Saturday, January 15, 2011


Swineherd is a useful tool for combining together multiple pig scripts, wukong scripts, and even R scripts into a workflow managed by rake. Here though I'd like to save the actual workflow part for a later post and just illustrate it's uniform interface to these scripts by showing how to launch a wukong script:

#!/usr/bin/env ruby

require 'rake'
require 'swineherd'

task :wukong_job do
script ='/path/to/wukong_script')
script.options = {:some_option => "123", :another_option => "foobar"}
script.input << '/path/to/input'
script.output << '/path/to/output'

You can save this into a file called "Rakefile" and run it by saying:
rake wukong_job


  1. Great! Thank you so much for that. Unfortunatelly my wukong scripts working only in local mode. I got strange error att 100% reduce state like:
    12/01/29 12:39:55 ERROR streaming.StreamJob: Job not successful. Error: NA
    12/01/29 12:39:55 INFO streaming.StreamJob: killJob...
    Streaming Command Failed!
    packageJobJar: [/home/hduser/process_ufo.rb, /tmp/hadoop-hduser/hadoop-unjar6128292723359852311/] [] /tmp/streamjob1254886336204086037.jar tmpDir=null
    /usr/local/rvm/gems/ruby-1.9.2-p290/gems/wukong-2.0.2/lib/wukong/script.rb:234:in `execute_command!': Streaming command failed! (RuntimeError)
    from /usr/local/rvm/gems/ruby-1.9.2-p290/gems/wukong-2.0.2/lib/wukong/script/hadoop_command.rb:78:in `execute_hadoop_workflow'
    from /usr/local/rvm/gems/ruby-1.9.2-p290/gems/wukong-2.0.2/lib/wukong/script.rb:152:in `run'
    from ./process_ufo.rb:31:in `'