Wednesday, January 19, 2011

Apache Pig 0.8 with Cloudera cdh3

So it's January and Cloudera hasn't released pig 0.8 as a debian package yet. Too bad. Turns out for the particular project I'm working on it's important to have a custom partioner, only available in pig 0.8. Also, I'd like to make use of the HbaseStorage load and storefuncs. Also, only available in 0.8. Anyhow, here's how I got it working with my current install of Hadoop (cdh3):

Get Pig

Go the the Pig releases page here and download the apache release for pig-0.8

Install Pig

Skip this part if you don't care (ie. you're going to put wherever you want and don't give a flip what my opinion is on where it should go). It's usually a good idea to put things you download and install yourself in /usr/local/share/ so it doesn't conflict with /usr/lib/ when you apt-get install it. So go ahead and unpack the downloaded archive into that directory.

As an example (for those of us just getting familiar):

$: wget
$: tar -zxvf pig-0.8.0.tar.gz
$: sudo mv pig-0.8.0 /usr/local/share/
$: sudo ln -s /usr/local/share/pig-0.8.0 /usr/local/share/pig

Perform Pig Surgery

As it stands your new pig install will not work with cloudera hadoop. Let's fix that.

1. Nuke the current pig jar and rebuild without hadoop

$: sudo rm pig-0.8.0-core.jar
$: sudo ant jar-withouthadoop

2. Add these lines to bin/pig (I don't think it matters where, I put mine before PIG_CLASSPATH is set):

# Add installed version of Hadoop to classpath

for jar in $HADOOP_HOME/hadoop-core-*.jar $HADOOP_HOME/lib/* ; do
if [ ! -z "$HADOOP_CLASSPATH" ] ; then
if [ ! -z "$HADOOP_CONF_DIR" ] ; then

3. Nuke the build dir and rename pig-withouthadoop.jar

$: sudo mv pig-withouthadoop.jar pig-0.8.0-core.jar
$: sudo rm -r build

4. Test it out

$: bin/pig
2011-01-19 13:49:07,766 [main] INFO org.apache.pig.Main - Logging error messages to: /usr/local/share/pig-0.8.0/pig_1295466547762.log
2011-01-19 13:49:07,959 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:8020
2011-01-19 13:49:08,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:8021

You can try typing things like 'ls' in the grunt shell to make sure it sees your HDFS. Hurray.


  1. FYI: it matters where you put the stuff in step 2. Before setting PIG_CLASSPATH is a good spot :-)

  2. Another idea is to use Cloudera's version of Pig, to be found at

  3. Thank you very much for this :). Saved me a bunch of time.

  4. Gaining Python certifications will validate your skills and advance your career.

  5. Nice tips. Very innovative... Your post shows all your effort and great experience towards your work Your Information is Great if mastered very well.
    python Training institute in Pune
    python Training institute in Chennai
    python Training institute in Bangalore

  6. This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me.. 
    Data Science Training in Chennai
    Data Science course in anna nagar
    Data Science course in chennai
    Data science course in Bangalore
    Data Science course in marathahalli

  7. Excellant post!!!. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.Best Devops Training in pune
    Microsoft azure training in Bangalore
    Power bi training in Chennai

  8. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    rpa training in bangalore
    best rpa training in bangalore
    rpa training in pune | rpa course in bangalore
    rpa training in chennai

  9. Superb. I really enjoyed very much with this article here. Really it is an amazing article I had ever read. I hope it will help a lot for all. Thank you so much for this amazing posts and please keep update like this excellent article. thank you for sharing such a great blog with us.
    rpa training in bangalore
    best rpa training in bangalore
    rpa training in pune | rpa course in bangalore
    rpa training in chennai