mathjax

Wednesday, January 19, 2011

Apache Pig 0.8 with Cloudera cdh3

So it's January and Cloudera hasn't released pig 0.8 as a debian package yet. Too bad. Turns out for the particular project I'm working on it's important to have a custom partioner, only available in pig 0.8. Also, I'd like to make use of the HbaseStorage load and storefuncs. Also, only available in 0.8. Anyhow, here's how I got it working with my current install of Hadoop (cdh3):

Get Pig


Go the the Pig releases page here and download the apache release for pig-0.8

Install Pig


Skip this part if you don't care (ie. you're going to put wherever you want and don't give a flip what my opinion is on where it should go). It's usually a good idea to put things you download and install yourself in /usr/local/share/ so it doesn't conflict with /usr/lib/ when you apt-get install it. So go ahead and unpack the downloaded archive into that directory.

As an example (for those of us just getting familiar):

$: wget http://apache.mesi.com.ar//pig/pig-0.8.0/pig-0.8.0.tar.gz
$: tar -zxvf pig-0.8.0.tar.gz
$: sudo mv pig-0.8.0 /usr/local/share/
$: sudo ln -s /usr/local/share/pig-0.8.0 /usr/local/share/pig


Perform Pig Surgery


As it stands your new pig install will not work with cloudera hadoop. Let's fix that.

1. Nuke the current pig jar and rebuild without hadoop

$: sudo rm pig-0.8.0-core.jar
$: sudo ant jar-withouthadoop


2. Add these lines to bin/pig (I don't think it matters where, I put mine before PIG_CLASSPATH is set):

# Add installed version of Hadoop to classpath
HADOOP_HOME=${HADOOP_HOME:-/usr/lib/hadoop}
. $HADOOP_HOME/bin/hadoop-config.sh

for jar in $HADOOP_HOME/hadoop-core-*.jar $HADOOP_HOME/lib/* ; do
CLASSPATH=$CLASSPATH:$jar
done
if [ ! -z "$HADOOP_CLASSPATH" ] ; then
CLASSPATH=$CLASSPATH:$HADOOP_CLASSPATH
fi
if [ ! -z "$HADOOP_CONF_DIR" ] ; then
CLASSPATH=$CLASSPATH:$HADOOP_CONF_DIR
fi


3. Nuke the build dir and rename pig-withouthadoop.jar

$: sudo mv pig-withouthadoop.jar pig-0.8.0-core.jar
$: sudo rm -r build


4. Test it out

$: bin/pig
2011-01-19 13:49:07,766 [main] INFO org.apache.pig.Main - Logging error messages to: /usr/local/share/pig-0.8.0/pig_1295466547762.log
2011-01-19 13:49:07,959 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:8020
2011-01-19 13:49:08,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:8021
grunt>

You can try typing things like 'ls' in the grunt shell to make sure it sees your HDFS. Hurray.

12 comments:

  1. FYI: it matters where you put the stuff in step 2. Before setting PIG_CLASSPATH is a good spot :-)

    ReplyDelete
  2. Another idea is to use Cloudera's version of Pig, to be found at http://nightly.cloudera.com/cdh/3/

    ReplyDelete
  3. Thank you very much for this :). Saved me a bunch of time.

    ReplyDelete
  4. Gaining Python certifications will validate your skills and advance your career.
    pythoncertification

    ReplyDelete
  5. Nice tips. Very innovative... Your post shows all your effort and great experience towards your work Your Information is Great if mastered very well.
    python Training institute in Pune
    python Training institute in Chennai
    python Training institute in Bangalore

    ReplyDelete
  6. This is most informative and also this post most user friendly and super navigation to all posts... Thank you so much for giving this information to me.. 
    Data Science Training in Chennai
    Data Science course in anna nagar
    Data Science course in chennai
    Data science course in Bangalore
    Data Science course in marathahalli

    ReplyDelete
  7. Excellant post!!!. The strategy you have posted on this technology helped me to get into the next level and had lot of information in it.Best Devops Training in pune
    Microsoft azure training in Bangalore
    Power bi training in Chennai

    ReplyDelete
  8. Very nice post here and thanks for it .I always like and such a super contents of these post.Excellent and very cool idea and great content of different kinds of the valuable information's.
    rpa training in bangalore
    best rpa training in bangalore
    rpa training in pune | rpa course in bangalore
    rpa training in chennai

    ReplyDelete
  9. Superb. I really enjoyed very much with this article here. Really it is an amazing article I had ever read. I hope it will help a lot for all. Thank you so much for this amazing posts and please keep update like this excellent article. thank you for sharing such a great blog with us.
    rpa training in bangalore
    best rpa training in bangalore
    rpa training in pune | rpa course in bangalore
    rpa training in chennai

    ReplyDelete