Wednesday, January 19, 2011

Apache Pig 0.8 with Cloudera cdh3

So it's January and Cloudera hasn't released pig 0.8 as a debian package yet. Too bad. Turns out for the particular project I'm working on it's important to have a custom partioner, only available in pig 0.8. Also, I'd like to make use of the HbaseStorage load and storefuncs. Also, only available in 0.8. Anyhow, here's how I got it working with my current install of Hadoop (cdh3):

Get Pig

Go the the Pig releases page here and download the apache release for pig-0.8

Install Pig

Skip this part if you don't care (ie. you're going to put wherever you want and don't give a flip what my opinion is on where it should go). It's usually a good idea to put things you download and install yourself in /usr/local/share/ so it doesn't conflict with /usr/lib/ when you apt-get install it. So go ahead and unpack the downloaded archive into that directory.

As an example (for those of us just getting familiar):

$: wget
$: tar -zxvf pig-0.8.0.tar.gz
$: sudo mv pig-0.8.0 /usr/local/share/
$: sudo ln -s /usr/local/share/pig-0.8.0 /usr/local/share/pig

Perform Pig Surgery

As it stands your new pig install will not work with cloudera hadoop. Let's fix that.

1. Nuke the current pig jar and rebuild without hadoop

$: sudo rm pig-0.8.0-core.jar
$: sudo ant jar-withouthadoop

2. Add these lines to bin/pig (I don't think it matters where, I put mine before PIG_CLASSPATH is set):

# Add installed version of Hadoop to classpath

for jar in $HADOOP_HOME/hadoop-core-*.jar $HADOOP_HOME/lib/* ; do
if [ ! -z "$HADOOP_CLASSPATH" ] ; then
if [ ! -z "$HADOOP_CONF_DIR" ] ; then

3. Nuke the build dir and rename pig-withouthadoop.jar

$: sudo mv pig-withouthadoop.jar pig-0.8.0-core.jar
$: sudo rm -r build

4. Test it out

$: bin/pig
2011-01-19 13:49:07,766 [main] INFO org.apache.pig.Main - Logging error messages to: /usr/local/share/pig-0.8.0/pig_1295466547762.log
2011-01-19 13:49:07,959 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:8020
2011-01-19 13:49:08,163 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:8021

You can try typing things like 'ls' in the grunt shell to make sure it sees your HDFS. Hurray.