## Wednesday, January 19, 2011

### Apache Pig 0.8 with Cloudera cdh3

So it's January and Cloudera hasn't released pig 0.8 as a debian package yet. Too bad. Turns out for the particular project I'm working on it's important to have a custom partioner, only available in pig 0.8. Also, I'd like to make use of the HbaseStorage load and storefuncs. Also, only available in 0.8. Anyhow, here's how I got it working with my current install of Hadoop (cdh3):

## Get Pig

Go the the Pig releases page here and download the apache release for pig-0.8

## Install Pig

Skip this part if you don't care (ie. you're going to put wherever you want and don't give a flip what my opinion is on where it should go). It's usually a good idea to put things you download and install yourself in /usr/local/share/ so it doesn't conflict with /usr/lib/ when you apt-get install it. So go ahead and unpack the downloaded archive into that directory.

As an example (for those of us just getting familiar):
$: wget http://apache.mesi.com.ar//pig/pig-0.8.0/pig-0.8.0.tar.gz$: tar -zxvf pig-0.8.0.tar.gz$: sudo mv pig-0.8.0 /usr/local/share/$: sudo ln -s /usr/local/share/pig-0.8.0 /usr/local/share/pig

## Perform Pig Surgery

As it stands your new pig install will not work with cloudera hadoop. Let's fix that.

1. Nuke the current pig jar and rebuild without hadoop
$: sudo rm pig-0.8.0-core.jar$: sudo ant jar-withouthadoop

2. Add these lines to bin/pig (I don't think it matters where, I put mine before PIG_CLASSPATH is set):
# Add installed version of Hadoop to classpathHADOOP_HOME=${HADOOP_HOME:-/usr/lib/hadoop}.$HADOOP_HOME/bin/hadoop-config.shfor jar in $HADOOP_HOME/hadoop-core-*.jar$HADOOP_HOME/lib/* ; do   CLASSPATH=$CLASSPATH:$jardoneif [ ! -z "$HADOOP_CLASSPATH" ] ; then CLASSPATH=$CLASSPATH:$HADOOP_CLASSPATHfiif [ ! -z "$HADOOP_CONF_DIR" ] ; then  CLASSPATH=$CLASSPATH:$HADOOP_CONF_DIRfi

3. Nuke the build dir and rename pig-withouthadoop.jar
$: sudo mv pig-withouthadoop.jar pig-0.8.0-core.jar$: sudo rm -r build

4. Test it out
\$: bin/pig2011-01-19 13:49:07,766 [main] INFO  org.apache.pig.Main - Logging error messages to: /usr/local/share/pig-0.8.0/pig_1295466547762.log2011-01-19 13:49:07,959 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:80202011-01-19 13:49:08,163 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:8021grunt>

You can try typing things like 'ls' in the grunt shell to make sure it sees your HDFS. Hurray.

1. FYI: it matters where you put the stuff in step 2. Before setting PIG_CLASSPATH is a good spot :-)

2. Nice article

3. Another idea is to use Cloudera's version of Pig, to be found at http://nightly.cloudera.com/cdh/3/

4. Thank you very much for this :). Saved me a bunch of time.