Data Recipes: cloudera

So it's January and Cloudera hasn't released pig 0.8 as a debian package yet. Too bad. Turns out for the particular project I'm working on it's important to have a custom partioner, only available in pig 0.8. Also, I'd like to make use of the HbaseStorage load and storefuncs. Also, only available in 0.8. Anyhow, here's how I got it working with my current install of Hadoop (cdh3):

Get Pig

Go the the Pig releases page here and download the apache release for pig-0.8

Install Pig

Skip this part if you don't care (ie. you're going to put wherever you want and don't give a flip what my opinion is on where it should go). It's usually a good idea to put things you download and install yourself in /usr/local/share/ so it doesn't conflict with /usr/lib/ when you apt-get install it. So go ahead and unpack the downloaded archive into that directory.

As an example (for those of us just getting familiar):


$: wget http://apache.mesi.com.ar//pig/pig-0.8.0/pig-0.8.0.tar.gz
$: tar -zxvf pig-0.8.0.tar.gz
$: sudo mv pig-0.8.0 /usr/local/share/
$: sudo ln -s /usr/local/share/pig-0.8.0 /usr/local/share/pig

Perform Pig Surgery

As it stands your new pig install will not work with cloudera hadoop. Let's fix that.

1. Nuke the current pig jar and rebuild without hadoop


$: sudo rm pig-0.8.0-core.jar 
$: sudo ant jar-withouthadoop

2. Add these lines to bin/pig (I don't think it matters where, I put mine before PIG_CLASSPATH is set):


# Add installed version of Hadoop to classpath
HADOOP_HOME=${HADOOP_HOME:-/usr/lib/hadoop}
. $HADOOP_HOME/bin/hadoop-config.sh

for jar in $HADOOP_HOME/hadoop-core-*.jar $HADOOP_HOME/lib/* ; do
   CLASSPATH=$CLASSPATH:$jar
done
if [ ! -z "$HADOOP_CLASSPATH" ] ; then
  CLASSPATH=$CLASSPATH:$HADOOP_CLASSPATH
fi
if [ ! -z "$HADOOP_CONF_DIR" ] ; then
  CLASSPATH=$CLASSPATH:$HADOOP_CONF_DIR
fi

3. Nuke the build dir and rename pig-withouthadoop.jar


$: sudo mv pig-withouthadoop.jar pig-0.8.0-core.jar
$: sudo rm -r build

4. Test it out


$: bin/pig
2011-01-19 13:49:07,766 [main] INFO  org.apache.pig.Main - Logging error messages to: /usr/local/share/pig-0.8.0/pig_1295466547762.log
2011-01-19 13:49:07,959 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost:8020
2011-01-19 13:49:08,163 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:8021
grunt>

You can try typing things like 'ls' in the grunt shell to make sure it sees your HDFS. Hurray.

Data Recipes

mathjax

Wednesday, January 19, 2011

Apache Pig 0.8 with Cloudera cdh3

Get Pig

Install Pig

Perform Pig Surgery