mathjax

Tuesday, April 26, 2011

A Lucene Text Tokenization UDF for Apache Pig

As much as I loathe to admit it, sometimes java is called for. One of those times is tokenizing raw text. You'll notice in the post about tfidf how I used a Wukong script, written in ruby, to accomplish the task of tokenizing a large text corpus with Hadoop and Pig. There are a couple of problems with this:

1. Ruby is slow at this.

2. All the gem dependencies (wukong itself, extlib, etc) must exist on all the machines in the cluster and be available in the RUBYLIB (yet another environment variable to manage).

There is a better way.

A Pig UDF



Pig UDFs (User Defined Functions) come in a variety of flavors. The simplest type is the EvalFunc whose function 'exec()' essentially acts as the Wukong 'process()' method or the java hadoop Mapper's 'map()' function. Here we're going to write an EvalFunc that takes a raw text string as input and outputs a pig DataBag. Each Tuple in the DataBag will be a single token. Here's what it looks like as a whole:



import java.io.IOException;
import java.io.StringReader;
import java.util.Iterator;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.BagFactory;

import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardTokenizer;

public class TokenizeText extends EvalFunc {

private static TupleFactory tupleFactory = TupleFactory.getInstance();
private static BagFactory bagFactory = BagFactory.getInstance();
private static String NOFIELD = "";
private static StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);

public DataBag exec(Tuple input) throws IOException {
if (input == null || input.size() < 1 || input.isNull(0))
return null;

// Output bag
DataBag bagOfTokens = bagFactory.newDefaultBag();

StringReader textInput = new StringReader(input.get(0).toString());
TokenStream stream = analyzer.tokenStream(NOFIELD, textInput);
CharTermAttribute termAttribute = stream.getAttribute(CharTermAttribute.class);

while (stream.incrementToken()) {
Tuple termText = tupleFactory.newTuple(termAttribute.toString());
bagOfTokens.add(termText);
termAttribute.setEmpty();
}
return bagOfTokens;
}
}


There's absolutely nothing special going on here. Remember, the 'exec' function gets called on every Pig Tuple of input. bagOfTokens will be the Pig DataBag returned. First, the lucene library tokenizes the input string. Then all the tokens in the resulting stream are turned into Pig Tuples and added to the result DataBag. Finally the resulting DataBag is returned. A document is truly a bag of words.

Example Pig Script



And here's an example script to use that UDF:


documents = LOAD 'documents' AS (doc_id:chararray, text:chararray);
tokenized = FOREACH documents GENERATE doc_id AS doc_id, FLATTEN(TokenizeText(text)) AS (token:chararray);



And that's it. It's blazing fast text tokenization for Apache Pig.

Hurray.

5 comments:

  1. There's a typo in the java code, the last line is a weird 'databag' xml tag that I don't know how to get rid of. Just ignore it.

    ReplyDelete
  2. Mate I think we need to parametrize the class with DataBag. what do u say ?

    ReplyDelete
  3. Yes, of course :) This post was prior to me figuring out how to prevent Blogger from stripping the '>' and '<' characters. It's definitely supposed to be there though.

    ReplyDelete
  4. Also I am getting an error for the LOG function, I dont know why its identifying it as an Alias, even though I used it in caps. It say Alias not found for LOG

    ReplyDelete
  5. Must say this blog is an awesome place to park and learn :) . Thanks for such a wonderful blog !!!! :)
    Mate, do you have an example where we could perform classification using pig ? may be using Naive bayes or some example where we can use pig to convert normal text into mahout processable format ( vectorizing text so that it could be used with mahout )

    ReplyDelete