Data Recipes: A Lucene Text Tokenization UDF for Apache Pig

As much as I loathe to admit it, sometimes java is called for. One of those times is tokenizing raw text. You'll notice in the post about tfidf how I used a Wukong script, written in ruby, to accomplish the task of tokenizing a large text corpus with Hadoop and Pig. There are a couple of problems with this:

1. Ruby is slow at this.

2. All the gem dependencies (wukong itself, extlib, etc) must exist on all the machines in the cluster and be available in the RUBYLIB (yet another environment variable to manage).

There is a better way.

A Pig UDF

Pig UDFs (User Defined Functions) come in a variety of flavors. The simplest type is the EvalFunc whose function 'exec()' essentially acts as the Wukong 'process()' method or the java hadoop Mapper's 'map()' function. Here we're going to write an EvalFunc that takes a raw text string as input and outputs a pig DataBag. Each Tuple in the DataBag will be a single token. Here's what it looks like as a whole:



import java.io.IOException;
import java.io.StringReader;
import java.util.Iterator;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import org.apache.pig.data.TupleFactory;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.BagFactory;

import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.util.Version;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.analysis.standard.StandardTokenizer;

public class TokenizeText extends EvalFunc {

    private static TupleFactory tupleFactory = TupleFactory.getInstance();
    private static BagFactory bagFactory = BagFactory.getInstance();
    private static String NOFIELD = "";
    private static StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);

    public DataBag exec(Tuple input) throws IOException {
        if (input == null || input.size() < 1 || input.isNull(0))
            return null;

        // Output bag
        DataBag bagOfTokens = bagFactory.newDefaultBag();
                
        StringReader textInput = new StringReader(input.get(0).toString());
        TokenStream stream = analyzer.tokenStream(NOFIELD, textInput);
        CharTermAttribute termAttribute = stream.getAttribute(CharTermAttribute.class);

        while (stream.incrementToken()) {
            Tuple termText = tupleFactory.newTuple(termAttribute.toString());
            bagOfTokens.add(termText);
            termAttribute.setEmpty();
        }
        return bagOfTokens;
    }
}

There's absolutely nothing special going on here. Remember, the 'exec' function gets called on every Pig Tuple of input. bagOfTokens will be the Pig DataBag returned. First, the lucene library tokenizes the input string. Then all the tokens in the resulting stream are turned into Pig Tuples and added to the result DataBag. Finally the resulting DataBag is returned. A document is truly a bag of words.

Example Pig Script

And here's an example script to use that UDF:


documents    = LOAD 'documents' AS (doc_id:chararray, text:chararray);
tokenized    = FOREACH documents GENERATE doc_id AS doc_id, FLATTEN(TokenizeText(text)) AS (token:chararray);

And that's it. It's blazing fast text tokenization for Apache Pig.

Hurray.

7 comments:

thedatachefApril 26, 2011 at 8:44 PM
There's a typo in the java code, the last line is a weird 'databag' xml tag that I don't know how to get rid of. Just ignore it.
Siiddharth TiwariJune 29, 2012 at 12:42 PM
Mate I think we need to parametrize the class with DataBag. what do u say ?
thedatachefJune 29, 2012 at 12:53 PM
Yes, of course :) This post was prior to me figuring out how to prevent Blogger from stripping the '>' and '<' characters. It's definitely supposed to be there though.
Siiddharth TiwariJune 29, 2012 at 1:21 PM
Also I am getting an error for the LOG function, I dont know why its identifying it as an Alias, even though I used it in caps. It say Alias not found for LOG
Siiddharth TiwariJune 29, 2012 at 1:31 PM
Must say this blog is an awesome place to park and learn :) . Thanks for such a wonderful blog !!!! :)
Mate, do you have an example where we could perform classification using pig ? may be using Naive bayes or some example where we can use pig to convert normal text into mahout processable format ( vectorizing text so that it could be used with mahout )
UnknownSeptember 4, 2018 at 12:14 AM
nice information
data science training in bangalore
hadoop training in bangalore
python online training
UnknownSeptember 15, 2018 at 2:33 AM
Great presentation of Big Data Hadoop Tutorial form of blog and Hadoop tutorial. Very helpful for beginners like us to understand Big Data Hadoop course. if you're interested to have an insight on Big Data Hadoop training do watch this amazing tutorial.https://www.youtube.com/watch?v=nuPp-TiEeeQ&

Data Recipes

mathjax

Tuesday, April 26, 2011

A Lucene Text Tokenization UDF for Apache Pig

A Pig UDF

Example Pig Script

7 comments: