Tuesday, February 15, 2011

Indexing Text Corpora With Pig and ElasticSearch

Routinely working with raw data is strenuous, ulcer inducing, and overall hazardous to your health. Unstructured text even more so. Safe data mining requires proper mining gear. Imagine if you could step into a power-suit, fire up your hydraulic force pincers, and make that data your bitch?
Turns out, a specialty of mine is in building useful and interesting exoskeletons (think of Sigourney Weaver in Aliens) for developers of all shapes and sizes.

On that note I've written a pig STORE function for elasticsearch. Now you can use simple Pig syntax to transform arbitrary input data and index the output records with elasticsearch. Here's an example:

%default INDEX 'ufo_sightings'
%default OBJ 'ufo_sighting'

ufo_sightings = LOAD '/data/domestic/aliens/ufo_awesome.tsv' AS (sighted_at:long, reported_at:long, location:chararray, shape:chararray, duration:chararray, description:chararray);
STORE ufo_sightings INTO 'es://$INDEX/$OBJ' USING com.infochimps.elasticsearch.pig.ElasticSearchIndex('-1', '1000');

Where '-1' means the records have no inherent id and '1000' is the number of records to batch up before indexing. Here's the link to the github page (wonderdog).

It doesn't get any simpler. You've just been endowed with magic super text indexing powers. Now go. Index some raw text.


1 comment: