Data Recipes: java

Showing posts with label java. Show all posts

Sunday, January 23, 2011

Bulk Indexing With ElasticSearch and Hadoop

At Infochimps we recently indexed over 2.5 billion documents for a total of 4TB total indexed size. This would not have been possible without ElasticSearch and the Hadoop bulk loader we wrote, wonderdog. I'll go into the technical details in a later post but for now here's how you can get started with ElasticSearch and Hadoop:

Getting Started with ElasticSearch

The first thing is to actually install elasticsearch:


$: wget http://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.14.2.zip
$: sudo mv elasticsearch-0.14.2 /usr/local/share/
$: sudo ln -s /usr/local/share/elasticsearch-0.14.2 /usr/local/share/elasticsearch

Next you'll want to make sure there is an 'elasticsearch' user and that there are suitable data, work, and log directories that 'elasticsearch' owns:


$: sudo useradd elasticsearch
$: sudo mkdir -p /var/log/elasticsearch /var/run/elasticsearch/{data,work}
$: sudo chown -R elasticsearch /var/{log,run}/elasticsearch

Then get wonderdog (you'll have to git clone it for now) and go ahead and copy the example configuration in wonderdog/config:


$: sudo mkdir -p /etc/elasticsearch
$: sudo cp config/elasticsearch-example.yml /etc/elasticsearch/elasticsearch.yml
$: sudo cp config/logging.yml /etc/elasticsearch/
$: sudo cp config/elasticsearch.in.sh /etc/elasticsearch/

Make changes to 'elasticsearch.yml' such that it points to the correct data, work, and log directories. Also, you'll want to change the number of 'recovery_after_nodes' and 'expected_nodes' in elasticsearch.yml to however many nodes (machines) you actually expect to have in your cluster. You'll probably also want to do a quick once-over of elasticsearch.in.sh and make sure the jvm settings, etc are sane for your particular setup. Finally, to startup do:


sudo -u elasticsearch /usr/local/share/elasticsearch/bin/elasticsearch -Des.config=/etc/elasticsearch/elasticsearch.yml

You should now have a happily running (reasonably configured) elasticsearch data node.

Index Some Data

Prerequisites:

You have a working hadoop cluster

Elasticsearch data nodes are installed and running on all your machines and they have discovered each other. See the elasticsearch documentation for details on making that actually work.

You've installed the following rubygems: 'configliere' and 'json'

Get Data

As an example lets index this UFO sightings data set from Infochimps here. (You should be familiar with this one by now...) It's mostly raw text and so it's a very reasonable thing to index. Once it's downloaded go ahead and throw it on the HDFS:


$: hadoop fs -mkdir /data/domestic/ufo
$: hadoop fs -put chimps_16154-2010-10-20_14-33-35/ufo_awesome.tsv /data/domestic/ufo/

Index Data

This is the easy part:


$: bin/wonderdog --rm --field_names=sighted_at,reported_at,location,shape,duration,description --id_field=-1 --index_name=ufo_sightings --object_type=ufo_sighting --es_config=/etc/elasticsearch/elasticsearch.yml /data/domestic/aliens/ufo_awesome.tsv /tmp/elasticsearch/aliens/out

Flags:

'--rm' - Remove output on the hdfs if it exists
'--field_names' - A comma separated list of the field names in the tsv, in order
'--id_field' - The field to use as the record id, -1 if the record has no inherent id
'--index_name' - The index name to bulk load into
'--object_type' - The type of objects we're indexing
'--es_config' - Points to the elasticsearch config*

*The elasticsearch config that the hadoop machines need must be on all the hadoop machines and have a 'hosts' entry listing the ips of all the elasticsearch data nodes (see wonderdog/config/elasticsearch-example.yml). This means we can run the hadoop job on a different cluster than the elasticsearch data nodes are running on.

The other two arguments are the input and output paths. The output path in this case only gets written to if one or more index requests fail. This way you can re-run the job on only those records that didn't make it the first time.

The indexing should go pretty quickly.
Next is to refresh the index so we can actually query our newly indexed data. There's a tool in wonderdog's bin directory for that:


$: bin/estool --host=`hostname -i` refresh_index

Query Data

Once again, use estool


$: bin/estool --host=`hostname -i` --index_name=ufo_sightings --query_string="ufo" query

Hurray.

Saturday, January 22, 2011

JRuby and Hadoop, Notes From a Non-Java Programmer

So I spent a fair deal of time playing with JRuby this weekend. Here's my notes/conclusions so far. Disclaimer: My experience with java overall is limited, so most of this might be obvious to a java ninja but maybe not to a ruby one...

Goal:

The whole point here is that it sure would be nice to only write a few lines of ruby code into a file called something like 'rb_mapreduce.rb' and run it by saying: ./rb_mapreduce.rb. No compiling, no awkward syntax. Pure ruby, plain and simple.

Experiments/Notes:

I created a 'WukongPlusMapper' that subclassed 'org.apache.hadoop.mapreduce.Mapper' and implemented the important methods, namely 'map'. Then I setup and launched a job from inside jruby using this jruby mapper ('WukongPlusMapper') as the map class.

The job launched and ran just fine. But...

Problems and Lessons Learned:

It is possible (in fact extremely easy) to setup and launch a Hadoop job with pure jruby

It is not possible, that I can tell so far, to use an uncompiled jruby class as either the mapper or the reducer for a Hadoop job. It doesn't throw an error (so long as you've subclassed a proper java mapper) but actually just uses the superclass's definition instead. I believe the reason is that each map task must have access to the full class definition for its mapper (only sensible) and has no idea what to do with my transient 'WukongPlusMapper' class. Obviously the same would apply to the reducer

It is possible to compile a jruby class ahead-of-time, stuff it into the job jar, and then launch the job with ordinary means. There are a couple somewhat obvious drawbacks with this method:
- You've got to specify 'java_signatures' for each of your methods that are going to be called inside java
- Messy logic+code for compiling thyself, stuffing thyself into a jar, shipping thyself with MR job. Might as well just write java at this point. radoop has some logic for doing that pretty well laid out.

It is possible to define and create an object in jruby that subclasses a java class or implements a java interface. Then you can simply overwrite the methods you want to overwrite. It's possible to pass instances of this class to a java runtime that only knows about the superclass and the subclass's methods (at least the ones that have the signatures defined in the superclass) will work just fine. Unfortunately, (and plainly obvious in hindsight) this does NOT work with Hadoop since these instances all show up in java as 'proxy classes' and are only accessible to the launching jvm

On another note there is the option of using the scripting engine which, as far as I can tell, is what both jruby-on-hadoop and radoop are using. Something of concern though is that neither of these two projects seem to have much traction. However, it may be that the scripting engine is the only way to reasonably make this work, at least 2 people vote yes ...

So, implementation complexity aside, it looks like all one would have to do is come up with some way of making JRuby's in-memory class definitions available to all of the spawned mappers and reducers. Probably not something I want to delve into at the moment.

Script engine it is.

Saturday, January 15, 2011

Real Deal Concrete Hadoop Examples

If you think you have to be a java programmer to use Hadoop then you've been lied to. Hadoop is not hard. What makes learning hadoop (or more correctly, map reduce) tedious is the lack of concrete and useful examples. Word counts are f*ing boring. The next few posts will overview two of the most useful higher level abstractions on top of hadoop (Pig and Wukong) with copious examples.

Thursday, January 13, 2011

JAVA_HOME on mac os X

As opposed to searching for and keeping in mind the JAVA_HOME environment variable on a mac there's a simple trick to remember:


export JAVA_HOME=`/usr/libexec/java_home`

mathjax