mathjax

Showing posts with label indexing. Show all posts
Showing posts with label indexing. Show all posts

Sunday, January 23, 2011

Bulk Indexing With ElasticSearch and Hadoop

At Infochimps we recently indexed over 2.5 billion documents for a total of 4TB total indexed size. This would not have been possible without ElasticSearch and the Hadoop bulk loader we wrote, wonderdog. I'll go into the technical details in a later post but for now here's how you can get started with ElasticSearch and Hadoop:

Getting Started with ElasticSearch



The first thing is to actually install elasticsearch:


$: wget http://github.com/downloads/elasticsearch/elasticsearch/elasticsearch-0.14.2.zip
$: sudo mv elasticsearch-0.14.2 /usr/local/share/
$: sudo ln -s /usr/local/share/elasticsearch-0.14.2 /usr/local/share/elasticsearch


Next you'll want to make sure there is an 'elasticsearch' user and that there are suitable data, work, and log directories that 'elasticsearch' owns:


$: sudo useradd elasticsearch
$: sudo mkdir -p /var/log/elasticsearch /var/run/elasticsearch/{data,work}
$: sudo chown -R elasticsearch /var/{log,run}/elasticsearch


Then get wonderdog (you'll have to git clone it for now) and go ahead and copy the example configuration in wonderdog/config:


$: sudo mkdir -p /etc/elasticsearch
$: sudo cp config/elasticsearch-example.yml /etc/elasticsearch/elasticsearch.yml
$: sudo cp config/logging.yml /etc/elasticsearch/
$: sudo cp config/elasticsearch.in.sh /etc/elasticsearch/


Make changes to 'elasticsearch.yml' such that it points to the correct data, work, and log directories. Also, you'll want to change the number of 'recovery_after_nodes' and 'expected_nodes' in elasticsearch.yml to however many nodes (machines) you actually expect to have in your cluster. You'll probably also want to do a quick once-over of elasticsearch.in.sh and make sure the jvm settings, etc are sane for your particular setup. Finally, to startup do:


sudo -u elasticsearch /usr/local/share/elasticsearch/bin/elasticsearch -Des.config=/etc/elasticsearch/elasticsearch.yml


You should now have a happily running (reasonably configured) elasticsearch data node.

Index Some Data



Prerequisites:


  • You have a working hadoop cluster

  • Elasticsearch data nodes are installed and running on all your machines and they have discovered each other. See the elasticsearch documentation for details on making that actually work.

  • You've installed the following rubygems: 'configliere' and 'json'



Get Data



As an example lets index this UFO sightings data set from Infochimps here. (You should be familiar with this one by now...) It's mostly raw text and so it's a very reasonable thing to index. Once it's downloaded go ahead and throw it on the HDFS:

$: hadoop fs -mkdir /data/domestic/ufo
$: hadoop fs -put chimps_16154-2010-10-20_14-33-35/ufo_awesome.tsv /data/domestic/ufo/


Index Data



This is the easy part:


$: bin/wonderdog --rm --field_names=sighted_at,reported_at,location,shape,duration,description --id_field=-1 --index_name=ufo_sightings --object_type=ufo_sighting --es_config=/etc/elasticsearch/elasticsearch.yml /data/domestic/aliens/ufo_awesome.tsv /tmp/elasticsearch/aliens/out


Flags:

'--rm' - Remove output on the hdfs if it exists
'--field_names' - A comma separated list of the field names in the tsv, in order
'--id_field' - The field to use as the record id, -1 if the record has no inherent id
'--index_name' - The index name to bulk load into
'--object_type' - The type of objects we're indexing
'--es_config' - Points to the elasticsearch config*

*The elasticsearch config that the hadoop machines need must be on all the hadoop machines and have a 'hosts' entry listing the ips of all the elasticsearch data nodes (see wonderdog/config/elasticsearch-example.yml). This means we can run the hadoop job on a different cluster than the elasticsearch data nodes are running on.

The other two arguments are the input and output paths. The output path in this case only gets written to if one or more index requests fail. This way you can re-run the job on only those records that didn't make it the first time.

The indexing should go pretty quickly.
Next is to refresh the index so we can actually query our newly indexed data. There's a tool in wonderdog's bin directory for that:

$: bin/estool --host=`hostname -i` refresh_index



Query Data



Once again, use estool

$: bin/estool --host=`hostname -i` --index_name=ufo_sightings --query_string="ufo" query


Hurray.