Data Recipes: Wukong's Hadoop Convenience Utilities

Saturday, January 15, 2011

Wukong's Hadoop Convenience Utilities

Wukong comes with a number of convenience command-line utilities for working with the hdfs as well as a few commands for basic hadoop streaming. All of them can be found in wukong's bin directory. Enumerating a few:

* hdp-put
* hdp-ls
* hdp-mkdir
* hdp-rm
* hdp-catd
* hdp-stream
* hdp-steam-flat
* and more...

HDFS utilities

These are just wrappers around the hadoop fs utility to cut down on the amount of typing:

Hadoop fs utility	Wukong Convenience Command
hadoop fs -put	hdp-put
hadoop fs -ls	hdp-ls
hadoop fs -mkdir	hdp-mkdir
hadoop fs -rm	hdp-rm

hdp-catd

hdp-catd will take an arbitrary hdfs directory and cat it's contents. It ignores those files that start with a "_" character. This means we can cat a whole directory of those awful part-xxxxx files.

hdp-stream

hdp-stream allows you to run a generic streaming task without all the typing. You almost always only need to specify input, output, num keys for partition, num sort key fields, number of reducers, and what scripts to use as the mapper and reducer. Here's an example of running a uniq:


hdp-stream /path/to/input /path/to/output /bin/cat /usr/bin/uniq 2 3 -Dmapred.reduce.tasks=10

will launch a streaming job using 2 fields for partition and 3 fields for sort keys and 10 reduce tasks. See http://hadoop.apache.org/common/docs/r0.20.0/mapred-default.html for other options you can pass in with the "-D" flag.

hdp-stream-flat

There's one other extremely useful case when you don't care to specify anything about partitioners because you either aren't running a reduce or don't care how your data is sent to individual reducers. In this case hdp-stream-flat is very useful. Here's how cut off the first two fields of a large input file:


hdp-stream-flat /path/to/input /path/to/output "/usr/bin/cut -f1,2" "/bin/cat" -Dmapred.reduce.tasks=0

see wukong/bin for more useful command line utilities.

4 comments:

mareddyonlineJuly 19, 2014 at 7:13 AM
Data Recipes are very yummy for Hadoopers.I am happy to found such helpful and fascinating post that is written in well manner. i actually enhanced my data when browse your post .thanks.
Hadoop Training in hyderabad
ReplyDelete
Replies
UnknownAugust 4, 2015 at 5:35 AM
Actually, you have explained the technology to the fullest. Thanks for sharing the information you have got. It helped me a lot. I experimented your thoughts in my training program.

Hadoop Training Chennai
Hadoop Training in Chennai
Big Data Training in Chennai
ReplyDelete
Replies
divyaJune 17, 2020 at 12:47 AM
I was very pleased to find this site. I want to thank you for this great content!! I enjoyed every little bit of it and I have you bookmarked to check out new stuff you post..good luck
Ai & Artificial Intelligence Course in Chennai
PHP Training in Chennai
Ethical Hacking Course in Chennai Blue Prism Training in Chennai
UiPath Training in Chennai
ReplyDelete
Replies
AnonymousJune 25, 2020 at 4:29 AM
python training in bangalore | python online training
aws training in Bangalore | aws online training
machine learning training in bangalore | machine learning online training
data science training in bangalore | data science online training
artificial intelligence training in bangalore | artificial intelligence online training

ReplyDelete
Replies

Add comment

mathjax