This post was kindly contributed by DATA ANALYSIS - go there to comment and to read the full post. |
The Hadoop system has its unique shell language, which is called FS. Comparing with the common Bash shell within the Linux ecosystem, the FS shell has much fewer commands. To deal with the humongous size of data distributively stored at the Hadoop nodes, in my practice, I have 10 popular Linux command to facilitate my daily work.
1. sort
A good conduct of running Hadoop is to always test the map/reduce programs at the local machine before releasing the time-consuming map/reduce codes to the cluster environment. The
A good conduct of running Hadoop is to always test the map/reduce programs at the local machine before releasing the time-consuming map/reduce codes to the cluster environment. The
sort
command simulates the sort and shuffle step necessary for the map/redcue process. For example, I can run the piped commands below to verify whether the Python codes have any bugs../mapper.py | sort | ./reducer.py
2. tail
Interestingly, the FS shell at Hadoop only supports the
Interestingly, the FS shell at Hadoop only supports the
tail
command instead of the head
command. Then I can only grab the bottom lines of the data stored at Hadoop.hadoop fs -tail 5 data/web.log.9
3. sed
Sine the FS shell doesn’t provide the
Sine the FS shell doesn’t provide the
head
command, the alternative solution is to use the sed
command that actually has more flexible options.hadoop fs -cat data/web.log.9 | sed '1,+5!d'
4. stat
The
The
stat
command allows me to know the time when the file has been touched.hadoop fs -stat data/web.log.9
5. awk
The commands that the FS shell supports usually have very few options. For example the
The commands that the FS shell supports usually have very few options. For example the
du
command under the FS shell does not support -sh
option to aggregate the disk usage of the sub-directories. In this case, I have to look for help from the awk
command to satisfy my need.hadoop fs -du data | awk '{sum+=$1} END {print sum}'
6. wc
One of the most important things to understand a file located at the Hadoop is to find the number of its total lines.
One of the most important things to understand a file located at the Hadoop is to find the number of its total lines.
hadoop fs -cat data/web.log.9 | wc -l
7. cut
The
The
cut
command is convenient to select the specified columns at the file. For example, I am able to count the lines for each of the unique groups from the column between the position of #5 and #14.hadoop fs -cat data/web.log.9 | cut -c 5-14 | uniq -c
8. getmerge
The great thing for the
The great thing for the
getmerge
command is that I am able to fetch all the result after map/reduce to the local file system as a single file.hadoop fs -getmerge result result_merged.txt
9. grep
I can start a mapper-only job only with the
I can start a mapper-only job only with the
grep
command form the Bash shell to search the lines which contain the key words I am interested in. And this is a map-only task.hadoop jar $STREAMING -D mapred.reduce.tasks=0 -input data -output result -mapper "bash -c 'grep -e Texas'"
10. at and crontab
The
The
at
and crontab
commnands are my favorite to schedule a job at Hadoop. For example, I would like to use the order below to clean the map/reduce results at midnight.at 0212
at > hadoop fs -rmr result
This post was kindly contributed by DATA ANALYSIS - go there to comment and to read the full post. |