Apache Hadoop - How to Install a Three Nodes Cluster (1)
Apache Hadoop - How to Install a Three Nodes Cluster (2)
Apache Hadoop - How to Install a Three Nodes Cluster (3)
At this point you should have your three nodes Hadoop cluster up and running. hadoop1 is the namenode + resource manager, hadoop2 is the job historyserver, hadoop3 is the datanode. We still haven't configure secondary namenode yet, but since the cluster is up and running, let's do some test first.
Tests we will run:
Grep - example extracts matching strings from text files and counts how many time they occurred.
The command works different than the Unix grep call: it doesn't display the complete matching line, but only the matching string, so in order to display lines matching "foo", use .*foo.* as a regular expression.
The program runs two map/reduce jobs in sequence. The first job counts how many times a matching string occurred and the second job sorts matching strings by their frequency and stores the output in a single output file.
Each mapper of the first job takes a line as input and matches the user-provided regular expression against the line. It extracts all matching strings and emits (matching string, 1) pairs. Each reducer sums the frequencies of each matching string. The output is sequence files contaning the matching string and count. The reduce phase is optimized by running a combiner that sums the frequency of strings from local map output. As a result it reduces the amount of data that needs to be shipped to a reduce task.
The second job takes the output of the first job as input. The mapper is an inverse map, while the reducer is an identity reducer. The number of reducers is one, so the output is stored in one file, and it is sorted by the count in a descending order. The output file is text, each line of which contains count and a matching string.
The example also demonstrates how to pass a command-line parameter to a mapper or a reducer. This is done by adding (key, value) pairs to the job's configuration before the job is submitted. Map or reduce tasks are able to access the value by getting it from the job's configuration in the method configure.
1. First we create user directory in HDFS:
# su hdfs $ hadoop fs -mkdir /user $ hadoop fs -mkdir /user/lxu $ hadoop fs -chown -R lxu:lxu /user/lxu $ hadoop fs -ls /user/ drwxr-xr-x - lxu lxu 0 2014-02-26 14:11 /user/lxu
Switch to lxu
Create input and out directories
$ hadoop fs -mkdir /user/lxu/grep-input
Upload sample.txt to grep-input, you can download the sample.txt file from:
sample.txt(https://drive.google.com/file/d/0B-r3WDLvcH3iT0U1c3A4TWlVajQ/edit?usp=sharing)
$ hadoop fs -put sample.txt /user/lxu/grep-input
If you see error like:
put: Specified block size is less than configured minimum value (dfs.namenode.fs-limits.min-block-size): 131072 < 1048576
Check you hdfs-site.xml file and increate your block size to at least 64M (67108864).
$ hadoop fs -ls /user/lxu/grep-input -rw-r--r-- 1 lxu lxu 124640 2014-02-26 14:38 /user/lxu/grep-input/sample.txt
Run grep command
$ hadoop org.apache.hadoop.examples.Grep grep-input grep-output 'was' .... 14/02/27 16:23:14 INFO mapreduce.Job: Counters: 43 File System Counters FILE: Number of bytes read=20 FILE: Number of bytes written=158799 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=233 HDFS: Number of bytes written=8 HDFS: Number of read operations=7 HDFS: Number of large read operations=0 HDFS: Number of write operations=2 Job Counters Launched map tasks=1 Launched reduce tasks=1 Data-local map tasks=1 Total time spent by all maps in occupied slots (ms)=5705 Total time spent by all reduces in occupied slots (ms)=6419 Map-Reduce Framework Map input records=1 Map output records=1 Map output bytes=12 Map output materialized bytes=20 Input split bytes=127 Combine input records=0 Combine output records=0 Reduce input groups=1 Reduce shuffle bytes=20 Reduce input records=1 Reduce output records=1 Spilled Records=2 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=147 CPU time spent (ms)=1400 Physical memory (bytes) snapshot=280195072 Virtual memory (bytes) snapshot=1793179648 Total committed heap usage (bytes)=136450048 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=106 File Output Format Counters Bytes Written=8 .... $ hadoop fs -cat /user/lxu/grep-output/part-r-00000 288 was
No comments:
Post a Comment