Thursday, September 05, 2013

A simple tutorial on how to setup Apache flume, HDFS, Oozie and Hive (3)


In previous two tutorials, (1) and (2), we have the Twitter data loaded into HDFS, then we use Hive to create external tables to query. Using an external table will allow us to query the table without moving the data from the location where it ends up in HDFS. As we add more and more data, we have to ensure the scalability of the table when we dealing with large data sets. A partitioned table allows us to prune the files that we read when querying, which results in better performance. However, with flume agent running all the time, we contime to stream tweets into HDFS. With Oozie, we can automate the periodic process of adding partitions to our table as the new data comes in.


Apache Oozie is a workflow coordination system, Oozie is an extremely flexible system for designing job workflows, which can be scheduled to run based on a set of criteria. We can configure the workflow to run an ALTER TABLE command that adds a partition containing the last hour’s worth of data into Hive, and we can instruct the workflow to occur every hour. This will ensure that we’re always looking at up-to-date data.

To configure Oozie and prepare the workflow:
1. Install Oozie, either from CM or manually (https://oozie.apache.org/docs/3.3.2/DG_QuickStart.html). Make sure you download the ExtJS lib (http://extjs.com/deploy/ext-2.2.zip) to enable Oozie webconsole. The Java 1.6+ bin directory shoulcp hive-serdes/target/hive-serdes-1.0-SNAPSHOT.jar oozie-workflows/libd be in the command path.

2. Create a lib directory and copy any necessary external JARs into it
# cd ~
# mkdir -p oozie-workflows/lib
# cp hive-serdes/target/hive-serdes-1.0-SNAPSHOT.jar oozie-workflows/lib

3. Copy hive-site.xml to the oozie-workflows directory:
# sudo cp /etc/hive/conf/hive-site.xml oozie-workflows
# sudo chown <username>:<username> oozie-workflows/hive-site.xml

4. Copy hive-site.xml and job.properties to the oozie-workflows directory, To execute the Hive action, Oozie needs a copy of hive-site.xml.
# cp /etc/hive/conf/hive-site.xml oozie-workflows/
# wget https://github.com/cloudera/cdh-twitter-example/blob/master/oozie-workflows/job.properties
# cp ./job.properties oozie-workflows/

5. Copy the oozie-workflows directory to HDFS (make sure you havethe proper permissions):
$ hadoop fs -put oozie-workflows /user/{username}/oozie-workflows

6. Install the Oozie ShareLib in HDFS.
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_17_6.html
Step 3: Upgrade the Oozie Sharelib

7. Start the Oozie coordinator workflow
$ oozie job -oozie http://<oozie-host>:11000/oozie -config oozie-workflows/job.properties -run

8. You can find your new oozie job in Oozie Web UI:
http://oozie-server:11000
under Coordinator jobs.


No comments: