Thursday, August 22, 2013

A simple tutorial on how to setup Apache flume, HDFS, Oozie and Hive (2)

So in part 1, we talked about how to configure flume agent in both Apache Hadoop and Cloudera Manager. Now we are gonna talk about how to configure Hive.

Before we can query the data, we need to ensure that the Hive table can properly interpret the JSON data. By default, Hive expects that input files use a delimited row format, but our Twitter data is in a JSON format, which will not work with the defaults. This is actually one of Hive’s biggest strengths. Hive allows us to flexibly define, and redefine, how the data is represented on disk.

SerDe stands for Serializer and Deserializer, which are interfaces that tell Hive how it should translate the data into something that Hive can process. In detail, Deserializer is used when we read data off disk, and converts the data into objects that Hive knows how to manupulate. We need to create a custom SerDe program that reads JSON data in and translate it for Hive.

1. Create the Hive directory hierarchy (If you don't have it in HDFS):
# hadoop fs -mkdir /user/hive/warehouse  
# hadoop fs -chown -R hive:hive /user/hive 
# hadoop fs -chmod 750 /user/hive 
# hadoop fs -chmod 770 /user/hive/warehouse 

2. You'll also want to add whatever user you plan on executing Hive scripts with to the hive Unix group:
# usermod -a -G hive <username>

3. If you don't have JSON SerDe, you can download it from here:
then copy it to /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hive/lib
# cp ./hive-serdes-1.0-SNAPSHOT.jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hive/lib/

4. Create the tweets hive table:
Run hive, and execute the following commands:

hive> add jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;
Added /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar to class path
Added resource: /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar
    >   id BIGINT,
    >   created_at STRING,
    >   source STRING,
    >   favorited BOOLEAN,
    >   retweeted_status STRUCT<
    >     text:STRING,
    >     user:STRUCT<screen_name:STRING,name:STRING>,
    >     retweet_count:INT>,
    >   entities STRUCT<
    >     urls:ARRAY<STRUCT<expanded_url:STRING>>,
    >     user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
    >     hashtags:ARRAY<STRUCT<text:STRING>>>,
    >   text STRING,
    >   user STRUCT<
    >     screen_name:STRING,
    >     name:STRING,
    >     friends_count:INT,
    >     followers_count:INT,
    >     statuses_count:INT,
    >     verified:BOOLEAN,
    >     utc_offset:INT,
    >     time_zone:STRING>,
    >   in_reply_to_screen_name STRING
    > )
    > PARTITIONED BY (datehour INT)
    > ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
    > LOCATION '/user/flume/tweets';
Time taken: 3.879 seconds

hive> show tables;
Time taken: 0.355 seconds

Or if you are using Cloudera Manger, go to "Query Editor", and run the abow hive command. You can add jar file in "File Resources":

5. Now that the table is created, let’s insert some data in the table.

From command line:
hive> hive> LOAD DATA INPATH '/user/flume/tweets/2013/08/21/15'
    > INTO TABLE `default.tweets`
    > PARTITION (datehour='2013022516');
Loading data to table default.tweets partition (datehour=2013022516)
Partition default.tweets{datehour=2013022516} stats: [num_files: 5, num_rows: 0, total_size: 98957, raw_data_size: 0]
Table default.tweets stats: [num_partitions: 1, num_files: 5, num_rows: 0, total_size: 98957, raw_data_size: 0]

Time taken: 23.86 seconds

From Cloudera Manager:
"Hue" -> "Home" -> "Table", select tweets, then click on "Import Data", choose path and then submit.

If you are getting the following error:
Caused by: Cannot move hdfs://hadoop1:8020/user/flume/tweets/2013/08/21/16/FlumeData.1377115317350.tmp to hdfs://hadoop1:8020/user/flume/tweets/datehour=2013082116/FlumeData.1377115317350.tmp
 at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(
 ... 19 more

Make sure the permissions are correct (try as hdfs user).

After the query executes, you should see data in the table 'tweets'.

hive> select * from tweets limit 10;

Or in Cloudera Manager, browse table.

No comments: