Week Eleven Activities: Hive jobs on Hadoop cluster

Apache Hive is the building component of Hadoop ecosystem. Data analysts use Hive to query, summarize, explore and analyze that data, then turn it into actionable business insight. Hive query is the SQL like query language.

In my project, I use Hive query for analysis flight delay data. The job of Hive is to load data into an external table from flight delay CSV files format which stored in HDFS, transformed it ORC file format and save it in ORC table for further processing. Hive process analytical query data from the ORC table. ORC table facilities faster, efficient, high performance of data analysis through Hive.

The list of Hive Jobs in this project:

  1. Create External table from loading data from CSV files
  2. Create ORC table from inserting data from external table and data analysis
  3. Calculate average delay on Security data
  4. Insert it on the HDFS Out put directory

Login to the HDInsight Hadoop cluster using Ambari view :

1.jpg

You will see an authentication screen. Put the user name and password for login.

2.jpg

4.jpg

3

After login the Hive View, you will see the following Hive QL query Editor.

5.jpg

Job 1: Create External table from loading data from CSV files 

  1. Create an external table in Hive default database system using the following the query of table creation. Hive stores files as text format.

DROP TABLE delays_raw;
— Creates an external table over the csv file
CREATE EXTERNAL TABLE delays_raw (
YEAR string,
FL_DATE string,
UNIQUE_CARRIER string,
CARRIER string,
FL_NUM string,
ORIGIN_AIRPORT_ID string,
ORIGIN string,
ORIGIN_CITY_NAME string,
ORIGIN_CITY_NAME_TEMP string,
ORIGIN_STATE_ABR string,
DEST_AIRPORT_ID string,
DEST string,
DEST_CITY_NAME string,
DEST_STATE_ABR string,
DEP_DELAY_NEW float,
ARR_DELAY_NEW float,
CARRIER_DELAY float,
WEATHER_DELAY float,
NAS_DELAY float,
SECURITY_DELAY float,
LATE_AIRCRAFT_DELAY float)
— The following lines describe the format and location of the file
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
LOCATION ‘/momo/flightdelay/data’;

6.jpg

Now you can check that your data is uploaded into the external file by the following select query:

7.jpg

calculation number of rows in the external table: delays_raw8.jpg

job 2: Create ORC table from inserting data from the external table and data analysis 

Run the following script to create ORC table from the external table. ORC table cleaning the data sorted it, and spit it into multiple files based on each output file.

— Drop the delays table if it exists
DROP TABLE flightdelays;
— Create the delays table and populate it with data
— pulled in from the CSV file (via the external table defined previously)
CREATE TABLE flightdelays AS
SELECT YEAR AS year,
FL_DATE AS flight_date,
substring(UNIQUE_CARRIER, 2, length(UNIQUE_CARRIER) -1) AS unique_carrier,
substring(CARRIER, 2, length(CARRIER) -1) AS carrier,
substring(FL_NUM, 2, length(FL_NUM) -1) AS flight_num,
ORIGIN_AIRPORT_ID AS origin_airport_id,
substring(ORIGIN, 2, length(ORIGIN) -1) AS origin_airport_code,
substring(ORIGIN_CITY_NAME, 2) AS origin_city_name,
substring(ORIGIN_STATE_ABR, 2, length(ORIGIN_STATE_ABR) -1) AS origin_state_abr,
DEST_AIRPORT_ID AS dest_airport_id,
substring(DEST, 2, length(DEST) -1) AS dest_airport_code,
substring(DEST_CITY_NAME,2) AS dest_city_name,
substring(DEST_STATE_ABR, 2, length(DEST_STATE_ABR) -1) AS dest_state_abr,
DEP_DELAY_NEW AS dep_delay_new,
ARR_DELAY_NEW AS arr_delay_new,
CARRIER_DELAY AS carrier_delay,
WEATHER_DELAY AS weather_delay,
NAS_DELAY AS nas_delay,
SECURITY_DELAY AS security_delay,
LATE_AIRCRAFT_DELAY AS late_aircraft_delay
FROM delays_raw;

9.jpg

10.jpg

Now you can see the number of rows in the external table equal to the number of rows in ORC table. That’s proved that is inserted accurately to the ORC Table.

11.jpg

Job 3: Calculate average delay on Security data 

Calculating the average of security delay from  the January to May 2017 data

SELECT regexp_replace(origin_city_name, ””, ”),
avg(security_delay)security_delay
FROM flightdelays
WHERE weather_delay IS NOT NULL
GROUP BY origin_city_name;

13.jpg

You can visual result of HIve query via Visualization option:

15.jpg

14.jpg

Data explorer: 

16.jpg

17.jpg

Visual explore: This visual explanation shows that how MapReduce process Hive Job on top of Hadoop cluster after executing a query.

18.jpg

Job 4: Insert the average delay data analysis result on the HDFS Output directory

Insert the average delay calculation result to the HDFS output directory folder by the following script:

INSERT OVERWRITE DIRECTORY ‘/momo/flightdelay/output’
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
SELECT regexp_replace(origin_city_name, ””, ”),
avg(weather_delay)weather_delay
FROM flightdelays
WHERE weather_delay IS NOT NULL
GROUP BY origin_city_name;

19.jpg

Now you can see the output file of average security delay file saved inside of HDFS default file container in the output directory.

20.jpg

 

Thank you 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s