Week Eleven Activities: Data transfer to Azure SQL database from HDFS and firewall setup

In this implementation phase, I am going to demonstrate how to transfer data from HDInsight Hadoop cluster to Azure SQL database, and integration between business intelligence Power BI and Microsoft Azure SQL database for data consumption and data visualization.

The configuration tasks are:

  1. Connection establish between Azure SQL database with HDFS
  2.  Create table in Azure SQL database
  3. Transfer data from HDFS to Azure SQL “Security delay” table  by Apache Sqoop
  4. Task 4: Login Azure SQL database via visual studio tools and firewall setup

 Task 1: Connection establish between Azure SQL database with HDFS

Step 1: The first step is to install FreeTDS.  “FreeTDS is a set of libraries for Unix and Linux that allows your programs to natively talk to Microsoft SQL Server and Sybase databases” – www.freetds.org

# FreeTDS Installation command line for database connection from HDInsight Linux clustering

sshuser@hdnO-flight:~$ sudo apt-get –assume-yes install freetds-dev freetds-bin

1.jpg

2.jpg

Step 2: establish connection with Azure SQL Database by the following command line

TDSVER=8.0 tsql -H flightdelayse.database.windows.net -U momataj -P <@password> -p 1433 -D flightdelaydb

3.jpg

Task 2: Create table in Azure SQL database

step 1: Create a table in the SQL database for exporting the file from HDFS by the following script

CREATE TABLE [dbo].[securitydelay](
[origin_city_name] [nvarchar](50) NOT NULL,
[security_delay] float,
CONSTRAINT [PK_security] PRIMARY KEY CLUSTERED
([origin_city_name] ASC))
GO

1.jpg

Step 2: You can verify the newly created table by the following query command line

2

Select * from information_schema.tables

Task 3: Transfer data from HDFS to Azure SQL “Security delay” table  by Apache Sqoop

Step 1: ## Testing connection command line:

sqoop list-databases –connect jdbc:sqlserver://flightdelayse.database.windows.net:1433 –username momataj –password <password>

“Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.” —sqoop.apache.org/

3.jpg

 

 Step 2: Export data to the “securitydelay” of Azure SQL database from HDInsight Hadoop Storage system by the following Sqoop command line

 

sqoop export –connect ‘jdbc:sqlserver://flightdelayse.database.windows.net:1433;database=flightdelaydb’ –username momataj –password <type password> –table ‘securityt_delay’ –export-dir ‘/momo/flightdelay/output’ –fields-terminated-by ‘\t’ -m 1

4.jpg

Step 3: Now we will check data is exported or not in the security delay table.

5

Task 4: Login Azure SQL database via visual studio tools and firewall setup

Go to the Azure Dashboard -> “flightdelaydb” <your database name>- click “Tools”

1.jpg

Select “Open in Visual Studio”

2.jpg

Before connecting through the Visual studio, Need to setup firewall. Click’ Configure your firewall”

3.jpg

Click “Add client IP”. Allow your public IP to access Azure Service

4.jpg

5.jpg

Now you can login Azure SQL database from your local machine via Visual Studio tools

click on “Open in visual studio.”  The visual studio auto open for you if it is already installed on your machine. If not! You need to install visual studio tools for Microsoft Azure.

6.jpg

Now you can see the table schema, select query output from the table “securitydelay.”

13.jpg

 

Thank you 🙂

 

 

Week Eleven Activities: Hive jobs on Hadoop cluster

Apache Hive is the building component of Hadoop ecosystem. Data analysts use Hive to query, summarize, explore and analyze that data, then turn it into actionable business insight. Hive query is the SQL like query language.

In my project, I use Hive query for analysis flight delay data. The job of Hive is to load data into an external table from flight delay CSV files format which stored in HDFS, transformed it ORC file format and save it in ORC table for further processing. Hive process analytical query data from the ORC table. ORC table facilities faster, efficient, high performance of data analysis through Hive.

The list of Hive Jobs in this project:

  1. Create External table from loading data from CSV files
  2. Create ORC table from inserting data from external table and data analysis
  3. Calculate average delay on Security data
  4. Insert it on the HDFS Out put directory

Login to the HDInsight Hadoop cluster using Ambari view :

1.jpg

You will see an authentication screen. Put the user name and password for login.

2.jpg

4.jpg

3

After login the Hive View, you will see the following Hive QL query Editor.

5.jpg

Job 1: Create External table from loading data from CSV files 

  1. Create an external table in Hive default database system using the following the query of table creation. Hive stores files as text format.

DROP TABLE delays_raw;
— Creates an external table over the csv file
CREATE EXTERNAL TABLE delays_raw (
YEAR string,
FL_DATE string,
UNIQUE_CARRIER string,
CARRIER string,
FL_NUM string,
ORIGIN_AIRPORT_ID string,
ORIGIN string,
ORIGIN_CITY_NAME string,
ORIGIN_CITY_NAME_TEMP string,
ORIGIN_STATE_ABR string,
DEST_AIRPORT_ID string,
DEST string,
DEST_CITY_NAME string,
DEST_STATE_ABR string,
DEP_DELAY_NEW float,
ARR_DELAY_NEW float,
CARRIER_DELAY float,
WEATHER_DELAY float,
NAS_DELAY float,
SECURITY_DELAY float,
LATE_AIRCRAFT_DELAY float)
— The following lines describe the format and location of the file
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
LOCATION ‘/momo/flightdelay/data’;

6.jpg

Now you can check that your data is uploaded into the external file by the following select query:

7.jpg

calculation number of rows in the external table: delays_raw8.jpg

job 2: Create ORC table from inserting data from the external table and data analysis 

Run the following script to create ORC table from the external table. ORC table cleaning the data sorted it, and spit it into multiple files based on each output file.

— Drop the delays table if it exists
DROP TABLE flightdelays;
— Create the delays table and populate it with data
— pulled in from the CSV file (via the external table defined previously)
CREATE TABLE flightdelays AS
SELECT YEAR AS year,
FL_DATE AS flight_date,
substring(UNIQUE_CARRIER, 2, length(UNIQUE_CARRIER) -1) AS unique_carrier,
substring(CARRIER, 2, length(CARRIER) -1) AS carrier,
substring(FL_NUM, 2, length(FL_NUM) -1) AS flight_num,
ORIGIN_AIRPORT_ID AS origin_airport_id,
substring(ORIGIN, 2, length(ORIGIN) -1) AS origin_airport_code,
substring(ORIGIN_CITY_NAME, 2) AS origin_city_name,
substring(ORIGIN_STATE_ABR, 2, length(ORIGIN_STATE_ABR) -1) AS origin_state_abr,
DEST_AIRPORT_ID AS dest_airport_id,
substring(DEST, 2, length(DEST) -1) AS dest_airport_code,
substring(DEST_CITY_NAME,2) AS dest_city_name,
substring(DEST_STATE_ABR, 2, length(DEST_STATE_ABR) -1) AS dest_state_abr,
DEP_DELAY_NEW AS dep_delay_new,
ARR_DELAY_NEW AS arr_delay_new,
CARRIER_DELAY AS carrier_delay,
WEATHER_DELAY AS weather_delay,
NAS_DELAY AS nas_delay,
SECURITY_DELAY AS security_delay,
LATE_AIRCRAFT_DELAY AS late_aircraft_delay
FROM delays_raw;

9.jpg

10.jpg

Now you can see the number of rows in the external table equal to the number of rows in ORC table. That’s proved that is inserted accurately to the ORC Table.

11.jpg

Job 3: Calculate average delay on Security data 

Calculating the average of security delay from  the January to May 2017 data

SELECT regexp_replace(origin_city_name, ””, ”),
avg(security_delay)security_delay
FROM flightdelays
WHERE weather_delay IS NOT NULL
GROUP BY origin_city_name;

13.jpg

You can visual result of HIve query via Visualization option:

15.jpg

14.jpg

Data explorer: 

16.jpg

17.jpg

Visual explore: This visual explanation shows that how MapReduce process Hive Job on top of Hadoop cluster after executing a query.

18.jpg

Job 4: Insert the average delay data analysis result on the HDFS Output directory

Insert the average delay calculation result to the HDFS output directory folder by the following script:

INSERT OVERWRITE DIRECTORY ‘/momo/flightdelay/output’
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
SELECT regexp_replace(origin_city_name, ””, ”),
avg(weather_delay)weather_delay
FROM flightdelays
WHERE weather_delay IS NOT NULL
GROUP BY origin_city_name;

19.jpg

Now you can see the output file of average security delay file saved inside of HDFS default file container in the output directory.

20.jpg

 

Thank you 🙂

Week Eleven Activities: Implementation (Chapter- IV)

In this week, I’ve started to write the report on chapter four “Implementation”.  The Implementation chapter will cover several phases of implementing steps from data collection to resulting data visualize through business Intelligence software.

The following steps I am going to use while writing this chapter:

 

Steps Name details
Collection of dataset Flight delay
Ingest dataset from local machine to Microsoft Azure Microsoft Azure Storage
Preparing dataset Data cleaning, shaping for process
Create and configure HDInsight Cluster HDInsight Hadoop Cluster Linux operating system, 1 master and 2 workers VM
Store data in the HDInsight Hadoop Cluster Data Storing in Hadoop cluster Preparing for processing
Configure the network Checking network, security, and IP
Processing the data using Hive
Transfer the result data into Microsoft SQL data warehouse Result in data transfer to data warehouse Delete cluster after transfer
Configure the integration with BI and SQL data warehouse Configure Microsoft Power BI Configure Microsoft Power BI with Microsoft Azure SQL Server
Visualize the analytical data Generating report in different format Generating report and testing outcomes with different report format
Testing the business requirement of dataset Testing BI feature for decision making Testing how to make easily BI system for analyzing data and how to it help user without writing query

Thank you 🙂

Week ten Activities: Chapter Three Outcomes

In this week, I have finished my chapter three report. The outcomes of this chapter is

  • I have selected classified dataset “Flight delay”
  • planned business requirements
  • Designed a customised end-to-end Big data analytical suite for Microsoft Azure platform
  • Selected Microsoft Azure big data products, tools, components for the specific data set
  • Selected business intelligence application for data visualisation
  •   Planned technical requirement

Thank you 🙂

Week ten Activities: Design Diagram and selection of services

The chapter three, Design diagram for An end-to-end Big Data Suite has been developed from the principle of fundamental architecture of big data.

The designed diagram section focused on how to design flow chart based on business requirements and how to build an architecture by following the flow chart system.

  • Discussed the fundamental principles of big data
  • Design flow chart for the selected data set
  • Discussed on how to build an  end-to-end architecture in the Azure cloud platform
  •  Designed a customised architecture for an end-to-end suite

The big data services section discussed the services that will be used during implementing the project such as

  • Azure storage
  • Azure HDInsight cluster
  • Microsoft SQL Data Warehouse
  • Microsoft Power BI

Thank you 🙂

Week ten Activities: Project plan (chapter three)

Chapter three is the “Project Plan”. I have completed chapter two last week and now I’ve started to write chapter three.  The outlines will be the focus of this chapter are:

  • Dataset: The classified data set I am going to use in the implementation testing is ” flight delay”
  • Design datagram: Design diagram will be the specific architecture I am going to use for the infrastructure of selected flight delay dataset
  • Big Data Analyst Services: The services I will need to analysis for flight delay data set
  • Business Intelligence Services: I am going to use “Microsoft Power BI” Tools for this service
  • Design Fine-Tuning: Fine-turning I will test on the basis of cluster cost and processing time.

Thank you 🙂

Week Nine Activities: Chapter two outcomes

I have completed chapter two in this week; This chapter is too long because, in this chapter, I discussed the areas of big data in terms of big data concept, sources of big data, importance and challenge, business intelligence for decision making and how it simplifies the business decision today.

The chapter also pointed on the infrastructure and platform issues and why public cloud is selected for analysis this project and finally the three public cloud service providers and their big data available products.

Finally, I concluded the chapter with the challenge and necessary demand in the big data market.

Summary of chapter two: Background

Domain analysis 

  • What are big data and sources of big data?
  • What are big data analysis and business intelligence and why they necessarily impact in today’s business and how can big data process and virtualize?

platform and infrastructure analysis:

  • What are the available platforms for big data analysis?
  • What are the drawbacks of the traditional platform?
  • What infrastructure models cloud platform offers to analyse big data analysis?

Technical analysis:

  • Discussed with the three cloud service providers in terms of big data solution.
  • What kind of services they offer for big data analysis?
  • What tool and services require for what type of data?
  • What is my selected platform and infrastructure, and why I choose it with justification?

As it is a too long chapter, I have to change my table of contents slightly and I will discuss this with my supervisor upcoming week.

Thank you 🙂