Week Eleven Activities: Create and configure Azure SQL Database

Azure SQL Database is the relational database-as-a-service which provide high performance, secure, and reliable database facility using the Microsoft SQL server engine.

In this phase three implementation of my project, I am using Azure SQL database system to consumption and publish analyzed data from HDInsight Hadoop ecosystem. Then, Business intelligence application Microsoft Power BI will be integrated with Azure SQL  database for data visualization.

The steps of Azure SQL Database configuration:

Step1:  Select Database from the left side of user console of Microsoft Azure, click SQL Database.1.jpg

Step 2: Give the database name. In my case, I used “flighdelaydb,” select your subscription, and resource group, then click Server configuration. You will get server configuration. 2

Step 3: click “Create a new Server” and provide the following information.  Select the location that same to HDInsihgt Hadoop cluster and Azure storage account. In my case, I used “EAST US.” then click “Select.”3.jpg

Step 4: Now you need to configure elastic pool and price tier.

4.jpg

Elastic pools are the solution for managing, scaling multiple databases cost effective way. The databases in an elastic pool are on a single Azure SQL Database server and share a set number of resources and to optimize the price performance for a group of databases.

DTUs stand for Database transaction units, and eDTUs means elastic database transaction units. It measures the different performance levels and service tiers of Azure SQL database based on the amount resources utilization such as CPU, memory, I/O. The ratio amount those resources determine by the real-world OLTP workloads.

Step 5: I am not going to use the elastic pool because I need a single database. The selection of price depends on your budget, data size, and performance. I am going to use “Basic” price tier because of my demo data size less than 5 GB. If it is more than that you can choose “Standard.” The cost of the price tier is calculated monthly usages.

Step 6: Click “Create” to deploy the database. It will take 1-2 minutes to implement.

7

Step 7: click ‘create’, It will check validation of configuration. then start deployment in a second. You will get an error during validating if your configuration doesn’t match with the requirement.

8.jpg

9.jpg

Step 8: You will get a message after complete deployment successfully. 10.jpg

Step 9: Overview of configuration

11.jpg

 

Thank you 🙂

 

Advertisements

Week Eleven Activities: Data transfer to Azure SQL database from HDFS and firewall setup

In this implementation phase, I am going to demonstrate how to transfer data from HDInsight Hadoop cluster to Azure SQL database, and integration between business intelligence Power BI and Microsoft Azure SQL database for data consumption and data visualization.

The configuration tasks are:

  1. Connection establish between Azure SQL database with HDFS
  2.  Create table in Azure SQL database
  3. Transfer data from HDFS to Azure SQL “Security delay” table  by Apache Sqoop
  4. Task 4: Login Azure SQL database via visual studio tools and firewall setup

 Task 1: Connection establish between Azure SQL database with HDFS

Step 1: The first step is to install FreeTDS.  “FreeTDS is a set of libraries for Unix and Linux that allows your programs to natively talk to Microsoft SQL Server and Sybase databases” – www.freetds.org

# FreeTDS Installation command line for database connection from HDInsight Linux clustering

sshuser@hdnO-flight:~$ sudo apt-get –assume-yes install freetds-dev freetds-bin

1.jpg

2.jpg

Step 2: establish connection with Azure SQL Database by the following command line

TDSVER=8.0 tsql -H flightdelayse.database.windows.net -U momataj -P <@password> -p 1433 -D flightdelaydb

3.jpg

Task 2: Create table in Azure SQL database

step 1: Create a table in the SQL database for exporting the file from HDFS by the following script

CREATE TABLE [dbo].[securitydelay](
[origin_city_name] [nvarchar](50) NOT NULL,
[security_delay] float,
CONSTRAINT [PK_security] PRIMARY KEY CLUSTERED
([origin_city_name] ASC))
GO

1.jpg

Step 2: You can verify the newly created table by the following query command line

2

Select * from information_schema.tables

Task 3: Transfer data from HDFS to Azure SQL “Security delay” table  by Apache Sqoop

Step 1: ## Testing connection command line:

sqoop list-databases –connect jdbc:sqlserver://flightdelayse.database.windows.net:1433 –username momataj –password <password>

“Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.” —sqoop.apache.org/

3.jpg

 

 Step 2: Export data to the “securitydelay” of Azure SQL database from HDInsight Hadoop Storage system by the following Sqoop command line

 

sqoop export –connect ‘jdbc:sqlserver://flightdelayse.database.windows.net:1433;database=flightdelaydb’ –username momataj –password <type password> –table ‘securityt_delay’ –export-dir ‘/momo/flightdelay/output’ –fields-terminated-by ‘\t’ -m 1

4.jpg

Step 3: Now we will check data is exported or not in the security delay table.

5

Task 4: Login Azure SQL database via visual studio tools and firewall setup

Go to the Azure Dashboard -> “flightdelaydb” <your database name>- click “Tools”

1.jpg

Select “Open in Visual Studio”

2.jpg

Before connecting through the Visual studio, Need to setup firewall. Click’ Configure your firewall”

3.jpg

Click “Add client IP”. Allow your public IP to access Azure Service

4.jpg

5.jpg

Now you can login Azure SQL database from your local machine via Visual Studio tools

click on “Open in visual studio.”  The visual studio auto open for you if it is already installed on your machine. If not! You need to install visual studio tools for Microsoft Azure.

6.jpg

Now you can see the table schema, select query output from the table “securitydelay.”

13.jpg

 

Thank you 🙂

 

 

Week Eleven Activities: Hive jobs on Hadoop cluster

Apache Hive is the building component of Hadoop ecosystem. Data analysts use Hive to query, summarize, explore and analyze that data, then turn it into actionable business insight. Hive query is the SQL like query language.

In my project, I use Hive query for analysis flight delay data. The job of Hive is to load data into an external table from flight delay CSV files format which stored in HDFS, transformed it ORC file format and save it in ORC table for further processing. Hive process analytical query data from the ORC table. ORC table facilities faster, efficient, high performance of data analysis through Hive.

The list of Hive Jobs in this project:

  1. Create External table from loading data from CSV files
  2. Create ORC table from inserting data from external table and data analysis
  3. Calculate average delay on Security data
  4. Insert it on the HDFS Out put directory

Login to the HDInsight Hadoop cluster using Ambari view :

1.jpg

You will see an authentication screen. Put the user name and password for login.

2.jpg

4.jpg

3

After login the Hive View, you will see the following Hive QL query Editor.

5.jpg

Job 1: Create External table from loading data from CSV files 

  1. Create an external table in Hive default database system using the following the query of table creation. Hive stores files as text format.

DROP TABLE delays_raw;
— Creates an external table over the csv file
CREATE EXTERNAL TABLE delays_raw (
YEAR string,
FL_DATE string,
UNIQUE_CARRIER string,
CARRIER string,
FL_NUM string,
ORIGIN_AIRPORT_ID string,
ORIGIN string,
ORIGIN_CITY_NAME string,
ORIGIN_CITY_NAME_TEMP string,
ORIGIN_STATE_ABR string,
DEST_AIRPORT_ID string,
DEST string,
DEST_CITY_NAME string,
DEST_STATE_ABR string,
DEP_DELAY_NEW float,
ARR_DELAY_NEW float,
CARRIER_DELAY float,
WEATHER_DELAY float,
NAS_DELAY float,
SECURITY_DELAY float,
LATE_AIRCRAFT_DELAY float)
— The following lines describe the format and location of the file
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
LOCATION ‘/momo/flightdelay/data’;

6.jpg

Now you can check that your data is uploaded into the external file by the following select query:

7.jpg

calculation number of rows in the external table: delays_raw8.jpg

job 2: Create ORC table from inserting data from the external table and data analysis 

Run the following script to create ORC table from the external table. ORC table cleaning the data sorted it, and spit it into multiple files based on each output file.

— Drop the delays table if it exists
DROP TABLE flightdelays;
— Create the delays table and populate it with data
— pulled in from the CSV file (via the external table defined previously)
CREATE TABLE flightdelays AS
SELECT YEAR AS year,
FL_DATE AS flight_date,
substring(UNIQUE_CARRIER, 2, length(UNIQUE_CARRIER) -1) AS unique_carrier,
substring(CARRIER, 2, length(CARRIER) -1) AS carrier,
substring(FL_NUM, 2, length(FL_NUM) -1) AS flight_num,
ORIGIN_AIRPORT_ID AS origin_airport_id,
substring(ORIGIN, 2, length(ORIGIN) -1) AS origin_airport_code,
substring(ORIGIN_CITY_NAME, 2) AS origin_city_name,
substring(ORIGIN_STATE_ABR, 2, length(ORIGIN_STATE_ABR) -1) AS origin_state_abr,
DEST_AIRPORT_ID AS dest_airport_id,
substring(DEST, 2, length(DEST) -1) AS dest_airport_code,
substring(DEST_CITY_NAME,2) AS dest_city_name,
substring(DEST_STATE_ABR, 2, length(DEST_STATE_ABR) -1) AS dest_state_abr,
DEP_DELAY_NEW AS dep_delay_new,
ARR_DELAY_NEW AS arr_delay_new,
CARRIER_DELAY AS carrier_delay,
WEATHER_DELAY AS weather_delay,
NAS_DELAY AS nas_delay,
SECURITY_DELAY AS security_delay,
LATE_AIRCRAFT_DELAY AS late_aircraft_delay
FROM delays_raw;

9.jpg

10.jpg

Now you can see the number of rows in the external table equal to the number of rows in ORC table. That’s proved that is inserted accurately to the ORC Table.

11.jpg

Job 3: Calculate average delay on Security data 

Calculating the average of security delay from  the January to May 2017 data

SELECT regexp_replace(origin_city_name, ””, ”),
avg(security_delay)security_delay
FROM flightdelays
WHERE weather_delay IS NOT NULL
GROUP BY origin_city_name;

13.jpg

You can visual result of HIve query via Visualization option:

15.jpg

14.jpg

Data explorer: 

16.jpg

17.jpg

Visual explore: This visual explanation shows that how MapReduce process Hive Job on top of Hadoop cluster after executing a query.

18.jpg

Job 4: Insert the average delay data analysis result on the HDFS Output directory

Insert the average delay calculation result to the HDFS output directory folder by the following script:

INSERT OVERWRITE DIRECTORY ‘/momo/flightdelay/output’
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
SELECT regexp_replace(origin_city_name, ””, ”),
avg(weather_delay)weather_delay
FROM flightdelays
WHERE weather_delay IS NOT NULL
GROUP BY origin_city_name;

19.jpg

Now you can see the output file of average security delay file saved inside of HDFS default file container in the output directory.

20.jpg

 

Thank you 🙂

Week Eleven Activities: Implementation (Chapter- IV)

In this week, I’ve started to write the report on chapter four “Implementation”.  The Implementation chapter will cover several phases of implementing steps from data collection to resulting data visualize through business Intelligence software.

The following steps I am going to use while writing this chapter:

 

Steps Name details
Collection of dataset Flight delay
Ingest dataset from local machine to Microsoft Azure Microsoft Azure Storage
Preparing dataset Data cleaning, shaping for process
Create and configure HDInsight Cluster HDInsight Hadoop Cluster Linux operating system, 1 master and 2 workers VM
Store data in the HDInsight Hadoop Cluster Data Storing in Hadoop cluster Preparing for processing
Configure the network Checking network, security, and IP
Processing the data using Hive
Transfer the result data into Microsoft SQL data warehouse Result in data transfer to data warehouse Delete cluster after transfer
Configure the integration with BI and SQL data warehouse Configure Microsoft Power BI Configure Microsoft Power BI with Microsoft Azure SQL Server
Visualize the analytical data Generating report in different format Generating report and testing outcomes with different report format
Testing the business requirement of dataset Testing BI feature for decision making Testing how to make easily BI system for analyzing data and how to it help user without writing query

Thank you 🙂

Week ten Activities: Chapter Three Outcomes

In this week, I have finished my chapter three report. The outcomes of this chapter is

  • I have selected classified dataset “Flight delay”
  • planned business requirements
  • Designed a customised end-to-end Big data analytical suite for Microsoft Azure platform
  • Selected Microsoft Azure big data products, tools, components for the specific data set
  • Selected business intelligence application for data visualisation
  •   Planned technical requirement

Thank you 🙂

Week ten Activities: Design Diagram and selection of services

The chapter three, Design diagram for An end-to-end Big Data Suite has been developed from the principle of fundamental architecture of big data.

The designed diagram section focused on how to design flow chart based on business requirements and how to build an architecture by following the flow chart system.

  • Discussed the fundamental principles of big data
  • Design flow chart for the selected data set
  • Discussed on how to build an  end-to-end architecture in the Azure cloud platform
  •  Designed a customised architecture for an end-to-end suite

The big data services section discussed the services that will be used during implementing the project such as

  • Azure storage
  • Azure HDInsight cluster
  • Microsoft SQL Data Warehouse
  • Microsoft Power BI

Thank you 🙂

Week ten Activities: Project plan (chapter three)

Chapter three is the “Project Plan”. I have completed chapter two last week and now I’ve started to write chapter three.  The outlines will be the focus of this chapter are:

  • Dataset: The classified data set I am going to use in the implementation testing is ” flight delay”
  • Design datagram: Design diagram will be the specific architecture I am going to use for the infrastructure of selected flight delay dataset
  • Big Data Analyst Services: The services I will need to analysis for flight delay data set
  • Business Intelligence Services: I am going to use “Microsoft Power BI” Tools for this service
  • Design Fine-Tuning: Fine-turning I will test on the basis of cluster cost and processing time.

Thank you 🙂