Week Nine Activities: Chapter two outcomes

I have completed chapter two in this week; This chapter is too long because, in this chapter, I discussed the areas of big data in terms of big data concept, sources of big data, importance and challenge, business intelligence for decision making and how it simplifies the business decision today.

The chapter also pointed on the infrastructure and platform issues and why public cloud is selected for analysis this project and finally the three public cloud service providers and their big data available products.

Finally, I concluded the chapter with the challenge and necessary demand in the big data market.

Summary of chapter two: Background

Domain analysis 

  • What are big data and sources of big data?
  • What are big data analysis and business intelligence and why they necessarily impact in today’s business and how can big data process and virtualize?

platform and infrastructure analysis:

  • What are the available platforms for big data analysis?
  • What are the drawbacks of the traditional platform?
  • What infrastructure models cloud platform offers to analyse big data analysis?

Technical analysis:

  • Discussed with the three cloud service providers in terms of big data solution.
  • What kind of services they offer for big data analysis?
  • What tool and services require for what type of data?
  • What is my selected platform and infrastructure, and why I choose it with justification?

As it is a too long chapter, I have to change my table of contents slightly and I will discuss this with my supervisor upcoming week.

Thank you 🙂



Week Nine Activities: Use SSH with HDInsight to Connect

In my test environment, I used HDInsight Linux (Ubuntu)  operating system for nodes within the Hadoop Cluster.

Command line for login Linux cluster:

<Cluster-Name>-ssh.azurehdinsight.net        port 22 Primary head node

<Cluster-name>-ssh.azurehdinsight.net         port 23 secondary headnode

<edgenodename>.<clustername>-ssh.azurehdinsight.net  port 22 edge node

<cluster-name> -ed-ssh.azurehdinght.net           port 22 (R server on HDInsight)

Step 1: If you don’t have PuTTY on your machine, download it from internet and install it before establish connection.

Step 2: Run your PuTTY software. Put the Hostname or IP Address and port number 22 for ssh connection.

Here my hostname is: “flightdelayh-ssh.azureinsight.net.”

If you don’t know hostname, go to the HDInsight Hadoop dashboard, overview configuration screen and “Secure Shell (SSH), and get your SSH hostname from there.22.jpg

Step 3: After HDInsight Hadoop SSH connection hostname and port number. Click “Open” for establishing a connection.


Step 4: You will get the following message. Click “Yes” to connect.2

Step 5: Provide SSH user name and password


Step 6: You can see the connection is established.


Thank you 🙂

Week Eight Activities: Background (Chapter 2) of Report

In this week, I am working with the Chapter 2:  Background

The background I have divided into three parts:

I have already finished Domain Analysis. where I covered:

  • What is big data?
  • Classification of type of big data
  • Classified Big data set
  • Big data analysis
  • Business Intelligence (BI)

In the Infrastructure Analysis I am going to discuss:

  • On-Premises Data centre platform
  • Cloud Platform: Deployment models and delivery models

In the Technical Analysis part, I will focus on different cloud providers and services they offer and justify the selecting cloud services.

Thank You 🙂

Week Eight Activities: Configure Azure HDInsight with Hadoop cluster

The Second phase of my project is to configure the HDInsight Hadoop cluster.

This week I’ve been configured Azure HDInsiight with Hadoop cluster. In this post, I am going to share to create HDInsight Hadoop cluster for analysing big data sets.

The functionality of the Hadoop is batch query and analysis of store data. The number of nodes in Hadoop cluster is Head node (2)  and, data node (+1), and default VM size.


But In my project, I am going to use Head node (2) and worker node (2).

Requirement and services for configuring HDInsight Hadoop Cluster:

Cluster tier: HDInsight Services tiers

Microsoft Azure offers the big data cloud in two services tiers such as Standard and Premium. In the testing during, I am using Standard tier and Linux operating system.

The default node configuration and virtual machine sizes for clusters:

Head: Default VM size, D3 v2, recommended VM size: D3 v2, D4 v2, D12 v2

I selected Head nodes VM size: D12 v2, 2 nodes 8 Cores

Worker: default VM size, recommended VM sizes: D3 v2, D4 v2, D12 v2

I selected Worker nodes for my project: D4 V2, 2 nodes 16 Cores

Supported HDInsight Versions: Highly available clusters with two head nodes and one worker are deployed by default for HDInsight 2.1 and above.  The latest available version HDInsight version 3.6 and Hortonworks Data Platform: 2.6, Apache Hive & HCatalog: 1.2.1. Default version: HDInsight 3.5, Hortonworks Data Platform: 2.5, Apache Hive & HCatalog: 1.2.1. I selected the latest version for my project implementation.

Configuration Steps:

Step 1: Select HDInsight from Data +Analytics tabs from the left side of the dashboard. You will get Hadoop and other big data solutions services inside of HDInsight.


Step 2: Click “Custom” for configuring size, settings, applications.  Provide the information that needed for Basics configuration settings.

HDInsight Hadoop configuration I selected for my project: 

Custer type: Hadoop, Operating system: Linux, Version: Hadoop 2.7.3 (HDI 3.8), Custer tier: Standard. You can customise your settings as per your business requirement and budgets




Step 3: In the “Storage Account Settings step, you need to select a storage account that you already configured in the previous implementation phase. Here, I am using my existing “flightdelaystore”.

You can see a default container is created inside of Azure Storage account that will be used as a default file system for Apache Hadoop HDFS for storing file data analysis.



Step 4: Configuring “Cluster size”. The default head nodes are 2, you need to select the number of worker nodes. Before configuring cluster size, keep in mind your data size and how fast you need to execute your data.

In my project, I am going to use 2 worker nodes and 2 head nodes. The selection of VM depends on also data size, execution and analysis size. You need to also consider your analysis budget during the selection of VM cores.

In my subscription, I can use 60 cores, but in my project data size is not too huge, it’s around 5 GB, So I am going to use 24 cores out of 60 cores.




Step 5: Validation check;  Before ready to create a cluster, the system will check validation of your configuration. If it shows any errors, you need to check your setup again, otherwise, it is ready to create and deployment.10.jpg

Step 6: Summary of Configuration



Step 7: Click Create for deployment. Azure HDInsight takes 15-20 minutes to deploy a cluster regardless of cluster nodes sides. Even a cluster contains 100 nodes, it will take same deployment time.

Step 8: Download template and parameters. You can download the template and parameters for further deployment or if you want to see the same configuration for deploying another cluster.


Step 9:  HDInsight cluster deployment. You will get notification of deployment, and you will receive successful notification after deployment done.14.jpg

Step 9: Successfully deployment of HDInsight and ready for data process and analysis.


Step 10: HDInsight cluster overview




Step 11: Tools for managing and monitoring HDInsight 


Step 12: cluster login information 


Step 13: Scale Cluster. you can scale up and scale down your cluster while jobs are running. It automatically moves data and processing job in the parllel distributing system without any interruption.


Step 14: Secure Sheel (SSH):  Login information for accessing cluster through SSH


Step 15: Storage Account; You can monitor and check storage account information from here which is integrated with HDInsight and you can also edit it as if required.


Step 16: Access Control (IAM):  You can control accessing of HDInsight cluster through and remove options from IAM menu. You can also configure your roles or use default roles for your users.


Step 18: Activity log; Activity log will show activities about accessing time, the operation of the cluster.


Step 19: Cluster Dashboard. If you have more than one cluster; you can view all of them on cluster dashboard.


Step 20: Cluster delete. Click “Delete” option for deleting cluster after job done!

Note: After complete your batch processing job through HDInsight Hadoop cluster, delete the cluster to reduce your cost.  Because HDInsight cluster starts charges it starts to run, and until delete. 


Step 21: Cluster deleted successfully 28.jpg


Thank you 🙂


Week Seven Activities: Configuring Azure Storage Account

In my implementation phase, the first initiative is to create and configure Azure Storage Account to load datasets in the cloud.

For analysing data through HDInsight cluster, Data can be stored either in Azure Data lake Store, or Azure Storage Account.  Both storages keep safe data after deleting the HDInsight Hadoop clusters without any data loss.

Azure Blob storage container uses as a default file system for the HDInsight big data analytical system which stores all types of structure or unstructured data. Azure storage is robust and easily integrated with HDInsight.

Step for open Azure Storage Account:

Step 1: Login Microsoft Storage Account. Select Storage Accounts from dashboard of left side list1.jpg

Step 2:  Select “Create Storage Accounts”.2.jpg

Note that, Account kind must be ” General Purpose, performance “Standard” because HDinsight Hadoop cluster only supports this configuration.

Step 3: Configure “Resource Group”. If you already created “Resource group”, you can use previous one or you can create new resource group. Then, Select “Location”. Note: Use the same location for the whole implementation. I select “East US” location for my project.


Step 4: Click “Create”


Step 5: After received notification of successful deployment. My storage account is ready for data load.



Step 6: Monitor the Resource Provider Status. Here you can register and unregister services that can be integrated with Azure storage.


Step 2: Azure Storage Overview after completed the configuration


The Azure Storage configuration for integrating with HDInsight is simple, just need few clicks and need to focus on some selection type. I finished it without any complexity.

Thank you 🙂

Week Seven Activities: Microsoft Azure Big Data solution

Microsoft Azure offers a range of Big data services and resources tools for data analysis and Integration service with leading productivity applications.

The list of services are:

  • Big data analytics with HDInsight: Hadoop on Azure
  • Azure Data Lake Analytics
  • Big data deploying Apache Hadoop with Hortonworks Data Platform on Azure
  • Orchestration Big data pipelines with Azure Data factor
  • Machine learning

Integration service with leading productivity applications:

  • Cloudera Enterprise Data
  • Datameer
  • Informatica Cloud Service

Reference link:


Thank you 🙂