Week Eight Activities: Configure Azure HDInsight with Hadoop cluster

The Second phase of my project is to configure the HDInsight Hadoop cluster.

This week I’ve been configured Azure HDInsiight with Hadoop cluster. In this post, I am going to share to create HDInsight Hadoop cluster for analysing big data sets.

The functionality of the Hadoop is batch query and analysis of store data. The number of nodes in Hadoop cluster is Head node (2)  and, data node (+1), and default VM size.

1

But In my project, I am going to use Head node (2) and worker node (2).

Requirement and services for configuring HDInsight Hadoop Cluster:

Cluster tier: HDInsight Services tiers

Microsoft Azure offers the big data cloud in two services tiers such as Standard and Premium. In the testing during, I am using Standard tier and Linux operating system.

The default node configuration and virtual machine sizes for clusters:

Head: Default VM size, D3 v2, recommended VM size: D3 v2, D4 v2, D12 v2

I selected Head nodes VM size: D12 v2, 2 nodes 8 Cores

Worker: default VM size, recommended VM sizes: D3 v2, D4 v2, D12 v2

I selected Worker nodes for my project: D4 V2, 2 nodes 16 Cores

Supported HDInsight Versions: Highly available clusters with two head nodes and one worker are deployed by default for HDInsight 2.1 and above.  The latest available version HDInsight version 3.6 and Hortonworks Data Platform: 2.6, Apache Hive & HCatalog: 1.2.1. Default version: HDInsight 3.5, Hortonworks Data Platform: 2.5, Apache Hive & HCatalog: 1.2.1. I selected the latest version for my project implementation.

Configuration Steps:

Step 1: Select HDInsight from Data +Analytics tabs from the left side of the dashboard. You will get Hadoop and other big data solutions services inside of HDInsight.

1.jpg

Step 2: Click “Custom” for configuring size, settings, applications.  Provide the information that needed for Basics configuration settings.

HDInsight Hadoop configuration I selected for my project: 

Custer type: Hadoop, Operating system: Linux, Version: Hadoop 2.7.3 (HDI 3.8), Custer tier: Standard. You can customise your settings as per your business requirement and budgets

2.jpg

3.jpg

4.jpg

Step 3: In the “Storage Account Settings step, you need to select a storage account that you already configured in the previous implementation phase. Here, I am using my existing “flightdelaystore”.

You can see a default container is created inside of Azure Storage account that will be used as a default file system for Apache Hadoop HDFS for storing file data analysis.

5.jpg

6.jpg

Step 4: Configuring “Cluster size”. The default head nodes are 2, you need to select the number of worker nodes. Before configuring cluster size, keep in mind your data size and how fast you need to execute your data.

In my project, I am going to use 2 worker nodes and 2 head nodes. The selection of VM depends on also data size, execution and analysis size. You need to also consider your analysis budget during the selection of VM cores.

In my subscription, I can use 60 cores, but in my project data size is not too huge, it’s around 5 GB, So I am going to use 24 cores out of 60 cores.

7.jpg

8.jpg

9.jpg

Step 5: Validation check;  Before ready to create a cluster, the system will check validation of your configuration. If it shows any errors, you need to check your setup again, otherwise, it is ready to create and deployment.10.jpg

Step 6: Summary of Configuration

11.jpg

12

Step 7: Click Create for deployment. Azure HDInsight takes 15-20 minutes to deploy a cluster regardless of cluster nodes sides. Even a cluster contains 100 nodes, it will take same deployment time.

Step 8: Download template and parameters. You can download the template and parameters for further deployment or if you want to see the same configuration for deploying another cluster.

13.jpg

Step 9:  HDInsight cluster deployment. You will get notification of deployment, and you will receive successful notification after deployment done.14.jpg

Step 9: Successfully deployment of HDInsight and ready for data process and analysis.

15.jpg

Step 10: HDInsight cluster overview

16.jpg

17.jpg

18.jpg

Step 11: Tools for managing and monitoring HDInsight 

19.jpg

Step 12: cluster login information 

20.jpg

Step 13: Scale Cluster. you can scale up and scale down your cluster while jobs are running. It automatically moves data and processing job in the parllel distributing system without any interruption.

21.jpg

Step 14: Secure Sheel (SSH):  Login information for accessing cluster through SSH

22.jpg

Step 15: Storage Account; You can monitor and check storage account information from here which is integrated with HDInsight and you can also edit it as if required.

23.jpg

Step 16: Access Control (IAM):  You can control accessing of HDInsight cluster through and remove options from IAM menu. You can also configure your roles or use default roles for your users.

24.jpg

Step 18: Activity log; Activity log will show activities about accessing time, the operation of the cluster.

25.jpg

Step 19: Cluster Dashboard. If you have more than one cluster; you can view all of them on cluster dashboard.

26.jpg

Step 20: Cluster delete. Click “Delete” option for deleting cluster after job done!

Note: After complete your batch processing job through HDInsight Hadoop cluster, delete the cluster to reduce your cost.  Because HDInsight cluster starts charges it starts to run, and until delete. 

27.jpg

Step 21: Cluster deleted successfully 28.jpg

 

Thank you 🙂

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s