The Second phase of my project is to configure the HDInsight Hadoop cluster.
This week I’ve been configured Azure HDInsiight with Hadoop cluster. In this post, I am going to share to create HDInsight Hadoop cluster for analysing big data sets.
The functionality of the Hadoop is batch query and analysis of store data. The number of nodes in Hadoop cluster is Head node (2) and, data node (+1), and default VM size.
But In my project, I am going to use Head node (2) and worker node (2).
Requirement and services for configuring HDInsight Hadoop Cluster:
Cluster tier: HDInsight Services tiers
Microsoft Azure offers the big data cloud in two services tiers such as Standard and Premium. In the testing during, I am using Standard tier and Linux operating system.
The default node configuration and virtual machine sizes for clusters:
Head: Default VM size, D3 v2, recommended VM size: D3 v2, D4 v2, D12 v2
I selected Head nodes VM size: D12 v2, 2 nodes 8 Cores
Worker: default VM size, recommended VM sizes: D3 v2, D4 v2, D12 v2
I selected Worker nodes for my project: D4 V2, 2 nodes 16 Cores
Supported HDInsight Versions: Highly available clusters with two head nodes and one worker are deployed by default for HDInsight 2.1 and above. The latest available version HDInsight version 3.6 and Hortonworks Data Platform: 2.6, Apache Hive & HCatalog: 1.2.1. Default version: HDInsight 3.5, Hortonworks Data Platform: 2.5, Apache Hive & HCatalog: 1.2.1. I selected the latest version for my project implementation.
Step 1: Select HDInsight from Data +Analytics tabs from the left side of the dashboard. You will get Hadoop and other big data solutions services inside of HDInsight.
Step 2: Click “Custom” for configuring size, settings, applications. Provide the information that needed for Basics configuration settings.
HDInsight Hadoop configuration I selected for my project:
Custer type: Hadoop, Operating system: Linux, Version: Hadoop 2.7.3 (HDI 3.8), Custer tier: Standard. You can customise your settings as per your business requirement and budgets
Step 3: In the “Storage Account Settings step, you need to select a storage account that you already configured in the previous implementation phase. Here, I am using my existing “flightdelaystore”.
You can see a default container is created inside of Azure Storage account that will be used as a default file system for Apache Hadoop HDFS for storing file data analysis.
Step 4: Configuring “Cluster size”. The default head nodes are 2, you need to select the number of worker nodes. Before configuring cluster size, keep in mind your data size and how fast you need to execute your data.
In my project, I am going to use 2 worker nodes and 2 head nodes. The selection of VM depends on also data size, execution and analysis size. You need to also consider your analysis budget during the selection of VM cores.
In my subscription, I can use 60 cores, but in my project data size is not too huge, it’s around 5 GB, So I am going to use 24 cores out of 60 cores.
Step 5: Validation check; Before ready to create a cluster, the system will check validation of your configuration. If it shows any errors, you need to check your setup again, otherwise, it is ready to create and deployment.
Step 6: Summary of Configuration
Step 7: Click Create for deployment. Azure HDInsight takes 15-20 minutes to deploy a cluster regardless of cluster nodes sides. Even a cluster contains 100 nodes, it will take same deployment time.
Step 8: Download template and parameters. You can download the template and parameters for further deployment or if you want to see the same configuration for deploying another cluster.
Step 9: HDInsight cluster deployment. You will get notification of deployment, and you will receive successful notification after deployment done.
Step 9: Successfully deployment of HDInsight and ready for data process and analysis.
Step 10: HDInsight cluster overview
Step 11: Tools for managing and monitoring HDInsight
Step 12: cluster login information
Step 13: Scale Cluster. you can scale up and scale down your cluster while jobs are running. It automatically moves data and processing job in the parllel distributing system without any interruption.
Step 14: Secure Sheel (SSH): Login information for accessing cluster through SSH
Step 15: Storage Account; You can monitor and check storage account information from here which is integrated with HDInsight and you can also edit it as if required.
Step 16: Access Control (IAM): You can control accessing of HDInsight cluster through and remove options from IAM menu. You can also configure your roles or use default roles for your users.
Step 18: Activity log; Activity log will show activities about accessing time, the operation of the cluster.
Step 19: Cluster Dashboard. If you have more than one cluster; you can view all of them on cluster dashboard.
Step 20: Cluster delete. Click “Delete” option for deleting cluster after job done!
Note: After complete your batch processing job through HDInsight Hadoop cluster, delete the cluster to reduce your cost. Because HDInsight cluster starts charges it starts to run, and until delete.
Step 21: Cluster deleted successfully
Thank you 🙂