Lab10: Introduction to Amazon Elastic MapReduce (EMR)

In this Lab, I will demonstrate the following contents

  • Creating an Amazon S3 buckets
  • Creating and launching an Amazon EMR cluster using AWS Management Console
  • Running a sample word count application to process data and How to monitor your cluster by using Amazon EMR console status indicators
  • How to view the amazon EMR output data in Amazon S3.

Amazon EMR is the AWS web services which help you to process a vast amount of data very quickly and cost-effectively. Amazon EMR is an open source framework uses Apache Hadoop. It mainly distributes your data and processing that data among a resizeable Amazon EC2 instance Cluster.

Amazon EMR  Use Cases: A broad set of big data can be handled by Amazon EMR securely and reliably such as

  • Web Indexing
  • Log analysis
  • Warehousing machine learning
  • Scientific  Simulation
  • Data Warehousing
  • Bioinformatics
  • Data transformations (ETL)

Step for the task: Creating an Amazon S3 buckets

  1. Click S3 under Storage from Sevices Menu in the AWS Management Console.
  2. Click “Create Bucket‘ from Amazon S3 pages. Type your bucket name and region. Then click “Create”1.jpg

Note: Amazon EMR uses Amazon S3 bucket name. Your bucket name must be unique and lower case only without any space, underscores, or periods.

3. If your bucket name is not available in the system and taking by another user. You will get the following error.2.jpg

Cost reducing recommendation:  Create Amazon S3 bucket and cluster same region to avoid paying cross-region bandwidth charges. It will reduce your data transfer cost.

Steps for the task: Creating and launching an Amazon EMR cluster using AWS Management Console

This step I will show how to create and launch cluster in Amazon Platform. For performing computation Amazon EMR provisions Ec2 Instance. Amazon EC2 instance preloaded with an Amazon Machine Image (AMI) that has been specially customised for Amazon EMR with Hadoop and other big data applications.

  1. Click “EMR” under Analytics from Services Menu in the AWS management console.
  2. Click “Create Cluster” from Amazon Elastic MapReduce.3.jpg
  3. Go to advanced options wizard from “Create Cluster – Quick Options” page.4.jpg
  4. Software Configuration, Select emr-4.7.1 for release. This is software release version.5.jpg
  5. Click Streaming program for step type under add Steps on the advance Options page. Then click “Configure“.6.jpg
  6. After clicking on “Configure”. Add step wizard, you have to put some link location after type name. I named: word count. You can use any meaningful name. 7.jpg

Mapper field: s3://elasticmapreduce/samples/wordcount/wordSplitter.py

Reducer field: aggregate

Input S3 Location field: s3://elasticmapreduce/samples/wordcount/input

Output S3 Location field: s3///output/  . It would be like : s3://emr-bucket-slt2317/output/

7. After putting all the information in the required field, click Add.8.jpg

8. Click the box of ” Auto-terminal cluster after the last step is completed” and click “Next”.

9. Leave the “Hardware Setting” defaults. Click “Next“. 9.jpg

10. Keep the logging enable  and  In the S3 folder replace auto generate name by  s3:///logs, like  s3://emr-bucket-slt2317/logs/ 10.jpg

11. click “Next“. On the Security Options, You have to select an EC2 key pair. Note: You can proceed without Key Pair but in this case, you can not connect to the Master node via SSH without keynote. On the permission section, you can leave it default or you can be customised your own permission. Then click “Create Cluster”.11.jpg

12. Now My cluster is ready and launched a managed cluster with EMR.12.jpg

13.jpg

Steps for the task: Running a sample word count application to process data and How to monitor your cluster by using Amazon EMR console status indicators

In this steps, you will monitor the Custer during counting the word in the input test. Input data in pre-configured that we mapped previously during our cluster configuration

Amazon EMR will copy the word count results into the Amazon S3 output bucket after the cluster is complete the processing of data that into the S3 output bucket that  I linked s3 output during configuration.

  1. To view the cluster list and result click “Cluster List“. On the right side of radio button select My cluster and click small drop-down arrow left side of My cluster14.jpg
  2. For monitoring your EMR job, click the circular arrow icon to refresh the status of your job periodically. when  ERM job will be done, you will view your cluster status will show “Terminated“.15.jpg

Steps for the task: How to view the amazon EMR output data in Amazon S3

Your result of the word count would be stored in the Amazon S3 output mapped file after complete the result.

  1. For the view, the result, Go to the Amazon S3 bucket again from service menu on the page of AWS management console.
  2. Select the bucket you created in S3 for EMR.
  3. Click on OutPut Folder. In the Output Folder, you will see One or more files. The result of the test files. If you see that title “Sucess“. If you want to download, click download option and if you want to view the file you can “Open the Test File”.

Problem-solving issues:

If see the status of the streaming program failed, you have to process your input data again. It shows that your processing word count is failed. You will not see any result. But you will find out the reason for failing in the log folder of Amazon S3 output buckets.

16.jpg

17.jpg

Cost analysis: 

Price Method: Amazon EMR price method is an hourly rate for each instance you use. Suppose, If you use 10 node cluster running for the 10 hours costs is equal to a 100 node cluster running for 1 hours.

The hourly rate depends on Instance type and it’s configuration resources such as Standard, CPU, high Memory, high storage. Hourly prices range from $0.011/hour to $0.27/hour ($94/year to $2367/year).

The Amazon EMR price is not included with Amazon EC2 price. That means You have to pay to EC2 price as well.

Pricing Options:

  •  On-Demand
  • 1 year & 3-year Reserved Instances
  • Spot instances

On-Demand Instances are the most expensive but most flexibility. EC2 also offers Reserved Instances and Spot Instances that help you to optimised your cost.

Cost-Optimised suggestion: Amazon EMR with Spot Instance is the most cost-effective payment selecting a method to scale, decreasing your data processing. It will help you to reduce cost over 50% according to the VP of Engineering at Fliptop. Source: https://aws.amazon.com/emr/pricing/

Thank you 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s