You are viewing the article davidjegan/AWS-EMR-Node-Calculator: AWS-EMR-Node-Calculator at Tnhelearning.edu.vn you can quickly access the necessary information in the table of contents of the article below.
AWS Elastic Map Reduce(EMR) Node Calculator – a Serverless way
Context
In order to ensure parallelism, perfect number of nodes should be chosen in EMR Clusters. This involves a complex look up and referencing. Using this tool, that arduous process is simplified. This tool, returns the exact nodes required for your application to run seemlessly.
Cluster Node Calculation Formulae
- Read the default Mapred-site.xml
 - Get mapreduce.map.memory.mb and yarn.scheduler.maximum-allocation-mb values
 - Number of mappers = maximum allocation memory/mapreduce.map.memory
 
i.e., Total Mappers Required = Total Size of Input / Input Split Size
Numerator = Total Mappers * Time to process Sample files Denominator = Instance Mapper Capacity * Desired Processing Time
Estimated number of nodes = Numerator / Denominator
Pre-Requisite
- Get a test Work Load
 - Number of Sample files should match the number of mappers
 - RUN an EMR cluster with single core and process the sample file.
 - The time taken to process is the Processing time
 
Services and components
- DynamoDB : NoSQL database offering of AWS
 - Lambda : A compute solution which can run without deploying servers
 - API Gateway: An Apification service of AWS to invoke the Lambda method
 - Front-end components: HTML, CSS, JS, Jquery and AJAX
 
Process Flow
- Get the details of all instances in AWS Compute and store it in a DB
 - Create a Lambda function that refers this DB and returns the contents
 - Create an API endpoint to invoke this lambda method
 - Embed this API in the Front-end code
 - Parse the response and render the contents of the webpage dynamically
 - (Optional) Lambda function can be created to listen to AWS SNS notification of service change, to update the DynamoDB contents on the fly
 
Set-up
- DynamoDB => Contains the data of instances
- Load the following contents into the DynamoDB using the following script
 
 - Lambda => To retrieve DB contents
- Create a lambda function in the AWS console
 
 - API Gateway
- Go to the API Gateway
 - provide a name
 - description
 - endpoint type.
 - Create a 
GETmethod - Choose 
Lambda Functionas theIntegration type - Turn on the 
Use Lambda Proxy Integration - Provide the region and lambda name created in the previous step
 - Click 
OKwhen the popup asks you to provide access to Lambda function. - Reference Image:
 - Click on 
ActionsandDeploy API - Provide a stage name and description
 Deploythe API- Note the 
Invoke URL, this will be used in the next step. 
 - Front-End updation
- Embed this endpoint in the code at js file
 - Run the html file. Provide the inputs and find the number of nodes at ease!
 
 
PS
- The number of mappers depends on the number of Hadoop splits
 - If your files are smaller than HDFS or Amazon S3 split size, the number of mappers is equal to the number of files
 - If some or all of your files are larger than HDFS or Amazon S3 split size (fs.s3.block.size) the number of mappers is equal to the sum of each file divided by the HDFS/Amazon S3 block size.
 
Thank you for reading this post davidjegan/AWS-EMR-Node-Calculator: AWS-EMR-Node-Calculator at Tnhelearning.edu.vn You can comment, see more related articles below and hope to help you with interesting information.
Related Search:

