You are viewing the article davidjegan/AWS-EMR-Node-Calculator: AWS-EMR-Node-Calculator at Tnhelearning.edu.vn you can quickly access the necessary information in the table of contents of the article below.
AWS Elastic Map Reduce(EMR) Node Calculator – a Serverless way
Context
In order to ensure parallelism, perfect number of nodes should be chosen in EMR Clusters. This involves a complex look up and referencing. Using this tool, that arduous process is simplified. This tool, returns the exact nodes required for your application to run seemlessly.
Cluster Node Calculation Formulae
- Read the default Mapred-site.xml
- Get mapreduce.map.memory.mb and yarn.scheduler.maximum-allocation-mb values
- Number of mappers = maximum allocation memory/mapreduce.map.memory
i.e., Total Mappers Required = Total Size of Input / Input Split Size
Numerator = Total Mappers * Time to process Sample files Denominator = Instance Mapper Capacity * Desired Processing Time
Estimated number of nodes = Numerator / Denominator
Pre-Requisite
- Get a test Work Load
- Number of Sample files should match the number of mappers
- RUN an EMR cluster with single core and process the sample file.
- The time taken to process is the Processing time
Services and components
- DynamoDB : NoSQL database offering of AWS
- Lambda : A compute solution which can run without deploying servers
- API Gateway: An Apification service of AWS to invoke the Lambda method
- Front-end components: HTML, CSS, JS, Jquery and AJAX
Process Flow
- Get the details of all instances in AWS Compute and store it in a DB
- Create a Lambda function that refers this DB and returns the contents
- Create an API endpoint to invoke this lambda method
- Embed this API in the Front-end code
- Parse the response and render the contents of the webpage dynamically
- (Optional) Lambda function can be created to listen to AWS SNS notification of service change, to update the DynamoDB contents on the fly
Set-up
- DynamoDB => Contains the data of instances
- Load the following contents into the DynamoDB using the following script
- Lambda => To retrieve DB contents
- Create a lambda function in the AWS console
- API Gateway
- Go to the API Gateway
- provide a name
- description
- endpoint type.
- Create a
GET
method - Choose
Lambda Function
as theIntegration type
- Turn on the
Use Lambda Proxy Integration
- Provide the region and lambda name created in the previous step
- Click
OK
when the popup asks you to provide access to Lambda function. - Reference Image:
- Click on
Actions
andDeploy API
- Provide a stage name and description
Deploy
the API- Note the
Invoke URL
, this will be used in the next step.
- Front-End updation
- Embed this endpoint in the code at js file
- Run the html file. Provide the inputs and find the number of nodes at ease!
PS
- The number of mappers depends on the number of Hadoop splits
- If your files are smaller than HDFS or Amazon S3 split size, the number of mappers is equal to the number of files
- If some or all of your files are larger than HDFS or Amazon S3 split size (fs.s3.block.size) the number of mappers is equal to the sum of each file divided by the HDFS/Amazon S3 block size.
Thank you for reading this post davidjegan/AWS-EMR-Node-Calculator: AWS-EMR-Node-Calculator at Tnhelearning.edu.vn You can comment, see more related articles below and hope to help you with interesting information.
Related Search: