Mastering AWS Elastic Map Reduce (EMR) for Data Engineers

Name: Mastering AWS Elastic Map Reduce (EMR) for Data Engineers
Rating: 4.3 (366 reviews)

Build Pyspark and Spark SQL Applications on AWS EMR, Orchestrate using Step Functions, Manage EMR using Boto3 and more

Created byDurga Viswanatha Raju Gadiraju

Last updated 9/2022

English

What you'll learn

Creating Clusters using AWS Elastic Map Reduce Web Console
Setup Remote Application Development using AWS Elastic Map Reduce (EMR) and Visual Studio Code
Develop and Validate Simple Spark Application using Visual Studio Code and AWS Elastic Map Reduce (EMR)
Deploy Spark Application as Step to AWS Elastic Map Reduce (EMR)
Manage AWS Elastic Map Reduce (EMR) based Pipelines using Boto3 and Python
Build End to End AWS Elastic Map Reduce (EMR) based Pipelines using AWS Step Functions
Develop Applications using Spark SQL on AWS EMR Cluster
Build State Machine or Pipeline using AWS Step Functions using Spark SQL Script on AWS EMR Cluster
Understand how to pass parameters to Spark SQL Scripts deployed on EMR

Course content

12 sections • 150 lectures • 11h 18m total length

Introduction to Mastering AWS Elastic Map Reduce for Data Engineers0:19

Planning of EMR Cluster1:20
Create EC2 Key Pair4:30
Setup EMR Cluster with Spark5:59
Understanding Summary of AWS EMR Cluster3:28
Review EMR Cluster Application User Interfaces2:23
Review EMR Cluster Monitoring1:46
Review EMR Cluster Hardware and Cluster Scaling Policy1:16
Review EMR Cluster Configurations2:11
Review emr cluster configurations by inspecting overridden properties and json, noting glue data catalog for spark metastore, and troubleshoot via runtime overrides in the cluster dashboard.
Review EMR Cluster Events2:21
Review EMR Cluster Steps1:48
Review EMR Cluster Bootstrap Actions2:03
Connecting to EMR Master Node using SSH2:20
Disabling Termination Protection and Terminating the Cluster1:41
Clone and Create New Cluster3:37
Listing AWS S3 Buckets and Objects using AWS CLI on EMR Cluster3:20
Listing AWS S3 Buckets and Objects using HDFS CLI on EMR Cluster3:32
Managing Files in AWS s3 using HDFS CLI on EMR Cluster4:51
Review Glue Catalog Databases and Tables1:45
Accessing Glue Catalog Databases and Tables using EMR Cluster5:45
Accessing spark-sql CLI of AWS EMR Cluster4:11
Accessing pyspark CLI of AWS EMR Cluster6:05
Accessing spark-shell CLI of AWS EMR Cluster6:56
Create AWS EMR Cluster for Notebooks2:50

Create bootstrap script for AWS EMR Cluster4:37
Provision Elastic IP for Master Node of AWS EMR Cluster3:27
Provision an elastic IP in AWS EC2 and map it to the EMR cluster master node for a stable public address, including allocation steps in the AWS console.
Create AWS EMR for Development3:52
Provision an EMR cluster with a bootstrap script and Elastic IP, using EMR 6.6.0 with Hive, Jupyter Enterprise Gateway, and Spark 3.2.0, integrated with the Glue data catalog.
Troubleshooting Issues related to Bootstrap of EMR Cluster1:59
Fix Bootstrap Script for AWS EMR Cluster3:51
Troubleshoot aws emr bootstrap by cloning the cluster, removing bootstrap actions, and recreating it with a corrected bootstrap.sh using /usr/bin/pip3 and boto3, then verify master access.
Validate AWS EMR Cluster with Bootstrap Action with updated script5:43
Setup Python Virtual Environment as part of VS Code Workspace2:47
Getting Started with Boto3 to Manage AWS EMR Clusters2:28
Setup boto3 to explore APIs to manage AWS EMR Clusters2:47
Set AWS Profile using env file in Visual Studio Code4:25
Get Cluster Details of AWS EMR Development Cluster using boto35:24
Getting Instance Id of the Master Node of AWS EMR Cluster using boto32:24
Getting Allocation Id of the Elastic Ip using AWS boto34:29
Use boto3 to locate the EMR master instance id and its elastic IP allocation id, then associate the elastic IP with the master node using AWS EC2 APIs.
Associating Elastic Ip with AWS EMR Master Node using Boto33:39
Setup Notebook Environment for EMR Cluster using IAM User6:46

Open Remote Window on AWS EMR Master Node using VS Code4:32
Setup Workspace on AWS EMR Master using Git Repository3:53
Best Practices and Advantages of using AWS EMR Cluster for Team Development4:13
Install VSCode Extensions in remote Workspace for Python4:58
Install Pylance and Python extensions in the remote Visual Studio Code window to enable Python syntax highlighting and auto complete for EMR cluster development; validate boto3 and s3_client usage.
Review Python and Pyspark details on EMR Cluster3:26
Running Applications using local and yarn during development2:39
Getting Started with Development of Spark Applications on EMR Cluster7:05
Create Function for Spark Session6:09
Create a modular spark session function in Python, supporting dev and prod yarn modes, using util.py and a main guard, then run with spark-submit to execute Spark SQL queries.
Upload Files to AWS s3 for the development using AWS EMR Cluster4:51
Develop read logic for the Spark Application10:05
Process Data Frame using Spark APIs6:56
Write Data to Files using Spark APIs8:20
Partition data by year, month, and dayofmonth, coalesce to 16 files, and write the transformed dataframe to s3 in parquet format using spark's dataframe writer with append mode.
Productionize the Code and setup required data sets for validation5:59
Resize the AWS EMR Cluster using Web Console7:19
Validate Changes to productionize the Application Code8:49
Take the backup and terminate the cluster6:59

Recreate the AWS EMR Cluster to deploy Spark Applications3:21
Clone and resize an emr cluster to deploy spark applications on non development clusters, configure uniform master/core/task nodes, enable autoscaling, and prepare a zip file on the master.
Setup Code Repository on the AWS EMR Master Node8:15
Resize the AWS EMR Cluster to validate application on larger data sets4:38
Build Zip File for the Spark Application3:54
Validate the Spark Application using zip file and client as deploy mode6:02
Run Spark Application on EMR using Cluster Deployment Mode3:42
Run Spark Application copied to s3 on EMR using Cluster Deployment Mode3:49
Deploy Spark Application as Step to the AWS EMR Cluster5:29
Learn how to deploy a spark application as a step on an existing EMR cluster using cluster mode and spark submit options, with artifacts stored in S3.
Setup Multiple Files to Manage AWS s3 Objects using State Machines2:35
Learn to configure a state machine to delete multiple S3 objects at once by creating test files in cloud shell, copying them to S3, and validating with object listing.
Validate Spark Application Deployed as Step on AWS EMR Cluster2:14
Validate a spark application deployed as a step on an AWS EMR cluster by reviewing STD ERR logs and using aws s3 ls to confirm the target files.

Update Material related to Managing AWS EMR using Boto33:42
Master how to manage AWS EMR with boto3 by setting up a Mac or WSL development environment in VS Code, and configuring AWS profiles for notebooks and S3 access.
Create AWS EMR Cluster using AWS CLI Command7:01
Provision an AWS EMR cluster with Hadoop and Spark using the AWS web console and AWS CLI, configured for EMR 6.6.0 with one master and two core instances.
Manage AWS EMR Clusters using AWS CLI Commands6:42
Overview of AWS boto3 to Manage AWS EMR Clusters8:09
Overview of Run Job Flow API to create AWS EMR Cluster6:17
Create AWS EMR Cluster or Job Flow Cluster using AWS Boto311:12
Prepare Data Sets to add Spark Application as Step to AWS EMR Cluster2:42
Add Spark Application as Step to AWS EMR Cluster using Boto37:41
Exercise to add Spark Application as Step to EMR Cluster using boto32:28
Terminate the AWS EMR Cluster used for adding Steps1:58
Exercise to Create AWS EMR Cluster with Steps for Spark Application1:47

Review of Development Environment for AWS Step Functions and EMR2:09
Quick Overview of Important Terms of AWS Step Functions1:50
Getting Started with EMR based Pipeline using AWS Step Functions6:27
Overview of AWS IAM Role associated with State Machine copy1:26
Overview of Creating EMR Cluster using AWS Step Functions5:52
Learn to build an EMR cluster workflow with create cluster, add step, and terminate cluster using AWS Step Functions and Workflow Studio, including production considerations and run job flow details.
Parameters to Create EMR Cluster using AWS Step Functions2:52
Learn to create an EMR cluster with AWS Step Functions by translating run job flow parameters into JSON for the create cluster action, including name, logUri, release label, and instances.
Attach Permissions to Step Function Role to Create AWS EMR Cluster4:58
Attach the step function role with permissions to create an AWS EMR cluster, including service, job flow, and auto scaling roles, and attach needed policies.
Add Step to AWS EMR Cluster using AWS Step Function8:35
Validate Adding Step to AWS EMR Cluster using Step Functions3:59
Validate the successful execution of a step added to an aws emr cluster via step functions, review logs and s3 outputs, and plan termination handling in the next lecture.
Add Action to Step Machine to Terminate the AWS EMR Cluster6:41
Validate the execution of State Machine to run Spark Application on AWS EMR2:35
Terminate AWS EMR Clusters Created to Validate State Machine copy2:28

Review the current state of AWS EMR based Pipeline or State Machine copy0:41
Create State Machine using AWS Step Function to Validate s3 copy3:03
Attach Policy with Permissions on AWS s3 to Step Function Role copy3:00
Setup File in AWS s3 and Validate State Machine to list objects copy3:11
Relationship between AWS Boto3 and Actions in Step Functions copy3:41
Learn how boto3 APIs relate to AWS Step Functions and S3-based states, enabling state machines to manage and delete S3 objects using bucket and prefix parameters.
Add State to Delete Object from AWS s3 copy2:53
Fix Permissions and Run State Machine to Delete Object from AWS s3 copy3:01
Passing Input to States in AWS Step Functions State Machine copy5:54
Setup Multiple Files to Manage AWS s3 Objects using State Machines copy2:35
Process AWS s3 Objects using Map in State Machine8:30
Extract Key of AWS s3 Objects using Step Functions Pass4:54
Learn to extract s3 object keys from list objects using a Step Functions map and pass with input path, then pass keys to s3 delete object.
Add State to AWS Step Function Delete s3 Object4:12
Implement delete S3 object logic in the AWS step function state machine, replacing pass with S3 delete object, wiring bucket and key inputs, and noting lambda-assisted array augmentation.
Develop AWS Lambda Function to customise State Machine Data7:03
Add AWS Lambda Function to State Machine to Pass s3 Details for delete9:11
Add Condition to State Machine to avoid Key Error on AWS s3 List Objects6:19
Overview of Map Concurrency in State Machines of AWS Step Functions3:33
Explore how map concurrency in AWS Step Functions enables parallel deletion of S3 objects, with practical guidance on setting maximum concurrency and validating parallel execution.
Invoking AWS Step Function State Machine from Other State Machines6:27
Overview of integration of s3 based State Machine with EMR State Machine1:22

Taking back up of AWS Step Functions State Machines2:13
Integrate two AWS step functions state machines—an EMR-based ghactivity converter and a validate S3 target location—ensuring validation precedes EMR pipelines, with backups via copy-to-new and git versioning.
Grant Permissions between AWS Step Functions State Machines via IAM Role3:24
Learn to grant aws step functions permissions via iam roles for cross-state machine invocation, enabling emr spark jobs and s3 target location validation.
Update AWS Step Function State Machine with EMR to validate s35:14
Pass EMR Step Details to AWS Step Functions State3:12
Propagate original input through the EMR and AWS Step Functions state machine to add a Spark step to the cluster, discarding output and routing input to downstream states.
Validate AWS Step Function EMR based State Machine Execution3:11
Run AWS Step Function State Machine to validate logic to delete AWS s3 Objects1:11
Exercise to add validation of source s3 location in AWS Step Function StateMach1:33
Update AWS Step Function State Machine to Validate Source s3 Location4:59
Run AWS Step Function State Function with source s3 Validation Logic5:03
Develop AWS Lambda Function to check number of files in source s35:20
Attach Policy to State Machine Role to Invoke AWS Lambda Function2:30
Attach an inline policy to the state machine role to invoke the lambda function, then review and specify the function ARN and name to enable execution.
Run Updated State Machine to validate source count11:58
Best Practices to Run AWS Step Functions State Machines3:10

Requirements

A computer science or IT Degree or 1 or 2 years of IT Experience
Basic Linux Skills with ability to run commands using Terminal
Programming Skills using Python is required
Valid AWS Account to use the AWS Services to learn how to build Data Pipelines using AWS Lambda Functions

Description

AWS Elastic Map Reduce (EMR) is one of the key AWS Services used in building large-scale data processing leveraging Big Data Technologies such as Apache Hadoop, Apache Spark, Hive, etc. As part of this course, you will end up learning AWS Elastic Map Reduce (EMR) by building end-to-end data pipelines leveraging Apache Spark and AWS Step Functions.

Here is the detailed outline of the course.

First, you will learn how to Get Started with AWS Elastic Map Reduce (EMR) by understanding how to use AWS Web Console to create and manage EMR Clusters. You will also learn about all the key features of Web Console and also how to connect to the master node of the cluster and validate all the important CLI interfaces such as spark-shell, pyspark, hive, etc as well as hdfs and aws CLI commands.
Once you understand how to get started with AWS EMR, you will go through the details related to Setting up Development Cluster using AWS EMR. There are quite a few advantages to using AWS EMR Clusters for development purposes and most enterprises do so.
After setting up a development cluster using AWS EMR, you will go through the Development Life Cycle of Spark Applications using AWS EMR Development Cluster. You will be using Visual Studio Code Remote Development on top of the AWS EMR Development Cluster to go through the details.
Once the development is done, you will go through the details related to Deploying Spark Application on AWS EMR Cluster. You will build the zip file and understand how to run using CLI in both clients as well as cluster deployment modes. You will also understand how you can deploy the spark application as a step on AWS EMR Clusters. You will also understand the details related to troubleshooting the issues related to Spark Applications by going through relevant logs.
Typically we run Spark Applications programmatically. After going through the details related to deploying spark applications on AWS EMR Clusters, you will be learning how to Manage AWS EMR Clusters using Python Boto3. You will not only learn how to create clusters programmatically but also how to deploy Spark Applications as Steps programmatically using Python Boto3.
End to End Data Pipelines using AWS EMR is built using AWS Step Functions. Once you understand how to manage EMR Clusters using Python Boto3 and also deploy Spark Applications on EMR Clusters using the same, it is important to learn how to Build EMR-based Workflows or Pipelines using AWS Step Functions. You will be learning how to create the cluster, deploy Spark Application as Step on to the cluster, and then terminate the cluster as part of a basic pipeline or State Machine using AWS Step Functions.
You will also learn how to perform validations as part of State Machines by Enhancing AWS EMR-based State Machine or Pipeline. You will check if the files specified already exist as part of the validations.
We can also build Data Processing Applications or Pipelines using Spark SQL on AWS EMR. First, you will learn how to design and develop solutions using Spark SQL Script, how to validate by using appropriate commands by passing relevant runtime arguments, etc.
Once you understand the development process of implementing solutions using Spark SQL on AWS EMR, you will learn how to deploy Data Pipeline using AWS Step Function to deploy Spark SQL Script on EMR Cluster. You will also learn the concept of Boto3 Waiters to make sure the steps are executed in a linear fashion.

Who this course is for:

University Students who want to learn AWS Elastic Map Reduce to process heavy volumes of data with hands on and real time examples
Aspiring Data Engineers and Data Scientists who want to master building data pipelines using AWS Elastic Map Reduce for large scale Data Processing
Experienced Application Developers who would like to explore how to build end to end Data Pipelines using Python and AWS Services such as AWS Elastic Map Reduce
Experienced Data Engineers to build end to end data pipelines using Python and AWS Elastic Map Reduce
Any IT Professional who is keen to deep dive into AWS Elastic Map Reduce (EMR) for heavy weight Data Processing

Mastering AWS Elastic Map Reduce (EMR) for Data Engineers

What you'll learn

Explore related topics

Course content

Introduction to Mastering AWS Elastic Map Reduce for Data Engineers1 lecture • 1min

Getting Started on Windows with Required Tools3 lectures • 9min

Getting Started with AWS EMR23 lectures • 1hr 16min

Setup Development Cluster using AWS EMR15 lectures • 59min

Development Life Cycle using AWS EMR Development Cluster16 lectures • 1hr 36min

Deploy Spark Application on AWS EMR Cluster10 lectures • 44min

Manage AWS EMR Clusters using Python Boto311 lectures • 1hr

Build EMR based Workflows or Pipelines using AWS Step Functions12 lectures • 50min

Develop State Machine using AWS Step Functions to manage s318 lectures • 1hr 20min

Adding s3 Validation Logic to AWS EMR based State Machine13 lectures • 53min

Requirements

Description

Who this course is for: