Mastering AWS Elastic Map Reduce (EMR) for Data Engineers
What you'll learn
- Creating Clusters using AWS Elastic Map Reduce Web Console
- Setup Remote Application Development using AWS Elastic Map Reduce (EMR) and Visual Studio Code
- Develop and Validate Simple Spark Application using Visual Studio Code and AWS Elastic Map Reduce (EMR)
- Deploy Spark Application as Step to AWS Elastic Map Reduce (EMR)
- Manage AWS Elastic Map Reduce (EMR) based Pipelines using Boto3 and Python
- Build End to End AWS Elastic Map Reduce (EMR) based Pipelines using AWS Step Functions
- Develop Applications using Spark SQL on AWS EMR Cluster
- Build State Machine or Pipeline using AWS Step Functions using Spark SQL Script on AWS EMR Cluster
- Understand how to pass parameters to Spark SQL Scripts deployed on EMR
- A computer science or IT Degree or 1 or 2 years of IT Experience
- Basic Linux Skills with ability to run commands using Terminal
- Programming Skills using Python is required
- Valid AWS Account to use the AWS Services to learn how to build Data Pipelines using AWS Lambda Functions
AWS Elastic Map Reduce (EMR) is one of the key AWS Services used in building large-scale data processing leveraging Big Data Technologies such as Apache Hadoop, Apache Spark, Hive, etc. As part of this course, you will end up learning AWS Elastic Map Reduce (EMR) by building end-to-end data pipelines leveraging Apache Spark and AWS Step Functions.
Here is the detailed outline of the course.
First, you will learn how to Get Started with AWS Elastic Map Reduce (EMR) by understanding how to use AWS Web Console to create and manage EMR Clusters. You will also learn about all the key features of Web Console and also how to connect to the master node of the cluster and validate all the important CLI interfaces such as spark-shell, pyspark, hive, etc as well as hdfs and aws CLI commands.
Once you understand how to get started with AWS EMR, you will go through the details related to Setting up Development Cluster using AWS EMR. There are quite a few advantages to using AWS EMR Clusters for development purposes and most enterprises do so.
After setting up a development cluster using AWS EMR, you will go through the Development Life Cycle of Spark Applications using AWS EMR Development Cluster. You will be using Visual Studio Code Remote Development on top of the AWS EMR Development Cluster to go through the details.
Once the development is done, you will go through the details related to Deploying Spark Application on AWS EMR Cluster. You will build the zip file and understand how to run using CLI in both clients as well as cluster deployment modes. You will also understand how you can deploy the spark application as a step on AWS EMR Clusters. You will also understand the details related to troubleshooting the issues related to Spark Applications by going through relevant logs.
Typically we run Spark Applications programmatically. After going through the details related to deploying spark applications on AWS EMR Clusters, you will be learning how to Manage AWS EMR Clusters using Python Boto3. You will not only learn how to create clusters programmatically but also how to deploy Spark Applications as Steps programmatically using Python Boto3.
End to End Data Pipelines using AWS EMR is built using AWS Step Functions. Once you understand how to manage EMR Clusters using Python Boto3 and also deploy Spark Applications on EMR Clusters using the same, it is important to learn how to Build EMR-based Workflows or Pipelines using AWS Step Functions. You will be learning how to create the cluster, deploy Spark Application as Step on to the cluster, and then terminate the cluster as part of a basic pipeline or State Machine using AWS Step Functions.
You will also learn how to perform validations as part of State Machines by Enhancing AWS EMR-based State Machine or Pipeline. You will check if the files specified already exist as part of the validations.
We can also build Data Processing Applications or Pipelines using Spark SQL on AWS EMR. First, you will learn how to design and develop solutions using Spark SQL Script, how to validate by using appropriate commands by passing relevant runtime arguments, etc.
Once you understand the development process of implementing solutions using Spark SQL on AWS EMR, you will learn how to deploy Data Pipeline using AWS Step Function to deploy Spark SQL Script on EMR Cluster. You will also learn the concept of Boto3 Waiters to make sure the steps are executed in a linear fashion.
Who this course is for:
- University Students who want to learn AWS Elastic Map Reduce to process heavy volumes of data with hands on and real time examples
- Aspiring Data Engineers and Data Scientists who want to master building data pipelines using AWS Elastic Map Reduce for large scale Data Processing
- Experienced Application Developers who would like to explore how to build end to end Data Pipelines using Python and AWS Services such as AWS Elastic Map Reduce
- Experienced Data Engineers to build end to end data pipelines using Python and AWS Elastic Map Reduce
- Any IT Professional who is keen to deep dive into AWS Elastic Map Reduce (EMR) for heavy weight Data Processing
20+ years of experience in executing complex projects using a vast array of technologies including Big Data and the Cloud.
ITVersity, Inc. - is a US-based organization that provides quality training for IT professionals and we have a track record of training hundreds of thousands of professionals globally.
Building an IT career for people with required tools such as high-quality material, labs, live support, etc to upskill and cross-skill is paramount for our organization.
At this time our training offerings are focused on the following areas:
* Application Development using Python and SQL
* Big Data and Business Intelligence
* Datawarehousing, Databases
- 4.4 Instructor Rating
- 9,530 Reviews
- 110,238 Students
- 11 Courses