Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Data Engineering using AWS Data Analytics

Name: Data Engineering using AWS Data Analytics
Rating: 4.4 (3502 reviews)

Build Data Engineering Pipelines on AWS using Data Analytics Services - Glue, EMR, Athena, Kinesis, Lambda, Redshift

Created byDurga Viswanatha Raju Gadiraju, Phani Bhushan Bozzam, Vinay Gadiraju

Last updated 11/2024

English

What you'll learn

Data Engineering leveraging Services under AWS Data Analytics
AWS Essentials such as s3, IAM, EC2, etc
Understanding AWS s3 for cloud based storage
Understanding details related to virtual machines on AWS known as EC2
Managing AWS IAM users, groups, roles and policies for RBAC (Role Based Access Control)
Managing Tables using AWS Glue Catalog
Engineering Batch Data Pipelines using AWS Glue Jobs
Orchestrating Batch Data Pipelines using AWS Glue Workflows
Running Queries using AWS Athena - Server less query engine service
Using AWS Elastic Map Reduce (EMR) Clusters for building Data Pipelines
Using AWS Elastic Map Reduce (EMR) Clusters for reports and dashboards
Data Ingestion using AWS Lambda Functions
Scheduling using AWS Events Bridge
Engineering Streaming Pipelines using AWS Kinesis
Streaming Web Server logs using AWS Kinesis Firehose
Overview of data processing using AWS Athena
Running AWS Athena queries or commands using CLI
Running AWS Athena queries using Python boto3
Creating AWS Redshift Cluster, Create tables and perform CRUD Operations
Copy data from s3 to AWS Redshift Tables
Understanding Distribution Styles and creating tables using Distkeys
Running queries on external RDBMS Tables using AWS Redshift Federated Queries
Running queries on Glue or Athena Catalog tables using AWS Redshift Spectrum

Course content

28 sections • 421 lectures • 25h 20m total length

Introduction to Data Engineering using AWS Analytics Services5:45
Delve into data engineering with AWS analytics services, from local and cloud setup (S3, IAM, CLI) to PySpark development and Glue, EMR, Kinesis, Lambda, and Athena pipelines.
Video Lectures and Reference Material3:01
Learn how to access and navigate the data engineering masterclass on AWS analytics services, using video lectures, code snippets, instructions, and external resources within the Udemy course interface.
Taking the Udemy Course for new Udemy Users4:07
Navigate Udemy by exploring sections and lectures, and track completion by watching to 80–90%. Use the video player to adjust speed, add notes and bookmarks, and review notes for learning.
Additional Costs for AWS Infrastructure for Hands-on Practice1:39
Understand the additional costs of AWS infrastructure for hands-on practice, including sign-up steps, credit card requirements, free tier limitations, and typical spend estimates for AWS analytics.
Signup for AWS Account1:45
Learn how to sign up for an AWS account, register a credit card, and complete the onboarding steps at aws.amazon.com to access AWS services for data engineering using AWS analytics.
Logging in into AWS Account1:45
Log in to the AWS management console as the root user, then create IAM users for ongoing access and explore services using the search bar.
Overview of AWS Billing Dashboard - Cost Explorer and Budgets3:16
Explore the AWS billing dashboard, cost explorer, and budgets to monitor forecasted and month-to-date spend, view spend by service or linked account, and set alerts.

Setup Local Environment on Windows for AWS3:22
Set up a Windows local environment with WSL and Ubuntu to interact with AWS via aws cli and Python, including Jupyter lab and boto3 access to S3.
Overview of Powershell on Windows 10 or Windows 114:25
Discover PowerShell, built into Windows 10 and 11, and learn to launch, customize its settings, and use SSH to remote machines, with basics like dir and mkdir.
Setup Ubuntu VM on Windows 10 or 11 using wsl6:07
Set up Ubuntu on Windows using WSL to install and configure an Ubuntu 20.04 virtual machine, learn WSL commands, installation steps, and reboot requirements for Windows 10 or Windows 11.
Setup Ubuntu VM on Windows 10 or 11 using wsl - Contd...5:17
Set up an Ubuntu 20.04 VM on Windows with WSL, create a login user, and access Linux commands via the WSL CLI and PowerShell.
Setup Python venv and pip on Ubuntu8:49
Validate and set up Python on Ubuntu via WSL, ensuring Python 3, venv, and pip for building Python-based apps. Create a demo-venv, verify it, and install modules with pip.
Setup AWS CLI on Windows and Ubuntu using Pip3:09
Learn to set up the local development environment for AWS by installing the AWS CLI with pip on Ubuntu via WSL or direct desktop, then verify with aws --version.
Create AWS IAM User and Download Credentials3:49
Create an IAM user with programmatic access and administrator permissions, then download the access key and secret key as a CSV backup to configure the AWS CLI.
Configure AWS CLI on Windows7:36
Configure the AWS CLI on Windows using an IAM user, run aws configure, and validate with aws s3 ls. Manage credentials and profiles in the .aws folder.
Create Python Virtual Environment for AWS Projects3:14
Learn to set up a Python virtual environment for AWS projects, install libraries, and validate Python can interact with your AWS account using venv, activation, and Linux commands.
Setup Boto3 as part of Python Virtual Environment2:29
Set up a python virtual environment and install boto3 to interact with AWS services, validate the installation with python cli, and prepare to use Jupyter for boto3-based AWS work.
Setup Jupyter Lab and Validate boto36:42
Install and launch Jupiter lab within the virtual environment, then validate boto3 by importing it and listing S3 buckets via a Python client, using AWS credentials from the default profile.

Setup Local Environment for AWS on Mac2:35
Learn to set up and configure a local AWS environment on macOS with IAM credentials, use AWS CLI and Python with boto3 to interact with S3 and other services.
Setup AWS CLI on Mac2:08
Set up the AWS CLI on macOS by installing via pip for Python 3.6–3.9, configure IAM access keys, and validate AWS commands.
Setup AWS IAM User to configure AWS CLI2:40
Create an IAM user with programmatic access, download its credentials, and use them to configure the AWS CLI for programmatic control of AWS services.
Configure AWS CLI using IAM User Credentials6:25
Configure the AWS CLI with IAM user credentials, verify access by listing S3 buckets, and manage multiple accounts with profiles and the .aws credentials and config files.
Setup Python Virtual Environment on Mac using Python 34:43
Set up a Python virtual environment on Mac using Python 3.8, activate it with source de-venv/bin/activate, and install configparser to enable Aws interaction via Python sdk and Aws cli.
Setup Boto3 as part of Python Virtual Environment2:29
Set up a Python virtual environment and run pip install boto3 to enable Python-based interaction with AWS, validate by importing boto3 and listing S3 buckets, and explore Jupiter for development.
Setup Jupyter Lab and Validate boto36:42
Install and validate boto3 in a Python notebook, launch Jupyter Lab, and run S3 operations like list_buckets using AWS credentials and the default CLI profile.

Introduction - AWS Getting Started1:44
Set up users, groups, and roles in the AWS console, create an S3 bucket as a permissions testbed, and configure the AWS CLI to validate read access.
[Instructions] Introduction - AWS Getting Started0:27
Create AWS s3 Bucket using AWS Web Console3:45
Create an s3 bucket named ITV-GitHub in the AWS web console, then create landing and raw folders to store json GitHub data and parquet-raw data for a data lake.
[Instructions] Create s3 Bucket0:40
Create AWS IAM Group and User using AWS Web Console4:26
Create an IAM group named ITV GitHub group and add a user with programmatic access. Download credentials to configure the AWS CLI or boto3; permissions inherit from the group.
[Instructions] Create IAM Group and User0:45
Overview of AWS IAM Roles to grant permissions between AWS Services2:18
Understand how IAM roles grant permissions between AWS services, with examples from EC2, EMR, and AWS Glue, and learn to create and attach policies or a custom JSON policy.
[Instructions] Overview of Roles0:22
Create and Attach AWS IAM Custom Policy using AWS Web Console4:36
Attach a custom AWS IAM policy to a group to grant full S3 bucket access for ITV-GitHub, including listing and object read/write permissions, via the AWS web console.
[Instructions and Code] Create and Attach Custom Policy0:30
Configure and Validate AWS Command Line Interface to run AWS Commands4:39
Configure the AWS CLI with a dedicated ITV GitHub profile, validate credentials, and verify access to the ITV-GitHub S3 bucket from a dockerized Jupiter Lab environment.
[Instructions and Code] Configure and Validate AWS CLI0:25

Getting Started with AWS Simple Storage aka S32:59
Learn AWS S3, the simple storage service, a low-cost cloud storage for buckets and objects accessible from anywhere. Master versioning, cross-region replication, and storage classes via console or CLI.
[Instructions] Getting Started with AWS S30:07
Setup Data Set locally to upload into AWS s32:17
Set up a local dataset, clone the retail_db repository with six folders, review locations, then create an S3 bucket and copy files as objects into S3.
[Instructions] Setup Data Set locally to upload into AWS s30:18
Adding AWS S3 Buckets and Objects using AWS Web Console5:49
Create a unique S3 bucket with your initials and retail prefix on the AWS web console, then upload folders as objects and learn single-file or single-folder limits before CLI steps.
[Instruction] Adding AWS s3 Buckets and Objects0:25
Version Control of AWS S3 Objects or Files5:55
Enable S3 bucket versioning and define lifecycle rules for retail_db to manage older versions, recover from prior versions, and control storage costs.
[Instructions] Version Control in AWS S31:01
AWS S3 Cross-Region Replication for fault tolerance9:15
Enable cross region replication for S3 to copy objects from the source bucket to a destination bucket in another region, boosting fault tolerance and availability with versioning.
[Instructions] AWS S3 Cross-Region Replication for fault tolerance0:49
Overview of AWS S3 Storage Classes or Storage Tiers5:58
Compare AWS S3 storage classes, including standard, intelligent-tiering, standard-IA, one zone-IA, glacier, and glacier deep archive, using the performance chart to guide cost, latency, and access.
[Instructions] Overview of AWS S3 Storage Classes or Storage Tiers0:51
Overview of Glacier in AWS s33:08
Discover glacier and glacier deep archive in S3 as low-cost backup storage, compare pricing and latency with standard, and apply storage class changes via object edits or lifecycle rules.
[Instructions] Overview of Glacier in AWS s30:19
Managing AWS S3 buckets and objects using AWS CLI7:07
Learn to manage s3 buckets and objects with the aws s3 cli, using ls, cp, mv, rm, mb, rb, and sync, and explore hosting a website on s3.
[Instructions and Commands] Managing AWS S3 buckets and objects using AWS CLI0:27
Managing Objects in AWS S3 using AWS CLI - Lab12:17
list all s3 objects recursively, delete retail_db subfolders, copy folders with aws s3 cp --recursive and include/exclude filters, and verify results via the aws console.
[Instructions] Managing Objects in AWS S3 using AWS CLI - Lab0:34

Creating AWS IAM Users with Programmatic and Web Console Access6:23
Learn to create an AWS IAM user with both programmatic and web console access, attach policies (admin, EC2, S3), manage passwords, and handle access keys and credentials.
[Instructions] Creating IAM Users0:07
Logging into AWS Management Console using AWS IAM User2:24
Log in to the AWS management console with the IAM user itvadmin, use credentials from the CSV, reset the password on first login, and verify administrator access before configuring CLI.
[Instructions] Logging into AWS Management Console using IAM User0:21
Validate Programmatic Access to AWS IAM User via AWS CLI2:15
Configure the AWS CLI for the itvadmin IAM user using downloaded credentials, then verify programmatic access by listing S3 buckets and creating a unique bucket (dg-itvdemo) to confirm access.
[Instructions and Commands] Validate Programmatic Access to IAM User0:29
Getting Started with AWS IAM Identity-based Policies9:08
Explore identity-based policies in AWS IAM by examining predefined and custom policies, their JSON structure (effect, action, resource), and attaching them to users, groups, or roles.
[Instructions and Commands] IAM Identity-based Policies1:04
Managing AWS IAM User Groups6:20
Understand how AWS IAM user groups govern permissions by attaching policies to groups like itvadmin and itvsupport, and how users inherit these permissions.
[Instructions and Commands] Managing IAM Groups0:50
Managing AWS IAM Roles for Service Level Access9:38
Learn how to manage IAM roles to enable service-to-service access on AWS, attaching the AmazonS3ReadOnlyAccess policy to an EC2 role and validating read-only S3 permissions via AWS CLI.
[Instructions and Commands] Managing IAM Roles0:46
Overview of AWS Custom Policies to grant permissions to Users, Groups, and Roles9:00
Explore custom policies in aws iam to grant granular s3 permissions, using json and arn scoping to allow retail_db access while listing buckets; attach policies to users, groups, or roles.
[Instructions and Commands] Overview of Custom Policies0:53
Managing AWS IAM Groups, Users, and Roles using AWS CLI8:56
Learn to manage IAM with the AWS CLI by listing users, groups, roles, and policies, and by creating, assigning, and deleting users in groups.
[Instructions and Commands] Managing IAM using AWS CLI0:39

Getting Started with AWS Elastic Cloud Compute aka EC22:59
learn the fundamentals of aws ec2, including regions, availability zones, and instance types, and provision ec2 instances with os, ebs root storage, vpc, security groups, and key pairs.
[Instructions] Getting Started with EC20:36
Create AWS EC2 Key Pair for SSH Access6:34
Create an ec2 key pair in the AWS console to enable passwordless ssh, download the pem file, store it in ~/.ssh with strict permissions, and note the OS-specific standard user.
[Instructions] Create EC2 Key Pair0:50
Launch AWS EC2 Instance or Virtual Machine10:19
Launch an ec2 instance via the aws console, choosing an ubuntu 18.04 ami, configuring storage, security groups, and networking, and launching with a key pair.
[Instructions] Launch EC2 Instance0:15
Connecting to AWS EC2 Instance or Virtual Machine using SSH3:07
Connect to the EC2 instance via ssh using the correct private key and key pair, with the public IPv4 DNS and the instance username.
[Instructions and Commands] Connecting to EC2 Instance0:11
Overview of AWS Security Groups for firewall security of AWS EC2 Instance8:28
Learn how aws ec2 security groups define firewall rules to block or allow inbound traffic, enable ssh, http, and https, and secure access with ip-based rules.
[Instructions and Commands] Security Groups Basics1:12
Overview of Public and Private IP Addresses of AWS EC2 Instance7:21
Understand public and private IP addresses and their DNS aliases for EC2 instances. See how ephemeral public IPs differ from private DNS used for internal communication in a VPC.
[Instructions] Public and Private IP Addresses0:31
Understanding AWS EC2 Instance or Virtual Machine Life Cycle3:46
Learn the AWS EC2 instance life cycle—from running and stopped to rebooting and restarting—including status checks, public IP behavior, and how to start, stop, reboot, terminate, or hibernate.
[Instructions] EC2 Life Cycle0:23
Allocating and Assigning AWS Elastic IP or Static IP address to AWS EC2 Instance5:06
Allocate an elastic IP in AWS and associate it with an EC2 instance to preserve the public IP and DNS across stops. Validate SSH connectivity and note costs.
[Instructions] Allocating and Assigning Elastic IP Addresses0:29
Managing AWS EC2 Instances or Virtual Machines Using AWS CLI8:52
Learn to manage EC2 instances using AWS CLI, including creating key pairs, launching and stopping instances, managing security groups and Elastic IPs, and querying instance metadata and status.
[Instructions and Commands] Managing EC2 Using AWS CLI1:01
Upgrade or Downgrade of AWS EC2 Instances or Virtual Machines6:46
Explore vertical scaling by upgrading or downgrading EC2 instance types in the AWS Management Console or via the command line, and distinguish horizontal scaling with EMR, ECS, and EKS.
[Instructions and Commands] Upgrade or Downgrade EC2 Instances1:10

Understanding AWS EC2 Instance or Virtual Machine Metadata4:00
Explore EC2 instance metadata and provisioning details, including instance type, security groups, key pair, IP addresses, DNS aliases, VPC, and retrieve describe-instances JSON with EBS and network interfaces.
[Instructions and Commands] Understanding EC2 Metadata0:21
Querying on AWS EC2 Instance or Virtual Machine Metadata5:22
Learn to extract instance metadata from AWS EC2 describe-instances using --query with JSON paths, including aliasing, filters, and state names.
[Instructions and Commands] Querying on EC2 Metadata0:17
Fitering on AWS EC2 Instance or Virtual Machine Metadata5:33
Learn to filter AWS EC2 instances with describe-instances using name and values, such as instance type to t2.micro, then query id, type, and status.
[Instructions and Commands] Filtering on EC2 Metadata0:26
Using Bootstrapping Scripts on AWS EC2 Instance or Virtual Machine7:33
Explore bootstrapping scripts for AWS EC2 instances, installing Apache2, Python pip, and AWS CLI on Ubuntu 18.04 t2.micro, with and without user data during first launch.
[Instructions and Commands] Using Bootstrapping Scripts0:26
Create an Amazon Machine Image aka AMI using AWS EC2 Instance5:52
Create a custom AMI from an EC2 instance by snapshotting the root volume. Launch new instances from this AMI with preinstalled software like Apache2, Python pip, and AWS CLI.
[Instructions and Commands] Create an AMI0:22
Validate Amazon Machine Image aka AMI - Lab4:08
Launch and validate a custom AMI by selecting it in the AMI dashboard. Launch the EC2 instance, verify apache2 and AWS CLI, pip install pandas, and terminate to avoid charges.
[Instructions and Commands] Validate AMI - Lab0:30

Hello World using AWS Lambda4:01
Create and deploy a hello world AWS lambda function in Python 3.8 via the web console, test with a sample event, and explore handlers, environment variables, and S3 permissions.
[Instructions] Hello World using AWS Lambda0:23
Setup Project for local development6:04
Set up the ghactivity downloader project for local development with a virtual environment, install requests and boto3, and prepare lambda deployment for downloading from gharchieve and uploading to s3.
[Instructions and Code] Setup Project for local development0:44
Deploy Project to AWS Lambda console4:23
Create and test a Python lambda function locally, zip the code, upload to the AWS Lambda console, and validate the deployment with a test invocation.
[Instructions and Code] Deploy Project to AWS Lambda console0:15
Develop download functionality using requests7:06
Develop a download utility using the requests library to perform a get call, validate the status code, and build a lambda deployment zip.
[Instructions and Code] Develop download functionality using requests0:26
Using 3rd party libraries in AWS Lambda6:01
Bundle the requests library into your AWS Lambda deployment by zipping the ghallib subfolders, update memory settings, and validate the function with a test to ensure a 200 status.
[Instructions and Code] Using 3rd party libraries in AWS Lambda0:32
Validating s3 access for local development9:25
Validate s3 access for local development by using the itvgithub profile to list and upload files to the itv-github bucket with boto3, and distinguish local profiles from lambda roles.
[Instructions and Code] Validating s3 access for local development0:22
Develop upload functionality to s38:53
Refactor code to upload files to s3 with boto3, implementing get_client and upload_s3, using body, bucket, and file, and configure environment variables like bucket name for dev and lambda deployment.
[Instructions and Code] Develop upload functionality to s30:45
Validating using AWS Lambda Console2:46
Deploy the updated zip containing download.py, lambda_function.py, and upload.py to the AWS lambda console, set BUCKET_NAME as an environment variable, and ensure S3 permissions via IAM role for successful test.
[Instructions and Code] Validating using AWS Lambda Console0:22
Run using AWS Lambda Console4:26
Validate a lambda that downloads from gh archive and uploads to s3, resolve access denied by attaching the itvgithubs3 policy to the lambda role, and set file_prefix for sandbox uploads.
[Instructions] Run using AWS Lambda Console0:27
Validating files incrementally9:44
Validate files incrementally using Python datetime utilities and timedelta, generate filenames with strftime from a baseline, and check existence with requests until a 200 status, for eventual upload to S3.
[Instructions and Code] Validating files incrementally0:32
Reading and Writing Bookmark using s37:35
Enable incremental data ingestion by maintaining a bookmark in S3 to track the latest file, and use boto3 to get and put the bookmark payload.
[Instructions and Code] Reading and Writing Bookmark using s30:23
Maintaining Bookmark using s37:58
Maintain a bookmark in S3 by processing available files, incrementing the bookmark by one hour for each file, and exiting when no more files exist.
[Instructions and Code] Maintaining Bookmark using s30:37
Review the incremental upload logic5:55
Review the incremental upload logic that downloads GH Archive files into S3, using bucket name, file prefix, bookmark, and baseline, with next filename calculation and bookmark updates.
Deploying lambda function11:19
Deploy and validate an AWS Lambda function by packaging code into a zip, uploading to Lambda, configuring S3 permissions, setting environment variables, and scheduling invocations with EventBridge.
[Instructions and Source Code] - ghactivity-downloader Lambda Function0:43
Schedule Lambda Function using AWS Event Bridge4:56
Schedule a lambda function with AWS EventBridge to perform incremental ghactivity ingestion from gharchive.org into S3 using a bookmark, with cleanup, monitoring, and validation.
[Instructions] Schedule Lambda Function using AWS Event Bridge0:20

Setup Virtual Environment and Install Pyspark4:45
Set up a local data engineering workspace by creating and activating a python 3.7 virtual environment, installing pyspark 2.4 (or 2.3/3.0.1), and configuring PyCharm for development.
[Commands] - Setup Virtual Environment and Install Pyspark0:05
Getting Started with Pycharm4:56
Open an existing project in PyCharm, configure the correct virtual environment, and verify pyspark dependencies to develop python-based Spark data engineering pipelines.
[Code and Instructions] - Getting Started with Pycharm0:21
Passing Run Time Arguments5:30
Learn how to pass runtime arguments to Python programs using the sys library and argv, validate them in PyCharm or the command line, and configure run-time arguments for testing.
Accessing OS Environment Variables4:39
Learn to access environment variables in Python with os.environ.get, using exported keys like FOO to retrieve values such as BAR and database credentials.
Getting Started with Spark2:53
Learn to run a Spark program in a Python environment by creating a SparkSession with SparkSession.builder, set appName and local master, then query current_date and show the DataFrame.
Create Function for Spark Session5:48
Create a modular get_spark_session function to build a spark session based on environment (dev local or cluster) and integrate it into the app workflow.
[Code and Instructions] - Create Function for Spark Session0:28
Setup Sample Data2:09
Learn how to set up a sample data set by creating a folder structure, downloading GitHub activity files from gharchive.org, and preparing data for loading into a data frame.
Read data from files8:46
Create a modular data pipeline by implementing read.py with from_files(spark, data_dir, file_pattern, file_format), using spark.read.format or spark.read.json to read JSON data, and preview via schema and sample records.
[Code and Instructions] - Read data from files0:37
Process data using Spark APIs6:27
Extract year, month, and day from the created date field using pyspark, add them as new columns, and prepare partitioned writes by year, month, and day.
[Code and Instructions] - Process data using Spark APIs0:35
Write data to files7:13
Learn how to write a transformed data frame to files with partitioning by year, month, and day, using coalesce to control file counts and supporting parquet or json.
[Code and Instructions] - Write data to files0:44
Validating Writing Data to Files6:46
Refactor app.py to write transformed data frames to parquet files partitioned by year, month, and day using to_files and environment variables for target dir and format.
Productionizing the Code4:36
Productionize a data engineering codebase for multi-node clusters using yarn to monitor spark resources, deploy on gateway nodes, and run in prod or dev modes with a zip package.
[Code and Instructions] - Productionizing the code1:10

Requirements

A Computer with at least 8 GB RAM
Programming Experience using Python is highly desired as some of the topics are demonstrated using Python
SQL Experience is highly desired as some of the topics are demonstrated using SQL
Nice to have Data Engineering Experience using Pandas or Pyspark
This course is ideal for experienced data engineers to add AWS Analytics Services as key skills to their profile

Description

Data Engineering is all about building Data Pipelines to get data from multiple sources into Data Lakes or Data Warehouses and then from Data Lakes or Data Warehouses to downstream systems. As part of this course, I will walk you through how to build Data Engineering Pipelines using AWS Data Analytics Stack. It includes services such as Glue, Elastic Map Reduce (EMR), Lambda Functions, Athena, EMR, Kinesis, and many more.

Here are the high-level steps which you will follow as part of the course.

Setup Development Environment
Getting Started with AWS
Storage - All about AWS s3 (Simple Storage Service)
User Level Security - Managing Users, Roles, and Policies using IAM
Infrastructure - AWS EC2 (Elastic Cloud Compute)
Data Ingestion using AWS Lambda Functions
Overview of AWS Glue Components
Setup Spark History Server for AWS Glue Jobs
Deep Dive into AWS Glue Catalog
Exploring AWS Glue Job APIs
AWS Glue Job Bookmarks
Development Life Cycle of Pyspark
Getting Started with AWS EMR
Deploying Spark Applications using AWS EMR
Streaming Pipeline using AWS Kinesis
Consuming Data from AWS s3 using boto3 ingested using AWS Kinesis
Populating GitHub Data to AWS Dynamodb
Overview of Amazon AWS Athena
Amazon AWS Athena using AWS CLI
Amazon AWS Athena using Python boto3
Getting Started with Amazon AWS Redshift
Copy Data from AWS s3 into AWS Redshift Tables
Develop Applications using AWS Redshift Cluster
AWS Redshift Tables with Distkeys and Sortkeys
AWS Redshift Federated Queries and Spectrum

Here are the details about what you will be learning as part of this course. We will cover most of the commonly used services with hands-on practice which are available under AWS Data Analytics.

Getting Started with AWS

As part of this section, you will be going through the details related to getting started with AWS.

Introduction - AWS Getting Started
Create s3 Bucket
Create AWS IAM Group and AWS IAM User to have required access on s3 Bucket and other services
Overview of AWS IAM Roles
Create and Attach Custom AWS IAM Policy to both AWS IAM Groups as well as Users
Configure and Validate AWS CLI to access AWS Services using AWS CLI Commands

Storage - All about AWS s3 (Simple Storage Service)

AWS s3 is one of the most prominent fully managed AWS services. All IT professionals who would like to work on AWS should be familiar with it. We will get into quite a few common features related to AWS s3 in this section.

Getting Started with AWS S3
Setup Data Set locally to upload to AWS s3
Adding AWS S3 Buckets and Managing Objects (files and folders) in AWS s3 buckets
Version Control for AWS S3 Buckets
Cross-Region Replication for AWS S3 Buckets
Overview of AWS S3 Storage Classes
Overview of AWS S3 Glacier
Managing AWS S3 using AWS CLI Commands
Managing Objects in AWS S3 using CLI - Lab

User Level Security - Managing Users, Roles, and Policies using IAM

Once you start working on AWS, you need to understand the permissions you have as a non-admin user. As part of this section, you will understand the details related to AWS IAM users, groups, roles as well as policies.

Creating AWS IAM Users
Logging into AWS Management Console using AWS IAM User
Validate Programmatic Access to AWS IAM User
AWS IAM Identity-based Policies
Managing AWS IAM Groups
Managing AWS IAM Roles
Overview of Custom AWS IAM Policies
Managing AWS IAM users, groups, roles as well as policies using AWS CLI Commands

Infrastructure - AWS EC2 (Elastic Cloud Compute) Basics

AWS EC2 Instances are nothing but virtual machines on AWS. As part of this section, we will go through some of the basics related to AWS EC2 Basics.

Getting Started with AWS EC2
Create AWS EC2 Key Pair
Launch AWS EC2 Instance
Connecting to AWS EC2 Instance
AWS EC2 Security Groups Basics
AWS EC2 Public and Private IP Addresses
AWS EC2 Life Cycle
Allocating and Assigning AWS Elastic IP Address
Managing AWS EC2 Using AWS CLI
Upgrade or Downgrade AWS EC2 Instances

Infrastructure - AWS EC2 Advanced

In this section, we will continue with AWS EC2 to understand how we can manage EC2 instances using AWS Commands and also how to install additional OS modules leveraging bootstrap scripts.

Getting Started with AWS EC2
Understanding AWS EC2 Metadata
Querying on AWS EC2 Metadata
Fitering on AWS EC2 Metadata
Using Bootstrapping Scripts with AWS EC2 Instances to install additional softwares on AWS EC2 instances
Create an AWS AMI using AWS EC2 Instances
Validate AWS AMI - Lab

Data Ingestion using Lambda Functions

AWS Lambda functions are nothing but serverless functions. In this section, we will understand how we can develop and deploy Lambda functions using Python as a programming language. We will also see how to maintain a bookmark or checkpoint using s3.

Hello World using AWS Lambda
Setup Project for local development of AWS Lambda Functions
Deploy Project to AWS Lambda console
Develop download functionality using requests for AWS Lambda Functions
Using 3rd party libraries in AWS Lambda Functions
Validating AWS s3 access for local development of AWS Lambda Functions
Develop upload functionality to s3 using AWS Lambda Functions
Validating AWS Lambda Functions using AWS Lambda Console
Run AWS Lambda Functions using AWS Lambda Console
Validating files incrementally downloaded using AWS Lambda Functions
Reading and Writing Bookmark to s3 using AWS Lambda Functions
Maintaining Bookmark on s3 using AWS Lambda Functions
Review the incremental upload logic developed using AWS Lambda Functions
Deploying AWS Lambda Functions
Schedule AWS Lambda Functions using AWS Event Bridge

Overview of AWS Glue Components

In this section, we will get a broad overview of all important Glue Components such as Glue Crawler, Glue Databases, Glue Tables, etc. We will also understand how to validate Glue tables using AWS Athena. AWS Glue (especially Glue Catalog) is one of the key components in the realm of AWS Data Analytics Services.

Introduction - Overview of AWS Glue Components
Create AWS Glue Crawler and AWS Glue Catalog Database as well as Table
Analyze Data using AWS Athena
Creating AWS S3 Bucket and Role to create AWS Glue Catalog Tables using Crawler on the s3 location
Create and Run the AWS Glue Job to process data in AWS Glue Catalog Tables
Validate using AWS Glue Catalog Table and by running queries using AWS Athena
Create and Run AWS Glue Trigger
Create AWS Glue Workflow
Run AWS Glue Workflow and Validate

Setup Spark History Server for AWS Glue Jobs

AWS Glue uses Apache Spark under the hood to process the data. It is important we setup Spark History Server for AWS Glue Jobs to troubleshoot any issues.

Introduction - Spark History Server for AWS Glue
Setup Spark History Server on AWS
Clone AWS Glue Samples repository
Build AWS Glue Spark UI Container
Update AWS IAM Policy Permissions
Start AWS Glue Spark UI Container

Deep Dive into AWS Glue Catalog

AWS Glue has several components, but the most important ones are nothing but AWS Glue Crawlers, Databases as well as Catalog Tables. In this section, we will go through some of the most important and commonly used features of the AWS Glue Catalog.

Prerequisites for AWS Glue Catalog Tables
Steps for Creating AWS Glue Catalog Tables
Download Data Set to use to create AWS Glue Catalog Tables
Upload data to s3 to crawl using AWS Glue Crawler to create required AWS Glue Catalog Tables
Create AWS Glue Catalog Database - itvghlandingdb
Create AWS Glue Catalog Table - ghactivity
Running Queries using AWS Athena - ghactivity
Crawling Multiple Folders using AWS Glue Crawlers
Managing AWS Glue Catalog using AWS CLI
Managing AWS Glue Catalog using Python Boto3

Exploring AWS Glue Job APIs

Once we deploy AWS Glue jobs, we can manage them using AWS Glue Job APIs. In this section we will get overview of AWS Glue Job APIs to run and manage the jobs.

Update AWS IAM Role for AWS Glue Job
Generate baseline AWS Glue Job
Running baseline AWS Glue Job
AWS Glue Script for Partitioning Data
Validating using AWS Athena

Understanding AWS Glue Job Bookmarks

AWS Glue Job Bookmarks can be leveraged to maintain the bookmarks or checkpoints for incremental loads. In this section, we will go through the details related to AWS Glue Job Bookmarks.

Introduction to AWS Glue Job Bookmarks
Cleaning up the data to run AWS Glue Jobs
Overview of AWS Glue CLI and Commands
Run AWS Glue Job using AWS Glue Bookmark
Validate AWS Glue Bookmark using AWS CLI
Add new data to the landing zone to run AWS Glue Jobs using Bookmarks
Rerun AWS Glue Job using Bookmark
Validate AWS Glue Job Bookmark and Files for Incremental run
Recrawl the AWS Glue Catalog Table using AWS CLI Commands
Run AWS Athena Queries for Data Validation

Development Lifecycle for Pyspark

In this section, we will focus on the development of Spark applications using Pyspark. We will use this application later while exploring EMR in detail.

Setup Virtual Environment and Install Pyspark
Getting Started with Pycharm
Passing Run Time Arguments
Accessing OS Environment Variables
Getting Started with Spark
Create Function for Spark Session
Setup Sample Data
Read data from files
Process data using Spark APIs
Write data to files
Validating Writing Data to Files
Productionizing the Code

Getting Started with AWS EMR (Elastic Map Reduce)

As part of this section, we will understand how to get started with AWS EMR Cluster. We will primarily focus on AWS EMR Web Console. Elastic Map Reduce is one of the key service in AWS Data Analytics Services which provide capability to run applications which process large scale data leveraging distributed computing frameworks such as Spark.

Planning for AWS EMR Cluster
Create AWS EC2 Key Pair for AWS EMR Cluster
Setup AWS EMR Cluster with Apache Spark
Understanding Summary of AWS EMR Cluster
Review AWS EMR Cluster Application User Interfaces
Review AWS EMR Cluster Monitoring
Review AWS EMR Cluster Hardware and Cluster Scaling Policy
Review AWS EMR Cluster Configurations
Review AWS EMR Cluster Events
Review AWS EMR Cluster Steps
Review AWS EMR Cluster Bootstrap Actions
Connecting to AWS EMR Master Node using SSH
Disabling Termination Protection for AWS EMR Cluster and Terminating the AWS EMR Cluster
Clone and Create a New AWS EMR Cluster
Listing AWS S3 Buckets and Objects using AWS CLI on AWS EMR Cluster
Listing AWS S3 Buckets and Objects using HDFS CLI on AWS EMR Cluster
Managing Files in AWS S3 using HDFS CLI on AWS EMR Cluster
Review AWS Glue Catalog Databases and Tables
Accessing AWS Glue Catalog Databases and Tables using AWS EMR Cluster
Accessing spark-sql CLI of AWS EMR Cluster
Accessing pyspark CLI of AWS EMR Cluster
Accessing spark-shell CLI of AWS EMR Cluster
Create AWS EMR Cluster for Notebooks

Deploying Spark Applications using AWS EMR

As part of this section, we will understand how we typically deploy Spark Applications using AWS EMR. We will be using the Spark Application we deployed earlier.

Deploying Applications using AWS EMR - Introduction
Setup AWS EMR Cluster to deploy applications
Validate SSH Connectivity to Master node of AWS EMR Cluster
Setup Jupyter Notebook Environment on AWS EMR Cluster
Create required AWS s3 Bucket for AWS EMR Cluster
Upload GHActivity Data to s3 so that we can process using Spark Application deployed on AWS EMR Cluster
Validate Application using AWS EMR Compatible Versions of Python and Spark
Deploy Spark Application to AWS EMR Master Node
Create user space for ec2-user on AWS EMR Cluster
Run Spark Application using spark-submit on AWS EMR Master Node
Validate Data using Jupyter Notebooks on AWS EMR Cluster
Clone and Start Auto Terminated AWS EMR Cluster
Delete Data Populated by GHAcitivity Application using AWS EMR Cluster
Differences between Spark Client and Cluster Deployment Modes on AWS EMR Cluster
Running Spark Application using Cluster Mode on AWS EMR Cluster
Overview of Adding Pyspark Application as Step to AWS EMR Cluster
Deploy Spark Application to AWS S3 to run using AWS EMR Steps
Running Spark Applications as AWS EMR Steps in client mode
Running Spark Applications as AWS EMR Steps in cluster mode
Validate AWS EMR Step Execution of Spark Application

Streaming Data Ingestion Pipeline using AWS Kinesis

As part of this section, we will go through details related to the streaming data ingestion pipeline using AWS Kinesis which is a streaming service of AWS Data Analytics Services. We will use AWS Kinesis Firehose Agent and AWS Kinesis Delivery Stream to read the data from log files and ingest it into AWS s3.

Building Streaming Pipeline using AWS Kinesis Firehose Agent and Delivery Stream
Rotating Logs so that the files are created frequently which will be eventually ingested using AWS Kinesis Firehose Agent and AWS Kinesis Firehose Delivery Stream
Set up AWS Kinesis Firehose Agent to get data from logs into AWS Kinesis Delivery Stream.
Create AWS Kinesis Firehose Delivery Stream
Planning the Pipeline to ingest data into s3 using AWS Kinesis Delivery Stream
Create AWS IAM Group and User for Streaming Pipelines using AWS Kinesis Components
Granting Permissions to AWS IAM User using Policy for Streaming Pipelines using AWS Kinesis Components
Configure AWS Kinesis Firehose Agent to read the data from log files and ingest it into AWS Kinesis Firehose Delivery Stream.
Start and Validate AWS Kinesis Firehose Agent
Conclusion - Building Simple Steaming Pipeline using AWS Kinesis Firehose

Consuming Data from AWS s3 using Python boto3 ingested using AWS Kinesis

As data is ingested into AWS S3, we will understand how data can ingested in AWS s3 can be processed using boto3.

Customizing AWS s3 folder using AWS Kinesis Delivery Stream
Create AWS IAM Policy to read from AWS s3 Bucket
Validate AWS s3 access using AWS CLI
Setup Python Virtual Environment to explore boto3
Validating access to AWS s3 using Python boto3
Read Content from AWS s3 object
Read multiple AWS s3 Objects
Get the number of AWS s3 Objects using Marker
Get the size of AWS s3 Objects using Marker

Populating GitHub Data to AWS Dynamodb

As part of this section, we will understand how we can populate data to AWS Dynamodb tables using Python as a programming language.

Install required libraries to get GitHub Data to AWS Dynamodb tables.
Understanding GitHub APIs
Setting up GitHub API Token
Understanding GitHub Rate Limit
Create New Repository for since
Extracting Required Information using Python
Processing Data using Python
Grant Permissions to create AWS dynamodb tables using boto3
Create AWS Dynamodb Tables
AWS Dynamodb CRUD Operations
Populate AWS Dynamodb Table
AWS Dynamodb Batch Operations

Overview of Amazon AWS Athena

As part of this section, we will understand how to get started with AWS Athena using AWS Web console. We will also focus on basic DDL and DML or CRUD Operations using AWS Athena Query Editor.

Getting Started with Amazon AWS Athena
Quick Recap of AWS Glue Catalog Databases and Tables
Access AWS Glue Catalog Databases and Tables using AWS Athena Query Editor
Create a Database and Table using AWS Athena
Populate Data into Table using AWS Athena
Using CTAS to create tables using AWS Athena
Overview of Amazon AWS Athena Architecture
Amazon AWS Athena Resources and relationship with Hive
Create a Partitioned Table using AWS Athena
Develop Query for Partitioned Column
Insert into Partitioned Tables using AWS Athena
Validate Data Partitioning using AWS Athena
Drop AWS Athena Tables and Delete Data Files
Drop Partitioned Table using AWS Athena
Data Partitioning in AWS Athena using CTAS

Amazon AWS Athena using AWS CLI

As part of this section, we will understand how to interact with AWS Athena using AWS CLI Commands.

Amazon AWS Athena using AWS CLI - Introduction
Get help and list AWS Athena databases using AWS CLI
Managing AWS Athena Workgroups using AWS CLI
Run AWS Athena Queries using AWS CLI
Get AWS Athena Table Metadata using AWS CLI
Run AWS Athena Queries with a custom location using AWS CLI
Drop AWS Athena table using AWS CLI
Run CTAS under AWS Athena using AWS CLI

Amazon AWS Athena using Python boto3

As part of this section, we will understand how to interact with AWS Athena using Python boto3.

Amazon AWS Athena using Python boto3 - Introduction
Getting Started with Managing AWS Athena using Python boto3
List Amazon AWS Athena Databases using Python boto3
List Amazon AWS Athena Tables using Python boto3
Run Amazon AWS Athena Queries with boto3
Review AWS Athena Query Results using boto3
Persist Amazon AWS Athena Query Results in Custom Location using boto3
Processing AWS Athena Query Results using Pandas
Run CTAS against Amazon AWS Athena using Python boto3

Getting Started with Amazon AWS Redshift

As part of this section, we will understand how to get started with AWS Redshift using AWS Web console. We will also focus on basic DDL and DML or CRUD Operations using AWS Redshift Query Editor.

Getting Started with Amazon AWS Redshift - Introduction
Create AWS Redshift Cluster using Free Trial
Connecting to Database using AWS Redshift Query Editor
Get a list of tables querying information schema
Run Queries against AWS Redshift Tables using Query Editor
Create AWS Redshift Table using Primary Key
Insert Data into AWS Redshift Tables
Update Data in AWS Redshift Tables
Delete data from AWS Redshift tables
Redshift Saved Queries using Query Editor
Deleting AWS Redshift Cluster
Restore AWS Redshift Cluster from Snapshot

Copy Data from s3 into AWS Redshift Tables

As part of this section, we will go through the details about copying data from s3 into AWS Redshift tables using the AWS Redshift Copy command.

Copy Data from s3 to AWS Redshift - Introduction
Setup Data in s3 for AWS Redshift Copy
Copy Database and Table for AWS Redshift Copy Command
Create IAM User with full access on s3 for AWS Redshift Copy
Run Copy Command to copy data from s3 to AWS Redshift Table
Troubleshoot Errors related to AWS Redshift Copy Command
Run Copy Command to copy from s3 to AWS Redshift table
Validate using queries against AWS Redshift Table
Overview of AWS Redshift Copy Command
Create IAM Role for AWS Redshift to access s3
Copy Data from s3 to AWS Redshift table using IAM Role
Setup JSON Dataset in s3 for AWS Redshift Copy Command
Copy JSON Data from s3 to AWS Redshift table using IAM Role

Develop Applications using AWS Redshift Cluster

As part of this section, we will understand how to develop applications against databases and tables created as part of AWS Redshift Cluster.

Develop application using AWS Redshift Cluster - Introduction
Allocate Elastic Ip for AWS Redshift Cluster
Enable Public Accessibility for AWS Redshift Cluster
Update Inbound Rules in Security Group to access AWS Redshift Cluster
Create Database and User in AWS Redshift Cluster
Connect to the database in AWS Redshift using psql
Change Owner on AWS Redshift Tables
Download AWS Redshift JDBC Jar file
Connect to AWS Redshift Databases using IDEs such as SQL Workbench
Setup Python Virtual Environment for AWS Redshift
Run Simple Query against AWS Redshift Database Table using Python
Truncate AWS Redshift Table using Python
Create IAM User to copy from s3 to AWS Redshift Tables
Validate Access of IAM User using Boto3
Run AWS Redshift Copy Command using Python

AWS Redshift Tables with Distkeys and Sortkeys

As part of this section, we will go through AWS Redshift-specific features such as distribution keys and sort keys to create AWS Redshift tables.

AWS Redshift Tables with Distkeys and Sortkeys - Introduction
Quick Review of AWS Redshift Architecture
Create multi-node AWS Redshift Cluster
Connect to AWS Redshift Cluster using Query Editor
Create AWS Redshift Database
Create AWS Redshift Database User
Create AWS Redshift Database Schema
Default Distribution Style of AWS Redshift Table
Grant Select Permissions on Catalog to AWS Redshift Database User
Update Search Path to query AWS Redshift system tables
Validate AWS Redshift table with DISTSTYLE AUTO
Create AWS Redshift Cluster from Snapshot to the original state
Overview of Node Slices in AWS Redshift Cluster
Overview of Distribution Styles related to AWS Redshift tables
Distribution Strategies for retail tables in AWS Redshift Databases
Create AWS Redshift tables with distribution style all
Troubleshoot and Fix Load or Copy Errors
Create AWS Redshift Table with Distribution Style Auto
Create AWS Redshift Tables using Distribution Style Key
Delete AWS Redshift Cluster with a manual snapshot

AWS Redshift Federated Queries and Spectrum

As part of this section, we will go through some of the advanced features of Redshift such as AWS Redshift Federated Queries and AWS Redshift Spectrum.

AWS Redshift Federated Queries and Spectrum - Introduction
Overview of integrating AWS RDS and AWS Redshift for Federated Queries
Create IAM Role for AWS Redshift Cluster
Setup Postgres Database Server for AWS Redshift Federated Queries
Create tables in Postgres Database for AWS Redshift Federated Queries
Creating Secret using Secrets Manager for Postgres Database
Accessing Secret Details using Python Boto3
Reading Json Data to Dataframe using Pandas
Write JSON Data to AWS Redshift Database Tables using Pandas
Create AWS IAM Policy for Secret and associate with Redshift Role
Create AWS Redshift Cluster using AWS IAM Role with permissions on secret
Create AWS Redshift External Schema to Postgres Database
Update AWS Redshift Cluster Network Settings for Federated Queries
Performing ETL using AWS Redshift Federated Queries
Clean up resources added for AWS Redshift Federated Queries
Grant Access on AWS Glue Data Catalog to AWS Redshift Cluster for Spectrum
Setup AWS Redshift Clusters to run queries using Spectrum
Quick Recap of AWS Glue Catalog Database and Tables for AWS Redshift Spectrum
Create External Schema using AWS Redshift Spectrum
Run Queries using AWS Redshift Spectrum
Cleanup the AWS Redshift Cluster

Who this course is for:

Beginner or Intermediate Data Engineers who want to learn AWS Analytics Services for Data Engineering
Intermediate Application Engineers who want to explore Data Engineering using AWS Analytics Services
Data and Analytics Engineers who want to learn Data Engineering using AWS Analytics Services
Testers who want to learn key skills to test Data Engineering applications built using AWS Analytics Services

Data Engineering using AWS Data Analytics

What you'll learn

Explore related topics

Course content

Introduction to the course7 lectures • 21min

Setup Local Development Environment for AWS on Windows 10 or Windows 1111 lectures • 55min

Setup Local Development Environment for AWS on Mac7 lectures • 28min

AWS Getting Started with s3, IAM and CLI12 lectures • 25min

Storage -Deep Dive into AWS Simple Storage Service aka s318 lectures • 1hr

AWS Security using IAM - Managing AWS Users, Roles and Policies using AWS IAM16 lectures • 59min

Infrastructure - Getting Started with AWS Elastic Cloud Compute aka EC220 lectures • 1hr 10min

Infrastructure - AWS EC2 Advanced12 lectures • 35min

Data Ingestion using Lambda Functions29 lectures • 1hr 47min

Development Lifecycle for Pyspark19 lectures • 1hr 8min

Requirements

Description

Who this course is for: