Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Data Engineering, Serverless ETL & BI on Amazon Cloud

Name: Data Engineering, Serverless ETL & BI on Amazon Cloud
Rating: 4.2 (761 reviews)

Data warehousing & ETL on AWS Cloud

Created bySid Raghunath

Last updated 10/2023

English

What you'll learn

Setting up a Data Warehouse on Amazon Cloud using Redshift from scratch
Learn and understand AWS Athena and when to make use of Athena
Learn how to store data in S3 Data lakes using Parquet columnar file formats and optimize the process of data scans using Athena
Learn and automate the ETL processes using different server-less components like AWS Glue , Data Pipeline and Lambda Functions
Data Centralization using Redshift Spectrum
Trigger and Automate Glue jobs using Lambda Functions
Understand how to pull data into QuickSight which is a BI-Reporting/Visualization offering from AWS

Course content

9 sections • 51 lectures • 6h 1m total length

Course & Project Overview3:27
Lecture 4 - Feedback and Learn More1:06
AWS Billing Components and Precautions to be taken3:29

Section Introduction1:06
Introduction to Redshift3:01
Redshift Vs Snowflake Vs Bigquery2:56
Introduction to AWS Glue3:07
Lab- Create MySQL Instance on AWS RDS2:54
Lab-Create MySQL DB/Tables and Data Preparation8:32
Lab- Deploy Glue Extraction Job from MySQL To S311:53
Lab- Use AWS Secrets Manager and Glue Job Arguments7:14
Lab- Create and Setup Redshift Cluster5:39
Provision a single-node redshift cluster and enable public access. Connect from a local sql client using the endpoint and port 5439 to test the connection with DBeaver.
Lab- Ingest Data into Redshift using Copy Commands8:05
Learn to ingest data into a Redshift cluster using copy commands, creating a transactional layer schema with orders, order items, reviews, and products from CSV and Parquet files in S3.
Lab- Deploy Glue Jobs for Redshift Data Ingestion6:05
Introduction to AWS Step Functions1:21
Lab - Execute Step Functions for AWS Glue Jobs8:44
Lab - Handle Incremental Data Loads into Redshift7:26

Section Overview and Introduction6:13
Enrich and centralize data by stitching transactional data with third-party user behavior data using AWS Glue crawlers, Athena, and PySpark on AWS. Centralize the data in Redshift for analytics.
Lab - AWS Glue Crawler Setup11:40
Lab - Athena - Data and Table Scan Explanation6:03
Lab - Pyspark Development Local15:33
Lab - Port Local Pyspark Script to AWS Glue8:08
Lab - AWS Glue Pyspark - Parquet File Format & Snappy Compression9:43
Lab - AWS Lambda to Trigger Glue Jobs11:19
Lab - Glue Crawler Run - Populate Partitions in Data Catalog3:16

Introduction to Redshift Spectrum6:21
Lab - Redshift Spectrum | Create External Schema10:26
Create an external schema with redshift spectrum linked to the data catalog to query S3 data as if it resides in redshift, then create an external table from csv files.
Lab - Redshift Spectrum | Cross Database Joins2:56
Learn to perform cross database joins in Redshift Spectrum by joining parquet_output with an RDBMS table to fetch the English category names and group by year.

Redshift - Sort Keys and Compound Sort Keys5:56
Analyze how Redshift sort keys, especially compound sort keys, organize data by column order to speed up queries. Learn how the first column affects data block access and overall performance.
Redshift - Interleaved Sort Keys4:29
Redshift - Vacuum Operations7:13
Learn how vacuum reclaims space from deletes and updates in Redshift, why it re-sorts data, and the main vacuum types: full, delete-only, and reindex for planning.
Redshift - Choosing Keys4:35
Redshift - Distribution Keys6:03
Lab - Parameter Group | Redshift Cluster Modification7:25
Create and apply a new parameter group for the Redshift cluster, add the schema to its search path, and reboot to enable sort keys on the orders table.
Lab - Sort and Dist Keys & Vacuuming | Alter Table Commands11:52

What are Dockers ?3:02
Install Docker Engine2:11
Install the Docker desktop engine, sign up for a free account, and verify Docker desktop runs as a daemon to enable Docker commands.
Steps to Create Docker Image2:25
Lab - Build and Run Docker Image11:08
Deploy Docker Image on AWS Cloud15:02
Build a docker image, push to AWS ECR, and deploy a lambda function to run the container; read a csv, apply pandas transformations, and write results back.
Final - Lambda IAM Permissions1:44

Use Case Intro-Transactional Processing with AWS Lambda,DynamoDb & API Gateway6:12
Learn how to design a real-time, serverless e-commerce transaction processing solution on AWS using Lambda, DynamoDB, and API Gateway, with scalable, fault-tolerant, and cost-effective ingestion and retrieval via API endpoints.
Lab - Part 1 | Transactional Processing with AWS Lambda,DynamoDb & API Gateway10:36
Lab-Part 2 | Transactional Processing with AWS Lambda,DynamoDb & API Gateway8:27
Use Case Intro-Data processing workflows using Step Functions,Lambda and Glue24:36
Learn to build serverless data processing workflows with AWS Step Functions, Lambda, and Glue that extract from a MySQL RDS, perform ETL with PySpark in Glue, and write to S3.
Lab- Data processing workflows using Step Functions,Lambda and Glue0:31

Requirements

Hands on expertise on Python & Sql is a must
should have a technical background or prior experience in Pyspark (at least beginner level)
Basic understanding of different cloud components (AWS ,GCP or Azure )

Description

AWS Cloud can seem intimidating and overwhelming to a lot of people due to its vast ecosystem, but this course will make it easier for anyone who wants a hands-on expertise in setting up a data-warehouse in Redshift or setup a BI infrastructure from scratch .

Data Scientists/Analysts/Business Analysts will soon be expected to (if not already) become all-rounders and handle the technical aspect of data ingestion/engineering/warehousing .

Anyone who has the basic understanding of how cloud works can benefit from this course because :

- This course is designed keeping in mind end to end life cycle of a typical data engineering project

- Provides a practical solution to real-world use-cases

This Course covers :

Setting up a data warehouse in AWS Redshift from scratch
Basic Data Warehousing Concepts
Writing server-less AWS Glue Jobs (pyspark and python shell) for ETL and batch processing
AWS Athena for ad-hoc analysis (when to use Athena)
AWS Data Pipeline to sync incremental data
Lambda functions to trigger and automate ETL/Data Syncing processes
QuickSight Setup , Analyses and Dashboards

Prerequisites for this course are :

Python / Sql (Absolute must)
PySpark (should know how to write some basic Pyspark scripts)
Willingness to explore ,learn and put in the extra effort to succeed
An active AWS Account

Important Note - This course makes use of the free tiers for Redshift and RDS , so you will not be billed for them unless you exceed the free tier usage which should be more than enough to get enough practice from this course .

Also , this course makes use of AWS UI on the browser for creating clusters and setting up jobs , there is no bash scripting involved. One can use any operating system to perform the lab sessions in this course .

This course is not code-intense or code-heavy ,there is only 35% coding involved , the rest is execution,understanding and chaining different component together. The whole purpose of this course is to make everyone aware of and feel comfortable with all the tools/features used in this course .

Some Tips :

Try to watch the videos at 1.2X speed
Every time you work on a new component or feature , do some research on the other tools that are meant for the same purpose and see how they differ and in what aspects , For Eg Redshift/Athena vs Snowflake or Bigquery , QuickSight vs PowerBi vs Microstrategy

Who this course is for:

Data Scientists/Analysts who need hands on implementation experience on AWS ETL Tools
Software developers who are curious to learn data engineering
Anyone with experience in coding that wants to get into the field of Data Engineering/Analytics and Science

Data Engineering, Serverless ETL & BI on Amazon Cloud

What you'll learn

Explore related topics

Course content

About the Course & Introduction3 lectures • 8min

Getting Started with AWS Glue, MySQL RDS and Redshift14 lectures • 1hr 18min

Data Lakes & Handling External Data Sources8 lectures • 1hr 12min

Redshift Spectrum3 lectures • 20min

Quicksight - BI / Reporting and Visualization3 lectures • 31min

Redshift - Optimization Techniques and Fine tuning7 lectures • 48min

Bonus - Do more with AWS Glue2 lectures • 20min

New Section - Working with Dockers and AWS ECR6 lectures • 36min

Data Engineering Use Cases5 lectures • 50min

Requirements

Description

Who this course is for: