
This is Volume 2 of Data Engineering course. In this course I will talk about Open Source Data Processing technologies - Spark and Kafka, which are the most used and most popular data processing frameworks for Batch & Stream Processing. In this course you will learn Spark from Level 100 to Level 400 with real-life hands on and projects. I will also introduce you to Data Lake on AWS (that is S3) & Data Lakehouse using Apache Iceberg.
I will use AWS as the hosting platform and talk about AWS Services - EMR, S3 and MSK. I will cover Databricks as Spark hosting platform. I will also show you Spark integration with other services like AWS RDS (MySQL or PostgreSQL) and Redshift.
You will get opportunities to do hands-on using large datasets (100 GB - 300 GB or more of data). This course will provide you hands-on exercises that match with real-time scenarios like Spark batch processing, stream processing, performance tuning, streaming ingestion, Window functions, ACID transactions on Iceberg etc.
Some other highlights:
10 Projects with different datasets. Total dataset size of 250 GB or more.
Other technologies covered - EC2, EBS, VPC and IAM.
Optional Python videos
Optional AWS and SQL Essentials videos
I will conclude the Data Engineering course with Volume 3, in which, I will be covering the following Topics.
Flink
Apache Airflow
Apache Pinot
AWS Kinesis
Please provide feedback and suggestions if you want me to add any other topics.