Master Data Engineering: Concepts to Production

Name: Master Data Engineering: Concepts to Production
Rating: 4.4 (110 reviews)

Data Engineering: SQL, Python, Unix, Spark, Cloud, AWS, ETL, Data Quality , Data Governance & Data Architecture

Created byParijat Bose

Last updated 11/2025

English

What you'll learn

Hands on Python, SQL, Unix, Hadoop, Spark, CICD, ETL using IDE to replicate real life data engineering workflow
Design, build, and manage scalable data pipelines using tools like Spark and frameworks for job orchestration, ensuring efficient data flow from ingestion to co
Model data warehouses/lakes using star/snowflake schemas and optimize storage for analytics.
Enforce data governance with quality checks, metadata management, and compliance frameworks
Master advanced SQL for complex queries, ETL transformations, and database optimization.
Troubleshoot pipelines using logging, monitoring tools, and error-handling strategies.
Leverage cloud tools (AWS EC2, S3,Lambda) for cost-effective, auto-scaling data workflows.
Identify real world problem statement, design and implement data pipeline.

Course content

10 sections • 235 lectures • 10h 24m total length

About the Course and the Instructor0:40
Explore data engineering concepts to production through patterns and real-world practices, as your experienced instructor guides you through massive data pipelines, enterprise data warehouses, and high-performance processing frameworks.
Who are Data Engineers!1:07
Learn how data engineers turn raw, messy information into clean, production-ready datasets that teams can rely on, covering fundamentals to architecture, pipelines to orchestration.
Story of Chef Anna!0:44
Handpick the finest ingredients at the market and craft precise, plated dishes, using lifelong recipes and continuous improvement to respond to customer concerns.
Data Engineer is a Master Chef2:11
Here’s why you should take this course!1:26
Designed for IT professionals, data scientists, and students, this hands-on course reveals what data engineering is, how it works, and real-world use cases, with a completion certificate.
Course Overview0:54
Explore data engineering foundations from sql and etl basics to unix, python, big data with hadoop and spark, and cover ci cd, data quality, governance, and cloud computing.
Key Components of Data Engineering2:26
Explore the seven key components of data engineering, from data sources and ingestion to etl processing, storage options, orchestration, data management, analytics, security, privacy monitoring, and logging.
Role of Data Engineers1:19
Types of Data1:05
Explore the three data types in data engineering: structured, semi-structured, and unstructured. Identify examples like tables, JSON/XML, and media such as images and audio.
Data Engineering is the Future2:16

Introduction To SQL3:54
Explore how SQL, the standard language for relational databases, enables efficient querying, data extraction, and maintenance of tables like doctor and patients, linked by IDs.
Setting up MySQL development Environment3:11
Create Table5:23
Insert Data3:00
Select and Where3:07
Group By2:42
Learn to count records and group by doctor IDs to determine how many patients are assigned to each doctor, using count(*) and group by.
Order By1:31
Rank doctors by patient totals using an order by clause with a count alias, and switch from ascending to descending to see doctors ranked from highest to lowest.
Having1:35
List doctors with a patient count of five or more by applying the having clause after group by and ordering by count. Explain why where cannot filter aggregate columns.
Inner Join3:03
Learn how to join patients and doctors using an inner join, with aliases and explicit column selection. Explore join types, unions, ctes, and subqueries to query multiple tables.
Left and Right Join2:29
Learn how a left join returns all rows from the left table and matched rows from the right table, with nulls where there is no match. Compare this to the right join using doctor and patient tables.
Union and Union All2:57
Explains full outer join by uniting left and right joins, filling nulls for unmatched rows, and uses union in MySQL to simulate it, then contrasts union and union all.
Common Table Expression2:24
Learn how common table expressions (CTEs) simplify complex queries, using a doctor and patient table to find patients over 12 still tagged to a pediatrician with date_sub.
Subquery1:27
DDL Operations4:17
Learn to create and alter tables, add and modify columns, and update data in the Celtics database, including a visits table, temporary tables, and when to truncate or drop tables.
Date and String Functions4:56
Window Functions Part 13:39
Window Functions Part 22:26
explores window functions with a real-world example to find the top-earning doctor per day by joining doctor and visit tables, using a cte to rank by date and total fees.
Window Functions Part 32:01
Explore rank, dense rank, row number, and the average window function to compare doctor earnings against daily clinic totals and understand tie effects.
Database Design and Normalization2:04
Explore database design principles, including normalization and denormalization, to create efficient schemas that minimize data redundancy and enforce transaction control.
First Normal Form1:13
Second Normal Form1:17
Third Normal Form1:11
implement the third normal form by removing transitive dependencies and moving customer data to a separate order-customer mapping table, splitting the order table into order details and customer mapping.
Denormalization0:53
Designing Efficient Schemas1:50
Normalize data to remove duplicates, improving consistency and storage efficiency. Select appropriate data types, use indexes judiciously, and design for growth with partitioning or sharding to ensure scalability and performance.
ACID Property1:53
Learn how atomicity, consistency, isolation, and durability make database transactions all or nothing, follow rules, stay isolated, and remain permanently stored for reliable data management.
Performance Tuning2:47
Transaction Control1:45
Extract Transform and Load ETL1:55
Data Pipelines2:20
Explore how data pipelines extract, transform, and load data from sources like CRM, weblogs, and APIs into targets such as Hive, AWS S3, and Tableau dashboards.
Data Cleansing1:49
cleanse and transform data with sql by validating names in the patient table using regex, fixing entries like d0n to don via a cte-driven update.
Data Warehouse1:39
Explore data warehousing concepts, including fact, dimension, and snapshot tables, and compare star and snowflake schemas. Learn how a time-variant, non-volatile repository integrates data from multiple sources for fast analysis.
Factual Snapshot and Dimension Tables3:22
Star and Snowflake Schema1:16
Slowly Changing Dimension Tables SCD1:59
Analyze slowly changing dimension tables and their three types: type 1 updates the row, type 2 adds history with dates, and type 3 tracks previous values.
Designing Datawarehouse2:10
Define goals, design architecture, and plan implementation for a data warehouse, selecting star, snowflake, or hybrid models, and addressing ETL, data quality, governance, security, and scalability.
Capstone Project5:57
Design an end-to-end ETL data pipeline that extracts from CRM, ERP, and web analytics sources into a data warehouse. Apply cleansing, transformations, and load data into dimension and fact tables.

Introduction to UNIX1:10
What is OS?3:02
What is UNIX?1:17
Explore Unix, a foundational and secure operating system from the 1960s–70s, and its influence on Mac OS, Linux, servers, and data tools, enabling transferable skills across technology and data work.
History of UNIX1:40
Trace Unix history from late 1960s Bell Labs leaders Ken Thompson and Dennis Ritchie. Observe how C made it portable, inspiring BSD, Solaris, Linux, and Mac OS.
Unix Vs Linux1:56
Significance of UNIX in Data Engineering1:53
UNIX Architecture2:18
Basic Unix Commands - File and Directory Management Part 12:49
Navigate the unix file system and manage files using cd and pwd, then view details with ls -l and sort by time with ls -lt, reversing order with r flag.
Basic Unix Commands - File and Directory Management Part 24:35
Explore essential unix file and directory management, including mkdir, cd, pwd, ls, touch, rm, and copy and move operations for data engineering workflows.
Basic Unix Commands - File and Directory Management Part 33:19
Master three Unix file creation methods: touch, cat with redirection, and vi editor, and learn to write, view, and save content with proper commands.
Basic Unix Commands - File and Directory Management Part 41:39
File Permissions - Part 11:42
File Permissions - Part 22:08
Explore how user, group, and others gain read, write, and execute permissions, and use chmod to add, remove, or set these rights on files.
File Permissions - Part 32:40
File Permissions - Part 41:12
Learn how c-h and change group commands adjust file ownership and groups, including Linda as owner and admin group as owning group.
Text Processing Tools - Part 12:13
Explore text processing tools using grep, said, and awk to search patterns in logs, identify exceptions, and pinpoint root causes in spark job failures.
Text Processing Tools - Part 21:26
Explore the stream editor for parsing and transforming text, including global replacements like replacing info with debug in files or Spark logs, a key data engineering tool.
Text Processing Tools - Part 32:02
Process Management - Part 15:00
Explore how a unix process is created and managed by the kernel. Track its life cycle from created to ready, running, blocked, and terminated, using ps, top, and kill.
Process Management - Part 22:00
Learn how the ps command displays active processes, lists all processes with ps -e, uses head -5 for top items, and top for real-time cpu usage and memory usage monitoring.
Data Compression and Archiving0:32
Explore data compression and archiving with gzip and tar, and distinguish archival from compression, noting that compression reduces size while archiving groups files into a single folder.
File Transfer and Networking2:04
Learn how to securely transfer files using SCP, securely log in with SSH, and access remote storage with a network file system (NFS), including mounting remote paths.
Regular Expression1:38
Master regular expressions for matching and processing in engineering, using grep to extract email IDs from emails.txt that start with lowercase letters, have @, and end with .com or .org.
Introduction to Shell Scripting1:15
Shell Scripts3:44
Write and run your first shell script with vi, saving hello world in scripts directory. Compare ssh invocation and dot slash execution; learn to use variables with chmod.
Control Structures3:53
Functions and Modular Scripting2:11
Explore functions and modular scripting in Unix by defining a square calculator function that uses the first argument, demonstrates reusable blocks, and prints the square of a given number.
Redirect Output and Error2:01
Working with Files2:30
Learn how to work with files in shell scripts by writing and appending to myfile.txt, checking the return code, and displaying file contents.
Error Handling4:14
Practical Applications and Optimization12:57
Job Scheduling3:31
Master cron job scheduling with crontab using five fields for minutes, hours, day, month, and weekday. See a hands-on example executing a hello world script and logging to cron output.txt.
Best Practices and Tips2:24
Implement unix-based best practices for data engineering, including version control with git, modular scripts, cron automation, graceful error handling, log redirection, nohup, and minimal permissions using oc, z, and grip.

Introduction to Python2:17
Explore the history, features, and architecture of Python and its essential role for data engineers, from readability to data science and data workflows.
Key Features2:36
Explore Python's intuitive syntax and interpreted, line-by-line execution. Harness its cross-platform reach, extensive standard library, dynamic typing, multiple paradigms, vast ecosystems, and strong community for readable and maintainable code.
Architecture2:14
Explore how Python code runs from parsing source into an abstract syntax tree to bytecode execution by the Python virtual machine, and compare interpreters with compilers.
Importance of Python for Data Engineers1:14
Python powers data engineering with simple, readable code that maintains complex data pipelines, using Pandas, NumPy, and Spark or Dask for scalable ETL and data source integration.
Download and Install2:01
Setup IDE2:35
Install IntelliJ IDEA community edition on macOS, set up a Python project with a virtual environment and Python 3.12, and use Copilot for debugging, code completion, and data engineering productivity.
Understanding main3:05
Setup project structure1:19
Set up the project structure with resources and source directories, visualize the sales daily transaction exports csv using a csv plugin, then read and process the data.
Introduction to Basic Operations0:30
Master Python basics, including syntax, variables, data types, operations, and expressions, and learn to define and call functions, pass arguments, return values, import modules, and read files.
Variables2:55
Learn to define Python variables with dynamic typing. Practice adding A and B, and concatenate outputs by converting integers to strings.
Methods4:22
Learn how Python methods modularize code by defining functions with parameters, calling them to sum numbers, and returning values to the caller.
Read File3:34
Learn to read a csv using the read file method in the file utils module of the process files package, skip the header with an iterator, and print each row.
Introduction to Control Flow1:02
List4:00
Explore Python lists, tuples, and sets by converting a csv reader to a list, iterating through items, and using built-in methods like append, pop, and index.
List Slicing7:26
For loop4:00
If Elif condition1:45
Learn to use if, elif, and else in Python to categorize customers by order value. Apply and or logical operators and proper indentation to create clear decision blocks.
Code Organization3:11
Organize your Python project with a utils folder and a categorize customers function that returns valued, medium, and least valued lists, then import and use it in main.
While loop3:13
Tuples2:08
Explore Python tuples as immutable data structures, distinguish them from lists by round brackets and indexing, and learn when to use tuples for fixed-size data while considering memory use.
Sets3:11
Dictionary2:47
Learn to map customer names to order values using Python dictionaries, with a get names and order value function and enumerate to pair indices with values in ETL workflows.
Introduction to Advance Features1:08
Write File2:09
Open the file in write mode and dump the customer dictionary to a json file using python's json library with four-space indentation, demonstrating etl by loading csv data into json.
DocString2:35
Exceptions2:40
Wrap Python file reads in a try-except block to gracefully handle file not found errors, returning a blank list and printing a clear message, and learn common built-in exceptions.
Lambda5:46
Capstone Project10:03

Introduction to Bigdata, Hadoop and Spark3:06
What is Big Data4:33
Understanding Use Case6:05
Big data enables real-time analytics, predictive modeling, and personalized insights across healthcare, finance, retail, and transportation, powering fraud detection, risk management, and optimized operations.
Challenges4:55
Explore the key challenges in big data processing—volume, variety, velocity, veracity, security, and scalability—and how tools like Hadoop HDFS, Apache Kafka, and Spark Streaming address them.
Introduction to Big Data Technologies0:24
Big Data Ecosystem2:33
Explore how data sources, data storage, data processing, ETL, analytics, visualization, data management, and data security form the big data ecosystem, with cloud platforms enabling scalable development and deployment.
Working with HDFS2:55
Traditional vs Big Data Processing2:16
Understand how traditional data processing struggles with huge data volumes and batch insights, while big data processing handles real-time streams and any data type, with horizontal scaling.
HDFS File Operations5:23
Learn to manage files in HDFS by creating a data directory, uploading with put or copy from local, downloading with get, copying directories, and deleting files with verification.
HDFS Checksum3:17
Introduction to Hadoop0:40
Discover Apache Hadoop and its core components, including HDFS, MapReduce, Yarn, and common utilities. Analyze HDFS architecture, the MapReduce framework, how MapReduce jobs execute, and Yarn architecture.
Apache Hadoop3:33
Explore Apache Hadoop, a distributed storage and processing framework with HDFS, MapReduce, Yarn, and Hadoop common, delivering scalable, fault-tolerant, cost-effective, flexible data handling for warehousing, analytics, and machine learning.
Components of Hadoop3:50
HDFS Architecture4:53
Explore the HDFS architecture with a primary name node, secondary name node, and data nodes storing 128 mb blocks replicated across nodes for client access and centralized metadata management.
MapReduce4:00
YARN2:36
Types of Setup2:30
Explore the five Hadoop or Spark setups: standalone, pseudo distributed, fully distributed, cloud based, and Hadoop as a service, and identify ideal use cases.
Install Java2:37
Install Hadoop5:27
Install and configure Hadoop locally, set fs.defaultFS to localhost:9000, enable yarn, format the name node, start daemons, and verify via hdfs and the NameNode and ResourceManager web UIs.
Introduction to Hive3:27
Explore Hive, a data warehousing solution with a sql-like language on Hadoop to query large datasets in HDFS or S3, featuring Hive Server 2, Metastore, ACID, and replication.
Hive vs DBMS1:52
Use Cases of Hive1:20
Explore how Hive analyzes e-commerce clickstream data, processes electronic health records to spot care trends, detects fraud through real-time financial transactions, and monitors telecom network logs to optimize performance.
Hive Architecture5:53
HQL2:04
Install Hive5:20
Hive Tables1:44
Create Table, Partition, Bucket3:45
Create a hive table with external or managed options, define columns with comments, and configure partitioning, bucketing, row format, and file format for efficient storage and retrieval.
Hive Supported File Formats4:13
Explore Hive file formats, including text files, sequence file, RC file, Avro, ORC, and Parquet, and learn how to choose the right format for schema evolution, compression, and analytical workloads.
Create Table2:48
Create a managed Hive table named employee with id, name, and salary; load data from a local file, then query using where, order by, group by, and a join condition.
Load Data0:57
Filter, Sort, Group, Join4:38
Hive UDF1:51
Explore how Hive user defined functions enable custom logic on top of tables to query and transform data, with built-in and custom UDFs and the difference from aggregate functions.
Introduction to Spark0:26
What is Spark1:27
Spark Ecosystem2:40
Spark Architecture2:34
Hadoop vs Spark2:04
Compare Hadoop MapReduce and Spark for big data processing, highlighting performance and fault tolerance. Spark's in-memory processing enables faster analytics and real-time insights.
Install Spark1:34
Understanding a Spark Job6:27
Submit a spark job to trigger the driver to build a dag, split it into stages, and assign tasks to executors across the cluster, reflecting wide transformations and shuffles.
SparkSession2:28
Discover how Spark session unifies Spark context, sql context, and Hive context to run data frames, rdds, and sql queries via Spark shell, PySpark, Spark sql, and the cluster manager.
Dataframe, Dataset and RDD2:17
Explore Spark data frames as named-column data sets and compare datasets to RDDs, highlighting fault tolerance through lineage, distributed partitions, and parallel processing.
Spark UI2:12
Navigate the Spark UI to debug applications by inspecting the context web UI and its tabs: jobs, stages, storage, environment, and executors; monitor status and executor health.
Spark Scala2:23
Learn spark coding in scala, pyspark, java, and r; practice scala by creating a data frame from records, naming columns, and displaying it with df.show, using spark context and session.
PySpark2:42
Spark IDE4:01
Set up a PySpark project in IntelliJ IDEA community edition, install PySpark, initialize a SparkSession, and use data frames and Spark SQL to filter and query data.

Introduction3:21
Benefits of CICD Pipeline1:14
Enable rapid, reliable data pipelines by implementing ci/cd, automated testing, and collaborative deployment across environments, reducing downtime and human error.
Stages of CICD4:10
Develop, validate, and deploy data pipelines through a five-stage ci/cd workflow—development, ci validation, cd deployment to staging and production, and production go-live—featuring version control, code reviews, tests, and compliance checks.
Introduction to Git1:21
Create Git Repository2:13
Sign up on GitHub, create a private repository named zozo, add a readme and gitignore, select the Apache license, set main as default branch, and use pull requests.
Git Clone2:54
Clone a git repository from remote to local by installing git, creating a local directory, and running git clone with the remote URL, then handle private versus public access.
Git Push, Pull and Merge6:55
Examples0:52
Learn CI/CD tooling for data engineers, with GitHub, GitLab, Bitbucket, Jenkins, CircleCI, and GitHub Actions. Cover deployment and testing with Kubernetes, Docker, Terraform, PyTest, and Great Expectations.
Benefits Vs Challenges1:43
See how CI/CD automates testing and deployment to speed up data pipeline releases and improve reliability across environments. Understand challenges like data dependencies, test data availability, and cross-environment consistency.

What is Data Quality1:36
Key Aspects of Data Quality1:26
Learn the key data quality dimensions: accuracy, consistency, uniqueness, timeliness, validity, and completeness, and how to ensure data stays true, non-contradictory, up-to-date, and complete.
DQ Metrics0:57
Define data quality metrics with KPIs for accuracy, error rate, and uniqueness. Measure data accuracy rate, duplicate record rate, and distinct value count to gauge overall data quality.
Data Profiling1:06
Learn data profiling to understand data structure, content, and quality, its purpose, and the techniques and tools to reveal missing values, errors, and cardinality for cleansing or analysis.
Data Cleansing1:22
Learn how data cleansing corrects inaccurate, incomplete, or inconsistent data, and data standardization brings diverse formats, such as JSON and XML, into a single CSV, improving reliability for decision making.
Data Cleansing Tools1:23
Explore data cleansing and standardization techniques, including formal standardization of dates, addresses, and data type conversion, and apply cleansing methods for missing values, duplicates, error correction, outliers, and data validation.
What is Data Governance2:14
Principles of Data Governance1:32
Apply data governance principles across stewardship, ownership, quality, and metadata management, then ensure security, privacy, standards, master data management, and architecture for compliant integration and risk.
Data Governance Models2:47
Explore centralized, decentralized, hybrid, command and control, and collaborative data governance models, and compare their pros and cons for consistent policies, agile responsiveness, and accountability.
Data Stewardship1:58
Metadata Management2:37
Data Lineage1:45
Compliance2:27
Explore the essentials of data privacy laws, including GDPR and HIPAA, and how they govern the collection, use, storage, and sharing of personal information for data engineers.
Role of Data Engineers1:28
Explore how data engineers enforce governance and data quality through policies and metadata management. Implement security, compliance, profiling, cleansing, monitoring, and lineage tracking and versioning for auditing.

Introduction to Cloud Computing2:46
Explore the four cloud architectures—public, private, hybrid, and multi-cloud—and learn how pay-as-you-go public resources, private cloud data security, and workloads moving between public and private clouds shape modern deployments.
What is Cloud2:22
Cloud Platforms1:48
Cloud Offerings1:23
Introduction to AWS4:07
AWS Console and Billing3:04
EC2 and Lambda2:17
Explore AWS EC2 and AWS Lambda to provision resizable compute capacity and run code on demand, comparing virtual servers with serverless execution triggered by data changes or user events.
EC2 Hands On3:56
Lambda Hands On2:01
Create a function in the aws lambda console and choose the python runtime. Deploy, then test with a new test event and review logs to confirm a 200 status.
AWS S3 and EBS2:49
AWS S3 Hands On5:28
EBS Hands On1:22
Explore elastic block store (EBS) and examine a volume attached to an EC2 instance, then learn to create and attach a new volume within the same region.
RDS and DynamoDB2:15
RDS Hands On4:50
Create an AWS RDS MySQL database using the free tier and single availability zone, note endpoint URL, install MySQL client on the EC2 instance, then connect and run show databases.
VPC and Route533:10
IAM and Secrets Manager1:51
Learn how AWS IAM manages access and identity, including users, groups and roles, and how AWS Secrets Manager securely stores passwords, API keys, and database credentials.
IAM Hands On5:15
Create IAM policies for EC2 and S3, assemble them into a user group, and add users to grant access; assign a role enabling EC2 to access S3.
Secrets Manager Hands On6:01
Cloud Formation2:40
Explore aws monitoring with CloudWatch to track metrics, logs, and events, set alarms, and trigger SNS notifications for scaling; learn CloudFormation to model and deploy resources via templates.

Introduction to Data Modeling and Architecture1:10
Master data modeling and architecture to organize and structure data for quick, informed decisions, ensuring scalable, reliable data flows across tools and systems.
Data Modeling and Types2:23
Data Modeling Methodologies2:17
Introduction to Data Architecture1:38
Introduction to data architecture explains how data flows through a system and serves as the blueprint for storage, access, and secure use, contrasting architecture with data models.
Key Components of Data Architecture1:17
Identify the six components of data architecture—data sources, storage, integration, governance, processing frameworks, and data access—and illustrate tools like ETL, data lakes, Collibra, Hadoop, Spark, and APIs.
Entity Relationship Diagram2:10
Explore entity relationship modeling to visualize entities, attributes, and relationships. Learn about er diagrams and their components, with examples like a customer placing orders in a one-to-many relationship.
Normalization and Denormalization1:15
First Normal Form1:44
First normal form ensures each column holds a single value and each record is unique by removing repeating groups, simplifying library borrowing lookups.
Second Normal Form1:21
Learn how second normal form eliminates partial dependencies after achieving one NF, by splitting data into book, borrower, and book-borrower tables to remove redundancy.
Third Normal Form1:46
Identify the criteria for third normal form, requiring two nf and no transitive dependencies between non-key attributes and the primary key, as shown by separating author and library data.
Boyce Codd Normal Form1:22
Explain Boyce-Codd normal form: every determinant is a candidate key, author determines genre. This violates BCNF, prompting a book branch table and final tables: author, book, borrower.
Denormalization0:56
Explore denormalization, the process of intentionally introducing redundancy by merging previously normalized tables to boost query performance. Understand trade-offs of increased storage and anomalies during insert, update, or delete.
Dimensional Modelling2:34
DataMart2:00
Explore data storage and retrieval patterns, including data lakes, data warehouses, and data marts, containing raw data in original formats, with ETL/ELT, partitioning, indexing, and sharding for analytics.
Partitioning, Indexing and Sharding2:59
Architecting Hadoop Systems2:10
Discover lambda and cap architectures, the five v's of Hadoop, and the key Hadoop ecosystem components. See how batch and stream processing yield accurate insights and real-time results.
Architecting Cloud Platforms2:46
Explore public, private, hybrid, and multi-cloud architectures, and learn how pay as you go cloud services from providers like AWS, Azure, and Google Cloud support secure, compliant data workloads.

Problem Statement and OKR2:28
Ingest raw JSON credit card transactions from S3, clean and validate data, mask PII, compute daily merchant summaries, and store in a data lake and MySQL for GDPR compliance.
Credit Card Lifecycle3:20
Explore the credit card transaction life cycle from authorization to clearing and settlement, and learn how data engineering captures and prepares data for analysis and predictive modeling.
Architecture Design1:52
Design and implement a data architecture that ingests JSON data into S3, triggers a Lambda-driven EMR Spark ETL, masks PII, aggregates by merchant and date, and loads results to MySQL.
Environment Setup1:40
Design S3 Path and Understanding RAW input JSON data2:10
Design the Daily Transaction Store bucket on S3 with raw, cleaned, and aggregated data in year/month/day folders; use parquet for cleaned and aggregated, including merchant daily and customer spend.
AWS S3 Setup1:52
Upload raw data to AWS EC2 and S35:52
Git and IDE Setup4:00
Set up a git repository on GitHub, clone locally, and create a Python virtual environment. Open the project in IntelliJ, add and commit files, then push to main.
PySpark Setup2:08
Activate the virtual environment in IntelliJ, install PySpark, configure the Python interpreter, run a spark job to create a spark session and data frames, and begin the ETL phase.
Project Setup1:32
Pipeline Execution and CLI command2:42
Main Class2:54
Create Spark Session(Singleton Pattern)2:44
Configuration and Credentials File Setup2:25
Describe how the Config.py class loads credentials from credentials.env, enforces validation rules, masks card numbers, and exposes get jdbc url and get MySQL properties for MySQL access.
Data Reader2:07
1.15 Data Cleansing and Quality- Part 13:31
The data cleaner initializes validation rules, trims strings, uppercases currency, removes nonnumeric card characters, rounds amounts, formats timestamps, derives transaction date and r, validates data, returns valid and invalid frames.
1.15 Data Cleansing and Quality- Part 22:52
Handling Sensitive Data3:16
Data Profiling2:24
Data Loading1:09
Utilize the data writer class to emit data frames to S3 (parquet or JSON), local JSON, and MySQL in overwrite mode, including invalid transactions, summaries, and data quality reports.
The Data Pipeline6:26
Before We Turn On The Tap1:53
Identify data quality issues in sample transactions, such as currency in lowercase, excessive decimal places, non-numeric card numbers, and duplicates, and review DDL for creating catalyst database in MySQL Workbench.
Data Pipeline Execution1:19
Submit the spark job to a local cluster, validate and mask data, generate daily merchant summaries and risk metrics, write transactions to MySQL, and generate a log report.
Verifying Output2:51
Inspect processed transactions (999 records) and a single invalid entry from a duplicate id 1006, then review daily merchant summaries and hourly patterns to debug and fix the data pipeline.

Requirements

Basic Programming Knowledge
No Prior Data Engineering Experience Needed
Access to a Computer & Internet
Curiosity about data workflows, databases, or cloud tools.

Description

Master Data Engineering: Concepts to Production is a comprehensive course designed to transform beginners into proficient data engineers. Starting with foundational concepts (data lifecycle, roles, and tools), the course progresses to hands on skills in SQL, ETL processes, UNIX scripting, and Python programming for automation and data manipulation. Dive into big data ecosystems with Hadoop and Spark, learning distributed processing and real-time analytics. Master data modeling (star and snowflake schemas) and architecture design for scalable systems.

Explore cloud technologies (AWS) to deploy storage, compute, and server less solutions. Build robust data pipelines and orchestrate workflows, while integrating CI CD practices for automated testing and deployment. Tackle data quality methods (validation, cleansing) and data governance principles (compliance, metadata management) to ensure reliability.

Each chapter combines theory with real world projects: designing ETL workflows, optimizing Spark jobs, and deploying cloud-based pipelines. By the end, you’ll confidently handle end to end data solutions, from raw data ingestion to production ready systems. Ideal for aspiring data engineers, analysts, or IT professionals seeking to up skill.

Prerequisites: Basic programming knowledge.

Tools covered: Spark, Hadoop, AWS, SQL, Python, UNIX, Git, IntelliJ IDE.

Outcome: Build a portfolio of projects showcasing your ability to solve complex data challenges.

Who this course is for:

Beginners with basic programming skills aiming to enter the field.
Professionals seeking to transition into engineering roles (ETL, pipelines, automation).
Developers or sysadmins wanting to specialize in scalable data systems, cloud (AWS), and big data tools.
Individuals with coding fundamentals pivoting to data engineering.
Teams needing modern data skills (Spark, Hadoop, CI/CD, governance) for enterprise projects.

Master Data Engineering: Concepts to Production

What you'll learn

Explore related topics

Course content

Course Outline10 lectures • 14min

SQL and ETL36 lectures • 1hr 31min

UNIX33 lectures • 1hr 29min

Python28 lectures • 1hr 26min

Bigdata, Hadoop and Spark45 lectures • 2hr 19min

Continuous Integration and Continuous Development9 lectures • 25min

Data Quality and Governance14 lectures • 25min

Cloud Computing19 lectures • 59min

Data Modeling and Architecture17 lectures • 32min

Real Life Data Problem and Solution24 lectures • 1hr 5min

Requirements

Description

Who this course is for: