
Explore data engineering concepts to production through patterns and real-world practices, as your experienced instructor guides you through massive data pipelines, enterprise data warehouses, and high-performance processing frameworks.
Learn how data engineers turn raw, messy information into clean, production-ready datasets that teams can rely on, covering fundamentals to architecture, pipelines to orchestration.
Handpick the finest ingredients at the market and craft precise, plated dishes, using lifelong recipes and continuous improvement to respond to customer concerns.
Designed for IT professionals, data scientists, and students, this hands-on course reveals what data engineering is, how it works, and real-world use cases, with a completion certificate.
Explore data engineering foundations from sql and etl basics to unix, python, big data with hadoop and spark, and cover ci cd, data quality, governance, and cloud computing.
Explore the seven key components of data engineering, from data sources and ingestion to etl processing, storage options, orchestration, data management, analytics, security, privacy monitoring, and logging.
Explore the three data types in data engineering: structured, semi-structured, and unstructured. Identify examples like tables, JSON/XML, and media such as images and audio.
Explore how SQL, the standard language for relational databases, enables efficient querying, data extraction, and maintenance of tables like doctor and patients, linked by IDs.
Learn to count records and group by doctor IDs to determine how many patients are assigned to each doctor, using count(*) and group by.
Rank doctors by patient totals using an order by clause with a count alias, and switch from ascending to descending to see doctors ranked from highest to lowest.
List doctors with a patient count of five or more by applying the having clause after group by and ordering by count. Explain why where cannot filter aggregate columns.
Learn how to join patients and doctors using an inner join, with aliases and explicit column selection. Explore join types, unions, ctes, and subqueries to query multiple tables.
Learn how a left join returns all rows from the left table and matched rows from the right table, with nulls where there is no match. Compare this to the right join using doctor and patient tables.
Explains full outer join by uniting left and right joins, filling nulls for unmatched rows, and uses union in MySQL to simulate it, then contrasts union and union all.
Learn how common table expressions (CTEs) simplify complex queries, using a doctor and patient table to find patients over 12 still tagged to a pediatrician with date_sub.
Learn to create and alter tables, add and modify columns, and update data in the Celtics database, including a visits table, temporary tables, and when to truncate or drop tables.
explores window functions with a real-world example to find the top-earning doctor per day by joining doctor and visit tables, using a cte to rank by date and total fees.
Explore rank, dense rank, row number, and the average window function to compare doctor earnings against daily clinic totals and understand tie effects.
Explore database design principles, including normalization and denormalization, to create efficient schemas that minimize data redundancy and enforce transaction control.
implement the third normal form by removing transitive dependencies and moving customer data to a separate order-customer mapping table, splitting the order table into order details and customer mapping.
Normalize data to remove duplicates, improving consistency and storage efficiency. Select appropriate data types, use indexes judiciously, and design for growth with partitioning or sharding to ensure scalability and performance.
Learn how atomicity, consistency, isolation, and durability make database transactions all or nothing, follow rules, stay isolated, and remain permanently stored for reliable data management.
Explore how data pipelines extract, transform, and load data from sources like CRM, weblogs, and APIs into targets such as Hive, AWS S3, and Tableau dashboards.
cleanse and transform data with sql by validating names in the patient table using regex, fixing entries like d0n to don via a cte-driven update.
Explore data warehousing concepts, including fact, dimension, and snapshot tables, and compare star and snowflake schemas. Learn how a time-variant, non-volatile repository integrates data from multiple sources for fast analysis.
Analyze slowly changing dimension tables and their three types: type 1 updates the row, type 2 adds history with dates, and type 3 tracks previous values.
Define goals, design architecture, and plan implementation for a data warehouse, selecting star, snowflake, or hybrid models, and addressing ETL, data quality, governance, security, and scalability.
Design an end-to-end ETL data pipeline that extracts from CRM, ERP, and web analytics sources into a data warehouse. Apply cleansing, transformations, and load data into dimension and fact tables.
Explore Unix, a foundational and secure operating system from the 1960s–70s, and its influence on Mac OS, Linux, servers, and data tools, enabling transferable skills across technology and data work.
Trace Unix history from late 1960s Bell Labs leaders Ken Thompson and Dennis Ritchie. Observe how C made it portable, inspiring BSD, Solaris, Linux, and Mac OS.
Navigate the unix file system and manage files using cd and pwd, then view details with ls -l and sort by time with ls -lt, reversing order with r flag.
Explore essential unix file and directory management, including mkdir, cd, pwd, ls, touch, rm, and copy and move operations for data engineering workflows.
Master three Unix file creation methods: touch, cat with redirection, and vi editor, and learn to write, view, and save content with proper commands.
Explore how user, group, and others gain read, write, and execute permissions, and use chmod to add, remove, or set these rights on files.
Learn how c-h and change group commands adjust file ownership and groups, including Linda as owner and admin group as owning group.
Explore text processing tools using grep, said, and awk to search patterns in logs, identify exceptions, and pinpoint root causes in spark job failures.
Explore the stream editor for parsing and transforming text, including global replacements like replacing info with debug in files or Spark logs, a key data engineering tool.
Explore how a unix process is created and managed by the kernel. Track its life cycle from created to ready, running, blocked, and terminated, using ps, top, and kill.
Learn how the ps command displays active processes, lists all processes with ps -e, uses head -5 for top items, and top for real-time cpu usage and memory usage monitoring.
Explore data compression and archiving with gzip and tar, and distinguish archival from compression, noting that compression reduces size while archiving groups files into a single folder.
Learn how to securely transfer files using SCP, securely log in with SSH, and access remote storage with a network file system (NFS), including mounting remote paths.
Master regular expressions for matching and processing in engineering, using grep to extract email IDs from emails.txt that start with lowercase letters, have @, and end with .com or .org.
Write and run your first shell script with vi, saving hello world in scripts directory. Compare ssh invocation and dot slash execution; learn to use variables with chmod.
Explore functions and modular scripting in Unix by defining a square calculator function that uses the first argument, demonstrates reusable blocks, and prints the square of a given number.
Learn how to work with files in shell scripts by writing and appending to myfile.txt, checking the return code, and displaying file contents.
Master cron job scheduling with crontab using five fields for minutes, hours, day, month, and weekday. See a hands-on example executing a hello world script and logging to cron output.txt.
Implement unix-based best practices for data engineering, including version control with git, modular scripts, cron automation, graceful error handling, log redirection, nohup, and minimal permissions using oc, z, and grip.
Explore the history, features, and architecture of Python and its essential role for data engineers, from readability to data science and data workflows.
Explore Python's intuitive syntax and interpreted, line-by-line execution. Harness its cross-platform reach, extensive standard library, dynamic typing, multiple paradigms, vast ecosystems, and strong community for readable and maintainable code.
Explore how Python code runs from parsing source into an abstract syntax tree to bytecode execution by the Python virtual machine, and compare interpreters with compilers.
Python powers data engineering with simple, readable code that maintains complex data pipelines, using Pandas, NumPy, and Spark or Dask for scalable ETL and data source integration.
Install IntelliJ IDEA community edition on macOS, set up a Python project with a virtual environment and Python 3.12, and use Copilot for debugging, code completion, and data engineering productivity.
Set up the project structure with resources and source directories, visualize the sales daily transaction exports csv using a csv plugin, then read and process the data.
Master Python basics, including syntax, variables, data types, operations, and expressions, and learn to define and call functions, pass arguments, return values, import modules, and read files.
Learn to define Python variables with dynamic typing. Practice adding A and B, and concatenate outputs by converting integers to strings.
Learn how Python methods modularize code by defining functions with parameters, calling them to sum numbers, and returning values to the caller.
Learn to read a csv using the read file method in the file utils module of the process files package, skip the header with an iterator, and print each row.
Explore Python lists, tuples, and sets by converting a csv reader to a list, iterating through items, and using built-in methods like append, pop, and index.
Learn to use if, elif, and else in Python to categorize customers by order value. Apply and or logical operators and proper indentation to create clear decision blocks.
Organize your Python project with a utils folder and a categorize customers function that returns valued, medium, and least valued lists, then import and use it in main.
Explore Python tuples as immutable data structures, distinguish them from lists by round brackets and indexing, and learn when to use tuples for fixed-size data while considering memory use.
Learn to map customer names to order values using Python dictionaries, with a get names and order value function and enumerate to pair indices with values in ETL workflows.
Open the file in write mode and dump the customer dictionary to a json file using python's json library with four-space indentation, demonstrating etl by loading csv data into json.
Wrap Python file reads in a try-except block to gracefully handle file not found errors, returning a blank list and printing a clear message, and learn common built-in exceptions.
Big data enables real-time analytics, predictive modeling, and personalized insights across healthcare, finance, retail, and transportation, powering fraud detection, risk management, and optimized operations.
Explore the key challenges in big data processing—volume, variety, velocity, veracity, security, and scalability—and how tools like Hadoop HDFS, Apache Kafka, and Spark Streaming address them.
Explore how data sources, data storage, data processing, ETL, analytics, visualization, data management, and data security form the big data ecosystem, with cloud platforms enabling scalable development and deployment.
Understand how traditional data processing struggles with huge data volumes and batch insights, while big data processing handles real-time streams and any data type, with horizontal scaling.
Learn to manage files in HDFS by creating a data directory, uploading with put or copy from local, downloading with get, copying directories, and deleting files with verification.
Discover Apache Hadoop and its core components, including HDFS, MapReduce, Yarn, and common utilities. Analyze HDFS architecture, the MapReduce framework, how MapReduce jobs execute, and Yarn architecture.
Explore Apache Hadoop, a distributed storage and processing framework with HDFS, MapReduce, Yarn, and Hadoop common, delivering scalable, fault-tolerant, cost-effective, flexible data handling for warehousing, analytics, and machine learning.
Explore the HDFS architecture with a primary name node, secondary name node, and data nodes storing 128 mb blocks replicated across nodes for client access and centralized metadata management.
Explore the five Hadoop or Spark setups: standalone, pseudo distributed, fully distributed, cloud based, and Hadoop as a service, and identify ideal use cases.
Install and configure Hadoop locally, set fs.defaultFS to localhost:9000, enable yarn, format the name node, start daemons, and verify via hdfs and the NameNode and ResourceManager web UIs.
Explore Hive, a data warehousing solution with a sql-like language on Hadoop to query large datasets in HDFS or S3, featuring Hive Server 2, Metastore, ACID, and replication.
Explore how Hive analyzes e-commerce clickstream data, processes electronic health records to spot care trends, detects fraud through real-time financial transactions, and monitors telecom network logs to optimize performance.
Create a hive table with external or managed options, define columns with comments, and configure partitioning, bucketing, row format, and file format for efficient storage and retrieval.
Explore Hive file formats, including text files, sequence file, RC file, Avro, ORC, and Parquet, and learn how to choose the right format for schema evolution, compression, and analytical workloads.
Create a managed Hive table named employee with id, name, and salary; load data from a local file, then query using where, order by, group by, and a join condition.
Explore how Hive user defined functions enable custom logic on top of tables to query and transform data, with built-in and custom UDFs and the difference from aggregate functions.
Compare Hadoop MapReduce and Spark for big data processing, highlighting performance and fault tolerance. Spark's in-memory processing enables faster analytics and real-time insights.
Submit a spark job to trigger the driver to build a dag, split it into stages, and assign tasks to executors across the cluster, reflecting wide transformations and shuffles.
Discover how Spark session unifies Spark context, sql context, and Hive context to run data frames, rdds, and sql queries via Spark shell, PySpark, Spark sql, and the cluster manager.
Explore Spark data frames as named-column data sets and compare datasets to RDDs, highlighting fault tolerance through lineage, distributed partitions, and parallel processing.
Navigate the Spark UI to debug applications by inspecting the context web UI and its tabs: jobs, stages, storage, environment, and executors; monitor status and executor health.
Learn spark coding in scala, pyspark, java, and r; practice scala by creating a data frame from records, naming columns, and displaying it with df.show, using spark context and session.
Set up a PySpark project in IntelliJ IDEA community edition, install PySpark, initialize a SparkSession, and use data frames and Spark SQL to filter and query data.
Enable rapid, reliable data pipelines by implementing ci/cd, automated testing, and collaborative deployment across environments, reducing downtime and human error.
Develop, validate, and deploy data pipelines through a five-stage ci/cd workflow—development, ci validation, cd deployment to staging and production, and production go-live—featuring version control, code reviews, tests, and compliance checks.
Sign up on GitHub, create a private repository named zozo, add a readme and gitignore, select the Apache license, set main as default branch, and use pull requests.
Clone a git repository from remote to local by installing git, creating a local directory, and running git clone with the remote URL, then handle private versus public access.
Learn CI/CD tooling for data engineers, with GitHub, GitLab, Bitbucket, Jenkins, CircleCI, and GitHub Actions. Cover deployment and testing with Kubernetes, Docker, Terraform, PyTest, and Great Expectations.
See how CI/CD automates testing and deployment to speed up data pipeline releases and improve reliability across environments. Understand challenges like data dependencies, test data availability, and cross-environment consistency.
Learn the key data quality dimensions: accuracy, consistency, uniqueness, timeliness, validity, and completeness, and how to ensure data stays true, non-contradictory, up-to-date, and complete.
Define data quality metrics with KPIs for accuracy, error rate, and uniqueness. Measure data accuracy rate, duplicate record rate, and distinct value count to gauge overall data quality.
Learn data profiling to understand data structure, content, and quality, its purpose, and the techniques and tools to reveal missing values, errors, and cardinality for cleansing or analysis.
Learn how data cleansing corrects inaccurate, incomplete, or inconsistent data, and data standardization brings diverse formats, such as JSON and XML, into a single CSV, improving reliability for decision making.
Explore data cleansing and standardization techniques, including formal standardization of dates, addresses, and data type conversion, and apply cleansing methods for missing values, duplicates, error correction, outliers, and data validation.
Apply data governance principles across stewardship, ownership, quality, and metadata management, then ensure security, privacy, standards, master data management, and architecture for compliant integration and risk.
Explore centralized, decentralized, hybrid, command and control, and collaborative data governance models, and compare their pros and cons for consistent policies, agile responsiveness, and accountability.
Explore the essentials of data privacy laws, including GDPR and HIPAA, and how they govern the collection, use, storage, and sharing of personal information for data engineers.
Explore how data engineers enforce governance and data quality through policies and metadata management. Implement security, compliance, profiling, cleansing, monitoring, and lineage tracking and versioning for auditing.
Explore the four cloud architectures—public, private, hybrid, and multi-cloud—and learn how pay-as-you-go public resources, private cloud data security, and workloads moving between public and private clouds shape modern deployments.
Explore AWS EC2 and AWS Lambda to provision resizable compute capacity and run code on demand, comparing virtual servers with serverless execution triggered by data changes or user events.
Create a function in the aws lambda console and choose the python runtime. Deploy, then test with a new test event and review logs to confirm a 200 status.
Explore elastic block store (EBS) and examine a volume attached to an EC2 instance, then learn to create and attach a new volume within the same region.
Create an AWS RDS MySQL database using the free tier and single availability zone, note endpoint URL, install MySQL client on the EC2 instance, then connect and run show databases.
Learn how AWS IAM manages access and identity, including users, groups and roles, and how AWS Secrets Manager securely stores passwords, API keys, and database credentials.
Create IAM policies for EC2 and S3, assemble them into a user group, and add users to grant access; assign a role enabling EC2 to access S3.
Explore aws monitoring with CloudWatch to track metrics, logs, and events, set alarms, and trigger SNS notifications for scaling; learn CloudFormation to model and deploy resources via templates.
Master data modeling and architecture to organize and structure data for quick, informed decisions, ensuring scalable, reliable data flows across tools and systems.
Introduction to data architecture explains how data flows through a system and serves as the blueprint for storage, access, and secure use, contrasting architecture with data models.
Identify the six components of data architecture—data sources, storage, integration, governance, processing frameworks, and data access—and illustrate tools like ETL, data lakes, Collibra, Hadoop, Spark, and APIs.
Explore entity relationship modeling to visualize entities, attributes, and relationships. Learn about er diagrams and their components, with examples like a customer placing orders in a one-to-many relationship.
First normal form ensures each column holds a single value and each record is unique by removing repeating groups, simplifying library borrowing lookups.
Learn how second normal form eliminates partial dependencies after achieving one NF, by splitting data into book, borrower, and book-borrower tables to remove redundancy.
Identify the criteria for third normal form, requiring two nf and no transitive dependencies between non-key attributes and the primary key, as shown by separating author and library data.
Explain Boyce-Codd normal form: every determinant is a candidate key, author determines genre. This violates BCNF, prompting a book branch table and final tables: author, book, borrower.
Explore denormalization, the process of intentionally introducing redundancy by merging previously normalized tables to boost query performance. Understand trade-offs of increased storage and anomalies during insert, update, or delete.
Explore data storage and retrieval patterns, including data lakes, data warehouses, and data marts, containing raw data in original formats, with ETL/ELT, partitioning, indexing, and sharding for analytics.
Discover lambda and cap architectures, the five v's of Hadoop, and the key Hadoop ecosystem components. See how batch and stream processing yield accurate insights and real-time results.
Explore public, private, hybrid, and multi-cloud architectures, and learn how pay as you go cloud services from providers like AWS, Azure, and Google Cloud support secure, compliant data workloads.
Ingest raw JSON credit card transactions from S3, clean and validate data, mask PII, compute daily merchant summaries, and store in a data lake and MySQL for GDPR compliance.
Explore the credit card transaction life cycle from authorization to clearing and settlement, and learn how data engineering captures and prepares data for analysis and predictive modeling.
Design and implement a data architecture that ingests JSON data into S3, triggers a Lambda-driven EMR Spark ETL, masks PII, aggregates by merchant and date, and loads results to MySQL.
Design the Daily Transaction Store bucket on S3 with raw, cleaned, and aggregated data in year/month/day folders; use parquet for cleaned and aggregated, including merchant daily and customer spend.
Set up a git repository on GitHub, clone locally, and create a Python virtual environment. Open the project in IntelliJ, add and commit files, then push to main.
Activate the virtual environment in IntelliJ, install PySpark, configure the Python interpreter, run a spark job to create a spark session and data frames, and begin the ETL phase.
Describe how the Config.py class loads credentials from credentials.env, enforces validation rules, masks card numbers, and exposes get jdbc url and get MySQL properties for MySQL access.
The data cleaner initializes validation rules, trims strings, uppercases currency, removes nonnumeric card characters, rounds amounts, formats timestamps, derives transaction date and r, validates data, returns valid and invalid frames.
Utilize the data writer class to emit data frames to S3 (parquet or JSON), local JSON, and MySQL in overwrite mode, including invalid transactions, summaries, and data quality reports.
Identify data quality issues in sample transactions, such as currency in lowercase, excessive decimal places, non-numeric card numbers, and duplicates, and review DDL for creating catalyst database in MySQL Workbench.
Submit the spark job to a local cluster, validate and mask data, generate daily merchant summaries and risk metrics, write transactions to MySQL, and generate a log report.
Inspect processed transactions (999 records) and a single invalid entry from a duplicate id 1006, then review daily merchant summaries and hourly patterns to debug and fix the data pipeline.
Master Data Engineering: Concepts to Production is a comprehensive course designed to transform beginners into proficient data engineers. Starting with foundational concepts (data lifecycle, roles, and tools), the course progresses to hands on skills in SQL, ETL processes, UNIX scripting, and Python programming for automation and data manipulation. Dive into big data ecosystems with Hadoop and Spark, learning distributed processing and real-time analytics. Master data modeling (star and snowflake schemas) and architecture design for scalable systems.
Explore cloud technologies (AWS) to deploy storage, compute, and server less solutions. Build robust data pipelines and orchestrate workflows, while integrating CI CD practices for automated testing and deployment. Tackle data quality methods (validation, cleansing) and data governance principles (compliance, metadata management) to ensure reliability.
Each chapter combines theory with real world projects: designing ETL workflows, optimizing Spark jobs, and deploying cloud-based pipelines. By the end, you’ll confidently handle end to end data solutions, from raw data ingestion to production ready systems. Ideal for aspiring data engineers, analysts, or IT professionals seeking to up skill.
Prerequisites: Basic programming knowledge.
Tools covered: Spark, Hadoop, AWS, SQL, Python, UNIX, Git, IntelliJ IDE.
Outcome: Build a portfolio of projects showcasing your ability to solve complex data challenges.