
Extract YouTube data via YouTube API and load it into a Postgres data warehouse via ELT with Python, then perform data quality checks with soda and enable CI/CD with Docker.
Build the code from the ground up with the class, using GitHub to store and version the project, and reference the final code as you move into data extraction.
Learn to create and secure a YouTube Data API v3 key via the Google Developer Console, set up a project, restrict the key to public data, and manage credentials.
Set up a virtual environment to isolate Python projects and avoid conflicts between versions like 3.4 and 3.10. Activate, install with pip, and use gitignore to exclude venv and pycache.
Develop a Python script to fetch YouTube channel playlist ID via the YouTube API using requests, handle errors with try-except, and build modular code ready for ELT, docker, and airflow.
Develop a Python function to fetch unique video IDs from a playlist using the playlist item resource, handling pagination with next page tokens and robustly parsing contentDetails.videoId.
Implement a batch-based function mapping video IDs to seven variables via snippet, content details, and statistics; batch IDs, build the API URL, fetch data, and accumulate results.
Build a Python data loading script to read json api data from the data directory with a load_path function, using json parsing and logging for robust error handling.
Learn to implement data quality tests using soda core, define checks in YAML, and run soda scan against Postgres in a dockerized Airflow setup.
Commit and push all changes before the ci/cd section to version your work, add any untracked changes with a meaningful commit message, and push to update the GitHub workflow.
You have successfully finished this data engineering course; celebrate your achievement and apply what you learned in the workplace, then consider leaving a rating about your experience.
Data Engineering is the backbone of modern data-driven companies. To excel, you need experience with the tools and processes that power data pipelines in real-world environments. This course gives you practical, project-based learning with the following tools PostgreSQL, Python, Docker, Airflow, Postman, SODA and Github Actions. I will guide you as to how you can use these tools.
What you will learn in the course:
Python for Data Engineering: Build Python scripts for data extraction by interacting with APIs using Postman, loading into the data warehouse and transforming (ELT). In this course we use Python version 3.10.
SQL for Data Pipelines: Use PostgreSQL as a data warehouse. Interact with the data warehouse using both psql & DBeaver
Docker for Containerized Deployments: Discover how to containerize data applications using Docker, making your data pipelines portable and easy to scale.
Airflow for Workflow Automation: Master the basics of orchestrating and automating your data workflows with Apache Airflow, a must-have tool in data engineering. In this course we use Airflow version 2.9.2.
Testing and Data Quality Assurance: Understand how to perform unit, integration & end-to-end (E2E) tests using a combination of pytest and Airflow's DAG tests to validate your data pipelines. Implement data quality tests using SODA to ensure your data meets business and technical requirements.
CI/CD for Automated Testing & Deployment: Learn to automate deployment pipelines using GitHub Actions to ensure smooth, continuous integration and delivery.