Writing production-ready ETL pipelines in Python / Pandas

Name: Writing production-ready ETL pipelines in Python / Pandas
Rating: 4.3 (893 reviews)

Learn how to write professional ETL pipelines using best practices in Python and Data Engineering.

Created byJan Schwarzlose

Last updated 7/2022

English

What you'll learn

How to write professional ETL pipelines in Python.
Steps to write production level Python code.
How to apply functional programming in Data Engineering.
How to do a proper object oriented code design.
How to use a meta file for job control.
Coding best practices for Python in ETL/Data Engineering.
How to implement a pipeline in Python extracting data from an AWS S3 source, transforming and loading the data to another AWS S3 target.

Course content

8 sections • 79 lectures • 7h 3m total length

Course Introduction3:26
[Important!] Updates0:20
Task Description4:01
Extract Deutsche Boerse public data from s3 buckets and generate a weekly parquet report with opening, closing, minimum, maximum prices, daily volume, and change from previous close.
Production Environment2:03
Task Steps3:49
Build a production-ready ETL pipeline in Python with pandas, starting from virtual environments and a Jupyter exploration to a tested, dockerized, well-structured design embracing functional and object-oriented patterns.

Why to use a virtual environment?4:02
Virtual Environment Setup6:10
AWS Setup6:53
Register for an AWS account, create an IAM user with programmatic access, attach S3 full access, download credentials, configure environment variables, install the AWS CLI, and access the public dataset.
Understanding the source data10:10
Quick and Dirty: Read multiple files12:27
Quick and Dirty: Transformations15:48
Quick and Dirty: Argument Date9:57
Learn to pass a date argument to a pandas etl workflow, convert a string to datetime, compute the previous day, and filter s3 data by date for targeted reporting.
Quick and Dirty: Save to S38:41
Quick and Dirty: Code Improvements8:27

Why a code design is needed?2:42
Functional vs. Object Oriented Programming6:29
Why Software Testing?4:32
Quick and Dirty to Functions: Architecture Design0:57
Quick and Dirty to Functions: Restructure Part 115:38
Quick and Dirty to Functions: Restructure Part 212:32
Restructure get_objects Intro1:50
Restructure get_objects Implementation11:35
Restructure the get objects function by generating a date list and, for each date, retrieving files via a prefix-filtered list, using the adapter and application layers.

Design Principles OOP4:22
More Requirements - Configuration, Meta Data, Logging, Exceptions, Entrypoint11:30
Meta Data: return_date_list Quick and Dirty17:50
Meta Data: return_date_list Function14:17
Meta Data: update_meta_file12:12
Code Design - Class design, methods, attributes, arguments13:41
Design a robust ETL workflow with object-oriented class design, single responsibility principles, and an interface for S3 bucket connectors, using a common function approach and a meta process.
Comparison Functional Programming and OOP1:04
Compare functional programming and object oriented approaches for ETL jobs and data pipelines, noting their respective strengths. Prefer object oriented programming for better encapsulation, extensibility, and reuse in larger projects.

Setting up Git Repository4:54
Setting up Python Project - Folder Structure4:40
Installation Visual Studio Code2:31
Setting up class frame - Task Description1:54
Setting up class frame - Solution S311:57
Setting up class frame - Solution meta_process1:01
Setting up class frame - Solution constants1:00
Setting up class frame - Solution custom_exceptions0:23
Setting up class frame - Solution xetra_transformer2:54
Set up and configure the Sedra transformer module in a Python ETL workflow, defining source and target configs, three-bucket connectors, and core extract-transform-load steps.
Setting up class frame - Solution run0:34
Logging in Python - Intro1:28
Logging in Python - Implementation12:50
Create Pythonpath6:00
Python Clean Coding3:34

list_files_in_prefix - Thoughts4:11
Implement the list_files_in_prefix method in the as tree bucket connector, detailing input cases (prefix string, existing or missing) and outputs, exception behavior, and logging considerations.
list_files_in_prefix - Implementation2:21
list_files_in_prefix - Linting Intro1:15
Explore linting, a tool that analyzes Python code to flag errors and style issues, and learn how git hooks and continuous integration keep ETL pipelines clean.
list_files_in_prefix - Pylint4:46
list_files_in_prefix - Unit Testing Intro2:58
list_files_in_prefix - Unit Test Specification3:12
list_files_in_prefix - Unit Test Implementation 114:49
list_files_in_prefix - Unit Test Implementation 214:30
Task Description - Writing Methods2:02
Solution - read_csv_to_df - Implementation0:38
Solution - read_csv_to_df - Unit Test Implementation2:15
Solution - write_df_to_s3 - Implementation2:41
Solution - write_df_to_s3 - Unit Test Implementation3:38
Solution - update_meta_file - Implementation2:01
Solution - update_meta_file - Unit Test Implementation6:29
Solution - return_date_list - Implementation0:24
Solution - return_date_list - Unit Test Implementation4:50
Solution - extract - Implementation1:27
Solution - extract - Unit Test Implementation4:30
Solution - transform_report1 - Implementation0:35
Implement a guard clause in the transform_report1 to return an empty data frame when input is empty, preventing all transformations; refine notebook code with comments and Lowgar infos.
Solution - transform_report1 - Unit Test Implementation2:20
Develop unit tests for transform_report1 in a pandas etl workflow, validating an empty input dataframe yields an empty result and a valid input produces the expected dataframe report via assertions.
Solution - load - Implementation0:43
Refine the load method by cleaning notebook code, adding comments and logging. Build an update list with a list comprehension to extract dates.
Solution - load - Unit Test Implementation1:32
Solution - etl_report1 - Implementation0:34
Solution - etl_report1 - Unit Test Implementation1:48
Integration Tests - Intro0:28
Learn to write and run integration tests for ETL pipelines in Python, validating how modules interact with real interfaces without mocks, while reusing suitable unit tests.
Integration Tests - Test Specification1:04
Integration Tests - Implementation12:09
Entrypoint run - Implementation12:22
Launch the entrypoint run.py with the config path, create s3 bucket connectors for source and target, and run the sedra etl job while checking logs.

Dependency Management - Intro7:34
pipenv Implementation4:12
Profiling and Timing - Intro2:05
Mem-Profiler3:07
Install memory profile and matplotlib, add profile decorator to methods, run and plot results to identify hot spots and timing bottlenecks for focused optimization.
Dockerization4:54
Run in Production Environment4:31
Run a production ETL job with Argo workflows and a container image to load source data into the target three buckets, mounting secrets and config maps as needed.

Requirements

Basic Python and Pandas knowledge is desirable.
Basic ETL and AWS S3 knowledge is desirable.

Description

This course will show each step to write an ETL pipeline in Python from scratch to production using the necessary tools such as Python 3.9, Jupyter Notebook, Git and Github, Visual Studio Code, Docker and Docker Hub and the Python packages Pandas, boto3, pyyaml, awscli, jupyter, pylint, moto, coverage and the memory-profiler.

Two different approaches how to code in the Data Engineering field will be introduced and applied - functional and object oriented programming.

Best practices in developing Python code will be introduced and applied:

design principles
clean coding
virtual environments
project/folder setup
configuration
logging
exeption handling
linting
dependency management
performance tuning with profiling
unit testing
integration testing
dockerization

What is the goal of this course?

In the course we are going to use the Xetra dataset. Xetra stands for Exchange Electronic Trading and it is the trading platform of the Deutsche Börse Group. This dataset is derived near-time on a minute-by-minute basis from Deutsche Börse’s trading system and saved in an AWS S3 bucket available to the public for free.

The ETL Pipeline we are going to create will extract the Xetra dataset from the AWS S3 source bucket on a scheduled basis, create a report using transformations and load the transformed data to another AWS S3 target bucket.

The pipeline will be written in a way that it can be deployed easily to almost any production environment that can handle containerized applications. The production environment we are going to write the ETL pipeline for consists of a GitHub Code repository, a DockerHub Image Repository, an execution platform such as Kubernetes and an Orchestration tool such as the container-native Kubernetes workflow engine Argo Workflows or Apache Airflow.

So what can you expect in the course?

You will receive primarily practical interactive lessons where you have to code and implement the pipeline and theory lessons when needed. Furthermore you will get the python code for each lesson in the course material, the whole project on GitHub and the ready to use docker image with the application code on Docker Hub.

There will be power point slides for download for each theoretical lesson and useful links for each topic and step where you find more information and can even dive deeper.

Who this course is for:

Data engineers, scientists and developers who want to write professional production-ready data pipelines in Python.
Everyone who is interested in writing data pipelines in Python that are ready for production.

Writing production-ready ETL pipelines in Python / Pandas

What you'll learn

Explore related topics

Course content

Introduction5 lectures • 14min

Quick and Dirty Solution9 lectures • 1hr 23min

Functional Approach8 lectures • 56min

Object Oriented Approach7 lectures • 1hr 15min

Setup and Class Frame Implementation14 lectures • 56min

Code Implementation29 lectures • 1hr 53min

Finalizing the ETL Job6 lectures • 26min

Summary1 lecture • 2min

Requirements

Description

Who this course is for: