
Develop Python-based data pipelines with Airflow by meeting essential prerequisites: solid Python programming experience and Docker, enabling a shared environment for guided hands-on learning.
Navigate the airflow roadmap from basics to advanced features, build your first data pipeline, and learn to scale, run in production, monitor, and secure your environment.
Meet the instructor who builds daily data pipelines with Apache Airflow and explains running Airflow in production with Astronema, inviting you to LinkedIn, YouTube, and Udemy for tips.
Set up your development environment for airflow by installing Docker, using the Astro CLI to run locally, verify with astro version, and follow OS-specific steps for Mac, Windows, and Linux.
Define data orchestration to coordinate extraction, cleaning, transformation, and loading with dependencies and automatic retries using Airflow. Manage thousands of tasks and monitor data workflows at scale.
Discover why Airflow is a scalable data orchestrator that manages dependencies, enables Python-based pipelines, and integrates with Airbyte, dbt, and Snowflake for end-to-end data workflows.
Map airflow's core components: the web server and user interface, the metadata database, the scheduler, and the executor with its worker, including Kubernetes, Celery, and local options.
Discover airflow's core concepts, including tasks and operators, and learn how to build a dag by linking tasks with dependencies into a data pipeline.
See how Airflow orchestrates a dag from the DAGs directory to a Dag run using the scheduler, metadata database, and executor. Monitor task states and progress via the web UI.
Explore airflow limitations, including its batch oriented design and lack of real-time streaming. Pipelines may run with delays; integrate with Kafka, and offload processing to Spark to avoid memory overflow.
Explore multiple Airflow installation methods, from pip to Docker and the Astro CLI, and learn best practices for a reliable, containerized Airflow environment.
Explore the Airflow user interface, focusing on the DAGs view and grid, graph, gantt, code, and task duration views to monitor and troubleshoot data pipelines.
Discover how the Airflow CLI enhances workflow control beyond the UI, supports CI/CD pipelines, and access via Astro CLI or Docker for essential DAG, database, and task commands.
Explore the Airflow rest api and its endpoints to trigger data pipelines from external tools. Review authentication options, view dag routes, and consult the official docs for practical use.
Build data pipelines with Airflow by defining tasks such as waiting for an event, executing Python functions, and interacting with a database, then monitor and debug the workflow.
Build a stock market data pipeline in Airflow that fetches Apple's prices, stores in Minio, formats with Spark, loads to Postgres, and powers dashboards with Metabase.
Learn to set up the project with Docker and docker compose, download and unzip the materials from Udemy, and explore the docker compose services including Airflow, MinIO, Spark, and Metabase.
Build docker images for spark master and worker, start the project with astro dev, and access the airflow user interface and postgres while verifying containers and ports are running.
Create a dag using the dag decorator in Apache Airflow to verify Yfinance API availability for a stock market data pipeline with daily scheduling and tags.
Explore the task flow API in Airflow, using dag and task decorators to reduce boilerplate and share data via xcoms between tasks, with automatic dependencies.
Poll yfinance api using the task.sensor decorator every 30 seconds for up to 5 minutes, returning a poc return value when available. Set up stock_api connection and verify via requests.
Learn to fetch stock prices with Airflow by building a PythonOperator task that calls a finance API, uses templating and XComs, and returns Nvidia's latest stock data.
Store stock prices in Minio, connecting to a Minio server like an aws s3 bucket, creating a bucket, building a Minio client, and saving json data under company symbols.
Format stock prices by running a spark job inside a docker container via the docker operator, converting prices.json in MinIO to a CSV file with a header.
Build an Airflow task to fetch the formatted stock prices CSV from MinIO using a Python operator, a MinIO client, and file-not-found handling in the data pipeline.
Configure an Airflow task to load a CSV into Postgres using the Astro SDK loadfile operator, from Minio to stock_market in the public schema, and validate via logs.
Build a Metabase dashboard to monitor Apple stock prices stored in your data warehouse, creating questions for average closing price, volume, and closing price visualizations.
See Airflow ui run the stock dag, triggering a data pipeline that fetches stock prices, formats them to csv, and loads to dw to create a dataset.
Airflow 2.7 introduces notifiers that encapsulate notification logic for task success or failure, enabling Slack and other notifier integrations via plug‑and‑play providers.
Clean the environment by stopping Docker containers with astro dev stop, then initialize a new Airflow project locally using astro dev init for a fresh data pipeline.
Explore three ways to define your dags in airflow—the old dag notation, the with dag context manager, and the dag decorator—and learn why the dag decorator is recommended.
Define essential Airflow dag parameters, including dag id, start date, schedule, catch up, description, tags, default arguments, dagrun timeout, and max consecutive failed dag runs.
Explore basics of dag scheduling, including start date, schedules, and presets. See how each dag run defines a data interval and how max active runs sets the limit per dag.
Explore how backfilling and catch up control dag runs in Airflow, and learn to disable catch up or backfill with CLI or UI for targeted reruns.
Backfill your dag safely by ensuring tasks are idempotent to avoid duplicates when rerunning data pipelines, and use the data_interval_end variable in SQL instead of now.
Explore configuring Airflow dag scheduling, including start date, daily versus weekly cadence, and catch up behavior. Learn to manage dag runs, data intervals, and safe schedule changes.
Learn how Airflow handles time zones, store data in UTC, display in local time, and use pendulum, while comparing cron and timedelta scheduling across daylight saving time.
Airflow enables data-based scheduling of DAGs using datasets, allowing a producer and consumer DAG to link via a dataset update that triggers the downstream workflow.
Master conditional dataset scheduling with and/or operators to run a dag when A or B and C or D update, using parentheses in the schedule parameter and time tables.
Create and link Airflow dags that extract JSON data from an API, write it to a dataset, and trigger a dependent dag with dataset scheduling.
Learn how to share data between Airflow tasks using XComs. Push values with xcom_push and retrieve them with xcom_pool, stored in the Airflow meta database.
Centralize data sets in an include folder, reference them from your dag files, and use an ignore file to keep the dags folder lean and fast.
Configure on success and on failure callbacks for dag and task runs, inspect context, and implement retries with default args and retry exponential backoff.
Discover how to test Apache Airflow pipelines with dag tests, validation tests, unit tests, and integration tests using the Astro CLI, then debug with IDE breakpoints.
Learn how to group tasks with task groups in Airflow to improve organization, apply defaults to sets of tasks, enable dynamic mapping, and create modular, reusable task blocks across DAGs.
Learn to use Airflow's branch operator and branch Python operator to route tasks based on conditions, such as data size or a cocktail being alcoholic, with task and group dependencies.
Discover how trigger rules change when a task runs in Airflow. Explore defaults like all_success and alternatives such as all_done, all_skipped, one_failed, and non_failed_mean_one_success.
Discover how airflow templating uses a template engine to replace placeholders with runtime data. Use ds, data interval start and end, templates dict, and template fields to avoid hardcoding.
Learn to store XCom data outside the airflow meta database using a custom XCom backend with AWS S3, enabling versioning and archiving, then configure and verify the setup.
Learn how to avoid hardcoding values in Airflow by using variables, created via the UI, the CLI, or environment variables, with encryption and template integration for reuse across tasks.
This lecture explains the executor inside the airflow scheduler and how the sequential executor runs tasks one at a time using sqlite, with config overrides via env vars.
The local executor runs tasks in parallel within the scheduler using Postgres, avoids SQLite for concurrent writes, and is configured via the astro CLI.
Configure parallelism, max active tasks per dag, max active runs per dag, and max activities per dag to control Airflow concurrency and understand how tasks run across dag runs.
Learn to scale Airflow with the Celery executor by distributing tasks across multiple workers via a Redis broker, using Docker Compose, and monitoring a sample DAG.
Monitor airflow with Flower to view real-time status of celery executor workers and tasks, explore the default redis queue, and adjust max concurrency using the airflow__celery__walker_concurrency setting.
Add a second Airflow worker and create GPU and CPU queues to distribute tasks. Restart Docker Compose, run the DAG, and confirm tasks run on the correct workers.
Discover how airflow runs on Kubernetes using the celery executor, and learn Kubernetes concepts—pods, nodes, master node, scheduler, controller—and how Helm deploys airflow on a Docker Desktop cluster.
Explore the shift from celery to the Kubernetes executor, discovering one task per pod isolation, granular resources and environment customization, and the Kubernetes API driven scheduling in a Kubernetes cluster.
Install Airflow on a Kubernetes cluster with Helm, create the airflow namespace, verify nodes and pods, and access the UI via port-forward for admin access.
Configure Airflow on Kubernetes with Helm by editing values.yaml, upgrading the deployment, and verifying pods and the web UI, then switch from Celery to Kubernetes executor for scalable DAG execution.
Deploy airflow on Kubernetes and fetch dags with a git sync sidecar, then configure a git repo, SSH keys, and Kubernetes secret to run dags on the Kubernetes executor.
Set up a Kubernetes cluster with EKS and Rancher, install Airflow, and run it with the Kubernetes Executor in AWS, including creating an EC2 instance, IAM user, and ECR repo.
Explore AWS EKS as a managed Kubernetes service, compare it to ECS, and learn how to deploy Airflow with the Kubernetes Executor on AWS while considering costs.
Set up an ec2 instance for rancher on aws using amazon linux 2, t2.small memory, and open http and https ports; install docker and run rancher to access its interface.
Create an IAM user in the AWS console for programmatic access, attach AdministratorAccess for testing, download the credentials CSV, and review permissions for Rancher to set up an EKS cluster.
Configure an ECR to store and deploy docker images for Airflow, install and configure the AWS CLI, build and tag the image, then push it as v1.0.
Create an Amazon EKS cluster with Rancher to simplify Kubernetes deployment on AWS. Configure credentials, region, and kubectl access, then monitor with the Rancher dashboard.
Explore how to expose a Kubernetes web server from outside the cluster using NodePort, LoadBalancer, and Ingress, with ClusterIP as the internal address.
Install nginx ingress in your Kubernetes cluster using Rancher catalogs and Helm charts, enable the Helm catalog, launch nginx ingress, customize values.yaml, and access the cluster via port 80/443.
Deploy and run Airflow on an EKS cluster with the Kubernetes executor, using a catalog to install airflow-eks, then verify via the Airflow UI and a running DAG.
Clean up your AWS setup after using Rancher by terminating the EKS cluster, deleting CloudFormation stacks, terminating EC2 instances, removing load balancers, and deleting the VPC to avoid charges.
Monitor your Airflow instance and DAGs in production by configuring logging and dashboards. Explore ELK and TIG stacks to visualize metrics and set Grafana alerts.
Discover how Airflow uses Python logging to manage logs from web server, scheduler, and workers. Configure loggers, handlers, and formats in airflow.cfg, with optional remote logging via REMOTE_LOG_CONN_ID and REMOTE_LOGGING.
Explore Airflow logging customization by editing airflow.cfg parameters like base_log_folder and loglevel, configuring logging_config_class, and creating a custom log_config.py with the default logging config.
Store Airflow logs in AWS S3 by creating a bucket and an IAM user with read/write access, then enable remote logging with the AWSS3LogStorage connection.
Explore how Elasticsearch stores, searches, and analyzes json documents and logs in near real time. Learn how indices, mappings, and the Elk stack enable ingestion, visualization, and monitoring for Airflow.
Configure Airflow to read logs from Elasticsearch via an ELK stack with Logstash and Filebeat. Set up containers, generate log_id and offset fields, and visualize DAG logs in Kibana.
Learn to monitor Airflow DAGs with Elasticsearch and Kibana by creating indices, dashboards, and visualizations that track failed tasks over the last seven days.
Learn how to monitor Airflow using metrics sent via UDP to StatsD, including counters, gauges, and timers, and visualize them with the TIG stack (Telegraf, InfluxDB, Grafana).
Set up the tig stack to monitor airflow by configuring telegraf to send metrics to influxdb. Build a grafana dashboard to visualize airflow metrics, including dagbag_size.
Set up Grafana alerts for Airflow by triggering the logger_dag to collect metrics, configure an smtp email channel via Gmail, and alert when the logger_dag.t2 duration exceeds 5 seconds.
Explore maintenance DAGs to keep Airflow running smoothly, including log-cleanup, db-cleanup, and kill-halted-tasks. Learn how to configure them and prevent metastore clutter.
Secure your Airflow deployment by encrypting passwords and managing the Fernet key. Hide sensitive variables in the UI, filter DAGs by owner, enable password authentication, and activate role-based access control.
Discover how Airflow encryption secures sensitive data by enabling secure_mode and encrypting credentials with Fernet keys, installing crypto, and updating connections through the UI and metastore.
Learn to rotate the fernet key in Airflow without breaking credentials, verify rotation, remove old keys, and configure securely with environment variables instead of airflow.cfg.
Learn how Airflow hides variable values using keywords and the UI, with values encrypted in the database, and that this security is cosmetic unless decrypted at the final destination.
Enable password authentication for the Airflow UI by setting authenticate to true and auth_backend to airflow.contrib.auth.backends.password_auth. Enable owner-based filtering so only the logged-in user's DAGs are shown.
Learn to implement RBAC in Apache Airflow, create admin and viewer users, and tailor permissions and roles to control access to DAGs and UI features.
Airflow 2.0 delivers active‑active schedulers, dag serialization for a stateless web ui, dag versioning, a stable rest api with open api 3.0, functional dags, and a pluggable storage engine.
Learn how to backfill Airflow DAGs and tasks using the CLI, including when to backfill, how to rerun failures, and cloning DAGs to run in parallel.
Learn to use the Docker operator to run tasks in a Docker image, manage dependencies, test scripts, and work with mounts, XComs, and resources.
Apache Airflow is a platform created by the community to programmatically author, schedule and monitor workflows.
It is scalable, dynamic, extensible, and modulable.
Without any doubt, mastering Airflow is becoming a must-have and an attractive skill for anyone working with data.
What you will learn in the course:
Fundamentals of Airflow are explained such as what Airflow is, how the scheduler and the web server work
The Forex Data Pipeline project is an incredible way to discover many operators in Airflow and deal with Slack, Spark, Hadoop, and more
Mastering your DAGs is a top priority, and you can play with timezones, unit test your DAGs, structure your DAG folder, and much more.
Scaling Airflow through different executors such as the Local Executor, the Celery Executor, and the Kubernetes Executor will be explained in detail. You will discover how to specialize your workers, add new workers, and what happens when a node crashes.
A Kubernetes cluster of 3 nodes will be set up with Rancher, Airflow, and the Kubernetes Executor local to run your data pipelines.
Advanced concepts will be shown through practical examples such as templating your DAGs, how to make your DAG dependent on another, what are Subdags and deadlocks, and more.
You will set up a Kubernetes cluster in the cloud with AWS EKS and Rancher to use Airflow and the Kubernetes Executor.
Monitoring Airflow is extremely important! That's why you will know how to do it with Elasticsearch and Grafana.
Security will also be addressed to make your Airflow instance compliant with your company. Specifying roles and permissions for your users with RBAC, preventing them from accessing the Airflow UI with authentication and password, data encryption, and more.
In addition:
Many practical exercises are given along the course so that you will have occasions to apply what you learn.
Best practices are stated when needed to give you the best ways of using Airflow.
Quiz are available to assess your comprehension at the end of each section.
Answering your questions fast is my top priority, and I will do my best for you.
I put a lot of effort into giving you the best content, and I hope you will enjoy it as much as I wanted to do it.
At the end of the course, you will be more confident than ever in using Airflow.
I wish you a great success!
Marc Lamberti