
This lecture introduces how this course is structured and what you can expect from it.
In this lecture you're going to learn about the main features and use cases of Apache Airflow.
For more information visit the Apache Airflow website: https://airflow.apache.org/
In this lecture I'm comparing Apache Airflow to similar open-source projects, Apache Oozie and Azkaban. Like Airflow, both Apache Oozie and Azkaban are scheduling tools however they focus on the Hadoop ecosystem while Airflow integrates well with other systems too.
Learn more about Apache Oozie here: https://github.com/apache/oozie
Learn more about Azkaban here: https://github.com/azkaban/azkaban
Here is the link for the documentation from where you can download the installer: https://docs.conda.io/projects/conda/en/latest/user-guide/install/macos.html
Learn Airflow’s four components—the web server, scheduler, executer, and metadata database—and how they collaborate to monitor, trigger, and execute workflows with proper resource allocation and DAGs and connections.
Install airflow on macOS by setting airflow home, creating a Python 3.7 virtual environment, installing airflow 1.10.10, and adding cryptography and Spark for secure connections and workflows.
Link to Ubuntu in Microsoft Store: https://www.microsoft.com/en-us/p/ubuntu/9nblggh4msv6
Learn to run Apache Airflow locally by initializing the metadata database, launching the web server and scheduler, and validating the airflow home and working directory in your environment.
Explore the Airflow UI by starting the web server and inspecting dags, task states, and logs, with tree and grand views to trigger dags and view code and variables.
Learn to manage Apache Airflow from the command line using the airflow CLI, including triggering DAGs, checking next execution times, printing DAG structure, and listing tasks, without the web UI.
For all Airflow cron presets visit: https://airflow.apache.org/docs/stable/scheduler.html#dag-runs
Pass a default_args dictionary to a dag to apply owner and start_date to all tasks, with start_date defaulting to midnight UTC if unspecified.
Define tasks as units of work run by workers, each governed by an operator. Learn about action, transfer, and sensor operators, the base operator class, and built‑in and provider operators.
Define dependencies between tasks in Airflow using big shift operators and relationship builders, with set downstream and set upstream, and practice with chain and cross for complex, multi-task graphs.
Apply airflow to a real use case by loading sensor data from google cloud storage into Begbie, creating a table for the latest vehicle state and historical analysis.
In this lecture we're creating resources in Google Cloud which we can interact with from our Airflow instance. We're creating the following resources.
A Google Cloud Project
Project name: YOUR-CREDENTIALS-airflow-tutorial-demo
2 Google Cloud Buckets
Landing bucket name: YOUR-CREDENTIALS-logistics-landing-bucket
Landing bucket location: europe-west2
Backup bucket name: YOUR-CREDENTIALS-logistics-backup-bucket
Backup bucket location: europe-west2
BigQuery dataset
Dataset name: vehicle_analytics
Dataset location: europe-west2
We also upload the downloadable CSV files to the landing bucket.
Configure Airflow connections to external resources, using the Google Cloud Default connection, download and store the service account key JSON, place it in secrets, and provide the project ID.
Load csv data from cloud storage into BigQuery using the Google Cloud Storage to BigQuery operator, configuring source bucket, destination project, dataset, and table, plus write options.
We'll need to provide a SQL query to the BigQueryOperator. Here is the query that you need to use in order to select the latest data for each vehicle.
SELECT * except (rank)
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY vehicle_id ORDER BY DATETIME(date, TIME(hour, minute, 0)) DESC
) as rank
FROM `{YOUR-PROJECT-NAME}.vehicle_analytics.history`) as latest
WHERE rank = 1;
You might not had enough time to write this down during the video lecture.
Note: To make sure it doesn't confuse you, in the video the project name I use in the query is incorrect it should be aa-airflow-tutorial-demo instead of aa-airflow-tutorial-dev.
Use the Google Cloud Storage hook in Airflow to list landing-bucket objects, enforce a single active run, and move processed files to a backup bucket to prevent duplicates.
Learn cross-task communication with xcoms by pushing and pulling messages as a shared state, use provide_context on python operators, and pass data between tasks.
Explore jinja templating and macros in Apache Airflow, learning to prefix objects with the execution timestamp, use templated fields, and define user-defined macros for consistent parameterization.
Manage airflow variables for dev, staging, and prod with the UI or CLI. Import them from a file and reference them in DAGs using variable.get for project and bucket names.
In this video I present our second use case. You can also download the PySpark jobs that we'll run later in the course as part of our DAG. Don't forget to download them!
Create a Google Cloud Storage bucket with a globally unique name, upload the Facebook folder, and enable the Cloud Data API to prepare for Airflow job orchestration.
Create a dataproc hadoop cluster using the dataproc cluster create operator, with a timestamped name, two workers, and a shared storage bucket in Europe, scheduled daily at eight pm.
Explore branching in Airflow using the BranchPythonOperator to decide between weekday and weekend tasks by evaluating the execution date, returning 'big data analytics' or 'weekend analytics' as the next task.
Submit a PySpark job with Airflow's Spark operator on a Google Cloud Dataprep cluster, wiring GCS paths, cluster name, and jar for weekend analytics.
Learn how subdags embed a dag inside a dag to group related tasks, enforce a consistent schedule, and manage dependencies; avoid deadlocks when running many operators in parallel.
Use cluster delete operator to remove data after bespoke jobs, and configure a trigger rule so downstream tasks run when all upstream tasks are done using the Google cloud connection.
Define and attach dag and task documentation in the Airflow UI using markdown or json-like strings. View titles, descriptions, and instance details to verify the documentation.
Develop a custom Airflow operator by extending the base operator, register it in a plug-ins folder, and implement a data validation operator with optional ui color and templated parameters.
Create a custom sensor in the big plug-in to wait for a data set to exist in Google Cloud, returning true when the data set is found and false otherwise.
Create and run a dag using custom operators and a custom sensor to validate vehicle analytics data, including templated sql and dataset checks, and monitor results in the web ui.
Learn how to validate and load dags in Airflow, run loader tests via CLI, inspect active dags and task dependencies, and verify error-free execution.
Master unit testing for Apache Airflow dags and operators. Create tests using a cheat sheet of attributes, properties, and methods, and verify upstream and downstream relationships and the dag structure.
Test the custom data validation operator by creating a unit test that mocks its data retrieval, asserts a failure on empty results, and demonstrates operator testing.
Explore the four Airflow components—web UI, metadata database, scheduler, and executors—and compare sequential, debugging, local, celery, and distributed executors, with production and setup notes.
Follow this tutorial to configure LocalExecutor with Load Balancer and static IP: https://medium.com/grensesnittet/airflow-on-gcp-may-2020-cdcdfe594019
Define and monitor task deadlines with service level agreements, configure per-task or default SLA, and receive one-time notifications and UI logs when misses occur; distinguish SLA from execution timeouts.
Documentation for Fernet key rotation: https://airflow.apache.org/docs/stable/howto/secure-connections.html
Enable remote logging in Airflow, set the remote log connection ID to the Google Cloud default, and specify a base folder to store task logs in a remote bucket.
Recommended reading from Lyft Engineering: https://eng.lyft.com/running-apache-airflow-at-lyft-6e53bb8fccff
Learn to integrate Sentry with Airflow to monitor and fix crashes in real time, capture the full stack trace, and browse, resolve errors, and configure alerts in a centralized dashboard.
Hi there, my name is Alexandra Abbas. I’m an Apache Airflow Contributor and a Google Cloud Certified Data Engineer & Architect with over 3 years experience as a Data Engineer.
Are you struggling to learn Apache Airflow on your own? In this course I will teach you Airflow in a practical manner, with every lecture comes a full coding screencast. By the end of the course you will be able to use Airflow professionally and add Airflow to your CV.
This course includes 50 lectures and more than 4 hours of video, quizzes, coding exercises as well as 2 major real-life projects that you can add to your Github portfolio!
You will learn:
How to install and set up Airflow on your machine
Basic and advanced Airflow concepts
How to develop complex real-life data pipelines
How to interact with Google Cloud from your Airflow instance
How to extend Airflow with custom operators and sensors
How to test Airflow pipelines and operators
How to monitor your Airflow instance using Prometheus and Grafana
How to track errors with Sentry
How to set up and run Airflow in production
This course is for beginners. You do not need any previous knowledge of Apache Airflow, Data Engineering or Google Cloud. We will start right at the beginning and work our way through step by step.
You will get lifetime access to over 50 lectures plus corresponding cheat sheets, datasets and code base for the lectures!