
This course teaches you how to build a robust analytics data stack to monitor business performance, identify issues, evaluate ROI, and make data-driven decisions.
We'll explore why modern analytics has evolved from Excel (prone to inconsistency and errors) to SQL-based workflows that provide structure, readability, and reliability. You'll learn the Extract-Load-Transform (ELT) framework for data manipulation, focusing specifically on the Transform stage using DBT (Data Build Tool).
Prerequisites: Basic SQL knowledge is required. The course assumes you understand how to write SELECT queries and use WHERE clauses.
What you'll build: A production-ready data transformation pipeline following software engineering best practices, enabling your team to create clean, documented, and cost-optimized analytics datasets.
This section introduces DBT using a supermarket revenue tracking example. You'll learn how DBT transforms raw transaction data (purchases and customer loyalty information) into actionable daily revenue reports.
Key concepts covered:
Design-first approach: Always visualize your desired output before writing SQL
What DBT adds beyond SQL: Organization, table referencing, scheduling, and software engineering best practices
DBT's folder structure: Separate folders for raw data (using source()) and transformed data (using ref())
Referencing system: source() for raw data not created by DBT, ref() for tables created by other DBT models
Why organize raw data separately: Even with thousands of tables, creating one staging query per raw table improves readability and maintainability
By the end of this section, you'll understand how DBT brings software development principles (version control, modularity, documentation) to data analytics workflows.
This section explains how to execute your DBT models using command-line interface (CLI), just like software developers run code in a terminal.
Key concepts covered:
What is a terminal? A command-line interface to navigate folders and execute files without clicking
Basic DBT commands:
dbt run --select revenue_daily - Runs a specific model
dbt run --select +revenue_daily - Runs a model plus all its upstream dependencies
Why command-line execution matters: Enables scheduling and automation of your data pipelines
Software engineering parallel: DBT brings developer workflows to analytics, making data transformation schedulable and reproducible
What's next: You'll learn how to set up a complete DBT project from scratch, following best practices for building a robust data stack.
This section walks through the complete dbt init process to create your first DBT project connected to BigQuery, but it works the same for other Data Warehouse.
What you'll do:
Prerequisites: Install Python, Git, and set up your data warehouse environment
Initialize DBT project: Run dbt init and configure project settings
BigQuery setup: Create a GCP project and service account with BigQuery access
Configure connection parameters:
Service account JSON key
Project ID and dataset name
Threads (parallel query execution limit) - default: 4
Job execution timeout - default: 300 seconds
Data location (US, EU, etc.)
Key takeaway: Naming conventions established during initialization are critical and will persist throughout your project, so choose carefully.
By the end, you'll have a fully initialized DBT project ready for development.
This section teaches you how to configure profiles.yml, the critical file that connects DBT to your data warehouse.
Key concepts:
What is profiles.yml? The configuration file that defines where DBT runs your SQL queries (called "targets")
Why separate environments? Following software engineering best practices, you need:
Development (dev): Your testing sandbox where you can safely experiment
Production (prod): Live data available to end users
Configuration setup:
Start with default dev target
Add prod target with separate dataset/project
Use different datasets (e.g., dev_johanna vs production_dataset)
Testing your setup:
dbt run - Runs on dev by default
dbt run --target prod - Runs on production environment
Critical takeaway: This separation ensures you never accidentally impact production data while developing and testing new models.
This section covers credential security best practices to prevent accidentally exposing sensitive information in your DBT project.
What you'll learn:
The problem: Service account JSON files shouldn't be visible in your project folder
The solution: Store credentials in a hidden folder outside your project
Step-by-step process:
Use ls -la to view hidden files (starting with .)
Create hidden folder: mkdir ../.dbt
Move credentials using nano text editor
Update profiles.yml to reference new path
Delete original JSON file from project
Test with dbt run and dbt run --target prod
Important note: Always consult your DevOps/Platform engineer about your company's specific credential management policies and tools (e.g., secret managers, vaults).
Result: Your credentials are now secured outside your project directory and won't be accidentally committed to version control.
Summary:
This section explains how to properly configure .gitignore to exclude personal and generated files from version control.
What you'll learn:
What is .gitignore? A file that tells Git which files/folders to ignore and not track in your repository
Essential exclusions for DBT projects:
profiles.yml - Contains personal credentials and environment-specific settings
target/ folder - Contains compiled SQL and run artifacts (generated files)
BigQuery-specific: .user.yml file (auto-generated, not important)
Understanding the target/ folder:
Created when running dbt run or dbt compile
Contains temporary compiled queries
Personal to each developer's last command
Can be cleared with dbt clean and regenerated anytime
Why this matters: Proper .gitignore configuration prevents committing sensitive credentials or personal temporary files, keeping your repository clean and secure.
Summary:
This section walks through setting up version control for your DBT project using Git and GitHub.
Step-by-step process:
Create GitHub repository: Navigate to https://github.com/new and create a repo with the same name as your DBT project
Initialize Git locally:
git init - Initialize Git in your project folder
git remote add origin [repo-url] - Connect to GitHub repository
Make initial commit:
git add . - Stage all files (respecting .gitignore)
git commit -m "Initial commit" - Create first commit
Push to GitHub:
git branch -M main - Ensure you're on main branch
git push -u origin main - Push code to GitHub
Result: Your DBT project is now under version control and backed up on GitHub, enabling collaboration and change tracking.
Summary:
This section demonstrates the complete Git workflow by creating and merging your first pull request to verify everything is set up correctly.
What you'll do:
Create a new branch: Work on changes separately from main
Make a small change: Edit a file to test the workflow (e.g., update README)
Commit and push:
git add .
git commit -m "Test: first PR"
git push origin [branch-name]
Create Pull Request: On GitHub, open a PR from your branch to main
Review and merge: Verify changes and merge the PR
Pull changes locally: git checkout main and git pull to sync
Why this matters: This validates your entire setup (Git, GitHub, branches, PRs) and establishes the workflow you'll use for all future development—ensuring you never work directly on main.
Summary:
This section introduces UV, a modern Python package manager for managing dependencies and environments in your DBT project.
What you'll learn:
The problem: Different projects may require different DBT versions, leading to conflicts in your local Python environment
The solution: UV creates isolated environments per project
Practical demonstration:
Check dbt --version in your local environment (e.g., 1.8.2)
Initialize UV: uv init in your project
Check dbt --version within UV environment (e.g., 1.10.1)
See how versions differ between environments
Why UV matters:
Ensures everyone on your team uses the same DBT version
Prevents "works on my machine" issues
Makes your project reproducible
Result: Your DBT project now has isolated, version-controlled Python dependencies that can be replicated across different machines and team members.
Summary: Quick demonstration of creating a pull request to add UV-generated files (pyproject.toml, uv.lock, .python-version) to version control, ensuring all team members can use the same environment setup.
Summary: Update the README.md file with UV installation instructions and usage commands, so new team members understand how to initialize their development environment and run DBT with the correct dependencies.
Summary:
This section provides a comprehensive walkthrough of dbt_project.yml, the core configuration file for your DBT project.
Key topics covered:
Folder structure: Understanding the role of each folder (models, analyses, seeds, tests, etc.)
Models folder: The primary location for .sql files that run with dbt run
Testing folder behavior: Demonstrating which folders trigger model creation with dbt run
The target/ folder:
Auto-generated when running dbt run or dbt compile
Contains compiled SQL (without {{ }} syntax)
Shows how {{ ref() }} becomes actual table references
dbt compile vs dbt run:
dbt compile - Generates SQL without executing
dbt run - Compiles AND executes
dbt clean: Command to delete folders
Result: You'll understand how DBT transforms your templated SQL into executable queries and where to find compiled output for debugging.
Summary:
This section introduces DBT packages, reusable libraries that extend DBT's functionality with helpful macros and utilities.
What you'll do:
Create packages.yml: Define your project's package dependencies
Install dbt_utils: Add the most essential DBT package (dbt-labs/dbt_utils version 1.3.0)
Run dbt deps: Install packages, which creates:
dbt_packages/ folder - Contains installed package code (gitignored)
package-lock.yml - Locks exact versions installed (gitignored)
Important distinction:
packages.yml - Committed to Git (defines what packages to use)
dbt_packages/ and package-lock.yml - Gitignored (generated files)
Why dbt_utils? Provides essential utility macros for data testing, SQL generation, and common transformations you'll use throughout your project.
Don't forget to commit packages.yml: Create a PR to share the packages.yml file with your team
Summary:
This section establishes the rules and conventions for the staging layer, your data pipeline's entry point.
Folder structure:
Organize by: 01_staging/<tool>/<dataset_name>/<table_name>
Naming conventions:
Prefix: stg_
Format: stg_<tool>_<dataset_name>_<table_name>.sql
Match exact Data Warehouse dataset/table names
No double underscores
Code rules:
Only select from {{ source("xxx", "xxx") }}
Cast data types and rename columns
No joins, calculations, or aggregations
Materialization: Views (default) - avoids duplicating data in memory
Data type standards:
IDs → STRING
Timestamps → DATE or TIMESTAMP only
Why these rules? Staging models create a clean, consistent foundation by standardizing raw data before any business logic is applied.
Summary:
This section demonstrates the practical implementation of staging layer conventions by building real staging models.
What you'll create:
Folder structure:
01_staging/gsheet/raw_data/ directory structure
Organized staging model files
_src.yml file:
Reference raw data location in BigQuery
Document source database, schema, and tables
Define sources used afterwards in {{ source() }} syntax in .sql models
Staging models (stg_.._purchases.sql and stg_.._customers.sql):
Follow naming conventions (stg_ prefix)
Use CTE with final pattern
Cast data types (STRING for IDs, TIMESTAMP for dates, DECIMAL for numbers)
Select from {{ source("gsheet_raw_data", "table_name") }}
dbt_project.yml configuration:
Set +schema: staging
Set +materialized: view
Apply settings to all 01_staging models
Result: A fully functional staging layer that cleanly ingests and standardizes raw data following best practices.
Summary:
This section introduces the intermediate layer, where business logic and reusable data models are built using a granularity-based naming system.
Folder structure:
Organize by business domain: 02_intermediate/<business_section>/ (e.g., product, finance, sales)
Model types:
dim_ (Dimensions): Relatively stable data, lists of IDs (e.g., customers, products)
fct_ (Facts): Data that changes over time, events, metrics
Naming conventions - The Granularity Framework:
Format: <dim/fct>_<dimension_granularity>_<context>_<time_granularity>
Use singular (never plural)
Dimension granularity comes first, time granularity comes last
Examples:
dim_account → 1 row = 1 account
fct_account_daily → 1 row = 1 account × 1 day
fct_account_user_daily → 1 row = 1 account × 1 user × 1 day
fct_account_revenue_daily → 1 row = 1 account × 1 day (revenue-specific)
Code conventions:
Only reference {{ ref("staging_or_intermediate_model") }}
Materialized as tables (not views)
Philosophy: Anyone should understand what data is available by thinking in terms of granularity.
Summary:
This section demonstrates the practical implementation of the intermediate layer by creating dimension and fact models for the supermarket example.
What you'll create:
Folder structure:
02_intermediate/finance/ (or relevant business domain)
Organized by business function
Intermediate models:
Apply granularity-based naming conventions
Create dim_ models (e.g., dim_customer for customer dimension)
Create fct_ models (e.g., fct_revenue_daily for daily revenue facts)
Use {{ ref("stg_model_name") }} to reference staging models
Implement business logic, joins, and calculations
dbt_project.yml configuration:
Set +materialized: table for 02_intermediate
Ensure all intermediate models are persisted as tables
Result: Reusable, business-focused data models that transform clean staging data into analytical building blocks, following granularity principles for clear data structure.
Summary:
This section introduces the three-layer architecture for organizing your DBT models, a fundamental best practice for scalable data transformation.
The three layers:
Staging (01_staging/):
Entry point for raw data from sources
Light transformations only (renaming, type casting)
One model per source table
Intermediate (02_intermediate/):
Business logic and transformations
Reusable building blocks (fct_ and dim_ models)
Joins and calculations
Mart (03_mart/):
Final models for BI tools and end users
Combines intermediate models
Ready for analysis and reporting
What you'll do: Create these three folders under models/ to establish a clear, logical data pipeline structure that separates concerns and makes your project maintainable as it grows.
Why this matters: This layered approach is the industry standard for DBT projects.
Summary:
This section introduces the mart layer, where final, production-ready models are created for end users, BI tools, and external systems.
Folder structure:
03_mart/<external_tool>/ - For reverse ETL destinations (CRM, etc.)
03_mart/<business_section>/ - For business domain analytics (product, finance, sales)
Naming conventions:
Format: <dimension_granularity>_<context>_<time_granularity>
No dim_/fct_ prefixes (cleaner for end users)
Use singular only
For external tools: ext_<tool_name>_<dimension_granularity>_<context>_<time_granularity>
Examples:
revenue_daily → 1 row = 1 day
account_revenue_daily → 1 row = 1 account × 1 day
ext_gsheet_customer_summary → Customer data for Google Sheets export
Code conventions:
Only reference {{ ref("intermediate_or_mart_model") }}
Materialized as tables
Purpose: Combine intermediate models into final datasets ready for analysis, dashboards, and external system integration.
Summary:
This section demonstrates the practical implementation of the mart layer by creating production-ready models for end users and external systems.
What you'll create:
Folder structure:
03_mart/external_tools/../ - For Reverse ETL
Mart models:
revenue_daily.sql - Final daily revenue table (no dim_/fct_ prefix)
ext_salesforce_customer.sql - Customer data for Salesforce export
Follow granularity-based naming without prefixes
Use {{ ref("intermediate_model") }} to combine intermediate models
Implement final business logic and aggregations
dbt_project.yml configuration:
Set +materialized: table for 03_mart
Ensure all mart models are persisted as tables
Result: Clean, user-friendly data models ready for BI dashboards, reporting, and external system integration.
Summary:
This section introduces exposures.yml, a DBT feature for documenting how your data models are used in downstream tools and dashboards.
What are exposures?
Documentation of where your DBT models are consumed (BI dashboards, reports, external tools)
Creates visibility into the full data pipeline from source to end use
Links your data models to their real-world applications
Why exposures matter:
Impact analysis: Understand which dashboards/reports will be affected before changing a model
Dependency tracking: See the complete lineage from raw data to business insights
Collaboration: Help stakeholders understand what data is available and where it's used
Documentation: Maintain a living record of all analytics use cases
Common exposure types:
Looker/Tableau dashboards
Google Sheets reports
External API integrations
Email reports
Result: Complete visibility into how your data models serve the business, enabling safer changes and better collaboration.
Summary:
This section teaches how to resolve SQLFluff linting failures caused by complex macros by adding advanced configurations to .sqlfluff.
The issue:
Some macros cause SQLFluff to fail or hang during linting
SQLFluff can't execute these operations during static analysis
Configuration strategies:
1. Exclude specific macro files:
[sqlfluff:templater:jinja]
exclude_macros_from_path = ./macros/problematic_macro.sql
2. Provide mock/stub implementations:
[sqlfluff:templater:jinja:macros]
your_macro_name = {% macro your_macro_name(args) %}[]{% endmacro %}
Example: The run_query_to_get_column_values() macro demonstrates both approaches—you can either exclude it or provide a stub that returns an empty array.
Key lesson: When SQLFluff fails on models using certain macros, check .sqlfluff configuration options rather than disabling linting entirely. Most macro conflicts can be resolved with proper configuration.
Result: SQLFluff successfully lints your entire project, even with advanced macro usage.
Summary:
This section teaches you how to automate DBT model documentation using YAML files and custom Python scripts.
Documentation approach:
1 model = 1 YAML file: Each .sql model has an associated documentation .yml file
Manual method (baseline):
Run: uv run dbt run-operation generate_model_yaml --args '{"model_name": "revenue_daily", "dataset_name": "dev_johanna"}'
Copy output and paste into documentation folder
Issue: Not very convenient for multiple models
Automated solution:
Create scripts/ folder for automation scripts
Build Python script: generate_model_yaml.py
Install dependency: uv add ruamel.yaml
Run: uv run python scripts/generate_model_yaml.py revenue_daily dev_johanna
Script automatically creates YAML file in correct location
Default test behavior:
First column automatically gets unique and not_null tests
Assumes first column is the primary/surrogate key (_id)
Result: Streamlined documentation workflow that generates properly structured YAML files with column descriptions and automatic tests.
Summary:
This section introduces pre-commit hooks, automated checks that run before code is committed to Git, starting with SQLFluff validation.
What you'll do:
Install pre-commit: Run uv run pre-commit install to activate hooks in your repository
Create .pre-commit-config.yaml: Define which hooks to run before commits
Configure SQLFluff hook:
Uses sqlfluff-fix to automatically lint and fix SQL code
Runs with your .sqlfluff configuration
fail_fast: true - Stops commit if SQL issues are found
Forces you to fix code quality issues before committing
Test the workflow:
Attempt a commit with SQL issues
Pre-commit hook catches problems automatically
SQLFluff fixes what it can, flags what needs manual changes
Prevents bad code from reaching the repository
Why pre-commit matters: Catches errors early, enforces standards automatically, and ensures all code meets quality requirements before it's shared with the team.
What's next: You'll add more hooks for testing models and documentation validation.
Summary:
This section completes your pre-commit pipeline by adding model testing and automatic documentation generation alongside SQLFluff validation.
Two new automated pre-commit hooks:
Model execution test :
Script: pre_commit_run_model.py
Runs dbt run --select on staged models in dev environment
Fails commit if model doesn't execute successfully
Catches SQL errors before they reach Git
Documentation generation :
Script: pre_commit_run_documentation.py
Automatically generates/updates YAML documentation for staged models
Ensures documentation stays in sync with code changes
No manual documentation step needed
Workflow behavior:
All hooks run sequentially (require_serial: true)
fail_fast: true - Stops at first failure
Only processes staged files (pass_filenames: true)
Result: Every commit is automatically validated for code quality, execution success, and documentation completeness—preventing broken or undocumented code from entering the repository.
Summary:
This section introduces DBT Cloud, a managed platform for scheduling, running, and monitoring your DBT projects without managing infrastructure.
What you'll learn:
Why DBT Cloud? Provides scheduling, CI/CD, documentation hosting, and job monitoring in one platform
Pricing strategy:
Start with the free tier (sufficient for learning and start-up projects)
Eventually upgrade to paid DBT Cloud OR migrate to open-source orchestrators (Airflow, Dagster)
Registration: Create your DBT Cloud account at getdbt.com
Service account connection:
Use the same Google Cloud service account from your local DBT setup
Upload the same JSON credentials file
Ensures consistent permissions across local and cloud environments
What's next: You'll configure your DBT Cloud project, connect your Git repository, and set up your first production job.
Key decision point: DBT Cloud simplifies operations but has costs at scale—understand your options early.
Summary:
This section demonstrates how to integrate your GitHub repository with DBT Cloud, enabling cloud-based job execution and CI/CD workflows.
What you'll do:
Connect GitHub account: Authorize DBT Cloud to access your GitHub repositories
Select repository: Choose your DBT project repository (e.g., test_dbt_project)
Why this matters:
Enables scheduled production runs in the cloud
Sets up CI/CD for automatic testing on pull requests
Provides centralized documentation hosting
Allows team collaboration without local setup
Result: Your DBT project is now accessible in DBT Cloud, ready for job scheduling and automated workflows.
Summary:
This section demonstrates how to create and schedule production jobs in DBT Cloud to automate your data pipeline.
What you'll do:
Create production environment:
Create your first job:
Command: dbt run (executes all models) in production environment
Test execution to verify it works
Verify production run:
Check job logs in DBT Cloud
Confirm tables created in production BigQuery dataset
Review execution time and model success/failure
Configure scheduling (optional):
Set daily, hourly, or custom schedules
Choose timezone and specific run times
Enable automatic pipeline updates
Result: Your DBT pipeline now runs automatically in production on a schedule, keeping your data warehouse updated without manual intervention.
Summary:
This section demonstrates how to set up automated CI testing on pull requests using DBT Cloud's CI Job feature to catch issues before merging to production.
What you'll configure:
Create CI Job in DBT Cloud:
Trigger: Automatically runs on every pull request
Commands: dbt run and dbt test
Scope: Only modified models (Slim CI) - faster, efficient testing
Temporary PR dataset:
Creates isolated dataset: dbt_cloud_pr_[PR_NUMBER]
Builds only changed models in this temporary space
Automatically cleaned up after PR closes
Data comparison workflow:
Compare PR dataset vs production dataset
Validate that changes don't break existing logic
Catch data discrepancies before merging
Demo the workflow:
Create a pull request with model changes
Watch CI job automatically trigger
Review test results and data in PR dataset
Result: Every pull request is automatically validated with real data execution, preventing breaking changes from reaching production.
Summary:
This section demonstrates how to set up automated deployment jobs that run immediately when pull requests are merged to main.
What you'll configure:
Create Merge Job in DBT Cloud:
Trigger: Automatically runs when PR is merged to main branch
Commands: dbt run and dbt test
Target: Production environment
Continuous deployment workflow:
Merge approved PR on GitHub
DBT Cloud automatically detects merge
Runs full production deployment immediately
Validates with tests
Why merge jobs matter:
Ensures production is updated immediately after code changes
No manual deployment steps needed
Catches any environment-specific issues quickly
Complements scheduled jobs (merge = immediate, scheduled = regular updates)
Demo the workflow:
Merge a pull request
Watch merge job automatically trigger
Verify production tables updated with changes
Result: Your production environment automatically updates whenever code is merged, creating a true continuous delivery pipeline.
Summary:
This section demonstrates how to generate and access interactive DBT documentation that provides a complete view of your data pipeline.
What you'll do:
Add documentation generation to jobs:
Tick the box dbt docs generate command to your prod DBT Cloud jobs
Generates documentation from your YAML files and SQL models
Access hosted documentation:
DBT Cloud automatically hosts documentation
Navigate to documentation section in DBT Cloud interface
View live, interactive documentation site
Explore documentation features:
Lineage graph: Visual dependency flow from sources to marts
Model details: Column descriptions, data types, tests
Source documentation: Raw data table information
Exposures: Downstream usage in dashboards/reports
Search functionality: Quickly find models and columns
Result: Your entire data pipeline is documented in an interactive, searchable website that shows dependencies, descriptions, and data lineage—making your project transparent and maintainable for the whole team.
Summary:
This section introduces Makefiles for automation, creating shortcuts for common tasks and automating pull request creation.
What you'll do:
Create Makefile: Define automation shortcuts for repetitive tasks
Add Python dependencies:
uv add gitpython - For Git operations in Python
uv add requests - For GitHub API calls
Configure SQLFluff automation:
Create make sqlfluff_fix command
Automatically runs SQLFluff on changed files
Set up GitHub API access:
Generate GitHub Personal Access Token (Settings → Developer Settings → Tokens)
Set repo permissions
Export token: export GITHUB_TOKEN="your_token_here"
Test automation workflow:
Intentionally push problematic code with --no-verify flag (bypass pre-commit)
Run automation to create PR with fixes
Verify PR created automatically on GitHub
Result: Streamlined workflow where common tasks (linting, PR creation) are automated with simple make commands.
Summary:
This section demonstrates how to automate SQLFluff code cleanup using GitHub Actions that run on a schedule, continuously maintaining code quality.
What you'll do:
Set up GitHub Secrets:
Navigate to: Repo → Settings → Secrets → Actions
Create new repository secret for GITHUB_TOKEN
Securely store authentication for automated PR creation
Create GitHub Action workflow:
Define scheduled trigger (e.g., daily, weekly)
Run make sqlfluff_fix across entire codebase
Create .github/worflows folder with sqlfluff_fix.yml workflow file
Continuous maintenance:
Runs without manual intervention
Catches code quality drift over time
Ensures codebase stays clean as team grows
Result: Your codebase is automatically scanned and cleaned regularly, maintaining consistent SQL quality without manual effort—especially useful as your project scales.
Summary:
This section demonstrates how to automate documentation generation for your entire DBT project using Makefiles and GitHub Actions that run on a schedule.
What you'll do:
Create Makefile command:
make documentation_from_prod - Generates YAML docs for all models
Runs your Python script across entire models/ directory
Uses production dataset to get actual column metadata
Build Python automation script (make_documentation_from_prod.py):
Scans all .sql files in models/ directory
Runs generate_model_yaml.py for each model
Excludes specific directories (e.g., to_remove folders)
Clones repo, commits changes, creates PR automatically
Set up GitHub Action:
Scheduled trigger
Sets up GCP credentials and DBT connection
Runs documentation script on production target
Creates PR with updated YAML files
Configure secrets:
Add GOOGLE_APPLICATION_CREDENTIALS_JSON for BigQuery access
Add REPO_GITHUB_TOKEN for PR creation
Result: Your documentation automatically stays in sync with production models—no manual YAML updates needed as your project evolves.
Summary:
This section demonstrates how to automatically identify orphaned tables in production that should be cleaned up to reduce costs and maintain data warehouse hygiene.
What you'll do:
Create DBT macro (list_tables_in_prod.sql):
Queries INFORMATION_SCHEMA.TABLES in production datasets
Returns list of all tables currently in BigQuery production
Outputs as CSV format
Build comparison script (make_cleaning_reco.py):
Runs DBT macro to get production table list
Scans models/, seeds/, and snapshots/ for .sql files
Compares production tables vs DBT project files
Identifies orphaned tables (in production but no .sql file)
Generates make_cleaning_reco.csv with deletion recommendations
Set up GitHub Action:
Scheduled
Connects to production BigQuery
Runs cleaning recommendation script
Creates PR with updated CSV files
Review and cleanup:
Review recommended tables for deletion
Verify they're truly orphaned (not created by other tools)
Delete unnecessary tables to reduce storage costs
Result: Automated weekly audits of your production datasets, catching tables that should be removed to maintain a clean, cost-efficient data warehouse.
Summary:
This section introduces cost monitoring principles and explains why tracking data warehouse expenses is critical for optimization and scalability.
Why cost monitoring matters:
Identify inefficient queries: High costs often indicate unoptimized models that need refactoring
Optimize workflow: Understand which models should use incremental strategies instead of full rebuilds
Enable scaling: Ensure your data stack remains cost-effective as data volume and business grow
Prioritize optimization work: Focus on the most expensive queries first for maximum impact
How to track costs:
Add SQL Commented Headers to all DBT models with metadata
Include dynamic variables:
job_name: Identifies which DBT Cloud job or local run created the query
run_type: Distinguishes between run, test, compile, or other commands
Note: This lecture focuses on query execution costs (compute). Storage costs are important but not covered here—consult your data warehouse documentation for storage optimization.
Result: Complete visibility into which DBT commands and models drive your costs, enabling data-driven optimization decisions.
Summary:
This section demonstrates how to automatically add metadata headers to all DBT queries using the +sql_header configuration in dbt_project.yml.
What you'll do:
Configure +sql_header by layer in dbt_project.yml:
Add to 01_staging, 02_intermediate, and 03_mart sections
Include JSON metadata: tool, layer, schema, job_name, run_type
Use DBT variables for dynamic values:
{{ target.schema }} - Captures dev vs prod environment
{{ var('job_name', 'Local run') }} - Identifies which job is running
{{ var('run_type', 'Unknown') }} - Tracks run/test/compile commands
Test the implementation:
Run dbt run to see generated SQL
Check that different targets (dev/prod) show correct schema
Result: Every query DBT generates now includes structured metadata in SQL comments, enabling precise cost tracking and query attribution in your data warehouse's query logs without modifying individual .sql files.
What's next: You'll query BigQuery's INFORMATION_SCHEMA to analyze costs by layer, job, and environment.
Summary:
This section demonstrates the final, comprehensive solution for cost monitoring by using query-comment and a custom macro to track ALL DBT operations, including tests.
The limitation of sql_header:
Only works for models (dbt run)
Doesn't apply to dbt test, seeds, snapshots, or other operations
Incomplete cost visibility
The solution: query_comment() macro
Create query_comment macro (macros/query_comment.sql):
Builds structured JSON metadata dictionary
Automatically detects layer from file path
Handles all resource types (models, tests, seeds, snapshots)
Sets appropriate run_type based on operation
Configure in dbt_project.yml:
yaml
query-comment:
comment: "{{ query_comment(node) }}"
append: false
Benefits over sql_header:
Works for all DBT operations (run, test, compile, seed, snapshot)
Tracks test costs separately with run_type: "test"
Single source of truth for metadata
Automatic layer detection
Clean JSON output using tojson()
Result: Complete cost visibility across your entire DBT project—every query (models, tests, everything) now includes structured metadata for comprehensive cost analysis and optimization.
Master the complete analytics engineering workflow by building a production-ready data stack from scratch using DBT (Data Build Tool), the industry-standard transformation framework trusted by data teams worldwide.
This comprehensive course takes you from zero to advanced DBT practitioner, covering everything needed to build, deploy, and maintain scalable data pipelines in real-world production environments. You'll learn the exact methodologies and best practices I've developed over 12+ years working across data analyst, data scientist, and analytics engineer roles in fast-growing startups.
What you'll build:
Complete three-layer data architecture (staging, intermediate, mart) following software engineering principles
Automated CI/CD pipelines with DBT Cloud for pull request testing and production deployments
Cost monitoring system to track and optimize data warehouse expenses
Self-healing testing framework with automated failure remediation
Production-grade incremental models for efficient data processing
Key topics covered:
DBT project setup with development/production environment separation
Granularity-based data modeling that scales from thousands to billions of rows
Version control workflows with Git and automated quality enforcement via pre-commit hooks
SQL linting with SQLFluff and automated documentation generation
Workflow automation using Makefiles and GitHub Actions
Query cost attribution and optimization strategies
Advanced DBT features: seeds, macros, snapshots, and custom tests
Who this is for: Data analysts transitioning to analytics engineering, data engineers building transformation layers, or anyone responsible for maintaining data pipelines serving hundreds of employees and millions of rows.
By the end, you'll have a battle-tested, production-ready data stack that actually works at scale—not just theory, but proven practices from real company environments.