
This course overview outlines the Databricks certified data engineer associate preparation, covering lakehouse platform, etl with spark sql and python, incremental data processing, production pipelines, and data governance.
Learn how Databricks combines a multi-cloud lakehouse built on Apache Spark, with the cloud service, runtime, and workspace, plus DBFS and Delta Lake support for batch and streaming analytics.
Learn to sign up for a 14-day free trial of Databricks on Azure, including creating a resource group, selecting the 14-day free premium option, and launching a workspace.
Navigate the Databricks workspace interface, including the left sidebar and workspace explorer, to organize notebooks, folders, and data assets across SQL, data engineering, and machine learning.
Import notebooks into the Databricks workspace via git folders with a GitHub repository; clone, access the course materials, and follow along to recreate solutions.
Create a Databricks cluster by navigating to compute, naming it, and choosing a single or multi-node setup with a driver and workers, runtime version 13.3 LTS, and auto termination.
Create and manage Databricks notebooks, switch languages with magic commands, use markdown for notes, run cells, and export, import, or revert revisions for modular workflows.
Explore git folders (Databricks Repos) for source control by integrating with git providers like GitHub, then clone, commit, push, and manage branches and pull requests.
Delta Lake is an open source storage framework that brings reliability to data lakes and enables lakehouse architecture through a transaction log for ACID and consistent reads of parquet data.
Explore delta lake tables in the hive metastore catalog, create a delta table, perform multiple inserts, and review metadata and history via describe detail and describe history.
Explore advanced Delta Lake features, including time travel with history and restore, and optimize performance with file compaction and Z-order indexing, plus vacuum garbage collection.
Explore advanced Delta Lake features, including time travel, optimize and vacuum, and restore data from prior table versions in a hands-on session.
Explore data file layout optimization in Databricks by using partitioning, z-order indexing, and liquid clustering to accelerate queries through data skipping and efficient file organization.
Learn to create and manage databases and tables on Databricks, including managed and external tables, their locations, and describe extended and drop behavior.
Learn to set up delta tables with ctas, infer schemas, rename columns, and apply not null constraints. Explore partitioning, external locations, and deep or shallow cloning to copy delta tables.
Explore Databricks views, including stored, temporary, and global temporary views. These virtual tables are defined by saved SQL queries and scoped by session or cluster.
Create and query stored, temporary, and global temporary views from a smartphones table, demonstrating persistence across sessions, and using show tables to manage tables and views in Databricks.
Query files in Databricks with Spark SQL across json, parquet, and csv; create Delta Lake tables via CTAS or external tables using, then load external data via temporary views.
Explore querying and ingesting files with Spark SQL on Databricks, reading JSON and CSV data, creating external and Delta tables, and handling schema, caching, and CTAS workflows.
Explore SQL for writing to Delta tables, using create or replace table, insert overwrite, and merge into to upsert, overwrite, or append while preserving ACID guarantees and time travel.
Explore advanced Spark SQL transformations on a bookstore dataset, parsing json strings, converting to struct types, flattening fields, exploding arrays, and applying joins, unions, and pivots.
Explore higher order functions and user defined functions (UDFs) in Spark SQL. Apply filter and transform on the books array, and define UDFs to reuse SQL logic across Spark sessions.
Spark structured streaming, treating infinite data as an unbounded table, using readStream and writeStream with micro-batches, triggers, checkpoints, Delta Lake integration, and exactly-once semantics.
Learn spark structured streaming with a bookstore dataset, query a Delta table as a stream source using spark.readStream, and create streaming views for incremental processing.
discover incremental data ingestion in Databricks using copy into and auto loader to load only new files into a delta table, with inferred schema, checkpointing, and exactly-once guarantees.
Learn incremental data ingestion with auto loader, reading parquet via Spark Structured Streaming and loading new files into the Delta Lake table orders_updates.
Explore the multi hop (medallion) architecture for lakehouse data, using bronze, silver, and gold layers to incrementally refine raw data into business-ready insights and support hybrid batch and streaming etl.
Build a Delta Lake multi-hop pipeline with Auto Loader streaming, metadata enrichment, and bronze, silver, gold tables, using a static customers lookup to enrich streaming data.
Explore delta live tables (dlt) to build maintainable multi-hop data pipelines with bronze, silver, and gold layers, using auto loader to ingest parquet data and enforce data quality via constraints.
Orchestrate multi-task Databricks jobs by chaining a land data notebook, a Delta Live Tables pipeline, and a results notebook, with scheduling, dependencies, and error repair.
Deploy and manage Databricks workflows across development, staging, and production with Databricks asset bundles, enabling testing, packaging, and ci/cd through a streamlined cli and vscode workflow.
Explore Databricks SQL (DBSQL) to run SQL and BI workloads at scale, create SQL warehouses, and build dashboards, queries, and alerts with unified governance.
Learn data object privileges in Databricks and how the governance model grants, denies, and revokes access to catalogs, schemas, tables, and views with privileges like select and modify.
Explore managing permissions in Databricks SQL by creating HR DB with an employees table and Paris view, assigning HR Team privileges, and reviewing permissions via show grants and data explorer.
Unity Catalog provides centralized governance across workspaces and clouds with a three-level namespace (metastore, catalog, schema) and identity-based access, discovery, and lineage features.
Explore Unity Catalog in Databricks, verify metastore linking, manage catalogs and permissions, enable Delta Sharing, and trace data lineage across workspaces and regions.
Explore Databricks cluster best practices, comparing classic and serverless compute, choosing between all-purpose and jobs clusters, leveraging instance pools and SQL warehouses for cost-efficient workloads.
Understand the Databricks associate-level data engineer certification overview, including exam format, topics distribution, scoring, and proctored online delivery.
If you are interested in becoming a Certified Data Engineer Associate from Databricks, you have come to the right place! This study guide will help you with preparing for this certification exam.
By the end of this course, you should be able to:
Understand how to use and the benefits of using the Databricks Lakehouse Platform and its tools, including:
Data Lakehouse (architecture, descriptions, benefits)
Data Science and Engineering workspace (clusters, notebooks, data storage)
Delta Lake (general concepts, table management and manipulation, optimizations)
Build ETL pipelines using Apache Spark SQL and Python, including:
Relational entities (databases, tables, views)
ELT (creating tables, writing data to tables, cleaning data, combining and reshaping tables, SQL UDFs)
Python (facilitating Spark SQL with string manipulation and control flow, passing data between PySpark and Spark SQL)
Incrementally process data, including:
Structured Streaming (general concepts, triggers, watermarks)
Auto Loader (streaming reads)
Multi-hop Architecture (bronze-silver-gold, streaming applications)
Delta Live Tables (benefits and features)
Build production pipelines for data engineering applications and Databricks SQL queries and dashboards, including:
Jobs (scheduling, task orchestration, UI)
Dashboards (endpoints, scheduling, alerting, refreshing)
Understand and follow best security practices, including:
Unity Catalog (benefits and features)
Entity Permissions (data objects Privileges)
With the knowledge you gain during this course, you will be ready to take the certification exam.
I am looking forward to meeting you!