Data Engineering with Google Datafusion and Big Query (CDAP)

Name: Data Engineering with Google Datafusion and Big Query (CDAP)
Rating: 4.4 (46 reviews)

Your first steps in Data Engineering with Google Datafusion, a low-code tool with an open-source version (CDAP)

Created byCassio Alessandro de Bolba

Last updated 5/2023

English

What you'll learn

Understand a bit more Google Cloud Resources
Use Google Datafusion as ETL tool
Data Engineering Low Code
ETL
Create Data Pipelines and DAGs
Read and Write data on Google Big Query
Read and Write data on Google Cloud Storage
Data Transformations with low code and queries
Some Advanced SQL commands

Course content

2 sections • 27 lectures • 3h 8m total length

1.1 Get to Know the Teacher2:07
1.2 Get to Know the Course3:52
1.3 Introduction to Google Datafusion8:44
1.4 Architecture and Components7:26
1.5 Creating a Datafusion Instance5:07
Enable the data fusion API and related APIs, then create a data fusion instance. Configure region and grant service account permissions, starting with the developer version for two users.
1.6 Instance Types and Pricing7:13
1.7 Understanding a Datafusion Instance7:35

2.1 GCS Object Storage6:04
2.2 Big Query as Datalake6:27
2.3 Working with Semi Structured Data4:38
2.4 Pipeline Studio and Wangler13:40
2.5 Preview and Debug7:06
Finish the data fusion pipeline and use preview mode to debug with GCS input, Wrangler transform, and trash sink, then inspect logs to diagnose schema errors.
2.6 Sinking data on Big Query10:04
ERROR - Importing json pipeline from other Datafusion Instance5:59
2.7 Branching the Pipeline9:01
2.8 Move files8:55
Move files from input to processing and then to output using Google Data Fusion, with regex filtering and isolated stages to ensure no reprocessing and accurate writes to BigQuery.
2.9 Big Query as Source5:02
2.10 Transforming Data with Wrangler 110:28
2.11 Transforming Data with Wrangler 27:45
2.12 Transforming Data with Big Query4:50
2.13 Execute Query in Datafusion5:49
2.14 Data Partitioning in Big Query7:33
master ingestion-time partitioning in bigquery to speed queries on bronze data, save costs, and boost pipeline reliability by date-partition filters and merge upserts for the silver table.
2.15 MERGE statement7:28
Master merging data into BigQuery with a merge into statement, using distinct preparation and a source-to-target upsert pattern to keep the pipeline idempotent in Data Fusion.
2.16 Delete temp Tables8:35
Delete temp tables and use parameterized macros in a Google Data Fusion and BigQuery workflow. The lesson covers bronze pipeline execution, input upload simulation, and partitioned reads for efficiency.
2.17 Scheduling and Pipeline Dependencies5:42
Configure dag-based dependencies between bronze and credit pipelines with inbound triggers and success-based handoffs. Set up cron schedules and manage max concurrent runs to automate end to end data workflows.
2.18 ERRO - Quota DISKS_TOTAL_GB Exceed5:31
2.19 Challenge5:34
Receive the schema and merge query and build the cards silver pipeline by creating a partitioned BigQuery cards table and wiring it through Data Fusion with the bronze pipeline.

Requirements

GCP account
Previous exposure to SQL

Description

This is an INTRODUCTORY course to Google Cloud's low-code ingestion tool, Datafusion. Google Data Fusion is a fully managed data integration platform that allows data engineers to efficiently create, deploy, and manage data pipelines.

One of the main reasons to use Google Data Fusion is its ease of use. With an intuitive and visual interface, data engineers can create complex data pipelines without the need for extensive coding. The drag-and-drop interface simplifies the process of data transformation and cleansing, allowing professionals to focus on business logic rather than worrying about detailed coding.

Another significant benefit of Google Data Fusion is its scalability. The platform runs on Google Cloud, which means it can handle large volumes of data and high-performance parallel processing. Data engineers can vertically or horizontally expand their processing capabilities according to project needs, ensuring they can handle any data demand at scale.

Furthermore, Google Data Fusion seamlessly integrates with other services and products in the Google Cloud ecosystem. Data engineers can easily connect and integrate data pipelines with services such as BigQuery, Cloud Storage, Pub/Sub, and many others. This enables a cohesive and unified data architecture, facilitating data ingestion, storage, and analysis across multiple platforms.

In this course, you will learn:

Understanding its internal workings.
What its benefits are.
How to create a Datafusion instance.
Using Google Cloud Storage as data input.
Using BigQuery as a Data Lake (Bronze and Silver layers).
Advanced features of BigQuery: Partitioned tables and MERGE command.
Ingesting data from different sources.
Transforming data with Wrangle (low code) and queries.
Creating DAGs for data ETL (Extract, Transform, Load) and dependencies.
Scheduling and inter-DAG dependencies.

Who this course is for:

Data Engineers
Data Analysts
Data Scientists
Analytics Engineer
Low Code Developers
Python Developers looking to reduce coding overhead
Open Source Fans

Data Engineering with Google Datafusion and Big Query (CDAP)

What you'll learn

Explore related topics

Course content

Introduction7 lectures • 42min

Developing Data Pipelines20 lectures • 2hr 26min

Requirements

Description

Who this course is for: