
In this lecture, you’ll gain a clear understanding of the three primary data manipulation methodologies—ETL, ELT, and the hybrid ETL-and-ELT approach—and learn how to choose the right strategy based on your organization’s requirements. Through practical use-case examples and a side-by-side comparison, you’ll see where each method shines and what trade-offs to consider.
What You Will Learn:
The sequence and core characteristics of ETL (Extract → Transform → Load)
How ELT (Extract → Load → Transform) leverages modern cloud platforms for speed and scalability
The hybrid “ETL + post-load transformation” pattern and when it makes sense
Key differences in processing location, speed, complexity, and ideal data types
Real-world examples from finance, streaming analytics, and retail environments
Criteria for selecting the right methodology—compliance needs, data volume, latency, and infrastructure
Lecture Outline:
Introduction to Data Manipulation Methods
Definition of ETL, ELT, and hybrid approaches
Historical context: traditional vs. modern cloud architectures
Deep Dive: ETL (Extract → Transform → Load)
Pre-load data cleansing and structuring
Ideal for structured data and compliance-sensitive industries (e.g., finance, healthcare)
Limitations: scalability challenges, slower ingestion for large datasets
Deep Dive: ELT (Extract → Load → Transform)
Rapid ingestion into data lakes or warehouse systems
Post-load transformation using powerful cloud engines (BigQuery, Snowflake, Redshift)
Best suited for big data and real-time analytics on cloud platforms
Deep Dive: Hybrid Approach (Extract → Transform → Load → Transform)
Combining pre-load filtering with post-load enrichment
Balancing efficiency (fast ingestion) and flexibility (in-warehouse processing)
Considerations: increased orchestration complexity
Side-by-Side Comparison
AspectETLELTHybridProcessing LocationBefore loadingAfter loadingBoth before & after loadingIngestion SpeedModerate to slowFastModerateComplexityHigh orchestrationModerateVery high orchestrationIdeal Data TypeStructured, compliance-heavyBig data, cloud analyticsMixed structured & unstructured
Use-Case Scenarios
ETL: Financial institution enforcing compliance rules before warehousing
ELT: Streaming data loaded raw into BigQuery, transformed on demand
Hybrid: Retail company performing real-time filtering pre-load and aggregations post-load
Choosing Your Methodology
Aligning with data volume and latency requirements
Weighing infrastructure capabilities and cost considerations
Matching business priorities: compliance, real-time insights, or hybrid workflows
Summary & Key Takeaways
Recap of when to use ETL, ELT, or hybrid
Final decision factors: scalability, transformation timing, and orchestration overhead
By the end of this lecture, you’ll be equipped to recommend and implement the most suitable data ingestion and transformation workflow for any data-driven project.
In this lecture, you will learn how to evaluate and choose the most suitable data transfer solution for moving data into Google Cloud. We’ll begin by identifying the key factors—such as data volume, network constraints, time sensitivity, and compliance requirements—that influence your choice. You’ll then explore four major GCP transfer options: Storage Transfer Service, Transfer Appliance, BigQuery Data Transfer Service, and the gcloud CLI. Through clear comparisons, real-world scenarios, and best-practice guidelines, you’ll be equipped to select and plan efficient, secure, and cost-effective transfers for any workload.
Learning Objectives
By the end of this lecture, you will be able to:
List the main factors to consider when moving data between on-premises, cloud, and hybrid environments.
Describe the features and ideal use cases for each GCP transfer tool.
Compare and contrast the trade-offs of online vs. offline and batch vs. scheduled transfers.
Apply best-practice guidelines to plan a secure, reliable, and performant data migration.
Lecture Outline
Introduction: Why Transfer Strategy Matters
Impact of volume, bandwidth, and timelines on tool selection
Compliance and security considerations for sensitive data
Key Decision Factors
Data Volume & Size: Small files vs. petabyte-scale archives
Network Bandwidth & Reliability: When online transfers become impractical
Time Constraints: One-time bulk moves vs. recurring syncs
Security & Compliance: Encryption, audit trails, and regulatory controls
Overview of GCP Data Transfer Tools
Storage Transfer Service
Fully managed, online transfers from other clouds or on-premises
Scheduling options for backups and periodic syncs
Transfer Appliance
Physical device for large, offline bulk migrations
Secure courier workflow for multi-petabyte datasets
BigQuery Data Transfer Service
Automated ingestion into BigQuery from SaaS and storage sources
Simplifies analytics workflows with scheduled loads
gcloud Command-Line Tool
Ad hoc, CLI-driven transfers suitable for development and testing
Best for small-to-medium volumes and manual control
Deep Dive & Use-Case Examples
Storage Transfer Service
Scenario: Daily incremental backups from on-premises to GCS
Transfer Appliance
Scenario: Initial migration of 500 TB archival data when network is a bottleneck
BigQuery Data Transfer Service
Scenario: Scheduled import of marketing SaaS data for analytics
gcloud CLI
Scenario: Rapid prototyping of a new data pipeline in a dev environment
Best Practices for Planning a Transfer
Thoroughly assess bandwidth, cost, and time estimates
Validate security and compliance requirements in advance
Perform pilot transfers with a representative subset of data
Monitor, log, and test throughout the migration lifecycle
Summary & Next Steps
Recap of selection criteria and tool capabilities
How to incorporate transfer planning into your overall data-migration project
In this lecture, you will explore the most common file formats used in data engineering pipelines and learn how to select the right format for different stages of your workflow. You’ll examine each format’s characteristics—such as readability, schema support, compression, and performance—and discover best-practice use cases. By the end, you’ll be able to make informed decisions that balance storage efficiency, processing speed, and compatibility with your analytics tools.
Learning Objectives
Identify key factors that influence file-format selection
Compare the strengths and weaknesses of CSV, JSON, Parquet, Avro, and structured database tables
Match each format to appropriate use cases (e.g., ad-hoc exports, streaming, analytics, OLTP)
Apply simple guidelines to optimize storage, performance, and maintainability
Lecture Outline
Why File Format Matters
Impact on read/write efficiency and query speed
Role of schema enforcement and data consistency
Implications for compression, storage costs, and tool compatibility
Decision Factors
Read/Write Efficiency: Row vs. column orientation
Schema Support: Enforced vs. flexible vs. schema-evolution
Compression Capabilities: Text vs. binary formats
Compatibility: Integration with engines like BigQuery, Spark, and NoSQL stores
Ease of Use: Human-readable vs. specialized tools
File Formats in Detail
CSV
Pros: Universal support, human-readable, easy to edit
Cons: No schema enforcement, large file sizes, poor query performance
Use Cases: Quick exports, small datasets, data interchange
JSON
Pros: Flexible semi-structured schema, native for APIs and NoSQL
Cons: Repeated keys inflate size, inconsistent structures, slow analytics queries
Use Cases: RESTful services, event logs, document storage
Apache Parquet
Pros: Columnar storage, excellent compression, optimized for analytics
Cons: Not human-readable, requires bulk writes, specialized readers
Use Cases: Large-scale analytics, data lakes, BI queries
Apache Avro
Pros: Compact binary row format, schema evolution, fast serialization
Cons: Slower for analytical workloads, needs language-specific libraries
Use Cases: Streaming pipelines (Kafka, Pub/Sub), log archives
Structured Database Tables
Pros: Rigid schema enforcement, optimized OLTP/OLAP performance, indexing
Cons: Schema changes are difficult, scaling challenges for unstructured data
Use Cases: Transactional systems (CRM, ERP), relational analytics
Guidelines for Choosing a Format
For simple exports or human review → CSV
For flexible API exchanges or NoSQL ingestion → JSON
For high-performance analytics at scale → Parquet
For streaming pipelines with schema versioning → Avro
For transactional or enterprise relational workloads → Structured tables
Key Takeaways
No single “best” format—each has trade-offs
Align format choice with your data volume, query patterns, and compliance needs
Always test with representative data to validate performance and compatibility
In this lecture, you’ll learn how to select the most appropriate Google Cloud data extraction tool for your project’s needs. Efficient data extraction is the foundation of any successful migration, transformation, or analytics workflow, and choosing the right tool can save time, reduce costs, and improve performance.
What You’ll Learn:
Why Extraction Matters
Understand the critical role of extraction in data pipelines
Explore how extraction efficiency impacts downstream processing and analytics
Key Factors for Tool Selection
Data volume: Small vs. large datasets
Extraction frequency: One-off loads, scheduled batches, or continuous streaming
Source system: On-premises databases, Cloud Storage, SaaS applications, or streaming sources
Transformation needs: Simple ingestion vs. advanced ETL
Overview of Google Cloud Extraction Tools
Cloud Dataflow
Serverless, fully managed for both batch and real-time pipelines
Built on Apache Beam for complex ETL operations
Ideal for streaming logs, IoT events, and pre-load transformations
Considerations: Requires Java/Python coding and Beam SDK knowledge
BigQuery Data Transfer Service
No-code scheduling of batch imports into BigQuery
Connectors for Google Ads, Analytics, Cloud Storage, Amazon S3, and more
Best for regular, automated data loads from SaaS sources
Limitations: No in-flight transformation; not suited for real-time streams
Database Migration Service
Managed migration of relational databases with minimal downtime
Supports homogeneous (e.g., MySQL→MySQL) and select heterogeneous moves
Continuous replication ensures near-real-time synchronization
Use Cases: Lift-and-shift migrations of MySQL, PostgreSQL, SQL Server
Limitations: Data moves “as-is,” with no transformation options
Cloud Data Fusion
GUI-based ETL tool built on CDAP with extensive prebuilt connectors
Supports both batch and streaming workflows with low-code transforms
Perfect for complex pipelines without deep programming skills
Trade-offs: Higher cost than code-based tools; best for batch use cases
In this lecture, you will learn how to select the most suitable storage service in Google Cloud based on your application’s data characteristics and requirements. Choosing the right storage solution is critical for performance, cost efficiency, and scalability.
What You’ll Learn:
Key Decision Factors
Data structure: Structured (tables), semi-structured (JSON), or unstructured (images, logs)
Access pattern: Read-heavy, write-heavy, or both
Scalability needs: Predictable growth vs. unpredictable spikes
Latency requirements: Real-time vs. batch access
Transaction support: ACID compliance for OLTP workloads
Overview of GCP Storage Services
Cloud Storage
Object storage for unstructured data (images, videos, backups, logs)
Highly durable, multi-region replication, versioning, lifecycle management
Ideal for static website hosting, data archival, ML training datasets
BigQuery
Serverless, columnar data warehouse for large-scale analytics
SQL-based querying, real-time insights, built-in machine learning
Best for BI reporting, streaming event analysis, data-lakehouse scenarios
Cloud SQL
Managed relational service (MySQL, PostgreSQL, SQL Server)
ACID transactions, automatic backups, vertical scaling for small-to-medium workloads
Suited for web/mobile apps, traditional OLTP systems
Firestore
Serverless NoSQL document database with real-time sync
Hierarchical data modeling, offline support for mobile/web apps
Perfect for chat, user profiles, collaborative applications
Bigtable
Wide-column NoSQL for petabyte-scale, low-latency workloads
Excellent for time-series, IoT telemetry, ad-tech, personalization engines
Integrates with Dataflow and BigQuery for analytics
Spanner
Globally distributed, strongly consistent relational database
Horizontal scaling with ACID transactions across regions
Ideal for financial systems, global SaaS backends, mission-critical OLTP
In this lecture, you will learn how to select the optimal storage location type in Google Cloud to meet your application’s performance, availability, and compliance requirements. Understanding zonal, regional, dual-region, and multi-region options ensures low latency, high durability, and cost efficiency.
What You’ll Learn:
Key Decision Criteria
Latency & proximity: Place data close to your compute resources
Availability & durability: Trade off cost versus redundancy
Compliance & residency: Satisfy industry and legal requirements
Disaster recovery: Plan for automated failover and data resilience
Storage Location Types
Zonal (Single zone)
Benefits: Lowest cost and latency within one data center
Limitations: No built-in redundancy; data lost if zone fails
Use Cases: Temporary or non-critical data, single-zone workloads
Regional (Multiple zones within one region)
Benefits: Higher availability and durability; no cross-region transfer fees
Limitations: Vulnerable to a complete region outage
Use Cases: Analytics pipelines, regional applications requiring redundancy
Dual-Region (Two selected regions)
Benefits: Automated failover between regions; improved disaster recovery; lower inter-region latency for bi-regional apps
Limitations: Higher storage costs; less geographic coverage than multi-region
Use Cases: Business-critical systems with two-site redundancy, compliance requiring two distinct locations
Multi-Region (More than two regions across a broad area)
Benefits: Maximum availability, global content delivery, fastest access for worldwide users
Limitations: Highest cost; limited control over exact data placement
Use Cases: Global web applications, media streaming, content distribution networks
In this lecture, you will learn how to classify your data as structured, semi-structured, or unstructured, and then choose the most appropriate Google Cloud storage and processing services for each. Correctly identifying your data’s format ensures efficient storage, optimal performance, and cost-effective solutions for your applications.
What You’ll Learn:
Data Type Fundamentals
Structured data: Highly organized, predefined schema (tabular)
Semi-structured data: Flexible schema (JSON, key-value, hierarchical)
Unstructured data: No schema (binary or text blobs such as images, video, audio)
Use Cases & Service Mapping
Structured Data
Use Cases: Financial transactions, banking systems, CRM/ERP records
Key Characteristics: ACID compliance, SQL queries, relational schemas
Recommended GCP Services:
Cloud SQL (MySQL, PostgreSQL, SQL Server)
Cloud Spanner (global, strongly consistent relational)
BigQuery (serverless data warehouse for analytics)
Semi-Structured Data
Use Cases: Application logs, web/mobile event streams, user preferences
Key Characteristics: JSON or key-value pairs, flexible schema, occasional schema evolution
Recommended GCP Services:
Firestore (NoSQL document store with real-time sync)
Bigtable (wide-column for high-throughput, low-latency)
BigQuery (can ingest JSON; ideal for analysis)
Cloud Storage (generic object store for logs)
Unstructured Data
Use Cases: Media streaming (video, audio), image repositories, backups, scientific data
Key Characteristics: Large binary/text files, no inherent schema
Recommended GCP Services:
Cloud Storage (multi-tier object storage: Standard, Nearline, Coldline)
Cloud CDN (for global content delivery)
Summary of Service Selection
Structured → Cloud SQL / Spanner / BigQuery
Semi-Structured → Firestore / Bigtable / BigQuery / Cloud Storage
Unstructured → Cloud Storage (+ CDN), ML/AI processing
Key Takeaways:
Properly classify your data to align with service strengths.
Leverage managed GCP services to minimize operations and scale seamlessly.
Structured data thrives in relational and warehousing platforms; semi-structured data benefits from NoSQL and flexible analytics engines; unstructured assets fit best in object storage with CDN support.
In this hands-on lecture, you will master how to upload files from both your local Windows machine and the Cloud Shell environment into a Google Cloud Storage bucket using the gcloud command-line utility. You’ll gain practical experience with bucket creation, configuration, and file transfer—without writing any code.
What You’ll Learn:
Preparing Your Environment
Ensure you’re working in the correct GCP project
Understand the role of Cloud Shell versus your local machine
Verify that the Cloud Storage API is enabled
Creating and Configuring a Bucket
Choose a globally unique bucket name
Select a location type (regional, dual-region, or multi-region) based on availability and cost needs
Pick an appropriate storage class (Standard, Nearline, Coldline, Archive) to match access patterns
Apply uniform access controls and enable public-access prevention
Configure object versioning and retention policies for data protection
Uploading from the Cloud Console UI
Navigate to the Storage section in the Cloud Console
Create folders and drag-and-drop files directly into your bucket
Review object metadata—size, type, creation time, and access URLs
Transferring via Cloud Shell
Launch Cloud Shell and confirm your active project
Upload local files into the Cloud Shell environment
Use gcloud storage cp to copy files from Cloud Shell to your bucket
List buckets and objects with gcloud storage ls to verify successful transfers
Next Steps & Best Practices
Automate transfers using the Storage Transfer Service in future lectures
Secure your data with custom encryption keys or fine-grained ACLs
Monitor transfer logs and bucket usage for cost optimization
Key Takeaways:
You can upload objects interactively via the Console or programmatically via the gcloud CLI.
Proper bucket configuration—location, class, and access controls—is essential for performance, durability, and security.
Mastery of gcloud commands empowers you to integrate file transfers into scripts and CI/CD pipelines.
In this hands-on lecture, you will learn how to install and use the gcloud CLI on your local machine to upload individual files and entire directories into a Google Cloud Storage bucket. You’ll also discover how to keep a local folder and your bucket in sync, ensuring your cloud data always reflects your latest changes—without writing any code.
What You’ll Learn:
Setting Up gcloud on Your Local Machine
Install the Cloud SDK on Windows, macOS, or Linux
Authenticate with gcloud auth login and authorize your account
Configure your active project with gcloud config set project
Uploading Single Files
Identify the local file path for upload
Use gcloud storage cp to copy files from your desktop to GCS
Verify successful uploads by listing bucket contents
Synchronizing Folders with rsync
Understand the difference between cp and rsync operations
Use gcloud storage rsync to mirror a local directory into a bucket
Observe how rsync skips unchanged files and propagates deletions
Exploring gcloud Storage Commands
List buckets and objects with gcloud storage ls
Preview file contents in Cloud Storage using gcloud storage cat
Discover additional operations: bucket creation, object deletion, and more
Best Practices & Next Steps
When to prefer rsync over individual uploads
Managing versioning and retention for synced data
Automating uploads in scripts or CI/CD pipelines
Key Takeaways:
The gcloud CLI enables seamless, code-free transfers from both local machines and Cloud Shell.
cp is ideal for one-off file uploads; rsync excels at ongoing folder synchronization.
Mastery of these commands empowers you to integrate Cloud Storage into your daily workflows and automation pipelines.
In this hands-on lecture, you will learn how to migrate data between MySQL instances using Google Cloud’s fully managed Database Migration Service. You’ll understand the end-to-end workflow—from preparing source and destination instances to configuring a migration job—so you can perform one-time or continuous replication with minimal downtime.
What You’ll Learn:
Introduction to Database Migration Service
Purpose and benefits of a managed migration service
Supported source engines (on-premises, AWS RDS, Cloud SQL) and destination targets
Planning Your Migration
Choosing between one-time and continuous replication
Understanding prerequisites: network access, user permissions, and MySQL configurations
Labeling and organizing instances for clarity
Setting Up Source and Destination Instances
Creating Cloud SQL instances for both source and target databases
Configuring machine type, storage size, and public-IP access
Applying labels to distinguish “source” versus “destination”
Configuring Connection Profiles
Defining source connection details (host, port, username, password)
Selecting encryption options for in-transit and at-rest data
Testing connectivity to ensure smooth data transfer
Creating and Running a Migration Job
Naming your migration job and specifying source/destination profiles
Reviewing and validating prerequisites within the console
Monitoring job status: draft, running, and completion phases
Troubleshooting & Best Practices
Handling IP allow-lists and firewall rules
Cleaning up unused instances to control costs
Verifying data consistency post-migration
Key Takeaways:
Google’s Database Migration Service simplifies moving MySQL data with minimal downtime.
Proper planning—instance sizing, network access, and permissions—is essential for seamless migration.
You can choose a one-time dump or set up ongoing replication to keep source and target in sync.
In this lecture, you will learn how to verify connectivity, prepare sample data, and initiate your Google Cloud Database Migration Service job. We’ll cover all the essential steps—from IP whitelisting to simple DDL/DML operations on your source instance—so you can confidently start a one-time or continuous migration with minimal surprises.
What You’ll Learn:
Reviewing Your Migration Job Configuration
Inspect source and destination connection profiles
Understand job status icons and draft vs. running states
Testing Connectivity
Execute built-in “Test Job” checks to verify source reachability
Identify common connectivity failures and their causes
Whitelist migration service IPs on your Cloud SQL instance
Preparing Sample Data on the Source
Connect to your Cloud SQL source instance from Cloud Shell
Create a test database and sample table for migration
Insert and query a few rows to confirm data availability
Starting the Migration
Choose between “Create only” (draft) and “Start immediately” options
Monitor the initial replication events in the Console
Understand the difference between one-time import and continuous sync
Best Practices & Troubleshooting
Keep notes of external IPs used by the migration service
Use labels to distinguish migration jobs and instances
Clean up unused instances or test databases to avoid extra costs
Key Takeaways:
Testing your migration job before running it prevents downtime and errors.
Simple SQL commands on the source help you verify that the migration will move the correct data.
Whitelisting and network configuration are often the first hurdles—resolving them early ensures a smooth migration start.
In this lecture, you will learn how to confirm that your Cloud SQL migration has completed successfully, validate that all data has arrived intact in your destination instance, and then clean up temporary resources. By the end, you will feel confident that your one-time or continuous MySQL migration is fully validated and that your environment is tidy.
What You’ll Learn:
Monitoring Migration Progress
Identify the “Running” and “Completed” states in the Database Migration Service console
Interpret key metrics such as total duration and throughput
Validating Migrated Data
Whitelist the migration service’s outgoing IP addresses on your destination instance
Connect to your destination Cloud SQL as a MySQL client from Cloud Shell
List databases and tables to confirm schema presence
Run simple SELECT queries to verify row counts and sample content
Performing Post-Migration Cleanup
Delete the migration job once you’ve confirmed successful data transfer
Remove any test databases or tables you created for validation
Delete temporary Cloud SQL instances if they were solely for demonstration
Ensure you keep only production-needed resources to control costs
Best Practices & Tips
Label your migration jobs and instances clearly (e.g., “source,” “destination”)
Always test connectivity and IP whitelisting before running a job
Use one-time migrations for bulk imports and continuous replication for ongoing sync
Keep detailed notes of migration IPs and timings for audit and troubleshooting
Key Takeaways:
Successful migration isn’t over until you verify data integrity in the target database.
Proper cleanup of test jobs and instances prevents unnecessary charges.
Clear labeling and documentation help you manage multiple migrations in large environments.
In this hands-on lecture, you will discover how to automate the movement of data between Google Cloud Storage buckets using the Storage Transfer Service. Unlike manual gsutil or gcloud storage cp commands, Storage Transfer Service lets you set up repeatable, monitored, and configurable transfer jobs—ideal for large datasets or recurring workflows.
What You’ll Learn:
When to Use Storage Transfer Service
Advantages over one-off CLI copies (batch scheduling, filtering, monitoring)
Supported sources and destinations (GCS→GCS, AWS S3, Azure Blob)
Preparing Your Buckets
Creating source and destination buckets via Cloud Shell or Console
Understanding bucket naming rules and location/class settings
Configuring a Transfer Job
Selecting “Batch” mode for one-time or periodic transfers
Specifying source and target buckets
Applying object filters (prefix, last-modified time) to limit transferred files
Choosing delete options (retain source objects, remove extras at destination)
Running and Monitoring Transfers
Starting a transfer run on demand
Observing job progress and statistics (bytes moved, duration, skipped files)
Reviewing detailed logs and error counts in the Console
Managing Transfer Jobs
Editing schedules and filters without recreating jobs
Disabling or deleting transfer jobs when no longer needed
Cleaning up demo buckets to avoid unwanted costs
Key Takeaways:
Storage Transfer Service simplifies large-scale and recurring bucket-to-bucket transfers.
Batch scheduling and object filtering let you tailor jobs to your exact needs.
Built-in monitoring and logging give visibility into transfer performance and health.
In this hands-on lecture, you will learn how to use Google Cloud’s Storage Transfer Service to move objects from an Azure Blob Storage container into a Google Cloud Storage bucket. You’ll discover how to set up source and destination endpoints, generate and use a Shared Access Signature (SAS) for secure Azure access, and monitor transfer runs—all without writing any code.
What You’ll Learn:
Preparing Your Azure Source
Create or identify an Azure Storage Account and Blob container
Generate a Shared Access Signature (SAS) with appropriate permissions
Copy the SAS token and container details for use in GCP
Configuring in Google Cloud
Navigate to the Storage Transfer section in the GCP Console
Select “Azure Blob Storage” as the source and provide container name plus SAS token
Choose your target Cloud Storage bucket as the destination
Defining Transfer Options
Decide between one-time runs or scheduled, recurring transfers
Set object overwrite behavior and deletion policies
(Optional) Apply object path filters or “last modified” filters to limit transfers
Running and Monitoring the Job
Start transfer runs on demand
Track progress: bytes transferred, duration, skipped items, and error counts
Review run history and detailed job statistics in the Console
Cleanup and Best Practices
Delete transfer jobs when no longer needed to avoid clutter
Securely remove temporary Azure storage resources if used for testing
Plan robust SAS expiration and rotation policies for production
Key Takeaways:
Storage Transfer Service provides a reliable, scalable way to migrate data from Azure into GCP.
A properly scoped SAS token ensures secure, least-privilege access to your Azure data.
Built-in scheduling, filtering, and monitoring capabilities make it ideal for recurring cross-cloud workflows.
In this lecture, you will learn how to configure and execute a cross-cloud data transfer using Google Cloud’s Storage Transfer Service. You’ll see step-by-step how to set up your AWS S3 source, grant secure access via IAM, and map it to a Google Cloud Storage bucket—all without writing a single line of code.
What You’ll Learn:
Preparing Your AWS Source
Create or select an S3 bucket and upload sample objects
Understand how AWS IAM access keys and secret keys work
Set up an IAM user with S3 read permissions for the transfer
Configuring the Storage Transfer Job
Choose “Amazon S3” as the source and provide bucket details
Select your GCP bucket as the destination and create an optional subfolder
Authenticate using AWS access key ID and secret access key
Defining Transfer Options
Decide between one-time (“run now”) vs. scheduled, recurring transfers
Configure overwrite behavior to always replace changed objects
Opt to retain or delete source objects after transfer (if needed)
Executing and Monitoring Transfers
Launch the transfer run on demand
Track real-time status: bytes transferred, duration, errors, and skipped files
Review run history and performance metrics in the Console
Cleanup and Best Practices
Delete test transfer jobs and temporary buckets to control costs
Remove IAM users and credentials when no longer needed
Document your transfer configuration for future audits
Key Takeaways:
Storage Transfer Service provides a seamless, no-code method to move data from AWS to GCP.
Proper IAM setup and least-privilege access keys ensure secure cross-cloud transfers.
Built-in monitoring and scheduling features make Storage Transfer Service ideal for both one-off migrations and ongoing sync workflows.
In this hands-on lecture, you will learn how to move data from your local environment (or Cloud Storage) into BigQuery tables using the bq command-line interface. You’ll see each step—dataset and table creation, schema definition, CSV loading with header handling, and simple querying—so you can confidently manage your own data ingestion workflows.
What You’ll Learn:
Setting Up Your Environment
Launch and authenticate Cloud Shell
Verify your active project and BigQuery credentials
Creating a BigQuery Dataset and Table
Use bq commands to make a new dataset in your project
Define a table schema with field names and data types
Loading Data with bq load
Prepare a local CSV file and understand header row handling
Execute a bq load command to ingest CSV into your table
Skip the header row to avoid parsing errors
Validating Your Data
Preview table contents in the BigQuery Console
Run a simple SQL SELECT query to confirm record counts and values
Advanced Tips
Reference files stored in Cloud Storage instead of local paths
Explore bq help options for additional formats and flags
Organize datasets and tables for reuse in later lectures
Key Takeaways:
Mastering the bq CLI empowers you to automate and script BigQuery ingestion.
Proper schema definition and header handling prevent common loading errors.
You can adapt these steps to load data from local files, Cloud Storage, or other sources.
In this lecture, you will learn how to leverage Google Cloud’s Python client libraries to programmatically interact with Cloud Storage and BigQuery from a managed Colab Enterprise environment. By the end, you’ll be able to write simple Python scripts to upload files, create datasets and tables, and run queries—all without leaving your notebook.
What You’ll Learn:
Setting Up the Environment
Launching a Colab Enterprise runtime in your GCP project’s region
Connecting the notebook to the runtime and verifying authentication
Installing and importing the google-cloud-storage and google-cloud-bigquery libraries
Configuring Your Parameters
Defining your GCP project ID, target bucket name, dataset, and table variables
Referencing local or Cloud Storage file paths for data ingestion
Interacting with Cloud Storage
Initializing a Storage client in Python
Uploading local files to a GCS bucket programmatically
Listing and downloading objects from a bucket
Working with BigQuery
Creating a BigQuery client in Python
Programmatically creating a dataset and table
Loading data from GCS into a BigQuery table via the SDK
Running a simple query and retrieving results in Python
Best Practices & Next Steps
Managing credentials and environment variables securely
Structuring notebooks for readability and reuse
Extending your scripts into production workflows or automation
Key Takeaways:
The Python SDK allows you to script common Cloud Storage and BigQuery operations, eliminating manual console clicks.
A Colab Enterprise notebook provides a fully authenticated, temporary environment for experimentation without local setup.
By parameterizing project, bucket, and table names, you can reuse your code across environments and projects.
In this hands-on lecture, you will learn how to use Google Cloud’s Python client libraries to programmatically manage Cloud Storage and BigQuery resources. By the end, you’ll be able to upload files, create datasets, and perform other common tasks—all from within a Colab Enterprise notebook—streamlining your data ingestion workflows without ever touching the Console.
What You’ll Learn:
Setting Up Your Notebook Environment
Launch and connect a Colab Enterprise runtime in your GCP project
Install and import the google-cloud-storage and google-cloud-bigquery libraries
Clean up unused runtimes to control costs
Automating Cloud Storage Operations
Initialize a Storage client in Python
Write a reusable upload_to_gcs(bucket, local_path, blob_name) function
Upload local CSV files into specific folders within your GCS bucket
Handle path conventions and troubleshoot common mistakes
Managing BigQuery via Python
Initialize a BigQuery client in Python
Create a new dataset programmatically (e.g., ds_from_python)
Delete or check for existing datasets to ensure idempotency
Best Practices & Next Steps
Parameterize project, bucket, dataset, and table names for reuse
Structure your notebook with clear function definitions and comments
Explore additional SDK methods for table creation, data loading, and querying
Key Takeaways:
The Python SDK enables you to automate routine GCS and BigQuery tasks, reducing manual effort.
A Colab Enterprise notebook serves as a fully authenticated development environment with minimal setup.
By encapsulating operations in functions, you create reusable building blocks for larger data pipelines.
In this lecture, you will learn how to load multiple tables into BigQuery and apply SQL queries to uncover actionable insights. Using a sample sales dataset—comprising orders, products, users, and order status—you’ll practice writing queries for counts, aggregations, joins, and filters. By the end, you’ll be comfortable navigating the BigQuery Studio interface and translating business questions into SQL.
What You’ll Learn:
Loading Multiple Tables
Import CSV files (orders, products, users, status) directly into BigQuery tables
Use auto-schema detection and native table creation
Exploring Table Schemas
Preview field lists and sample rows in BigQuery Studio
Understand primary keys and foreign keys across tables
Crafting Basic Aggregate Queries
Write COUNT(*) queries to get total users, products, and orders
Display overall record distributions
Building Multi-Table Analyses
Perform joins between orders, users, and products
Summarize sales by product category or user segment
Using Filters and Window Functions
Filter orders by date range or status (e.g., delivered vs. canceled)
Apply ranking or running totals for time-series insights
Visualizing Query Results
Leverage BigQuery’s built-in charting options
Export result sets for external dashboards
Key Takeaways:
Loading, previewing, and querying tables in BigQuery Studio empowers rapid data exploration.
Basic SQL aggregates and joins reveal high-level metrics and relationships in your dataset.
BigQuery’s usage indicators help you monitor data scanned and optimize query costs.
In this hands-on lecture, you will learn how to transform raw sales, product, and user data into meaningful business insights using BigQuery’s SQL engine. You’ll practice writing efficient aggregation and join queries to calculate total revenue, rank products and customers, and prepare for time-series analysis—arming you with the skills to drive data-backed decisions.
What You’ll Learn:
Efficient Data Preview
Use LIMIT to inspect only a few rows and minimize data scanned
Understand BigQuery’s per-column scanning and cost implications
Calculating Aggregate Metrics
Compute overall sales revenue with SUM(total_amount)
Reduce scanned bytes by selecting only necessary columns
Product-Level Analysis
Join orders and products tables on product_id
Group by product name to find revenue per product
Swap grouping to category level for high-level insights
Use ORDER BY DESC to identify top revenue generators
Order Volume & Quantity Analysis
Count orders per product to determine popularity
Sum quantity per product to rank by units sold
Limit result sets to top five performers for concise reporting
Customer Spending Profiles
Join orders and users tables on user_id
Group by customer name and location to calculate total spend
Identify your highest-value customers at a glance
Preparing for Time-Series Trends
Leverage the order_date field for monthly or daily revenue trends
(Next lecture) Partition your queries by date and apply window functions
Key Takeaways:
Applying targeted SELECT and LIMIT clauses reduces query cost and accelerates exploration.
Aggregations combined with table joins unlock product, category, and customer insights.
Ranking and filtering results highlights your top performers, ready for dashboards or reports.
BigQuery’s query plan indicators help you monitor data scanned and optimize for cost.
In this lecture, you’ll build on basic aggregations to perform deeper analyses of your sales dataset—orders, products, users, and status—using BigQuery SQL. You’ll learn to extract monthly and daily trends, rank products and customers, compute percentages for order statuses, and identify inventory and supplier signals. Finally, you’ll prepare your result sets for visualization in BigQuery Studio.
What You’ll Learn:
Time-Series Aggregations
Group sales by year and month to reveal monthly revenue trends
Drill into a specific month (e.g., December 2023) and analyze daily sales
Count total orders per month, then sort descending to find peak periods
Customer & Product Rankings
Identify top customers by total spending and filter repeat buyers via HAVING
Rank products by order count and by total quantity sold, selecting the top five
Analyze user location distribution to highlight key markets
Inventory & Supplier Signals
Query products with low stock (e.g., quantity < 20) to prioritize replenishment
Join with the products table to compute total revenue per supplier
Category & Payment Analysis
Aggregate sales or units by product category to uncover best-selling segments
Group orders by payment mode to see which methods customers prefer
Order Status Insights
Calculate the percentage of orders in each status (delivered, pending, canceled)
Understand fulfillment health by comparing delivered vs. problem orders
Preparing for Visualization
Structure your query outputs (aliases, sorting, limits) for direct use in BigQuery Studio charts
Optimize queries by selecting only needed columns to reduce scanned data
Key Takeaways:
Time-based grouping (EXTRACT(YEAR/MONTH/DATE)) uncovers seasonal and daily patterns.
Joins and GROUP BY let you correlate transactions with user, product, and supplier metadata.
HAVING filters on aggregated results (e.g., repeat customers) for targeted insights.
Percentage calculations and order status breakdowns provide quick health checks on operations.
Well-structured SQL outputs feed seamlessly into BigQuery’s visualization tools for dashboarding.
In this lecture, you will learn how to turn your BigQuery analytical results into clear, compelling charts using Python and Matplotlib within a Colab Enterprise notebook. You’ll discover how to fetch query results directly from BigQuery, convert them into a Pandas DataFrame, and produce bar charts to highlight key insights—empowering you to communicate findings effectively.
What You’ll Learn:
Launching a Colab Enterprise Runtime
Spin up a new Vertex AI–backed notebook in your GCP project
Adjust notebook settings (font size, layout) for optimal visibility
Clean up runtimes when no longer needed to manage costs
Fetching BigQuery Data in Python
Install and import the google-cloud-bigquery, pandas, and matplotlib libraries
Initialize a BigQuery client in your notebook
Write and execute a SQL query (e.g., top customers by order count) programmatically
Converting Results to Pandas
Load the query response into a Pandas DataFrame
Inspect the DataFrame to verify columns and sample values
Prepare the data (e.g., sort or rename columns) for visualization
Creating Bar Charts with Matplotlib
Use plt.bar() to map categorical labels to numeric values
Customize chart elements: titles, axis labels, and rotation for readability
Display inline charts that automatically render in your notebook
Alternative: BigQuery Notebook Extensions
Explore the %bigquery magic command for inline querying
Compare the simplicity of built-in extensions versus full Python scripts
Key Takeaways:
Colab Enterprise provides a ready-to-use, authenticated environment for data visualization without local setup.
Programmatic access to BigQuery results via Python enables automated chart production for any query.
Simple Matplotlib commands are sufficient to create insightful bar charts that highlight your top metrics.
In this lecture, you will learn how to use BigQuery Data Transfer Service to automate the import of CSV data stored in Google Cloud Storage into BigQuery tables. You’ll walk through every step—from preparing an empty table with the correct schema to configuring and running a transfer job—so you can reliably ingest data for downstream analysis without manual uploads.
What You’ll Learn:
Preparing Your BigQuery Environment
Verify that your source CSV file (e.g., orders.csv) resides in a GCS bucket
Create an empty BigQuery table (orders2) with fields matching your CSV’s columns
Choose the appropriate data types and order of columns for seamless ingestion
Configuring a Transfer Job
Select “Google Cloud Storage” as the source in the BigQuery Data Transfer UI
Point to the specific bucket and file path for your CSV
Map the transfer to your target dataset and the newly created table
Define options such as header-row skipping, error thresholds, and append vs. overwrite behavior
Executing & Monitoring the Transfer
Run the job on-demand and observe its initial “running” status
Refresh the table preview to confirm that all records have been loaded
Inspect job details and logs to verify file matching, record counts, and any errors
Cleanup & Best Practices
Delete test transfer jobs and temporary tables when they’re no longer needed
Leverage scheduled transfers for recurring imports (daily, weekly, etc.)
Ensure schema evolution is handled gracefully by re-defining tables or using partitioned destinations
Key Takeaways:
BigQuery Data Transfer Service eliminates manual CSV loading by enabling automated, repeatable imports.
Proper schema definition and transfer configuration ensure data lands in BigQuery exactly as expected.
Monitoring job logs and record counts helps you catch issues early and maintain data integrity.
In this lecture, you will learn how to use BigQuery’s Data Transfer Service to copy entire tables from one BigQuery dataset to another—no manual exports or imports required. You’ll walk through the end-to-end process of configuring, running, monitoring, and cleaning up a one-time dataset copy job, enabling you to streamline data movement within BigQuery for analysis and reporting.
What You’ll Learn:
Introduction to BigQuery Data Transfer Service
Understand its role in migrating data into BigQuery from varied sources
Differentiate between dataset-copy transfers and GCS-to-BigQuery imports
Configuring a Dataset Copy Job
Navigate to the Data Transfer section in the BigQuery UI
Choose “Dataset copy” as your source type
Create or select destination and source datasets (e.g., sales_data → copy_sales)
Define a one-time (on-demand) run vs. recurring schedules
Job Execution & Monitoring
Launch the transfer manually and observe its “Running” status
Refresh the destination dataset to confirm new tables (orders, products, users) appear
Inspect transfer logs to verify table counts, file matching, and any errors
Managing Transfer Configurations
Edit transfer settings (e.g., retry behavior, service account) for future runs
Disable or delete transfer jobs when no longer needed
Clean up destination datasets or tables to avoid clutter and control costs
Best Practices & Tips
Grant the transfer service account least-privilege access to both source and destination datasets
Use descriptive names (e.g., ds_copy_sales) to track multiple transfer configs
Leverage scheduled transfers for regular backups or data refreshes
Key Takeaways:
BigQuery Data Transfer Service automates moving tables between datasets without manual scripting.
One-time or scheduled transfers let you keep data synchronized for reporting and analysis.
Proper monitoring, logging, and cleanup ensure reliable, cost-effective data operations.
In this lecture, you’ll explore two on-platform machine learning offerings in Google Cloud—BigQuery ML and AutoML—and learn when to choose each for your projects. You’ll gain a clear understanding of how BigQuery ML brings SQL-based modeling to your existing structured data in BigQuery, while AutoML provides low-code pipelines for both structured and unstructured data. By the end, you’ll be able to pick the right tool for forecasting, classification, clustering, and more.
What You’ll Learn:
Machine Learning on Google Cloud
Overview of Google’s integrated ML ecosystem: training, evaluation, and deployment
How scalability and data proximity drive performance in BigQuery
BigQuery ML Fundamentals
Build and run models using familiar SQL syntax—no Python required
Train on petabyte-scale tables without moving data out of BigQuery
Key use cases:
Time-series forecasting (e.g., demand prediction with ARIMA)
Customer segmentation via built-in clustering
AutoML Fundamentals
Low-code model creation for users with limited ML expertise
Automated feature engineering, hyperparameter tuning, and deployment
Key use cases:
Image classification (detecting defects in manufacturing)
Natural language processing (sentiment analysis on customer feedback)
Comparing BigQuery ML vs. AutoML
Data types: Structured only (BigQuery ML) vs. structured + unstructured (AutoML)
Skill requirements: SQL proficiency (BigQuery ML) vs. minimal ML knowledge (AutoML)
Integration & performance: Zero data movement and native SQL (BigQuery ML) vs. automated pipelines and specialized model support (AutoML)
Decision Criteria & Best Practices
When to leverage SQL-based modeling for rapid prototyping
When to adopt AutoML for vision, language, or advanced custom models
How to integrate models into your data pipelines for real-time predictions
Key Takeaways:
BigQuery ML lets you turn SQL queries into scalable machine learning workflows on your existing tables.
AutoML offers a no-code path for building vision and language models, plus structured-data classifiers and regressors.
Choosing the right tool depends on your data type, team skills, and the level of model customization you need.
In this lecture, you will learn how to systematically plan and execute a machine learning project from start to finish. By following a structured lifecycle—problem definition through deployment—you’ll ensure that your models are reliable, maintainable, and aligned with business goals.
What You’ll Learn:
1. Problem Definition
Clearly articulate the business or research question you want the model to solve
Establish measurable objectives and success criteria (e.g., accuracy thresholds)
Account for project constraints: budget, timeline, compute resources, and ethics
2. Data Collection
Identify relevant data sources: internal databases, public datasets, APIs, or streaming feeds
Plan how to ingest data (batch vs. streaming) into your environment
Assess data volume and quality requirements before moving to analysis
3. Data Preparation & Exploration
Clean the data: handle missing values, remove duplicates, and address outliers
Perform exploratory data analysis to uncover patterns and validate assumptions
Engineer new features and transform existing ones to improve model performance
4. Model Training
Choose appropriate algorithms based on problem type (classification, regression, clustering)
Split data into training, validation, and test sets to guard against overfitting
Use hyperparameter tuning techniques (grid search, random search) to optimize model parameters
5. Model Evaluation
Select relevant metrics (accuracy, precision, recall, F1-score) for your use case
Apply cross-validation and holdout testing to assess generalization
Monitor for underfitting vs. overfitting, and iterate on feature engineering or algorithm choice
6. Prediction & Deployment
Design inference pipelines for real-time (online) or batch scoring
Deploy models on Google Cloud AI Platform, in containers, or via serverless functions
Implement versioning to rollback or update models seamlessly
7. Monitoring & Maintenance
Continuously track model performance in production for data drift or degradation
Retrain or tune models periodically with fresh data
Document every experiment and maintain code/data version control for reproducibility
Best Practices
Rigorously document each step to ensure transparent, reproducible workflows
Use Git or similar tools to version both code and datasets
Encourage collaboration and clear communication between data engineers, scientists, and stakeholders
Treat ML as an iterative process—expect multiple cycles of refinement
Key Takeaways:
A well-defined lifecycle—from problem scoping through deployment—forms the backbone of successful ML projects.
Investing time in data quality, feature engineering, and robust evaluation early on prevents costly rework later.
Continuous monitoring and iteration keep your models accurate and aligned with changing business needs.
In this lecture, you’ll discover how to use Looker—Google Cloud’s enterprise BI platform—to explore, analyze, and visualize data without writing SQL. We’ll walk through accessing a temporary Looker instance via Cloud Skills Boost, loading sample datasets, building ad-hoc queries, and creating interactive visualizations and filters. By the end, you’ll know how to prototype dashboards and share insights with stakeholders.
What You’ll Learn:
Accessing Looker via Cloud Skills Boost
Obtain temporary credentials from a Looker lab in Cloud Skills Boost
Log into the Looker UI and navigate the main menus
Exploring Data with Looker
Use the Explore interface to browse sample datasets (e.g., Fruit Basket)
Understand dimensions (fruit type, color, weight) and measures (count, sum, average)
Preview data and view the underlying SQL Looker generates
Building Visualizations
Run queries to retrieve specific metrics (e.g., total price per fruit)
Switch between result formats: table, single value, and chart types
Customize charts (bar, column, single value) to highlight key figures
Applying Filters & Drill-Downs
Add filters (e.g., fruit color = “Green”) to focus your analysis
Create calculated fields on the fly for custom metrics
Drill down into dimensions to reveal more granular insights
Creating Dashboards & Sharing Results
Pin visualizations to a dashboard board for easy access
Schedule reports or download visualizations for offline sharing
Leverage pre-built Looker templates from the Marketplace
Key Takeaways:
Looker’s no-code Explore interface lets you query and visualize data quickly, even without SQL knowledge.
Temporary lab credentials through Cloud Skills Boost provide hands-on access to a full Looker environment.
Interactive filters, pivoting, and calculated fields enable deep, ad hoc analysis in minutes.
Dashboards and scheduling features simplify sharing insights across teams.
In this hands-on lecture, you will build a complete binary classification pipeline using BigQuery ML—no Python or external tools required. From ingesting CSV data to cleaning, exploring, training a logistic regression model, and making predictions, you’ll see how SQL statements power every step of the machine learning lifecycle directly in BigQuery.
What You’ll Learn:
Data Ingestion & Schema Detection
Load your customer dataset from CSV into a new BigQuery table
Verify and auto-detect column types: integers for age, income, score; string for gender
Data Cleaning & Exploration
Check for nulls to ensure data quality
Deduplicate records via CREATE TABLE … AS SELECT DISTINCT
Compute gender-based averages and counts to understand feature distributions
Model Creation with SQL
Define and train a logistic regression model using CREATE MODEL … OPTIONS(MODEL_TYPE='LOGISTIC_REG')
Specify label column (purchase) and data split ratio (80% train, 20% test)
Model Evaluation
Use ML.EVALUATE to obtain accuracy, precision, recall, and F1 scores
Examine the built-in confusion matrix to gauge model performance
Making Predictions
Apply ML.PREDICT on held-out data and on custom feature inputs
Retrieve predicted labels and probabilities for binary outcomes
Cleanup & Best Practices
Delete temporary tables and models to keep your project tidy
Leverage BigQuery ML’s zero-data-movement advantage for seamless, scalable ML
Key Takeaways:
BigQuery ML empowers you to conduct full ML workflows—data prep through prediction—using only SQL.
Logistic regression is an ideal starting point for simple binary classification tasks.
Inspecting data distributions and model metrics early ensures you catch quality issues before deployment.
In this lecture, you’ll learn how to select the most appropriate data transformation tool within Google Cloud for your specific workload—whether it’s batch analytics, streaming ETL, or complex workflow orchestration. We’ll compare five core services (Dataproc, Dataflow, Data Fusion, Composer, and Databricks) in terms of use cases, strengths, weaknesses, and cost considerations, equipping you to design efficient, scalable, and maintainable pipelines.
What You’ll Learn:
Key Selection Criteria
Evaluate data volume and velocity to determine batch vs. real-time needs
Assess transformation complexity and integration requirements
Balance ease of use, operational overhead, and budget constraints
Tool Overviews & Comparisons
Dataproc (Managed Spark/Hadoop)
Ideal for large-scale batch analytics and migrating legacy Hadoop jobs
Pros: Familiar ecosystem, rapid cluster provisioning
Cons: Cluster management overhead, not serverless
Dataflow (Serverless Apache Beam)
Unified stream and batch processing with auto-scaling
Pros: Real-time suitability, minimal infrastructure management
Cons: Higher cost on small jobs, limited local debugging
Data Fusion (GUI-Based ETL)
Drag-and-drop pipeline builder with prebuilt connectors
Pros: No-code interface for non-technical users, rapid prototyping
Cons: Less flexible for complex logic, higher price at scale
Cloud Composer (Managed Airflow)
Workflow orchestration for multi-step, dependency-driven pipelines
Pros: Powerful scheduling, extensible via Python DAGs
Cons: Steeper learning curve, more overhead for small projects
Dataform (SQL-Based Transformations)
SQL workflow management and version control for data warehouses
Pros: Leverages existing SQL skills, built-in lineage tracking
Cons: Limited to batch SQL transforms, not suited for real-time
Comparison Matrix
Real-time vs. batch suitability
Ease of use and required skill sets
Cost efficiency and operational complexity
Best-fit scenarios for each tool
Best Practices for Tool Selection
Align tool choice with business goals and SLA requirements
Prototype critical pipelines to validate performance and cost
Plan for future growth: ensure maintainability and team skill alignment
Key Takeaways:
Choosing the right transformation tool is critical for pipeline performance, cost control, and team productivity.
GCP offers both code-centric and low-code/no-code options—select based on your data volume, complexity, and existing skill sets.
Proof-of-concepts and clear criteria ensure you pick a solution that scales with your business needs.
In this lecture, you will learn the fundamental differences between ETL (Extract-Transform-Load) and ELT (Extract-Load-Transform), and understand when to apply each approach in your data pipelines. Through real-world use cases—ranging from compliance-driven banking systems to agile, large-scale e-commerce analytics—you’ll gain the criteria needed to align your integration strategy with business goals, data volume, and technology constraints.
What You’ll Learn:
Understanding ETL vs. ELT
ETL: transform data before loading into your warehouse—for strict compliance, real-time processing, or legacy integrations
ELT: load raw data first, then transform in-place—ideal for massive volumes, rapid ingestion, and agile analytics
Process Workflows & Characteristics
ETL pipelines cleanse, aggregate, and encrypt data externally before storage
ELT pipelines dump raw data into a scalable warehouse (e.g., BigQuery) and use SQL or Spark to shape it afterward
Key Use Cases
ETL:
Banking or healthcare systems requiring pre-load masking/encryption for GDPR/PCI compliance
Real-time fraud detection or risk assessment demanding immediate data cleansing
ELT:
E-commerce or media platforms with high-volume clickstream and transaction logs
Data science teams iterating on models, A/B tests, and custom transformations without re-extracting data
Comparative Advantages
Transformation Location: external (ETL) vs. in-warehouse (ELT)
Scalability & Speed: moderate throughput in ETL vs. high concurrency and speed in ELT
Flexibility & Cost: limited reruns but specialized control in ETL vs. on-demand reprocessing and lower infrastructure cost in ELT
Decision Criteria & Best Practices
Match your pipeline to data compliance, volume, and latency requirements
Prototype critical paths to validate performance and cost implications
Leverage managed services (Dataflow, Dataproc, Data Fusion, Dataform) that align with your chosen pattern
Document and version your transformation logic for auditability and reproducibility
Key Takeaways:
ETL gives you tight control over data quality and compliance before storage, but requires upfront processing and specialized tools.
ELT leverages powerful cloud warehouses for scalable, flexible transformations—perfect for analytics, machine learning, and rapid iteration.
The right choice depends on your business objectives, regulatory needs, data volume, and team expertise.
In this practical lecture, you’ll provision a fully managed Cloud Composer environment on Google Cloud and familiarize yourself with the Apache Airflow interface—laying the foundation for orchestrating robust data pipelines. By the end, you’ll understand how to configure Composer, manage its resources, and navigate the key Airflow components needed to deploy and monitor your DAGs.
What You’ll Learn:
Setting Up a Cloud Composer Environment
Choose the appropriate Composer version (Composer 2 with Airflow 2.x)
Define a service account with the right IAM roles (Composer Worker, Storage Object Admin)
Select environment size (small, medium, large or custom resource allocations)
Configure networking and security defaults for public access
Understanding Composer’s Google-Managed Components
How Composer spins up GKE clusters, Cloud SQL backends, and Cloud Storage buckets
The role of scheduler, DAG processor, web server, and worker resources
Navigating the Airflow UI
DAGs View: See all workflows, their schedules, and run statuses
Graph & Tree Views: Inspect task dependencies and execution history
Logs & Monitoring: Access scheduler, task, and webserver logs directly in the UI
Admin & Config: Review environment variables, Airflow configurations, and install PyPI packages on-the-fly
Key Composer Best Practices
Use a dedicated, least-privilege service account for production environments
Leverage environment labels and IAM policies for resource governance
Clean up unused environments and buckets to manage costs
Key Takeaways:
Cloud Composer abstracts away complex infrastructure, letting you focus on writing and running Airflow DAGs.
Proper sizing and IAM configuration are critical to ensure your workflows run reliably and securely.
The Airflow UI provides all the monitoring and debugging capabilities you need to maintain production pipelines.
In this hands-on session, you’ll learn how to create, deploy, and run your very first Apache Airflow DAG inside a Cloud Composer environment. You’ll start with a simple “Hello World” pipeline that uses built-in operators to demonstrate Composer’s basic workflow orchestration, and see how easy it is to schedule, monitor, and trigger tasks using the Airflow UI.
What You’ll Learn:
DAG Structure & Operators
Define default arguments and schedule intervals for your workflow
Use the DummyOperator to mark start and end points
Execute shell commands with the BashOperator (e.g., printing the date and a greeting)
Run Python functions via the PythonOperator
Deploying Your DAG
Upload your DAG file to the environment’s Cloud Storage bucket under the dags/ folder
Understand how Composer polls for new DAGs and makes them available in the UI
Running & Monitoring
View the DAG list and run history in the Composer Airflow web interface
Inspect the Graph View to see task dependencies and statuses
Access task logs directly in the UI to debug or verify outputs
Manual & Scheduled Triggers
Automatically schedule your DAG to run once per day
Manually trigger runs on-demand and observe the live status updates
Distinguish between scheduled vs. external runs
Key Takeaways:
Cloud Composer abstracts Airflow setup so you can focus on writing and orchestrating your workflows.
DAGs are simply Python scripts—uploading them to your Composer bucket makes them live in minutes.
The Airflow UI provides a complete overview of task execution, logs, and scheduling, enabling easy debugging and monitoring.
In this hands-on session, you’ll create a complete extract-load-transform (ELT) workflow in Cloud Composer, using Apache Airflow to orchestrate data movement and transformation. By the end, you’ll have a scheduled pipeline that ingests a CSV file of New York housing listings from Cloud Storage into BigQuery, then runs a summary query to generate an aggregated results table.
What You’ll Learn:
Pipeline Overview
Understand the end-to-end flow: extract CSV from Cloud Storage → load raw table in BigQuery → transform into a summary table
Authoring Your DAG
Define default arguments and schedule interval (daily run)
Use start/end markers with DummyOperator
Load data into BigQuery using the GCS to BigQuery operator
Execute a summary SQL query with the BigQueryInsertJob operator
Deploying to Composer
Upload your DAG file into the Composer environment’s dags/ bucket
Watch Airflow detect and register the new DAG automatically
Executing & Monitoring
Trigger the pipeline manually or let it run on its daily schedule
Inspect the Graph View for task dependencies and run statuses
View task logs to confirm successful data loads and query execution
Verifying Results
Explore the raw house_data table to see all 4,801 records in BigQuery
Preview the house_summary table to validate min/max prices and listing counts
Key Takeaways:
Cloud Composer makes it easy to orchestrate multi-step ELT jobs without managing infrastructure.
Airflow operators for BigQuery and GCS handle data movement and SQL execution with minimal code.
The Airflow UI provides full visibility into your DAG’s schedule, run history, and logs for quick debugging.
In this hands-on session, you’ll learn how to spin up a fully managed Apache Hadoop and Spark cluster using Google Cloud Dataproc. We’ll cover every step from selecting the right region and machine types based on your quota, to configuring a single-node cluster for rapid prototyping. You’ll also discover how to navigate the cluster’s web interfaces—YARN ResourceManager, Spark History Server, HDFS NameNode—and set up automated idle shutdown to control costs.
What You’ll Learn:
Cluster Creation Essentials
Choose a region with available CPU quota
Decide between single-node (for dev/test) and multi-node high-availability deployments
Select machine families (N4, C4) and resource size (vCPU & RAM)
Configure auto-deletion after idle periods to avoid unexpected charges
Understanding Dataproc Components
How Dataproc leverages Compute Engine under the hood
Role of the staging bucket for temporary data storage
IAM requirements for the service account running your cluster
Accessing Cluster Services
Launch the YARN ResourceManager UI to track job progress
Explore the Spark History Server for completed applications
Use the HDFS NameNode UI to browse your Hadoop file system
Best Practices & Cost Controls
Always verify regional quotas before provisioning
Use single-node clusters for quick experiments, and scale to multi-node for production
Set reasonable idle-timeout values to automatically tear down unused clusters
In this practical session, you’ll learn how to navigate and manage the Hadoop Distributed File System (HDFS) on your newly provisioned Dataproc cluster. We’ll cover everything from SSH’ing into the master node to basic file operations and using the cluster’s built-in web UIs for monitoring.
What You’ll Learn:
Accessing the Master Node
SSH into your Dataproc master VM directly from the Cloud Console
Confirm that the hdfs command-line tools are available
HDFS File Management
List top-level directories to see default folders like /tmp and /user
Create a new HDFS directory (e.g. /vehicle_sold)
Upload (put) a local text file into HDFS
Verify your upload by listing its contents and using cat to display the file
Navigating the HDFS UI
Open the HDFS NameNode web interface to browse directories graphically
Understand how HDFS paths map to the UI’s tree view
Cluster Web Interfaces
Visit the YARN ResourceManager to monitor running jobs and queue status
Explore the Spark History Server for completed Spark applications
Check MapReduce job history and HDFS health via the built-in web UIs
Next Steps
Prepare to run Hive queries on your HDFS data (covered in the following lecture)
Key Takeaways:
You now know how to perform essential HDFS operations from the command line and UI.
You’re familiar with the core web dashboards that help you track jobs, resource usage, and file system status.
This groundwork will let you confidently load, process, and analyze large datasets using Hadoop, Spark, and Hive on Dataproc.
In this hands-on session, we’ll explore how to use Hive—the SQL-based data warehousing layer built on HDFS—within your Dataproc cluster. You’ll learn to run Hive queries, manage databases and tables, and perform basic data operations entirely from the command line.
What You’ll Learn:
Accessing Hive on Dataproc
SSH into your Dataproc master node
Launch the Hive shell to begin interacting with your data
Database Management in Hive
Create a new Hive database (e.g. store_db)
Switch (“USE”) between Hive databases
List available databases
Table Operations
Define a Hive table schema for product sales data
Verify table creation with SHOW TABLES
Inspect table structure via DESCRIBE
Data Ingestion
Insert sample records into your Hive table
Understand how Hive jobs are launched under the hood
Query and validate the inserted data
Cleanup Tasks
Drop tables and databases when no longer needed
Exit the Hive shell gracefully
Key Takeaways:
You’ll be comfortable launching Hive on your Dataproc cluster and executing common DDL/DML statements.
You’ll understand the relationship between Hive databases, tables, and the underlying HDFS storage.
You’ll see how simple SQL-like commands translate into MapReduce or Tez jobs, enabling you to process large datasets in a familiar way.
In this hands-on session, you’ll learn how to run Apache Spark workloads on your Dataproc cluster—both interactively and as standalone jobs. We’ll cover the end-to-end process from launching a Spark shell to packaging and submitting a complete PySpark script.
What You’ll Learn:
Interactive Spark Development
Launching a PySpark shell on your Dataproc master node
Creating Spark DataFrames on the fly
Applying basic transformations and actions (filter, groupBy, count)
Inspecting results in real time
Packaging a PySpark Job
Writing a standalone Python script that initializes a SparkSession
Embedding your data-loading, transformation and aggregation logic
Properly stopping the SparkSession at the end of your script
Job Submission
Submitting your script with spark-submit to the cluster
Understanding how Dataproc manages the Spark context and resources
Monitoring your job’s progress and logs via the Cloud Console
Best Practices & Tips
When to use the interactive shell versus batch submission
Organizing your code for repeatable, production-ready jobs
Leveraging Dataproc’s autoscaling and configuration options for optimal performance
Key Takeaways:
You’ll be fluent in developing and testing Spark code interactively on Dataproc.
You’ll know how to package your work into a PySpark script and launch it as a managed cluster job.
You’ll understand the trade-offs between rapid experimentation and production workflows on Google Cloud.
In this practical demonstration, you’ll learn how to run your PySpark workloads on a Dataproc cluster using the Google Cloud Console’s built-in job submission UI. You’ll see how to prepare your data and code in Cloud Storage, configure the job parameters through the Console, and monitor execution—all without touching a command line.
What You’ll See & Do:
Prepare Your Inputs in Cloud Storage
Upload raw data and your PySpark script to a GCS bucket
Organize files into logical folders for data vs. code
Configure a Dataproc Job in the Console
Select your target cluster and region
Pick “PySpark” as the job type
Reference your script and data via gs:// URIs
Point the job to your output BigQuery dataset and table
Submit & Monitor the Job
Launch the job with a single click
Track its lifecycle (queued → running → success) in the Jobs pane
Review detailed logs and driver output from the Console
Verify Results in BigQuery
Refresh your BigQuery dataset to see newly created tables
Run simple queries to confirm your ETL/analysis logic worked as expected
Key Pointers:
GCS Organization: Keep data and code separate to simplify job definitions.
URI Accuracy: Double-check your gs:// paths to avoid “file not found” errors.
Cluster Selection: Ensure you pick the right cluster in the matching region.
Output Handling: Use BigQuery as a sink to immediately validate your transformations.
In this hands-on session, you’ll learn how to spin up a Dataproc cluster pre-configured with JupyterLab, and then use it as a full-featured development environment for PySpark. We’ll walk through uploading a CSV dataset to Cloud Storage, launching JupyterLab from the Dataproc UI, and performing end-to-end data ingestion, transformation, and export—all in familiar notebook cells.
What You’ll See & Do:
Enable Jupyter on Your Cluster
Create a single-node Dataproc cluster with the “Jupyter” optional component
Grant the default service account access so JupyterLab can start
Prepare Your Data in Cloud Storage
Upload the orders.csv file to your GCS bucket
Inspect its schema (order ID, product, quantity, price, date)
Launch JupyterLab from the Dataproc Console
Open the notebook interface with one click
Choose the PySpark kernel for seamless Spark integration
Read & Explore Data with PySpark
Initialize a SparkSession
Load the CSV into a DataFrame and display the first 20 rows
Perform simple transformations: add a total_price column, filter, group
Write Results to BigQuery & GCS
Save the raw DataFrame into a new BigQuery dataset/table
Push transformed results (e.g., revenue per category) back to BigQuery
Export the same transformed DataFrame as CSV files into Cloud Storage
Leverage the Notebook for HDFS & Hive
Run HDFS and Hive commands directly in notebook cells
Browse your cluster’s HDFS and interact with Hive metastore
Key Pointers:
Cluster Setup: Don’t forget to enable “Jupyter” and grant the Compute Engine service account the needed IAM role.
GCS Paths: Use gs:// URIs consistently for both data sources and sinks.
Kernel Choice: Select the PySpark kernel to get SparkContext and SparkSession automatically injected.
Notebook Workflow: Develop interactively, then export your final code as a script for production scheduling.
Resource Cleanup: Delete your Dataproc cluster when done to avoid unnecessary charges.
In this practical hands-on session, you’ll learn how to build a fully managed, serverless pipeline using Google Cloud Dataflow to ingest CSV data from Cloud Storage into BigQuery—without writing any code. We’ll leverage a built-in Dataflow template, supply only configuration files and a JavaScript UDF, and then watch our job spin up, execute, and tear down resources automatically.
What You’ll Learn & Do:
Prepare Your Assets in Cloud Storage
Upload the source orders.csv file containing fields (ID, product name, quantity, unit price, category)
Upload a JSON schema file that defines BigQuery column types
Upload a JavaScript UDF that transforms each CSV row into a JSON object
Create the BigQuery Destination
Spin up a new BigQuery dataset in your chosen region
Create an empty target table (no schema required—Dataflow will apply it)
Launch the Dataflow Job from a Template
Select the “Text Files on Cloud Storage to BigQuery” template in the Dataflow UI
Point to your GCS paths for the CSV, schema, and UDF
Specify your temporary staging bucket for intermediate data
Name and start the job in the region where you have available quota
Monitor & Understand Your Pipeline
Watch your job’s graph view, showing steps from reading input, applying the JS transform, converting to BigQuery rows, and writing to BigQuery
View detailed step metrics (CPU, memory) and logs in real time
Observe automatic VM provisioning and teardown—no manual cluster management
Verify the Results in BigQuery
Inspect the newly populated table schema and preview all 100+ records
Confirm that each CSV row was correctly parsed by your UDF and loaded
Key Pointers:
No-Code Ingestion: Dataflow’s templates handle the heavy lifting; you only supply paths and a simple UDF.
Serverless & Scalable: Dataflow auto-scales workers and deletes them when done, so you pay only for what you use.
Separation of Concerns: Keep your schema and transformation logic in separate files (JSON & JS) for easy updates.
Regional Alignment: Create your dataset and run your job in the same region to minimize data egress and ensure quota availability.
Monitoring Tools: Use both graph and table views in the Dataflow UI to pinpoint performance bottlenecks or errors.
In this hands-on demo, you’ll learn how to design and run a simple ETL (Extract-Transform-Load) pipeline entirely within the Dataflow UI—no coding required. We’ll start with a CSV file in Cloud Storage, apply two transformations (filter and aggregation), and write the final results into BigQuery.
What You’ll Learn & Do:
Configure Your Dataflow Job in the UI:
Give your job a descriptive name and choose the correct Google Cloud region for execution.
Select “Batch” mode to run the pipeline a single time.
Define the Source:
Point to your CSV file in Cloud Storage and specify the comma delimiter.
Learn how you can add multiple sources for more complex pipelines.
Add Transformations Without Code:
Filter Transformation: Toss out any rows where the quantity field is ≤ 3, ensuring only larger orders proceed.
Group-By Transformation: Group the filtered records by category and compute the sum of quantity in each group.
Chain these transforms together so that the output of the filter feeds into the aggregation.
Configure the Sink (Destination):
Create or select a BigQuery dataset.
Name your output table (e.g. orders_grouped_by_category).
Choose whether to overwrite or append, and let Dataflow handle table creation.
Run & Monitor Your Pipeline:
Launch the job and watch as Dataflow automatically provisions workers.
Use the Graph View to visualize each pipeline stage (read, filter, group, write) and inspect resource usage per step.
Verify automatic teardown of worker VMs after job completion—no manual cleanup needed.
Verify Results in BigQuery:
Inspect your output table to confirm only categories with large-quantity orders appear, each showing the correct summed total.
Key Pointers:
Drag-and-Drop Builder: Perfect for quickly assembling simple ETL workflows without writing pipelines in code.
Modular Transforms: Easily add, reorder, or remove filter, map, or aggregation steps to refine your data.
Serverless Scaling: Dataflow auto-scales workers up or down based on data volume and then deletes them, so you pay only for the processing you use.
Region & Quota Considerations: Always run your job in a region where you have sufficient CPU quota and your BigQuery dataset resides.
Visual Monitoring: The Dataflow UI provides both graphical and tabular overviews of your pipeline’s progress, resource consumption, and logs.
In this hands-on demo, you will learn how to build and run a streaming Dataflow job on Google Cloud Platform that continuously ingests messages from Pub/Sub and writes them into BigQuery. Unlike our previous batch demos, here data arrives in real time and is processed on the fly. By the end of this lecture, you’ll understand how to leverage Google-provided templates to accelerate your streaming pipelines with minimal configuration.
Learning Objectives
By the end of this lecture, you will be able to:
Differentiate streaming and batch processing modes in Dataflow
Configure a Pub/Sub topic and BigQuery table schema for streaming ingestion
Launch a Dataflow job from a pre-built streaming template
Monitor pipeline execution and throughput in the Dataflow UI
Publish test messages to Pub/Sub and verify them in BigQuery
Follow best practices for cleaning up resources to control costs
Prerequisites
A Google Cloud project with billing enabled
Permissions to create Pub/Sub topics, BigQuery tables, and Dataflow jobs
Basic familiarity with the GCP Console
Lecture Outline
Introduction to Streaming in Dataflow
Contrast with batch mode
When to choose streaming pipelines
Preparing Your Sink: BigQuery
Navigate to BigQuery in the GCP Console
Create (or select) a dataset for streaming results
Define a new table schema with a single data STRING field
Creating the Dataflow Job
Open the Dataflow “Create job from template” wizard
Choose the Pub/Sub → BigQuery template
Assign a job name (e.g., demo-stream-eu) and select your region
Point to your Pub/Sub topic and BigQuery output table
Select streaming mode and a processing guarantee (e.g., “exactly once”)
Setting Up Pub/Sub
Create a new topic (e.g., dataflow-topic) directly from the Dataflow wizard
Leave all default settings (message retention, encryption) intact
Starting and Observing the Pipeline
Submit the job and watch it transition from Queued to Running
Publish test messages in JSON format ({"data":"<your text>"}) at a fixed interval (e.g., five messages, one per second)
Monitor Dataflow steps (e.g., “Read from Pub/Sub”, “Write to BigQuery”) and inspect logs for throughput metrics
Verifying Results in BigQuery
Preview the table or run a simple SELECT * query
Observe incoming rows appear in near real time
Cleaning Up Resources
Stop and cancel the Dataflow job to avoid ongoing charges
Delete the Pub/Sub topic when no longer needed
Verify that any Compute Engine instances spawned by Dataflow are shut down
In this hands-on lecture, you will learn how to create a Dataform repository in Google Cloud, configure its service account, and establish a secure connection with a GitHub repository. We’ll cover every step—from provisioning the Dataform repo in the Cloud Console, granting the necessary IAM and Secret Manager roles, to initializing your first development workspace and syncing it with GitHub. By the end, you’ll have a fully linked environment ready for writing and versioning your Dataform scripts.
Learning Objectives
By the end of this lecture, you will be able to:
Create a new Dataform repository in BigQuery’s Dataform interface.
Assign IAM roles to the Dataform service account so it can execute jobs in BigQuery.
Use Secret Manager to store a GitHub access token securely and grant the service account secret-access permissions.
Connect Dataform to GitHub by configuring the repository URL and secret-based authentication.
Initialize and synchronize a Dataform workspace with the GitHub main branch.
Lecture Outline
Provisioning the Dataform Repository
Navigate to BigQuery → Dataform in the Cloud Console
Create a new repository (e.g., “demo-two”) in your chosen region
Select the default Dataform service account and KMS key for encryption
Configuring Service Account Permissions
Identify the Dataform service account email
Grant BigQuery Job User (or Admin) role for job execution
Grant Secret Manager Secret Accessor role to retrieve GitHub tokens
Storing the GitHub Token
Generate a fine-grained GitHub personal access token with repo read/write
Create a Secret Manager secret (e.g., dataform-token) and paste the token
Verify the service account can access the new secret
Linking Dataform to GitHub
In Dataform repository settings, enter the GitHub HTTPS URL
Reference the Secret Manager key for authentication
Test and confirm a successful connection
Initializing Your First Workspace
Create a development workspace (e.g., “quick-start-workspace”)
Understand that workspaces map to Git branches
Pull from the default branch and initialize the project files
Commit and push the auto-generated files to GitHub
Verification and Next Steps
Confirm that the GitHub repo now contains Dataform project files
Review how branch/workspace synchronization works
Prepare for upcoming lectures on writing Dataform SQLX code and running transformations
Key Takeaways
You can provision and encrypt a Dataform repository directly from the BigQuery interface.
Proper IAM roles and Secret Manager permissions are essential for secure, automated CI/CD workflows.
GitHub integration allows you to version-control your Dataform project and collaborate effectively.
Workspaces in Dataform correspond to Git branches, enabling safe development and review processes.
This foundational setup ensures you have a robust, secure environment for building, testing, and deploying your Dataform pipelines. In the next lecture, we’ll dive into writing Dataform code to create tables and orchestrate data transformations in BigQuery.
In this hands-on lecture, you will learn how to use Dataform to create and populate a BigQuery table entirely from your development workspace. We’ll walk through establishing a new branch (workspace), authoring a SQL-based Dataform definition to create a dataset and table, executing the action, troubleshooting region or dataset errors, and finally committing your changes back to GitHub. By the end, you’ll understand how Dataform simplifies table creation, data loading, and version control for your analytics workflows.
Learning Objectives
By the end of this lecture, you will be able to:
Set up a new Dataform development workspace (branch) for your table-creation work
Write a Dataform SQLX definition that:
Ensures the target dataset exists
Creates a new table (e.g., orders_data.orders)
Populates it with hard-coded sample rows
Execute a single action in Dataform and monitor its progress
Diagnose and resolve common execution errors (region mismatches, missing datasets)
Commit and push your Dataform code changes to GitHub for version control
Lecture Outline
Creating a New Workspace
Concept: workspaces map to Git branches
Steps to create a “create_table” workspace from the main branch
Authoring the Table Definition
File structure in the definitions/ folder
Writing a multi-statement SQLX file to:
CREATE SCHEMA IF NOT EXISTS for your project
CREATE TABLE IF NOT EXISTS in the orders_data dataset
INSERT hard-coded rows for demonstration
Configuring Workflow Settings
Ensuring workflow_config.json points to the correct EU location
Verifying dataset region matches your Dataform repository’s default location
Executing the Action
Selecting the “orders” definition in the Dataform UI
Choosing execution options: service account, include dependencies
Monitoring job status and viewing detailed logs
Troubleshooting Execution Errors
Addressing “dataset not found in region” by creating the dataset in BigQuery
Refreshing and re-running the action until successful
Verifying the Result in BigQuery
Navigating to the BigQuery console
Confirming the new table and its populated data
Version Control with GitHub
Committing your SQLX definition with a descriptive message
Pushing changes to the remote branch
Opening a pull request to merge into main
Key Takeaways
Dataform allows you to treat table creation and data loading as versioned code, streamlining collaboration.
Workspaces (branches) isolate your development work until you’re ready to merge.
Proper workflow configuration (location, project, dataset) is critical for error-free execution.
Committing and pushing changes keeps your analytics pipeline code in Git, supporting auditability and CI/CD practices.
This lecture empowers you to build repeatable, reviewable Dataform pipelines for BigQuery, turning manual SQL jobs into scalable, maintainable code.
In this hands-on session, you will build a complete ETL pipeline using Dataform to ingest raw CSV data from Google Cloud Storage, apply transformations, and load both cleaned and aggregated results into BigQuery. We’ll cover the entire workflow—from uploading files to GCS, writing Dataform SQLX definitions to load, filter, and summarize data, to executing and validating your pipeline. By the end, you’ll have a reusable pattern for data ingestion and transformation that leverages version control and automated runs.
Learning Objectives
By the end of this lecture, you will be able to:
Upload CSV input files to a dedicated Cloud Storage bucket and folder
Create staging (stg_sales), cleaned (clean_sales), and summary (sales_summary) tables in BigQuery via Dataform
Reference GCS CSV files in your Dataform definitions using gs:// URIs
Implement data quality transformations (e.g., filtering out non-positive quantities and prices)
Build aggregated summary tables grouped by business dimensions
Grant the Dataform service account the necessary IAM roles (Storage Object Viewer, BigQuery Job User) for seamless execution
Execute, monitor, and troubleshoot Dataform runs in the Cloud Console
Lecture Outline
Preparing Your Input Data
Organize and upload the sales.csv file into a dedicated dataform/ folder in GCS
Understand file URI usage (gs://bucket/dataform/sales.csv) in Dataform scripts
Defining Staging Table (stg_sales)
Write a SQLX definition to drop and recreate the raw staging table
Load all rows from the CSV into BigQuery, skipping header rows
Implementing Data Cleaning (clean_sales)
Apply filters to remove invalid records (quantity > 0 AND unit_price > 0)
Chain the transformation to depend on the staging table
Building Summary Aggregation (sales_summary)
Group cleaned data by category and calculate total quantity and revenue
Create a summary table ready for reporting
Configuring Service Account Permissions
Assign Storage Object Viewer role for GCS access
Ensure BigQuery Job User role for table creation and data loading
Executing and Monitoring the Pipeline
Select your “load_and_transform_sales” definition in Dataform
Run the combined action and watch job status and logs
Diagnose common errors (missing IAM roles, URI typos)
Validating Results in BigQuery
Verify that the staging, cleaned, and summary tables are populated correctly
Inspect data samples and aggregated results to confirm transformation logic
Version Control and Collaboration
Commit your Dataform SQLX files to the GitHub-backed repository
Push changes and create a pull request for peer review
Key Takeaways
Dataform streamlines ETL by treating ingestion and transformation logic as versioned code.
Proper IAM roles are essential for secure, automated data pipelines.
Splitting workloads into staging, cleaning, and summary tables supports modular, auditable workflows.
With Dataform’s Git integration, you ensure reproducibility and collaboration in your data projects.
This lecture equips you with a robust, reusable Dataform pattern for ingesting raw data, applying business logic, and producing analytics-ready tables in BigQuery.
In this practical lecture, you will learn how to provision a Google Cloud Data Fusion instance step by step. We’ll navigate the Cloud Console, enable necessary APIs, select the appropriate edition, assign required IAM roles, and troubleshoot common UI or quota issues. By the end of this session, you’ll have a fully functional Data Fusion environment ready for building and orchestrating enterprise data pipelines.
Learning Objectives
By the end of this lecture, you will be able to:
Enable the Data Fusion API and navigate to the Data Fusion service in the Cloud Console
Choose the right edition (Basic, Enterprise, or Developer) based on cost and feature needs
Understand pricing tiers and free usage quotas for each edition
Assign essential IAM roles (e.g., Data Fusion Runner, Dataproc Worker) to the Compute Engine service account
Identify and resolve common provisioning issues, such as regional capacity or quota constraints
Access and explore the Data Fusion UI once the instance is ready
Lecture Outline
Preparing for Instance Creation
Ensure the Data Fusion API is enabled in your project
Review edition options and pricing details
Understand the trade-off between Basic (free hours) versus Enterprise features
Launching a Data Fusion Instance
Navigate to Data Fusion in the Cloud Console
Fill in the instance name, region, and edition
(Optional) Add accelerators to boost pipeline performance
Granting Service Account Permissions
Identify the Compute Engine default service account
Assign Cloud Data Fusion Runner role to allow job execution
Assign Dataproc Worker role for underlying Dataproc tasks
Monitoring and Troubleshooting Provisioning
Recognize UI delays or form resets and how to retry
Handle region-specific capacity or quota errors by switching regions
Verify instance status until it becomes Running (20–30 minutes)
Accessing the Data Fusion UI
Authenticate with your user account
Navigate through the Data Fusion home screen
Overview of key features: wrangling, pipeline creation, and governance
Preparing for Next Steps
Upload sample sales.csv to a dedicated GCS folder for Data Fusion
Preview the dataset to be used in upcoming ETL pipelines
Key Takeaways
API Enablement: Always enable the Data Fusion API before creating an instance.
Edition Selection: Choose Basic for lightweight, cost-effective development; upgrade to Enterprise for advanced features.
IAM Roles: Grant Data Fusion Runner and Dataproc Worker roles to the service account to ensure smooth pipeline execution.
Region and Quota: If you encounter provisioning failures, switch regions or verify your project’s quotas.
UI Access: Once the instance is running, use the Data Fusion UI to design, test, and manage your data pipelines.
This lecture sets up your Data Fusion environment, laying the groundwork for building robust, GUI-driven ETL workflows in subsequent sessions.
In this practical lecture, you will design and deploy the first stage of an ETL pipeline using Google Cloud Data Fusion. We’ll cover how to read a CSV file from Cloud Storage, apply code-free wrangling transformations, and configure BigQuery as the data sink. You’ll gain hands-on experience with the Data Fusion Wrangler, explore common transformation operations, and understand how to deploy a batch pipeline that provisions a Dataproc cluster automatically.
Learning Objectives
By the end of this lecture, you will be able to:
Connect to your Cloud Storage bucket and select a CSV file as the pipeline source
Use the Wrangler interface to:
Load and preview raw CSV data
Apply transformations such as data masking, type casting, case conversion, and field derivation
Configure a batch pipeline with Data Fusion:
Define Cloud Storage as the source
Attach the Wrangler transform stage
Set BigQuery as the sink and specify dataset/table details
Validate pipeline configuration, deploy the Data Fusion pipeline, and monitor Dataproc cluster provisioning
Recognize how Data Fusion automates cluster management and scales compute resources
Lecture Outline
Preparing Your Source
Navigate to Data Fusion → Wrangler
Select Cloud Storage as the pipeline input
Point to the sales.csv file in your GCS bucket
Applying Code-Free Transformations
Data Masking: mask sensitive parts of a column (e.g., order dates)
Case Conversion: convert categorical fields to uppercase
Computed Field: multiply quantity and unit_price to create total_price
Previewing Results: inspect the transformed dataset in the Wrangler UI
Building the Batch Pipeline
Switch to the Pipeline view and choose Batch mode
Source Stage: confirm GCS input settings
Transform Stage: include your Wrangler logic
Sink Stage: add BigQuery, specifying the target dataset and table name (e.g., df_gcs)
Deploying and Running the Pipeline
Configure pipeline properties (name, description)
Deploy to provision a Dataproc cluster automatically
Run the pipeline and observe status updates
Monitoring Cluster & Job Execution
Open the Dataproc console to view cluster creation
Track job logs to ensure transformations are applied correctly
Understand how Data Fusion manages resources behind the scenes
Key Takeaways
Wrangler provides a no-code interface for common data transformations, speeding up pipeline development.
Data Fusion’s batch pipelines seamlessly integrate GCS and BigQuery without manual cluster management.
Proper configuration of source, transform, and sink stages is critical to successful ETL workflows.
Monitoring via the Dataproc console helps you validate that your pipeline runs as expected.
This lecture lays the groundwork for a robust, scalable ETL workflow. In the next session, we’ll explore advanced transformation patterns and automate pipeline scheduling in Data Fusion.
In this lecture, we’ll continue from deploying our Data Fusion pipeline to explore job execution details, monitor progress, and validate results in BigQuery. You’ll learn how to inspect Dataproc cluster and job statuses, interpret runtime logs, troubleshoot simple issues (such as unintended partitioning), and perform cleanup of your Data Fusion instance. This session ensures you understand the end-to-end operation of a production ETL pipeline without writing any code.
Learning Objectives
By the end of this lecture, you will be able to:
Monitor Dataproc cluster provisioning and confirm job startup in the Data Fusion UI
Explore job listings and runtime arguments to track execution progress
Access and interpret Hadoop/Dataproc logs for detailed insights and warnings
Verify the successful creation of target tables in BigQuery, including checking transformations
Identify configuration oversights (e.g., unwanted partitioning) and understand how to adjust them
Delete the Data Fusion instance to avoid unnecessary costs
Lecture Outline
Monitoring Job Execution
Navigating to the Data Fusion Jobs tab
Understanding status transitions: Provisioning → Running → Succeeded
Viewing runtime arguments, elapsed time, and warning messages
Inspecting Logs
Opening detailed Hadoop job logs for troubleshooting
Locating errors versus informational messages
How to use logs to confirm each transformation step
Validating BigQuery Output
Refreshing the BigQuery console to view newly created tables
Confirming that masked dates, uppercase categories, and calculated totals appear correctly
Recognizing accidental partitioned tables and noting best practices
Troubleshooting Common Issues
Addressing non-critical warnings in logs
Best practices for naming and avoiding unintended partitioning
Tips for adjusting pipeline configuration before rerunning
Instance Cleanup
Understanding Data Fusion instance billing implications
Steps to safely delete your Data Fusion instance and free associated resources
Key Takeaways
Active monitoring of cluster and job status in Data Fusion is essential for production pipelines.
Detailed logs provide visibility into each step and help quickly pinpoint issues.
Always verify output tables in BigQuery to ensure transformations ran as expected.
Clean up unused Data Fusion instances promptly to control costs.
This lecture cements your ability to operate, validate, and manage a no-code ETL pipeline in Google Cloud Data Fusion, guiding you from deployment through teardown.
In this foundational lecture, we explore Google Cloud Eventarc, a powerful serverless event routing service designed to simplify the integration of various Google Cloud products using an event-driven architecture.
This session lays the groundwork for the hands-on demos you'll build in the upcoming lectures by covering the essential concepts, architecture, and use cases of Eventarc.
Key Topics Covered:
What is Eventarc?
Understand the purpose of Eventarc and how it enables communication between Google Cloud services through events.
Core Capabilities:
Event-driven architecture with minimal configuration
Seamless integration between services like Cloud Storage, Pub/Sub, Cloud Run, and more
Essential Components Explained:
Event Sources: GCS, Pub/Sub, Firebase
Event Destinations: Cloud Run, Cloud Functions, Workflows, GKE
Types of Events: Object creation, message publishing, and more
Eventarc Models:
Trigger (Standard)
Best for simple point-to-point communication
Example: GCS → Cloud Run
Eventarc Bus
Advanced central hub for multiple producers and consumers
Ideal for microservices and multi-team architectures
Supports custom producers/subscribers
Eventarc Pipeline
Enables filtering, transformation, and conditional routing
Example: Forward only .csv files to Cloud Run
Suitable for preprocessing events before delivery
When to Use What:
Use Triggers for simple integrations
Use Bus for complex, multi-source routing
Use Pipeline for advanced filtering and event enrichment
Technical Comparison Chart:
Routing: Point-to-point vs. multi-service vs. conditional
Transformation Support: Only available in pipelines
UI Access Points: Trigger tab, Bus tab, Pipeline tab
Recommended Use Cases:
Triggers for most event-driven apps
Bus for scalable architectures
Pipelines for conditional or filtered event handling
What You’ll Learn Next:
Following this lecture, we’ll build two real-world hands-on demos using Eventarc:
Trigger GCS to Cloud Run – Automatically invoke a Cloud Run service when a file is uploaded to Cloud Storage.
GCS to BigQuery via Cloud Run – Capture GCS file uploads and write structured logs to BigQuery using Cloud Run.
These practical examples will reinforce your understanding of triggers and event-driven design on Google Cloud.
In this hands-on lecture, we take a practical deep dive into Google Cloud Eventarc by building an end-to-end event-driven pipeline. You'll learn how to configure Eventarc triggers to automatically invoke a Cloud Run service whenever a file is uploaded to a Google Cloud Storage (GCS) bucket.
This demo reinforces the core concepts introduced in the previous lecture and showcases how seamlessly Eventarc connects different Google Cloud services.
What You'll Learn in This Lecture:
- Real-world use case: Triggering Cloud Run on object upload to GCS
- Step-by-step setup: Creating GCS buckets, Cloud Run services, and Eventarc triggers
- Working with regions: Understanding quota and region compatibility
- Permission handling: Granting required IAM roles to Eventarc and service accounts
- Event type selection: Using google.cloud.storage.object.v1.finalized to detect new file uploads
- Trigger creation: Linking GCS bucket events to a Cloud Run service using Eventarc
- Debugging common errors: Region mismatch, permission propagation delays, and troubleshooting UI warnings
- Monitoring event delivery: Viewing Cloud Run logs to verify successful invocation
Key Setup Details:
Cloud Storage Bucket:
Created in the europe-north2 region
Stores uploaded files that will trigger the event
Cloud Run Service:
Basic "Hello World" container deployed
Configured to allow unauthenticated access for simplicity
Deployed in the matching region for smooth trigger compatibility
Eventarc Trigger:
Set to listen for finalized object creation events in the GCS bucket
Connects the GCS bucket (source) to the Cloud Run service (destination)
Why This Demo Matters:
This lecture provides a hands-on understanding of Eventarc triggers, ideal for beginners exploring event-driven architecture in Google Cloud. You'll experience the complete lifecycle from service setup to real-time event detection and processing—all through the Google Cloud Console.
By the end of this demo, you’ll be confident in:
Setting up triggers for serverless communication between services
Validating event delivery using logs
Managing permissions and troubleshooting configuration issues
In this hands-on session, we take a step beyond basic event detection and build a real-world data ingestion pipeline using Google Cloud Storage (GCS), Cloud Run, Eventarc, and BigQuery.
Unlike the previous demo, where we triggered a default container to verify file uploads, here you’ll create a fully functional Cloud Run service using a custom container that logs metadata about uploaded files (e.g., name, format, and upload time) directly into BigQuery.
This is the first part of a two-part lecture, focusing on setting up BigQuery, building the Cloud Run app using a custom Docker image, and configuring permissions and registries.
Key Learning Outcomes:
By the end of this lecture, you will understand how to:
✅ Design a Cloud Run service that captures GCS event data and logs it to BigQuery
✅ Build and containerize a Python (Flask) application from scratch using Cloud Shell
✅ Set up a BigQuery dataset and table for capturing upload logs
✅ Understand Eventarc's role in routing GCS events to your Cloud Run app
✅ Handle regional and permission-based configurations effectively
✅ Transition from Container Registry (deprecated) to Artifact Registry
Topics Covered in This Lecture:
BigQuery Setup:
Create a new dataset in the EU multi-region (eventarc_ds)
Define a table (upload_logs) with schema: file name, format, and upload time
Custom Cloud Run App:
Use Flask to build a lightweight event receiver
Parse GCS event payloads for metadata
Insert extracted data into BigQuery using the google-cloud-bigquery library
Containerization with Docker:
Create essential files:
main.py (Flask app)
requirements.txt (dependencies)
Dockerfile (container instructions)
Build Docker image using Cloud Shell
Image Deployment Challenges:
Attempt image upload to Google Container Registry (GCR)
Understand why GCR is deprecated and explore transition to Artifact Registry
Handle common errors like region mismatches and permission denials
Permissions & Configuration:
Set the active project using gcloud config
Grant necessary IAM roles to allow Cloud Run and Eventarc to interact
Understand default service accounts and their privileges
Why This Lecture Is Important:
This lecture bridges the gap between theory and practical implementation by guiding you through a realistic cloud-native event logging use case. It prepares you to build more advanced, data-driven microservices and provides valuable insight into Google Cloud’s evolving container management ecosystem.
Whether you're an aspiring cloud engineer or a professional looking to build event-driven applications, this lecture is a crucial milestone in your learning path.
In this continuation of Demo 2, we complete the implementation of an event-driven pipeline that captures file upload events in Google Cloud Storage (GCS) and logs the metadata into BigQuery, using a custom Cloud Run service and Eventarc triggers.
This lecture focuses on the deployment and testing phase of the custom container image using Artifact Registry (instead of the deprecated Container Registry), as well as troubleshooting common configuration issues such as environment variables, IAM permissions, and regional alignment.
By the end of this session, you will have a fully functional pipeline that listens for GCS object creation events and stores structured metadata (file name, format, and upload time) in BigQuery—powered entirely by serverless Google Cloud services.
Key Topics Covered:
✅ Artifact Registry Setup
Create a Docker-based repository in the EU region
Authenticate Docker CLI with Artifact Registry
Build and tag Docker images from custom code
Push Docker image to Artifact Registry
✅ Deploying Custom Container to Cloud Run
Select the correct region and container image
Configure environment variables to pass BigQuery table ID
Assign IAM permissions (BigQuery Admin role) to the default Cloud Run service account
Allow unauthenticated invocation for simplicity
✅ Creating and Testing Eventarc Trigger
Set up a trigger that listens for google.cloud.storage.object.v1.finalized events
Connect GCS bucket to the newly deployed Cloud Run service
Upload test files (e.g., CSV and JPEG) to GCS
Verify trigger execution via Cloud Run logs and BigQuery data insertion
✅ Troubleshooting & Debugging
Understand and resolve issues such as:
Missing environment variables (e.g., BigQuery table ID)
IAM permission errors for service accounts
Log-based diagnostics using Cloud Run and Eventarc UI
✅ Cleanup Process
Delete:
Artifact Registry repository
Cloud Run service
Eventarc trigger
What You’ll Learn:
How to deploy a custom-built container image to Cloud Run using Artifact Registry
How to handle Cloud Run environment variables for dynamic configuration
How to monitor, debug, and verify the flow of real-time data from GCS to BigQuery
How to implement an end-to-end event-driven ingestion pipeline on Google Cloud
Why This Lecture Matters:
This hands-on project gives you the practical skills to implement serverless event-driven architectures in real-world data engineering workflows. You will experience the full lifecycle—from container creation to deployment, event trigger configuration, and verification in BigQuery.
Whether you're building automated data logging systems, serverless ETL workflows, or reactive applications, this lecture gives you a production-grade template using GCS + Eventarc + Cloud Run + BigQuery.
This course is a comprehensive, hands-on learning path designed to help you prepare for the Google Cloud Associate Data Practitioner Certification, following the structure and objectives outlined in the official exam guide.
The certification targets individuals working with data in the cloud, requiring foundational skills in managing, processing, analyzing, and visualizing data using Google Cloud technologies.
In this course, you’ll learn to confidently work across various GCP services and develop a clear understanding of their practical use in end-to-end data workflows.
Key Focus Areas:
Data Preparation and Ingestion: Learn to differentiate between ETL, ELT, and ETLT, clean and transform datasets, and work with tools like Cloud Data Fusion and BigQuery.
Data Analysis and Visualization: Use BigQuery to explore datasets, interpret analytical results, and build impactful dashboards with Looker Studio. Learn to utilize BigQuery ML and AutoML for predictive insights.
Data Pipeline Orchestration: Implement and schedule data pipelines using Cloud Composer (Apache Airflow), Dataflow (Apache Beam), Dataform, and Dataproc.
Data Management: Understand when to use Cloud Storage, BigQuery, Cloud SQL, Firestore, Bigtable, Spanner, and AlloyDB, including considerations around cost, scale, and performance.
This course blends theory with practical labs, real-world scenarios, and project-based exercises to help you internalize concepts and gain confidence.
Whether you're aiming to clear the exam or build a strong data foundation in GCP, this course provides everything you need to succeed.