GCP Data Engineering-End to End Project-Healthcare Domain

Name: GCP Data Engineering-End to End Project-Healthcare Domain
Rating: 4.6 (503 reviews)

Industry Standard Project in Healthcare Domain using GCP services like GCS, BigQuery, Dataproc, Composer, GitHub, CICD

Created bySaidhul Shaik

Last updated 4/2025

English

What you'll learn

Understand the End to End Data Engineering Project
Design and Implement Scalable ETL Pipelines for Healthcare Data
Implement Key Techniques like Incremental Data, SCD2, Metadata driven approach, Medallion Arch, Error Handling, CDM , CICD & Many more..
Develop and Deploy Data Solutions with CI/CD Practices

Course content

2 sections • 10 lectures • 7h 47m total length

Important Links0:02
Introductory Lecture to Understand Project29:17

Project Introduction - Overview, Architecture38:55
In this lecture, we introduce the real-world data engineering project. You’ll gain an understanding of the business problem, dataset, and the end-to-end architecture. We will cover:
Project scope and objectives
Architecture overview (Data flow from source to consumption)
Tools & technologies used: GCS, BigQuery, Dataproc, Airflow, Cloud Build
Expected outcomes and key learning points
Setting up the Data sources - SQL DBs, GCS, BQ, Configs48:20
Description:
Before we begin processing data, we must configure our data sources. This lecture focuses on:
Setting up a SQL database (MySQL) as a transactional data source
Configuring Google Cloud Storage (GCS) as a landing zone
Creating BigQuery datasets and tables for structured storage
Setting up configuration files for dynamic pipeline management
Data Ingestion - Dataproc, Pyspark, GCS Landing1:33:25
This lecture covers the ingestion of raw data into GCS using PySpark on Dataproc. Key topics include:
Creating and configuring a Dataproc cluster
Writing PySpark scripts for data extraction and transformation
Storing ingested data in GCS (Landing Zone)
Handling errors and logging
GCS Landing to Bronze Layer - GCS, BigQuery1:03:59
Once data is landed in GCS, we need to structure and move it to the Bronze Layer. This session covers:
Creating partitioned and clustered tables in BigQuery
Using PySpark/SQL to load raw data from GCS to BigQuery
Applying basic data validation and schema enforcement
Automating ingestion with workflow scheduling
Bronze to Silver Layer Processing - BigQuery1:06:28
The Silver Layer involves refining the raw data from the Bronze Layer by cleaning, deduplicating, and standardizing it. You’ll learn:
Transforming raw ingested data using BigQuery SQL
Handling schema evolution and deduplication
Ensuring data quality with checks and validation
Optimizing storage for analytical queries
Silver to Gold Layer Processing - BigQuery17:31
The Gold Layer is where data is modeled for business intelligence and reporting. This lecture focuses on:
Creating fact and dimension tables for analytical queries
Applying aggregations and joins for meaningful insights
Performance tuning and indexing strategies
Best practices for creating Gold tables for reporting and dashboards
Create Dags - Workflow Orchestration - Airflow28:46
Now that we have an ETL pipeline, we need to automate it using Airflow. This session includes:
Setting up Cloud Composer (Managed Airflow)
Writing DAGs (Directed Acyclic Graphs) for workflow orchestration
Configuring dependencies, retries, and error handling
Scheduling and monitoring DAG execution
CICD - Github, Cloud Build, Airflow1:20:26
For a production-grade pipeline, we need Continuous Integration and Deployment (CI/CD). This session teaches:
Version control and branching strategies with GitHub
Setting up Cloud Build for automated testing and deployment
Integrating Airflow DAGs with CI/CD pipelines
Best practices for deployment in GCP

Requirements

Basic Knowledge on Python and SQL

Description

This project focuses on building a data lake in Google Cloud Platform (GCP) for Revenue Cycle Management (RCM) in the healthcare domain.
The goal is to centralize, clean, and transform data from multiple sources, enabling healthcare providers and insurance companies to streamline billing, claims processing, and revenue tracking.
GCP Services Used:
- Google Cloud Storage (GCS): Stores raw and processed data files.
- BigQuery: Serves as the analytical engine for storing and querying structured data.
- Dataproc: Used for large-scale data processing with Apache Spark.
- Cloud Composer (Apache Airflow): Automates ETL pipelines and workflow orchestration.
- Cloud SQL (MySQL): Stores transactional Electronic Medical Records (EMR) data.
- GitHub & Cloud Build: Enables version control and CI/CD implementation.
- CICD (Continuous Integration & Continuous Deployment): Automates deployment pipelines for data processing and ETL workflows.

Techniques involved :
- Metadata Driven Approach
- SCD type 2 implementation
- CDM(Common Data Model)
- Medallion Architecture
- Logging and Monitoring
- Error Handling
- Optimizations
- CICD implementation
- many more best practices

Data Sources
- EMR (Electronic Medical Records) data from two hospitals
- Claims files
- CPT (Current Procedural Terminology) Code
- NPI (National Provider Identifier) Data
Expected Outcomes
- Efficient Data Pipeline: Automating the ingestion and transformation of RCM data.
- Structured Data Warehouse: gold tables in BigQuery for analytical queries.
- KPI Dashboards: Insights into revenue collection, claims processing efficiency, and financial trends.

Who this course is for:

Aspiring Data Engineers, Data Professionals
For getting interview Ready

GCP Data Engineering-End to End Project-Healthcare Domain

What you'll learn

Explore related topics

Course content

Project Material2 lectures • 29min

GCP Data Engineering - Project8 lectures • 7hr 18min

Requirements

Description

Who this course is for: