
In this lecture, we introduce the real-world data engineering project. You’ll gain an understanding of the business problem, dataset, and the end-to-end architecture. We will cover:
Project scope and objectives
Architecture overview (Data flow from source to consumption)
Tools & technologies used: GCS, BigQuery, Dataproc, Airflow, Cloud Build
Expected outcomes and key learning points
Description:
Before we begin processing data, we must configure our data sources. This lecture focuses on:
Setting up a SQL database (MySQL) as a transactional data source
Configuring Google Cloud Storage (GCS) as a landing zone
Creating BigQuery datasets and tables for structured storage
Setting up configuration files for dynamic pipeline management
This lecture covers the ingestion of raw data into GCS using PySpark on Dataproc. Key topics include:
Creating and configuring a Dataproc cluster
Writing PySpark scripts for data extraction and transformation
Storing ingested data in GCS (Landing Zone)
Handling errors and logging
Once data is landed in GCS, we need to structure and move it to the Bronze Layer. This session covers:
Creating partitioned and clustered tables in BigQuery
Using PySpark/SQL to load raw data from GCS to BigQuery
Applying basic data validation and schema enforcement
Automating ingestion with workflow scheduling
The Silver Layer involves refining the raw data from the Bronze Layer by cleaning, deduplicating, and standardizing it. You’ll learn:
Transforming raw ingested data using BigQuery SQL
Handling schema evolution and deduplication
Ensuring data quality with checks and validation
Optimizing storage for analytical queries
The Gold Layer is where data is modeled for business intelligence and reporting. This lecture focuses on:
Creating fact and dimension tables for analytical queries
Applying aggregations and joins for meaningful insights
Performance tuning and indexing strategies
Best practices for creating Gold tables for reporting and dashboards
Now that we have an ETL pipeline, we need to automate it using Airflow. This session includes:
Setting up Cloud Composer (Managed Airflow)
Writing DAGs (Directed Acyclic Graphs) for workflow orchestration
Configuring dependencies, retries, and error handling
Scheduling and monitoring DAG execution
For a production-grade pipeline, we need Continuous Integration and Deployment (CI/CD). This session teaches:
Version control and branching strategies with GitHub
Setting up Cloud Build for automated testing and deployment
Integrating Airflow DAGs with CI/CD pipelines
Best practices for deployment in GCP
This project focuses on building a data lake in Google Cloud Platform (GCP) for Revenue Cycle Management (RCM) in the healthcare domain.
The goal is to centralize, clean, and transform data from multiple sources, enabling healthcare providers and insurance companies to streamline billing, claims processing, and revenue tracking.
GCP Services Used:
Google Cloud Storage (GCS): Stores raw and processed data files.
BigQuery: Serves as the analytical engine for storing and querying structured data.
Dataproc: Used for large-scale data processing with Apache Spark.
Cloud Composer (Apache Airflow): Automates ETL pipelines and workflow orchestration.
Cloud SQL (MySQL): Stores transactional Electronic Medical Records (EMR) data.
GitHub & Cloud Build: Enables version control and CI/CD implementation.
CICD (Continuous Integration & Continuous Deployment): Automates deployment pipelines for data processing and ETL workflows.
Techniques involved :
Metadata Driven Approach
SCD type 2 implementation
CDM(Common Data Model)
Medallion Architecture
Logging and Monitoring
Error Handling
Optimizations
CICD implementation
many more best practices
Data Sources
EMR (Electronic Medical Records) data from two hospitals
Claims files
CPT (Current Procedural Terminology) Code
NPI (National Provider Identifier) Data
Expected Outcomes
Efficient Data Pipeline: Automating the ingestion and transformation of RCM data.
Structured Data Warehouse: gold tables in BigQuery for analytical queries.
KPI Dashboards: Insights into revenue collection, claims processing efficiency, and financial trends.