Data Engineering with Google Datafusion and Big Query (CDAP)
What you'll learn
- Understand a bit more Google Cloud Resources
- Use Google Datafusion as ETL tool
- Data Engineering Low Code
- Create Data Pipelines and DAGs
- Read and Write data on Google Big Query
- Read and Write data on Google Cloud Storage
- Data Transformations with low code and queries
- Some Advanced SQL commands
- GCP account
- Previous exposure to SQL
This is an INTRODUCTORY course to Google Cloud's low-code ingestion tool, Datafusion. Google Data Fusion is a fully managed data integration platform that allows data engineers to efficiently create, deploy, and manage data pipelines.
One of the main reasons to use Google Data Fusion is its ease of use. With an intuitive and visual interface, data engineers can create complex data pipelines without the need for extensive coding. The drag-and-drop interface simplifies the process of data transformation and cleansing, allowing professionals to focus on business logic rather than worrying about detailed coding.
Another significant benefit of Google Data Fusion is its scalability. The platform runs on Google Cloud, which means it can handle large volumes of data and high-performance parallel processing. Data engineers can vertically or horizontally expand their processing capabilities according to project needs, ensuring they can handle any data demand at scale.
Furthermore, Google Data Fusion seamlessly integrates with other services and products in the Google Cloud ecosystem. Data engineers can easily connect and integrate data pipelines with services such as BigQuery, Cloud Storage, Pub/Sub, and many others. This enables a cohesive and unified data architecture, facilitating data ingestion, storage, and analysis across multiple platforms.
In this course, you will learn:
Understanding its internal workings.
What its benefits are.
How to create a Datafusion instance.
Using Google Cloud Storage as data input.
Using BigQuery as a Data Lake (Bronze and Silver layers).
Advanced features of BigQuery: Partitioned tables and MERGE command.
Ingesting data from different sources.
Transforming data with Wrangle (low code) and queries.
Creating DAGs for data ETL (Extract, Transform, Load) and dependencies.
Scheduling and inter-DAG dependencies.
Who this course is for:
- Data Engineers
- Data Analysts
- Data Scientists
- Analytics Engineer
- Low Code Developers
- Python Developers looking to reduce coding overhead
- Open Source Fans
I'm self taught Senior Data Engineer and content creator. Migrated from a machine operator at my 30's to the Data IT Industry. Can help early professionals to drive their path to become data professionals as well as give some great advices for those who wish to live abroad and achieve a sponsorship visa.
My current stack:
Data Integration / Processing -> Databricks | Dataflow | AWS Lambdas | Datafusion | DataFactory
Automation -> Power Platform | Power Automate | Power Apps
Databases -> Snowflake | Big Query | SQL Server
Data Transformation -> DBT
Versioning / Repository -> Git | Azure DevOps
Programming -> SQL | Python | PySpark
Cloud Providers -> Azure | GCP | AWS
Task / Data Orchestration -> Airflow
BI -> Power BI | Qlik Sense
CI / CD -> Git Lab CI
Containers -> Docker