Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Transform Data into Insights with Dagster and Deepnote

Name: Transform Data into Insights with Dagster and Deepnote
Rating: 4.0 (20 reviews)

Data Engineering for Empowered Business Decisions: ETL, Exploration & Visualization

Created bySimon Szalai

Last updated 2/2023

English

English [Auto],

What you'll learn

Turn messy, real-world data into actionable insights.
Gain familiarity with tools such as Deepnote, Dagster, and Metabase.
Use Deepnote as a data engineering development environment.
Generate realistic development data for analysis and visualization.
Learn data exploration and preprocessing techniques using Python and SQL.
Clean and normalize data from various sources, such as relational databases, JSON, .xls files and more.
Set up Dagster to orchestrate your data pipeline.
Integrate the processing logic into a scalable ETL pipeline with Dagster.
Deploy your pipeline to Dagster Cloud (serverless)
Optimize processing through techniques such as parallelization or streamed processing.
Create powerful data visualizations using Metabase.

Course content

9 sections • 38 lectures • 8h 42m total length

Welcome to the World of Data Engineering1:22
Learn how to build an ETL pipeline with Python and Dagster in this data engineering course. Led by Simon Szalai, you will use Dagster as the orchestrator and a cloud-based development environment and database to visualize and explore processed data. By the end of the course, you will have the skills to create similar pipelines for your own organization or clients and unlock the value of their data. Discover the role of a data engineer and why it's in high demand as data becomes increasingly valuable.
The Power of Clean, Organized Data1:57
In this video, I will show you how to identify valuable data as a data engineer. I'll use an e-commerce example to illustrate the difference between raw and processed data and how processing data can unlock valuable insights for companies. I will explain your role as a data engineer in transforming raw data into useful information and the importance of clear and self-explanatory column names and key metrics. By the end of this video, you will understand how to make data valuable for decision-making.
The Skills and Tools Needed to be a Successful Data Engineer2:56
In this video, we'll dive into the skills and tools needed to be a successful data engineer. The job is not easy - you'll need to work with data stored in multiple places and formats, including relational databases like PostgreSQL and MySQL, non-relational databases like MongoDB and DynamoDB, and files on local systems or cloud storage services like Amazon S3, Google Cloud Storage, or Azure Block Storage. These files can be images, videos, audio files, binary files, JSON or CSV files, and they can all have different columns. And that's just the beginning.
Once you've got your data, you'll need to make sure it's up to standard by checking for missing values, typos, invalid data, and columns with mixed types. If any data was input manually by humans, it will contain errors - files might have been uploaded to the wrong folders, different versions of the app that produced the data might have output different formats, and there may be bugs in the app that resulted in missing or misplaced data. And that's just the start.
But, once you've cleaned and structured the data, you'll need to visualize it. You'll need to transfer the data to an environment where you can use visualization tools like Smart Plotly or CBO. If you want a real-time dashboard, you'll need to look into something like Tableau. It sounds overwhelming, but it's not all bad news. First, it's 2022, and there are many platforms and open-source tools that make this job easier. You won't need to run your own servers or implement database connectors from scratch, which means you'll spend most of your time designing the system and implementing actual business logic. And that's much more rewarding and enjoyable.
An ETL pipeline for Small and Medium-Sized Businesses1:32
Learn how to solve a common problem faced by small and medium-sized businesses in this course. I will show you how to implement a system that helps them improve efficiency and make data-driven decisions by interfacing with common data sources, cleaning and normalizing data, and building a real-time dashboard. Whether you're a data scientist, engineer, or freelancer, this course will help you impress your manager or secure compensation for your expertise.

DeepNote8:53
In this video, I show you how to use the Deepnote platform as a setup-free development environment. We cover the different types of machines available, including GPU options for working with large amounts of data. I also demonstrate how to select from various programming environments, including different versions of Python and TensorFlow optimized versions. Additionally, I show you how to customize your environment with a Dockerfile and how to enable and disable connections to the machine running your notebook. I also discuss the Linux terminal feature and the integrations available, such as connecting to Google Drive, Amazon S3, and various databases. Overall, Deepnote makes it easy to connect to different data sources and focus on your work, rather than setting up infrastructure.
Dagster12:41
Learn how to use Dagster, a cloud-native orchestrator, to build and manage your ETL pipelines. In this video, I will explain what an orchestrator is, why you need it, and how Dagster can help you monitor your jobs, schedule them, and ensure they are running smoothly. I will also show you some of Dagster key features, such as the task-based workflow and the single pane of glass data platform. Additionally, I will share tips on how to use Dagster Slack community for support.
I'll also be mentioning that there are two versions of Dagster: the open-source version that you can run on your own servers and manage yourself, or the cloud version which is a fully managed service. The cloud version is a good option if you don't want to spend time on deploying and maintaining infrastructure, and it also has a free trial. Additionally, I'll be talking about the documentation for Dagster, which is comprehensive and includes tutorials for getting started and deploying the open-source version.
Metabase5:21
In this video, we will be discussing the use of Metabase as a business intelligence tool. It allows us to connect to different data sources and in this case, we will be connecting to a Postgres database which will already contain pre-processed data. The main goal of the data pipeline is to process, transform and make the data accessible and useful for the database and easy to visualize. The great thing about Metabase is that it is a low-code tool, which means non-technical people can use it to create their own visualizations and do their own data aspiration.
Other tools2:54
In this video, I will be discussing the use of Google Cloud specifically for the Cloud SQL for PostgreSQL. I will mention that signing up for an account will give you $300 in free credits which will be used for the course. I will also mention Tableplus, a useful application that I use as a database explorer that can connect to cloud or local databases and can be downloaded for different operating systems.

Building the Solution Architecture8:51
In this section of the course, I will be showing you how to build a solution architecture for our data pipeline. We will be using a Postgres database and Google Drive as input sources. The focus of this course is on building the pipeline and developing its structure, rather than on DevOps. We will be breaking down the pipeline into three stages: extract, transform, and load. The extract stage involves getting data from the source, the transform stage involves applying any necessary transformations such as normalizing or cleaning data, and the load stage involves saving the data to another database. We will also be visualizing the data using Metabase. The development process is divided into three sections: Business Logic Development, Dagster development, and Production. The reason for separating the development environment is to improve the speed of execution. We will be using Deepnote for fast development of the business and transformation logic.
Everything is connected to the same data source, even in the early stages of development, to ensure that the functions will run on all data sets and prevent errors. We will be using Dagster ops and jobs for local development, and deploying them to the cloud for production. It is important to use separate databases for development and production to prevent interference. We will be using a local Postgres database for speed and simplicity.

Creating a PostgreSQL Database on Google Cloud6:59
To begin, we will create a trial account on the Google Cloud platform. You will need a credit or debit card, but don't worry, you will start with $300 of free credits and your card will not be charged if you exceed that amount. After signing up, you can either create a new project or use the default project provided. Next, we will search for the Cloud SQL API to create our database. After creating the instance, we will specify a few settings such as the instance ID, password, and development version. We will also select a region that is closest to us and has the low CO2 option. Lastly, we will enable the public IP and add the network to connect to the database. We will also set up Table Plus to connect and verify the database connection.
Generating Synthetic Data of a Hypothetical Client8:20
In this video, we will continue building our data pipeline by setting up Deepnote and creating a new project in our account. We will also configure the Postgres connection and Google Drive integration, which we will use later on in the course to generate data for a hypothetical company that grows and sells berries. They sell their produce in-person at markets, through a main shop on their farm, and even accept crypto as a form of payment. Our data pipeline will collect and normalize all this transaction data for easy use and visualization.
Explanation of the data generation process (optional)14:47
In this video, I explain in detail how the dummy data has been generated by the provided Jupyter notebook.
Verifying the Generated Data6:00

Preprocessing Relational Data: POS Transactions40:15
Preprocess relational pos transactions by hashing ids with hashlib, mapping sku to product names using lookups, and implementing upsert writes via SQLAlchemy and pandas to_sql.
Preprocessing Relation Data: Crypto Transactions26:00
Preprocessing JSON Data15:17
Preprocess json data from stripe api by extracting product descriptions, computing unit price, tax, and quantity, identifying unique transaction types, and constructing a transactions dataframe for dagster and deepnote workflows.
Preprocessing Excel Sheets: Loading Files from Google Drive38:37
Preprocessing Excel Sheets: Market Transactions57:26
Refactoring Business Logic: Challenge11:24
Refactoring Business Logic: Solution7:42
Unit Testing8:15

Overview of Dagster Concepts17:51
Set up Local Dagster Development20:15
Set up a local Dagster development environment by configuring Dexter Cloud, creating a virtual environment, installing dependencies, enabling the daemon, and running the UI and sample data pipelines locally.
Extracting Data23:49
Transforming and Loading Data15:55
Partitioned Processing17:15
Job Configuration16:38
Streamed Data Processing18:23
Processing Files6:32
Create a Dagster job that processes market transaction files by loading data from Google Drive, transforming it into a dataframe, and scheduling automatic runs.
Creating Dagster Schedules4:32
Creating Dagster Sensors19:02
Deploying to Dagster Cloud17:54

Requirements

Basic Python Knowledge

Description

Do you struggle with making data-driven decisions for your business due to scattered, inconsistent, and inaccessible data? This course is the solution! Learn to build a streamlined and efficient ETL pipeline that will allow you to turn data into actionable insights.

This course teaches you how to build a system that collects data from multiple sources, normalizes it, and stores it in a consistent and accessible format. You will learn how to extract data, explore and preprocess it, and ultimately visualize it to support better decision-making and optimize business processes.

Forget about big data and cluster management headaches, this course is designed to get you up and running quickly with a real-time ETL pipeline. With infrastructure costs under $50 a month, you can start seeing immediate results and return on investment for your clients or company.

In the first part of the course, I will walk you through the architecture and introduce you to the tools we will be using:

Deepnote, as a setup-free development environment
Dagster, as the pipeline orchestrator
Metabase, as a low-code data visualization platform

While the course will introduce you to the relevant features of Deepnote and Metabase, it is mostly focused on Dagster.

In the next part, we will get started by generating dummy sales data of a hypothetical company using Deepnote. The code will be provided for this. Once we have the data, the course will dive into data exploration and preprocessing techniques using Python and SQL in Deepnote, including cleaning and normalizing data from various sources such as relational and JSON data, Excel sheets, and more. We will implement the processing logic in Deepnote, then commit it to a Git repository that will be shared with Dagster.

In the following section, we will wrap the business logic with Dagster operations and jobs, then deploy them to Dagster Cloud (self-hosted option also available), which will allow you to manage everything from a single, unified view. In this section, you will also learn a few tricks to speed up and optimize processing, such as parallelization or streamed processing.

In the final section of this course, you'll bring your preprocessed data to life with Metabase. With a few simple clicks, even non-technical individuals will be able to create stunning, powerful visualizations that unlock the full potential of your data.

By the end of this course, you'll have a comprehensive understanding of the tools used and how they work together, empowering you to provide tangible benefits to your clients or company from day one, measured in thousands or tens of thousands of dollars.

The choice is yours - will you seize this opportunity to deliver massive benefits to your company or clients, and claim your fair share of the rewards?

Who this course is for:

Developers seeking to build scalable and efficient ETL pipelines.
Entrepreneurs looking to leverage data for business growth.
Data analysts and scientists who want to streamline their data processing workflow.
Business professionals looking to improve their data-driven decision-making abilities.
Students and recent graduates interested in a career in data engineering.
Data managers tasked with organizing and making data accessible for analysis.
Project managers looking to implement data-driven solutions for clients or company.
Individuals interested in learning cutting-edge tools and techniques in data engineering.

Transform Data into Insights with Dagster and Deepnote

What you'll learn

Explore related topics

Course content

Introduction4 lectures • 8min

Exploring the Tools of the ETL pipeline: Deepnote, Dagster, and Metabase4 lectures • 30min

Designing the ETL Pipeline: From Data Sources to Dashboards1 lecture • 9min

Setting Up Your Development Environment and Generating Dummy Data4 lectures • 36min

Getting Started with Deepnote: An Introduction to Python and SQL for Data Explor4 lectures • 45min

Data Preprocessing in Deepnote: Cleaning and Normalizing Data8 lectures • 3hr 25min

Setting up the ETL pipeline with Dagster11 lectures • 2hr 58min

Visualizing Data in Metabase1 lecture • 11min

Bonus Content1 lecture • 1min

Requirements

Description

Who this course is for: