Mastering Data Wrangling with PySpark in Databricks

Name: Mastering Data Wrangling with PySpark in Databricks
Rating: 4.5 (66 reviews)

From Beginner to Pro: Learn Key Data Processing Skills and Machine Learning with PySpark in Databricks

Created byGustavo R Santos

Last updated 10/2024

English

What you'll learn

Understand the fundamental concepts of PySpark and Databricks and their significance in the world of big data analytics.
Learn how to set up and configure your Databricks environment, including creating an account and managing clusters.
Explore PySpark's data structures, DataFrames, and Datasets, and learn to create and work with structured data.
Master the essential data manipulation techniques in PySpark, including selecting, filtering, transforming, aggregating, and handling missing data.
Discover how to use PySpark SQL for structured queries, compare it with DataFrame operations, and understand when to use each.
Learn the essentials of ETL (Extract, Transform, Load) processes with PySpark, including reading and writing data, data cleaning, and partitioning.
Gain an overview of PySpark's MLlib library and different types of machine learning tasks.
Dive into feature engineering, model selection, evaluation, and hyperparameter tuning for building robust machine learning models using PySpark.
Discover performance optimization techniques in PySpark, including data caching, broadcast variables, and query optimization.
Explore strategies for scaling PySpark workloads, including best practices for handling large datasets.

Course content

8 sections • 56 lectures • 6h 43m total length

Course Overview2:38
Explore PySpark and Databricks, set up the environment, and master data structures, data types, and data frames. Learn to transform, clean, and extract data, and optimize queries.
Notebooks0:03

Introduction to PySpark and Databricks9:27
Mastering data wrangling with PySpark in Databricks teaches how PySpark uses in-memory cluster computing, data frames, and RDDs to analyze large datasets, with Databricks as a collaborative, notebook-based platform.
Setting up Your Databricks Environment3:49
Set up your Databricks environment by creating a community edition account, launching a cluster, and starting a notebook to connect to the cluster and run Python, SQL, Scala, or R.
Inside Databricks4:37
Take an inside look at Databricks, including notebooks, clusters, data, and the spark UI. Learn to attach notebooks to a cluster, run code, and switch languages in notebooks.
Transformations vs Actions8:19
Learn how spark uses lazy evaluation to optimize workflows by separating transformations, which create data frames, from actions, which return calculated results; explore examples like filter, group by, and aggregates.

PySpark Data Structures7:17
Explore the basic data structures in PySpark, including lists, tuples, dictionaries, and RDDs versus data frames. Learn to parallelize lists into RDDs and create PySpark data frames from rows.
Pyspark Data Structures
Schema and data types3:43
Define a schema and data types in PySpark, using struct field and struct type to map variables to double, integer, and string, then apply it to correct depth and price.
Creating DataFrames2:21
Create dataframes in Databricks with PySpark by defining a schema with StructType and StructField and the date, product, store, and total fields, then spark create dataframe from data and schema.
Creating DataFrames - Part 25:22
Explore multiple PySpark techniques to create data frames in Databricks, including using a struct schema, parallelizing lists into an RDD, building with createDataFrame from tuples, and Pandas on Spark.
Importing PySpark Functions in Databricks4:16
Master importing PySpark functions in Databricks, with from pyspark.sql.functions and optional aliasing, using call, count distinct, sum, min, max, and mean; leverage window techniques over loops for big data.
What to import to a PySpark Session
Loading and Displaying Data in Databricks9:36
Learn to load the diamonds csv dataset in Databricks with PySpark, set header to true, cast columns to float, select and display data, and visualize results.
Infer Schema1:08
Infer schema automatically infers data types when loading csv datasets in Databricks, avoiding manual struct definitions and correcting types for price, x, y, z.
How to Load data to Databricks

Data Manipulation with PySpark1:53
Develop data wrangling skills in pyspark by selecting data, slicing and filtering, handling missing values, joining datasets, and using groupby and pivot tables for summarizing.
Selecting, Adding and Removing Columns12:00
Learn to select, add, and remove columns in the diamonds dataset with PySpark on Databricks, convert key fields to float, and rename or alias columns using withColumn and select.
Renaming Columns3:10
Learn how to rename columns in a Databricks PySpark DataFrame using withColumnRenamed or select with alias. See how these methods rename specific columns versus all columns.
Count, Count Distinct, Sort, Cast8:40
Master data wrangling with PySpark in Databricks by counting rows, estimating counts with samples, identifying distinct values, sorting by columns, and casting data types.
Filtering Data7:14
Apply filter and where to narrow datasets in PySpark on Databricks, selecting premium cuts and price thresholds to accelerate analysis and enable focused visualizations.
Filtering Contains and Like3:07
Apply contains and like text filters in PySpark on Databricks to select rows by substring, starts with, ends with, or middle patterns using the dataset filter function.
Between and isin5:27
Demonstrate using is in and between to filter data in PySpark on Databricks, showing value lists, date ranges, and using where as an alternative to filter.
Fill and Replace Values, Handling Missing Data10:53
Master filling and replacing missing values in PySpark on Databricks, using is null and is nan, casting to integer, and DataFrame operations like replace, lit, and fill to clean data.
Handling Missing Data 26:51
Explore missing data handling in Databricks PySpark: drop with any, all, and thresh to keep rows with non-null values, and fill with mean or a string placeholder.
Content Check
Case When4:19
Leverage the case when function in PySpark on Databricks to replace many if statements, labeling data as expensive, regular, or cheap using when and is in.
Aggregating Data9:10
Group by cut and aggregate with PySpark in Databricks on diamonds dataset to compute counts, mean (average) price, min, max, percentiles, stats, then alias columns and sort by average price.
Pivot Table2:24
Master data wrangling with PySpark pivot tables by transforming long data to wide, grouping by cut and color, and using pivot to create color columns with min price insights.
Dealing with Date and Time9:11
Window11:17
Explore window functions in PySpark, derived from SQL, to compute ranks, rolling sums, and cumulative distributions, using window specs with partition by and order by on diamonds dataset.
Content Check
Joining Datasets19:56
Master joining datasets in PySpark on Databricks with left, right, inner, full outer, and left anti join to combine stores and sales, then analyze total sales by city.
Percentile2:06
Learn how to use the percentile function in PySpark to compute medians and percentiles for price data, and build a distribution using 25th, 50th, 75th, and 95th percentiles.
Median (Update)1:05
PySpark SQL adds median in 3.5.0. The lecture compares median and percentile(50) on the diamonds dataset grouped by cut, showing equivalent results.
Other Useful Functions5:44
Explore PySpark SQL functions such as sample with seed, data dimensions, list values per row, greatest across columns, floor and ceiling, group by with collect list, and file size checks.
Other Useful Functions Part 213:09
Some more useful functions in Pyspark
Data Caching4:27
Demonstrate data caching with PySpark in Databricks by caching a 60 million row grouped dataset to memory, revealing speedups for repeated filters, while noting caching may underperform on small datasets.
Saving Data to CSV5:47
Saving Data to Databricks File System4:48
Save files to the Databricks file system using parquet for better data type preservation and compression, then load with Spark to verify the saved data.
Exercises4:02
Practice data wrangling with PySpark in Databricks through hands-on exercises. Load, filter to ideal diamonds, create a price_tier, map tiers to numbers, perform a left join, and save to DBFS.
Exercises Solutions14:22
Load a diamond dataset with spark.read.csv, filter to ideal cuts, create a price tier with when-otherwise, left join datasets, aggregate counts and mean price, and save as a Databricks table.

Query Optimization18:11
Optimize PySpark queries in Databricks by using dataframes, filtering early, and avoiding user-defined functions. Cast big integers to text for exact filters and leverage partitioning to speed joins.
Cache and Persist8:16
Learn how cache and persist store Spark data on worker nodes to speed up computations, with lazy evaluation and actions triggering runs, choosing memory or disk.
Best practices for handling large datasets13:26
Explore handling large datasets in Databricks with PySpark by caching data, and applying bronze–silver–gold medallion architecture to clean, transform, and deliver ready-to-use data.

DataFrame API vs. SQL API3:55
Explore the data frame API vs SQL API in Databricks, highlighting when to use PySpark for complex transformations and machine learning, versus SQL for simple data manipulation and readability.
Working with SQL3:08
Explore working with SQL in Databricks by loading the diamonds data, registering a temp view, and querying with SQL to retrieve data and compute average price by cut.
Basic SQL Queries7:01
Learn basic sql queries in Databricks on the diamonds dataset, grouping by cut and computing min, max, and percentile prices to explain how distribution affects averages.

Introduction to Machine Learning with Pyspark6:21
Explore modeling with PySpark and MLlib, contrasting spark ml and scikit-learn workflows, including vectorization, feature engineering, and pipelines for classification and regression.
MLlib Regression: Diamonds Prices10:46
Analyze the diamonds history with PySpark for exploratory data analysis, visualize distributions, and build a linear regression model to price diamonds using cut, color, and the x, y, z dimensions.
MLlib Regression: Diamonds Prices (2)14:20
Explore diamond prices with PySpark in Databricks, grouping by cut and color to compute total, mean, median, and price per carat. Reveal strong price drivers through carat correlations (about 0.92).
MLlib Regression: Diamonds Prices (3)11:48
Explore how to build a PySpark MLlib regression model in Databricks, using vector assembler and log transformations to predict diamond prices from carat and assess residuals.
ML Case 2 - Logistic Regression2:56
Build a logistic regression model in PySpark MLlib using a pipeline with VectorAssembler, StringIndexer, and OneHotEncoder to prepare the UCI adult census data, handling missing values, for income prediction.
Feature engineering6:13
Apply feature engineering with a univariate selector to choose categorical and numerical variables, transform data into a vectorized frame, and output features and the income label.
Preparing Data for Modeling6:41
Explain handling the unbalanced output variable by transforming categorical features to numeric using string indexer and one-hot encoding, assembling vectorized features in a modeling pipeline.
Training and Evaluating Machine Learning Models5:16
Prepare features with string indexing and one-hot encoding, split data for training and testing, fit a logistic regression model, and evaluate precision, recall, and specificity from the confusion matrix.
Model Tunning10:57
Explore logistic regression in PySpark, tune thresholds and regularization, and use cross-validated parameter grids to balance precision and recall for imbalanced binary classification.

Requirements

It is expected that the student has a basic knowledge of Python, such as data objects, loops and functions.

Description

Explore the world of big data analytics with our comprehensive course, 'Mastering Data Processing with PySpark in Databricks.'

In this course, we equip you with the practical skills and knowledge required to navigate the complexities of PySpark and Databricks, two industry-leading tools for efficient data processing, analysis, and the extraction of valuable insights from large datasets.

As technology evolves, the access to Big Data is easier each day, making professionals with the skill to process and extract insights from those large datasets wanted by the Big Tech Companies. Learning how to use Databricks will upskill you to be that wanted professional!

Gain practical skills in PySpark and Databricks to efficiently process, analyze, and extract valuable insights from vast datasets. Discover data processing, transformation, query optimization, and machine learning techniques from the basic.

In the age of data-driven decision-making, understanding PySpark in Databricks is not just an advantage but a necessity. By enrolling in this course, you'll be poised to take your data analytics capabilities to the next level, making you a sought-after professional in a data-centric world.

Join us and take the first step towards optimizing your data processing skills.

By the end of this course, you will be ready to add PySpark to your resume!

Enroll today to enhance your data analytics capabilities and boost your career in the data-driven world!

Who this course is for:

Data Scientists who are new to PySpark and Databricks and need to get up to seep with this technology.
Professionals who are starting a new role and need to master Databricks for data analysis.
Enthusiasts and curious professionals eager to learn a new skill.

Mastering Data Wrangling with PySpark in Databricks

What you'll learn

Explore related topics

Course content

Introduction2 lectures • 3min

Getting Started with PySpark and Databricks4 lectures • 26min

Basics of PySpark7 lectures • 34min

Data Wrangling With PySpark24 lectures • 2hr 51min

Query Optimization3 lectures • 40min

Databricks SQL3 lectures • 14min

Machine Learning with PySpark9 lectures • 1hr 15min

Conclusion4 lectures • 41min

Requirements

Description

Who this course is for: