Problem Solving using PySpark - Regression & Classification

Name: Problem Solving using PySpark - Regression & Classification
Rating: 5.0 (1 reviews)

Gradient Boosted Trees, XGBoost, Spark NLP, Time Series, Prophet, Data Cleaning, Descriptive Statistics, Spark SQL

Created bySathish Jayaraman

Last updated 6/2024

English

What you'll learn

Data analysis and descriptive statistics with PySpark - Learning to compute essential descriptive statistics for data understanding and summarization
Data Cleaning with PySpark
Predictive modeling with PySpark using Regression
Applying Classification techniques to a real world problem in PySpark
Text analytics using PySpark and Spark NLP
Time-Series modeling with PySpark and Prophet
Introduction to Spark SQL for data querying

Course content

8 sections • 35 lectures • 1h 48m total length

Introduction
Problem Solving with PySpark : Regression and Classification2:49
Explore real-world problems with PySpark, from descriptive statistics and data cleaning to regression and classification using gradient boosted trees and XGBoost, plus text analytics for sentiment and time series modeling.

Setting up PySpark Environment in Google Colab3:54
Set up a PySpark environment in Google Colab and compute descriptive statistics using aggregation functions, installing Spark, Java, and dependencies, and launching a Spark session.
Understanding Descriptive Statistics in PySpark5:35
study descriptive statistics in PySpark on a spark data frame, using describe, agg, and approxQuantile to summarize a multivariate metro train dataset with temperature, motor current, and pressure measurements.
Understanding Data Filtering and Slicing in PySpark3:23
Filter and slice PySpark data frames, select column subsets, and filter by conditions such as motor current > 7. Convert to pandas for seaborn/plotly visualizations, plotting oil level over time.
Summary of Descriptive Statistics in PySpark and Quiz1:41
Explore descriptive statistics in PySpark, including count, mean, stddev, min, max, median, and mode computed via group by and order by, with optional pandas conversion for Seaborn.

Introduction to Data Cleaning with PySpark1:06
Explore data cleaning in PySpark, tackling missing values, inaccuracies, duplicates, and outliers while imputing values. Follow along in the Colab notebook to practice the process and understand its complexity.
Setting up PySpark Environment for Data Cleaning on Google Colab2:31
Set up the PySpark environment in Google Colab with dependencies and Plotly, create a SparkSession, verify Spark 3.5.0, and load the dataset into a PySpark data frame for cleaning.
Understanding the Dataset : Explanatory Analysis and Data Cleaning with PySpark4:18
Perform exploratory analysis on the 550,068-row dataset to inspect schema, distinct values, and nulls across columns such as user_id, gender, occupation, marital status, and product categories, guiding PySpark cleaning.
PySpark Data Cleaning : Assessment of Null Values and Outliers4:50
Learn to clean PySpark data by assessing null values and outliers, imputing with zero for missing categories, checking duplicates, and using box plots with Plotly for visualization.
Data Cleaning with PySpark : Imputation Strategy Quiz3:51
Apply an imputation strategy in PySpark to replace missing values in the kindness column using the Imputer, exploring mean, median, and mode options, then verify with transform.
Introduction to Pivot Tables in PySpark1:30
Learn how pivot tables in PySpark transpose the city category column into multiple columns, enabling grouped analysis of purchases by age and city with the pivot function.

Introduction to Regression and Classification Problems in PySpark4:17
Understanding the Data Set through Explanatory Analysis4:07
Examine the dataset through explanatory analysis, including shape, columns, and continuous features, and review descriptive statistics for tau one through tau four.
Correlation Analysis and Data Preparation2:56
Analyze correlations to identify key features for PySpark regression, using df dot score to measure coefficients for tau1–tau4 and g1–g4 with AB, visualize heatmaps, and prepare data with vector assembler.
Modeling the data using Gradient Boosted Trees Regression3:25
Explain how gradient boosted trees regression models the relationship between inputs and outputs by combining decision trees into an ensemble and training on residuals and evaluating with the R2 score.
Understanding Feature Importance2:48
Gradient Boosted Trees Regression - Quiz2:25
Solve a gradient boosted trees regression quiz using seaborn box plots to analyze tau3 and G2 distributions, identify medians and outliers, and explain the ensemble’s improved performance.

Classification Problem Statement : Supervised Machine Learning3:39
Explore supervised machine learning through a binary classification problem predicting electric grid stability using PySpark, including Spark, DataFrame, ML pipeline, and classifiers like XGBoost and logistic regression.
Data Cleaning and Preparation for XGBoost Classification Model2:59
Import the dataset into a PySpark data frame, verify missing values and duplicates, review statistics, verify that the output column is balanced, index labels, and assemble features for XGBoost classification.
XGBoost Classification Model Pipeline using PySpark2:52
Summary of the segment on Spark XGBoost Classifier0:39
Explore how XGBoost classifiers balance performance and generalization in PySpark, tuning max depth, gamma, and estimators to reduce overfitting, with future focus on text analytics.

Classification Model for Text Data2:31
Explore supervised learning for text data by applying classification with spark NLP and PySpark in Google Colab, including data preprocessing, modeling, and visualization.
Understanding the Data for Text Classification3:48
Word Cloud : Text Analytics Quiz2:03
Practice text analytics with a quiz by inspecting the first five rows of a pandas dataframe, filtering positive reviews, excluding stop words, and generating a word cloud for classification pipeline.
Spark NLP Pipeline : Classification Model4:52
Explore a spark NLP pipeline for text analytics sentiment classification, using document assembler, tokenizer, Bert embeddings, sentence embeddings, and a deep learning classifier with training and evaluation.

Introduction to Time Series Analysis : Setting up the Google Colab Notebook2:33
Explore time series analytics with PySpark and Prophet in Google Colab by configuring a Spark session, installing Spark and Java dependencies, and loading data from Google Drive for exploratory analysis.
Explanatory Analysis and Data Cleaning3:06
Conduct explanatory analysis and data cleaning on a 913000-row spark dataset, inspecting schema, checking duplicates and nulls, extracting date range, and validating results with SQL and Plotly time series visuals.
Analysis of time series components using advanced visualization techniques5:38
Validate aggregation results with PySpark SQL, convert to pandas, and visualize time series components—trend and seasonality—using Plotly, box plots, and seasonal decomposition toward forecasting with Prophet.
Use of Prophet Model for Time Series Forecasting2:43
Learn to forecast time series with Prophet, including data prep, renaming to ds and y, train-test split, model fitting, future data frames, and visualizing predictions with confidence intervals.
Time Series Forecasting - Quiz1:34
Apply time series forecasting with the profit model in PySpark to generate a 30-day forecast, display predictions and confidence intervals, and inspect results with the data frame tail.

Introduction to Spark SQL Querying2:42
Explore spark sql in Apache Spark using DataFrames to run queries, set up Colab, install dependencies, create a SparkSession, read zip codes data in CSV, and compare sql with PySpark.
Comparison of PySpark statements and Spark SQL Query4:06
Join in Spark SQL5:34
Apply inner, left, right, full outer, and left anti-joins in spark sql to combine employee and department dataframes using shared keys, verifying schemas and results.
Join in Spark SQL - Quiz2:07
Explore a left semi-join in Spark SQL through a quiz, using emp df and department df, matching on department id to retrieve corresponding rows from the left frame.

Requirements

Basic knowledge of data science and ML principles will be helpful
Familiarity with Python to work with PySpark
A computer with internet to access course material

Description

This course is based on real world problems in PySpark, surrounding Data Cleaning, Descriptive statistics, Classification and Regression Modeling.

The first segment introduces descriptive statistics in PySpark and computing fundamental measures such as mean, standard deviation and generating an extended statistical summary.

The second segment is based on cleaning the data in PySpark, working with null values, redundant data and imputing the null values.

The third segment is about Predictive modeling with PySpark using Gradient Boosted Trees Regression

The fourth and fifth segments are based on applying classification techniques in PySpark. The fourth Segment introduces the application of Spark XGB Classifier for a classification problem and the fifth segment is about using a deep learning model for text sentiment classification.

The sixth segment is about time series analytics and modeling using PySpark and Prophet

The seventh segment introduces Spark SQL for data querying and analysis.

These segments also include advanced visualization techniques through Seaborn and Plotly libraries including Box plots to understand the distribution of the data and assessment of outliers, Count plots to understand balance in the proportion of data, Bar chart to represent feature importance as part of the Gradient Boosted Trees Regression Model, Word Cloud for text analytics and analyzing time series data to extract seasonality and trend components.

Each of these segments, has a Google Colab notebook included aligning with the lecture.

Who this course is for:

This course is suited for anyone interested in the realm of analytics using PySpark - particularly useful for analysts and engineers interested in Big Data, someone with a basic knowledge of data science and ML principles

Problem Solving using PySpark - Regression & Classification

What you'll learn

Explore related topics

Course content

Introduction2 lectures • 3min

Data analysis and descriptive statistics with PySpark4 lectures • 15min

Data Cleaning with PySpark6 lectures • 18min

Predictive modeling with PySpark using Regression6 lectures • 20min

Predictive Modeling with PySpark using Classification4 lectures • 10min

Text analytics using PySpark and Spark NLP4 lectures • 13min

Time Series Analysis and Forecast with PySpark and Prophet5 lectures • 16min

Introduction to Spark SQL4 lectures • 14min

Requirements

Description

Who this course is for: