
Explore real-world problems with PySpark, from descriptive statistics and data cleaning to regression and classification using gradient boosted trees and XGBoost, plus text analytics for sentiment and time series modeling.
Set up a PySpark environment in Google Colab and compute descriptive statistics using aggregation functions, installing Spark, Java, and dependencies, and launching a Spark session.
study descriptive statistics in PySpark on a spark data frame, using describe, agg, and approxQuantile to summarize a multivariate metro train dataset with temperature, motor current, and pressure measurements.
Filter and slice PySpark data frames, select column subsets, and filter by conditions such as motor current > 7. Convert to pandas for seaborn/plotly visualizations, plotting oil level over time.
Explore descriptive statistics in PySpark, including count, mean, stddev, min, max, median, and mode computed via group by and order by, with optional pandas conversion for Seaborn.
Explore data cleaning in PySpark, tackling missing values, inaccuracies, duplicates, and outliers while imputing values. Follow along in the Colab notebook to practice the process and understand its complexity.
Set up the PySpark environment in Google Colab with dependencies and Plotly, create a SparkSession, verify Spark 3.5.0, and load the dataset into a PySpark data frame for cleaning.
Perform exploratory analysis on the 550,068-row dataset to inspect schema, distinct values, and nulls across columns such as user_id, gender, occupation, marital status, and product categories, guiding PySpark cleaning.
Learn to clean PySpark data by assessing null values and outliers, imputing with zero for missing categories, checking duplicates, and using box plots with Plotly for visualization.
Apply an imputation strategy in PySpark to replace missing values in the kindness column using the Imputer, exploring mean, median, and mode options, then verify with transform.
Learn how pivot tables in PySpark transpose the city category column into multiple columns, enabling grouped analysis of purchases by age and city with the pivot function.
Examine the dataset through explanatory analysis, including shape, columns, and continuous features, and review descriptive statistics for tau one through tau four.
Analyze correlations to identify key features for PySpark regression, using df dot score to measure coefficients for tau1–tau4 and g1–g4 with AB, visualize heatmaps, and prepare data with vector assembler.
Explain how gradient boosted trees regression models the relationship between inputs and outputs by combining decision trees into an ensemble and training on residuals and evaluating with the R2 score.
Solve a gradient boosted trees regression quiz using seaborn box plots to analyze tau3 and G2 distributions, identify medians and outliers, and explain the ensemble’s improved performance.
Explore supervised machine learning through a binary classification problem predicting electric grid stability using PySpark, including Spark, DataFrame, ML pipeline, and classifiers like XGBoost and logistic regression.
Import the dataset into a PySpark data frame, verify missing values and duplicates, review statistics, verify that the output column is balanced, index labels, and assemble features for XGBoost classification.
Explore how XGBoost classifiers balance performance and generalization in PySpark, tuning max depth, gamma, and estimators to reduce overfitting, with future focus on text analytics.
Explore supervised learning for text data by applying classification with spark NLP and PySpark in Google Colab, including data preprocessing, modeling, and visualization.
Practice text analytics with a quiz by inspecting the first five rows of a pandas dataframe, filtering positive reviews, excluding stop words, and generating a word cloud for classification pipeline.
Explore a spark NLP pipeline for text analytics sentiment classification, using document assembler, tokenizer, Bert embeddings, sentence embeddings, and a deep learning classifier with training and evaluation.
Explore time series analytics with PySpark and Prophet in Google Colab by configuring a Spark session, installing Spark and Java dependencies, and loading data from Google Drive for exploratory analysis.
Conduct explanatory analysis and data cleaning on a 913000-row spark dataset, inspecting schema, checking duplicates and nulls, extracting date range, and validating results with SQL and Plotly time series visuals.
Validate aggregation results with PySpark SQL, convert to pandas, and visualize time series components—trend and seasonality—using Plotly, box plots, and seasonal decomposition toward forecasting with Prophet.
Learn to forecast time series with Prophet, including data prep, renaming to ds and y, train-test split, model fitting, future data frames, and visualizing predictions with confidence intervals.
Apply time series forecasting with the profit model in PySpark to generate a 30-day forecast, display predictions and confidence intervals, and inspect results with the data frame tail.
Explore spark sql in Apache Spark using DataFrames to run queries, set up Colab, install dependencies, create a SparkSession, read zip codes data in CSV, and compare sql with PySpark.
Apply inner, left, right, full outer, and left anti-joins in spark sql to combine employee and department dataframes using shared keys, verifying schemas and results.
Explore a left semi-join in Spark SQL through a quiz, using emp df and department df, matching on department id to retrieve corresponding rows from the left frame.
This course is based on real world problems in PySpark, surrounding Data Cleaning, Descriptive statistics, Classification and Regression Modeling.
The first segment introduces descriptive statistics in PySpark and computing fundamental measures such as mean, standard deviation and generating an extended statistical summary.
The second segment is based on cleaning the data in PySpark, working with null values, redundant data and imputing the null values.
The third segment is about Predictive modeling with PySpark using Gradient Boosted Trees Regression
The fourth and fifth segments are based on applying classification techniques in PySpark. The fourth Segment introduces the application of Spark XGB Classifier for a classification problem and the fifth segment is about using a deep learning model for text sentiment classification.
The sixth segment is about time series analytics and modeling using PySpark and Prophet
The seventh segment introduces Spark SQL for data querying and analysis.
These segments also include advanced visualization techniques through Seaborn and Plotly libraries including Box plots to understand the distribution of the data and assessment of outliers, Count plots to understand balance in the proportion of data, Bar chart to represent feature importance as part of the Gradient Boosted Trees Regression Model, Word Cloud for text analytics and analyzing time series data to extract seasonality and trend components.
Each of these segments, has a Google Colab notebook included aligning with the lecture.