Udemy
    •  
    •  
    •  
    •  
    •  
    •  
    •  
    •  
Turn what you know into an opportunity and reach millions around the world.
Learn More
Your cart is empty.
Keep shopping
Problem Solving using PySpark - Regression & Classification
Rating: 5.0 out of 5(1 rating)
27 students

Problem Solving using PySpark - Regression & Classification

Gradient Boosted Trees, XGBoost, Spark NLP, Time Series, Prophet, Data Cleaning, Descriptive Statistics, Spark SQL
Last updated 6/2024
English

What you'll learn

  • Data analysis and descriptive statistics with PySpark - Learning to compute essential descriptive statistics for data understanding and summarization
  • Data Cleaning with PySpark
  • Predictive modeling with PySpark using Regression
  • Applying Classification techniques to a real world problem in PySpark
  • Text analytics using PySpark and Spark NLP
  • Time-Series modeling with PySpark and Prophet
  • Introduction to Spark SQL for data querying

Course content

8 sections35 lectures1h 48m total length
  • Introduction
  • Problem Solving with PySpark : Regression and Classification2:49

    Explore real-world problems with PySpark, from descriptive statistics and data cleaning to regression and classification using gradient boosted trees and XGBoost, plus text analytics for sentiment and time series modeling.

Requirements

  • Basic knowledge of data science and ML principles will be helpful
  • Familiarity with Python to work with PySpark
  • A computer with internet to access course material

Description

This course is based on real world problems in PySpark, surrounding Data Cleaning, Descriptive statistics, Classification and Regression Modeling.

The first segment introduces descriptive statistics in PySpark and computing fundamental measures such as mean, standard deviation and generating an extended statistical summary.

The second segment is based on cleaning the data in PySpark, working with null values,  redundant data and imputing the null values.

The third segment is about Predictive modeling with PySpark using Gradient Boosted Trees Regression

The fourth and fifth segments  are based on applying classification techniques in PySpark. The fourth Segment introduces the application of Spark XGB Classifier for a classification problem and the fifth segment is about using a deep learning model for text sentiment classification.

The sixth segment is about time series analytics and modeling using PySpark and Prophet

The seventh segment introduces  Spark SQL for data querying and analysis.

These segments also include advanced visualization techniques through Seaborn and Plotly libraries including  Box plots to understand the distribution of the data and assessment of outliers, Count plots to understand balance in the proportion of data, Bar chart to represent feature importance as part of the Gradient Boosted Trees Regression Model, Word Cloud for text analytics and analyzing time series data to extract seasonality and trend components.

Each of these segments, has a Google Colab notebook included aligning with the lecture.

Who this course is for:

  • This course is suited for anyone interested in the realm of analytics using PySpark - particularly useful for analysts and engineers interested in Big Data, someone with a basic knowledge of data science and ML principles