Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

PySpark Essentials for Data Scientists (Big Data + Python)

Name: PySpark Essentials for Data Scientists (Big Data + Python)
Rating: 4.4 (833 reviews)

Learn how to wrangle Big Data for Machine Learning using Python in PySpark taught by an industry expert!

Created byLayla AI

Last updated 5/2022

English

English [Auto],

What you'll learn

Use Python with Big Data on a distributed framework (Apache Spark)
Work with REAL datasets on realistic consulting projects
How to streaming LIVE data from Twitter using Spark Structured Streaming
Learn how to create a "Pandora Like" app that classifies songs into genres using machine learning
Flag suspicious job postings using Natural Language Processing
Use machine learning to predict optimal cement strength and the factors that affect it
Classify Christmas cooking recipes using Topic Modeling (LDA)
Customer Segmentation using Gaussian Mixture Modeling (Clustering)
Use cluster analysis to develop a strategy designed to increase college graduation rates for under-priveleged populations
How to use the k-means clustering algorithm to define a marketing outreach strategy
Integrate a UI to monitor your model training and development process with MLflow
Theory and application of cutting edge data science algorithms
Manipulate, Join and Aggregate Dataframes in Spark with Python
Learn how to apply Spark's machine learning techniques on distributed Dataframes
Cross Validation & Hyperparameter Tuning
Frequent Pattern Mining Techniques
Classification & Regression Techniques
Data Wrangling for Natural Language Processing
How to write SQL Queries in Spark

Course content

11 sections • 109 lectures • 17h 16m total length

Frequently Asked Questions0:18
Course Introduction8:52
This lecture will provide a general introduction to the course, and what will be covered.
Course Orientation5:13
Navigate the PySpark essentials course in the DME environment with arrows, play/pause, bookmarks, and notes, and access transcripts, resources, datasets, and code in notebooks or dot pi versions.
Course Materials Bulk Download0:38
Resources for Setting up PySpark0:07
This lecture contains links to several PySpark installation resources. The most basic installation would require installing Anaconda and PySpark locally, however students also have the option to connect to a cluster if they want to get experience with a certain platform. Note that the material provided in this course will remain the same regardless of if you choose to connect to a cluster or not. The only benefit here would be computational speed.
Python Cheatsheet Resources0:12
Introduction to PySpark16:08
In this lecture, students will learn about what PySpark is and why to use it. An overview of the Spark ecosystem will also be provided as a reference.
Transitioning from Python to PySpark Concept Review5:47
This lecture will walk students coming from a Python background, through some of the changes they will experience during their transition to PySpark.
Transitioning from Python to PySpark Code Along Activity10:45
This lecture will walk students through the syntactical difference between Python and PySpark through a code along activity in two side by side Jupyter Notebooks. Both the Jupyter Notebooks and the dataset for this lecture have also been provided as an attachment for those who would like to follow along.

Dataframe Essentials Concept Review8:16
In this lecture we will going over what Dataframes are, how they are used and how we can manage them. We will also spend some time introducing some functionality that is unique to Pyspark when it comes to handling big data and then provide an overview of what will be covered in the next several code along activity lectures where we will be getting into the more hands-on stuff.
Dataframe Essentials Concept Review Quiz
A little something to keep you going....
Read, Write and Validate Dataframes Code Along Activity30:16
This lecture will be a code along activity where students will learn how to create a PySpark instance as well as read, write and validate dataframes using a Jupyter Notebook which is attached to the this lecture along with several datasets that we will be working with.
Read, Write and Validate Data HW1:22
This lecture will provide students will a brief intro to the first PySpark coding homework assignment. The Jupyter Notebook guide for the homework assignment and the corresponding dataset have also been included as attachment resources to this lecture. I've also included a link that lists all the PySpark data types in case you need them.
Read, Write and Validate Data HW Solutions Code Review9:07
This lecture will walk students through my solution to the "Read, Write and Validate Data Homework Assignment" assigned in the previous lecture. The Jupyter Notebook containing my solutions and the dataset that corresponds to this lecture have both been provided as attachments for reference.
A little something to keep you going....
Search and Filter Dataframes Code Along Activity22:44
This lecture will walk students through the essentials of searching and filtering dataframes in PySpark through a code along activity. We will also introduce PySparks SQL functions library here. The Jupyter Notebook and corresponding dataset have been provided as attachments in case students want to follow along on their own. Additionally, two external resources have been provided fro students. The first is a "List of PySpark SQL Functions" for students to reference later on and to check out additional functions that were not covered in the lecture (there are a lot!). The second is a link to W3 schools, which is a SQL tutorial website (not specific to PySpark), that students can use if they want to learn more about SQL.
Search and Filter Dataframes HW1:48
This lecture will provide students with a brief introduction to the Search and Filter Dataframes homework assignment that corresponds to the previous lecture. The Jupyter Notebook for this exercise and the necessary dataset (same fifa dataset as in the previous lecture) has also been attached to this lecture.
Search and Filter Dataframes HW Solution Code Review5:58
In this lecture, I will review my solutions to the Search and Filter Dataframes Homework assignment. The corresponding Jupyter Notebook for this lecture is also attached so students can follow along. Additionally, I have provided the dataset (also attached to the previous two lectures) that goes along with this lecture just in case students need it.
A little something to keep you going....
SQL Options in Spark/PySpark Code Along Activity19:02
This lecture will walk students through the various SQL options available in PySpark and how to use them. The corresponding Jupyter Notebook and dataset have also been provided as attachments to this lecture in case students want to follow along. I also added the same SQL resource that I attached to the previous lecture just in case students need it again. This resource will also be helpful here too.
SQL Options in Spark/PySpark HW3:08
This lecture will provide students with a brief introduction to the SQL Options in Spark/PySpark homework assignment. The corresponding Jupyter Notebook and dataset have also been provided as an attachment to this lecture.
SQL Options in Spark/PySpark HW Solutions7:23
This lecture will walk students through my solutions to the SQL Options in Spark/PySpark Homework assignment. The corresponding Jupyter Notebook and dataset have also been provided as attachment to this lecture.
A little something to keep you going....

Manipulating Dataframes Code Along Activity49:37
This lecture will provide students with the foundational tool necessary to manipulate dataframes in PySpark. The corresponding Jupyter Notebook and dataset have also been provided as attachments to this lecture in case students want to follow along.
Manipulating Dataframes HW4:13
This lecture will provide students with an overview of the Manipulating Dataframes HW assignment. The corresponding Jupyter Notebook and datasets have also been provided as attachments to this lecture for convenience.
Manipulating Dataframes HW Solution13:35
This lecture will provide students with my solution to the Manipulating Dataframes HW assignment. The corresponding Jupyter Notebook and datasets have also been provided as attachments to this lecture for convenience.
A little something to keep you going....
Aggregating Data in Dataframes Code Along Activity19:54
In this lecture, students will learn how to aggregate dataframes in PySpark. The corresponding Jupyter Notebook and dataset have also been provided as attachments for convenience.
Aggregating Data in Dataframes HW2:26
In this lecture, students will be provided with an introduction to the Aggregating Data in Dataframes homework assignment. The corresponding Jupyter Notebook and dataset have also been provided as attachments to this lecture for convenience.
Aggregating Data in Dataframes HW Solution7:11
In this lecture, I will walk students through my solutions to the "Aggregating Dataframes" homework assignment. I have also attached the corresponding Jupyter Notebook and dataset for convenience in case students want to follow along.
A little something to keep you going....
Joining and Appending Dataframes Code Along Activity30:31
In this lecture, students will learn how to join and appending dataframes in PySpark. The corresponding Jupyter Notebook and datasets have also been provided as attachments to the lecture for convenience.
Joining and Appending Dataframes HW2:29
In this lecture, students will be provided with a brief introduction to the Joining and Appending Dataframes Homework assignment. The corresponding Jupyter Notebook and datasets have also been provided as attachments for convenience.
Joining and Appending Dataframes HW Solution Code Review11:24
In this lecture, I will review my solutions to the Joining and Appending Dataframes Homework assignment. The corresponding Jupyter Notebook and datasets have also been provided as attachments to this lecture for convenience.
A little something to keep you going....
Handling Missing Data in Dataframes Code Along Activity25:50
In this lecture, students will learn how to handle missing data in PySpark. I will cover the PySparks, built in functions as well as provide some useful functions for calculating missing data statistics and filling in missing values. The corresponding Jupyter Notebook and dataset have also been provided as attachments to this lecture for convenience.
Handling Missing Data in Dataframes HW2:53
In this lecture, students will be provided with an overview of the homework assignment for handling missing data in PySpark. The corresponding Jupyter Notebook and dataset have also been provided as attachments to this lecture for convenience.
Handling Missing Data in Dataframes HW Solution5:42
In this lecture, I will be reviewing my solutions to the homework assignment for handling missing data in PySpark. The corresponding Jupyter Notebook and dataset have also been provided as attachments to this lecture for convenience.
Dataframe Essentials Coding Master Review2:22
This lecture will provide students with an overview of how to best utilize the attached Dataframe Essentials Master Review Jupyter Notebook. There is no corresponding dataset, as the notebook is not meant to be run but rather serve as a coding reference guide.
A little something to keep you going....

Introduction to Machine Learning Concept Review21:34
This lecture is intended for students who are new to machine learning and need a quick primer before digging into the core concepts of machine learning in the subsequent lectures. We will go over the differences between supervised and unsupervised learning and several examples of each to provide student with a solid foundation moving forward.
Introduction to Machine Learning Quiz
Introduction to MLlib Concept Review5:48
In this lecture, students will be provided with a basic overview of what PySpark offers, the resources that are available to us from Spark, and a few introductory concepts that are essential to understand before we dig into the machine learning portion of this course. Links to the Spark MLlib Guide and Spark Documentation pages have also been provided for reference, attached to this lecture.
Model Selection and Tuning in MLlib Concept Review12:48
In this lecture, we will cover the required data format that Spark confines us to when using machine learning algorithms, as well as the concepts of cross validation, training and validation splitting methods, model selection and hyperparameter tuning, spark piplelines and a brief introduction to the custom function that I have created exclusively for this course that is going to make training a validating models a breeze!
Model Selection and Tuning in MLlib Quiz
Two Links to Bookmark0:20
A little something to keep you going....

Introduction to Classification in MLlib Concept Review16:09
This lecture will be a brief refreshed of concepts and terms within the classification algorithms that PySpark offers. First off, we will be going over what classification is and it’s various types. Then we will outline the pre-processing steps you need to take on your dependent variable in order for the algorithms to be able to process your data. Finally, we will dive into each one of the algorithms PySpark has for classification.
Classification in MLlib Quiz
A little something to keep you going....
Classification in MLlib Code Along Part 1: Data Formatting and Transformations35:03
In this lecture, students will learn how to format their data in order for Spark's MLlib to be able to process it as well as how to tackle common data transformations. The corresponding Jupyter Notebook and dataset have been provided as attachments for convenience.
Classification in MLlib Code Review Part 2.0: Train and Evaluate Models [Intro]2:36
In this lecture, students will learn about all the available algorithms for classification tasks in Spark's MLlib. Students will also learn how to use Spark's cross validator and parameter building functions as well which will be paramount for your machine learning tasks on the job. We will be using the same Jupyter Notebook and dataset that were attached to the previous lecture.
Classification in MLlib Code Review Part 2.1: Train & Test Models [Logistic]13:50
In this lecture, we will review the coding for the Logistic Regression classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
Classification in MLlib Code Review Part 2.2: Train & Test Models [1 vs Rest]4:39
In this lecture, we will review the coding for the One vs. Rest classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
A little something to keep you going....
Classification in MLlib Code Review Part 2.3: Train & Test Models[Multilayer PC]7:42
In this lecture, we will review the coding for the Multi-Layer Perceptron classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
Classification in MLlib Code Review Part 2.4: Train & Test Models [Naive Bayes]4:55
In this lecture, we will review the coding for the Naive Bayes classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
Classification in MLlib Code Review Part 2.5: Train & Test Models [Linear SVM]4:44
In this lecture, we will review the coding for the Linear SVM classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
Classification in MLlib Code Review Part 2.6: Train & Test Models[Decision Tree]7:14
In this lecture, we will review the coding for the Decision Tree classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
Classification in MLlib Code Review Part 2.7: Train & Test Models[Random Forest]3:17
In this lecture, we will review the coding for the Random Forest classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
Classification in MLlib Code Review Part 2.8: Train & Test Models [GBT]3:46
In this lecture, we will review the coding for the Gradient Boosted Tree classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
A little something to keep you going....
BONUS: Add loop functions to your training and evaluation script15:16
This lecture is a continuation of the previous several lectures which will outline how students can add functions to help streamline their model training and evaluation process. The corresponding Jupyter Notebook is attached and the corresponding dataset can be found attached to the "Classification in MLlib Code Along Part 1: Data Formatting and Transformations" lecture.
BONUS: Leverage MLflow to better track and manage your results11:47
This lecture will walk students through the same coding content as the original Classification in MLlib Jupyter Notebook with the addition of MLflow added that provides an interactive User Interface (UI) to view and sort your model results along with your hyper parameter tuning trials. MLflow is open source and can be installed in just a few seconds using your local terminal (pip install mlflow), so it is super practical. I've attached the Jupyter Notebook that corresponds to this lecture as well as some additional MLflow resources for students to learn more if they are so inclined.
Classification Project2:08
This lecture will provide students with an introduction to the classification project where both the Jupyter Notebook and dataset for the project have been provided as attachments.
Remember to be creative with this project!
Classification Project Solution21:22
In this lecture, I will review my solution to the classification project. The corresponding Jupyter Notebook and dataset have also been attached for convenience.

Introduction to Natural Language Processing9:29
In this lecture, students will learn the introductory concepts of natural language processing and begin to understand how it is used in the real world.
Introduction to Natural Language Processing Quiz
Natural Language Processing Concept Review [Part 1: Feature Transformers]9:42
In this lecture, students will learn about the concept of feature transformers as it relates to natural language processing in PySpark. Additionally, the link to the PySpark documentation page on Feature Transformers has also been provided as a resource to this lecture.
Natural Language Processing Concept Review [Part 2: Feature Extractors]9:15
In this lecture, students will learn about the concept of feature extraction as it relates to natural language processing in Pyspark. Examples of each feature extractor provided by PySpark will be provided to give students a deeper understanding of how each method works. The link to the PySpark documentation that outlines the feature extractors available in PySpark has also been provided below, as a resource. Students are encouraged to review this for more details.
Natural Language Processing Feature Extractors Quiz
A little something to keep you going....
Natural Language Processing Code Along Activity Part 1: Data Prep34:43
In this lecture, students will learn how to apply natural language processing techniques in PySpark. We will cover common regex techniques for cleaning data as well as how to tokenize our data using PySpark's built-in libraries in the first part of the lecture. The corresponding Jupyter Notebook and dataset have also been provided as attachments for convenience.
Natural Language Processing Code Along Activity Part 2: Vectorize, Train & Eval24:34
In part two of this lecture, students will learn how to vectorize this dataframe in three different ways (Hashing TF, TF-IDF and Word2Vec) and then pass them into the classification train and evaluate function that was covered in the previous section. We will continue to use the same Jupyter Notebook and dataset that was attached to the previous lecture.
Natural Language Processing Project1:38
In this lecture, I will provide an introduction to the NLP project for this course. The corresponding Jupyter Notebook and dataset have also been provided as attachments.
Natural Language Processing Project Solution14:49
In this lecture, I will be providing my solution to the NLP project for this course. The corresponding Jupyter Notebook and dataset have also been provided as attachments for convenience.
A little something to keep you going....

Regression in MLlib Concept Review8:50
This lecture will be a brief refresher of concepts and terms within the regression algorithms that PySpark offers. I’ll be providing several examples to make sure you are well equipped for the code along lectures that follow.
Regression in PySpark's MLlib
Regression in MLlib Code Review Introduction2:34
This lecture will provide students with a brief refresher of the overall objective of regression analysis and which algorithms PySpark has available for regression. The corresponding Jupyter Notebook and dataset that will be used for this lecture and the rest of the Regression Code Review lectures that follow are attached for convenience.
Regression in MLlib Code Review Part 1: Data Prep12:54
In this lecture, students will learn how to prep their dataframe to be able to run it through the next several regression algorithms. We cover how to vectorize the dataframe, correct for skewness and also multicolinearity. The corresponding Jupyter Notebook and dataset are attached to the "Regression in MLlib Code Review Introduction" lecture.
Regression in MLlib Code Review Part 2.0: Linear Regression9:55
In this lecture, students will learn how to implement the Linear Regression algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Regression in MLlib Code Review Introduction" lecture.
A little something to keep you going....
Regression in MLlib Code Review Part 2.1: Decision Tree Regression4:46
In this lecture, students will learn how to implement the Decision Tree Regression algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Regression in MLlib Code Review Introduction" lecture.
Regression in MLlib Code Review Part 2.2: Random Forest Regression3:11
In this lecture, students will learn how to implement the Random Forest Regression algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Regression in MLlib Code Review Introduction" lecture.
Regression in MLlib Code Review Part 2.3: Gradient Boosted Tree Regression5:53
In this lecture, students will learn how to implement the Gradient Boosted Tree Regression algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Regression in MLlib Code Review Introduction" lecture.
A little something to keep you going....
BONUS: Add loop functions to your regression training and evaluation script12:49
In this lecture, students will learn how to add loop functions to the regression code that was previously reviewed. The corresponding Juypter Notebook and dataset (same as in the last several lectures) have been provided as attachments.
Regression Project2:13
In this lecture, students will be given an introduction to the Regression Project, centered around making recommendations to a cement production company. The Jupyter Notebook and corresponding dataset, have also been provided as an attachment to this lecture.
And finally... have FUN with this project and LOVE what you do!
Regression Project Solution Code Along Activity52:55
In this lecture, I will be going over my solution to the Regression Project as a code along activity. I have also provided the Jupyter notebook that goes along with this lecture as an attachment.

Intro to Clustering in MLlib Concept Review11:59
In this lecture, students will learn about the concept of clustering within the PySpark domain. We will review the various algorithms that PySpark offers and common applications of the technique.
Clustering Concept Review Quiz
K-Means & Bisecting K-Means in MLlib Code Along Activity39:03
In this lecture, students will learn how to implement the K-means and Bisecting K-means algorithms in PySpark. The corresponding Jupyter Notebook and dataset have also been provided as attachments for convenience.
Latent Dirichlet Allocation in MLlib Code Along Activity27:25
In this lecture, students will learn how to apply the Latent Dirichlet Allocation algorithm in PySpark. The corresponding Jupyter Notebook and dataset have also been attached for convenience.
A little something to keep you going....
Gaussian Mixture Modeling in MLlib Code Along Activity32:50
In this lecture, students will learn how to implement the Gaussian Mixture Modeling technique in PySpark. The Jupyter Notebook and dataset for this code along activity have both been provided as an attachment to this lecture as a reference. An additional link to learn even more about GMM modeling as also been provided as an attachment to the lecture.
Clustering Project Introduction5:07
In this lecture, students will be provided with an introduction to the Clustering Project as well as the data and Jupyter Notebook needed to complete the project.
Clustering Project Solution Code Review13:57
In this lecture, students will be provided with a detailed review of the solution I put together for the Clustering Project. The Jupyter Notebook with my detailed notes has also been provided as a downloadable resource attached to this lecture.
A little something to keep you going....

Frequent Pattern Mining in MLlib Concept Review9:25
In this lecture, students will learn about the concept of frequent pattern mining, as well as some applications of the technique and the various algorithms that PySpark offer to conduct the analysis.
Frequent Pattern Mining Concept Quiz
Frequent Pattern Mining Code Along Activity [Part 1: FP-Growth]29:38
In this lecture, students will learn how to implement the FPGrowth algorithm concept learned in the concept review lecture in PySpark. This lecture will be a code along activity. The Jupyter Notebook as well as the dataset for this lecture have been provided as an attachment to this lecture.
Frequent Pattern Mining Code Along Activity [Part 2: PrefixSpan]13:45
In this lecture, students will learn how to implement the PrefixSpan algorithm concept learned in the concept review lecture in PySpark. This lecture is a continuation of the previous code along activity lecture, and will be using the same Jupyter Notebook and dataset were provided as an attachment to the previous lecture.
A little something to keep you going....
Frequent Pattern Mining Project Introduction2:29
In this lecture, students will be provided with an introduction to the frequent pattern mining project. The Juypter Notebook and dataset have both been provided in the zipped folder attached.
Frequent Pattern Mining Project Solution Code Review10:10
In this lecture, I will review my coding solution to the Frequent Pattern Mining Project. The Jupyter Notebook has been provided as an attachment to this lecture, and the data, if students need it, is attached to the previous lecture.

Intro to Spark Structured Streaming12:46
Explore spark structured streaming to analyze real-time data with data frames and sql, streaming from file, kafka, socket, or rate sources with micro-batches and outputs to external storage.
Intro to Streaming Data Using Sockets19:10
Explore streaming data with sockets in Python by building a server listener and a client that exchange data over a host and port, including socket binding.
Twitter Structure Streaming Project Setup and Intro5:52
Set up the Twitter structured stream project by creating a Twitter developer account, generating API keys and access tokens, and preparing to run live hashtag counts in a Jupyter notebook.
Twitter Project Tweet Listener Setup25:46
Learn to set up a Python Twitter listener that authenticates with the Twitter API, streams tweets as JSON, and forwards them via a socket for aggregation.
Twitter Project Structured Stream Setup and Implementation21:55
Code for this lecture was attached to the "Twitter Structure Streaming Project Setup and Intro" lecture.
Additional Spark Structured Streaming Resources0:29

Requirements

Familiarity with Python is helpful but not required
Some background in data science is helpful but not required
A hunger to LEARN

Description

This course is for data scientists (or aspiring data scientists) who want to get PRACTICAL training in PySpark (Python for Apache Spark) using REAL WORLD datasets and APPLICABLE coding knowledge that you’ll use everyday as a data scientist! By enrolling in this course, you’ll gain access to over 100 lectures, hundreds of example problems and quizzes and over 100,000 lines of code!

I’m going to provide the essentials for what you need to know to be an expert in Pyspark by the end of this course, that I’ve designed based on my EXTENSIVE experience consulting as a data scientist for clients like the IRS, the US Department of Labor and United States Veterans Affairs.

I’ve structured the lectures and coding exercises for real world application, so you can understand how PySpark is actually used on the job. We are also going to dive into my custom functions that I wrote MYSELF to get you up and running in the MLlib API fast and make getting started building machine learning models a breeze! We will also touch on MLflow which will help us manage and track our model training and evaluation process in a custom user interface that will make you even more competitive on the job market!

Each section will have a concept review lecture as well as code along activities structured problem sets for you to work through to help you put what you have learned into action, as well as the solutions to each problem in case you get stuck. Additionally, real world consulting projects have been provided in every section with AUTHENTIC datasets to help you think through how to apply each of the concepts we have covered.

Lastly, I’ve written up some condensed review notebooks and handouts of all the course content to make it super easy for you to reference later on. This will be super helpful once you land your first job programming in PySpark!

I can’t wait to see you in the lectures! And I really hope you enjoy the course! I’ll see you in the first lecture!

Who this course is for:

Data Scientists interested in learning PySpark
PySpark developers looking to strengthen their coding skills
Python developers who need to work with big data
Data Scientists who want to learn to work with big data

PySpark Essentials for Data Scientists (Big Data + Python)

What you'll learn

Explore related topics

Course content

Course Introduction9 lectures • 48min

Dataframe Essentials: Read, Write, Validate & Explore14 lectures • 1hr 49min

Dataframe Essentials: Clean, Manipulate, Join, Aggregate17 lectures • 2hr 58min

Introduction to Spark MLlib5 lectures • 41min

Classification in MLlib19 lectures • 2hr 34min

Natural Language Processing in MLlib9 lectures • 1hr 44min

Regression in MLlib13 lectures • 1hr 56min

Clustering in PySpark8 lectures • 2hr 10min

Frequent Pattern Mining in MLlib6 lectures • 1hr 5min

Spark Structured Streaming6 lectures • 1hr 26min

Requirements

Description

Who this course is for: