
This lecture will provide a general introduction to the course, and what will be covered.
Navigate the PySpark essentials course in the DME environment with arrows, play/pause, bookmarks, and notes, and access transcripts, resources, datasets, and code in notebooks or dot pi versions.
This lecture contains links to several PySpark installation resources. The most basic installation would require installing Anaconda and PySpark locally, however students also have the option to connect to a cluster if they want to get experience with a certain platform. Note that the material provided in this course will remain the same regardless of if you choose to connect to a cluster or not. The only benefit here would be computational speed.
In this lecture, students will learn about what PySpark is and why to use it. An overview of the Spark ecosystem will also be provided as a reference.
This lecture will walk students coming from a Python background, through some of the changes they will experience during their transition to PySpark.
This lecture will walk students through the syntactical difference between Python and PySpark through a code along activity in two side by side Jupyter Notebooks. Both the Jupyter Notebooks and the dataset for this lecture have also been provided as an attachment for those who would like to follow along.
In this lecture we will going over what Dataframes are, how they are used and how we can manage them. We will also spend some time introducing some functionality that is unique to Pyspark when it comes to handling big data and then provide an overview of what will be covered in the next several code along activity lectures where we will be getting into the more hands-on stuff.
This lecture will be a code along activity where students will learn how to create a PySpark instance as well as read, write and validate dataframes using a Jupyter Notebook which is attached to the this lecture along with several datasets that we will be working with.
This lecture will provide students will a brief intro to the first PySpark coding homework assignment. The Jupyter Notebook guide for the homework assignment and the corresponding dataset have also been included as attachment resources to this lecture. I've also included a link that lists all the PySpark data types in case you need them.
This lecture will walk students through my solution to the "Read, Write and Validate Data Homework Assignment" assigned in the previous lecture. The Jupyter Notebook containing my solutions and the dataset that corresponds to this lecture have both been provided as attachments for reference.
This lecture will walk students through the essentials of searching and filtering dataframes in PySpark through a code along activity. We will also introduce PySparks SQL functions library here. The Jupyter Notebook and corresponding dataset have been provided as attachments in case students want to follow along on their own. Additionally, two external resources have been provided fro students. The first is a "List of PySpark SQL Functions" for students to reference later on and to check out additional functions that were not covered in the lecture (there are a lot!). The second is a link to W3 schools, which is a SQL tutorial website (not specific to PySpark), that students can use if they want to learn more about SQL.
This lecture will provide students with a brief introduction to the Search and Filter Dataframes homework assignment that corresponds to the previous lecture. The Jupyter Notebook for this exercise and the necessary dataset (same fifa dataset as in the previous lecture) has also been attached to this lecture.
In this lecture, I will review my solutions to the Search and Filter Dataframes Homework assignment. The corresponding Jupyter Notebook for this lecture is also attached so students can follow along. Additionally, I have provided the dataset (also attached to the previous two lectures) that goes along with this lecture just in case students need it.
This lecture will walk students through the various SQL options available in PySpark and how to use them. The corresponding Jupyter Notebook and dataset have also been provided as attachments to this lecture in case students want to follow along. I also added the same SQL resource that I attached to the previous lecture just in case students need it again. This resource will also be helpful here too.
This lecture will provide students with a brief introduction to the SQL Options in Spark/PySpark homework assignment. The corresponding Jupyter Notebook and dataset have also been provided as an attachment to this lecture.
This lecture will walk students through my solutions to the SQL Options in Spark/PySpark Homework assignment. The corresponding Jupyter Notebook and dataset have also been provided as attachment to this lecture.
This lecture will provide students with the foundational tool necessary to manipulate dataframes in PySpark. The corresponding Jupyter Notebook and dataset have also been provided as attachments to this lecture in case students want to follow along.
This lecture will provide students with an overview of the Manipulating Dataframes HW assignment. The corresponding Jupyter Notebook and datasets have also been provided as attachments to this lecture for convenience.
This lecture will provide students with my solution to the Manipulating Dataframes HW assignment. The corresponding Jupyter Notebook and datasets have also been provided as attachments to this lecture for convenience.
In this lecture, students will learn how to aggregate dataframes in PySpark. The corresponding Jupyter Notebook and dataset have also been provided as attachments for convenience.
In this lecture, students will be provided with an introduction to the Aggregating Data in Dataframes homework assignment. The corresponding Jupyter Notebook and dataset have also been provided as attachments to this lecture for convenience.
In this lecture, I will walk students through my solutions to the "Aggregating Dataframes" homework assignment. I have also attached the corresponding Jupyter Notebook and dataset for convenience in case students want to follow along.
In this lecture, students will learn how to join and appending dataframes in PySpark. The corresponding Jupyter Notebook and datasets have also been provided as attachments to the lecture for convenience.
In this lecture, students will be provided with a brief introduction to the Joining and Appending Dataframes Homework assignment. The corresponding Jupyter Notebook and datasets have also been provided as attachments for convenience.
In this lecture, I will review my solutions to the Joining and Appending Dataframes Homework assignment. The corresponding Jupyter Notebook and datasets have also been provided as attachments to this lecture for convenience.
In this lecture, students will learn how to handle missing data in PySpark. I will cover the PySparks, built in functions as well as provide some useful functions for calculating missing data statistics and filling in missing values. The corresponding Jupyter Notebook and dataset have also been provided as attachments to this lecture for convenience.
In this lecture, students will be provided with an overview of the homework assignment for handling missing data in PySpark. The corresponding Jupyter Notebook and dataset have also been provided as attachments to this lecture for convenience.
In this lecture, I will be reviewing my solutions to the homework assignment for handling missing data in PySpark. The corresponding Jupyter Notebook and dataset have also been provided as attachments to this lecture for convenience.
This lecture will provide students with an overview of how to best utilize the attached Dataframe Essentials Master Review Jupyter Notebook. There is no corresponding dataset, as the notebook is not meant to be run but rather serve as a coding reference guide.
This lecture is intended for students who are new to machine learning and need a quick primer before digging into the core concepts of machine learning in the subsequent lectures. We will go over the differences between supervised and unsupervised learning and several examples of each to provide student with a solid foundation moving forward.
In this lecture, students will be provided with a basic overview of what PySpark offers, the resources that are available to us from Spark, and a few introductory concepts that are essential to understand before we dig into the machine learning portion of this course. Links to the Spark MLlib Guide and Spark Documentation pages have also been provided for reference, attached to this lecture.
In this lecture, we will cover the required data format that Spark confines us to when using machine learning algorithms, as well as the concepts of cross validation, training and validation splitting methods, model selection and hyperparameter tuning, spark piplelines and a brief introduction to the custom function that I have created exclusively for this course that is going to make training a validating models a breeze!
This lecture will be a brief refreshed of concepts and terms within the classification algorithms that PySpark offers. First off, we will be going over what classification is and it’s various types. Then we will outline the pre-processing steps you need to take on your dependent variable in order for the algorithms to be able to process your data. Finally, we will dive into each one of the algorithms PySpark has for classification.
In this lecture, students will learn how to format their data in order for Spark's MLlib to be able to process it as well as how to tackle common data transformations. The corresponding Jupyter Notebook and dataset have been provided as attachments for convenience.
In this lecture, students will learn about all the available algorithms for classification tasks in Spark's MLlib. Students will also learn how to use Spark's cross validator and parameter building functions as well which will be paramount for your machine learning tasks on the job. We will be using the same Jupyter Notebook and dataset that were attached to the previous lecture.
In this lecture, we will review the coding for the Logistic Regression classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
In this lecture, we will review the coding for the One vs. Rest classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
In this lecture, we will review the coding for the Multi-Layer Perceptron classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
In this lecture, we will review the coding for the Naive Bayes classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
In this lecture, we will review the coding for the Linear SVM classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
In this lecture, we will review the coding for the Decision Tree classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
In this lecture, we will review the coding for the Random Forest classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
In this lecture, we will review the coding for the Gradient Boosted Tree classifier algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Classification in MLlib Code Review Part 2.0..." lecture.
This lecture is a continuation of the previous several lectures which will outline how students can add functions to help streamline their model training and evaluation process. The corresponding Jupyter Notebook is attached and the corresponding dataset can be found attached to the "Classification in MLlib Code Along Part 1: Data Formatting and Transformations" lecture.
This lecture will walk students through the same coding content as the original Classification in MLlib Jupyter Notebook with the addition of MLflow added that provides an interactive User Interface (UI) to view and sort your model results along with your hyper parameter tuning trials. MLflow is open source and can be installed in just a few seconds using your local terminal (pip install mlflow), so it is super practical. I've attached the Jupyter Notebook that corresponds to this lecture as well as some additional MLflow resources for students to learn more if they are so inclined.
This lecture will provide students with an introduction to the classification project where both the Jupyter Notebook and dataset for the project have been provided as attachments.
In this lecture, I will review my solution to the classification project. The corresponding Jupyter Notebook and dataset have also been attached for convenience.
In this lecture, students will learn the introductory concepts of natural language processing and begin to understand how it is used in the real world.
In this lecture, students will learn about the concept of feature transformers as it relates to natural language processing in PySpark. Additionally, the link to the PySpark documentation page on Feature Transformers has also been provided as a resource to this lecture.
In this lecture, students will learn about the concept of feature extraction as it relates to natural language processing in Pyspark. Examples of each feature extractor provided by PySpark will be provided to give students a deeper understanding of how each method works. The link to the PySpark documentation that outlines the feature extractors available in PySpark has also been provided below, as a resource. Students are encouraged to review this for more details.
In this lecture, students will learn how to apply natural language processing techniques in PySpark. We will cover common regex techniques for cleaning data as well as how to tokenize our data using PySpark's built-in libraries in the first part of the lecture. The corresponding Jupyter Notebook and dataset have also been provided as attachments for convenience.
In part two of this lecture, students will learn how to vectorize this dataframe in three different ways (Hashing TF, TF-IDF and Word2Vec) and then pass them into the classification train and evaluate function that was covered in the previous section. We will continue to use the same Jupyter Notebook and dataset that was attached to the previous lecture.
In this lecture, I will provide an introduction to the NLP project for this course. The corresponding Jupyter Notebook and dataset have also been provided as attachments.
In this lecture, I will be providing my solution to the NLP project for this course. The corresponding Jupyter Notebook and dataset have also been provided as attachments for convenience.
This lecture will be a brief refresher of concepts and terms within the regression algorithms that PySpark offers. I’ll be providing several examples to make sure you are well equipped for the code along lectures that follow.
This lecture will provide students with a brief refresher of the overall objective of regression analysis and which algorithms PySpark has available for regression. The corresponding Jupyter Notebook and dataset that will be used for this lecture and the rest of the Regression Code Review lectures that follow are attached for convenience.
In this lecture, students will learn how to prep their dataframe to be able to run it through the next several regression algorithms. We cover how to vectorize the dataframe, correct for skewness and also multicolinearity. The corresponding Jupyter Notebook and dataset are attached to the "Regression in MLlib Code Review Introduction" lecture.
In this lecture, students will learn how to implement the Linear Regression algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Regression in MLlib Code Review Introduction" lecture.
In this lecture, students will learn how to implement the Decision Tree Regression algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Regression in MLlib Code Review Introduction" lecture.
In this lecture, students will learn how to implement the Random Forest Regression algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Regression in MLlib Code Review Introduction" lecture.
In this lecture, students will learn how to implement the Gradient Boosted Tree Regression algorithm in PySpark. The corresponding Jupyter Notebook and dataset are attached to the "Regression in MLlib Code Review Introduction" lecture.
In this lecture, students will learn how to add loop functions to the regression code that was previously reviewed. The corresponding Juypter Notebook and dataset (same as in the last several lectures) have been provided as attachments.
In this lecture, students will be given an introduction to the Regression Project, centered around making recommendations to a cement production company. The Jupyter Notebook and corresponding dataset, have also been provided as an attachment to this lecture.
In this lecture, I will be going over my solution to the Regression Project as a code along activity. I have also provided the Jupyter notebook that goes along with this lecture as an attachment.
In this lecture, students will learn about the concept of clustering within the PySpark domain. We will review the various algorithms that PySpark offers and common applications of the technique.
In this lecture, students will learn how to implement the K-means and Bisecting K-means algorithms in PySpark. The corresponding Jupyter Notebook and dataset have also been provided as attachments for convenience.
In this lecture, students will learn how to apply the Latent Dirichlet Allocation algorithm in PySpark. The corresponding Jupyter Notebook and dataset have also been attached for convenience.
In this lecture, students will learn how to implement the Gaussian Mixture Modeling technique in PySpark. The Jupyter Notebook and dataset for this code along activity have both been provided as an attachment to this lecture as a reference. An additional link to learn even more about GMM modeling as also been provided as an attachment to the lecture.
In this lecture, students will be provided with an introduction to the Clustering Project as well as the data and Jupyter Notebook needed to complete the project.
In this lecture, students will be provided with a detailed review of the solution I put together for the Clustering Project. The Jupyter Notebook with my detailed notes has also been provided as a downloadable resource attached to this lecture.
In this lecture, students will learn about the concept of frequent pattern mining, as well as some applications of the technique and the various algorithms that PySpark offer to conduct the analysis.
In this lecture, students will learn how to implement the FPGrowth algorithm concept learned in the concept review lecture in PySpark. This lecture will be a code along activity. The Jupyter Notebook as well as the dataset for this lecture have been provided as an attachment to this lecture.
In this lecture, students will learn how to implement the PrefixSpan algorithm concept learned in the concept review lecture in PySpark. This lecture is a continuation of the previous code along activity lecture, and will be using the same Jupyter Notebook and dataset were provided as an attachment to the previous lecture.
In this lecture, students will be provided with an introduction to the frequent pattern mining project. The Juypter Notebook and dataset have both been provided in the zipped folder attached.
In this lecture, I will review my coding solution to the Frequent Pattern Mining Project. The Jupyter Notebook has been provided as an attachment to this lecture, and the data, if students need it, is attached to the previous lecture.
Explore spark structured streaming to analyze real-time data with data frames and sql, streaming from file, kafka, socket, or rate sources with micro-batches and outputs to external storage.
Explore streaming data with sockets in Python by building a server listener and a client that exchange data over a host and port, including socket binding.
Set up the Twitter structured stream project by creating a Twitter developer account, generating API keys and access tokens, and preparing to run live hashtag counts in a Jupyter notebook.
Learn to set up a Python Twitter listener that authenticates with the Twitter API, streams tweets as JSON, and forwards them via a socket for aggregation.
Code for this lecture was attached to the "Twitter Structure Streaming Project Setup and Intro" lecture.
This course is for data scientists (or aspiring data scientists) who want to get PRACTICAL training in PySpark (Python for Apache Spark) using REAL WORLD datasets and APPLICABLE coding knowledge that you’ll use everyday as a data scientist! By enrolling in this course, you’ll gain access to over 100 lectures, hundreds of example problems and quizzes and over 100,000 lines of code!
I’m going to provide the essentials for what you need to know to be an expert in Pyspark by the end of this course, that I’ve designed based on my EXTENSIVE experience consulting as a data scientist for clients like the IRS, the US Department of Labor and United States Veterans Affairs.
I’ve structured the lectures and coding exercises for real world application, so you can understand how PySpark is actually used on the job. We are also going to dive into my custom functions that I wrote MYSELF to get you up and running in the MLlib API fast and make getting started building machine learning models a breeze! We will also touch on MLflow which will help us manage and track our model training and evaluation process in a custom user interface that will make you even more competitive on the job market!
Each section will have a concept review lecture as well as code along activities structured problem sets for you to work through to help you put what you have learned into action, as well as the solutions to each problem in case you get stuck. Additionally, real world consulting projects have been provided in every section with AUTHENTIC datasets to help you think through how to apply each of the concepts we have covered.
Lastly, I’ve written up some condensed review notebooks and handouts of all the course content to make it super easy for you to reference later on. This will be super helpful once you land your first job programming in PySpark!
I can’t wait to see you in the lectures! And I really hope you enjoy the course! I’ll see you in the first lecture!