Udemy
    •  
    •  
    •  
    •  
    •  
    •  
    •  
    •  
Turn what you know into an opportunity and reach millions around the world.
Learn More
Your cart is empty.
Keep shopping
PySpark Essentials for Data Scientists (Big Data + Python)
Rating: 4.4 out of 5(833 ratings)
5,744 students

PySpark Essentials for Data Scientists (Big Data + Python)

Learn how to wrangle Big Data for Machine Learning using Python in PySpark taught by an industry expert!
Created byLayla AI
Last updated 5/2022
English

What you'll learn

  • Use Python with Big Data on a distributed framework (Apache Spark)
  • Work with REAL datasets on realistic consulting projects
  • How to streaming LIVE data from Twitter using Spark Structured Streaming
  • Learn how to create a "Pandora Like" app that classifies songs into genres using machine learning
  • Flag suspicious job postings using Natural Language Processing
  • Use machine learning to predict optimal cement strength and the factors that affect it
  • Classify Christmas cooking recipes using Topic Modeling (LDA)
  • Customer Segmentation using Gaussian Mixture Modeling (Clustering)
  • Use cluster analysis to develop a strategy designed to increase college graduation rates for under-priveleged populations
  • How to use the k-means clustering algorithm to define a marketing outreach strategy
  • Integrate a UI to monitor your model training and development process with MLflow
  • Theory and application of cutting edge data science algorithms
  • Manipulate, Join and Aggregate Dataframes in Spark with Python
  • Learn how to apply Spark's machine learning techniques on distributed Dataframes
  • Cross Validation & Hyperparameter Tuning
  • Frequent Pattern Mining Techniques
  • Classification & Regression Techniques
  • Data Wrangling for Natural Language Processing
  • How to write SQL Queries in Spark

Course content

11 sections109 lectures17h 16m total length
  • Frequently Asked Questions0:18
  • Course Introduction8:52

    This lecture will provide a general introduction to the course, and what will be covered.

  • Course Orientation5:13

    Navigate the PySpark essentials course in the DME environment with arrows, play/pause, bookmarks, and notes, and access transcripts, resources, datasets, and code in notebooks or dot pi versions.

  • Course Materials Bulk Download0:38
  • Resources for Setting up PySpark0:07

    This lecture contains links to several PySpark installation resources. The most basic installation would require installing Anaconda and PySpark locally, however students also have the option to connect to a cluster if they want to get experience with a certain platform. Note that the material provided in this course will remain the same regardless of if you choose to connect to a cluster or not. The only benefit here would be computational speed.

  • Python Cheatsheet Resources0:12
  • Introduction to PySpark16:08

    In this lecture, students will learn about what PySpark is and why to use it. An overview of the Spark ecosystem will also be provided as a reference.

  • Transitioning from Python to PySpark Concept Review5:47

    This lecture will walk students coming from a Python background, through some of the changes they will experience during their transition to PySpark.

  • Transitioning from Python to PySpark Code Along Activity10:45

    This lecture will walk students through the syntactical difference between Python and PySpark through a code along activity in two side by side Jupyter Notebooks. Both the Jupyter Notebooks and the dataset for this lecture have also been provided as an attachment for those who would like to follow along.

Requirements

  • Familiarity with Python is helpful but not required
  • Some background in data science is helpful but not required
  • A hunger to LEARN

Description

This course is for data scientists (or aspiring data scientists) who want to get PRACTICAL training in PySpark (Python for Apache Spark) using REAL WORLD datasets and APPLICABLE coding knowledge that you’ll use everyday as a data scientist! By enrolling in this course, you’ll gain access to over 100 lectures, hundreds of example problems and quizzes and over 100,000 lines of code!

I’m going to provide the essentials for what you need to know to be an expert in Pyspark by the end of this course, that I’ve designed based on my EXTENSIVE experience consulting as a data scientist for clients like the IRS, the US Department of Labor and United States Veterans Affairs.

I’ve structured the lectures and coding exercises for real world application, so you can understand how PySpark is actually used on the job. We are also going to dive into my custom functions that I wrote MYSELF to get you up and running in the MLlib API fast and make getting started building machine learning models a breeze! We will also touch on MLflow which will help us manage and track our model training and evaluation process in a custom user interface that will make you even more competitive on the job market!

Each section will have a concept review lecture as well as code along activities structured problem sets for you to work through to help you put what you have learned into action, as well as the solutions to each problem in case you get stuck. Additionally, real world consulting projects have been provided in every section with AUTHENTIC datasets to help you think through how to apply each of the concepts we have covered.

Lastly, I’ve written up some condensed review notebooks and handouts of all the course content to make it super easy for you to reference later on. This will be super helpful once you land your first job programming in PySpark!

I can’t wait to see you in the lectures! And I really hope you enjoy the course! I’ll see you in the first lecture!

Who this course is for:

  • Data Scientists interested in learning PySpark
  • PySpark developers looking to strengthen their coding skills
  • Python developers who need to work with big data
  • Data Scientists who want to learn to work with big data