Data Science with Spark
4.3 (2 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
70 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Data Science with Spark to your Wishlist.

Add to Wishlist

Data Science with Spark

Get started with Spark for data science using this unique video tutorial
4.3 (2 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
70 students enrolled
Created by Packt Publishing
Last updated 2/2017
Curiosity Sale
Current price: $10 Original price: $125 Discount: 92% off
30-Day Money-Back Guarantee
  • 3.5 hours on-demand video
  • 1 Supplemental Resource
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Understand the Spark programming language and its ecosystem of packages in Data Science
  • Obtain and clean data before processing it
  • Understand the Spark machine learning algorithm to build a simple pipeline
  • Work with interactive visualization packages in Spark
  • Apply data mining techniques on the available data sets
  • Build a recommendation engine
View Curriculum
  • You should be familiar with the basics of Spark programming. A basic knowledge about statistics and computational mathematics is expected. Knowledge of Python and Scala would be good, but is not essential.

The real power and value proposition of Apache Spark is its speed and platform to execute Data Science tasks. Spark's unique use case is that it combines ETL, batch analytic, real-time stream analysis, machine learning, graph processing, and visualizations to allow Data Scientists to tackle the complexities that come with raw unstructured data sets. Spark embraces this approach and has the vision to make the transition from working on a single machine to working on a cluster, something that makes data science tasks a lot more agile.

In this course, you’ll get a hands-on technical resource that will enable you to become comfortable and confident working with Spark for Data Science. We won't just explore Spark’s Data Science libraries, we’ll dive deeper and expand on the topics.

This course starts by taking you through Spark and the needed steps to build machine learning applications. You will learn to collect, clean, and visualize data coming from Twitter with Spark streaming. Then, you will get acquainted with Spark Machine learning algorithms and different machine learning techniques. You will also learn to apply statistical analysis and mining operations on our Tweet dataset. Finally, the course will end by giving you some ideas on how to perform awesome analysis including graph processing. By the end of the course, you will be able to do your Data scientist job in a very visual way, comprehensive and appealing for business and other stakeholders.

About The Author

Eric Charles has 10 years’ experience in the field of Data Science and is the founder of Datalayer, a social network for Data Scientists. He is passionate about using software and mathematics to help companies get insights from data.

His typical day includes building efficient processing with advanced machine learning algorithms, easy SQL, streaming and graph analytics. He also focuses a lot on visualization and result sharing.

He is passionate about open source and is an active Apache Member. He regularly gives talks to corporate clients and at open source events.

Who is the target audience?
  • This course if for anyone who wants to work with Spark on large and complex data sets.
Students Who Viewed This Course Also Viewed
Curriculum For This Course
40 Lectures
Your Spark and Visualization Toolkit
4 Lectures 16:20

This video provides you a synopsis of the course.

Preview 03:52

Big data solutions such as Spark are hard to setup, time consuming to learn, and obscure for non-technical users. The aim of the video is to give you just enough information to find your way through this complex ecosystem. 

Origins and Ecosystem for Big Data Scientists, the Scala, Python, and R flavors

The goal of this video is to show the steps to install Spark on your laptop. This will allow you to easily test and prototype on a limited dataset size. If you want to process real big data, you will need a cluster in the cloud. We will also show you how to achieve this. 

Install Spark on Your Laptop with Docker, or Scale Fast in the Cloud

This video shows you how to set up Apache Zeppelin, a tool to execute and visualize your Spark analysis. 

Apache Zeppelin, a Web-Based Notebook for Spark with matplotlib and ggplot2
First Steps with Spark Visualization
5 Lectures 25:29

Manipulate data with the core RDD API. 

Preview 08:16

Load files with the more advanced Spark data structures, the Dataframe, and Dataset, and perform basic operations on the loaded data. 

Using Dataframe, Dataset, and SQL – Natural and Easy!

Manipulate the rows and columns of Dataframe. This is needed to feature engineer your datasets. 

Manipulating Rows and Columns

Often the received dataset is not what you expect it to be. You will need to convert the encoding and format. 

Dealing with File Format

Visualization is a powerful way to make sense from data. 

Visualizing More – ggplot2, matplotlib, and Angular.js at the Rescue
The Spark Machine Learning Algorithms
5 Lectures 31:08

Data scientists use complex algorithms to analyze data with high-level libraries. Spark ships an out-of-the-box, complete solution for distributed machine learning. 

Preview 08:01

Dataframes are not the best suited data structures for data science algorithms which mainly deal with matrices and vectors. We haven't seen the use of the Spark optimized distributed data structure for this. 

Wrapping Up Basic Statistics and Linear Algebra

Before applying the algorithms, we need to cleanse the datasets and create new features. 

Cleansing Data and Engineering the Features

With big data, we may have an explosion of features which can cause noise. We need to be able to work on only the relevant features. 

Reducing the Dimensionality

Data analysis works out in multiple steps to be sequentially executed—Spark fulfils this need with pipelines. 

Pipeline for a Life
Collecting and Cleansing the Dirty Tweets
4 Lectures 19:09

We want to connect to the Twitter API and store collected tweets in a persistent way. 

Preview 05:37

Get introduced to and visualize streaming information. 

Streaming Tweets on a Map

Build a clean and usable from the raw collected data. 

Cleansing and Building Your Reference Dataset

We want an easy and fast view of our dataset. 

Querying and Visualizing Tweets with SQL
Statistical Analysis on Tweets
4 Lectures 19:10

The aim of this video is to understand your dataset better. 

Preview 07:16

The aim of this video is to validate statistical relevance. 

Validating Statistical Relevance

Algorithms have sometimes difficulties with many features which can be redundant. We will see what SVD and PCA can bring to reduce the number of features for later modeling. 

Running SVD and PCA

In this video, you will learn to extend the basic statistics for your needs. 

Extending the Basic Statistics for Your Needs
Extracting Features from the Tweets
4 Lectures 19:20

Free text is difficult for algorithms. We need to convert words into numerical values. 

Preview 07:23

Standard Spark functions does not meet the Tweet specificities. Let's see how we can do a better job.

Dealing with Stemming, Syntax, Idioms and Hashtags

We need to enrich our Dataset with more features we generate. This is a common process in Data Science analysis. We will add Sentiment features based on free text Feature Vectors. 

Detecting Tweet Sentiment

Once again, we want to add generate features from the ones we have. Topics can be considered, although they are not easy to apply on Tweets. 

Identifying Topics with LDA
Mine Data and Share Results
4 Lectures 18:37

The aim of this video is to visualize word clouds of data in Spark DataFrame. 

Preview 05:30

We want to aggregated view on uses locations. Therefore we will preprocess the users' locations with Geohash. 

Locating Users and Displaying Heatmaps with GeoHash

Your data analysis has to be an iterative and collaborative process. 

Collaborating on the Same Note with Peers

We want to get feedback as soon as possible from the end-users and from our business stakeholders. 

Create Visual Dashboards for Your Business Stakeholders
Classifying the Tweets
4 Lectures 22:08

The aim of the video is to review the available Spark Algorithms to classify and prepare the needed datasets. 

Preview 07:25

The aim of this video is to train a Classification Model. 

Training a Logistic Regression Model

The aim of this video is to apply the Model on the Test Dataset and evaluate its performance. 

Evaluating Your Classifier

This video aims at automating the Model Selection. 

Selecting Your Model
Clustering Users
3 Lectures 10:29

The aim of the video is to introduce clustering algorithms and run K-Means to identify a cluster based on User Followers and Friends. 

Preview 05:12

This video helps us to Cluster users by location with the 'ml' package. 

Clustering Users by Location

The aim of this video is to show how to adapt to changing sources.

Running KMeans on a Stream
Your Next Data Challenges
3 Lectures 17:40

Recommendation is widely use on web sites. It needs to learn from preferences. We don't do that in our Tweet dataset. The aim of this video is to show you alternatives techniques to recommend users. 

Preview 05:10

The aim of this video is to have a view on how the user's mentions are linked. 

Analyzing Mentions with GraphX

In this course, all the existing topics have not been covered and new topics pop up every day. Hence, we need to give pointers to the viewers to know where to learn more. 

Where to Go from Here
About the Instructor
Packt Publishing
3.9 Average rating
7,336 Reviews
52,405 Students
616 Courses
Tech Knowledge in Motion

Packt has been committed to developer learning since 2004. A lot has changed in software since then - but Packt has remained responsive to these changes, continuing to look forward at the trends and tools defining the way we work and live. And how to put them to work.

With an extensive library of content - more than 4000 books and video courses -Packt's mission is to help developers stay relevant in a rapidly changing world. From new web frameworks and programming languages, to cutting edge data analytics, and DevOps, Packt takes software professionals in every field to what's important to them now.

From skills that will help you to develop and future proof your career to immediate solutions to every day tech challenges, Packt is a go-to resource to make you a better, smarter developer.

Packt Udemy courses continue this tradition, bringing you comprehensive yet concise video courses straight from the experts.