Learning Path: Data Science With Apache Spark 2
3.2 (6 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
67 students enrolled
Wishlisted Wishlist

Please confirm that you want to add Learning Path: Data Science With Apache Spark 2 to your Wishlist.

Add to Wishlist

Learning Path: Data Science With Apache Spark 2

Get started with Spark for large-scale distributed data processing and data science
3.2 (6 ratings)
Instead of using a simple lifetime average, Udemy calculates a course's star rating by considering a number of different factors such as the number of ratings, the age of ratings, and the likelihood of fraudulent ratings.
67 students enrolled
Created by Packt Publishing
Last updated 2/2017
Current price: $10 Original price: $200 Discount: 95% off
5 hours left at this price!
30-Day Money-Back Guarantee
  • 9 hours on-demand video
  • 1 Supplemental Resource
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
What Will I Learn?
  • Get to know the fundamentals of Spark 2.0 and the Spark programming model using Scala and Python
  • Know how to use Spark SQL and DataFrames using Scala and Python
  • Get an introduction to Spark programming using R
  • Develop a complete Spark application
  • Obtain and clean data before processing it
  • Understand the Spark machine learning algorithm to build a simple pipeline
  • Work with interactive visualization packages in Spark
  • Apply data mining techniques on the available datasets
  • Build a recommendation engine
View Curriculum
  • Requires basic knowledge of either Python or R

The real power and value proposition of Apache Spark is its speed and platform to execute data processing and data science tasks. Sounds interesting? Let’s see how easy it is!

Packt’s Video Learning Paths are a series of individual video products put together in a logical and stepwise manner such that each video builds on the skills learned in the video before it.

Spark is one of the most widely-used large-scale data processing engines and runs extremely fast. It is a framework that has tools that are equally useful for application developers as well as data scientists. Spark's unique use case is that it combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations to allow data scientists to tackle the complexities that come with raw unstructured datasets.

This Learning Path starts with an introduction tour of Apache Spark 2. We will look at the basics of Spark, introduce SparkR, then look at the charting and plotting features of Python in conjunction with Spark data processing, and finally take a thorough look at Spark's data processing libraries. We then develop a real-world Spark application. Next, we will help you become comfortable and confident working with Spark for data science by exploring Spark’s data science libraries on a dataset of tweets.

The goal of this course to introduce you to Apache Spark 2 and teach you its data processing and data science libraries so that you are equipped with the skills required from modern data scientists.

This Learning Path is authored by some of the best in their fields.

Rajanarayanan Thottuvaikkatumana

Rajanarayanan Thottuvaikkatumana, or Raj, is a seasoned technologist with more than 23 years of software development experience at various multinational companies. His experience includes architecting, designing, and developing software applications. He has worked on various technologies including major databases, application development platforms, web technologies, and big data technologies. Currently he is building a next generation Hadoop YARN-based data processing platform and an application suite built with Spark using Scala.

Eric Charles

Eric Charles has 10 years’ experience in the field of Data Science and is the founder of Datalayer, a social network for Data Scientists. His typical day includes building efficient processing with advanced machine learning algorithms, easy SQL, streaming and graph analytics. He also focuses a lot on visualization and result sharing. He is passionate about open source and is an active Apache Member. He regularly gives talks to corporate clients and at open source events. 

Who is the target audience?
  • Application developers, data scientists, or big data architects interested in combining the data processing power of Apache Spark will find this course to be very useful. As implementations of Apache Spark will be shown with Scala and Python, some programming knowledge on these languages will be needed. This course is for anyone who wants to work with Spark on large and complex datasets. A basic knowledge about statistics and computational mathematics is expected.
  • With the help of real-world use cases on the main features of Spark, this course offers an easy introduction to the framework. This practical hands-on course covers the fundamentals of Spark needed to get to grips with data science through a single dataset. It expands on the next learning curve for those comfortable with Spark programming who are looking to apply Spark in the field of data science.
Students Who Viewed This Course Also Viewed
Curriculum For This Course
85 Lectures
Apache Spark 2 for Beginners
45 Lectures 05:38:30

This video gives an overview of the entire course

Preview 04:30

This video will take you through the overview of Apache Hadoop. You will also explore the Apache Hadoop Framework and the MapReduce process.

An Overview of Apache Hadoop

By the end of this video, you will learn in depth about Spark and its advantages. You will also go through the Spark libraries and then dive into Spark Programming Paradigm.

Understanding Apache Spark

In this video, you will learn Python installation and also how to install R. Finally, you will be able to set up the Spark environment for your machine.

Installing Spark on Your Machines

Ability to get consistent results from a program or function because of the side effect that the program logic has, which makes many applications very complex

Preview 08:44

Learn to process data using RDDs from the relevant data source, such as text files and NoSQL data stores

Data Transformations and Actions with RDDs

Learn to handle the tools for monitoring the jobs running in a given Spark ecosystem

Monitoring with Spark

Ability to explain the core concepts from which the elementary data items have been picked up.

The Basics of Programming with Spark

Ability to handle the appropriate Spark connector program to be used and the appropriate API to be used for reading data.

Creating RDDs from Files and Understanding the Spark Library Stack

What if you could not make use of the RDD-based Spark programming model as it requires some amount of functional programming? The solution to this is Spark SQL, which you will learn in this video.

Preview 09:38

This video will take you through the structure and internal workings of Spark SQL.

Anatomy of Spark SQL

This video will demonstrate to you two types of DataFrame programming models, one using the SQL queries and the other usingthe DataFrameAPIs for Spark.

DataFrame Programming

Spark SQL allows the aggregation of data. Instead of running SQL statements on a single data source located in a single machine, you can use SparkSQL to do the same on distributed data sources.

Understanding Aggregations and Multi-Datasource Joining with SparkSQL

This video will show you the methods used to create a Dataset, along with its usage, conversion of RDD to DataFrame, and conversion of DataFrame to dataset. You will also learn the usage of Catalog API in Scala and Python.

Introducing Datasets and Understanding Data Catalogs

This video will make you understand the necessity of SparkR and the basic data types in the R language.

Preview 08:09

You may encounter several situations where you need to convert an R DataFrame to a Spark DataFrame or vice versa. Let’s see how to do it

DataFrames in R and Spark

This video will show you how to write programs with SQL and R DataFrame APIs.

Spark DataFrame Programming with R

In SQL, the aggregation of data is very flexible. The same thing is true in Spark SQL too. Let’s see its use and the implementation of multi-datasource joins

Understanding Aggregations and Multi- Datasource Joins in SparkR

This video will walk you through the Charting and Plotting Libraries and give a brief description of the application stack. You will also learn how to set up a dataset with Spark in conjunction with Python, NumPy, SciPy, and matplotlib.

Charting and Plotting Libraries and Setting Up a Dataset

There are several instances where you need to create various charts and plots to visually represent the various aspects of the dataset and then perform data processing, charting, and plotting. This video will enable you to do this with Spark.

Charts, Plots, and Histograms

This video will let you explore more on the different types of charts and bars, namely Stacked Bar Chart, Donut Chart, Box Plot, and Vertical Bar Chart. So, let’s do it!

Bar Chart and Pie Chart

Through this video, you will learn in detail about scatter plot and line graph using Spark. You will also see how to enhance scatter plot in depth.

Scatter Plot and Line Graph

Data sources generate data like a stream, and many real-world use cases require them to be processed in real time. This video will give you a deep understanding of Stream processing in Spark.

Data Stream Processing and Micro Batch Data Processing

These days, it is very common to have a central repository of application log events in many enterprises. Also, the log events are streamed live to data processing applications in order to monitor the performance of the running applications on a real-time basis. This video demonstrates the real-time processing of log events using a Spark Streaming data processing application.

A Log Event Processor

This video will let you know the different processing options that you can pick up in Spark to work in a smart way with any data.

Windowed Data Processing and More Processing Options

Kafka is a publish-subscribe messaging system used by many IoT applications to process a huge number of messages. Let’s see how to use it!

Kafka Stream Processing

When a Spark Streaming application is processing the incoming data, it is very important to have an uninterrupted data processing capability so that all the data that is getting ingested is processed. This video will take you through those tasks that enable you to achieve this goal.

Spark Streaming Jobs in Production

This video will let you know the basics of machine learning and understand the ability of Spark to achieve the goals of machine learning in an efficient manner.

Understanding Machine Learning and the Need of Spark for it

By the end of this video, you will be able to perform predictions on huge data such as the Wine quality, which is a widely used data set in data analysis.

Wine Quality Prediction and Model Persistence

Let’s use Spark to perform Wine classification by using various algorithms.

Wine Classification

Spam filtering is a very common use case that is used in many applications. It is ubiquitous in e-mail applications. It is one of the most widely used classification problems. This video will enable you to deal with this problem and show you the best approach to resolve it in Spark.

Spam Filtering

It is not very easy to get raw data in the appropriate form of features and labels in order to train the model. Through this video, you will be able to play with the raw data and use it efficiently for processing.

Feature Algorithms and Finding Synonyms

Graphs are widely used in data analysis. Let’s explore some commonly used graphs and their usage.

Understanding Graphs with Their Usage

Many libraries are available in the open source world. Giraph, Pregel, GraphLab, and Spark GraphX are some of them. Spark GraphX is one of the recent entrantsinto this space. Let’s dive into it!

The Spark GraphX Library

Just like any other data structure, a graph also undergoes lots of changes because of the change in the underlying data. Let’s learn to process these changes.

Graph Processing and Graph Structure Processing

Since the basic graph processing fundamentals are in place, now it is time to take up a real-world use case that uses graphs. Let’s take the tennis tournament's results for it.

Tennis Tournament Analysis

When searching the web using Google, pages that are ranked highly by its algorithm are displayed. In the context of graphs, instead of web pages, if vertices are ranked based on the same algorithm, lots of new inferences can be made. Let’s jump right in and see how to do this.

Applying PageRank Algorithm

In a graph, finding a subgraph consisting of connected vertices is a very common requirement with tremendous applications. This video will enable you to find the connected vertices, making it easy for you to work on the given data.

Connected Component Algorithm

GraphFrames is a new graph processing library available as an external Spark package developed by Databricks. Though this video, you will learn the concepts and queries used in GraphFrames.

Understanding GraphFrames and Its Queries

Application architecture is very important for any kind of software development. Lambda Architecture is a recent and popular architecture that's ideal for developing data processing applications. Let’s dive into it!

Lambda Architecture

In the recent years, the concept of microblogging included the general public in the culture of blogging. Let’s see how we could work it and have fun!

Micro Blogging with Lambda Architecture

Since the Lambda Architecture is a technology-agnostic architecture framework, when designing applications with it, it is imperative to capture the technology choices used in the specific implementations. This video does exactly that.

Implementing Lambda Architecture and Working with Spark Applications

You may require using different coding styles and performing data ingestion. This video will enhance your knowledge and enable you to implement these tasks with ease.

Coding Style, Setting Up the Source Code, and Understanding Data Ingestion

This video will show you how to create the purposed views and queries discussed in the previous videos of this section.

Generating Purposed Views and Queries

Let’s explore custom data processes with this video!

Understanding Custom Data Processes
Data Science with Spark
40 Lectures 03:19:30

This video provides you a synopsis of the course.

Preview 03:52

Big data solutions such as Spark are hard to setup, time consuming to learn, and obscure for non-technical users. The aim of the video is to give you just enough information to find your way through this complex ecosystem.

Origins and Ecosystem for Big Data Scientists, the Scala, Python, and R flavors

The goal of this video is to show the steps to install Spark on your laptop. This will allow you to easily test and prototype on a limited dataset size. If you want to process real big data, you will need a cluster in the cloud. We will also show you how to achieve this.

Install Spark on Your Laptop with Docker, or Scale Fast in the Cloud

This video shows you how to set up Apache Zeppelin, a tool to execute and visualize your Spark analysis.

Apache Zeppelin, a Web-Based Notebook for Spark with matplotlib and ggplot2

Manipulate data with the core RDD API.

Manipulating Data with the Core RDD API

Load files with the more advanced Spark data structures, the Dataframe, and Dataset, and perform basic operations on the loaded data.

Using Dataframe, Dataset, and SQL – Natural and Easy!

Manipulate the rows and columns of Dataframe. This is needed to feature engineer your datasets.

Manipulating Rows and Columns

Often the received dataset is not what you expect it to be. You will need to convert the encoding and format.

Dealing with File Format

Visualization is a powerful way to make sense from data.

Visualizing More – ggplot2, matplotlib, and Angular.js at the Rescue

Data scientists use complex algorithms to analyze data with high-level libraries. Spark ships an out-of-the-box, complete solution for distributed machine learning.

Discovering spark.ml and spark.mllib - and Other Libraries

Dataframes are not the best suited data structures for data science algorithms which mainly deal with matrices and vectors. We haven't seen the use of the Spark optimized distributed data structure for this.

Wrapping Up Basic Statistics and Linear Algebra

Before applying the algorithms, we need to cleanse the datasets and create new features.

Cleansing Data and Engineering the Features

With big data, we may have an explosion of features which can cause noise. We need to be able to work on only the relevant features.

Reducing the Dimensionality

Data analysis works out in multiple steps to be sequentially executed—Spark fulfils this need with pipelines.

Pipeline for a Life

We want to connect to the Twitter API and store collected tweets in a persistent way.

Streaming Tweets to Disk

Get introduced to and visualize streaming information.

Streaming Tweets on a Map

Build a clean and usable from the raw collected data.

Cleansing and Building Your Reference Dataset

We want an easy and fast view of our dataset.

Querying and Visualizing Tweets with SQL

The aim of this video is to understand your dataset better.

Indicators, Correlations, and Sampling

The aim of this video is to validate statistical relevance.

Validating Statistical Relevance

Algorithms have sometimes difficulties with many features which can be redundant. We will see what SVD and PCA can bring to reduce the number of features for later modeling.

Running SVD and PCA

In this video, you will learn to extend the basic statistics for your needs.

Extending the Basic Statistics for Your Needs

Free text is difficult for algorithms. We need to convert words into numerical values.

Analyzing Free Text from the Tweets

Standard Spark functions does not meet the Tweet specificities. Let's see how we can do a better job.

Dealing with Stemming, Syntax, Idioms and Hashtags

We need to enrich our Dataset with more features we generate. This is a common process in Data Science analysis. We will add Sentiment features based on free text Feature Vectors.

Detecting Tweet Sentiment

Once again, we want to add generate features from the ones we have. Topics can be considered, although they are not easy to apply on Tweets.

Identifying Topics with LDA

The aim of this video is to visualize word clouds of data in Spark DataFrame.

Word Cloudify Your Dataset

We want to aggregated view on uses locations. Therefore we will preprocess the users' locations with Geohash.

Locating Users and Displaying Heatmaps with GeoHash

Your data analysis has to be an iterative and collaborative process.

Collaborating on the Same Note with Peers

We want to get feedback as soon as possible from the end-users and from our business stakeholders.

Create Visual Dashboards for Your Business Stakeholders

The aim of the video is to review the available Spark Algorithms to classify and prepare the needed datasets.

Building the Training and Test Datasets

The aim of this video is to train a Classification Model.

Training a Logistic Regression Model

The aim of this video is to apply the Model on the Test Dataset and evaluate its performance.

Evaluating Your Classifier

This video aims at automating the Model Selection.

Selecting Your Model

The aim of the video is to introduce clustering algorithms and run K-Means to identify a cluster based on User Followers and Friends.

Clustering Users by Followers and Friends

This video helps us to Cluster users by location with the 'ml' package.

Clustering Users by Location

The aim of this video is to show how to adapt to changing sources.

Running KMeans on a Stream

Recommendation is widely use on web sites. It needs to learn from preferences. We don't do that in our Tweet dataset. The aim of this video is to show you alternatives techniques to recommend users.

Recommending Similar Users

The aim of this video is to have a view on how the user's mentions are linked.

Analyzing Mentions with GraphX

In this course, all the existing topics have not been covered and new topics pop up every day. Hence, we need to give pointers to the viewers to know where to learn more.

Where to Go from Here
About the Instructor
Packt Publishing
3.9 Average rating
7,282 Reviews
51,853 Students
616 Courses
Tech Knowledge in Motion

Packt has been committed to developer learning since 2004. A lot has changed in software since then - but Packt has remained responsive to these changes, continuing to look forward at the trends and tools defining the way we work and live. And how to put them to work.

With an extensive library of content - more than 4000 books and video courses -Packt's mission is to help developers stay relevant in a rapidly changing world. From new web frameworks and programming languages, to cutting edge data analytics, and DevOps, Packt takes software professionals in every field to what's important to them now.

From skills that will help you to develop and future proof your career to immediate solutions to every day tech challenges, Packt is a go-to resource to make you a better, smarter developer.

Packt Udemy courses continue this tradition, bringing you comprehensive yet concise video courses straight from the experts.