Apache Spark for Big Data Analytics and Data Processing
3.8 (2 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
39 students enrolled

Apache Spark for Big Data Analytics and Data Processing

Leverage the power of Apache Spark to perform efficient data processing and analytics on your data in real-time
3.8 (2 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
39 students enrolled
Created by Packt Publishing
Last updated 12/2018
English
Current price: $139.99 Original price: $199.99 Discount: 30% off
5 hours left at this price!
30-Day Money-Back Guarantee
This course includes
  • 7 hours on-demand video
  • 1 downloadable resource
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
Training 5 or more people?

Get your team access to 4,000+ top Udemy courses anytime, anywhere.

Try Udemy for Business
What you'll learn
  • Query your structured data using Spark SQL and work with the DataSets API
  • Analyze and process graph structures using Spark’s GraphX module
  • Train machine learning models with streaming data, and use them for making real-time predictions
  • Implement high-velocity streaming and data processing use cases while working with streaming API
  • Dive into MLlib– the machine learning functional library in Spark with highly scalable algorithm
  • See how SparkR allows to create and transform RDDs in R
  • See analytical use case implementations using MLLib, GraphX, and Spark streaming
  • Examine a number of real-world use cases with hands-on projects
  • Build Hadoop and Apache Spark jobs that process data quickly and effectively
Course content
Expand all 90 lectures 07:06:47
+ Spark Analytics for Real-Time Data Processing
19 lectures 01:38:10

This video gives an overview of the entire course.

Preview 03:06

This video explains the complete introduction of Spark SQL discusses the type of applications where Spark SQL is useful to use and in the end it explains the performance of the Spark SQL.

  • At first it talks about the Spark SQL introduction

  • Next it explains the type of applications where Spark SQL is useful

  • The final step explains the performance of the Spark SQL

Spark SQL Introduction
09:04

This video explains about the Spark SQL core abstractions used by programming interfaces. These core abstractions are SQLContext, HiveContext, SparkSession, Dataset and DataFrames.

  • At first it talks about the SQLContextfor Spark 1.6 and 2.0

  • Next it explains the HiveContext for Spark 1.6 and 2.0

  • The final step explains concept of Dataset and DataFrame

Spark SQL – Core Abstractions
06:48

This video explains the creation of DataFrames from resilient distributed datasets also run some code examples.

  • At first it talks about creating DataFrames from resilient distributed datasets

  • Next it executes the code sample for creating DataFrame

Creating DataFrames from RDD
03:55

This video explains the creation of DataFrames from different type of files and also run some code examples

  • At first it talks about creating DataFrame from CSV files with demonstration

  • Next it talks about creating DataFrames from JSON files with demonstration

  • Last it talk about creating DataFrames from Parquet and ORC files

Creating DataFrames from Files
07:20

This video explains the ways of creating DataFrames from different data source. It also talks about the storing DataFrames also run some code examples.

  • At first it talks about creating DataFramefrom HIVE data source

  • Next it explains the creating DataFramesfrom JDBC data source

  • The final step explains storing the data within JSON/ORC files and using HIVE and JDBC

Creating DataFrames from Data Sources
05:02

This video explains the data frame API for common operations such as columns, dtypes, Explain, printSchema, registerTempTable, and so on with demonstration.

  • At first it talks about the common operations – columns and dtypes

  • Next it explains the common operations – explain and printSchema

  • The final steptalk about the common operations – registerTempTable

DataFrame API – Common Operations
03:02

This video explains the data frame API for query operationsfor aggregation, sampling, filter, groupBy, join, intersect, orderBy, sort, and so on with demonstration.

  • At first it talk about the query operations for aggregation, sampling, filter

  • Next it explains the query operations for groupBy, join, intersect

  • The final step talks about the query operations for orderBy, sort and distinct

DataFrame API – Query Operations
05:44

This video explains the data frame API for actionssuch as limit, select, withColumn, selectExpr, count, describe, collect, and so on with demonstration.

  • At first it talks about the limit, select, withColumnactions

  • Next it explains the selectExpr, count, describe actions

  • The final step talks about the collect, show, take actions

DataFrame API – Actions
04:38

This video explains the data frame API for built-in functions for collections, date, time, math, and string that Spark SQL provides, optimized for fast execution.

  • At first it talks about the built-in functions for collections

  • Next it explains the built-in functions for date and time

  • The final steptalk about the built-in functions for math and string

DataFrame API – Built-In Functions
03:00

This video explains the complete introduction of Spark Streaming, DStreams and support for different data sources.

  • At first it talk why Spark Streaming is needed

  • Next it explains the how DStreams is different from RDD

  • The final step explains the different data sources supported by Spark Streaming

Preview 03:56

This video explains the complete code for word count program and also steps for executing it as a very first example.

  • At first it walks through the word count program code

  • Next it explains the step to run word count program

Spark Streaming – Quick Example
05:54

This video explains the complete architecture of Spark Streaming, concept of DStreams with example and Streaming execution in spark.

  • At first it walks through the Spark Streaming architecture in detail

  • Next it explains the concept of DStreams with example

  • The final step explains the Spark Streaming execution details

Spark Streaming – Architecture
04:14

This video explains the different type of transformations available in Spark Streaming such as stateless transformations and stateful transformations.

  • At first it talks about the stateless transformations such as map(), filter(), groupByKey(), and so on

  • Next it explains the windowed operations under stateful transformation

  • The final step explains the updateStateByKey() stateful transformation

Spark Streaming – Transformations
06:11

This video explains the different type of input sources available for Spark Streaming such as Sockets, Files, Kafka, Flume, and so on. It also explains the different available output operations.

  • At first it talks about the core input sources such as Sockets, Files, Akka drivers

  • Next it explains other input sources such as Flume and Kafka

  • The final step explains the output operations such as Save(), saveAsHadoopFiles(), and so on

Spark Streaming – Input Sources
04:52

This video briefly explains the performance considerations for Spark Streaming such as batch size, parallelism, garbage collection, memory usage.

  • At first it talks about the tuning batch size for Spark Streaming

  • Next it explains usage of parallelism for Spark Streaming performance

  • The final step explains garbage collection and memory usage for Streaming applications

Spark Streaming – Performance Considerations
04:04

Aim of this video is to explain about the best practice for handling high velocity streams such as using parallelism, scheduling, setting right configuration for memory usage and few other tips.

  • First it explains the parallelism based best practices

  • Next it explains scheduling based best practices

  • In the last it explains the memory related configuration and few other tips

Best Practices for High Velocity Streams
07:26

Aim of this video is to explain about the Best practice for external data sources such as Flume, Kafka, Sockets and Message Queue protocol.

  • First it explains about Flume in context of Streaming

  • Next it explains about Kafka in context of Streaming

  • In the last it explains usage of Sockets and Message Queue protocol

Best Practices for External Data Sources
05:04

Aim of this video is to explain design patterns which can be used with to maintain the Global State and foreachRDD output action in Spark Streaming.

  • First it explains patterns for maintain the Global State within Streaming application

  • Next it explains patterns for handling connections within foreachRDD action

Design Patterns
04:50
Test Your Knowledge
5 questions
+ Advanced Analytics and Real-Time Data Processing in Apache Spark
49 lectures 03:24:03

This video gives an overview of the entire course.

Preview 02:57

The aim of this video is to delve into Spark Streaming architecture.

  • Understand micro batches

  • Compare latency versus throughput

  • Learn about failure recovery and check pointing

Introducing Spark Streaming
04:17

The aim of this video is to look into Streaming context of Spark Streaming application.

  • Create Spark Streaming application

  • Create base for Streaming processing

Streaming Context
03:38

The goal of this video is to look into the processing Streaming data and understand how the Streaming processing differs from processing batch data.

  • Find out what the unbounded data is

  • Find out how stream processing is different from the batch processing

  • Process each event really fast

Processing Streaming Data
02:40

The goal of this video is to learn about use cases in Spark Streaming applications and know when to use it.

  • Find out why to use Streaming and it’s pros

  • Learn stream use cases

Use Cases
03:12

The aim of this video is to look into Spark Streaming word count problem and solve it using Spark Streaming API.

  • Create Spark Streaming word count

  • Test Streaming job

  • Learn how to write processing

Spark Streaming Word Count Hands-On
05:44

The goal of this video is to understand what the master URL in Spark Streaming context is.

  • Understand Spark architecture

  • Use master URL for submitting jobs

  • Find out what a YARN is

Spark Streaming – Understanding Master URL
05:25

Streaming architecture needs to have a data source. So often, Apache Kafka does an event queue which is a great data source for events. The goal of this video is to integrate Spark Streaming with Apache Kafka.

  • Understand what the Apache Kafka is

  • Use Apache Kafka as a DataSource for the Spark Streaming job

  • Learn about writing DStream provider

Integrating Spark Streaming with Apache Kafka
05:26

The aim of this video is to implement streaming stateful processing that saves some data to Cassandra database and retrieve it so it can be used as a state durable data store.

  • Stream stateful processing

  • Use Cassandra as state store

  • Use Spark Streaming mapWithState to implement stateful processing

mapWithState Operation
06:56

The aim of this video is to learn about transform and the window operation in Spark Streaming.

  • Learn about transformation on the DStream

  • Learn window events using Dstream API

Transform and Window Operation
02:52

In Spark, we often want to join multiple streams data and then apply some processing on it. In this video, we will try to join two sources.

  • Join two streams

  • Test the joining

Join and Output Operations
02:40

In this video, we will learn output operations and learn how to save results to Kafka Sink.

  • Understand what a Sink is

  • Define Sink for DStream

  • Save results from Spark Streaming job to Kafka

Output Operations –Saving Results to Kafka Sink
02:55

In this video, we will get to know what the event time is.

  • What the processing time is

  • What the ingestion time is

  • How to handle each of them

Handling Time in High Velocity Streams
04:58

Data sources work in at least once guarantee. In this video, we will connect external systems.

  • Implement deduplication logic

Connecting External Systems That Works in At Least Once Guarantee – Deduplicaion
05:52

In this video, we will get to know how to handle not in order events.

  • How to verify order of events

  • Implement sorting in stream of events

Building Streaming Application –Handling Events That Are Not in Order
06:01

In this video, we will implement Streaming processing that filters our bots.

  • Use deduplication that we implemented to make Streaming processing robust

  • Implement using order verification that we implemented to make Streaming processing robust

Filtering Bots from Stream of Page View Events
06:54

In this video, we will create a project using Spark MLlib.

  • What we will want to achieve

  • Analyze input Data

  • Prepare input data to be make it ready for input to ML models

Introducing Machine Learning with Spark
05:54

In this video, we will see how to show text as a vector.

  • Transform text into vector of numbers

Feature Extraction and Transformation
01:13

In this video, we will see an algorithm for transforming text into vector of numbers.

  • Understand Bag-Of-Words

  • Word2Vect

  • Learn Skip-Gram

Transforming Text into Vector of Numbers – ML Bag-of-Words Technique
04:44

In this video, we will learn what a supervised and unsupervised ML is.

  • What the logistic regression is

  • Explain logistic regression simple example

  • Implement logistic regression model in Apache Spark

Logistic Regression
06:52

This video explains what the cross validation is.

  • How to split training and test data in a proper way

  • Implement cross-validation in Apache Spark

Model Evaluation
02:42

This video explains what a clustering is.

  • Learn Gaussian mixture model explanation

  • Cluster data using post timestamp

  • How to use GMM in a proper way

Clustering
02:41

In this video, we will preparing data for clustering.

  • Using GMM to Cluster posts by time of a post

  • Implementing Logic in Apache Spark

Gaussian Mixture Models
05:10

In this video, we will see what the Singular Value Decomposition (SVD) is.

  • When we can use it

  • How to implement it in the Spark using SPARKML Lib

Principal Component Analysis and Distributing the Singular Value Decomposition
03:15

In this video, we will look at the Movie data source that will be used to train model.

  • Build collaborative filtering in Apache Spark

  • Use the Alternating Least Squares (ALS) algorithm

  • Recommend movies for given user

Collaborative Filtering – Building Recommendation Engine
07:32

In this video, we will see what a graph is.

  • What an edge is

  • What a vertex is

Introducing Spark GraphX–How to Represent a Graph?
03:04

This video explains Spark GraphX.

  • See the pros

  • Differentiate between graph-parallel versus data-parallel

Limitations of Graph-Parallel System – Why Spark GraphX?
02:54

In this video, we will create Spark project.

  • Import GraphX library to Spark

  • Explain sbt

Importing GraphX
01:45

In this video, we will see what a property graph is.

  • Use GraphX API to create a graph

  • Define edges

  • Define vertices

Create a Graph Using GraphX and Property Graph
05:03

In this video, we will understand about Graph API.

  • Look and investigate operations

List of Operators
03:47

In this video, we will use GraphAPI to experiment with Graph

  • Explain operations on edges

  • Explain operations on vertices

Perform Graph Operations Using GraphX
04:08

In this video, we will use triplets API.

  • Aggregate triplets to extract facts from a graph

Triplet View
03:27

In this video, we will create subgraph of graph.

  • Define properties of a subgraph

  • Extract subgraph

Perform Subgraph Operations
04:28

In this video, we will calculate average of neighbourhood.

  • Use neighbourhood aggregations from GraphX API

Neighbourhood Aggregations – Collecting Neighbours
03:42

In this video, we will count degrees of vertices.

  • Count in-degree of vertex

  • Count out-degree of vertex

Counting Degree of Vertex
04:21

In this video, we will optimize graph by using caching.

  • Test code without caching

  • Test code with caching

Caching and Uncaching
03:57

In this video, we will define graph structure in a file.

  • Load it to Spark using GraphLoader

GraphBuilder
02:52

In this video, we will take RDD from all edges.

  • Take RDD from all vertices

  • Perform operations using RDD API

Vertex and Edge RDD
03:31

In this video, we will see what connected components are.

  • Implement in Spark GraphX

Structural Operators – Connected Components
03:03

In this video, we will see what an R language is.

  • What a SparkR is

  • How to use SparkR

  • What are the pros of SparkR

Introduction to SparkR and How It’s Used?
04:14

In this video, we will install SparkR Studio.

  • Use SparkR Studio

Setting Up from RStudio
01:56

In this video, we will create Spark DataFrame in the SparkR.

  • Use SparkR console

Creating Spark DataFrames from Data Sources
03:32

In this video, we will use SparkR grouping.

  • Use SparkR aggregation

SparkDataFrames Operations – Grouping, Aggregation
02:46

This video uses dapply.

  • Use dapplyCollect

Run a Given Function on a Large Dataset Using dapply or dapplyCollect
04:13

In this video, we will use gapply.

  • Use gapplyCollect

Running Large Dataset by Input Column(s) and Using gapply or gapplyCollect
04:06

In this video, we will use distributed functions.

  • Use spark.lapply method

Run Local R Functions Distributed Using spark.lapply
02:04

In this video, we will use DataFrame SQL API.

  • Use SQL from SparkR

Running SQL Queries from SparkR
03:01

In this video, we will get to know about PageRank.

  • Look at the input data

  • Calculate PageRank in SparkX

  • Explain PageRank using Spark GraphX

PageRank Using Spark GraphX
06:53

In this video, we will create abandoned cart logic.

  • Implement Streaming logic

Sending Real-Time Notification to User on an E-Commerce site
08:46
Test Your Knowledge
5 questions
+ Big Data Analytics Projects with Apache Spark
22 lectures 02:04:34

This video provides an overview of the entire course.

Preview 02:12

This video will show how to analyze windows in streaming world.

  • Find out ways of calculating top sellers in a moving window

Explaining Ways of Joining Datasets
07:56

This video will tell us how create window — join logic.

  • Find top sellers

Developing Spark Algorithm for Joining/Windowing Datasets
11:05

This video will show how to write tests for top seller’s job.

  • Handle edge cases

  • Produce top seller result

Testing Logic in MapReduce Spark — Finding Top Sellers
03:55

This video shows the enriching top sellers items with external data.

  • Emulate REST call for enrichment

  • Draw conclusions

Drawing Conclusions from Top Sellers Data
06:41

This video will tell us what MBA is.

  • Learn what are the MBA goals

Market Basket Analysis Goals
04:25

This video will teach us what we can achieve using MBA.

  • Learn what are the MBA algorithms

  • Learn about the applications of MBA algorithms

Where MBA Algorithms Are Useful?
03:45

This video will implement MBA MapReduce algorithm in Spark.

  • Create algorithm for finding associations

  • Test the algorithm

Implementing MBA MapReduce Algorithm in Spark
08:15

This video will show us how to find association rules between products.

  • Implement generation of association rules

  • Test the program

Finding Association Rules Between Products
06:55

This video will show how to create a project using Spark MLlib.

  • Learn what we will want to achieve

  • Analyze input data

  • Prepare input data to be make it ready for input to ML models

Analyzing Post for an Author
02:38

This video will deal with how to show text as a vector.

  • Transform text into vector of numbers

Extracting Information from Unstructured Text
04:35

This video will show how to extract information.

  • Check out algorithms for transforming text into vector of numbers

  • Learn what is bag-of-words and Skip-gram

Extracting Information via Spark DataFrame
05:20

This video will deal with sentiment analysis of posts using Logistic Regression.

  • Define logistic regression

  • Check out logistic regression simple example

  • Implement logistic regression model in Single Basket

Sentiment Analysis of Posts Using Logistic Regression
05:24

This video will show how to find an author of a post.

  • Learn how to split training and test data in a proper way

  • Implement cross-validation in Apache Spark

Finding an Author of a Post
03:03

This video will show us how to build recommendation system.

  • Define collaborative filtering

  • Learn how to implement it

Content-Based Recommendation Systems Explanation
04:35

This video we look at the co-relation between movies and users.

  • Take look at the input data set — movies and users

  • Implement CF in Spark

Finding Correlation Between Movies and Users
04:14

This video will show testing of MapReduce in Spark.

  • Test CF engine

  • Cover simple test cases

Testing Logic in MapReduce Spark
07:56

This video will show how to find recommended movies for given user.

  • Test the recommendation model

  • Validate the recommendation model

Finding Recommendation for Given User
05:24

This video will show how to find common friends problem using graph approach.

  • Learn what a graph is

  • Learn what a vertex is

  • Learn what an edge is

Finding Common Friends Problem — Graph Approach
03:53

This video will show how to create a graph using GraphX and property graph.

  • Use GraphX API to create a graph

  • Define edges

  • Define vertices

Creating a Graph Using GraphX and Property Graph
09:32

This video will show how to examine available methods.

  • Look and investigate operations

  • Understand Graph API

Solution — Examining Available Methods
04:15

This video we will find closest friend for given user using PageRank.

  • Define a PageRank

  • Look at the input data

  • Calculate PageRank in SparkX

Finding Closest Friend for Given User Using Page Rank
08:36
Test Your Knowledge
5 questions
Requirements
  • Basic understanding and functional knowledge of Apache Spark and big data are required.
Description

Today’s world witnesses a massive amount of data being generated everyday, everywhere. As a result, a number of organizations are focusing on Big Data processing to process large amounts of data in real-time with maximum efficiency. This has led to Apache Spark gaining popularity in the Big Data market rapidly. If you want to get the most out of the trending Big Data framework for all your data processing needs, then go for this Learning Path.

This comprehensive 3-in-1 course focuses on performing data streaming and data analytics with Apache Spark. You will learn to load data from a variety of structured sources such as JSON, Hive, and Parquet using Spark SQL and schema RDDs. You will also build streaming applications and learn best practices for managing high-velocity streaming and external data sources. Next, you will explore Spark machine learning libraries and GraphX where you will perform graphical processing and analysis. Finally, you will build projects which will help you put your learnings into practice and get a strong hold of the topic.

Contents and Overview

This training program includes 3 complete courses, carefully chosen to give you the most comprehensive training possible.

The first course, Spark Analytics for Real-Time Data Processing, starts off with explaining Spark SQL. You will learn how to use the Spark SQL API and built-in functions with Apache Spark. You will also go through some interactive analysis and look at some integrations between Spark and Java/Scala/Python. Next, you will explore Spark Streaming, streamingcontext, and DStreams. You will learn how Spark streaming works on top of the Spark core, thus inheriting its features. Finally, you will stream data and also learn best practices for managing high-velocity streaming and external data sources.

In the second course, Advanced Analytics and Real-Time Data Processing in Apache Spark, you will leverage the features of various components of the Spark framework to efficiently process, analyze, and visualize your data. You will then learn how to implement the high velocity streaming operation for data processing in order to perform efficient analytics on your real-time data. You will also analyze data using machine learning techniques and graphs. Next, you will learn to solve problems using machine learning techniques and find out about all the tools available in the MLlib toolkit. Finally, you will see some useful machine learning algorithms with the help of Spark MLlib and will integrate Spark with R.

The third course, Big Data Analytics Projects with Apache Spark, contains various projects that consist of real-world examples. The first project is to find top selling products for an e-commerce business by efficiently joining data sets in the Mapreduce paradigm. Next, a Market Basket Analysis will help you identify items likely to be purchased together and find correlations between items in a set of transactions. Moving on, you will learn about probabilistic logistic regression by finding an author for a post. Next, you will build a content-based recommendation system for movies to predict whether an action will happen, which you will do by building a trained model. Finally, you will use the Mapreduce Spark program to calculate mutual friends on social network.

By the end of this course, you will have a sound understanding of the Spark framework, which will help you in analyzing and processing big data in real time.

Meet Your Expert(s):

We have the best work of the following esteemed author(s) to ensure that your learning journey is smooth:

  • Nishant Garg has over 17 years of software architecture and development      experience in various technologies, such as Java Enterprise Edition, SOA,      Spring, Hadoop, Hive, Flume, Sqoop, Oozie, Spark, Shark, YARN, Impala,      Kafka, Storm, Solr/Lucene, NoSQL databases (such as HBase, Cassandra, and      MongoDB), and MPP databases (such as GreenPlum). He received his MS in      software systems from the Birla Institute of Technology and Science,      Pilani, India, and is currently working as a technical architect for the      Big Data RandD Group with Impetus Infotech Pvt. Ltd. Previously, Nishant      has enjoyed working with some of the most recognizable names in IT      services and financial industries, employing full software life cycle      methodologies such as Agile and SCRUM. Nishant has also undertaken many      speaking engagements on big data technologies and is also the author of      Apache Kafka and HBase Essentials, Packt Publishing.

  •  Tomasz Lelek is a Software Engineer and Co-Founder of InitLearn. He mostly does programming in Java and Scala. He dedicates his time and effort to get better at everything. He is currently diving into Big Data technologies. Tomasz is very passionate about everything associated with software development. He has been a speaker at a few conferences in Poland-Confitura and JDD, and at the Krakow Scala User Group. He has also conducted a live coding session at Geecon Conference. He was also a speaker at an international event in Dhaka. He is very enthusiastic and loves to share his knowledge.

Who this course is for:
  • This course is for software engineers, data scientists, big data developers, and big data analysts who are interested in big data processing and data analytics with Apache Spark.