Tuning Apache Spark: Powerful Big Data Processing Recipes

Name: Tuning Apache Spark: Powerful Big Data Processing Recipes
Rating: 4.0 (31 reviews)

Uncover the lesser known secrets of powerful big data processing with Spark and Kafka

Created byPackt Publishing

Last updated 2/2019

English

What you'll learn

How to attain a solid foundation in the most powerful and versatile technologies involved in data streaming: Apache Spark and Apache Kafka
Form a robust and clean architecture for a data streaming pipeline
Ways to implement the correct tools to bring your data streaming architecture to life
How to create robust processing pipelines by testing Apache Spark jobs
How to create highly concurrent Spark programs by leveraging immutability
How to solve repeated problems by leveraging the GraphX API
How to solve long-running computation problems by leveraging lazy evaluation in Spark
Tips to avoid memory leaks by understanding the internal memory management of Apache Spark
Troubleshoot real-time pipelines written in Spark Streaming

Course content

3 sections • 84 lectures • 12h 0m total length

The Course Overview6:22
This video provides an overview of the entire course.
Discovering the Data Streaming Pipeline Blueprint Architecture17:37
Introduce data streaming fundamentals and shape the data streaming blueprint architecture
Cover the big picture of data streaming
Talk about classifying, securing and scaling streaming systems
Shape via a diagram the data streaming blueprint architecture
Analyzing Meetup RSVPs in Real-Time5:58
Introduce the Meetup RSVPs stream and choose the technologies for implementing the data streaming blueprint architecture. See alternative technologies as well and how to decide between them
Access the Meetup RSVP stream online
Choose the proper technology for each tier of data streaming blueprint architecture
Explore the alternative technologies per tier and criteria for choosing between them properly
Running the Collection Tier (Part I – Collecting Data)20:39
After a brief overview of the Collection Tier, we have a general discussion about protocols, interaction patterns and issues involved in writing a Collection Tier.
Start with a brief overview about connecting to the source of data, push and pull mechanisms and lightweight business logic
Continue with protocols and interaction patterns
Finish with the problem of scaling the Collection Tier and WebSocket caused by the direct and persistent connection
Collecting Data Via the Stream Pattern and Spring WebSocketClient API6:50
Develop the Collection Tier part for ingesting Meetup RSVPs via Spring WebSocketClient API
Brief overview of WebSocket concept
Introduce Spring WebSocketClient API and its role in Collection Tier
Implementation the code
Explaining the Message Queuing Tier Role6:18
Explain why this tier, that apparently complicates and slows down the data streaming pipeline, is needed.
Tackle backpressure issue
Understand the data durability issue
Learn about data delivery semantics issue
Introducing Our Message Queuing Tier –Apache Kafka24:58
Apache Kafka is a powerful, but complex technology. This video represents a comprehensive introduction of the main Kafka concepts.
Understand cover overview, terminology, high-level architecture, topics and partitions
Explore producers and consumers, consumer groups, delivery semantics and durability
Install and configure Zookeeper and a Kafka broker
Running The Collection Tier (Part II – Sending Data)14:14
Send the collected data to Message Queuing Tier (Kafka) via Spring Cloud Stream, Kafka Binder API.
Introduce Spring Cloud Stream goal and architecture
Discuss about message binders, especially the Kafka Binder API via suggestive diagrams
Follow the Code for sending the collected data to the Message Queuing Tier
Dissecting the Data Access Tier18:18
Cover the main aspects of a Data Access Tier such as writing/reading the analyzed data to/from a long-term storage, in-memory databases/data-grids and memory. Discuss about caching strategies. Cover static and dynamic filtering depending on protocol.
See the overview of the Data Access Tier by answering to the question "what we can do with the analyzed data?"
Write and read the analyzed data to/from a long-term storage, in-memory databases/data-grids and memory
Cover caching strategies along with static and dynamic filtering depending on protocol
Introducing Our Data Access Tier – MongoDB11:13
Introduce MongoDB main headlines, justifying this election and prepare a MongoDB instance ready to go.
Learn MongoDB - What is it, why to use it and when to use it
Explore terminology, relational vs. document based, capped collection and scaling
Install and configure a localhost instance of MongoDB server and MongoDB Compass
Exploring Spring Reactive24:48
Clarify what is "reactive programming" and "reactive streams". Introduce Spring Reactive. Coding the MongoDB and Spring Reactive interaction.
Explain "reactive programming" and "reactive streams"
Introduce Spring Reactive Mono, Flux, WebFlux API and Spring Reactive Repositories via snippets of code
Know how to tie up MongoDB and Spring Reactive at code level via the ReactiveMongoTemplate API
Exposing the Data Access Tier in Browser9:45
Focus on implementing the UI part. The end-user or client is a HTML -JS based webpage capable to connect via Server Sent Events protocol to a reactive endpoint exposed via the Spring Reactive Flux API. Cover a bunch of communication patterns used in this situation.
Explain the theoretical headlines meant to clarify what we will do
Implement the UI part at code level
Discuss about publish-subscribe, RMI/RPC, Simple Messaging and Data Sync communication patterns
Diving into the Analysis Tier19:08
General overview of Analysis Tier. Cover main headlines and goals of this tier in a data streaming pipeline.
Explore the Continuous Query Model. specific to stream-processors
Explain why the Analysis Tier should run in a distributed fashion and touching high-level architectures of Apache Spark, Storm, Samza and Flink
Discover main features of a streaming process
Streaming Algorithms For Data Analysis29:13
Discover how the specific streaming algorithms looks like and have a flavor of the problems that these algorithms tries to solve. Theoretical cover four notorious streaming algorithms.
Talk about data stream query types and stream mining constrains
Explaining stream and event time. Introducing the window of data concept
Explore the concepts of Reservoir Sampling, HyperLogLog, Count-Min Sketch and Bloom Filter streaming algorithms
Introducing Our Analysis Tier – Apache Spark18:19
The goal of this video is like a check in list of Apache Spark headlines and to givea high-level overview of what Apache Spark is and how it works.
Understand what is Apache Spark and why to elect it
Know terminology, high-level architecture, Spark stack and Spark job architecture
Introduce RDDs, DataFrames, Datasets, checkpointing and monitoring
Plug-in Spark Analysis Tier to Our Pipeline9:47
Plug-in Apache Spark in our data streaming pipeline. More precisely, place the Analysis Tier (Spark) between Message Queuing Tier (Kafka) and Data Access Tier (MongoDB).
Cover aspects of running Spark on Windows
Write a Spark based kickoff application
Prepare this application to ingest data from Kafka and send it, after analysis, to MongoDB
Brief Overview of Spark RDDs25:07
Discover the RDD data structure specific to Apache Spark and be aware of its main characteristics. Implement the code lines needed to ingest Meetup RSVPs from Kafka in RDDs and write these RDDs in a MongoDB collection.
Introduce RDDs as a new data structure
Cover RDDs transformations actions and memory management
Write the code lines needed to pull RSVPs from Kafka to RDDs and sending them to a MongoDB collection
Spark Streaming28:37
Grasp a comprehensive guide of Spark Streaming. Theoretical and practical aspects are interleaved in order to cover Discretized Stream and Windowing as the two main headlines.
Cover theoretical part of DStreams, Receiver Thread, Windowing and Checkpointing
Write an application to pull RSVPs from Kafka to DStreams and send these DStreams to a MongoDB collection
Write an application to count RSVPs in a window length of 30 seconds with sliding interval of 5 seconds
DataFrames, Datasets and Spark SQL22:14
Tackle Spark SQL headlines, cover the powerful DataFrame and Dataset data structures via a comparison with RDDs and several examples, and write an application based on Spark SQL.
Have a brief overview of Spark SQL and a comprehensive comparison of RDDs vs. DataFrames vs. Datasets
Introduce DataFrames and Datasets API via examples
Write an application for filtering RSVPs by Australia venue via Spark SQL
Spark Structured Streaming32:37
The focus here is on discovering Spark Structured Streaming and developing an application sample.
Cover Structured Streaming processing model. Explain concepts: unbounded input table, user query, result table, output mode and triggers.
Discover windowed grouped aggregations, watermarking, sources and sinks and checkpointing.
Write an application for counting RSVPs by guests number in a window of 4 minutes with a sliding of 2 minutes and a watermark of 1 minute
Machine Learning in 7 Steps20:50
Provide the main set of knowledge about the topic in a soft-technical language and easy to assimilate.
Introduce Machine Learning concept via an example
Loop over the 7 steps meant to shape the big picture of how Machine Learning should tackle real problems
Have a final overview of Machine Learning and some Spark hints
MLlib (Spark ML)25:17
Spark MLlib (or Spark ML) is the Spark library for Machine Learning. The aim of this video is to discover all the main headlines of a Spark ML Pipeline. Implement an ML Pipeline for the House Price Forecast System discussed in the previous video.
Introduce Spark MLlib (Spark ML) main concept, Spark ML Pipeline, and see how data is flowing through an ML Pipeline
Cover Spark MLlib (Spark ML) operations: transformers, estimators, evaluators, etc.
Dissect Spark Pipeline and PipelineModel APIs and use them to Implement an ML Pipeline For The House Price Forecast System
Spark ML and Structured Streaming23:46
Combine the power of Spark ML and Structured Streaming in an example that trains a Logistic Regression model offline and later scoring online. Explore an example of online training and scoring via the RDD API. Discuss about the unreleased Streaming ML concept.
Introduce the Logistic Regression algorithm used in the further applications
Develop an application that trains the model offline and scores online on the Meetup RSVPs stream
Develop an application that trains and scores online on the Meetup RSVPs stream via the RDDs API
Spark GraphX6:41
Bring into discussion Spark GraphX, the Spark library dedicated to graphs and graphs-parallel computation.
Cover Spark GraphX headlines
Cover Spark GraphX API headlines
See a simple example
Fault Tolerance (HML)27:59
Provide the argumentation for choosing logging against checkpointing as the fault tolerance mechanism in streaming, to dissect the RBML, SBML and HML architectures and to implement HML in our streaming pipeline.
Explain why logging is better than checkpointing in a streaming pipeline
Have a bunch of meaningful diagrams to dissect the flow of data through RBML, SBML and HML
Provide the coding session for adding HML in our streaming pipeline via Spring Reactive and MongoDB
Kafka Connect4:19
The goal here is to provide another implementation for the SBML part via the Debezium Connector for MongoDB.
Get a Kafka Connect brief overview
Explore Debezium Connector for MongoDB brief overview
Understand theoretical aspects of implementing SBML logger with Debezium Connector For MongoDB
Securing Communication between Tiers10:18
Secure the communication between the Collection and the Message Queuing tiers and between the Analysis and the Message Queuing tiers.
Explore secure communication between Collection and Message Queuing tiers via SSL
Secure communication between Analysis and Message Queuing tiers via SSL.
Point SSL for Kafka inter-broker communication
Test Your Knowledge

The Course Overview2:18
This video provides an overview of the entire course.
Using Spark Transformations to Defer Computations to a Later Time4:32
In this video, we will use Spark transformations to defer computations to a later time.
Understand spark DAG creation
Execute DAG by issuing action
Defer decision about starting job until the last possible moment
Avoiding Transformations3:52
In this video, we will learn how to avoid transformations.
Understand groupBy API
Use cache() function
Avoid skewed partitions
Using reduce and reduceByKey to Calculate Results5:06
In this video, we will using reduce and reduceByKey to calculate results.
Understand reduce behavior
Use reduce() function
Use reduceByKey() function
Performing Actions That Trigger Computations5:17
In this video, we will understand what an action can be in Spark.
Get a walkthrough of all the actions
Perform tests
Reusing the Same RDD for Different Actions3:41
In this video, we reuse the same RDD for different actions.
Minimize execution time by reuse of RDD
Introduce caching
Perform tests
Delve into Spark RDDs Parent/Child Chain6:12
In this video, we will delve into Spark RDDs parent/child chain.
Learn how to extend RDD
Chain the new RDD with a parent
Test our custom RDD
Using RDD in an Immutable Way3:18
In this video, we will use RDD in an immutable way.
Understand DAG immutability
Create two leaves from one root RDD
Examine results from both leaves
Using DataFrame Operations to Transform It3:29
In this video, we will use DataFrame operations to transform the RDD.
Understand DataFrame immutability
Create two leaves from one root DataFrame
Add a new column by issuing transformation
Immutability in the Highly Concurrent Environment4:37
In this video, we will learn about immutability in the high-concurrent environment.
Understand cons of mutable collections
Create two threads that simultaneously modify mutable collection
Understand the reasoning about a concurrent program
Using Dataset API in an Immutable Way2:45
In this video, we will use Dataset API in the immutable way.
Understand dataset immutability
Create two leaves from one root dataset
Add new columns by issuing transformation
Detecting a Shuffle in a Processing4:42
In this video, we will learn to detect a shuffle in processing.
Load randomly partitioned data
Issue re-partition using meaningful partition key
Understand that shuffle occurs by explaining query
Testing Operations That Cause Shuffle in Apache Spark4:12
In this video, we will test operations that cause shuffle in Apache Spark.
Use join for two DataFrames
Use two DataFrames that are partitioned differently
Test join that causes shuffle
Changing Design of Jobs with Wide Dependencies3:12
In this video, we will change the design of jobs with wide dependencies.
Re-partition dataFrames using common partition key
Understand join with pre-partitioned data
Understand why we avoided shuffle
Using keyBy() Operations to Reduce Shuffle3:36
In this video, we will use keyBy() operation to reduce shuffle.
Load randomly partitioner data
Learn to pre-partition data in a meaningful way
Leverage keyBy() function
Using Custom Partitioner to Reduce Shuffle3:28
In this video, we will use a custom partitioner to reduce shuffle.
Implement custom partitioner
Use partitioner with partitionBy method on Spark
Validate that our data was partitioned properly
Saving Data in Plain Text4:57
In this video, we will learn to save data in plain text.
Load plain text data
Test your data
Leveraging JSON as a Data Format4:10
In this video, we will learn to save data in plain text.
Load plain text data
Test your data
Tabular Formats – CSV3:40
In this video, we will learn about tabular formats.
Load CSV data
Test your data
Using Avro with Spark4:15
In this video, we will learn to use Avro with Spark.
Understand how to save data in an Avro
Load Avro data
Test your data
Columnar Formats – Parquet3:44
In this video, we will learn to use columnar formats like Parquet.
Learn how to save data in Parquet
Load Parquet data
Test your data
Available Transformations on Key/Value Pairs4:23
In this video, we will learn the available transformations on key/value pairs.
Use countByKey()
Examine your data
Using aggregateByKey Instead of groupBy()4:53
In this video, we will learn to use aggregateByKey instead of groupBy().
Understand why we should not use groupByKey
Learn what aggregateByKey gives us
Implement logic using aggregateByKey
Actions on Key/Value Pairs3:14
In this video, we will learn to use Actions on Key/Value pairsUsing Accumulators.
Examine actions on K/V pairs
Use collect
Examine output for K/V RDD
Available Partitioners on Key/Value Data4:25
In this video, we will look at the available partitioners on key/value data.
Examine HashPartitioner
Examine RangePartitioner
Test your data
Implementing Custom Partitioner4:55
In this video, we will learn to implement custom partitioner.
Learn to implement range partitioning
Test our partitioner
Separating Logic from Spark Engine – Unit Testing4:15
In this video, we will be creating components with logic. Then we will perform unit testing of the components and lastly we will use case class from model.
Create component with logic
Perform unit testing of the component
Use case class from model
Integration Testing Using SparkSession3:26
In this video, we will acquire skills to perform integration testing using SparkSession.
Leverage SparkSession to integration testing
Use Unit tested component
Mocking Data Sources Using Partial Functions4:09
This video will take you through mocking data sources using partial functions.
Create spark component that read from Hive
Mocking component
Test mocked component
Using ScalaCheck for Property-Based Testing3:53
In this video, we will be using ScalaCheck for property-based testing.
Apply property based testing
Create property based test
Testing in Different Versions of Spark3:26
In the last video of the section, we will learn to apply the test in different versions of spark.
Resisting component to work with Spark pre-2.x
Mock testing pre-2.x
RDD mock testing
Creating Graph from Datasource3:33
In this video, we will learn to create a graph from datasource.
Create loader component
Revisit graph file format
Load Spark graph from file
Using Vertex API5:01
In this video, we will learn to use Vertex API.
Construct graph Using Vertex
Use Vertex
Leverage Vertex transformations
Using Edge API3:00
In this video, we will learn to use Edge API.
Construct graph using Edge
Use Edge
Leverage Edge transformations
Calculate Degree of Vertex3:57
In this video, we will learn to calculate degree of Vertex.
Calculate degree
In-Degree
Out-Degree
Calculate Page Rank4:49
In this video, we will learn to calculate Page Rank.
Load data about users
Load data about followers
Use PageRank to calculate rank of users
Test Your Knowledge

The Course Overview2:59
This video will give you an overview about the course.
Eager Computations: Lazy Evaluation4:55
In this video, we will be solving eager computations with lazy evaluation.
What is a Transformation?
Why are my transformations not executed?
Trigger transformations using actions
Caching Values: In-Memory Persistence6:39
In this video, we will be solving slow-running jobs by using in-memory Persistence.
Problem with data re-computation
Use the cache() function
Use the persistance() function
Unexpected API Behavior: Picking the Proper RDD API4:24
In this video, we will be alleviating unexpected API behavior by picking the proper RDD API.
How to speed up transform/filter queries
The ordering of operators matters
Performance test of our improvement
Wide Dependencies: Using Narrow Dependencies4:53
We will learn to reduce wide dependencies using narrow dependencies.
What is a narrow dependency?
What is a wide dependency?
How to avoid wide dependencies?
Making Computations Parallel: Using Partitions5:41
In this video, we will learn to solve slow jobs using partitions.
Examine the number of partitions of RDD
Use the coalesce() method
Use the repartition() method
Defining Robust Custom Functions: Understanding User-Defined Functions4:43
In this video, we will learn the technique of extending the DataFrame API with UDF functions.
Use the DataFrame API
Create a UDF Function
Register UDF for a usage in the DF API
Logical Plans Hiding the Truth: Examining the Physical Plans5:58
In this video, we will be understanding jobs by examining physical and logical plans
Examine logical and physical plans of DF
Examine execution plan of RDDs
Slow Interpreted Lambdas: Code Generation Spark Optimization4:17
In this video, we will be replacing slow interpreted lambdas using Spark Optimizer.
Delve into the Optimizer class
Bytecode generation
Avoid Wrong Join Strategies: Using a Join Type Based on Data Volume7:24
In this video, we will learn to avoid wrong join strategies by using a join type based on data volume.
Understand inner join
Understand left/right join
Understand outer join
Slow Joins: Choosing an Execution Plan for Join5:37
In this video, we will discover techniques to solve the slow joins problem by choosing the proper execution plan.
Use custom partitioner during join
How to join a smaller dataset with a bigger one?
Distributed Joins Problem: DataFrame API5:08
In this video, we will perform distributed joins using DataFrame.
Use DataFrame to perform join
Perform inner join
Perform outer/left/right join
TypeSafe Joins Problem: The Newest DataSet API3:39
In this video, we will perform distributed joins using DataSet.
How to perform type-safe joins?
Use DataSet to join
Minimizing Object Creation: Reusing Existing Objects5:04
In this video, we will make jobs memory efficient by reusing existing objects.
How to minimize object creation
Use aggregateByKey
Use mutable state passed to Spark API
Iterating Transformations – The mapPartitions() Method3:49
In this video, we will iterate over specific partitions by using mapPartition().
Understand what can be inside of a partition
Perform operations partition wise using mapPartitions
Slow Spark Application Start: Reducing Setup Overhead4:37
In this video, we will learn to debug Spark Start by introducing accumulators.
Use accumulators
Add metrics using accumulators
Performing Unnecessary Recomputation: Reusing RDDs4:06
In this video, we will explore ways to avoid recomputation with RDDs multiple times by using caching.
Use Spark API to favour resisability of RDDs
Use StorageLevel
Use Checkpointing
Repeating the Same Code in Stream Pipeline: Using Sources and Sinks6:33
In this video, we will create replaceable and reusable sink And source.
Detect Missing Values (NaN)
Leverage the IsNull() helper method
Install pandas
Long Latency of Jobs: Understanding Batch Internals5:03
In this video, we will be reducing the time of batch jobs using Spark micro-batch approach.
Make NaN meaningful to processing
Replacing NaN with scalar value
Fault Tolerance: Using Data Checkpointing3:23
In this video, we will make jobs fault tolerant by introducing a checkpoint mechanism.
Define the Ad Validator module
Understand what a backward fill is
Understand what a forward fill is
Maintaining Batch and Streaming: Using Structured Streaming Pros4:03
In this video, we will learn to create one code base for stream and batch using structured streaming.
Handle an outlier by replacing it with meaningful name
Implement logic using replace
Test Your Knowledge

Requirements

To pick up this course, you don’t need to be an expert with Spark. Customers should be familiar with Java or Scala.

Description

Video Learning Path Overview

A Learning Path is a specially tailored course that brings together two or more different topics that lead you to achieve an end goal. Much thought goes into the selection of the assets for a Learning Path, and this is done through a complete understanding of the requirements to achieve a goal.

Today, organizations have a difficult time working with large datasets. In addition, big data processing and analyzing need to be done in real time to gain valuable insights quickly. This is where data streaming and Spark come in.

In this well thought out Learning Path, you will not only learn how to work with Spark to solve the problem of analyzing massive amounts of data for your organization, but you’ll also learn how to tune it for performance. Beginning with a step by step approach, you’ll get comfortable in using Spark and will learn how to implement some practical and proven techniques to improve particular aspects of programming and administration in Apache Spark. You’ll be able to perform tasks and get the best out of your databases much faster.

Moving further and accelerating the pace a bit, You’ll learn some of the lesser known techniques to squeeze the best out of Spark and then you’ll learn to overcome several problems you might come across when working with Spark, without having to break a sweat. The simple and practical solutions provided will get you back in action in no time at all!

By the end of the course, you will be well versed in using Spark in your day to day projects.

Key Features

From blueprint architecture to complete code solution, this course treats every important aspect involved in architecting and developing a data streaming pipeline
Test Spark jobs using the unit, integration, and end-to-end techniques to make your data pipeline robust and bulletproof.
Solve several painful issues like slow-running jobs that affect the performance of your application.

Author Bios

Anghel Leonard is currently a Java chief architect. He is a member of the Java EE Guardians with 20+ years’ experience. He has spent most of his career architecting distributed systems. He is also the author of several books, a speaker, and a big fan of working with data.
Tomasz Lelek is a Software Engineer, programming mostly in Java and Scala. He has been working with the Spark and ML APIs for the past 5 years with production experience in processing petabytes of data. He is passionate about nearly everything associated with software development and believes that we should always try to consider different solutions and approaches before solving a problem. Recently he was a speaker at conferences in Poland, Confitura and JDD (Java Developers Day), and at Krakow Scala User Group. He has also conducted a live coding session at Geecon Conference. He is a co-founder of initlearn, an e-learning platform that was built with the Java language. He has also written articles about everything related to the Java world.

Who this course is for:

An Application Developer, Data Scientist, Analyst, Statistician, Big data Engineer, or anyone who has some experience with Spark will feel perfectly comfortable in understanding the topics presented. They usually work with large amounts of data on a day to day basis. They may or may not have used Spark, but it’s an added advantage if they have some experience with the tool.

Tuning Apache Spark: Powerful Big Data Processing Recipes

What you'll learn

Explore related topics

Course content

Data Stream Development with Apache Spark, Kafka, and Spring Boot27 lectures • 7hr 51min

Apache Spark: Tips, Tricks, & Techniques36 lectures • 2hr 26min

Troubleshooting Apache Spark21 lectures • 1hr 43min

Requirements

Description

Who this course is for: