Data Stream Development via Spark, Kafka and Spring Boot

Name: Data Stream Development via Spark, Kafka and Spring Boot
Rating: 4.5 (136 reviews)

Handle high volumes of data at high speed. Architect and implement an end-to-end data streaming pipeline

Highest Rated

Created byPackt Publishing

Last updated 2/2019

English

English [Auto],

What you'll learn

Attain a solid foundation in the most powerful and versatile technologies involved in data streaming: Apache Spark and Apache Kafka
Form a robust and clean architecture for a data streaming pipeline
Implement the correct tools to bring your data streaming architecture to life
Isolate the most problematic tradeoff for each tier involved in a data streaming pipeline
Query, analyze, and apply machine learning algorithms to collected data
Display analyzed pipeline data via Google Maps on your web browser
Discover and resolve difficulties in scaling and securing data streaming applications

Course content

5 sections • 27 lectures • 7h 51m total length

The Course Overview6:22
This video provides an overview of the entire course.
Discovering the Data Streaming Pipeline Blueprint Architecture17:37
Introduce data streaming fundamentals and shape the data streaming blueprint architecture
• Cover the big picture of data streaming
• Talk about classifying, securing and scaling streaming systems
• Shape via a diagram the data streaming blueprint architecture
Analyzing Meetup RSVPs in Real-Time5:58
Introduce the Meetup RSVPs stream and choose the technologies for implementing the data streaming blueprint architecture. See alternative technologies as well and how to decide between them
• Access the Meetup RSVP stream online
• Choose the proper technology for each tier of data streaming blueprint architecture
• Explore the alternative technologies per tier and criteria for choosing between them properly

Running the Collection Tier (Part I – Collecting Data)20:39
After a brief overview of the Collection Tier, we have a general discussion about protocols, interaction patterns and issues involved in writing a Collection Tier.
Start with a brief overview about connecting to the source of data, push and pull mechanisms and lightweight business logic
Continue with protocols and interaction patterns
Finish with the problem of scaling the Collection Tier and WebSocket caused by the direct and persistent connection
Collecting Data Via the Stream Pattern and Spring WebSocketClient API6:50
Develop the Collection Tier part for ingesting Meetup RSVPs via Spring WebSocketClient API
Brief overview of WebSocket concept
Introduce Spring WebSocketClient API and its role in Collection Tier
Implementation the code
Explaining the Message Queuing Tier Role6:18
Explain why this tier, that apparently complicates and slows down the data streaming pipeline, is needed.
Tackle backpressure issue
Understand the data durability issue
Learn about data delivery semantics issue
Introducing Our Message Queuing Tier –Apache Kafka24:58
Apache Kafka is a powerful, but complex technology. This video represents a comprehensive introduction of the main Kafka concepts.
Understand cover overview, terminology, high-level architecture, topics and partitions
Explore producers and consumers, consumer groups, delivery semantics and durability
Install and configure Zookeeper and a Kafka broker
Running The Collection Tier (Part II – Sending Data)14:14
Send the collected data to Message Queuing Tier (Kafka) via Spring Cloud Stream, Kafka Binder API.
Introduce Spring Cloud Stream goal and architecture
Discuss about message binders, especially the Kafka Binder API via suggestive diagrams
Follow the Code for sending the collected data to the Message Queuing Tier

Dissecting the Data Access Tier18:18
Cover the main aspects of a Data Access Tier such as writing/reading the analyzed data to/from a long-term storage, in-memory databases/data-grids and memory. Discuss about caching strategies. Cover static and dynamic filtering depending on protocol.
See the overview of the Data Access Tier by answering to the question "what we can do with the analyzed data?"
Write and read the analyzed data to/from a long-term storage, in-memory databases/data-grids and memory
Cover caching strategies along with static and dynamic filtering depending on protocol
Introducing Our Data Access Tier – MongoDB11:13
Introduce MongoDB main headlines, justifying this election and prepare a MongoDB instance ready to go.
Learn MongoDB - What is it, why to use it and when to use it
Explore terminology, relational vs. document based, capped collection and scaling
Install and configure a localhost instance of MongoDB server and MongoDB Compass
Exploring Spring Reactive24:48
Clarify what is "reactive programming" and "reactive streams". Introduce Spring Reactive. Coding the MongoDB and Spring Reactive interaction.
Explain "reactive programming" and "reactive streams"
Introduce Spring Reactive Mono, Flux, WebFlux API and Spring Reactive Repositories via snippets of code
Know how to tie up MongoDB and Spring Reactive at code level via the ReactiveMongoTemplate API
Exposing the Data Access Tier in Browser9:45
Focus on implementing the UI part. The end-user or client is a HTML -JS based webpage capable to connect via Server Sent Events protocol to a reactive endpoint exposed via the Spring Reactive Flux API. Cover a bunch of communication patterns used in this situation.
Explain the theoretical headlines meant to clarify what we will do
Implement the UI part at code level
Discuss about publish-subscribe, RMI/RPC, Simple Messaging and Data Sync communication patterns

Diving into the Analysis Tier19:08
General overview of Analysis Tier. Cover main headlines and goals of this tier in a data streaming pipeline.
Explore the Continuous Query Model. specific to stream-processors
Explain why the Analysis Tier should run in a distributed fashion and touching high-level architectures of Apache Spark, Storm, Samza and Flink
Discover main features of a streaming process
Streaming Algorithms For Data Analysis29:13
Discover how the specific streaming algorithms looks like and have a flavor of the problems that these algorithms tries to solve. Theoretical cover four notorious streaming algorithms.
Talk about data stream query types and stream mining constrains
Explaining stream and event time. Introducing the window of data concept
Explore the concepts of Reservoir Sampling, HyperLogLog, Count-Min Sketch and Bloom Filter streaming algorithms
Introducing Our Analysis Tier – Apache Spark18:19
The goal of this video is like a check in list of Apache Spark headlines and to givea high-level overview of what Apache Spark is and how it works.
Understand what is Apache Spark and why to elect it
Know terminology, high-level architecture, Spark stack and Spark job architecture
Introduce RDDs, DataFrames, Datasets, checkpointing and monitoring
Plug-in Spark Analysis Tier to Our Pipeline9:47
Plug-in Apache Spark in our data streaming pipeline. More precisely, place the Analysis Tier (Spark) between Message Queuing Tier (Kafka) and Data Access Tier (MongoDB).
Cover aspects of running Spark on Windows
Write a Spark based kickoff application
Prepare this application to ingest data from Kafka and send it, after analysis, to MongoDB
Brief Overview of Spark RDDs25:07
Discover the RDD data structure specific to Apache Spark and be aware of its main characteristics. Implement the code lines needed to ingest Meetup RSVPs from Kafka in RDDs and write these RDDs in a MongoDB collection.
Introduce RDDs as a new data structure
Cover RDDs transformations actions and memory management
Write the code lines needed to pull RSVPs from Kafka to RDDs and sending them to a MongoDB collection
Spark Streaming28:37
Grasp a comprehensive guide of Spark Streaming. Theoretical and practical aspects are interleaved in order to cover Discretized Stream and Windowing as the two main headlines.
Cover theoretical part of DStreams, Receiver Thread, Windowing and Checkpointing
Write an application to pull RSVPs from Kafka to DStreams and send these DStreams to a MongoDB collection
Write an application to count RSVPs in a window length of 30 seconds with sliding interval of 5 seconds
DataFrames, Datasets and Spark SQL22:14
Tackle Spark SQL headlines, cover the powerful DataFrame and Dataset data structures via a comparison with RDDs and several examples, and write an application based on Spark SQL.
Have a brief overview of Spark SQL and a comprehensive comparison of RDDs vs. DataFrames vs. Datasets
Introduce DataFrames and Datasets API via examples
Write an application for filtering RSVPs by Australia venue via Spark SQL
Spark Structured Streaming32:37
The focus here is on discovering Spark Structured Streaming and developing an application sample.
Cover Structured Streaming processing model. Explain concepts: unbounded input table, user query, result table, output mode and triggers.
Discover windowed grouped aggregations, watermarking, sources and sinks and checkpointing.
Write an application for counting RSVPs by guests number in a window of 4 minutes with a sliding of 2 minutes and a watermark of 1 minute
Machine Learning in 7 Steps20:50
Provide the main set of knowledge about the topic in a soft-technical language and easy to assimilate.
Introduce Machine Learning concept via an example
Loop over the 7 steps meant to shape the big picture of how Machine Learning should tackle real problems
Have a final overview of Machine Learning and some Spark hints
MLlib (Spark ML)25:17
Spark MLlib (or Spark ML) is the Spark library for Machine Learning. The aim of this video is to discover all the main headlines of a Spark ML Pipeline. Implement an ML Pipeline for the House Price Forecast System discussed in the previous video.
Introduce Spark MLlib (Spark ML) main concept, Spark ML Pipeline, and see how data is flowing through an ML Pipeline
Cover Spark MLlib (Spark ML) operations: transformers, estimators, evaluators, etc.
Dissect Spark Pipeline and PipelineModel APIs and use them to Implement an ML Pipeline For The House Price Forecast System
Spark ML and Structured Streaming23:46
Combine the power of Spark ML and Structured Streaming in an example that trains a Logistic Regression model offline and later scoring online. Explore an example of online training and scoring via the RDD API. Discuss about the unreleased Streaming ML concept.
Introduce the Logistic Regression algorithm used in the further applications
Develop an application that trains the model offline and scores online on the Meetup RSVPs stream
Develop an application that trains and scores online on the Meetup RSVPs stream via the RDDs API
Spark GraphX6:41
Bring into discussion Spark GraphX, the Spark library dedicated to graphs and graphs-parallel computation.
Cover Spark GraphX headlines
Cover Spark GraphX API headlines
See a simple example

Fault Tolerance (HML)27:59
Provide the argumentation for choosing logging against checkpointing as the fault tolerance mechanism in streaming, to dissect the RBML, SBML and HML architectures and to implement HML in our streaming pipeline.
Explain why logging is better than checkpointing in a streaming pipeline
Have a bunch of meaningful diagrams to dissect the flow of data through RBML, SBML and HML
Provide the coding session for adding HML in our streaming pipeline via Spring Reactive and MongoDB
Kafka Connect4:19
The goal here is to provide another implementation for the SBML part via the Debezium Connector for MongoDB.
Get a Kafka Connect brief overview
Explore Debezium Connector for MongoDB brief overview
Understand theoretical aspects of implementing SBML logger with Debezium Connector For MongoDB
Securing Communication between Tiers10:18
Secure the communication between the Collection and the Message Queuing tiers and between the Analysis and the Message Queuing tiers.
Explore secure communication between Collection and Message Queuing tiers via SSL
Secure communication between Analysis and Message Queuing tiers via SSL.
Point SSL for Kafka inter-broker communication

Requirements

Having knowledge of the Spring framework will be an added benefit.

Description

Today, organizations have a difficult time working with huge numbers of datasets. In addition, data processing and analyzing need to be done in real time to gain insights. This is where data streaming comes in. As big data is no longer a niche topic, having the skillset to architect and develop robust data streaming pipelines is a must for all developers. In addition, they also need to think of the entire pipeline, including the trade-offs for every tier.

This course starts by explaining the blueprint architecture for developing a completely functional data streaming pipeline and installing the technologies used. With the help of live coding sessions, you will get hands-on with architecting every tier of the pipeline. You will also handle specific issues encountered working with streaming data. You will input a live data stream of Meetup RSVPs that will be analyzed and displayed via Google Maps.

By the end of the course, you will have built an efficient data streaming pipeline and will be able to analyze its various tiers, ensuring a continuous flow of data.

About the Author

Anghel Leonard is currently a Java chief architect. He is a member of the Java EE Guardians with 20+ years’ experience. He has spent most of his career architecting distributed systems. He is also the author of several books, a speaker, and a big fan of working with data.

Who this course is for:

This course is perfect for Java developers and architects who want to design and write data streaming pipelines.

Data Stream Development via Spark, Kafka and Spring Boot

What you'll learn

Explore related topics

Course content

Introducing Data Streaming Architecture3 lectures • 30min

Deployment of Collection and Message Queuing Tiers5 lectures • 1hr 13min

Proceeding to the Data Access Tier4 lectures • 1hr 4min

Implementing the Analysis Tier12 lectures • 4hr 22min

Mitigate Data Loss between Collection, Analysis and Message Queuing Tiers3 lectures • 43min

Requirements

Description

Who this course is for: