
This video provides an overview of the entire course.
Introduce data streaming fundamentals and shape the data streaming blueprint architecture
• Cover the big picture of data streaming
• Talk about classifying, securing and scaling streaming systems
• Shape via a diagram the data streaming blueprint architecture
Introduce the Meetup RSVPs stream and choose the technologies for implementing the data streaming blueprint architecture. See alternative technologies as well and how to decide between them
• Access the Meetup RSVP stream online
• Choose the proper technology for each tier of data streaming blueprint architecture
• Explore the alternative technologies per tier and criteria for choosing between them properly
After a brief overview of the Collection Tier, we have a general discussion about protocols, interaction patterns and issues involved in writing a Collection Tier.
Start with a brief overview about connecting to the source of data, push and pull mechanisms and lightweight business logic
Continue with protocols and interaction patterns
Finish with the problem of scaling the Collection Tier and WebSocket caused by the direct and persistent connection
Develop the Collection Tier part for ingesting Meetup RSVPs via Spring WebSocketClient API
Brief overview of WebSocket concept
Introduce Spring WebSocketClient API and its role in Collection Tier
Implementation the code
Explain why this tier, that apparently complicates and slows down the data streaming pipeline, is needed.
Tackle backpressure issue
Understand the data durability issue
Learn about data delivery semantics issue
Apache Kafka is a powerful, but complex technology. This video represents a comprehensive introduction of the main Kafka concepts.
Understand cover overview, terminology, high-level architecture, topics and partitions
Explore producers and consumers, consumer groups, delivery semantics and durability
Install and configure Zookeeper and a Kafka broker
Send the collected data to Message Queuing Tier (Kafka) via Spring Cloud Stream, Kafka Binder API.
Introduce Spring Cloud Stream goal and architecture
Discuss about message binders, especially the Kafka Binder API via suggestive diagrams
Follow the Code for sending the collected data to the Message Queuing Tier
Cover the main aspects of a Data Access Tier such as writing/reading the analyzed data to/from a long-term storage, in-memory databases/data-grids and memory. Discuss about caching strategies. Cover static and dynamic filtering depending on protocol.
See the overview of the Data Access Tier by answering to the question "what we can do with the analyzed data?"
Write and read the analyzed data to/from a long-term storage, in-memory databases/data-grids and memory
Cover caching strategies along with static and dynamic filtering depending on protocol
Introduce MongoDB main headlines, justifying this election and prepare a MongoDB instance ready to go.
Learn MongoDB - What is it, why to use it and when to use it
Explore terminology, relational vs. document based, capped collection and scaling
Install and configure a localhost instance of MongoDB server and MongoDB Compass
Clarify what is "reactive programming" and "reactive streams". Introduce Spring Reactive. Coding the MongoDB and Spring Reactive interaction.
Explain "reactive programming" and "reactive streams"
Introduce Spring Reactive Mono, Flux, WebFlux API and Spring Reactive Repositories via snippets of code
Know how to tie up MongoDB and Spring Reactive at code level via the ReactiveMongoTemplate API
Focus on implementing the UI part. The end-user or client is a HTML -JS based webpage capable to connect via Server Sent Events protocol to a reactive endpoint exposed via the Spring Reactive Flux API. Cover a bunch of communication patterns used in this situation.
Explain the theoretical headlines meant to clarify what we will do
Implement the UI part at code level
Discuss about publish-subscribe, RMI/RPC, Simple Messaging and Data Sync communication patterns
General overview of Analysis Tier. Cover main headlines and goals of this tier in a data streaming pipeline.
Explore the Continuous Query Model. specific to stream-processors
Explain why the Analysis Tier should run in a distributed fashion and touching high-level architectures of Apache Spark, Storm, Samza and Flink
Discover main features of a streaming process
Discover how the specific streaming algorithms looks like and have a flavor of the problems that these algorithms tries to solve. Theoretical cover four notorious streaming algorithms.
Talk about data stream query types and stream mining constrains
Explaining stream and event time. Introducing the window of data concept
Explore the concepts of Reservoir Sampling, HyperLogLog, Count-Min Sketch and Bloom Filter streaming algorithms
The goal of this video is like a check in list of Apache Spark headlines and to givea high-level overview of what Apache Spark is and how it works.
Understand what is Apache Spark and why to elect it
Know terminology, high-level architecture, Spark stack and Spark job architecture
Introduce RDDs, DataFrames, Datasets, checkpointing and monitoring
Plug-in Apache Spark in our data streaming pipeline. More precisely, place the Analysis Tier (Spark) between Message Queuing Tier (Kafka) and Data Access Tier (MongoDB).
Cover aspects of running Spark on Windows
Write a Spark based kickoff application
Prepare this application to ingest data from Kafka and send it, after analysis, to MongoDB
Discover the RDD data structure specific to Apache Spark and be aware of its main characteristics. Implement the code lines needed to ingest Meetup RSVPs from Kafka in RDDs and write these RDDs in a MongoDB collection.
Introduce RDDs as a new data structure
Cover RDDs transformations actions and memory management
Write the code lines needed to pull RSVPs from Kafka to RDDs and sending them to a MongoDB collection
Grasp a comprehensive guide of Spark Streaming. Theoretical and practical aspects are interleaved in order to cover Discretized Stream and Windowing as the two main headlines.
Cover theoretical part of DStreams, Receiver Thread, Windowing and Checkpointing
Write an application to pull RSVPs from Kafka to DStreams and send these DStreams to a MongoDB collection
Write an application to count RSVPs in a window length of 30 seconds with sliding interval of 5 seconds
Tackle Spark SQL headlines, cover the powerful DataFrame and Dataset data structures via a comparison with RDDs and several examples, and write an application based on Spark SQL.
Have a brief overview of Spark SQL and a comprehensive comparison of RDDs vs. DataFrames vs. Datasets
Introduce DataFrames and Datasets API via examples
Write an application for filtering RSVPs by Australia venue via Spark SQL
The focus here is on discovering Spark Structured Streaming and developing an application sample.
Cover Structured Streaming processing model. Explain concepts: unbounded input table, user query, result table, output mode and triggers.
Discover windowed grouped aggregations, watermarking, sources and sinks and checkpointing.
Write an application for counting RSVPs by guests number in a window of 4 minutes with a sliding of 2 minutes and a watermark of 1 minute
Provide the main set of knowledge about the topic in a soft-technical language and easy to assimilate.
Introduce Machine Learning concept via an example
Loop over the 7 steps meant to shape the big picture of how Machine Learning should tackle real problems
Have a final overview of Machine Learning and some Spark hints
Spark MLlib (or Spark ML) is the Spark library for Machine Learning. The aim of this video is to discover all the main headlines of a Spark ML Pipeline. Implement an ML Pipeline for the House Price Forecast System discussed in the previous video.
Introduce Spark MLlib (Spark ML) main concept, Spark ML Pipeline, and see how data is flowing through an ML Pipeline
Cover Spark MLlib (Spark ML) operations: transformers, estimators, evaluators, etc.
Dissect Spark Pipeline and PipelineModel APIs and use them to Implement an ML Pipeline For The House Price Forecast System
Combine the power of Spark ML and Structured Streaming in an example that trains a Logistic Regression model offline and later scoring online. Explore an example of online training and scoring via the RDD API. Discuss about the unreleased Streaming ML concept.
Introduce the Logistic Regression algorithm used in the further applications
Develop an application that trains the model offline and scores online on the Meetup RSVPs stream
Develop an application that trains and scores online on the Meetup RSVPs stream via the RDDs API
Bring into discussion Spark GraphX, the Spark library dedicated to graphs and graphs-parallel computation.
Cover Spark GraphX headlines
Cover Spark GraphX API headlines
See a simple example
Provide the argumentation for choosing logging against checkpointing as the fault tolerance mechanism in streaming, to dissect the RBML, SBML and HML architectures and to implement HML in our streaming pipeline.
Explain why logging is better than checkpointing in a streaming pipeline
Have a bunch of meaningful diagrams to dissect the flow of data through RBML, SBML and HML
Provide the coding session for adding HML in our streaming pipeline via Spring Reactive and MongoDB
The goal here is to provide another implementation for the SBML part via the Debezium Connector for MongoDB.
Get a Kafka Connect brief overview
Explore Debezium Connector for MongoDB brief overview
Understand theoretical aspects of implementing SBML logger with Debezium Connector For MongoDB
Secure the communication between the Collection and the Message Queuing tiers and between the Analysis and the Message Queuing tiers.
Explore secure communication between Collection and Message Queuing tiers via SSL
Secure communication between Analysis and Message Queuing tiers via SSL.
Point SSL for Kafka inter-broker communication
Today, organizations have a difficult time working with huge numbers of datasets. In addition, data processing and analyzing need to be done in real time to gain insights. This is where data streaming comes in. As big data is no longer a niche topic, having the skillset to architect and develop robust data streaming pipelines is a must for all developers. In addition, they also need to think of the entire pipeline, including the trade-offs for every tier.
This course starts by explaining the blueprint architecture for developing a completely functional data streaming pipeline and installing the technologies used. With the help of live coding sessions, you will get hands-on with architecting every tier of the pipeline. You will also handle specific issues encountered working with streaming data. You will input a live data stream of Meetup RSVPs that will be analyzed and displayed via Google Maps.
By the end of the course, you will have built an efficient data streaming pipeline and will be able to analyze its various tiers, ensuring a continuous flow of data.
About the Author
Anghel Leonard is currently a Java chief architect. He is a member of the Java EE Guardians with 20+ years’ experience. He has spent most of his career architecting distributed systems. He is also the author of several books, a speaker, and a big fan of working with data.