Introduction to streaming

Loony Corn
A free video tutorial from Loony Corn
An ex-Google, Stanford and Flipkart team
4.2 instructor rating • 73 courses • 131,787 students

Lecture description

Spark can process streaming data in near real time using DStreams. 

Learn more from the full course

From 0 to 1 : Spark for Data Science with Python

Get your data to fly using Spark for analytics, machine learning and data science​

08:18:52 of on-demand video • Updated February 2018

  • Use Spark for a variety of analytics and Machine Learning tasks
  • Implement complex algorithms like PageRank or Music Recommendations
  • Work with a variety of datasets from Airline delays to Twitter, Web graphs, Social networks and Product Ratings
  • Use all the different features and libraries of Spark : RDDs, Dataframes, Spark SQL, MLlib, Spark Streaming and GraphX
English [Auto] Often you have bita which is best used real time real time status update with a feed stock picker information such data come in streams in order to deal with this data traditional processing just isn't enough. Let's take a specific example. Let's say you work for a payment service or just people Apple a Google Wallet or any of the others. The number of field images per second is a mission critical metric. If you didn't deal with this right away it's possible that you're losing money every second. This information is in complete contrast with the weekly business reports. If you think about business reports they're published at a certain frequency quarterly monthly weekly maybe even hourly. However they have a specific frequency and a specified bit of information in those reports. Such reports require batch processing what is batch processing the data collected and stored in a database or DFS. If you're working on it in a distributed environment you would then run processing tasks at the required frequency you're done hourly to us then you would push these our live reports together. And on the boss and so on each processing task works on a batch of data all the data that has come in in the last hour all the data that has been accumulated over a week uses a batch of the contents all these business reports with the number of field payments. Or second if you wanted to monitor fee payments batch processing simply isn't enough. It's not fast enough. It's not really time enough. You can't afford to read for this kind of data. If there is a spike in the number of field payments you need to know immediately not in one week not in a the not even in an hour. You need to react within minutes or even seconds. That's how quickly you have to get to fix the systems so that you're not losing money. If you don't react quickly there could be severe consequences for your business. It could mean loss of reputation loss of face in addition to the customers who might not return. Loss of revenue being slow to react to such a critical issue could be a death blow to your business. Mission Critical metrics need to be monitored in real time payment systems just like any other critical systems have logs. Logs are created by appeasements of this typically interesting these logs need to be processed as they are created not after a day. Not after an hour of the monologue. It is streaming data and processing such data requires stream processing huge volumes of data such as logs are processed typically by distributed computing systems distributed computing systems like modern use and even SPARC typically processed from stable storage that is stored in a database stored in HD FS or any other storage system. A read in processed and written out to disk the read data from disk. Which means they would be too much lag while processing streaming data. They don't adapt themselves when the streaming they add up because they read data from it. If you want to use stream processing for your streaming data folks end up using separate systems one for batch processing and one for stream processing. Even if the data is exactly the thing to aggregate logs at regular intervals will be produced using a batch processing system and monitoring logs in drill and aim for errors will be done using a stream processing system a well known system which is very popular for stream processing is just dumb. It's a standalone stream processing application. It works well with the best trees now now early on in this course we'd mention that Spock is a general purpose engine and can solve all your data processing problems. We spoke about how good SPARC is as a production system as well as a system which allows you to explore and investigate what about stream processing the sparks streaming library provide stream processing capabilities and then SPARC. So you are not limited to batch processing. You can perform string processing with inspan which means that you can use the same engine SPARC for both batch processing and Celestron processing needs. You remain in the FIM system and use it for different types of applications. You've spoken about the programming abstraction in the park. A typical SPARC application uses argue these that result in distributed datasets on which you perform operations Spock streaming applications use something called B streams B streams within it contain ideas. We'll see what I mean. Just a second B streams. Other basic abstraction we use for streaming processes in order to understand stream processing. It's important to visualize a stream of it. Let's say you have a stream of human logs these logs can be in any form letters in the string file. Here is a stream. The one is one log message Peter with another log message 3 for they come in in stream form. How does a typical stream processing application block a stream processing application. We process each individual log as it comes in. So I assume that it's sitting somewhere here and it manages data in some way and the stream comes in it will go through the screen processing app will be modified or recorded in some way and will be output from the stream processing app and also as a street this is typically how string processing works. In spot this entire stream of people is represented as a street. Remember that each bit of data was used to be a log statement this entire stream of logs is a stream in back on the stand stream processing you need to get some street in your head. The B stream has a property called the batch interval the batch interval is measured in seconds and it represents what portion of this data is represented in one are being so Ardita which I did in the batched interval is grouped together into one IP. Let's take an example to make this clear. Let's say that the batching the well is one second within one second. Let's see two log messages come in. So keep your log messages come in and the read of two or second to U.P.S. not the highlighted messages each of these will be in one RBD there will be two log messages in one RBD. So you have are getting one to three as you can see highlighted on screen the batch in the really documents what batch of these messages live in one of the other group together in one RBD in just spoken about the streams as a stream of data that the stream is actually a sequence of BTs. Remember all data that come within a batch and Divell is one RBD. That is one RTT for every batch and develop and the stream is just a sequence of these if. Now the good thing about the stream is that you can treat it like an IBD in many ways. You can apply transformations and actions on the strings exactly like you would on an RTD. What happens is that these transmissions and actions are then applied to each of these are these as they are created as the Arbys the in the B stream are created by but Vachan double the transformations and actions that apply to them. If you want to think about it logically with the streams Spock still does batch processing except that the batch is now all these are the things that are created in realtime or batch interval the batch processing is done on individual. Are the trees in this tree. So it's batch processing but in real time as they are these are created every batch interval.