
A brief introduction to the course, and then we'll get your development environment for Spark and Scala all set up on your desktop. A quick test application will confirm Spark is working on your system!
Get set up with a Twitter developer account, and run your first Spark Streaming application to listen to and print out live Tweets as they happen!
We start our crash course in the Scala programming language by covering some basics of the language: types and variables, printing, and boolean comparisons.
Our Scala crash course continues, illustrating various means of flow control in Scala. For loops, do/while loops, while loops, etc.
Scala is a functional programming language, and so understanding how functions work and are treated in Scala is hugely important! This lecture covers the fundamentals, and lets you put it into practice.
We wrap up our Scala crash course with commonly used data structures using in Spark with Scala. Tuples, lists, and maps.
Before you can learn about Spark Streaming, you need to understand how Spark itself works at a high level! This covers the why & how of Apache Spark, of which Spark Streaming is a component.
The fundamental object of Spark programming is the Resilient Distributed Dataset (RDD), and this is used not just in Spark but also within Spark Streaming scripts. This lecture explains what they are, and what you can do with them.
Let's walk through and actually run a simple Spark script that counts the number of occurrences of each word in a book.
We finally have all the pre-requisite knowledge to start talking about Spark Streaming itself in more detail! We'll cover how it works, what it's for, and its architecture.
Now that we know more, let's go revisit that first Spark Streaming application we ran in lecture 2, and dive into how it really works.
Windowing allows you to analyze streaming data over a sliding window of time, which lets you do much more than just transform streaming data and store it someplace else. We'll cover the concepts of the batch, window, and slide intervals, and how they work together to let you aggregate streaming data over some period of time.
How can Spark Streaming do so much work continuously in a reliable manner? We'll uncover some of its tricks for reliability, as well as tips for configuring Spark Streaming to be as reliable as possible.
We'll build on our "print tweets" example to actually store the incoming Tweets to disk, and illustrate how Spark Streaming can handle file output.
Compute the average length of a Tweet, using windowing in Spark Streaming.
This is a fun one! We'll track the most popular hashtags in Twitter over time, and watch how they change in real time!
We'll simulate an incoming stream of Apache access logs, and use Spark Streaming to keep track of the most-requested web pages in real time!
This example will listen to an Apache access log stream, and raise an alarm if too many errors are returned by the server in real time.
We'll integrate Spark Streaming with Spark SQL, allowing us to run SQL queries on data as it is streamed in! Again we will use Apache logs as an example.
Spark 2.0 introduced experimental support for Structured Streaming, a new DataSet-based API for Spark Streaming that is bound to become increasingly important. Learn how it works.
As an example, we'll stream Apache access logs in from a directory, and use Structured Streaming to count up status codes over a one-hour moving window.
Apache Kafka is a popular and robust technology for publishing messages across a cluster on a large scale. We'll show how to get Spark Streaming to listen to Kafka topics, and process them in real time.
Flume is a popular technology for publishing log information at large scale, especially on a Hadoop cluster. We'll illustrate how to set up both push-based and pull-based Flume configurations with Spark Streaming, and discuss the tradeoffs of each.
Amazon's Kinesis Streaming service is basically Kafka on AWS. If you're working with an AWS/EC2 cluster, you'll want to know how to integrate Spark Streaming with Kinesis - and that's what this lecture covers.
What if you need to integrate Spark Streaming with some proprietary system that does not have an existing connection library? Well, you can always write your own Receiver class. This example shows you how and actually lets you build and run one.
Cassandra is a popular "NoSQL" database that can be used to provide fast access to massive data sets to real-time applications. Dumping data transformed by Spark Streaming into a Cassandra database can expose that data you your larger, real-time services. We'll show you how and actually run a simple example.
Spark has the ability to track arbitrary state across streams of data as they come in, such as web sessions, running totals, etc. This example shows you how it all works, and challenges you to track your own state using our example as a baseline.
Spark Streaming integrates with some of Spark's MLLib (Machine Learning Library) capabilities. This example creates a real-time K-Means clustering example; it does unsupervised machine learning that continually gets better as more training data feeds into it.
Spark Streaming can also feed data in real-time to linear regression models, that get better over time as more data is fed into them. This example shows linear regression in action with Spark Streaming.
Your production applications won't be run from within the Scala IDE; you'll need to run them from a command line, and potentially on a cluster. The spark-submit command is used for this. We'll show you how to package up your application and run it using spark-submit from a command prompt.
If your Spark Streaming application has external library dependencies that might not be already present on every machine in your cluster, the SBT tool can manage those dependencies for you, and package them into the JAR file you run with spark-submit. We'll show you how it works with a real example.
We'll run our simple word count example on a real cluster, using Amazon's Elastic MapReduce service! This just shows you what's involved in running a Spark Streaming job on a real cluster as opposed to your desktop; there are a few parameters to spark-submit you need to worry about, and getting your scripts and data in the right place is also something you need to deal with.
Spark jobs rarely run perfectly, if at all, on the first try - some tuning and debugging is usually required, and arriving at the right scale of your cluster is also necessary. We'll cover some performance tips, and how to troubleshoot what's going on with a Spark Streaming job running on a cluster.
Want to learn more about Spark Streaming? Here are a few books and other resources I've found valuable.
스파크 스트리밍과 스칼라로 빅데이터 스트리밍!
대량의 데이터 세트를 해결하세요!
실무에 바로 적용할 수 있습니다!
스파크 스트리밍과 스칼라로 빅 데이터 스트리밍하기 (실전편) 강의를 선택해야 하는 이유
현재 IntelliJ 통합개발환경에 맞춰 업데이트 됐습니다!
“빅 데이터” 분석은 인기있고 대단히 가치있는 능력입니다. 중요한 건 “빅 데이터”의 흐름이 멈추지 않는다는 것입니다! 스파크 스트리밍은 대량의 데이터 세트를 생성할 때 처리하기 위한 새롭고 신속하게 개발되는 기술입니다 - 항상 실시간으로 분석 업데이트를 할 수 있는데 밤마다 분석을 해야할까요? 대형 웹사이트의 방문 사이트 동향 데이터, 대규모 “사물 인터넷” 배포의 센서 데이터, 재무 데이터 등 그 어떤 것이든 스파크 스트리밍은 데이터가 생성될 때 항상 데이터를 변환하고 분석할 수 있는 강력한 기술입니다.
여러분은 아마존과 IMDb 선임 매니저와 전 엔지니어분으로부터 해당 내용을 배우게 될 것입니다.
이 코스 과정에서는 실제 라이브 트위터 데이터, 아파치 액세스 로그의 시뮬레이션 동향, 그리고 심지어 머신러닝 모델을 훈련하는 곳에 사용되는 데이터까지 접해볼 수 있습니다! 직접 집에서 컴퓨터로 스파크 스트리밍 작업을 작성하고 실행해 볼 수 있습니다. 그리고 과정이 끝날 때쯤 여러분에게 실제 하둡 클러스터로 이러한 작업을 가져와서 생산 환경에서도 실행하는 방법을 보여줄 것입니다.
이 교육과정은 매우 실용적이고 바로 수행 가능한 활동으로 구성되어 여러분의 교육을 강화하는 데 도움이 됩니다. 강의가 끝날 무렵, 여러분은 스파크 스트리밍 스크립트를 스칼라를 활용하여 자신있게 작성할 줄 알게되며, 완전히 새로운 방식으로 거대한 양의 데이터를 해결하는데 준비가 되어있을 것입니다. 스파크 스트리밍이 이 모든 걸 가능하게 했다는 사실에 매우 놀랄 것입니다!
스파크 스트리밍과 스칼라로 빅 데이터 스트리밍하기 (실전편) 강의에서는 아래의 내용을 배울 수 있습니다:
스칼라 프로그래밍 언어로 된 집중 훈련을 수강하세요
아파치 스파크가 클러스터에서 어떻게 운영되는지 알아보세요
스파크 스트리밍으로 불연속의 스트림을 설정하고 데이터가 수신되면 변환할 수 있습니다
실시간으로 구조화된 스트리밍을 이용하여 데이터 프레임으로 스트리밍합니다
슬라이딩 윈도우에서 시간 경과에 따른 스트리밍 데이터 분석
여러 데이터 스트림 전반에 걸쳐 상태 정보 유지하게 됩니다
카프카, 플룸, 및 키네시스와 같은 확장성이 뛰어난 데이터 소스와 스파크 스트리밍을 연결하는 방법을 습득합니다
카산드라와 같은 구조화 질의어만을 사용하지 않는 데이터베이스에 실시간으로 데이터 스트림을 폐기하는 방법
스트리밍 된 데이터에 실시간으로 구조화 질의어 쿼리를 실행합니다
스트리밍 데이터로 머신러닝 모델을 실시간으로 훈련하고, 이 모델을 사용하여 시간이 지남에 따라 계속 향상되는 예측을 할 수 있습니다
아마존의 빅데이터 프레임워크 실행을 간소화하는 관리형 클러스터 플랫폼을 사용하여 자체적으로 내장된 스파크 스트리밍 코드를 실제 하둡 클러스터에 패키징, 배포 및 실행하는 방법을 배웁니다.
강의를 들으시고 강의와 관련하여 궁금하신 점은 무엇이든 Q&A에 남기실 수 있지만, 꼭 영어로 남겨주세요. 그래야 답변을 드릴 수 있습니다. :)
강의에서 만나요!