
Discover Apache Spark for Java developers, from core RDDs and Spark’s programming models to Spark SQL and DataFrames, streaming with DStreams and structured streaming, and Kafka integration.
Spark enables in-memory processing, richer operations, and an execution plan that runs tasks in parallel, outperforming Hadoop MapReduce.
Explore Spark architecture, the driver and worker nodes, partitions and tasks, and how the execution plan (dag) drives parallel processing of RDDs across clusters using HDFS or S3.
Learn how a reduce on an RDD aggregates a distributed data set across partitions using a reduce function, including Java 8 lambdas and double sums.
Transform an rdd with the map operation to produce a new rdd of doubles by applying a mapping function to calculate square roots, illustrating immutability and Java 8 lambdas.
Learn to output RDD results to the console with for each and a void function, printing values without creating a new RDD, and explore the Java eight colon colon syntax.
Count elements in Spark using map and reduce on the square root RDD, producing a long count. Keep interim counts inside the RDD to enable further transformations.
Learn how to fix a not serializable exception in Spark when using RDD.foreach with System.out.println. Use collect to bring data to the JVM and iterate safely.
Learn how to combine an integer and its square root into a single RDD using a Java object, and why tuples simplify multi-value records in Spark.
Explore Scala tuples—lightweight, type-rich groupings used throughout the Spark API—showing how tuple2 and larger tuples compare with Java usage, and how pair RDDs leverage these structures.
Explains how pair RDDs store key and value pairs to enable grouping by key, and demonstrates counting warnings, errors, and fatals from log data using Spark's pair RDD operations.
Transform an original log messages RDD into a Java pair RDD using map-to-pair by splitting each log line on colon to produce a (level, date) key-value pair.
Count warnings, errors, and fatals by log level using a pair RDD and reduceByKey, summing counts by key instead of groupByKey, which can cause severe performance problems.
Refactor verbose Java Spark code using a fluent API to chain map, reduceByKey, and foreach on an RDD, achieving a single, readable line with a concise tuple.
Explore flatMap in Spark and learn how it returns zero or more outputs per input, enabling splitting sentences into words and flattening results into a single RDD.
Apply the filter function to an RDD to remove junk, such as single characters, by returning true or false in a lambda, and chain operations efficiently.
Learn how to load text files into a Spark job with sc.textFile, create a string RDD, and split into words, addressing Windows-specific setup with winutils.
Analyze a combined subtitles file to automatically generate ten keywords for the course by performing a word count and filtering out boring words with a Util class.
Sorts a Spark pair RDD by value to reveal top keywords for the Docker course. Uses map to pair, reduce by key, and sort by key to present the results.
Explore Spark internals, debunk the belief that sorting an RDD requires coalescing to a single partition, and learn why foreach output across partitions can mislead.
Coalesce is the wrong solution for sorting in Spark; it forces a single partition and can cause incorrect results, so learn about partitions, shuffles, and take for accuracy.
Coalesce reduces partitions to improve performance when data is small, avoiding unnecessary shuffles and keeping correctness; use collect sparingly when results fit in the driver memory.
Get started with the amazing Apache Spark parallel computing framework - this course is designed especially for Java Developers.
If you're new to Data Science and want to find out about how massive datasets are processed in parallel, then the Java API for spark is a great way to get started, fast.
All of the fundamentals you need to understand the main operations you can perform in Spark Core, SparkSQL and DataFrames are covered in detail, with easy to follow examples. You'll be able to follow along with all of the examples, and run them on your own local development computer.
Included with the course is a module covering SparkML, an exciting addition to Spark that allows you to apply Machine Learning models to your Big Data! No mathematical experience is necessary!
And finally, there's a full 3 hour module covering Spark Streaming, where you will get hands-on experience of integrating Spark with Apache Kafka to handle real-time big data streams. We use both the DStream and the Structured Streaming APIs.
Optionally, if you have an AWS account, you'll see how to deploy your work to a live EMR (Elastic Map Reduce) hardware cluster. If you're not familiar with AWS you can skip this video, but it's still worthwhile to watch rather than following along with the coding.
You'll be going deep into the internals of Spark and you'll find out how it optimizes your execution plans. We'll be comparing the performance of RDDs vs SparkSQL, and you'll learn about the major performance pitfalls which could save a lot of money for live projects.
Throughout the course, you'll be getting some great practice with Java Lambdas - a great way to learn functional-style Java if you're new to it.