Advanced Apache Spark for Data Scientists and Developers

Name: Advanced Apache Spark for Data Scientists and Developers
Rating: 3.5 (59 reviews)

Apache Spark

Created byAdastra Academy

Last updated 1/2016

English

What you'll learn

Understand the functionality of Spark's four built-in libraries
Create real-world applications using Spark’s libraries
Understand how to develop, debug and optimize the performance of Spark applications

Course content

6 sections • 71 lectures • 2h 46m total length

Introduction to Apache Spark4:19
Spark Installation16:00
Spark Installation Quiz
IDE Installation14:00
IDE Installation Quiz

Introduction and Topics0:41
Overview of Spark Streaming1:17
Explore real-time data processing with Apache Spark Streaming, a fault-tolerant extension of Apache Spark for data streams, read from sources like Flume, Kafka, and Twitter, processing with micro-batches.
Linking Input Sources0:52
Streaming Context1:15
Master streaming context as the entry for spark streaming, with a batch interval from 500 milliseconds to seconds; start disallows new computations, and only one active context exists per jvm.
Discretized Streams (DStreams)0:47
Input DStreams2:29
Hands-on Exercise 1: Spark Streaming11:00
Stateless Transformations on DStreams3:51
Stateful Transformations3:30
Explore stateful transformations in Spark streaming, using window operations and update state by key to maintain running counts within sliding windows and per-key statistics.
Hands-on Exercise 2: Spark Streaming6:00
Output Operations1:54
Spark streaming triggers the execution of lazy transformations when an output operation runs, printing and saving streamed data to databases and files with prefix and suffix naming per batch.
Hands-on Exercise 3: Spark Streaming7:00
Checkpointing0:46
Explore spark checkpointing for 24/7 streaming, enabling recovery from system failures with data checkpoints saved to reliable storage such as HFS, and periodic checkpoints to avoid long recovery times.
Caching and Persisting0:44
Tuning and Debugging2:28
Section Topics0:32

Introduction to Spark SQL0:59
Spark SQL Overview6:48
Explore Spark SQL overview: a distributed framework for structured and semi-structured data, enabling SQL queries, reading and writing many formats, and Hive-compatible data sources via JDBC or ODBC.
The Spark Shell hands-on2:00
Hands-on Exercise 1: part a) Import CSV30:00
Schema Inference6:25
Data Query Select5:19
Data Query Select
DataFrame.Reader DataFrame.Writer8:11
Explore reading json lines into a dataframe from structured sources, infer or apply explicit schemas for nested structures and arrays, and write to json or parquet formats for hive tables.
Hands-on Exercise 1: part b) Import JSON18:00
Data Query INNER JOINs6:40
Data Query INNER JOINs
Group By, Order By, Window Functions5:41
Explore aggregation and sorting in Spark SQL, using group by and order by to compute per-product totals, then apply window functions to rank and analyze data across categories.
Group By, Order By, Window Functions
Data Query OUTER JOINs, SEMI JOIN9:50
Data Query OUTER JOINs, SEMI JOIN
Custom UDF (User Defined Function)4:41
Custom UDF (User Defined Function)
API or SQL?3:43
Hands-on Exercise 2: Spark SQL18:00

Introduction and Topics0:41
Explore core machine learning concepts and algorithms, examine the advantages of the machine learning library for common methods, and engage in hands-on Spark examples with evaluation metrics.
Machine Learning1:17
MLlib2:32
Basic Statistics1:00
Leverage the Madlib statistics package to compute basic statistics—mean, variance, standard deviation, and non-zero counts—and explore correlations, stratified sampling, and hypothesis testing to assess data significance.
Optimization1:49
Classification6:20
Explore supervised learning with classification and regression, compare linear and non-linear models like SVM, logistic regression, and decision trees, and evaluate using training/testing splits and the roc curve.
Hands-on Exercise 1: Spark MLlib: Classification12:00
Validation1:07
Regression2:18
Clustering3:51
Hands-on Exercise 2: Spark MLlib: Clustering12:00
Feature Extraction and Transformation1:00
Dimensionality Reduction5:23
Extract informative features by reducing dimensionality with PCA and SVD, projecting data into lower dimensions to minimize noise, speed up analysis, and enable visualization.
Collaborative Filtering0:55
Evaluation Metrics3:37
Explain evaluation metrics for binary classifiers, including precision, recall, accuracy, and area under the curve, plus roc estimation methods, and clustering measures like intra- and inter-cluster distance.

Introduction to Spark GraphX7:18
Graph creation examples2:00
Graph Operators Overview, Information about a Graph3:18
Information about a graph example1:00
Transform Graph Items2:35
Transform graph items examples1:00
Modify Graph Structure1:24
Modify graph structure example1:00
Graph Neighborhood Aggregations2:30
Explore graph neighborhood aggregations using collect neighbors to obtain neighbor IDs, with edge direction options and duplicate handling, then apply two-phase map-reduce messaging to update vertex attributes and return the graph.
Neighborhood Aggregations Examples2:00
Graph Algorithms2:36
Triangle Count Example1:00
Pregel- Graph Parallel Computation2:11
Pregel Example1:00
Optimized Graph Representation3:00
Hands-on Exercise: Spark GraphX23:00

Requirements

Completed a introductory Apache Spark course. Adastra Academy's Introduction to Apache Spark for Developers and Engineers recommended.
A beginner to intermediate understanding of the Scala programming language. Adastra Academy's Scala in Practice recommended.
A basic understanding of Apache Hadoop and Big Data

Description

Apache Spark is an open source data processing engine. Spark is designed to provide fast processing of large datasets, and high performance for a wide range of analytics applications. Unlike MapReduce, Spark enables in-memory cluster computing which greatly improves the speed of iterative algorithms and interactive data mining tasks.

Adastra Academy’s Advanced Apache Spark includes illuminating video lectures, thorough application examples, a guide to install the NetBeans Integrated Development Environment, and quizzes. Through this course, you will learn about Spark’s four built-in libraries - SparkStreaming, DataFrames (SparkSQL), MLlib and GraphX - and how to develop, build, tune, and debug Spark applications. The course exercises will enable you to become proficient at creating fully functional real-world applications using the Apache Spark libraries. Unlike other courses, we give you the guided and ground-up approach to learning Spark that you need in order to become an expert.

Who this course is for:

Data Scientists
Developers
Data Engineers

Advanced Apache Spark for Data Scientists and Developers

What you'll learn

Explore related topics

Course content

Introduction to Advanced Apache Spark3 lectures • 34min

Tuning and Debugging7 lectures • 15min

Spark Streaming16 lectures • 45min

Spark SQL14 lectures • 2hr 6min

Spark MLlib15 lectures • 56min

Spark GraphX16 lectures • 57min

Requirements

Description

Who this course is for: