
Examine spark's runtime configuration as the key performance lever and review the web user interface, log files, and crile serialization to improve memory and performance.
Explore real-time data processing with Apache Spark Streaming, a fault-tolerant extension of Apache Spark for data streams, read from sources like Flume, Kafka, and Twitter, processing with micro-batches.
Master streaming context as the entry for spark streaming, with a batch interval from 500 milliseconds to seconds; start disallows new computations, and only one active context exists per jvm.
Explore stateful transformations in Spark streaming, using window operations and update state by key to maintain running counts within sliding windows and per-key statistics.
Spark streaming triggers the execution of lazy transformations when an output operation runs, printing and saving streamed data to databases and files with prefix and suffix naming per batch.
Explore spark checkpointing for 24/7 streaming, enabling recovery from system failures with data checkpoints saved to reliable storage such as HFS, and periodic checkpoints to avoid long recovery times.
Explore Spark SQL overview: a distributed framework for structured and semi-structured data, enabling SQL queries, reading and writing many formats, and Hive-compatible data sources via JDBC or ODBC.
Explore reading json lines into a dataframe from structured sources, infer or apply explicit schemas for nested structures and arrays, and write to json or parquet formats for hive tables.
Explore aggregation and sorting in Spark SQL, using group by and order by to compute per-product totals, then apply window functions to rank and analyze data across categories.
Explore core machine learning concepts and algorithms, examine the advantages of the machine learning library for common methods, and engage in hands-on Spark examples with evaluation metrics.
Leverage the Madlib statistics package to compute basic statistics—mean, variance, standard deviation, and non-zero counts—and explore correlations, stratified sampling, and hypothesis testing to assess data significance.
Explore supervised learning with classification and regression, compare linear and non-linear models like SVM, logistic regression, and decision trees, and evaluate using training/testing splits and the roc curve.
Extract informative features by reducing dimensionality with PCA and SVD, projecting data into lower dimensions to minimize noise, speed up analysis, and enable visualization.
Explain evaluation metrics for binary classifiers, including precision, recall, accuracy, and area under the curve, plus roc estimation methods, and clustering measures like intra- and inter-cluster distance.
Explore graph neighborhood aggregations using collect neighbors to obtain neighbor IDs, with edge direction options and duplicate handling, then apply two-phase map-reduce messaging to update vertex attributes and return the graph.
Apache Spark is an open source data processing engine. Spark is designed to provide fast processing of large datasets, and high performance for a wide range of analytics applications. Unlike MapReduce, Spark enables in-memory cluster computing which greatly improves the speed of iterative algorithms and interactive data mining tasks.
Adastra Academy’s Advanced Apache Spark includes illuminating video lectures, thorough application examples, a guide to install the NetBeans Integrated Development Environment, and quizzes. Through this course, you will learn about Spark’s four built-in libraries - SparkStreaming, DataFrames (SparkSQL), MLlib and GraphX - and how to develop, build, tune, and debug Spark applications. The course exercises will enable you to become proficient at creating fully functional real-world applications using the Apache Spark libraries. Unlike other courses, we give you the guided and ground-up approach to learning Spark that you need in order to become an expert.