Apache Spark SQL - Bigdata In-Memory Analytics Master Course
4.5 (83 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
465 students enrolled

Apache Spark SQL - Bigdata In-Memory Analytics Master Course

Master in-memory distributed computing with Apache Spark SQL. Leverage the power of Dataframe and Dataset Real life demo
4.5 (83 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
465 students enrolled
Last updated 10/2019
English
English
Current price: $139.99 Original price: $199.99 Discount: 30% off
5 hours left at this price!
30-Day Money-Back Guarantee
This course includes
  • 4.5 hours on-demand video
  • 3 articles
  • 22 downloadable resources
  • Full lifetime access
  • Access on mobile and TV
  • Certificate of Completion
Training 5 or more people?

Get your team access to 4,000+ top Udemy courses anytime, anywhere.

Try Udemy for Business
What you'll learn
  • Spark SQL Syntax, Component Architecture in Apache Spark
  • Dataset, Dataframes, RDD
  • Advanced features on interaction of Spark SQL with other components
  • Using data from various data sources like MS Excel, RDBMS, AWS S3, No SQL Mongo DB,
  • Using different format of files like Parquet, Avro, JSON
  • Table partitioning and Bucketing
Requirements
  • Introduction to Big Data ecosystem
  • Basics on SQL
Description

This course is designed for professionals from zero experience to already skilled professionals to enhance their Spark SQL Skills. Hands on session covers on end to end setup of Spark Cluster in AWS and in local systems. 

COURSE UPDATED PERIODICALLY SINCE LAUNCH: Last Updated : December

What students are saying:

  • 5 stars, "This is classic. Spark related concepts are clearly explained with real life examples.  " - Temitayo Joseph 

In data pipeline whether the data is in structured or in unstructured form, the final extracted data would be in structured form only. At the final stage we need to work with the structured data. SQL is popular query language to do analysis on structured data.

Apache spark facilitates distributed in-memory computing. Spark has inbuilt module called Spark-SQL for structured data processing. Users can mix SQL queries with Spark programs and seamlessly integrates with other constructs of Spark.

Spark SQL facilitates loading and writing data from various sources like RDBMS, NoSQL databases, Cloud storage like S3 and easily it can handle different format of data like Parquet, Avro, JSON and many more.

Spark Provides two types of APIs

Low Level API - RDD

High Level API - Dataframes and Datasets

Spark SQL amalgamates very well with various components of Spark like Spark Streaming, Spark Core and GraphX as it has good API integration between High level and low level APIs.

Initial part of the course is on Introduction on Lambda Architecture and Big data ecosystem. Remaining section would concentrate on reading and writing data between Spark and various data sources.

Dataframe and Datasets are the basic building blocks for Spark SQL. We will learn on how to work on Transformations and Actions with RDDs, Dataframes and Datasets.

Optimization on table with Partitioning and Bucketing.

To facilitate the understanding on data processing following usecase have been included to understand the complete data flow.

1) NHL Dataset Analysis

2) Bay Area Bike Share Dataset Analysis


Updates:

++ Apache Zeppelin notebook (Installation, configuration, Dynamic Input)

++Spark Demo with Apache Zeppelin


Who this course is for:
  • Beginners who wanted to start with Spark SQL with Apache Spark
  • Data Analysts, Big data analysts
  • Those who wants to leverage in-memory computing against structured data.
Course content
Expand all 46 lectures 04:21:39
+ Introduction
4 lectures 15:58
  • Need for Apache Spark

  • Various sub components of Apache Spark

  • Overview of distributed memory and role of structured data processing in Spark

Preview 01:56
  • Need for distributed storage

  • Need for distributed processing

  • Bottleneck for traditional processing and overhead of network

  • Sending the program near to data

  • Doing the processing near to data

  • Roles of data in memory in recursive computing

  • Distributed in memory computing concepts

Preview 04:11
  • Introduction to Lambda Architecture

  • Three layers of Lambda architecture

  • Details on Speed Layer, Batch Layer and Service Layer

  • Role of model generation, Machine learning analytics, Big Data and Data Ingestion

  • Different sub components of Spark Stack and mapping to lambda architecture layers

  • Role of Spark SQL, MLlib and Spark Streaming

Lambda Architecture and Spark Stack
06:07
  • Introduction to Master Worker architecture

  • Introduction to different resource manager like Mesos, YARN, Spark Standalone

  • Spark stand alone architecture

  • Role of Driver, Spark Context, Executor and tasks in Spark Master and Spark Workers

Master - Worker Architecture - Spark Standalone Cluster
03:44
+ Apache Spark - In memory computing
4 lectures 19:29
  • Architecture of Spark - Cluster mode of execution

  • Architecture of Spark - Client mode of execution

  • Difference between Cluster and Client mode of execution

  • Pros and Cons of Custer and Client mode of execution

  • Spark Driver execution location

Spark - Different Execution Mode
03:34
  • Introduction to Yet Another Resource Negotiator (YARN)

  • Different components and its introduction.

  • Learn about Node Manager, Application Master, Resource Manager, YARN gateway and its role

  • Spark execution in YARN

Spark on YARN
03:04
  • Introduction to AWS EC2

  • Setup EC2 for Spark Installation

  • Setup security group

  • Configure private key / public key

  • Connect to EC2 with private key using putty

Prepare AWS EC2 Instance
06:48
  • Connect to EC2 Instance

  • Install required Java package

  • Download and unpack Spark package

  • Explore various files and folder structure in Spark

  • Start Spark Shell locally

  • Verify Spark Shell with spark context

  • Explore Spark Web UI with default configuration

  • Explore pyspark Shell

Spark Local Installation and Spark Shell Verification
06:03
+ Spark Different Modes of Execution
4 lectures 20:49
  • Introduction to Spark Session

  • Different context and its uses.

  • Intrododuction to Sprak Context, SQL Context, Streaming Context, Hive Context

  • Details on Spark Session in Spark Shell

  • Introduction to Spark-SQL Shell

Spark Session - Different Shell - Scala Shell and PySpark
05:46
  • Start and test spark shell in cluster mode

  • Start and test spark shell in client mode

  • Overview of spark application execution in Cluster and Client mode

Spark Cluster and Client Mode
09:40
  • Start spark stand alone cluster with multiple machines

  • Start spark shell against stand alone cluster

  • Spark shell web UI for cluster

  • Execute sample job and verify DAG cycle and jobs in web UI

Spark Standalone Cluster - Spark Shell
02:35
  • Configure web ui to access log files

  • Access logs of cluster from web UI

Accessing log files
02:48
+ Low Level API
2 lectures 14:40
  • Understanding on Dataframes

  • Internal working of Dataframes on shared memory

  • Concept on handling each row as generic type of Row Class

  • Creating sample dataframe

  • visualize data frame with printschema and show options

Dataframe
10:57
  • Understand Dataset

  • Difference between Dataset and Dataframe

  • Introduction to Case Class

  • Using Case Class with Dataset

  • Create and visualize Dataset

Preview 03:43
+ Spark Components and Architecture
4 lectures 28:50
  • Various sub components of Apache Spark

  • Introduction to Spark-SQL sub components like Catalysts Optimizer, Dataframes, Datasets, etc.,

  • Introduction to other components like Streaming, MlLib, etc.,

  • Roles played by various components in data ingestion

Preview 05:06
  • Purpose of partitioning the data

  • How partitioning works in Spark

  • Impact of partitioning the data

  • Visualize partitioned data and processing of partitioned data

Spark Partitions Introduction
08:24
  • Introduction to RDD transformations and actions

  • RDD and Directed Acyclic Graph (DAG) during transformation

  • Optimization using DAG cycle during transformation

Transformations and Actions
09:15
  • Introduction to Catalyst Optimizer

  • Purpose and logical architecture of Catalyst Optimizer

  • Logical and Physical plan selection and Catalyst optimizer role

  • Overview about logical optimization

  • Overview on physical optimization

Catalyst Optimizer
06:05
+ Data Ingestion - Data Sources
8 lectures 59:08
  • Starting spark shell using Mysql connection driver

  • Reading data from Mysql database

  • Writing data to Mysql database

MySQL Read and Write
20:09
  • Reading Mysql table into multiple partitions

  • Verify partition with web UI

  • Analyze the impact of partition in RDD

Preview 03:40
  • Understand MongoDB

  • Setup mongoDB with mlab

  • Import data into mongoDB

  • Start spark shell with mongoDB connector

  • Read data from mongoDB

Read and Write data from MongoDB
09:30
  • Configure spark excel package

  • Start spark shell by including excel package

  • Read data from xls file

  • View and verify data

MS xlsx file as Datasource
05:07
  • Introduction on AWS S3

  • Configure Secret access and access key Id

  • Creating S3 bucket

  • Read data from S3

  • View and verify data

AWS Simple Storage Service S3 as Datasource
04:58
  • Introduction to JSON file

  • Read JSON file

  • View JSON file schema

  • Covert JSON to dataframes

JSON File as Datasource
06:20
  • Introduction to avro format

  • Download and use spark avro package with spark shell

  • Read a JSON file and store as avro

  • Overview of different configuration like compression, deflate level, etc.,

  • Read and view avro files

Avro File as Datasource
05:42
  • Introduction on Parquet files

  • Read json file and store as parquet file

  • Read and view parquet file

Preview 03:42
+ Working with Spark SQL Shell
9 lectures 50:42
  • Introduction to SparkSQL shell

  • Open SparkSQL shell

  • Difference between schema on read and Schema on write

  • Create table and load data in managed table

  • Fetch records from managed table

Introduction to Spark SQL Shell
08:11
  • Spark warehouse directory purpose

  • Default spark warehouse directory

  • Customize warehouse directory

Customize Data Warehouse Dir
06:14
  • Overview on external table

  • Load CSV files to external table

  • Pros and Cons of external table

Create External Table with CSV
04:52
  • Analyze NHL game data

  • Read NHL csv files

  • Create Dataframe from NHL data RDD

  • Analyze behavior of partitions, RDD, Performance with different queries

Use Case : NHL Game data Analysis
08:09
  • Create table in Parquet format

  • Access table data using Spark-SQL shell

  • Verify parquet files in data warehouse directory

Creating Table in Parquet Format
02:28
  • Overview of partition in Spark

  • Create partitioned table

  • Verify and execute query in partitioned data

Table Partition
04:28
  • Overview of Bucketing with partitions

  • Purpose and use of bucketing data

  • Create bucketed table

  • Load and analyse data from bucketed table

Table Bucketing
06:13
  • Overview of views

  • Create views from existing table

  • Select data from views

Views
03:05
  • Analyse ford go bike share data

  • Create dataframe from CSV file

  • Create required tables for Stations, Status, Trips and Weather

  • Load data to all the required tables

  • Analyse the data with various queries

Use Case: Bay Area Bike Share Data - FordGoBike
07:02
+ Visualization with Apache Zeppelin
8 lectures 51:43
  • Introduction to Apache Zeppelin

  • Architecture and various components of Apache Zeppelin

  • Overview on Zeppelin UI, Notebook, etc.,

Preview 03:28
  • Install Zeppelin from binary

  • Configure system requirements like JDK and JAVA_HOME

  • Zeppelin folder overview

  • Configure Zeppelin port

  • Start Zeppelin daemon

  • Zeppelin UI overview

Zeppelin Installation
05:25
  • Zeppelin UI Overview

  • Create and Import Notebook

  • Play with interpreter and settings

  • Accessing saved notebook

  • Configure , enable and disable interpreter

Zeppelin UI Overview
04:37
  • Hadoop Distributed File System (HDFS) overview

  • Access HDFS files list

  • Configure HDFS Interpreter

  • List HDFS files

Zeppelin Interpreter HDFS
04:47
  • Zeppelin notebook functionality overview

  • Connect to MySQL database

  • View different database available

  • List various tables

  • Try version functions in Notebook

  • Compare different versions of notebook

  • Overview on note permissions, Configuration, Interpreter settings and keyboard shortcuts

Preview 04:47
  • Introduction to Apache Hive and its components

  • Overview of configuration details to connect to Hive

  • Configuring JDBC Interpreter and required maven artifacts

  • Create and configure interpreter

  • Access hive database

  • Load data to hive tables

  • Query hive tables

  • Execute various hive queries and visualize

  • Arrange multiple query visualization

Zeppelin Interpreter Hive
12:41
  • Introduction to Apache Spark

  • Configure Spark Interpreter

  • Configuring Spark parameters in Interpreter

  • Execute various Spark SQL queries

  • Visualize Spark SQL query results

  • Arrange and visualize various query results

Zeppelin Interpreter Spark
08:03
  • Introduction to dynamic input forms

  • Creating and using various input elements

  • Discuss various scopes of input elements

  • Demo on various scopes

Zeppelin Dynamic Input elements
07:55
+ Data files and Other Resources
2 lectures 00:03
Data Files
00:02
Slides used in the session
00:01
+ Bonus Lecture
1 lecture 00:16
Special coupon to join my other courses
00:16