
Agenda of Course
What is Apache Spark ?
Why Apache Spark ?
Who uses Apache Spark?
Real Use Case of Apache Spark?
Get your google cloud platform free trial for first 90 days and use google cloud machines for practise.
It will not charge you , we will create account , cluster , and run spark-shell directly
Thats it setup is done.
https://console.cloud.google.com/
other option https://cloud.ibm.com/catalog
Create Big data cluster and check spark application , hadoop application , shut down cluster
Components in Apache Spark
Actual working of Spark engine
Spark in local mode / standalone or client mode & cluster Mode
How Exactly a Spark job is executed in Spark Engine
Configure local machine having Unix /Linux or Ubuntu OS
Install Java 1.8.x
Install Hadoop 2.8.x
Install Hive 3.x or Hive 2.x
Install Spark 2.4.x
Install Scala 2.11.3
Set properties required to do a standalone setup
Spark Shell in local mode
Spark shell in google cloud environment
Scala Basics
What exactly is Spark Context and Spark Session
Tabs in Spark Job UI
For-loop , switch case , var and val in scala programming
How to read file line by line in Spark Shell using scala code
Special variables in spark
Broadcast and accumulator
What is RDD, RDD Features ?
Transformation
Action
Logic & Demo
Map, flatMap , reduceByKey
What is Spark dataset ?
RDD vs DataFrame vs Dataset
How to analyze spark shell command on Spark UI
Read csv / json /xml or text data in spark
Using https://github.com/databricks/spark-xml to know more about xml file read and write operation
How to write data into json/parquet/avro/text format on local or hadoop path
Spark config changes to add hive details
Spark SQL working architecture
CREATE HIVE TABLE with CTAS Query
PARTITION, BUCKETING, SORTING , MSCK and REFRESH Utility in Spark-SQL statement
Trick of the Day
Spark SQL and similar Dataset/Dataframe API functions
Hive opeartion on spark sql like select , filter , where , group by , order by , agg, sum, count
How to write queries using window function like
Rank()
Dense Rank()
Row Number
Sampling of Data :
to analyze big files using Random Sampling with RAND() Function & Block Sampling using SQL
Spark Optimization technique 1 : Cache or Persist your dataframe or RDD into memory/ disk and
speedup your action
Different Types of Joins in Spark
Join with SQL Query and Dataset API
LEFT /RIGHT/ INNER /FULL / LEFT SEMI / LEFT ANTI Join
How to read Join DAG in Spark UI
Spark Property Details
https://spark.apache.org/docs/latest/configuration.html#viewing-spark-properties
Sort Merge Join , BroadCast Join , Bucket Map Join on Hive Table
DAG analysis of sql queries
which Join we should prefer and what are the configuration for this
How to calculate Spark Executor / Driver memory and cores , No. of executors ?
How to set parallelism in spark job?
Read this page after seeing video :
https://spark.apache.org/docs/latest/configuration.html#spark-properties
How to configure spark configuration parameters with spark-shell
Spark code application
Spark submit command
We will create an Scala Project with Maven in Eclipse which will read any csv /json /parquet or avro data and
write it into any HDFS external table path in overwrite mode.
Writer operation will take care that every file before writing must be less or equal to 128 Mb , so that Hadoop block will not get waste with small files.
What is spark streaming ?
Use cases where we need to use it
How it works ?
Spark UI for Spark Streaming Job
Scala code to run spark Streaming (RDD API) DStream application
Live Demo on cloud to consume HDFS text files at real time and do word Count, we will display word count on console and also redirect all output to a unix stream.log file
How to use Checkpoint in Spark
How to use broadcast and Accumulator in spark Streaming to skip few words from word count which are not relevant like is, a , ; etc...
Live Demo on cloud to consume HDFS text files at real time and do word Count, we will display word count on console and also redirect all output to a unix stream.log file
How to use broadcast and Accumulator in spark Streaming to skip few words from word count which are not relevant like is, a , ; etc...
What is structured Streaming ?
Why and Where to use it ?
Watermarking in Spark Structured Streaming
Scala code application to read data of Employee from HDFS , count employee for each department and display
count department wise on console.
Solve a real world problem with designing a Big data application.
Explaining a Sample Spark Application Working model of WallMart / D-Mart
What parameters to set to tune Executor
What parameters to set to tune Driver
when we have lots of small files
when we have lots of big files
when we have single large file
Spark Executor Memory understanding, how different operation like shuffle join or persist/cache use them
How to set them to avoid Out of Memory error
Preferring broadcast join over sort merge , HashAggregate over SortAggregate alogrithm, reduceBykey over groupByKey etc...
API for join /count / isEmpty / take / head
Spark Streaming with HDFS as source or Kafka
Spark Structured Streaming with HDFS or Kafka
I am Big Data Solution Designer in IT industry from last few years. I am adding all my learning and experience in this video series. So that you can understand working of Spark eco-system, work like a professional big data engineer and get a good job. Updated course with latest version
Benefits of this course:
Enroll into this course and get end to end knowledge of Apache Spark +Spark-SQL + Spark Streaming + Spark with Hive + Real World Use cases + Designing of Big Data project with Spark eco-system & Interview asked Use cases. This course is very rare of its kind and includes even very thin details of Spark which are not available anywhere online.
In this course you will get to understand a step by step learning of very Basic Spark to Advance Spark (which is actually used in Real-time projects) like with latest Spark version 3.x
Spark Setup , All file formats ,Hive Optimization Concepts like Partition , Bucketing , Joins , Spark Code Review like Experts : all demo / interactive sessions
Spark Google cloud account setup for hands-on over all concepts
Spark SQL Clauses : Distribute by , order by , clustered by , sort by
Scala basics Coding
Eclipse Coding Application with Java 8 as Maven Project and Spark API
Window functions like rank , row_number , dense_rank : all demo / interactive sessions
RDD , Dataset & DataFrame API
Different ways to create / insert data in Hadoop or Hive table
Spark Job Configuration Optimization
Spark Application DAG analysis and debugging using spark UI
Spark Streaming & Structured Streaming with Coding in Java
Performance Technique that big companies use to query fast on data.
This course is a full package explaining even rarely used commands and concepts in Spark. After completing this course you won't find any topic left in Spark. This course is made keeping in mind the Real Implementation of Spark in Live Projects..
Additionally ,You can download the Step Step Installation Guide (doc) to Install Scala and Apache Spark