Big Data Foundation for Developers

Name: Big Data Foundation for Developers
Rating: 4.3 (13 reviews)

A hands on developers course to learn popular big data tools Hadoop, Hive & Spark including Machine Learning with Spark

Created byGanapathi Devappa

Last updated 11/2022

English

What you'll learn

Apache Hadoop, Hive and Spark are very popular big data tools used by many organizations. Don't let your skills become obsolete.
Upskill yourself with the in-demand big data and machine learning skills
Practice with 20 demos and more than 50 practice activities that push you beyond what you learn in the class to become a big data developer
You will implement machine learning techniques using Spark to solve business problems like prediction, recommendation engine and anomaly detection.
By the end of this course, you will be able to set up a big data cluster, copy data to it and process with big data tools
Query big data using Hive, process big data through dataframes in Spark
Store data in Parquet format to take advantage of predicate pushdowns, chain multiple chain multiple transformations of data including windowing and pivoting
Includes introduction to Scala for use with Spark

Course content

9 sections • 183 lectures • 9h 4m total length

Introduction2:08
Introduction to the course
Introduction continued0:51
Course prerequisites0:53
Course Structure0:59
Data Sizes in Big Data1:59
How is big data technology different?7:21
3 Vs of Big Data2:47
Big data case study2:06
Big Data Solution2:33
Big data solution stages2:41
Apache Hadoop4:17
Yarn0:59
Hive1:29
Spark1:22
Practice Activity
Things to remember1:02

Introduction0:41
Big Data and Hadoop1:45
Download Test Data0:06
Download config files0:14
Please download and unzip the enclosed configuration files. You can copy paste from the files in this during the installation and setup of various tools. Add the commands in bashrc_addon to the end of your .bashrc file on Linux. Make sure to change trainer1 in this to your username on Linux
HDFS Design Principles4:14
HDFS Components1:39
Demo : HDFS Block Distribution6:27
HDFS Placement Strategy2:35
HDFS Block Distribution with Racks2:35
HDFS Interfaces1:28
Hadoop Installation part16:30
Hadoop Installation part 25:03
Hadoop Installation part 39:47
HDFS Examples5:29
HDFS Root & User Directories2:29
More HDFS Examples3:07
Who Stores What?1:53
Secondary Name Node2:38
HDFS Practice Activities0:47
Things to remember0:58
Solutions for practice activities0:11

Introduction0:49
Apache Hadoop0:21
Map Reduce1:32
Case Study: Distributed Processing0:59
Case Study: Distributed Processing 21:43
Map Reduce Diagram4:06
Map Reduce Architecture1:46
Stages of Map Reduce1:59
Map Tasks2:13
Map Reduce Example1:49
Mapper Class3:09
Reduce Tasks3:17
Reduce Class2:20
Map Reduce Examples1:51
Shuffle/Sort1:28
Map Reduce Daemons4:26
Hadoop Detailed Components0:51
Input Format1:59
Combiner2:50
Map Reduce Puzzle3:02
Map Reduce Driver Class2:29
Compile Map Reduce Program3:12
Demo and practice activity: Review and Compile Map reduce program4:46
Practice Activities1:53
Summary1:11
Solutions for practice activities0:25
Please download the zip file provided that contains the solutions for the Map reduce practice activities. You can compile the programs using the instructions provided, only after setting up yarn in the next lesson.

Introduction0:58
Apache Hadoop0:28
Cluster Resources1:32
Yarn Architecture4:10
Resource Manager0:55
Scheduler1:20
Capacity Scheduler4:14
Node Manager0:49
Application Master0:43
Demo and Practice Activity: Yarn Installation/Setup9:41
Demo and Practice Activity: Map Reduce - YARN3:24
Practice Activities1:40
Summary0:48
Solutions for practice activities0:07
Please download the zip file provided that contains the solutions for the Yarn practice activities. You can compile the programs using the instructions provided in the previous Lesson.

Introduction1:18
Hive Features2:11
Hive Workflow0:59
Hive Query Example0:59
Demo and practice activity: Hive Installation Part 17:58
Demo and practice activity: Hive Setup3:24
Demo and practice activity: Connect to Hive using beeline9:44
Hive Metastore1:54
Hive Command Line Interface0:57
Hive Data Model1:48
Hive Partitions & Buckets2:38
Hive External Table1:05
Practice Activities0:39
Summary0:43
Solutions for practice activities0:06
Please download the zip file provided that contains the solutions for the Hive practice activities. You need to connect to beeline as shown in the lesson.

Introduction2:17
What is Spark?1:08
Demo and practice activity: Spark Installation5:05
Demo and practice activity: Stop Hive server1:15
Demo and practice activity: test spark shell3:15
Scala1:40
Download Scala Examples0:12
Please download the scala examples file provided. This file will be used in the demos of this class to illustrate Scala language. You can copy paste these commands in spark-shell to practice Scala. You can correct any errors in the file due to division of a line into two parts.
Scala Example3:41
Components3:12
Data types6:35
Scala Operators2:03
Scala Statements2:27
Loop Statements4:37
Functions4:24
Anonymous Functions4:24
Arrays3:06
Collections6:54
Classes & Objects4:01
Companion Objects2:50
Case Classes3:14
Traits2:17
Place Holders and Higher Order Functions8:01
Practice Activities3:40
Summary1:00
Solutions for practice activities0:05
Please download the zip file provided that contains the solutions for the Spark-Scala practice activities. You need to run these on spark-shell.

Introduction1:54
Spark Architecture1:06
RDD1:10
RDD Operations2:25
Performance2:01
Demo and Practice Activity: RDD Operations14:55
RDD Operations continued4:04
RDD Actions2:23
Spark SQL Interface1:04
Dataframes0:41
Dataframes continued1:47
Parquet Format1:27
Dataframes from Hive tables1:03
Demo and Practice Activity: Dataframes Part 113:13
Demo and Practice Activity: Dataframes Part 29:11
Demo and Practice Activity: Dataframes Part 39:21
Dataframe Operations1:00
Dataframe Transformations2:58
Dataframe Actions1:51
Aggregate Operations including Pivoting1:28
Demo and Practice Activity: Dataframes Part 44:03
Window Operations including Deduplications4:15
Demo and Practice Activity: Dataframes Part 59:29
Dataframe Parallelism2:33
Temporary Views2:17
Dataframe Caching1:46
Spark Web Interface0:55
Datasets1:12
Demo and Practice Activity: Datasets6:22
SQL Queries1:13
Demo and Practice Activity: SQL Queries3:11
User Defined Functions2:18
Demo and Practice Activity: User Defined Functions4:23
Demo : AWS EMR Serverless22:30
Big data processing is now moving to cloud and every organization is exploring serverless big data processing on the cloud like Amazon Web Services EMR serverless. So I thought it will be apt to add this demo so that you become familiar with a cloud platform for running big data jobs. This is a simple example but has all the steps to run a Spark Job. I hope this addition will be useful for you. Please ignore some noise in the video as there was lot of construction noise next door.
Spark on Yarn3:41
Practice Activities1:38
Summary1:14
Solutions for practice activities0:05
Please download the zip file provided that contains the solutions for the Spark practice activities. You can run these programs in spark-shell

Introduction1:52
Machine Learning1:51
Types of Machine Learning1:28
Machine Learning Approaches5:56
Spark Machine Learning Library1:26
Data types in MLLib1:59
Types of Features1:47
Decision Tree Example1:59
Machine Learning Technique1:01
Demo and Practice Activity: Decision Tree6:50
One-hot Encoding2:21
Demo and Practice Activity: One hot encoding6:48
Pipeline Model1:32
Demo and Practice Activity: Pipeline Model7:38
Linear Regression1:54
Demo and Practice Activity: Linear Regression7:08
Anomaly Detection2:02
K-Means Clustering1:03
Demo and Practice Activity: Anomaly Detection8:48
Collaborative Filtering : Recommendation Engine0:39
Demo and Practice Activity: Recommendation Engine6:36
Practice Activities1:43
Summary0:58
Solutions for practice activities0:05
Please download the zip file provided that contains the solutions for the Machine Learning practice activities. You can run these programs in spark-shell

Conclusion and next steps2:16
AWS EMR and Step Functions3:34
Explains Setting up big data on cloud with AWS EMR and workflow orchestration with Step Functions
Linux virtual machine on windows12:34
Download and setup this Ubuntu Linux virtual machine on windows that comes loaded with all the big data software taught in this course. This virtual machine has very low footprint (about 5GB for download and 8GB on disk) and can run with just 2GB of memory on windows.
Use winscp to transfer files1:26
You can use winscp to securely copy files from windows to your virtual machine and virtual machine to windows
Troubleshoot connection issue1:44
Sometimes address range of NAT in vmware and virtual machine may cause connection problem. This shows one technique to correct the issue.

Requirements

Development experience with Java or C++, database experience.
Access to Linux environment or virtual machine with Linux on Windows.

Description

Apache Hadoop, Yarn, Hive and Spark are popular big data tools used by many organizations to develop big data analytics solutions. Through this course students can develop big data applications using these tools to process data and derive valuable insights from data. By the end of the course, students will be able to set up a personal big data development environment, master the fundamental concepts of Hadoop, Yarn, Hive and Spark, copy data into and from a big data cluster, process the data using the Map/Reduce paradigm, run Map/Reduce and Spark jobs on Yarn, Learn to process big data using Scala programming language in Spark, Use RDDs and dataframes to process big data, use Parquet format to store data, and finally use Machine Learning Libraries of Spark to develop Machine Learning solutions like decision trees, recommendation engine, Linear Regression and Anomaly detection.

This is a hands on development course and you will practice more than 50 activities during this course. While Java knowledge is assumed, fundamentals of Scala are taught so that you can write Scala code to process data in Spark. The course provides a foundation for developers to join big data development teams in their organization.

Who this course is for:

Beginners who want to learn big data tools
Talented professionals who want to practice Popular big data tool Spark and machine learning
IT professionals keen to upskill with lot of big data practice

Big Data Foundation for Developers

What you'll learn

Explore related topics

Course content

Introduction15 lectures • 33min

Lesson 2 Hadoop - HDFS21 lectures • 1hr 1min

Hadoop Map/Reduce26 lectures • 56min

YARN14 lectures • 31min

Hive15 lectures • 36min

Spark Scala25 lectures • 1hr 21min

Spark RDDs, Dataframes and SQL38 lectures • 2hr 28min

Spark Machine Learning24 lectures • 1hr 15min

Conclusion5 lectures • 22min

Requirements

Description

Who this course is for: