Learn Spark and Hadoop Overnight on GCP

Name: Learn Spark and Hadoop Overnight on GCP
Rating: 4.1 (22 reviews)

Learn Hands-on by Building Your Own System on Spark and Hadoop

Created byCS PRO

Last updated 8/2018

English

What you'll learn

For E-Commerce Data Load and Operation Setting Up Hadoop and Spark
Up and Running With Spark on GCP

Course content

2 sections • 41 lectures • 3h 34m total length

Introduction to E-Commerce Data Load and Operation Setting Up Hadoop & Spark3:19
Explore big data in e-commerce by setting up Hadoop and Spark on Google Cloud Platform. Understand the HDFS architecture, high availability, and cold storage for enterprise data.
Data Explosion and Reduction in Storage Cost4:24
When Data is Referred as Big Data5:50
Discover how enterprises tackle big data with the three Vs: velocity, volume, and variety, using Hadoop to store and process vast data robustly for 10x returns.
Computer Science Behind Big Data Processing2:57
Understand theoretical computer science behind big data processing, analyzing loop-based time complexity and how divide-and-conquer and memory distribution power Hadoop to handle large datasets.
Hot and Cold Data6:00
Explore hot and cold data in enterprise architectures, storing hot data in ERP main memory for transactions and using Hadoop for cold data analytics with OLAP.
Hadoop Architecture5:37
Understand Hadoop architecture with a master name node and data nodes, enabling scalable, low-cost storage, high availability, and cloud-based ERP cold data management via IaaS.
Hadoop Cluster Data Operation5:37
Explore the Hadoop cluster architecture, including the name node and data nodes, and learn how files are written as 128 MB blocks with replication for fault tolerance.
High Availability and Replication for Enterprise Part 13:57
Enable high availability for enterprise systems by replicating data across multiple nodes with a configurable replication factor, guided by the name node, heartbeat monitoring, and rack awareness.
High Availability and Replication for Enterprise Part 24:06
Explore how Hadoop achieves high availability through replication and standby name node and secondary name node, with zookeeper coordination and fs image and log files on Google Cloud Platform.
GCP Dataproc and Modern Big Data Lifecycle .mp42:19
Discover how Google Cloud Dataproc enables an end-to-end big data lifecycle and rapid Hadoop and Spark deployment in minutes.
Data Load into HDFS or Storage Bucket3:36
Learn how to load data into HDFS or a storage bucket on GCP, compare unified storage with HDFS, and spin up HDFS for scalable map-reduce processing.
Configuring and Running Hadoop in GCP with Dataproc9:13
SSH Inside the Master Node and HDFS Files System6:16
Learn to access the spark environment on GCP master node, explore Hadoop HDFS, start Spark and PySpark shells, use the web UI, and safely terminate clusters to save costs.
Summary - Part 11:20

Introduction to Up and Running With Spark on GCP - Part 23:34
Learn to set up and run Spark on Google Cloud Platform, explore Spark architecture, data frames, transformations and actions, and access data from Hadoop on GCP through practical hands-on exercises.
SSH to Google VM Instance from Local Machine8:08
Processing With Spark3:59
Spark Inner Working - Part 18:59
Explore how Spark works: from driver program and Spark context to master-slave cluster management, map and reduce operations, and JVM-based execution across worker nodes on GCP.
Spark Inner Working - Part 25:19
Understand Spark's inner workings: RDD partitions, immutability, and transformations, then explore data frames with schemas that enable faster operations and dramatic speed gains on distributed data.
Running Spark on Google® Cloud Platform5:12
Learn to spin up a two-node Spark cluster on Google Cloud Platform, using an initialization action to start a Jupyter notebook, and automate provisioning via the REST API.
Launching Jupyter Notebook in Spark3:48
Launch a Spark cluster on GCP, access the master node via a VM, start a PySpark session and Jupyter notebook, and monitor the Hadoop cluster through the dashboard.
A Simple Introduction to RDD12:40
Discover Spark RDD basics, autocompletion, partitioning with parallelize, and map-collect lazy evaluation; note Python 3 differences and range concepts.
RDD Architecture in Spark2:57
Explore how Spark's RDD architecture distributes data across master and worker nodes on GCP, detailing partitions, memory, lazy evaluation, and map and collect driving computation.
RDD Transformation and Actions: Map and Collect Respectively - Part 13:21
Explore RDD transformation and actions with map and collect, detailing immutability, partitioning, and how a minus one function is applied and evaluated on Spark.
RDD Transformation and Actions: Map and Collect Respectively - Part 21:37
Explore how RDD transformations like map create new datasets and how the collect action reveals results across partitions, emphasizing immutability and practical use in Spark workloads.
RDD Transformation and Actions - Filter and Collect Respectively11:22
Explore RDD transformation and actions by applying filter and collect in Spark, learning lazy evaluation, partitioning, and performance optimization across large datasets.
RDD Transformation and Actions - Filters and Collect5:07
Explore rdd transformations and actions by using filter with a lambda, then retrieve elements with first and take, and sort to obtain top elements in ascending or descending order.
RDD Transformation and Actions - Collect and Reduce5:33
Explore RDD transformations and actions by performing reduce to compute a single value, with an emphasis on associative and cumulative operators, and why subtraction is unsuitable.
RDD Advance Transformation and Actions - Flatmaps4:50
RDD Advance Transformation and Actions - FlatMaps Example4:09
RDD Advance Transformation and Actions - groupByKey and reduceByKey Basics11:11
Compare reduceByKey and groupByKey on RDDs, focusing on partitioning and shuffling, memory use, and performance. Learn when to prefer reduceByKey for large data and groupByKey for smaller, simpler workloads.
RDD Advance Transformation and Actions - groupByKey and reduceByKey Example2:50
Explore groupByKey and reduceByKey in Spark by creating key-value pairs, applying a lambda to sum values, and performing collect as a combined transformation and action.
RDD Advance Transformation and Actions - groupByKey and reduceByKey Example6:28
Explore advanced RDD transformations with groupByKey and reduceByKey by building sums from keys, using mapValues, and handling Spark errors to optimize performance.
RDD Caching, Persistence and Summary6:10
Explore RDD caching and persistence in Spark on GCP, using in-memory storage levels, replication, and persist and unpersist.
Why Dataframe and Basics of Dataframe3:10
Understand why data frames with a schema enable Spark to optimize access, outpacing datasets by turning partitions into a tabular structure for faster queries.
Installing Faker to Generate Random Data for Examples2:59
Creating Random User Data With Faker and Creating Dataframe5:26
Generate 100 random user records with faker, including last name, first name, ssn, occupation, and age, then create a spark dataframe and inspect the schema.
Working with DataFrame and Understanding Functionalities - Part 15:20
Explore Spark data frames, demonstrating immutability, showing records, and filtering with a user defined function to select ages under 10, highlighting optimization and performance.
Working with DataFrame and Understanding Functionalities - Part 22:22
Explore data frames in Spark, distinguish actions from transformations, and use lazy evaluation to optimize operations, including selecting columns, dropping data, and performing group by with max and average.
Working with DataFrame and Understanding Functionalities - Part 36:10
Explore data frame operations with actions like show, desc, and first; sort by age to locate the max, and use count, distinct, max, min, and average to handle duplicates.
Working with DataFrame and Understanding Functionalities - Part 47:20

Requirements

Basic Knowledge of Hadoop and Spark is required

Description

This is a comprehensive hands on course on Spark Hadoop

In this course we focused on Big Data and open source solutions around that.
We require these tools for our E-commerce end of Project CORE (Create your Own Recommendation Engine) is one of its kind of project to learn technology End-to-End
We will explore Hadoop one of the prominent Big Data solution
We will look Why part and How part of it and its ecosystem, its Architecture and basic inner working and will also spin our first Hadoop under 2 min in Google Cloud
This particular course we are going to use in Project CORE which is comprehensive project on hands on technologies. In Project CORE you will learn more about Building you own system on Big Data, Spark, Machine Learning, SAPUI5, Angular4, D3JS, SAP® HANA®
With this Course you will get a brief understanding on Apache Spark™, Which is a fast and general engine for large-scale data processing.
Spark is used in Project CORE to manage Big data with HDFS file system, We are storing 1.5 million records of books in spark and implementing collaborative filtering algorithm.
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Runs Everywhere - Spark runs on Hadoop, Mesos, standalone, or in the cloud.

Who this course is for:

Hadoop Learners
Hadoop Developers

Learn Spark and Hadoop Overnight on GCP

What you'll learn

Explore related topics

Course content

Introduction14 lectures • 1hr 5min

Up and Running With Spark on GCP27 lectures • 2hr 30min

Requirements

Description

Who this course is for: