Spark Project on Cloudera Hadoop(CDH) and GCP for Beginners

Name: Spark Project on Cloudera Hadoop(CDH) and GCP for Beginners
Rating: 3.7 (124 reviews)

Building Data Processing Pipeline Using Apache NiFi, Apache Kafka, Apache Spark, Cassandra, MongoDB, Hive and Zeppelin

Created byPARI MARGU

Last updated 4/2021

English

What you'll learn

Complete Spark Project Development on Cloudera Hadoop and Spark Cluster
Fundamentals of Google Cloud Platform(GCP)
Setting up Cloudera Hadoop and Spark Cluster(CDH 6.3) on GCP
Features of Spark Structured Streaming using Spark with Scala
Features of Spark Structured Streaming using Spark with Python(PySpark)
Fundamentals of Apache NiFi
Fundamentals of Apache Kafka
How to use NoSQL like MongoDB and Cassandra with Spark Structured Streaming
How to build Data Visualisation using Python
Fundamentals of Apache Hive and how to integrate with Apache Spark
Features of Apache Zeppelin
Fundamentals of Docker and Containerization

Course content

10 sections • 47 lectures • 10h 54m total length

Course Introduction4:13

Workaround for setting up Cloudera CDH on GCP0:15
Environment Setup Overview1:16
Create Free Trial Account in Google Cloud Platform(GCP)9:52
Create VM instance using Compute Engine in GCP21:35
Learn how to create a Google Compute Engine VM, choose region and machine type, install a Linux OS, and configure a static external IP with SSH keys and firewall rules.
Setting Up Single Node Cloudera Hadoop CDH 6.3 Cluster in GCP32:45
Install Apache NiFi on Single Node CDH 6.3 Cluster6:29
Install Apache Kafka on Single Node CDH 6.3 Cluster15:33
Install Apache Cassandra on Single Node CDH 6.3 Cluster15:50
Install MongoDB on Single Node CDH 6.3 Cluster7:20
Install and Configure PyCharm Community Edition for PySpark Application17:09
Install & Configure IntelliJ Community Edition for Spark with Scala Application22:01
Install IntelliJ IDEA community edition on Windows, set up a Spark with Scala project, add dependencies, and run a local Spark job from a main Scala object.

Introduction to Apache Kafka11:32
Key Concepts in Apache Kafka13:38
Apache Kafka Architecture9:11
Explore Kafka architecture in multi-node clusters, with producers publishing to topics, partitions, and consumer groups processing streams in real time, plus connectors for databases and file systems integration.
Kafka Producer with Hands-On8:16
Publish and consume messages with a Kafka producer in Python, creating a topic, configuring a broker, and running a consumer to receive real-time transaction data.
Kafka Consumer with Hands-On8:04

Project Architecture(Building Data Processing Pipeline)9:00
Generate Retail Data using Apache NiFi Data Pipeline(eCommerce Data Simulator)14:04
Spark Structured Streaming and Apache Kafka Integration14:03
Building Data Processing Pipeline with Spark Structured Streaming and Cassandra47:10
Building Data Processing Pipeline with Spark Structured Streaming and MongoDB8:49
Building Data Visualization using Python14:38
Project Demo22:02
Watch an end-to-end project demo showing a real-time data pipeline with Spark, MongoDB, and Kafka, deploying dashboards, building and running a Spark job, and monitoring streaming data.
How to Install Apache Zeppelin in CDH 6.3 Cluster14:04
Data Analysis using Spark SQL in Apache Zeppelin9:03

Requirements

Basic understanding of Programming Language
Basic understanding of Apache Hadoop
Basic understanding of Apache Spark
No worry, even solid Apache Hadoop and Apache Spark basics are covered for the benefit of absolute beginners
Most important one, which is willingness to learn

Description

In retail business, retail stores and eCommerce websites generates large amount of data in real-time.
There is always a need to process these data in real-time and generate insights which will be used by the business people and they make business decision to increase the sales in the retail market and provide better customer experience.
Since the data is huge and coming in real-time, we need to choose the right architecture with scalable storage and computation frameworks/technologies.
Hence we want to build the Data Processing Pipeline Using Apache NiFi, Apache Kafka, Apache Spark, Apache Cassandra, MongoDB, Apache Hive and Apache Zeppelin to generate insights out of this data.
The Spark Project is built using Apache Spark with Scala and PySpark on Cloudera Hadoop(CDH 6.3) Cluster which is on top of Google Cloud Platform(GCP).
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance.
Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
A NoSQL (originally referring to "non-SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

Who this course is for:

Beginners who want to learn Apache Spark/Big Data Project Development Process and Architecture
Entry/Intermediate level Data Engineers and Data Scientist
Data Engineering and Data Science Aspirants
Data Enthusiast who want to learn, how to develop and run Spark Application on CDH Cluster
Anyone who is really willingness to become Big Data/Spark Developer

Spark Project on Cloudera Hadoop(CDH) and GCP for Beginners

What you'll learn

Explore related topics

Course content

Introduction1 lecture • 4min

Big Data and Apache Hadoop Concepts3 lectures • 56min

Apache Spark Concepts2 lectures • 45min

Environment Setup11 lectures • 2hr 30min

Apache Spark Practical using Spark with Scala and PySpark4 lectures • 1hr 32min

Fundamentals of Apache NiFi4 lectures • 34min

Fundamentals of Apache Kafka5 lectures • 51min

Fundamentals of Apache Hive4 lectures • 11min

Spark Project Development using Spark with Scala and PySpark on CDH 6.3 Cluster9 lectures • 2hr 33min

Bonus Tutorial4 lectures • 59min

Requirements

Description

Who this course is for: