Udemy
    •  
    •  
    •  
    •  
    •  
    •  
    •  
    •  
Turn what you know into an opportunity and reach millions around the world.
Learn More
Your cart is empty.
Keep shopping
Spark Project on Cloudera Hadoop(CDH) and GCP for Beginners
Rating: 3.7 out of 5(124 ratings)
1,598 students

Spark Project on Cloudera Hadoop(CDH) and GCP for Beginners

Building Data Processing Pipeline Using Apache NiFi, Apache Kafka, Apache Spark, Cassandra, MongoDB, Hive and Zeppelin
Created byPARI MARGU
Last updated 4/2021
English

What you'll learn

  • Complete Spark Project Development on Cloudera Hadoop and Spark Cluster
  • Fundamentals of Google Cloud Platform(GCP)
  • Setting up Cloudera Hadoop and Spark Cluster(CDH 6.3) on GCP
  • Features of Spark Structured Streaming using Spark with Scala
  • Features of Spark Structured Streaming using Spark with Python(PySpark)
  • Fundamentals of Apache NiFi
  • Fundamentals of Apache Kafka
  • How to use NoSQL like MongoDB and Cassandra with Spark Structured Streaming
  • How to build Data Visualisation using Python
  • Fundamentals of Apache Hive and how to integrate with Apache Spark
  • Features of Apache Zeppelin
  • Fundamentals of Docker and Containerization

Course content

10 sections47 lectures10h 54m total length
  • Course Introduction4:13

Requirements

  • Basic understanding of Programming Language
  • Basic understanding of Apache Hadoop
  • Basic understanding of Apache Spark
  • No worry, even solid Apache Hadoop and Apache Spark basics are covered for the benefit of absolute beginners
  • Most important one, which is willingness to learn

Description

  • In retail business, retail stores and eCommerce websites generates large amount of data in real-time.

  • There is always a need to process these data in real-time and generate insights which will be used by the business people and they make business decision to increase the sales in the retail market and provide better customer experience.

  • Since the data is huge and coming in real-time, we need to choose the right architecture with scalable storage and computation frameworks/technologies.

  • Hence we want to build the Data Processing Pipeline Using Apache NiFi, Apache Kafka, Apache Spark, Apache Cassandra, MongoDB, Apache Hive and Apache Zeppelin to generate insights out of this data.

  • The Spark Project is built using Apache Spark with Scala and PySpark on Cloudera Hadoop(CDH 6.3) Cluster which is on top of Google Cloud Platform(GCP).

  • Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance.

    Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

    Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

    A NoSQL (originally referring to "non-SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.

Who this course is for:

  • Beginners who want to learn Apache Spark/Big Data Project Development Process and Architecture
  • Entry/Intermediate level Data Engineers and Data Scientist
  • Data Engineering and Data Science Aspirants
  • Data Enthusiast who want to learn, how to develop and run Spark Application on CDH Cluster
  • Anyone who is really willingness to become Big Data/Spark Developer