
Discover what big data means through its three defining aspects: data size, velocity, and variety, and see how these drive processing in Hadoop and cloud-based environments.
Learn how to create a Google Compute Engine VM, choose region and machine type, install a Linux OS, and configure a static external IP with SSH keys and firewall rules.
Install IntelliJ IDEA community edition on Windows, set up a Spark with Scala project, add dependencies, and run a local Spark job from a main Scala object.
Explore Spark SQL concepts with hands-on exercises that transform data into frames and tables, run queries, and build a simple Facebook program using notebooks.
Explore Apache NiFi core terminologies, including flow files, processors, process groups, connections, and ports, and learn how to build configurable data pipelines.
Explore Kafka architecture in multi-node clusters, with producers publishing to topics, partitions, and consumer groups processing streams in real time, plus connectors for databases and file systems integration.
Publish and consume messages with a Kafka producer in Python, creating a topic, configuring a broker, and running a consumer to receive real-time transaction data.
Watch an end-to-end project demo showing a real-time data pipeline with Spark, MongoDB, and Kafka, deploying dashboards, building and running a Spark job, and monitoring streaming data.
In retail business, retail stores and eCommerce websites generates large amount of data in real-time.
There is always a need to process these data in real-time and generate insights which will be used by the business people and they make business decision to increase the sales in the retail market and provide better customer experience.
Since the data is huge and coming in real-time, we need to choose the right architecture with scalable storage and computation frameworks/technologies.
Hence we want to build the Data Processing Pipeline Using Apache NiFi, Apache Kafka, Apache Spark, Apache Cassandra, MongoDB, Apache Hive and Apache Zeppelin to generate insights out of this data.
The Spark Project is built using Apache Spark with Scala and PySpark on Cloudera Hadoop(CDH 6.3) Cluster which is on top of Google Cloud Platform(GCP).
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance.
Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.
A NoSQL (originally referring to "non-SQL" or "non-relational") database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases.