
Explore big data in e-commerce by setting up Hadoop and Spark on Google Cloud Platform. Understand the HDFS architecture, high availability, and cold storage for enterprise data.
Discover how enterprises tackle big data with the three Vs: velocity, volume, and variety, using Hadoop to store and process vast data robustly for 10x returns.
Understand theoretical computer science behind big data processing, analyzing loop-based time complexity and how divide-and-conquer and memory distribution power Hadoop to handle large datasets.
Explore hot and cold data in enterprise architectures, storing hot data in ERP main memory for transactions and using Hadoop for cold data analytics with OLAP.
Understand Hadoop architecture with a master name node and data nodes, enabling scalable, low-cost storage, high availability, and cloud-based ERP cold data management via IaaS.
Explore the Hadoop cluster architecture, including the name node and data nodes, and learn how files are written as 128 MB blocks with replication for fault tolerance.
Enable high availability for enterprise systems by replicating data across multiple nodes with a configurable replication factor, guided by the name node, heartbeat monitoring, and rack awareness.
Explore how Hadoop achieves high availability through replication and standby name node and secondary name node, with zookeeper coordination and fs image and log files on Google Cloud Platform.
Discover how Google Cloud Dataproc enables an end-to-end big data lifecycle and rapid Hadoop and Spark deployment in minutes.
Learn how to load data into HDFS or a storage bucket on GCP, compare unified storage with HDFS, and spin up HDFS for scalable map-reduce processing.
Learn to access the spark environment on GCP master node, explore Hadoop HDFS, start Spark and PySpark shells, use the web UI, and safely terminate clusters to save costs.
Learn to set up and run Spark on Google Cloud Platform, explore Spark architecture, data frames, transformations and actions, and access data from Hadoop on GCP through practical hands-on exercises.
Explore how Spark works: from driver program and Spark context to master-slave cluster management, map and reduce operations, and JVM-based execution across worker nodes on GCP.
Understand Spark's inner workings: RDD partitions, immutability, and transformations, then explore data frames with schemas that enable faster operations and dramatic speed gains on distributed data.
Learn to spin up a two-node Spark cluster on Google Cloud Platform, using an initialization action to start a Jupyter notebook, and automate provisioning via the REST API.
Launch a Spark cluster on GCP, access the master node via a VM, start a PySpark session and Jupyter notebook, and monitor the Hadoop cluster through the dashboard.
Discover Spark RDD basics, autocompletion, partitioning with parallelize, and map-collect lazy evaluation; note Python 3 differences and range concepts.
Explore how Spark's RDD architecture distributes data across master and worker nodes on GCP, detailing partitions, memory, lazy evaluation, and map and collect driving computation.
Explore RDD transformation and actions with map and collect, detailing immutability, partitioning, and how a minus one function is applied and evaluated on Spark.
Explore how RDD transformations like map create new datasets and how the collect action reveals results across partitions, emphasizing immutability and practical use in Spark workloads.
Explore RDD transformation and actions by applying filter and collect in Spark, learning lazy evaluation, partitioning, and performance optimization across large datasets.
Explore rdd transformations and actions by using filter with a lambda, then retrieve elements with first and take, and sort to obtain top elements in ascending or descending order.
Explore RDD transformations and actions by performing reduce to compute a single value, with an emphasis on associative and cumulative operators, and why subtraction is unsuitable.
Compare reduceByKey and groupByKey on RDDs, focusing on partitioning and shuffling, memory use, and performance. Learn when to prefer reduceByKey for large data and groupByKey for smaller, simpler workloads.
Explore groupByKey and reduceByKey in Spark by creating key-value pairs, applying a lambda to sum values, and performing collect as a combined transformation and action.
Explore advanced RDD transformations with groupByKey and reduceByKey by building sums from keys, using mapValues, and handling Spark errors to optimize performance.
Explore RDD caching and persistence in Spark on GCP, using in-memory storage levels, replication, and persist and unpersist.
Understand why data frames with a schema enable Spark to optimize access, outpacing datasets by turning partitions into a tabular structure for faster queries.
Generate 100 random user records with faker, including last name, first name, ssn, occupation, and age, then create a spark dataframe and inspect the schema.
Explore Spark data frames, demonstrating immutability, showing records, and filtering with a user defined function to select ages under 10, highlighting optimization and performance.
Explore data frames in Spark, distinguish actions from transformations, and use lazy evaluation to optimize operations, including selecting columns, dropping data, and performing group by with max and average.
Explore data frame operations with actions like show, desc, and first; sort by age to locate the max, and use count, distinct, max, min, and average to handle duplicates.
This is a comprehensive hands on course on Spark Hadoop
In this course we focused on Big Data and open source solutions around that.
We require these tools for our E-commerce end of Project CORE (Create your Own Recommendation Engine) is one of its kind of project to learn technology End-to-End
We will explore Hadoop one of the prominent Big Data solution
We will look Why part and How part of it and its ecosystem, its Architecture and basic inner working and will also spin our first Hadoop under 2 min in Google Cloud
This particular course we are going to use in Project CORE which is comprehensive project on hands on technologies. In Project CORE you will learn more about Building you own system on Big Data, Spark, Machine Learning, SAPUI5, Angular4, D3JS, SAP® HANA®
With this Course you will get a brief understanding on Apache Spark™, Which is a fast and general engine for large-scale data processing.
Spark is used in Project CORE to manage Big data with HDFS file system, We are storing 1.5 million records of books in spark and implementing collaborative filtering algorithm.
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.
Spark powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
Runs Everywhere - Spark runs on Hadoop, Mesos, standalone, or in the cloud.