Massive Data Workloads with Open Source Software

Name: Massive Data Workloads with Open Source Software
Rating: 5.0 (1 reviews)

Tips, Tools and Techniques for Data Aggregation, Storage, Processing, Analysis & Visualization with Open Source Software

Created byIsrael Ekpo

Last updated 5/2021

English

English [Auto],

What you'll learn

Tips, tools, techniques and strategies for working with massive data workloads using open source software
Tools and strategies for aggregating events using open source software
Strategies for selecting open source storage solutions across various data store categories
Tools and strategies for processing real time and batch workloads with open source software
Strategies for analyzing and visualizing
Optimizing on performance, reliability, security and costs

Course content

6 sections • 41 lectures • 3h 29m total length

Introduction5:05
Explore open source tools to aggregate, store, process, and analyze data across batch and streaming workloads. Build practical skills to select the right tools and strategies for real-world data challenges.
A Special Thank you and Appreciation3:20
Express gratitude to God, family, mentors, and Kickstarter supporters for eight years building a data course that teaches how to process, store, analyze, and visualize data, and invites feedback.

Overview of Data Sources and Workloads4:49
Explore data sources fueling massive workloads, including internet of things devices and social networks, to business applications, and how to capture, store, and analyze them effectively.
Strategies and Tools for Data Movement7:52
Identify data movement strategies and tools for migrating data between stores and formats, considering location, formats (binary, csv, avro), transformation, and encryption. Distinguish bounded and unbounded data, selecting Kafka connectors.
Data Store Categories: The What and Why18:30
Choose the right data store for the job by evaluating relational databases, white column stores, key-value stores, document stores, graph stores, and object stores.
Selecting a DataStore Category
Tools and Strategies for Data Processing5:19
Explore batch and real-time data processing with open source tools, comparing bounded versus unbounded datasets. Learn how Spark, Beam, Flink, and Kafka enable streaming, aggregation, and real-time analytics.

Introduction2:25
Introduce open source solutions for massive data workloads, balancing batch and real-time processing for an online grocery store while selecting the right tools for storage, analysis, and visualization.
Challenges - Data Store Selection8:34
Choose the right data store to enforce relationships, normalize data, and support caching, flexible documents, and graph networks for a grocery data ecosystem.
Challenges - Data Processing and Analysis7:19
Examine data workloads with bounded datasets using batch processing with Spark and enrichment, and unbounded streams with real-time processing via Kafka Streams to keep inventory and orders up to date.
Challenges - Data Consumption, Reporting and Visualization1:54
Consume stored data and share it with customers and partners in avro or other formats, then visualize data from elasticsearch and the api using kibana dashboards to create insightful visualizations.
Summary13:05
Compare batch and streaming data workloads to show how open-source data stores—relational, document, columnar, and key-value—plus engines like Spark and Kafka enable real-time enrichment and visualization with Elasticsearch and Kibana.

Introduction8:31
Set up local and azure cloud environments for containerized data workloads, provisioning resources, networking, and storage using Azure CLI, Docker, kubectl, Helm, Git, VSCode, and Postman.
Getting a Cloud Account with Microsoft Azure3:25
Create a free Azure account to receive a $200 credit for 30 days, then switch to pay-as-you-go, add a debit card, and scale down VMs to save costs.
Azure Account Validation1:53
Log in to validate your Azure account, add a payment method, and upgrade to pay-as-you-go, then request a region-specific limit increase to provision data stores for a communities cluster.
Cloud Account Quota Adjustment2:02
Increase virtual machine and cpu quotas for a region by submitting a subscription limit request through the portal, aim for 1000, then bookmark the page for future setups.
Tools and Integrated Development Environments11:13
Set up essential tools and IDEs, including version control, vscode, intellij, azure cli, and the Java, Scala, Hadoop, Spark, and Maven environments with correct PATH and JAVA_HOME.
Tools and IDEs Continued2:52
Learn to set up web development tools with WebStorm, choose a browser (Chrome or Edge), and install the CLI tools and package manager for Angular projects.
Azure CLI Login and Kickoff0:59
Download the git repository, log in, and prepare your local development environment to set up the communities cluster using the Azure Resource Manager template.
Provisioning the Kubernetes Cluster6:25
Provision a two-tier kubernetes cluster using a template and parameters, with one system node and three agent nodes, then execute the deploy script to create resources.
Kubernetes Cluster Validation3:19
Validate the newly created Kubernetes cluster, review resource groups and pool setup, retrieve cluster credentials, and use controls to run notebooks and scale the cluster for cost efficiency.
Overview of Helm4:26
Publish cluster resources with Helm as a single unit, using install, upgrade, and uninstall, and explore namespaces, services, and persistent volumes.
Setting up a Debugger Container on Kubernetes5:06
Set up and explore a debugger container on Kubernetes with helm charts, inspect deployments in the debugger namespace, and deploy MySQL 5.6, Postgres, Cassandra, and MongoDB.
Infrastructure - MySQL Database (Relational Store)2:52
Set up the MySQL infrastructure using data definition language and data manipulation language files, install and verify the database, and confirm its IP address before proceeding to Cassandre.
Infrastructure - 3-node Cassandra Cluster (Wide Column Store)2:10
Set up a three-node cassandra cluster using templates, run a single setup command from the project directory, and monitor the cluster as each node comes online.
Infrastructure - MongoDB Document Store4:51
Set up MongoDB using Helm charts, verify deployment status and external IP, and understand the role of persistent volumes in ensuring data durability before proceeding to a Redis installation.
Infrastructure - Redis (Key-Value Store)1:27
Set up Redis, verify it is running, and inspect the container status and internal versus external IP addresses for application communication; then proceed to ElasticSearch and Kibwana cluster.
Infrastructure - ElasticSearch and Kibana (Search)1:53
Learn to set up ElasticSearch and Kibwana, monitor readiness across namespaces, manage their dependency, and use a persistent volume with the watch command before moving to the next setup.
Infrastructure - Neo4j (Graph Database)2:10
Set up neo4j in standalone mode, accept the enterprise license, configure the login password, retrieve the password if needed, log into the database, and verify ports and external IP.
Infrastructure - Kafka, Zookeeper, Kafka Connect, Schema Registry & KSQL9:39
Learn to provision a Kafka ecosystem by installing Zookeeper, broker, Schema Registry, Kafka Connect, and KSQL in dependency order, verifying readiness before deployment.
Setup Validation and Local DNS Setup4:42
Validate the setup and implement a local dns mapping that maps service ip addresses to domain names, using a script to generate host entries for both local and cluster environments.

Overview of Scenarios and Solutions4:15
The lecture surveys data stores from MySQL to Elasticsearch, MongoDB, and Cassandra, covers caching, full-text search, and graph databases, and discusses real-time and batch processing with Spark and streaming engines.
Relational Database Implementation - MySQL8:04
Set up a MySQL-based e-commerce and inventory database, including users, privileges, schemas, and sample data. Use a data generator app to simulate orders, shipments, and replenishments via Kafka streams.
Key Value Store Implementation - Redis3:42
Explore implementing a key value cache with Redis to store and retrieve backend results, checking the cache before querying the database and managing keys with the Redis CLI.
Document Store Implementation - MongoDB1:53
Verify a running MongoDB instance, download a GitHub repo in a debugger container, and import dataset files to populate the database for a hands-on document store lesson.
Wide Column Store Implementation - Cassandra2:58
Load data into Cassandre by running setup commands, then log into the Cassandre environment, use the Cassandre query language shell to access the cluster, create key spaces, and load data.
Fulltext Search Server - ElasticSearch4:37
Log in to Elasticsearch and Kibana, verify credentials and endpoints, and explore indices with the Postman collection; monitor data flowing from database to Kafka to Elasticsearch with Kibana dashboards.
Graph Database - Neo4j and Cypher Query Language5:08
Explore graph databases with Neo4j and Cypher, set up a database, create nodes and connections, and query first- and third-degree relationships to understand data graphs.
Batch Analysis of Bounded Datasets - Apache Spark2:48
Learn how to perform batch analysis with Apache Spark by connecting to MongoDB, joining the product and product details collections, and creating an enriched dataset loaded into MongoDB.
Realtime Analysis of Unbounded Data Streams - Kafka and KSQL12:06
Learn real-time analysis of unbounded data streams with the Kafka ecosystem, including Zookeeper, brokers, producers, consumers, schema registry, connect, and streams to join and enrich data.
Reporting APIs and Data Visualization - SpringBoot APIs and D3.js4:43
Visualize data with ElasticSearch and Kibwana dashboards using SpringBoot APIs and D3.js for interactive reports. Run sample reports against a database and explore rest API querying, ElasticSearch, Kafka, and dashboards.

Requirements

A computer with internet access is required
Access to an Azure cloud account is necessary. You can use the Free trial credit for this

Description

The process of selecting the right tools, technologies and strategies for aggregating, processing and making sense of high-velocity, high-volume application log data from tens, hundreds or sometimes thousands of sources can be very overwhelming, expensive, intimidating, stressful and frustrating. This course offers a complete, hands-on instruction on how to aggregate, process, search and visualize massive log data using open source software tools, frameworks and platforms available today to solve these challenges.

Who this course is for:

Software Engineers, Data Engineers, Data Analysts, Data Scientists and Operations Engineers

Massive Data Workloads with Open Source Software

What you'll learn

Explore related topics

Course content

Getting Started2 lectures • 8min

Foundations4 lectures • 37min

Challenges, Use Cases, Scenarios and Solutions5 lectures • 33min

Infrastructure & IDE Setup for the Course19 lectures • 1hr 20min

Implementation of Solutions for Scenarios and Use Cases10 lectures • 50min

Course Round Up and Summary1 lecture • 1min

Requirements

Description

Who this course is for: