Building Big Data Pipelines with PySpark + MongoDB + Bokeh

Name: Building Big Data Pipelines with PySpark + MongoDB + Bokeh
Rating: 4.5 (71 reviews)

Build intelligent data pipelines with big data processing and machine learning technologies

Last updated 2/2020

English

English [Auto],

What you'll learn

7 sections • 25 lectures • 5h 4m total length

Introduction9:30
Explore building an intelligent data pipeline using PySpark, MongoDB, and Bokeh to process, analyze, and visualize earthquake data, including ETL, predictive modeling, geospatial machine learning, dashboards, and a lightweight server.

Python Installation3:16
Install Python 3.7 on Windows, add to environment variables, verify installation, then install Anaconda to access Jupyter notebooks for big data pipelines.
Installing Third Party Libraries3:08
Open the command prompt as administrator, upgrade pip, and install pandas, numpy, pymongo, and bokeh to build the data pipeline's reporting and database integration.
Installing Apache Spark12:06
Install Apache Spark by downloading Spark 3.0 preview with Hadoop 2.7, extract to C drive, and configure SPARK_HOME and PATH. Then install Hadoop winutils and update Hadoop PATH.
Installing Java (Optional)4:35
Install Java 8 (JDK 8) and verify by running java -version; optional steps include creating a temporary hive folder for older Spock versions before testing installation.
Testing Apache Spark Installation6:05
Test Apache Spark installation by launching the Spock shell, converting a sample array to an RDD and a DataFrame, and validating Pi Spall for pipeline development.
Installing MongoDB3:47
Download MongoDB community server for Windows, install with complete setup, configure data and log directories and the service, then verify the installation by launching the Mongo shell and running db.
Installing NoSQL Booster for MongoDB7:10
Install NoSQL Booster for MongoDB, connect to a local instance, and create databases and collections to explore document-based storage in json-like format and basic queries.

Integrating PySpark with Jupyter Notebook5:08
Install and configure findspark to connect PySpark with Jupyter notebook, initialize a Spark session, and verify integration by creating and previewing a dataframe.
Data Extraction19:23
Extract data from GitHub, load into a PySpark dataframe, and derive a year field. Group by year to count earthquakes, preparing results for MongoDB storage.
Data Transformation14:51
Transform data by casting latitude, longitude, depth, and magnitude to double, compute yearly max and average magnitudes, join results, drop nulls, and load into MongoDB for ML-ready pipelines.
Loading Data into MongoDB13:15
Configure the Mongo Spork connector in a Jupyter notebook, load cleaned dataframes, and write them to MongoDB collections like quake and quake 3, then verify the results with NoSQL Booster.

Data Pre-processing19:07
Execute data pre-processing to clean and transform earthquake data, load training from MongoDB and test data from GitHub, align fields, and prepare training and testing sets.
Building the Predictive Model12:16
Apply a random forest regression model in PySpark to predict earthquake magnitude using latitude, longitude, and depth, assembled into a feature vector, and evaluate with RMSE on test data.
Creating the Prediction Dataset7:48
Extracts latitude, longitude and predicted magnitude from the prediction results, renames to predicted magnitude, adds year 2017 and RMSE, then saves the dataset to MongoDB for reporting and dashboards.

Loading the Data Sources from MongoDB16:44
Read data from Mongo DV into pandas data frames, convert collections to data frames, and prepare the 2016 earthquakes data for visualization with Bokeh.
Creating a Map Plot33:20
Create an interactive geo map with bokeh to display 2016 earthquakes and 2017 predictions, using murk projection coordinates, custom styling, tooltips, and a base map.
Creating a Bar Chart9:21
Create a bar chart of earthquakes by year from the quake frequency dataframe, then build a Bokeh figure with labeled axes and tooltips to show year and count.
Creating a Magnitude Plot15:28
Create a magnitude plot of maximum and average earthquake magnitudes by year from the quake data frame. Round the average to one decimal and add circles, tooltips, and legends.
Creating a Grid Plot8:51
Create a grid plot to display crops as a two-row dashboard in a web browser, using geo map, bar plot, and magnitude plot.

Installing Visual Studio Code5:02
Install Visual Studio Code, choose the 64-bit Windows version, then set up a project folder and install the Python extension to begin scripting data pipelines.
Creating the PySpark ETL Script23:32
Build a PySpark ETL script for the quakes pipeline that extracts data, transforms it (drop unused fields, derive year, compute counts, average magnitude, and maximum magnitude), and loads into MongoDB.
Creating the Machine Learning Script30:03
Create a spark-based machine learning pipeline to predict earthquake magnitudes by loading data from csv and MongoDB, training a random forest regressor, and evaluating with root mean squared error.
Creating the Dashboard Server20:48
Create a Flask dashboard server to share a generated dashboard by hosting the HTML file on localhost:5000. Set up a virtual environment, install Flask, and organize templates.

Welcome to the Building Big Data Pipelines with PySpark & MongoDB & Bokeh course. In

this course we will be building an intelligent data pipeline using big data technologies like

Apache Spark and MongoDB.

We will be building an ETLP pipeline, ETLP stands for Extract Transform Load and Predict.

These are the different stages of the data pipeline that our data has to go through in order for it

to become useful at the end. Once the data has gone through this pipeline we will be able to

use it for building reports and dashboards for data analysis.

The data pipeline that we will build will comprise of data processing using PySpark, Predictive

modelling using Spark’s MLlib machine learning library, and data analysis using MongoDB and

Bokeh.

You will learn how to create data processing pipelines using PySpark
You will learn machine learning with geospatial data using the Spark MLlib library
You will learn data analysis using PySpark, MongoDB and Bokeh, inside of jupyter notebook
You will learn how to manipulate, clean and transform data using PySpark dataframes
You will learn basic Geo mapping
You will learn how to create dashboards
You will also learn how to create a lightweight server to serve Bokeh dashboards