Big Data Analytics with PySpark + Tableau Desktop + MongoDB

Name: Big Data Analytics with PySpark + Tableau Desktop + MongoDB
Rating: 4.3 (31 reviews)

Integrating Big Data Processing tools with Predictive Modeling and Visualization with Tableau Desktop

Created byEBISYS R&D

Last updated 2/2020

English

What you'll learn

Tableau Data Visualization
PySpark Programming
Data Analysis
Data Transformation and Manipulation
Big Data Machine Learning
Geo Mapping with Tableau
Geospatial Machine Learning
Creating Dashboards

Course content

7 sections • 27 lectures • 4h 17m total length

Introduction10:20
Build an end-to-end big data analytics pipeline with PySpark, MongoDB, and Tableau Desktop to transform earthquake data into summary tables, geo maps, dashboards, and predictive regression models with trend lines.

Python Installation3:16
Installing Apache Spark12:06
Install Apache Spark on Windows by downloading Spark 3.x with Hadoop 2.7, extracting the zip, and configuring spark home and path; install winutils and set Hadoop home and path.
Installing Java (Optional)4:35
Install and verify Java 8 (JDK 8) on Windows, including downloading from Oracle, accepting the license, and checking with java -version; optional hive cache setup for older Spock versions.
Testing Apache Spark Installation6:05
Test Apache Spark installation by launching the Spock shell, converting a sample array to an RDD and a dataframe to confirm the Python API works with PySpark.
Installing MongoDB3:47
Download the MongoDB community server MSI for Windows, choose the latest 2.3 release, install as a complete setup with default data and log directories, then verify using the mongo shell.
Installing NoSQL Booster for MongoDB7:10
Install NoSQL Booster to manage MongoDB, connect locally, and create a quake database with a dummy collection, then query documents using JavaScript via db.dummy.find.

Integrating PySpark with Jupyter Notebook5:08
Integrate PySpark with Jupyter notebook by installing the Find Spock library, testing import and initialization, then create a Spark session and a PySpark data frame to run data pipelines.
Data Extraction19:23
Import a dataset from GitHub into a Spark dataframe, clean unused columns, derive a year field from the date, and count records by year for MongoDB storage.
Data Transformation14:51
Loading Data into MongoDB13:15
Configure the Mongo DV spot connector in a Jupyter notebook, then load cleaned data frames into MongoDB with the Spock connector, overwriting data in the quake and quake 3 collections.

Data Pre-processing19:07
Perform data pre-processing for machine learning by loading earthquake data, creating training and test sets, and cleaning to prepare features. Rename fields, cast to doubles, and drop missing values.
Building the Predictive Model12:16
Select latitude, longitude, and depth as features via a vector assembler, train a random forest regression model in a pipeline to predict earthquake magnitude, and evaluate with RMSE below 0.5.
Creating the Prediction Dataset7:48
Create the prediction dataset by selecting latitude, longitude, and predicted magnitude, rename the field to predicted magnitudes, add year and RMSE columns for 2017, and load into MongoDB.

Installing Visual Studio Code3:20
Learn how to download and install Visual Studio Code on Windows, set up a project folder, open the editor, and install the Python extension to build data pipeline scripts.
Creating the PySpark ETL Script24:20
Create a PySpark ETL script to extract, transform, and load earthquake data into MongoDB using a Spark session and data frames.
Creating the Machine Learning Script26:37
Create a machine learning script to build and evaluate a random forest model from preprocessed quake data, train on MongoDB data, generate predictions, and store results for Tableau dashboards.

Installing Tableau1:44
Installing MongoDB ODBC Drivers4:59
Install the MongoDB ODBC drivers by downloading the BI connector for MongoDB, install the Visual C++ redistributable, and connect Tableau to MongoDB for reports, dashboards, and graphs.
Creating a System DSN for MongoDB5:24
Create a system DSN for MongoDB using the ODBC driver to enable access from all applications. Start the MongoDB BI connector on port 3307, then test the connection.
Loading the Data Sources5:38
Creating a Geo Map12:33
Create a geo map visualization in Tableau using longitude and latitude, add predicted magnitude and year to tooltips, customize colors, size, opacity, and map background for clear earthquake data insights.
Creating a Bar Chart11:08
Creating a Magnitude Chart12:59
Create a magnitude plot to visualize maximum and average earthquake magnitudes per year with a line graph in Tableau. Add styling, markers, tooltips, and a trend line.
Creating a Table Plot3:40
Create a Tableau table plot from the quakes data by dragging the type field to rows and using a count measure to show earthquake types, then rename the sheet.
Creating a Dashboard6:27
Create a multi-chart Tableau dashboard combining an earthquake map, magnitude plots by year, and a types table from cleaned MongoDB data via a data pipeline, with interactive location search.

Requirements

Basic Understanding of Python
Little or no understanding of GIS
Basic understanding of Programming concepts
Basic understanding of Data
Basic understanding of what Machine Learning is

Description

Welcome to the Big Data Analytics with PySpark + Tableau Desktop + MongoDB course. In this course we will be creating a big data analytics solution using big data technologies like PySpark for ETL, MLlib for Machine Learning as well as Tableau for Data Visualization and for building Dashboards.

We will be working with earthquake data, that we will transform into summary tables. We will then use these tables to train predictive models and predict future earthquakes. We will then analyze the data by building reports and dashboards in Tableau Desktop.

Tableau Desktop is a powerful data visualization tool used for big data analysis and visualization. It allows for data blending, real-time analysis and collaboration of data. No programming is needed for Tableau Desktop, which makes it a very easy and powerful tool to create dashboards apps and reports.

MongoDB is a document-oriented NoSQL database, used for high volume data storage. It stores data in JSON like format called documents, and does not use row/column tables. The document model maps to the objects in your application code, making the data easy to work with.

You will learn how to create data processing pipelines using PySpark
You will learn machine learning with geospatial data using the Spark MLlib library
You will learn data analysis using PySpark, MongoDB and Tableau
You will learn how to manipulate, clean and transform data using PySpark dataframes
You will learn how to create Geo Maps in Tableau Desktop
You will also learn how to create dashboards in Tableau Desktop

Who this course is for:

Python Developers at any level
Data Engineers at any level
Developers at any level
Machine Learning engineers at any level
Data Scientists at any level
GIS Developers at any level
The curious mind

Big Data Analytics with PySpark + Tableau Desktop + MongoDB

What you'll learn

Explore related topics

Course content

Introduction1 lecture • 10min

Setup and Installations6 lectures • 37min

Data Processing with PySpark and MongoDB4 lectures • 53min

Machine Learning with PySpark and MLlib3 lectures • 39min

Creating the Data Pipeline Scripts3 lectures • 54min

Tableau Data Visualization9 lectures • 1hr 5min

Source Code and Notebook1 lecture • 1min

Requirements

Description

Who this course is for: