Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Real-World Data Science with Spark 2

Name: Real-World Data Science with Spark 2
Rating: 3.9 (21 reviews)

Address Big Data challenges with the fast and scalable features of Spark.

Created byPackt Publishing

Last updated 4/2017

English

English [Auto],

What you'll learn

An introduction to Big Data and data science
Get to know the fundamentals of Spark 2
Understand Spark and its ecosystem of packages in data science
Consolidate, clean, and transform your data acquired from various data sources
Unlock the capabilities of various Spark components to perform efficient data processing, machine learning, and graph processing
Dive deeper and explore various facets of data science with Spark

Course content

15 sections • 55 lectures • 5h 35m total length

Course Introduction4:23
Master real-time data processing with Spark, explore its ecosystem of packages, and apply practical workflows through hands-on content, code examples, and assessments.
An introduction to Big Data6:45

An overview of Apache Hadoop5:30
Gain an overview of Apache Hadoop, its distributed file system and map reduce framework. Track MRV1 to MRV2 evolution and how YARN unifies processing on HDFS.
Understanding Apache Spark5:07
Explore Apache Spark and its in-memory processing advantages over Hadoop MapReduce, and learn Spark libraries, the DAG execution model, and multi-language support across cluster managers.
Install Spark on your laptop with Docker, or scale fast in the cloud4:24
Install Apache Spark on your laptop from tarball or using Docker and run Spark jobs locally. Then learn to scale to the cloud with Azure, using a two-setup workflow.
Apache Zeppelin, a web-based notebook for Spark with matplotlib and ggplot22:51
Install and configure Apache Zeppelin, a web-based notebook for Spark with Matplotlib and ggplot2, using binary packages, a Docker image, or from source, to access Spark, R, and Python APIs.
The RDD API13:58
Test Your Knowledge

Data visualization12:15
Manipulating data with the core RDD API7:24
Explore the core rdd api by reading from disk, filtering and transforming data, and writing results back to disk using map, mapPartitions with index, and repartition.
Using DataFrame, dataset, and SQL – natural and easy!6:26
Learn to work with data in Spark using dataframe, dataset, and SQL by loading data, creating schemas, registering tables, and performing transformations and visualizations.
Manipulating rows and columns4:41
Learn to manipulate rows and columns of Spark data frames by merging frames with union, filtering, sorting, grouping, and aggregating, then adding or renaming columns for feature engineering.
Dealing with file format2:08
Learn how to manage different file formats in Spark data processing, including csv to json conversion, reading with type inference options, selecting columns, filtering, and writing new files.
Visualizing more – ggplot2, matplotlib, and Angular.js at the rescue3:19
Visualize data with ggplot2, matplotlib, and angular.js in Zipline notebooks, using inline plots and interactive displays. Utilize Zeppelin display to build interactive forms and capture user input to run code.
References0:11
Test Your Knowledge

An introduction to machine learning2:20
Discovering spark.ml and spark.mllib - and other libraries6:56
Explore spark.ml and spark.mllib, the out-of-the-box libraries for classification, regression, clustering, and pipelines. Train models with svm, split data for training and testing, and evaluate with roc auc.
Wrapping up basic statistics and linear algebra9:49
Wraps up basic statistics and linear algebra in Spark, covering random data generation, descriptive statistics, correlations, cross tabulation, and vector and matrix operations for distributed data.
Cleansing data and engineering the features4:57
Preprocess data by cleansing, parsing text, handling missing and formatting issues, and joining sources, then extract and transform features to build robust data pipelines for analytics with Spark.
Reducing the dimensionality4:03
Explore dimensionality reduction with PCA, projecting correlated features into uncorrelated principal components to obtain a lower-dimensional representation and interpret explained variance.
Pipeline for a life3:20
Build a multi-stage Spark ml pipeline that combines a tokenizer, term frequency, and logistic regression to train, transform, and save reusable models on a data frame with labeled documents.
References0:03
Test Your Knowledge

Streaming tweets to disk4:31
Learn how to set up a Twitter app, obtain consumer keys and access tokens, and stream tweets using Spark streaming to collect and save them to disk in micro-batches.
Streaming tweets on a map3:54
Stream tweets in real time, bind a Twitter stream to a Leaflet map using Zipline and Scala, and display geolocated tweets as they stream.
Cleansing and building your reference dataset5:07
Learn how to cleanse and build a reference dataset by loading and merging tweets, detecting language, and saving to parquet format for scalable analysis.
Querying and visualizing tweets with SQL3:54
Explore a cleansed Twitter reference dataset with Escorial to query and visualize tweets using SQL, perform exploratory data analysis, and examine text length, timestamps, geo locations, and followers.

Indicators, correlations, and sampling6:34
Explore basic indicators and correlations using a statistics object, learn stratified sampling for real data, and compare exact versus approximate statistics with Spark.
Validating statistical relevance3:19
Assess statistical relevance in real-world datasets using Spark with tests like t-square, goodness-of-fit, independence tests, Kolmogorov–Smirnov, and kernel density estimation to compute p-values.
Running SVD and PCA3:46
Apply singular value decomposition and principal component analysis to a tweet feature matrix to reduce features for downstream modeling.
Extending the basic statistics to your needs3:44
Extend basic statistics to your needs by adding an operator to an existing RTD or by extending the RTG class, using extension classes and implicit conversions for custom KPI calculations.

Analyzing free text from the tweets6:27
Learn to convert tweets into numerical feature vectors using bag of words, tokenization, stop word removal, and tf-idf, enabling text data to feed machine learning models.
Dealing with stemming, syntax, idioms, and hashtags5:05
Discover how to customize text feature extraction for Twitter data with stemming, syntax handling, lemmatization, and a Twitter model. Build a tokenization and analysis pipeline to handle hashtags and mentions.
Detecting tweet sentiment3:14
Detect tweet sentiment using the Stanford NLP library with a token-based weighting from very positive to very negative, zero for neutral. Explore integrating lexicons to improve accuracy.
Identifying topics with LDA2:38
Identify topics in tweets using lda, build a vocabulary, convert text to feature vectors, and fit an lda model to map topics to terms.

Word cloudify your dataset4:42
Explore creating word clouds from text data by calculating word counts, selecting top words, and rendering visuals with the R word cloud library, Python, and Matlab's matplotlib.
Locating users and displaying heatmaps with GeoHash3:51
Locate users and aggregate their locations into heatmaps using GeoHash encoding. Create a geocache and render the map with the leaflet library to visualize tweets around London.
Collaborating on the same note with peers4:47
Collaborate on the same note with peers using web socket, authenticate users, and set note permissions in zipline. Explore interpreter binding options: shared, scoped, and orated for collaborative work.
Create visual dashboards for your business stakeholders3:18
Create visual dashboards and share live, interactive insights with business stakeholders by embedding widgets, using skins, and updating dashboards in real time via web sockets.
Test Your Knowledge

Requirements

A basic knowledge of statistics and computational mathematics
Prior knowledge of Python and Scala would be beneficial

Description

Are you looking forward to expand your knowledge of performing data science operations in Spark? Or are you a data scientist who wants to understand how algorithms are implemented in Spark, or a newbie with minimal development experience and want to learn about Big Data analytics? If yes, then this course is ideal you. Let’s get on this data science journey together.

When people want a way to process Big Data at speed, Spark is invariably the solution. With its ease of development (in comparison to the relative complexity of Hadoop), it’s unsurprising that it’s becoming popular with data analysts and engineers everywhere. It is one of the most widely-used large-scale data processing engines and runs extremely fast.

The aim of the course is to make you comfortable and confident at performing real-time data processing using Spark.

What is included?

This course is meticulously designed and developed in order to empower you with all the right and relevant information on Spark. However, I want to highlight that the road ahead may be bumpy on occasions, and some topics may be more challenging than others, but I hope that you will embrace this opportunity and focus on the reward. Remember that throughout this course, we will add many powerful techniques to your arsenal that will help us solve the problems.

Let’s take a look at the learning journey. The course begins with the basics of Spark 2 and covers the core data processing framework and API, installation, and application development setup. Then, you’ll be introduced to the Spark programming model through real-world examples. Next, you’ll learn how to collect, clean, and visualize the data coming from Twitter with Spark streaming. Then, you will get acquainted with Spark machine learning algorithms and different machine learning techniques. You will also learn to apply statistical analysis and mining operations on your dataset. The course will give you ideas on how to perform analysis including graph processing. Finally, we will take up an end-to-end case study and apply all that we have learned so far.

By the end of the course, you should be able to put your learnings into practice for faster, slicker Big Data projects.

Why should I choose this course?

Packt courses are very carefully designed to make sure that they're delivering the best learning experience possible. This course is a blend of text, videos, code examples, and quizzes, which together makes your learning journey all the more exciting and truly rewarding. This helps you learn a range of topics at your own speed and also move towards your goal of learning the technology. We have prepared this course using extensive research and curation skills. Each section adds to the skills learned and helps you to achieve mastery of Spark.

This course is an amalgamation of sections that form a sequential flow of concepts covering a focused learning path presented in a modular manner. We have combined the best of the following Packt products:

Data Science with Spark by Eric Charles
Spark for Data Science by Bikramaditya Singhal and Srinivas Duvvuri
Apache Spark 2 for Beginners by Rajanarayanan Thottuvaikkatumana

Meet your expert instructors:

For this course, we have combined the best works of these extremely esteemed authors:

Eric Charles has 10 years of experience in the field of data science and is the founder of Datalayer, a social network for data scientists. He is passionate about using software and mathematics to help companies get insights from data.

Bikramaditya Singhal is a data scientist with about 7 years of industry experience. He is an expert in statistical analysis, predictive analytics, machine learning, Bitcoin, Blockchain, and programming in C, R, and Python. He has extensive experience in building scalable data analytics solutions in many industry sectors.

Srinivas Duvvuri is currently the senior vice president development, heading the development teams for fixed income suite of products at Broadridge Financial Solutions (India) Pvt Ltd. In addition, he also leads the Big Data and Data Science COE and is the principal member of the Broadridge India Technology Council.

Rajanarayanan Thottuvaikkatumana, Raj, is a seasoned technologist with more than 23 years of software development experience at various multinational companies. He has worked on various technologies including major databases, application development platforms, web technologies, and Big Data technologies.

Who this course is for:

This course is for anyone who wants to work with Spark on large and complex datasets.
Data analyst, data scientists, or Big Data architects interested to explore the data processing power of Apache Spark will find this course very useful.

Real-World Data Science with Spark 2

What you'll learn

Explore related topics

Course content

Big Data and Data Science2 lectures • 11min

The Spark Programming Model5 lectures • 32min

Spark SQL and DataFrames2 lectures • 23min

Data Analysis on Spark4 lectures • 46min

First Step with Spark Visualization7 lectures • 36min

The Spark Machine Learning Algorithms7 lectures • 31min

Collecting and Cleansing the Dirty Tweets4 lectures • 17min

Statistical Analysis on Tweets4 lectures • 17min

Extracting Features from the Tweets4 lectures • 17min

Mine Data and Share Results4 lectures • 17min

Requirements

Description

Who this course is for: