Learning PySpark

Name: Learning PySpark
Rating: 4.1 (189 reviews)

Building and deploying data-intensive applications at scale using Python and Apache Spark

Created byPackt Publishing

Last updated 4/2018

English

What you'll learn

Learn about Apache Spark and the Spark 2.0 architecture.
Understand schemas for RDD, lazy executions, and transformations.
Explore the sorting and saving elements of RDD.
Build and interact with Spark DataFrames using Spark SQL
Create and explore various APIs to work with Spark DataFrames.
Learn how to change the schema of a DataFrame programmatically.
Explore how to aggregate, transform, and sort data with DataFrames.

Course content

5 sections • 49 lectures • 2h 28m total length

The Course Overview5:52
This video gives an overview of the entire course.
Brief Introduction to Spark2:04
The aim of the video is to explain Spark and its Python interface.
Apache Spark Stack1:38
The aim of this video is to provide a brief overview of Apache Spark stack components.
Spark Execution Process1:26
The aim of this video is to briefly review the execution process.
Newest Capabilities of PySpark 2.0+1:56
The aim of this video is to briefly review the newest features of Spark 2.0+.
Cloning GitHub Repository1:56
The aim of this video is to clone the GitHub repository for the course. Doing this will set everything we need for the following videos.

Brief Introduction to RDDs1:49
In this video, we will provide a brief overview of one of the fundamental data structures of Spark – the RDDs.
Creating RDDs4:38
In this video, we will learn how to create RDDs in many different ways.
Schema of an RDD2:17
In this video, we explore the advantages and disadvantages of RDD’s lack of schema.
Understanding Lazy Execution2:11
Spark is lazy to process data. In this video we will learn why this is an advantage.
Introducing Transformations – .map(…)3:57
In this video, we will introduce lambdas and the .map(…) transformation.
Introducing Transformations – .filter(…)2:23
In this video, we will learn how to filter data from RDDs.
Introducing Transformations – .flatMap(…)6:14
In this video, we will explain the difference between .flatMap(…) and .map(…) transformations and we will learn to use it to filter malformed records.
Introducing Transformations – .distinct(…)3:27
In this video, we will explore what the .distinct(…) transformation does.
Introducing Transformations – .sample(…)3:15
In this video, we will learn how to sample data from RDDs.
Introducing Transformations – .join(…)4:17
In this video, we will learn how to join two RDDs.
Introducing Transformations – .repartition(…)4:16
In this video, we will explore how to effectively use repartitioning.

Introducing Actions – .take(…)5:43
In this video, we will focus on one of the most fundamental tools any data scientist can use: the .take(…) action.
Introducing Actions – .collect(…)2:15
In this video, we will learn when to use the .collect(…) action and when to avoid it.
Introducing Actions – .reduce(…) and .reduceByKey(…)2:59
In this video, we will learn another fundamental method from the Map-Reduce paradigm – the .reduce(…) and the .reduceByKey(…).
Introducing Actions – .count()2:36
In this video, we will learn how to count the number of records in an RDD.
Introducing Actions – .foreach(…)1:51
In this video, we will learn how to execute an action on each element of an RDD in each of its partitions.
Introducing Actions – .aggregate(…) and .aggregateByKey(…)4:55
In this video, we will explore how to aggregate the data within each partition first before collecting the results on the driver for the final aggregation.
Introducing Actions – .coalesce(…)2:05
In this video, we will learn when and why to use the .coalesce(…) method instead of the .repartition(…).
Introducing Actions – .combineByKey(…)3:11
In this video, we will learn about the most flexible data reduction action.
Introducing Actions – .histogram(…)1:50
In this video, we will learn how to bin data into buckets.
Introducing Actions – .sortBy(…)2:38
In this video, we will learn how to sort data within an RDD.
Introducing Actions – Saving Data3:10
In this video, we will explore how to save data from an RDD.
Introducing Actions – Descriptive Statistics2:14
In this video, we will explore some basic descriptive statistics.

Introduction1:35
In this video, we will provide a brief introduction to Spark DataFrames.
Creating DataFrames4:16
In this video, we will learn how to create DataFrames.
Specifying Schema of a DataFrame6:00
In this video, we will learn how to specify schema of a DataFrame.
Interacting with DataFrames1:36
In this video, we will discuss different ways of interacting with DataFrames.
The .agg(…) Transformation3:19
In this video, we will learn how to use the .agg(…) method to aggregate data.
The .sql(…) Transformation3:57
In this video, we will learn how to use the .sql(…) transformation to interact with the data in a DataFrame.
Creating Temporary Tables2:31
In this video, we will learn how to create temporary views over a DataFrame.
Joining Two DataFrames3:54
In this video, we will learn how to join two DataFrames.
Performing Statistical Transformations3:55
In this video, we will learn how to calculate descriptive statistics in DataFrames.
The .distinct(…) Transformation1:30
In this video, we will how to retrieve distinct values from a DataFrame.

Schema Changes6:28
In this video, we will learn how to drop, rename, and handle missing observations.
Filtering Data1:31
In this video, we will learn how to filter data.
Aggregating Data2:34
In this video, we will learn how to aggregate data.
Selecting Data2:24
In this video, we will learn how to select data from a DataFrame.
Transforming Data1:40
In this video, we will learn how to transform data.
Presenting Data1:34
In this video, we will learn how to present data.
Sorting DataFrames1:00
In this video, we will learn how to sort data contained within a DataFrame.
Saving DataFrames4:28
In this video, we will learn how to save DataFrames in a number of file formats.
Pitfalls of UDFs3:38
In this video, we will discuss the pitfalls of using pure Python user defined functions.
Repartitioning Data1:58
In this video, we will learn how to repartition the data.

Requirements

A firm understanding of Python is expected to get the best out of the tutorial. Familiarity with Spark would also be helpful.

Description

Apache Spark is an open-source distributed engine for querying and processing data. In this tutorial, we provide a brief overview of Spark and its stack. This tutorial presents effective, time-saving techniques on how to leverage the power of Python and put it to use in the Spark ecosystem. You will start by getting a firm understanding of the Apache Spark architecture and how to set up a Python environment for Spark.

You'll learn about different techniques for collecting data, and distinguish between (and understand) techniques for processing data. Next, we provide an in-depth review of RDDs and contrast them with DataFrames. We provide examples of how to read data from files and from HDFS and how to specify schemas using reflection or programmatically (in the case of DataFrames). The concept of lazy execution is described and we outline various transformations and actions specific to RDDs and DataFrames.

Finally, we show you how to use SQL to interact with DataFrames. By the end of this tutorial, you will have learned how to process data using Spark DataFrames and mastered data collection techniques by distributed data processing.

About the Author

Tomasz Drabas is a Data Scientist working for Microsoft and currently residing in the Seattle area. He has over 12 years' international experience in data analytics and data science in numerous fields: advanced technology, airlines, telecommunications, finance, and consulting.

Tomasz started his career in 2003 with LOT Polish Airlines in Warsaw, Poland while finishing his Master's degree in strategy management. In 2007, he moved to Sydney to pursue a doctoral degree in operations research at the University of New South Wales, School of Aviation; his research crossed boundaries between discrete choice modeling and airline operations research. During his time in Sydney, he worked as a Data Analyst for Beyond Analysis Australia and as a Senior Data Analyst/Data Scientist for Vodafone Hutchison Australia among others. He has also published scientific papers, attended international conferences, and served as a reviewer for scientific journals.

In 2015 he relocated to Seattle to begin his work for Microsoft. While there, he has worked on numerous projects involving solving problems in high-dimensional feature space.

Who this course is for:

If you are a Python developer keen to master hands-on techniques using the Apache Spark 2.x ecosystem in the best possible manner, this video is for you.

Learning PySpark

What you'll learn

Explore related topics

Course content

A Brief Primer on PySpark6 lectures • 15min

Resilient Distributed Datasets11 lectures • 39min

Resilient Distributed Datasets and Actions12 lectures • 35min

DataFrames and Transformations10 lectures • 33min

Data Processing with Spark DataFrames10 lectures • 27min

Requirements

Description

Who this course is for: