Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Databricks and PySpark for Big Data: From Zero to Expert

Name: Databricks and PySpark for Big Data: From Zero to Expert
Rating: 4.0 (215 reviews)

Complete course to learn Databricks, including PySpark, Dataframes, Machine Learning, Advanced Analytics and Streaming

Created byData Data

Last updated 2/2024

English

What you'll learn

Processing Big Data with PySpark in Databricks
Databricks environment and Platform
ETL, Dataframes and data visualization in Databricks
PySpark in Databricks with RDDs, Spark Dataframes API or Spark SQL
Spark Column Expresions and Dataframe Agregations
Spark Data Sources and Format types
Spark Architecture Concepts and Query Optimization
Advanced analytics and data visualization with Databricks
Machine Learning with Spark at Databricks
Spark Streaming at Databricks

Course content

20 sections • 94 lectures • 5h 29m total length

Course Material0:09
How to get the most out of the course15:27
Explore how to navigate the Udemy platform, access course material and certificates, and apply best practices, notes, and Q&A to learn Databricks and PySpark effectively.

Spark Fundamentals1:43
How Apache Spark works2:03
Explore how Apache Spark runs on a cluster with a driver and workers, where the Spark session coordinates tasks per partition, caches iteration data, and returns results.
Apache Spark ecosystem and official documentation5:02
Explore the Apache Spark ecosystem—from the Spark Core API and RDDs to Spark SQL, DataFrames, streaming, MLlib, and GraphX—and consult the official documentation at spark.apache.org for examples and guides.
PySpark: cluster management and architecture3:30
Explore PySpark, the Python API for Apache Spark, enabling execution on clusters with a driver context, master–slave architecture, and cluster managers such as standalone, Apache Mesos, Hadoop Yarn, and Kubernetes.

Spark installation: downloading tools3:36
Install Spark on your local computer by downloading from the official Apache Spark site, unzip, and configure log4j to error, then install Anaconda and winutils matching Hadoop version.
Installing Spark: setting environment variables3:49
Install spark on Windows by downloading Java development kit into the C drive, unzip spark, and configure environment variables (Hadoop home, spark home, java home) with path to run Spark.
Running Spark at the prompt and jupyter notebook2:52
Launch and validate Spark from the Anaconda prompt and from a Jupyter notebook, ensuring PySpark installs, imports, and a Spark context initializes with the Spark UI accessible.

Fundamentals and advantages of DataFrames3:00
Explore how DataFrames provide a tabular data structure in Spark, enabling scalable processing of structured and semi-structured data, with missing-value imputation, query optimization, and multi-language support.
Characteristics of DataFrames and data sources2:03
Learn the characteristics of dataframes—distributed, lazily evaluated, immutable, and fault-tolerant—alongside data sources such as CSV, JSON, XML, Parquet, an RDD, Hive, Cassandra, DFS, and local files.
Creating DataFrames in PySpark3:16
Create and work with dataframes in Apache Spark using PySpark by initializing a Spark session and building dataframes from data collections, Hive tables, and RDDs.
Operations with PySpark DataFrames6:21
Explore core PySpark DataFrame operations including count, columns, types, printSchema, select, filter, drop, withColumn, and group by aggregations with count, sum, max, min, and average, and sort by salary.
Different types of joins in DataFrames3:27
Explore core spark dataframes joins: inner, left outer, right outer, full outer, left anti join, and right anti join, using df join syntax on depth and id.
Consultas SQL en PySpark2:45
Learn to use Spark sql api to run sql-like queries by registering a dataframe as a temporary view tied to the Spark session, then apply sql, filter, and distinct.
Funciones avanzadas para cargar y exportar datos en PySpark3:02
Learn how to create and export spark dataframes from hive tables, csv files, and jdbc sources using read, write, and save, with mode and path options.

Advanced Features and Performance Optimization3:53
Explore advanced PySpark functions and performance optimization techniques, including cache and persist, UDFs, partitions, and DataFrame processing, to reduce redundant work and speed up transformations.
BroadCast Join and caching5:09
Optimize spark performance with broadcast join and caching, tune thresholds, and manage memory and disk storage, then handle missing values with imputation.
User Defined Functions (UDF) and advanced SQL functions4:08
Learn to generate SQL expressions and UDFs in Spark to categorize salaries as low, medium, or high. Use expr, selectExpr, withColumn, and Python UDFs to apply the logic in DataFrames.
Handling and imputation of missing values2:51
Identify null values with isnull, filter and show them, replace nulls with the string 'invalid' using fillna, and drop rows with nulls using dropna, controlling behavior with how and subset.

Introduction to Databricks2:41
Explore how Databricks, built on Apache Spark, enables big data analytics, batch and streaming processing, and collaborative data science in cloud platforms like Azure, AWS, and Google Cloud.
Databricks Terminology and Databricks Community3:17
Learn Databricks terminology and how the free community edition limits resources, clusters, and workspace use, while covering notebooks, libraries, tables, clusters, and jobs.
Crear una cuenta gratuita de Databricks1:49
Create a free Databricks account by registering on the official site, choose the Community Edition with up to 15GB storage and basic notebooks, verify your email to access the platform.

Introduction to the Databricks environment9:41
Learn to navigate the Databricks environment, create notebooks, clusters, and tables, manage data sources and jobs, and configure language, MLflow, and the model registry for data science.
First steps with Databricks6:06
Master the Databricks quickstart: create notebooks, run sql and PySpark cells, build a Delta Lake from csv, and analyze diamonds by color with average price.

Databricks Utilities3:33
Learn to use databricks utils to manage secrets, file systems, libraries, and notebooks, and apply the data summarize command to spark dataframes like the diamonds dataset, revealing stats.
Databricks Utils for managing File System and libraries3:42
Learn how Databricks utils manages file systems and Python libraries, performing file operations (cp, makedir, put, remove), viewing data (head), and installing and listing libraries (numpy, pandas, TensorFlow, NumPy).
Databricks Utils for notebooks, secrets and Widgets5:16
Explore Databricks utils to orchestrate modular notebooks, run and exit workflows, and manage secrets and widgets, including combo boxes, drop-downs, and text inputs for dynamic environments.

Creating and saving DataFrames in Databricks4:44
Create dataframes in PySpark, union them, and save to parquet in Databricks, using dbutils to delete existing files and then read and display the data.
Transformation and visualization of data in Databricks6:25
Transform nested data into a dataframe, apply filters and sorts, handle nulls, compute aggregations like sum and distinct counts, and visualize salaries using Pandas and Matplotlib in Databricks.

Fundamentals of Machine Learning with Spark2:17
Explore machine learning with Spark, covering supervised, unsupervised, and reinforcement learning, and build end-to-end pipelines with Spark ML dataframes, including ETL, development, and deployment.
Spark Machine Learning components2:14
Explore Spark MLlib components, including core algorithms like classification and regression, data transformation and feature selection, pipelines, and persistence to build end-to-end models across Scala, Java, and Python.
Stages in the development of a Machine Learning model4:53
Discover the stages of building a machine learning model with Spark, from etl and data preparation to feature engineering, model training, and deployment, including vector assembler and string indexer.
Machine Learning Model Definition and Pipeline Development4:26
Define a logistic regression model for binary classification and build a PySpark pipeline with a one-hot encoder, label indexing, vector assembly, and logistic regression, then train, predict, and interpret probabilities.
Model evaluation with PySpark and Databricks2:56
The lecture demonstrates evaluating binary classification with PySpark and Databricks, using area under the curve and accuracy metrics via binary and multiclass evaluators to gauge performance.
Hyperparameter setting and logging in MLFlow3:54
Tune logistic regression hyperparameters with a 3x3 grid via param grid builder and cross validator, while MLflow logs experiments and auc metric for reproducible results.
Predictions with new data and visualization of results4:12
Make predictions on new data using best hyperparameters from the cross validator and visualize results. Evaluate area under the curve and accuracy, and analyze predictions with SQL queries and graphs.

Requirements

PySpark Fundamentals

Description

If you are looking for a hands-on, complete and advanced course to learn Databricks and PySpark, you have come to the right place.

Databricks is a data analytics platform powered by Apache Spark for data engineering, data science, and machine learning. Databricks has become one of the most important platforms to work with Spark, compatible with Azure, AWS and Google Cloud. This makes Databricks and Apache Spark some of the most in-demand skills for data engineers and data scientists, and some of the most valuable skills today. This course will teach you everything you need to know to position yourself in the Big Data job market.

This course is designed to prepare you to learn everything related to Databricks and Apache Spark, from the Databricks environment, platform and functionalities, to Spark SQL API, Spark Dataframes, Spark Streaming, Machine Learning, advanced analytics and data visualization in Databricks.

With a complete training, downloadable study guides, hands-on exercises, and real-world use cases, this is the only course you'll ever need to learn Databricks and Apache Spark. You will learn Databricks, starting from the basics to the most advanced functionalities. To do so, we will use visual presentations, sharing clear explanations and useful professional advice.

This course covers the following sections:

Introduction to Big Data and Apache Spark
Spark Fundamentals with Spark RDDs, Dataframes
Databricks environment
Advanced analytics and data visualization with Databricks
Machine Learning with Spark at Databricks
Spark Streaming at Databricks

If you're ready to improve your skills, increase your career opportunities, and become a Big Data expert, join today and get immediate and lifetime access to:

• Complete Guide to Databricks with Apache Spark (PDF e-book)

• Downloadable project files

• Practical exercises and questionnaires

• Databricks resources such as: Cheatsheets and summaries

• 1 to 1 expert support

• Forum of questions and answers of the course

See you there!

Who this course is for:

Anyone who wants to learn Databricks
Anyone who wants to learn advanced big data skills
Anyone wants to make a career as a data engineer, data analyst or data scientist
Anyone interested in learning Apache Spark and PySpark for Big Data analytics
Anyone wants to learn cutting-edge technology in data processing

Databricks and PySpark for Big Data: From Zero to Expert

What you'll learn

Explore related topics

Course content

Introduction to this course2 lectures • 16min

Introduction to Apache Spark and Big Data4 lectures • 12min

Installation of Spark on premises (Addiotional)3 lectures • 10min

Spark DataFrames and Apache Spark SQL7 lectures • 24min

Spark Advanced Features4 lectures • 16min

Databricks Fundamentals3 lectures • 8min

Databricks Platform2 lectures • 16min

Databricks Utilities3 lectures • 13min

ETL, Dataframes and data visualization in Databricks2 lectures • 11min

Machine learning with Databricks and Apache Spark7 lectures • 25min

Requirements

Description

Who this course is for: