Data Cleaning & Preprocessing in Python for Machine Learning

Name: Data Cleaning & Preprocessing in Python for Machine Learning
Rating: 4.4 (35 reviews)

Learn how to resolve Data Quality issues in Machine Learning & Data Science using Data Cleaning in Python Pandas.

Created byAjatshatru Mishra

Last updated 7/2022

English

What you'll learn

You will learn how to detect and impute missing values in the data.
How to detect and rectify incorrect data types.
How to deal with Categorical Columns.
How to detect and replace incorrect values with correct ones.
How to use Apply Lambda method for using advanced cleaning functions.
How to group the dataset by a particular column.
How to detect and remove outliers.
How to perform feature scaling.
How to clean and preprocess textual data for NLP.

Course content

4 sections • 31 lectures • 1h 34m total length

Introduction1:16
Learn to detect and resolve common data issues in real-world datasets using Python, including missing values, incorrect data types, feature scaling, normalization, handling categorical variables, structural problems, and outliers.
Curriculum1:39
Explore a three-section curriculum on data quality checks, fixing issues with imputation and data types, and nlp preprocessing like tokenization, stop words, and stemming.
Installation and Setup2:03

The Dataset1:08
Explore a dummy dataset that mimics website demographics, identify missing values and incorrect formats, and learn how to fix issues using python for machine learning and analytics.
The Dataset File.0:03
Finding Data types and Structure5:25
Learn how to perform data quality checks in Python by identifying missing values, duplicates, and data type issues, using Pandas and NumPy to inspect a data frame's shape and info.
Using the "unique()" function for detecting anomalies1:48
Detecting Missing Values1:36
Detect missing values to assess data quality by counting missing values per column, tallying rows with missing data, and computing the overall missing percentage relative to the dataset length.
Detecting Duplicate Values0:58
Detect duplicated rules to assess data quality in raw data. Use the duplicated rules method to report any duplicates, ensuring data confidence and preventing downstream costs.
Jupyter Notebook0:03
Detecting Data Issues

Replacing the Incorrect Values3:41
Imputing the Missing Values4:52
Learn to handle missing values by imputing with column means using a simple imputer, preserving data rather than dropping rows in data cleaning and preprocessing for machine learning.
Dropping the Missing Values1:29
Removing Whitespaces2:16
Apply the strip method to remove leading and trailing white spaces from text columns like gender and country, ensuring consistent data entries for machine learning preprocessing.
Dealing with Dates3:22
Transform date columns from strings to datetime using pandas to_datetime, after inspecting dtypes with info, and fix inconsistent entries like 15th March 2021 to enable reliable machine learning preprocessing.
Fixing the Data types2:34
Identify and fix data types by converting the age column from float to integer with astype, verify with info, and convert other columns to string as needed.
Dealing with Anomalies2:50
Mapping Categorical to Numeric values3:25
Grouping the Data set1:47
Using Apply Lambda Method4:58
Learn to use the applied lambda method in pandas to classify monthly visits as normal or active by applying a function to each element and creating a new label column.
Converting Categorical Columns to Numeric6:15
Detecting and Removing Outliers6:24
Learn to detect and remove outliers in Python data cleaning by using z-scores and absolute values on value, monthly visits, and item purchased, while preserving customer ID and gender.
Feature Scaling4:47
Scale all features to a 0–1 range using a min-max scaler after dropping customer ID and date, then fit and transform the data for machine learning.
Jupyter Notebook0:03
Data Cleaning and Preprocessing

Introduction3:21
NLP - Dataset4:43
Clean and preprocess textual data for NLP by loading a corona tweet dataset with pandas, handling encoding, and extracting original tweets and sentiment for NLP tasks using NLTK.
Tokenization8:00
Learn how tokenization converts tweets into tokens for sentiment analysis by breaking text into words and numbers, filtering out special characters, and applying a tokenization function to a data column.
Removing Stop words5:21
Stemming4:07
Combined Methods of Data Preprocessing in NLP4:30
Jupyter Notebook0:03

Requirements

Basic knowledge of Python.

Description

More often than not, real world data is messy and can rarely be used directly. It needs a lot of cleaning and preprocessing before it can be used in Analytics, Machine Learning or other application. Data Cleaning be a dirty job, which often requires lots of effort and advanced technical skills like familiarity with Pandas and other libraries.

For most of the data cleaning, all you need is data manipulation skills in Python. In this course you will learn just that. This course has lectures, quizzes and Jupyter notebooks, which will teach you to deal with real world raw data. The course contains tutorials on a range of data cleaning techniques, like imputing missing values, feature scaling and fixing data types issues etc.

In this you course you will learn:

How to detect and deal with missing values in the data.
How to detect and rectify incorrect data types.
How to deal with Categorical Columns.
How to detect and replace incorrect values with correct ones.
How to use Apply Lambda method for using advanced cleaning functions.
How to group the dataset by a particular column.
How to detect and remove outliers.
How to perform feature scaling.
How to clean and preprocess textual data for NLP.

Who this course is for:

Data Analysts, Data Engineers, Machine Learning Engineers and Data Sicentists.

Data Cleaning & Preprocessing in Python for Machine Learning

What you'll learn

Explore related topics

Course content

Introduction and Setup3 lectures • 5min

Detecting Data Quality issues7 lectures • 11min

Data Cleaning and Preprocessing14 lectures • 49min

Data Cleaning and Preprocessing for NLP7 lectures • 30min

Requirements

Description

Who this course is for: