Name: Improving data quality in data analytics & machine learning
Rating: 4.6 (1005 reviews)

Udemy Business

Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Created byMike X Cohen

Last updated 6/2026

English

What you'll learn

Strategies for increasing data quality
Ways to assess data quality
Interpreting data visualizations
How to spot problems in data

Course content

9 sections • 45 lectures • 5h 23m total length

Is this course right for you?6:44

Section summary1:09
Explore what data really means and what we can and cannot infer from it, and learn how data quality drives the quality of data-based decisions.
Is data or are data??2:33
Clarify that data is plural and datum is singular, and emphasize correct subject-verb agreement in data analytics and machine learning communication.
On the origins and quality of data6:32
Potential problems with data
GIGO (garbage in, garbage out)3:29
Data quality influences data-driven decisions3:55
Data quality drives every data-driven decision; rely on high-quality data rather than anecdotal stories, and collect or clean data to improve decision making.

Section summary1:55
Data management6:22
Data documentation5:09
Data audits7:56
What to include in data documentation
Data cleaning phases2:59
Improve quality before getting data8:56
Anticipate data problems by planning analyses and studying similar datasets before collection. Run pilot studies to test setup and data quality, and seek expert input to improve upfront.
Improve quality during data collection5:10
Improve quality after data collection5:20
Clean and transform collected data to improve quality before analysis. Compute data quality metrics, identify missing, corrupted, or outlier values, and apply removal, interpolation, or scaling as needed.
Improve quality during data analysis3:14
Risks of biased results7:32
When to maximize data quality

Section summary0:33
Qualitative vs. quantitative quality assessments10:15
Evaluating data quality by eye and by algorithm
Qualitative assessments via visual inspection13:08
Explore how to use data visualization to identify quality issues in data, distinguishing numerical and categorical data, and interpreting bar plots, pie charts, box plots, histograms, and scatter plots.
Code: Visualizing data distributions16:38
Variance assessments6:41
Correlations and correlation matrices16:27
Data error rates4:42
Sample sizes8:58
Code: Measuring data quality12:20
Explore general data quality features for any dataset with pandas and seaborn, including missing values, duplicates, and correlations, and learn practical cleaning with drop duplicates and dropping missing values.

Section summary10:49
Z-score scaling9:08
Min/max scaling5:02
Binning (rounding)12:21
Unit normalization10:32
Rank transform6:29
Nonlinear transformations10:39
Apply nonlinear data transformations, such as log, square root, rank, and Fisher, to normalize non-Gaussian data and improve linear method validity.
Code: Transforming data21:44
Apply z-score scaling and min-max scaling to align variable ranges, explore binning and unit normalization, and perform rank transforms in pandas and NumPy for cleaner data quality.

Section summary1:18
What are outliers?13:51
Define outliers and their ambiguity in univariate and multivariate data. Explain how they arise from noise, error, or natural variation. Compare removal with robust analysis approaches that handle leverage.
The z-score method9:22
Identify and remove outliers with the Z-score method. Convert data to Z scores, apply a three standard deviation threshold, and use iterative or modified Z-score approaches for non Gaussian data.
The modified z-score method3:40
Dealing with missing data6:26
Code: Dealing with bad or missing data13:59

Requirements

Interest in working with data
Interest in knowing more about data quality
Some Python skills are useful for the optional coding videos

Description

All of our decisions are based on data. Our sense organs gather data, our memories are data, and our gut-instincts are data. If you want to make good decisions, you need to have high-quality data.

This course is about data quality: What it means, why it's important, and how you can increase the quality of your data.

In this course, you will learn:

High-level strategies for ensuring high data quality, including terminology, data documentation and management, and the different research phases in which you can check and increase data quality.
Qualitative and quantitative methods for evaluating data quality, including visual inspection, error rates, and outliers. Python code is provided to see how to implement these visualizations and scoring methods using pandas, numpy, seaborn, and matplotlib.
Specific data methods and algorithms for cleaning data and rejecting bad or unusual data. As above, Python code is provided to see how to implement these procedures using pandas, numpy, seaborn, and matplotlib.

This course is for

Data practitioners who want to understand both the high-level strategies and the low-level procedures for evaluating and improving data quality.
Managers, clients, and collaborators who want to understand the importance of data quality, even if they are not working directly with data.

Who this course is for:

Data science practitioners
Data scientist students
Managers or colleagues who work with data practitioners

What you'll learn

Explore related topics

Course content

Introduction1 lecture • 7min

Download course materials (Python code)1 lecture • 3min

Why data quality matters5 lectures • 18min

Ensuring high data quality10 lectures • 55min

Assessing data quality9 lectures • 1hr 30min

Data transformations8 lectures • 1hr 27min

Outliers and missing data6 lectures • 49min

Be a high-quality data scientist4 lectures • 15min

Bonus1 lecture • 1min

Requirements

Description

Who this course is for: