Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Machine Learning with Imbalanced Data

Name: Machine Learning with Imbalanced Data
Rating: 4.7 (859 reviews)

Learn to over-sample and under-sample your data, apply SMOTE, ensemble methods, and cost-sensitive learning.

Created bySoledad Galli, Train in Data Team

Last updated 9/2024

English

EnglishSpanish [Auto],

What you'll learn

Apply random under-sampling to remove observations from majority classes
Perform under-sampling by removing observations that are hard to classify
Carry out under-sampling by retaining observations at the boundary of class separation
Apply random over-sampling to augment the minority class
Create syntethic data to increase the examples of the minority class
Implement SMOTE and its variants to synthetically generate data
Use ensemble methods with sampling techniques to improve model performance
Change the miss-classification cost optimized by the models to accomodate minority classes
Determine model performance with the most suitable metrics for imbalanced datasets

Course content

10 sections • 106 lectures • 9h 15m total length

Course Curriculum Overview3:11
Learn the machine learning with imbalanced data course curriculum, covering imbalanced datasets, metrics, sampling methods (under- and over-sampling, SMOTE, ADASYN), ensembles, and cost-sensitive techniques, plus notebooks and datasets.
Course Material1:42
Use Jupyter notebooks to implement techniques for imbalanced datasets, complemented by videos, presentations, and online datasets. Read three articles to download code from GitHub, presentations, and datasets before you begin.
Code | Jupyter notebooks0:33
Presentations covered in the course0:19
Python package Imbalanced-learn0:08
Download Datasets0:08
Resources to learn machine learning skills0:40

Imbalanced classes - Introduction5:24
Explore imbalanced datasets, where minority and majority classes create distributions and imbalance degrees, and learn why predicting rare events like fraud, medical diagnoses, equipment malfunction, or oil spills is hard.
Nature of the imbalanced class4:56
Explore how imbalanced classes affect learning by examining class distribution, imbalance ratio, and dataset size. Understand that class separability and within-class sub-clusters drive difficulty more than imbalance alone.
Approaches to work with imbalanced datasets - Overview3:59
Explore data-level, cost-sensitive, and ensemble approaches to imbalanced datasets, including over-sampling, under-sampling, and SMOTE. Use higher costs for minority misclassifications and ensemble methods to boost minority-class predictions.
Additional Reading Resources (Optional)0:10

Introduction to Performance Metrics3:22
Explore performance metrics for imbalanced data, including roc-auc, precision-recall curves, and confusion matrices, and learn how threshold choice and multiclass extensions impact metrics.
Accuracy4:21
Explore accuracy as the common metric for model performance, defined as correct predictions over total, and show why it fails with imbalanced data, especially for minority class, in binary classification.
Accuracy - Demo5:39
Compare accuracy of random forest and logistic regression against a baseline majority-predictor using scikit-learn, emphasizing minority-class detection and concluding logistic regression best for positives.
Precision, Recall and F-measure13:32
Explore precision, recall, and the F1 score for imbalanced data, focusing on minority class performance, threshold effects, and how to balance metrics for reliable model evaluation.
Install Yellowbrick0:07
Precision, Recall and F-measure - Demo10:04
Compare precision, recall, and F1 scores for imbalanced data using random forest and logistic regression, with scikit-learn and yellowbrick visuals, and optimize thresholds to improve minority-class performance.
Confusion tables, FPR and FNR6:03
Explore the confusion matrix and the false positive discovery rate and false negative discovery rate to understand how true positives, true negatives, false positives, and false negatives shape model thresholds.
Confusion tables, FPR and FNR - Demo7:32
Analyze confusion matrices, false positive and false negative rates for a random forest and logistic regression on imbalanced data, and explore thresholds to balance recall and precision.
Balanced Accuracy3:49
Explore how balanced accuracy addresses imbalanced datasets by averaging recall across classes. Examine confusion matrix insights, including true positives, and recall to compare class performance beyond overall accuracy.
Balanced accuracy - Demo2:43
Estimate the balanced accuracy using scikit-learn on the imbalanced KDD 2004 dataset. Compare it to standard accuracy for the baseline majority-class model, random forest, and logistic regression.
Geometric Mean, Dominance, Index of Imbalanced Accuracy4:29
Explore alternative metrics for imbalanced data, including geometric mean, dominance, and the index of imbalanced accuracy, built from recall and true negative rate, with comparisons to precision and ROC-AUC.
Geometric Mean, Dominance, Index of Imbalanced Accuracy - Demo9:28
Apply geometric mean, dominance, and the index of imbalanced accuracy to compare random forest and logistic regression on imbalanced data, using recall, true negative rate, and balanced accuracy.
ROC-AUC7:26
Explore the ROC-curve and ROC-AUC to evaluate classifier performance across thresholds, compare models, and understand how threshold choice affects true positive rate, false positive rate, and AUC.
ROC-AUC - Demo4:46
Compare random forest and logistic regression using ROC-AUC score to evaluate discrimination, and plot ROC curves with scikit-learn and Yellowbrick to interpret class-specific performance.
Precision-Recall Curve11:19
Learn how the precision-recall curve analyzes imbalanced data, showing the precision-recall trade-off at different thresholds, and compare logistic regression and random forest.
Precision-Recall Curve - Demo3:53
Learn to plot precision-recall curves for imbalanced data using scikit-learn and Yellowbrick, compare random forest and logistic regression, and interpret average precision and area under the curve (AUC).
Additional reading resources (Optional)0:16
Probability4:32
Plot precision-recall curves with scikit-learn and Yellowbrick to compare random forest and logistic regression on the test set, showing logistic regression’s larger area under the curve.

Under-Sampling Methods - Introduction5:21
Explore under-sampling methods to balance imbalanced data, comparing fixed and cleaning approaches, balancing ratios, and algorithms like random under-sampling, NearMiss, and instance hardness, and implement in Python with imbalanced-learn.
Random Under-Sampling - Intro4:23
Describe random under-sampling as a simple, data-agnostic method that balances classes by sampling from the majority until a ratio is reached, using Imbalanced-learn's RandomUnderSampler with a sampling strategy.
Random Under-Sampling - Demo10:11
Explore how random under-sampling balances imbalanced data with imbalanced-learn, preserves original distributions, and boosts model performance using a random forest on toy and real datasets.
Condensed Nearest Neighbours - Intro8:03
Explore condensed nearest neighbours for boundary-focused under-sampling, using 1-nearest-neighbour training to balance minority and majority classes, with one-vs-one multi-class handling.
Condensed Nearest Neighbours - Demo7:25
Apply condensed nearest neighbours under-sampling to imbalanced data, selecting majority-class samples near the minority class to train a knn and compare performance with a random forest using roc-auc.
Tomek Links - Intro4:43
Explore Tomek links, pairs of nearest-neighbor observations from opposite classes. Remove the majority observation or the entire TomekLinks pair to reduce boundary noise and improve learning.
Tomek Links - Demo3:05
Apply Tomek Links from imbalanced-learn to under-sample imbalanced data, removing majority observations or full links, and compare model performance on well-separated and not well-separated datasets.
One Sided Selection - Intro4:38
One Sided Selection - Demo3:00
Apply One Sided Selection from imbalanced-learn to undersample the majority class and compare random forest performance on full versus undersampled data. Observe Tomek Links' effect on data and separation.
Edited Nearest Neighbours - Intro5:01
Discover edited nearest neighbors, a cleaning method that undersamples the majority class by removing samples near the boundary using a 3-nearest neighbors check.
Edited Nearest Neighbours - Demo4:02
Implement edited nearest neighbours with imbalanced-learn to undersample data, using 3 neighbors and mode or all criteria, then compare performance and retention of clearly majority class observations.
Repeated Edited Nearest Neighbours - Intro4:39
Explore repeated edited nearest neighbours, a cleaning method using 3-nearest-neighbours to iteratively remove samples whose neighbors disagree, until no more removals or max cycles, via imbalanced-learn fit_resample.
Repeated Edited Nearest Neighbours - Demo3:00
Apply repeated edited nearest neighbours to undersample the majority class using imbalanced-learn, compare with edited nearest neighbours, and observe impact on minority class separation and model performance.
All KNN - Intro6:16
Explore the all k nearest neighbours algorithm, an iterative under-sampling method that removes majority-class samples near class boundaries to balance imbalanced data.
All KNN - Demo5:50
Explore all KNN with imbalanced-learn, comparing it to edited and repeated edited nearest neighbours on toy datasets and a real dataset, and assess undersampling effects on minority classes.
Neighbourhood Cleaning Rule - Intro6:14
Learn how the neighbourhood cleaning rule under-samples the majority class by removing samples near the boundary with other classes, using edited nearest neighbours and a minority-focused cleaning step.
Neighbourhood Cleaning Rule - Demo1:55
Apply the neighbourhood cleaning rule with fit_resample to undersample the majority class, using a 3-neighbour rule and a two-to-one threshold.
NearMiss - Intro3:47
Explore NearMiss, a fixed under-sampling method from imbalanced-learn that balances majority and minority classes by selecting majority observations based on distance, using three versions.
NearMiss - Demo3:53
Instance Hardness - Intro9:20
Explore instance hardness threshold and filtering to remove hard observations, reduce class overlap, and use probability thresholds or percentiles with one-vs-rest for multiclass classification.
Instance Hardness Threshold - Demo16:21
Apply Instance Hardness Threshold to a binary classification task. Undersample with the threshold, compare ROC-AUC for random forest and logistic regression, using imbalanced-learn.
Instance Hardness Threshold Multiclass Demo7:44
Demonstrates instance hardness threshold for multiclass imbalanced data using a one-vs-rest random forest, creating a three-class toy dataset, filtering hard instances by class probabilities, and balancing via auto undersampling.
Undersampling Method Comparison7:44
Assess how undersampling methods affect ROC-AUC for a random forest trained on undersampled data across multiple datasets. Different datasets show varying gains from NearMiss and Instance Hardness.
Wrapping up the section5:18
Explore the trade-offs between fixed and cleaning under-sampling techniques for imbalanced data, and learn when to test multiple methods to improve model performance.
Setting up a classifier with under-sampling and cross-validation10:54
Train the random forest on the under-sampled training set and evaluate on the test set with the original class distribution using cross-validation and the average precision-recall score.
Summary Table0:28

Over-Sampling Methods - Introduction3:41
Explore over-sampling methods to address imbalanced datasets by balancing the minority and majority classes. Learn random over-sampling, SMOTE and ADASYN, including SMOTEENC, and how sample generation targets hard-to-classify boundary cases.
Random Over-Sampling5:00
Explore random over-sampling to balance imbalanced data by duplicating minority-class observations to a 1:1 ratio, using imbalanced-learn's RandomOverSampler and fit_resample, and adjust balancing strategy for multiclass targets.
Random Over-Sampling - Demo4:55
Demonstrates how to apply random over-sampling with Imbalanced-learn's RandomOverSampler on a toy dataset created with make_blobs, and visualizes the balance using pandas, matplotlib, and seaborn.
ROS with smoothing - Intro6:39
Explore random over-sampling with smoothing to generate new minority-class examples by adding noise guided by class distribution, controlled by a shrinkage factor, and implemented with Imbalanced-learn's RandomOverSampler.
ROS with smoothing - Demo4:36
Explore random over-sampling with smoothing using imbalanced-learn's RandomOverSampler to create synthetic minority-class examples and observe dispersion changes as the shrinkage factor varies.
SMOTE9:26
Learn how SMOTE, the synthetic minority over-sampling technique, generates new minority class samples via interpolation between the five nearest neighbors to balance data without duplicating observations.
SMOTE - Demo2:35
Apply SMOTE to balance an imbalanced dataset by generating synthetic minority samples with imblearn, demonstrate with make_blobs, show fit_resample effects, and visualize the before/after distributions.
SMOTE-NC9:02
Learn how smote-nc extends smote to handle categorical variables, using knn distances adjusted by the median std deviation, and generate synthetic samples with imbalanced-learn's smotenc.
SMOTE-NC - Demo2:56
Demonstrates SMOTE-NC on a toy dataset with numerical and categorical features using imbalanced-learn, oversampling the minority class with fit_resample, preserving class ratios, and generating new synthetic samples.
SMOTE-N19:25
Learn how SMOTE-N extends SMOTE to work with categorical data using the value difference metric to identify the k nearest minority neighbors, and generate new samples by majority vote.
SMOTE-N Demo7:20
Explore how to compute distances between categorical variables with the value difference metric, encode with ordinal encoder, and apply SMOTE-N to balance a toy dataset of multiple categorical features.
ADASYN7:11
Explore ADASYN, a minority-class oversampling method that generates synthetic data from harder-to-classify samples with weighting, and contrast it with SMOTE using the full dataset, KNN, and class boundaries.
ADASYN - Demo3:17
Demonstrates implementing ADASYN with imbalanced-learn on artificial data in Python, oversampling the minority class with fit_resample, using the sampling strategy auto and neighbours to shape boundary-adjacent synthetic data.
Borderline SMOTE7:47
borderline smote refines smote by generating synthetic minority samples near the decision boundary, using knn to select the danger group and offering two variants for interpolation and extrapolation.
Borderline SMOTE - Demo3:13
Explore Borderline SMOTE with Imbalanced-learn by oversampling a toy dataset using Borderline SMOTE variations 1 and 2, observing synthetic samples near the decision boundary and the new class balance.
SVM SMOTE16:40
Learn how SVM SMOTE uses minority support vectors to interpolate or extrapolate new samples, and compare it with ADASYN and Borderline SMOTE for imbalanced data.
Resources on SVMs0:07
SVM SMOTE - Demo4:32
Apply svm smote to oversample the minority class, visualize the support vectors and the decision boundary of a linear svm, and compare the original with the oversampled data.
K-Means SMOTE13:01
Explore k-means SMOTE, an intra-class cluster oversampling method that boosts minority regions by clustering data, filtering clusters by imbalance, and generating synthetic samples within clusters to reduce noise.
K-Means SMOTE - Demo3:29
Demonstrates K-means SMOTE with imbalanced-learn on a synthetic 1,200-sample dataset using make_blobs and KMeans, oversampling minority clusters to rebalance.
Over-Sampling Method Comparison5:50
Compare over-sampling methods such as random oversampler, SMOTE, ADASYN, borderline SMOTE, and SVM SMOTE on diverse datasets, evaluating roc-auc with a random forest to identify best-performing approaches.
Wrapping up the section9:30
compare over-sampling versus under-sampling for imbalanced data, discuss SMOTE and variants (ADASYN, border-line SMOTE, SVM SMOTE, K-means SMOTE), and emphasize distance metrics and practical trade-offs.
How to Correctly Set Up a Classifier with Over-sampling5:24
Learn how to correctly set up a classifier using over-sampling or under-sampling to address class imbalance, and evaluate performance on test data with original imbalance, integrating resampling into cross-validation.
Setting Up a Classifier - Demo4:13
Implement oversampling schemes with cross-validation using Imbalanced-learn make_pipeline and scikit-learn cross_validate; compare SMOTE and other oversamplers with a random forest and MinMaxScaler, highlighting varied dataset results.
Summary Table0:06

Combining Over and Under-sampling - Intro6:02
Combine over-sampling and under-sampling techniques to improve model performance on imbalanced data. Learn how SMOTE, edited nearest neighbours, and Tomek links mitigate noise and preserve information.
Combining Over and Under-sampling - Demo5:26
Combine over-sampling with under-sampling using SMOTEENN and SMOTETomek in Imbalanced-learn to rebalance datasets and reduce noisy minority observations.
Comparison of Over and Under-sampling Methods5:54
Compare combined over-sampling and under-sampling methods with stand-alone SMOTE variants to assess ROC-AUC performance across imbalanced datasets, with Borderline SMOTE and SVM SMOTE often excelling.
Combine over and under-sampling manually0:05
Wrapping up2:08
Conclude that there is no universal rule for over-sampling or under-sampling; test original data, random over-sampling, or combinations, while noting nearest-neighbour under-sampling scalability and distance metrics for categorical features.

Ensemble methods with Imbalanced Data4:49
Explore ensemble methods for imbalanced data by combining bagging and boosting with data level techniques such as under- and over-sampling, including hybrid ensembles and cost-sensitive strategies, with Python implementations.
Foundations of Ensemble Learning3:12
Discover how ensemble learning improves model performance by combining diverse classifiers that complement each other, using bagging and boosting techniques like random forests and AdaBoost to boost generalization.
Bagging3:04
Bagging, or bootstrap aggregating, creates datasets by sampling with replacement and trains classifiers on each. It combines predictions by averaging or voting to improve generalization and foster diverse, de-correlated models.
Bagging plus Over- or Under-Sampling5:38
Learn how bagging pairs with under- or over-sampling and SMOTE to balance imbalanced data, train diverse classifiers on balanced subsamples, and ensemble their predictions.
Boosting10:03
Learn how boosting builds sequential classifiers, weights misclassified observations, and combines predictions with classifier weights (AdaBoost and gradient boosting) to improve performance on imbalanced data.
Boosting plus Re-Sampling7:05
Explore boosting with data pre-processing to handle imbalanced data, including RUSBoost, SMOTEBoost, RAMOBoost, and ADASYN-inspired RAMOBoost, focusing on under-sampling, synthetic samples, and weighted ensembles.
Hybdrid Methods4:48
Explore hybrid methods that combine bagging, boosting, and resampling to boost model performance on imbalanced data, including Balanced Random Forests, EasyEnsemble, and BalanceCascade using AdaBoost.
Ensemble Methods - Demo9:59
Compare ensemble methods with and without resampling on imbalanced data, using random forests, boosting, bagging, and easy ensemble in scikit-learn and imbalanced-learn, and evaluate ROC-AUC across datasets.
Wrapping up5:31
Wrapping up this section, this lecture compares ensemble methods for imbalanced data, shows boosting's generally strong performance, and highlights scalable, fast resampling like random under-sampling. It stresses metric choice.
Additional Reading Resources0:17

Cost-sensitive Learning - Intro7:27
Explore cost-sensitive learning and how misclassification costs shape the training objective, using cost matrices to balance imbalanced data and emphasize minority-class accuracy.
Types of Cost10:55
Explore misclassification cost, from constant error cost to conditional, time-based, and data and professional costs. Understand how these costs shape classifier decisions in imbalanced data contexts.
Obtaining the Cost4:28
Learn to obtain the cost matrix for cost-sensitive learning, using expert input or heuristic estimates, and tune costs as hyperparameters to balance minority and majority classes.
Cost Sensitive Approaches1:52
Explore cost-sensitive learning by contrasting direct training methods that embed misclassification costs with meta-learning approaches that adjust data or outputs, including pre-processing and post-processing techniques like meta-cost.
Misclassification Cost in Logistic Regression3:35
Explore how logistic regression minimises its cost function and incorporates misclassification costs through weighted logistic regression to address imbalanced data.
Misclassification Cost in Decision Trees4:02
Explore how to introduce misclassification cost into decision trees by weighting impurity functions (gini, entropy) with class weights. See how these costs extend to random forests for imbalanced data.
Cost Sensitive Learning with Scikit-learn7:13
Discover cost-sensitive learning in scikit-learn by applying class_weight or sample_weight to misclassification costs. See how balancing techniques and custom class penalties improve ROC-AUC.
Find Optimal Cost with hyperparameter tuning3:33
Learn to treat the misclassification cost as a hyperparameter and optimize it with grid search on a random forest using the KDD 2004 dataset, balancing ratio, and ROC-AUC evaluation.
Additional Reading Resources0:13

Requirements

Knowledge of machine learning basic algorithms, i.e., regression, decision trees and nearest neighbours
Python programming, including familiarity with NumPy, Pandas and Scikit-learn
A Python and Jupyter notebook installation

Description

Welcome to Machine Learning with Imbalanced Datasets. In this course, you will learn multiple techniques which you can use with imbalanced datasets to improve the performance of your machine learning models.

If you are working with imbalanced datasets right now and want to improve the performance of your models, or you simply want to learn more about how to tackle data imbalance, this course will show you how.

We'll take you step-by-step through engaging video tutorials and teach you everything you need to know about working with imbalanced datasets. Throughout this comprehensive course, we cover almost every available methodology to work with imbalanced datasets, discussing their logic, their implementation in Python, their advantages and shortcomings, and the considerations to have when using the technique. Specifically, you will learn:

Under-sampling methods at random or focused on highlighting certain sample populations
Over-sampling methods at random and those which create new examples based of existing observations
Ensemble methods that leverage the power of multiple weak learners in conjunction with sampling techniques to boost model performance
Cost sensitive methods which penalize wrong decisions more severely for minority classes
The appropriate metrics to evaluate model performance on imbalanced datasets

By the end of the course, you will be able to decide which technique is suitable for your dataset, and / or apply and compare the improvement in performance returned by the different methods on multiple datasets.

This comprehensive machine learning course includes over 50 lectures spanning more than 10 hours of video, and ALL topics include hands-on Python code examples which you can use for reference and for practice, and re-use in your own projects.

In addition, the code is updated regularly to keep up with new trends and new Python library releases.

So what are you waiting for? Enroll today, learn how to work with imbalanced datasets and build better machine learning models.

Who this course is for:

Data scientists and machine learning engineers working with imbalanced datasets
Data scientists who want to improve the performance of models trained on imbalanced datasets
Students who want to learn intermediate content on machine learning
Students working with imbalanced multi-class targets

Machine Learning with Imbalanced Data

What you'll learn

Explore related topics

Course content

Introduction7 lectures • 7min