
Use Jupyter notebooks to implement techniques for imbalanced datasets, complemented by videos, presentations, and online datasets. Read three articles to download code from GitHub, presentations, and datasets before you begin.
Explore imbalanced datasets, where minority and majority classes create distributions and imbalance degrees, and learn why predicting rare events like fraud, medical diagnoses, equipment malfunction, or oil spills is hard.
Compare accuracy of random forest and logistic regression against a baseline majority-predictor using scikit-learn, emphasizing minority-class detection and concluding logistic regression best for positives.
Compare precision, recall, and F1 scores for imbalanced data using random forest and logistic regression, with scikit-learn and yellowbrick visuals, and optimize thresholds to improve minority-class performance.
Explore how balanced accuracy addresses imbalanced datasets by averaging recall across classes. Examine confusion matrix insights, including true positives, and recall to compare class performance beyond overall accuracy.
Apply geometric mean, dominance, and the index of imbalanced accuracy to compare random forest and logistic regression on imbalanced data, using recall, true negative rate, and balanced accuracy.
Explore how random under-sampling balances imbalanced data with imbalanced-learn, preserves original distributions, and boosts model performance using a random forest on toy and real datasets.
Explore Tomek links, pairs of nearest-neighbor observations from opposite classes. Remove the majority observation or the entire TomekLinks pair to reduce boundary noise and improve learning.
Discover edited nearest neighbors, a cleaning method that undersamples the majority class by removing samples near the boundary using a 3-nearest neighbors check.
Apply the neighbourhood cleaning rule with fit_resample to undersample the majority class, using a 3-neighbour rule and a two-to-one threshold.
Explore instance hardness threshold and filtering to remove hard observations, reduce class overlap, and use probability thresholds or percentiles with one-vs-rest for multiclass classification.
Train the random forest on the under-sampled training set and evaluate on the test set with the original class distribution using cross-validation and the average precision-recall score.
Explore random over-sampling to balance imbalanced data by duplicating minority-class observations to a 1:1 ratio, using imbalanced-learn's RandomOverSampler and fit_resample, and adjust balancing strategy for multiclass targets.
Explore random over-sampling with smoothing to generate new minority-class examples by adding noise guided by class distribution, controlled by a shrinkage factor, and implemented with Imbalanced-learn's RandomOverSampler.
Explore random over-sampling with smoothing using imbalanced-learn's RandomOverSampler to create synthetic minority-class examples and observe dispersion changes as the shrinkage factor varies.
borderline smote refines smote by generating synthetic minority samples near the decision boundary, using knn to select the danger group and offering two variants for interpolation and extrapolation.
compare over-sampling versus under-sampling for imbalanced data, discuss SMOTE and variants (ADASYN, border-line SMOTE, SVM SMOTE, K-means SMOTE), and emphasize distance metrics and practical trade-offs.
Implement oversampling schemes with cross-validation using Imbalanced-learn make_pipeline and scikit-learn cross_validate; compare SMOTE and other oversamplers with a random forest and MinMaxScaler, highlighting varied dataset results.
Conclude that there is no universal rule for over-sampling or under-sampling; test original data, random over-sampling, or combinations, while noting nearest-neighbour under-sampling scalability and distance metrics for categorical features.
Explore ensemble methods for imbalanced data by combining bagging and boosting with data level techniques such as under- and over-sampling, including hybrid ensembles and cost-sensitive strategies, with Python implementations.
Bagging, or bootstrap aggregating, creates datasets by sampling with replacement and trains classifiers on each. It combines predictions by averaging or voting to improve generalization and foster diverse, de-correlated models.
Explore boosting with data pre-processing to handle imbalanced data, including RUSBoost, SMOTEBoost, RAMOBoost, and ADASYN-inspired RAMOBoost, focusing on under-sampling, synthetic samples, and weighted ensembles.
Explore hybrid methods that combine bagging, boosting, and resampling to boost model performance on imbalanced data, including Balanced Random Forests, EasyEnsemble, and BalanceCascade using AdaBoost.
Compare ensemble methods with and without resampling on imbalanced data, using random forests, boosting, bagging, and easy ensemble in scikit-learn and imbalanced-learn, and evaluate ROC-AUC across datasets.
Discover cost-sensitive learning in scikit-learn by applying class_weight or sample_weight to misclassification costs. See how balancing techniques and custom class penalties improve ROC-AUC.
Learn to treat the misclassification cost as a hyperparameter and optimize it with grid search on a random forest using the KDD 2004 dataset, balancing ratio, and ROC-AUC evaluation.
Welcome to Machine Learning with Imbalanced Datasets. In this course, you will learn multiple techniques which you can use with imbalanced datasets to improve the performance of your machine learning models.
If you are working with imbalanced datasets right now and want to improve the performance of your models, or you simply want to learn more about how to tackle data imbalance, this course will show you how.
We'll take you step-by-step through engaging video tutorials and teach you everything you need to know about working with imbalanced datasets. Throughout this comprehensive course, we cover almost every available methodology to work with imbalanced datasets, discussing their logic, their implementation in Python, their advantages and shortcomings, and the considerations to have when using the technique. Specifically, you will learn:
Under-sampling methods at random or focused on highlighting certain sample populations
Over-sampling methods at random and those which create new examples based of existing observations
Ensemble methods that leverage the power of multiple weak learners in conjunction with sampling techniques to boost model performance
Cost sensitive methods which penalize wrong decisions more severely for minority classes
The appropriate metrics to evaluate model performance on imbalanced datasets
By the end of the course, you will be able to decide which technique is suitable for your dataset, and / or apply and compare the improvement in performance returned by the different methods on multiple datasets.
This comprehensive machine learning course includes over 50 lectures spanning more than 10 hours of video, and ALL topics include hands-on Python code examples which you can use for reference and for practice, and re-use in your own projects.
In addition, the code is updated regularly to keep up with new trends and new Python library releases.
So what are you waiting for? Enroll today, learn how to work with imbalanced datasets and build better machine learning models.