
Master classification modeling in Python with data prep, feature engineering, imbalanced data techniques, and evaluating K-nearest neighbors, logistic regression, and trees for credit risk at Maven National Bank.
Explore classification modeling in Python as part three of a five-part data science series, covering dataprep and EDA, regression classification, unsupervised learning, and natural language processing.
Explore a project-based, hands-on course in Python data science that covers classification modeling, data prep, EDA, and model deployment with interactive demos, quizzes, and assignments.
Explore and visualize bank customer data, prepare it for modeling, and apply classification algorithms to predict high, medium, and low credit risk, then evaluate and interpret the best model.
Explore classification modeling from k nearest neighbors to logistic regression and tree-based models, use metrics for imbalanced data, and practice in Jupyter Notebooks or Google Colab.
Install and launch Anaconda to access Jupyter notebooks and create notebooks. Explore Google Colab as an online alternative and verify setup with a simple Python test.
Explore the data science landscape, compare it with other analytics fields, and walk through each phase of the data science workflow, highlighting supervised and unsupervised learning and common algorithms.
Learn how machine learning uses data-driven algorithms to enable computers to learn and predict, covering supervised learning (regression and classification) and unsupervised learning for pattern discovery, customer segmentation, and recommendations.
Discover the most common machine learning algorithms, covering supervised and unsupervised learning, regression, classification, and key models like k-NN, logistic regression, tree-based models, and neural networks.
The data science workflow guides you from scoping the project to gathering, cleaning, exploring data, selecting features, and modeling with machine learning. Share insights with end users and iterate.
Define the project scope and business problem to guide data gathering and data cleaning, then perform exploratory data analysis with profiling and visualization to derive actionable insights.
Move from data steps to modeling by splitting data, selecting and engineering features, and training and validating models. Emphasize simple, interpretable approaches for stakeholder buy-in, deployment, and business impact.
Explore classification modeling within the data science workflow, emphasizing predicting categorical targets and prioritizing dataprep and EDA to maximize the modeling phase.
Data science uses data to make smart decisions with supervised and unsupervised learning. Classification predicts categorical outcomes like purchases, fraud, or positive/negative/neutral reviews.
Classification modeling is a supervised technique that predicts a categorical target from features, using age and income to forecast purchase. It covers terminology, goals, workflow, and a k-nearest neighbors example.
Explore how classification models balance prediction and inference to maximize predictive accuracy. Learn how target variables relate to features and how this informs model choice and feature engineering.
Learn the classification modeling workflow from data preparation and feature engineering to model selection, evaluation, and deployment, with algorithms like logistic regression, trees, and ensembles.
Explore classification modeling concepts, including predicting categorical values, target variables (Y), and feature inputs (X), while emphasizing predictive accuracy through data splitting, feature engineering, and model validation.
Explore data prep and EDA for classification, visualize the target and features, assess their relationships and multicollinearity, and prepare data with feature engineering and data splitting.
Explore how to define a binary target for classification using loan default status, credit score thresholds, and loan type mappings, with data exploration and transformations in Python and SQL.
Explore the target variable to assess class frequencies and imbalances with value counts and bar charts; express distributions as percentages and note the need for numeric encoding.
Explore features for classification using histograms and box plots on numeric features, and value counts with bar charts using pandas plot API for categorical features; note the 35–45 age spike.
Explore a loan default dataset with numeric and object features using histograms, box plots, and loops to visualize distributions and identify rare classes for cleaning.
Read the income data with pandas, convert sal stat to a binary target. Plot target frequencies and explore numeric features like age, capital gain, capital loss, and hours per week.
Read an income dataset with pandas, create a binary target via numpy.where, assess class imbalance, and explore numeric and categorical features with box and bar plots for classification modeling.
Explore how correlation reveals the strength and direction of relationships between numeric features and the target, using pandas to compute correlations and screen for multicollinearity in classification modeling.
Explore how to create and interpret a correlation matrix using df.corr and sns.heatmap to identify strong predictors like age and estimated salary, while recognizing nonpredictive features such as user ID.
Explore correlation basics using a loan data frame, visualize relationships with scatter plots and heatmaps, and quantify links via a correlation matrix, including loan amount, property value, and default.
Explore feature-target relationships using box plots for numeric features and bar charts for categorical features to identify strong predictors and reveal how means and distributions relate to the target variable.
Explore feature feature relationships to detect multicollinearity and avoid redundant predictors in classification modeling, using correlation matrices, data visualizations, and pair plots such as age versus salary.
Explore feature relationships in the data by building a correlation heat map, a pair plot of numeric variables, and a function to visualize the average target rate by categorical levels.
Build a correlation heatmap from a correlation matrix, then use seaborn pair plots and bar plots to reveal how age, hours worked, and education influence earning over 50,000.
Engineer features by creating or modifying inputs to improve model performance, using domain knowledge to transform and combine columns, such as binary gender and dummy variables, for logistic regression.
Transform categorical data into numeric with dummy variables via one-hot encoding using pandas get_dummies, and learn how drop_first creates a reference level for logistic regression.
Group rare and related categories through binning to reduce dummy variables and model width, boosting interpretability in classification models by mapping eight diamond quality levels into three bins.
Explore quick feature engineering techniques, including scaling with mean and standard deviation, binning categorical variables, creating ratio features like loan amount to income, and preparing data for modeling with splits.
Prepare data for modeling by organizing it into a single data frame with features and a binary target. Tree-based models handle missing values; kNN and logistic regression require imputation.
Prepare data for modeling by creating dummy variables, grouping rare categories, and splitting 20% of data for testing, producing x_train, x_test, y_train, and y_test.
Group rare or ambiguous categorical levels into missing or other buckets, replace values, create dummy variables, and prepare training and test sets for modeling.
Explore features with histograms and box plots, bar charts for targets; use scatter plots and correlations to identify predictors and multicollinearity, perform feature engineering, and split data for classification modeling.
Learn the KNN workflow: split data, standardization with scikit-learn's StandardScaler, and fit and tune a KNN model to evaluate generalization on a test set.
Apply k nearest neighbors classification in Python using scikit-learn, emphasizing data split and scaling, and evaluate using accuracy scores to tune the k parameter via cross-validation to optimize model performance.
The lecture explains measuring classification accuracy by counting correct predictions, using scikit-learn's score, and tools like the confusion matrix to compare training and test performance.
Analyze the confusion matrix to compare predicted versus actual classes, identify true and false positives and negatives, derive accuracy, precision, and recall with Python and scikit-learn; visualize with seaborn heatmap.
Explore how accuracy can be misleading and diagnose model performance with a confusion matrix, illustrating true negatives, true positives, false positives, and false negatives using scikit-learn. and seaborn.
Fit a k-nearest neighbors model with k=5 using age and hours per week, after scaling, and report train/test accuracy and confusion matrices; create a scatter plot colored by predictions.
Standardize age and hours per week with a standard scaler, train a five-nearest-neighbors model, and evaluate with accuracy and a confusion matrix; explore tuning k and features.
Tune hyperparameters to optimize k nearest neighbors classification, testing values of k from 1 to n and considering distance metrics like euclidean or manhattan and weights to improve validation performance.
Balance overfitting and underfitting by using validation data and cross-validation to gauge generalization, tune hyperparameters, features, and regularization, and evaluate accuracy, precision, and recall.
Tune the k in k-nearest neighbors with grid search cross-validation to optimize hyperparameters, compare distance metrics like Minkowski, Euclidean, and Manhattan, and report a test accuracy of 88.75%.
Compare soft probability scores to hard thresholds through a probability versus event rate plot. Bin scores into deciles, then compare mean probability to actual event rate.
Tune the k in a kNN classifier using all features, apply scaling, and employ cross-validated grid search to report test accuracy and generate a confusion matrix.
Tune a kNN model using all features to boost accuracy, but note the training-test gap. Cross-validation informs n_neighbors=25, raising test accuracy to 82.8%.
Explore the pros and cons of k-nearest neighbors (KNN) for classification, noting simplicity and interpretability for small data, and heavy computation and the curse of dimensionality for large data.
K-nearest neighbors is a distance-based classifier using Euclidean distance (with Manhattan as an alternative), k tuned by cross-validation, and evaluated via accuracy and confusion matrix before moving to logistic regression.
Explore logistic regression for binary and multiclass classification in Python, covering the sigmoid probability curve, likelihood, fitting, scoring, regularization, tuning, and model interpretation.
Explore how logistic regression, a classification algorithm, uses a linear model to produce probabilities by transforming outputs through log odds (logit) to p in [0,1].
Shows how logistic regression converts a linear combination of features into a probability via a sigmoid function, with intercept and slope shaping odds and likelihood optimization.
Learn how logistic regression uses likelihood to fit model weights by maximizing predicted probabilities and comparing to the true labels; extend to multiple features and regularization.
Learn to apply multiple logistic regression to predict binary outcomes with several features, such as X1 and X2, an intercept, and beta weights for spam detection.
Split training/validation and test sets to assess performance, scale for regularization if needed, then fit and tune logistic regression with features and hyperparameters, and score on test.
Explore how to fit a logistic regression model in Python using scikit-learn, inspect coefficients and intercept, evaluate accuracy and confusion matrices on loan default data, feature engineering ideas.
Interpret logistic regression coefficients as changes in log odds; positive coefficients increase probability and negative coefficients decrease it, with odds ratios given by e^beta.
Fit a logistic regression model on income data using age and hours per week, prepare the data, and evaluate with a confusion matrix and accuracy on test data.
Fit a logistic regression with age and hours per week to predict income over 50k, evaluate with training and test accuracy and a confusion matrix, and interpret coefficients as odds.
Improve ad purchase prediction by applying feature engineering and selection, using binary cutoffs for age and salary and interaction terms to boost out-of-sample accuracy of logistic regression.
Explore regularization to curb overfitting through hyperparameter tuning, cross-validation, and penalty terms (L1, L2, elastic net) in linear models and logistic regression to improve out-of-sample performance.
Tune regularized logistic regression by scaling features and using grid search with cross-validation to optimize C and penalty types (L1 and L2) on the Saga solver.
Scale data with standardization, run grid search over logistic regression C and penalty, note elastic net with L1 ratio, and use saga with max_iter 1000.
Fit a full logistic regression model with all features, scale the inputs, and tune regularization hyperparameters, using feature selection and engineering to build an MVP production model.
Fit a logistic regression model using all features and scale data. Tune regularization (C, L2) with grid search to improve test accuracy to 85.6% and preview multiclass logistic regression.
Explore multiclass logistic regression with the iris data set to predict flower species from petal and sepal measurements. See how three models handle classes, interpret coefficients, and assess accuracy.
Fit a multi-class logistic regression on the provided dataset, and report test accuracy and a confusion matrix to evaluate model performance for the upcoming projects.
Fit a multiclass logistic regression on credit data with three target classes, and note baseline class frequencies. Compare its accuracy to a random forest baseline and discuss misclassifications.
Master logistic regression as a linear, interpretable classifier with a logit link and 0–1 probabilities, forming an S-shaped curve. Explore regularization, multiclass extensions, and metrics beyond accuracy and confusion matrix.
Explore classification metrics for evaluating models, including accuracy, confusion matrix, precision, recall, and F1 score, plus ROC curves and AUC for soft classification thresholds and multiclass metrics.
Explore how to compute accuracy, precision, and recall using scikit-learn metrics, interpret confusion matrices, and compare untuned and tuned logistic regression to balance precision and recall.
Explore the F1 score, the harmonic mean of precision and recall, and learn how to compute it with scikit-learn, interpret imbalances, and compare untuned and tuned models.
Apply Python to calculate accuracy, precision, recall, and F1 scores for a logistic regression classifier, generate a confusion matrix, and practice metrics with the metrics assignments notebook on income data.
Shift the default 0.5 threshold to balance false positives and false negatives, optimizing precision, recall, and F1 for tasks like spam filtering, fraud detection, and targeted advertising.
Shift decision thresholds with predict_proba to balance precision and recall in a loan default model, analyze false negatives versus false positives, and preview the precision recall curve.
Visualize precision-recall curve to explore how thresholds trade off precision and recall with predicted probabilities. Use F1 curve to pick the threshold that balances metrics on test data.
Plot precision-recall and F1 curves to compare thresholds, observe how precision rises as recall falls, and identify near-optimal points using training and test data, ROC and AUC metrics.
Explore the ROC curve and AUC to evaluate probabilistic classifiers across thresholds, using true positive rate and false positive rate to illustrate threshold-agnostic ranking.
Demonstrates evaluating a classifier with the ROC curve and AUC, using scikit-learn metrics to compute false positive and true positive rates from predicted probabilities, and plotting the curve.
Recap of key classification metrics for model tuning, including accuracy, precision, recall, f1, and roc auc, with guidance on when each metric matters, especially with imbalanced data.
Shift the model threshold to maximize the F1 score, plot precision, recall, and F1 versus threshold, and report metrics at the optimum. Plot the ROC curve and report the AUC.
Plot precision-recall and F1 vs threshold to identify the optimal threshold around 0.32 using training data, then compare precision, recall, F1, and ROC-AUC to understand trade-offs.
Explore multiclass confusion matrices to assess model predictions, compute per-class and weighted metrics (accuracy, precision, recall), and implement calculations with Python.
Explore multiclass metrics in python with scikit-learn, using actual y and predicted y in precision score and selecting average none, macro, or weighted for imbalanced classes.
Learn to compute precision and recall by class for a fitted multiclass model, and derive accuracy and weighted-average precision and recall on test data.
Explore multi-class metrics in Python data science by calculating precision, recall, and F1 scores using scikit-learn, comparing class-level and weighted averages, and relating recall to accuracy.
Use accuracy for balanced data, then leverage precision, recall, and F1 for imbalanced problems; adjust thresholds and compare ROC and AUC, with AUC near one indicating strong ranking.
Explore techniques for modeling imbalanced data, including oversampling, SMOTE, and undersampling, and tune thresholds and class weights to improve rare-event detection and overall metrics.
Choose the correct metric during scoping, then balance imbalanced data with tuning the decision threshold, sampling, and class weights to improve precision and recall.
Explore oversampling, duplicating minority class observations to balance data and improve a model's ability to distinguish the positive class, while noting risks of overfitting and larger data size.
Install imbalance learn and demo oversampling to balance the data by increasing the positive class using a random oversampler, then evaluate F1 improvements across ratios and threshold tuning, including smote.
Explore Smote, a distance-based oversampling technique that creates synthetic minority observations near existing ones, reducing overfitting compared with duplication, while noting the higher computational cost and need for cross validation.
learn to apply smote in python with the EMB learn library, configure the oversampling ratio (e.g., 4, 8, 16), and compare F1 scores to improve model performance.
Explore undersampling, or down sampling, to balance imbalanced data by randomly removing majority class rows, improving model performance while reducing data size and training time in Python.
Explore undersampling in Python using the IMDb learn library and random under sampler to balance imbalanced data, adjust minority–majority ratios, and evaluate logistic regression with F1-score and threshold tuning.
Explore imbalance techniques by applying oversampling, undersampling, and SMOTE to an income data set, then tune logistic regression thresholds to improve F1 and related metrics.
Explore how to apply oversampling and undersampling methods, including Smote and random oversampling, to logistic regression, compare accuracy, precision, recall, and F1, and optimize thresholds.
Explore changing class weights in imbalanced data to improve logistic regression performance, comparing default, balanced, and forex weighting, and evaluating with F1 and AUC.
Tune class weights and thresholds to improve F1 and accuracy in a binary classifier. Use AUC, precision, and recall to compare balanced, 4-to-1, and threshold-based strategies.
Experiment with logistic regression by testing standard, balanced, and 4-to-1 class weights to maximize AUC, then tune the threshold to maximize F1 using the imbalanced data assignments notebook.
Compare three logistic regression models with no weighting, balanced weighting, and four-to-one positive weighting, and evaluate with accuracy, auc, and f1 to determine if standard weighting suffices.
Recap strategies for imbalanced data: establish a baseline, tune the threshold, apply sampling methods (oversampling with SMOTE, undersampling), and adjust class weights, guided by cross-validation.
predict the credit score category with a binary logistic regression model, merging good and standard credit, using data prep, EDA, SMOTE, and threshold tuning, evaluated by ROC AUC and F1.
Explore how decision trees classify data by splitting on features to maximize information gain and entropy reduction, then fit, visualize, and tune tree depth and leaf size.
Explore how entropy measures class impurity and guides decision tree splits, using churn prediction with last login time, age, lifetime value, and sign up date to maximize information gain.
Predict with decision trees from root to leaf nodes, using soft and hard classifications, information gain splits, and hyperparameter tuning to prevent overfitting; compare random forest and gradient boosting ensembles.
Learn to build decision trees in Python with scikit-learn, tune hyperparameters, fit models, evaluate training versus test accuracy, and visualize splits with plot_tree to highlight age and salary.
demonstrates preprocessing a loan default dataset, imputing numeric values with the mean and categorical values with the mode, training a sklearn decision tree, and examining data quality issues.
Learn how feature importance sums to one and reveals each feature's contribution to a tree model's accuracy, and use low-importance features to remove them from your model, potentially improving generalization.
Develop a decision tree classifier with max depth 3 using age, hours per week, and gender; evaluate accuracy, confusion matrix, and feature importance, and visualize the tree.
Train a simple decision tree on income data, evaluate train and test accuracy, precision, and recall, and examine feature importance (marital status, capital gain, hours, age, education) ahead of tuning.
Learn how decision tree hyperparameters like max depth, minsamplesleaf, criteria, and class weights control overfitting, with cross-validation and grid search to boost out-of-sample performance.
Apply grid search to tune a decision tree classifier, selecting max depth and min samples leaf, yielding a simpler, more interpretable model with improved test accuracy and highlighted feature importance.
Explore how decision trees use entropy to split data, visualize early splits, and tune hyperparameters and depth to prevent overfitting while evaluating feature importance.
This is a hands-on, project-based course designed to help you master the foundations for classification modeling and supervised machine learning in Python.
We’ll start by reviewing the Python data science workflow, discussing the primary goals & types of classification algorithms, and do a deep dive into the classification modeling steps we’ll be using throughout the course.
You’ll learn to perform exploratory data analysis (EDA), leverage feature engineering techniques like scaling, dummy variables, and binning, and prepare data for modeling by splitting it into train, test, and validation datasets.
From there, we’ll fit K-Nearest Neighbors & Logistic Regression models, and build an intuition for interpreting their coefficients and evaluating their performance using tools like confusion matrices and metrics like accuracy, precision, and recall. We’ll also cover techniques for modeling imbalanced data, including threshold tuning, sampling methods like oversampling & SMOTE, and adjusting class weights in the model cost function.
Throughout the course, you'll play the role of Data Scientist for the risk management department at Maven National Bank. Using the skills you learn throughout the course, you'll use Python to explore their data and build classification models to accurately determine which customers have high, medium, and low credit risk based on their profiles.
Last but not least, you'll learn to build and evaluate decision tree models for classification. You’ll fit, visualize, and fine-tune these models using Python, then apply your knowledge to more advanced ensemble models like random forests and gradient boosted machines.
COURSE OUTLINE:
Intro to Data Science in Python
Introduce the fields of data science and machine learning, review essential skills, and introduce each phase of the data science workflow
Classification 101
Review the basics of classification, including key terms, the types and goals of classification modeling, and the modeling workflow
Pre-Modeling Data Prep & EDA
Recap the data prep & EDA steps required to perform modeling, including key techniques to explore the target, features, and their relationships
K-Nearest Neighbors
Learn how the k-nearest neighbors (KNN) algorithm classifies data points and practice building KNN models in Python
Logistic Regression
Introduce logistic regression, learn the math behind the model, and practice fitting them and tuning regularization strength
Classification Metrics
Learn how and when to use several important metrics for evaluating classification models, such as precision, recall, F1 score, and ROC-AUC
Imbalanced Data
Understand the challenges of modeling imbalanced data and learn strategies for improving model performance in these scenarios
Decision Trees
Build and evaluate decision tree models, algorithms that look for the splits in your data that best separate your classes
Ensemble Models
Get familiar with the basics of ensemble models, then dive into specific models like random forests and gradient boosted machines
__________
Ready to dive in? Join today and get immediate, LIFETIME access to the following:
9.5 hours of high-quality video
18 homework assignments
9 quizzes
2 projects
Python Data Science: Classification ebook (250+ pages)
Downloadable project files & solutions
Expert support and Q&A forum
30-day Udemy satisfaction guarantee
If you're a business intelligence professional or aspiring data scientist looking for an introduction to the world of classification modeling with Python, this is the course for you.
Happy learning!
-Chris Bruehl (Data Science Expert & Lead Python Instructor, Maven Analytics)
__________
Looking for our full business intelligence stack? Search for "Maven Analytics" to browse our full course library, including Excel, Power BI, MySQL, Tableau and Machine Learning courses!
See why our courses are among the TOP-RATED on Udemy:
"Some of the BEST courses I've ever taken. I've studied several programming languages, Excel, VBA and web dev, and Maven is among the very best I've seen!" Russ C.
"This is my fourth course from Maven Analytics and my fourth 5-star review, so I'm running out of things to say. I wish Maven was in my life earlier!" Tatsiana M.
"Maven Analytics should become the new standard for all courses taught on Udemy!" Jonah M.