Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Feature Engineering for Machine Learning

Name: Feature Engineering for Machine Learning
Rating: 4.5 (3801 reviews)

Learn imputation, variable encoding, discretization, feature extraction, how to work with datetime, outliers, and more.

Created bySoledad Galli, Train in Data Team

Last updated 3/2025

English

What you'll learn

Learn multiple techniques for missing data imputation.
Transform categorical variables into numbers while capturing meaningful information.
Learn how to deal with infrequent, rare, and unseen categories.
Learn how to work with skewed variables.
Convert numerical variables into discrete ones.
Remove outliers from your variables.
Extract useful features from dates and time variables.
Learn techniques used in organizations worldwide and in data competitions.
Increase your repertoire of techniques to preprocess data and build more powerful machine learning models.

Course content

19 sections • 202 lectures • 13h 23m total length

Course curriculum overview5:39
Course requirements2:28
How to approach this course1:56
Setting up your computer0:14
Resources to learn machine learning skills0:23

Variable characteristics2:43
Missing data6:43
Cardinality5:03
Rare labels4:54
Variable distribution5:13
Outliers8:27
Linear models assumptions8:59
Linear model assumptions - additional reading resources (optional)0:35
Variable magnitude3:08
Summary table0:08
Table illustrating the advantages and disadvantages of different machine learning algorithms, as well as their requirements in terms of feature engineering, and common applications.
Additional reading resources0:30

Basic imputation methods3:52
Mean or median imputation4:53
In this lecture, I describe what I mean by replacing missing values by the mean or median of the variable, what are the assumptions, advantages and disadvantages, and how they may affect the performance of machine learning algorithms.
Arbitrary value imputation3:16
Frequent category imputation3:30
Missing category imputation1:22
Adding a missing indicator3:42
Here I describe the process of adding one additional binary variable to capture those observations where data is missing.
Basic methods - considerations11:15
Basic imputation with pandas6:45
Basic imputation with pandas - demo12:35
Basic methods with Scikit-learn9:44
Mean or median imputation with Scikit-learn10:53
Arbitrary value imputation with Scikit-learn3:57
Frequent category imputation with Scikit-learn4:38
Missing category imputation with Scikit-learn2:24
Adding a missing indicator with Scikit-learn4:59
Imputation with GrdiSearch - Scikit-learn8:24
Basic methods with Feature-engine7:19
Mean or median imputation with Feature-engine6:50
Arbitrary value imputation with Feature-engine3:16
Frequent category imputation with Feature-engine2:34
Arbitrary string imputation with Feature-engine3:24
Adding a missing indicator with Feature-engine4:52
Wrapping up2:19
Treat: our movie pick0:21

Alternative imputation methods2:59
Complete Case Analysis6:30
In this lecture, I describe complete case analysis, what it is, what assumptions it makes, and what are the implications and consequences of handling missing values using this method.
CCA - considerations with code demo3:45
End of distribution imputation4:14
Random sample imputation14:14
In this lecture, I describe what random sample imputation, the advantages, and the cares that should be taken were this method to be implemented in a business setting.
Random imputation - considerations with code7:56
Mean or median imputation per group4:32
CCA with pandas5:19
End of distribution imputation with pandas5:24
Random sample imputation with pandas4:46
Mean imputation per group with pandas5:34
CCA with Feature-engine6:47
End of distribution imputation with Feature-engine5:13
Random sample imputation with Feature-engine2:25
Continues from previous lecture: in this lecture, I describe what random sample imputation, the advantages, and the cares that should be taken were this method to be implemented in a business setting.
Imputation - Summary table0:08
Wrapping up5:52

Categorical encoding | Introduction4:59
One hot encoding6:03
One hot encoding with pandas7:29
One hot encoding with sklearn11:06
One hot encoding with Feature-engine2:19
One hot encoding with Category encoders5:04
Ordinal encoding1:50
Ordinal encoding with pandas3:16
Ordinal encoding with sklearn4:05
Ordinal encoding with Feature-engine1:49
Ordinal encoding with Category encoders1:43
Count or frequency encoding3:11
Count encoding with pandas2:58
Count encoding with Feature-engine1:21
Count encoding with Category encoders1:42
Unseen categories11:35
Wrapping up3:03

Categorical encoding | Monotonic5:09
Ordered ordinal encoding2:25
Ordered ordinal encoding with pandas8:11
Ordered ordinal encoding with Feature-engine2:36
Mean encoding1:34
Mean encoding with pandas4:39
Mean encoding with Feature-engine2:36
Mean encoding with Category encoders2:15
Mean encoding plus smoothing4:55
Mean encoding plus smoothing - Category encoders6:35
Mean encoding plus smoothing - Feature-engine6:15
Weight of evidence (WoE)4:36
Weight of Evidence with pandas9:47
Weight of Evidence with Feature-engine1:40
Weight of Evidence with Category encoders1:12
Weight of evidence - gotchas3:05
Unseen categories2:15
Wrapping up3:24
Comparison of categorical variable encoding9:09
Additional reading resources0:17

Grouping rare labels4:17
One hot encoding of top categories3:06
OHE of top categories with pandas5:33
OHE of top categories with Feature-engine2:14
OHE of top categories with sklearn5:35
Rare label encoding4:31
In this lecture I will describe and compare 2 methods commonly used to replace rare labels. Rare labels are those categories within a categorical variable that contain very few observations, and therefore may affect tree based machine learning algorithm performance.
In this lecture I will focus on variables with one predominant category.
Rare label encoding with pandas8:12
In this lecture I will describe and compare 2 methods commonly used to replace rare labels. Rare labels are those categories within a categorical variable that contain very few observations, and therefore may affect tree based machine learning algorithm performance.

In this lecture I will focus on variables with few categories.
Rare label encoding with Feature-engine1:39
Wrapping up2:20
Categorical encoding - More...3:31

Requirements

A Python installation.
Jupyter notebook installation.
Python coding skills.
Some experience with Numpy and Pandas.
Familiarity with machine learning algorithms.
Familiarity with Scikit-Learn.

Description

Welcome to Feature Engineering for Machine Learning, the most comprehensive course on feature engineering available online. In this course, you will learn about variable imputation, variable encoding, feature transformation, discretization, and how to create new features from your data.

Master Feature Engineering and Feature Extraction.

In this course, you will learn multiple feature engineering methods that will allow you to transform your data and leave it ready to train machine learning models. Specifically, you will learn:

How to impute missing data

How to encode categorical variables

How to transform numerical variables and change their distribution

How to perform discretization

How to remove outliers

How to extract features from date and time

How to create new features from existing ones

Create useful Features with Math, Statistics and Domain Knowledge

Feature engineering is the process of transforming existing features or creating new variables for use in machine learning. Raw data is not suitable to train machine learning algorithms. Instead, data scientists devote a lot of time to data preprocessing. This course teaches you everything you need to know to leave your data ready to train your models.

While most online courses will teach you the very basics of feature engineering, like imputing variables with the mean or transforming categorical variables using one hot encoding, this course will teach you that, and much, much more.

In this course, you will first learn the most popular and widely used techniques for variable engineering, like mean and median imputation, one-hot encoding, transformation with logarithm, and discretization. Then, you will discover more advanced methods that capture information while encoding or transforming your variables to improve the performance of machine learning models.

You will learn methods like the weight of evidence, used in finance, and how to create monotonic relationships between variables and targets to boost the performance of linear models. You will also learn how to create features from date and time variables and how to handle categorical variables with a lot of categories.

The methods that you will learn were described in scientific articles, are used in data science competitions, and are commonly utilized in organizations. And what’s more, they can be easily implemented by utilizing Python's open-source libraries!

Throughout the lectures, you’ll find detailed explanations of each technique and a discussion about their advantages, limitations, and underlying assumptions, followed by the best programming practices to implement them in Python.

By the end of the course, you will be able to decide which feature engineering technique you need based on the variable characteristics and the models you wish to train. And you will also be well placed to test various transformation methods and let your models decide which ones work best.

Step-up your Career in Data Science

You’ve taken your first steps into data science. You know about the most commonly used prediction models. You've even trained a few linear regression or classification models. At this stage, you’re probably starting to find some challenges: your data is dirty, lots of values are missing, some variables are not numerical, and others extremely skewed. You may also wonder whether your code is efficient and performant or if there is a better way to program. You search online, but you can’t find consolidated resources on feature engineering. Maybe just blogs? So you may start to wonder: how are things really done in tech companies?

In this course, you will find answers to those questions. Throughout the course, you will learn multiple techniques for the different aspects of variable transformation, and how to implement them in an elegant, efficient, and professional manner using Python. You will leverage the power of Python’s open source ecosystem, including the libraries NumPy, Pandas, Scikit-learn, and special packages for feature engineering: Feature-engine and Category encoders.

By the end of the course, you will be able to implement all your feature engineering steps into a single elegant pipeline, which will allow you to put your predictive models into production with maximum efficiency.

Leverage the Power of Open Source

We will perform all feature engineering methods utilizing Pandas and Numpy, and we will compare the implementation with Scikit-learn, Feature-engine, and Category encoders, highlighting the advantages and limitations of each library. As you progress in the course, you will be able to choose the library you like the most to carry out your projects.

There is a dedicated Python notebook with code to implement each feature engineering method, which you can reuse in your projects to speed up the development of your machine learning models.

The Most Comprehensive Online Course for Feature Engineering

There is no one single place to go to learn about feature engineering. It involves hours of searching on the web to find out what people are doing to get the most out of their data.

That is why, this course gathers plenty of techniques used worldwide for feature transformation, learnt from data competitions in Kaggle and the KDD, scientific articles, and from the instructor’s experience as a data scientist. This course therefore provides a source of reference where you can learn new methods and also revisit the techniques and code needed to modify variables whenever you need to.

This course is taught by a lead data scientist with experience in the use of machine learning in finance and insurance, who is also a book author and the lead developer of a Python open source library for feature engineering. And there is more:

The course is constantly updated to include new feature engineering methods.

Notebooks are regularly refreshed to ensure all methods are carried out with the latest releases of the Python libraries, so your code will never break.

The course combines videos, presentations, and Jupyter notebooks to explain the methods and show their implementation in Python.

The curriculum was developed over a period of four years with continuous research in the field of feature engineering to bring you the latest technologies, tools, and trends.

Want to know more? Read on...

This comprehensive feature engineering course contains over 100 lectures spread across approximately 10 hours of video, and ALL topics include hands-on Python code examples that you can use for reference, practice, and reuse in your own projects.

REMEMBER, the course comes with a 30-day money-back guarantee, so you can sign up today with no risk.

So what are you waiting for? Enrol today and join the world's most comprehensive course on feature engineering for machine learning.

Who this course is for:

Data scientists who want to learn how to preprocess datasets in order to build machine learning models.
Data scientists who want to learn more techniques for feature engineering for machine learning.
Data scientists who want to improve their coding skills and programming practices for feature engineering.
Software engineers, mathematicians and academics switching careers into data science.
Data scientists interested in experimenting with various feature engineering techniques on data competitions
Software engineers who want to learn how to use Scikit-learn and other open-source packages for feature engineering.

Feature Engineering for Machine Learning

What you'll learn

Explore related topics

Course content

Welcome5 lectures • 11min

Course material4 lectures • 4min

Variable Types5 lectures • 15min

Variable Characteristics11 lectures • 46min

Missing Data Imputation - Basic24 lectures • 2hr 7min

Missing Data Imputation - Alternative Methods16 lectures • 1hr 26min

Multivariate Missing Data Imputation8 lectures • 30min

Categorical Encoding - Basic methods17 lectures • 1hr 14min

Categorical encoding - Monotonic20 lectures • 1hr 23min

Categorical encoding - Rare labels10 lectures • 41min

Requirements

Description

Who this course is for: