Feature Engineering for Machine Learning
4.6 (1,414 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
9,071 students enrolled

Feature Engineering for Machine Learning

Transform the variables in your data and build better performing machine learning models
4.6 (1,414 ratings)
Course Ratings are calculated from individual students’ ratings and a variety of other signals, like age of rating and reliability, to ensure that they reflect course quality fairly and accurately.
9,071 students enrolled
Created by Soledad Galli
Last updated 7/2020
English
English [Auto]
Current price: $139.99 Original price: $199.99 Discount: 30% off
5 hours left at this price!
30-Day Money-Back Guarantee
This course includes
  • 10 hours on-demand video
  • 19 articles
  • 5 downloadable resources
  • Full lifetime access
  • Access on mobile and TV
  • Assignments
  • Certificate of Completion
Training 5 or more people?

Get your team access to 4,000+ top Udemy courses anytime, anywhere.

Try Udemy for Business
What you'll learn
  • Learn multiple techniques for missing data imputation
  • Transform categorical variables into numbers while capturing meaningful information
  • Learn how to deal with infrequent, rare and unseen categories
  • Transform skewed variables into Gaussian
  • Convert numerical variables into discrete
  • Remove outliers from your variables
  • Extract meaningful features from dates and time variables
  • Learn techniques used in organisations worldwide and in data competitions
  • Increase your repertoire of techniques to preprocess data and build more powerful machine learning models
Course content
Expand all 124 lectures 09:54:47
+ Introduction
9 lectures 20:14
How to approach this course
01:09
Setting up your computer
01:32
Download Jupyter notebooks
00:52
Download datasets
01:02
Download course presentations
00:25
FAQ: Data Science, Python programming, datasets, presentations and more...
00:47
+ Variable Types
6 lectures 15:43
Variables | Intro
02:37
Numerical variables
05:02
Categorical variables
03:43
Date and time variables
01:56
Mixed variables
02:16
Bonus: More about the Lending Club dataset
00:09
Quiz about variable types
13 questions
+ Variable Characteristics
11 lectures 47:14
Variable characteristics
02:43
Missing data
06:44
Cardinality - categorical variables
05:03
Rare Labels - categorical variables
04:54
Linear models assumptions
09:52
Variable distribution
05:13
Outliers
08:25
Variable magnitude
03:08

Table illustrating the advantages and disadvantages of different machine learning algorithms, as well as their requirements in terms of feature engineering, and common applications. 

Bonus: Machine learning algorithms overview
00:10
Bonus: Additional reading resources
00:38
FAQ: How can I learn more about machine learning?
00:24
+ Missing Data Imputation
25 lectures 02:05:04
Introduction to missing data imputation
03:58

In this lecture, I describe complete case analysis, what it is, what assumptions it makes, and what are the implications and consequences of handling missing values using this method.

Complete Case Analysis
06:46

In this lecture, I describe what I mean by replacing missing values by the mean or median of the variable, what are the assumptions, advantages and disadvantages, and how they may affect the performance of machine learning algorithms.

Mean or median imputation
07:53
Arbitrary value imputation
06:42
End of distribution imputation
04:53
Frequent category imputation
06:56
Missing category imputation
04:05

In this lecture, I describe what random sample imputation, the advantages, and the cares that should be taken were this method to be implemented in a business setting.

Random sample imputation
14:17

Here I describe the process of adding one additional binary variable to capture those observations where data is missing. 

Adding a missing indicator
05:26
Mean or median imputation with Scikit-learn
10:33
Arbitrary value imputation with Scikit-learn
05:35
Frequent category imputation with Scikit-learn
03:48
Missing category imputation with Scikit-learn
02:46
Adding a missing indicator with Scikit-learn
04:06
Automatic determination of imputation method with Sklearn
08:25
Introduction to Feature-engine
05:10
Mean or median imputation with Feature-engine
04:44
Arbitrary value imputation with Feature-engine
03:12
End of distribution imputation with Feature-engine
04:40
Frequent category imputation with Feature-engine
01:59
Missing category imputation with Feature-engine
02:24

Continues from previous lecture: in this lecture, I describe what random sample imputation, the advantages, and the cares that should be taken were this method to be implemented in a business setting.

Random sample imputation with Feature-engine
02:00
Adding a missing indicator with Feature-engine
03:11
Overview of missing value imputation methods
00:08
Conclusion: when to use each missing data imputation method
01:27
+ Multivariate Missing Data Imputation
1 lecture 00:03
Multivariate Imputation - COMING IN 2020
00:03
+ Categorical Variable Encoding
22 lectures 02:11:21
Categorical encoding | Introduction
06:49
One hot encoding
06:09
One-hot-encoding: Demo
14:12
One hot encoding of top categories
03:06
One hot encoding of top categories | Demo
08:35
Ordinal encoding | Label encoding
01:50
Ordinal encoding | Demo
08:08
Count or frequency encoding
03:11
Count encoding | Demo
04:33
Target guided ordinal encoding
02:50
Target guided ordinal encoding | Demo
08:30
Mean encoding
02:22
Mean encoding | Demo
05:31
Probability ratio encoding
06:13
Weight of evidence (WoE)
04:36
Weight of Evidence | Demo
12:38
Comparison of categorical variable encoding
10:36

In this lecture I will describe and compare 2 methods commonly used to replace rare labels. Rare labels  are those categories within a categorical variable that contain very few observations, and therefore may affect tree based machine learning algorithm performance. 

In this lecture I will focus on variables with one predominant category.

Rare label encoding
04:31

In this lecture I will describe and compare 2 methods commonly used to replace rare labels. Rare labels  are those categories within a categorical variable that contain very few observations, and therefore may affect tree based machine learning algorithm performance. 

In this lecture I will focus on variables with few categories.

Rare label encoding | Demo
10:25
Binary encoding and feature hashing
06:12
Summary table of encoding techniques
00:05
Bonus: Additional reading resources
00:18
+ Variable Transformation
4 lectures 23:05
Variable Transformation | Introduction
04:48
Variable Transformation with Numpy and SciPy
07:38
variable Transformation with Scikit-learn
07:03
Variable transformation with Feature-engine
03:36
+ Discretisation
14 lectures 01:10:17
Discretisation | Introduction
03:01
Equal-width discretisation
04:06
Equal-width discretisation | Demo
11:18
Equal-frequency discretisation
04:13
Equal-frequency discretisation | Demo
07:16
K-means discretisation
04:13
K-means discretisation| Demo
02:43
Discretisation plus categorical encoding
02:54
Discretisation plus encoding | Demo
05:45
Discretisation with classification trees
05:05
Discretisation with decision trees using Scikit-learn
11:55
Discretisation with decision trees using Feature-engine
03:48
Domain knowledge discretisation
03:52
Bonus: Additional reading resources
00:08
+ Outlier Handling
7 lectures 33:06
Outlier Engineering | Intro
07:42
Outlier trimming
07:21

In this lecture I will describe a common method to handle outliers in numerical variables. These methods are commonly used in surveys as well as in other business settings.

Outlier capping with IQR
06:24

This lecture continues from the previous one. 

I continue to describe a common method to handle outliers in numerical variables. These methods are commonly used in surveys as well as in other business settings.

Outlier capping with mean and std
04:44
Outlier capping with quantiles
03:17
Arbitrary capping
03:33
Additional reading resources
00:05
+ Feature Scaling
14 lectures 52:22
Feature scaling | Introduction
03:43
Standardisation
05:30
Standardisation | Demo
04:38
Mean normalisation
04:01
Mean normalisation | Demo
05:20
Scaling to minimum and maximum values
03:23
MinMaxScaling | Demo
03:00
Maximum absolute scaling
02:59
MaxAbsScaling | Demo
03:44
Scaling to median and quantiles
02:45
Robust Scaling | Demo
02:03
Scaling to vector unit length
05:50
Scaling to vector unit length | Demo
05:17
Additional reading resources
00:09
Requirements
  • A Python installation
  • Jupyter notebook installation
  • Python coding skills
  • Some experience with Numpy and Pandas
  • Familiarity with Machine Learning algorithms
  • Familiarity with Scikit-Learn
Description

NEW! Updated in November 2020 for the latest software versions, including use of new tools and open-source packages, and additional feature engineering techniques.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Welcome to Feature Engineering for Machine Learning, the most comprehensive course on feature engineering available online. In this course, you will learn how to engineer features and build more powerful machine learning models.


Who is this course for?

So, you’ve made your first steps into data science, you know the most commonly used prediction models, you perhaps even built a linear regression or a classification tree model. At this stage you’re probably starting to encounter some challenges - you realize that your data set is dirty, there are lots of values missing, some variables contain labels instead of numbers, others do not meet the assumptions of the models, and on top of everything you wonder whether this is the right way to code things up. And to make things more complicated, you can’t find many consolidated resources about feature engineering. Maybe even just blogs? So you may start to wonder: how are things really done in tech companies?

This course will help you! This is the most comprehensive online course in variable engineering. You will learn a huge variety of engineering techniques used worldwide in different organizations and in data science competitions, to clean and transform your data and variables.


What will you learn?

I have put together a fantastic collection of feature engineering techniques, based on scientific articles, white papers, data science competitions, and of course my own experience as a data scientist.

Specifically, you will learn:

  • How to impute your missing data

  • How to encode your categorical variables

  • How to transform your numerical variables so they meet ML model assumptions

  • How to convert your numerical variables into discrete intervals

  • How to remove outliers

  • How to handle date and time variables

  • How to work with different time zones

  • How to handle mixed variables which contain strings and numbers

Throughout the course, you are going to learn multiple techniques for each of the mentioned tasks, and you will learn to implement these techniques in an elegant, efficient, and professional manner, using Python, NumPy, Scikit-learn, pandas and a special open-source package that I created especially for this course: Feature- engine.


At the end of the course, you will be able to implement all your feature engineering steps in a single and elegant pipeline, which will allow you to put your predictive models into production with maximum efficiency.


Want to know more? Read on...

In this course, you will initially become acquainted with the most widely used techniques for variable engineering, followed by more advanced and tailored techniques, which capture information while encoding or transforming your variables. You will also find detailed explanations of the various techniques, their advantages, limitations and underlying assumptions and the best programming practices to implement them in Python.


This comprehensive feature engineering course includes over 100 lectures spanning about 10 hours of video, and ALL topics include hands-on Python code examples which you can use for reference and for practice, and re-use in your own projects.


REMEMBER, the course comes with a 30-day money back guarantee, so you can sign up today with no risk. So what are you waiting for? Enrol today, embrace the power of feature engineering and build better machine learning models.

Who this course is for:
  • Data Scientists who want to get started in pre-processing datasets to build machine learning models
  • Data Scientists who want to learn more techniques for feature engineering for machine learning
  • Data Scientist who want to limprove their coding skills and best programming practices for feature engineering
  • Software engineers, mathematicians and academics switching careers into data science
  • Data Scientists who want to try different feature engineering techniques on data competitions
  • Software engineers who want to learn how to use Scikit-learn and other open-source packages for feature engineering