
Please download the resources (dataset and jupyter files)
Sometimes dataset is large. Importing these kinds of data into the python environment make further preprocessing slow. This lecture details how to reduce the size of data by tweaking some of the parameters in the dataset
This is part-2 of the first lecture
Pandas_profiling is a package that automates EDA with just 2 lines of code and outputs the HTML-based interactive EDA report.
This is another package that automates the EDA with many more options. Please do explore it.
Some methods in Pandas works with Pandas dataframes and some with pandas series. This lecture details the difference between the two and tells what to use where
Data can have various invalid values like '#', 'none' etc, which you want to treat as a missing value. Students will be able to convert these invalid values to NAN while importing the data.
People generally use df.na() to get rid of the missing data. The lecture explains why this is not the great idea to use it
Students will be able to use parameters in df.na() to get rid of missing values intelligently
Every dataset has some constant and quasi constant variables. Lecture explains what are these variable and what to do with these kind of variables
Every dataset has some textual data in the form of categorical variables. Since these could be manual inputs, these kinds of data can have multiple issues. Students will learn how to work with textual data and clean this data
Able to learn about different filtering methods available
Learn how to use column names to subset dataset
During preprocessing steps, Lambda, Apply, Applymap can work as an alternative to looping over rows. Students will be able to work and learn the difference between the three.
Students will be able to use if-else statements on the dataset to generate new columns
Students will be able to use text columns to subset data and create new columns
Students will learn how to manipulate dates and extract information from dates
Students will learn how to manipulate dates and extract information from dates
Students will be able to create excel like pivot table
Students will learn the use of groupby to aggregate data. They will also learn the difference between using aggregate with groupby and using transform with groupby to get a different effects on data frames
Students will learn
1- how to use groupby to rank rows within a group
2- how to find the difference between rows of the same columns
The presentation details the sequence wise process to be followed using feature engineering
fit, tranform and fit_transform are scikit-learn methods to transform data. Students will be able to learn the difference and using it
Students will be able to understand various missing values imputation methods in scikit learn and how to use it on dataframes
Students will be able to understand the need for correlation analysis, understand the various statistics associated with correlation analysis and find the correlated variables
Students will be able to identify and treat outliers
Encoding of the categorical variables is needed to convert textual data to numbers so that it can be into the model. Students in this lecture will be able to learn various encoders, when to use what and how to use it
Scaling of variables before modeling is needed to provide all the variables same weightage. Students in this lecture will be able to learn various scaling techniques, when to use what and how to use it
Pipeline and Column transformers let users create a pipeline of various feature engineering techniques. The lecture helps students create pipeline and then use column transformers to perform various feature engineering tasks in one go. It makes the code more readable and concise.
Students will be able to write a simple function using Pandas functionalities. This function will allow users to perform multiple tasks with a single line of code and automate tasks which are performed repeatedly
Now we will add more functionalities in our functions like if-else statement and loop through columns
Students will be able to write printed output to text file
Students will be able to write multiple data frames to an excel workbook in different sheets or the same sheet
Understand common Errors like Attribute Error, Name Errors etc and debugging it
Real-life data are dirty. This is the reason why preprocessing tasks take approximately 70% of the time in the ML modeling process. Moreover, there is a lack of dedicated courses which deal with this challenging task
Introducing, "Data Science Course: Data Cleaning & Feature Engineering" a hardcore completely dedicated course to the most tedious tasks of Machine Learning modeling - "Data preprocessing".
if you want to enhance your data preprocessing skills to get better high-performing ML models, then this course is for you!
This course has been designed by experienced Data Scientists who will help you to understand the WHYs and HOWs of preprocessing.
I will walk you step-by-step into the process of data preprocessing. With every tutorial, you will develop new skills and improve your understanding of preprocessing challenging ways to overcome this challenge
It is structured the following way:
Part 1- EDA (exploratory Data Analysis): Get insights into your dataset
Part 2 - Data Cleaning: Clean your data based on insights
Part 3 - Data Manipulation: Generating features, subsetting, working with dates, etc.
Part 4 - Feature Engineering- Get the data ready for modeling
Part 5 - Function writing with Pandas Darframe
Bonus Section: A few Interview preparation tips and strategies for data science enthusiasts in the job hunt
Who this course is for:
Anyone who is interested in becoming efficient in data preprocessing
People who are learning data scientists and want better to understand the various nuances of data and its treatment
Budding data scientists who want to improve data preprocessing skills
Anyone who is interested in preprocessing part of data science
This course is not for people who want to learn machine learning algorithms