
•Preprocessing refers to transformation before feeding to machine learning
•Quality of data is important to train the model
•Source – Government databases, professional or company data sources(twitter), your company, etc
•Data will never be in the format you need – Pandas Dataframe for reformatting
•Columns to remove – No values, duplicate(correlated column, e.g: house size in ft and metres)
•Learning algorithms understands only number, converting text image to number is required
•Unscaled or unstandardized data have might have unacceptable prediction
•Check for Null values
•Remove or Impute
•df.isnull().values.any()
•df = df.dropna(how='any',axis=0)
•Sometimes two features that are meant to measure different characteristics of a model are influenced by common mechanism and they move together.
How to Handle Correlation:
•Remove one of the feature
•Apply Principal Component Analysis(PLA)
•Adjusting Data Types - Inspect data types to see if there are any issues. Data should be numeric.
•If required create new columns
Missing Data - Ways to Handle
•Drop rows
•Replace values (Impute)
•Feature Scaling is a technique to standardize the independent features present in the data in a fixed range.
•It is performed during the data pre-processing to handle highly varying magnitudes or values or units.
Disadvantage:
• Without Feature Scaling a machine learning algorithm tends to weigh greater values -> higher and consider smaller values as the lower values, regardless of the unit of the values.
Convert text values to numbers. These can be used in the following situations:
•There are only two values for a column in your data. The values will then become 0/1 - effectively a binary representation
•The values have relationship with each other where comparisons are meaningful (e.g. low<medium<high)
•Use when there is no meaningful comparison between values in the column
•Creates a new column for each unique value for the specified feature in the data set
This course is designed to understand the basic concept of data preprocessing. Anyone can opt for this course. No prior understanding of machine learning is required. The data pre-processing concept and its implementation in Python are covered in detail.
Data quality is critical to a successful machine learning model. Data preprocessing is a prerequisite for machine learning. We cannot feed into machine learning algorithms as raw data. It is important to clean the data, analyze it, and transform it to understand machine learning algorithms.