Data Cleaning Techniques in Data Science & Machine Learning

Name: Data Cleaning Techniques in Data Science & Machine Learning
Rating: 3.8 (9 reviews)

Explore all the concepts of Data Cleaning for AI & Data Science to become an expert with this complete online tutorial.

Created byEduonix Learning Solutions

Last updated 1/2020

English

What you'll learn

Professional ways for handling the data
Learn Standard visualization techniques like Histograms, Scatterplots etc
How to locate discrepancies, and deal with issues

Course content

5 sections • 30 lectures • 4h 54m total length

Identifying the task8:57
Identify the task in data cleaning and data science by distinguishing supervised, unsupervised, and semi supervised learning, and applying classification, prediction, and clustering to data.
Model building7:22
Identify the data task, select the target variable and features, and apply preprocessing to build a model for classification, prediction, or clustering on a dataset like Iris.
Some common solutions12:45
Survey common supervised and unsupervised model types, including decision trees, random forests, boosting, regression, neural networks, SVMs, clustering, and PCA.
Training and test data13:13
Split data into training and test sets, shuffle with a random state to avoid bias, and use a validation set for hyperparameter tuning to ensure robust model generalization.
Cross validation5:03
Apply cross validation by using multiple folds, such as fivefold or tenfold, with leave-one-out for scarce data, to train on most data and test on a held-out fold.
Feature selection14:04
Identify irrelevant and redundant features to improve model accuracy and efficiency. Explore filter, wrapper, and embedded feature selection methods, including principal component analysis and feature engineering, to build lean models.
Accuracy measures16:22
Explore how confusion matrices support binary classification, learn to calculate and interpret accuracy, specificity, sensitivity, precision, recall, f1 score, and area under the roc curve using Python and scikit-learn.
Overfitting11:53
Explore how to identify and prevent overfitting by balancing model complexity, handling noise and outliers, and applying train-test splits and feature selection for robust data cleaning in machine learning.

Reading the data10:55
This session demonstrates reading data from csv and excel formats, loading datasets like the breast cancer data into a data frame, with header handling, and renaming columns.
Structure of the data8:50
Examine the structure of the dataset by examining size, shape, rows and columns, feature names, data types, and missing or unique values; learn when to transpose for analysis.
Merging/Splitting7:51
Split data into input features and target, filter out personal information and unused columns, and merge datasets on a common idea column to align results for modeling.
Integrity check8:20
Learn to verify data integrity by checking completeness, consistency, accuracy, and duplication avoidance, and up-to-dateness, while validating features, data types, ranges, and unique identifiers to prepare reliable datasets for analysis.
Knowing the domain4:47
Apply domain knowledge to guide model building and feature selection using data descriptions, such as the auto MPD dataset from the UCI repository, and consult domain experts to interpret features.
Range of variables9:29
Explore how to determine a variable's range with min and max, visualize distributions using bar charts, histograms, and box plots, and assess relationships via scatter plots and trend lines.
Inquiring dependencies7:47
Analyze how miles per gallon and car weight relate through a scatter plot, correlation coefficient, and polynomial fits of order 1 and 2 in Python using seaborn.

Type of variables10:15
Define and classify variables as category, descriptive, and numeric variables. Explain binary, nominal, and ordinal types, and show how to use one hot encoding for ML models.
More variable types17:06
Explore numeric variable types—discrete, continuous, and ordinal—and learn to transform and encode them with democratization and one-hot encoding for machine learning.
Single variable plots6:38
Explore how to visualize single-variable data using bar charts, histograms, and density plots, including decisions on bin size, spacing, and when to stack or align multiple variables.
Plotting interrelations3:44
Explore how scatter plots visualize relationships between two variables, reveal distributions, show a trend line, and indicate correlation and the strength of dependence.
Measuring correlations4:50
Explore how the Pearson correlation coefficient measures the linear relationship between two numeric variables, indicating positive or negative dependence and strength from -1 to 1 via plots and the formula.
Need for transformation5:50
Normalize each variable to a 0–1 scale by subtracting the minimum and dividing by the range, and apply log transformation to skewed data to improve knn distances.
Discretizing features7:35
Discretize continuous features into intervals to reveal patterns and improve predictions. Select binning strategies (uniform, equal height, or k-means) and choose supervised or unsupervised methods.

Absurd or Missing values5:09
Explore how real-world data collection creates missing and absurd values, from placeholders to default entries, and learn data cleaning, type-correcting, and harmonizing features for robust machine learning.
Finding their distribution in the dataset16:35
Explore data cleaning techniques to identify and handle missing values, unify their formats, and visualize the distribution of missing values with Python, Pandas, and Seaborn.
Deciding what to do with them6:43
Decide to drop rows or columns with missing values or impute them using mean, median, mode, or nearest neighbor methods, guided by thresholds, data size, and feature type.
Looking for outliers8:33
Detect outliers with z-scores and nearest-neighbor methods, assess their impact on model performance, and decide when to exclude or note them during preprocessing and post-processing.

Exercise-110:46
Learn data cleaning on the Titanic dataset by loading the training data, inspecting shape and missing values, and dropping non-informative columns to predict survival.
Exercise-214:39
Visualize variable distributions with bar charts and histograms, apply log-scale transformations to reveal skewness, then explore survival trends by sex and class using color-coded plots.
Exercise-317:37
Explore data cleaning and feature selection on the breast cancer dataset by applying filter, wrapper, and embedded feature selection methods, handling missing values, and visualizing correlations.
Exercise-411:11
Impute Titanic missing values using mean/median and mode; encode sex, embarked, and cabin to numeric, then apply knn imputation with fancy boot and prepare features with one-hot encoding.

Requirements

Basic Knowledge of Python

Description

One of the most essential aspects of Data Science or Machine Learning is Data Cleaning. In order to get the most out of the data, your data must be clean as uncleaned data can make it harder for you to train ML models. In regard to ML & Data Science, data cleaning generally filters & modifies your data making it easier for you to explore, understand and model.

A good statistician or a researcher must spend at least 90% of his/her time on collecting or cleaning data for developing a hypothesis and remaining 10% on the actual manipulation of the data for analyzing or deriving the results. Despite these facts, data cleaning is not commonly discussed or taught in detail in most of the data science or ML courses. With the rise of big data & ML, now data cleaning has also become equally important.

Why should you learn Data Cleaning?

Improve decision making
Improve the efficiency
Increase productivity
Remove the errors and inconsistencies from the dataset
Identifying missing values
Remove duplication

Why should you take this course?

Data Cleaning is an essential part of Data Science & AI, and it has become an equally important skill for a programmer. It’s true that you will find hundreds of online tutorials on Data Science and Artificial Intelligence but only a few of them cover data cleaning or just give the basic overview. This online guide for data cleaning includes numerous sections having over 5 hours of video which are enough to teach anyone about all its concepts from the very beginning. Enroll in this course now to learn all the concepts of Data Cleaning.

This course teaches you everything including the basics of Data Cleaning, Data Reading, merging or splitting datasets, different visualization tools, locate or handling missing/absurd values and hands-on sessions where you’ll be introduced to the dataset for ensuring complete learning of Data Cleaning.

Enroll in this course now to learn about data cleaning concepts and techniques in detail!

Who this course is for:

Students who want to learn the basics of Data Cleaning

Data Cleaning Techniques in Data Science & Machine Learning

What you'll learn

Explore related topics

Course content

Introduction8 lectures • 1hr 30min

Playing with the Data7 lectures • 58min

Variables and Correlations7 lectures • 56min

Missing Values and Outliers4 lectures • 37min

Exercises4 lectures • 54min

Requirements

Description

Who this course is for: