Data Cleaning in Python

Name: Data Cleaning in Python
Rating: 3.8 (143 reviews)

Preprocessing, structuring and normalizing data

Created byTaimoor khan

Last updated 8/2022

English

What you'll learn

Data cleaning or cleansing as a preprocessing step towards making the data more consistent and high quality before training predictive models.

Course content

10 sections • 65 lectures • 5h 43m total length

Introduction2:03
The lecture introduces the course and what we are going to cover in general.
Quality of Data5:14
In this lecture, we discuss the characteristics of good quality data. This is very important to know before hand, in order to set a criteria around which we will attempt to improve the quality of our data.
Missing Values, Noise and Outliers3:56
In this lecture we introduce the dataset preprocessing and the kind of issues that can be found in the data. The real-world data is expected to have all these issues. In some of the datasets that are available online, such issues are already taken care of, however, not all of them. Therefore, it is important to learn about them and rectify such issues before providing the data to train a model.
Examples of Anomalies9:16
In this lecture we discuss the examples of anomalies that is missing values and noise values i.e., univariate outliers in the dataset with a small example.
Instructor1:06
About instructor

2.1.1 Anomaly Detection (Median)11:22
In this lecture we discuss anomaly detection i.e., univariate outlier with the help of median. The median gives us a range of normality that we can apply to all values of a feature.
2.2.1 Implementing Detection of Missing Values3:46
In this lecture, we implement the detection of missing values in the dataset.
2.2.2 Implementing Median based Detection (Global Context)3:40
In this lecture we use the median approach to define the range of normality for the detection of noise values.
2.2.3 Implementing Median based Detection (Local Context)6:14
In this lecture, we detect missing values with the help of median using the local context.
2.1.2 Anomaly Detection (Mean)6:08
In this lecture, we discuss detecting noise values with the help of mean to determine the range of normality.
2.2.4 Implementing Mean based Detection of Noise values5:30
In this lecture, we implement the detection of noise values with the help of mean considering both the local and global context.
2.1.3 Anomally Detection (Z-score)3:27
In this lecture, we discuss and evaluate the detection of a noise value or univariate outlier with the help of z-score.
2.2.5 Implementing Z-score based Detection3:07
In this lecture, we implement the detection of noise values using z-score based approach.
2.1.4 Anomally Detection (Interquartile Range)4:58
In this lecture, we discuss and evaluate the interquartile range for the detection of noise values.
2.2.6 Implementing Interquartile Range for Noise Detection3:30
In this lecture we implment the interquartile range to define the range of normality. Then used its lower and upper limits to compare all the values for the feature and identify noise values that are very much unlike the rest of the values.

3.1.1 Approaches to Handle Anomalies3:21
In this lecture, we discuss the types of approaches that are used for handling or processing anomalies in the data particularly missing values and noise values i.e., univariate outliers.
3.1.2 Deletion Strategy2:26
In this lecture, we discuss the deletion strategy as an approach for processing records or features i.e., rows or columns that has got anomalies.
3.2.1 Deleting Missing Values1:49
In this lecture, we handle the missing values with the deletion strategy i.e., to get rid of all the rows with the missing values.
3.1.3 Global and Local Context3:27
In this lecture, we differentiate between the global and the local context for handling an anomaly. In case of global context, all the values of the dataset are considered to handle an anomaly while in case of local context i.e., concept restricted then only the instances that fall into the local context are used for handling the anomaly.
3.1.4 Replacement Strategy4:48
In this lecture, we discuss the replacement strategy for handling anomalies both with the global and local context.
3.1.5 Statistical Measures13:27
In this lecture we discuss the use of statistical measures like median, mode and mean for handling missing and noise values.
3.2.2 Implementing Imputation with Mode5:56
In this lecture, we demonstrate how and when to use the mode value for the imputation of a missing or noise value in the dataset.
3.2.3 Implementing Imputation with Median and Mean5:04
In this lecture, we make use of the statistical measures mean and median for the imputation of noise values in the dataset.

4.1.1 Multivariate Outliers2:50
In this lecture, we discuss the multivariate outliers. These are the outliers that have more than one noise values in a record and the whole object or instance is considered to be an outlier. The types of multivariate outliers and their detection techniques are discussed. They are generally being filtered off instead of trying to fix their values for having issues with more than one feature.
4.1.2 Local Outlier Factor10:41
In this lecture, we discuss the use of Local outlier factor for detecing multivariate outliers.
4.2.1 Implementing LOF for Outlier Detection6:05
In this lecture, we implement and demonstrate the use of Local Outlier Factor for the detection of multivariate outliers in a dataset.
4.1.3 Clustering for Multivariate Outlier Detection6:32
In this lecture, we briefly introduce clustering and a particular clustering technique that is very effective towards identifying multivariate outliers i.e., DBSCAN. Working mechanism of the algorithm is also explained.
4.2.2 Implementing DBSCAN Clustering for Outlier Detection4:52
In this lecture, we implement DBSCAN algorithm for clustering our data, however, our interest was in the points that are identified as outliers by the algorithm and are kept out of the clusters
4.1.3 Data Visualization for Outlier Detection5:38
In this lecture, we discuss how data visualization can be helpful towards detecting multivariate outliers in the data. We have also discussed how to come up with good features that can be used for the visualization purpose.
4.2.3 Implementing Data Visualization2:52
In this lecture, we implement the data by using the first two non-ID numeric columns of the dataset to plot a visualization of the data. It helped to manually inspect the outlier in the data.

5.1.1 Normalizing Text Anomalies5:27
This lecture introduces the various types of anomalies that can be found in textual data. It also demonstrates an example of what we want to achieve when textual data is passed through these normalization steps.
5.2.1 Lowercase, Whitespaces, Punctuations4:00
In this lecture, we implement the removal of whitespaces and punctuations while the text is converted to lowercase.
5.2.2 Stopwords Removal3:16
In this lecture, we implement the removal of stop words from the dataset using ENGLISH_STOP_WORDS list
5.1.2 Regular Expressions12:53
In this lecture, we briefly discuss regular expression basics that should be sufficient for removing unwanted domain specific stopwords.
5.2.4 Implementing Regular Expressions for Filtering stopwords2:55
In this lecturew, we make use of regular expressions for identifying and removing the domain specific stopwords.
5.2.3 Stemming and Lemmatization8:53
In this lecture, we implement the stemming and lemmatization of words so that the number of features can be reduced while words having similar meanings are grouped together to be represented as a single feature.
Parts-of-speech (POS) Tagging4:24
In this lecture we use the NLTK library for finding the parts-of-speech labels for the words in a sentence. POS has many other purposes as well. They are much more relevant in dialog and question-answeringn based systems. However, they are also frequently used for filtering the unwanted parts of speech. For example, in order to perform Sentiment analysis, the word types that are not expected to hold user opinions are filtered through its parts of speech rather than looking for each word individually.
5.2.6 Text Segmentation and Tokenization3:53
In this lecture, we implement the separation of text into segments i.e., sentences and then segments into words or tokens. For this purpose, we make use of the NLTK library which is much more efficient in identifying tokens and segments as compared to splitting text on periods (.) and spaces.

6.1.1 Structuring Textual Data4:46
The lecture introduces the need for structuring textual documents and converting raw text into numerical values that can be easily consumed my machine learning techniques.
6.1.2 Bag-of-Words (BoW) Approach3:57
In this lecture, we discuss the bag-of-words (BoW) approach for converting textual data into vectors. Its called BoWs approach because it loses the position related information about words and only retain their frequency related information.
6.1.3 Binary and TF-IDF Representation7:01
In this lecture, we discuss the representation schemes of converting textual data into structured format by converting words or tokens to features and representing them through numbers for each document.
6.2.1 Implementing One Document Corpus Representation3:00
In this lecture, we take single document and use count vectorizer to convert it into a structured format. With such a small dataset, the students can see how the words become columns and the document hold a numeric value for each word i.e., feature showing its frequency of occurence in the document.
6.2.2 Implementing Multi-doc Corpus Representation1:14
In this lecture, we extend our study to include more documents in the corpus and see how they can be converted into a structured format. The idea of starting with these small datasets is for students to cross check the numbers and verify.
6.2.3 Tuning Parameters to Improve Representation3:56
In this lecture, we tune the parameters to find the suitable parameters and number of those parameters for training a good model.
6.2.4 Implementing TF-IDF Representation Scheme3:54
In this lecture, we implement the TF-IDF based text representation scheme for structuring textual data. TF-IDF is the most commonly used text representation approach.
6.2.5 Implementing Dummy Dataset Representation3:25
In this lecture, the process of vectorizing the textual documents is applied on a dummy dataset with few documents.
6.2.6 Implementing UCI Repository Dataset Representation2:28
In this lecture, the processing of vectorizing the textual documents is applied on a dataset from the UCI Repository.

7.1.1 Why Feature Scaling4:50
In this lecture, we discuss the need for feature scaling.
7.1.2 Feature Normalization (Min Max Scaler)4:05
In this lecture, we discuss the normalization scheme for feature scaling and how it is affected by outliers.
7.2.1 Implementing Feature Normalization6:21
In this lecture, we demonstrate the implementation of feature scaling using normalization strategy. It make use of the MinMaxScaler class from sklearn.preprocessing.
7.1.3 Feature Standardization (Standard Scaler)2:11
In this lecture, we discuss how to use standardization scheme for scaling the values of a feature.
7.2.2 Implementing Feature Standardization2:17
In this lecture, we implement the conversion of feature values into their standardized values using StandardScaler from sklearn.preprocessing.
7.1.4 Robust Feature Scaler4:13
In this lecture, we discuss another feature scaling scheme that make use of the interquartile range and therefore, is not affected by outliers
7.2.3 Implementation of Robust Scaler3:06
This lecture, demonstrates the implementation of Robust Scaler using sklearn.preprocessing.

8.1.1 Types of Features5:29
Structured datasets include two feature types: numeric features and categorical features. Ordinal categories use integer encoding, while nominal categories require one hot encoding to enable machine learning models.
8.2.1 Handling Categorical Ordinal Features7:21
In this lecture, we use Pandas to deal with categorical ordinal features by replacing the string values with integers. The pandas function replace() for this purpose on the data frame having the dataset.
8.2.2 Categorical Nominal Features4:54
Categorical nominal features are converted into numeric features using one-hot encoding. This approach converts a feature into the number of features equal to the unique values of that feature. Among the newly added features for a column or feature, only one will have a value of 1 for a record and the rest will be 0.
8.2.3 Text Sequence Encoding (for Deep Learning Models)10:25
Sequence encoding is used for integer representation of data as a sequence of numbers, so that the position and frequency based information of different data items could be retained. In order to use deep neural networks for text processing, the text is generally encoding as a sequence of integers. It requires assigning unique integer values to specific data items and then replacing them accordingly. Further, truncation and padding is required to ensure that all documents / objects are of equal length.

Deductive Learning and Inductive Learning2:44
Deductive learning refers to the older expert based systems that required the human expert to provide all their expertise in the form of rules. The computer program would utilize those rules to make intelligent decisions. However, it was laborious and time-consuming. Inductive learning, on the other hand, require data and computations which were easily available, particularly due to the decline in hardware prices. In case of inductive learning, the model gets input and output data and learn the patterns from it that would be used for decision-making. The latter is also called Machine Learning.
Learning from Features4:16
In machine learning (supervised learning) to be specific, the model learns by finding patterns in the values of individual features and their impact on the labels. If the value of a feature is on the higher side for one type of label and lower for another, it is preserved as a pattern. The prediction of multiple such patterns are combined to evaluate the label for a test case.
Machine Learning (Introduction)8:17
The lecture elaborates on the definition of machine learning to realize what is being considered when we claim the model is learning. How to model learn through experience or past data and how it is evaluated. The model attempts to improve its performance on the task using the evaluation criteria provided, iteratively until it cannot be improved any further in the given setup.
Supervised and Unsupervised Learning5:00
The lecture provides an overview of supervised and unsupervised learning. In case of supervised learning, the models attempts to find a pattern between the feature values and specific class labels. Whereas, in case of unsupervised learning or clustering in specific, the model evaluates similarity among the feature values of different objects.
Pattern Recognition13:13
Identify patterns in two-feature data to separate class A from B, train models for unseen instances, and scale pattern recognition with visualization and data mining and machine learning approaches.
Machine Learning Project Pipeline19:18
Machine learning based solution may require a lot of tasks rather than just training the model on a training data. These tasks depend on the nature of data, its availability and the nature of the task. The general pipeline for building an end to end model that can be used to tackle a real-world problem is elaborated in this lecture. You may pick a specific technique for each module or step of the pipeline, or may skip some depending on their need.

Data Acquisition from Webpages3:06
In this lecture, we implement the acquisition of useful data from a webpage into python variables. At times it is desirable to prepare our own dataset as the publicly available dataset may not cover the kind of analysis that a user wants to perform. Therefore, data acquisition may also be a task before preprocessing it. Using the text based preprocessing that we discussed in the earlier module, textual data can be cleaned from code and other unwanted sequences.

Requirements

Basics of Python

Description

Data cleaning or Data cleansing is very important from the perspective of building intelligent automated systems. Data cleansing is a preprocessing step that improves the data validity, accuracy, completeness, consistency and uniformity. It is essential for building reliable machine learning models that can produce good results. Otherwise, no matter how good the model is, its results cannot be trusted. Beginners with machine learning starts working with the publicly available datasets that are thoroughly analyzed with such issues and are therefore, ready to be used for training models and getting good results. But it is far from how the data is, in real world. Common problems with the data may include missing values, noise values or univariate outliers, multivariate outliers, data duplication, improving the quality of data through standardizing and normalizing it, dealing with categorical features. The datasets that are in raw form and have all such issues cannot be benefited from, without knowing the data cleaning and preprocessing steps. The data directly acquired from multiple online sources, for building useful application, are even more exposed to such problems. Therefore, learning the data cleansing skills help users make useful analysis with their business data. Otherwise, the term 'garbage in garbage out' refers to the fact that without sorting out the issues in the data, no matter how efficient the model is, the results would be unreliable.

In this course, we discuss the common problems with data, coming from different sources. We also discuss and implement how to resolve these issues handsomely. Each concept has three components that are theoretical explanation, mathematical evaluation and code. The lectures *.1.* refers to the theory and mathematical evaluation of a concept while the lectures *.2.* refers to the practical code of each concept. In *.1.*, the first (*) refers to the Section number, while the second (*) refers to the lecture number within a section. All the codes are written in Python using Jupyter Notebook.

Who this course is for:

The target students are beginners to data science and machine learning.

Data Cleaning in Python

What you'll learn

Explore related topics

Course content

Introduction5 lectures • 22min

Detecting Missing and Noise Values (Univariate Outliers)10 lectures • 52min

Handling Missing and Noise Values (Univariate Outliers)8 lectures • 40min

Multivariate Outliers7 lectures • 40min

Anomalies in Textual data8 lectures • 46min

Structuring Textual Documents9 lectures • 34min

Feature Scaling (Normalization)7 lectures • 27min

Handling Categorical Features4 lectures • 28min

Machine Learning Overview6 lectures • 53min

Data Acquisition1 lecture • 3min

Requirements

Description

Who this course is for: