
Learn to detect and resolve common data issues in real-world datasets using Python, including missing values, incorrect data types, feature scaling, normalization, handling categorical variables, structural problems, and outliers.
Explore a three-section curriculum on data quality checks, fixing issues with imputation and data types, and nlp preprocessing like tokenization, stop words, and stemming.
Explore a dummy dataset that mimics website demographics, identify missing values and incorrect formats, and learn how to fix issues using python for machine learning and analytics.
Learn how to perform data quality checks in Python by identifying missing values, duplicates, and data type issues, using Pandas and NumPy to inspect a data frame's shape and info.
Detect missing values to assess data quality by counting missing values per column, tallying rows with missing data, and computing the overall missing percentage relative to the dataset length.
Detect duplicated rules to assess data quality in raw data. Use the duplicated rules method to report any duplicates, ensuring data confidence and preventing downstream costs.
Learn to handle missing values by imputing with column means using a simple imputer, preserving data rather than dropping rows in data cleaning and preprocessing for machine learning.
Apply the strip method to remove leading and trailing white spaces from text columns like gender and country, ensuring consistent data entries for machine learning preprocessing.
Transform date columns from strings to datetime using pandas to_datetime, after inspecting dtypes with info, and fix inconsistent entries like 15th March 2021 to enable reliable machine learning preprocessing.
Identify and fix data types by converting the age column from float to integer with astype, verify with info, and convert other columns to string as needed.
Learn to use the applied lambda method in pandas to classify monthly visits as normal or active by applying a function to each element and creating a new label column.
Learn to detect and remove outliers in Python data cleaning by using z-scores and absolute values on value, monthly visits, and item purchased, while preserving customer ID and gender.
Scale all features to a 0–1 range using a min-max scaler after dropping customer ID and date, then fit and transform the data for machine learning.
Clean and preprocess textual data for NLP by loading a corona tweet dataset with pandas, handling encoding, and extracting original tweets and sentiment for NLP tasks using NLTK.
Learn how tokenization converts tweets into tokens for sentiment analysis by breaking text into words and numbers, filtering out special characters, and applying a tokenization function to a data column.
More often than not, real world data is messy and can rarely be used directly. It needs a lot of cleaning and preprocessing before it can be used in Analytics, Machine Learning or other application. Data Cleaning be a dirty job, which often requires lots of effort and advanced technical skills like familiarity with Pandas and other libraries.
For most of the data cleaning, all you need is data manipulation skills in Python. In this course you will learn just that. This course has lectures, quizzes and Jupyter notebooks, which will teach you to deal with real world raw data. The course contains tutorials on a range of data cleaning techniques, like imputing missing values, feature scaling and fixing data types issues etc.
In this you course you will learn:
How to detect and deal with missing values in the data.
How to detect and rectify incorrect data types.
How to deal with Categorical Columns.
How to detect and replace incorrect values with correct ones.
How to use Apply Lambda method for using advanced cleaning functions.
How to group the dataset by a particular column.
How to detect and remove outliers.
How to perform feature scaling.
How to clean and preprocess textual data for NLP.