
Explore what data really means and what we can and cannot infer from it, and learn how data quality drives the quality of data-based decisions.
Clarify that data is plural and datum is singular, and emphasize correct subject-verb agreement in data analytics and machine learning communication.
Data quality drives every data-driven decision; rely on high-quality data rather than anecdotal stories, and collect or clean data to improve decision making.
Anticipate data problems by planning analyses and studying similar datasets before collection. Run pilot studies to test setup and data quality, and seek expert input to improve upfront.
Clean and transform collected data to improve quality before analysis. Compute data quality metrics, identify missing, corrupted, or outlier values, and apply removal, interpolation, or scaling as needed.
Explore how to use data visualization to identify quality issues in data, distinguishing numerical and categorical data, and interpreting bar plots, pie charts, box plots, histograms, and scatter plots.
Explore general data quality features for any dataset with pandas and seaborn, including missing values, duplicates, and correlations, and learn practical cleaning with drop duplicates and dropping missing values.
Apply nonlinear data transformations, such as log, square root, rank, and Fisher, to normalize non-Gaussian data and improve linear method validity.
Apply z-score scaling and min-max scaling to align variable ranges, explore binning and unit normalization, and perform rank transforms in pandas and NumPy for cleaner data quality.
Define outliers and their ambiguity in univariate and multivariate data. Explain how they arise from noise, error, or natural variation. Compare removal with robust analysis approaches that handle leverage.
Identify and remove outliers with the Z-score method. Convert data to Z scores, apply a three standard deviation threshold, and use iterative or modified Z-score approaches for non Gaussian data.
All of our decisions are based on data. Our sense organs gather data, our memories are data, and our gut-instincts are data. If you want to make good decisions, you need to have high-quality data.
This course is about data quality: What it means, why it's important, and how you can increase the quality of your data.
In this course, you will learn:
High-level strategies for ensuring high data quality, including terminology, data documentation and management, and the different research phases in which you can check and increase data quality.
Qualitative and quantitative methods for evaluating data quality, including visual inspection, error rates, and outliers. Python code is provided to see how to implement these visualizations and scoring methods using pandas, numpy, seaborn, and matplotlib.
Specific data methods and algorithms for cleaning data and rejecting bad or unusual data. As above, Python code is provided to see how to implement these procedures using pandas, numpy, seaborn, and matplotlib.
This course is for
Data practitioners who want to understand both the high-level strategies and the low-level procedures for evaluating and improving data quality.
Managers, clients, and collaborators who want to understand the importance of data quality, even if they are not working directly with data.