What you'll learn
- Data cleaning or cleansing as a preprocessing step towards making the data more consistent and high quality before training predictive models.
Requirements
- Basics of Python
Description
Data cleaning or Data cleansing is very important from the perspective of building intelligent automated systems. Data cleansing is a preprocessing step that improves the data validity, accuracy, completeness, consistency and uniformity. It is essential for building reliable machine learning models that can produce good results. Otherwise, no matter how good the model is, its results cannot be trusted. Beginners with machine learning starts working with the publicly available datasets that are thoroughly analyzed with such issues and are therefore, ready to be used for training models and getting good results. But it is far from how the data is, in real world. Common problems with the data may include missing values, noise values or univariate outliers, multivariate outliers, data duplication, improving the quality of data through standardizing and normalizing it, dealing with categorical features. The datasets that are in raw form and have all such issues cannot be benefited from, without knowing the data cleaning and preprocessing steps. The data directly acquired from multiple online sources, for building useful application, are even more exposed to such problems. Therefore, learning the data cleansing skills help users make useful analysis with their business data. Otherwise, the term 'garbage in garbage out' refers to the fact that without sorting out the issues in the data, no matter how efficient the model is, the results would be unreliable.
In this course, we discuss the common problems with data, coming from different sources. We also discuss and implement how to resolve these issues handsomely. Each concept has three components that are theoretical explanation, mathematical evaluation and code. The lectures *.1.* refers to the theory and mathematical evaluation of a concept while the lectures *.2.* refers to the practical code of each concept. In *.1.*, the first (*) refers to the Section number, while the second (*) refers to the lecture number within a section. All the codes are written in Python using Jupyter Notebook.
Who this course is for:
- The target students are beginners to data science and machine learning.
Instructor
I am a researcher and an academician since 2011, and have a background of professional software development for around 3 years. As an Assistant Professor in Computer Science faculty I have taught various courses to undergraduate and graduate students. I am particularly interested in courses related to software design and development, databases, artificial intelligence, machine learning and data mining etc.
My PhD research is related to data science and computational linguistics, having worked with large-scale textual data for building knowledge-based systems that are adaptive and evolve with the growing needs without having to explicitly trained for a specific scenario. I have published papers in internationally recognized journals and conferences where we proposed solutions to real-world data analysis issues. I have supervised tens of projects that offered software based solutions for social content analytics, recommendations and tracking evolving public interests.