
There are many definitions for Data Quality. Here is one of them.
In today’s era of data-driven decision making, data needs to be treated as an organizational asset; data without quality cannot serve any purpose. Data quality is an assessment of data’s fitness for purpose. Data quality is an essential characteristic that determines the reliability of data for making decisions. If the data is not trustworthy, then analytics and reporting that run on the data cannot be trusted.
To put it another way, if you have data quality, your data is capable of delivering the insight you hope to get out of it. Conversely, if you don’t have data quality, there is a problem in your data that will prevent you from using the data to do what you hope to achieve with it.
To illustrate the definition of Data Quality, let’s examine a few examples of real-world data quality challenges using the situations of
Is 100% Data Quality necessary?
Is it possible in the first place?
What are the different steps to be done to achieve 100% Data Quality?
What are the different ways to measure Data Quality is discussed in this lecture.
In order for the analyst to determine the scope of the underlying root causes and to plan the ways that tools can be used to address data quality issues, it is valuable to understand these common and core data quality dimensions.
Consistency means data across all systems reflects the same information and are in synch with each other across the enterprise. Examples:
Questions you can ask yourself: Are data values the same across the data sets? Are there any distinct occurrences of the same data instances that provide conflicting information?
Is all the requisite information available? Are data values missing, or in an unusable state? In some cases, missing data is irrelevant, but when the information that is missing is critical to a specific business process, completeness becomes an issue.
Timeliness references whether information is available when it is expected and needed. Timeliness of data is very important. This is reflected in:
The timeliness depends on user expectation. Online availability of data could be required for room allocation system in hospitality, but nightly data could be perfectly acceptable for a billing system.
What data is missing important relationship linkages? The inability to link related records together may actually introduce duplication across your systems. Not only that, as more value is derived from analyzing connectivity and relationships, the inability to link related data instance together impedes this valuable analysis.
This lecture covers the Validity Data Quality Dimension.
Accuracy is the degree to which data correctly reflects the real world object OR an event being described. Examples:
Questions you can ask yourself: Do data objects accurately represent the “real world” values they are expected to model? Are there incorrect spellings of product or person names, addresses, and even untimely or not current data? These issues can impact operational and analytical applications.
This lecture covers the examples of all the core dimensions we have discussed till now.
A quick review on what is the difference between Data Quality and Data Governance.
This lecture describes the Data Life Cycle. Though there are a lot of different Data Cycles described this is the acceptable one with in the Data Management professionals.
There are three main ways that data can be captured, and these are very important:
Data Maintenance is the focus of a broad range of data management activities. Because of this, Data Governance faces a lot of challenges in this area. Perhaps one of the most important is rationalizing how data is supplied to the end points for Data Synthesis and Data Usage, e.g. preventing proliferation of point-to-point transfers.
Data Derivation is discussed in this lecture.
How is the data used with in the enterprise is discussed in this lecture.
How is the data being used with in the enterprise and how is it shared with third party vendors.
A data archive is simply a place where data is stored, but where no maintenance, usage, or publication occurs. If necessary the data can be restored to an environment where one or more of these occur.
The removal of every copy of a data item from the enterprise. Ideally, this will be done from an archive. A Data Governance challenge in this phase of the data life cycle is proving that the purge has actually been done properly.
Let's talk about the Data Quality Life Cycle.
Data profiling is an assessment of data values within a given data set for uniqueness, consistency, and logic – the three key data quality metrics.
In this lecture, we do a review of what are the different data types which can be used to profile data.
A quick difference between Data Profiling and Data Mining.
This lecture describes the different types of Data Profiling.
How does the Business Expectations differ with the Data Quality Expectations. Both of the expectations should be aligned to form a Data Quality Framework.
This lecture covers the different impacts and costs of not managing the Data Quality.
This lecture covers the different impacts and costs of not managing the Data Quality.
It is quite common that the existing Data Warehouse or the Data Lake will already have the Data Quality issues. In this lecture we will review the different possibilities and how to avoid that.
How does the Enhance, Transform and Calculate phase or the ETL phase help in Data Quality? is discussed in this lecture.
Data standardization is the critical process of bringing data into a common format that allows for collaborative research, large-scale analytics, and sharing of sophisticated tools and methodologies.
Different cases to complete and correct the data are described in this lecture.
Once the data is in the Warehouse the match and consolidate process is discussed in this lecture.
The different possible Data Quality roles in an enterprise are discussed here.
Data quality is not necessarily data that is devoid of errors. Incorrect data is only one part of the data quality equation. Managing data quality is a never ending process. Even if a company gets all the pieces in place to handle today’s data quality problems, there will be new and different challenges tomorrow. That’s because business processes, customer expectations, source systems, and business rules all change continuously. To ensure high quality data, companies need to gain broad commitment to data quality management principles and develop processes and programs that reduce data defects over time.
Much like any other important endeavor, success in data quality depends on having the right people in the right jobs. This course helps you understand key concepts, principles and terminology related to data quality and other areas in data management.