Visualization and Imputation of Missing Data

Learn to create numerous unique visualizations to better understand patterns of missing data in your data sample.
3.9 (11 ratings) Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
620 students enrolled
$19
$30
37% off
Take This Course
  • Lectures 38
  • Length 5 hours
  • Skill Level All Levels
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works

Discover

Find online courses made by experts from around the world.

Learn

Take your courses with you and learn anywhere, anytime.

Master

Learn and practice real-world skills and achieve your goals.

About This Course

Published 11/2015 English

Course Description

There are many problems associated with analyzing data sets that contain missing data. However, there are various techniques to 'fill in,' or impute, missing data values with reasonable estimates based on the characteristics of the data itself and on the patterns of 'missingness.' Generally, techniques appropriate for imputing missing values in multivariate normal data and not as useful when applied to non-multivariate-normal data. This Visualization and Imputation of Missing Data course focuses on understanding patterns of 'missingness' in a data sample, especially non-multivariate-normal data sets, and teaches one to use various appropriate imputation techniques to "fill in" the missing data. Using the VIM and VIMGUI packages in R, the course also teaches how to create dozens of different and unique visualizations to better understand existing patterns of both the missing and imputed data in your samples.

The course teaches both the concepts and provides software to apply the latest non-multivariate-normal-friendly data imputation techniques, including: (1) Hot-Deck imputation: the sequential and random hot-deck algorithm; (2) the distance-based, k-nearest neighbor imputation approach; (3) individual, regression-based imputation; and (4) the iterative, model-based, stepwise regression imputation technique with both standard and robust methods (the IRMI algorithm). Furthermore, the course trains one to recognize the patterns of missingness using many vibrant and varied visualizations of the missing data patterns created by the professional VIMGUI software included in the course materials and made available to all course participants.

This course is useful to anyone who regularly analyzes large or small data sets that may contain missing data. This includes graduate students and faculty engaged in empirical research and working professionals who are engaged in quantitative research and/or data analysis. The visualizations that are taught are especially useful to understand the types of data missingness that may be present in your data and consequently, how best to deal with this missing data using imputation. The course includes the means to apply the appropriate imputation techniques, especially for non-multivariate-normal sets of data which tend to be most problematic to impute.

The course author provides free-of-charge with the course materials his own unique VIMGUI toolbar developed in the RGtk2 visualization programming language in R. However, please note that both the R-provided VIMGUI package (developed in RGtk2), as well as the course author's provided VIMGUI toolbar application (also developed in RGtk2) may have some problems starting up properly on a Mac computer. So if you only have a Mac available to you, you may have some initial difficulties getting the applications to run properly.

What are the requirements?

  • Students will need to install R software but ample instructions for doing so are provided.

What am I going to get from this course?

  • Use visualizations created by R software to identify patterns of 'missingness' in data sets and to impute reasonable values to replace the missing data.
  • Recognize and identify the different patterns of missing data and the relative severity of their likely consequences.
  • Learn to use the VIM and VIMGUI R packages to create unique, novel and vibrant images which promote the understanding of patterns of both missing and imputed data in a set of data.
  • Learn the different historical approaches to impute reasonable values for missing data and their relative advantages and disadvantages.
  • Learn the characteristics of: (1) Hot-Deck; (2) K-Nearest Neighbor; (3) Regression-Based; and (4) Iterative, Model-Based, Stepwise Regression (IRMI) imputation techniques to "fill in" missing data and when and how to implement them with provided software.

What is the target audience?

  • This course is useful for anyone analyzing large or small data sets that may contain missing data.
  • The course is useful for graduate students conducting quantitative, empirical research and/or practicing quantitative analytic professionals.
  • Please note that the VIMGUI software is written in the R-specific RGtk2 language (based on GTK+) which has been known to be problematic running on a Mac computer.

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.

Curriculum

Section 1: What is Missing Data? Imputation Approaches
What is this Course all About?
Preview
02:00
Introduction to Course and Materials
Preview
07:40
The Allstat-Gui and Vim-Gui Applications
Preview
03:29
08:35

Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data. Missing datacan occur because of nonresponse: no information is provided for several items or no information is provided for a whole unit.

What is Missing Data? (slides, part 2)
04:28
What is Missing Data? (scripts, part 1)
06:57
What is Missing Data? (scripts, part 2)
07:08
08:15

In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data

Approaches to Handle Missing Data (slides, part 2)
04:20
Approaches to Handle Missing Data (slides, part 3)
07:33
Walk through Visualizations with VIMGUI (part 1)
07:22
Walk through Visualizations with VIMGUI (part 2)
05:49
Section 2: Missing and Imputed Visualizations
08:52

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression.[1] In both cases, the input consists of the k closest training examples in thefeature space. The output depends on whether k-NN is used for classification or regression:

  • In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
  • In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

k-NN is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until classification. The k-NN algorithm is among the simplest of all machine learning algorithms.

Both for classification and regression, it can be useful to assign weight to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.[2]

The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

A shortcoming of the k-NN algorithm is that it is sensitive to the local structure of the data.

VIMGUI Data and Aggregation Plot
Preview
10:20
Customizing the Aggregation Plot
04:23
Histograms and Barplots
12:05
Spinograms and Splineplots
07:53
Boxplots with Imputed Values (part 1)
07:10
Boxplots with Imputed Values (part 2)
06:54
Marginplots and Enhanced Scatterplots
11:05
Section 3: VIM and VIMGUI Features; More Visualizations
Introduction to VIM and VIMGUI (slides)
Preview
09:36
VIM and VIMGUI Imputation Techniques (slides)
11:32
Introduction to K Nearest Neighbors (slides, part 1)
10:14
Introduction to K Nearest Neighbors (slides, part 2)
07:53
Marginplot Matrix Visualization
Preview
07:41
Scatterplot Matrix with Imputed Missings
06:56
Parallel Coordinates Plot
03:40
Matrix Plot
06:01
Mosaic Plot
08:02
Section 4: Reproducing VIM Package Visualizations in VIMGUI
Preparing VIMGUI Visualizations
Preview
08:54
Begin Replicating VIM Aggregation Plot Example
11:07
Continue Aggregation Plot Example
08:21
Finish Replicating VIM Aggregation Plot in VIMGUI
10:32
Histogram and Barplot (part 1)
07:19
Histogram and Barplot (part 2)
05:28
Histogram and Barplot (part 3)
12:29
Spinogram and Splineplot (part 1)
06:51
Spinogram and Splineplot (part 2)
08:11

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Geoffrey Hubona, Ph.D., Professor of Information Systems

Dr. Geoffrey Hubona held full-time tenure-track, and tenured, assistant and associate professor faculty positions at 3 major state universities in the Eastern United States from 1993-2010. In these positions, he taught dozens of various statistics, business information systems, and computer science courses to undergraduate, master's and Ph.D. students. He earned a Ph.D. in Business Administration (Information Systems and Computer Science) from the University of South Florida (USF) in Tampa, FL (1993); an MA in Economics (1990), also from USF; an MBA in Finance (1979) from George Mason University in Fairfax, VA; and a BA in Psychology (1972) from the University of Virginia in Charlottesville, VA. He was a full-time assistant professor at the University of Maryland Baltimore County (1993-1996) in Catonsville, MD; a tenured associate professor in the department of Information Systems in the Business College at Virginia Commonwealth University (1996-2001) in Richmond, VA; and an associate professor in the CIS department of the Robinson College of Business at Georgia State University (2001-2010). He is the founder of the Georgia R School (2010-2014) and of R-Courseware (2014-Present), online educational organizations that teach research methods and quantitative analysis techniques. These research methods techniques include linear and non-linear modeling, multivariate methods, data mining, programming and simulation, and structural equation modeling and partial least squares (PLS) path modeling. Dr. Hubona is an expert of the analytical, open-source R software suite and of various PLS path modeling software packages, including SmartPLS. He has published dozens of research articles that explain and use these techniques for the analysis of data, and, with software co-development partner Dean Lim, has created a popular cloud-based PLS software application, PLS-GUI.

Ready to start learning?
Take This Course