Applied Multivariate Analysis with R

Learn to use R software to conduct PCAs, MDSs, cluster analyses, EFAs and to estimate SEM models.

Created byGeoffrey Hubona, Ph.D.

Last updated 7/2020

English

What you'll learn

Conceptualize and apply multivariate skills and "hands-on" techniques using R software in analyzing real data.
Create novel and stunning 2D and 3D multivariate data visualizations with R.
Set up and estimate a Principal Components Analysis (PCA).
Formulate and estimate a Multidimensional Scaling (MDS) problem.
Group similar (or dissimilar) data with Cluster Analysis techniques.
Estimate and interpret an Exploratory Factor Analysis (EFA).
Specify and estimate a Structural Equation Model (SEM) using RAM notation in R.
Be knowledgeable about SEM simulation capabilities from the R SIMSEM package.

Course content

7 sections • 75 lectures • 12h 13m total length

Introduction to Multivariate Analysis (MVA) Course11:40
This video presents an overview of the Applied Multivariate Analysis (MVA) course.
Materials for Section 1 Introduction to MV Data and Analysis2:25
The materials used in the video lectures for Section 1 Introduction to Multivariate Data and Analysis are briefly explained and then provided as a .zip file download after the short video is presented.
What is "Multivariate Analysis" ?14:15
Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which involves the observation and analysis of more than one statistical outcome variable at a time. In design and analysis, the technique is used to perform trade studies across multiple dimensions while taking into account the effects of all variables on the responses of interest. Some of the applications include:

• To reduce a large number of variables to a smaller number of factors for data modeling

• To validate a scale or index by demonstrating that its constituent items load on the same factor, and to drop proposed scale items which cross-load on more than one factor.

• To select a subset of variables from a larger set, based on which original variables have the highest correlations with some other factors.

• To create a set of factors to be treated as uncorrelated variables as one approach to handling multi-collinearity in such procedures as multiple regression

In this "hands-on" course on applied multivariate analysis, we focus on how to actually use and conduct MVA analyses, using dozens of real data sets and R software. We examine the techniques and examples of principal components analysis, multidimensional scaling, cluster analysis, exploratory factor analysis, and an introduction to structural equation modeling.
Missing Values and the Measure Dataset8:20
Missing data is a huge problem in analyzing data sets because many statistical and mathematical functions fail when any individual data observations have even one missing data element. We explain and demonstrate why this is a problem using a 'body measures' dataset that we construct in R, and we show some "quick fixes" to getting around this problem of missing data in multivariate analysis.
Other Multivariate Datasets10:11
We create several multivariate data sets using R software. We use these data sets and others in the rest of the course.
Covariance, Correlation and Distance (part 1)11:36
In probability theory and statistics, a covariance matrix (also known as dispersion matrix or variance–covariance matrix) is a matrix whose element in the i, j position is the covariance between the i ^th and j ^th elements of a random vector (that is, of a vector of random variables). Each element of the vector is a scalar random variable, either with a finite number of observed empirical values or with a finite or infinite number of potential values specified by a theoretical joint probability distribution of all the random variables.

The correlation matrix of n random variables X₁, ..., X_n is the n × n matrix whose i,j entry is corr(X_i, X_j). If the measures of correlation used are product-moment coefficients, the correlation matrix is the same as the covariance matrix of the standardized random variables X_i / σ (X_i) for i = 1, ..., n. This applies to both the matrix of population correlations (in which case "σ" is the population standard deviation), and to the matrix of sample correlations (in which case "σ" denotes the sample standard deviation). Consequently, each is necessarily a positive-semidefinite matrix.

The correlation matrix is symmetric because the correlation between X_i and X_j is the same as the correlation between X_j and X_i.
Covariance, Correlation and Distance (part 2)10:21
We continue our discussion of creating, estimating and using both covariance and correlation matrices in multivariate analysis using R software. We also introduce the concept of "distance" for finding similarities / differences among sets of variables.
Covariance, Correlation and Distance (part 3)10:12
We continue our discussion of creating, estimating and using both covariance and correlation matrices in multivariate analysis using R software. We also introduce the concept of "distance" for finding similarities / differences among sets of variables.
The Multivariate Normal Density Function11:28
We describe, create (with simulation), demonstrate and visualize a multivariate normal (MVN) density function using R. In probability theory and statistics, the multivariate normal distribution or multivariate Gaussian distribution, is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One possible definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.
Setting Up Normality Plots10:03
We demonstrate several R software graphical approaches to test for univariate and multivariate normality.
Drawing Normality Plots13:41
We continue our illustrative cases and examples of creating normality plots in R software.
Covariance, Correlation and Normality Exercises7:09
This video lecture explains the three covariance, correlation and normality exercises for the first section of the applied MVA course.

Materials and Exercises for Visualizing Multivariate Data Section2:46
The materials used in the video lectures for Section 2 Visualizing Multivariate Data are briefly explained and then provided as a .zip file download after the short video is presented.
Covariance and Correlation Matrices with Missing Data (part 1)8:49
Covariance and Correlation Matrices with Missing Data (part 2)10:45
Univariate and Multivariate QQPlots of Pottery Data9:38
Converting Covariance to Correlation Matrices15:20
Plots for Marginal Distributions15:35
Outlier Identification16:29
Chi, Bubble, and other Glyph Plots14:12
Scatterplot Matrix7:56
Kernel Density Estimators13:41
3-Dimensional and Trellis (Lattice Package) Graphics15:26
More Trellis (Lattice Package) Graphics14:04
Bivariate Boxplot and ChiPlot Visualizations Exercises2:16

Materials for Principal Components Analysis (PCA) Section0:44
The materials used in the video lectures for Section 3 Principal Components Analysis (PCS) are briefly explained and then provided as a .zip file download after the short video is presented.
Bivariate Boxplot Visualization Exercise Solution14:48
ChiPlot Visualization Exercise Solution3:40
What is a "Principal Components Analysis" (PCA) ?11:13
PCA Basics with R: Blood Data (part 1)9:18
PCA Basics with R: Blood Data (part 2)10:51
PCA with Head Size Data (part 1)8:01
PCA with Head Size Data (part 2)9:31
PCA with Heptathlon Data (part 1)7:40
PCA with Heptathlon Data (part 2)10:03
PCA with Heptathlon Data (part 3)13:04
PCA Criminal Convictions Exercise1:21

Materials for Multidimensional Scaling Section0:56
The materials used in the video lectures for Section 4 Multidimensional Scaling (MDS) are briefly explained and then provided as a .zip file download after the short video is presented.
PCA Criminal Convictions Exercise Solution14:29
Introduction to Multidimensional Scaling13:31
Classical Multidimensional Scaling (part 1)14:50
Classical Multidimensional Scaling (part 2)8:47
Classical Multidimensional Scaling: Skulls Data17:46
Non-Metric Multidimensional Scaling Example: Voting Behavior14:24
Non-Metric Multidimensional Scaling Example: WW II Leaders9:08
Multidimensional Scaling Exercise: Water Voles2:48

Materials for Cluster Analysis Section1:13
The materials used in the video lectures for Section 5 Cluster Analysis are briefly explained and then provided as a .zip file download after the short video is presented.
MDS Water Voles Exercise Solution13:55
Introduction to Cluster Analysis10:50
Hierarchical Clustering Distance Techniques10:35
Hierarchical Clustering of Measures Data12:40
Hierarchical Clustering of Fighter Jets10:15
K-Means Clustering of Crime Data (part 1)13:06
K-Means Clustering of Crime Data (part 2)6:36
Clustering of Romano-British Pottery Data14:50
K-Means Classifying of Exoplanets13:20
Model-Based Clustering of Exoplanets12:34
Finite Mixture Model-Based Analysis13:04
Cluster Analysis Neighborhood and Stripes Plots10:07
K-Means Cluster Analysis Crime Data Exercise0:35

Materials for Exploratory Factor Analysis (EFA) Section0:40
The materials used in the video lectures for Section 6 Exploratory Factor Analysis are briefly explained and then provided as a .zip file download after the short video is presented.
K-Means Crime Data Exercise Solution9:20
The solution to the K-Means exercise using the crime data is explained.
Introduction to Exploratory Factor Analysis (EFA)14:38
In multivariate statistics, exploratory factor analysis (EFA) is a statistical method used to uncover the underlying structure of a relatively large set of variables. EFA is a technique within factor analysis whose overarching goal is to identify the underlying relationships between measured variables. It is commonly used by researchers when developing a scale (a scale is a collection of questions used to measure a particular research topic) and serves to identify a set of latent constructs underlying a battery of measured variables. It should be used when the researcher has no a priori hypothesis about factors or patterns of measured variables. Measured variables are any one of several attributes of people that may be observed and measured. An example of a measured variable would be the physical height of a human being. Researchers must carefully consider the number of measured variables to include in the analysis. EFA procedures are more accurate when each factor is represented by multiple measured variables in the analysis.
The factanal() Function Explained7:34
The factanal() function in R performs maximum-likelihood factor analysis on a covariance matrix or data matrix.
EFA Life Data Example14:47
Is an example of estimating an EFA using R software with the life data provided in the materials.
EFA Drug Use Data Example16:20
Is an example of estimating an EFA using R software with the drug use data provided in the materials.
Comparing EFA with Confirmatory Factor Analysis (CFA)8:17
Both exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are employed to understand shared variance of measured variables that is believed to be attributable to a factor or latent construct. Despite this similarity, however, EFA and CFA are conceptually and statistically distinct analyses.

The goal of EFA is to identify factors based on data and to maximize the amount of variance explained. The researcher is not required to have any specific hypotheses about how many factors will emerge, and what items or variables these factors will comprise. If these hypotheses exist, they are not incorporated into and do not affect the results of the statistical analyses. By contrast, CFA evaluates a priori hypotheses and is largely driven by theory. CFA analyses require the researcher to hypothesize, in advance, the number of factors, whether or not these factors are correlated, and which items/measures load onto and reflect which factors. As such, in contrast to exploratory factor analysis, where all loadings are free to vary, CFA allows for the explicit constraint of certain loadings to be zero.

EFA is sometimes reported in research when CFA would be a better statistical approach. It has been argued that CFA can be restrictive and inappropriate when used in an exploratory fashion. However, the idea that CFA is solely a “confirmatory” analysis may sometimes be misleading, as modification indices used in CFA are somewhat exploratory in nature. Modification indices show the improvement in model fit if a particular coefficient were to become unconstrained. Likewise, EFA and CFA do not have to be mutually exclusive analyses; EFA has been argued to be a reasonable follow up to a poor-fitting CFA model.
EFA Exercise2:43
The correlation matrix given below represent grading scores of 220 boys in six school subjects:

(1) French; (2) English; (3) History; (4) Arithmetic; (5) Algebra and (6) Geometry.

Find the two-factor solution from a maximum likelihood factor analysis. Interpret the factor loadings. Then plot these derived loadings and interpret again. Was it easier to interpret the factors by looking at the visualization? Finally, find an non-orthogonal rotation that allows easier interpretation of the results looking at the factor loadings directly, without the "visual utility" that is afforded by plotting the two-factor solution first.

# French 1.00

# English 0.44 1.00

# History 0.41 0.35 1.00

# Arithmetic 0.29 0.35 0.16 1.00

# Algebra 0.33 0.32 0.19 0.59 1.00

# Geometry 0.25 0.33 0.18 0.47 0.46 1.00

Introduction to the SEM, QGraph and SIMSEM Course Section with Materials2:53
Structural equation modeling (SEM) is a methodology for representing, estimating, and testing a network of relationships between variables (measured variables and latent constructs). qgraph is a package that can be used to plot several types of graphs. It is mainly aimed at visualizing relationships in (psychometric) data as networks to create a clear picture of what the data actually looks like. SIMSEM is an R package developed for facilitating simulation and analysis of data within the structural equation modeling (SEM) framework.
Exploratory Factor Analysis (EFA) Exercise Solution9:15
Solution to the EFA exercises are provided in R scripts.
Specify and Estimate Drug Use SEM Model11:56
Structural equation modeling (SEM) is a methodology for representing, estimating, and testing a network of relationships between variables (measured variables and latent constructs). Specification is formulating a statement about a set of parameters and stating a model. A critical principle in model specification and evaluation is the fact that all of the models that we would be interested in specifying and evaluating are wrong to some degree We must define as an optimal outcome a finding that a particular model fits our observed data closely and yields a highly interpretable solution. Instead of considering all possible models, a finding that a particular model fits observed data well and yields an interpretable solution can be taken to mean only that the model provides one plausible representation of the structure that produced the observed data.
Specify and Estimate Alienation SEM Model5:15
Structural equation modeling (SEM) is a methodology for representing, estimating, and testing a network of relationships between variables (measured variables and latent constructs). Specification is formulating a statement about a set of parameters and stating a model. A critical principle in model specification and evaluation is the fact that all of the models that we would be interested in specifying and evaluating are wrong to some degree We must define as an optimal outcome a finding that a particular model fits our observed data closely and yields a highly interpretable solution. Instead of considering all possible models, a finding that a particular model fits observed data well and yields an interpretable solution can be taken to mean only that the model provides one plausible representation of the structure that produced the observed data.
QGraph Visualizations6:43
qgraph is a package that can be used to plot several types of graphs. It is mainly aimed at visualizing relationships in (psychometric) data as networks to create a clear picture of what the data actually looks like.

Its most important use is to visualize correlation matrices as a network in which each node represents a variable and each edge a correlation. The color of the edges indicate the sign of the correlation (green for positive correlations and red for negative correlations) and the width indicate the strength of the correlation. Other statistics can also be used in the graph as long as negative and positive values are comparable in strength and zero indicates no relationship.

qgraph also comes with various functions to visualize other statistics and even perform analyses, such as EFA, PCA, CFA and SEM. The stable release of qgraph is available at CRAN, the developmental version of qgraph is available at GitHub and finally an article introducing the package in detail is available in the Journal of Statistical Software.

Since qgraph 1.3 the package also contains network model selection and estimation procedures.
SIMSEM Package Simulation Capabilities (part 1)9:57
The SIMSEM R package has been developed for facilitating simulation and analysis of data within the structural equation modeling (SEM) framework. This package aims to help analysts create simulated data from hypotheses or analytic results from obtained data. The simulated data can be used for different purposes, such as power analysis, model fit evaluation, and planned missing design. Students will have an appreciation of how to use SIMSEM for these purposes.
SIMSEM Package Simulation Capabilities (part 2)4:12
The SIMSEM R package has been developed for facilitating simulation and analysis of data within the structural equation modeling (SEM) framework. This package aims to help analysts create simulated data from hypotheses or analytic results from obtained data. The simulated data can be used for different purposes, such as power analysis, model fit evaluation, and planned missing design. Students will have an appreciation of how to use SIMSEM for these purposes.

Requirements

No specific knowledge or skills are required.
Students will need to install the popular no-cost R Console and RStudio software (instructions provided).
However, it is helpful if students have some interest and aptitude in quantitative or statistical analysis.

Description

Applied Multivariate Analysis (MVA) with R is a practical, conceptual and applied "hands-on" course that teaches students how to perform various specific MVA tasks using real data sets and R software. It is an excellent and practical background course for anyone engaged with educational or professional tasks and responsibilities in the fields of data mining or predictive analytics, statistical or quantitative modeling (including linear, GLM and/or non-linear modeling, covariance-based Structural Equation Modeling (SEM) specification and estimation, and/or variance-based PLS Path Model specification and estimation. Students learn all about the nature of multivariate data and multivariate analysis. Students specifically learn how to create and estimate: covariance and correlation matrices; Principal Components Analyses (PCA); Multidimensional Scaling (MDS); Cluster Analysis; Exploratory Factor Analyses (EFA); and SEM model estimation. The course also teaches how to create dozens of different dazzling 2D and 3D multivariate data visualizations using R software. All software, R scripts, datasets and slides used in all lectures are provided in the course materials. The course is structured as a series of seven sections, each addressing a specific MVA topic and each section culminating with one or more "hands-on" exercises for the students to complete before proceeding to reinforce learning the presented MVA concepts and skills. The course is an excellent vehicle to acquire "real-world" predictive analytics skills that are in high demand today in the workplace. The course is also a fertile source of relevant skills and knowledge for graduate students and faculty who are required to analyze and interpret research data.

Who this course is for:

Anyone interested in using multivariate analysis technques as a basis for data mining, statistical modeling, and structural equation modeling (SEM) estimation.
Practicing quantitative analysis professionals including college and university faculty seeking to learn new multivariate data analysis skills.
Undergraduate students looking for jobs in predictive or business analytics fields.
Graduate students wishing to learn more applied data analysis techniques and approaches.

Applied Multivariate Analysis with R

What you'll learn

Explore related topics

Course content

Introduction to Multivariate Data and Analysis12 lectures • 2hr 1min

Visualizing Multivariate Data13 lectures • 2hr 27min

Principal Components Analysis (PCA)12 lectures • 1hr 40min

Multidimensional Scaling (MDS)9 lectures • 1hr 37min

Cluster Analysis14 lectures • 2hr 24min

Exploratory Factor Analysis (EFA)8 lectures • 1hr 14min

Introduction to Structural Equation Modeling (SEM), QGraph, and SIMSEM7 lectures • 50min

Requirements

Description

Who this course is for: