
This video presents an overview of the Applied Multivariate Analysis (MVA) course.
The materials used in the video lectures for Section 1 Introduction to Multivariate Data and Analysis are briefly explained and then provided as a .zip file download after the short video is presented.
Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which involves the observation and analysis of more than one statistical outcome variable at a time. In design and analysis, the technique is used to perform trade studies across multiple dimensions while taking into account the effects of all variables on the responses of interest. Some of the applications include:
• To reduce a large number of variables to a smaller number of factors for data modeling
• To validate a scale or index by demonstrating that its constituent items load on the same factor, and to drop proposed scale items which cross-load on more than one factor.
• To select a subset of variables from a larger set, based on which original variables have the highest correlations with some other factors.
• To create a set of factors to be treated as uncorrelated variables as one approach to handling multi-collinearity in such procedures as multiple regression
In this "hands-on" course on applied multivariate analysis, we focus on how to actually use and conduct MVA analyses, using dozens of real data sets and R software. We examine the techniques and examples of principal components analysis, multidimensional scaling, cluster analysis, exploratory factor analysis, and an introduction to structural equation modeling.
Missing data is a huge problem in analyzing data sets because many statistical and mathematical functions fail when any individual data observations have even one missing data element. We explain and demonstrate why this is a problem using a 'body measures' dataset that we construct in R, and we show some "quick fixes" to getting around this problem of missing data in multivariate analysis.
We create several multivariate data sets using R software. We use these data sets and others in the rest of the course.
In probability theory and statistics, a covariance matrix (also known as dispersion matrix or variance–covariance matrix) is a matrix whose element in the i, j position is the covariance between the i th and j th elements of a random vector (that is, of a vector of random variables). Each element of the vector is a scalar random variable, either with a finite number of observed empirical values or with a finite or infinite number of potential values specified by a theoretical joint probability distribution of all the random variables.
The correlation matrix of n random variables X1, ..., Xn is the n × n matrix whose i,j entry is corr(Xi, Xj). If the measures of correlation used are product-moment coefficients, the correlation matrix is the same as the covariance matrix of the standardized random variables Xi / σ (Xi) for i = 1, ..., n. This applies to both the matrix of population correlations (in which case "σ" is the population standard deviation), and to the matrix of sample correlations (in which case "σ" denotes the sample standard deviation). Consequently, each is necessarily a positive-semidefinite matrix.
The correlation matrix is symmetric because the correlation between Xi and Xj is the same as the correlation between Xj and Xi.
We continue our discussion of creating, estimating and using both covariance and correlation matrices in multivariate analysis using R software. We also introduce the concept of "distance" for finding similarities / differences among sets of variables.
We continue our discussion of creating, estimating and using both covariance and correlation matrices in multivariate analysis using R software. We also introduce the concept of "distance" for finding similarities / differences among sets of variables.
We describe, create (with simulation), demonstrate and visualize a multivariate normal (MVN) density function using R. In probability theory and statistics, the multivariate normal distribution or multivariate Gaussian distribution, is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One possible definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.
We demonstrate several R software graphical approaches to test for univariate and multivariate normality.
We continue our illustrative cases and examples of creating normality plots in R software.
This video lecture explains the three covariance, correlation and normality exercises for the first section of the applied MVA course.
The materials used in the video lectures for Section 2 Visualizing Multivariate Data are briefly explained and then provided as a .zip file download after the short video is presented.
The materials used in the video lectures for Section 3 Principal Components Analysis (PCS) are briefly explained and then provided as a .zip file download after the short video is presented.
The materials used in the video lectures for Section 4 Multidimensional Scaling (MDS) are briefly explained and then provided as a .zip file download after the short video is presented.
The materials used in the video lectures for Section 5 Cluster Analysis are briefly explained and then provided as a .zip file download after the short video is presented.
The materials used in the video lectures for Section 6 Exploratory Factor Analysis are briefly explained and then provided as a .zip file download after the short video is presented.
The solution to the K-Means exercise using the crime data is explained.
In multivariate statistics, exploratory factor analysis (EFA) is a statistical method used to uncover the underlying structure of a relatively large set of variables. EFA is a technique within factor analysis whose overarching goal is to identify the underlying relationships between measured variables. It is commonly used by researchers when developing a scale (a scale is a collection of questions used to measure a particular research topic) and serves to identify a set of latent constructs underlying a battery of measured variables. It should be used when the researcher has no a priori hypothesis about factors or patterns of measured variables. Measured variables are any one of several attributes of people that may be observed and measured. An example of a measured variable would be the physical height of a human being. Researchers must carefully consider the number of measured variables to include in the analysis. EFA procedures are more accurate when each factor is represented by multiple measured variables in the analysis.
The factanal() function in R performs maximum-likelihood factor analysis on a covariance matrix or data matrix.
Is an example of estimating an EFA using R software with the life data provided in the materials.
Is an example of estimating an EFA using R software with the drug use data provided in the materials.
Both exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are employed to understand shared variance of measured variables that is believed to be attributable to a factor or latent construct. Despite this similarity, however, EFA and CFA are conceptually and statistically distinct analyses.
The goal of EFA is to identify factors based on data and to maximize the amount of variance explained. The researcher is not required to have any specific hypotheses about how many factors will emerge, and what items or variables these factors will comprise. If these hypotheses exist, they are not incorporated into and do not affect the results of the statistical analyses. By contrast, CFA evaluates a priori hypotheses and is largely driven by theory. CFA analyses require the researcher to hypothesize, in advance, the number of factors, whether or not these factors are correlated, and which items/measures load onto and reflect which factors. As such, in contrast to exploratory factor analysis, where all loadings are free to vary, CFA allows for the explicit constraint of certain loadings to be zero.
EFA is sometimes reported in research when CFA would be a better statistical approach. It has been argued that CFA can be restrictive and inappropriate when used in an exploratory fashion. However, the idea that CFA is solely a “confirmatory” analysis may sometimes be misleading, as modification indices used in CFA are somewhat exploratory in nature. Modification indices show the improvement in model fit if a particular coefficient were to become unconstrained. Likewise, EFA and CFA do not have to be mutually exclusive analyses; EFA has been argued to be a reasonable follow up to a poor-fitting CFA model.
The correlation matrix given below represent grading scores of 220 boys in six school subjects:
(1) French; (2) English; (3) History; (4) Arithmetic; (5) Algebra and (6) Geometry.
Find the two-factor solution from a maximum likelihood factor analysis. Interpret the factor loadings. Then plot these derived loadings and interpret again. Was it easier to interpret the factors by looking at the visualization? Finally, find an non-orthogonal rotation that allows easier interpretation of the results looking at the factor loadings directly, without the "visual utility" that is afforded by plotting the two-factor solution first.
# French 1.00
# English 0.44 1.00
# History 0.41 0.35 1.00
# Arithmetic 0.29 0.35 0.16 1.00
# Algebra 0.33 0.32 0.19 0.59 1.00
# Geometry 0.25 0.33 0.18 0.47 0.46 1.00
Structural equation modeling (SEM) is a methodology for representing, estimating, and testing a network of relationships between variables (measured variables and latent constructs). qgraph is a package that can be used to plot several types of graphs. It is mainly aimed at visualizing relationships in (psychometric) data as networks to create a clear picture of what the data actually looks like. SIMSEM is an R package developed for facilitating simulation and analysis of data within the structural equation modeling (SEM) framework.
Solution to the EFA exercises are provided in R scripts.
Structural equation modeling (SEM) is a methodology for representing, estimating, and testing a network of relationships between variables (measured variables and latent constructs). Specification is formulating a statement about a set of parameters and stating a model. A critical principle in model specification and evaluation is the fact that all of the models that we would be interested in specifying and evaluating are wrong to some degree We must define as an optimal outcome a finding that a particular model fits our observed data closely and yields a highly interpretable solution. Instead of considering all possible models, a finding that a particular model fits observed data well and yields an interpretable solution can be taken to mean only that the model provides one plausible representation of the structure that produced the observed data.
Structural equation modeling (SEM) is a methodology for representing, estimating, and testing a network of relationships between variables (measured variables and latent constructs). Specification is formulating a statement about a set of parameters and stating a model. A critical principle in model specification and evaluation is the fact that all of the models that we would be interested in specifying and evaluating are wrong to some degree We must define as an optimal outcome a finding that a particular model fits our observed data closely and yields a highly interpretable solution. Instead of considering all possible models, a finding that a particular model fits observed data well and yields an interpretable solution can be taken to mean only that the model provides one plausible representation of the structure that produced the observed data.
qgraph is a package that can be used to plot several types of graphs. It is mainly aimed at visualizing relationships in (psychometric) data as networks to create a clear picture of what the data actually looks like.
Its most important use is to visualize correlation matrices as a network in which each node represents a variable and each edge a correlation. The color of the edges indicate the sign of the correlation (green for positive correlations and red for negative correlations) and the width indicate the strength of the correlation. Other statistics can also be used in the graph as long as negative and positive values are comparable in strength and zero indicates no relationship.
qgraph also comes with various functions to visualize other statistics and even perform analyses, such as EFA, PCA, CFA and SEM. The stable release of qgraph is available at CRAN, the developmental version of qgraph is available at GitHub and finally an article introducing the package in detail is available in the Journal of Statistical Software.
Since qgraph 1.3 the package also contains network model selection and estimation procedures.
The SIMSEM R package has been developed for facilitating simulation and analysis of data within the structural equation modeling (SEM) framework. This package aims to help analysts create simulated data from hypotheses or analytic results from obtained data. The simulated data can be used for different purposes, such as power analysis, model fit evaluation, and planned missing design. Students will have an appreciation of how to use SIMSEM for these purposes.
The SIMSEM R package has been developed for facilitating simulation and analysis of data within the structural equation modeling (SEM) framework. This package aims to help analysts create simulated data from hypotheses or analytic results from obtained data. The simulated data can be used for different purposes, such as power analysis, model fit evaluation, and planned missing design. Students will have an appreciation of how to use SIMSEM for these purposes.
Applied Multivariate Analysis (MVA) with R is a practical, conceptual and applied "hands-on" course that teaches students how to perform various specific MVA tasks using real data sets and R software. It is an excellent and practical background course for anyone engaged with educational or professional tasks and responsibilities in the fields of data mining or predictive analytics, statistical or quantitative modeling (including linear, GLM and/or non-linear modeling, covariance-based Structural Equation Modeling (SEM) specification and estimation, and/or variance-based PLS Path Model specification and estimation. Students learn all about the nature of multivariate data and multivariate analysis. Students specifically learn how to create and estimate: covariance and correlation matrices; Principal Components Analyses (PCA); Multidimensional Scaling (MDS); Cluster Analysis; Exploratory Factor Analyses (EFA); and SEM model estimation. The course also teaches how to create dozens of different dazzling 2D and 3D multivariate data visualizations using R software. All software, R scripts, datasets and slides used in all lectures are provided in the course materials. The course is structured as a series of seven sections, each addressing a specific MVA topic and each section culminating with one or more "hands-on" exercises for the students to complete before proceeding to reinforce learning the presented MVA concepts and skills. The course is an excellent vehicle to acquire "real-world" predictive analytics skills that are in high demand today in the workplace. The course is also a fertile source of relevant skills and knowledge for graduate students and faculty who are required to analyze and interpret research data.