Applied Multivariate Analysis with R

Learn to use R software to conduct PCAs, MDSs, cluster analyses, EFAs and to estimate SEM models.
4.0 (60 ratings) Instead of using a simple lifetime average, Udemy calculates a
course's star rating by considering a number of different factors
such as the number of ratings, the age of ratings, and the
likelihood of fraudulent ratings.
1,784 students enrolled
$19
$50
62% off
Take This Course
  • Lectures 75
  • Length 12 hours
  • Skill Level All Levels
  • Languages English
  • Includes Lifetime access
    30 day money back guarantee!
    Available on iOS and Android
    Certificate of Completion
Wishlisted Wishlist

How taking a course works

Discover

Find online courses made by experts from around the world.

Learn

Take your courses with you and learn anywhere, anytime.

Master

Learn and practice real-world skills and achieve your goals.

About This Course

Published 7/2015 English

Course Description

Applied Multivariate Analysis (MVA) with R is a practical, conceptual and applied "hands-on" course that teaches students how to perform various specific MVA tasks using real data sets and R software. It is an excellent and practical background course for anyone engaged with educational or professional tasks and responsibilities in the fields of data mining or predictive analytics, statistical or quantitative modeling (including linear, GLM and/or non-linear modeling, covariance-based Structural Equation Modeling (SEM) specification and estimation, and/or variance-based PLS Path Model specification and estimation. Students learn all about the nature of multivariate data and multivariate analysis. Students specifically learn how to create and estimate: covariance and correlation matrices; Principal Components Analyses (PCA); Multidimensional Scaling (MDS); Cluster Analysis; Exploratory Factor Analyses (EFA); and SEM model estimation. The course also teaches how to create dozens of different dazzling 2D and 3D multivariate data visualizations using R software. All software, R scripts, datasets and slides used in all lectures are provided in the course materials. The course is structured as a series of seven sections, each addressing a specific MVA topic and each section culminating with one or more "hands-on" exercises for the students to complete before proceeding to reinforce learning the presented MVA concepts and skills. The course is an excellent vehicle to acquire "real-world" predictive analytics skills that are in high demand today in the workplace. The course is also a fertile source of relevant skills and knowledge for graduate students and faculty who are required to analyze and interpret research data.

What are the requirements?

  • No specific knowledge or skills are required.
  • Students will need to install the popular no-cost R Console and RStudio software (instructions provided).
  • However, it is helpful if students have some interest and aptitude in quantitative or statistical analysis.

What am I going to get from this course?

  • Conceptualize and apply multivariate skills and "hands-on" techniques using R software in analyzing real data.
  • Create novel and stunning 2D and 3D multivariate data visualizations with R.
  • Set up and estimate a Principal Components Analysis (PCA).
  • Formulate and estimate a Multidimensional Scaling (MDS) problem.
  • Group similar (or dissimilar) data with Cluster Analysis techniques.
  • Estimate and interpret an Exploratory Factor Analysis (EFA).
  • Specify and estimate a Structural Equation Model (SEM) using RAM notation in R.
  • Be knowledgeable about SEM simulation capabilities from the R SIMSEM package.

What is the target audience?

  • Anyone interested in using multivariate analysis technques as a basis for data mining, statistical modeling, and structural equation modeling (SEM) estimation.
  • Practicing quantitative analysis professionals including college and university faculty seeking to learn new multivariate data analysis skills.
  • Undergraduate students looking for jobs in predictive or business analytics fields.
  • Graduate students wishing to learn more applied data analysis techniques and approaches.

What you get with this course?

Not for you? No problem.
30 day money back guarantee.

Forever yours.
Lifetime access.

Learn on the go.
Desktop, iOS and Android.

Get rewarded.
Certificate of completion.

Curriculum

Section 1: Introduction to Multivariate Data and Analysis
11:40

This video presents an overview of the Applied Multivariate Analysis (MVA) course.

02:25

The materials used in the video lectures for Section 1 Introduction to Multivariate Data and Analysis are briefly explained and then provided as a .zip file download after the short video is presented.

14:15

Multivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which involves the observation and analysis of more than one statistical outcome variable at a time. In design and analysis, the technique is used to perform trade studies across multiple dimensions while taking into account the effects of all variables on the responses of interest. Some of the applications include:

• To reduce a large number of variables to a smaller number of factors for data modeling

• To validate a scale or index by demonstrating that its constituent items load on the same factor, and to drop proposed scale items which cross-load on more than one factor.

• To select a subset of variables from a larger set, based on which original variables have the highest correlations with some other factors.

• To create a set of factors to be treated as uncorrelated variables as one approach to handling multi-collinearity in such procedures as multiple regression

In this "hands-on" course on applied multivariate analysis, we focus on how to actually use and conduct MVA analyses, using dozens of real data sets and R software. We examine the techniques and examples of principal components analysis, multidimensional scaling, cluster analysis, exploratory factor analysis, and an introduction to structural equation modeling.

08:20

Missing data is a huge problem in analyzing data sets because many statistical and mathematical functions fail when any individual data observations have even one missing data element. We explain and demonstrate why this is a problem using a 'body measures' dataset that we construct in R, and we show some "quick fixes" to getting around this problem of missing data in multivariate analysis.

10:11

We create several multivariate data sets using R software. We use these data sets and others in the rest of the course.

11:36

In probability theory and statistics, a covariance matrix (also known as dispersion matrix or variance–covariance matrix) is a matrix whose element in the i, j position is the covariance between the i th and j th elements of a random vector (that is, of a vector of random variables). Each element of the vector is a scalar random variable, either with a finite number of observed empirical values or with a finite or infinite number of potential values specified by a theoretical joint probability distribution of all the random variables.

The correlation matrix of n random variables X1, ..., Xn is the n × n matrix whose i,j entry is corr(Xi, Xj). If the measures of correlation used are product-moment coefficients, the correlation matrix is the same as the covariance matrix of the standardized random variables Xi / σ (Xi) for i = 1, ..., n. This applies to both the matrix of population correlations (in which case "σ" is the population standard deviation), and to the matrix of sample correlations (in which case "σ" denotes the sample standard deviation). Consequently, each is necessarily a positive-semidefinite matrix.

The correlation matrix is symmetric because the correlation between Xi and Xj is the same as the correlation between Xj and Xi.

10:21

We continue our discussion of creating, estimating and using both covariance and correlation matrices in multivariate analysis using R software. We also introduce the concept of "distance" for finding similarities / differences among sets of variables.

10:12

We continue our discussion of creating, estimating and using both covariance and correlation matrices in multivariate analysis using R software. We also introduce the concept of "distance" for finding similarities / differences among sets of variables.

11:28

We describe, create (with simulation), demonstrate and visualize a multivariate normal (MVN) density function using R. In probability theory and statistics, the multivariate normal distribution or multivariate Gaussian distribution, is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One possible definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

10:03

We demonstrate several R software graphical approaches to test for univariate and multivariate normality.

13:41

We continue our illustrative cases and examples of creating normality plots in R software.

07:09

This video lecture explains the three covariance, correlation and normality exercises for the first section of the applied MVA course.

Section 2: Visualizing Multivariate Data
02:46

The materials used in the video lectures for Section 2 Visualizing Multivariate Data are briefly explained and then provided as a .zip file download after the short video is presented.

Covariance and Correlation Matrices with Missing Data (part 1)
Preview
08:49
Covariance and Correlation Matrices with Missing Data (part 2)
10:45
Univariate and Multivariate QQPlots of Pottery Data
09:38
Converting Covariance to Correlation Matrices
15:20
Plots for Marginal Distributions
15:35
Outlier Identification
16:29
Chi, Bubble, and other Glyph Plots
14:12
Scatterplot Matrix
07:56
Kernel Density Estimators
13:41
3-Dimensional and Trellis (Lattice Package) Graphics
15:26
More Trellis (Lattice Package) Graphics
14:04
Bivariate Boxplot and ChiPlot Visualizations Exercises
02:16
Section 3: Principal Components Analysis (PCA)
00:44

The materials used in the video lectures for Section 3 Principal Components Analysis (PCS) are briefly explained and then provided as a .zip file download after the short video is presented.

Bivariate Boxplot Visualization Exercise Solution
14:48
ChiPlot Visualization Exercise Solution
03:40
What is a "Principal Components Analysis" (PCA) ?
Preview
11:13
PCA Basics with R: Blood Data (part 1)
09:18
PCA Basics with R: Blood Data (part 2)
10:51
PCA with Head Size Data (part 1)
08:01
PCA with Head Size Data (part 2)
09:31
PCA with Heptathlon Data (part 1)
07:40
PCA with Heptathlon Data (part 2)
10:03
PCA with Heptathlon Data (part 3)
13:04
PCA Criminal Convictions Exercise
01:21
Section 4: Multidimensional Scaling (MDS)
00:56

The materials used in the video lectures for Section 4 Multidimensional Scaling (MDS) are briefly explained and then provided as a .zip file download after the short video is presented.

PCA Criminal Convictions Exercise Solution
14:29
Introduction to Multidimensional Scaling
Preview
13:31
Classical Multidimensional Scaling (part 1)
14:50
Classical Multidimensional Scaling (part 2)
08:47
Classical Multidimensional Scaling: Skulls Data
17:46
Non-Metric Multidimensional Scaling Example: Voting Behavior
14:24
Non-Metric Multidimensional Scaling Example: WW II Leaders
09:08
Multidimensional Scaling Exercise: Water Voles
02:48
Section 5: Cluster Analysis
01:13

The materials used in the video lectures for Section 5 Cluster Analysis are briefly explained and then provided as a .zip file download after the short video is presented.

MDS Water Voles Exercise Solution
13:55
Introduction to Cluster Analysis
Preview
10:50
Hierarchical Clustering Distance Techniques
10:35
Hierarchical Clustering of Measures Data
12:40
Hierarchical Clustering of Fighter Jets
10:15
K-Means Clustering of Crime Data (part 1)
13:06
K-Means Clustering of Crime Data (part 2)
06:36
Clustering of Romano-British Pottery Data
14:50
K-Means Classifying of Exoplanets
13:20
Model-Based Clustering of Exoplanets
12:34
Finite Mixture Model-Based Analysis
13:04
Cluster Analysis Neighborhood and Stripes Plots
10:07
K-Means Cluster Analysis Crime Data Exercise
00:35
Section 6: Exploratory Factor Analysis (EFA)
00:40

The materials used in the video lectures for Section 6 Exploratory Factor Analysis are briefly explained and then provided as a .zip file download after the short video is presented.

09:20

The solution to the K-Means exercise using the crime data is explained.

14:38

In multivariate statistics, exploratory factor analysis (EFA) is a statistical method used to uncover the underlying structure of a relatively large set of variables. EFA is a technique within factor analysis whose overarching goal is to identify the underlying relationships between measured variables. It is commonly used by researchers when developing a scale (a scale is a collection of questions used to measure a particular research topic) and serves to identify a set of latent constructs underlying a battery of measured variables. It should be used when the researcher has no a priori hypothesis about factors or patterns of measured variables. Measured variables are any one of several attributes of people that may be observed and measured. An example of a measured variable would be the physical height of a human being. Researchers must carefully consider the number of measured variables to include in the analysis. EFA procedures are more accurate when each factor is represented by multiple measured variables in the analysis.

07:34

The factanal() function in R performs maximum-likelihood factor analysis on a covariance matrix or data matrix.

14:47

Is an example of estimating an EFA using R software with the life data provided in the materials.

16:20

Is an example of estimating an EFA using R software with the drug use data provided in the materials.

08:17

Both exploratory factor analysis (EFA) and confirmatory factor analysis (CFA) are employed to understand shared variance of measured variables that is believed to be attributable to a factor or latent construct. Despite this similarity, however, EFA and CFA are conceptually and statistically distinct analyses.

The goal of EFA is to identify factors based on data and to maximize the amount of variance explained. The researcher is not required to have any specific hypotheses about how many factors will emerge, and what items or variables these factors will comprise. If these hypotheses exist, they are not incorporated into and do not affect the results of the statistical analyses. By contrast, CFA evaluates a priori hypotheses and is largely driven by theory. CFA analyses require the researcher to hypothesize, in advance, the number of factors, whether or not these factors are correlated, and which items/measures load onto and reflect which factors. As such, in contrast to exploratory factor analysis, where all loadings are free to vary, CFA allows for the explicit constraint of certain loadings to be zero.

EFA is sometimes reported in research when CFA would be a better statistical approach. It has been argued that CFA can be restrictive and inappropriate when used in an exploratory fashion. However, the idea that CFA is solely a “confirmatory” analysis may sometimes be misleading, as modification indices used in CFA are somewhat exploratory in nature. Modification indices show the improvement in model fit if a particular coefficient were to become unconstrained. Likewise, EFA and CFA do not have to be mutually exclusive analyses; EFA has been argued to be a reasonable follow up to a poor-fitting CFA model.

02:43

The correlation matrix given below represent grading scores of 220 boys in six school subjects:

(1) French; (2) English; (3) History; (4) Arithmetic; (5) Algebra and (6) Geometry.

Find the two-factor solution from a maximum likelihood factor analysis. Interpret the factor loadings. Then plot these derived loadings and interpret again. Was it easier to interpret the factors by looking at the visualization? Finally, find an non-orthogonal rotation that allows easier interpretation of the results looking at the factor loadings directly, without the "visual utility" that is afforded by plotting the two-factor solution first.

# French 1.00

# English 0.44 1.00

# History 0.41 0.35 1.00

# Arithmetic 0.29 0.35 0.16 1.00

# Algebra 0.33 0.32 0.19 0.59 1.00

# Geometry 0.25 0.33 0.18 0.47 0.46 1.00

Section 7: Introduction to Structural Equation Modeling (SEM), QGraph, and SIMSEM
02:53

Structural equation modeling (SEM) is a methodology for representing, estimating, and testing a network of relationships between variables (measured variables and latent constructs). qgraph is a package that can be used to plot several types of graphs. It is mainly aimed at visualizing relationships in (psychometric) data as networks to create a clear picture of what the data actually looks like. SIMSEM is an R package developed for facilitating simulation and analysis of data within the structural equation modeling (SEM) framework.


09:15

Solution to the EFA exercises are provided in R scripts.

11:56

Structural equation modeling (SEM) is a methodology for representing, estimating, and testing a network of relationships between variables (measured variables and latent constructs). Specification is formulating a statement about a set of parameters and stating a model. ƒ A critical principle in model specification and evaluation is the fact that all of the models that we would be interested in specifying and evaluating are wrong to some degree ƒ We must define as an optimal outcome a finding that a particular model fits our observed data closely and yields a highly interpretable solution. ƒ Instead of considering all possible models, a finding that a particular model fits observed data well and yields an interpretable solution can be taken to mean only that the model provides one plausible representation of the structure that produced the observed data.

05:15

Structural equation modeling (SEM) is a methodology for representing, estimating, and testing a network of relationships between variables (measured variables and latent constructs). Specification is formulating a statement about a set of parameters and stating a model. ƒ A critical principle in model specification and evaluation is the fact that all of the models that we would be interested in specifying and evaluating are wrong to some degree ƒ We must define as an optimal outcome a finding that a particular model fits our observed data closely and yields a highly interpretable solution. ƒ Instead of considering all possible models, a finding that a particular model fits observed data well and yields an interpretable solution can be taken to mean only that the model provides one plausible representation of the structure that produced the observed data.

06:43

qgraph is a package that can be used to plot several types of graphs. It is mainly aimed at visualizing relationships in (psychometric) data as networks to create a clear picture of what the data actually looks like.

Its most important use is to visualize correlation matrices as a network in which each node represents a variable and each edge a correlation. The color of the edges indicate the sign of the correlation (green for positive correlations and red for negative correlations) and the width indicate the strength of the correlation. Other statistics can also be used in the graph as long as negative and positive values are comparable in strength and zero indicates no relationship.

qgraph also comes with various functions to visualize other statistics and even perform analyses, such as EFA, PCA, CFA and SEM. The stable release of qgraph is available at CRAN, the developmental version of qgraph is available at GitHub and finally an article introducing the package in detail is available in the Journal of Statistical Software.

Since qgraph 1.3 the package also contains network model selection and estimation procedures.

09:57

The SIMSEM R package has been developed for facilitating simulation and analysis of data within the structural equation modeling (SEM) framework. This package aims to help analysts create simulated data from hypotheses or analytic results from obtained data. The simulated data can be used for different purposes, such as power analysis, model fit evaluation, and planned missing design. Students will have an appreciation of how to use SIMSEM for these purposes.

04:12

The SIMSEM R package has been developed for facilitating simulation and analysis of data within the structural equation modeling (SEM) framework. This package aims to help analysts create simulated data from hypotheses or analytic results from obtained data. The simulated data can be used for different purposes, such as power analysis, model fit evaluation, and planned missing design. Students will have an appreciation of how to use SIMSEM for these purposes.

Students Who Viewed This Course Also Viewed

  • Loading
  • Loading
  • Loading

Instructor Biography

Geoffrey Hubona, Ph.D., Professor of Information Systems

Dr. Geoffrey Hubona held full-time tenure-track, and tenured, assistant and associate professor faculty positions at 3 major state universities in the Eastern United States from 1993-2010. In these positions, he taught dozens of various statistics, business information systems, and computer science courses to undergraduate, master's and Ph.D. students. He earned a Ph.D. in Business Administration (Information Systems and Computer Science) from the University of South Florida (USF) in Tampa, FL (1993); an MA in Economics (1990), also from USF; an MBA in Finance (1979) from George Mason University in Fairfax, VA; and a BA in Psychology (1972) from the University of Virginia in Charlottesville, VA. He was a full-time assistant professor at the University of Maryland Baltimore County (1993-1996) in Catonsville, MD; a tenured associate professor in the department of Information Systems in the Business College at Virginia Commonwealth University (1996-2001) in Richmond, VA; and an associate professor in the CIS department of the Robinson College of Business at Georgia State University (2001-2010). He is the founder of the Georgia R School (2010-2014) and of R-Courseware (2014-Present), online educational organizations that teach research methods and quantitative analysis techniques. These research methods techniques include linear and non-linear modeling, multivariate methods, data mining, programming and simulation, and structural equation modeling and partial least squares (PLS) path modeling. Dr. Hubona is an expert of the analytical, open-source R software suite and of various PLS path modeling software packages, including SmartPLS. He has published dozens of research articles that explain and use these techniques for the analysis of data, and, with software co-development partner Dean Lim, has created a popular cloud-based PLS software application, PLS-GUI.

Ready to start learning?
Take This Course