This "hands-on" course teaches one how to use the R software lavaan package to specify, estimate the parameters of, and interpret covariance-based structural equation (SEM) models that use latent variables. "lavaan" (note the purposeful use of lowercase "L" in 'lavaan') is an acronym for latent variable analysis, and the name suggests the long-term goal of the developer, Yves Rosseel: "to provide a collection of tools that can be used to explore, estimate, and understand a wide family of latent variable models, including factor analysis, structural equation, longitudinal, multilevel, latent class, item response, and missing data models." The course uses and executes many "live" examples (with included R scripts and datasets) using no-cost R and RStudio software to demonstrate and teach how to: (1) specify a SEM model in lavaan syntax; (2) fit and then evaluate your model; (3) perform a CFA; (4) impute and replace missing data; (5) estimate mediating and other indirect effects; (6) estimate and evaluate multigroup models, simultaneously establishing measurement invariance; and (7) specifying and estimating latent (growth) curve models, including the use of random (and latent) intercepts and slopes. The R lavaan package is world-class 'professional-grade' SEM software, used by thousands of SEM experts, graduate students, and college and university faculty around the world.
R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, surveys of data miners, and studies of scholarly literature databases show that R's popularity has increased substantially in recent years.
R is an implementation of the S programming language combined with lexical scoping semantics inspired by Scheme. S was created by John Chambers while at Bell Labs. There are some important differences, but much of the code written for S runs unaltered.
R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. R is named partly after the first names of the first two R authors and partly as a play on the name of S.
R is a GNU project. The source code for the R software environment is written primarily in C, Fortran, and R. R is freely available under the GNU General Public License, and pre-compiled binary versions are provided for various operating systems. R uses a command line interface; there are also several graphical front-ends for it.
Structural equation modeling (SEM) is a family of statistical methods designed to test a conceptual or theoretical model.[ Some common SEM methods include confirmatory factor analysis, path analysis, and latent growth modeling.[ The term "structural equation model" most commonly refers to a combination of two things: a "measurement model" that defines latent variables using one or more observed variables, and a "structural regression model" that links latent variables together. The parts of a structural equation model are linked to one another using a system of simultaneous regression equations.
SEM is widely used in the social sciences because of its ability to isolate observational error from measurement of latent variables. To provide a simple example, the concept of human intelligence cannot be measured directly as one could measure height or weight. Instead, psychologists develop theories of intelligence and write measurement instruments with items (questions) designed to measure intelligence according to their theory. They would then use SEM to test their theory using data gathered from people who took their intelligence test. With SEM, "intelligence" would be the latent variable and the test items would be the observed variables.
In statistics, path analysis is used to describe the directed dependencies among a set of variables. This includes models equivalent to any form of multiple regression analysis, factor analysis, canonical correlation analysis, discriminant analysis, as well as more general families of models in the multivariate analysis of variance and covariance analyses (MANOVA, ANOVA, ANCOVA).
In addition to being thought of as a form of multiple regression focusing on causality, path analysis can be viewed as a special case of structural equation modeling (SEM) – one in which only single indicators are employed for each of the variables in the causal model. That is, path analysis is SEM with a structural model, but no measurement model. Other terms used to refer to path analysis include causal modeling, analysis of covariance structures, and latent variable models.
In statistics, confirmatory factor analysis (CFA) is a special form of factor analysis, most commonly used in social research.[ It is used to test whether measures of a construct are consistent with a researcher's understanding of the nature of that construct (or factor). As such, the objective of confirmatory factor analysis is to test whether the data fit a hypothesized measurement model. This hypothesized model is based on theory and/or previous analytic research. CFA was first developed by Jöreskog and has built upon and replaced older methods of analyzing construct validity such as the MTMM Matrix as described in Campbell & Fiske (1959).
In confirmatory factor analysis, the researcher first develops a hypothesis about what factors s/he believes are underlying the measures s/he has used (e.g., "Depression" being the factor underlying the Beck Depression Inventory and the Hamilton Rating Scale for Depression) and may impose constraints on the model based on these a priori hypotheses. By imposing these constraints, the researcher is forcing the model to be consistent with his/her theory. For example, if it is posited that there are two factors accounting for the covariance in the measures, and that these factors are unrelated to one another, the researcher can create a model where the correlation between factor A and factor B is constrained to zero. Model fit measures could then be obtained to assess how well the proposed model captured the covariance between all the items or measures in the model. If the constraints the researcher has imposed on the model are inconsistent with the sample data, then the results of statistical tests of model fit will indicate a poor fit, and the model will be rejected. If the fit is poor, it may be due to some items measuring multiple factors. It might also be that some items within a factor are more related to each other than others.
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in four observed variables mainly reflect the variations in two unobserved variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors, plus "error" terms. The information gained about the interdependencies between observed variables can be used later to reduce the set of variables in a dataset. Computationally this technique is equivalent to low-rank approximation of the matrix of observed variables. Factor analysis originated in psychometrics and is used in behavioral sciences, social sciences, marketing, product management, operations research, and other fields that deal with data sets where there are large numbers of observed variables that are are thought to reflect a small number of latent variables.
Factor analysis is related to principal component analysis (PCA), but the two are not identical. Latent variable models, including factor analysis, use regression modelling techniques to test hypotheses producing error terms, while PCA is a descriptive statistical technique. There has been significant controversy in the field over the equivalence or otherwise of the two techniques (see exploratory factor analysis versus principal components analysis).[
In multivariate statistics, exploratory factor analysis (EFA) is a statistical method used to uncover the underlying structure of a relatively large set of variables. EFA is a technique within factor analysis whose overarching goal is to identify the underlying relationships between measured variables. It is commonly used by researchers when developing a scale (a scale is a collection of questions used to measure a particular research topic) and serves to identify a set of latent constructs underlying a battery of measured variables. It should be used when the researcher has no a priori hypothesis about factors or patterns of measured variables. Measured variables are any one of several attributes of people that may be observed and measured. An example of a measured variable would be the physical height of a human being. Researchers must carefully consider the number of measured variables to include in the analysis. EFA procedures are more accurate when each factor is represented by multiple measured variables in the analysis.
EFA is based on the common factor model. Within the common factor model, a function of common factors, unique factors, and errors of measurements expresses measured variables. Common factors inﬂuence two or more measured variables, while each unique factor inﬂuences only one measured variable and does not explain correlations among measured variables.
EFA assumes that any indicator/measured variable may be associated with any factor. When developing a scale, researchers should use EFA first before moving on to confirmatory factor analysis (CFA). EFA requires the researcher to make a number of important decisions about how to conduct the analysis because there is no one set method.
In statistics, imputation is the process of replacing missing data with substituted values. When substituting for a data point, it is known as "unit imputation"; when substituting for a component of a data point, it is known as "item imputation". Because missing data can create problems for analyzing data, imputation is seen as a way to avoid pitfalls involved with listwise deletion of cases that have missing values. That is to say, when one or more values are missing for a case, most statistical packages default to discarding any case that has a missing value, which may introduce bias or affect the representativeness of the results. Imputation preserves all cases by replacing missing data with a probable value based on other available information. Once all missing values have been imputed, the data set can then be analysed using standard techniques for complete data.
In statistics, a mediation model is one that seeks to identify and explicate the mechanism or process that underlies an observed relationship between an independent variable and a dependent variable via the inclusion of a third explanatory variable, known as a mediator variable. Rather than hypothesizing a direct causal relationship between the independent variable and the dependent variable, a mediational model hypothesizes that the independent variable influences the mediator variable, which in turn influences the dependent variable. Thus, the mediator variable serves to clarify the nature of the relationship between the independent and dependent variables. In other words, mediating relationships occur when a third variable plays an important role in governing the relationship between the other two variables.
Measurement invariance or measurement equivalence is a statistical property of measurement that indicates that the same construct is being measured across some specified groups. For example, measurement invariance can be used to study whether a given measure is interpreted in a conceptually similar manner by respondents representing different genders or cultural backgrounds. Violations of measurement invariance may preclude meaningful interpretation of measurement data. Tests of measurement invariance are increasingly used in fields such as psychology to supplement evaluation of measurement quality rooted in classical test theory. Measurement invariance is often tested in the framework of multiple-group confirmatory factor analysis (CFA).
Latent growth modeling is a statistical technique used in the structural equation modeling (SEM) framework to estimate growth trajectory. It is a longitudinal analysis technique to estimate growth over a period of time. It is widely used in the field of behavioral science, education and social science. It is also called latent growth curve analysis. The latent growth model was derived from theories of SEM. General purpose SEM software, such as OpenMx, lavaan (both open source packages based in R), AMOS, Mplus, LISREL, or EQS among others may be used to estimate the trajectory of growth.
Latent Growth Models represent repeated measures of dependent variables as a function of time and other measures. Such longitudinal data share the features that the same subjects are observed repeatedly over time, and on the same tests (or parallel versions), and at known times. In latent growth modeling, the relative standing of an individual at each time is modeled as a function of an underlying growth process, with the best parameter values for that growth process being fitted to each individual.
These models have grown in use in social and behavioral research since it was shown that they can be fitted as a restricted common factor model in the structural equation modeling framework.
Dr. Geoffrey Hubona held full-time tenure-track, and tenured, assistant and associate professor faculty positions at 3 major state universities in the Eastern United States from 1993-2010. In these positions, he taught dozens of various statistics, business information systems, and computer science courses to undergraduate, master's and Ph.D. students. He earned a Ph.D. in Business Administration (Information Systems and Computer Science) from the University of South Florida (USF) in Tampa, FL; an MA in Economics, also from USF; an MBA in Finance from George Mason University in Fairfax, VA; and a BA in Psychology from the University of Virginia in Charlottesville, VA. He is the founder of the Georgia R School (2010-2014) and of R-Courseware (2014-Present), online educational organizations that teach research methods and quantitative analysis techniques. These research methods techniques include linear and non-linear modeling, multivariate methods, data mining, programming and simulation, and structural equation modeling and partial least squares (PLS) path modeling.