It is often both useful and revealing to create visualizations, plots and graphs of the multivariate data that is the subject of one's research project. Often, both pre-analysis and post-analysis visualizations can help one understand “what is going on in the data" in a way that looking at numerical summaries of fitted model estimates cannot. The lattice package in R is uniquely designed to graphically depict relationships in multivariate data sets.
This course describes and demonstrates this creative approach for constructing and drawing grid-based multivariate graphic plots and figures using R. Lattice graphics are characterized as multi-variable (3, 4, 5 or more variables) plots that use conditioning and paneling. Consequently, it is a popular approach for, and a good fit to visually present the results of multi-variable statistical model fitting. The appearance of most of the plots, graphs and figures are determined by panel functions, rather than by the high-level graphics function calls themselves. Further, the user of lattice graphics has extensive and comprehensive control over many more of the details and features of the visual plots, far greater control that is afforded by the base graphics approach in R. The method is based on trellis graphics which were popularized in the S language developed by Bell Labs.
The lattice package, written by Deepayan Sarkar, attempts to improve on base R graphics by providing better defaults and the ability to easily display multivariate relationships. In particular, the package supports the creation of trellis graphs - graphs that display a variable or the relationship between variables, conditioned on one or more other variables.
The typical format is
where graph_type is selected from the listed below. formula specifies the variable(s) to display and any conditioning variables . For example ~x|A means display numeric variable x for each level of factor A.y~x | A*B means display the relationship between numeric variables y and x separately for every combination of factor A and B levels. ~x means display numeric variable x alone.
A trellis object, as returned by high level lattice functions like
xyplot, is a list with the
"class" attribute set to
"trellis". Many of the components of this list are simply the arguments to the high level function that produced the object. Among them are:
Box-and-whisker plots summarize the data using a few quantiles, and possibly some outliers. This summarizing can be important when the number of observations is large. When the number of observations per sample is small, it is often sufficient to simply plot the sample values side by side in a common scale. Such plots are known as strip plots, also referred to as univariate scatter plots. They are in fact very similar to the bivariate scatter plots.
An important subset of statistical data comes in the form of tables. Tables usually record the frequency or proportion of observations that fall into a particular category or combination of categories. They could also encode some other summary measure such as a rate (of binary events) or mean (of a continuous variable). In R, tables are usually represented by arrays of one (vectors), two (matrices), or more dimensions. To distinguish them from other vectors and arrays, they often have class “table”. The R functions table() and xtabs() can be used to create tables from raw data.
A scatter plot graphs two variables directly against each other in a Cartesian coordinate system. It is a simple graphic in the sense that the data are directly encoded without being summarized in any way; often the aspects that the user needs to worry about most are graphical ones such as whether to join the points by a line, what colors to use, and so on. Depending on the purpose, scatter plots can also be enhanced in several ways. In this chapter, we go over some of the variants supported by panel.xyplot(), which is the default panel function for both xyplot() and splom() (under the alias panel.splom()).
Scatter-plot matrices, produced by splom(), are exactly what the name suggests; they are a matrix of pairwise scatter plots given two or more variables. Conditioning is possible, but it is more common to call splom() with a data frame as its first argument.
Like scatter-plot matrices, parallel coordinates plots are hypervariate in nature, that is, they show relationships between an arbitrary number of variables. Their design is related to univariate scatter plots; in fact, they are basically univariate scatter plots of all variables of interest stacked parallel to each other (vertically in the implementation in lattice), with values that correspond to the same observation linked by line segments.
Trivariate displays encode three primary variables in a panel. There are four high-level functions in lattice that produce trivariate displays: cloud() creates three-dimensional scatter plots of unstructured trivariate data, whereas levelplot(), contourplot(), and wireframe() render surfaces or two dimensional tables evaluated on a systematic rectangular grid. Of these, cloud() and wireframe() are similar in that they both create two-dimensional projections of three-dimensional constructs, and they share several common arguments that control the details of the projection.
We begin with cloud(), which produces three-dimensional scatter plots. Most of the discussion in this section about projection and how to control it in cloud() applies to wireframe() as well.
The methods we used to plot regression surfaces using wireframe() can be easily adapted to mathematical surfaces.
Graphical parameters are often critical in determining the effectiveness of a plot. Such parameters include obvious ones such as colors, symbols, line types, and fonts for the various elements of a graph, as well as more subtle ones such as the length of tick marks or the amount of space separating different components of the graph. The parameters used in lattice displays are highly customizable. Many of them can be controlled directly by specifying suitable arguments in a high-level function call. Most derive their default values from a system of common global settings that can also be modified by the user. The latter approach has two primary benefits: it allows good global defaults to be specified, and it provides a consistent “look and feel” to lattice graphics while letting the user retain ultimate control.
Dr. Geoffrey Hubona held full-time tenure-track, and tenured, assistant and associate professor faculty positions at 3 major state universities in the Eastern United States from 1993-2010. Currently, he is a visiting associate professor of MIS at Texas A&M International University. In these positions, he taught dozens of various statistics, business information systems, and computer science courses to undergraduate, master's and Ph.D. students. He earned a Ph.D. in Business Administration (Information Systems and Computer Science) from the University of South Florida (USF) in Tampa, FL; an MA in Economics, also from USF; an MBA in Finance from George Mason University in Fairfax, VA; and a BA in Psychology from the University of Virginia in Charlottesville, VA. He is the founder of the Georgia R School (2010-2014) and of R-Courseware (2014-Present), online educational organizations that teach research methods and quantitative analysis techniques. These research methods techniques include linear and non-linear modeling, multivariate methods, data mining, programming and simulation, and structural equation modeling and partial least squares (PLS) path modeling.