In this tutorial, we show how to run an exploratory factor analysis with structured residuals (EFAST) model on a dataset with estimated volume on 68 regions of interest (ROIs) in both hemispheres. Data of 647 participants from the Cam-CAN cohort1 is included in the efast
package – synthesized through the synthpop
package2 to preserve individual subject’s privacy. The data can be loaded as follows:
Upon plotting the correlation matrix of this dataset, we can see that there are secondary diagonals. These indicate symmetry structure: the first 34 variables are the left-hemisphere ROIs, and the last 34 variables are their contralateral homologues. The strong secondary diagonals show that the contralateral homologues correlate strongly in their amount of volume. The correlation matrix can be plotted using the cplot()
function as follows (cplot()
is based on the corrplot
package3):
The goal of the EFAST routine is to extract \(M\) factors that explain the correlations among the ROIs as well as possible, while taking into account the symmetry. The resulting factors can then be used for further analysis, for example to compare different groups in their structural volume.
The efast
package is a wrapper around the lavaan
package4 with a convenient interface for EFAST models. In addition, it adds some detailed constraints on the parameters, specifically for the EFAST model, that aid in estimation. The parameters that need to be entered by the researcher are:
data
The dataset to be analyzedM
The number of factors to extractlh_idx
The column indices of the left-hemisphere variables in the datarh_idx
The column indices of the right-hemisphere variables in the dataAn EFAST model for the volume ROI data with 6 factors can thus be computed as follows:
The function allows for other arguments to be passed on to the lavaan
function as well, to change many aspects of model estimation (see ?lavoptions
for more information). Most notably, the rotation
can be changed here as well, if a different rotation than the default oblique geomin rotation is desired. The resulting object also inherits from the lavaan
class, meaning methods such as summary()
and print()
are, by default, implemented in this object.
Since an EFAST model is a structural equation model, the entire range of fit metrics is available (e.g, \(\chi^2\), AIC, BIC, CFI, TLI, RMSEA). They can be investigated using the fitmeasures()
command from lavaan
:
## npar fmin chisq
## 495.000 2.908 3762.998
## df pvalue baseline.chisq
## 1851.000 0.000 38406.398
## baseline.df baseline.pvalue cfi
## 2278.000 0.000 0.947
## tli nnfi rfi
## 0.935 0.935 0.879
## nfi pnfi ifi
## 0.902 0.733 0.948
## rni logl unrestricted.logl
## 0.947 -45934.661 -43190.394
## aic bic ntotal
## 92859.321 95073.133 647.000
## bic2 rmsea rmsea.ci.lower
## 93501.524 0.040 0.038
## rmsea.ci.upper rmsea.pvalue rmr
## 0.042 1.000 0.340
## rmr_nomean srmr srmr_bentler
## 0.340 0.340 0.340
## srmr_bentler_nomean crmr crmr_nomean
## 0.340 0.305 0.305
## srmr_mplus srmr_mplus_nomean cn_05
## 0.303 0.303 336.657
## cn_01 gfi agfi
## 344.097 0.774 0.713
## pgfi mfi ecvi
## 0.610 0.228 7.346
For example, we can see in the RMSEA value of 0.0399566 that the model fit is quite good – the model approximates the elements of the observed covariance matrix well. Additionally, the fit of different models with different numbers of factors can also be investigated to determine the optimal number of factors. Another model comparison that can be made is to a model without symmetry structure. This can be done in several ways, for example through the relative fit indices such as BIC and AIC, or through likelihood ratio tests.
Below we compare the EFAST model to an identically estimated EFA model with 6 factors.
A likelihood ratio test can be performed via the anova()
or the lavTestLRT()
function from the lavaan
package:
## Chi-Squared Difference Test
##
## Df AIC BIC Chisq Chisq diff Df diff Pr(>Chisq)
## efast_fit 1851 92859 95073 3763.0
## efa_fit 1885 96023 98085 6603.6 2840.7 34 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The resulting table shows that the EFAST model with 6 factors fits significantly better than the pure EFA model with 6 factors: \(\chi^2\)(34) = 2840.7, p < .001.
Two relevant methods for performing dimension reduction through EFAST are predict()
for extracting the per-participant factor values for further analysis, and the efast_loadings()
function, for extracting and displaying the factor loadings to give an interpretation to each of the factors. The loadings can optionally be displayed in a convenient symmetric way, with the loadings of the left and right hemispheres for each ROI side-by-side (note that while this aids interpretation of the loadings themselves, such as easily showing that factor 2 is relatively assymmetric, the bottom table with the proportions of explained variance now displays wrong values):
##
## Loadings:
## F1_lh F1_rh F2_lh F2_rh F3_lh F3_rh F4_lh F4_rh
## [1,] 0.803 0.746
## [2,]
## [3,] 0.521 0.379
## [4,] 0.808 0.592
## [5,]
## [6,] 0.303 0.330
## [7,] 0.624 0.447
## [8,] 0.431
## [9,]
## [10,] 0.367 0.351 0.321
## [11,]
## [12,] 0.650 0.610
## [13,]
## [14,] 0.673 0.488
## [15,]
## [16,] 0.548 0.652
## [17,] 0.708
## [18,]
## [19,] 0.550
## [20,] 0.860 0.719
## [21,] 0.560 0.527
## [22,]
## [23,] 0.735 0.537
## [24,]
## [25,] 0.419
## [26,]
## [27,] 0.307 0.399 0.330
## [28,]
## [29,] 0.451 0.305
## [30,] 0.367
## [31,]
## [32,]
## [33,]
## [34,]
##
## F1_lh F1_rh F2_lh F2_rh F3_lh F3_rh F4_lh F4_rh
## SS loadings 2.753 1.922 1.004 0.304 2.177 1.547 1.926 1.855
## Proportion Var 0.081 0.057 0.030 0.009 0.064 0.046 0.057 0.055
## Cumulative Var 0.081 0.138 0.167 0.176 0.240 0.286 0.342 0.397
Below, the factors scores for each person are stored. The predict()
function generates values (factor scores) for all latent variables in the model, and the first 6 of these are the EFA factors. As an example of further analysis, these scores can be used to generate 2 latent brain volume subgroups using k-means clustering: one group with relatively low volume, and one with relatively high volume.
## K-means clustering with 2 clusters of sizes 352, 295
##
## Cluster means:
## F1 F2 F3 F4 F5 F6
## 1 -0.6219663 -0.3880039 -0.5019330 -0.6063915 -0.6665400 -0.6117761
## 2 0.7421429 0.4629742 0.5989166 0.7235587 0.7953291 0.7299837
##
## Clustering vector:
## [1] 1 2 2 1 2 2 2 1 1 1 1 1 1 2 1 2 1 2 1 1 2 1 2 1 1 1 2 2 1 1 1 1
## [33] 1 1 2 1 2 1 2 1 1 1 2 2 1 1 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2
## [65] 2 1 2 2 1 2 1 1 2 1 2 2 2 1 2 2 1 1 1 2 2 1 2 1 2 1 1 1 1 1 1 1
## [97] 2 2 2 1 1 1 2 2 1 2 1 2 2 1 2 2 2 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1
## [129] 1 2 1 2 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 2 2 2 1 1 1 2 2
## [161] 1 2 2 2 1 2 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 2 1 2 1 1 1 2 2 1 1 1
## [193] 1 2 1 2 1 1 1 2 2 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 2 1 2 2 2 1 2
## [225] 2 2 2 1 1 2 2 1 1 1 2 1 1 1 1 2 1 2 1 2 1 1 1 1 2 1 1 1 1 1 2 2
## [257] 1 2 2 2 1 1 1 1 2 2 2 1 1 2 2 2 1 2 1 1 1 2 1 1 2 1 1 1 1 2 2 2
## [289] 1 1 2 1 2 2 1 1 1 2 2 2 2 1 1 2 2 1 2 2 2 2 1 1 1 2 1 1 2 2 1 2
## [321] 2 1 2 2 1 1 1 1 2 2 2 2 2 1 2 2 2 2 1 1 1 1 2 1 1 1 2 2 2 2 1 2
## [353] 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2 2 1 1 2 2 1 1 1 1 2 1
## [385] 1 1 1 1 1 2 2 2 1 1 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1 2 2 1 2 1 2 1
## [417] 1 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 2 1 2 2 2 2 2 2 1 1 1 1 1 2 1
## [449] 1 1 1 2 2 1 2 1 2 1 1 2 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1
## [481] 2 1 2 1 1 1 1 1 1 2 1 1 1 1 2 2 2 1 1 1 2 1 1 1 2 1 2 1 2 2 1 2
## [513] 2 1 2 2 2 1 1 1 2 2 1 1 2 1 2 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 2 1
## [545] 1 2 1 2 2 1 2 2 1 1 1 1 2 1 2 2 1 2 1 2 1 1 1 2 1 1 2 2 1 1 2 1
## [577] 1 2 1 2 1 2 1 2 2 2 2 1 1 2 2 2 1 2 1 1 2 1 2 2 1 1 1 1 2 2 1 1
## [609] 2 2 1 1 1 1 1 2 2 1 2 2 1 1 1 1 2 2 1 1 2 1 2 1 2 2 1 1 1 2 2 2
## [641] 2 1 1 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 980.2925 1135.3208
## (between_SS / total_SS = 41.9 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
We can then plot the scores of the participants on the first two factors and show the clustering by using colours:
The ROI method factors themselves can be used to determine per ROI the amount of symmetry and, conversely, the amount of lateralization.
Through including lateralization index coefficients in the model code (see extending the EFAST model
below), the efast
package provides a nice convenience function for showing how lateralized each ROI is, with automatic standard errors based on the delta method. These standard errors can also be based on bootstrapping, for example, by changing the efast_hemi()
call to include se = "bootstrap"
. These lateralization indices – indicating how much residual variance in the observed variables cannot be explained by the symmetry structure – are computed for each ROI as \[\begin{equation}
\label{eq:li}
LI_i = 1 - \text{cor}(u^{lh}_i, u^{rh}_i)
\end{equation}\] where \(u^{lh}_i\) and \(u^{rh}_i\) are residuals given the trait factors of interest of the \(i^{th}\) ROI in the left and right hemisphere, respectively. The correlation \(\text{cor}(\cdot\,, \cdot)\) between these residuals represents the amount of symmetry, so the \(LI_i\) represents the of the \(i^{th}\) ROI in the two hemispheres.
## region est se z pvalue ci.lower ci.upper
## 1 ROI1 1.076 0.053 20.154 0.000 0.971 1.181
## 2 ROI2 0.898 0.040 22.578 0.000 0.820 0.976
## 3 ROI3 0.652 0.037 17.775 0.000 0.580 0.723
## 4 ROI4 0.694 0.045 15.554 0.000 0.607 0.782
## 5 ROI5 0.679 0.036 18.742 0.000 0.608 0.750
## 6 ROI6 0.563 0.033 16.910 0.000 0.498 0.628
## 7 ROI7 0.800 0.042 18.951 0.000 0.717 0.882
## 8 ROI8 0.504 0.031 16.376 0.000 0.444 0.565
## 9 ROI9 0.825 0.039 20.896 0.000 0.748 0.902
## 10 ROI10 0.479 0.030 15.995 0.000 0.420 0.538
## 11 ROI11 0.498 0.040 12.499 0.000 0.420 0.576
## 12 ROI12 0.646 0.039 16.707 0.000 0.570 0.722
## 13 ROI13 0.847 0.043 19.509 0.000 0.762 0.932
## 14 ROI14 0.484 0.033 14.886 0.000 0.420 0.548
## 15 ROI15 0.518 0.031 16.875 0.000 0.457 0.578
## 16 ROI16 0.744 0.043 17.505 0.000 0.661 0.828
## 17 ROI17 0.885 0.070 12.705 0.000 0.748 1.021
## 18 ROI18 0.890 0.044 20.230 0.000 0.804 0.977
## 19 ROI19 0.923 0.056 16.422 0.000 0.813 1.033
## 20 ROI20 0.706 0.054 13.072 0.000 0.600 0.811
## 21 ROI21 0.585 0.036 16.137 0.000 0.514 0.656
## 22 ROI22 0.832 0.039 21.142 0.000 0.754 0.909
## 23 ROI23 0.483 0.038 12.636 0.000 0.408 0.558
## 24 ROI24 1.300 0.487 2.670 0.008 0.346 2.254
## 25 ROI25 0.875 0.040 21.713 0.000 0.796 0.954
## 26 ROI26 0.446 0.028 15.746 0.000 0.390 0.501
## 27 ROI27 0.447 0.029 15.246 0.000 0.390 0.505
## 28 ROI28 0.566 0.038 14.865 0.000 0.491 0.640
## 29 ROI29 0.615 0.036 17.295 0.000 0.545 0.685
## 30 ROI30 0.579 0.034 17.164 0.000 0.512 0.645
## 31 ROI31 0.638 0.034 18.519 0.000 0.570 0.705
## 32 ROI32 0.742 0.038 19.761 0.000 0.668 0.816
## 33 ROI33 0.737 0.037 19.728 0.000 0.663 0.810
## 34 ROI34 0.373 0.027 13.682 0.000 0.320 0.427
The Cam-CAN cohort (2015). https://www.cam-can.org/↩︎
Beata Nowok, Gillian M. Raab, Chris Dibben (2016). synthpop: Bespoke Creation of Synthetic Data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11↩︎
Taiyun Wei and Viliam Simko (2017). R package “corrplot”: Visualization of a Correlation Matrix (Version 0.84). Available from https://github.com/taiyun/corrplot↩︎
Yves Rosseel (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1-36. URL http://www.jstatsoft.org/v48/i02/.↩︎