Go back

Introduction

In this tutorial, we show how to run an exploratory factor analysis with structured residuals (EFAST) model on a dataset with estimated volume on 68 regions of interest (ROIs) in both hemispheres. Data of 647 participants from the Cam-CAN cohort1 is included in the efast package – synthesized through the synthpop package2 to preserve individual subject’s privacy. The data can be loaded as follows:

data("roi_volume")

Upon plotting the correlation matrix of this dataset, we can see that there are secondary diagonals. These indicate symmetry structure: the first 34 variables are the left-hemisphere ROIs, and the last 34 variables are their contralateral homologues. The strong secondary diagonals show that the contralateral homologues correlate strongly in their amount of volume. The correlation matrix can be plotted using the cplot() function as follows (cplot() is based on the corrplot package3):

cplot(cor(roi_volume))

The goal of the EFAST routine is to extract \(M\) factors that explain the correlations among the ROIs as well as possible, while taking into account the symmetry. The resulting factors can then be used for further analysis, for example to compare different groups in their structural volume.

Estimating an EFAST model

The efast package is a wrapper around the lavaan package4 with a convenient interface for EFAST models. In addition, it adds some detailed constraints on the parameters, specifically for the EFAST model, that aid in estimation. The parameters that need to be entered by the researcher are:

  • data The dataset to be analyzed
  • M The number of factors to extract
  • lh_idx The column indices of the left-hemisphere variables in the data
  • rh_idx The column indices of the right-hemisphere variables in the data

An EFAST model for the volume ROI data with 6 factors can thus be computed as follows:

efast_fit <- efast_hemi(
  data   = roi_volume,
  M      = 6,
  lh_idx = 1:34,
  rh_idx = 35:68
)

The function allows for other arguments to be passed on to the lavaan function as well, to change many aspects of model estimation (see ?lavoptions for more information). Most notably, the rotation can be changed here as well, if a different rotation than the default oblique geomin rotation is desired. The resulting object also inherits from the lavaan class, meaning methods such as summary() and print() are, by default, implemented in this object.

Model fit and model comparison

Since an EFAST model is a structural equation model, the entire range of fit metrics is available (e.g, \(\chi^2\), AIC, BIC, CFI, TLI, RMSEA). They can be investigated using the fitmeasures() command from lavaan:

fitmeasures(efast_fit)
##                npar                fmin               chisq 
##             495.000               2.908            3762.998 
##                  df              pvalue      baseline.chisq 
##            1851.000               0.000           38406.398 
##         baseline.df     baseline.pvalue                 cfi 
##            2278.000               0.000               0.947 
##                 tli                nnfi                 rfi 
##               0.935               0.935               0.879 
##                 nfi                pnfi                 ifi 
##               0.902               0.733               0.948 
##                 rni                logl   unrestricted.logl 
##               0.947          -45934.661          -43190.394 
##                 aic                 bic              ntotal 
##           92859.321           95073.133             647.000 
##                bic2               rmsea      rmsea.ci.lower 
##           93501.524               0.040               0.038 
##      rmsea.ci.upper        rmsea.pvalue                 rmr 
##               0.042               1.000               0.340 
##          rmr_nomean                srmr        srmr_bentler 
##               0.340               0.340               0.340 
## srmr_bentler_nomean                crmr         crmr_nomean 
##               0.340               0.305               0.305 
##          srmr_mplus   srmr_mplus_nomean               cn_05 
##               0.303               0.303             336.657 
##               cn_01                 gfi                agfi 
##             344.097               0.774               0.713 
##                pgfi                 mfi                ecvi 
##               0.610               0.228               7.346

For example, we can see in the RMSEA value of 0.0399566 that the model fit is quite good – the model approximates the elements of the observed covariance matrix well. Additionally, the fit of different models with different numbers of factors can also be investigated to determine the optimal number of factors. Another model comparison that can be made is to a model without symmetry structure. This can be done in several ways, for example through the relative fit indices such as BIC and AIC, or through likelihood ratio tests.

Below we compare the EFAST model to an identically estimated EFA model with 6 factors.

efa_fit <- efast_efa(data = roi_volume, M = 6)

A likelihood ratio test can be performed via the anova() or the lavTestLRT() function from the lavaan package:

lavTestLRT(efast_fit, efa_fit)
## Chi-Squared Difference Test
## 
##             Df   AIC   BIC  Chisq Chisq diff Df diff Pr(>Chisq)    
## efast_fit 1851 92859 95073 3763.0                                  
## efa_fit   1885 96023 98085 6603.6     2840.7      34  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The resulting table shows that the EFAST model with 6 factors fits significantly better than the pure EFA model with 6 factors: \(\chi^2\)(34) = 2840.7, p < .001.

Dimension reduction

Two relevant methods for performing dimension reduction through EFAST are predict() for extracting the per-participant factor values for further analysis, and the efast_loadings() function, for extracting and displaying the factor loadings to give an interpretation to each of the factors. The loadings can optionally be displayed in a convenient symmetric way, with the loadings of the left and right hemispheres for each ROI side-by-side (note that while this aids interpretation of the loadings themselves, such as easily showing that factor 2 is relatively assymmetric, the bottom table with the proportions of explained variance now displays wrong values):

lds <- efast_loadings(efast_fit, symmetry = TRUE)
# we only show the loadings for the first 4 factors to ensure it fits
print(lds, cutoff = 0.3)
## 
## Loadings:
##       F1_lh  F1_rh  F2_lh  F2_rh  F3_lh  F3_rh  F4_lh  F4_rh 
##  [1,]  0.803  0.746                                          
##  [2,]                                                        
##  [3,]                                            0.521  0.379
##  [4,]                              0.808  0.592              
##  [5,]                                                        
##  [6,]  0.303  0.330                                          
##  [7,]  0.624  0.447                                          
##  [8,]  0.431                                                 
##  [9,]                                                        
## [10,]  0.367                       0.351  0.321              
## [11,]                                                        
## [12,]                              0.650  0.610              
## [13,]                                                        
## [14,]  0.673  0.488                                          
## [15,]                                                        
## [16,]                                            0.548  0.652
## [17,]                0.708                                   
## [18,]                                                        
## [19,]                0.550                                   
## [20,]                              0.860  0.719              
## [21,]                                            0.560  0.527
## [22,]                                                        
## [23,]                                            0.735  0.537
## [24,]                                                        
## [25,]  0.419                                                 
## [26,]                                                        
## [27,]         0.307                              0.399  0.330
## [28,]                                                        
## [29,]  0.451  0.305                                          
## [30,]         0.367                                          
## [31,]                                                        
## [32,]                                                        
## [33,]                                                        
## [34,]                                                        
## 
##                F1_lh F1_rh F2_lh F2_rh F3_lh F3_rh F4_lh F4_rh
## SS loadings    2.753 1.922 1.004 0.304 2.177 1.547 1.926 1.855
## Proportion Var 0.081 0.057 0.030 0.009 0.064 0.046 0.057 0.055
## Cumulative Var 0.081 0.138 0.167 0.176 0.240 0.286 0.342 0.397

Below, the factors scores for each person are stored. The predict() function generates values (factor scores) for all latent variables in the model, and the first 6 of these are the EFA factors. As an example of further analysis, these scores can be used to generate 2 latent brain volume subgroups using k-means clustering: one group with relatively low volume, and one with relatively high volume.

pred <- predict(efast_fit)[,1:6]
clus <- kmeans(pred, 2)
clus
## K-means clustering with 2 clusters of sizes 352, 295
## 
## Cluster means:
##           F1         F2         F3         F4         F5         F6
## 1 -0.6219663 -0.3880039 -0.5019330 -0.6063915 -0.6665400 -0.6117761
## 2  0.7421429  0.4629742  0.5989166  0.7235587  0.7953291  0.7299837
## 
## Clustering vector:
##   [1] 1 2 2 1 2 2 2 1 1 1 1 1 1 2 1 2 1 2 1 1 2 1 2 1 1 1 2 2 1 1 1 1
##  [33] 1 1 2 1 2 1 2 1 1 1 2 2 1 1 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2
##  [65] 2 1 2 2 1 2 1 1 2 1 2 2 2 1 2 2 1 1 1 2 2 1 2 1 2 1 1 1 1 1 1 1
##  [97] 2 2 2 1 1 1 2 2 1 2 1 2 2 1 2 2 2 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1
## [129] 1 2 1 2 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 2 2 2 1 1 1 2 2
## [161] 1 2 2 2 1 2 1 1 1 2 1 2 2 2 1 1 1 2 1 1 1 2 1 2 1 1 1 2 2 1 1 1
## [193] 1 2 1 2 1 1 1 2 2 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 2 1 2 2 2 1 2
## [225] 2 2 2 1 1 2 2 1 1 1 2 1 1 1 1 2 1 2 1 2 1 1 1 1 2 1 1 1 1 1 2 2
## [257] 1 2 2 2 1 1 1 1 2 2 2 1 1 2 2 2 1 2 1 1 1 2 1 1 2 1 1 1 1 2 2 2
## [289] 1 1 2 1 2 2 1 1 1 2 2 2 2 1 1 2 2 1 2 2 2 2 1 1 1 2 1 1 2 2 1 2
## [321] 2 1 2 2 1 1 1 1 2 2 2 2 2 1 2 2 2 2 1 1 1 1 2 1 1 1 2 2 2 2 1 2
## [353] 1 1 2 1 2 1 1 1 1 2 2 1 1 1 2 2 1 1 1 2 2 2 1 1 2 2 1 1 1 1 2 1
## [385] 1 1 1 1 1 2 2 2 1 1 1 2 1 2 1 1 2 1 1 1 1 2 1 1 1 2 2 1 2 1 2 1
## [417] 1 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 2 2 1 2 2 2 2 2 2 1 1 1 1 1 2 1
## [449] 1 1 1 2 2 1 2 1 2 1 1 2 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1
## [481] 2 1 2 1 1 1 1 1 1 2 1 1 1 1 2 2 2 1 1 1 2 1 1 1 2 1 2 1 2 2 1 2
## [513] 2 1 2 2 2 1 1 1 2 2 1 1 2 1 2 1 1 2 2 2 2 2 2 2 1 1 1 1 1 1 2 1
## [545] 1 2 1 2 2 1 2 2 1 1 1 1 2 1 2 2 1 2 1 2 1 1 1 2 1 1 2 2 1 1 2 1
## [577] 1 2 1 2 1 2 1 2 2 2 2 1 1 2 2 2 1 2 1 1 2 1 2 2 1 1 1 1 2 2 1 1
## [609] 2 2 1 1 1 1 1 2 2 1 2 2 1 1 1 1 2 2 1 1 2 1 2 1 2 2 1 1 1 2 2 2
## [641] 2 1 1 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1]  980.2925 1135.3208
##  (between_SS / total_SS =  41.9 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

We can then plot the scores of the participants on the first two factors and show the clustering by using colours:

plot(pred[,c(1, 2)], bg = c("magenta", "orange")[clus$cluster],
     pch = 21, asp = 1, bty = "L")

Lateralization

The ROI method factors themselves can be used to determine per ROI the amount of symmetry and, conversely, the amount of lateralization.

Through including lateralization index coefficients in the model code (see extending the EFAST model below), the efast package provides a nice convenience function for showing how lateralized each ROI is, with automatic standard errors based on the delta method. These standard errors can also be based on bootstrapping, for example, by changing the efast_hemi() call to include se = "bootstrap". These lateralization indices – indicating how much residual variance in the observed variables cannot be explained by the symmetry structure – are computed for each ROI as \[\begin{equation} \label{eq:li} LI_i = 1 - \text{cor}(u^{lh}_i, u^{rh}_i) \end{equation}\] where \(u^{lh}_i\) and \(u^{rh}_i\) are residuals given the trait factors of interest of the \(i^{th}\) ROI in the left and right hemisphere, respectively. The correlation \(\text{cor}(\cdot\,, \cdot)\) between these residuals represents the amount of symmetry, so the \(LI_i\) represents the of the \(i^{th}\) ROI in the two hemispheres.

lateralization(efast_fit)
##    region   est    se      z pvalue ci.lower ci.upper
## 1    ROI1 1.076 0.053 20.154  0.000    0.971    1.181
## 2    ROI2 0.898 0.040 22.578  0.000    0.820    0.976
## 3    ROI3 0.652 0.037 17.775  0.000    0.580    0.723
## 4    ROI4 0.694 0.045 15.554  0.000    0.607    0.782
## 5    ROI5 0.679 0.036 18.742  0.000    0.608    0.750
## 6    ROI6 0.563 0.033 16.910  0.000    0.498    0.628
## 7    ROI7 0.800 0.042 18.951  0.000    0.717    0.882
## 8    ROI8 0.504 0.031 16.376  0.000    0.444    0.565
## 9    ROI9 0.825 0.039 20.896  0.000    0.748    0.902
## 10  ROI10 0.479 0.030 15.995  0.000    0.420    0.538
## 11  ROI11 0.498 0.040 12.499  0.000    0.420    0.576
## 12  ROI12 0.646 0.039 16.707  0.000    0.570    0.722
## 13  ROI13 0.847 0.043 19.509  0.000    0.762    0.932
## 14  ROI14 0.484 0.033 14.886  0.000    0.420    0.548
## 15  ROI15 0.518 0.031 16.875  0.000    0.457    0.578
## 16  ROI16 0.744 0.043 17.505  0.000    0.661    0.828
## 17  ROI17 0.885 0.070 12.705  0.000    0.748    1.021
## 18  ROI18 0.890 0.044 20.230  0.000    0.804    0.977
## 19  ROI19 0.923 0.056 16.422  0.000    0.813    1.033
## 20  ROI20 0.706 0.054 13.072  0.000    0.600    0.811
## 21  ROI21 0.585 0.036 16.137  0.000    0.514    0.656
## 22  ROI22 0.832 0.039 21.142  0.000    0.754    0.909
## 23  ROI23 0.483 0.038 12.636  0.000    0.408    0.558
## 24  ROI24 1.300 0.487  2.670  0.008    0.346    2.254
## 25  ROI25 0.875 0.040 21.713  0.000    0.796    0.954
## 26  ROI26 0.446 0.028 15.746  0.000    0.390    0.501
## 27  ROI27 0.447 0.029 15.246  0.000    0.390    0.505
## 28  ROI28 0.566 0.038 14.865  0.000    0.491    0.640
## 29  ROI29 0.615 0.036 17.295  0.000    0.545    0.685
## 30  ROI30 0.579 0.034 17.164  0.000    0.512    0.645
## 31  ROI31 0.638 0.034 18.519  0.000    0.570    0.705
## 32  ROI32 0.742 0.038 19.761  0.000    0.668    0.816
## 33  ROI33 0.737 0.037 19.728  0.000    0.663    0.810
## 34  ROI34 0.373 0.027 13.682  0.000    0.320    0.427

  1. The Cam-CAN cohort (2015). https://www.cam-can.org/↩︎

  2. Beata Nowok, Gillian M. Raab, Chris Dibben (2016). synthpop: Bespoke Creation of Synthetic Data in R. Journal of Statistical Software, 74(11), 1-26. doi:10.18637/jss.v074.i11↩︎

  3. Taiyun Wei and Viliam Simko (2017). R package “corrplot”: Visualization of a Correlation Matrix (Version 0.84). Available from https://github.com/taiyun/corrplot↩︎

  4. Yves Rosseel (2012). lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1-36. URL http://www.jstatsoft.org/v48/i02/.↩︎