Exploratory Factor Analysis

In Factor Analysis, we have measurements on \(k\) variable: \((X_1, X_2,\ldots, X_k)\). Data on these \(k\) variables are recorded for \(n\) individuals. ALthough there are many different measurements, we believe that the outcomes of these variables are manifestations of a few underlying latent factors. We would like to discover what those latent factors might be. As an example, let us consider an problem based on an example from the book "Analyzing Multivariate Data" by Lattin, Carrol and Green.

problem

An automobile company wanted to design a new Luxury car concept. To help position the new car, they wanted to understand what considerations may be driving the perceptions of consumers about cars. They conducted a market research study; 162 customers were surveyed regarding perceptions for different car models. They were asked to give ratings for familiar car brands based on 9 dimensions: namely \[Luxury,~~Style,~~Reliability,~~Fuel~Economy,~~Safety, ~~Maintenance,~~Quality,~~Durability ~~and~~ Performance.\] The correlation matrix between the ratings are given below. The ratings for different dimensions show some high correlations. So, a natural question would be whether there are a fewer set of dimensions that can help understand these ratings?

Read the data correlation matrix

In this example, we have the correlation matrix available to us

setwd('C:\\Users\\IIMA\\Google Drive\\SMDA\\SMDA2020\\3. LuxuryCars and BreakfastCereals')
C<-read.csv("luxury_car.csv")
C[upper.tri(C)] <- 0

CC<-(C+t(C))- diag(diag(as.matrix(C))) # the correlation matrix

rownames(CC)<-colnames(CC)

library(corrplot)

## corrplot 0.84 loaded

corrplot(as.matrix(CC), method="number", number.cex=.75, tl.cex=.75)

#corrplot(as.matrix(CC), method="circle")
corrplot(as.matrix(CC), order="hclust", method="number", number.cex=.75, tl.cex=.75)

The factor model formulation

We have 9 variables, which we can label as \(X_1\) (Luxury), \(X_2\) (style), ...., \(X_9\) (Performance). For simplicity, supppose that there are only two underlying latent factors \((\xi_1, \xi_2)\), that are driving the way consumers are rating the different \(k\) dimensions. A simple mathematical formulation assumes that each of the \(k\) dimensions is linearly related to the factors and can be written as follows \[\begin{eqnarray} {X}_1 &=& \beta_{11}\xi_1 + \beta_{12}\xi_2 + \delta_1 \\ {X}_2 &=& \beta_{21}\xi_1 + \beta_{22}\xi_2 + \delta_2 \\ \vdots\\ {X}_k &=& \beta_{k1}\xi_1 + \beta_{k2}\xi_2 + \delta_k. \end{eqnarray}\]

The objective will then be to estimate the parameters \(\beta\)'s and assign scores for factors \(\xi_1\) and \(\xi_2\) to each individual.

Assumptions

The model makes the following assumptions \[\begin{eqnarray} var(X_i)&=& 1 \mbox{ (i.e. data is centered and scaled)}\\ var(\xi_i)&=&1 \\ cov(\delta_i, \xi_l)&=&0\\ cov(\xi_l, \xi_m)&=&0 , ~~l\ne m\\ cov(\delta_i, \delta_j)&=&0, ~~ i\ne j. \end{eqnarray}\]

Idea for estimating the \(\beta\) parameters

The factor model can be written in matrix form as \[ \boldsymbol{X}= \boldsymbol{\beta}\boldsymbol{\xi} + \boldsymbol{\delta}, \mbox{ with } E[\boldsymbol{X}]=\boldsymbol{0}, ~~var(\boldsymbol{X})=\Sigma ,~~var(\boldsymbol{\xi})=I, ~~ Cov(\boldsymbol{X}, \boldsymbol{\delta})=\boldsymbol{0}, ~~ cov(\boldsymbol{\delta})= \Theta= diag(\sigma_1^2, ~\sigma_2^2, \ldots, \sigma_k^2)\] \[\mbox{where }\boldsymbol{X}= \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\X_k\end{bmatrix}, ~~\boldsymbol{\beta}= \begin{bmatrix}\beta_{11} & \beta_{12}\\ \beta_{21} & \beta_{22}\\ \vdots & \vdots \\ \beta_{k1} & \beta_{k2}\end{bmatrix}, ~~ \boldsymbol{\xi}= \begin{bmatrix} \xi_1 \\ \xi_2 \end{bmatrix} \mbox{ and } \boldsymbol{\delta}= \begin{bmatrix} \delta_1\\ \delta_2 \\ \vdots \\ \delta_k\end{bmatrix}.\] It follows from the model that \[\begin{eqnarray} \Sigma &= & \boldsymbol{\beta} \boldsymbol{\beta}^T + \Theta \end{eqnarray}\]

Principal Axis Method to obtain \(\boldsymbol{\beta}\)

If we knew \(\Theta\), we could use spectral decomposition of \(\Sigma-\Theta\) and use the first \(2\) (in general number of factors) eigen vectors, values to obtain \(\boldsymbol{\beta}\). One starts with a starting values for \(\Theta\) either as \(0\) or by using the Squared Multiple COrrelation (SMC). Then the solution is obtained iteratively. The code below shows the first step of the iteration

##initialize var(delta)=0 and compute factors based on first two eigen vectors and eigen values
test<- eigen(as.matrix(CC)-diag(rep(0,9)))
## First step of iteration
round(test$vectors[,1:2]%*%sqrt(diag(test$values[1:2])),2)

##        [,1]  [,2]
##  [1,] -0.64  0.61
##  [2,] -0.65  0.40
##  [3,] -0.76 -0.24
##  [4,] -0.39 -0.71
##  [5,] -0.64  0.36
##  [6,] -0.62 -0.45
##  [7,] -0.81 -0.14
##  [8,] -0.74 -0.23
##  [9,] -0.77  0.18

The above can be implemented using fa() function in R. It can be checked that the code below with max_iter=1 gives same output.

library(data.table)
library(psych)
library(GPArotation)

# first iteration using fa()
fit<- fa(r = as.matrix(CC), nfactors = 2,rotate = "none", max.iter=1, SMC=FALSE, fm="pa")

## maximum iteration exceeded

## Factor loadings
print(loadings(fit), cutoff = 0)
## SMC
round(smc(as.matrix(CC), covar=FALSE),2)

Many iterations until convergence and SMC as starting values

fit<- fa(r = as.matrix(CC), nfactors = 2,rotate = "none", max.iter=100, SMC=TRUE, fm="pa")
fit

## Factor Analysis using method =  pa
## Call: fa(r = as.matrix(CC), nfactors = 2, rotate = "none", SMC = TRUE, 
##     max.iter = 100, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
##              PA1   PA2   h2   u2 com
## Luxury      0.65 -0.63 0.82 0.18 2.0
## Style       0.60 -0.29 0.45 0.55 1.4
## Reliability 0.72  0.22 0.57 0.43 1.2
## FuelEconomy 0.34  0.53 0.40 0.60 1.7
## Safety      0.59 -0.26 0.41 0.59 1.4
## Maintenance 0.57  0.37 0.46 0.54 1.7
## Quality     0.78  0.15 0.63 0.37 1.1
## Durable     0.70  0.22 0.54 0.46 1.2
## Performance 0.73 -0.12 0.54 0.46 1.1
## 
##                        PA1  PA2
## SS loadings           3.72 1.10
## Proportion Var        0.41 0.12
## Cumulative Var        0.41 0.53
## Proportion Explained  0.77 0.23
## Cumulative Proportion 0.77 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 2 factors are sufficient.
## 
## The degrees of freedom for the null model are  36  and the objective function was  3.67
## The degrees of freedom for the model are 19  and the objective function was  0.13 
## 
## The root mean square of the residuals (RMSR) is  0.03 
## The df corrected root mean square of the residuals is  0.04 
## 
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    PA1  PA2
## Correlation of (regression) scores with factors   0.95 0.88
## Multiple R square of scores with factors          0.90 0.77
## Minimum correlation of possible factor scores     0.80 0.54

Communality and Factor Loadings

Communality is the percentage variance in a variable that is explained by the factor model. Factor loadings are the correlations between the original variables and the factors. Based on the factor model we have \[\mbox{Communality for variable }X_i: ~~ 1- V(\delta_i)=\beta_{i1}^2 + \beta_{i2}^2 , \] \[\mbox{Factor loadings}:~~ Cov(X_i, \xi_1)= \beta_{i1}, ~~ Cov(X_i, \xi_2)= \beta_{i2}\]

#communality
fit$communality

##      Luxury       Style Reliability FuelEconomy      Safety Maintenance 
##   0.8179224   0.4473630   0.5657471   0.4001636   0.4133675   0.4606012 
##     Quality     Durable Performance 
##   0.6314923   0.5357780   0.5413651

The loadings, in some sense, tell us how much of each variable is loaded in a particular factor. This helps identify the factor by relating to the context of the problem and name the factor.

# loadings
print(loadings(fit), cutoff = .4)

## 
## Loadings:
##             PA1    PA2   
## Luxury       0.651 -0.627
## Style        0.602       
## Reliability  0.719       
## FuelEconomy         0.530
## Safety       0.589       
## Maintenance  0.571       
## Quality      0.780       
## Durable      0.697       
## Performance  0.726       
## 
##                  PA1   PA2
## SS loadings    3.718 1.096
## Proportion Var 0.413 0.122
## Cumulative Var 0.413 0.535

Rotation of solution: Better interpretation of factors

It is to be noted that if \(\boldsymbol{\beta}_{k\times 2}\) is a solution to the factor model, then so is \(\boldsymbol{\beta}U\), where \(U\) is any orthonormal matrix with dimension same as number of factors. Looking at the data w.r.t \(U \boldsymbol{\xi}\) may give better interpretation. There are diferent methods for rotation: the varimax method tries to maximize the variance of squared loadings within columns, quartimax tries to do so along the rows.

fit<- fa(r = as.matrix(CC), nfactors = 2,rotate = "varimax", max.iter=100, SMC=TRUE, fm="pa")
print(loadings(fit), cutoff = .4)

## 
## Loadings:
##             PA1    PA2   
## Luxury       0.904       
## Style        0.642       
## Reliability         0.646
## FuelEconomy         0.625
## Safety       0.610       
## Maintenance         0.655
## Quality      0.477  0.635
## Durable             0.633
## Performance  0.620       
## 
##                  PA1   PA2
## SS loadings    2.538 2.276
## Proportion Var 0.282 0.253
## Cumulative Var 0.282 0.535

Alternate Estimation Method (MLE)

Apart from Principal axis methods, another method for estimating the model is by maximum likelihood estimation.

fit<- fa(r = as.matrix(CC), nfactors = 2,rotate = "varimax", max.iter=100, SMC=TRUE, fm="ml")
print(loadings(fit), cutoff = .4)

## 
## Loadings:
##             ML1    ML2   
## Luxury       0.914       
## Style        0.644       
## Reliability         0.640
## FuelEconomy         0.604
## Safety       0.620       
## Maintenance         0.640
## Quality      0.454  0.659
## Durable             0.661
## Performance  0.588  0.430
## 
##                  ML1   ML2
## SS loadings    2.491 2.322
## Proportion Var 0.277 0.258
## Cumulative Var 0.277 0.535