In Factor Analysis, we have measurements on \(k\) variable: \((X_1, X_2,\ldots, X_k)\). Data on these \(k\) variables are recorded for \(n\) individuals. ALthough there are many different measurements, we believe that the outcomes of these variables are manifestations of a few underlying latent factors. We would like to discover what those latent factors might be. As an example, let us consider an problem based on an example from the book "Analyzing Multivariate Data" by Lattin, Carrol and Green.
An automobile company wanted to design a new Luxury car concept. To help position the new car, they wanted to understand what considerations may be driving the perceptions of consumers about cars. They conducted a market research study; 162 customers were surveyed regarding perceptions for different car models. They were asked to give ratings for familiar car brands based on 9 dimensions: namely \[Luxury,~~Style,~~Reliability,~~Fuel~Economy,~~Safety, ~~Maintenance,~~Quality,~~Durability ~~and~~ Performance.\] The correlation matrix between the ratings are given below. The ratings for different dimensions show some high correlations. So, a natural question would be whether there are a fewer set of dimensions that can help understand these ratings?
In this example, we have the correlation matrix available to us
setwd('C:\\Users\\IIMA\\Google Drive\\SMDA\\SMDA2020\\3. LuxuryCars and BreakfastCereals')
C<-read.csv("luxury_car.csv")
C[upper.tri(C)] <- 0
CC<-(C+t(C))- diag(diag(as.matrix(C))) # the correlation matrix
rownames(CC)<-colnames(CC)
library(corrplot)
## corrplot 0.84 loaded
corrplot(as.matrix(CC), method="number", number.cex=.75, tl.cex=.75)
#corrplot(as.matrix(CC), method="circle")
corrplot(as.matrix(CC), order="hclust", method="number", number.cex=.75, tl.cex=.75)
The objective will then be to estimate the parameters \(\beta\)'s and assign scores for factors \(\xi_1\) and \(\xi_2\) to each individual.
If we knew \(\Theta\), we could use spectral decomposition of \(\Sigma-\Theta\) and use the first \(2\) (in general number of factors) eigen vectors, values to obtain \(\boldsymbol{\beta}\). One starts with a starting values for \(\Theta\) either as \(0\) or by using the Squared Multiple COrrelation (SMC). Then the solution is obtained iteratively. The code below shows the first step of the iteration
##initialize var(delta)=0 and compute factors based on first two eigen vectors and eigen values
test<- eigen(as.matrix(CC)-diag(rep(0,9)))
## First step of iteration
round(test$vectors[,1:2]%*%sqrt(diag(test$values[1:2])),2)
## [,1] [,2]
## [1,] -0.64 0.61
## [2,] -0.65 0.40
## [3,] -0.76 -0.24
## [4,] -0.39 -0.71
## [5,] -0.64 0.36
## [6,] -0.62 -0.45
## [7,] -0.81 -0.14
## [8,] -0.74 -0.23
## [9,] -0.77 0.18
The above can be implemented using fa() function in R. It can be checked that the code below with max_iter=1 gives same output.
library(data.table)
library(psych)
library(GPArotation)
# first iteration using fa()
fit<- fa(r = as.matrix(CC), nfactors = 2,rotate = "none", max.iter=1, SMC=FALSE, fm="pa")
## maximum iteration exceeded
## Factor loadings
print(loadings(fit), cutoff = 0)
## SMC
round(smc(as.matrix(CC), covar=FALSE),2)
fit<- fa(r = as.matrix(CC), nfactors = 2,rotate = "none", max.iter=100, SMC=TRUE, fm="pa")
fit
## Factor Analysis using method = pa
## Call: fa(r = as.matrix(CC), nfactors = 2, rotate = "none", SMC = TRUE,
## max.iter = 100, fm = "pa")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PA1 PA2 h2 u2 com
## Luxury 0.65 -0.63 0.82 0.18 2.0
## Style 0.60 -0.29 0.45 0.55 1.4
## Reliability 0.72 0.22 0.57 0.43 1.2
## FuelEconomy 0.34 0.53 0.40 0.60 1.7
## Safety 0.59 -0.26 0.41 0.59 1.4
## Maintenance 0.57 0.37 0.46 0.54 1.7
## Quality 0.78 0.15 0.63 0.37 1.1
## Durable 0.70 0.22 0.54 0.46 1.2
## Performance 0.73 -0.12 0.54 0.46 1.1
##
## PA1 PA2
## SS loadings 3.72 1.10
## Proportion Var 0.41 0.12
## Cumulative Var 0.41 0.53
## Proportion Explained 0.77 0.23
## Cumulative Proportion 0.77 1.00
##
## Mean item complexity = 1.4
## Test of the hypothesis that 2 factors are sufficient.
##
## The degrees of freedom for the null model are 36 and the objective function was 3.67
## The degrees of freedom for the model are 19 and the objective function was 0.13
##
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.04
##
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## PA1 PA2
## Correlation of (regression) scores with factors 0.95 0.88
## Multiple R square of scores with factors 0.90 0.77
## Minimum correlation of possible factor scores 0.80 0.54
Communality is the percentage variance in a variable that is explained by the factor model. Factor loadings are the correlations between the original variables and the factors. Based on the factor model we have \[\mbox{Communality for variable }X_i: ~~ 1- V(\delta_i)=\beta_{i1}^2 + \beta_{i2}^2 , \] \[\mbox{Factor loadings}:~~ Cov(X_i, \xi_1)= \beta_{i1}, ~~ Cov(X_i, \xi_2)= \beta_{i2}\]
#communality
fit$communality
## Luxury Style Reliability FuelEconomy Safety Maintenance
## 0.8179224 0.4473630 0.5657471 0.4001636 0.4133675 0.4606012
## Quality Durable Performance
## 0.6314923 0.5357780 0.5413651
The loadings, in some sense, tell us how much of each variable is loaded in a particular factor. This helps identify the factor by relating to the context of the problem and name the factor.
# loadings
print(loadings(fit), cutoff = .4)
##
## Loadings:
## PA1 PA2
## Luxury 0.651 -0.627
## Style 0.602
## Reliability 0.719
## FuelEconomy 0.530
## Safety 0.589
## Maintenance 0.571
## Quality 0.780
## Durable 0.697
## Performance 0.726
##
## PA1 PA2
## SS loadings 3.718 1.096
## Proportion Var 0.413 0.122
## Cumulative Var 0.413 0.535
It is to be noted that if \(\boldsymbol{\beta}_{k\times 2}\) is a solution to the factor model, then so is \(\boldsymbol{\beta}U\), where \(U\) is any orthonormal matrix with dimension same as number of factors. Looking at the data w.r.t \(U \boldsymbol{\xi}\) may give better interpretation. There are diferent methods for rotation: the varimax method tries to maximize the variance of squared loadings within columns, quartimax tries to do so along the rows.
fit<- fa(r = as.matrix(CC), nfactors = 2,rotate = "varimax", max.iter=100, SMC=TRUE, fm="pa")
print(loadings(fit), cutoff = .4)
##
## Loadings:
## PA1 PA2
## Luxury 0.904
## Style 0.642
## Reliability 0.646
## FuelEconomy 0.625
## Safety 0.610
## Maintenance 0.655
## Quality 0.477 0.635
## Durable 0.633
## Performance 0.620
##
## PA1 PA2
## SS loadings 2.538 2.276
## Proportion Var 0.282 0.253
## Cumulative Var 0.282 0.535
Apart from Principal axis methods, another method for estimating the model is by maximum likelihood estimation.
fit<- fa(r = as.matrix(CC), nfactors = 2,rotate = "varimax", max.iter=100, SMC=TRUE, fm="ml")
print(loadings(fit), cutoff = .4)
##
## Loadings:
## ML1 ML2
## Luxury 0.914
## Style 0.644
## Reliability 0.640
## FuelEconomy 0.604
## Safety 0.620
## Maintenance 0.640
## Quality 0.454 0.659
## Durable 0.661
## Performance 0.588 0.430
##
## ML1 ML2
## SS loadings 2.491 2.322
## Proportion Var 0.277 0.258
## Cumulative Var 0.277 0.535