In Factor Analysis, we have measurements on \(k\) variable: \((X_1, X_2,\ldots, X_k)\). Data on these \(k\) variables are recorded for \(n\) individuals. ALthough there are many different measurements, we believe that the outcomes of these variables are manifestations of a few underlying latent factors. We would like to discover what those latent factors might be. As an example, let us consider an problem based on an example from the book "Analyzing Multivariate Data" by Lattin, Carrol and Green.

problem

A company wanted to understand consumer considerations for their favourite cereal brands. The objective was to explain which brands consumers would consider purchasing as a function of underlying characteristics of the brands. A survey was conducted on 121 respondents and each respondent was asked to rate 3 of their favourite brands on 25 attributes. Can these ratings be understood through fewer dimensions?

Read the data correlation matrix

In this example, we have the data on ratings available to us

library(data.table)
library(psych)
library(GPArotation)
setwd('C:\\Users\\IIMA\\Google Drive\\SMDA\\SMDA2020\\3. LuxuryCars and BreakfastCereals')
X<- read.csv("rte_cereal.csv") # col1 subject id, col 2 cereal id

Checking the correlations

library(corrplot)
## corrplot 0.84 loaded
CC<-cor(X)
corrplot(as.matrix(CC), order="hclust", method="number", number.cex=.75, tl.cex=.75)

#### Determining the number of factors to be used in the factor model One way to determine the number of factors is by looking at the scree plot based on principal component analysis of the correlation matrix, which is a plot of largest to the smallest eigen values. Largest eigen value is the max variance possible from a linear combination of the variables. Second eigen value is the next best variance possible along a direction that is perpendicular to the direction of max variance and so on. A Thumb rule is to look for number of eigen values that are greater than 1.

fi<-princomp(X[,3:27], cor=TRUE)
screeplot(fi, type="l") 

#### SMC

round(smc(as.matrix(CC), covar=FALSE),2)
##    Subject     Cereal    Filling    Natural      Fibre      Sweet       Easy 
##       0.16       0.23       0.63       0.64       0.71       0.61       0.23 
##       Salt Satisfying     Energy        Fun       Kids      Soggy Economical 
##       0.45       0.62       0.58       0.49       0.66       0.33       0.37 
##     Health     Family   Calories      Plain      Crisp    Regular      Sugar 
##       0.75       0.60       0.43       0.43       0.42       0.53       0.67 
##      Fruit    Process    Quality      Treat     Boring Nutritious 
##       0.49       0.30       0.62       0.57       0.33       0.72

Estimate the factor model and interpret the factors

fit<- fa(X[,3:27], nfactors = 4,rotate = "varimax", covar=FALSE, max.iter=100, SMC=TRUE, fm="pa")  
fit$communality
##    Filling    Natural      Fibre      Sweet       Easy       Salt Satisfying 
##  0.5685942  0.6147138  0.7029907  0.6233801  0.1706555  0.4856290  0.6058884 
##     Energy        Fun       Kids      Soggy Economical     Health     Family 
##  0.5223303  0.4606682  0.7240315  0.2412347  0.3081858  0.7742947  0.5946442 
##   Calories      Plain      Crisp    Regular      Sugar      Fruit    Process 
##  0.4202885  0.4609137  0.3562410  0.3945663  0.7306293  0.4442474  0.2120224 
##    Quality      Treat     Boring Nutritious 
##  0.5487396  0.5899718  0.3372715  0.7282686
print(loadings(fit), cutoff = .4)
## 
## Loadings:
##            PA1    PA2    PA3    PA4   
## Filling     0.706                     
## Natural     0.753                     
## Fibre       0.821                     
## Sweet              0.702              
## Easy                                  
## Salt               0.686              
## Satisfying  0.626         0.423       
## Energy      0.660                     
## Fun                       0.415  0.480
## Kids                      0.850       
## Soggy                           -0.480
## Economical                0.416       
## Health      0.829                     
## Family                    0.761       
## Calories           0.627              
## Plain                           -0.656
## Crisp                            0.438
## Regular     0.613                     
## Sugar              0.816              
## Fruit                            0.442
## Process                               
## Quality     0.647                     
## Treat                            0.603
## Boring                          -0.506
## Nutritious  0.831                     
## 
##                  PA1   PA2   PA3   PA4
## SS loadings    5.203 2.660 2.467 2.290
## Proportion Var 0.208 0.106 0.099 0.092
## Cumulative Var 0.208 0.315 0.413 0.505

Assigning factor scores for each observation

The full utility of the factor model is realized when we assign factor scores for each observation, for then we can visualize how the different ceral brands stand with respect to the newly identified dimensions. One approach to assigning scores is by the Thurstone Method, which is as follows. We have the factor model in matrix notation as The factor model can be written in matrix form as \[ \boldsymbol{X}= \boldsymbol{\beta}\boldsymbol{\xi} + \boldsymbol{\delta}, \mbox{ with } E[\boldsymbol{X}]=\boldsymbol{0}, ~~var(\boldsymbol{X})=\Sigma ,~~var(\boldsymbol{\xi})=I, ~~ Cov(\boldsymbol{X}, \boldsymbol{\delta})=\boldsymbol{0}, ~~ cov(\boldsymbol{\delta})= \Theta= diag(\sigma_1^2, ~\sigma_2^2, \ldots, \sigma_k^2)\] We know that the factor loadings are the correlations between \(\boldsymbol{X}\) and \(\boldsymbol{\xi}\), so \[ \boldsymbol{\beta}= E\left(\boldsymbol{X}\boldsymbol{\xi}^T \right).\] In the Thurstone Methods, it is assumed that the factor scores are a lineat function of \(\boldsymbol{X}\), so \[ \boldsymbol{\xi} = B\boldsymbol{X}.\] It follows that \[ \boldsymbol{\beta}= E\left(\boldsymbol{X}\boldsymbol{\xi}^T \right)= E\left(\boldsymbol{X}\boldsymbol{X}^T B \right)= RB^T,\] where \(R= E\left(\boldsymbol{X}\boldsymbol{X}^T \right)\) is the correlation matrix. So, \[B=R^{-1} \boldsymbol{\beta}.\]

# Getting Factor scores for each observation in the data 

fit<- fa(X[,3:27], nfactors = 4,rotate = "varimax", covar=FALSE, max.iter=100, SMC=TRUE, fm="pa", scores="Thurstone")  
tt<- fit$scores

Positioning of cereals

AA1<- data.table(name=X[,2] , wholesome=tt[,1], artificial=tt[,2], popularity=tt[,3], mood=tt[,4] )
Agg<- as.matrix(AA1[,lapply(.SD,mean), by=name])
Aggnames<-c("Weetbix","Special K","Cerola", "Corn Flakes","Rice bubbles", "Vitabrits","Purina","All bran","Nutrigrain","Sustain","Just Right", "Komplete")


par(mfrow=c(2,2))
plot(Agg[,2], Agg[,3], xlab="wholesome", ylab="artificial", pch=".", xlim=c(-1.25,1.25), ylim=c(-1.25,1.25))
text(Agg[,2], Agg[,3], labels=Aggnames, cex=.75)
                
plot(Agg[,2], Agg[,4], xlab="wholesome", ylab="popularity", pch=".", xlim=c(-1.25,1.25), ylim=c(-1.25,1.25))
text(Agg[,2], Agg[,4], labels=Aggnames, cex=.75)


plot(Agg[,2], Agg[,5], xlab="wholesome", ylab="mood", pch=".", xlim=c(-1.25,1.25), ylim=c(-1.25,1.25))
text(Agg[,2], Agg[,5], labels=Aggnames, cex=.75)


plot(Agg[,3], Agg[,5], xlab="artificial", ylab="mood", pch=".", xlim=c(-1.25,1.25), ylim=c(-1.25,1.25))
text(Agg[,3], Agg[,5], labels=Aggnames, cex=.75)