In Factor Analysis, we have measurements on \(k\) variable: \((X_1, X_2,\ldots, X_k)\). Data on these \(k\) variables are recorded for \(n\) individuals. ALthough there are many different measurements, we believe that the outcomes of these variables are manifestations of a few underlying latent factors. We would like to discover what those latent factors might be. As an example, let us consider an problem based on an example from the book "Analyzing Multivariate Data" by Lattin, Carrol and Green.
A company wanted to understand consumer considerations for their favourite cereal brands. The objective was to explain which brands consumers would consider purchasing as a function of underlying characteristics of the brands. A survey was conducted on 121 respondents and each respondent was asked to rate 3 of their favourite brands on 25 attributes. Can these ratings be understood through fewer dimensions?
In this example, we have the data on ratings available to us
library(data.table)
library(psych)
library(GPArotation)
setwd('C:\\Users\\IIMA\\Google Drive\\SMDA\\SMDA2020\\3. LuxuryCars and BreakfastCereals')
X<- read.csv("rte_cereal.csv") # col1 subject id, col 2 cereal id
library(corrplot)
## corrplot 0.84 loaded
CC<-cor(X)
corrplot(as.matrix(CC), order="hclust", method="number", number.cex=.75, tl.cex=.75)
#### Determining the number of factors to be used in the factor model One way to determine the number of factors is by looking at the scree plot based on principal component analysis of the correlation matrix, which is a plot of largest to the smallest eigen values. Largest eigen value is the max variance possible from a linear combination of the variables. Second eigen value is the next best variance possible along a direction that is perpendicular to the direction of max variance and so on. A Thumb rule is to look for number of eigen values that are greater than 1.
fi<-princomp(X[,3:27], cor=TRUE)
screeplot(fi, type="l")
#### SMC
round(smc(as.matrix(CC), covar=FALSE),2)
## Subject Cereal Filling Natural Fibre Sweet Easy
## 0.16 0.23 0.63 0.64 0.71 0.61 0.23
## Salt Satisfying Energy Fun Kids Soggy Economical
## 0.45 0.62 0.58 0.49 0.66 0.33 0.37
## Health Family Calories Plain Crisp Regular Sugar
## 0.75 0.60 0.43 0.43 0.42 0.53 0.67
## Fruit Process Quality Treat Boring Nutritious
## 0.49 0.30 0.62 0.57 0.33 0.72
fit<- fa(X[,3:27], nfactors = 4,rotate = "varimax", covar=FALSE, max.iter=100, SMC=TRUE, fm="pa")
fit$communality
## Filling Natural Fibre Sweet Easy Salt Satisfying
## 0.5685942 0.6147138 0.7029907 0.6233801 0.1706555 0.4856290 0.6058884
## Energy Fun Kids Soggy Economical Health Family
## 0.5223303 0.4606682 0.7240315 0.2412347 0.3081858 0.7742947 0.5946442
## Calories Plain Crisp Regular Sugar Fruit Process
## 0.4202885 0.4609137 0.3562410 0.3945663 0.7306293 0.4442474 0.2120224
## Quality Treat Boring Nutritious
## 0.5487396 0.5899718 0.3372715 0.7282686
print(loadings(fit), cutoff = .4)
##
## Loadings:
## PA1 PA2 PA3 PA4
## Filling 0.706
## Natural 0.753
## Fibre 0.821
## Sweet 0.702
## Easy
## Salt 0.686
## Satisfying 0.626 0.423
## Energy 0.660
## Fun 0.415 0.480
## Kids 0.850
## Soggy -0.480
## Economical 0.416
## Health 0.829
## Family 0.761
## Calories 0.627
## Plain -0.656
## Crisp 0.438
## Regular 0.613
## Sugar 0.816
## Fruit 0.442
## Process
## Quality 0.647
## Treat 0.603
## Boring -0.506
## Nutritious 0.831
##
## PA1 PA2 PA3 PA4
## SS loadings 5.203 2.660 2.467 2.290
## Proportion Var 0.208 0.106 0.099 0.092
## Cumulative Var 0.208 0.315 0.413 0.505
The full utility of the factor model is realized when we assign factor scores for each observation, for then we can visualize how the different ceral brands stand with respect to the newly identified dimensions. One approach to assigning scores is by the Thurstone Method, which is as follows. We have the factor model in matrix notation as The factor model can be written in matrix form as \[ \boldsymbol{X}= \boldsymbol{\beta}\boldsymbol{\xi} + \boldsymbol{\delta}, \mbox{ with } E[\boldsymbol{X}]=\boldsymbol{0}, ~~var(\boldsymbol{X})=\Sigma ,~~var(\boldsymbol{\xi})=I, ~~ Cov(\boldsymbol{X}, \boldsymbol{\delta})=\boldsymbol{0}, ~~ cov(\boldsymbol{\delta})= \Theta= diag(\sigma_1^2, ~\sigma_2^2, \ldots, \sigma_k^2)\] We know that the factor loadings are the correlations between \(\boldsymbol{X}\) and \(\boldsymbol{\xi}\), so \[ \boldsymbol{\beta}= E\left(\boldsymbol{X}\boldsymbol{\xi}^T \right).\] In the Thurstone Methods, it is assumed that the factor scores are a lineat function of \(\boldsymbol{X}\), so \[ \boldsymbol{\xi} = B\boldsymbol{X}.\] It follows that \[ \boldsymbol{\beta}= E\left(\boldsymbol{X}\boldsymbol{\xi}^T \right)= E\left(\boldsymbol{X}\boldsymbol{X}^T B \right)= RB^T,\] where \(R= E\left(\boldsymbol{X}\boldsymbol{X}^T \right)\) is the correlation matrix. So, \[B=R^{-1} \boldsymbol{\beta}.\]
# Getting Factor scores for each observation in the data
fit<- fa(X[,3:27], nfactors = 4,rotate = "varimax", covar=FALSE, max.iter=100, SMC=TRUE, fm="pa", scores="Thurstone")
tt<- fit$scores
AA1<- data.table(name=X[,2] , wholesome=tt[,1], artificial=tt[,2], popularity=tt[,3], mood=tt[,4] )
Agg<- as.matrix(AA1[,lapply(.SD,mean), by=name])
Aggnames<-c("Weetbix","Special K","Cerola", "Corn Flakes","Rice bubbles", "Vitabrits","Purina","All bran","Nutrigrain","Sustain","Just Right", "Komplete")
par(mfrow=c(2,2))
plot(Agg[,2], Agg[,3], xlab="wholesome", ylab="artificial", pch=".", xlim=c(-1.25,1.25), ylim=c(-1.25,1.25))
text(Agg[,2], Agg[,3], labels=Aggnames, cex=.75)
plot(Agg[,2], Agg[,4], xlab="wholesome", ylab="popularity", pch=".", xlim=c(-1.25,1.25), ylim=c(-1.25,1.25))
text(Agg[,2], Agg[,4], labels=Aggnames, cex=.75)
plot(Agg[,2], Agg[,5], xlab="wholesome", ylab="mood", pch=".", xlim=c(-1.25,1.25), ylim=c(-1.25,1.25))
text(Agg[,2], Agg[,5], labels=Aggnames, cex=.75)
plot(Agg[,3], Agg[,5], xlab="artificial", ylab="mood", pch=".", xlim=c(-1.25,1.25), ylim=c(-1.25,1.25))
text(Agg[,3], Agg[,5], labels=Aggnames, cex=.75)