Principal Component Analysis is an unsupervised learning technique used for dimension reduction on multivariate data, i.e. data containing multiple variables. If \({\bf X}=(X_1,..., X_k)\) are k variables, the problem in PCA is to find the linear combination \(l^T{\bf X}\) with maximum variance subject to the constraint \(l^Tl=1\). The problem can also be thought of as finding the direction in the \(k-\)dimensional plane along which the data has maximum variance. It turns out that the solution to this problem is the maximum eigen valaue of the variance-covariance matrix of the data on \({\bf X}\) and the variance maximizing direction is given by the eigen vector corresponding to the maximum eigen value. Similarly, the eigen vector corresponindg to the second largest eigen value will be the next best direction for more variance, and so on. For one to apply principal components meaningfully, a good understanding of concepts: eigen values, eigen vectors and spectral decomposition from linear algebra are desirable.
Here, we will understand the application of PCA in analyzing the Human Development Index. The Human Development Index (HDI) is an index used by the United Nations Development Programme (UNDP) to rank countries based on certain desirable metrics. More detail can be found at the the UNDP website (http://hdr.undp.org/en/2019-report). The HDI is calculated as simple geometric mean of three dimensions \[ HDI = ( I_{Health} \times I_{Education} \times I_{Income})^{1/3} \] The index gives equal weight to indices from different dimension. There are two sub-indices that go into the index of education, both of which also get equal weights. We would like to explore if there are perhaps better alternative ways to construct this index. Can we make the determination of weights less ad-hoc?
Before importing the data, we need to first set the directory (i.e. the folder on our computer) from where the data needs to be read. Directory can be set using the setwd command. The path specified here points to the folder called "1.HDI" on my computer.You may create folder with the same name on your computer and store the data file "HDI2018_Editedcsv.csv" within it.
setwd("C:\\Users\\IIMA\\Google Drive\\SMDA\\SMDA2020\\1.HDI")
We can import the csv file using the following command.
Data1 <- read.csv("HDI2018_Editedcsv.csv")
dim(Data1)
## [1] 189 15
head(Data1, n=2L)
## HDI.rank Country HDI Life.expectancy.at.birth
## 1 1 Norway 0.953 82.3
## 2 2 Switzerland 0.944 83.5
## Expected.years.of.schooling Mean.years.of.schooling
## 1 17.9 12.6
## 2 16.2 13.4
## Gross.national.income..GNI..per.capita GNI.per.capita.rank.minus.HDI.rank
## 1 68012 5
## 2 57625 8
## HDI.rank.1 Life.Index Exp.Years.Index Mean.Years.Index Education.Index
## 1 1 0.96 0.99 0.84 0.91
## 2 2 0.98 0.90 0.89 0.90
## GNI.Index HDI.1
## 1 0.99 0.9525220
## 2 0.96 0.9439976
Our focus here will mainly be on the three individual indices related to Life, Education and Income (GNI). Let us look at the data with respect to the Life Index. We can see that there is a lot of overlap; so just based on one index it is hard to distinguish between all the countries.
#accessing a particular column in the data
Data1$Life.Index
## [1] 0.96 0.98 0.97 0.95 0.94 0.97 0.99 0.96 0.97 0.95 0.94 0.96 0.92 0.95 0.95
## [16] 0.95 0.94 0.93 0.98 0.95 0.95 0.96 0.96 0.96 0.94 0.97 0.91 0.97 0.94 0.89
## [31] 0.94 0.93 0.89 0.88 0.95 0.84 0.90 0.88 0.88 0.84 0.84 0.95 0.88 0.92 0.86
## [46] 0.89 0.87 0.88 0.79 0.88 0.84 0.85 0.82 0.86 0.89 0.84 0.85 0.86 0.77 0.86
## [61] 0.82 0.83 0.92 0.86 0.84 0.90 0.85 0.90 0.78 0.87 0.82 0.84 0.92 0.88 0.83
## [76] 0.85 0.88 0.84 0.86 0.80 0.92 0.86 0.84 0.85 0.87 0.87 0.87 0.80 0.85 0.84
## [91] 0.86 0.78 0.76 0.83 0.84 0.86 0.86 0.82 0.82 0.79 0.73 0.89 0.89 0.85 0.79
## [106] 0.78 0.82 0.80 0.74 0.72 0.82 0.80 0.76 0.67 0.79 0.76 0.87 0.76 0.83 0.77
## [121] 0.83 0.79 0.86 0.86 0.82 0.72 0.83 0.79 0.69 0.75 0.76 0.76 0.83 0.78 0.72
## [136] 0.81 0.69 0.81 0.72 0.66 0.58 0.73 0.72 0.59 0.65 0.76 0.64 0.72 0.78 0.72
## [151] 0.59 0.78 0.70 0.71 0.78 0.64 0.52 0.73 0.53 0.67 0.71 0.62 0.63 0.73 0.68
## [166] 0.62 0.69 0.68 0.67 0.52 0.67 0.66 0.71 0.64 0.62 0.62 0.58 0.69 0.70 0.60
## [181] 0.66 0.59 0.63 0.50 0.58 0.51 0.57 0.51 0.62
n<-dim(Data1)[1]
n
## [1] 189
plot(rep(1,n), Data1$Life.Index)
text(rep(1,n), Data1$Life.Index, labels=Data1$Country, cex=.5)
plot(Data1$Life.Index, Data1$Education.Index)
text(Data1$Life.Index, Data1$Education.Index, labels=Data1$Country, cex=.5)
If we use two indices, there is better separation.
plot(Data1$Life.Index, Data1$Education.Index)
text(Data1$Life.Index, Data1$Education.Index, labels=Data1$Country, cex=.5)
Ofcourse, we could try all three indices and it would be even better! But here we want to see if we can still use a single index to get good differentiation. We will use principal component analysis. More specifically we will create an index based on the first principal component of the covariance matrix of the three individual indices.
# 3 dimensional plots scatterplot3d
library(scatterplot3d)
with(Data1, scatterplot3d(Life.Index, Education.Index, GNI.Index))
It is to be noted that in some problems it may be more appropriate to work with centered and scaled data, i.e. every column in the data \(x\) is replaced with its z-score \(\frac{x-mean(x)}{sd(x)}\). This amounts to working with the correlation matrix, instead of the covariance matrix. Typically, this is necessitated if the different columns in the data are measured in widely different units. However, in our case, the indices are already strandardized and our interest is to obtain weights directly applicable to the indices taken as they are and not in their standardized form. So, we will work with the covariance matrix.
X<- with(Data1, cbind(Life.Index, Education.Index, GNI.Index))
Sigma <- var(X)
Sigma
## Life.Index Education.Index GNI.Index
## Life.Index 0.01380519 0.01702796 0.01738280
## Education.Index 0.01702796 0.03099982 0.02674235
## GNI.Index 0.01738280 0.02674235 0.03292561
Recall the eigen vector corresponding to the maximum eigen value is the vector \(l\)., and variance(\(l^T X\))=maximum eigen value of the covariance matrix. In R, the eigen values are displayed in descending order and the corresponding eigen vectors are shown.
eig<-eigen(Sigma)
eig
## eigen() decomposition
## $values
## [1] 0.069376819 0.005217144 0.003136651
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.4010995 -0.07520299 0.9129423
## [2,] -0.6368143 -0.69351187 -0.3369107
## [3,] -0.6584730 0.71650945 -0.2302769
# eigen values
eig_v<- eig$values
Note that the total variance is the sum of variances of all the three variables. By property of matrices, this is same as the sum of eigen values of the covraince matrix.
# eigen values
eig_v<- eig$values
eig_v
## [1] 0.069376819 0.005217144 0.003136651
# % variance explained by 1st principal component
eig_v[1]/sum(eig_v)
## [1] 0.8925289
The score for each row in the data is obtained by calculating the dot product of the each row of indices with the first principal component vector. A quick look at the eigen vectors suggests that the latter two indices recieve higher weight than the first index, so the index we are going to create will not give equal weightage like the HDI.
eig_vectors <- eig$vectors
# first eigen vector
U1<- eig_vectors[,1]
#score
F1<-X %*% U1
Let us look at the variance of the HDI that is used in practice and the new index we have computed.
# variance in HDI
var(Data1$HDI)
## [1] 0.02334725
# variance of our index
var(F1)
## [,1]
## [1,] 0.06937682
Let us also visually check if our score helps distinguish countries better than the HDI.
forboxplot<- rbind(cbind(1, Data1$HDI), cbind(2, F1))
## 1 is HDI and 2 is our score
boxplot(forboxplot[,2]~ forboxplot[,1], xlab="indices")
Our score seems slightly better in that it has more variability. Let us compare them by bringing them to the same mean.
forboxplot<- rbind(cbind(1, Data1$HDI/mean(Data1$HDI)), cbind(2, F1/mean(F1)))
boxplot(forboxplot[,2]~ forboxplot[,1], xlab="indices")
In the above analysis, we created an index obtained as a dot product of the Life, Education and GNI indices. This amounts to a weighted arithmetic mean (up to a constant scalar multiple). However, the HDI uses a geometric mean and our original question was whether we should use equal weights in the geometric mean. Is there a way we could modify the above analysis to check that?