Principal Component Analysis

Principal Component Analysis is an unsupervised learning technique used for dimension reduction for multivariate data, i.e. data containing multiple variables. If ${\bf X}=(X_1,..., X_k)$ are k variables, the problem in PCA is to find the linear combination $l^T{\bf X}$ with maximum variance subject to the constraint $l^Tl=1$. The problem can also be thought of as finding the direction in the $k-$dimensional plane along which the data has maximum variance. It turns out that the solution to this problem is the maximum eigen valaue of the variance-covariance matrix of the data on ${\bf X}$ and the variance maximizing direction is given by the eigen vector corresponding to the maximum eigen value. Similarly, the eigen vector corresponindg to the second largest eigen value will be the next best direction for more variance, and so on.

For one to apply principal components meaningfully, a good understanding of concepts: eigen values, eigen vectors and spectral decomposition from linear algebra are desirable. The procedure to use in R is as below. We will consider the heptathlon dataset under the HSAUR3 package in R. It consists results for for a seven-event combined event for women called "heptathlon", which made its first entry into olympics in 1984. The data consists results on the 7 event for 25 competitors in the 1988 olympics.

library(HSAUR3)

## Loading required package: tools

data(heptathlon)
head(heptathlon)

##                     hurdles highjump  shot run200m longjump javelin run800m
## Joyner-Kersee (USA)   12.69     1.86 15.80   22.56     7.27   45.66  128.51
## John (GDR)            12.85     1.80 16.23   23.65     6.71   42.56  126.12
## Behmer (GDR)          13.20     1.83 14.20   23.10     6.68   44.54  124.20
## Sablovskaite (URS)    13.61     1.80 15.23   23.92     6.25   42.78  132.24
## Choubenkova (URS)     13.51     1.74 14.76   23.93     6.32   47.46  127.90
## Schulz (GDR)          13.75     1.83 13.50   24.65     6.33   42.82  125.79
##                     score
## Joyner-Kersee (USA)  7291
## John (GDR)           6897
## Behmer (GDR)         6858
## Sablovskaite (URS)   6540
## Choubenkova (URS)    6540
## Schulz (GDR)         6411

dim(heptathlon)

## [1] 25  8

#help(heptathlon)
with(heptathlon, plot(rep(1, dim(heptathlon)[1]), score, col="blue", xlab="", ylab="score"  ))
with(heptathlon, text(rep(1, dim(heptathlon)[1]), score, labels=row.names(heptathlon), offset=.5, pos=4, cex=.5))

We would like to distinguish the players based on their performance. Note that it is hard to do so based on just the final score. We will try to do it using the first principal component obtained from the data. In many situations, we dont want the determination of maximum variance to be infulencedby some variables just because they are measured in different units. If so, it is possible that some large valued variables will show large magnitude in variance although they may not be able to distinguish the records well. So,one approach taken is to standardize the variables in the data by subtracting th mean ("centering") and dividing by standard deviation ("scaling"). It turns out that covariance matrix for the standardized variables is just the correlation matrix for the original variables. Accordingly, in R, we can specify that the procedure use the correlation matrix instead of the covraince matrix.

pcfit <- princomp(heptathlon, cor=TRUE)
summary(pcfit)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     2.3337239 1.0962319 0.72182793 0.67615707 0.49697892
## Proportion of Variance 0.6807834 0.1502156 0.06512945 0.05714855 0.03087351
## Cumulative Proportion  0.6807834 0.8309990 0.89612842 0.95327697 0.98415048
##                             Comp.6      Comp.7       Comp.8
## Standard deviation     0.271578567 0.221442919 0.0632795567
## Proportion of Variance 0.009219365 0.006129621 0.0005005378
## Cumulative Proportion  0.993369841 0.999499462 1.0000000000

As per this output, the percentage of total variance explained in the first principal component direction (i.e. direction of maximum variance) is 68%. If we use two dimensions to represent the data, by taking the first two principal component directions, then we can explain 83% of variance.

The scree-plot below is a pictorical visualization of the variances explained by each principal component.

plot(pcfit,type="lines")

The linear combinations of data columns that yield the maximum variance (component 1), next best variance (component 2) and so on are called the "loadings". They are obtained as below.

loadings(pcfit)

## 
## Loadings:
##          Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## hurdles   0.407  0.178                0.113  0.795  0.382       
## highjump -0.339 -0.264 -0.367 -0.679                0.439 -0.145
## shot     -0.332  0.273  0.677 -0.123  0.494         0.223 -0.203
## run200m   0.370 -0.240        -0.363  0.665        -0.457  0.124
## longjump -0.412         0.141 -0.110 -0.198  0.572 -0.602 -0.252
## javelin          0.836 -0.471 -0.120  0.124        -0.170 -0.116
## run800m   0.338  0.239  0.395 -0.604 -0.491 -0.133 -0.105  0.194
## score    -0.426                                            0.893
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.125  0.125  0.125  0.125  0.125  0.125  0.125  0.125
## Cumulative Var  0.125  0.250  0.375  0.500  0.625  0.750  0.875  1.000

Now, one just needs to compute $l^T {\bf X}$ for every row ${\bf X}$ in the data. Taking $l$ as the loadings for first component gives the linear combination of data columns that has maximum variance, taking $l $ as loading for second component gives the linear combination of data columns that has next best variance etc.

pcfit$scores

##                          Comp.1      Comp.2     Comp.3      Comp.4      Comp.5
## Joyner-Kersee (USA) -4.73644855  1.22539333  0.3793062 -0.02196748 -0.43729779
## John (GDR)          -3.28440666  0.50281397  0.9168877  0.48550552  0.71013213
## Behmer (GDR)        -3.04070072  0.66552951 -0.4674166  0.69468140 -0.10943894
## Sablovskaite (URS)  -1.59214658  0.69947622  0.6075720  0.14314333  0.46528122
## Choubenkova (URS)   -1.61876350  1.77837280 -0.1526263  0.85387062  0.69303571
## Schulz (GDR)        -1.20977434 -0.08669928 -0.6888783  0.20932582  0.75539868
## Fleming (AUS)       -1.20996826 -0.34829112 -0.0739671  0.49729341 -0.77976998
## Greiner (USA)       -1.00140330 -0.83695433  0.8290977  0.03048597  0.09353036
## Lajbnerova (CZE)    -0.61564151  0.14577624  0.1642336 -0.62889647  0.57671005
## Bouraga (URS)       -0.81676153 -0.55162879  0.1879048  0.68222247 -1.04182864
## Wijnsma (HOL)       -0.58675246 -1.43042703 -0.1400534 -0.41428025  0.30403239
## Dimitrova (BUL)     -1.15127240 -0.40509917 -0.0806038  0.49443290 -0.82612806
## Scheider (SWI)      -0.03088395  0.82771529 -2.0079814 -0.74761217 -0.02524885
## Braun (FRG)         -0.01831843  0.72956299 -0.3314211 -1.08727927 -0.19328575
## Ruotsalainen (FIN)   0.06751278  0.78163321 -0.9648576 -0.27359910 -0.19082726
## Yuping (CHN)        -0.11797365 -0.55321949  1.0867539 -1.66532434 -0.21631179
## Hagger (GB)          0.26570642 -1.77992363  0.5980197 -0.48177822 -0.05296740
## Brown (USA)          0.56258410  0.74800208 -0.3193688 -1.31527290 -0.51047191
## Mulliner (GB)        1.31010266 -0.63798124  0.7399102  0.59086483 -0.15373614
## Hautenauve (BEL)     1.29504614 -1.87374132  0.0133772  0.25941677  0.20475008
## Kytola (FIN)         1.65452526 -0.92398335 -0.6606445  0.21813098  0.51680688
## Geremias (BRA)       2.30639362 -0.07589554  0.6609723 -0.02562593  0.24570720
## Hui-Ing (TAI)        3.27988022 -0.64256494 -0.7655648  1.14143591 -0.47351027
## Jeong-Mi (KOR)       3.36476324 -0.93996824 -0.5850645  0.11590791  0.60288550
## Launa (PNG)          6.92470138  2.98210183  1.0544130  0.24491829 -0.15744742
##                          Comp.6       Comp.7        Comp.8
## Joyner-Kersee (USA)  0.35468155 -0.354294635  0.0499179341
## John (GDR)          -0.24937415 -0.147288228 -0.0130098300
## Behmer (GDR)         0.24560071  0.133260633  0.0288725786
## Sablovskaite (URS)  -0.09302220  0.494524686  0.0790232430
## Choubenkova (URS)   -0.13755103 -0.243099856 -0.0520519296
## Schulz (GDR)         0.35766289  0.108969683 -0.0287885207
## Fleming (AUS)       -0.08018676  0.143534214  0.0546954057
## Greiner (USA)        0.15345562 -0.033506531 -0.0180293624
## Lajbnerova (CZE)    -0.27680329  0.253591230  0.0041538209
## Bouraga (URS)       -0.39114739  0.014838280  0.1139690296
## Wijnsma (HOL)        0.35111531  0.188677157 -0.0014908073
## Dimitrova (BUL)     -0.26537935  0.077475405 -0.2527593382
## Scheider (SWI)       0.00371748 -0.036704338 -0.0027736911
## Braun (FRG)         -0.27883136 -0.047064168  0.0072713956
## Ruotsalainen (FIN)  -0.14243817 -0.139150766  0.0122289171
## Yuping (CHN)         0.28332519  0.177143125 -0.0324982135
## Hagger (GB)         -0.14116135 -0.533014857  0.0371422763
## Brown (USA)          0.07430305  0.005915267 -0.0035067589
## Mulliner (GB)        0.43652255 -0.079806513 -0.0456489065
## Hautenauve (BEL)     0.10638046 -0.087306001  0.0164130338
## Kytola (FIN)         0.07277970  0.128755033 -0.0033252945
## Geremias (BRA)      -0.65925197  0.216469848 -0.0047145336
## Hui-Ing (TAI)        0.19142764  0.211141285  0.0334977126
## Jeong-Mi (KOR)      -0.18266970 -0.391575777  0.0212358269
## Launa (PNG)          0.26684456 -0.061484177  0.0001760122

Let us now plot the players w.r.t first principal component and compare with the plot w.r.t the overall score given in the data. To aid the comparsion, we standardize both variables. We can see that the players are more easily distinguished if we used the PCA 1st component score.

std_overallscore<- with(heptathlon, (score - mean(score))/sd(score)) 
std_PCAscore1<- with(heptathlon, (pcfit$scores[,1] - mean(pcfit$scores[,1]))/sd(pcfit$scores[,1])) 
par(mfrow=c(1,2))
with(heptathlon, plot(rep(1, dim(heptathlon)[1]), std_overallscore, col="blue", xlab="", ylab="overall score"  , ylim=c(-3,3)))
with(heptathlon, text(rep(1, dim(heptathlon)[1]), std_overallscore, labels=row.names(heptathlon), offset=.5, pos=4, cex=.75))


with(heptathlon, plot(rep(1, dim(heptathlon)[1]), std_PCAscore1, col="red", xlab="", ylab="First PCA score" , ylim=c(-3,3) ))
with(heptathlon, text(rep(1, dim(heptathlon)[1]), std_PCAscore1, labels=row.names(heptathlon), offset=.5, pos=4, cex=.75))

Further, we can more easily distinguish the players if we use two dimensions, viz. first and second principal component scores.

with(heptathlon, plot(pcfit$scores[,1], pcfit$scores[,2], col="red",xlab="First PCA score",ylab="Second PCA score" , xlim=c(-5,5), ylim=c(-2.5,2.5) ))
with(heptathlon, text(pcfit$scores[,1], pcfit$scores[,2], labels=row.names(heptathlon), offset=.5, pos=4, cex=.75))

Here, we have discussed a packages that can be used to implement Principal Component Analysis. Application of other techniques will similarly involve looking up the relevant commands such as lda() for Linear Discriminant Analysis, fa() for Factor Analysis etc. However, a conceptual understanding of the technique involved is important to ensure correct application of the tools in R.

Principal Component Analysis

Karthik Sriram

19 November 2020