The covariance matrix computed from the data is important for multivariate analysis techniques such as principal component analysis or factor analysis. The var() function in R can be used to directly compute the same in R. However, here we do a direct computation for clarity of understanding.
Let us consider the hepthalon data available in R. This is the scors of various contestants in the 1984 olympics. This data is organized in a matrix, i.e. rows and columns, with each row corresponding to a contestant and column to a score in particular event or overall score.
library(HSAUR3)
## Loading required package: tools
data(heptathlon)
head(heptathlon, n=2L)
## hurdles highjump shot run200m longjump javelin run800m
## Joyner-Kersee (USA) 12.69 1.86 15.80 22.56 7.27 45.66 128.51
## John (GDR) 12.85 1.80 16.23 23.65 6.71 42.56 126.12
## score
## Joyner-Kersee (USA) 7291
## John (GDR) 6897
To compute the covariance matrix of the numerical columns we will first extract the numerical columns into a matrix and compute the covraince matrix using the var() function.
X<-heptathlon
Cov<-var(X)
Cov
## hurdles highjump shot run200m longjump
## hurdles 0.5426500 -0.0465875 -0.7158125 0.5526083 -0.3186333
## highjump -0.0465875 0.0060750 0.0512550 -0.0368525 0.0289200
## shot -0.7158125 0.0512550 2.2257190 -0.9874603 0.5257018
## run200m 0.5526083 -0.0368525 -0.9874603 0.9400410 -0.3757313
## longjump -0.3186333 0.0289200 0.5257018 -0.3757313 0.2248773
## javelin -0.0202750 0.0005950 1.4228727 -1.1449063 0.1128357
## run800m 4.7594000 -0.3820250 -5.1904192 4.9583408 -2.7502933
## score -386.6004167 34.0000000 678.2173333 -476.6920000 256.2143333
## javelin run800m score
## hurdles -0.0202750 4.759400 -386.6004
## highjump 0.0005950 -0.382025 34.0000
## shot 1.4228727 -5.190419 678.2173
## run200m -1.1449063 4.958341 -476.6920
## longjump 0.1128357 -2.750293 256.2143
## javelin 12.5716773 0.589390 510.2418
## run800m 0.5893900 68.742142 -3642.2717
## score 510.2418333 -3642.271667 323157.8333
We will now compute the covariance matrix directly using X.
# number of data points
n<- dim(X)[1]
# vector of column means
mean_X<- colMeans(X)
mean_X
## hurdles highjump shot run200m longjump javelin run800m score
## 13.8400 1.7820 13.1176 24.6492 6.1524 41.4824 136.0540 6090.6000
# subtraxt column means from each respective column
Y<- as.matrix( X - cbind(rep(1,n)) %*% rbind(mean_X))
Cov_direct<- 1/(n-1) * t(Y) %*% Y
# direct
round(Cov_direct,2)
## hurdles highjump shot run200m longjump javelin run800m score
## hurdles 0.54 -0.05 -0.72 0.55 -0.32 -0.02 4.76 -386.60
## highjump -0.05 0.01 0.05 -0.04 0.03 0.00 -0.38 34.00
## shot -0.72 0.05 2.23 -0.99 0.53 1.42 -5.19 678.22
## run200m 0.55 -0.04 -0.99 0.94 -0.38 -1.14 4.96 -476.69
## longjump -0.32 0.03 0.53 -0.38 0.22 0.11 -2.75 256.21
## javelin -0.02 0.00 1.42 -1.14 0.11 12.57 0.59 510.24
## run800m 4.76 -0.38 -5.19 4.96 -2.75 0.59 68.74 -3642.27
## score -386.60 34.00 678.22 -476.69 256.21 510.24 -3642.27 323157.83
# var function
round(Cov,2)
## hurdles highjump shot run200m longjump javelin run800m score
## hurdles 0.54 -0.05 -0.72 0.55 -0.32 -0.02 4.76 -386.60
## highjump -0.05 0.01 0.05 -0.04 0.03 0.00 -0.38 34.00
## shot -0.72 0.05 2.23 -0.99 0.53 1.42 -5.19 678.22
## run200m 0.55 -0.04 -0.99 0.94 -0.38 -1.14 4.96 -476.69
## longjump -0.32 0.03 0.53 -0.38 0.22 0.11 -2.75 256.21
## javelin -0.02 0.00 1.42 -1.14 0.11 12.57 0.59 510.24
## run800m 4.76 -0.38 -5.19 4.96 -2.75 0.59 68.74 -3642.27
## score -386.60 34.00 678.22 -476.69 256.21 510.24 -3642.27 323157.83
Ofcourse, both computations give the same result.
The correlation matrix can be obtained by using the cor() function in R.
# computation using cor() function
cor_X<- cor(X)
round(cor_X,3)
## hurdles highjump shot run200m longjump javelin run800m score
## hurdles 1.000 -0.811 -0.651 0.774 -0.912 -0.008 0.779 -0.923
## highjump -0.811 1.000 0.441 -0.488 0.782 0.002 -0.591 0.767
## shot -0.651 0.441 1.000 -0.683 0.743 0.269 -0.420 0.800
## run200m 0.774 -0.488 -0.683 1.000 -0.817 -0.333 0.617 -0.865
## longjump -0.912 0.782 0.743 -0.817 1.000 0.067 -0.700 0.950
## javelin -0.008 0.002 0.269 -0.333 0.067 1.000 0.020 0.253
## run800m 0.779 -0.591 -0.420 0.617 -0.700 0.020 1.000 -0.773
## score -0.923 0.767 0.800 -0.865 0.950 0.253 -0.773 1.000
# Direct computation
Z<- scale(X) # this function subtracts the mean from each column and divides by column standard deviation
cor_X_direct<- 1/(n-1)* t(Z)%*%Z
round(cor_X_direct,3)
## hurdles highjump shot run200m longjump javelin run800m score
## hurdles 1.000 -0.811 -0.651 0.774 -0.912 -0.008 0.779 -0.923
## highjump -0.811 1.000 0.441 -0.488 0.782 0.002 -0.591 0.767
## shot -0.651 0.441 1.000 -0.683 0.743 0.269 -0.420 0.800
## run200m 0.774 -0.488 -0.683 1.000 -0.817 -0.333 0.617 -0.865
## longjump -0.912 0.782 0.743 -0.817 1.000 0.067 -0.700 0.950
## javelin -0.008 0.002 0.269 -0.333 0.067 1.000 0.020 0.253
## run800m 0.779 -0.591 -0.420 0.617 -0.700 0.020 1.000 -0.773
## score -0.923 0.767 0.800 -0.865 0.950 0.253 -0.773 1.000