The covariance matrix computed from the data is important for multivariate analysis techniques such as principal component analysis or factor analysis. The var() function in R can be used to directly compute the same in R. However, here we do a direct computation for clarity of understanding.

Example data

Let us consider the hepthalon data available in R. This is the scors of various contestants in the 1984 olympics. This data is organized in a matrix, i.e. rows and columns, with each row corresponding to a contestant and column to a score in particular event or overall score.

library(HSAUR3)
## Loading required package: tools
data(heptathlon)
head(heptathlon, n=2L)
##                     hurdles highjump  shot run200m longjump javelin run800m
## Joyner-Kersee (USA)   12.69     1.86 15.80   22.56     7.27   45.66  128.51
## John (GDR)            12.85     1.80 16.23   23.65     6.71   42.56  126.12
##                     score
## Joyner-Kersee (USA)  7291
## John (GDR)           6897

Covariance Matrix using var() function

To compute the covariance matrix of the numerical columns we will first extract the numerical columns into a matrix and compute the covraince matrix using the var() function.

X<-heptathlon
Cov<-var(X)
Cov
##               hurdles   highjump        shot      run200m    longjump
## hurdles     0.5426500 -0.0465875  -0.7158125    0.5526083  -0.3186333
## highjump   -0.0465875  0.0060750   0.0512550   -0.0368525   0.0289200
## shot       -0.7158125  0.0512550   2.2257190   -0.9874603   0.5257018
## run200m     0.5526083 -0.0368525  -0.9874603    0.9400410  -0.3757313
## longjump   -0.3186333  0.0289200   0.5257018   -0.3757313   0.2248773
## javelin    -0.0202750  0.0005950   1.4228727   -1.1449063   0.1128357
## run800m     4.7594000 -0.3820250  -5.1904192    4.9583408  -2.7502933
## score    -386.6004167 34.0000000 678.2173333 -476.6920000 256.2143333
##              javelin      run800m       score
## hurdles   -0.0202750     4.759400   -386.6004
## highjump   0.0005950    -0.382025     34.0000
## shot       1.4228727    -5.190419    678.2173
## run200m   -1.1449063     4.958341   -476.6920
## longjump   0.1128357    -2.750293    256.2143
## javelin   12.5716773     0.589390    510.2418
## run800m    0.5893900    68.742142  -3642.2717
## score    510.2418333 -3642.271667 323157.8333

Covariance matrix direct computation

We will now compute the covariance matrix directly using X.

# number of data points
n<- dim(X)[1] 

# vector of column means
mean_X<- colMeans(X)
mean_X
##   hurdles  highjump      shot   run200m  longjump   javelin   run800m     score 
##   13.8400    1.7820   13.1176   24.6492    6.1524   41.4824  136.0540 6090.6000
# subtraxt column means from each respective column
Y<- as.matrix( X - cbind(rep(1,n)) %*% rbind(mean_X))

Cov_direct<- 1/(n-1) * t(Y) %*% Y 

# direct
round(Cov_direct,2)
##          hurdles highjump   shot run200m longjump javelin  run800m     score
## hurdles     0.54    -0.05  -0.72    0.55    -0.32   -0.02     4.76   -386.60
## highjump   -0.05     0.01   0.05   -0.04     0.03    0.00    -0.38     34.00
## shot       -0.72     0.05   2.23   -0.99     0.53    1.42    -5.19    678.22
## run200m     0.55    -0.04  -0.99    0.94    -0.38   -1.14     4.96   -476.69
## longjump   -0.32     0.03   0.53   -0.38     0.22    0.11    -2.75    256.21
## javelin    -0.02     0.00   1.42   -1.14     0.11   12.57     0.59    510.24
## run800m     4.76    -0.38  -5.19    4.96    -2.75    0.59    68.74  -3642.27
## score    -386.60    34.00 678.22 -476.69   256.21  510.24 -3642.27 323157.83
# var function
round(Cov,2)
##          hurdles highjump   shot run200m longjump javelin  run800m     score
## hurdles     0.54    -0.05  -0.72    0.55    -0.32   -0.02     4.76   -386.60
## highjump   -0.05     0.01   0.05   -0.04     0.03    0.00    -0.38     34.00
## shot       -0.72     0.05   2.23   -0.99     0.53    1.42    -5.19    678.22
## run200m     0.55    -0.04  -0.99    0.94    -0.38   -1.14     4.96   -476.69
## longjump   -0.32     0.03   0.53   -0.38     0.22    0.11    -2.75    256.21
## javelin    -0.02     0.00   1.42   -1.14     0.11   12.57     0.59    510.24
## run800m     4.76    -0.38  -5.19    4.96    -2.75    0.59    68.74  -3642.27
## score    -386.60    34.00 678.22 -476.69   256.21  510.24 -3642.27 323157.83

Ofcourse, both computations give the same result.

Correlation matrix direct computation

The correlation matrix can be obtained by using the cor() function in R.

# computation using cor() function
cor_X<- cor(X)
round(cor_X,3)
##          hurdles highjump   shot run200m longjump javelin run800m  score
## hurdles    1.000   -0.811 -0.651   0.774   -0.912  -0.008   0.779 -0.923
## highjump  -0.811    1.000  0.441  -0.488    0.782   0.002  -0.591  0.767
## shot      -0.651    0.441  1.000  -0.683    0.743   0.269  -0.420  0.800
## run200m    0.774   -0.488 -0.683   1.000   -0.817  -0.333   0.617 -0.865
## longjump  -0.912    0.782  0.743  -0.817    1.000   0.067  -0.700  0.950
## javelin   -0.008    0.002  0.269  -0.333    0.067   1.000   0.020  0.253
## run800m    0.779   -0.591 -0.420   0.617   -0.700   0.020   1.000 -0.773
## score     -0.923    0.767  0.800  -0.865    0.950   0.253  -0.773  1.000
# Direct computation
Z<- scale(X)  # this function subtracts the mean from each column and divides by column standard deviation
cor_X_direct<- 1/(n-1)* t(Z)%*%Z
round(cor_X_direct,3)
##          hurdles highjump   shot run200m longjump javelin run800m  score
## hurdles    1.000   -0.811 -0.651   0.774   -0.912  -0.008   0.779 -0.923
## highjump  -0.811    1.000  0.441  -0.488    0.782   0.002  -0.591  0.767
## shot      -0.651    0.441  1.000  -0.683    0.743   0.269  -0.420  0.800
## run200m    0.774   -0.488 -0.683   1.000   -0.817  -0.333   0.617 -0.865
## longjump  -0.912    0.782  0.743  -0.817    1.000   0.067  -0.700  0.950
## javelin   -0.008    0.002  0.269  -0.333    0.067   1.000   0.020  0.253
## run800m    0.779   -0.591 -0.420   0.617   -0.700   0.020   1.000 -0.773
## score     -0.923    0.767  0.800  -0.865    0.950   0.253  -0.773  1.000