Glimpse of some R commands for Data Analysis

After gaining a basic understanding of operations, data manipulations, simulations and plots in R, utilizing R to apply various statistical techniques is not difficult. However, the catch is that one needs to have a good conceptual understanding of the techniques to ensure correct application.

Here, we will demonstrate the R commands/ packages used for some common techniques. This document is not meant to teach these techniques but to familiarize one with the R package used in the technique.

Linear Regression

Linear regression is perhaps the first modeling technique, one learns in supervised learning. Here we have a response $y_i$ and explanatory variables $(x_{i1}, x_{i2},\ldots, x_{ik})$ collected over $n$ subjects indexed $i=1,2,\ldots, n$. Lte us take a simpel example where there is only one explanatory variable. The data is on the old faithful geaser in Yellostone National Park, Wyoming, USA. It consists the waiting time between eruptions and the duration of eruption. This data is available in the datasets library of R.

#install.packages('datasets')
library(datasets)
data(faithful)
Data<-faithful

Basic summaries

head(Data)

##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55

summary(Data)

##    eruptions        waiting    
##  Min.   :1.600   Min.   :43.0  
##  1st Qu.:2.163   1st Qu.:58.0  
##  Median :4.000   Median :76.0  
##  Mean   :3.488   Mean   :70.9  
##  3rd Qu.:4.454   3rd Qu.:82.0  
##  Max.   :5.100   Max.   :96.0

with(Data,plot(waiting, eruptions))

A linear regression model is formulates as follows

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i, ~~~ \epsilon_1, \ldots, \epsilon_n \stackrel{iid}{\sim} N(0, \sigma^2)\]

The estimation of parameters $(\beta_0, \beta_1)$ is based on least squares criterion.

\[ \min_{\beta_0, \beta_1} \sum_{i=1}^n(y_i - \beta_0 -\beta_1 x_i)^2 \]

The estimates along with standard errors can be obtained in R. In R, statistical techniques such as linear regression are implemented through specific packages ad specific commands. The command for linear regression is lm() (stands for linear model) and is available in base R.

reg<- lm(eruptions ~ waiting, Data)
summary(reg)

## 
## Call:
## lm(formula = eruptions ~ waiting, data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29917 -0.37689  0.03508  0.34909  1.19329 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.874016   0.160143  -11.70   <2e-16 ***
## waiting      0.075628   0.002219   34.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4965 on 270 degrees of freedom
## Multiple R-squared:  0.8115, Adjusted R-squared:  0.8108 
## F-statistic:  1162 on 1 and 270 DF,  p-value: < 2.2e-16

The predictions from the model can be obtained and plotted as follows.

prediction<-predict(reg)

with(Data, plot(waiting, eruptions))
with(Data,lines(waiting, prediction, col="blue"))

A standard diagnostic check in rgeression to verify the linearity assumption is the checking of studentized residuals.

library(MASS)
sresid <- studres(reg)
with(Data, plot(prediction, sresid ))

Typically, if the linearity assumption is correct, then there should not be any pattern in the resdiual plot. Whet we see here indicates presence of some pattern and suggests there may be non-linearity. There are many techniques for rectifying such a violation of linearity assumption, but we will not delve into it here.

Another diagnostic is the qqplot to check for normality of residuals. The qq-plot compares the theoretical quantiles (or percentiles) based on normalk distribution with the quantiles (or percentiles) from the data on (studentized) residuals. A standard assumption in Linear regression is that the errors are normally distributred. If this assumption is reasonable then the qqplot should result in a scatter plot that is close to a straight line.

qqnorm(sresid)

We will just note here that regression with multiple variables is also easily done. We will take the mtcars data to illustrate. Suppose we want to model miles per gallon (mpg) on all other variables.

library(datasets)
data(mtcars)

reg2<- lm(mpg ~ . , data=mtcars)
summary(reg2)

## 
## Call:
## lm(formula = mpg ~ ., data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Principal Component Analysis (PCA)

Principal Component Analysis is an unsupervised learning technique used for dimension reduction for multivariate data, i.e. data containing multiple variables. If ${\bf X}=(X_1,..., X_k)$ are k variables, the problem in PCA is to find the linear combination $l^T{\bf X}$ with maximum variance subject to the constraint $l^Tl=1$. The problem can also be thought of as finding the direction in the $k-$dimensional plane along which the data has maximum variance. It turns out that the solution to this problem is the maximum eigen valaue of the variance-covariance matrix of the data on ${\bf X}$ and the variance maximizing direction is given by the eigen vector corresponding to the maximum eigen value. Similarly, the eigen vector corresponindg to the second largest eigen value will be the next best direction for more variance, and so on.

For one to apply principal components meaningfully, a good understanding of concepts: eigen values, eigen vectors and spectral decomposition from linear algebra are desirable. The procedure to use in R is as below. We will consider the heptathlon dataset under the HSAUR3 package in R. It consists results for for a seven-event combined event for women called “heptathlon”, which made its first entry into olympics in 1984. The data consists results on the 7 event for 25 competitors in the 1988 olympics.

library(HSAUR3)

## Warning: package 'HSAUR3' was built under R version 3.5.3

## Loading required package: tools

data(heptathlon)
head(heptathlon)

##                     hurdles highjump  shot run200m longjump javelin run800m
## Joyner-Kersee (USA)   12.69     1.86 15.80   22.56     7.27   45.66  128.51
## John (GDR)            12.85     1.80 16.23   23.65     6.71   42.56  126.12
## Behmer (GDR)          13.20     1.83 14.20   23.10     6.68   44.54  124.20
## Sablovskaite (URS)    13.61     1.80 15.23   23.92     6.25   42.78  132.24
## Choubenkova (URS)     13.51     1.74 14.76   23.93     6.32   47.46  127.90
## Schulz (GDR)          13.75     1.83 13.50   24.65     6.33   42.82  125.79
##                     score
## Joyner-Kersee (USA)  7291
## John (GDR)           6897
## Behmer (GDR)         6858
## Sablovskaite (URS)   6540
## Choubenkova (URS)    6540
## Schulz (GDR)         6411

dim(heptathlon)

## [1] 25  8

#help(heptathlon)
with(heptathlon, plot(rep(1, dim(heptathlon)[1]), score, col="blue", xlab="", ylab="score"  ))
with(heptathlon, text(rep(1, dim(heptathlon)[1]), score, labels=row.names(heptathlon), offset=.5, pos=4, cex=.5))

We would like to distinguish the players based on their performance. Note that it is hard to do so based on just the final score. We will try to do it using the first principal component obtained from the data. In many situations, we dont want the determination of maximum variance to be infulencedby some variables just because they are measured in different units. If so, it is possible that some large valued variables will show large magnitude in variance although they may not be able to distinguish the records well. So,on eapproach taken is to standardize the variables in the data by subtracting th mean (“centering”) and dividing by standard deviation (“scaling”). It turns out that covariance matrix for the standardized variables is just the correlation matrix for the original variables. Accordingly, in R, we can specify that the procedure use the correlation matrix instead of the covraince matrix.

pcfit <- princomp(heptathlon, cor=TRUE)
summary(pcfit)

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     2.3337239 1.0962319 0.72182793 0.67615707 0.49697892
## Proportion of Variance 0.6807834 0.1502156 0.06512945 0.05714855 0.03087351
## Cumulative Proportion  0.6807834 0.8309990 0.89612842 0.95327697 0.98415048
##                             Comp.6      Comp.7       Comp.8
## Standard deviation     0.271578567 0.221442919 0.0632795567
## Proportion of Variance 0.009219365 0.006129621 0.0005005378
## Cumulative Proportion  0.993369841 0.999499462 1.0000000000

As per this output, the percentage of total variance explained in the first principal component direction (i.e. direction of maximum variance) is 68%. If we use two dimensions to represent the data, by taking the first two principal component directions, then we can explain 83% of variance.

The scree-plot below is a pictorical visualization of the variances explained by each principal component.

plot(pcfit,type="lines")

The linear combinations of data columns that yield the maximum variance (component 1), next best variance (component 2) and so on are called the “loadings”. They are obtained as below.

loadings(pcfit)

## 
## Loadings:
##          Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## hurdles   0.407  0.178                0.113  0.795  0.382       
## highjump -0.339 -0.264 -0.367 -0.679                0.439 -0.145
## shot     -0.332  0.273  0.677 -0.123  0.494         0.223 -0.203
## run200m   0.370 -0.240        -0.363  0.665        -0.457  0.124
## longjump -0.412         0.141 -0.110 -0.198  0.572 -0.602 -0.252
## javelin          0.836 -0.471 -0.120  0.124        -0.170 -0.116
## run800m   0.338  0.239  0.395 -0.604 -0.491 -0.133 -0.105  0.194
## score    -0.426                                            0.893
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.125  0.125  0.125  0.125  0.125  0.125  0.125  0.125
## Cumulative Var  0.125  0.250  0.375  0.500  0.625  0.750  0.875  1.000

Now, one just needs to compute $l^T {\bf X}$ for every row ${\bf X}$ in the data. Taking $l$ as the loadings for first component gives the linear combination of data columns that has maximum variance, taking $l $ as loading for second component gives the linear combination of data columns that has next best variance etc.

pcfit$scores

##                          Comp.1      Comp.2     Comp.3      Comp.4      Comp.5
## Joyner-Kersee (USA) -4.73644855  1.22539333  0.3793062 -0.02196748 -0.43729779
## John (GDR)          -3.28440666  0.50281397  0.9168877  0.48550552  0.71013213
## Behmer (GDR)        -3.04070072  0.66552951 -0.4674166  0.69468140 -0.10943894
## Sablovskaite (URS)  -1.59214658  0.69947622  0.6075720  0.14314333  0.46528122
## Choubenkova (URS)   -1.61876350  1.77837280 -0.1526263  0.85387062  0.69303571
## Schulz (GDR)        -1.20977434 -0.08669928 -0.6888783  0.20932582  0.75539868
## Fleming (AUS)       -1.20996826 -0.34829112 -0.0739671  0.49729341 -0.77976998
## Greiner (USA)       -1.00140330 -0.83695433  0.8290977  0.03048597  0.09353036
## Lajbnerova (CZE)    -0.61564151  0.14577624  0.1642336 -0.62889647  0.57671005
## Bouraga (URS)       -0.81676153 -0.55162879  0.1879048  0.68222247 -1.04182864
## Wijnsma (HOL)       -0.58675246 -1.43042703 -0.1400534 -0.41428025  0.30403239
## Dimitrova (BUL)     -1.15127240 -0.40509917 -0.0806038  0.49443290 -0.82612806
## Scheider (SWI)      -0.03088395  0.82771529 -2.0079814 -0.74761217 -0.02524885
## Braun (FRG)         -0.01831843  0.72956299 -0.3314211 -1.08727927 -0.19328575
## Ruotsalainen (FIN)   0.06751278  0.78163321 -0.9648576 -0.27359910 -0.19082726
## Yuping (CHN)        -0.11797365 -0.55321949  1.0867539 -1.66532434 -0.21631179
## Hagger (GB)          0.26570642 -1.77992363  0.5980197 -0.48177822 -0.05296740
## Brown (USA)          0.56258410  0.74800208 -0.3193688 -1.31527290 -0.51047191
## Mulliner (GB)        1.31010266 -0.63798124  0.7399102  0.59086483 -0.15373614
## Hautenauve (BEL)     1.29504614 -1.87374132  0.0133772  0.25941677  0.20475008
## Kytola (FIN)         1.65452526 -0.92398335 -0.6606445  0.21813098  0.51680688
## Geremias (BRA)       2.30639362 -0.07589554  0.6609723 -0.02562593  0.24570720
## Hui-Ing (TAI)        3.27988022 -0.64256494 -0.7655648  1.14143591 -0.47351027
## Jeong-Mi (KOR)       3.36476324 -0.93996824 -0.5850645  0.11590791  0.60288550
## Launa (PNG)          6.92470138  2.98210183  1.0544130  0.24491829 -0.15744742
##                          Comp.6       Comp.7        Comp.8
## Joyner-Kersee (USA)  0.35468155 -0.354294635  0.0499179341
## John (GDR)          -0.24937415 -0.147288228 -0.0130098300
## Behmer (GDR)         0.24560071  0.133260633  0.0288725786
## Sablovskaite (URS)  -0.09302220  0.494524686  0.0790232430
## Choubenkova (URS)   -0.13755103 -0.243099856 -0.0520519296
## Schulz (GDR)         0.35766289  0.108969683 -0.0287885207
## Fleming (AUS)       -0.08018676  0.143534214  0.0546954057
## Greiner (USA)        0.15345562 -0.033506531 -0.0180293624
## Lajbnerova (CZE)    -0.27680329  0.253591230  0.0041538209
## Bouraga (URS)       -0.39114739  0.014838280  0.1139690296
## Wijnsma (HOL)        0.35111531  0.188677157 -0.0014908073
## Dimitrova (BUL)     -0.26537935  0.077475405 -0.2527593382
## Scheider (SWI)       0.00371748 -0.036704338 -0.0027736911
## Braun (FRG)         -0.27883136 -0.047064168  0.0072713956
## Ruotsalainen (FIN)  -0.14243817 -0.139150766  0.0122289171
## Yuping (CHN)         0.28332519  0.177143125 -0.0324982135
## Hagger (GB)         -0.14116135 -0.533014857  0.0371422763
## Brown (USA)          0.07430305  0.005915267 -0.0035067589
## Mulliner (GB)        0.43652255 -0.079806513 -0.0456489065
## Hautenauve (BEL)     0.10638046 -0.087306001  0.0164130338
## Kytola (FIN)         0.07277970  0.128755033 -0.0033252945
## Geremias (BRA)      -0.65925197  0.216469848 -0.0047145336
## Hui-Ing (TAI)        0.19142764  0.211141285  0.0334977126
## Jeong-Mi (KOR)      -0.18266970 -0.391575777  0.0212358269
## Launa (PNG)          0.26684456 -0.061484177  0.0001760122

Let us now plot the players w.r.t first principal component and compare with the plot w.r.t the overall score given in the data. To aid the comparsion, we standardize both variables. We can see that the players are more easily distinguished if we used the PCA 1st component score.

std_overallscore<- with(heptathlon, (score - mean(score))/sd(score)) 
std_PCAscore1<- with(heptathlon, (pcfit$scores[,1] - mean(pcfit$scores[,1]))/sd(pcfit$scores[,1])) 
par(mfrow=c(1,2))
with(heptathlon, plot(rep(1, dim(heptathlon)[1]), std_overallscore, col="blue", xlab="", ylab="overall score"  , ylim=c(-3,3)))
with(heptathlon, text(rep(1, dim(heptathlon)[1]), std_overallscore, labels=row.names(heptathlon), offset=.5, pos=4, cex=.75))


with(heptathlon, plot(rep(1, dim(heptathlon)[1]), std_PCAscore1, col="red", xlab="", ylab="First PCA score" , ylim=c(-3,3) ))
with(heptathlon, text(rep(1, dim(heptathlon)[1]), std_PCAscore1, labels=row.names(heptathlon), offset=.5, pos=4, cex=.75))

Further, we can more easily distinguish the players if we use two dimensions, viz. first and second principal component scores.

with(heptathlon, plot(pcfit$scores[,1], pcfit$scores[,2], col="red",xlab="First PCA score",ylab="Second PCA score" , xlim=c(-5,5), ylim=c(-2.5,2.5) ))
with(heptathlon, text(pcfit$scores[,1], pcfit$scores[,2], labels=row.names(heptathlon), offset=.5, pos=4, cex=.75))

Here, we have discussed packages that can be used to implement two techniques: Linear Regression and Principal Component Analysis. Application of other techniques will similarly involve looking up the relevant commands such as lda() forLinear Discriminant Analysis, fa() for Factor Analysis etc. However, a conceptual understanding of the technique involved is important to ensure correct application of the tools in R.

Glimpse of some R commands for Data Analysis

Karthik Sriram

5 February 2020

Linear Regression

Basic summaries

Principal Component Analysis (PCA)