We will learn some data manipulations in R For that, we will use the data called “mtcars” within the “datasets” library of R. We will first get the data. One may check the help to understand the data description: Quoting the R help documentation: “The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models)”

Import data

If the data is stored as a csv file in a directory, we can import it using the read.table() command. It is good idea to map a folder in the computer as a working directory in R for the analysis so that we can export and import files to and from that directory. A preferred approach is to work with csv files for data.

setwd('C:\\Users\\IIMA\\Google Drive\\R introduction\\')

mtcars<-read.table("mtcars.csv", row.names=1, sep=",", header=TRUE)
View(mtcars)

By specifying row.names=1 , the first column will be treated as row names. If we do not specify this, the column will be imported as a separate column.

Since this data is available in the “datasets” library, we can obtain it directly within R as follows.

#install.packages("datasets")
library(datasets)
data(mtcars)
View(mtcars)
#help(mtcars)

Basic data manipulations

Let us look at a snapshot of first 10 rows of the data to understand what is in it. We can see that the row headings indicate the name of the car model and we can see the nature of other variables, e.g. vs is a 0-1 variable, mpg is continuous etc.

# dimension of data  i.e number of rows and columns

print(dim(mtcars))
## [1] 32 11
# snapshot of data
head(mtcars, n=10)
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Suppose we want to change the column names to make it more descriptive and helpful for interpretation, we can do it as follows.

# view the current column names

colnames(mtcars)
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
# to change the column names

colnames(mtcars)<- c("miles_per_gal", "cylinders", "displacement", "horse_power", "rear_axle_ratio", "weight", "1_4miletime", "straight_engine", "tansmission", "gears", "carburetors")

head(mtcars, n=2)
##               miles_per_gal cylinders displacement horse_power rear_axle_ratio
## Mazda RX4                21         6          160         110             3.9
## Mazda RX4 Wag            21         6          160         110             3.9
##               weight 1_4miletime straight_engine tansmission gears carburetors
## Mazda RX4      2.620       16.46               0           1     4           4
## Mazda RX4 Wag  2.875       17.02               0           1     4           4

We note that data is essentially a matrix of columns and rows. We can access the element in the 3rd row and 1st column, we can access it as below.

# accessing element in 3rd row and 1st column
mtcars[3,1]
## [1] 22.8
# accessing all elements of the 1st column
mtcars[,1]
##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
# accessing all elements of the 3rd row
mtcars[3,]
##            miles_per_gal cylinders displacement horse_power rear_axle_ratio
## Datsun 710          22.8         4          108          93            3.85
##            weight 1_4miletime straight_engine tansmission gears carburetors
## Datsun 710   2.32       18.61               1           1     4           1

Let us find out the average mileage in the data and compare it with that for MAZDA RX4. Note that for this we need to access the row names and check which rows are for this model.

# Accessing row names
rownames(mtcars)
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"
# obtaining the indices of rows for which row name is "MAzda RX4" 
row_list<-which(rownames(mtcars)=="Mazda RX4")
print(row_list)
## [1] 1
# Check how many rows satisfy the condition

length(row_list)
## [1] 1
# Obtain the details for that row

mtcars[row_list,]
##           miles_per_gal cylinders displacement horse_power rear_axle_ratio
## Mazda RX4            21         6          160         110             3.9
##           weight 1_4miletime straight_engine tansmission gears carburetors
## Mazda RX4   2.62       16.46               0           1     4           4
#  mileage for MAZDA RX4

# method 1
mtcars[row_list, 1]
## [1] 21
# method 2
mtcars$miles_per_gal[1]
## [1] 21
# Average mileage for all cars
mean(mtcars$miles_per_gal)
## [1] 20.09062

Suppose we want to extract the rows for all rows which are more “merc”.

row_names<- rownames(mtcars)

substr(row_names, 1,4)
##  [1] "Mazd" "Mazd" "Dats" "Horn" "Horn" "Vali" "Dust" "Merc" "Merc" "Merc"
## [11] "Merc" "Merc" "Merc" "Merc" "Cadi" "Linc" "Chry" "Fiat" "Hond" "Toyo"
## [21] "Toyo" "Dodg" "AMC " "Cama" "Pont" "Fiat" "Pors" "Lotu" "Ford" "Ferr"
## [31] "Mase" "Volv"
merc_list<- which(substr(row_names, 1,4)=="Merc")
print(merc_list)
## [1]  8  9 10 11 12 13 14
##rows for Merc

data_merc<- mtcars[merc_list,]

# average mileage for Merc
mean(data_merc[,1])
## [1] 19.01429

Suppose, for merc, we want to look at miles per gallon, horse power and weight columns in the data. It can be done in multiple ways.

data_merc[, c(1,4,6)]
##             miles_per_gal horse_power weight
## Merc 240D            24.4          62   3.19
## Merc 230             22.8          95   3.15
## Merc 280             19.2         123   3.44
## Merc 280C            17.8         123   3.44
## Merc 450SE           16.4         180   4.07
## Merc 450SL           17.3         180   3.73
## Merc 450SLC          15.2         180   3.78
#or

cbind(data_merc$miles_per_gal, data_merc$horse_power, data_merc$weight )
##      [,1] [,2] [,3]
## [1,] 24.4   62 3.19
## [2,] 22.8   95 3.15
## [3,] 19.2  123 3.44
## [4,] 17.8  123 3.44
## [5,] 16.4  180 4.07
## [6,] 17.3  180 3.73
## [7,] 15.2  180 3.78
#or 
with(data_merc, cbind(miles_per_gal, horse_power, weight))
##      miles_per_gal horse_power weight
## [1,]          24.4          62   3.19
## [2,]          22.8          95   3.15
## [3,]          19.2         123   3.44
## [4,]          17.8         123   3.44
## [5,]          16.4         180   4.07
## [6,]          17.3         180   3.73
## [7,]          15.2         180   3.78

Suppose we want to pull the records for Mazda, Mercedes or Hornet.

sub_list<- which(substr(row_names, 1,4 ) %in% c("Merc", "Mazd", "Horn"))
mtcars[sub_list,]
##                   miles_per_gal cylinders displacement horse_power
## Mazda RX4                  21.0         6        160.0         110
## Mazda RX4 Wag              21.0         6        160.0         110
## Hornet 4 Drive             21.4         6        258.0         110
## Hornet Sportabout          18.7         8        360.0         175
## Merc 240D                  24.4         4        146.7          62
## Merc 230                   22.8         4        140.8          95
## Merc 280                   19.2         6        167.6         123
## Merc 280C                  17.8         6        167.6         123
## Merc 450SE                 16.4         8        275.8         180
## Merc 450SL                 17.3         8        275.8         180
## Merc 450SLC                15.2         8        275.8         180
##                   rear_axle_ratio weight 1_4miletime straight_engine
## Mazda RX4                    3.90  2.620       16.46               0
## Mazda RX4 Wag                3.90  2.875       17.02               0
## Hornet 4 Drive               3.08  3.215       19.44               1
## Hornet Sportabout            3.15  3.440       17.02               0
## Merc 240D                    3.69  3.190       20.00               1
## Merc 230                     3.92  3.150       22.90               1
## Merc 280                     3.92  3.440       18.30               1
## Merc 280C                    3.92  3.440       18.90               1
## Merc 450SE                   3.07  4.070       17.40               0
## Merc 450SL                   3.07  3.730       17.60               0
## Merc 450SLC                  3.07  3.780       18.00               0
##                   tansmission gears carburetors
## Mazda RX4                   1     4           4
## Mazda RX4 Wag               1     4           4
## Hornet 4 Drive              0     3           1
## Hornet Sportabout           0     3           2
## Merc 240D                   0     4           2
## Merc 230                    0     4           2
## Merc 280                    0     4           4
## Merc 280C                   0     4           4
## Merc 450SE                  0     3           3
## Merc 450SL                  0     3           3
## Merc 450SLC                 0     3           3

So far we did subsetting based on character variables. Suppose we want to find out which cars give a good mileage, say above average ?

avg_mpg<-mean(mtcars$miles_per_gal)

rownames(mtcars)[which(mtcars[,1]>avg_mpg)]
##  [1] "Mazda RX4"      "Mazda RX4 Wag"  "Datsun 710"     "Hornet 4 Drive"
##  [5] "Merc 240D"      "Merc 230"       "Fiat 128"       "Honda Civic"   
##  [9] "Toyota Corolla" "Toyota Corona"  "Fiat X1-9"      "Porsche 914-2" 
## [13] "Lotus Europa"   "Volvo 142E"
# Ofcourse one can use multiple conditions

# "or"" condition
rownames(mtcars)[which((mtcars[,1]>avg_mpg) | (mtcars$horse_power>150))]   # i..e rows with any one condition is satisfied
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Duster 360"         
##  [7] "Merc 240D"           "Merc 230"            "Merc 450SE"         
## [10] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [13] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [16] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [19] "Camaro Z28"          "Pontiac Firebird"    "Fiat X1-9"          
## [22] "Porsche 914-2"       "Lotus Europa"        "Ford Pantera L"     
## [25] "Ferrari Dino"        "Maserati Bora"       "Volvo 142E"
# "and  "  condition
rownames(mtcars)[which((mtcars[,1]>avg_mpg) & (mtcars$horse_power>150))]   # i..e rows with both conditions satisfied
## character(0)
# "and  "  condition
rownames(mtcars)[which((mtcars[,1]>avg_mpg) & (mtcars$horse_power>110))]   # i..e rows with both conditions satisfied
## [1] "Lotus Europa"

Suppose we want to compute overall column-wise summary of the data

# Mean
apply(mtcars,2, mean)
##   miles_per_gal       cylinders    displacement     horse_power rear_axle_ratio 
##       20.090625        6.187500      230.721875      146.687500        3.596563 
##          weight     1_4miletime straight_engine     tansmission           gears 
##        3.217250       17.848750        0.437500        0.406250        3.687500 
##     carburetors 
##        2.812500
#SD
apply(mtcars,2, sd)
##   miles_per_gal       cylinders    displacement     horse_power rear_axle_ratio 
##       6.0269481       1.7859216     123.9386938      68.5628685       0.5346787 
##          weight     1_4miletime straight_engine     tansmission           gears 
##       0.9784574       1.7869432       0.5040161       0.4989909       0.7378041 
##     carburetors 
##       1.6152000

Suppose we want to do the summary of mileage by cylinder and engine type

# Mean mileage by cylinder and engine type
mpg_mean<-aggregate( miles_per_gal ~  cylinders + straight_engine, data=mtcars, FUN=mean  )
colnames(mpg_mean)[3]<-"mean_mpg"
print(mpg_mean)
##   cylinders straight_engine mean_mpg
## 1         4               0 26.00000
## 2         6               0 20.56667
## 3         8               0 15.10000
## 4         4               1 26.73000
## 5         6               1 19.12500
#SD of mileage by cylinder and engine type

mpg_sd<-aggregate( miles_per_gal ~  cylinders + straight_engine, data=mtcars, FUN=sd  )
colnames(mpg_sd)[3]<-"sd_mpg"

print(mpg_sd)
##   cylinders straight_engine    sd_mpg
## 1         4               0        NA
## 2         6               0 0.7505553
## 3         8               0 2.5600481
## 4         4               1 4.7481107
## 5         6               1 1.6317169
# count of records within each group
mpg_count<-aggregate( miles_per_gal ~  cylinders + straight_engine, data=mtcars, FUN=length  )
colnames(mpg_count)[3]<-"count_mpg"

print(mpg_count)
##   cylinders straight_engine count_mpg
## 1         4               0         1
## 2         6               0         3
## 3         8               0        14
## 4         4               1        10
## 5         6               1         4

Suppose we want to report sd for only those records with at least 5 records.

mpg_sub<- which(mpg_count[,3]>=5)

mpg_sd_sub<- mpg_sd[mpg_sub,]

print(mpg_sd_sub)
##   cylinders straight_engine   sd_mpg
## 3         8               0 2.560048
## 4         4               1 4.748111

Now let us create a data which will contain both mean and sd, but sd reported only when count is at least 5. We will do this by merging.

merged_data<-merge(mpg_mean, mpg_sd_sub, by.x= c("cylinders","straight_engine"), by.y= c("cylinders","straight_engine") , all.x="TRUE", all.y="TRUE")

print(merged_data)
##   cylinders straight_engine mean_mpg   sd_mpg
## 1         4               0 26.00000       NA
## 2         4               1 26.73000 4.748111
## 3         6               0 20.56667       NA
## 4         6               1 19.12500       NA
## 5         8               0 15.10000 2.560048

Here’s a way to sort the data in increasing order of average mileage

ord<-with(merged_data, order(mean_mpg, decreasing=FALSE))
print(merged_data[ord,])
##   cylinders straight_engine mean_mpg   sd_mpg
## 5         8               0 15.10000 2.560048
## 4         6               1 19.12500       NA
## 3         6               0 20.56667       NA
## 1         4               0 26.00000       NA
## 2         4               1 26.73000 4.748111

Exporting/Importing csv files

setwd('C:\\Users\\IIMA\\Google Drive\\R introduction\\')

#export
write.csv(mtcars, "exported_mtcars.csv")

#import
imported_data<-read.csv("exported_mtcars.csv")

# Note here that the rownames appear as s separate column. If we wanted it to be treated as row names then we would use read.table() as above.

View(imported_data)

Statistical summaries of data

Before doing any deep analysis with a dataset, it is good to get an idea for basic summaries such as minimum, maximum and quartiles. In order to demonstrate what happens with categorical variables, we also create one column for engine type which is non-numeric or categorical before doing the summary.

data(mtcars)
colnames(mtcars)<- c("miles_per_gal", "cylinders", "displacement", "horse_power", "rear_axle_ratio", "weight", "1_4miletime", "straight_engine", "tansmission", "gears", "carburetors")
mtcars$engtype<-"straight"
mtcars$engtype[which(mtcars$straight_engine==0)]<- "V shaped"

print(summary(mtcars))
##  miles_per_gal     cylinders      displacement    horse_power   
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##  rear_axle_ratio     weight       1_4miletime    straight_engine 
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##   tansmission         gears        carburetors      engtype         
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000   Length:32         
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000   Class :character  
##  Median :0.0000   Median :4.000   Median :2.000   Mode  :character  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812                     
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000                     
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

The way to look t categorical variables is to summarize the frequencies by each category.

print(table(mtcars$engtype))
## 
## straight V shaped 
##       14       18

When we are looking at data such as this, we would be curious about correlations between variables. We can compute it as below.

data(mtcars)
round(cor(mtcars),2)
##        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## mpg   1.00 -0.85 -0.85 -0.78  0.68 -0.87  0.42  0.66  0.60  0.48 -0.55
## cyl  -0.85  1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp -0.85  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp   -0.78  0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat  0.68 -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt   -0.87  0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec  0.42 -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs    0.66 -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am    0.60 -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear  0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb -0.55  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00

A better visualization of correlations can be obtained by using the corplot pakage

#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.5.3
## corrplot 0.84 loaded
C<-cor(mtcars)
corrplot(C, method="circle")

corrplot(C, method="number")