We will learn some data manipulations in R For that, we will use the data called “mtcars” within the “datasets” library of R. We will first get the data. One may check the help to understand the data description: Quoting the R help documentation: “The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models)”
If the data is stored as a csv file in a directory, we can import it using the read.table() command. It is good idea to map a folder in the computer as a working directory in R for the analysis so that we can export and import files to and from that directory. A preferred approach is to work with csv files for data.
setwd('C:\\Users\\IIMA\\Google Drive\\R introduction\\')
mtcars<-read.table("mtcars.csv", row.names=1, sep=",", header=TRUE)
View(mtcars)
By specifying row.names=1 , the first column will be treated as row names. If we do not specify this, the column will be imported as a separate column.
Since this data is available in the “datasets” library, we can obtain it directly within R as follows.
#install.packages("datasets")
library(datasets)
data(mtcars)
View(mtcars)
#help(mtcars)
Let us look at a snapshot of first 10 rows of the data to understand what is in it. We can see that the row headings indicate the name of the car model and we can see the nature of other variables, e.g. vs is a 0-1 variable, mpg is continuous etc.
# dimension of data i.e number of rows and columns
print(dim(mtcars))
## [1] 32 11
# snapshot of data
head(mtcars, n=10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Suppose we want to change the column names to make it more descriptive and helpful for interpretation, we can do it as follows.
# view the current column names
colnames(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
# to change the column names
colnames(mtcars)<- c("miles_per_gal", "cylinders", "displacement", "horse_power", "rear_axle_ratio", "weight", "1_4miletime", "straight_engine", "tansmission", "gears", "carburetors")
head(mtcars, n=2)
## miles_per_gal cylinders displacement horse_power rear_axle_ratio
## Mazda RX4 21 6 160 110 3.9
## Mazda RX4 Wag 21 6 160 110 3.9
## weight 1_4miletime straight_engine tansmission gears carburetors
## Mazda RX4 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 2.875 17.02 0 1 4 4
We note that data is essentially a matrix of columns and rows. We can access the element in the 3rd row and 1st column, we can access it as below.
# accessing element in 3rd row and 1st column
mtcars[3,1]
## [1] 22.8
# accessing all elements of the 1st column
mtcars[,1]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
# accessing all elements of the 3rd row
mtcars[3,]
## miles_per_gal cylinders displacement horse_power rear_axle_ratio
## Datsun 710 22.8 4 108 93 3.85
## weight 1_4miletime straight_engine tansmission gears carburetors
## Datsun 710 2.32 18.61 1 1 4 1
Let us find out the average mileage in the data and compare it with that for MAZDA RX4. Note that for this we need to access the row names and check which rows are for this model.
# Accessing row names
rownames(mtcars)
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
# obtaining the indices of rows for which row name is "MAzda RX4"
row_list<-which(rownames(mtcars)=="Mazda RX4")
print(row_list)
## [1] 1
# Check how many rows satisfy the condition
length(row_list)
## [1] 1
# Obtain the details for that row
mtcars[row_list,]
## miles_per_gal cylinders displacement horse_power rear_axle_ratio
## Mazda RX4 21 6 160 110 3.9
## weight 1_4miletime straight_engine tansmission gears carburetors
## Mazda RX4 2.62 16.46 0 1 4 4
# mileage for MAZDA RX4
# method 1
mtcars[row_list, 1]
## [1] 21
# method 2
mtcars$miles_per_gal[1]
## [1] 21
# Average mileage for all cars
mean(mtcars$miles_per_gal)
## [1] 20.09062
Suppose we want to extract the rows for all rows which are more “merc”.
row_names<- rownames(mtcars)
substr(row_names, 1,4)
## [1] "Mazd" "Mazd" "Dats" "Horn" "Horn" "Vali" "Dust" "Merc" "Merc" "Merc"
## [11] "Merc" "Merc" "Merc" "Merc" "Cadi" "Linc" "Chry" "Fiat" "Hond" "Toyo"
## [21] "Toyo" "Dodg" "AMC " "Cama" "Pont" "Fiat" "Pors" "Lotu" "Ford" "Ferr"
## [31] "Mase" "Volv"
merc_list<- which(substr(row_names, 1,4)=="Merc")
print(merc_list)
## [1] 8 9 10 11 12 13 14
##rows for Merc
data_merc<- mtcars[merc_list,]
# average mileage for Merc
mean(data_merc[,1])
## [1] 19.01429
Suppose, for merc, we want to look at miles per gallon, horse power and weight columns in the data. It can be done in multiple ways.
data_merc[, c(1,4,6)]
## miles_per_gal horse_power weight
## Merc 240D 24.4 62 3.19
## Merc 230 22.8 95 3.15
## Merc 280 19.2 123 3.44
## Merc 280C 17.8 123 3.44
## Merc 450SE 16.4 180 4.07
## Merc 450SL 17.3 180 3.73
## Merc 450SLC 15.2 180 3.78
#or
cbind(data_merc$miles_per_gal, data_merc$horse_power, data_merc$weight )
## [,1] [,2] [,3]
## [1,] 24.4 62 3.19
## [2,] 22.8 95 3.15
## [3,] 19.2 123 3.44
## [4,] 17.8 123 3.44
## [5,] 16.4 180 4.07
## [6,] 17.3 180 3.73
## [7,] 15.2 180 3.78
#or
with(data_merc, cbind(miles_per_gal, horse_power, weight))
## miles_per_gal horse_power weight
## [1,] 24.4 62 3.19
## [2,] 22.8 95 3.15
## [3,] 19.2 123 3.44
## [4,] 17.8 123 3.44
## [5,] 16.4 180 4.07
## [6,] 17.3 180 3.73
## [7,] 15.2 180 3.78
Suppose we want to pull the records for Mazda, Mercedes or Hornet.
sub_list<- which(substr(row_names, 1,4 ) %in% c("Merc", "Mazd", "Horn"))
mtcars[sub_list,]
## miles_per_gal cylinders displacement horse_power
## Mazda RX4 21.0 6 160.0 110
## Mazda RX4 Wag 21.0 6 160.0 110
## Hornet 4 Drive 21.4 6 258.0 110
## Hornet Sportabout 18.7 8 360.0 175
## Merc 240D 24.4 4 146.7 62
## Merc 230 22.8 4 140.8 95
## Merc 280 19.2 6 167.6 123
## Merc 280C 17.8 6 167.6 123
## Merc 450SE 16.4 8 275.8 180
## Merc 450SL 17.3 8 275.8 180
## Merc 450SLC 15.2 8 275.8 180
## rear_axle_ratio weight 1_4miletime straight_engine
## Mazda RX4 3.90 2.620 16.46 0
## Mazda RX4 Wag 3.90 2.875 17.02 0
## Hornet 4 Drive 3.08 3.215 19.44 1
## Hornet Sportabout 3.15 3.440 17.02 0
## Merc 240D 3.69 3.190 20.00 1
## Merc 230 3.92 3.150 22.90 1
## Merc 280 3.92 3.440 18.30 1
## Merc 280C 3.92 3.440 18.90 1
## Merc 450SE 3.07 4.070 17.40 0
## Merc 450SL 3.07 3.730 17.60 0
## Merc 450SLC 3.07 3.780 18.00 0
## tansmission gears carburetors
## Mazda RX4 1 4 4
## Mazda RX4 Wag 1 4 4
## Hornet 4 Drive 0 3 1
## Hornet Sportabout 0 3 2
## Merc 240D 0 4 2
## Merc 230 0 4 2
## Merc 280 0 4 4
## Merc 280C 0 4 4
## Merc 450SE 0 3 3
## Merc 450SL 0 3 3
## Merc 450SLC 0 3 3
So far we did subsetting based on character variables. Suppose we want to find out which cars give a good mileage, say above average ?
avg_mpg<-mean(mtcars$miles_per_gal)
rownames(mtcars)[which(mtcars[,1]>avg_mpg)]
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive"
## [5] "Merc 240D" "Merc 230" "Fiat 128" "Honda Civic"
## [9] "Toyota Corolla" "Toyota Corona" "Fiat X1-9" "Porsche 914-2"
## [13] "Lotus Europa" "Volvo 142E"
# Ofcourse one can use multiple conditions
# "or"" condition
rownames(mtcars)[which((mtcars[,1]>avg_mpg) | (mtcars$horse_power>150))] # i..e rows with any one condition is satisfied
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Duster 360"
## [7] "Merc 240D" "Merc 230" "Merc 450SE"
## [10] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [13] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [16] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [19] "Camaro Z28" "Pontiac Firebird" "Fiat X1-9"
## [22] "Porsche 914-2" "Lotus Europa" "Ford Pantera L"
## [25] "Ferrari Dino" "Maserati Bora" "Volvo 142E"
# "and " condition
rownames(mtcars)[which((mtcars[,1]>avg_mpg) & (mtcars$horse_power>150))] # i..e rows with both conditions satisfied
## character(0)
# "and " condition
rownames(mtcars)[which((mtcars[,1]>avg_mpg) & (mtcars$horse_power>110))] # i..e rows with both conditions satisfied
## [1] "Lotus Europa"
Suppose we want to compute overall column-wise summary of the data
# Mean
apply(mtcars,2, mean)
## miles_per_gal cylinders displacement horse_power rear_axle_ratio
## 20.090625 6.187500 230.721875 146.687500 3.596563
## weight 1_4miletime straight_engine tansmission gears
## 3.217250 17.848750 0.437500 0.406250 3.687500
## carburetors
## 2.812500
#SD
apply(mtcars,2, sd)
## miles_per_gal cylinders displacement horse_power rear_axle_ratio
## 6.0269481 1.7859216 123.9386938 68.5628685 0.5346787
## weight 1_4miletime straight_engine tansmission gears
## 0.9784574 1.7869432 0.5040161 0.4989909 0.7378041
## carburetors
## 1.6152000
Suppose we want to do the summary of mileage by cylinder and engine type
# Mean mileage by cylinder and engine type
mpg_mean<-aggregate( miles_per_gal ~ cylinders + straight_engine, data=mtcars, FUN=mean )
colnames(mpg_mean)[3]<-"mean_mpg"
print(mpg_mean)
## cylinders straight_engine mean_mpg
## 1 4 0 26.00000
## 2 6 0 20.56667
## 3 8 0 15.10000
## 4 4 1 26.73000
## 5 6 1 19.12500
#SD of mileage by cylinder and engine type
mpg_sd<-aggregate( miles_per_gal ~ cylinders + straight_engine, data=mtcars, FUN=sd )
colnames(mpg_sd)[3]<-"sd_mpg"
print(mpg_sd)
## cylinders straight_engine sd_mpg
## 1 4 0 NA
## 2 6 0 0.7505553
## 3 8 0 2.5600481
## 4 4 1 4.7481107
## 5 6 1 1.6317169
# count of records within each group
mpg_count<-aggregate( miles_per_gal ~ cylinders + straight_engine, data=mtcars, FUN=length )
colnames(mpg_count)[3]<-"count_mpg"
print(mpg_count)
## cylinders straight_engine count_mpg
## 1 4 0 1
## 2 6 0 3
## 3 8 0 14
## 4 4 1 10
## 5 6 1 4
Suppose we want to report sd for only those records with at least 5 records.
mpg_sub<- which(mpg_count[,3]>=5)
mpg_sd_sub<- mpg_sd[mpg_sub,]
print(mpg_sd_sub)
## cylinders straight_engine sd_mpg
## 3 8 0 2.560048
## 4 4 1 4.748111
Now let us create a data which will contain both mean and sd, but sd reported only when count is at least 5. We will do this by merging.
merged_data<-merge(mpg_mean, mpg_sd_sub, by.x= c("cylinders","straight_engine"), by.y= c("cylinders","straight_engine") , all.x="TRUE", all.y="TRUE")
print(merged_data)
## cylinders straight_engine mean_mpg sd_mpg
## 1 4 0 26.00000 NA
## 2 4 1 26.73000 4.748111
## 3 6 0 20.56667 NA
## 4 6 1 19.12500 NA
## 5 8 0 15.10000 2.560048
Here’s a way to sort the data in increasing order of average mileage
ord<-with(merged_data, order(mean_mpg, decreasing=FALSE))
print(merged_data[ord,])
## cylinders straight_engine mean_mpg sd_mpg
## 5 8 0 15.10000 2.560048
## 4 6 1 19.12500 NA
## 3 6 0 20.56667 NA
## 1 4 0 26.00000 NA
## 2 4 1 26.73000 4.748111
setwd('C:\\Users\\IIMA\\Google Drive\\R introduction\\')
#export
write.csv(mtcars, "exported_mtcars.csv")
#import
imported_data<-read.csv("exported_mtcars.csv")
# Note here that the rownames appear as s separate column. If we wanted it to be treated as row names then we would use read.table() as above.
View(imported_data)
Before doing any deep analysis with a dataset, it is good to get an idea for basic summaries such as minimum, maximum and quartiles. In order to demonstrate what happens with categorical variables, we also create one column for engine type which is non-numeric or categorical before doing the summary.
data(mtcars)
colnames(mtcars)<- c("miles_per_gal", "cylinders", "displacement", "horse_power", "rear_axle_ratio", "weight", "1_4miletime", "straight_engine", "tansmission", "gears", "carburetors")
mtcars$engtype<-"straight"
mtcars$engtype[which(mtcars$straight_engine==0)]<- "V shaped"
print(summary(mtcars))
## miles_per_gal cylinders displacement horse_power
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## rear_axle_ratio weight 1_4miletime straight_engine
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## tansmission gears carburetors engtype
## Min. :0.0000 Min. :3.000 Min. :1.000 Length:32
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 Class :character
## Median :0.0000 Median :4.000 Median :2.000 Mode :character
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
The way to look t categorical variables is to summarize the frequencies by each category.
print(table(mtcars$engtype))
##
## straight V shaped
## 14 18
When we are looking at data such as this, we would be curious about correlations between variables. We can compute it as below.
data(mtcars)
round(cor(mtcars),2)
## mpg cyl disp hp drat wt qsec vs am gear carb
## mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
## cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
## disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
## hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
## drat 0.68 -0.70 -0.71 -0.45 1.00 -0.71 0.09 0.44 0.71 0.70 -0.09
## wt -0.87 0.78 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43
## qsec 0.42 -0.59 -0.43 -0.71 0.09 -0.17 1.00 0.74 -0.23 -0.21 -0.66
## vs 0.66 -0.81 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57
## am 0.60 -0.52 -0.59 -0.24 0.71 -0.69 -0.23 0.17 1.00 0.79 0.06
## gear 0.48 -0.49 -0.56 -0.13 0.70 -0.58 -0.21 0.21 0.79 1.00 0.27
## carb -0.55 0.53 0.39 0.75 -0.09 0.43 -0.66 -0.57 0.06 0.27 1.00
A better visualization of correlations can be obtained by using the corplot pakage
#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.5.3
## corrplot 0.84 loaded
C<-cor(mtcars)
corrplot(C, method="circle")
corrplot(C, method="number")