The problem and data used here is adapted from the book "Analyzing multivariate data" by Lattin, Carrol and Green. The data is on a sample of single family homes listed for sale during a month in 1986 in three real estate locations in San Francisco Bay area: Los Altos, Menlo Park and Palo Alto. The data contains three attributes: asking price, number of bed rooms and square footage (in thousands). The question of interest is whether the three communities are different w.r.t these attribiutes.

Here, we use discriminant analysis as a way to check the same.

 library(kernlab)
library(MASS)
X1<-read.csv('realestate.csv')
X<-data.frame(community=X1$community,scale(X1[2:4]))

namess<-as.matrix(unique(X$community))
colors <- ifelse(X$community == namess[1], "red", ifelse(X$community == namess[2],"blue", "dark green"))

par(mfrow=c(2,2))
a<-2
b<-3

xx<-X[,a]
yy<-X[,b]
plot(xx,yy, xlab=colnames(X)[a],ylab=colnames(X)[b], col= colors, main="Three Locations"  )


a<-3
b<-4

xx<-X[,a]
yy<-X[,b]
plot(xx,yy, xlab=colnames(X)[a],ylab=colnames(X)[b], col= colors, main="Three Locations"  )


a<- 2
b<-4

xx<-X[,a]
yy<-X[,b]
plot(xx,yy, xlab=colnames(X)[a],ylab=colnames(X)[b], col= colors, main="Three Locations"  )

LDA with flat prior

fit <- lda(community ~ . , data=X, prior=c(1/3,1/3,1/3), na.action="na.omit",CV=FALSE)

pred<-predict(fit, X)

predclass<- as.matrix(pred[[1]])


# confusion Matrix 
ct <- table(as.matrix(X$community), predclass)

diag(prop.table(ct, 1))
##        LA        MP        PA 
## 0.7777778 0.6153846 0.6153846
ct
##     predclass
##      LA MP PA
##   LA  7  1  1
##   MP  1  8  4
##   PA  1  4  8
sum(diag(prop.table(ct)))
## [1] 0.6571429
prop.table(ct)
##     predclass
##              LA         MP         PA
##   LA 0.20000000 0.02857143 0.02857143
##   MP 0.02857143 0.22857143 0.11428571
##   PA 0.02857143 0.11428571 0.22857143

Cross Validation (Leave one out validation)

When we have small data, we may not have enough observations to divide into training and validation samples. In that case, leave-one-out validation can be used. Here, we leave out each observation of the data and estimate the prediction for that observation based on LDA on the remaining observations, i.e. the LDA classification rule is generated without using that observation. Then predictions so obtained are compared with actuals.

fit <- lda(community ~ . , data=X, na.action="na.omit",CV=TRUE)
ct <- table(as.matrix(X$community), as.matrix(fit$class))

# confusion Matrix
#ct <- table(as.matrix(X$community), predclass)
diag(prop.table(ct, 1))
##        LA        MP        PA 
## 0.4444444 0.5384615 0.5384615
# total percent correct
sum(diag(prop.table(ct)))
## [1] 0.5142857
prop.table(ct)
##     
##              LA         MP         PA
##   LA 0.11428571 0.05714286 0.08571429
##   MP 0.02857143 0.20000000 0.14285714
##   PA 0.02857143 0.14285714 0.20000000

Is the model useful?

One way of answering this is by checking what would have happened without the model. If we had randomly assigned the 3 locations to each observation, then would the classification accuracy be different from what we got using the LDA?

v<-prop.table( table(as.matrix(X$community)))
v
## 
##        LA        MP        PA 
## 0.2571429 0.3714286 0.3714286
t(v)%*%v
##           [,1]
## [1,] 0.3420408