Here we will implement discriminant analysis with multiple groups. We will look at the famous iris data of Fisher. The problem is to classify folowers into three different species, namely Setosa, Versicolor and Virginica, based on four characetristics: sepal length, sepal width, petal length and petal width.

library(kernlab)
library(MASS)

# iris data
data(iris)
X<-iris

Scatter plots with original variables

Let us visualize the data. There are four columns but let us look at the scatter plot w.r.t two columns.

speciesnames<-as.matrix(unique(X$Species))
colors <- ifelse(X$Species == speciesnames[1], "red", ifelse(X$Species == speciesnames[2],"blue", "dark green"))
plot(X[,1],X[,2], xlab=colnames(X)[1],ylab=colnames(X)[2], col= colors, main="Three Species of Iris"  )
legend(7,4.2, legend=c("setosa", "versicolor", "virginica"),   col=c("red","blue", "green"), pch=c(1,1,1))

Scatter plots with Fisher's discriminant functions

Note that since there are 3 specicies (3 groups), we will need two discriminant functions.

fit <- lda(Species ~ . , data=X, na.action="na.omit",CV=FALSE)

ldavar1<- as.matrix(X[,1:4])%*% (fit$scaling[,1])
ldavar2<- as.matrix(X[,1:4])%*% (fit$scaling[,2])

plot(ldavar1,ldavar2, xlab="First Discriminant Function",ylab="Second Discriminant Function", col= colors , main="axes as discriminant functions")

Confusion matrix

pred<-predict(fit, X)

pred[[1]][1:10]  # Classification based on maximum posterior probability
##  [1] setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
round(pred[[2]][1:10,],2)  # Posterior probability of belonging to group
##    setosa versicolor virginica
## 1       1          0         0
## 2       1          0         0
## 3       1          0         0
## 4       1          0         0
## 5       1          0         0
## 6       1          0         0
## 7       1          0         0
## 8       1          0         0
## 9       1          0         0
## 10      1          0         0
predclass<- as.matrix(pred[[1]])

# confusion Matrix on development sample
ct <- table(as.matrix(X$Species), predclass)
diag(prop.table(ct, 1))
##     setosa versicolor  virginica 
##       1.00       0.96       0.98
# total percent correct
sum(diag(prop.table(ct)))
## [1] 0.98
prop.table(ct)
##             predclass
##                   setosa  versicolor   virginica
##   setosa     0.333333333 0.000000000 0.000000000
##   versicolor 0.000000000 0.320000000 0.013333333
##   virginica  0.000000000 0.006666667 0.326666667