Iris Classification

An exploration of simple classification using caret with Edgar Anderson's famous Iris data set.

This is a quick exploration of simple classification using caret with Edgar Anderson’s famous Iris data set.

The goal of the project is to utilise caret alongside a variety of machine learning algorithms to correctly classify irises into their species group.

The Data

Anderson’s Iris data set gives the measurements in centimeters of the 4 predictors, as well as the Species:

Sepal Length
Sepal Width
Petal Length
Petal Width
Species
- iris setosa
- iris versicolor
- iris virginica

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

The iris data set was chosen partly as it requires no pre-processing or transformation, which is ideal for quickly looking at caret and classification, and also partly because it is a classic classification data set.

Pairwise Correlation Matrix

First we will create a Pairwise Correlation Matrix to explore the variables and look for any relationships.

ggpairs(iris,
        colour='Species',
        alpha=0.4,
        title = "Anderson's Iris Data -- 3 species")

center

These pairwise scatterplots show that there is strong correlation between Petal.Length and Petal.Width. And also correlation between Sepal.Length and Petal.Length, and also Sepal.Length with Petal.Width.

We can also see a degree of separation between iris setosa and the other Irises, but there is no clear separation between iris versicolor and iris virginica

Comparing Sepal.Length and Sepal.Width

ggplot(iris, aes(x=Sepal.Length, y=Sepal.Width, colour=Species)) + 
    geom_point() +
    ggtitle("Sepal.Length vs Sepal.Width")

center

Comparing Sepal.Width and Sepal.Length whilst colouring according to species shows good separation between iris setosa and the other iris species, but again only minor difference between iris versicolor and iris virginica.

Comparing Petal.Length and Petal.Width

ggplot(iris, aes(x=Petal.Width, y=Petal.Length, colour=Species)) + 
    geom_point() +
    ggtitle("Petal.Length vs Petal.Width")

center

From this graph, looking at the Petal.Width vs Petal.Length and colouring according to species, we can see both strong correlation and also pretty good separation betwen the 3 species, except for a few uncertain points between the blue/green groups.

Models

We’ll build some models and use them to classify species.

Test/Training Data

First we need to split our data set in test and training sets.

The function createDataPartition can be used to create a stratified random sample of the data.

inTrainingSet <- createDataPartition(iris$Species, p = .50, list = FALSE)
irisTrain <- iris[ inTrainingSet,]
irisTest  <- iris[ -inTrainingSet,]

Random Forest

We now can use caret with a standard random forest method to create a model for clasifying iris species.

## Loading required package: randomForest

Let’s check our model by using it on the unseen test data to predict the species based on the other variables.

pred.rf <- predict(m.rf, irisTest)

Now we can view a confusion matrix to see how well our random forest classified the data.

cm.rf <- confusionMatrix(pred.rf, irisTest$Species)
kable(cm.rf$table)

	setosa	versicolor	virginica
setosa	25	0	0
versicolor	0	24	1
virginica	0	1	24

This is pretty good, classifying all 25/25 of the iris setosa data correctly, only misclassifying 1/25 iris versicolor as iris virginica, and misclassifying 1/25 iris virginica as iris versicolor

C5.0

Now we will try classifying with the C5.0 algorithm.

m.C50 <- train(Species ~ ., data=irisTrain, method="C5.0" )

## Loading required package: C50

pred.C50 <- predict(m.C50, irisTest)
cm.C50 <- confusionMatrix(pred.C50, irisTest$Species)
kable(cm.C50$table)

	setosa	versicolor	virginica
setosa	23	0	0
versicolor	2	24	1
virginica	0	1	24

Naive Bayes

m.nb <- train(Species~., data=irisTrain, method="nb")

pred.nb <- predict(m.nb, irisTest)
cm.nb <- confusionMatrix(pred.nb, irisTest$Species)
kable(cm.nb$table)

	setosa	versicolor	virginica
setosa	25	0	0
versicolor	0	24	2
virginica	0	1	23

Naive Bayes with k-fold Cross Validation

Adding k-fold (in this case 10-fold) cross validation to create our naive bayes model.

train_control <- trainControl(method="cv", number=10)

m.nbkf <- train(Species~., data=irisTrain, trControl=train_control, method="nb")

pred.nbkf <- predict(m.nbkf, irisTest)
cm.nbkf <- confusionMatrix(pred.nbkf, irisTest$Species)
kable(cm.nbkf$table)

	setosa	versicolor	virginica
setosa	25	0	0
versicolor	0	24	2
virginica	0	1	23

k-nearest neighbours

m.knn <- train(Species~., data=irisTrain, method="knn")

pred.knn <- predict(m.knn, irisTest)
cm.knn <- confusionMatrix(pred.knn, irisTest$Species)
kable(cm.knn$table)

	setosa	versicolor	virginica
setosa	25	0	0
versicolor	0	25	2
virginica	0	0	23

Neural Network

m.nnet <- train(Species~.,data=irisTrain,method="nnet", trace=FALSE)

## Loading required package: nnet

pred.nnet <- predict(m.nnet, irisTest)
cm.nnet <- confusionMatrix(pred.nnet, irisTest$Species)
kable(cm.nnet$table)

	setosa	versicolor	virginica
setosa	25	0	0
versicolor	0	24	0
virginica	0	1	25

Neural Network with Bootstrapping

tc <- trainControl(method="boot",number=25)

m.nnet.bs <- train(Species~.,data=irisTrain,method="nnet",trControl=tc, trace=FALSE)

pred.nnet.bs <- predict(m.nnet.bs, irisTest)
cm.nnet.bs <- confusionMatrix(pred.nnet.bs, irisTest$Species)
kable(cm.nnet.bs$table)

	setosa	versicolor	virginica
setosa	25	0	0
versicolor	0	24	0
virginica	0	1	25

Random Forest with Leave One Out Cross Validation

tc <- trainControl(method="LOOCV", number=25)

m.rf.bs <- train(Species~.,data=irisTrain,method="rf",trControl=tc, trace=FALSE)

pred.rf.bs <- predict(m.rf.bs, irisTest)
cm.rf.bs <- confusionMatrix(pred.rf.bs, irisTest$Species)
kable(cm.rf.bs$table)

	setosa	versicolor	virginica
setosa	25	0	0
versicolor	0	24	1
virginica	0	1	24

Elastic Net glmnet

m.glmnet <- train(Species~., 
                  data=irisTrain, 
                  method="AdaBoost.M1")

## Loading required package: adabag

pred.glmnet <- predict(m.glmnet, irisTest)
cm.glmnet <- confusionMatrix(pred.glmnet, irisTest$Species)
kable(cm.glmnet$table)

	setosa	versicolor	virginica
setosa	25	0	0
versicolor	0	24	1
virginica	0	1	24

Using Pre-Processing (including [PCA]) with Neural Net

m.nnet.pp <- train(Species~.,
                data=irisTrain,
                method="nnet",
                preProcess=c("BoxCox", "center", "scale", "pca"),
                trace=FALSE)

pred.nnet.pp <- predict(m.nnet.pp, irisTest)
cm.nnet.pp <- confusionMatrix(pred.nnet.pp, irisTest$Species)
kable(cm.nnet.pp$table)

	setosa	versicolor	virginica
setosa	25	0	0
versicolor	0	23	4
virginica	0	2	21

Results

Having used a variety of different models to classify the iris data set, we can now attempt to compare their performance. Comparing confusion matrices isn’t entirelty straight-forward as models perform differently in different areas, some optimising for specificity, some for sensitivity, or for some other metric. I have chosen to use Matthews Correlation Coefficient which gives a single number in the range -1 to +1, and is a correlation coefficient between the observed and predicted binary classifications.

results <- results %>% arrange(desc(MCC))
kable(results)

Name	MCC
Neural Net	0.9802614
Neural Net w Boot	0.9802614
KNN	0.9610256
Random Forest	0.9600000
Random Forest w LOOCV	0.9600000
AdaBoost.M1	0.9600000
Naive Bayes	0.9402508
Naive Bayes k-fold	0.9402508
C5.0	0.9209829
Neural Net PP	0.8809402

ggplot(results, aes(x=Name, y=MCC, fill=Name)) + 
geom_bar(stat="identity") +
ggtitle("Comparison of Classification Methods") + 
theme(axis.text.x=element_blank(),
      axis.title.x=element_blank())

center

From the table and graph it is possible to see that all the models performed similarly, in the range 0.88 - 0.98. But the Neural Network performed the best MCC=0.98, followed by KNN with MCC=0.96.

Another point of interest is that the worst model was also the neural network, but with pre-processed data using BoxCox, centering, scaling and PCA which implies that innapropriate pre-processing can be worse than none.

The github repository for this project can be viewed here.

The Data

Pairwise Correlation Matrix

Comparing Sepal.Length and Sepal.Width

Comparing Petal.Length and Petal.Width

Models

Test/Training Data

Random Forest

C5.0

Naive Bayes

Naive Bayes with k-fold Cross Validation

k-nearest neighbours

Neural Network

Neural Network with Bootstrapping

Random Forest with Leave One Out Cross Validation

Elastic Net glmnet

Using Pre-Processing (including [PCA]) with Neural Net

Results

Dialogue & Discussion