Classification of Dog Breeds with Neural Networks

Overview

We’ll examine data from the Stanford Dogs dataset to attempt building a neural network that can classify dogs by breed. For this purpose, we’ll use Keras with Tensorflow as a backend, all running on Google Colab. After training our network and (hopefully achieving some accuracy) we’ll apply our predictor to some of the dogs in my life. With 120 difference breeds represented in the data set, we should temper our expectations of how accurate our system will be. In fact, the highest baseline accuracy is only 22%.

Set Up

Fortunately, by using Google Colab, we won’t need to worry about our development environment. We’ll even access the dataset by mounting the folder containing it from Google drive. I’ve used rclone to sync the data set into my drive. We’ll import the usual suspects:

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from google.colab import drive

To actually access the dataset, we’ll need the following line:

drive.mount('/content/drive')

The images we need are all in the “images” folder, with subdirectories with names like: “n02085620-Chihuahua” For our purposes, it will be sufficient to use some of Keras’ built in functionality. We’ll get a training set:

train_set = keras.utils.image_dataset_from_directory(
    directory = '/content/drive/My Drive/Data-ML/images/',
    labels = 'inferred',
    label_mode = 'int', # Use Sparse cross categorical loss
    batch_size = 32,
    subset='training',
    seed = 4,
    validation_split=0.3,
    image_size = (128,128)
)
train_set.prefetch(buffer_size=tf.data.AUTOTUNE)

and the dataset we’ll use to evaluate our results:

eval_set = keras.utils.image_dataset_from_directory(
    directory = '/content/drive/My Drive/Data-ML/images/',
    labels = 'inferred',
    label_mode = 'int', # Use Sparse cross categorical loss
    batch_size = 32,
    subset='validation',
    seed = 4,
    validation_split=0.3,
    image_size = (128,128)
)

Note that we’ll be resizing all images to be (128,128). I’ll spare some details here, but I will quickly note that these functions return a tf.data.Dataset. These utility functions are hiding a lot of the messy work, which is handy of course, but a source of potential pitfalls. Note that the name of the subdirectories will be the names of the categories here.

The Approach

Since we’re working with images, we expect that convolution layers will be useful. We’ll try a “standard” approach:

input_layer = keras.Input(shape= (128,128,3))
x = keras.layers.Rescaling(1/255)(input_layer)

x = keras.layers.Conv2D(filters = 128, kernel_size=(5,5), strides=3, padding='same')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.LeakyReLU()(x)

x = keras.layers.Conv2D(filters = 64, kernel_size=(3,3), strides=2, padding='same')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.LeakyReLU()(x)

x = keras.layers.Conv2D(filters=32, kernel_size=(3,3), strides =1, padding='same')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.LeakyReLU()(x)


x = keras.layers.Conv2D(filters=16, kernel_size=(3,3), strides=1, padding='same')(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.LeakyReLU()(x)
x = keras.layers.Flatten()(x)

x = keras.layers.Dense(units=320, activation='relu')(x)
out = keras.layers.Dense(units=120, activation='softmax')(x)


model = keras.Model(inputs = input_layer, outputs = out)

opt = tf.keras.optimizers.Adam(learning_rate=1e-3)
model.compile(optimizer = opt, loss = 'sparse_categorical_crossentropy', metrics=['accuracy'])

A few gentle reminders about this snippet. Note that we’re using softmax activation on the last layer, precisely because we’re going to intepret the output of our model as a probability. It turns out that training this model takes long enough that Google Colab tends to time out while executing. For this reason, I trained the model locally. After training through 20 epochs, the model achieved an accuracy of approximately 96% on the training set. Unfortunately, the accuracy on the evaluation set was a piddling 4.4%. Our model is signifigantly over-fitted to the data. I had hoped that our batchnormalization layers would help prevent overfitting, but it clearly did not. I’ll introduce “image augmentation” prior the bulk of our network:

x = keras.layers.RandomFlip('horizontal')(x)
x = keras.layers.RandomRotation(0.1)(x)

I note that this mimics the approach applied in the relevant Tensorflow tutorial. This effectively gives us a slightly more diverse dataset, which will mitigate some overfitting. Additionally, I’ll employ some dropout layers with the same goal as augmentation.

With these adjustments, we can achieve a 10.4% accuracy on the evaluation set after about 40 epoches training. While I’m not especially pleased with this result, I’m satisfied for the time being. I plan to revisit this topic and data set in the future. My next project will be to generate dogs using Variational Auto-encoders.

Further Ideas

Throughout this article, we’ve been using log loss. Perhaps we could obtain “better” performance by using a loss function that recognizes that the similiarities between different breeds varies. For instance, our data set includes 3 types of poodles. We might prefer that our system mislabels one poodle as another over calling a poodle a retriever. We could create a loss function to reflect this.

We might also achieve some better results by first segmenting the images when possible. Some differences between breeds manifest mostly in differences in the faces of the dogs, so it could be useful to ‘manually’ first split these features off and use specific models to analyze different parts of the animal.

One approach I would like to explore in the future is the “MAVEN” architecture introduced in the paper “Multi-Adversial Variational Autoencoder Nets for Simultaneous Image Generation and Classification” by Imran and Terzopoulos.

Lastly, the easiest way to achieve better results would be to simply hypertune some parameters of our current model. Why haven’t I attempted this? It simply doesn’t interest me here, and might present something of a burden given my available computational resources.