250 Bird Species Image Classification using ResNets, Regularization and Data Augmentation in PyTorch

11 min readJan 7, 2021

Bird Diversity 250 Species. — World Bird Species Diversity

In this Deep Neural Network project, I trained a ResNet-n (n=9) neural networks architecture with a different layers to classify a diverse set of 250 Bird Species from the Kaggle dataset with over 96% accuracy. For this project, I used the 250 Birds Species dataset, which consists of 250 bird species. 35215 training images, 1250 test images(5 per species) and 12500 validation images(5 per species. All images are 224 X 224 X 3 color images in jpg format. Also includes a “consolidated” image set that combines the training, test and validation images into a single data set.. Here are some sample images from the dataset:

Image Samples from 250 Bird Species Dataset

Modeling Definition (Network Architecture)

Model with Residual Blocks and Batch Normalization

I leveraged one of the applied improvements to the Convolutional Neural Network (CNN) model through addition of the Residual block, which adds the original input back to the output feature map obtained by passing the input through one or more convolutional layers as evident in the following figure:

Here is the implementation of the very simple Residual block:

Addition of the Residual block produces a significant improvement in the performance of the model. Also, after each convolutional layer, I add a batch normalization layer, which normalizes the outputs of the previous layer.

For the purpose of this project, the ResNet9 architecture has been deployed as demonstrated in the following figure with the corresponding implementation:

Next, I developed the ResNet9 model through extending the ImageClassificationBase model besides defining a conv_block containing the Convolutional Block as the building block of the aforementioned architecture. The ResNet9 architecture has been implemented as follows:

Hidden Layers:

(64, 128, 128, 128, 256, 512, 512, 512, 250)

ResNet9 Neural Network Layers

Model Training

Before starting to train the model, I apply a number of small but important improvements to my fit function:

Learning Rate Scheduling: Instead of using a fixed learning rate, I deploy a Learning Rate Scheduler, which will change the learning rate after every batch of training. There are many strategies for varying the learning rate during training, and the one I use is called the “One Cycle Learning Rate Policy”, which involves starting with a low learning rate, gradually increasing it batch-by-batch to a high learning rate for about 30% of epochs, then gradually decreasing it to a very low value for the remaining epochs.

Reference: https://sgugger.github.io/the-1cycle-policy.html

Weight Decay: I also use weight decay, which is yet another regularization technique which prevents the weights from becoming too large by adding an additional term to the loss function.

Reference: https://towardsdatascience.com/this-thing-called-weight-decay-a7cd4bcfccab

Gradient Clipping: Aside from the layer weights and outputs, it is also helpful to limit the values of gradients to a small range to prevent undesirable changes in parameters due to large gradient values. This simple yet effective technique is called Gradient Clipping.

Reference: https://towardsdatascience.com/what-is-gradient-clipping-b8e815cdfb48

To that aim, I define a fit_one_cycle function to incorporate the above-mentioned changes. I also record the learning rate used for each batch.

Model Training utilizing “One Cycle Learning Rate Policy”

Before I begin training, I evaluate the instantiated model in order to see how it performs on the validation set with the initial set of parameters.

Initial Model Evaluation Prior To Training

The initial accuracy is around 0.4%, which is what one would naturally anticipate from a randomly initialized model (since it has a (1/250)*100 ~ 0.4% chance of getting a label right by guessing randomly out of 250 possible outcome).

I use the following hyper-parameters (learning rate, no. of epochs, batch_size etc.) to train my ResNet9 model architecture. Further down the training path, I play with these parameters to see whether is possible to achieve a higher accuracy in a shorter time.

Hyper-parameters Set up

Now all is prepared to train the ResNet9 model. Instead of SGD (Stochastic Gradient Descent), I use the Adam optimizer which leverages techniques like momentum and adaptive learning rates for faster training. I utilized the following resource as reference:

Optimizers Reference: https://ruder.io/optimizing-gradient-descent/index.html

Training the ResNet9 Model

My ResNet9 model architecture with First trial of the hyperparameters achieved about 90% accuracy within just 10 minutes of training!

Next, I try tweaking variables associated with the data augmentations, network architecture & hyper-parameters to see whether it is possible to chase better values for “Model Accuracy” as well as “Model Loss” within less amount of training time as follows.

Model Accuracy and Loss Plots

Plot the Model Losses and Accuracies to check if I am starting to hit the limits of how well my ResNet9 model architecture is able to perform on this 250 Bird Species Dataset. Built on this, I initiate a number of more training to check whether I can see the scope for further improvement.

I also plot the Model Training and Validation Losses to investigate the training trend further down the improvement path.

As evident from the above trend, my ResNet9 model is not yet suffering from overfitting to the training data. I tried removing Batch Normalization, Data Augmentation and Residual Layers one by one to analyze their individual effect on overfitting phenomenon.

Finally, I visualize how the One Cycle Learning Rate Scheduler is unfolding over time, batch-by-batch over all the epochs.

As anticipated, the One Cycle Learning Rate Scheduler starts at a low value, and gradually increases for 30% of the iterations to a defined maximum value of 0.01, followed by gradually decreasing to a very small value.

Record Experiment Model Performance on Test Dataset

Finally, I evaluate the model on the Test Dataset of 250 Bird Species and report its final performance in each experiment for record.

Model Performance Metrics on Test Dataset

As shown in the results, my ResNet9 model architecture with aforementioned set of the hyper-parameters achieved over 96% accuracy within just 17 minutes of 20 Epochs Training!

Improvement Strategy Algorithm

In each Training Phase, I check whether the obtained results are in line with the Accuracy & Loss Requirements.

To that aim, I record my Accuracy, Loss and other Performance Metrics results in each experiment by completing the section below so that I can come back for referral and try a different architecture & hyper-parameters.

I need to try different network architectures (#hidden layers, size of each hidden layer, activation function) and hyperparameters (#epochs, LR) in so far as I can chase the desired Model Test Loss & Accuracy.

Experiment#1: Start with the initial size of the Nine Hidden Layers: (64, 128, 128, 128, 256, 512, 512, 512, 250), resulting in the initial Model Accuracy of {'val_acc': 0.916} and the initial Model Loss of {'val_loss': 0.297} under Training Time of train_time_1='10:22'.

Experiment#2: Doubling the batch_size from 32 to 64 to utlize a larger portion of the GPU RAM while keeping the rest of the hyperparameters unchanged, resulted in slight improvement in Training Time and Performance such that it nudged up the Model Accuracy from {'val_acc': 0.916} up to {'val_acc': 0.929} and dampened the Model Loss from {'val_loss': 0.297} down to {'val_loss': 0.239} under yet lesser Training Time of train_time_2='8:49'.

Experiment#3: This time, however, increasing the batch_size from 64 to 400 to utlize even a larger portion of the GPU RAM while keeping the rest of the hyperparameters unchanged, did not result in Performance improvement as it dampened the Model Accuracy from {'val_acc': 0.929} down to {'val_acc': 0.915} and raised the Model Loss slightly from {'val_loss': 0.239} up to {'val_loss': 0.325} under even lesser Training Time of train_time_3='8:18'.

Experiment#4: To get a smoother profile for Accuracy & Loss vs. No. of epochs', I relaxed the batch_size from 400 back to 256 (2^8) to allocate Power-Of-Two chunk of memory compatible with GPU RAM physical architecture while keeping the rest of the hyperparameters unchanged, which indeed resulted in Performance uptick as it upticked the Model Accuracy from {'val_acc': 0.915} up to {'val_acc': 0.932} and downticked the Model Loss slightly from {'val_loss': 0.325} down to {'val_loss': 0.234} under approximately the same Training Time of train_time_4='8:19'.

Experiment#5: Increasing the Maximum Learning Rate in the "One Cycle Learning Rate Policy" from max_lr = 0.01 to max_lr = 0.1 while keeping the rest of the hyperparameters unchanged, deteriorated the Performance substantially since it dampened the Model Accuracy from {'val_acc': 0.932} way down to {'val_acc': 0.618} and leveraged the Model Loss from {'val_loss': 0.234} way up to {'val_loss': 1.606} under approximately the same Training Time of train_time_5='8:30'.

Experiment#6: Reverting back the Maximum Learning Rate from max_lr = 0.1 to max_lr = 0.01 along with decreasing the Gradient Clipping from grad_clip = 0.1 down to grad_clip = 0.01 while keeping the rest of the hyperparameters unchanged, bounced back the Performance again as it pulled the Model Accuracy from {'val_acc': 0.618} way up to {'val_acc': 0.933} and pushed the Model Loss from {'val_loss': 1.606} way down to {'val_loss': 0.238} under approximately the same Training Time of train_time_6='8:51'.

Experiment#7: Reverting back the Gradient Clipping from grad_clip = 0.01 to grad_clip = 0.1 along with raising the weight decay from weight_decay = 1e-4 to a moderate value of weight_decay = 1e-2 while keeping the rest of the hyperparameters unchanged, worsened the Performance significantly again as it suppressed the Model Accuracy from {'val_acc': 0.933} way down to {'val_acc': 0.652} and uplifted the Model Loss from {'val_loss': 0.238} way up to {'val_loss': 1.445} under approximately the same Training Time of train_time_7='8:31'.

Experiment#8: Reverting back the weight decay from weight_decay = 1e-2 to initial value of weight_decay = 1e-4 along with changing the descent optimization from Adam to SGD optimizer while keeping the rest of the hyperparameters unchanged, regained the Performance again as it jumped the Model Accuracy from {'val_acc': 0.652} way up to {'val_acc': 0.87} and diminished the Model Loss from {'val_loss': 1.445} way down to {'val_loss': 0.546} under approximately the same Training Time of train_time_8='8:28'.

Experiment#9: Reverting back the descent optimization from SGD to Adam optimizer besides doubling the image resolution from image_size = 32 to image_size = 64 caused GPU RAM "OUT OF MEMORY" issue even with minimal batch size the of batch_size = 32 (down from batch_size = 256).

Experiment#10: Reverting back the image resolution from image_size = 64 to image_size = 32 & to batch_size = 256 along with adding another Data Augmentation as transforms.RandomRotation(randomrotate, ...) while keeping the rest of the hyperparameters unchanged, rebounced the Performance as it leveraged the Model Accuracy way up to {'val_acc': 0.932} and suppressed the Model Loss way down to {'val_loss': 0.249} under approximately the same Training Time of train_time_10='9:14'.

Experiment#11: Keeping transforms.RandomRotation(randomrotate, ...) together with adding another Data Augmentation as transforms.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1, hue=0.1) while keeping the rest of the hyper-parameters unchanged, did not have tangible effect on the Performance as the Model Accuracy slipped a bit from {'val_acc': 0.932} down to {'val_acc': 0.916} and the Model Loss rose slightly from {'val_loss': 0.249} up to {'val_loss': 0.298} under relatively higher Training Time of train_time_11='11:30'.

Experiment#12: Abandoning Data Augmentation of transforms.ColorJitter(...) besides doubling the No. of Training Epochs from epochs = 10 to epochs = 20 while keeping the rest of the hyperparameters unchanged, led to upshift on the Performance as the Model Accuracy moved higher from {'val_acc': 0.916} up to {'val_acc': 0.953} and the Model Loss fell from {'val_loss': 0.298} down to {'val_loss': 0.159} under reasonable Training Time of train_time_12='17:20'.

Experiment#13: Commenting out Data Augmentation of transforms.RandomRotation(randomrotate, ...) while keeping the rest of the hyperparameters unchanged, led to a minor uplift on the Performance as the Model Accuracy was upticked from {'val_acc': 0.953} up to {'val_acc': 0.962} and the Model Loss downticked from {'val_loss': 0.159} down to {'val_loss': 0.138} under even lesser Training Time of train_time_13='17:01'.

Testing with individual Test Datasets images

While I have been tracking the overall accuracy of a model so far, it is also a good idea to look at the model's performance on some sample test images. Let's test out my model with some sample images from the predefined Test Dataset of 1250 Test images (5 per species).

Helper Function predict_image to return predicted label for a single image tensor

Identifying where the chosen model architecture performs poorly can help the designer improve the model, by collecting more training data, increasing/decreasing the complexity of the model, as well as changing the hyper-parameters.

Summary of the Employed Techniques and Further Reference

I am now ready to build upon this project and train numerous state-of-the-art deep learning models from scratch.

Here’s a summary of the different techniques used in this Deep Learning to improve the model architecture performance and reduce the training time:

Data normalization: I normalized the image tensors by subtracting the mean and dividing by the standard deviation of pixels across each channel. Normalizing the data prevents the pixel values from any one channel from disproportionately affecting the losses and gradients.

Reference: https://medium.com/@ml_kid/what-is-transform-and-transform-normalize-lesson-4-neural-networks-in-pytorch-ca97842336bd

Data Augmentation: I applied random transformations while loading images from the training dataset. Specifically, I padded each image by 4 pixels, and then take a random crop of size 32 x 32 pixels, and then flip the image horizontally with a 50% probability.

Reference: https://www.analyticsvidhya.com/blog/2019/12/image-augmentation-deep-learning-pytorch/

Residual Connections: One of the key area of improvements to my ResNet9 model was the addition of the resudial block, which adds the original input back to the output feature map obtained by passing the input through one or more convolutional layers. I utilized the ResNet9 architecture to that aim.

Reference: https://towardsdatascience.com/residual-blocks-building-blocks-of-resnet-fd90ca15d6ec

Batch Normalization: After each convolutional layer, I added a Batch Normalization layer, which normalizes the outputs of the previous layer. This is somewhat similar to Data Normalization, except it is applied to the outputs of a layer, and the mean and standard deviation are learned parameters.

Reference: https://towardsdatascience.com/batch-normalization-and-dropout-in-neural-networks-explained-with-pytorch-47d7a8459bcd

Learning Rate Scheduling: Instead of using a fixed learning rate, I used a Learning Rate Scheduler, which will change the learning rate after every batch of training. There are many strategies for varying the learning rate during training, and I used the “One Cycle Learning Rate Policy".

Reference: https://sgugger.github.io/the-1cycle-policy.html

Weight Decay: I added Weight Decay to the optimizer, yet another regularization technique which prevents the weights from becoming too large by adding an additional term to the loss function.

Reference: https://towardsdatascience.com/this-thing-called-weight-decay-a7cd4bcfccab

Gradient Clipping: Moreover, I added Gradient Clipping capability, which helps limit the values of gradients to a small range to prevent undesirable changes in model parameters owing to large gradient values during training.

Reference: https://towardsdatascience.com/what-is-gradient-clipping-b8e815cdfb48#63e0

Adam Optimizer: Instead of SGD (Stochastic Gradient Descent), I made use of the Adam optimizer which leverages techniques such as momentum and adaptive learning rates for faster training. There are many other optimizers to choose from and experiment with.

Reference: https://ruder.io/optimizing-gradient-descent/index.html

As future work, I try applying each technique independently and see how much each one affects the performance and training time. As I try different experiments, I start to cultivate the intuition for picking the right architectures, data augmentation & regularization techniques.

In this post I trained a ResNet9 neural network model to identify world various Bird Species from the 250 Bird Species Dataset with an accuracy of around 96%:

https://www.kaggle.com/gpiosenka/100-bird-species

However, I also noticed that it is quite challenging to improve the accuracy beyond 96%, due to the model’s limited power.