Specializing a neural network with SVC
This article is a follow up to this post, where we trained a CNN to recognise Devanagari characters.
Transfer learning is the practice of using knowledge already acquired to perform new tasks, and it’s awesome. Of course, why would you start from scratch when the problem is almost already solved?
In the case of neural networks, a way to perform transfer learning is to re-train the last layers of a network. I’m not fond of this method, as it can feel unnatural to implement. I think a more explicit way to benefit from a trained network is to use it as a features extractor by chopping off the last layers. When you see your network as just a features extractor, retraining the last layers mean stacking a new network on top. But why should we limit ourselves to this possibility? If we have few training samples, why not add a more suitable algorithm like support vector classifiers, for instance?
This is exactly what we are going to do in this final chapter of our series on handwritten characters recognition. Today, we will improve greatly our accuracy by training a CNN on the whole database (numerals, consonants, and vowels), before replacing the last layers with support vector machines.
This article follows this one and presupposes the same data structures are loaded in the workspace
Merging the datasets
We start by merging the three sets. To do so, we increment the labels of the vowels by 10 (the number of numerals) and the labels of the consonants by 22 (number of numerals + vowels) to resolve conflicts between labels.
We then split the dataset into a training, a validation and a testing set, using the same method as before (stratifying, and using the same proportions).
Building a new model
We then define a model that will be trained on the whole training dataset (numerals, consonants, and vowels together). We now have a more consequent dataset (over 9000 images for training), and we will use data-augmentation. Because of that, we can afford a more complex model to better fit the new diversity of our dataset. The new model is constructed as followed:
- We start with a convolutional layer with more filters (128).
- We put two dense layers of 512 nodes before the last layer, to construct a better representation of the features uncovered by the convolutional layers. We will keep these layers when specialising the model to one of the three datasets.
- Because the model is still quite simple (no more than 4 millions parameters), we can afford to perform numerous epochs during the training on a GPU. Numerous epochs are also a good way to benefit fully from data-augmentation, as the model will discover new images at each iteration. However, to prevent overfitting, we put a drop out layer before each dense layer, and also one after the first convolutional layers.
keras implementation of the model is:
Fitting the model with data-augmentation
We will now fit the same model using data-augmentation. We use Keras’
ImageDataGenerator to dynamically generate new batches of images. We specify the transformations we want on augmented images:
- A small random rotation of the characters (maximum 15 degrees)
- A small random zoom (in or out), up to a maximum of 20% of the image size.
- We could add random translations, but they are pretty useless if the first layers are convolutional.
When using data-augmentation, we need to fit the model using a special function,
fit_generator. We specify that we want to monitor the training with a non-augmented validation set, by specifying
validation_data=(X_val, y_val). Finally, we save the weights only when the validation loss is decreasing, and we predict the accuracy on the testing set.
We now remove the last two layers of our model (the last dense layer and the drop out layer before). We then freeze the remaining layers to make them non-trainable, and we save our base model for future use.
We should now split our whole dataset into its numerals, vowels and consonants components. This part is a little tedious but is necessary to ensure that we will test and validate our specified model on samples that were not seen during the previous learning.
To do that, we define an
extract_subset function that allows us to extract samples for which the label is in a given range. For instance, to extract only the consonants, we should extract all the samples with a label between 0 and 9.
We have extracted the training, validation and testing sets for the three datasets. We can now load the pre-trained model, using keras’
load_model function, and extract the activation of the last layers. These activations can be seen as high-level features of our images.
One way to specialize our model would be to add another dense layer (with a softmax activation) on the top. This is, in fact, equivalent to performing a logistic regression over the activation of the last layers produced by each image (these activations will be called ‘bottleneck_features’). Thus, we suggest here to use a more powerful classifier instead of a last dense layer. We will train a SVC to predict the class of the character, given the bottleneck features as input.
Of course, our features extractor will be the model trained on the whole dataset (using data-augmentation) with the last dense layer removed. It will then transform an image into a vector of 512 high-level features that we can feed to our SVC.
We then merge the training and validation set (to perform a K-fold for validation instead), and perform a grid search to find the best parameters for our SVC. The following function will do so and return the best SVC found during the grid search.
These results improved by far what we obtained with a CNN trained from scratch:
|SVC over bottleneck features||CNN from scratch||SVC with PCA|
On the vowels and the numerals, we achieve an accuracy of 99.5% and 99.7%, thus improving by 2.2 and 0.6 points the accuracy of the previous CNN model. The results are even more spectacular with the consonants, where we improve our accuracy from 85.9% up to 94.9% (+9.0 points).
Neural networks as features extractors
The fact that a trained neural network can be used as a features extractor is very useful. For image recognition, a popular technique consists in using pre-trained CNN (such as Inception or VGG) to extract high-level features that can be fed to other machine learning algorithms. By doing so, I save myself the struggle of training a CNN and improve my precision by using the fact that these CNN were trained on a database far bigger than mine.
To visualize this phenomenon, I trained another CNN, with a dense layer containing only two neurons somewhere in the middle. If we remove all the layers after this one, the output of this CNN will be a vector of size two representing the input image in the plane. Please note that the final accuracy of this CNN is far inferior: it is generally a bad idea to put such a bottleneck on the information flowing in a neural network.
The features discovered by this CNN are displayed below (one colour by class, logarithmic transformations applied):
Or, for clarity, if we keep only the vowels:
Here, we can see that the features extracted from the images are grouped by classes: this is why it is so efficient to use them as inputs to another classifier.
We tested several ways to classify Devanagari characters. The first one was a Support Vector Classifier trained over the first 24 axes of a PCA. We then improved our accuracy by switching to a Convolutional Neural Network, trained only on the relevant dataset (consonants, vowels or numerals). At last, we trained another CNN on all the dataset, using data-augmentation, to provide a powerful features extractor. We then trained one specialized SVC for each type of character over the high levels features provided by this CNN. With this technique, we achieved an accuracy far superior to the other methods (99.7% for the numerals, 99.5% for the vowels and 94.9% for the consonants.
That’s the end of the series! Thank for your attention, and I promise no more Devanagari characters here ;)