Fine-tuning CaffeNet for Style Recognition on “Flickr Style” Data

The Flickr-sourced images of the Style dataset are visually very similar to the ImageNet dataset, on which the was trained.Since that model works well for object category classification, we’d like to use this architecture for our style classifier.We also only have 80,000 images to train on, so we’d like to start with the parameters learned on the 1,000,000 ImageNet images, and fine-tune as needed.If we provide the weights argument to the caffe train command, the pretrained weights will be loaded into our model, matching layers by name.

Because we are predicting 20 classes instead of a 1,000, we do need to change the last layer in the model.Therefore, we change the name of the last layer from fc8 to fc8_flickr in our prototxt.Since there is no layer named that in the bvlc_reference_caffenet, that layer will begin training with random weights.

We will also decrease the overall learning rate in the solver prototxt, but boost the lr_mult on the newly introduced layer.The idea is to have the rest of the model change very slowly with new data, but let the new layer learn fast.Additionally, we set stepsize in the solver to a lower value than if we were training from scratch, since we’re virtually far along in training and therefore want the learning rate to go down faster.Note that we could also entirely prevent fine-tuning of all layers other than fc8_flickr by setting their lr_mult to 0.

All steps are to be done from the caffe root directory.

This script downloads images and writes train/val file lists into data/flickr_style.The prototxts in this example assume this, and also assume the presence of the ImageNet mean file (run from data/ilsvrc12 to obtain this if you haven’t yet).

We’ll also need the ImageNet-trained model, which you can obtain by running ./scripts/download_model_binary.py models/bvlc_reference_caffenet.

Now we can train! The key to fine-tuning is the -weights argument in thecommand below, which tells Caffe that we want to load weights from a pre-trainedCaffe model.

(You can fine-tune in CPU mode by leaving out the -gpu flag.)

For comparison, here is how the loss goes down when we do not start with a pre-trained model:

This model is only beginning to learn.

Fine-tuning can be feasible when training from scratch would not be for lack of time or data.Even in CPU mode each pass through the training set takes ~100 s. GPU fine-tuning is of course faster still and can learn a useful model in minutes or hours instead of days or weeks.Furthermore, note that the model has only trained on < 2,000 instances. Transfer learning a new task like style recognition from the ImageNet pretraining can require much less data than training from scratch.

Now try fine-tuning to your own tasks and data!

The Flickr Style dataset as distributed here contains only URLs to images.Some of the images may have copyright.Training a category-recognition model for research/non-commercial use may constitute fair use of this data, but the result should not be used for commercial purposes.

Fine-tuning for style recognition

Fine-tuning CaffeNet for Style Recognition on “Flickr Style” Data