Solver

The Caffe solvers are:

Stochastic Gradient Descent (),
AdaDelta (type: "AdaDelta"),
Adaptive Gradient (type: "AdaGrad"),
Adam (type: "Adam"),
Nesterov’s Accelerated Gradient (type: "Nesterov") and
RMSprop (type: "RMSProp")

The solver

scaffolds the optimization bookkeeping and creates the training network for learning and test network(s) for evaluation.
iteratively optimizes by calling forward / backward and updating parameters
(periodically) evaluates the test networks
snapshots the model and solver state throughout the optimization
where each iteration
calls network forward to compute the output and loss
incorporates the gradients into parameter updates according to the solver method
updates the solver state according to learning rate, history, and method
to take the weights all the way from initialization to learned model.

Like Caffe models, Caffe solvers run in CPU / GPU modes.

The solver methods address the general optimization problem of loss minimization.For dataset , the optimization objective is the average loss over all data instances throughout the dataset

where is the loss on data instance and is a regularization term with weight . can be very large, so in practice, in each solver iteration we use a stochastic approximation of this objective, drawing a mini-batch of instances:

The model computes in the forward pass and the gradient in the backward pass.

The parameter update is formed by the solver from the error gradient , the regularization gradient , and other particulars to each method.

Stochastic gradient descent (type: "SGD") updates the weights by a linear combination of the negative gradient and the previous weight update .The learning rate is the weight of the negative gradient.The momentum is the weight of the previous update.

Formally, we have the following formulas to compute the update value and the updated weights at iteration , given the previous weight update and current weights :

The learning “hyperparameters” ( and ) might require a bit of tuning for best results.If you’re not sure where to start, take a look at the “Rules of thumb” below, and for further information you might refer to Leon Bottou’s Stochastic Gradient Descent Tricks [1].

[1] L. Bottou. . Neural Networks: Tricks of the Trade: Springer, 2012.

Rules of thumb for setting the learning rate \alpha and momentum \mu

A good strategy for deep learning with SGD is to initialize the learning rate to a value around , and dropping it by a constant factor (e.g., 10) throughout training when the loss begins to reach an apparent “plateau”, repeating this several times.Generally, you probably want to use a momentum or similar value.By smoothing the weight updates across iterations, momentum tends to make deep learning with SGD both stabler and faster.

This was the strategy used by Krizhevsky et al. [1] in their famously winning CNN entry to the ILSVRC-2012 competition, and Caffe makes this strategy easy to implement in a SolverParameter, as in our reproduction of [1] at ./examples/imagenet/alexnet_solver.prototxt.

Under the above settings, we’ll always use momentum .We’ll begin training at a base_lr of for the first 100,000 iterations, then multiply the learning rate by gamma () and train at for iterations 100K-200K, then at for iterations 200K-300K, and finally train until iteration 350K (since we have max_iter: 350000) at .

Note that the momentum setting effectively multiplies the size of your updates by a factor of after many iterations of training, so if you increase , it may be a good idea to decrease accordingly (and vice versa).

For example, with , we have an effective update size multiplier of .If we increased the momentum to , we’ve increased our update size multiplier to 100, so we should drop (base_lr) by a factor of 10.

Note also that the above settings are merely guidelines, and they’re definitely not guaranteed to be optimal (or even work at all!) in every situation.If learning diverges (e.g., you start to see very large or NaN or inf loss values or outputs), try dropping the base_lr (e.g., base_lr: 0.001) and re-training, repeating this until you find a base_lr value that works.

[1] A. Krizhevsky, I. Sutskever, and G. Hinton. . Advances in Neural Information Processing Systems, 2012.

AdaDelta

The AdaDelta (type: "AdaDelta") method (M. Zeiler [1]) is a “robust learning rate method”. It is a gradient-based optimization method (like SGD). The update formulas are

and

[1] M. Zeiler . arXiv preprint, 2012.

The adaptive gradient (type: "AdaGrad") method (Duchi et al. [1]) is a gradient-based optimization method (like SGD) that attempts to “find needles in haystacks in the form of very predictive but rarely seen features,” in Duchi et al.’s words.Given the update information from all previous iterations for ,the update formulas proposed by [1] are as follows, specified for each component of the weights :

Note that in practice, for weights , AdaGrad implementations (including the one in Caffe) use only extra storage for the historical gradient information (rather than the storage that would be necessary to store each historical gradient individually).

[1] J. Duchi, E. Hazan, and Y. Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. The Journal of Machine Learning Research, 2011.

Adam

The Adam (type: "Adam"), proposed in Kingma et al. [1], is a gradient-based optimization method (like SGD). This includes an “adaptive moment estimation” () and can be regarded as a generalization of AdaGrad. The update formulas are

and

Kingma et al. [1] proposed to use as default values. Caffe uses the values of momemtum, momentum2, delta for , respectively.

[1] D. Kingma, J. Ba. Adam: A Method for Stochastic Optimization. International Conference for Learning Representations, 2015.

The weight update formulas look very similar to the SGD updates given above:

What distinguishes the method from SGD is the weight setting on which we compute the error gradient – in NAG we take the gradient on weights with added momentum ; in SGD we simply take the gradient on the current weights themselves.

[1] Y. Nesterov. A Method of Solving a Convex Programming Problem with Convergence Rate . Soviet Mathematics Doklady, 1983.

[2] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. . Proceedings of the 30th International Conference on Machine Learning, 2013.

RMSprop

The RMSprop (type: "RMSProp"), suggested by Tieleman in a Coursera course lecture, is a gradient-based optimization method (like SGD). The update formulas are

The default value of (rms_decay) is set to .

[1] T. Tieleman, and G. Hinton. . COURSERA: Neural Networks for Machine Learning.Technical report, 2012.

The solver scaffolding prepares the optimization method and initializes the model to be learned in Solver::Presolve().


I0902 13:35:56.474978 16020 caffe.cpp:90] Starting Optimization
I0902 13:35:56.475190 16020 solver.cpp:32] Initializing solver from parameters:
test_iter: 100
test_interval: 500
base_lr: 0.01
display: 100
max_iter: 10000
lr_policy: "inv"
gamma: 0.0001
power: 0.75
momentum: 0.9
weight_decay: 0.0005
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
solver_mode: GPU
net: "examples/mnist/lenet_train_test.prototxt"

Net initialization

Loss

I0902 13:35:56.728893 16020 net.cpp:170] loss needs backward computation.
I0902 13:35:56.728909 16020 net.cpp:170] ip2 needs backward computation.
I0902 13:35:56.728924 16020 net.cpp:170] relu1 needs backward computation.
I0902 13:35:56.728953 16020 net.cpp:170] pool2 needs backward computation.
I0902 13:35:56.728970 16020 net.cpp:170] conv2 needs backward computation.
I0902 13:35:56.728984 16020 net.cpp:170] pool1 needs backward computation.
I0902 13:35:56.728998 16020 net.cpp:170] conv1 needs backward computation.
I0902 13:35:56.729014 16020 net.cpp:172] mnist does not need backward computation.
I0902 13:35:56.729027 16020 net.cpp:208] This network produces output loss
I0902 13:35:56.729071 16020 net.cpp:219] Network initialization done.
I0902 13:35:56.729085 16020 net.cpp:220] Memory required for data: 5169924
I0902 13:35:56.729277 16020 solver.cpp:156] Creating test net (#0) specified by net file: examples/mnist/lenet_train_test.prototxt

Completion

The actual weight update is made by the solver then applied to the net parameters in Solver::ComputeUpdateValue().The ComputeUpdateValue method incorporates any weight decay into the weight gradients (which currently just contain the error gradients) to get the final gradient with respect to each network weight.Then these gradients are scaled by the learning rate and the update to subtract is stored in each parameter Blob’s diff field.Finally, the Blob::Update method is called on each parameter blob, which performs the final update (subtracting the Blob’s diff from its data).

The solver snapshots the weights and its own state during training in Solver::Snapshot() and Solver::SnapshotSolverState().The weight snapshots export the learned model while the solver snapshots allow training to be resumed from a given point.Training is resumed by Solver::Restore() and Solver::RestoreSolverState().

Weights are saved without extension while solver states are saved with .solverstate extension.Both files will have an _iter_N suffix for the snapshot iteration number.

Snapshotting is configured by:

# The snapshot interval in iterations.
snapshot: 5000
# File path prefix for snapshotting model weights and solver state.
# Note: this is relative to the invocation of the `caffe` utility, not the
# solver definition file.
snapshot_prefix: "/path/to/model"
# Snapshot the diff along with the weights. This can help debugging training
# but takes more storage.
snapshot_diff: false
# A final snapshot is saved at the end of training unless
# this flag is set to false. The default is true.