Synchronous SGD
Example code:
- data_parallel_model_test.py has a simple 2-GPU model.
- For a more complex model, see the example .
Parallelizing a model is done by module caffe2.python.data_parallel_model. The model must be created using a ModelHelper, such as .
The key is to split your model creation code to three functions. These functions construct the operators like you would do without parallelization.
forward_pass_builder_fun
: this function adds the operators, layers to the network. It should return a list of loss-blobs that are used for computing the loss gradient. This function is also passed an internally calculated loss_scale parameter that is used to scale your loss to normalize for the number of GPUs. Signature:function(model, loss_scale)
- : this function adds the operators for applying the gradient update to parameters. For example, a simple SGD update, a momentum parameter update. You should also instantiate the Learning Rate and Iteration blobs here. You can set this function to None if you are not doing learning but only forward pass. Signature:
function(model)
optimize_gradient_memory
: if enabled, module is used to optimize memory usage of gradient operators by sharing blobs when possible. This can save significant amount of memory, and may help you run larger batches.
Under the hood, Caffe2 uses and NameScope
to distinguish parameters for each GPU. Each parameter is prefixed with a namescope such as “gpu_0/” or “gpu_5/”. Each blob created by the functions above is assigned to the correct GPU by DeviceScope
set by the data_parallel_model.Parallelize_GPU
function. To checkpoint the model, only pickup parameters prefixed with “gpu_0/” by calling model.GetParams("gpu_0")
. We use CUDA NCCL-ops to synchronize parameters between machines.
Gloo is a Facebook Incubator project that helps manage multi-host, multi-GPU machine learning applications.
contains example code using which is a feature not specifically utilized in this synch SGD example, but is present in the data_parallel_model module that it used.