How to launch a distributed training
Distributed training doesn’t work in a notebook, so first, clean up your experiments notebook and prepare a script to run the training. For instance, here is a minimal script that trains a wide resnet on CIFAR10.
Your script is going to be executed in a different process that will each happen on a different GPU. To make this work properly, add the following introduction between your imports and the rest of your code.
What we do here is that we import the necessary stuff from fastai (for later), we create an argument parser that will intercept an argument named (which will contain the name of the GPU to use), then we set our GPU accordingly. The last line is what pytorch needs to set things up properly and know that this process is part of a larger group.
This will add the additional callbacks that will make sure your model and your data loaders are properly setups.
Now you can save your scriptn here is what the full example looks like:
In your terminal, type the following line (adapt and to the number of GPUs you want to use and your script name ending with .py).
Since they all have the same gradients at this stage, they will al perform the same update, so the models will still be the same after this step. Then training continues with the next batch, until the number of desired iterations is done.