DISTRIBUTED TRAINING LAUNCHER
Users can start distributed training by the following commands:
For example, to start the training of a single machine with two graphics cards:
Run on machine 0:
Run on machine 1:
- : number of nodes
--nproc_per_node
: The number of processes per node to be started on each machine, which is recommended to be consistent with the number of GPUs
The Relationship between Launch Module and Parallel Strategy
The main function of oneflow.distributed.launch
is to allow users to start distributed training more conveniently after the user completes the distributed program. It saves the trouble of configuring in the cluster.
But oneflow.distributed.launch
does not determine Parallel Strategy. The Parallel Strategy is determined by the setup of the distribution method of data and the model, and the placement of those on the physical devices.
Please activate JavaScript for write a comment in LiveRe