DISTRIBUTED TRAINING LAUNCHER

    Users can start distributed training by the following commands:

    For example, to start the training of a single machine with two graphics cards:

    Run on machine 0:

    Run on machine 1:

    • : number of nodes
    • --nproc_per_node: The number of processes per node to be started on each machine, which is recommended to be consistent with the number of GPUs

    The Relationship between Launch Module and Parallel Strategy

    The main function of oneflow.distributed.launch is to allow users to start distributed training more conveniently after the user completes the distributed program. It saves the trouble of configuring in the cluster.

    But oneflow.distributed.launch does not determine Parallel Strategy. The Parallel Strategy is determined by the setup of the distribution method of data and the model, and the placement of those on the physical devices.

    Please activate JavaScript for write a comment in LiveRe