Distributed Training

In OneFlow, through top-level design and engineering innovation. It is easiest use distribution system. Users can easily use OneFlow for distributed training without making any special changes to the network structure or job logic. This is the most important feature that make OneFlow different from other frameworks.

In this article, we will introduce:

How to switch a program platform from a single machine to a distributed system.
The concept and mission of node in OneFlow.

Support consistent view, the whole network only needs one logic input and output.
A mirrored view compatible with other frameworks is provided. Users who are familiar with the distributed training of other frameworks can learn to use it quickly.
Only a few lines of configuration code are needed to switch a program platform from a single machine to a distributed system.

By the distributed training interface of OneFlow, you only need a few configuration to specify the distributed computing nodes IP and the number of devices for performing distributed training network.

Here is an example to change a program run on a single machine to be run on a distributed system with few configurations.

Here is the framework of single machine training program. Because the code of each function will be presented in the distributed program below, it is not listed in detail here.

In oneflow.config, we provide interfaces related to distributed program. We mainly use two of them:

oneflow.config.gpu_device_num : set the number of device. This will be applied to all machines.
oneflow.config.ctrl_port : set the port number of communications. All the machines will use the same port.

In the following demo, we set all machines to use one device and use the port 9988 for communication. User can change the configuration according to their actual situation.

#device number
flow.config.gpu_device_num(1)
#Port number

To be mentioned that, if we only have one single machine with multiple GPU devices in it, we can still use flow.config.gpu_device_num to change a program from running on a single machine to run on a distributed system. In the code below, we will use two GPU devices in one machine to do the distributed training:

Then we need to config the connection between the machines in network. In OneFlow, the distributed machine called node.

The network information of each node is stored as a dict. The key “addr” is corresponding with IP of this node. All nodes are stored in a list, which will be informed to Oneflow by flow.env.machine. OneFlow will automatically generate the connection between nodes.

flow.env.machine(nodes)

It should be noted that the node 0 in list (in the above code is 192.168.1.12) is called master node. After the whole distributed training system starts, it will create the graph while the other nodes are waiting. When construction of graph is finished, all nodes will receive a notice specifying which nodes that they need to contact. Then they will work together in a decentralized way.

During the training process, master node will deal with the standard output and store the model. The other nodes are only responsible for calculation.

We can wrap the configuration code for distributed training as a function, which is easy to be called:

After adding the configurations code, the program becomes a distributed training one. Just follow the same step as we do in a single machine program.

Compared with single machine training program, the distributed training program only needs to call one more function named .

Distribution script:

Running on both 192.168.1.12 and 192.168.1.11:

wget https://docs.oneflow.org/code/basics_topics/distributed_train.py
python3 distributed_train.py

The result of the program will be displayed on 192.168.1.12.

After running this distribution code, the program waits for a long time and does not display the calculation results。

Run training in docker, program waits for a long time and does not show calculation results.

Using virtual network cards

Distributed training

Distributed Training