CONSISTENT VIEW

    Instead of caring about the details of computing and communication in a cluster, users can program like on a single node, and OneFlow can train the model in a distributed way.

    OneFlow’s consistent view relies on several important concepts: Placement, SBP and SBP Signature.

    The Tensors of OneFlow has a attribute in consistent view; the placement specifies which physical device the Tensor is placed on.

    OneFlow will automatically number the devices in the cluster. For example, if there are four hosts in a cluster and each host has eight cards, then the four hosts correspond to ID: 0,1,2,3. The cards on each host correspond to numbers 0 to 7. To place a Tensor on the first four cards on machine 0, simply configure: placement("cuda", {0: [0, 1, 2, 3]}).

    Placement makes it easy for OneFlow to support pipelining parallelism, and we’ll see examples of placement in other articles on this topic.

    SBP is a unique concept in OneFlow, which describes the mapping of data from a “Super Computing Device” perspective to data on real physical devices in a cluster. It is a combination of the initials of three words: split, broadcast, partial.

    In detail:

    • split means that the physical Tensor is obtained by splitting the logical Tensor along a certain dimension. An axis parameter is used to indicate the dimension of the split. If multiple physical Tensors are concatenated along the dimension of Split, the logical Tensor can be restored.
    • partial indicates that although the physical Tensor has the same shape as the logical Tensor, the value in the physical Tensor is a part of the value in the corresponding position in the logical Tensor, if you add multiple physical Tensors at the same positions, you can restore the logical Tensor. Besides sum, , max and some other opreations are made available for partial.

    The figures below show some examples of SBP, including split(0), split(1), broadcast and partial sum.

    SBP Example

    When you create a Consistent Tensor, you can specify the SBP of the Tensor. The example will be seen in the next article: Consistent Tensor.

    SBP describes the mapping relationship between the data under the consistent view and the data on the physical devices. When doing distributed training, OneFlow distributes the data to the physical devices, computes the results according to the SBP attributes of the data.

    Let us discuss this problem with the example of matrix multiplication. Look at how the input and output SBP of matrix multiplication are combined to be legal and illegal in a distributed system with tow devices.

    Suppose, from the consistent view, that a matrix with the shape $Consistent View - 图4 is multiplied by a matrix with the shape Consistent View - 图6 to get $y $, the shape of must be Consistent View - 图8.

    According to the rule of matrix multiplication, we can divide the matrix into two matrices Consistent View - 图10 and by dimension 0, with the shapes of Consistent View - 图12, respectively:

    Device 1:

    Consistent View - 图14

    Device 2:

    It’s easy to configure the relationship among physical Tensors Consistent View - 图16, and the Tensor Consistent View - 图18, which is under the consistent view. And also the relationship between , Consistent View - 图20 and the consistent view data :

    Consistent View - 图22

    In this way, it is possible to execute the operation and get the correct result from the consistent view by distributing the data to each physical device. The long story we talked above, described in SBP, are surprisingly simple:

    We can see that for matrix multiplication, the SBP of its input and output combined in the above way, is legal. For matrix multiplication, there are more than one valid SBP combinations, such as:

    Consistent View - 图27 is broadcast, is , and Consistent View - 图29 is split(1).

    Or:

    is split(1), Consistent View - 图31 is split(0), and is partial sum.

    While we showed multiple valid SBP combinations above, not all SBP combinations are valid. For example, for matrix multiplication, if Consistent View - 图33, are both split(0), then:

    Consistent View - 图35

    Because the shapes of Consistent View - 图37 and do not meet the requirements of matrix multiplication, it is impossible to compute the matrix multiplication on physical devices. We can say that the combination of Consistent View - 图39 as split(0) and as split(0) is illegal.

    We defines a specific, valid SBP combination of the inputs and outputs of an operator, as shown above, as a SBP Signature of this operator.

    All operators in OneFlow are presetting all possible SBP signatures according to the operator’s Operation Rules. The user only needs to set the placement and SBP attributes of the data, the selection process is transparent to the user.

    placement, SBP, and are the important guarantee of OneFlow distributed consistent view, which makes OneFlow distributed training as simple as on a single machine single card.

    In the next article , we’ll show you an example of programming under the consistent view.