4.2 All-to-All Broadcast and Reduction

All-to-all broadcast is a generalization of one-to-all broadcast in which all p nodes simultaneously initiate a broadcast. A process sends the same m-word message to every other process, but different processes may broadcast different messages. All-to-all broadcast is used in matrix operations, including matrix multiplication and matrix-vector multiplication. The dual of all-to-all broadcast is all-to-all reduction, in which every node is the destination of an all-to-one reduction (Problem 4.8). Figure 4.8 illustrates all-to-all broadcast and all-to-all reduction.

Figure 4.8. All-to-all broadcast and all-to-all reduction.

graphics/04fig09.gif

One way to perform an all-to-all broadcast is to perform p one-to-all broadcasts, one starting at each node. If performed naively, on some architectures this approach may take up to p times as long as a one-to-all broadcast. It is possible to use the communication links in the interconnection network more efficiently by performing all p one-to-all broadcasts simultaneously so that all messages traversing the same path at the same time are concatenated into a single message whose size is the sum of the sizes of individual messages.

The following sections describe all-to-all broadcast on linear array, mesh, and hypercube topologies.

4.2.1 Linear Array and Ring

While performing all-to-all broadcast on a linear array or a ring, all communication links can be kept busy simultaneously until the operation is complete because each node always has some information that it can pass along to its neighbor. Each node first sends to one of its neighbors the data it needs to broadcast. In subsequent steps, it forwards the data received from one of its neighbors to its other neighbor.

Figure 4.9 illustrates all-to-all broadcast for an eight-node ring. The same procedure would also work on a linear array with bidirectional links. As with the previous figures, the integer label of an arrow indicates the time step during which the message is sent. In all-to-all broadcast, p different messages circulate in the p-node ensemble. In Figure 4.9, each message is identified by its initial source, whose label appears in parentheses along with the time step. For instance, the arc labeled 2 (7) between nodes 0 and 1 represents the data communicated in time step 2 that node 0 received from node 7 in the preceding step. As Figure 4.9 shows, if communication is performed circularly in a single direction, then each node receives all (p - 1) pieces of information from all other nodes in (p - 1) steps.

Figure 4.9. All-to-all broadcast on an eight-node ring. The label of each arrow shows the time step and, within parentheses, the label of the node that owned the current message being transferred before the beginning of the broadcast. The number(s) in parentheses next to each node are the labels of nodes from which data has been received prior to the current communication step. Only the first, second, and last communication steps are shown.

graphics/04fig10.gif

Algorithm 4.4 gives a procedure for all-to-all broadcast on a p-node ring. The initial message to be broadcast is known locally as my_msg at each node. At the end of the procedure, each node stores the collection of all p messages in result. As the program shows, all-to-all broadcast on a mesh applies the linear array procedure twice, once along the rows and once along the columns.

Algorithm 4.4 All-to-all broadcast on a p-node ring.

1.   procedure ALL_TO_ALL_BC_RING(my_id, my_msg, p, result) 
2.   begin 
3.      left := (my_id - 1) mod p; 
4.      right := (my_id + 1) mod p; 
5.      result := my_msg; 
6.      msg := result; 
7.      for i := 1 to p - 1 do 
8.          send msg to right; 
9.          receive msg from left; 
10.         result := result  msg; 
11.     endfor; 
12.  end ALL_TO_ALL_BC_RING

In all-to-all reduction, the dual of all-to-all broadcast, each node starts with p messages, each one destined to be accumulated at a distinct node. All-to-all reduction can be performed by reversing the direction and sequence of the messages. For example, the first communication step for all-to-all reduction on an 8-node ring would correspond to the last step of Figure 4.9 with node 0 sending msg[1] to 7 instead of receiving it. The only additional step required is that upon receiving a message, a node must combine it with the local copy of the message that has the same destination as the received message before forwarding the combined message to the next neighbor. Algorithm 4.5 gives a procedure for all-to-all reduction on a p-node ring.

Algorithm 4.5 All-to-all reduction on a p-node ring.

1.   procedure ALL_TO_ALL_RED_RING(my_id, my_msg, p, result) 
2.   begin 
3.      left := (my_id - 1) mod p; 
4.      right := (my_id + 1) mod p; 
5.      recv := 0; 
6.      for i := 1 to p - 1 do 
7.          j := (my_id + i) mod p; 
8.          temp := msg[j] + recv; 
9.          send temp to left; 
10.         receive recv from right; 
11.     endfor; 
12.     result := msg[my_id] + recv; 
13.  end ALL_TO_ALL_RED_RING

4.2.2 Mesh

Just like one-to-all broadcast, the all-to-all broadcast algorithm for the 2-D mesh is based on the linear array algorithm, treating rows and columns of the mesh as linear arrays. Once again, communication takes place in two phases. In the first phase, each row of the mesh performs an all-to-all broadcast using the procedure for the linear array. In this phase, all nodes collect messages corresponding to the nodes of their respective rows. Each node consolidates this information into a single message of size , and proceeds to the second communication phase of the algorithm. The second communication phase is a columnwise all-to-all broadcast of the consolidated messages. By the end of this phase, each node obtains all p pieces of m-word data that originally resided on different nodes. The distribution of data among the nodes of a 3 x 3 mesh at the beginning of the first and the second phases of the algorithm is shown in Figure 4.10.

Figure 4.10. All-to-all broadcast on a 3 x 3 mesh. The groups of nodes communicating with each other in each phase are enclosed by dotted boundaries. By the end of the second phase, all nodes get (0,1,2,3,4,5,6,7) (that is, a message from each node).

graphics/04fig11.gif

Algorithm 4.6 gives a procedure for all-to-all broadcast on a mesh. The mesh procedure for all-to-all reduction is left as an exercise for the reader (Problem 4.4).

Algorithm 4.6 All-to-all broadcast on a square mesh of p nodes.

1.   procedure ALL_TO_ALL_BC_MESH(my_id, my_msg, p, result) 
2.   begin 

/* Communication along rows */ 
3.      left := my_id - (my_id mod ) + (my_id - 1)mod ; 
4.      right := my_id - (my_id mod ) + (my_id + 1) mod ; 
5.      result := my_msg; 
6.      msg := result; 
7.      for i := 1 to  - 1 do 
8.          send msg to right; 
9.          receive msg from left; 
10.         result := result  msg; 
11.     endfor; 

/* Communication along columns */ 
12.     up := (my_id - ) mod p; 
13.     down := (my_id + ) mod p; 
14.     msg := result; 
15.     for i := 1 to  - 1 do 
16.         send msg to down; 
17.         receive msg from up; 
18.         result := result  msg; 
19.     endfor; 
20.  end ALL_TO_ALL_BC_MESH

4.2.3 Hypercube

The hypercube algorithm for all-to-all broadcast is an extension of the mesh algorithm to log p dimensions. The procedure requires log p steps. Communication takes place along a different dimension of the p-node hypercube in each step. In every step, pairs of nodes exchange their data and double the size of the message to be transmitted in the next step by concatenating the received message with their current data. Figure 4.11 shows these steps for an eight-node hypercube with bidirectional communication channels.

Figure 4.11. All-to-all broadcast on an eight-node hypercube.

graphics/04fig12.gif

Algorithm 4.7 gives a procedure for implementing all-to-all broadcast on a d-dimensional hypercube. Communication starts from the lowest dimension of the hypercube and then proceeds along successively higher dimensions (Line 4). In each iteration, nodes communicate in pairs so that the labels of the nodes communicating with each other in the i th iteration differ in the i th least significant bit of their binary representations (Line 5). After an iteration's communication steps, each node concatenates the data it receives during that iteration with its resident data (Line 8). This concatenated message is transmitted in the following iteration.

Algorithm 4.7 All-to-all broadcast on a d-dimensional hypercube.

1.   procedure ALL_TO_ALL_BC_HCUBE(my_id, my_msg, d, result) 
2.   begin 
3.      result := my_msg; 
4.      for i := 0 to d - 1 do 
5.          partner := my id XOR 2ⁱ; 
6.          send result to partner; 
7.          receive msg from partner; 
8.          result := result  msg; 
9.      endfor; 
10.  end ALL_TO_ALL_BC_HCUBE

As usual, the algorithm for all-to-all reduction can be derived by reversing the order and direction of messages in all-to-all broadcast. Furthermore, instead of concatenating the messages, the reduction operation needs to select the appropriate subsets of the buffer to send out and accumulate received messages in each iteration. Algorithm 4.8 gives a procedure for all-to-all reduction on a d-dimensional hypercube. It uses senloc to index into the starting location of the outgoing message and recloc to index into the location where the incoming message is added in each iteration.

Algorithm 4.8 All-to-all broadcast on a d-dimensional hypercube. AND and XOR are bitwise logical-and and exclusive-or operations, respectively.

1.   procedure ALL_TO_ALL_RED_HCUBE(my_id, msg, d, result) 
2.   begin 
3.      recloc := 0; 
4.      for i := d - 1 to 0 do 
5.          partner := my_id XOR 2ⁱ; 
6.          j := my_id AND 2ⁱ; 
7.          k := (my_id XOR 2ⁱ) AND 2ⁱ; 
8.          senloc := recloc + k; 
9.          recloc := recloc + j; 
10.         send msg[senloc .. senloc + 2ⁱ - 1] to partner; 
11.         receive temp[0 .. 2ⁱ - 1] from partner; 
12.         for j := 0 to 2ⁱ - 1 do 
13.             msg[recloc + j] := msg[recloc + j] + temp[j]; 
14.         endfor; 
15.     endfor; 
16.     result := msg[my_id]; 
17.  end ALL_TO_ALL_RED_HCUBE

4.2.4 Cost Analysis

On a ring or a linear array, all-to-all broadcast involves p - 1 steps of communication between nearest neighbors. Each step, involving a message of size m, takes time t_s + t_w m. Therefore, the time taken by the entire operation is

Equation 4.2

Similarly, on a mesh, the first phase of simultaneous all-to-all broadcasts (each among nodes) concludes in time . The number of nodes participating in each all-to-all broadcast in the second phase is also , but the size of each message is now . Therefore, this phase takes time to complete. The time for the entire all-to-all broadcast on a p-node two-dimensional square mesh is the sum of the times spent in the individual phases, which is

Equation 4.3

On a p-node hypercube, the size of each message exchanged in the i th of the log p steps is 2ⁱ^-1m. It takes a pair of nodes time t_s + 2ⁱ^-1t_wm to send and receive messages from each other during the i th step. Hence, the time to complete the entire procedure is

Equation 4.4

graphics/04fig17.gif

Equations 4.2, 4.3, and 4.4 show that the term associated with t_w in the expressions for the communication time of all-to-all broadcast is t_wm(p - 1) for all the architectures. This term also serves as a lower bound for the communication time of all-to-all broadcast for parallel computers on which a node can communicate on only one of its ports at a time. This is because each node receives at least m(p - 1) words of data, regardless of the architecture. Thus, for large messages, a highly connected network like a hypercube is no better than a simple ring in performing all-to-all broadcast or all-to-all reduction. In fact, the straightforward all-to-all broadcast algorithm for a simple architecture like a ring has great practical importance. A close look at the algorithm reveals that it is a sequence of p one-to-all broadcasts, each with a different source. These broadcasts are pipelined so that all of them are complete in a total of p nearest-neighbor communication steps. Many parallel algorithms involve a series of one-to-all broadcasts with different sources, often interspersed with some computation. If each one-to-all broadcast is performed using the hypercube algorithm of Section 4.1.3, then n broadcasts would require time n(t_s + t_wm) log p. On the other hand, by pipelining the broadcasts as shown in Figure 4.9, all of them can be performed spending no more than time (t_s + t_wm)(p - 1) in communication, provided that the sources of all broadcasts are different and n p. In later chapters, we show how such pipelined broadcast improves the performance of some parallel algorithms such as Gaussian elimination (Section 8.3.1), back substitution (Section 8.3.3), and Floyd's algorithm for finding the shortest paths in a graph (Section 10.4.2).

Another noteworthy property of all-to-all broadcast is that, unlike one-to-all broadcast, the hypercube algorithm cannot be applied unaltered to mesh and ring architectures. The reason is that the hypercube procedure for all-to-all broadcast would cause congestion on the communication channels of a smaller-dimensional network with the same number of nodes. For instance, Figure 4.12 shows the result of performing the third step (Figure 4.11(c)) of the hypercube all-to-all broadcast procedure on a ring. One of the links of the ring is traversed by all four messages and would take four times as much time to complete the communication step.

Figure 4.12. Contention for a channel when the communication step of Figure 4.11(c) for the hypercube is mapped onto a ring.

graphics/04fig18.gif