In nonserial polyadic DP formulations, in addition to processing subproblems at a level in parallel, computation can also be pipelined to increase efficiency. We illustrate this with the optimal matrix-parenthesization problem.

12.5.1 The Optimal Matrix-Parenthesization Problem

Consider the problem of multiplying n matrices, A₁, A₂, ..., A_n, where each A_i is a matrix with r_i _-1 rows and r_i columns. The order in which the matrices are multiplied has a significant impact on the total number of operations required to evaluate the product.

Example 12.3 Optimal matrix parenthesization

Consider three matrices A₁, A₂, and A₃ of dimensions 10 x 20, 20 x 30, and 30 x 40, respectively. The product of these matrices can be computed as (A₁ x A₂) x A₃ or as A₁ x (A₂ x A₃). In (A₁ x A₂) x A₃, computing (A₁ x A₂) requires 10 x 20 x 30 operations and yields a matrix of dimensions 10 x 30. Multiplying this by A₃ requires 10 x 30 x 40 additional operations. Therefore the total number of operations is 10 x 20 x 30 + 10 x 30 x 40 = 18000. Similarly, computing A₁ x (A₂ x A₃) requires 20 x 30 x 40 + 10 x 20 x 40 = 32000 operations. Clearly, the first parenthesization is desirable.

The objective of the parenthesization problem is to determine a parenthesization that minimizes the number of operations. Enumerating all possible parenthesizations is not feasible since there are exponentially many of them.

Let C [i, j] be the optimal cost of multiplying the matrices A_i ,..., A_j. This chain of matrices can be expressed as a product of two smaller chains, A_i, A_i +1,..., A_k and A_k₊₁ ,..., A_j. The chain A_i, A_i ₊₁ ,..., A_k results in a matrix of dimensions r_i _-1 x r_k, and the chain A_k₊₁ ,..., A_j results in a matrix of dimensions r_k x r_j. The cost of multiplying these two matrices is r_i _-1r_k r_j. Hence, the cost of the parenthesization (A_i, A_i +1 ,..., A_k)( A_k₊₁ ,..., A_j) is given by C [i, k] + C [k + 1, j] + r_i _-1r_k r_j. This gives rise to the following recurrence relation for the parenthesization problem:

Given Equation 12.7, the problem reduces to finding the value of C [1, n]. The composition of costs of matrix chains is shown in Figure 12.7. Equation 12.7 can be solved if we use a bottom-up approach for constructing the table C that stores the values C [i, j]. The algorithm fills table C in an order corresponding to solving the parenthesization problem on matrix chains of increasing length. Visualize this by thinking of filling in the table diagonally (Figure 12.8). Entries in diagonal l corresponds to the cost of multiplying matrix chains of length l + 1. From Equation 12.7, we can see that the value of C [i, j] is computed as min{C [i, k]+C [k +1, j]+r_i _-1r_k r_j}, where k can take values from i to j -1. Therefore, computing C [i, j] requires that we evaluate (j - i) terms and select their minimum. The computation of each term takes time t_c, and the computation of C [i, j] takes time (j - i)t_c. Thus, each entry in diagonal l can be computed in time lt_c.

In computing the cost of the optimal parenthesization sequence, the algorithm computes (n - 1) chains of length two. This takes time (n - 1)t_c. Similarly, computing (n - 2) chains of length three takes time (n - 2)2t_c. In the final step, the algorithm computes one chain of length n. This takes time (n - 1)t_c. Thus, the sequential run time of this algorithm is

Consider the parallel formulation of this algorithm on a logical ring of n processing elements. In step l, each processing element computes a single element belonging to the l^th diagonal. Processing element P_i computes the (i + 1)th column of Table C. Figure 12.8 illustrates the partitioning of the table among different processing elements. After computing the assigned value of the element in table C, each processing element sends its value to all other processing elements using an all-to-all broadcast (Section 4.2). Therefore, the assigned value in the next iteration can be computed locally. Computing an entry in table C during iteration l takes time lt_c because it corresponds to the cost of multiplying a chain of length l + 1. An all-to-all broadcast of a single word on n processing elements takes time t_s log n + t_w(n - 1) (Section 4.2). The total time required to compute the entries along diagonal l is lt_c + t_s log n + t_w(n - 1). The parallel run time is the sum of the time taken over computation of n - 1 diagonals.

The parallel run time of this algorithm is Q(n²). Since the processor-time product is Q(n³), which is the same as the sequential complexity, this algorithm is cost-optimal.

When using p processing elements (1

n) organized in a logical ring, if there are n nodes in a diagonal, each processing element stores n/p nodes. Each processing element computes the cost C [i, j] of the entries assigned to it. After computation, an all-to-all broadcast sends the solution costs of the subproblems for the most recently computed diagonal to all the other processing elements. Because each processing element has complete information about subproblem costs at preceding diagonals, no other communication is required. The time taken for all-to-all broadcast of n/p words is t_s log p + t_wn(p - 1)/p

t_s log p + t_wn. The time to compute n/p entries of the table in the l^th diagonal is lt_cn/p. The parallel run time is

In order terms, T_P = Q(n³/p) + Q(n²). Here, Q(n³/p) is the computation time, and Q(n²) the communication time. If n is sufficiently large with respect to p, communication time can be made an arbitrarily small fraction of computation time, yielding linear speedup.

This formulation can use at most Q(n) processing elements to accomplish the task in time Q(n²). This time can be improved by pipelining the computation of the cost C [i, j] on n(n + 1)/2 processing elements. Each processing element computes a single entry c(i, j) of matrix C. Pipelining works due to the nonserial nature of the problem. Computation of an entry on a diagonal t does not depend only on the entries on diagonal t - 1 but also on all the earlier diagonals. Hence work on diagonal t can start even before work on diagonal t - 1 is completed.

12.5 Nonserial Polyadic DP Formulations

12.5.1 The Optimal Matrix-Parenthesization Problem

Figure 12.7. A nonserial polyadic DP formulation for finding an optimal matrix parenthesization for a chain of four matrices. A square node represents the optimal cost of multiplying a matrix chain. A circle node represents a possible parenthesization.

Figure 12.8. The diagonal order of computation for the optimal matrix-parenthesization problem.