Problems

9.1 Consider the following technique for performing the compare-split operation. Let x₁, x₂, ..., x_k be the elements stored at process P_i in increasing order, and let y₁, y₂, ..., y_k be the elements stored at process P_j in decreasing order. Process P_i sends x₁ to P_j. Process P_j compares x₁ with y₁ and then sends the larger element back to process P_i and keeps the smaller element for itself. The same procedure is repeated for pairs (x₂, y₂), (x₃, y₃), ..., (x_k , y_k ). If for any pair (x_l , y_l ) for 1 l k, x_l y_l, then no more exchanges are needed. Finally, each process sorts its elements. Show that this method correctly performs a compare-split operation. Analyze its run time, and compare the relative merits of this method to those of the method presented in the text. Is this method better suited for MIMD or SIMD parallel computers?
9.2 Show that the relation, as defined in Section 9.1 for blocks of elements, is a partial ordering relation.
Hint: A relation is a partial ordering if it is reflexive, antisymmetric, and transitive.
9.3 Consider the sequence s = {a₀, a₁, ..., a_n_-1},where n is a power of 2. In the following cases, prove that the sequences s₁ and s₂ obtained by performing the bitonic split operation described in Section 9.2.1, on the sequence s, satisfy the properties that (1) s₁ and s₂ are bitonic sequences, and (2) the elements of s₁ are smaller than the elements of s₂:
s is a bitonic sequence such that a₀ a₁ ··· a_n_/2-1 and a_n/₂ a_n_/2+1 ··· a_n_-1.
s is a bitonic sequence such that a₀ a₁ ··· a_i and a_i ₊₁ a_i ₊₂ ··· a_n_-1 for some i, 0 i n - 1.
s is a bitonic sequence that becomes increasing-decreasing after shifting its elements.
9.4 In the parallel formulations of bitonic sort, we assumed that we had n processes available to sort n items. Show how the algorithm needs to be modified when only n/2 processes are available.
9.5 Show that, in the hypercube formulation of bitonic sort, each bitonic merge of sequences of size 2^k is performed on a k-dimensional hypercube and each sequence is assigned to a separate hypercube.
9.6 Show that the parallel formulation of bitonic sort shown in Algorithm 9.1 is correct. In particular, show that the algorithm correctly compare-exchanges elements and that the elements end up in the appropriate processes.
9.7 Consider the parallel formulation of bitonic sort for a mesh-connected parallel computer. Compute the exact parallel run time of the following formulations:
One that uses the row-major mapping shown in Figure 9.11(a) for a mesh with store-and-forward routing.
One that uses the row-major snakelike mapping shown in Figure 9.11(b) for a mesh with store-and-forward routing.
One that uses the row-major shuffled mapping shown in Figure 9.11(c) for a mesh with store-and-forward routing.
Also, determine how the above run times change when cut-through routing is used.
9.8 Show that the block-based bitonic sort algorithm that uses compare-split operations is correct.
9.9 Consider a ring-connected parallel computer with n processes. Show how to map the input wires of the bitonic sorting network onto the ring so that the communication cost is minimized. Analyze the performance of your mapping. Consider the case in which only p processes are available. Analyze the performance of your parallel formulation for this case. What is the largest number of processes that can be used while maintaining a cost-optimal parallel formulation? What is the isoefficiency function of your scheme?
9.10 Prove that the block-based odd-even transposition sort yields a correct algorithm.
Hint: This problem is similar to Problem 9.8.
9.11 Show how to apply the idea of the shellsort algorithm (Section 9.3.2) to a p-process mesh-connected computer. Your algorithm does not need to be an exact copy of the hypercube formulation.
9.12 Show how to parallelize the sequential shellsort algorithm for a p-process hypercube. Note that the shellsort algorithm presented in Section 9.3.2 is not an exact parallelization of the sequential algorithm.
9.13 Consider the shellsort algorithm presented in Section 9.3.2. Its performance depends on the value of l, which is the number of odd and even phases performed during the second phase of the algorithm. Describe a worst-case initial key distribution that will require l = Q(p) phases. What is the probability of this worst-case scenario?
9.14 In Section 9.4.1 we discussed a parallel formulation of quicksort for a CREW PRAM that is based on assigning each subproblem to a separate process. This formulation uses n processes to sort n elements. Based on this approach, derive a parallel formulation that uses p processes, where (p < n). Derive expressions for the parallel run time, efficiency, and isoefficiency function. What is the maximum number of processes that your parallel formulation can use and still remain cost-optimal?
9.15 Derive an algorithm that traverses the binary search tree constructed by the algorithm in Algorithm 9.6 and determines the position of each element in the sorted array. Your algorithm should use n processes and solve the problem in time Q(log n) on an arbitrary CRCW PRAM.
9.16 Consider the PRAM formulation of the quicksort algorithm (Section 9.4.2). Compute the average height of the binary tree generated by the algorithm.
9.17 Consider the following parallel quicksort algorithm that takes advantage of the topology of a p-process hypercube connected parallel computer. Let n be the number of elements to be sorted and p = 2^d be the number of processes in a d-dimensional hypercube. Each process is assigned a block of n/p elements, and the labels of the processes define the global order of the sorted sequence. The algorithm starts by selecting a pivot element, which is broadcast to all processes. Each process, upon receiving the pivot, partitions its local elements into two blocks, one with elements smaller than the pivot and one with elements larger than the pivot. Then the processes connected along the d^th communication link exchange appropriate blocks so that one retains elements smaller than the pivot and the other retains elements larger than the pivot. Specifically, each process with a 0 in the d^th bit (the most significant bit) position of the binary representation of its process label retains the smaller elements, and each process with a 1 in the d^th bit retains the larger elements. After this step, each process in the (d - 1)-dimensional hypercube whose d^th label bit is 0 will have elements smaller than the pivot, and each process in the other (d - 1)-dimensional hypercube will have elements larger than the pivot. This procedure is performed recursively in each subcube, splitting the subsequences further. After d such splits - one along each dimension - the sequence is sorted with respect to the global ordering imposed on the processes. This does not mean that the elements at each process are sorted. Therefore, each process sorts its local elements by using sequential quicksort. This hypercube formulation of quicksort is shown in Algorithm 9.9. The execution of the algorithm is illustrated in Figure 9.21.

Figure 9.21. The execution of the hypercube formulation of quicksort for d = 3. The three splits - one along each communication link - are shown in (a), (b), and (c). The second column represents the partitioning of the n-element sequence into subcubes. The arrows between subcubes indicate the movement of larger elements. Each box is marked by the binary representation of the process labels in that subcube. A * denotes that all the binary combinations are included.

Algorithm 9.9 A parallel formulation of quicksort on a d-dimensional hypercube. B is the n/p-element subsequence assigned to each process.
1.   procedure HYPERCUBE_QUICKSORT (B, n) 
2.   begin 
3.      id := process's label; 
4.      for i := 1 to d do 
5.      begin 
6.         x := pivot; 
7.         partition B into B₁ and B₂ such that B₁  x < B₂; 
8.         if i^th bit is 0 then 
9.         begin 
10.           send B₂ to the process along the i ^th communication link; 
11.           C := subsequence received along the i ^th communication link; 
12.           B := B₁  C ; 
13.        endif 
14.        else 
15.           send B₁ to the process along the i^th communication link; 
16.           C := subsequence received along the i^th communication link; 
17.           B := B₂  C ; 
18.        endelse 
19.     endfor 
20.     sort B using sequential quicksort; 
21.  end HYPERCUBE_QUICKSORT 
Analyze the complexity of this hypercube-based parallel quicksort algorithm. Derive expressions for the parallel runtime, speedup, and efficiency. Perform this analysis assuming that the elements that were initially assigned at each process are distributed uniformly.
9.18 Consider the parallel formulation of quicksort for a d-dimensional hypercube described in Problem 9.17. Show that after d splits - one along each communication link - the elements are sorted according to the global order defined by the process's labels.
9.19 Consider the parallel formulation of quicksort for a d-dimensional hypercube described in Problem 9.17. Compare this algorithm against the message-passing quicksort algorithm described in Section 9.4.3. Which algorithm is more scalable? Which algorithm is more sensitive on poor selection of pivot elements?
9.20 An alternative way of selecting pivots in the parallel formulation of quicksort for a d-dimensional hypercube (Problem 9.17) is to select all the 2^d - 1 pivots at once as follows:
Each process picks a sample of l elements at random.
All processes together sort the sample of l x 2^d items by using the shellsort algorithm (Section 9.3.2).
Choose 2^d - 1 equally distanced pivots from this list.
Broadcast pivots so that all the processes know the pivots.
How does the quality of this pivot selection scheme depend on l? Do you think l should be a function of n? Under what assumptions will this scheme select good pivots? Do you think this scheme works when the elements are not identically distributed on each process? Analyze the complexity of this scheme.
9.21 Another pivot selection scheme for parallel quicksort for hypercube (Section 9.17) is as follows. During the split along the i^th dimension, 2ⁱ ^-1 pairs of processes exchange elements. The pivot is selected in two steps. In the first step, each of the 2ⁱ ^-1 pairs of processes compute the median of their combined sequences. In the second step, the median of the 2ⁱ ^-1 medians is computed. This median of medians becomes the pivot for the split along the i^th communication link. Subsequent pivots are selected in the same way among the participating subcubes. Under what assumptions will this scheme yield good pivot selections? Is this better than the median scheme described in the text? Analyze the complexity of selecting the pivot.
Hint: If A and B are two sorted sequences, each having n elements, then we can find the median of A B in time Q(log n).
9.22 In the parallel formulation of the quicksort algorithm on shared-address-space and message-passing architectures (Section 9.4.3) each iteration is followed by a barrier synchronization. Is barrier synchronization necessary to ensure the correctness of the algorithm? If not, then how does the performance change in the absence of barrier synchronization?
9.23 Consider the message-passing formulation of the quicksort algorithm presented in Section 9.4.3. Compute the exact (that is, using t_s, t_w, and t_c) parallel run time and efficiency of the algorithm under the assumption of perfect pivots. Compute the various components of the isoefficiency function of your formulation when
t_c = 1, t_w = 1, t_s = 1
t_c = 1, t_w = 1, t_s = 10
t_c = 1, t_w = 10, t_s = 100
for cases in which the desired efficiency is 0.50, 0.75, and 0.95. Does the scalability of this formulation depend on the desired efficiency and the architectural characteristics of the machine?
9.24 Consider the following parallel formulation of quicksort for a mesh-connected message-passing parallel computer. Assume that one element is assigned to each process. The recursive partitioning step consists of selecting the pivot and then rearranging the elements in the mesh so that those smaller than the pivot are in one part of the mesh and those larger than the pivot are in the other. We assume that the processes in the mesh are numbered in row-major order. At the end of the quicksort algorithm, elements are sorted with respect to this order. Consider the partitioning step for an arbitrary subsequence illustrated in Figure 9.22(a). Let k be the length of this sequence, and let P_m, P_m₊₁, ..., P_m₊_k be the mesh processes storing it. Partitioning consists of the following four steps:

Figure 9.22. (a) An arbitrary portion of a mesh that holds part of the sequence to be sorted at some point during the execution of quicksort, and (b) a binary tree embedded into the same portion of the mesh.

A pivot is selected at random and sent to process P_m. Process P_m broadcasts this pivot to all k processes by using an embedded tree, as shown in Figure 9.22(b). The root (P_m) transmits the pivot toward the leaves. The tree embedding is also used in the following steps.
Information is gathered at each process and passed up the tree. In particular, each process counts the number of elements smaller and larger than the pivot in both its left and right subtrees. Each process knows the pivot value and therefore can determine if its element is smaller or larger. Each process propagates two values to its parent: the number of elements smaller than the pivot and the number of elements larger than the pivot in the process's subtree. Because the tree embedded in the mesh is not complete, some nodes will not have left or right subtrees. At the end of this step, process P_m knows how many of the k elements are smaller and larger than the pivot. If s is the number of the elements smaller than the pivot, then the position of the pivot in the sorted sequence is P_m₊_s.
Information is propagated down the tree to enable each element to be moved to its proper position in the smaller or larger partitions. Each process in the tree receives from its parent the next empty position in the smaller and larger partitions. Depending on whether the element stored at each process is smaller or larger than the pivot, the process propagates the proper information down to its subtrees. Initially, the position for elements smaller than the pivot is P_m and the position for elements larger than the pivot is P_m₊_s₊₁.
The processes perform a permutation, and each element moves to the proper position in the smaller or larger partition.
This algorithm is illustrated in Figure 9.23.

Figure 9.23. Partitioning a sequence of 13 elements on a 4 x 4 mesh: (a) row-major numbering of the mesh processes, (b) the elements stored in each process (the shaded element is the pivot), (c) the tree embedded on a portion of the mesh, (d) the number of smaller or larger elements in the process of the first column after the execution of the second step, (e) the destination of the smaller or larger elements propagated down to the processes in the first column during the third step, (f) the destination of the elements at the end of the third step, and (g) the locations of the elements after one-to-one personalized communication.

Analyze the complexity of this mesh-based parallel quicksort algorithm. Derive expressions for the parallel runtime, speedup, and efficiency. Perform this analysis assuming that the elements that were initially assigned at each process are distributed uniformly.
9.25 Consider the quicksort formulation for a mesh described in Problem 9.24. Describe a scaled-down formulation that uses p < n processes. Analyze its parallel run time, speedup, and isoefficiency function.
9.26 Consider the enumeration sort algorithm presented in Section 9.6.1. Show how the algorithm can be implemented on each of the following:
a CREW PRAM
a EREW PRAM
a hypercube-connected parallel computer
a mesh-connected parallel computer.
Analyze the performance of your formulations. Furthermore, show how you can extend this enumeration sort to a hypercube to sort n elements using p processes.
9.27 Derive expressions for the speedup, efficiency, and isoefficiency function of the bucket sort parallel formulation presented in Section 9.5. Compare these expressions with the expressions for the other sorting algorithms presented in this chapter. Which parallel formulations perform better than bucket sort, and which perform worse?
9.28 Show that the splitter selection scheme described in Section 9.5 guarantees that the number of elements in each of the m buckets is less than 2n/m.
9.29 Derive expressions for the speedup, efficiency, and isoefficiency function of the sample sort parallel formulation presented in Section 9.5. Derive these metrics under each of the following conditions: (1) the p sub-blocks at each process are of equal size, and (2) the size of the p sub-blocks at each process can vary by a factor of log p.
9.30 In the sample sort algorithm presented in Section 9.5, all processes send p - 1 elements to process P₀, which sorts the p(p - 1) elements and distributes splitters to all the processes. Modify the algorithm so that the processes sort the p(p - 1) elements in parallel using bitonic sort. How will you choose the splitters? Compute the parallel run time, speedup, and efficiency of your formulation.
9.31 How does the performance of radix sort (Section 9.6.2) depend on the value of r? Compute the value of r that minimizes the run time of the algorithm.
9.32 Extend the radix sort algorithm presented in Section 9.6.2 to the case in which p processes (p < n) are used to sort n elements. Derive expressions for the speedup, efficiency, and isoefficiency function for this parallel formulation. Can you devise a better ranking mechanism?