9.7 Bibliographic Remarks

Knuth [Knu73] discusses sorting networks and their history. The question of whether a sorting network could sort n elements in time O (log n) remained open for a long time. In 1983, Ajtai, Komlos, and Szemeredi [AKS83] discovered a sorting network that could sort n elements in time O (log n) by using O (n log n) comparators. Unfortunately, the constant of their sorting network is quite large (many thousands), and thus is not practical. The bitonic sorting network was discovered by Batcher [Bat68], who also discovered the network for odd-even sort. These were the first networks capable of sorting n elements in time O (log² n). Stone [Sto71] maps the bitonic sort onto a perfect-shuffle interconnection network, sorting n elements by using n processes in time O (log² n). Siegel [Sie77]shows that bitonic sort can also be performed on the hypercube in time O (log² n). The block-based hypercube formulation of bitonic sort is discussed in Johnsson [Joh84] and Fox et al. [FJL⁺88]. Algorithm 9.1 is adopted from [FJL⁺88]. The shuffled row-major indexing formulation of bitonic sort on a mesh-connected computer is presented by Thompson and Kung [TK77]. They also show how the odd-even merge sort can be used with snakelike row-major indexing. Nassimi and Sahni [NS79] present a row-major indexed bitonic sort formulation for a mesh with the same performance as shuffled row-major indexing. An improved version of the mesh odd-even merge is proposed by Kumar and Hirschberg [KH83]. The compare-split operation can be implemented in many ways. Baudet and Stevenson [BS78] describe one way to perform this operation. An alternative way of performing a compare-split operation based on a bitonic sort (Problem 9.1) that requires no additional memory was discovered by Hsiao and Menon [HM80].

The odd-even transposition sort is described by Knuth [Knu73]. Several early references to parallel sorting by odd-even transposition are given by Knuth [Knu73] and Kung [Kun80]. The block-based extension of the algorithm is due to Baudet and Stevenson [BS78]. Another variation of block-based odd-even transposition sort that uses bitonic merge-split is described by DeWitt, Friedland, Hsiao, and Menon [DFHM82]. Their algorithm uses p processes and runs in time O (n + n log(n/p)). In contrast to the algorithm of Baudet and Stevenson [BS78], which is faster but requires 4n/p storage locations in each process, the algorithm of DeWitt et al. requires only (n/p) + 1 storage locations to perform the compare-split operation.

The shellsort algorithm described in Section 9.3.2 is due to Fox et al. [FJL⁺88]. They show that, as n increases, the probability that the final odd-even transposition will exhibit worst-case performance (in other words, will require p phases) diminishes. A different shellsort algorithm based on the original sequential algorithm [She59] is described by Quinn [Qui88].

The sequential quicksort algorithm is due to Hoare [Hoa62]. Sedgewick [Sed78] provides a good reference on the details of the implementation and how they affect its performance. The random pivot-selection scheme is described and analyzed by Robin [Rob75]. The algorithm for sequence partitioning on a single process was suggested by Sedgewick [Sed78] and used in parallel formulations by Raskin [Ras78], Deminet [Dem82], and Quinn [Qui88]. The CRCW PRAM algorithm (Section 9.4.2) is due to Chlebus and Vrto [CV91]. Many other quicksort-based algorithms for PRAM and shared-address-space parallel computers have been developed that can sort n elements in time Q(log n) by using Q(n) processes. Martel and Gusfield [MG89] developed a quicksort algorithm for a CRCW PRAM that requires space O (n³) on the average. An algorithm suited to shared-address-space parallel computers with fetch-and-add capabilities was discovered by Heidelberger, Norton, and Robinson [HNR90]. Their algorithm runs in time Q(log n) on the average and can be adapted for commercially available shared-address-space computers. The hypercube formulation of quicksort described in Problem 9.17 is due to Wagar [Wag87]. His hyperquicksort algorithm uses the median-based pivot-selection scheme and assumes that the elements in each process have the same distribution. His experimental results show that hyperquicksort is faster than bitonic sort on a hypercube. An alternate pivot-selection scheme (Problem 9.20) was implemented by Fox et al. [FJL⁺88]. This scheme significantly improves the performance of hyperquicksort when the elements are not evenly distributed in each process. Plaxton [Pla89] describes a quicksort algorithm on a p-process hypercube that sorts n elements in time O ((n log n)/p +(n log³/² p)/p +log³ p log(n/p)). This algorithm uses a time O ((n/p) log log p + log² p log(n/p)) parallel selection algorithm to determine the perfect pivot selection. The mesh formulation of quicksort (Problem 9.24) is due to Singh, Kumar, Agha, and Tomlinson [SKAT91a]. They also describe a modification to the algorithm that reduces the complexity of each step by a factor of Q(log p).

The sequential bucket sort algorithm was first proposed by Isaac and Singleton in 1956. Hirschberg [Hir78] proposed a bucket sort algorithm for the EREW PRAM model. This algorithm sorts n elements in the range [0...n - 1] in time Q(log n) by using n processes. A side effect of this algorithm is that duplicate elements are eliminated. Their algorithm requires space Q(n²). Hirschberg [Hir78] generalizes this algorithm so that duplicate elements remain in the sorted array. The generalized algorithm sorts n elements in time Q(k log n) by using n^1+1/k processes, where k is an arbitrary integer.

The sequential sample sort algorithm was discovered by Frazer and McKellar [FM70]. The parallel sample sort algorithm (Section 9.5) was discovered by Shi and Schaeffer [SS90]. Several parallel formulations of sample sort for different parallel architectures have been proposed. Abali, Ozguner, and Bataineh [AOB93] presented a splitter selection scheme that guarantees that the number of elements ending up in each bucket is n/p. Their algorithm requires time O ((n log n)/p + p log² n), on average, to sort n elements on a p-process hypercube. Reif and Valiant [RV87] present a sample sort algorithm that sorts n elements on an n-process hypercube-connected computer in time O (log n) with high probability. Won and Sahni [WS88] and Seidel and George [SG88] present parallel formulations of a variation of sample sort called bin sort [FKO86].

Many other parallel sorting algorithms have been proposed. Various parallel sorting algorithms can be efficiently implemented on a PRAM model or on shared-address-space computers. Akl [Akl85], Borodin and Hopcroft [BH82], Shiloach and Vishkin [SV81], and Bitton, DeWitt, Hsiao, and Menon [BDHM84] provide a good survey of the subject. Valiant [Val75] proposed a sorting algorithm for a shared-address-space SIMD computer that sorts by merging. It sorts n elements in time O (log n log log n) by using n/2 processes. Reischuk [Rei81] was the first to develop an algorithm that sorted n elements in time Q(log n) for an n-process PRAM. Cole [Col88] developed a parallel merge-sort algorithm that sorts n elements in time Q(log n) on an EREW PRAM. Natvig [Nat90] has shown that the constants hidden behind the asymptotic notation are very large. In fact, the Q(log² n) bitonic sort outperforms the Q(log n) merge sort as long as n is smaller than 7.6 x 10²²! Plaxton [Pla89] has developed a hypercube sorting algorithm, called smoothsort, that runs asymptotically faster than any previously known algorithm for that architecture. Leighton [Lei85a] proposed a sorting algorithm, called columnsort, that consists of a sequence of sorts followed by elementary matrix operations. Columnsort is a generalization of Batcher's odd-even sort. Nigam and Sahni [NS93] presented an algorithm based on Leighton's columnsort for reconfigurable meshes with buses that sorts n elements on an n²-process mesh in time O (1).