Input/Output: Algorithms and Architectures 1 ... - Semantic Scholar

1 downloads 14363 Views 383KB Size Report
has the disadvantage that only one I/O request can be in service at any given time and all disks must wait .... RAID Level 1 (mirrored disks) also o ers a very good.
Input/Output: Algorithms and Architectures Lucian Popa Abstract

Modern processors are improving their speed at a very fast rate. Memory design technologies try to keep pace with rapid evolution of processors, and combined with caching techniques, they partly succeed. However, disk speed is not improving at the same rate. On the contrary, seek time, latency and bandwidth are almost the same as ten years ago, and there is no much hope for the future. To prevent I/O to become a signi cant slowdown in computer systems, alternative solutions must be found. The paper presents an overview of software and architectural techniques that try to overcome the I/O problem: disk arrays (RAID), external-memory algorithms, interconnection networks, parallel I/O, in their relation with uniprocessor and multiprocessor architectures.

1 Introduction The users of computers are enjoying unprecedented growth in the speed of processors (CPUs). Amdahl related CPU speed and main memory size using the following rule: \Each CPU instruction per second requires one byte of main memory". This suggests that memory capacity should grow at the same rate as CPU speed, in order not to become a bottleneck. And this indeed happened so far. The speed of main memory has also kept pace with the growth in CPU (caching is used practically in all computer systems of today). However, disk speed is not improving at the same rate. On the contrary, seek time, latency and bandwidth are almost the same as ten years ago, and there is no much hope for signi cant improvements in the future. All of these suggest that the gap between processor performance and I/O performance is widening continuously. Moreover, we have a shift in the applications: faster microprocessors make it possible to attack new problems, which need faster access to larger datasets. Therefore, solutions for higher-performance I/O systems must be found: better architectures (parallel disks to increase the bandwidth, good interconnection networks between the disks and the processing unit(s)), better algorithms (to reduce the number of I/O operations needed to solve problems). This paper presents an overview of some of the work that has been done in the recent years in this area. The paper is organized as follows. The next section brie y reviews what seems to be the main recent innovation in I/O architecture: RAID. Section 3 focuses on external sorting, one of the most important algorithmic components of many I/O intensive applications, and describes a parallel disk model and an optimal sorting algorithm for this model, Greed Sort. Section 4 describes several other external-memory algorithms: list ranking, Euler tour, lowest common ancestors, connected components, minimum spanning tree, etc. It also points out that computing an arbitrary permutation has, in most typical cases, the same I/O complexity as sorting, in external memory (as opposed to main memory), and therefore optimal algorithms for problems requiring computing arbitrary permutations can be designed by making use of optimal sorting algorithms. It also describes a technique for designing ecient parallel algorithms by simulating PRAM algorithms in external memory. Section 5 discusses several issues in interconnecting multiple CPUs, reviewing several MIMD architectures (shared memory and mainly distributed memory - hypercubes). Ecient algorithms for arbitrary permutations and for sorting are described in detail for the hypercube architecture. An important application of them is in designing a connection network that can be recon gured dynamically, in real time. These ideas seem to be applicable when disks as well as CPUs are interconnected together, so that the problem of data placement can be eciently supported. Moreover, they can provide support for adapting the algorithms described in the previous sections (the model there is a uniprocessor with multiple disks) for the case of parallel disks connected to a multiprocessor. The conclusions are left for the last section. 1

2 Parallel Disks The central idea surveyed in this paper is how to use parallelism among multiple disks to improve aggregate I/O performance. In principle, a processing system having n disks attached must be able to transfer data to/from the disks at a bandwidth n times higher than in the case of a single disk (see Fig. 1). However, in practice, there are

B

B

B

B

B

B

nB CPU + MEM

Figure 1: Parallel disks. many problems that need to be overcome in order to achieve this peak bandwidth. The rst and probably the most dicult among them is how to distribute data across the disks in order to evenly access all the disks. While in general this depends on the speci c application, in some cases a distribution of data in which all disks store equal portions of data will ensure uniform distribution of accesses. An important concept here is data striping: logically contigous pieces of data, usually equal in size, are stored on each disk in turn, circularly. This brings us several good things:

 large chunks of data can be transferred in a single I/O (by accessing in parallel all the disks on which the data

is spread; ideally, all the disks are involved)  small, independent, data accesses, to di erent disks, can be done simultaneously. This is ideal for a transaction processing system.  uniform load balancing for several important applications: transaction processing, supercomputer applications (where large transfers are usually needed). However, there are other applications where more sophisticated data placement algorithms are needed.

A second problem with parallel disks is vulnerability to failures. We will see in the nest subsection how RAID solves this. Finally, a third problem comes from the fact that one key motivation for parallel disks is to increase data parallelism for high-performance computing (multiprocessors). And here, we have several questions that are still open:

 What is a suitable multi-processor multi-disk architecture ?  What interconnection networks are useful ?  Who controls the disks?

2.1 Redundant Arrays of Inexpensive Disks (RAID) Disk arrays were rst proposed in 1987 by Patterson, Gibson and Katz ([11]) as a way to use parallelism between multiple disks to improve aggregate I/O performance. Disk arrays stripe data across multiple disks and access 2

them in parallel to achieve both higher data transfer rates on large data accesses and higher I/O rates on smaller data accesses. As we've already pointed out, data striping achieves in many situations uniform load balancing, by eliminating hot spots that otherwise saturate a small number of disks while the majority of disks remain idle. Typically, this array of disks appear to the computer as a single logical disk. Large disk arrays, however, are highly vulnerable to disk failures; a disk array with a hundred disks is a hundred times more likely to fail than a single disk. That is, the Mean Time Between Failure (MTBF) of the array will be equal to the MTBF of an individual disk, divided by the number of disks in the array. Because of this, the MTBF of an array of disks would be much too low for many applications. The obvious solution is to employ redundancy in the form of error-correcting codes to tolerate disk failures. This, typically, allows a redundant disk array to have a much higher MTBF than that of a single disk. Redundancy, however, has negative consequences. All write operations must update the redundant information, therefore the performance of writes in redundant disk arrays can be signi cantly worse than the performance of writes in non-redundant disk arrays. Redundant disk arrays employ two orthogonal concepts: data striping for improved performance and redundancy for improved reliability. Striping improves aggregate I/O performance by allowing multiple small I/Os and large single I/Os to be serviced in parallel. Redundancy has two basic aspects:

 the granularity of data interleaving: ne-grained or coarse-grained. Fine-grained disk arrays conceptually interleave data in relatively small units (bits, for example) so that all I/O requests, regardless of their size, access all of the disks in the array. This results in very high high data transfer rates for all I/O requests but has the disadvantage that only one I/O request can be in service at any given time and all disks must wait time positioning for every request. Coarse-grained disk arrays interleave data in relatively large units so that small I/O requests need access only a small number of disks while large requests can access all the disks in the array. This allows multiple, small requests to be serviced simultaneously while still allowing large requests to see a high transfer rate.

 the method and pattern in which the redundant information is computed and distributed across the disk array. The method for computing the redundant information is, in most cases, parity (with few exceptions of using Hamming or Reed-Solomon codes). There are two basic schemes for distribution of redundant information. The rst one concentrates the redundant information on a small number of disks, while the second one distributes the redundant information uniformly across all of the disks. The second one is in general preferred because it avoids hot spots and other load balancing problems.

2.2 Basic RAID Organizations There are several ways in which data striping and redundancy can be used in disk arrays, and selecting between these schemes involves complex tradeo s between reliability, performance and cost. The most common schemes are usually known as RAID levels.

2.2.1 Non-Redundant (RAID Level 0) The data is split across all the disks, but there is no redundancy involved at all (therefore it has the lowest cost). It o ers the best write performance since it never needs to update redundant information. However, it does not have the best read performance. Redundancy schemes that duplicate data, such as mirroring, can perform better on reads by scheduling requests on the disk with shortest expected seek and rotational delays. Reliability is their main problem: any single disk failure results in data-loss. They are widely used in supercomputing environments, where performance and capacity, rather than reliability, are the primary concerns.

3

2.2.2 Mirrored (RAID Level 1) This level provides redundancy by writing all data to two disks (therefore there are always two copies of the information). When data is read, it can be retrieved from the disk with the shorter queueing, seek and rotational delays. The reads are faster but the writes are slower when compared to a single disk. If a disk fails, the other copy is used to service requests. It has a high cost, since it uses twice as many disks as a non-redundant disk array. It is frequently used in database applications where availability and transaction rate are more important than storage eciency.

2.2.3 Memory-Style ECC (RAID Level 2) In a similar way in which memory systems provide recovery from failed components, this RAID level uses typically Hamming error corection codes. In one version of this scheme, four data disks require three redundant disks. In general, the number of redundant disks is proportional to the logarithm of the total number of disks in the system (therefore storage eciency increases as the number of data disks increases). The essential thing to note here is that multiple redundant disks are needed to identify the failed disk, but only one disk is enough to recover the lost information. Unlike memory components, disk controllers can easily identify which disk has failed. Thus, one can use a single parity disk rather than a set of redundant disks (see the next level).

2.2.4 Bit-Interleaved Parity (RAID Level 3) Data is interleaved bit-wise over the data disks, and a single parity disk is added to tolerate any single disk failure. Each read request accesses all data disks and each write request accesses all data disks and the parity disk. Thus, only one request can be serviced at a time. Because the parity disk contains only parity and no data, the parity disk cannot participate on reads, resulting in slightly lower read performance than for schemes that distribute the parity and data over all disks. They are frequently used in applications that require high bandwidth but no high I/O rates.

2.2.5 Block-Interleaved Parity (RAID Level 4) This level is similar with the previous one except that data is interleaved across disks in blocks of arbitrary size rather than in bits. The size of these blocks is called the striping unit. Read requests smaller than the striping unit access only a single data disk. Write requests must, in addition to updating the requested data blocks, compute and update the parity block. For large writes that touch blocks on all disks, parity is easily computed by exclusive-or-ing the new data for each disk (one access needed to all disks). For small writes that update only one data disk, the new parity is computed in the following way:

new parity = (old data xor new data) xor old parity Therefore, a small write to a single disk need not to access all the disks. It needs only two disks, the data disk to which the write is addressed and the parity disk. However, it needs four acceses: two to read the old data and the old parity, one to write the new data, and one to write the new parity. This is referred to as the read-modify-write procedure. Because a block-interleaved, parity disk has only one parity disk, which must be updated on all write operations, the parity disk can easily become a bottleneck. Hence, the next RAID level (which distributes the parity) is always preferred to this level.

2.2.6 Block-Interleaved Distributed-Parity (RAID Level 5) The block-interleaved distributed parity disk array eliminates the parity disk bottleneck present in the previous level by distributing the parity uniformly over all of the disks. An additional advantage to distributing the parity is that it also distributes data over all of the disks rather than over all but one. This allows all disks to participate in servicing read operations in contrast to redundancy schemes with dedicated parity disks in which the parity disk 4

cannot participate in servicing read requests. Block-interleaved distributed-parity have the best small read, large read and large write performance of any redundant disk array. Small write requests are somewhat inecient compared with redundancy schemes such as mirroring however, due to the need to perform read-modify-write operations to update parity. A RAID 5 level appears very attractive for both supercomputer applications (with large transfers) and transaction processing (many small, independent I/O requests).

2.3 Comments on RAID Main Points:  Reliability: all of the RAID levels with the exception of the rst one (Level 0) have a very good reliability, which makes them attractive especially for database applications.

 High data-rate (large transfers) and high I/O rate (transaction processing) (i.e. high throughput). Raid Level 5 is very attractive to both transaction processing systems (which need high I/O rate) and applications which need high data-rate (supercomputer applications). RAID Level 1 (mirrored disks) also o ers a very good performance in both terms of throughput and reliability, but only if it's not too expensive to waste half of the disks for backup.

 Low cost (useable storage capacity, PC disks). Compared to traditional single large disks, RAID o ers signi cant advantages for the same cost.

Questions:  Data placement (in order to reduce contention and to evenly distribute accesses to data). { An interesting issue here is whether is appropriate to use RAID as a single logical disk (data placement is

built-in and transparent to the application program) or to use a \parallel" RAID, in which the application program does its own data placement. It seems that there are many situations which require that data placement is done by the software components working on top of the RAID implementation (if a RAID is used). We will see one such example in the next section: in Greed Sort, the optimal sorting algorithm in the presence of parallel disks, it is essential that each disk must be accessed as a separate, independent entity.

 Interconnection of RAID to the main processing system (I/O bus, I/O network ?). A related issue here is the

fact that in a \classic" RAID, a single, centralized, controller is used to control the array, and this can be easily become a vulnerable point in the system. We will look in section 5 at a parallel RAID architecture where this problem dissapears (see also [2]).

 CPU parallelism vs. I/O parallelism (data parallelism, SMP, MPP). How can a multiprocessor be connected

to a (parallel) RAID ? What is the best architecture (shared-nothing, shared-disk), what should be a good cost model that considers the I/O cost, the CPU cost and the interconnection network cost ?)

3 Sorting on Parallel Disks The architectural solution that RAID o ers to I/O problem needs to be completed with software solutions in order to be useful in real systems. One important component of information processing, widely used in database algorithms (join, aggregation, duplicate removal, etc.) and in other important algorithms (see the next section) is sorting. Of particular importance is external sorting, in which the records to be sorted are too numerous to t in processor's main memory and instead must be stored on disk. The bottleneck in external sorting is usually the time needed for 5

the I/O operations. How this bottleneck can be reduced in the presence of parallel disk architectures by using good external sorting algorithms is the subject of this section. Having many disks that can be accessed in parallel can greatly increase the bandwidth to the I/O system. The challenge is therefore to take advantage of this increased bandwidth by making certain that items to be read and written during I/Os are evenly distributed over the disks. Let's review rst the most common external sorting algorithm for one disk.

External sorting for one disk. All sorting algorithms actually used in database systems use merging ([5]), i.e.

the input data are written into initial sorted runs and then merged into larger and larger runs until only one run is left, the sorted output. If M is the number of records that can t in the main memory, and the number of records to be sorted is N , then approximately N=M initial runs can be created by reading M records at a time, sorting them in main memory, and writing them to the disk. Next, an R-way merge sort algorithm will sort the entire le in dlogR (N=M )e. The merge factor R can be determined in the following way. Assume that the block size is B records. Then BR memory space can be used as input bu ers (for reading R runs at a time). B memory space can be used as output bu er. The remaining M ? BR ? B space can be used to organize a priority heap to perform the merge. A priority queue needs approximately R space to hold R records. Therefore, R = M ? BR ? B , which implies that R = b MB?+1B c. At each merge step all the records are read from the disk and then written back to disk, that is 2 NB N=M I/Os are needed. Therefore, the total number of I/Os is approximately d2 NB logM=B (N=M )e, or ( NB log log M=B ), if we assume that M  B .

External sorting on parallel disks. In [10], Nodine and Vitter give an optimal external sorting algorithm for a parallel disk model (see Fig. 2). In this model, in a single (parallel) I/O operation, each of the D disks can simultaneously transfer one block of B records to the main memory. Thus, D blocks can be transferred per I/O, but only if no two of the blocks access the same disk. The parameters of the sorting problem, besides B and D, are M ,

B CPU

MEM B

...

External memory (D disks)

Figure 2: The parallel disk model. the number of records that can t in main memory, and N , the number of records to be sorted. It is assumed that M < N (i.e. the problem is too large to t in memory, and that 1  DB  b(M ? M )=2c, for xed < 1 (i.e. the total number of records that can be transferred in a single parallel I/O, DB , cannot exceed about half of the memory size. The measure of performance is the number of parallel I/O (the CPU time is ignored).

Lower Bound. Aggarwal and Vitter, in [1], considered a simpli ed, parallel block transfer model for sorting, in which D physical blocks, each consisting of B contiguous records, can (always) be transferred simultaneously in a single I/O into a main memory capable of holding M records. There is only one logical disk capable to provide, always, the bandwidth of DB records per I/O. That is, issues like data distribution and access pattern, which are crucial for multiple disks, are hidden in their model. They proved that the lower bound to the number of I/Os needed 6

to sort N records under their model is:

N log(N=B ) ): ( BD log(M=B )

Now, the parallel disk model of Nodine and Vitter is a generalization of Aggarwal and Vitter's model, in the sense that it can provide the peak bandwidth of DB records per I/O only under some special organization and distribution of data across the disks. Therefore, any sorting algorithm for this model cannot perform any better than an optimal algorithm for the simpli ed model. Hence, the above lower bound still holds. Nodine and Vitter proved that it can be achieved. But before looking at the optimal algorithm, let's take a look at the straightforward adaptation of the previously described algorithm for the case of multiple disks. We will see that this adaptation is not optimal and we shall brie y investigate where the diculty comes from when trying to nd an optimal algorithm. We will refer to this adaption as Merge Sort whenever we will compare it to the optimal algorithm. In order to achieve maximum bandwidth, records (and runs) will be striped circularly across the D disks. The input bu ers and the output bu er will have size DB . The priority queue will have the same size as before, i.e. R. It follows that the merge factor R is given DB c (note that in order to have R  2 we must have DB  M=3). If we assume M  DB , then by R = b M1+?DB N log N=M ), which is less than optimal. This R  M=DB . It follows that the total number of I/Os needed is ( DB log M=DB is mainly because the merge factor R depends on the number of disks D. If we can change the algorithm in such a way that the merge factor is increased to M=B , then the resulting algorithm will be optimal. The diculty is then how to increase the number of runs that can be merged in a single step (this will increase also the number of input bu ers) and still keep enough information in main memory to perform the merge. So, the key is a better memory usage.

Data Striping. The method of storing data over the disks is data striping. This means that the rst block of

records in a le appears on block 1 of disk 1, the second block on block 1 of disk 2, and so on until the block D appears on block 1 of disk D. Then, the le cycles back to block 2 of disk 1. In general, the ith block of the le appears on block b(i ? 1)=D + 1c of disk ((i ? 1) mod D) + 1.

Greed Sort. The algorithm is esentially a merge-sort. However, there is a substantial di erence in the way the

merge is done. At each merge phase, R runs, which are striped across all the D disks, are approximately merged on each disk individually. That is, on each disk, the slices of the R runs that reside on that disk, are merged into a single list for that disk. This list is not sorted, and therefore, the concatenation of all the lists for all disks is not sorted. However, the concatenation has a nice locality property: the position in the concatenated list of an element x, which is smaller than another element y, but situated after y in the list, is not larger than the position of y by more than a xed factor, L. Then, by performing an additional phase, in which overlapping sequences of the list are succesively sorted, the approximately sorted list can be exactly sorted. An informal description of the algorithm is as follows: 1. Start by creating N=M initial runs of size M . These runs are created by repeatedly reading in M records (from all D disks in parallel), sorting them in main memory and writing them back to the disks (by striping them across all the disks).

p

2. R = M=B=2 input runs are merged at a time to form larger runs, which will be used as input runs for the next pass. This is repeated until there is only one sorted run. One merge has two steps, as we said above:

 an approximate merge (consisting of individual merges which are done in parallel for each disk), and  a global sort (inter-disk) producing the sorted run. Let's consider the example in Fig. 3(a) with R = 3, D = 2, and B = 2 to explain the algorithm in more detail. As it can be seen, R1 , R2 , and R3 are sorted (and striped circularly over the disks D1 and D2 ). The individual merge of 7

the R runs, on each disk, is done in the following way: at each step, the rst blocks of the runs for the particular disk are considered (they consist the set of candidate blocks for that disk). Among these candidate blocks, two blocks are of particular interest (best blocks): the block with the smallest minimum, and the block with the smallest maximum (see Fig. 3(b)). These two blocks are read into memory (in parallel for all disks) and merged. The smallest B records are written in an output list for the disk, while the largest B records are written back at the front of the run from which the smallest minimum was taken. In the case that the block with the smallest minimum and the block with the smallest maximum coincide, the only best block is written to the output list. In both cases, there is one run which decreases its size with one block, and all the runs remain (globally) sorted. The next two con gurations for our example are shown in Fig. 4. Continuing in this way, an output list will be produced for each disk (the rst step of the merge). Remember that in Merge Sort R blocks, one for each run, are read into memory and merged to produce a larger run. A priority queue in memory is then used to select the current smallest element among all runs. In Greed Sort, only two blocks are read and merged at a time: the block with the smallest minimum and the block with the smallest maximum. This is obviously not enough to produce a sorted run. While the smallest minimum is indeed the smallest current element among all runs, and it will be produced rst in the output list, the other elements will not necessarily be produced in sorted order. However, choosing the block with the smallest maximum as the other best block limits the \distance" which can be between the correct position of a element in sorted order and the actual position of that element in the output list. For example, assume that the best two blocks for a disk are bm (the block with the smallest minimum) and bM (the block with the smallest maximum). Let x1 be the minimum element of bm and xM the maximum element of bM . Then, when bm and bM are merged, the half smallest elements will be written to the output list, with x1 in the rst position. For any element x 6= x1 written to the output, the number of elements y < x that do not occur before x in the output sequence cannot be larger than RB (x itself is smaller than xM , therefore y < xM . But there are at most RB elements smaller than xM in all the runs). To make things more precise, we need a de nition.

De nition 3.1 A sequence is called L-regressive if for any two elements x < y, y does not precede x by more than L positions in the sequence.

It can be shown that the (global) output list for all disks is RDB -regressive (after the approximate merge step). Moreover, we have the following two lemmas:

Lemma 3.2 If a list is L-regressive, then every element is at most L positions away from its correct sorted location. Lemma 3.3 If every element in a list is within a distance of L of its sorted location, then a series of sorts of size 2L, beginning at every Lth location, will suce to complete the sort.

Now, it is clear how the (global) approximated output list can be xed, in order to produce a sorted run (this is the second step of merge). The only problem is to use a sorting algorithm that doesn't give for this second step a larger complexity than that of the rst step of the merge phase. To compute the number of parallel I/Os needed to complete the rst step of the merge phase, we need to take a closer look at the data structures needed to implement this step.

Data structures for Greed Sort. We only need to examine the rst unprocessed blocks on a given disk in any run to nd the block with the smallest minimum and the block with smallest maximum on that disk from that run. We can keep two priority queues for each disk: one using the blocks' largest value as key, and the other using the blocks' smallest value as key. We only need to keep one element per block in each priority queue (the smallest element and the largest element). Therefore the space needed for each priority queue is O(R). Since there are D disks, the total space for the priority queues is O(DR). Only two blocks (the best blocks) are merged at a time for each disk. Therefore, the space for input bu ers is O(DB ). We need to verify that the amount of space needed for the data structures does not exceed the size of the primary memory, M . 8

R1 D1

R2

3

13 1 15 2 18

3

1

2

5

17 4 16 7 19

5

4

7

9 20 6 23 8 21

9

6

8

12 24 11 26 10 30

12

11

10

R1 D2

R3

R2

R3

(a) Three input runs R1 , R2 and R3 .

(b) The candidate blocks and the best blocks for each disk.

Figure 3: Greed Sort with R = 3, D = 2 and B = 2.

R1 D1

13

15 2

18

1

3

15

2

5

17

16 7

19

4

5

16

7

R3 21

6

9

10

21

12 24 11 26

30

8

12

11

30

R2

R3

13

15 5

18

1 2

13

15

5

17

16 7

19

4 3

17

16

7

R1 D2

R2

9 20 10 23

R1 D1

R3

3

R1 D2

R2

R2

R3

11 20

23

21

6 9

11

23

21

12 24

26

30

8 10

12

26

30

Figure 4: Next two merge con gurations. The output lists are shown immediately at the right of the disks.

9

p

1 The number of runs that p we need to merge in order to obtain optimal performance is 2 M=B . Therefore, the priority queues need D M=B space for the keys. At rst glance, it seems that if D = O(M ) and B = 1 then

(M 3=2 ) storage space is required in primary memory, which is clearly impossible. However, a partial disk striping method can be used throughout the course of the algorithm, while giving up only a constant factor in performance. Assume that D = D(M ) grows faster than M , for some xed 0 < < 1=2. We can cluster the D disks into clusters of D0 = M clusters of D=D0 disks synchronized together. Each of the D0 clusters will act like a logical disk with block size B 0 = BD=D0 . Then, the number of primary storage locations needed is at most:

p p D0 M=B 0 = M M=B 0 = O(M +1=2 )

The expression for the number of I/Os for the entire algorithm will remain the same, with this modi cation: log(N=B 0 ) = O( N log(N=B ) ) k DN0 B 0 log( M=B 0 ) DB log(M=B ) Take = + 1=2. The amount of memory needed for bu ers is 2DB , so the total amount of memory is at most 2DB + M  M since DB  b(M ? M )=2c.

Complexity of Greed Sort. First, we examine how many times a block needs to be read into the primary

memory. At most twice is necessary for the actual merge. One more time is needed for maintaining the priority queues (each time an element is replaced from the priority queue a new block must be read from the disk to nd its smallest value and its largest value). Each record is written at most twice back to disk. It's easy now to see that all N records are read and written to disks in this step no more than a constant number of times (three). Therefore, the number of parallel I/Os for the rst step of merge is O(N=DB ). The algorithm used to x each approximated output list, in the second step of merge, is Column p Sort (a detailed description is given in [8]). The justi cation of choosing Column Sort is that it can sort N  D MB records with O(N=DB ) parallel I/Os. Brie y, Column Sort works like this. We have to sort a list of 2L records, where L = RDB is the regressive factor discussed before. This list is striped circularly across the D disks, one block of B records per disk. During the execution of Column Sort, the list is viewed as a matrix with r = M rows and c = N=M columns. This ensures that a column of the matrix can t in the main memory (each column will be sorted repeatedly in main memory through the course of the algorithm). There are eight steps in total. The odd-numbered steps are all the same: sort all the records in each column. The complexity of such a step is clearly O(N=DB ) I/Os (each column is read into main memory, sorted, and then written back to disk). Steps 2 and 4 are essentially a transpose operation. It was shown in [12] how to perform this transpose in O(N=DB ). Finally, steps 6 and 8 consist of a shift by r=2 of all the records in the matrix, in column-major order. This can be accomplished mainly by reading DB records at a time and saving them into main memory, and writing in their positions the previously saved DB records. This takes O(N=DB ) I/Os. Therefore, the number of I/Os needed by each Column Sort step is O(N=DB ). For the corectness of Column Sort see [8]. At each merge step, sorts of size 2L will be performed against the approximated merged list for R runs. Assume that the list has k records (this k increases after each merge phase). We need to perform O(k=L) sorts, each of size 2L, and therefore taking O(L=DB ) I/Os. The number of I/Os needed to x the entire list will be O(k=DB ). This is for merging R runs into a larger run. But we have to perform in general more than one merge in one merge phase (there are more than R runs, except for the last merge phase, i.e. by a merge phase we mean the processing of an entire level in the R-ary tree describing the computation). Roughly speaking, we have O(N=k) merge operations to do in one phase, each involving a list of k records. Thus, the total number of I/Os in one merge phase is O(N=DB ). log(N=M ) ) = O(N=DB log(N=B ) ). The overall complexity of Greed Sort is then O(N=DB logR (N=M )) = O(N=DB log( M=B ) log(M=B ) When there is only one disk, i.e. D = 1, Greed Sort has the same asymptotic complexity as Merge Sort. However, the constant for Greed Sort is much larger than the constant for Merge Sort. Recall that the merge factor in Merge 10

p

Sort is R = M=B , which is larger than the optimal merge factor in Greed Sort, R = 1=2 M=B . Thus, for a single disk, Merge Sort should be preferred instead of Greed Sort.

An extension of the model for multiprocessors. The Greed Sort algorithm applies also to a model in which

each of the D disks is controlled by a separate CPU with internal memory capable of storing M=D records. The D CPUs are connected by a network that allows some basic operations, including sorting of M records in the internal memories. Again, the bottleneck in this model is expected to be the I/O, and not the network interconnecting the CPUs. This architecture is commonly called a shared-nothing architecture and it was heavily studied by the parallel databases research comunity. The Greed Sort can be immediately adapted to work on this model, with the same number of I/Os. However, if the network cost is not negligible, a suitable cost model must be introduced in order to analyze the complexity of the algorithm in terms of network communication. CPU processing cost might as well be taken into consideration for some applications. As we've seen a merge step in Greed Sort consists of two steps: the rst one, producing the approximated merged list, works independently for each disk (and therefore for each CPU, and does not need any network access), and the second one, the sorting of the approximated merged list, which has to be done globally. This last step involves sorting of columns of M records (by the Column Sort routine), and since the memory is distributed now, we need to worry about good algorithms and suitable interconnection networks for sorting in distributed-memory multiprocessors. This will be the focus of section 5.

4 External-Memory Algorithms In [4], a collection of techniques is shown to produce optimal external-memory algorithms for an impressive number of computational problems. Most of these algorithms use sorting as a subroutine and their I/O complexity is a function (linear, in most of the cases) of the number of I/Os needed to perform sorting. If the sorting algorithm used is optimal, then all these algorithms are optimal, too. But we have seen in the previous section an algorithm for external sorting when using parallel disks, that meets the lower bound given by Aggarwal and Vitter, in [1], and therefore is optimal. Without describing in detail all of the algorithms discussed in [4] (they are very brie y described there, too!), it is worthwhile to point out the important facts that make the external-memory algorithms di erent from their internal-memory counterparts and some of the techniques that are used to produce optimal external-memory algorithms.

4.1 The Computation Model The model of [4] is quite similar to the parallel disk model used in [10] by Nodine and Vitter. We have the same parameters: N (the total number of records), M (the size of memory measured in records), B (the size of a block in records), and D (the number of disks in the system). It is assumed that M < N and that 1  DB  M=2. As in [10] an I/O is de ned to be the process of simultaneously reading or writing D blocks of data, one to or from each of the D disks. The total amount of data transferred in an I/O is thus DB records. Again, as in [10], whenever data is stored in sorted order, data is striped across all the disks. Two primitives are shown to be fundamental for many computational problems in external memory, scanning and sorting. The I/O complexities of these two primitives are proportional to:

x scan(x) = DB

x , since we can read at most DB items in a single i.e. reading all of the input data (x items) will take at least DB I/O, and x x log sort(x) = DB M=B B which is the minimal number of I/Os, shown by Aggarwal and Vitter, needed to sort x items (and also the number of I/Os needed by Greed Sort, as we have just seen).

11

4.2 Lower Bounds: Linear Time vs. Permutation Time. The rst interesting idea given in [4] is a way to establish lower bounds. In many cases it is useful to look at the complexity of problems in terms of the number of permutations that need to be performed in order to solve them. In an ordinary RAM, any permutation of N items can be produced in linear time, i.e. O(N ). In an N processor PRAM, it can be done in constant time (see also the next section). In both cases, the work complexity is O(N ), which is no more than it would take us to examine all the input. This is no longer the case in external memory: it is not generally possible to perform arbitrary permutations in a linear number O(scan(N )) of I/Os. (For example, if we try the naive way of scanning each block of items, and then writing each item in the input block to its new position on disk, in some output block, this output block might need to be updated later, since other items, from other input blocks, may need to be written there as well.) The following lower bound for performing arbitrary permutations in external memory was shown also by Aggarwal and Vitter in [1]:

 N perm(N ) = min D ; sort(N ) When M or B is extremely small, N=D = O(B scan(N )) may be smaller than sort(N ). In the case where B and D are constants, also, permutations can be performed in linear time. However, in typical I/O systems, the sort(N ) term is smaller than the N=D term. For example, consider that B = 104 and M = 108. Then sort(N ) < N=D as long as N < 1040;004 , which is incredibly large. Now, as is shown in [4], there are many computational problems that need (perm(N ) (i.e. perm(N ) is a lower bound) for which upper bounds of O(sort(N )) can be shown. This, together with the previous lower bound for permutations, show that these upper bounds give also optimal algorithms for these problems, whenever perm(N ) = (sort(N )).

To summarize, to design optimal algorithms for problems requiring arbitrary permutations it is useful (in many cases) to use sorting as a subroutine. Note that this doesn't work in the ordinary RAM model, where sorting takes

(n log n), while many problems requiring arbitrary permutations can be solved in linear time.

4.3 PRAM Simulation The second interesting idea in [4] is a technique for designing I/O ecient algorithms based on the simulation of parallel algorithms. And, more interesting, I/O-optimal algorithms can be generated by simulating in external memory PRAM algorithms that are not work-optimal. The PRAM algorithms simulated typically have geometrically decreasing numbers of active processors and very small constant factors in their running times. This makes them ideal for I/O simulations since inactive processors do not need to be simulated, and therefore optimal and practical I/O algorithms can be obtained. The following theorems are important.

Theorem 4.1 A PRAM algorithm that uses N processors and O(N ) space and runs in (parallel) time T can be simulated in O(T  sort(N )). Theorem 4.2 A PRAM algorithm that solves a problem of size N by using N processors and O(N ) space, and that

after each stage of time T , both the number of processors and the number of memory cells that will ever be used again are reduced by a constant factor, can be simulated in external memory in O(T  sort(N )) I/O operations.

The proofs are simple. First, a lemma is needed:

Lemma 4.3 Let A be a PRAM algorithm that uses N processors and O(N ) space. Then a single step of A can be simulated in O(sort(N )) I/Os.

Proof Sketch. To simulate the PRAM memory, we keep an array of O(N ) on disk(s) in O(scan(N )) blocks. In a

single step, each PRAM processor reads O(1) operands from memory, performs some computation, and then writes O(1) results to memory. To provide the operands for the simulation, we sort a copy of the contents of the PRAM 12

memory based on the indices of the processors for which they will be operands in this step. This allows us to read the operands for all the processors by a single scanning step (if we don't do the sorting, a block may need to be read more than once!). Then we perform the computation for each processor being simulated, and write the results to the disk(s). Finally, we sort the results of the computation based on the memory addresses to which the PRAM processors would store them and then scan this list and the old copy of memory and merge them to produce the new copy of memory. The whole process uses O(1) scans and O(1) sorts, and thus takes O(sort(N )) I/Os. The rst theorem follows immediately from the lemma, since to simulate an entire PRAM algorithm we simply need to simulate all T steps. For the second theorem, we use again the previous lemma. The rst stage, consists of T steps, therefore it can be simulated in O(T  sort(N )). Thus, the total number of I/Os can be computed by the following recurrence: I (N ) = O(T  sort(N )) + I ( N ) It's easy to see that the solution is O(T  sort(N )).

4.4 List Ranking We exemplify with the list ranking algorithm in external memory. First, a lower bound of (perm(N )) can be shown. In [4] it is proven that the proximate neighbors problem needs (perm(N )) I/Os. Then, by a reduction from the proximate neighbors problem, it can be shown that list ranking needs (perm(N )) I/Os (we don't detail here anymore). Second, an optimal algorithm for list ranking in external memory is obtained by simulating a PRAM algorithm for list ranking. We need to give some details rst about the PRAM list ranking algorithm which will be used (see [6]). List ranking problem: Consider a linked list L of n nodes whose order is speci ed by an array S such that S (i) contains a pointer to the node following node i on L, for 1  i  n. We assume that S (i) = 0 when i is the end of the list. We are not concerned by the actual information stored in the nodes of the list. The list-ranking problem is to determine the distance of each node i from the end of the list, that is we would like to compute an array R such that R(i) is equal to the distance of node i from the end of L. The obvious sequential algorithm is trivially linear. To obtain good parallel algorithms, completely di erent strategies are needed. There is a simple parallel algorithm that uses the pointer jumping technique (also called parallel pre x) and runs in O(log n) parallel time but with total number of operations W (n) = O(n log n) (therefore, it is not work optimal). Since this algorithm will be used in the next algorithm, we say a few words about it. We keep an additional array of pointers Q (let's call Q(i) the descendant of i), and each Q(i) is initially equal to the succesor S (i). The ranks are all set to be 1, except for the rank of the last node which is set to be 0. The algorithm has dlog ne stages, and in each stage the descendant Q(i) of a node i is updated to be the descendant's descendant. The ranks are also updated at the same time, by adding the rank of the current descendant to the current rank of the node. As the technique is applied repeatedly, the descendant of a node becomes closer and closer to the end of the list (the distance between a node i and its descendant Q(i) doubles unless the descendant of the descendant is already the end of the list). At the end, all the descendants are equal to the end of the list and the rank of each node contain the distance of the node from the end of the list. The strategy to reduce the number of operations of the algorithm in order to achieve work-optimality has three steps: 1. Contract the original list L until only O(n= log n) nodes remain, using only W(n) = O(n) operations. 2. Apply the pointer jumping technique on the short list of the remaining nodes. Notice that since the size of the list is O(n= log n) now, the number of operations used will be linear, while the time will still be O(log n). 3. Restore the original list and rank all the nodes removed in step 1 (using again only O(n) operations). The main diculty lies in performing step 1 in O(log n) time using a linear number of operations. While there is a way to do this, yielding a work-optimal list-ranking algorithm running in O(log n) (see also [6] for details), a simpler, 13

4

[1]

5 * [3]

7

[1]

3

1

[1]

* [1]

2

6

8

9

[2]

* [2]

[1]

[0]

(a) 4

[4]

5 x

7

3

1 x

2

6 x

8

9

[3]

[1]

[2]

[1]

[4]

[2]

[1]

[0]

(b)

Figure 5: Removing nodes of an independent set. (a) The initial list. The ranks are given in brackets, and the selected node to be removed are labeled with *. (b) The list after contraction.

slower algorithm will suce for our purposes (an optimal external-memory algorithm for list ranking will be derived from it). This simpler algorithm will perform step 1 in O(log n log log n) with linear number of operations, and the entire algorithm will run also in O(log n log log n) with W (n) = O(n). The method for shrinking the list L consists of removing a selected set of nodes from L and updating the intermediate R values of the remaining nodes. The key to a fast parallel implementation is in choosing an independent set of nodes to be removed. A set I of nodes is independent if, whenever i 2 I , S (i) 62 I . We can remove each node i 2 I by adjusting the successor pointer of the predecessor of i. Since I is independent, no two nodes to be removed are adjacent, therefore this process can be done in parallel for all the nodes in I .

Finding an independent set. This can be done by coloring the nodes of the list L. A k-coloring of L is a mapping from the set of nodes of L into f0; 1; : : :; k ? 1g such that no adjacent vertices are assigned the same color. There exists a well-known parallel algorithm for 3-coloring of a ring (see [6]) which runs in O(log n) time and uses O(n) operations (and, therefore, is work-optimal). Now, using a 3-coloring of L we can obtain a large independent set in the following way. A node u is a local minimum with respect to this coloring if the color of u is smaller than the colors of its predecessor and its succesor. Then, the set of local minima is an independent set of size (n=3), and it can be identi ed in O(1) parallel time using O(n) operations (by comparing in parallel the color of each node with the colors of its predecessor and its succesor). Removing nodes of an independent set. We show how this is done on a concrete example (see gure 5. In

practice, for the purposes of our list-ranking algorithm, it's enough to label each removed node as removed and store at the same position in the array the information about the succesor and the rank that the node had before removing. This information will be used later when the original list will be restored and the nodes in the independent set I will need to be ranked properly. Also, at this phase, we need to adjust the succesor of the predecessor of each removed node. (Without loss of generality, we assume that the predecessor array P is available. Otherwise, it can be computed in O(1) time with O(n) operations.) Finally, the rank of the predecessor of each removed node is updated by adding to it the rank of the removed node. All this can be done in O(1) parallel time using O(n) operations.

Contracting the list to O(n= log n) size. By applying once the 3-coloring procedure, identifying the independent

set, and removing all of its node, the list is reduced to size cn, for some 0 < c < 1. Hence, this process can be repeated 14

6

4

1

3

7

2

8

5

6

4

1

x [1] (1)

[1] (0)

[1] (2)

6

4

1

x [2] (2)

[1] [1] (0) (2) (a)

[1] (1)

[1] (2)

[0] (0)

3

2

8

5

7

x [2] (1)

x [2] (2)

[7]

3

7

x [5]

2

8

x

5 x

[3]

[1]

(d)

x

6

4

1

3

7

2

[7]

[6]

[5]

[4]

[3]

[2]

[1] (0)

8

5

[1]

[0]

(e)

(b) 6

4 x

1 x

3

7

x

[4]

2 x

8 x

5 x

[3] (c)

Figure 6: List ranking. (a) the initial list; the initial ranks R are given in brackets, and a 3-coloring is shown in parantheses. (b), (c) the lists obtained during the contraction process. (d) restoration of the nodes removed from the list in (c). (e) the nal result

dlog log ne times to produce a list of size less than or equal to n= log n, for some > 0. To estimate the total running time and the total work, let nk be the size of the list at the beginning of iteration k, 0  k < dlog log ne. The size of the list is reduced by a factor c at each iteration, therefore we have nk  ck n. At iteration k + 1, the 3-coloring algorithm runs in O(log nk ) time using O(nk ) operations. The other computations done at iteration k + 1 take O(1) time using O(nk ) operations. Thus the total time is given by O(log n0 +log n1 + : : : log nl ) = O(log n +log(cn)+ : : : log(cl?1 n)) = O(log(nl c

?1)

l(l

2

)) = O(l log n) = O(log n log log n)

where l = dlog log ne is the total number of iterations. The total work is simply

O(n0 + : : : + nl?1 ) = O(n(1 + c + : : : + cl?1 )) = O(n)

Restoring the original list and ranking the remaining nodes. This is done simply by reversing the process performed during the contraction of the list. It has the same number of iterations as the contraction phase. in the rst phase the nodes that were removed during the last iteration of the contracttion phase are reinserted back in the list. For each node reinserted, its rank is updated (the new rank is the sum of its old rank and the rank of its successor in the list after insertion. Figure 6 shows a complete example. It is clear now that the entire ranking algorithm runs in O(log n log log n) using O(log n) operations.

An optimal external-memory list ranking algorithm. For the purposes of our discussion is useful to rephrase the above algorithm in a recursive manner. Initially, the ranks are initialized (to 1 for nodes which are not the end of the list and to 0 for the end of the list). Then, we proceed recursively: rst, nd a large independent set of size (n). Then, remove these nodes and obtain a list of size cn, with 0 < c < 1 (as shown above). Recursively solve the ranking problem for the new list. After the recursive call, reinsert the removed nodes in the list and update the ranks (again, as explained above). The removal of the nodes from the independent set takes O(1) parallel time and uses O(n) space. Similar, for reinserting these nodes after the recursive call. Therefore, they can be simulated in O(sort(n)) I/Os in external memory (according to Theorem 3.1.). Now, if we can produce a large independent set in 15

O(sort(n)) I/Os then the entire algorithm will take O(sort(N )) I/Os (we obtain a similar recurrence as in Theorem 3.2.). Note that we don't really need to use the pointer-jumping technique as in the PRAM algorithm; we can simply go recursively until the list is of size O(1), and the resulting algorithm will still be I/O-optimal. The simplest way to produce a large independent set is a randomized approach. We scan along the input and ip a coin for each vertex v. We assign a sex to each vertex: male if the head appears, female otherwise. Then, the independent set consists of all the males with females as successors. The expected size of the independent set is (N ? 1)=4. In parallel this can be done in O(1) time using O(n) space. Again, we can apply Theorem 3.1. and we obtain an external-memory algorithm for constructing a (n) size independent set in O(sort(n)) I/Os. A deterministic way to produce an independent set of size (n) in O(sort(n)) I/Os uses an external-memory version of the PRAM algorithm for 3-coloring of a ring (which is more complicated, but based on the same ideas of PRAM simulation and we don't describe it here).

4.5 Additional Applications Euler tour. Finding Euler tours in trees is an important technique used for ecient parallel computation of many

functions on trees (see [6]): range-minima problem, lowest common ancestor problem, rooting a tree, computing the number of descendants of a node in a tree, etc. This, in conjunction with the above idea of obtaining good external-memory algorithms from known PRAM algorithms, suggests that it is important to have good algorithms for nding Euler tours when external memory is used. We brie y describe rst the Euler tour technique, and then show that an Euler tour of a tree can be produced in (sort(N )) I/Os, where N is the number of vertices in the tree. Let T = (V; E ) be a given tree and let T 0 = (V; E 0 ) be the directed graph obtained from T when each edge (u; v) 2 E is replaced by two arcs < u; v > and < v; u >. Since the indegree of each vertex of T 0 is equal to its outdegree, T 0 is an Eulerian graph; that is, it has a directed circuit that traverses each arc exactly once. In [6] it is shown how to produce an Euler tour in O(1) parallel time using O(n) operations, where n =j V j, under the assumption that the input tree is given in a certain representation. This representation is based on the following observation. An Euler tour of T 0 = (V; E 0 ) can be de ned by specifying the successor function s mapping each arc e 2 E 0 into the arc s(e) 2 E 0 that follows e on the tour. A suitable successor function can be de ned as follows: for each vertex v 2 V of the tree T = (V; E ), we x a certain ordering on the set of vertices adjacent to v, for example, adj (v) =< u0 ; : : : ; ud?1 >, where d is the degree of v. We de ne the successor of each arc e =< ui ; v > to be s(< ui ; v >) =< v; u(i+1) mod d >, for 0  i  d ? 1. It can be easily seen (an inductive proof) that s de nes an Euler tour of T 0. Then, a suitable representation to allow producing of the Euler tour in O(1) parallel time is as follows. T is represented by using a circular adjacency list for each vertex, L[v] =< u0 ; : : : ; ud?1 >. Additionaly, the node containing vertex ui in L[v] (this node uniquely represents the arc < v; ui >) contains a pointer to the node containing v in L[ui ] (which uniquely represents the arc < ui ; v >). This allows computing the successor of a given arc < u; v > by following the pointer from the node containing v in L[u] to the node containing u in L[v] and then taking the next node of L[v]. Therefore an Euler tour of T can be produced in O(1) time using O(n) operations on a PRAM. By theorem 3.1., we obtain immediately an external-memory algorithm with O(sort(n)) I/Os. A lower bound of (sort(n)) I/Os is shown in [4] for the Euler tour problem, therefore the above algorithm is optimal. As a simple application, we show how to compute (optimally) the vertex level in a rooted tree, for both PRAM and external memory, by using the Euler tour technique. We are given a tree T = (V; E ) represented as above and a special vertex r 2 V as the root. Additionally, each vertex v in the tree contains the parent information p(v) when T is rooted at r. (Computing the parent information is commonly referred to as tree rooting problem. It can be solved in parallel time O(log n) with O(n) operations on a PRAM by using the Euler tour technique, in a very similar way to computing the vertex level! See [6] for more details.) Let L[r] =< u0 ; : : : ; ud?1 > be the adjacency list of r. We de ne an Euler path to be the list EP = (e1 =< r; u0 >; e2 = s(e1 ); : : : ; < ud?1; r >). (It can be seend that this list corresponds to a depth- rst search of the tree T starting at vertex r.) Note that the Euler path can be produced by simply setting s(< ud?1; r >) = 0 (i.e. 0 is used to mark the end of the list). We want to compute level(v) of each 16

vertex v, which is the distance (number of edges) between v and the root r. We assign w(< p(v); v >) = +1, and w(< v; p(v) >) = ?1, for each v in the tree, and perform parallel pre x on the list EP (this can be done optimally in O(log n) with O(n) operations using a slightly modi ed version of the optimal list-ranking algorithm, the one that was mentioned but not explained in the previous subsection). Then, we set level(v) to be equal to the pre x sum of < p(v); v >. The overall parallel time is clearly O(log n) while the number of operations is O(n).

The external-memory version of the previous vertex level algorithm is immediate. The Euler tour is computed in O(sort(n) I/Os as we've seen before, while the list ranking can be also done in O(sort(n)) I/Os (the previous subsection). The assignment of the weights can be done in O(sort(n)) by Theorem 3.1. (since it needs O(1) parallel time and O(n) operations).

Lowest Common Ancestor problem. The lowest common ancestor lca(u; v) of two vertices u and v of a rooted tree is a node w that is an ancestor to both u and v, and is farthest from the root. The lowest common ancestor problem (LCA problem) is the problem of preprocessing T such that each query lca(u; v) can be answered in O(1) sequential time. There are two special cases in which the LCA problem can be solved immediately:

1. T is a simple path. Then, computing the distance of each vertex from the root allows us to answer any LCA(u; v) query in constant time by comparing the distances of u and v from the root. 2. T is a complete binary tree. Determining the inorder number of each vertex is sucient to guarantee the handling of each LCA query in constant time as follows. Identify the nodes of the tree with their inorder numbers. To nd LCA(x; y), express x and y as binary numbers and nd the leftmost bit, say bit i, in which they disagree; that is, the leftmost i ? 1 bits z1 z2 : : : zi?1 of x and y are identical, and the ith bits are di erent. Then, LCA(x; y) is equal to the number whose binary representation is given by z1 z2 : : : zi?1 10 : : : 0. For the general case, the LCA problem is reduced to the range-minima problem, by using the Euler-tour technique, in the following way. We start by determining an Euler tour of T (as we've seen before). We replace each arc < u; v > in the tour by the vertex v, and insert in front of the list the root. We obtain the Euler array A of T . If j V j= n, then the size of A is m = 2n ? 1. This can be done in O(1) parallel time, using O(n) operations. Next, we compute the array B = level(A) containing for each element v in A its corresponding level in T (as de ned in the previous paragraph). This can be done in O(log n) parallel time with O(n) operations (as we've just seen in the previous paragraph). We need two more pieces of information for each vertex v: l(v), the index of the leftmost appearance of v in A, and r(v), the index of the rightmost appearance of v in A. Given the array A, the element ai = v is the leftmost appearance of v in A if and only if level(ai?1) = level(v) ? 1 (recall that the way our Euler tour is produced, it corresponds to a depth- rst search of T ). Similarly, the element ai = v is the rightmost appearance of v in A if and only if level(ai+1) = level(v) ? 1. Therefore, l(v) and r(v) can be produced in O(1) time, using a linear number of operations. The reduction of the LCA problem to the range-minima problem is justi ed by the following lemma (easily veri ed):

Lemma 4.4 Let u and v be two arbitrary nodes in a rooted tree T = (V; E ). Then: 1. u is an ancestor of v i l(u) < l(v) < r(v) < r(u). 2. u and v are not in an ancestor-descendant relation i either r(u) < l(v) or r(v) < l(u). 3. if r(u) < l(v), then LCA(u; v) is the vertex with the minimum level over the interval [r(u); l(v)].

The reduction takes in total O(log n) parallel time using O(n) operations on a PRAM. In external memory, we have seen how to compute the Euler tour in O(sort(n)) I/Os and the level of each vertex in O(sort(n)) I/Os (by using list ranking). l(v) and r(v) can be computed in O(sort(n)), too (theorem 3.1., again). Thus, the whole reduction takes O(sort(n)) I/Os. Next, we brie y describe how to solve the range-minima problem in external memory. 17

Range-minima problem. We are given an array A of size n = 2l and we need to preprocess A such that the following query can be answered in O(1) sequential time: Given any interval [k; j ], where 1  k < j  n, nd the minimum of fak ; ak+1 ; : : : ; aj g.

The common solution is to construct a complete binary tree, in which each leaf corresponds to an element of the array A, and each internal node v corresponds to the elements of the array in the subtree rooted at v. To nd the minimum over the interval [k:j ], rst, the lowest common ancestor v of the two leaves containing k and j has to be identi ed (as shown before this can be done in O(1) sequential time). Next, the subarray associated with v is typically of the form Av = far ; : : : ; ak ; : : : ; aj ; : : : ; as g, where r  k < j  s. Let u and w be, respectively, the left and the right children of v. Then, the subarrays Au and Aw associated with u and w partition Av into, say, Au = far ; : : : ; ak ; : : : ; ap g and Aw = fap+1 ; : : : ; aj ; : : : ; as g, for some k  p < j . The element we seek is the minimum of the following two elements: the minimum of the sux fak ; : : : ; ap g of Au and the minimum of the pre x fap+1 ; : : : ; aj g of Aw . Therefore, for each node v, it is sucient to store the sux minima and the pre x minima of the subarray associated with v. This can be generalized to a complete k-ary tree, with two modi cations. First, the least common ancestor v of the nodes associated with the two leaves containing k and j has k children rather than just two. This implies that the subarray associated with v is partitioned into k subarrays, each corresponding to the subtree rooted at each child. Therefore, the interval [k; j ] spans, in general, over more than two of the k subarrays, let's say l subarrays. To answer the range-minimum query we need to compute the minimum of l elements: a sux minimum for the rst subarray, the minimums of the \middle" subarrays, and a pre x minimum for the last subarray. Hence, for each node v, we have to store in addition to the pre x and sux minima, the minimum of the subarray associated with v. The second modi cation from the basic algorithm is how to compute the lowest common ancestor in constant time, when k > 2. This is possible by making use of an appropriate numbering of the nodes of the tree and the representations of these numbers in radix k instead of 2. Finally, we need to check that the range-minimum query is indeed answered in O(1) sequential time. But this is true as long as k is bounded (otherwise, a query is answered in O(k) sequential time). A simple O(sort(N )) external-memory algorithm to construct a search tree as described above for answering rangeminimum queries is sketched in what follows. To simplify, assume that D = 1. We choose k to be M=B . There are O(N=B ) leaves, each a block storing B data items. The search tree is a complete (M=B )-ary tree with O(logM=B (N=B )) levels. The pre x and sux minima, and the minimum for the subarray associated with a node, can be computed level by level, starting from the leaves. The information needed to be stored in nodes at some level is computed using a part of the information stored at children at the previous level (the amount of information needed at each moment will t into the main memory). Each level need to be written once and read once. The amount of information stored at each level is O(N ), thus we need O(N=B ) I/Os per level. This gives a total of O(N=B logM=B (N=B )) I/Os, i.e. O(sort(N )).

Connected components and minimum spanning trees for dense graphs. Another important application

of the PRAM simulation technique is nding the connected components of an undirected graph G = (V; E ). A simple sequential algorithm, based on depth- rst search, runs optimally in O(j V j + j E j), by keeping a counter and systematically visiting all the verticess accessible from a starting vertex, labeling each of these vertices with the value of the counter. When all the accessible vertices were visited, the counter is incremented and the process is repeated starting from another (unvisited) vertex (if there is any left). The vertices belonging to the same connected component receive the same label. The depth- rst search is essentially a sequential process, and to nd an ecient parallel algorithm is more dicult. We brie y describe here an optimal PRAM algorithm (see [6]) for dense graphs running in O(log2 n) parallel time using O(n2 ) operations. The main idea is the following: identify the connected components by an iterative procedure, where, at the beginning of each iteration, the available vertices are partitioned in groups such that all the vertices in a group belong to the same connected component. During each iteration, some groups with adjacent vertices are merged into larger groups. The algorithm terminates when no additional groups can be merged. The number of such groups (being adjacent to other groups) decreases at least by a factor of two at each iteration. We have therefore O(log n) iterations (we start 18

with groups containing one vertex), where n =j V j. At each iteration the groups are identi ed by one of their vertices (the one with the smallest number in the group, for example). We construct a new, smaller graph, containing only the representatives of the groups (called supervertices). There is an edge between two supervertices if and only if there is any edge between a vertex of the group represented by one supervertex and a vertex of the group represented by the other supervertex. The process is then repeated for the smaller graph containing the supervertices. How are vertices merged into groups ? First, each vertex is associated with the vertex which is adjacent to it and has the smallest number. This de nes a function from vertices to vertices. This function, let's call is C , induces a partition of the initial graph into directed pseudotrees (trees with one additional cycle), by assigning an arc < v; C (v) > for each pair (v; C (v)). Each of these trees can be assigned a root (the vertex with the smallest number) and all the vertices in the tree can be made aware of this root by applying the pointer jumping technique. At the end of the merge process, all vertices belong to some group. This process takes O(log nk ) parallel time, using O(nk log nk ) operations, where nk is the number of vertices (supervertices) in the graph at the beginning at iteration k. After the groups are identi ed, the roots are the new supervertices. The adjacency matrix of the new graph can be computed from the adjacency graph of the previous graph in the following way: any two adjacent vertices (in the previous graph) belonging to two di erent groups assign a value of 1 in the new adjacency matrix in the position corresponding to the two representatives of the two groups. This implies a concurrent write capability, since more than one pair of vertices can detect that they are adjacent and belong to the same two groups. It can be done in O(1) parallel time using O(n2k ) operations. Computing the function C can be done in O(log nk ) with O(n2k ) operations using an optimal pre x algorithm. Thus, each iteration takes O(log nk ) parallel time using O(n2k ) operations. Remember that nk  n2k+1 . A simple calculation gives an overall time complexity of O(log2 n) with O(n2 ) operations. The external-memory algorithm for connected components can be obtained from the PRAM algorithm in the following way. The rst iteration is simulated in O(sort(n2 )): for each vertex, computing the smallest adjacent vertex (i.e. the function C ) can be done using a version of the external-memory list ranking algorithm in O(sort(n2 )) (or an external-memory simulation of an optimal PRAM pre x-minimum algorithm); determining the root of each pseudotree can be done in O(sort(n)) using again a version of the external-memory list ranking algorithm. Computing the new adjacency matrix for the next iteration can be done in O(sort(n2 )) (use Theorem 3.1.). We are then left with a smaller problem of size at most n=2. The recurrence has the solution O(sort(n2 )) (similar with the result of Theorem 3.2.). This is the same as the upper bound given in [4] for the case of dense graphs. The minimum spanning tree problem can be solved by a variation of the above PRAM algorithm, with the same complexities. The given graph is assumed to be connected. The algorithm (Sollin's algorithm) begins with the forest (V; ;) and grows trees on subsets of V until there is a single tree containing all the vertices. During each iteration, the minimum-weight edge incident on each tree is selected. The new edges are added to the current forest, say Fs , to obtain a new forest, Fs+1 . This process is continued until there is only a single tree. The number of trees in Fs+1 is at most one-half the number of trees in Fs , so we have, as before, O(log n) iterations. The strategy of the algorithm is similar with the connected components algorithm. The only essential di erence is the de nition of the function C , namely for each vertex v, C (v) is de ned to be the vertex which is connected to v by the minimum-weight edge. S This ensures that only edges that indeed occur in the MST are selected. (The important lemma is: If V = Vi is a partition of V , then for each i the minimum-weight edge connecting a vertex in Vi to a vertex in V ? Vi belongs to the MST). The external-memory algorithm for nding minimum spanning tree follows in the same way as before. Its complexity is O(sort(n2 )).

19

Disks

Disks

I/O processors connected by a fast interconnection network

Hosts

Hosts Interconnection network / Shared memory Interconnection network / Shared memory

Shared-disk architecture - shared-memory shared-disk - distributed-memory shared-disk

Distributed-disk architecture - shared-memory distributed-disk - distributed-memory distributed-disk (shared-nothing parallel databases) (TickerTAIP Parallel RAID)

Figure 7: Two classes of multi-processor multi-disk architectures

5 Multi-Processor Multi-Disk Architectures As we have seen, RAID o ers an attractive alternative to single large capacity disks. In very large database systems, it is very common that data is spread over multiple disks, not only for distributed or parallel architectures but also for more common uniprocessor architectures. A problem that appears very frequently in relation with I/O operation of parallel disks is the optimal placement of data, in order to minimize:

 I/O cost  network cost (for distributed and some of the parallel architectures)  processing cost and to take advantage of the potential higher bandwidth. The most important algorithmic problems related with data placement are permutation and sorting of data. The architectural counterparts of these are the interconnection networks between processing elements and disks.

5.1 Parallel Permutation and Sorting Algorithms In [9] four SIMD architectures are surveyed in relation with the problem of permutation and sorting of data: 1. Shared Memory Model (SMM) 2. Mesh Connected Computer (MCC) 3. Cube Connected Computer (CCC) 4. Perfect Shue Computer (PSC)

20

Algorithms for arbitrary permutation and sorting are described for the CCC architecture (they can be easily adapted for the PSC architecture as well). One important question that can be raised is if and how they can be adapted to the case when data is resident on disks. What kind of interconnection networks between processors and disks are needed in addition to the processor-processor interconnection network ? Which is better to choose: shared-disk or shared-nothing architectures ? (for the algorithms analyzed here, the memory is distributed). What algorithms can be developed for the shared-memory case and parallel disks?

5.1.1 Permutation and Sorting Algorithms: Some Results Input. N = 2n records. Initially, record i is in PE (i), 0  i < N . Each record has a eld A(i). For the permutation problem, A(i) 2 [0; N ? 1]; 0  i < N . Output. For sorting: a rearangement of the N records into non-decreasing order of A(i), one record per PE. For permutation: record i is relocated to PE (A(i)); 0  i < N .  Permutation on SMM: O(1), with N PEs. { The algorithm assumes a CREW PRAM, in which there is a suciently complex interconnection network between the shared memory and the PEs to allow simultaneous memory accesses by several PEs. PE(i) rst writes its record into location A(i) of the common memory and then reads back the record in location i.

 Sorting on SMM: { O(log2 N ), with N PEs (Batcher) { O(k log N ), with N 1+1=k PEs (Preparata)  Permutation on CCC: { O(log2 N ), with N PEs { O(k log N ), with N 1+1=k PEs (Nassimi and Sahni - based on MSD radix sort)  Permutation on PSC: { O(log2 N ), with N PEs { O(k log N ), with N 1+1=k PEs (Nassimi and Sahni)  Sorting on CCC: { O(log2 N ), with N PEs (Batcher's bitonic merge sort) { O(k log N ), with N 1+1=k PEs (Nassimi and Sahni - similar with Preparata's algorithm for SMM)  Sorting on PSC: { O(log2 N ), with N PEs (Batcher's bitonic merge sort) { O(k log N ), with N 1+1=k PEs (Nassimi and Sahni) As immediate consequences of Nassimi and Sahni algorithms for permuting and sorting, we obtain (1) data broadcasting and (2) transfers between any N PEs among all N 1+1=k PEs in O(k log N ). The next two sections discuss the algorithms given by Nassimi and Sahni for permuting and sorting on CCCs and PSCs.

21

5.1.2 Permutation We assume a CCC or a PSC having N 1+1=k PEs, where k = n=m for some integer m, 1  m  n. These PEs are arranged into a 2m  2n array and are indexed in row-major order. First, let us remind ourselves the structure of the interconnection network for both CCC and PSC:

 CCC (hypercube): The number of processors in a hypercube is p = 2q . Let iq?1 : : : i0 be the binary representation of i, for i 2 [0; p ? 1], and let i( b) be the number whose binary representation is iq?1 : : : ib+1 ib ib?1 : : : i0 , where ib is the complement of ib , 0  b < q. Then, PE (i) is connected to PE (i(b) ) for all 0  b < q. Two observations are worth to be mentioned, for hypercube:

{ The degree of a processor (i.e. the number of links starting from a processor) is q = log p. The diameter

(the maximum distance between any two processors) is also q = log p. { The hypercube has a recursive structure. We can extend a q-dimensional cube to a (q + 1)-dimensional cube by connecting corresponding processors of two q-dimensional cubes. The nodes of the rst cube will have the most signi cant bit equal to 0, while the nodes of the second cube will have the most signi cant bit equal to 1.

 PSC: In a PSC, each processor has only three links, as opposed to a hypercube. PE (i) is connected to PE (i(0) ),

PE (iq?2 : : : i0iq?1 ), and PE (i0 iq?1 : : : i1 ). These three links are called exchange, shue, and unshue, respecp tively. The diameter is still O(log p) (as for the hypercube, but with only 32p links in total as opposed to p log 2

links for hypercube).

Before discussing the main algorithm, we need to describe four algorithms that will be widely used for both permuting and sorting (we describe them for CCC, but they can be implemented in a very similar way on PSC, with the same cost). Three of the algorithms, broadcast, concentrate and spread, are particular cases of routing: (the rst one broadcasts a value to all nodes in the hypercube, the second one is used to \concentrate" elements, while the third is used to \spread" elements along a hypercube). The fourth algorithm, rank, will be used to rank each node of a hypercube. In all four algorithms, we assume a hypercube with p = 2q nodes. For permuting and sorting, these algorithms will be used for particular subcubes (hence, they will need to be slightly modi ed in order to adjust the indices; we omit here the details). 1. broadcast: O(q) steps. Input: Vector A(k) such that A(i) 6= null, and A(j ) = null; 8j 6= i. Result: Element A(i) is replicated in all the nodes. In the rst step of the algorithm, PE(i) sends its data to the PE(i(0)) (the two PEs are neighbors, i.e. distance is 1). In the second step, the two PEs send their data to the PEs having indices with the same binary representation except for the bit 1, and so on. In general, after b steps (1  b  q) 2b PEs have the data, and in step b + 1 they propagate it to the 2b PEs that have the same bits in the binary representation of their indices except for the bit b. After exactly q steps, all the 2q PEs have received the data. 0

1 *

2

3

4 *

5

6

7

*

(a) Input con guration ( = non-null value)

0

1

2

3

4

5

6

7

0

0

1

1

1

2

2

3

(b) Rank values

Figure 8: Ranking for a hypercube with q = 3.

22

0

1

2

3

*

4

5

*

6

7

0

*

1

2

3

*

4

5

*

6

7

*

R 0

0

1

1

0

1

1

2

R 0

0

1

1

1

2

2

3

S

1

1

1

2

2

2

2

S

3

3

3

3

3

3

3

1

(a) Immediately after the recursive calls with q=2

3

(b) At the end of this recursive call (q=3)

Figure 9: rank: the recursive call for q = 3. 2. rank: O(q) steps. Input: A vector A(i), i 2 [0; 2q ? 1], some elements of which may be null. Result: A vector R(i), such that R(i) represents the number of nodes with indices j < i that hold a non-null element A(j ) (see Fig. 8). Besides ranks, the algorithm computes, as an auxiliary result, the count S (i) of all PEs in the cube that hold a non-null element. The algorithm can be described recursively as follows (remember our observation regarding the recursive structure of a hypercube). Divide the 2q PEs into two groups: the rst 2q?1 PEs (having 0 as the value of bit q ? 1), and the last 2q?1 PEs (having 1 as the value of bit q ? 1). Recursively compute the rank (and the auxiliary count) in each of the two groups, in parallel (see Fig. 9(a)). Then, each processor in the rst group, let's say P1 , has one corresponding processor in the second group, P2 , the one with the same bits as P1 except for the bit q ? 1, Then, P1 sends the count of its group to P2 , and P2 will add this value to its previously computed rank (relative to its group), to obtain the new rank (relative to the whole line). The rank that P1 holds is also the rank relative to the whole line (so, it does not need any adjustment). However, both P1 and P2 must update their count (with respect to the whole line). Hence, there is also communication from P2 to P1 (see Fig. 9(b)). The base case for the recursion (groups of one PE) consists of setting the rank to 0 and the count to either 1, if the PE holds a record, or 0, otherwise. 3. concentrate: O(q) steps. Input: A vector R(i), i 2 [0; 2q ? 1], such that: (1) either R(i) = null, or R(i) 2 [0; 2q ? 1]; 8i 2 [0; 2q ? 1] (2) 8j; l 2 [0; 2q ? 1] (R(j ); R(l) 6= null )j j ? l j  j R(j ) ? R(l) j)

Result: Each non-null element R(i) is routed from the original node i to node R(i). The algorithm goes succesively for each bit b from 0 to q ? 1, and each non-null element R is routed in step

b to that processor whose index agrees with R on bit b (only if they di er, otherwise R remains in place) (see Fig. 10). Notice that collisions don't occur only because of the property (2) satis ed by the elements R(i). Suppose that two elements originating in nodes j and l collide in the node i, in iteration b . Then it must be the case that:

i = (jq?1:b+1 ; R(j )b:0 ) = (lq?1:b+1 ; R(l)b:0) The right equality implies j j ? l j < 2b+1 and j R(j ) ? R(i) j  2b+1 . Hence, j j ? l j < j R(j ) ? R(l) j,

contradiction.

4. spread: O(q) steps. 23

000 001 010 011 100 101 110 111

0

1

000 001 010 011 100 101 110 111

2

0

000 001 010 011 100 101 110 111

1

2

(b) After iteration b = 0

(a) Initial con guration 0

1

000 001 010 011 100 101 110 111

2

0

(c) After iteration b = 1

1

2

(d) After iteration b = 2

Figure 10: concentrate with q = 3 000 001 010 011 100 101 110 111

R 1

4

000 001 010 011 100 101 110 111

6

R

(a) Initial con guration

1

4

6

(b) Final con guration

Figure 11: spread with q = 3

Input: A vector R(i), i 2 [0; 2q ? 1], such that: (1) either R(i) = null, or R(i) 2 [0; 2q ? 1]; 8i 2 [0; 2q ? 1] (2) 8j; l 2 [0; 2q ? 1] (R(j ); R(l) 6= null )j R(j ) ? R(l) j  j j ? l j)

Result: Each non-null element R(i) is routed from the original node i to node R(i). This is the inverse of concentrate (see Fig. 11). The algorithm starts from bit q ? 1 and iterates downto bit 0 (in reverse order as concentrate). Correctness (no collisions) is ensured by property (2).

Finally, we can describe the permutation algorithm. Remember that the elements A(i), i 2 [0; 2n ? 1] reside in the rst row of the 2m  2n array. We consider the values of A(j ), where j  2n , to be null, initially. The algorithm is essentially a parallel version of MSD radix sort (our permutation is also a particular case of sorting). The radix used is 2m . The algorithm has dn=me phases, one for each digit of A(i). After s phases, the array A is sorted with respect to its most signi cant s digits (using radix 2m ). Hence, after dn=me the array A is sorted. We describe the rst phase (sorting with respect to the most signi cant digit in radix 2m, or equivalently the most signi cant m bits in the binary representation of A(i)). This consists of three steps: 1. spread: for each column i, i 2 [0; 2n ? 1], the element A(i) is routed along the column to the row D1 (i), where D1 (i) is the value of the rst digit of A(i). There is only one such non-null element in the column, and this step can be acomplished by a simple spread along the column (see Fig. 12(b)). This step takes (m) parallel time. 2. rank: for each node i, the number of records located in PEs with indices less than i is computed. This can be done by computing the rank within the row and than adding the number of records below that row. The last number is simply the number of rows below the current row, i.e. bi=2nc, multiplied by the number of records within the row, i.e. 2n?m . This step takes (n) parallel time. 3. concentrate along each row: each element is routed along its row to the column denoted by the previously computed rank (see Fig. 12(c)) ((n) parallel time). 24

4. concentrate along each column: each element, in each column, is routed to the rst row (see Fig. 12(d)). This takes (m) parallel time. 5

2

1

4

0

7

6

3

1

0

2

3

5

4 7

(a) Initial con guration 1

(b) After the spread step

0

1 2

6

0

2

3

5

4

7

6

3 5

4 7

6

(c) After ranking and concentration along each line

(d) After concentration along each column

Figure 12: Permutation of 8 elements with 4 x 8 PEs (only rst phase is shown). After the rst phase, records are sorted with respect to the rst digit. Now, the ordering with respect to the rest of digits can be done independently for groups of records having the same value of the rst digit. In other words, it's enough to sort within each group with respect to the rest of the digits, and the nal sequence will be sorted. Sorting within each group is performed in a similar way, in each 2m  2n?m subcube (4 x 2 subcubes in Fig. 12(d)). As an observation, the last concentration step (along the column) need not be done for each phase, but instead can be done once at the end. For the spread step to work, it is not necessary for A(i) to be located in the rst row.

Complexity. We can describe the complexity of the above algorithm by the following recurrence (we consider the

optimized version of the algorithm, in which the last concentration step is done only once, at the end): T (2n) = c1 m + c2 n + c3 + T (2n?m); if n > m T (2n) = c1 n + c2 n + c1 n + c03 ; if n  m where the last term c1 n in the second equation accounts for the last concentration step along the column done at the end. Also, in the last equation, the rst term is c1 n instead of c1 m since only the rst n rows will be used (there is only one digit which has n bits instead of m). Unfolding the recurrence, we obtain: T (2n) = c1 m + c2 n + c3 + T (2n?m) n ? m T (2 ) = c1 m + c2 (n ? m) + c3 + T (2n?2m)

::: T (2n?(k?1)m) = c1 (n ? (k ? 1)m) + c2 (n ? (k ? 1)m) + c1 (n ? (k ? 1)m) + c03 where k = dn=me. This yields T (2n) = (k ? 1)c1 m + c1 (n ? (k ? 1)m) + c2 (kn ? (1 + : : : + (k ? 1))m) + c1 (n ? (k ? 1)m) + c3 (k ? 1) + c03 = c1 n + c2 (kn ? k?2 1 n) + c1 m + c3 (k ? 1) + c03 = O(kn). Since n = log N , we obtain T (N ) = O(k log N ).

5.1.3 Sorting Again, we assume a CCC with 2n+m PEs organized into a 2m  2n array (for PSC, the algorithm can be easily modi ed in the same way it was done for permutation). The records are initially located on the rst row. For 25

0 1 2 3

0

S0

?

1

S1

?

2

S2

3 .

S3

?

0 1 2 3

?

(a) Broadcast along the column for each subsequence Sl

0 1 2 3

0

1

0

1

  

T2

T1

2

T3

3

-

.

T4

(b) Broadcast along the line for each subsequence Tp

2

3 .

S0 , T0

S1 , T0

S2 , T0

S3 , T0

S0 , T1

S1 , T1

S2 , T1

S3 , T1

S0 , T2

S1 , T2

S2 , T2

S3 , T2

S0 , T3

S1 , T3

S2 , T3

S3 , T3

(c) Each block (l; p) compares subsequences Sl and Tp Figure 13: Merge of subsequences S0 , S1 , S2 , S3 : distribution of subsequences within the square matrix (m = 2). simplicity, assume that n is divisible with m and take k = n=m. The algorithm is esentially a parallel merge-sort and it has n=m phases. Each phase starts with sorted sequences of size 2r and ends with sorted sequences of size 2r+m (r is 0, initially), by performing a 2m-way merge. The way in which the merge is done is the core of the algorithm. The idea of the merge is simple: let's consider 2m consecutive subsequences of size 2r . These subsequences need to be merged. They reside in a matrix 2m  2r+m which can be viewed as square matrix 2m  2m in which each element consists of a block of 2r nodes. During the merge, elements of each subsequence Sl (l 2 [0; 2m ? 1]) need to be compared against elements of each subsequence Sp (l 2 [0; 2m ? 1]). This will be performed in the element (l; p) of the square matrix. To do this, each subsequence, Sl , which initially resides in element (0; l) of the square matrix, is rst broadcasted along its column (see Fig. 13(a)). Second, the copy of Sp on the diagonal, let's call it Sp;p is copied into a new subsequence (in a new set of registers, for example) Tp . Third, the subsequence Tp is broadcasted along its line (see Fig. 13(b)). In this way, each element (l; p) of the square matrix will be able to compare subsequence Sl with subsequence Tp (see Fig. 13(c)). All these distributions of subsequences within the square matrix take (m) parallel time. Then, the merge itself has three distinct steps: 1. Counting. For each element Ai of subsequence Sl determine the number Cip of elements in subsequence Tp , which are: (a) no greater than Ai , if p < l (b) to the left of Ai , if p = l, (c) less than Ai , if p > l This computation is performed in the block (l; p) of the square matrix and it will be detailed shortly. 2. Ranking. The rank Ri of the element Ai is its nal position in the sorted sequence. This is simply the sum (over the column) of the counts for that element, i.e. Ri = p Cip . This can be done in (m) in a similar way with the broadcast operation. The ranks are stored in the diagonal blocks. 3. Routing. This is a spread step which routes all records within each diagonal block (p; p) of the square matrix to the nodes corresponding to their ranks. This is a spread indeed, since all the elements preserve their order after the routing. The parallel time to do this is, again, (m). 26

4. Concentrate. All records are concentrated to the rst row. The parallel complexity is O(m) and, as for the permutation case, the algorithm can be optimized to do this step only once at the end. Now, we detail how the counting is done. Each 2r block contains two sorted subsequences S and T . For each element S (i) in the block, we have to nd the corresponding count R(i). There are three cases: 1. The block is a diagonal block. Then S and T are identical. The count, R, for the ith S in the block is i. 2. The block is at the left of the diagonal, that is the block contains subsequences Sl and Tp with p > l. The count for the ith S is the number of T values in Tp that are less than it. 3. The block is at the right of the diagonal (the block contains subsequences Sl and Tp with p < l). The count for the ith S is the number of T values in Tp that are no greater than it. For both last cases a merge is used to determine the count. For a left block, a stable merge is used (elements that are equal don't change their initial relative order). For a right block, we need to modify the merge algorithm in order to take care of equal values. Let's consider a left block. Figure 14 shows two sequences S and T . It's easy to see that if the ith S value is located in position j after the merge, then its R value is j ? i. The third row of gure 14 gives the R values for the S sequence. S

T

0

4

6

9

0

1

3

4

0

2

4

1

3

4

6

8

9

3

8

before merge after merge

4

R

Figure 14: Sketch of Count. The S and T sequences are merged using Batcher's bitonic merge [7]. The S and T sequences in each 2r -block form a 2r+1 sequence (if we consider the T sequence as following the S sequence). This 2r+1 sequence is not bitonic. However, it can be made bitonic by reversing the T sequence (a sequence of nondecreasing numbers followed by a sequence of nonincreasing numbers forms a bitonic sequence). A bitonic sequence of length 2p can be arranged into nondecreasing order as follows:

for i := p - 1 downto 0 do

compare pairs of elements 2i apart and interchange them if the one on the left is larger than the one on the right

end

It's easy to see that this bitonic merge can be implemented on a CCC in (p). Also, the T sequence (of length 2r ) can be reversed in (r) time. Therefore, the entire counting step takes (r) time.

Complexity of sort. As it can be seen from above, the complexity of a merge step is (r + m), where r =

0; m; 2m; : : :; mn m (there are mn merge steps). This gives a total time complexity of (k log N ). As an observation, we can compare this algorithm with the one that would be obtained by applying Batcher's merge sort directly for each merge step. Batcher's merge sort can be used to merge only two sequences (as opposed to 2m 27

sequences in the above algorithm). Therefore, we need log N merge phases. We start with sequences of length 20 , then continue with sequences of length 21; : : : ; 2n . To merge two sequences of length 2r we need (r) time. Therefore the total time complexity will be (1 + : : : + n) = (n2 ) = (log2 N ), which is de nitely worse than the algorithm presented before. (Of course, by using Batcher's merge, we would need only N processors.) This algorithm based on Batcher's bitonic merge was well-known before.

5.2 Generalized Connection Networks An (N , N ) generalized connection network (GCN) is a switching network capable of conecting any subset of its N inputs to any subset of its N outputs. Important parameters:

contact pairs - the total number of edges in the graph representing the GCN delay - the maximum number of edges on any input-output path setup time - the time needed to decide for each edge (switch) whether it's on or o for a given connection fan-in and fan-out - respectively, the indegree and the outdegree of the graph representing the GCN An N  N cross-bar switch is an (N , N ) GCN with O(N 2 ) contact pairs and O(1) delay (the delay is O(log N ) if the    

fan-in and the fan-out are bounded). Nassimi and Sahni propose an (N , N ) GCN with O(k N 1+1=k log N ) contact pairs and O(k log N ) delay, that is \easy" to set up. This means that it could be advantageous to use in situations where the switch settings have to be computed on-line (total delay = set-up time + delay). There are many situations in database query processing when data need to be redistributed (repartitioned) among disks during execution of parallel join algorithms. The GCN described can perform better for \long" records by using pipelining (the second and subsequent words will su er only O(1) delay - it could be useful for disks, since the minimal unit of transfer for disks is one block). How can such an interconnection network be compared with the usual I/O buses commonly used to interconnect processors/memories and disks ? Cost-performance issues need to be taken into account.

5.2.1 A New Generalized Connection Network. The GCN proposed by Nassimi and Sahni is constructed recursively as it is shown in gure 15. As we'll see there is a strong correspondence between the components of the network and the \software" components of the permutation and sorting algorithms given before. There are N = 2n inputs and outputs, m is a parameter in the range [1; n], and M = 2m. A higer value for m results in a GCN with more edges and less delay. The fan-out and the fan-in of the GCN are restricted to 2. A (1; M )-generalizer is a full binary tree with M leaf vertices (and log M + 1 levels of vertices). The switches are always on, since the function of the generalizer will be to make M copies of its input. An (N; N=M )-concentrator has N inputs and N=M outputs. Its function is to connect any p of its inputs to its top p outputs, 0  p  N=M . The connection of inputs to outputs is one-to-one and preserves the relative order. The construction of an (N; N=M )-concentrator is based on a (N; N )-concentrator, which is described rst. An (N; N )-concentrator has N inputs and N outputs. Its purpose is to connect any subset of its inputs to a consecutive subset of its outputs. The connection has to be one-to-one and to preserve the relative order of the inputs (i.e. it has to be monotone). The (N; N ) can be described recursively, as suggested in gure 16. There are log N + 1 columns of N vertices. If we denote by V b (i) the vertex i in column b, 0  b  log N , 0  i  N ? 1, then the edges are as follows: for each vertex V b (i), 0  b < log N , there are two edges incident from it. One goes to V b+1 (i) and the other to V b+1 (i(b) ). The routing of the inputs to the outputs is exempli ed in gure 17. Only the on edges are shown. Each input i that has to be connected to some output has a rank Ri which shows which is the corresponding output. We want to connect input i to output Ri . The path < i; Ri > is determined from left to 28

0

1

N-1

(1, M)generalizer

(N, N/M)concentrator

(N/M,N/M,M) - GCN

(1, M)generalizer

(N, N/M)concentrator

(N/M,N/M,M) - GCN

. . .

. . .

(1, M)generalizer

(N, N/M)concentrator

(N/M,N/M,M) - GCN

Figure 15: (N, N, M)-GCN. right using the binary representation of Ri . Bit b of Ri determines the edge from column b to column b + 1: a value of 0 means the upper edge, while a value of 1 means the lower edge (see gure 17). As a paranthesis, we have to check that there are no con icts in an (N; N )-concentrator for any desired connection (this is not detailed in the paper of Nassimi and Sahni). The essential thing to remember is that the outputs must be consecutive (otherwise, con icts can occur). Indeed, suppose we have a sequence of length  N of consecutive numbers r1 ; : : : ; rm , not necessarily less than N . These are the ranks associated to a subset of the inputs of a (N; N )-concentrator. Consider the numbers r10 ; : : : ; rm0 obtained by keeping only the rst n = log N bits (the least signi cant). This second sequence has the property that ri0+1 = ri0 + 1 mod N (i.e. the numbers in the second sequence are pseudo-consecutive). We want to route, without con ict, each input with rank ri to the output ri0 . Our (N; N )-concentrator is recursively constructed by interconnecting two (N=2; N=2)-concentrators (see gure 16). Suppose r1 ; : : : ; rk correspond to inputs of the upper (N=2; N=2)-concentrator, while rk+1 ; : : : ; rm correspond to the lower (N=2; N=2)-concentrator, and consider q1 ; : : : ; qk and qk+1 ; : : : ; qm the two sequences obtained from r1 ; : : : ; rk , and rk+1 ; : : : ; rm , respectively, by keeping only the rst n ? 1 bits. These two new sequences are sequences of pseudoconsecutive numbers, too. Therefore, we can assume inductively, that the two subsets of inputs corresponding to the two q-sequences can be routed without con ict in the corresponding two (N=2; N=2)-concentrators. Now, the only possibility to have a con ict in the (N; N )-concentrator is that two outputs, one (with index qi , 1  i  k) going out of the upper (N=2; N=2)-concentrator, the other one (with index qj , k + 1  j  m) going out of the lower (N=2; N=2)-concentrator are mapped to the same output of the (N; N )-concentrator. But this is possible only if qi = qj and the most signi cant bit of ri0 is the same as the most signi cant bit of rj0 . Moreover, the least signi cant n ? 1 bits of ri0 and rj0 have to be the same (since they are the binary representations of qi and qj ). It follows immediately that ri0 = rj0 . But in the original sequence rj > ri . Therefore, it must be the case that rj ? ri > 2n = N , which means that the original sequence has more than N elements, contradiction. The base case, when n = 1, is trivial. This actually proves more than we need: any connection in which the outputs are pseudo-consecutive can be obtained without con ict. Now, we need to show how to obtain an (N; N=M )-concentrator from an (N; N )-concentrator. Only the top N=M outputs of the (N; N )-concentrator are needed; so we remove the edges (switches) that cannot lead to any of these 29

0 1 N/2-1

(N/2, N/2)concentrator

N/2 N/2+1 N-1

(N/2, N/2)concentrator

Figure 16: (N, N)-concentrator.

Ri

i

b0

b1

b2

0 3

1 2

4

3 4 5

5

6 7

Figure 17: A connection on a (8, 8)-concentrator. outputs. Figure 18 shows an (8; 2)-concentrator constructed from an (8; 8)-concentrator. We can see that an (8; 2)concentrator is really four (2; 2)-concentrators connected by two full binary trees of height 3 (these binary trees are the reverse of (1; M )-generalizers). In general, we can obtain an (N; N=M )-concentrator from M (N=M; N=M )concentrators connected together by N=M binary trees of height log M + 1 (see gure 19). These trees are called (M; 1)-concentrators. All switches in an (M; 1)-concentrator need only one state, that is, on. (At most one of the inputs to an (M; 1)-concentrator will be active; the other inputs will not get connected through the preceding stage of (N=M; N=M )-concentrators. So, an (M; 1)-concentrator can safely connect all of its inputs to its output.) We've seen that an (N; N=M )-concentrator has log N levels of edges. When N > M , only the rst log(N=M ) levels (corresponding to the (N=M; N=M )-concentrators) need to be set up. The remaining log M levels (corresponding to the (M; 1)-concentrators) are always on. The setup rule for the case N = M is slightly di erent. From gure 15 we see that an (N; N; N )-GCN (which is called a N  N crossbar) consists of a stage of (1; N )-generalizers followed by a stage of (N; 1)-concentrators. (The third stage of gure 15 becomes null as N = M .) All switches in the (1; N )-generalizers are on as always. Therefore, the (N; 1)-concentrator stage cannot have all of its switches on. Each (N; 1)-concentrator needs to connect at most one of its inputs to its output, as all of its inputs are now active. This will be done by setting up the rst level of edges in an (N; 1)-concentrator; the remaining log N ? 1 levels of 30

0

0

1

1

2 3 4 5 6 7

Figure 18: An (8, 2)-concentrator.

edges will still be on. To complete the recursive construction of the (N; N; M )-GCN, we need to specify the GCN for 1  N < M . When 1 < N < M , the (N; N; M )-GCN is replaced by an (N; N; N )-GCN. And a (1; 1; M )-GCN is a null switch as stated earlier. In summary, the (N; N; M )-GCN consists of dn=me stages of generalizers followed by concentrators. The correspondence between the GCN and the permutation algorithm described earlier is easily seen. Suppose that the GCN is to be used to perform a permutation on its inputs. The (1; M )-generalizers correpond to the spread along the column of the permutation algorithm. (The M outputs of a generalizer may be regarded as the PEs in a column of a CCC.) The (N; N=M )-concentrators perform the concentration of records along each row of N = 2n PEs to groups of N=M = 2n?m PEs. From this point on, the GCN (and the permutation algorithm) route the N=M outputs of each (N; N=M )-concentrator independently.

Complexity of the network. We are interested to nd out the delay, D(n; m), and the total number of edges,

E (n; m), of a (N; N; M )-GCN. The delay in a (1; M )-generalizer is the height of the binary tree, i.e. log M = m. The delay in a (N; N=M )-concentrator is the delay in a (N=M; N=M )-concentrator plus the delay in a (M; 1)-concentrator, that is log(N=M ) + log M = log N = n. Therefore we obtain the following recurrence for the entire GCN: D(n; m) =



m + n + D(n ? m; m); n > m 2n; nm

It can be shown that D(n; m)  21 (d mn e + 3)n. The equality holds when n is divisible by m. Therefore, D(n; m) = O(k log N ), where k = n=m. For the number of edges, we observe that each (1; M )-generalizer and each (M; 1)-concentrator is a full binary tree with 2+22 + : : : +2m = 2(M ? 1) edges. For an (N; N )-concentrator we can write the recurrence corresponding to its recursive structure and obtain 2N log N edges. Thus, an (N=M; N=M )-concentrator has 2(N=M ) log(N=M ) edges. N ) + 2(M ? 1) N = So, from gure 19 we see that the number of edges in a (N; N=M )-concentrator is M 2 MN log( M M N N 2N log( M ) + 2(M ? 1) M . Since there are N (1; M )-generalizers and M (N; N=M )-concentrators in the rst stage, the recurrence for our GCN is:  N (M ? 1) + 2N log MN + ME (n ? m; m); n > m E (n; m) = 44N (N ? 1); nm N + 4)MN . We obtain, immediately, that E (n; m) = O(kN 1+1=k log N ). It can be shown that E (n; m) < d mn e(log M

31

(N/M, N/M)concentrator

(M, 1)concentrator

(N/M, N/M)concentrator

(M, 1)concentrator

. . .

. . .

(N/M, N/M)concentrator

(M, 1)concentrator

Figure 19: An (N; N=M )-concentrator.

Setup algorithm for one-to-one connections. As stated earlier, the structure of the GCN parallels very closely

the permutation algorithm for a CCC. An (N; N; M )-GCN can be set up to perform connections which are one-toone (in particular, permutations) by running the permutation algorithm on a CCC with M  N PEs. We will show this on a concrete example. Consider the (4; 4; 2)-GCN of gure 20. Input i has to be routed to output Ai . First we describe the setup for the rst stage of the (4; 4; 2)-GCN, i.e. the part of the network which is before the two (2; 2; 2)-GCNs. This parallels the rst phase in the permutation algorithm. The dotted line edges are not real edges in the network; their endpoints coincide. The (1; 2)-generalizers have all edges in on state, as stated earlier. Therefore, their inputs are copied to the outputs, as shown in the gure. This corresponds to a broadcast along each column in the 2  4 array of PEs. The upper (4; 2)concentrator corresponds to the rst row of PEs in the 2  4 array, while the lower (4; 2)-concentrator corresponds to the second row of PEs in the 2  4 array. The correspondence between vertices of the (4; 2)-concentrators and PEs is shown in the gure. Recall that in the permutation algorithm, each record in each column is routed to the row with index equal to the most signi cant digit in radix 2m. In our case, m = 1, thus each record has to be routed in exactly one of the two rows of the array. In GCN terms, this means that only one of the two copies of each input has to be propagated farther in the network (the input is propagated in the upper (4; 2)-concentrator if the most signi cant bit is 0 and in the lower (4; 2)-concentrator if the most signi cant bit is 1). This is done in the following way: each vertex in the rst level of vertices of the (4; 2)-concentrators has a corresponding Pj and an input i. If the most signi cant bit of i (bit 1 of i) di ers from the most signi cant bit of j (bit 2 of j ) then disable both edges going out of that vertex. The result is shown in gure 21. Notice that after this inputs are grouped already in the \correct" groups of N=M . We only need to route them out to the next level of (2; 2; 2)-GCNs. To do this, the rst n ? m iterations (1 in our case) of the concentration along the row in the permutation algorithm are enough. (Recall that the concentration algorithm worked by iterating for all bits b from 0 to n ? 1, routing each record to that processor that agreed on bit b with the index of the record. But here the inputs agree already in the rst n ? m bits, i.e. they are already grouped. Therefore we need only to x the least signi cant n ? m bits.) Iteration b of concentrate determines the settings for switches from level b of vertices to level b + 1 of vertices in the (N; N=M )-concentrators, based on ranks computed by the CCC. In our example, we need only one iteration (with b = 0), after the ranks of 32

Ai

3 P0

3 1 0 1 2

(2,2,2)-GCN

P1 P2 P3

3 P4

0

1 P5 0 2

(2,2,2)-GCN

P6 P7

2 (1,2)-generalizers

(4,2)-concentrators

Figure 20: An example setup for a (4; 4; 2)-GCN.

Ai

3 P0

3 1 0 1 2

0

(2,2,2)-GCN

P1 P2 P3

3 P4 1 P5 0

2

(2,2,2)-GCN

P6 P7

2

Figure 21: An example setup for a (4; 4; 2)-GCN (cont.)

33

Ai

3 P0

3

1

1 0 1 2

0

P1 P2

0

(2,2,2)-GCN

P3

3 P4 3

1 P5 0 2

2

P6

(2,2,2)-GCN

P7

2

Figure 22: An example setup for a (4; 4; 2)-GCN (cont.) the four inputs are computed. For the vertex corresponding to P1 , we see that the rank of the input is 0 (so, bit 0 is 0, di erent from bit 0 of P1 ), therefore we need to route the input to P0 . For the vertex corresponding to P2 , the rank is 1 (bit 0 is 1, di erent from bit 0 of P2 ), therefore we need to route the input to P3 . Similar for the other vertices. The result is shown in gure 22 (only on edges are shown). The setting algorithm continues recursively with the two (2; 2; 2)-GCNs. We don't insist on it anymore. In conclusion, a (N; N; M )-GCN can be set up by running the permutation algorithm on a M  N CCC in time O(k log N ).

6 Conclusion We have seen that there is a large variety of algorithmic results that can be put together to work on a parallel disk architecture. The I/O-optimal sorting algorithm, Greed Sort, and other I/O-optimal algorithms for some important computational problems (list ranking, connected components, minimum spanning tree, lowest common ancestor, etc.) have been described. There exist good techniques that seem to be applicable in designing of I/O ecient algorithms for other problems as well. How to combine all these things together to obtain suitable models for di erent applications would be the next step to investigate. We have seen, for example, that Greed Sort can be easily adapted for a shared-nothing multiprocessor - parallel disk architecture, preserving the same I/O complexity. Any network connecting the processors, which is capable to perform suciently fast (internal-memory) sorting, would be suitable for the purpose of running Greed Sort on such a shared-nothing architecture. In particular, a M  N hypercube having N disks connected to the rst row of N CPUs would probably be sucient (a PSC would be even better, since the number of links in the interconnection network is smaller than in the case of a hypercube, therefore cheaper). Generalized connection networks seem to be also very attractive for interconnecting the CPUs/parallel disks in this model, in particular the one proposed by Nassimi and Sahni, since it can be fast recon gured. Blocks of records can be moved from one disk to another through the internal memories of the CPUs that control those disks, and through the GCN. The cost of such moving would have two components: the cost of the two I/O operations, and the cost on network communication. A good cost model to take into account both the I/O and network components as 34

Disks

Disks

I/O processors connected by a fast interconnection network

Hosts

Hosts Interconnection network / Shared memory Interconnection network / Shared memory

Shared-disk architecture - shared-memory shared-disk - distributed-memory shared-disk

Distributed-disk architecture - shared-memory distributed-disk - distributed-memory distributed-disk (shared-nothing parallel databases) (TickerTAIP Parallel RAID)

Figure 23: Multi-processor multi-disk architectures. well the CPU component for a shared-nothing architecture would be useful. Other architectures might be considered as well: shared-memory multiprocessors are very common these days, and connecting them to a parallel disk system will be certainly needed. A more pictorial overview of the possible multi-processor multi-disk architectures is given in Fig. 23. We summarize here what seems to be their main advantages and disadvantages:

 Shared-disk architecture: { Maintainance of redundancy is easy { Data can be directly transferred from a disk to another without going through any host { Desired: fast connection between any subset of I/O processors and any subset of disks (the interconnection network of Sahni and Nassimi can come into play here!)

 Distributed-disk architecture: { Redundancy is a problem!  Greed Sort: easily adapted to both types of architectures if the parallel computer can perform fast sorting As a simple and attractive application of these ideas, Fig. 6 describes how the interconnection network of Sahni and Nassimi can be used to connect parallel disks to an existing hypercube architecture. We end with a list of open questions which we believe are worthwhile to investigate.

 It might be fruitful to apply these ideas (especially the algorithms part) to parallel databases area.  How can data placement be done by the application but still be able to maintain redundancy?  What is the best architecture when multiprocessors are connected to parallel disks? { commercial shared-memory multiprocessors ? 35

   

   

   

   

.

   Disks  I/O processors connected by Nassimi and Sahni's network .... ...

Hypercube (PSC) .... ...

.... ...

N 1=k

.. ....

.... .... ....

N

Figure 24: A hypercube with parallel disks.

 How can we maintain redundancy when distributed disks are involved?  For some of these architectures, good cost models would need to include: { CPU cost { network cost { I/O cost  Algorithms need to be redesigned for some of these models

References [1] A. Aggarwal, and J. S. Vitter. 1988. The input/output complexity of sorting and related problems. Comm. ACM, 31, 9(Sept.), pp. 1116-1127. [2] P. Cao, S. B. Lim, S. Venkataraman, and J. Wilkes. 1994. The TickerTAIP Parallel RAID Architecture. ACM Trans. on Computer Systems, Vol. 12, No. 3, Aug. 1994, pp. 236-269. [3] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. 1994. RAID: high-performance, reliable secondary storage. ACM Computing Surveys, Vol. 26, No. 2, June 1994, pp.145-185. [4] Y. J. Chiang, M. T. Goodrich, E. F. Grove, R. Tamassia, D. E. Vengro , and J. S. Vitter. 1995. External-memory graph algorithms. 6th ACM-SIAM Symp. on Discrete Algorithms (SODA), 1995. [5] G. Graefe. 1993. Query Evaluation Techniques for Large Databases. ACM Computing Surveys, Vol.25, No.2, June 1993, pp. 73-170. [6] J. JaJa. 1992. An Introduction to Parallel Algorithms. Addison-Wesley Publishing Company, Inc. [7] D. E. Knuth. 1973. The Art of Computer Programming, Vol. 3: Sorting and Searching. Addison-Wesley, Reading, Mass., 1973. [8] T. Leighton. 1985. Tight bounds on the complexity of parallel sorting. IEEE Trans. Comput. C-34, 4(Apr.), pp.344-354. [9] D. Nassimi, and S. Sahni. 1982. Parallel permutation and sorting algorithms and a new generalized connection network. Journal of the ACM, Vol. 29, No. 3, July 1982, pp.642-667. 36

[10] M. H. Nodine, and J. S. Vitter. 1995. Greed sort: optimal deterministic sorting on parallel disks. Journal of the ACM, Vol. 42, No. 4, July 1995, pp.919-933. [11] D. A. Patterson, G. Gibson, and R. H. Katz. 1988. A case for redundant arrays of inexpensive disks (RAID). Proceedings of the ACM SIGMOD International Conference on Management of Data (Chicago, Ill., June 1-3). ACM. New York, pp.109-116. [12] J. S. Vitter, and E. A. M. Shriver. 1990. Optimal disk I/O with parallel block transfer. Proceedings of the 22nd Annual ACM Symposium of Theory of Computing (Baltimore, Md., May 12-14). ACM. New York, pp.159-169.

37

Suggest Documents