Parallel sorting algorithms for declustered data - Semantic Scholar

Parallel sorting algorithms for declustered data E. Schikuta Institute of Applied Computer Science and Information Systems Department of Data Enginering, University of Vienna A-1010 Vienna, Rathausstr. 19/4, Austria [email protected]

Abstract Sorting is one of the most important operations in database systems and its eciency can in uences drastically the overall system performance. To speed up the performance of database system, parallelism is applied to the execution of the data administration operations. Conventional parallel hardware architectures, which employ a highly parallel software architecture are used in current parallel database systems. Thereby a common technique is to decluster the data sets among a number of parallel and independent disk drives. In this paper we revisit parallel sorting algorithms in the context of declustered data of parallel database systems. We adapt the well known and well studied parallel sort-merge and bitonic-sort algorithms for declustered data and compare analytically their performance. We show that the adapted bitonic sort outperforms the adapted sort-merge algorithm for declustered data in resemblance to the results of their conventional counterparts.

Keyword Parallel Database Systems, Parallel Sorting, Declustered Data Set, Algorithm Analysis

1 Introduction Due to the situation that the development of highly specialized database machines has failed and that these systems were replaced by the usage of conventional parallel hardware architectures, the research on the design of parallel software architectures gained an important role in database research in the last few years.

Generally parallelism is employed in parallel database systems by declustering [5] and/or operator parallelization [7], as inter-operator and intraoperator parallelism. Inter-operator parallelism is realized by executing dierent operators of a partitioned query execution plan in parallel. In contrast intraoperator parallelism executes the same operation in parallel on a partitioned data set among multiple processors, as it is well known from the SPMD (single program multiple data) paradigm. Sorting is a very important operation in a database system. It is one of the most frequently executed operators in query execution plans generated by database query optimizers. Therefore its performance in uences dramatically the performance of the whole database system [8], [9]. Generally sort algorithms can be divided into internal and external algorithms [4]. An external sorting algorithm is necessary, if the data set is too large to t in main memory. Obviously this is the common case in database systems. Sorting algorithms are very well analyzed and examined for sequential architectures. Also internal sorting algorithms for parallel architectures are investigated (e.g. [2], [11], [10]), but the topic of external sorting for parallel computer architectures has not yet received adequate consideration. An analytical approach can be found in [12] of the author of this paper, which sketches the parallel algorithmic design approach, but neglects a thorough algorithm de nition and performance analysis. Also the eect of piplined parallelism is not touched. The structure of the paper is as follows. In section 2 the model of the underlying parallel hardware architecture, and in the following section the notion of a declustered data set common to parallel database machines is described. The 4th section gives the basic idea of a declustered sorting approach and introduces the notation and metric of the analysis. The common parallel external sorting algorithms, parallel binary sort-merge [3] and block bitonic sort [1], are de-

P2 P1

A-F G-J K-L M-R S-Z

...

Pn

Interconnection Network

Mass storage system

Figure 1: Parallel architecture model ned. Further the variations of these algorithms for declustered data sets are presented and their performance is analyzed. Section 5 gives the results of the performance comparison. Finally in section 6 these results are discussed and conclusions for the design of database query optimizers are drawn.

2 Architectural Framework It was our aim to put our analytical framework on a very general basis. Our parallel computer model consists of three describing components only, processing units (P1, P2, ... Pn) disks (mass storage system), and an interconnection network Figure 1 gives a sketch of our framework. It covers all types of physical disk architectures. Generally 2 dierent types can be distinguished, local and global disks. On the one hand each processor is physical connected with one, local disk. This can be found in a number of systems, such as the IBM SP-1 or the IPSC-2 hypercube. On the other hand systems like the Intel Paragon have a global (common) set of disks for all processors. A very important assumption in our model is that the relations of the database system are to large to t into the total main memory of the processing units. Consequences are that sorting has to be done external

AAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAAAAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAA AAAA AAAAAAAA AAAAAAAA

Disk1 Disk2 Disk3 Disk4

Disk5

Figure 2: Range declustering and the I/O costs are an in uencing factor on the system performance.

3 Declustered Data Sets Declustering of a relation is the distribution of the tuples of a relation to two or more disk drives, according to one or more attributes and a distribution criteria. Declustering allows a database system to increase the bandwidth of the I/O operations by reading and writing multiple disks in parallel. Three basic declustering schemes can be distinguished,

range declustering (contiguous parts of a data set are stored on the same disk)

round-robin declustering (records are spread consecutively among the disks)

hash-declustering (the location of a record is de ned by a hash function)

Figure 2 gives an example of range declustering. The records belonging to a compact interval on one attribute domain are stored on the same disk. In the following we will exploit the range declustering property and develop sorting algorithms based on this paradigm. Further we will compare their performance with their conventional parallelized versions.

4 Parallel Sorting Algorithms Besides the approach to design and implement a specialized parallel algorithm to reach higher performance, a much simpler approach is possible with declustered data sets. The basic idea is to exploit the inherent parallelism of declustered data sets. This enables the usage of sequential operators in parallel instead of a parallelized operator (similar to the SPMD paradigm). The range declustering schema partitions the value space of the records into disjoint hyperplanes 1 2 n . The data records are grouped according to the declustering schema with respect to the attribute values in one or a number of the domains [6]. The declustering schema is expressed by a function ( ), which calculates the disk identi er for each data record. The following property is valid: R

R ;R ;:::R

d r

( ) ! where 2

d r

i;

r

R

i = fxjx 2 R ^ d(x) = ig

R

R

= [ i and i \ j = ; ^ 6= R

R

R

i

declustering function declustered sets 1:::n disjoint property R

j

The disjoint property of the declustered sets i is well suited for the development of intra-operational parallel algorithms, where multiple processors are working on the same relational operation but each with a dierent set of tuples. We chose the parallel sort-merge and bitonic-sort algorithms analyzed in [3], because of their clear design and their well studied properties, which lead to broad acceptance in practice. Bitton et al. proved that the bitonic-sort generally outperforms the sortmerge algorithm. This is caused by the disadvantageous suboptimal phase, where not all available processors can be used by the algorithm. The disjoint property of the declustered data sets can be used to de ne two specialized algorithms based on the ideas mentioned in [3]. The declustered variants of the algorithms sort the hyperplanes separately in parallel. The sorted sets are then joined according to the inverse declustering function. Thus the whole value space is sorted. The basic idea is described in gure 3. The linear join phase will be neglected in the analysis of the algorithms, because the join phase can be executed as a by-product of the sorting phase and consumes no additional runtime expenses. R

FOR all hyperplanes DO PARALLEL bitonic-sort(hyperplane) or sort-merge(hyperplane) END JOIN LINEARLY(all sorted hyperplanes)

Figure 3: Algorithm schema for declustered data sets

4.1 Parameters of the analysis The metric for the de nition of the analytical formulas for the execution times of the algorithms is the 2 2 p unit. The p unit is the basic notation in [3]. It is a "2-page" operation consisting of reading 2 sorted pages from disk, merging them and writing the resulting block of the siez of 2 pages back to disk. It is de ned by

C

C

(1) where r is the average cost of reading a sorted page into main memory, m of merging the records in sorted order and w of writing the resulting sorted pages to the disks. The de nition of the p2 unit as the basic metric is very convenient for the analysis. All algorithms are reducible to the basic processor operation to merge 2 ordered sequences of disk pages into 1 ordered sequence of double length. Such a sequence is called `run'. C

2 p = 2Cr + Cm + 2Cw

C

C

C

C

4.2 Conventional parallel sorting In the following section the algorithms are concisely de ned. The common known parallel sort-merge and bitonic sort algorithms are only sketched. (For a more comprehensive discussion see the references.) In the analysis of their performance the number of processors is denoted by and the number of disk pages by . p

n

4.2.1 Parallel binary sort merge

The parallel binary sort-merge algorithm (SM) can be divided into a sub-optimal, optimal and post-optimal phase, respective that not enough, sucient and too many processors are available during the execution of the algorithm. During the sub-optimal phase pairs of longer and longer runs (starting with length 1) are merged until the number of runs is equal to twice the number of processors. There are ( 2 ) stages and each stage needs 2 p2 operations, therefore ld n= p

n= pC

suboptimal ( ) = 2 2 (2) In the optimal phase each processor merges exactly 2 runs of length 2 , x

x

ld

p

x

This results in a total cost of

p

2 BS ( ) = 2 ( ( )) + (2 ) 2? (2 ) x

n= p

optimal ( ) = 2

(3)

x

x

p

The post-optimal phase merges the remaining runs into one sorted run. During this phase pipelining of the merging of the runs can be employed to increase performance. We will also analyze a "blocked" version, where the merging of runs starts after a preceding merge has nished. This gives the possibility to exploit idle processors to sort other hyperplanes of the declustered data set. The time expenses for this phase can be calculated by postoptimal pipelined ( ) = ( ) ? 1 + 2 x

x

ld p

postoptimal blocked ( ) = ( )

x+x 2 p

(4) (5)

2 For the pipelined version the whole costs are therefore x

lp p

SMP ( ) = suboptimal ( ) + optimal ( ) + postoptimal pipelined ( ) x

x

x

x

(6)

and analogously for the blocked version x

x

x

4.2.2 Block bitonic sort

(7)

The block-bitonic sort algorithm (BS) consists of a parallel comparison-exchange operation and a transfer operation. This is accomplished by one processor, which processes 2 pages and creates a `lower' and a `higher' page of the sorted 2-page block. All processor units have to be interconnected by a perfect shuf e interconnection schema. The preprocessing stage to produce 2 blocks (the maximum for the bitonic phase) of length 2 can be performed by a parallel binary sort-merge. The costs consist therefore of a suboptimal phase and the bitonic phase, which is (8) bitonic ( ) = 2 (22 ) ( (2 ) + 1) p

n= p

x

n ld p

p

ld

p

p

ld

ld x

p

ld

p

(9)

4.3 Parallel sorting for declustered data The following section de nes declustered version of the above presented algorithms employing the parallelization idea of section 2 and presents the formulas for the execution costs.

4.3.1 Declustered-Sort-Merge

We distinguish between a declustered pipelined and blocked version of the sort merge algorithm (resp. DSMP and DSMB). According to the idea presented in the algorithm each hyperplane of the declustered data set is sorted independently. We assume without loss of generality an even record distribution. Therefore each hyperplane has approximately the same size of . This leads for the pipelined case to a cost function of n=d

(10) DSMP ( ) = SMP xd The number of hyperplanes of the declustered data set is denoted by d. An assumption is that the number of pages of a hyperplane is larger than the number of the available processors, which seems appropriate for a typical database relation. This deduces that the sorting of the hyperplanes has to be done sequentially (i.e. the multiplier ). In the blocked SM version during the postoptimal phase fewer processors are necessary during the execution. The execution ow of one post-optimal phase can be represented schematically by a graph, shaped as a triangle, standing on one angle, where the end of the phase is represented by the bottom angle of the triangle ( gure 4). For each hyperplane such an execution ow triangle exists. These triangles can be executed in parallel dependent only on the available number of processors. In gure 5 the parallel processing of 2 hyperplanes is shown. The work load of the processors in the postoptimal phase of the DSM is better balanced than in the pipelined version. Every time a processor is freed by the postoptimal phase of a hyperplane, it can be used in the postoptimal phase of another hyperplane. Two postoptimal phases can use a given number of processors rather equally over the time. It can easily be seen that the number of execution levels increases for x

d

SMB ( ) = suboptimal ( ) + optimal ( ) + postoptimal blocked ( ) x

n

d

any number of processors constantly by 2 to sort two hyperplanes in parallel. Therefore calculation time of the post optimal phase of the DSMB algorithm is approximately divided by 2 (beneath an increase of the constant factor 2, which can be neglected for a large number of processors). The execution time of the SMB with parallel postoptimal phase can be expressed by SMB ( ) = suboptimal ( ) + optimal ( ) + postoptimal blocked ( ) (11) 2 Thus the total execution time is x

x

x

x

processor

DSMB ( ) = SMB ( xd ) x

flow of data pages

(12)

d

4.3.2 Declustered-Bitonic-Sort

Analogously to (12) a formula for the declustered bitonic sort (DBS) can be developed, which is

Figure 4: Postoptimal phase

DSB = BS ( xd ) d

5 Results All presented algorithms are bound to the main term of ( 2 ) p2 time units. Thus, when ( ) ( ( )), an optimal speedup is reached in comparison to a sequential external merge sort. Comparing the execution times in p2 units of the three conventional parallel algorithms (SMP, SMB and BS) and the 3 adapted declustering algorithms (resp. DSMP, DSMB and DBS), the following relation can be discovered, n ld n= p

1

1

1

1

1

1

1

2

2

2

2

ld p + 1 levels for

2

2

2

2

sorting hyperplane 1

2 additional levels for sorting hyperplane 2

1

processing page of hyperplane 1

2

processing page of hyperplane 2

C

O p

DSMB(n)

>

SMP(n) DSMP(n) BS(n) DBS(n). >

>

>

The actual results are shown in gure 6. An analytic prove by a mathematical comparison the formulas gives the same result. Thus the analysis shows that the DBS algorithm outperforms all other algorithms.

6 Discussion and Conclusion The variants of the bitonic sort are superior to the sort-merge algorithms for normal and declustered

VAL Chart 1

8000

[2]

7000

C2P units

6000 5000

[3]

4000 3000 2000

[4]

1000

10

9

8

7

6

5

4

3

2

1

0

0 number of pages in 1000 SMP(x)

SMB(x)

BS(x)

DBS(x)

DSMP(x)

DSMB(x)

Figure 6: Graphical comparison of the sorting algorithms. data sets. This extends the result in [3] to declustered data sets too. An important result is that the declustered version of the bitonic sort outperforms the conventional one. This states that it is advantageous to exploit the inherent parallelism of declustered data sets. In spite the execution of 2 postoptimal phases in parallel the declustered sort-merge has a worse performance than its pipelined variants (the conventional and the declustered ones). These are important result for the construction of query execution plans in parallel data base systems to choose the right sorting method. Obviously the bitonic sort variants guarantee better performance. Dependent on the data distribution (conventional or declustered) the most suitable algorithm has to be chosen, BS resp. DBS. Summing up it could be shown that a better performance can be reached by sorting based on intraoperational parallelism than on inter-operational parallelism. This corresponds to similar results stated in [7].

References [1] K.E. Batcher, Sorting networks and their applications, In Proc. of the 1968 Spring Joint Computer Conference (Atlantic City, NJ, Apr. 30 -

[5] [6]

[7]

[8] [9] [10] [11]

[12]

May 2), vol. 32, AFIPS Press, Reston, VA, pp. 307-314 G. Baudet, D. Stevenson, Optimal Sorting Algorithms for Parallel Computers, IEEE Transactions on Computers, 27, 1, pp. 84-87, January 1978 D. Bitton, H. Boral, D.J. DeWitt, W.K. Wilkinson, Parallel algorithms for the execution of relational database operations, ACM Trans. Database Systems, 8, 3, 324-353, (1983) D. Bitton, D. DeWitt, D.K. Hsiao, J. Menon, A Taxonomy of Parallel Sorting, ACM Computing Surveys, 16, 3, pp. 287-318, September 1984 D.J. DeWitt, J. Gray, Parallel Database Systems: The Future of Database Processing or a Passing Fad?, SIGMOD record, 19, 4, Dec. 1990 S. Ghandeharizadeh, D.J. DeWitt, W. Qureshi, A performance analysis of alternative multiattribute declustering strategies, Proc. of SIGMOD 92 W. Hong, M. Stonebraker, Optimization of Parallel Query Execution Plans in XPRS, Proc. 1st Int. Conf. on Parallel and Distributed Information Systems, 1991 B.R. Iyer, D.M. Dias, System Issues in Parallel Sorting for Database Systems, Proc. Int. Conf. on Data Engineering, pp. 246-255, 1990 M. Jarke, J. Koch, Query Optimization in Database Systems, ACM Computing Surveys 16, 2, pp. 111{152, 1984 F. Meyer auf der Heide, A. Wigderson, The Complexity of Parallel Sorting, SIAM Journal of Computing, 16, 1, pp. 100-107, February 1987 K. Sado, Y. Igarashi, Some Parallel Sorts on a Mesh-Connected Processor Array and Their Time Eciency, Journal of Parallel and Distributed Computing, 3, pp. 398-410, 1986 E. Schikuta, Parallel Relational Database Algorithms Revisited for Range Declustered Data Sets, Proc. Int. Symp. on Parallel Architectures, Algorithms and Networks (ISPAN), pp. 25-32, IEEE Computer Soc. Press, 1994

Parallel sorting algorithms for declustered data - Semantic Scholar

Parallel sorting algorithms for declustered data - Semantic Scholar

Suggest Documents

Parallel database sorting - Semantic Scholar

Sorting Large Data Sets on a Massively Parallel ... - Semantic Scholar

Data-Flow Algorithms for Parallel Matrix ... - Semantic Scholar

Parallel Induction Algorithms for Data Mining - Semantic Scholar

Distributed-Memory Parallel Algorithms for ... - Semantic Scholar

PARALLEL ALGORITHMS FOR CELLULAR ... - Semantic Scholar

PARALLEL ALGORITHMS FOR SPLIT ... - Semantic Scholar

Parallel Permutation and Sorting Algorithms - UF-CISE

parallel algorithm for sorting animal pedigrees ... - Semantic Scholar

Efficient Algorithms for the Inverse Sorting Problems - Semantic Scholar

Efficient Algorithms for the Inverse Sorting Problems - Semantic Scholar

Efficient Algorithms for the Inverse Sorting Problems - Semantic Scholar

Efficient Algorithms for the Inverse Sorting Problems - Semantic Scholar

Implementing Data Parallel Algorithms for ...

Implementing Data Parallel Algorithms for ...

Genetic Epidemiology, Parallel Algorithms, and ... - Semantic Scholar

Realistic Parallel Algorithms: Priority Queue ... - Semantic Scholar

Can Parallel Algorithms Enhance Serial ... - Semantic Scholar

Competitive Algorithms for Distributed Data ... - Semantic Scholar

Performance Analysis of Parallel Sorting Algorithms using GPU ...

Sequential and Parallel Algorithms for Embedding ... - Semantic Scholar

scalable algorithms for parallel tree search - Semantic Scholar

SPITFIRE: Scalable Parallel Algorithms for Test ... - Semantic Scholar

Parallel Algorithms for Indexing and Retrieval in ... - Semantic Scholar