Results of Parallel Implementations of the Selection ...

1 downloads 0 Views 178KB Size Report
rithms using Sisal, a high level functional language. We have ... multiprocessors and supercomputers provide a good subset of the machines where Sisal.
Results of Parallel Implementations of the Selection Problem Using Sisaly Marc Daumasa & Paraskevas Evripidoub

aLaboratoire de l'Informatique du Parallelisme, Ecole Normale Superieure de Lyon, 69364 Lyon Cedex 07 (France) bDept. of Computer Science and Engineering, Southern Methodist University, Dallas, Tx. 75275

Abstract

This paper presents an in depth analysis on the parallel implementation of four of the standard selection algorithms using a functional language on a number of multiprocessors and supercomputers. Three of the algorithms: Randomize Search, Binary Search and Divide & Conquer Search are based on the partition paradigm. The fourth one is a modi ed version of the Batcher sort. All routines were able to sustain good speed-up and high eciency, even with a large number of processors. Eciency higher than 86% was obtained with a con guration close to the maximum number of processors.

1 Introduction

In the late 19th century, Lewis Carroll proposed in \Lawn tennis tournament" an algorithm to nd both the best and the second best player among all the participants of a tennis tournament in optimal time. This is the rst appearance of a selection algorithm in the literature. Given a set X = x : : : xn, there is a permutation  which will sort the set, i.e: x  : : :  x n or more formally, i < j ) x i  x j . The ith order statistic of X is x i . A variant of this problem is to search for the subset of order statistics fx j j j = 1 : : : ig. This paper presents the results of parallel implementation of various selection algorithms using Sisal, a high level functional language. We have investigated parallel construction of four of the most popular selection algorithms. The rst three implementations are built around the partition paradigm. The last implementation is based on a sorting algorithm. We were able to achieve good performance in many cases. The chosen target multiprocessors and supercomputers provide a good subset of the machines where Sisal is available: a Silicon Graphics, a Sequent Symmetry S81, and a Cray 2. 1

(1)

( )

( )

( )

( )

( )

This material is based upon work supported in part by the SMU University Research Council under Grant No. 4-21002 and by the PRC \Architecture de Machines Nouvelles" of the French \Ministere de l'Education Nationale". y

Machine DECStation CRAY 2 SGI Sequent Symmetry

Processing elements # Communication MIPS 1 (Uni-processor) Custom (Vector) 8 Round Robin Shared memory MIPS 4 Bus Shared memory Intel i386 20 Bus Shared memory Table 1: Target machines

Creating correct, determinate parallel programs is very dicult. Even after extensive research and development, automatic parallelizing compilers for imperative languages have not met expectations. Extending imperative languages like \C" with constructs that allow the explicit expression of parallelism has proven dicult and error prone as well. The extensions often limit programmer productivity and hinder analysis. Functional languages like Sisal [5], on the other hand, implicitly support the development of determinate, machine independent parallel software. The programmer is not responsible for managing shared data, interprocess communication, or process scheduling. Such tasks are the responsibility of the compiler and run time system. Furthermore, correctness is not an issue; Sisal programs that run correctly on a single processor are guaranteed to run correctly on any supported multiprocessor. As a result of recent advances [2], Sisal programs can now execute at least as fast as imperative programs on a single processor and do much better when multiple processors are used. In Section 2, we present the programming environment and the machines used. The algorithms and their implementation are presented the Section 3. In order to allow comparisons between di erent machines, and comparisons within a group of results on the same machine, most of the test results are gathered in Section 4.

2 Programming Environment

Sisal (Streams and Iterations in a Single-Assignment Language)[5] is a high level functional data- ow language; compilers for Sisal have been implemented on a large number of multiprocessors and supercomputers. The most advanced of these implementations is the Optimizing Sisal Compiler (OSC) and its multiprocessor microtasking environment that executes on the Cray, Encore, Alliant, SGI and Sequent multiprocessors [2]. The functional paradigm is based on mathematical principles. The programmer describes \what to" solve and leaves implementation details to the compiler, operating system, and hardware. The language exposes implicit parallelism through data independence, and guarantees determinate results via side-e ect free semantics. OSC has a single front-end; the output is a machine independent direct acyclic graph. The graph is fed to a machine-dependent back-end to create target code. Aside from decoupling the front-end compiler and the back-end code-generator, it is also used for machine independent optimization of the program graph. We present the machines used for our experiments in Table 1. A workstation provides an evaluation of the basic performances of the algorithms. We used a DECStation 3100 with a MIPS R3000 processor and a main memory of 24 MB. Presented rst in 1985, the CRAY-2 is able to execute compound operations with a clock cycle of 4:1nsec, leading for the highest con guration to peak performance close to 4GFLOPS . The main memory is made of four quadrants of 32 banks each which means a total space of 256Mwords. The

processors are built around four parallel units: two oating point units and two integer units. Two MIMD machines that represent the two ends of the high granularity multiprocessor spectrum were also used for our experiments. The SGI includes only four very high performance MIPS processors, whereas the Sequent Symmetry has twenty high performance Intel i386 processors. In both cases, the resources around the processors (bus, memory) are close to been saturated. Any additions would mean at best a low performance increase.

3 The Selection Algorithms

The rst three solutions chosen for our experiments are built around the partition paradigm. The rst one, the randomize search, is described in [4]. The expected execution time of this algorithm is linear. This solution is very close to the quick-sort algorithm. Binary search is the second algorithm used. Its running time is (n log ); however,  can easily be kept very small in most real-time applications with the result of the previous execution. If this technique does not apply, a pre-sampling mechanism like the one described in [3] will always work. The third solution, described in, [1] is a divide & conquer algorithm, which is known to have O(n) time complexity. The last solution is based on a sorting algorithm. The routine is a modi cation of the Batcher sort; this reduces the e ective work down to the minimum necessary.

3.1 The Partition Paradigm

The partition paradigm was rst described in the literature by Hoare in [4]. This idea became famous as the basis for quick-sort (also in [4]). In the same paper, Hoare introduced the nd routine which returns the ith order statistic of the array using the just de ned partition function. The standard algorithm was tested but the compiler was not able to extract enough parallelism. If loop slicing is used, at each step, the array V n is divided among all the PEs. One partition step is executed, and the array V n is constructed.  On a shared memory machine, it is impossible to construct a packed version of V n . The compiler must use a mask technique since the array is not packed.  On a distributed memory machine, the data movement is handled through communications which are very expensive. As a consequence each PE only works with a subset of its initial assignment. To make it visible to the compiler, we have decided to use a set of arrays U , one array for each PE so the compiler would allocate one array to each processor. The functions p split and p size are the parallel counterparts of the functions split and array size. They are very easy to implement from the original functions. We present only p split here. We have underscored modi cations to the sequential algorithm. +1

+1

function p let in

partition (St: state; U: twodim; i: integer returns integer) temp := guess (St; U; i) L; M; R := p split (U; temp) sl; sm; sr := p size (L); p size (M ); p size (R) if sl  i then p partition (St:temp:L; L; i)

end let end function

elseif sl + sm else p partition end if

function p split (U: for V in U

 i

then

temp

(St:temp:R; R; i ? sl ? sm)

twodim; pivot: integer

returns

twodim, twodim, twodim)

ul; um; ur = split (U; pivot) ul um ur

returns array of array of array of end for end function

The Randomize Search in the sequential environment returns the rst element of the array. The parallel guess random function returns the rst element in one of the nonempty arrays in the set. function guess let

random (V: twodim; i: integer returns integer) V := for V in U returns value of V when array size(V ) > 0

in end let end function

end for

V[1]

We have set the number of steps of the Binary Search to approximately 20. In real applications, such as radar tracking, the level of the previous selection can give a fairly accurate idea of the bounds of the binary search. In some other types of applications, the user may have a good knowledge of the distribution. Finally, a pre-sampling method can be used similar to the one described in [3]. This step ensures that the ith order statistic is within [a : : :b] if St  (a; b). The Fast Binary Search is an improved version of this algorithm: it stops as soon as we can solve the problem \Find a value g such that x i  g < x i ". The algorithm uses the state St de ned with the partition function. The guess binary function described here returns (a + b)=2. (a; b):x:L  (a; x) and (a; b):x:R  (x; b) The Linear Search on a sequential machine ensures that each iteration reduces the array by at least n ? 6 elements. This method was described in [1] where the authors have shown a recurrence equation of the form:     T (n)  T n5 + 710n + 6 + O(n) This induces T (n) = O(n). 0

( )

3 10

( +1)

function guess let in

linear (V: onedim; i: integer returns integer) comp := reduce5 (V ) sorted := sort sel (V ) if array size (V )  5 then sorted[i] else linear (comp; array size(comp) = 2)

end let end function

end if

The reduce5 function scatters the array V into a set of 5-tuples and deduces the median of each 5-tuple. The result comp, is the set of those medians. This function induces a lot of array copying. Array handling is one of the major handicaps of the functional programming paradigm. Moreover, it is even harder to guide the compiler since our analyses have shown that we cannot execute this procedure without a certain amount of array copying. Using imperative languages, we were able to limit array copying to a small e ect. The Sisal compiler detects a problem and to automatically ensure determinacy, it introduces too much work. We were interested in implementing for our parallel partition a Parallel Linear Search that would guarantee a worst case linear execution time. The new value is obtained using a sequential linear function on one of the PEs where the largest number of values remains. It ensures that at least np elements are put out of the set at each iteration. 2

3.2 The Sort Based Function

Sorting part of the set to retrieve only the ith order statistic is usually a waste of time. However, this solution may apply for an especially small set, or if i is comparable to the number of PEs. We have implemented a modi ed version of the batcher sort (function batcher sel shown below) that keeps only the rst i values in all the sub-strings. We present here part of the Sisal code; the functions high (i, p) and low (i, p) return respectively bi=pc and mod(i; p). function

:::

batcher sel (V: dim1; k:

integer returns

dim1)

while q > 1 repeat q := old q / 2 S := for x in old S at i returns array of if (high (i, q * 2) + q + low (i, q) < n) & (low then if (i - high (i, q * 2) - low (i, q)) = 0 then min (x, old S[i + q]) else max (x, old S[i + q]) end if else x end if

(i, p) < 2 * k)

Machine DECStation CRAY 2

# PE 1 1 2 4 8 Silicon Graphics 1 2 4 Sequent Symmetry 1 2 4 8 16

1K 6 5 5 4 5 3 8 14 40 32 29 37 69

2K 14 10 7 6 7 6 9 15 79 50 40 45 78

4K 25 18 12 9 10 13 16 23 156 92 61 58 93

8K 16K 32K 64K 57 148 376 1009 36 71 141 286 21 38 73 144 13 25 62 114 11 25 55 29 75 186 544 24 52 116 281 34 51 99 188 329 729 1632 3860 172 349 754 1669 103 185 367 775 80 126 209 395 120 133 194 282

Table 2: Fast Binary Search on Data Sets of 1024 to 64K values (in ms) :::

end for;

4 Tests Results

The results of the average execution time on 100 runs for the Fast Binary Search are presented in Table 2 in regard to the machine and the e ective number of processors used for di erent sizes of the input set. The Table 3 presents a synthesis of these results in three points:  The Relative Performance is the maximum ratio between the execution time on the machine with any number of processors available and the execution time on the DECStation, for any size of set.  The Maximum Speed-up sustained, regardless of eciency. The speed-up is computed from the execution time of the same program on only one processor and on any number of processors.  The highest speed-up sustained with High Eciency. The eciency is the ratio of the speed-up divided by the number of processors. The results on the Sequent Symmetry are excellent. We were able to obtain a speed-up higher than 10 with still good eciency. On all the other machines, the eciency drops quickly with more than two processors. This e ect is increased by the type of machine used (low number of processes). Execution times of the other routines are presented Table 4. The Binary Search achieve equivalent performance and the Random Search is a little slower. The Batcher Selection is an order of magnitude slower on all the machines; moreover, the algorithm does have the linear behavior of the partition based functions. We have tried to run the experiment for very large arrays but we were limited by both time and space. Most of the compiled codes ran with 2MB of memory. However, for some

Machine

Performance Max speed up (relative) # of PEs Sp. up DECStation 1.00 CRAY 2 8.85 8 2.56 SGI 5.85 4 2.89 Sequent Symmetry 3.58 16 13.69

High eciency E . # of PEs Sp. up E . 32% 72% 86%

2 2 16

1.99 99% 1.94 97% 13.69 86%

Table 3: Relative performance Selection Algorithm Machine # PE DECStation 1 CRAY 2 1 2 4 8 Silicon Graphics 1 2 4 Sequent Symmetry 1 2 4 8 16

1K 7 7 7 7 9 4 11 22 46 47 48 68 131

Binary 8K 60 38 35 15 26 31 28 45 339 183 116 101 148

64K 1010 291 149 121 122 539 279 178 3862 1668 782 408 330

Randomized Linear Batcher 1K 8K 1K 8K 1K 8K 11 112 228 6137 439 5706 9 62 233 6063 118 1498 7 41 63 756 6 22 33 447 7 23 30 459 6 55 117 3044 186 2279 11 46 130 1818 127 1512 32 76 71 744 92 988 66 569 1164 1106 50 284 2142 541 6865 41 173 949 302 3631 55 128 639 3985 173 1863 99 157 706 2603 131 1005

Table 4: Selection on Data Sets of 1K, 8K or 64K values (in ms) of the largest sets, more memory was necessary. Space Requirements analysis show that the array can be updated in place. For an array of N = 1024 elements, an optimal C code needs approximately 8KB. The Sisal code is asking for 70KB; this implies a poor memory eciency of 11%. The code is involving a lot of array manipulation. Whereas the compiler creates a new array for each step of any of the three searches, update in place techniques are eciently applied for the more regular batcher selection. The memory usage for the same size of inputs is around 15KB (53%). On the CRAY machine, we were able to use inter-processors parallelism (good speedup), but we have not been able to obtain much vectorized code (poor overall performance). The OSC compiler does not take advantage of the two integer vector units. The loop vectorization is only used for oating point operations. The Linear Search results are very poor compared to the other guess strategies. This is due to the large amount of array copying necessary for the reduce5 operation. Sisal is yet fully capable in obtaining good eciency on loop construction and loop automatic concurrentization; however, not all array constructs are well exploited by the current compiler. The reduce5 function could have been implemented in Sisal with an acceptable eciency but that would have resulted in loss of code readability. This was the case with

the batcher selection; In order to avoid array copying, the code implemented is far from the intuitive implementation. However, future versions of the OSC compiler are expected to adequately address these points. The monitoring tools of OSC are not yet user-friendly. In a control- ow environment, the monitoring of a program is easy although it might be expensive in time. In a data

ow environment, some action are conditioned by other and actions may take place in many orders. The newly incorporated time function of the compiler provides the user with the rst useful method. We have developed dedicated programs to handle data from the output generated by the time function.

5 Conclusion

These results clearly indicate that the Optimizing Sisal Compiler is fully capable of detecting and exploiting parallelism. This is true even though the selection problem is not a scienti c application which is major focuses of the SISAL development team. The code for all implementations ran without any modi cation on every machine. Furthermore, the user does not have to worry about explicitly parallelizing the code, and as the Sisal compilers become better on target code generation, on data structure handling and on communication protocols, the Sisal programs will achieve better and better performance. It has be long argued by the proponents of functional languages that learning extensions and how to use them amounts to learning a new language. Our experience from these experiments is that learning the di erent extensions amounted to learning a new language, which was the case for SISAL for one of the authors. However, after learning SISAL it took less time to write the programs than with a parallel C versions. We are now working on an evaluation of a Sisal compiler for a massively parallel SIMD machine, the MasPar MP-1. The sequential part of an application would execute on the Front-End; the parallel part only would execute on the MasPar Parallel Unit. This approach is close to the one adopted by the Fortran90 development team at MasPar: we should be able to take advantage of the study of many common concurrentization problems. This means also more facility to link to the highly optimized fortran libraries available on the MasPar machine for scienti c computation.

References [1]

[2] [3] [4] [5]

M. Blum, R. W. Floyd, V. Pratt, R. R. Rivest and R. E. Tarjan, \Time bounds for selection", Journal of computer and system sciences, Vol. 7(4), August 1973 J. Feo and D. Cann and R. Oldehoef, \A Report on the Sisal Language Project", Journal of Parallel and Distributed Computing, Vol. 10(4), December 1990 R. W. Floyd and R. R. Rivest, \Expected time bounds for selection", Communication of the ACM, Vol. 18(3), March 1975 C. A. R. Hoare, \Partition", \Quicksort" & \Find" (Algorithms 63-65), Communication of the ACM, Vol. 4(7), July 1961 McGraw et al, \Sisal: Streams and Iterations in a Single Assignment Language", \Language reference manual - Version 1.2", Lawrence Livermore National Laboratory, March 1985

Suggest Documents