merical solution of di erential equations in domains ..... posed O(n) parallel algorithms to nd the minimum ..... 22] D. Nath, S. Maheshwari, and P. Bhatt, \E cient.
Parallel Algorithms for Workstation Clusters K. Efe, P. Uthayopas, V. Krishnamoorthy, and T. Dechsakulthorn The Center for Advanced Computer Studies, University of Southwestern Louisiana, Lafayette, LA 70504
Abstract
We investigate the potential of workstation clusters for use in high performance computation for some selected applications. Currently, the network speed found in most of the existing systems is quite low, but higher speed networks are already emerging in the market. We present four parallel algorithms that performed astonishingly well on a cluster of workstations connected by Ethernet. Three of these are algorithms for sorting, matrix multiplication, and all-pairs shortest path problems. The fourth algorithm solves computationally intensive numeric problems that require little communication. Since these numeric computations are easy to parallelize, they serve well for testing operating system strategies for designing a network-wide scheduler that utilizes idle workstations. These results appear to suggest that future progress in network speeds can potentially make workstation clusters serious competitors in high performance computation.
Key words: parallel algorithms, sorting, matrix multiplication, graph problems, all-pairs shortest paths, workstation clusters, ethernet.
1 Introduction
Workstation clusters connected by local area networks have been widely available for more than a decade, but they have begun to receive substantial attention from parallel computing researchers only recently. While there is a large body of literature on various aspects of distributed computing that directly apply to workstation clusters, there are only few papers that develop parallel algorithms for workstation clusters. In this paper, we study the performance of four speci c algorithms developed for workstation clusters. The system we used consisted of 16 Sparc-2 workstations connected by Ethernet. While the low speed of Ethernet creates a major obstacle to high performance computation, we do not necessarily consider this as a weakness for the purposes of this paper. This is because the basic lessons that we learned will not change. If we obtain good speed-up rates when using Ethernet as the communication medium, this provides enough evidence that future systems should perform better when higher speed networking is available. Three of the problems considered here are sorting, matrix multiplication, and the computation of This
research is funded by a grant from LEQSF.
all-pairs shortest paths in a graph. These are among the best known problems in the literature as they have substantial applicability and relevance. There are several papers devoted to each, and we have no doubt that more papers will continue to improve on the earlier methods for solving these problems. Our reason for choosing these problems is that they are hard to parallelize in the sense that there is no obvious best way to parallelize these computations on a given architecture; and this fact precisely justi es the sheer number of papers that address these problems. The fourth problem we used in this study is numerical solution of dierential equations in domains that can contain singularities. Solution of these equations requires little communication while the amount of required computation is large. These problems also allow partitioning the computation between any number of workstations with almost linear speed-up. Our reason for including these computations in this study is that they allow us to evaluate the performance of workstation clusters at the system level rather than task level. At the task level we are concerned about the speed-up for just one task, but at the system level we are concerned about the mean response time over several tasks. To evaluate the rst three algorithms mentioned above, we had to use the workstation cluster for one computation at a time. This is due to two reasons. First, sharing the limited bandwidth of Ethernet between the subtasks of several tasks causes serious interference on the network. Second, if the subtasks of two tasks share some subset of workstations, both tasks experience a major slowdown. The rst concern may (or may not) be alleviated by future progress in networking technology, but the second concern will still remain. To remedy this problem, we implemented simple scheduling techniques that coordinate the use of idle workstations, and we used this scheduler to allocate workstations for the fourth problem when a stream of such tasks may arrive in an unpredictable pattern. Due to the small amount of communication required, such numerical computations serve well to evaluate system-wide scheduling strategies for allocating workstations when communication interference is not a major concern, e.g. by using a non-broadcast based medium such as an ATM switch. I/O bandwidth is another factor that has a major eect in the performance of parallel systems. Yet, this factor is often omitted in the literature when reporting
the speed-up amounts obtained. To further strengthen our point, our reported speed-up amounts include the I/O time spent sending the initial data from a leserver to the set of workstations, and returning the results to the leserver when the computation is complete.
2 System Description
Our hardware system consisted of 16 Sparc-2 workstations connected by Ethernet. All workstations had the same CPU speed and similar memory sizes. The sequential algorithm was run on the leserver which has direct access to the required data without using the network. We note that the leserver is much faster than the other computers, and therefore the speed-up amounts reported in this paper are rather pessimistic. The reason for using the leserver for sequential execution is that it is close to the data. Thus, it can actually run large programs even when the required memory size is larger than what is available on the computer. Using any of the other computers for sequential execution would not be possible for large problem sizes, since paging across the network would be required. We implemented the parallel algorithms using PVM 3.3.4 and tested them for various problem sizes. The system is organized in a master-slave hierarchy where the master process resides on the network leserver as it needs to access the data stored there. The slave processes running on the set of workstations receive the data from the master, process the data, and return the results to the master process for storing.
3 Sorting
Parallel algorithms for sorting have received a lot of attention in the literature. Implementations on existing systems such as the NCube had to do with a limited number of processors due to its physical design, while the number of keys to be sorted may be very large. Then, if there are n keys to be sorted and there are p processors, where p < n, each node holds w = n=p keys before and after the sorting algorithm. During the steps of sorting algorithm, the number keys held in a processor should not grow arbitrarily large due to memory constraints. Most of the sorting algorithms that use this model in the literature fall into three categories: mergebased, bucket-based, or a mix of both. In the mergebased approaches [14, 17] processors independently sort their values (e.g. by using a sequential quicksort algorithm) and then they merge the sorted lists into a single sorted list. In this approach, the merging process is the limiting factor on the speed-up. If enough care is not taken for maximizing the overlap of communication between dierent pairs of processors, the last few steps of merging process oers little parallelism as in the method of [14], and the amount of speed-up is not more that about 5 or 6 regardless of how many processors are used. In the bucket based approaches [16] (also see Chapter 6 of [20]) we determine a set of pivots (or splitters) before the sorting algorithm starts. If data is initially distributed among the set of processors, some pre-processing and communication between processors
is needed in order to determine a good set of pivots that is globally agreed. Then, these pivots are used by each processor to divide their lists into p buckets such that no element in bucket i is larger than any element in bucket i + 1, for i = 1; ; p ? 1. These buckets are then sent to the processors that will contain them at the end of the sorting algorithm (i.e. every processor sends its ith bucket to processor i). Upon receiving these buckets, all the processors sort their data locally and we are done. In the methods based on the mix of the two methods above [1, 27] processors initially sort the values they have independently. If initially the keys are distributed equally between the processors, they will complete this step about the same time. Then, by using some method of pivot selection, a global set of splitters are obtained and communicated to all of the processors. Each processor then splits its local data into p buckets and sends them to their nal destinations where they are merged to obtain the nal sorted order. In the last two methods, the selection of the set of pivots is the most critical step. If the pivot values do not equally divide the data, the subsequent steps will not complete at the same time due to unbalanced load distribution. A recent paper [1] which implements the mix of the two methods on a hypercube architecture presents a method of determining the set of pivots that guarantees equal splitting of the data. The critical information that helps in that paper is the fact that the rst step sorts the data locally in each processor. In our implementation for workstation networks we used a bucket-based approach. To select the set of pivots we used a technique that has been reported to perform well when the keys are uniformly distributed [20]. In this method, we randomly pick s samples, where s >> p, from the input data that resides on the le server. We then sort the samples sequentially and then pick p ? 1 values equally spaced apart in the sorted order. We use these values as our splitters to form the buckets sequentially by the le server. Once the p buckets are determined and lled, they are sent to p processors and then sorted independently by using a quicksort algorithm. If the sorted values need not be returned back to the leserver, then we are done. Otherwise, the le server just needs to concatenate the results returned by the processors. Note that our measure of the running time includes the time spent for sampling, sorting the samples and determining the splitters, retrieving the data from leserver and separating them into buckets, sending the buckets to the processors, sorting these independently, and then sending the results back to the leserver and storing the result. The time spent for initial sampling and sorting is relatively small, and independent of the size of the input data. In other words, regardless of how large the input data is, the quality of the set of pivots improve fast initially by the increasing size of s for small s values. But beyond a certain level, increasing the size of s has a negligible eect on the quality of the set of pivots. The initial processing done for splitting the data into buckets introduces only a minor additional overhead on the part of the leserver,
Parallel Bucket Sort
Speed-up
4.5 2 workstations 4 workstations 8 workstations 16 workstations
4
Log n 3.5
speed up
3
2.5
2
1.5
p
Figure 1: Expected behavior of speed-up for increasing number of processors. and it can be done while reading the data from the le. After the results are returned from workstations, the nal concatenation step requires no additional overhead over storing the data. The most time consuming overhead here is due to the time spent sending the data from the leserver to the set of processors since this step cannot be parallelized. To make the matters worse, this overhead is linearly proportional with the amount of input data. However this step must be performed whenever the data initially resides in one place rather than distributed among the p processors, and this is true for any parallel computer. Assuming that s