Parallel Sorting on Heterogeneous Platforms Gabriel Mateescu National Research Council 100 Sussex Drive, Ottawa K1A 0R6 Canada
[email protected]
Abstract We present a method for load balancing parallel sorting on heterogeneous networks of workstations and clusters. Load balancing is achieved by exploiting information about the available throughput of the processors. First, the problem is partitioned into subproblems such that the times taken by the processors to solve the subproblems are balanced. Determining the partition involves solving a nonlinear system for finding the subproblem sizes. Second, the data are sorted by each process and are merged by choosing a processor topology which minimizes the critical path.
1. Background Application load balancing has long been an important aspect of improving the efficiency of parallel applications. The advent of heterogeneous clusters and networks of workstation makes parallel application balancing important not only for long running iterative simulations, but also for common applications such as sorting. We consider the problem of integer sorting using merge sort [1, 2]. While iterative codes may either balance the load on the fly, or need to rebalance to account for the change in the subproblem size with the iterations, parallel mergesort must achieve balancing before starting the sorting. Parallel sorting algorithms have been designed for real or model architectures such as the hypercube interconnect or PRAM [6, 3]. We propose a parallel mergesort method in which the data set is partitioned into subsets such that times taken by the processors to sort the subsets are balanced. The sorted subsets are then merged using a binary tree processesor communication topology. Our work is original in that it provides a method for partitioning the problem, which employs accurate run-time throughputs by harnessing information from, and interacting with, resource management systems and system monitoring tools available on the contemporary platforms.
2. Description of the Method Let N be the size of the data set to be sorted, and let P be the number of processes. We assume that each process runs on a distinct processor. The size of the subproblem assigned to processor k is Nk . The main steps of the parallel mergesort algorithm are: (i) Label each processor with its rank in order of decreasing throughput, such that processor 0 has the highest throughput and processor P −1 has the lowest throughput. Let tk be the throughput of processor k, where 0 ≤ k < P. (ii) Determine the size of the subproblem assigned to each processor such that all processors sort the data subsets −1 in the same time, i.e., find {xk }P k=0 such that x1 log (x1 ) xP −1 log (xP −1 ) x0 log (x0 ) = = ··· = t0 t1 tP −1 (1) and x0 + x1 + · · · + xP −1 = N. (2) The actual sizes of the subproblems are Nk = xk , P −2 for 0 ≤ k < P − 1, and NP −1 = N − k=0 xk . (iii) Assign to each process the data subset to be sorted as defined by the sizes computed at (ii), then merge the data sorted by each process.
3. Execution Environment Both the problem partitioning and the communication topology in the merge stage of the algorithm are aware of the processor throughputs. Each process computes the current throughput of its execution processor using the expression tk = Tk (1 − uk ), where uk and Tk are the processor utilization and the peak throughput of processor k, respectively. The utilization uk is collected using a reporting tool such as the System Activity Report (sar) utility [7], which is available in most UNIX
Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications (HPCS’02) 0-7695-1626-2/02 $17.00 © 2002 IEEE
communication is pipelined or not. In this paper, we assume that all edges have equal cost, and we set up the processor communication such that the critical path contains only processors in the top half of the throughput ranking: when merging data from 2p processors, the faster p processors receive the data and perform the merge. Merging is performed in log2 P stages. At stage s, processor k merges its sorted data with the data of processor 2Ps − k − 1, where 0 ≤ s ≤ log2 P − 1, and 0 ≤ k ≤ P −1. Along the critical path, the number of arithmetic 2s+1 operations for merging is bounded above by log2 P N and the computation time for merging all data sets is at most log2 P N/tP/2 .
and Linux distributions. The peak throughput of interest for integer sorting is the rate of integer operations which is obtained from benchmarks such as SPEC’2000 [8]. We assume that a workload management system such as the Portable Batch System (PBS) [4, 5] or the Load Sharing Facility (LSF) [9] is used, and scheduling is configured such that if another job increases the current CPU utilization the ratios of the utilization will not change dramatically.
4. Partitioning We compute a partition that provides accurate load balancing as follows. Relations (1) and (2) define a nonlinP −1 ear system of equations with unknowns {xk }K=0 which can be written as f (x) = 0, where x = (x0 , . . . , xP −1 ), k+1 f = (f0 , . . . , fP −1 ), fk = xtkk log (xk ) − xtk+1 log (xk+1 ), 0 ≤ k < P −1, and fP −1 = x0 +...+xP −1 −N . Newton’s method gives the iterates x(i) , i ≥ 0, with
6. Results and Conclusion We have implemented the method using MPI communication and have executed the program on two personal computers, one having a 333 MHz Pentium II processor, the other a 733 MHz Pentium III processor. Sorting 32 million integers on these processors has taken 90 seconds with the proposed method and 135 seconds with even partitioning. The overhead of computing the partition has been less than 2%. The significant improvement brought about by throughput aware partitioning makes our approach attractive. Other tuning techniques such as pipelining the communication during the merge stage can further improve the parallel efficiency.
(−1) x(i+1) = x(i) − J (i) f (x(i) ), where J (i) is the Jacobian matrix defined by ∂fj (x) J (i) (j, k) = , for 0 ≤ k < P − 1, i > 0. ∂xk x=x(i) The initial guess for the iterative scheme is the partition which is the exact solution for a problem with asymptotic complexity O(N ), and this can be easily found to be (0) xj = N tj / j tj , 0 ≤ j < P . (−1) f (x(i) ) is found as the soThe vector y (i) = J (i) (i) (i) lution of the linear system J y = f (i) , which is solved using in place LU factorization of J (i) . The band structure of the Jacobian allows to perform LU factorization in O(P ) arithmetic operations, assuming no pivoting is necessary.
References [1] A. V. Aho, J. E. Hopcroft, and J. D. Ullman. Data Structures and Algorithms. Addison-Wesley, Reading, MA, USA, 1983. [2] S. G. Akl. Parallel Sorting Algorithms. Academic Press, 1985. [3] R. Cole. Parallel merge sort. SIAM Journal on Computing, 17(4):770–785, Aug. 1988. [4] R. Henderson. Job scheduling under the Portable Batch System. In F. D.G. and R. L., editors, Job Scheduling Startegies for Paralell Processing, volume 949 of Lecture Notes in Computer Science, pages 279–294. Springer-Verlag, 1995. [5] J. Jones. NAS requirements checklist for job queueing/scheduling software. Technical Report NAS-96-003, NASA, April 1996. [6] F. T. Leighton. Introduction to Parallel Algorithms and Architectures. Morgan Kaufman, San Mateo, 1992. [7] J. Peek, T. O’Reilly, and M. Loukides. Unix Power Tools. O’Reilly, Sebastopol CA, 1997. [8] SPEC. Standard Performance Evaluation Corporation. http://www.spec.org, 2000. [9] S. Zhou. LSF: Load sharing in large-scale heterogeneous distributed systems. In Proc. Workshop on Cluster Computing, 1992.
5. Merging Data sets sorted by the processors are merged using a binary tree communication topology. In the binary tree, edges represent communication and nodes represent processors merging the data. A cost is defined for each edge as the communication time between the nodes connected by the edge, and a cost is defined for each node as the time to merge the local data set with the incoming data set. The time to complete the merge is given by the largest cost sum along a path from the root to a leaf, and a path that has the maximum cost is called a critical path. The cost of an edge depends on several factors, including the type of interconnect, the current amount of traffic over the interconnect, the size of the buffers used by the communication software, the size of the messages, and whether 2
Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications (HPCS’02) 0-7695-1626-2/02 $17.00 © 2002 IEEE