Parallel Computing in Paderborn: The SFB 376 “Massive Parallelism – Algorithms, Design Methods, Applications”? Friedhelm Meyer auf der Heide, Thomas Decker Department of Mathematics and Computer Science and Heinz Nixdorf Institute University of Paderborn, Germany e-mail: ffmadh,
[email protected] http://www.uni-paderborn.de/cs/ffmadh, deckerg.html
1 Introduction A major research area in the University of Paderborn is Parallel Computing. Next to computer scientists, also researchers in mathematics, electrical and machine engineering, and manufacturing technology employ the computation power of parallel and distributed systems. Further, many institutions of our university focus on research related to this topic: the Paderborn Center for Parallel Computing (PC 2 ) offers support for efficient, comfortable use of parallel machines not only to users of the Paderborn or other universities, but also to users in international industries. Parallel computing in the Heinz Nixdorf Institute and its DFG-Graduate College ranges from basic research to applications in manufacturing technology. The C-LAB, a joint venture with Siemens Nixdorf Informationssysteme AG, contributes design methodology for complex distributed real time systems. All these activities are supported within numerous projects, by, e.g., DFG, BMBF, EU, and industry. The SFB 376 “Massive Parallelism” has become the central research organization coordinating the activities in parallel computing in Paderborn and conducting the basic research in this area. It aims to develop methods to fully exploit the computation power of large parallel systems, and to make such methods easily usable for applications in science,engineering, and manufacturing technology. The project integrates three major parts: A: Algorithms, B: Design methods, and C: Applications. All these parts are strongly related to each other. On the one hand, the new algorithmic techniques and design methods provide an important basis for the application-oriented parts of the project. On the other hand, the demands of the applications motivate many algorithmic and methodic problems. Further, the applications play an important role in the evaluation of new algorithms and design methods. This project structure demands cooperation and interdisciplinary research between experts in the different application areas, methodic oriented researchers, and algorithmic researchers. We will now describe the different parts in more detail. ?
This work is supported by the “DFG Sonderforschungsbereich 376 : Massive Parallelit¨at - Algorithmen, Entwurfsmethoden, Anwendungen” and by the EU ESPRIT Long Term Research Project 20244 (ALCOM-IT). More information about the SFB can be found on our web-pages under http://www.uni-paderborn.de/SFB376.
A: Algorithms. Designing algorithms that fully exploit the computation power of large parallel systems is much more complicated than in the sequential case: – The design of parallel algorithms for a large number of processors often requires new algorithmic approaches (examples are: combinatorial optimization, fluid dynamics, etc.). A parallelization of sequential methods often does not lead to satisfactory results. – Even algorithms which exploit “natural” parallelism of the underlying problems are difficult to implement on massively parallel systems because of their highly dynamic behavior concerning process generation and communication (e.g. adaptive finite element methods, event driven simulations). – Using networks of workstations as one single parallel machine imposes further problems. Due to the heterogenity of the computation- and communication-hardware, different control- and communication-mechanisms coexist within one application. Sometimes it is even essential to use different algorithms on different architectures. Within this part of the SFB, we design and analyze protocols for basic services like load balancing, routing, and data management in processor networks. Furthermore, we develop algorithms for realizing data structures, as well as graph-, geometry-, and computational algebra algorithms. All this is made avaliable to users in and beyond the SFB as easy-to-use libraries. B: Design methods. In this part of the project we investigate techniques and tools which support the design, realization, and the comfortable and efficient use of massively parallel systems. The utilization is assisted from the side of the hardware as well as of the software. The leading idea is that we can increase the effectivity of efficient algorithmic approaches – by supporting the design of very complex, naturally parallel, reactive technical systems with real-time constraints. – by a tool-system automating the systematically solvable tasks occuring in the design process of parallel applications. Thereby, developers of parallel applications who are not familiar with the design of efficient parallel algorithms get access to reusable parallelization know-how. Within this part of the SFB, we develop design methods for massively parallel real-time systems as well as tools for the development and implementation of parallel applications. These systems use (and sometimes build on) results from part A, and are strongly connected to application areas within part C. C: Applications. In this part, we work on applications of massively parallel systems with high economic and scientific relevance. Due to of their complexity, boundary conditions like timing constraints, and their dynamic behavior, these applications put large challenges into the design- and algorithmic methods. The following criteria are common to all applications investigated in this part:
– Every application field is highly relevant from both the economic and scientific point of view. – The applications are broadly scattered across different disciplines outside the area of classic scientific problems. – The problems lead to computational demands which are beyond the capabilities of standard computer systems. – Every application is highly dynamic with respect to load generation, communication, and to their data access patterns. Therefore the applications represent an important tool for measuring the quality of the algorithms and design methods developed in parts A and B. In particular, we push the development of parallel self-organizing mechatronic systems and applications in the area of artificial intelligence. In addition, we work on production planing as well as on parallel extensions of the computer algebra system MuPAD. In the following we concentrate on describing our algorithmic research on developing, analyzing, and implementing programming platforms designed for large parallel machines, (including massively parallel architectures and SCI-Clusters). In particular, we present techniques and libraries for load balancing and virtual global variables together with the applications by the projects integrated in our SFB. This work is mainly done within the project A2 (Meyer auf der Heide, Monien) “Universal basic services”.
2 Universal basic services In this part of the SFB, two important basic services are studied: load balancing and data management in large parallel networks. As the developed methods should be appropriate for a large spectrum of applications, they have to be able to adapt to the different application demands as well as to the capabilities of the underlying hardware. This universality can only be achieved by algorithms which take into consideration specific characteristics of the application and of the architectures . For example, a universal load balancing service should offer different methods and adapt them to the specific demands of the application. For data management systems and for load balancing algorithms, different characteristics of the application and of the architecture are relevant for selecting appropriate strategies. In the next sections we take a closer look to these services. The data- and task-management system Daisy makes these services available by integrating tools for the simulation of shared memory (DIVA) and for load balancing (VDS) in a single comprehensive library. A beta release of Daisy is available for Parsytec's PARIX and PowerMPI programming models. With the improved thread support of MPI-2 and PVM 3.4, Daisy will also become available on workstation clusters. 2.1 Load balancing The problem. Massively parallel computers have been shown to be very efficient at solving problems that can be partitioned into tasks with static computation and commu-
nication patterns. However, there exists a large class of problems that have unpredictable computational requirements and/or irregular communication patterns. To efficiently solve this kind of problems with parallel computers, it is neccessary to perform load balancing operations during runtime [14]. In contrast to static load balancing problems, where a priori knowledge about the dynamic behavior of the application is available [12], dynamic algorithms have to place the load-items on-line. Consequently, the application is not only influenced by the obtained load balancing quality but also by the overhead imposed by the balancing activities. Therefore we have to optimize the tradeoff between load balancing cost on the one hand and effort on the other hand. To do this, we are considering the properties of the architecture (communication bandwidth, message-offset costs, latency) and of the application (e.g. granularity). A very important parameter of the application is the demanded load balancing quality. Depending on the application, it may be neccessary to demand a perfect load balance of the processors or only a minimization of the idle-times. Dynamic strategies. We distinguish between scenarios where migration is possible and where objects can only be placed once (dynamic mapping). Dynamic mapping algorithms are often used for process-placement because in many systems the migration of processes is very costly or even impossible. In [13] we presented a universal dynamic mapping algorithm which is able to adapt the mapping overhead to the granularity of the application and to the communication cost imposed by the architecture. A parallel version of a similar mapping process was introduced in [6]. Particularly for the applications considered in the SFB-project, we have scenarios where migration of load items is possible which allows applying completely different balancing algorithms. In these cases, load-items can often be described by data-packets which can be migrated simply by sending them from one processor to another. Here, the selection of the load balancing strategy depends mainly on the demanded balancing quality. If the total set of load items processed during a distributed computation does not depend on the schedule determined by the balancing layer, i.e. the order and location where the items are processed, the maximum speedup can be reached if the idle-times of the nodes are minimized. For example, this is the case in tree-structured computations like divide-and-conquer applications which decompose the problem to be solved into parts which directly depend on the problem-instance itself. For this kind of load balancing (load sharing), randomized workstealing leads to very good results in theory [4] as well as in practice [15]. However, the load-items generated by a distributed computation may also depend significantly on the order they are processed. We find this phenomenon in many search algorithms used in artificial intelligence and operation research. In best-first branchand-bound, for example, the processing order is defined by the quality of the objects (partial solutions). When applications of this kind are parallelized, it is not only important to ensure that all processors are busy, but also some form of qualitative load balancing is neccessary to make sure that all processors are working on good partial solutions and thus to prevent the processors from doing ineffective work (work not processed by a sequential best-first algorithm).
In load sharing algorithms, processors can only have two states: idle or not idle. Qualitative algorithms directly take the load-states of the processors into consideration. Based on comparisons of these states, load is migrated from “source” processors with high load to “sink” processors with low load. The various algorithms for this setting differ in the point of time they get active, in the strategies used to select the processors which exchange information about their load-state, and in the amount of load which is migrated. In [25] we analytically compared to two well known qualitative local balancing techniques: the dimension exchange (DE)- and the diffusion (DF)-method. The DE method balances the load of each neighbor iteratively, whereas the DF-method balances all neighbors in one step. It was shown that depending on the capabilities of the archi(t) tecture both techniques have advantages. Assume that wi represents the load of node i (t) at time t, and w represents the average load at time t. It was proved that the expected (t) N (t) )2 (time t, N nodes) is value of the system imbalance factor (t) = i=1 (wi ? w smaller for DE if it is possible to communicate to more than one node simultaneously (multi-port model) and larger than for DF otherwise (in the single-port model). Here it was assumed that the load is generated by identically distributed random variables. Consequently, the first method is preferable if the communication hardware is able to support multi-port communication efficiently and the latter one should be used if only one-port communication is possible. Further, it was shown that for synchronous scenarios, where no load-generation takes place during the balancing phase, the expected value of the system imbalance factor of the DE method is always smaller than the one of the DF method independently of the communication model. In addition to the experimental evaluation conducted in [25], we evaluated the practical relevance of the results using a branch and bound algorithm for the set partitioning problem. Both methods clearly outperformed simple approaches which only select one neighbor for balancing [17, 26].
P
Implementation and application. The virtual data space tool VDS simulates a global data space for objects stored in distributed heaps, stacks, and other abstract data types [11]. The work packets are spread over the distributed processors as evenly as possible with respect to the incurred balancing overhead. Among other algorithms, VDS integrates the methods described above for qualitative load balancing as well as for load sharing. Within the SFB, VDS is applied for the parallelization of an application out of the area of artificial intelligence. Further, we are using VDS inside another “A”-project dealing with problems of computational algebra like real root isolation. 2.2 Data Management The problem. The provision of shared memory in systems with distributed memory supports comfortable and efficient programming essentially. For example, it is possible to store variables like it is done in sequential programs and at the same time to make them accessable from other processors. Other data-objects are for example pages or
cache lines in a virtual shared memory system, shared files in a distributed file system, or media information (video, audio, text, graphic) on a media server. The efficiency of such systems significantly depends on the bandwidth of the architecture. We have to distinguish between systems with high bandwidth and with low bandwidth. In the former case, the dominating problem is the contention of the memory modules (i.e. the number of requests at each module) [5, 7, 8, 9, 16] and in the latter case we have to consider the network congestion. Here we have to preserve the data locality in order to reduce the communication load in the network. A survey of approaches for both scenarios is given in [22]. Strategies for systems with limited bandwidth. Most work concerning data management in parallel and distributed systems investigates either hashing or caching based strategies. Hashing distributes the shared objects uniformly at random among the memory modules, which yields an even distribution of the data and therefore achieves a good load balance. However, uniform hashing gives up any locality in the pattern of read and write accesses. Caching exploits locality by placing or moving copies of the objects at or close to the accessing processors. The basic idea is that this minimizes the distances and therefore decreases the total communication load. The main problem is that minimizing distances can produce bottlenecks in the system, e.g., if many objects are placed on a central processor in the network. For the simulation of shared memory on MPPs or NOWs, the routing mechanism is the bottleneck of the system, i.e., we have to focus on data management in parallel processor systems in which the processors are connected by a relatively sparse network. Each processor is assumed to have its own local memory module such that shared objects have to be distributed among these modules. This scenario is typical for most of today's parallel computers, including Parsytec GCel and GCpp, Intel Paragon, Fujitsu AP1000, and Cray T3D and T3E. The processors in all these systems are connected by mesh- or torus-networks. Clearly, the larger the number of processors in these systems, the more the communication bandwidth becomes the bottleneck, because the bisection width of these networks increases less than the number of processors. For this scenario, hashing yields an even distribution of the data among the processors and also an even distribution of the communication or routing load among the links in the network. Several hashing based strategies are analyzed in the context of PRAM simulation. For instance, Ranade [24] describes a hashing based PRAM emulation for the direct butterfly network. He shows that an N processor PRAM can be emulated by an N processor butterfly network with slowdown O(log N ). This scheme can also be adapted p to other p networks, which, e.g., yields p an N processor PRAM simulation for O ( N ). This slowdown is optimal for general the N N mesh with slowdown p PRAM simulations because of the N bisection width of the mesh. Nevertheless, it is completely unsatisfying for applications including locality. This shows that the main drawback of uniform hashing is that it gives up any locality in the pattern of read and write accesses. In order to exploit locality, we have to minimize the communication load. This can be done by minimizing the distances from the accessing processors to the accessed objects. This problem is widely studied in the context of file allocation and distributed paging, see, e.g., [1, 19, 2]. A survey on these topics is given in [3]. Clearly, minimi-
zing the distances minimizes the total communication load. Unfortunately, it also can increase the congestion, i.e. the maximum number of data packets which have to cross the same link. The congestion describes the worst bottleneck of the system and therefore gives a lower bound on the execution time of a given application. Moreover, several results on store-and-forward- and wormhole-routing (see e.g. [18, 21, 10, 23]) indicate that this value is also a good approximation for an upper bound on the execution time of coarse grained applications with high communication load and low synchronization requirement. This shows the importance of considering the congestion rather than the total communication load. In [20] we presented static and dynamic placement strategies for acyclic networks as well as for multidimensional meshes. Furthermore we developed static strategies for indirect networks like Clos-Networks or Fat-Trees. All these strategies aim to minimize the congestion. The static strategy maps the objects to the modules according to some knowledge of the access pattern of a given application. The dynamic strategy makes all placement decisions on-line, i.e., it has no knowledge about the access patterns beforehand. It is a combined hashing and caching strategy. Both strategies can work either with or without redundancy. We compare the achieved congestion with the congestion of an optimal strategy and show that it is close to optimal. Implementation and application. The distributed variables library DIVA provides functions for simulating shared memory on distributed systems. The idea is to provide an access mechanism to distributed variables rather than to memory pages or single memory cells. The variables can be created and released at runtime. Once a global variable is created, each participating processor in the system has access to it. For latency hiding, reads and writes can be performed in two separate function calls. The first call initiates the variable access and the second call waits for its completion. The time between initiation and completion of a variable access can be hidden by other local instructions or variable accesses. Currently, we are working on making the DIVA-library usable for a parallelization of the computer algebra system MuPAD in cooperation with one of the application projects. For this, several protocols for managing global variables, including those mentioned above, are implemented and incorporated in DIVA.
References 1. B. Awerbuch, Y. Bartal, and A. Fiat: Competitive distributed file allocation, Proc. of the 25th ACM Symp. on Theory of Computing (STOC), pages 164–173, 1993. 2. B. Awerbuch, Y. Bartal, and A. Fiat: Distributed paging for general networks, Proc. of the 7th ACM Symp. on Discrete Algorithms (SODA), pages 574–583, 1996. 3. Y. Bartal: Survey on distributed paging, Proc. of the Dagstuhl Workshop on On-line Algorithms, 1996. 4. R. D. Blumhofe, C. E. Leiserson: Scheduling Multithreaded Computations by Work Stealing, Proc. 36th Symp. on Foundations of Computer Science (FOCS ' 95), pp. 356-368, 1995. 5. P. Berenbrink, F. Meyer auf der Heide, V. Stemann: Fault Tolerant Shared Memory Simulations. Proc. 13th Symp. on Theoretical Aspects of Computer Science, pp. 181-192, 1996. 6. P. Berenbrink,F. Meyer auf der Heide, K. Schr¨oder: Allocating Weighted Jobs in Parallel, Proc. of 9th ACM Symp. on Parallel Algorithms and Architectures (SPAA' 97), to appear.
7. A. Czumaj, F. Meyer auf der Heide, V. Stemann: Shared memory simulations with triplelogarithmic delay, Proc. of 3rd European bbbp m Symposium on Algorithms (ESA), pp. 46-59, 1995. 8. A. Czumaj, F. Meyer auf der Heide, V. Stemann: Simulating Shared Memory in Real Time: On the Computation Power of Reconfigurable Architectures, Technical Report SFB tr-rsfb96-006, Paderborn University, Jan. 1996. 9. A. Czumaj, F. Meyer auf der Heide, V. Stemann: Contention Resolution in Hashing Based Shared Memory Simulations, Technical Report SFB tr-rsfb-96-005, Paderborn University, Dec. 1996, and: Information and Computation, to appear. 10. R. Cypher, F. Meyer auf der Heide, C. Scheideler, and B. V¨ocking: Universal algorithms for store-and-forward and wormhole routing, Proc. of the 26th ACM Symp. on Theory of Computing (STOC), pages 356–365, 1996. 11. T. Decker: Virtual Data Space - A Universal Load Balancing Scheme, Proc. 4th Int. Symp. on Solving Irregularly Structured Problems in Parallel, IRREGULAR' 97, 1997, to appear. 12. T. Decker, R. Diekmann: Mapping of Coarse-Grained Applications onto WorkstationClusters, Proc. 5th Euromicro Workshop on Parallel and Distr. Processing, pp. 5-12, 1997. 13. T. Decker, R. Diekmann, R. L¨uling, B. Monien: Towards Developing Universal Dynamic Mapping Algorithms, 7th IEEE Symp. on Parallel and Distr. Processing, 1995, pp. 456-459. 14. R. Diekmann, B. Monien, R. Preis: Load Balancing Strategies for Distributed Memory Machines, F. Karsch, H. Satz (ed.): “Multi-Scale Phenomena and their Simulation”, World Scientific, 1997 (to appear). 15. R. Feldmann, P. Mysliwietz, B. Monien: A fully distributed chess program, Advances in Computer Chess VI, Ellis Horwood Publishers, pp. 1-27, 1991. 16. R. Karp, M. Luby, F. Meyer auf der Heide: Efficient PRAM simulation on a distributed memory machine, Algorithmica, (16), pp. 517-542, 1996. 17. R. L¨uling: Lastverteilung zur effizienten Nutzung paralleler Systeme, Ph.D. Theses, ShakerVerlag, 1996, to appear. 18. F. T. Leighton, B. M. Maggs, A. G. Ranade, and S. B. Rao: Randomized routing and sorting on fixed-connection networks, Journal of Algorithms, (17), pp. 157-205, 1994. 19. C. Lund, N. Reingold, J. Westbrook, and D. Yan: On-line distributed data management, Proc. of the 2nd European Symposium on Algorithms (ESA), 1996. 20. B. Maggs, F. Meyer auf der Heide, B. V¨ocking, and M. Westermann: Exploiting locality for networks of limited bandwidth, Techn. Report tr-rsfb-97-042, University of Paderborn, 1997. 21. F. Meyer auf der Heide, B. V¨ocking: A packet routing protocol for arbitrary networks, Proc. 12th Symp. on Theoretical Aspects of Computer Science (STACS), pages 291–302, 1995. 22. F. Meyer auf der Heide, B. V¨ocking: Static and dynamic data management in networks, Proc. of Euro-Par' 97, to appear. 23. R. Ostrovsky and Y. Rabani: Universal O(congestion + dilation + log1+ n) local control packet switching algorithms, Proc. of the 29th ACM Symp. on Theory of Computing (STOC), to appear, 1997. 24. A. G. Ranade: How to emulate shared memory, Proc. of the 28th IEEE Symp. on Foundations of Computer Science (FOCS), pages 185–194, 1987. 25. C.-Z. Xu, B. Monien, R. L¨uling, F. C. M. Lau: An Analytical Comparison of Nearest Neighbour Algorithms for Load Balancing in Parallel Computers Proc. of International Parallel Processing Symposium (IPPS' 95), pp. 472-479, 1995. 26. C.-Z. Xu, S. Tsch¨oke, B. Monien: Performance Evaluation of Load Distribution Strategies in Parallel Branch and Bound Computations Proc. 7th Symposium on Parallel and Distributed Processing (SPDP' 95), pp. 402-405, 1995.