Wolf Zimmermann and Holger Kumm. Institut f ur Programmstrukturen und Datenorganisation. Universit at Karlsruhe. P.O. Box 6980. 76128 Karlsruhe. Germany.
On the Implementation of Virtual Shared Memory Wolf Zimmermann and Holger Kumm Institut fur Programmstrukturen und Datenorganisation Universitat Karlsruhe P.O. Box 6980 76128 Karlsruhe Germany
Abstract The eld of parallel algorithms demonstrated that a machine model with virtual shared memory is easy to program. Most eorts in this eld have been achieved on the PRAM-model. Theoretical results show that a PRAM can be simulated optimally on an interconnection network. We discuss implementations of some of these PRAM simulations and discuss their performance.
1 Introduction The eld of parallel algorithms demonstrated that a machine model with virtual shared memory is easy to program. Most of the eorts in this eld have been achieved on the PRAM-model. Ecient parallel algorithms have been developed for most application domains [5]. A v-PRAM has v processors and a global shared memory. Every processor can access the shared memory. The communication between two processors is accomplished by accessing the shared memory. Each processor could have a local memory which cannot be accessed by other processors. The PRAM model is based on the unrealistic assumption that an access to the global memory costs constant time. More realistic computation models include interconnection networks with distributed memory. Therefore, in order to make the advantages of the PRAM model available on parallel machines a shared memory has to be simulated on interconnection networks. One part of this implementation is the distribution of the shared memory. In practice, this distribution is determined by data dependencies. However, for parallel algorithms where In Proceedings of the Working Conference on Programming Models for Massively Parallel Computers, p. 172 { 179
the communication pattern depends on the state of the PRAM (like e.g. list ranking and parallel graph algorithms [5]) it is usually impossible to determine such dependencies. Here, we answer the questions of how to distribute the shared memory and how to realize an access to the shared memory for this case.
The theory provides methods for the universal simulation of PRAMs on interconnection networks with distributed memory. All these methods consist of two components: the distribution of the shared memory and the realization of memory accesses by routing algorithms. These methods are deterministic [13, 15, 1, 12, 4] or probabilistic [13, 11, 3, 16, 7, 14]. All the deterministic simulations use the same memory organization. A cell of the shared memory is multiply stored on dierent processors on the interconnection network. In [12] it is shown, that a PRAM-step of p-PRAM can be simulated on p-processor mesh of trees in time O(log2 p= loglog p). However, all the proposed concepts rely on a memory organization scheme which is not explicitly known. In fact, only the existence of such a scheme is shown. The probabilistic simulations distribute the shared memory probabilistically. The most interesting simulation in this respect is that of Valiant [16]. This simulation performs one PRAM-step of a v-PRAM on a p-processor hypercube, p-processor butter y and p-processor cube connected cycles network (CCC) in expected time O(max(log p; v=p)). It is therefore theoretically optimal, if v > p logp, i.e. the number of virtual processors is much larger than the number of physical processors { a realistic assumption. The next section describes the basic ideas of this simulation. The last section discusses the performance measurements of the simulation on the MASPAR MP-1 (16384 processors).
2 The Simulation Scheme The implemented PRAM-Simulation is based on the simulation of [16]. In this section we will discuss three topics: the memory organization, the implementation of memory access, and the implementation of a PRAM-step. We assume that the address space of the virtual shared memory is Zm, where m is polynomial in the number of virtual processors. We also assume that m is prime.
2.1 The Memory Organization The memory organization uses the universal class of hash functions [13]: H = fa0 + a1 x + + ad xd mod m : ai 2 Zmg The simulation randomly selects an h 2 H and maps the memory cell with address i to the processor with address h(i) mod p. In [16] the degree of these hash functions is determined by d = logp. However, in [7], a constant degree is used. The simulation of step vPRAM on a p-hypercube (butter y, CCC) costs also expected time O(v=p) but now v > p1+" for an " > 0 is required. In fact, the simulation results imply, that good results can be obtained even for a constant degree of the hash function below this bound. Let X be the maximal number of memory requests to a processor to be answered by the simulation of any PRAMstep. It is possible to show Pr[X 4 v=p] = o(p?1 ), if v = p log p [16, Corollary 4.2]. After mapping the shared to the distributed memory an ecient memory organization inside each local memory has to be developped. For this purpose we also use hashing. In [16] the same universal class of hash functions is used. The collision resolution is by chaining. Valiant shows that with high probability O(logp) memory requests can be answered. However, the perfect hashing scheme in [2] guarantees a constant time access. Therefore we use this scheme. Altogether, it is possible to simulate a step of a v-PRAM in expected time O(v=p) on a complete p processor network.
2.2 Implementation of Memory Access An access to the shared memory address i is realized by sending a memory request to processor h(i). Processor h(i) answers the request via its local perfect hash function kh(i) . If the memory request is a
writing request, processor h(i) writes the data into its local memory cell kh(i) (i). If it is a reading request, the processor reads from this cell and sends the answer to the origin of the request. Thus, there is a need to explore routing algorithms. In [16] it is shown that on a butter y with p processors and p logp switching nodes O(logp) requests/processor can be answered in expected time O(log p). If each node is a processor the routing algorithm in [14] still satis es the same time bound. The MASPAR MP-1 packet switching network is a grid. Therefore, we explore routing algorithms on grids. It should be mentioned that pure oblivious or pure greedy routing algorithms for bounded degree p processor networks with limited buer capacity have, even for partial permutation routing, a lower bound of (p) [6]. To solve this problem either probabilistic routing [9] or deterministic non-oblivious routing [10] achieve optimal performance. We implemented both routing strategies. These routing algorithms need deterministic and expected time O(pp) for routing a permutation. Putting the results on the memory organization and routing on grids together yields expected time O(v=pp) for one PRAM-step, if v p log p.
2.3 Implementation of a PRAM-step We synchronize explicitly after each PRAM-step or we synchronize by counting steps on the network. The latter method uses the probability result. We run the simulation for one PRAM-step c v=pp steps for some constant c. If it is completed within this time, we can start with the next step. If it is not completed (this is discovered during the execution of the next PRAMstep), we reset the computation and reorganize the shared memory with a new randomly selected hash function. If the number p of processors is small, then the rst method is chosen, because synchronization of all processors is fast. Otherwise, if the synchronization is expensive, then the second method seems preferable. When implementing a PRAM-step, we have to map PRAM-processors to the physical processors. The computation to be done by a PRAM step is almost the same as in a RAM: rst reading the operands, second performing the computation, and third writing the result. In all theoretical simulations a PRAMprocessor is mapped to a physical processor which simulates these steps. We call this simulation the central three-phase simulation. It emerged that in practice this choice is the worst. The computations performed by a PRAM-processor can also be distributed over sev-
eral processors. Instead of sending reading requests to the operands, one can send a packet containing the command rst to the processor holding the rst operand, from this processor to a processor holding the second operand, then sending the package to the processor which holds the result address. This processor also performs the computation. This simulation scheme is called the distributed three-phase simulation. Finally, instead of sending a packet to the rst operand and then to the second, we can send two packets at the same time to the rst and second operand respectively. From the processor holding the operands, the packets are sent to the processor on which the result address had been mapped. As for the distributed three-phase simulation, this processor performs the computation. We call this simulation scheme the distributed twophase simulation. For the distributed simulations, the number of packets in the networks is reduced by a factor of 2=5 and 1=5 , respectively. Figure 1 shows the three dierent simulation schemes. Processor n
1
2
Operand 1
As the MASPAR has two interconnection networks, a two-dimensional torus and an expanded deltanetwork, we were able to study both networks. The expanded-delta network is a circuit-switching network with a pre-de ned (greedy) routing scheme. Figure 2 shows the expanded delta-network as it is implemented in the Maspar MP-1. On the torus, it is pos4:4 Crossbar
.. 64 : 16 *4 .. . .
.. 64 : 16 *4 .. . .
.. 64 : 16 *4 .. . .
.. 64 : 16 *4 .. . .
.. .
.. .
.. .
Processor
1 Operand 2
2
1 Operand 1
3
Result
4:4 Crossbar
Operand 2
2
.. 64 : 16 *4 .. . .
3 Result
Central 3-Phase Simulation
Distributed 3-Phase Simulation
Processor
1 Operand 2
1 Operand 1
2 2 Result
Distributed 2-Phase Simulation
Figure 1: Possible implementations of a PRAM-step
3 Performance Our performance measurements were aiming to determine the constant factors of the PRAM-simulation, to compare dierent routing algorithms, to compare theoretical and practical behaviour, to study the in uence of good hash functions, and to study the dierent addressing modes.
. . .. 64 : 16 *4 ..
.. . 4:4 Crossbar
Figure 2: An expanded delta-network sible to implement dierent routing algorithms. We implemented the following routing algorithms: a deterministic greedy routing algorithm on grids and tori, respectively, probabilistic greedy routing algorithms [9], and the optimal k-k-routing algorithm of Kunde and Tensi [8]. Figure 3 shows the time needed for a PRAM-step simulated by 256 MASPAR-processors. The PRAM-step uses direct addresses for its results as well as for its operands. Figure 3 also shows that the distributed 2-phase simulation scheme is more ecient than the 3-phase simulation schemes. The global routing network (the expanded delta-network) is more ecient. This is due to its low diameter. The time to simulate a routing step is in the order of seconds. This is partly due to the long store and load times on the MP-1. On the other hand the simulation could be substantially speeded up if a local MIMD concept (i.e. concurrent execution of if-branches, concurrent communication in dierent directions) would be provided. In order to determine a more machine independent constant, we counted the routing steps.
probabilistic
Timeper PRAM step in seconds
deterministic (Kunde)
2-Phase Simulation
10
3-Phase Simulation
9
greedy
8 7 6 5
greedy greedy on torus
4 3 2 1
Global Router
1000
2000
3000
4000
5000
6000
7000
virtual Processors
8000
Figure 3: Running time of a PRAM operation for dierent implementationsof PRAM simulations(direct addresses for operands and results, 256 processors)
number of routing steps per PRAM step
1000
probabilistic
900
2-phase simulation
deterministic (Kunde) greedy
3-phase simulation
800 700 600
greedy
500
greedy on torus 400 300 200 100
Global-Router
1000
2000
3000
4000
5000
6000
7000
8000
virtual processors
Figure 4: Number of routing steps used by a PRAM-operation for the dierent implementations of PRAM simulations (direct addresses for operands and results, 256 processors)
In gure 4 we do this for the same routing algorithms and simulation schemes as in gure 3. This number is high for the routing algorithms on grids. However, on the global network the situation seems to be much better. Hence, for two-dimensional grids and tori, the constant factors appear to be too big for practical PRAM-simulations. On the other hand, low diameter networks such as butter ies and other hypercubic networks decrease the running time by a factor of pp. Observe that gures 3 and 4 show the predicted theoretical behaviour, i.e. the time required for simulating a PRAM-step increases linearly in the number of virtual processors. Other experiments showed that choosing polynomials of degree three for the universal class of hash functions yields good distributions of the shared memory. Larger degrees result in too much computation time for the evaluation of the hash function, smaller degrees result in too unbalanced memory distributions. Until now, we have studied only the use of direct addresses. For the use of indirect addresses we consider as an example the performance of moving data. Table 1 shows the relative performance of data moves. These factors are determined theoretically. Surprisingly enough, they are also satis ed in practical implementations. The dierence from the predicted behaviour is less than 5 %. These values are determined for the deterministic greedy routing algorithms, but they hold also for the other routing algorithms. Finally, we consider the amount of time spent in the pure computation. As gure 5 shows, the only simulation which has an overhead of about 80% is the simulation using the global router network. All other simulations have an overhead of 99.5%. Surprisingly the queue organization is a large part of the overhead. This can be explained partially by the MP-1 architecture. First, a parallel local data move already requires time in the order of seconds. Second, the SIMDarchitecture does not allow an independent local organization of the queues. An MIMD-architecture would provide a gain in speed w.r.t. the queue organization. For greedy algorithms queues can become quite big. The algorithm of Kunde and Tensi reduces the queue length. However, the routing algorithm works in independent phases (including a local sorting). These phases can overlap, but on an SIMD-architecture this overlapping cannot be implemented eciently. Finally, the dierent phases in a PRAM-simulation also may overlap as long as no write-con icts occur, and as long as the reading from a memory cell is done before writing into it. Again, in an SIMD-mode it
is impossible to pro t from the overlapping. On the other hand, it shows that before writes are executed, each processor has to wait until all processors receive their data. Furthermore, the next PRAM-step can only be executed when the current step is nished. Hence, many processors spend a considerable amount of time in waiting. This is also partially contained in the queue-organization. Thus, we have four sources of ineciencies: rst, the MASPAR MP-1 architecture, second the organization of hashing and queueing by software, third the network diameter, and fourth the synchronizations during and after each PRAM-step. We suggest hardware support for routing and the evaluation of universal hash functions. For hardware-implemented hash functions the class of [13, Example 1] is suitable. The modulo computation and multiplications involve operands of the form 2i and can therefore be implemented by suitable shifts and masks. The results also show that low-diameter networks are required. The diameter of two-dimensional arrays is far too high. For machines up to 729 nodes a three-dimensional array is the prefered architecture. Over 1024 nodes a network with logarithmic diameter is best. Finally, a large part of ineciencies depend on the PRAM-model itself. An asynchronous execution would reduce the waiting times of processors dramatically. Unfortunately, the correctness of most PRAM-algorithms depends on the synchronous model. Therefore, eorts must be made towards two directions. First the hardware has to support computation of hash-functions, routing, queueing, and synchronization. Second, asynchronous parallel algorithms have to be investigated.
Acknowledgments We thank Peter Fletcher, Arne Frick, Gerhard Goos, Welf Lowe, Martin Trapp, and Achim Weisbrod for many fruitful discussions. We also apologize for the inconvenience Maspar Users in Karlsruhe had during our experiments.
References [1] H. Alt, T. Hagerup, K. Mehlhorn, and F. P. Preparata. Deterministic simulation of idealized parallel computers on more realistic ones. SIAM Journal on Computing, 16:808 { 835, 1987.
adressing mode operand result direct direct
multiplication factor 2-phase simulation 3-phase simulation 1 1
indirect
direct
3 2
5 3
direct
indirect
3 2
5 3
indirect
indirect
2
7 3
direct
local register
1
2 3
indirect
local register
3 2
4 3
local register direct
1 2
1 3
local register indirect
1
1
Table 1: Multiplication factors for the routing time of data moves using dierent addressing modes
Hashing Sorting Other Organization
Queue Organization
Network Computation Global Router
greedy
greedy
(3-Phase)
(2-Phase)
probabilistic
deterministic (Kunde)
Figure 5: Overhead in PRAM-simulations
[2] G. V. Cormack, R. N. S. Horspool, and M. Kaiserswerth. Practical perfect hashing. The Computer Journal, 28(1):54 { 58, 1985. [3] M. Dietzfelbinger and F. Meyer auf der Heide. A new universal class of hash functions and dynamic hashing in real time. Technical Report 67, Universitat-Gesamthochschule Paderborn, Fachbereich Mathematik-Informatik, April 1990. [4] S.W. Hornick and F. P. Preparata. Deterministic P-RAM simulation with constant redundancy. Information and Computation, 92:81 { 96, 1991. [5] R. M. Karp and V. Ramachandran. Parallel algorithms for shared memory machines. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science Vol. A, pages 871{941. MIT-Press, 1990. [6] D. Krizanc. Oblivious routing with limited buer capacity. Journal of Computer and System Sciences, 43:317{327, 1991. [7] C.P. Kruskal, L. Rudolph, and M. Snir. A complexity theory of ecient parallel algorithms. Theoretical Computer Science, 71:95{132, 1990. [8] M. Kunde and T. Tensi. (k,k) routing on multidimensional mesh-connected arrays. Journal of Parallel and Distributed Computing, 11(2):146{ 155, 1991. [9] T. Leighton. Average case analysis of greedy algorithms on arrays. In Proceedings of the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures, pages 2{10. ACM, 1990.
[10] T. Leighton, F. Makedon, and I. Tollis. A 2n ? 2 step algorithm for routing in an n n mesh. In
Proceedings of the 1st Annual ACM Symposium on Parallel Algorithms and Architectures, pages
3238{335. ACM, 1989. [11] F. Luccio, A. Pietracaprina, and G. Pucci. A probabilistic simulation of PRAMs on a bounded degree network. Information Processing Letters, 28:141 { 147, 1988. [12] F. Luccio, A. Pietracaprina, and G. Pucci. A new scheme for the deterministic simulations of PRAMs in VLSI. Algorithmica, 5:529 { 544, 1990. [13] K. Mehlhorn and U. Vishkin. Randomized and deterministic simulations of PRAMs by parallel machines with restricted garanularity of parallel memories. Acta Informatica, 21:339 { 374, 1984.
[14] A. Ranade. How to emulate shared memory. Journal of Computer and System Sciences, 18:307{326, 1991. [15] E. Upfal and A. Widgerson. How to share memory in a distributed system. Journal of the ACM, 34:116 { 127, 1987. [16] L. G. Valiant. General purpose parallel architectures. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science Vol. A, pages 945{ 971. MIT-Press, 1990.