Concurrent Object-Oriented Description Frameworks for Massively Parallel Computing Akinori Yonezawa E-mail:
3
Masahiro Yasugi
Kenjiro Taura
Tomio Kamada
fyonezawa,yasugi,tau,
[email protected]
Department of Information Science, University of Tokyo
1
Introduction
We studied the design, implementation and application for software systems on the massively parallel computer that is being designed and built at the RWCP. Our approach is based on the concurrent object-oriented paradigm. Last year we designed a concurrent object-oriented programming (COOP) language called ABCL/ST and this year we developed its language systems/environments including the optimizing compilers[6, 7, 4, 5], runtime systems[2, 3] and debuggers[1]. 2
Language Systems
Improvements of the ABCL/ST compiler for EM-4
The main target machine of the ABCL/ST compiler is a ne-grained data-driven parallel computer, EM-4, developed at Electrotechnical Laboratories, which is regarded as an archi-prototype of the RWC-1 machine. We improved the previously developed compiler in terms of functionality, usability, modularity and optimization. A novel technique called the Plan-Do style compilation technique for eager date transfer[6, 7] has been developed, which is suitable for ne-grained architectures, such as EM-4. We used this technique in improving the ABCL/ST compiler for EM-4. An ABCL/ST prototype compiler for the KSR
It is quite important to make the compiler portable since our nal target machine is the one at the RWCP. Our laboratory cooperates with University of Manchester and ported the ABCL/ST compiler on the parallel computer KSR1 at the University. The hardware architecture of the KSR series provides a shared-memory image using ALLCACHE engine on the distributed memory (cache) among processors. The inter-processor communication must be performed via shared memory; messages should be enqueued to/dequeued from (shared) message queues. This usually requires mutex locks to guarantee that at most one processor can modify the queue; however, mutex locks are expensive operations specially in the distributed memory environment. In order to cope with this problem, we
Present Aliation: Dept. of Computer and Systems Engineering, Kobe University 3
devised a lockless runtime architecture and developed a prototype compiler for KSR1. The preliminary performance results exhibit 2{3 time speedup. C code generation
While the compiler for EM-4 generates assembly code, the compiler for KSR1 generates C code because the details of the KSR processor architecture are not available to the users. We designed an abstract machine which provides EM-4's packet scheduling features. The xed packet size makes the scheduler simpler and faster. The ABCL/ST compiler manages execution contexts with heap and a xed number of C variables (as many as the underlying hardware registers) in order to speed up the accesses of instant variables and to avoid spilling of register values. Global labels for context switching are implemented as pairs of function names and label number (as already proposed for ABCL/f [5]). ABCL/1 code generation
We added the functionality of ABCL/1 code generation to the ABCL/ST compiler; thus we can use the programming environment of ABCL/1, such as its interpreter and debugger. 3
Programming Environment
Concurrent programs often exhibit nondeterministic behavior because execution order of concurrent events may involve arbitrariness. Such indeterminacy makes it dicult to nd causes of program errors. We developed a debugging scheme which facilitates (1) replay of a speci c execution with minimal amount of logged information, and (2) detection of indeterminacy in message arrival order. We evaluate its performance and usability through a prototype debugging system for ABCL/f on a multicomputer AP1000 with 32-1024 nodes. Our debugging scheme and its implementation technique are eective for other massively parallel language systems. 4
Applications
and
Perfor-
mance Measurements Description ABCL/ST
of
3-D
N-body
problem
in
We have already proposed a concurrent object-
# of particles 500 EM-4 1.16 Workstation(C) 2.07
1000 3.15 9.38
1500 4.53 12.5
2000 6.28 15.6
2500 8.65 19.5
3000 11.3 (sec) 23.8 (sec)
Table 1: Performance Measurements with 3-D N-body application oriented algorithm for N-body problem[8]. The algorithm is based on the Barnes and Hut O(N log N ) sequential algorithm and its time complexity is O(log N ) using O(N ) concurrent objects. This year we described the three-dimensional version of this algorithm in ABCL/ST. We increased the size (# of particles) of the problem that we can solve on EM-4 by writing the program in such a way that memory for pending messages is saved. It is realixed by using ACK messages between the sender and the receiver(s). This is implemented as ABCL/1's now-type (RPC-style) message passing, while concurrency is maintained by returning ACK messages before processing request messages. Performance Measurements with body application
3-D
N-
We conducted system benchmarks by measuring the execution time of the 3-D N-body application on the EM-4 and workstations (Super SPARC 36MHz). The workstation code was generated by gcc2.6.3, compiling the C code generated by the ABCL/ST compiler. EM-4 consists of 80 PEs and runs at 12.5MHz, while EM-4 does not have FPUs; thus the performance of the workstation for real numbers could be estimated 20 times faster than the single PE of EM-4. This means that we gained 36 times (N=500) or 60 times (N=1000) speedups with 80 PEs. (See table 1). Other applications
We constructed a visualization program of implicit functions (solutions of equations) as an application of ABCL/ST. This program nds a `solution pixel' in the region for visualization by recursively dividing the region if the estimated probability of presence of solutions in that region is higher than a given threshold. The program also divides the region if the region is adjacent to a solution pixel so that we can nd contiguous solution pixels even with a lower threshold, which reduces the number of wasteful divisions. This means that the algorithm is not simply recursive and thus concurrent objects are required to represent the intermediate nodes of division trees. References
[1] T. Kamada. A study on debugging schemes for concurrent programs on massively parallel
processors. Master's thesis, Department of Information Science, University of Tokyo, Mar. 1994. [2] T. Kamada, S. Matsuoka, and A. Yonezawa. Ecient parallel global garbage collection on massively parallel computers. In Proc. of Supercomputing, pages 79{88. IEEE, 1994. [3] T. Kamada, S. Matsuoka, and A. Yonezawa. Ecient parallel global garbage collection on massively parallel computers. In Proc. of Joint Symposium on Parallel Processing (JSPP), pages 33{40, May 1994. (in Japanese). [4] K. Taura, S. Matsuoka, and A. Yonezawa. ABCL/f : A future-based polymorphic typed concurrent object-oriented language { its design and implementation {. In G. Blelloch, M. Chandy, and S. Jagannathan, editors, Proceedings of the DIMACS workshop on Speci cation of Parallel Algorithms, pages 275{292, 1994. [5] K. Taura, S. Matsuoka, and A. Yonezawa. StackThreads: An abstract machine for scheduling ne-grain threads on stock CPUs. In Springer Lecture Notes in Computer Science No. 907, pages 121{136, Mar. 1995. [6] M. Yasugi, S. Matsuoka, and A. Yonezawa. The plan-do style compilation technique for eager data transfer in thread-based execution. In Proc. of the IFIP WG10.3 International Conference on Parallel Architectures and Compilation Techniques, Montreal, Canada, pages 57{ 66, Aug. 1994. [7] M. Yasugi, S. Matsuoka, and A. Yonezawa. The plan-do style compilation technique for eager data transfer in thread-based execution. IPSJ SIG Notes 94-PRG-18(SWoPP'94), 94(65):9{ 16, July 1994. (in Japanese). [8] M. Yasugi and A. Yonezawa. Towards performance evaluation of a N-body problem algorithm in an object-oriented concurrent language on a date-driven parallel computer. In Object-Oriented Computing II (WOOC'93), pages 147{154. Kindai Kagaku Sha, Apr. 1994. (in Japanese).