AN IMPLEMENTATION AND PERFORMANCE EVALUATION OF LANGUAGE WITH FINE ...

AN IMPLEMENTATION AND PERFORMANCE EVALUATION OF LANGUAGE WITH FINE-GRAIN THREAD CREATION ON SHARED MEMORY PARALLEL COMPUTER YOSHIHIRO OYAMA3 KENJIRO TAURA TOSHIO ENDO AKINORI YONEZAWA Department of Information Science, University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033 Japan foyama,tau,endo,[email protected]

Abstract

We implemented two applications with irregular parallelism in (1) C and a thread library and (2) our concurrent language Schematic which supports ecient ne-grain dynamic thread creation and its dynamic load balance. We compared the two approaches focusing on program description cost and performance. Schematic not only achieves common programming practices seen in C such as task queue management with much smaller description cost, but incorporates some advanced optimizations for synchronization such as inter-thread communication on register. The case study shows that Schematic can describe irregular applications more naturally and can achieve high performance: Schematic is executed about 2.8 times slower than C on sequential environment and its speedup on 64 processor environment is comparable to C. Keywords: compiler-controlled ne-grain multithreading, application study, irregular parallelism

1

INTRODUCTION

One of the largest challenges for parallel computing is to solve problems with irregular parallelism eciently. Typical irregular problems have the following characteristics: 1. Computed data structure is determined dynamically and the computation amount for each part of the structure cannot be predicted. 2. Multiple threads executed in parallel have complicated data dependency. Focusing on the above characteristics, this paper compares two approaches to solve irregular applications on symmetric multiprocessors: one uses C and a thread library, and the other uses a ne-grain multithread language. Many people have proposed ne-grain multithread languages to liberate programmers from heavy programming burden in low-level languages such as C and to enable them to describe irregular structures naturally. The paradigm aims to provide a convenient pro3 JSPS Research Fellow

gramming environment in which programmers can create threads aggressively to expose parallelism without worrying about thread creation cost. The naive multithread implementation, however, cannot serve for a realistic programming alternative because of the inherent overheads such as a large number of thread scheduling and synchronizations. Though many researches proposed the technique to reduce the overhead, few compared the performance with other popular alternatives such as C. There are also few pieces of research that demonstrate the effectiveness of multithread languages considering programming eort and performance together [1]. In this research, we implement a ne-grain multithread language on a symmetric multiprocessor utilizing ecient implementation techniques and compare it with C on two applications focusing on description cost and sequential/parallel performance.

2

SAMPLE PROBLEMS

We adopt RNA secondary structure prediction (RNA) as a program with characteristic 1., and parallel CFG parser (CKY) as a program with characteristic 2. RNA RNA is essentially a tree traverse problem with pruning. Parallelism is exposed from concurrent node traverses and other parallelism is not considered. High performance cannot be gained in this kind of problem with a simple use of coarse-grain threads which typically have large creation cost. Consider the naive solution in C to create threads for each subtree whose depth is less than some threshold. Since pruning makes the prediction of a tree structure dicult, this solution cannot guarantee the desirable load balance which correctly re ects the number of processors. If too shallow threshold is used, we fail to utilize all processors. If too deep one is used, large thread creation cost must be unnecessarily paid. CKY CKY is essentially a calculation of elements in an upper triangle matrix. The matrix size we use is about 100. Only the values of diagonal elements are given. The calculation of each element depends on

the value of all elements placed left in the same row and placed below in the same column. Each element is calculated in parallel and other parallelism is not considered. The naive solution in C to create a coarse-grain thread for each element will exhibit low performance caused by the overhead of plenty of thread creations/terminations.

3

C AND A THREAD LIBRARY

Generally, programmers do not create coarse-grain threads dynamically; instead, they usually split a coarse task into pieces of ner ones and provide a task queue, a shared data structure for managing tasks and balancing workloads among processors. Furthermore, they will pay much attention to the implementation of synchronization. If the synchronization delay is expected to be small or no other task is available, most programmers will adopt busy-wait synchronization because it is easy to implement and the overhead is small when a synchronization immediately succeeds. Otherwise, they will adopt blocking synchronization in which a processor switches to another task when a synchronization fails.

RNA A xed number of threads are created as work-

ers at the beginning of a program. A task queue is provided to store runnable tasks. We enqueue tasks (i.e., structures storing arguments necessary for function call) into the task queue for traversing nodes, instead of calling the traverse function directly. Since a global queue may cause a frequent contention, each thread keeps its local queue, from which a task is basically acquired. If the queue is empty, a thread steals a task from another queue. Application programmers must endure boring descriptions such as task queue management, mutual exclusion, and creating/terminating threads including termination detection.

CKY A xed number of threads are created at the beginning of a program. Each thread traverses the matrix in search for an element not calculated and starts the calculation if it nds one. Programmers are responsible for preventing multiple threads from calculating the same element. The traverse begins with elements closest to the diagonal line and it moves toward the vertex. The order is expected (and con rmed) to make the synchronization delay so small that a dependent thread waits for a necessary value through busy-wait.

4

MULTITHREAD LANGUAGE

Schematic [2] is a concurrent object-oriented extension to Scheme. It supports asynchronous function call and

exible synchronization via rst-class communication medium. The key primitives are as follows.

executes (f a1 a2 : : :) and its \continuation" in parallel. The expression immediately returns a communication medium called channel. The evaluated value of (f a1 a2 : : :) will eventually be sent to the channel. (touch r) tries to receive a value from channel r. If r has a value, the value is consumed and returned as a result value of touch expression. Otherwise, the execution suspends until a value arrives. Typically, threads are created by future where parallelism is exposed and are synchronized by touch. (future (f a1 a2 : : :))

4.1

EXECUTION SCHEME

4.1.1 TASK MANAGEMENT IMPLEMENTATION Schematic uses processor-

local scheduling stacks, which keep runnable closures in LIFO order. In a sequential function call, we push onto the stack a closure executing a \continuation," store arguments in correct locations, and jump to the callee body. Future calls are exactly the same as sequential ones except that the continuation may be stolen by another processor. When the callee returns, a closure at the stack top is popped and executed. The return value, if any, is applied to the closure. Our load distribution scheme is based on a messagepassing variant of Lazy Task Creation [3]. It uses no global task queue and an idle processor steals closures from another processor's stack instead.

COMPARISON Task management and load balancing in Schematic is very similar to those in typical C programs: the operation executed in future call can be regarded as putting a task into the task queue. Schematic programmers have only to change sequential calls into future calls to achieve that kind of task management and load balancing. As in other multithread languages, they are free from troublesome programming burdens necessary in low-level languages. 4.1.2

EFFICIENT SYNCHRONIZATION ON REGISTER

IMPLEMENTATION Schematic uses

unboxed

[4] for communicating return values of future calls. Channels are not created on memory at the moment future call is executed. The callee expresses a channel (a pair of a ag and a return value) on register and eventually gives it to the continuation closure. The ag represents the location of the return value (register or memory) and whether the callee could return a value. The caller accesses appropriate location according to the ag and obtains a return value, if any. When the continuation is stolen, a channel is dynamically created on memory. This way the return values of future calls can be communicated on register in most cases [5]. channel

EFFICIENT SYNCHRONIZATION BY CODE DUPLICATION

IMPLEMENTATION Schematic generates two specialized versions of code from one continuation of a synchronization. The one version shares register assignment with the preceding code and is used when a synchronization immediately succeeds, whereas the other has its own assignment and is used when a synchronization succeeds after a failure. In the former case, the continuation is executed immediately without overheads such as register moves. In the latter case, a continuation closure is stored in channel and another closure in the stack is popped and executed. 1 COMPARISON In ordinary blocking synchronization implementations, only one version is generated from one continuation and either an after-success part or an after-fail part jumps to the shared version. Either part pays the overhead to t it into a single calling convention of the shared code. Indeed a similar code duplication can be used in C, but a program becomes complicated and dicult to maintain especially in nested synchronizations. The point to observe is Schematic programmers have only to use touch to utilize the sophisticated synchronization mechanism. The compiler and runtime are fully responsible for low-level managements. 4.2

DESCRIPTION OF PROBLEMS

RNA The traversing function uses future to tra-

verse child nodes irrespective of tree depth since a future call is very cheap. It forms a remarkable contrast with the naive solution in C. When a child is traversed, a continuation, which may be stolen by an idle processor, is pushed onto the stack. The operation is just 1 Our scheme has a potential problem to increase code size. The current compiler reduces much code size by duplicating code selectively based on heuristics [5].

À

ÐàåâêÞñæà

° ¯ ®

ÏË¾

ÀÈÖ

Figure 1: EXECUTION TIME ON UNIPROCESSOR ôæñåðöëàå«ìëïâä«©ôæñåàìáâáòí« ôæñåðöëàå«ìëïâä«©ôæñåìòñàìáâáòí« ôæñåìòñðöëàå«ìëïâä«©ôæñåàìáâáòí« ôæñåìòñðöëàå«ìëïâä«©ôæñåìòñàìáâáòí«

ëìïêÞéæ÷âáâõâàòñæìë ñæêâ

4.1.3

ëìïêÞéæ÷âáâõâàòñæìëñæêâ

COMPARISON Most existing multithread languages realize every inter-thread communication on memory, hence programs which intensely use asynchronous calls are more costly than programs with few asynchronous calls. It discourages programmers from creating many threads. In Schematic, future calls are as cheap as sequential ones. For example, a Fibonacci program calling two children by future exhibits almost the same performance as its sequential version despite the aggressive use of asynchronous calls. To realize similar register communication in C, we must write an additional code to branch according to the location of a return value at every point the optimization is used. Moreover, task stealing needs to create a communication medium on memory dynamically and force the tasks in the queue to reference the medium. This optimization makes a program so complex that the program will become hard to maintain.

² ± ° ¯ ®

ÏË¾· ®ÍÂ

ÏË¾· ²ÍÂ

ÀÈÖ· ®ÍÂ

ÀÈÖ· ²ÍÂ

Figure 2: EFFECT OF THE OPTIMIZATIONS a task queue management. The dierence compared with C is that a parent can communicate with children on register: no channel is created on memory for traversing a child and a parent basically references register to know whether a child completes its calculation. CKY At the beginning of a program, plenty of threads to calculate each element are created. The matrix element is a channel which itself keeps a calculated value. To write an element, we send a value to the appropriate channel. To read an element, we receive a value from channel. Thus threads are automatically synchronized in touching a channel. Forming the matrix of channels enables us to describe the synchronization in CKY very easily and naturally. If a receive succeeds immediately, the stack is not manipulated during the synchronization and more ef cient code is executed which uses a single register assignment before and after the synchronization. In this case the synchronization behaves like busy-wait. If a receive fails, a processor switches to another task as in ordinary blocking implementations.

5

PERFORMANCE EVALUATION

The parallel machine used for experiments is Sun Ultra Enterprise 10000 (UltraSPARC 250 MHz 2 64) running Solaris 2.5.1. The C programs use Solaris thread library, which provides user-level threads. Figure 1 shows the execution time on uniprocessor. Schematic executes RNA and CKY about 2.8 times slower than C. The dierence of the time is expected to arise from immature implementation and inherent overheads such as tagging objects and checking the ag returned by function calls. Incorporating further optimizations in register allocation and closure conversion

ðíââáòí

À¥ÏË¾¦ À¥ÀÈÖ¦

² ± ° ¯ ®

®

ÐàåâêÞñæà¥ÏË¾¦ ÐàåâêÞñæà¥ÀÈÖ¦

¯ ° ± ìãÍÂð

²

³

Figure 3:

SCALABILITY OF EACH SOLUTION. SPEEDUPS ARE WITH RESPECT TO A SEQUENTIAL EXECUTION OF EACH PROGRAM.

will make the dierence much smaller. Figure 2 shows the combination of two optimizations results in about four-fold speedup and is ascertained to be remarkably powerful. Figure 3 shows that the speedups of Schematic programs are comparable to those of C. In RNA, both C and Schematic exhibit scalable performance. In CKY, both hit the performance ceiling at about 15 times speedup. Some experiments in [6] give some clues to identify why the speedup in CKY comes to a standstill.

6

RELATED WORK

7

CONCLUSION

This research is much aected by [1], which evaluates \programmability" of an object-oriented language ICC++ in case study on seven irregular applications. We share the same spirit with them to prove the usefulness of high-level parallel language. [1] focuses mainly on computation granularity, namespace management and low-level concurrency, while we chie y discussed synchronization mechanism and thread management based on Lazy Task Creation. Concurrent logic languages such as Fleng [7] and non-strict data ow languages such as Id [8] also create plenty of ne-grain threads. [7, 8] reduce the frequency of dynamic thread scheduling by merging multiple threads based on static analysis of inter-thread dependency. Schematic utilizes code duplication to achieve a similar eect. Their techniques and ours can be complementary to each other. Cilk [9] is a multithread extension to C, which also realizes a great many ne-grain thread creations with low cost. Cilk's synchronization speci cation is, however, much restricted and only one pattern is allowed: a parent thread waits for the termination of children. Therefore, it is much harder to write CKY in Cilk. The compilation technique of Schematic for intranode thread scheduling and synchronization on register is based on StackThreads [4]. Unlike Schematic, it is implemented on distributed memory machines and programmers are fully responsible for load balancing. We compared programming in ne-grain multithread language with that in C. The case study showed that

our language Schematic could describe irregular applications more naturally and attain good performance. Some common programming practices used in lowlevel languages such as explicit management of negrain tasks are realized by Schematic implicitly. Moreover, it incorporates some advanced optimizations which are troublesome to implement in low-level languages: (1) synchronization on register and (2) code duplication technique with the advantage of busy-wait and blocking synchronization together. Schematic is executed about 2.8 times slower than C on sequential environment and Schematic's speedup on parallel environment is comparable to C.

References

[1] A. A. Chien, J. Dolby, B. Ganguly, V. Karamcheti, and X. Zhang. Evaluating High Level Parallel Programming Support for Irregular Applications in ICC++. In Proceedings of ISCOPE '97, volume 1343 of LNCS, pp. 33{40, 1997. [2] K. Taura and A. Yonezawa. Schematic: A Concurrent Object-Oriented Extension to Scheme. In Proceedings of OBPDC '95, volume 1107 of LNCS, pp. 59{82, 1996. [3] E. Mohr, D. A. Kranz, and R. Halstead Jr. Lazy Task Creation: A Technique for Increasing the Granularity of Parallel Programs. IEEE Transactions on Parallel and Distributed Systems, 2(3):264{280, 1991. [4] K. Taura and A. Yonezawa. Fine-grain Multithreading with Minimal Compiler Support|A Cost Eective Approach to Implementing Ecient Multithreading Languages. In Proceedings of PLDI '97, pp. 320{333, 1997. [5] Y. Oyama, K. Taura, and A. Yonezawa. An Ecient Compilation Framework for Languages Based on a Concurrent Process Calculus. In Proceedings of Euro-Par '97, volume 1300 of LNCS, pp. 546{ 553, 1997. [6] Y. Oyama, K. Taura, T. Endo, and A. Yonezawa. An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer. Technical report, University of Tokyo, 1998 (to appear). [7] T. Araki and H. Tanaka. Static Granularity Optimization of a Committed-Choice Language Fleng. In Proceedings of Euro-Par '97, volume 1300 of LNCS, pp. 1191{1200, 1997. [8] K. E. Schauser, D. E. Culler, and S. C. Goldstein. Separation Constraint Partitioning | A New Algorithm for Partitioning Non-Strict Programs into Sequential Threads. In Proceedings of POPL '95, pp. 259{272, 1995. [9] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An Ecient Multithreaded Runtime System. In Proceedings of PPoPP '95, pp. 207{216, 1995.

AN IMPLEMENTATION AND PERFORMANCE EVALUATION OF LANGUAGE WITH FINE ...

AN IMPLEMENTATION AND PERFORMANCE EVALUATION OF LANGUAGE WITH FINE ...

Suggest Documents

Implementation and performance evaluation of an ... - CiteSeerX

Implementation and performance evaluation of an ... - CiteSeerX

Performance Evaluation of Parallel Implementation

An implementation and performance evaluation of a ...

Applications, Implementation and Performance Evaluation of Bit ...

Implementation and Performance Evaluation of Two Snapshot ...

Implementation and Performance Evaluation of Polyphase Filter ...

Development and Implementation of Performance Evaluation System ...

Design, implementation and performance evaluation of ... - CiteSeerX

Implementation and Performance Evaluation of Multicast Control ...

Implementation and Performance Evaluation of Network ... - CiteSeerX

Implementation and Performance Evaluation of ...

Implementation and Performance Evaluation of Two Snapshot ...

Implementation and Performance Evaluation of ... - Columbia CS

FPGA implementation and performance evaluation of ...

Implementation and Performance Evaluation of Network ... - CiteSeerX

CATWOMAN: Implementation and Performance Evaluation of ... - VBN

FPGA Implementation and Performance Evaluation of ... - CiteSeerX

Design, Implementation and Performance Evaluation of ... - Ittc.ku.edu

Implementation and performance evaluation of SPAM ... - cOMPunity

Implementation and Performance Evaluation of a

An Implementation Approach and Performance

AODV Implementation Design and Performance Evaluation

FPGA Implementation Cost and Performance Evaluation ... - CiteSeerX