EXECUTING PARALLEL PROGRAMS WITH ... - CiteSeerX

EXECUTING PARALLEL PROGRAMS WITH SYNCHRONIZATION BOTTLENECKS EFFICIENTLY Y. OYAMAa K. TAURA A. YONEZAWA Department of Information Science, Faculty of Science University of Tokyo 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033 Japan E-mail: foyama,tau,[email protected]

We propose a scheme within which parallel programs with potential synchronization bottlenecks run eciently. In the straightforward implementations which use basic locking schemes, the execution time for the program parts with bottlenecks increases signi cantly when the number of processors increases. Our scheme makes the parallel performance for the bottleneck parts of programs close to the sequential performance while maintaining the eciency with which the nonbottleneck parts run. Experiments with a 64-processor SMP and a 128-processor DSM machine con rmed that parallel programs implemented with our scheme perform much better than parallel programs implemented with other widely-used locking schemes. 1 Introduction This paper deals with concurrent invocations of methods aecting objects shared among multiple processors.b Execution schemes are generally classi ed into the following two kinds. local-based execution A method is (basically) executed by the processor that invokes the method. owner-based execution A method is (basically) executed by the owner of the object. Here we de ne the owner of an object as the processor that updated the object most recently. Local-based execution is commonly used in the systems that run on shared-memory multiprocessors and in those systems that run on distributed-memory machines and that may create a replica of an object. Owner-based execution is commonly used in systems running on distributedmemory machines and not creating replicas.

Research Fellow of the Japan Society for the Promotion of Science In this paper we use the term object in a wide sense meaning a set of the data protected by mechanisms maintaining its consistency. We use the term method to denote a sequence of operations that potentially reference or update an object. Our scheme can thus be applied not only to concurrent object-oriented languages but also to a wide range of parallel languages.

a b

1

The reasons that local-based execution is usually a better choice when the systems run on shared-memory machines have to do with both time-eciency and space-eciency: time-eciency Local-based execution is the more suitable when an object is not updated frequently because each processor can then execute methods eciently when using its cached copy of the object. And the data structures for representing computation do not have to be moved from one processor to another. space-eciency Naive owner-based execution may result in excessive demands on memory because processings can rapidly send repeated requests to an owner, resulting in the creation of a huge number of data structures containing the information needed for the execution of the requested method. See Cilk's work 1 for a theoretical background on space-eciency. Local-based execution, however, is not always the best choice. If an object is updated frequently by multiple processors, for example, local-based execution will be subject to serious slowdowns caused by overheads, such as cache misses in reading an object. This work deals with the programs in which a number of concurrent methods are invoked to one object at the same time and the time spent in the object dominates the overall execution time. Objects of this kind are called bottlenecks in this paper.c Owner-based execution can be advantageous with regard to bottlenecks because only one processor updates an object, minimizing the cost of referencing and updating the object. On the other hand, owner-based execution incurs the overhead due to the need to move the data structures for representing computation from one processor to another. This paper describes a novel scheme for executing parallel programs with potential synchronization bottlenecks eciently. The major context of this research is the implementation of ne-grain multithread languages (e.g., concurrent object-oriented languages) on symmetric multiprocessors (SMPs) or distributed shared-memory (DSM) machines. The total sequential execution time of a program with bottlenecks will be the sum of the execution time of the bottleneck parts and that of the other parts. Parallelization is expected to shorten the latter but not the former. Consider a program in which we must execute 20 percent of the code sequentially. Since we can therefore decrease the execution time by a factor of ve if we use multiple processors, we might expect the total execution time to decrease this c

Elsewhere they are sometimes called hot spots. 2

much as we increase the number of processors. Actually, however, the execution time in bottlenecks increases dramatically and the overall performance is extremely low. We observed that the performance got better as the number of processors increased up to some number, but got worse steadily thereafter. See the speedup curves of a program with a naive implementation in Fig. 8, 9, and 10 (spin and block in the gures). The curves never become at and the execution time increases signi cantly when the number of processors is excessive. As will be exempli ed in Sec. 2, a wide range of parallel programs have potential bottlenecks and it is not rare that a program runs on too many processors. Our goal is to make the total execution time in a parallel environment close to the time needed to sequentially execute only the bottleneck parts. Our scheme therefore signi cantly reduces the performance degradation due to bottlenecks and keeps the penalty paid in nonbottleneck parts very small. Note that our scheme deals well with an object that becomes a bottleneck only in some program phases and is found to be a bottleneck at runtime (i.e., an object that cannot be clearly determined to be a bottleneck when the program is being compiled). Our scheme is implemented on top of a compiler and a runtime system of a high-level language so that application programmers do not have to worry about its implementation details. To sum up, our scheme contributes to the reduction of programming eorts and to higher performance especially in programs where it is dicult to predict which objects are bottlenecks. We implemented our scheme in the concurrent object-oriented language Schematic2 3 4 and con rmed its eectiveness in experiments using a 64-processor SMP and a 128-processor DSM machine. The experiments compare our scheme with the other object implementation techniques using simple locking algorithms and showed that our scheme could make the performance loss due to bottlenecks extremely small. The remainder of the paper is organized as follows: Section 2 gives some examples of parallel programs with potential synchronization bottlenecks. Section 3 explains basic object implementation schemes and their problems, and Sec. 4 describes our scheme. Section 5 gives our experimental results, and Sec. 6 compares our work with related work. Section 7 concludes by brie y summarizing the paper and mentioning future work. ; ;

2 Examples of Bottlenecks Our study, which gives faster execution in bottleneck objects, would be useless if programs with bottlenecks were rare. There are, however, many parallel programs containing bottlenecks dicult to eliminate. Some examples of bot3

tlenecks are shown below.

MT-unsafe libraries An MT-unsafe (MultiThread-unsafe) function may be executed incorrectly when it is called by multiple threads at the same time. It is sometimes dicult or troublesome to make MT-unsafe functions into MT-safe ones. The ad-hoc solution we sometimes adopt is serializing the calls to MT-unsafe functions with some locks. The solution, however, may give us another source of bottlenecks. I/O functions Concurrent calls to I/O functions are often serialized explicitly even if the functions are MT-safe. For example, simultaneous calls to printf are mutually excluded so that the outputs to a console or a le may be put in order. That kind of serialization tends to become a source of potential bottlenecks because I/O operations are generally very slow. stub objects in distributed systems Some programs use one object as the representative of a site and leave all intersite communication to the object. That kind of object can become a bottleneck. shared global variables Some programs collect statistical information by using shared global variables and some simple mutex mechanism to count the occurrences of interesting phenomena. If the phenomena occur frequently, the variables will become a bottleneck. The essential solution to gain high eciency in these potential bottleneck parts is to revise the algorithm and enable multiple methods to be executed on the parts concurrently. It is, however, sometimes impractical to rewrite a program because of, say, high program development cost or binary distribution of MT-unsafe libraries.

3 Naive Execution Schemes and their Problems This section describes two straightforward schemes for implementing objects. One scheme gives local-based execution and is implemented with spin-locks. The other gives owner-based execution and is implemented with simple blocking locks. We will describe a detailed implementation algorithm for both schemes and clarify their problems. 3.1

Spin-locks

The memory area for a spin-lock is added to each object. The area contains a ag to distinguish the following two states of an object: (1) a processor is 4

queue area

spin-lock area flag area

......

....

spin flag head tail v1 vn

Figure 1: Object layout in the scheme using simple blocking locks. A ag area holds the

ag indicating whether or not a method of the object is being executed. A head area and a tail area hold the reference to the queue storing contexts associated with the object. Simultaneous operations of a queue and a ag are mutually excluded with the auxiliary spinlock using a spin area. The elds v1 ; :::; vn represent object data (also known as instance variables). executing a method of the object, or (2) no processor is executing a method of the object. Objects in the former state contain a locked ag and objects in the latter contain a free ag . In lock acquisition, each processor executes the busy loop to change a free ag in a lock area atomically into a locked one. If the processor succeeds in the update, it executes a method and unlocks the spin-lock by writing a free ag in the area. The implementation with spin-locks gives local-based execution, where different processors execute methods on an object one after another by themselves. Since dierent processors can update an object, in frequently-updated objects the cache miss penalty in object reference can become very large. This scheme has more disadvantages. Busy-waiting wastes CPU time and memory bandwidth. Processors in a multiprogram environment are forced to spin for a very long time without doing any useful work while a lock-holding OS processd is descheduled. This phenomenon is widely known as convoying. Finally, a spin-lock is an impractical alternative in programs using the thread systems without fair scheduling and preemption because busy-waiting may incur a deadlock. 3.2

Simple Blocking Locks

Figures 1 and 2 show an object representation and a method execution algorithm of the naive implementation using simple blocking locks. A spin-lock, a ag, and a queue are associated with each object. A ag area keeps either a locked ag or a free ag, and a locked ag indicates that a method on the d

In this paper an OS process is a process controlled by the operating system. 5

void method(lock, a0, ..., am) f if (lock->flag == FREE) f acquire_spinlock(lock->spin); if (lock->flag == FREE) f lock->flag = LOCKED; release_spinlock(lock->spin); BODY(a1, ..., am); release_blockinglock(lock); g else f release_spinlock(lock->spin); g g cont = make_context(a1, ..., am); acquire_spinlock(lock->spin); if (lock->flag == LOCKED) f enqueue(cont, lock); release_spinlock(lock->spin); g else f make_context lock->flag = LOCKED; release_spinlock(lock->spin); BODY'(cont); release_blockinglock(lock); g g inline void release_blockinglock(lock) f while (1) f acquire_spinlock(lock->spin); if (is_queue_empty(lock)) f

/* unlock temporarily */

/* reacquire spin-lock */ /* the object is still locked */ /* a ag changed to free during */ /* execute a context immediately without enqueueing it */

/*lock->flag the queue=isFREE; empty */

g

release_spinlock(lock->spin); return;

/*contthe=queue has contexts */ dequeue(lock); g

g

release_spinlock(lock->spin); BODY'(cont);

Figure 2: Method execution algorithm using simple blocking locks. BODY represents the function call to a method or the code of a method body inline-expanded at the point. BODY' is the same as BODY except that it reads context data at the beginning. The de nitions of enqueue, dequeue, and is queue empty are omitted because they are straightforward.

6

object is being executed. A free ag indicates that no method on the object is being executed. A queue area contains the reference to a queue managing data structures that hold information necessary for executing a method (typically a method ID and the arguments of the method). In the rest of this paper a data structure of this kind is called a context . A ag area and a queue area are updated exclusively with an auxiliary spin-lock; contention requires processors to wait their turn. For simplicity, the following description omits spin-locking/unlocking operations, the details of which are clear in Fig. 2. Each processor executes a method as follows. First it reads the ag associated with an object. If this read fetches a free ag, the processor writes a locked ag in a lock area and executes the method. A processor executing the method of an object is called the owner of the object. If the read fetches a locked ag, the processor creates a context for the method and puts it into the queue in the queue area of the object. After a processor nishes executing the method of an object, it checks the queue in the object. If the queue is empty, the processor writes a free ag in a ag area. If the queue is not empty, the processor picks a context out of the queue and executes the method represented by the context. This algorithm gives local-based execution in nonbottleneck parts of the program and gives owner-based execution in bottlenecks. Speci cally, in bottlenecks, only one processor (the owner) is likely to execute multiple methods represented by contexts consecutively. Hence the overhead of cache misses in reading an object is smaller in this scheme. The following overhead, however, will make the owner's execution slower and will make the overall performance worse. 1. An owner cannot dequeue a context while other processors are manipulating a ag and a queue (recall that their simultaneous operation is excluded with a spin-lock). Put dierently, an owner must wait its turn to obtain the exclusive access to a queue. The waiting time will be large because execution by nonowners tends to be slow because of the cache misses that occur during attempts to manipulate a ag area, a queue area, and a spin-lock area. 2. An owner must pay a penalty to acquire and release a spin-lock every time it manipulates a ag or a queue. This penalty is a result of cache misses during attempts to reference and update a lock area, because multiple processors manipulate a lock area. 3. Cache misses occur when an owner reads context data because contexts were created by dierent processors. 7

vn

vn

vn

(a) free

(b) locked

a1

a1’

a1’’

...

......

v1 v2

......

v1 v2

......

v1 v2

...

"locked"

...

"free"

an

an’

an’’

contexts

(c) con ict

Figure 3: Object representation in our scheme. The strings \free" and \locked" respectively represent free and locked ags. The slots v ; :::; vn represent object data. The slots a1; :::; an serve to exchange the values necessary for1 execution of a method. Diagram (c) shows the object with three contexts in its list. The overheads must be minimized because the penalty paid by an owner directly aects the overall performance (i.e., the critical path) in many programs. Note that the problems described above are common to owner-based execution with a naive implementation and are not limited to programs using simple blocking locks. Problems 1 and 2 are general ones associated with the mutual exclusion for operating a \task queue" and problem 3 is the general loss of memory locality due to the migration of computation.

4 Our Scheme 4.1

Basic Framework

Algorithm A one-word lock area is attached to each object (Fig. 3), and an object is in one of three states: free, locked, or con ict. Objects in the free state have no method being executed. Objects in the locked state have a method being executed and have no context to be executed. And objects in the con ict state have a method being executed and have some contexts to be executed. Objects are created in the free state. Objects in the free and locked states respectively hold in their lock areas a free ag and a locked ag . Both ags can be distinguished from references to a context. An object in the con ict state has in its lock area the reference to a list of contexts to be executed. Our method execution algorithm is shown in pseudo-code in Fig. 4. The algorithm assumes that a target machine has atomic compare-and-swap and 8

void method(lock, a1, ..., am) f if (compare&swap(lock, FREE, LOCKED)) f

/*BODY(a1, the object was locked successfully */ ..., am);

release_lock(lock); g else f cont = make_context(a1, ..., am); if (insert(cont, lock) == NOT_INSERTED) f

/*BODY'(cont); execute a context immediately without inserting it */

g

release_lock(lock);

g g inline void release_lock(lock) f while (1) f if (compare&swap(lock, LOCKED, FREE)) return;

/* the object was unlocked successfully */ /*contcontexts exist */ = LOCKED; /*swap(lock, detach a list of contexts and store it in cont */ cont); /* one processor executes multiple contexts by turns */

while (cont != NULL) f BODY(cont); cont = cont->next; g

g g int insert(cont, lock) f while (1) f

/* repeat in nitely until a compare&swap succeeds */ v = *lock; /* read a lock area */

switch (v) f case FREE: if (compare&swap(lock, v, LOCKED)) return NOT_INSERTED; break; case LOCKED: cont->next = NULL; if (compare&swap(lock, v, cont)) return INSERTED; break; default: cont->next = v; if (compare&swap(lock, v, cont)) return INSERTED; g

/* make a singleton list */ /* contexts exist */

g

g

/* reached when a compare&swap failed */

Figure 4: Our method execution algorithm. compare&swap(address, old , new ) writes the value of a variable new in address and returns \true" if the value pointed to by address is equal to old . Otherwise it does not update any memory and returns \false." swap(address , var ) writes the value of a variable var in address and assigns the value originally pointed to by address to var . Both instructions execute a sequence of operations atomically. 9

1111111 0000000 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111

compare&swap

vn

v1 ....

....

v1

vn

Figure 5: Inserting a context into a list instructions, which seem to be available on most modern architectures.e A processor attempting to execute a method rst issues a compare-andswap instruction that writes a locked ag in a lock area atomically only if the lock area has a free ag. If this attempt is successful, the processor (now the lock-holder of the object) starts to execute the method. If the attempt is unsuccessful, the processor creates a context of the method (called C below) and inserts it into the list on the object. To insert a context, a processor performs one of the following operations according to the value in the lock area: (1) If the lock area contains a free ag, a processor does not insert a context C but it itself becomes the new owner. It subsequently locks the object by writing a locked ag in a lock area by issuing a compare-and-swap instruction and executes the method represented by C. (2) If the lock area contains a locked ag, the processor writes the reference to C in the lock area by issuing a compare-and-swap instruction. The context C forms a singleton list of contexts. (3) If the lock area points to a list of contexts, the processor puts C at the head of the list: the processor rst makes C point to the context currently at the head and atomically writes in the lock area the reference to C by issuing a compare-and-swap instruction (Fig. 5). The processor that nishes executing a method issues a compare-and-swap instruction that writes a free ag in a lock area atomically only if the lock area has a locked ag. If the instruction succeeds, the object has been unlocked. If the instruction fails or the lock area originally points to a list of contexts, the processor detaches the whole list of contexts from the object by using a swap instruction (Fig. 6). It then executes the detached contexts in \last-insertedswap

On hardware that lacks these instructions they need to be simulated. On the R10000, for example, these instructions are simulated by combining load-linked and store-conditional instructions. Our algorithm can be used as is in that case. But, if a simulated swap instruction may be often disturbed by other processors and cannot complete its operation within a nite number of instructions, a priority mechanism giving higher priority to owner's swap operations may need to be implemented. e

10

....

1111111 0000000 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111

swap

"locked"

......

v1 ....

......

v1

vn

vn

Figure 6: Detaching a whole list of contexts from an object rst-executed" order without any mutex operations to manipulate the list. The algorithm uses an atomic compare-and-swap to insert a context, whereas it uses an atomic swap to detach contexts. If multiple processors attempt to insert a context at the same time, only one insertion succeeds and the other processors try again. A detachment operation, on the other hand, is always successful within a constant number of instructions. This asymmetry gives an owner the higher priority to manipulate a lock area.

Advantages Our scheme has the following advantages. 1. An owner can always detach a list of contexts from an object in a constant number of instruction steps and is never put into competition with other processors. 2. Once an owner detaches a list of contexts from an object, it does not perform the synchronization associated with the manipulation of the list. The former advantage reduces the time needed for one synchronization operation and the latter reduces the number of synchronization operations . As a result, the owner's execution can be made much faster and the critical path of a program may also be shortened. From a dierent point of view, detaching a list of contexts can be regarded as a thread-scheduling heuristic for ecient execution in bottleneck objects because the operation ensures that a processor is always assigned to one of the blocked methods represented by the contexts. Another potential advantage is that any OS process can always proceed with its work (insertion or detachment) even when any other OS processes are descheduled. In the two straightforward schemes mentioned in Sec. 3, processes cannot go beyond a spin-lock acquisition point if a spin-lock holder is descheduled. 11

Limitations and Simple Improvements Our algorithm, in contrast to a queue-based implementation, manages contexts in LIFO order. It must be modi ed if the order of method invocations needs to be preserved in method execution. One of the simple extensions to ensure FIFO scheduling is to reverse the list of contexts just after it is detached. This, however, is an additional overhead. Note that our scheme achieves high performance in exchange for no guarantee for fair scheduling. A potential problem with our scheme is that a large amount of memory may be allocated for a long list of contexts. As mentioned in Sec. 1, ownerbased schemes, in general, can use much more memory than local-based ones. Our scheme should obviously be improved to let programs that in the current implementation create a long list of contexts run with less memory. One way to do this would be to forbid non-owners to lengthen a list beyond some threshold and to make them busy-wait until too long a list is detached from an object. 4.2

Optimization during the Execution of Detached Contexts

Prefetching Context Data Prefetch instructions are provided for some modern processors. They can be used to read memory data and load them on cache. A programmer expects prefetch processings to proceed in parallel with the following instructions. Our scheme prefetches the content of the context that will be executed next. See the code in Fig. 4. We insert the code to prefetch the context pointed to by cont->next just above BODY(cont); in release lock. By doing this we expect that 3. the context data executed next is loaded on cache during the current method execution. There are some conditions that must be met if this improvement is to be eective. One is that the execution time of a context is greater than the time required to prefetch the next context. Another is that the current method does not overwrite the prefetched cache lines. If these conditions (and some others) are met, we can eliminate cache-miss overhead in reading contexts. Assigning Object Data to Registers While detached contexts are executed, (some of all of the) object data is \cached" in registers (Fig. 7). This optimization changes the references and updates of (cache) memory into those of registers. See the code in Fig. 4. 12

without memory access with memory access

....

v1

v1 , ...,vn

v ’, 1

...,vn’

v ’’, ...,v ’’ 1

n

vn v1’’’, ...,vn’’’

Figure 7: Object data is assigned to registers during the execution of detached contexts. Just after contexts are detached (just below the swap in release lock), object data is loaded on registers collectively in advance. The object data, which may be updated during the execution of a context, is passed to the next context through registers. After all the detached contexts have been executed (just below a while loop in release lock), the nal object data is written into an object.

4.3

Comparison with Program Description in Low-Level Languages

Our scheme frees programmers from any need to worry about which objects become bottlenecks. Our runtime system implicitly performs optimized ownerbased execution in bottleneck parts, whereas, in nonbottleneck parts, each processor executes a method by itself eciently as in local-based schemes. Now let us consider the approach to describing applications with potential bottlenecks when using a low-level language such as C. One of the most promising programming practices is to predict the degree of contention for each object, and use local-based execution in nonbottleneck objects and use ownerbased execution in bottleneck objects. Unfortunately, describing owner-based execution in a low-level language is tedious. A programmer has to explicitly de ne the data structure for representing a context for each object and each method because the information stored in a context depends on an application. What is worse, bottlenecks cannot always be identi ed statically: whether or not an object becomes a bottleneck depends on the number of processors and on runtime parameters. A program will become extremely complicated if a programmer prepares for all possibilities. 13

5 Performance Evaluation 5.1

Experiments with Schematic Programs

To evaluate the eectiveness of our scheme, we implemented it and other straightforward ones in the concurrent object-oriented language Schematic 2 3 4 and evaluated their performance experimentally using a SMP and a DSM machine, an Ultra Enterprise 10000 (UltraSPARC 250MHz 2 64, Solaris 2.6) and an Origin 2000 (R10000 195MHz 2 128, IRIX 6.5). We used the following application programs. ; ;

raytrace: A raytrace program that calculates the RGB value of each pixel in parallel. This is a typical application having a number of independent tasks that can be executed in parallel but has the potential bottleneck object associated with I/O operations. Every time a processor calculates one RGB value it sends it to the le object shared among processors by invoking the exclusive methods.f RNA: A program used to predict the secondary structure of a protein. It essentially solves a tree-traversal problem with pruning. A counter object is created to count the number of traversed nodes. Each processor invokes a increment method to the counter object every time it traverses one node. counter: A tiny counter program. It is an extreme example for assessing the performance in bottlenecks. It creates one counter object containing ten integers and repeatedly invokes a method for incrementing all integers by one. The number of methods invoked is constant regardless of the number of processors. An increment method performs dummy computation, which makes the object a bottleneck. The execution time on the Ultra Enterprise 10000 is shown in Fig. 8, and the execution time on the Origin 2000 is shown in Fig. 9. We compared several algorithms: spin-locks (spin in the gure), simple blocking locks (block), spinblocks (sp. bl.), our scheme without the compile-time optimizations described in Sec. 4.2 (detach), and our scheme with the optimizations (reg. + pref.). Spinblocks are almost the same as simple blocking locks except that the processor trying to obtain a lock does not branch to the creation of a context until it reads a lock area one hundred times. Spin-locks use a test-and-test-and-set

Another application is reading the le concurrently and painting a window using the values of pixels already sent to the le object. f

14

É¸ÐËÉ¸º¼ ÅÆw¹ÆË

ÊÇÀÅ

¹ÃÆºÂ

ÊÇw¹Ã

»¼Ë¸º¿

É¼¾ÇÉ¼½

ËÀÄ¼wÄÊ¼º

ÅÌÄ¹¼ÉwÆ½wÇÉÆº¼ÊÊÆÉÊ

©¥ ÅÆw¹ÆË

ÊÇÀÅ

¹ÃÆºÂ

ÊÇw¹Ã

»¼Ë¸º¿

É¼¾ÇÉ¼½

ËÀÄ¼wÄÊ¼º


ºÆÌÅË¼É ÅÆw¹ÆË

ÊÇÀÅ

¹ÃÆºÂ

ÊÇw¹Ã

»¼Ë¸º¿

É¼¾ÇÉ¼½

ËÀÄ¼wÄÊ¼º


Figure 8: Measured execution time of Schematic programs (Ultra Enterprise 10000). 15

É¸ÐËÉ¸º¼ ÅÆw¹ÆË

ÊÇÀÅ

¹ÃÆºÂ

ÊÇw¹Ã

»¼Ë¸º¿

É¼¾

ËÀÄ¼wÄÊ¼º


©¥ ÅÆw¹ÆË

ÊÇÀÅ

¹ÃÆºÂ

ÊÇw¹Ã

»¼Ë¸º¿

É¼¾

ËÀÄ¼wÄÊ¼º


ºÆÌÅË¼É ÅÆw¹ÆË

ÊÇÀÅ

¹ÃÆºÂ

ÊÇw¹Ã

»¼Ë¸º¿

É¼¾

ËÀÄ¼wÄÊ¼º


Figure 9: Measured execution time of Schematic programs (Origin 2000). 16

algorithmg to avoid waste of memory bandwidth. Since a prefetch optimization has not yet been implemented on the Origin 2000, the execution time shown in Fig. 9 is that obtained with only the other optimization to cache object data in registers (reg.). We also show the execution time of the program which omits the method invocations to a bottleneck (no bot.). The curves of no bot. are useful to estimate the time spent executing nonbottleneck parts of the program. In all experiments, dummy space was inserted between dierent areas of an object in order to avoid false sharing: we allocated distinct cache lines for a lock area, a ag area, a queue area, and an object data area. With all schemes except ours, the execution time for counter and RNA increased signi cantly when the number of processors increased on both the Ultra Enterprise 10000 and the Origin 2000. We found it useless to implement simple blocking locks for ecient execution of bottleneck parts. Our scheme could keep the performance degradation smaller than other schemes could. The raytrace performance dierence between our scheme and others was relatively small, and we investigated the reason for this. As is obvious from the curve of no bot., the execution time in a bottleneck (an abstract le object) dominates the whole execution time. More detailed experiments revealed that a long queue (or list) of contexts was created for the le object but that the manipulation of contexts seldom con icted. Raytrace invokes a method on a bottleneck object less frequently than the other application programs tested because nonbottleneck parts, such as oating arithmetics, take a relatively long time. To summarize, we found that simple blocking locks gave good performance when a program has an object whose execution time dominates the total execution time but the invocation of methods on that object is not frequent. In that case, spin-locks gave the worst performance for Raytrace among the compared schemes. The eect of the optimizations described in Sec. 4.2 was too small to be recognized clearly. We think that prefetch operations is implemented prematurely in Schematic. We are now investigating a sophisticated way to insert prefetch instructions eectively. We could scarcely identify the eect of the optimization assigning object data to registers. We think this was because the dierence in access latency between registers and L1 cache was very small on both parallel machines. The optimization will be more eective when the target hardware has a slower L1 cache or when the cost of a reference to an object is high because of runtime checks or multiple indirect pointers.

Before a processor tries to acquire a lock with an atomic test-and-set operation, the algorithm checks, with a nonatomic operation, whether or not the processor can acquire the lock. g

17

ºÆÌÅË¼É

ÊÇÀÅ ¾¼ËÆÅ¼

¹ÃÆºÂ »¼Ë¸º¿

ÊÇw¹Ã É¼¾ÇÉ¼½

¹ÃÆºÂw»¼Ë

ËÀÄ¼wÄÊ¼º


Figure 10: Measured execution time of C programs (Ultra Enterprise 10000). 5.2

Experiments with C Programs

The superiority of our scheme to conventional schemes was also con rmed in experiments that used C programs. Figure 10 shows the performance of a counter program written in C. The program uses a Solaris thread library. We implemented two additional locking algorithms not present in Schematic programs. One used the same data structure as simple blocking locks but detached all contexts from an object and executed them without manipulating auxiliary spin-locks (block (det.)). The other used the same data structure as our scheme but an owner picks out only one context at a time using a compare-and-swap (getone). We got results similar to the ones we got with the programs written in Schematic: our scheme obviously outperformed the others. Notably, we could observe the performance improvement due to the two compile-time optimizations. Prefetching contexts achieved about 15% speedup at maximum and assigning object data to registers gave about 5% speedup at maximum. In the counter program implemented with simple blocking locks, the waiting overhead for auxiliary spin-locks was by far the largest. On 60 processors, the waiting overhead occupied 69% of owner's execution time in the program block and 49% of owner's execution time in the program block (det.). Finally we report the execution time on a uniprocessor. Since every lock operation succeeds immediately under that condition, the time indicates the performance of uncontended lock acquisition and release. This experiment used a dierent version of the counter program, one in which dummy computation in a method body was eliminated and each processor invoked a larger number 18

of methods. The program with spin-locks, the one with simple blocking locks, and the one with our scheme respectively took 641, 1025, and 810 milliseconds to execute. The penalty paid in nonbottleneck objects in our scheme turned out to be smaller than that paid in nonbottleneck objects in simple blocking locks.

6 Related Work There is a large literature on concurrent objects on which plenty of methods are invoked in parallel. Most previous works,5 6 7 4 8 however, focus on exposing parallelism between nonexclusive methods and pay little attention to the performance degradation in contended objects. Our study deals with exclusive methods and is diagonal to theirs. Adaptive replication 9 is a program transformation technique for adaptively replicating synchronization bottlenecks. The transformed program detects the objects on which multiple processors perform the updates at the same time and then the program creates the replicas of the objects accordingly. Their compiler nds a set of the operations that can be performed on replicas, called replicatable operations. Replicatable operations are essentially the ones whose contributions to replicas can be combined to yield the same nal result as serially accumulating the contributions into the original object. Synchronization bottlenecks in RNA and Counter can be eliminated with their technique because the exclusive update operations in the applications are replicatable. On the other hand, their technique is inapplicable to Raytrace because the exclusive operation to the bottleneck accompanies the side eect and hence is determined to be non-replicatable. Our technique is complementary to theirs because it can reduce the performance degradation due to the con ict of nonreplicatable operations. The lock coarsening technique 10 statically detects computations that repeatedly release and reacquire the same lock and applies the program transformations which eliminate the intermediate release and acquire operations. Both our technique and theirs fusion multiple critical sections into one, thus reducing the frequency of locking operations. While their static technique deals with consecutive critical sections within one thread, our dynamic technique deals with the consecutive execution of critical sections due to the concurrent invocations to a synchronization bottleneck. Stack swapping 11 by Toyama et al. solves the problem of the low locality of memory reference in non-owners' local-based method execution, in the context of implementing a concurrent language on DSM computers. The stack-swapping scheme executes the methods whose \size" is larger than some 19 ; ; ; ;

threshold in an owner-based way, keeping space-eciency in local-based execution. The goal of their work is similar to ours: to reduce the performance degradation resulting from the movement of the data structures for representing computation from one processor to another. We focus on the case in which computation moves as a result of contention in bottleneck objects, whereas they focus primarily on the computation movement triggered by a method reading from and writing to a remote memory for a long time. Their scheme and ours are complementary to each other. amortizes the performance loss in \hot" spin-locks where concurrent operations con ict very frequently. The main idea of the technique is to give a private spin area for each processor and notify a release event to exactly one processor. Like our scheme, the technique is intended to maintain the eciency of the programs executed by too many processors. Unlike our scheme, MCS lock is essentially a spin-lock and it does not solve the problem of low locality in object manipulation, which is common to local-based execution. MCS lock 12

by Onodera et al. provides an ecient implementation of Java monitors. Their scheme is both time-ecient and space-ecient, especially when concurrent method invocations do not con ict. Our work was much in uenced by this algorithm and its predecessor Thin locks: 14 both of these schemes and ours provide space-eciency by using one word (or less) to represent a lock, both store a ag in a lock area when there is no contention and store the reference to larger data structures otherwise, and both update a lock area by using a compare-and-swap instruction. The bimodal object-locking algorithm and Thin locks, however, do not take into consideration locality of memory references during method execution. Bimodal object-locking algorithm 13

is a very clever technique for implementing Java monitors. It eliminates many of the problems occurring when Thin locks is used. One of the most remarkable characteristics of their algorithm is that it does not introduce busy-waiting. This is very important when many processors can contend for the same lock. Though meta-lock achieves ultrafast synchronization in uncontended objects, it does not in contended objects. In particular, if we use their scheme straightforwardly, the multiple blocked threads chained to a bottleneck object may be executed by dierent processors, not by one processor. This is undesirable because it cannot enjoy high cache locality. Moreover, waking up a blocked thread accompanies the relatively heavy operations of mutex and condition variables. Note that it is not easy to incorporate owner-based execution like ours in Java monitors because Java has a much more complicated synchronization model than the one we used in this paper. Meta-lock 15

20

7 Conclusion and Future Work We have developed a scheme for eciently executing parallel programs with potential synchronization bottlenecks. It makes the parallel execution time of the entire program very close to the time necessary to execute only the bottleneck parts sequentially. More concretely, we have developed two runtime implementation techniques and two compile-time optimizations. One of the implementation techniques is detaching a whole list of contexts from an object and the other is giving higher priority to the owner of an object by using compare-and-swap. And one optimization prefetches context data; the other assigns object data to registers. Measuring the performance of programs with synchronization bottlenecks running a SMP and a DSM machine, we con rmed that our scheme could keep the execution-time increase due to parallel processing very small even when the programs with spin-locks or simple blocking locks got much slower when the number of processors was increased. Furthermore, our scheme gives remarkably fast execution also in the programs where concurrent invocations of methods never con ict. Finally we oer some directions for further work. One is to investigate the reason for the slight slowdown that was more conspicuous on the Origin 2000. Another is to develop a system that controls the number of processors automatically according to the execution status. The potential problem of using a large amount of memory, explained in Sec. 4.1, should also be addressed. As already discussed, one simple extension would switch to local-based execution dynamically even in bottlenecks when the memory space used for contexts exceeds some threshold. Devising an algorithm for this switching, however, is not a trivial task. If this potential problem of memory use is eliminated, we will be able to enjoy a convenient programming environment in which we do not worry about the slowdown and the usage of a large amount of memory that are caused by synchronization bottlenecks. Acknowledgments Kenichi Hagihara and Noriyuki Fujimoto provided very helpful comments on our work, and our laboratory associates made many useful suggestions. References 1. Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS), pages 356{ 368, 1994. 21

2. Yoshihiro Oyama, Kenjiro Taura, Toshio Endo, and Akinori Yonezawa. An Implementation and Performance Evaluation of Language with FineGrain Thread Creation on Shared Memory Parallel Computer. In Pro-

ceedings of 1998 International Conference on Parallel and Distributed Computing and Systems (PDCS '98), pages 672{675, 1998.

3. Yoshihiro Oyama, Kenjiro Taura, and Akinori Yonezawa. An Ecient Compilation Framework for Languages Based on a Concurrent Process Calculus. In Proceedings of Euro-Par '97 Parallel Processing, volume 1300 of Lecture Notes in Computer Science, pages 546{553, 1997. 4. Kenjiro Taura and Akinori Yonezawa. Schematic: A Concurrent ObjectOriented Extension to Scheme. In Proceedings of Workshop on ObjectBased Parallel and Distributed Computation (OBPDC '95), volume 1107 of Lecture Notes in Computer Science, pages 59{82, 1996. 5. Greg Barnes. A Method for Implementing Lock-Free Shared Data Structures. In Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '93), pages 261{270, 1993. 6. Andrew A. Chien. Concurrent Aggregates (CA) . The MIT Press, 1991. 7. Andrew A. Chien, Udey Reddy, John Plevyak, and Julian Dolby. ICC++ { A C++ Dialect for High Performance Parallel Computing. In Proceedings of Second JSSST International Symposium on Object Technologies for Advanced Software (ISOTAS '96), volume 1049 of Lecture Notes in Computer Science, pages 76{95, 1996.

8. Masahiro Yasugi, Shigeyuki Eguchi, and Kazuo Taki. Eliminating Bottlenecks on Parallel Systems using Adaptive Objects. In Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT '98), pages 80{87, 1998.

9. Martin C. Rinard and Pedro C. Diniz. Eliminating Synchronization Bottlenecks in Object-Based Programs Using Adaptive Replication. In

Proceedings of 1999 ACM International Conference on Supercomputing (ICS '99), pages 83{92, 1999.

10. Pedro C. Diniz and Martin C. Rinard. Synchronization Transformation for Parallel Computing. In Proceedings of the 25th ACM SIGPLANSIGACT Symposium on Principles of Programming Languages (POPL '97), pages 187{200, 1997.

11. Sumio Toyama, Yoshihiro Oyama, Kenjiro Taura, and Akinori Yonezawa. Enabling Explicit Task Placement on Lazy Task Creation. Information Processing Society of Japan Transactions on Programming, 40(SIG1 (PRO2)):1{12, 1999. (in Japanese). 12. John M. Mellor-Crummey and Michael L. Scott. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Transactions 22

on Computer Systems, 9(1):21{65, 1991. 13. Tamiya Onodera and Kiyokuni Kawachiya. A Study of Locking Objects with Bimodal Fields. In Proceedings of 14th Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA '99), 1999.

14. David F. Bacon, Ravi Konuru, Chet Murthy, and Mauricio Serrano. Thin Locks: Featherweight Synchronization for Java. In Proceedings of 1998

ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '98), pages 258{268, 1998.

15. Ole Agesen, David Detlefs, Alex Garthwaite, Ross Knippel, Y. S. Ramakrishna, and Derek White. An Ecient Meta-lock for Implementing Ubiquitous Synchronization. In Proceedings of 14th Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA '99), 1999.

23

EXECUTING PARALLEL PROGRAMS WITH ... - CiteSeerX

EXECUTING PARALLEL PROGRAMS WITH ... - CiteSeerX

Suggest Documents

A migration framework for executing parallel programs in ... - CiteSeerX

Executing Suspended Logic Programs - CiteSeerX

Debugging Parallel Programs with Instant Replay - CiteSeerX

Visual Representations of Executing Programs - CiteSeerX

Executing Object-Oriented Parallel Programs on High ... - Indiana CS

Publishing and Executing Parallel Legacy Code using an ... - CiteSeerX

Experiences with Publishing and Executing Parallel Legacy Code ...

Basic Compiler Algorithms for Parallel Programs - CiteSeerX

Storage Management in Parallel Programs - CiteSeerX

Communication Optimizations for Parallel C Programs - CiteSeerX

Deriving Efficient Parallel Programs for Complex ... - CiteSeerX

Graphical Construction of Parallel Programs - CiteSeerX

Tuning Parallel Programs with Computational ... - Semantic Scholar

Executing Reactive, Model-based Programs ... - Semantic Scholar

Optimizing Parallel Programs with Explicit ... - Semantic Scholar

Techniques for debugging parallel programs with ... - Description

Performance Evaluation of Concurrently Executing Parallel ... - cfaed

Executing UML Models - CiteSeerX

Engineering Large Parallel Functional Programs

Systematic testing of parallel programs

Patterns for Parallel Application Programs

Verification of JavaSpacesTM Parallel Programs

E cient Solutions for mapping parallel programs - CiteSeerX

A Framework for Representing Data Parallel Programs ... - CiteSeerX