Compilation Techniques for Fair Execution of Shared Memory Parallel Programs over a Network of Workstations Yosi Ben-Asher Haifa university, Haifa, Israel.
[email protected]
Reinhard Wilhelm Universitat des Saarlandes, Germany.
[email protected] Abstract
Compiler technologies are crucial for the ecient execution of sequential programs. This is not yet true for parallel programs, where the operating system performs most of the work, resulting in increased overhead for scheduling and distributed shared memory simulation. In this work we suggest simple compilation techniques that can be used to guarantee ecient execution of shared memory parallel programs. In particular, we address the diculties involved in supporting preemption of activities in compiled code, which is crucial for fair execution of shared memory programs. The main novelty of this work is a comprehensive approach for compilation, where eciency is guaranteed whenever preemption is not really needed. The compiler is used to insert explicit contextswitch instructions, so that preemption is supported only where required. In addition, our compilation scheme guarantees that the cactus stack is embedded in the regular stack of each machine, such that most of the activities are spawned as regular function calls or loop iterations. Finally, the virtual address space is partitioned between the dierent machines, and shared memory is supported by inserting \special" code before \heavy" loops. This code samples the iteration space of these heavy loops before they are executed, so that most of the memory pages that are used by the loops can be fetched. The number of page-faults occurring as a result of the DSM simulation is thus reduced. The virtual memory mechanism is also used to get rid of the un-used frames (\holes") that are left on the stack. The page-table in each machine is managed using special assembler instructions inserted by the compiler.
Index terms- Compilation techniques, Parc language, Parallel Programming, Shared memory.
1 Introduction We consider the problem of designing a compiler for shared memory parallel programs. The target architecture is a cluster of work-stations/PCs (NOW) connected by a fast network allowing the machines to exchange messages. Today's communication technology already supports high rates of communication. For instance, with the Myrinet switch [1] a set of 8 machines can communicate at a rate of 1Gb per second. This technology gives rise to the hope that shared memory programs can be eciently executed on low cost clusters. Hence, we wish to support a strong model of parallel programming, including: asynchronous execution, shared memory, dynamic spawning of processes, and a distinction between local /external memory references. ParC [5] is a simple extension of C which has these features, and is therefore used as the source language for our compiler. However, many other languages like Split-C [9], Cilk [23], Linda [2], Swarm [24], Pc [7], and Concurrent C [8] can also be realized using the proposed paradigm. In general, we wish to compile a parallel-program into an object-code that will be executed concurrently by every machine in the cluster. Existing systems like MAXI, PVM, MPI... are characterized by the fact that they work in an operating system mode, so that all the decisions are made during execution in an \on-line" mode. Using a compiler for parallel programs means that most of the decisions are made \o-line" before the execution begins. It is not surprising that most of the existing systems for parallel-processing work in 1
an operating system mode, as making optimal choices during run-time is easier than doing so at compilation time. An advantage of compilers versus operating systems is that the low level code produced as output can handle the machine resources more eciently. For example, the stack pointer in a context-switch operation can be changed, as opposed to the use of system calls in the \on-line" approach. Our goal is to develop a comprehensive method for compilation of parallel programs, in which all the basic compilation techniques for shared memory parallel programs have been addressed. We identify and de ne the set of analyses that are essential to optimize the the code produced by the above compiler and show how they can be used by it. However, the data- ow analysis that is needed to realize some of these optimizations (like Range analysis for arrays or identifying heavy loops) is beyond the scope of this article, and is subject for future research. If successful, our techniques can be used to implement such a compiler that can be used for parallel execution of shared memory on any NOW. We look for the parallel equivalents of the known compilation techniques for sequential programs (e.g., using stacks for realizing function calls). Following is a list of basic requirements that should be addressed by the compiler: Activity creation and termination. Usually, the number of activities (threads) generated during execution of parallel programs is signi cantly larger than the number of available processors or machines. Clearly, most of these activities can be generated using regular function calls or loop iterations. Such a technique is usually called Lazy Threads and is used to save the overhead involved in the creation of activities on a remote machine (including the overhead for allocating a new stack). Similarly the termination of an activity should be realized by the return instruction. Preemption. Switching between activities is crucial for maintaining the fairness requirement of ParC , so that while-loops whose termination depends on shared variables will eventually terminate. Preemption is usually realized using interrupts from the clock; a compiler can analyze the program and insert explicit context-switch instructions in suitable places in the code, saving the overhead of unnecessary context-switches and the overhead of the interrupt handler. Cactus or loose stack. ParC supports dynamic creation of activities and preemption requires that the operating system should maintain a set of separate stacks, one for each activity that is currently being executed. This increases the overhead of the execution, as many instructions are wasted in allocating/deallocating and maintaining the separate stacks. In addition, as the stack allocated to each activity must be of a xed length, some protection mechanisms are needed to detect stack over ow, which adds to the general overhead. The proposed compiler should be able to embed the cactus-stack directly into the regular stack without increasing the overhead of maintaining the cactus stack. The main diculty here in reallocating the \holes" that were left on the stack by activities that have terminated. We use the page-table mechanism of the underlying machine to move such holes to the top of the stack so that their physical pages can be reused. Referencing non-local variables. Nesting of parallel constructs in ParC requires that direct reference of nonlocal variables (those de ned in external constructs) be made possible and fast. This situation resembles the need to access nonlocal variables in Pascal like languages. Compiler methods for supporting nonlocal references might work; however, for eciency's sake, we have chosen a dierent method to realize nonlocal references. In the proposed scheme, an activity is represented by a function whose arguments are pointers to the external shared variables used by that activity. Scheduling and Load-balancing. Usually the operating system performs this task by maintaining a global FIFO queue of activities that are ready to run, where, when a machine runs out of work it fetches its next activity from that queue. It is well known that using a global queue guarantees optimal load-balancing (as the gap between the loads of any two machines can not exceed the maximal length of an activity). However, extensive overhead is required for maintaining this queue and synchronizing the accesses it. Using a compiler implies that the scheduling should be determined at compilation time The reader is referred to Polychronopoulos's book [22] describing a variety of optimization techniques for scheduling. We use a simple scheme, in which the compiler generates code that divides the new 2
activities equally between the machines. This is based on the assumption that activities generated by the same parallel construct are of equal \length" (i.e., perform the same amount of work, such as multiplying rows and columns in matrix multiplication). As a result, each machine receives an equal portion of activities that can be executed one after the other as regular iterations of a loop. Only if the current activity forks or performs a context switch, is it inserted into a ready-queue. This forms a simple variation of the stacklet technique used in [15]. Shared memory simulation. Shared memory is usually realized by dividing the memory between the different machines and then using the page-fault mechanism to fetch memory pages which are not available on the current machine. This \on-line" solution incurs extensive overhead, as frequent movements of memory pages might overload the network. An o-line solution is hard to obtain, since it requires determining the pattern of memory references at compilation time and insert explicit instructions to fetch memory pages at suitable places in the code. The proposed solution is based on the idea that some of the page-faults can be avoided if the compiler inserts code that samples the iteration space of large loops and pre-fetches pages that will be needed by the loop.
1.1 Comparisons with related works
There are numerous works which deal with ecient implementations of parallel programming languages using the operating system approach, namely multi-threaded packages like PVM [13], MAXI [12], and Linda [14]. Compilers for parallel languages are rarer, used mostly for functional/data- ow languages (e.g., [28]), where the natural parallelism and the challenge of implementing Strictness analysis and Lazy evaluation motivated most of the works. Several abstract parallel machines and suitable compilation schemes were developed, e.g., TAM [10]. A most comprehensive work with a detailed background of relevant compilation techniques for parallel programs, is the work of Goldstein et al. [15]. They use Lazy threads, stacklet, and synchronizers to design a a compiler that produces ecient code for parallel languages (in particular Id90 [19]). It is therefore sucient to compare our techniques with the ones presented in [15]. Using function calls for activity creation is a well known technique; however, the speci c variant of passing addresses of shared variables as parameters is novel and will allow us to simplify the realization of the cactusstack. As in [15], we try to implement Lazy threads, using however, a very dierent technique. An activity in [15] is always executed locally using a regular function call. Only if there is a work-request from another machine will the activity migrate to a remote machine. This is done by using multiple-entrance points in the code and in the frames of new activities. Such methods require non-preemptive execution ([5]); otherwise, there will be \holes" left on the stack when preempted activities terminate. In our method, we assume that there are P machines and the spawn generates n activities. In this case, the compiler can distribute one activity to each machine. This activity will locally spawn Pn activities that will be executed as function calls or loop iterations. Hence, for a ParC type of language, the Lazy thread mechanism can be replaced by a xed scheduling technique. Finally, since ParC 's activities are preemptive, activity termination will create unused space on the stack. This problem is solved by using the page-fault mechanism of the virtual memory management to move physical pages to the top of the stack, so that the holes in the stack can be reused. Both methods require that the stack-size needed for an activity will be computed at compilation time. We use a calling-graph method to estimate this size and describe some heuristics to overcome the problem of estimating the stack-size in case of recursive calls.
1.2 The abstract parallel machine
The abstract parallel machine that might be used as a target language for the compiler is a collection of several common abstract machines that have been extended with special instructions for communication (via active messages [29]) and for memory management. Each machine has an additional stack (called the message-stack) that is used to receive messages from other machines. The `SEND' instruction is used to send/push k bytes to another machine. It is assumed that these k bytes contain the calling sequence of a function. Thus, the operation of receiving a message is equivalent to a function call whose address and 3
parameters are taken from the message-stack instead of the regular stack. The instructions for memory management are based on the ability to change the page-table (PT) of the current machine (or process). In particular, it includes instructions for obtaining the physical address and the index in the page table of the page matching a given virtual address (vadd). Instructions that modify access protections of virtual pages (validating/invalidating) are also included. Finally, page-fault interrupts caused by referencing a non-valid virtual address can be handled by indicating a function that will be activated when a page-fault occurs. The arguments to this function will be the virtual address that caused the trap and the type of memory access, namely read or write. These special instructions are described in the following table, where SP is the stack-pointer, PC the program counter, FP the frame-pointer, PT the page-table and mstack the message stack.: mnemonic function SAVE push(FP); push(SP); push(PC); SSP sp=pop(); RESUME FP=pop(); SP=pop(); PC=pop(); SEND mach = pop(); msize = pop(); m=pop(mzise); push(m,mstack[mach]); RECEIVE fun = pop(mstack); copy the rest of the message into the stack; call fun; PENTRY vadd = pop(); index=entry in PT(vadd); push(index); GPAGE vadd = pop(); page=physical address(vadd); push(page); PUPDATE index=pop(); vadd=pop(); PT[index].vadd = vadd; TOUCH S=pop(); n=pop; while(n > 0)fvadd=pop(); if(vadd 62 PT )S[k++]=vadd;g PSEND S=pop(); n=S[0]; while(n ? ? > 0) SEND(S[n]); RESTORE S=pop(); n=S[0]; while(n ? ? > 0)validate(S[n]); RET PAGES S=pop(); n=S[0]; while(n ? ? > 0)SEND(get page(S[n])); PFault pop function name; It is assumed that RESUME is an atomic instruction that modi es the PC,FP and PC simultaneously. Evidently, these instructions can be used to implement distributed shared memory (DSM); hence, we assume that the underlying abstract machine supports DSM and memory pages are automatically transferred between the individual machines. It is the goal of the compiler to generate code that will minimize the number of page-faults that are generated by the DSM simulation. Such a compiler can produce three types of outputs: an assembler le for an abstract parallel machine, an actual assembler le that can be executed directly on the machines (e.g. 8086 assembler), and a C le that can be compiled on any type of machine. Each type of output has a dierent task. The abstract parallel machine code will be executed by an interpreter on every individual machine. This will slow down the performances; however, it can be used in a heterogeneous cluster similar to a Java interpreter. In addition, fancy types of instructions like the ones that change the page-table or move memory pages between the dierent machines are easier to implement using an interpreter. The actual assembler code is designed for maximal performances and is therefore speci c for one type of architecture, and probably will disable some of the operating system functions during execution. The C le will rely on system calls to implement communication and memory management. It will be portable and can be implemented using a translator, since most of the sequential parts of the program can be copied to the output as is.
2 A Short Introduction to ParC ParC [5] is a parallel programming language which extends C to support asynchronous, shared memory parallel programming. The ParC language is based on the integration of parallel and sequential constructs, and on a special modi cation of the scoping rules of C to a parallel environment. The language introduces two types of parallel constructs [5]. Parallel blocks, introduced by the keyword Pblok , explicitly list the statements that can be executed in parallel. Parallel for loops, introduced by Pfor , indicate that all iterations of the loop should be executed in parallel. This has another version, called lparfor, which
4
automatically chunks iterations together. The number of chunks created equals the number of processors. The parallel \things" that are created by these constructs are called activities. Both constructs terminate only when the last constituent activity has been completed. Scoping rules are usually used to determine which statement can access which variables. Usually, a statement can access variables declared along the branch of its syntactic nesting. ParC extends this notion to determine which variables are shared and which are private/local to certain activities. Variables declared inside an activity are local to it, yet shared by its children. Local variables declared inside a Pfor or Pblok are replicated, and each activity has a distinct copy. The following gure demonstrates the scoping rules and the dynamic run-time structure of a simple program. Note that g2 is duplicated four times; hence, each /*A*/-process updates a dierent copy. However, they all access the same (and only) copy of g1. The sync command is a barrier synchronization between activities spawned by the same construct. It guarantees that a /*B*/ will read g2 only after the suitable /*A*/-activity has stored a value in it.
int g1; Pfor int i; 1; 4; 1; f int g2; Pblok f /*A*/ int g3;
g1 /* i= 1*/ g2 g3 g4 /* A */ /* B */
/* i= 2*/ g2 g3 g4 /* A */ /* B */
/* i= 3*/ g2 g3 g4 /* A */ /* B */
/* i= 4*/ g2 g3 g4 /* A */ /* B */
g2 = g1+g3; ; g : f /*B*/ g4;
sync int sync ; g4 = g2; g epar g epar
Nested parallel constructs
Execution graph and shared variables
It is assumed that the implementation of ParC guarantees that references to local variables (by the activity in which they were de ned) are indeed performed locally, i.e., they are performed by a processor to its local memory. Hence, ParC supports a non uniform memory access model [6]. In addition to the use of scoping rules, ParC supports locality by allowing the user to allocate memory on a speci c machine using malloc() and to re-access that memory using the mapped version of Pfor and Pblok [5]. It follows that any realization of ParC can not allow activities migration if it intends to support locality of references. Finally, the semantic of ParC requires some notion of fairness, namely, that no activity can be starved for ever, and that in every stage of the computation there is a nite time interval where ready-to-run activities will be served. Consequently, any realization of ParC should use preemption (also referred as context-switches); otherwise, activities might spin forever waiting for other activities to set some values of shared variables. This requirement is a severe restriction on obtaining ecient compilation, as it requires the compiler to simulate a ready-queue holding the activities that need to be simulated next.
3 Inserting explicit context switch instructions
An important aspect of executing parallel code is to overcome the problem of choosing the correct execution order for activities whose termination depends on the execution of other activities. Typically, this can happen when shared memory is used, and one activity waits for the results of another activity. Consider, for example, the following two loops with mutual dependence (left side):
5
int x=0,y=1; pparblock
f
while(y)f while(x%2 == 0); x++; if(x>1000)y=0;g g : f while(y)f while(x%2 == 1); x++; g g epar
Mutual dependence
int x=0,y=1; pparblock f int k=0; while(y)f k++; if(k%100 == 0)csw(); while(x%2 == 0) f k++; if(k%100 == 0)csw();g x++; if(x>1000)y=0;g g : f int k=0; while(y)f k++; if(k%100 == 0)csw(); while(x%2 == 1) k++; if(k%100 == 0)csw(); x++; g g epar inserting explicit CSW instructions
Clearly, the only way for the above activities (left-side) to terminate is if we execute them concurrently. However, it may happen that both activities will be allocated to the same machine. In this case, we must stop the execution of at least one activity, and continue with the next, requiring that either preemption or context-switch be used. Operating systems use interrupts from the clock in order to perform context switch between activities. Interrupt driven context-switches have greater execution overhead than do compilers that can insert calls for context-switch at the right places in the code. This is depicted in the above gure (right-side), where context switch is executed every 100 iterations. The idea is not to perform context-switch too frequently, as this operation has high overhead. The optimal choice of where and when to insert explicit context-switches is a complex subject; however, some simple heuristics can be used. For example, explicit context-switch instructions can be inserted only inside while-loops. Moreover, the quantum time for executing the context switch can be set proportionally to the number of iterations executed so far. The following code gives two alternatives for inserting context-switch instructions. The left-size code uses a counter and performs the context-switch every xed amount of iterations. This type of solution is useful if the number of iterations executed by the loop is relatively small compared to the total number of steps in the program. The code in the right side doubles the quantum time (cstime) between two consecutive context-switch instructions, every time the number of iterations executed so far equals cstime2 . This is useful if the while-loop is \heavy" and we do not want to use too many context-switch instructions. k++; if(k%100 == 0)csw();
Fixed quantum time for csw()
k++; if(k%cstime == 0)f if(k > cstime*cstime) cstime+=cstime; csw(); g
proportional increase of the quantum time
Not every while-loop is used for synchronization; hence, the compiler need not insert context-switch instructions in such loops. Special analysis is required to check that the condition of a given while-loop does not depend on shared variables. Note that spinning (performing busy wait) on shared variables can also be implemented using recursive calls. But, since the memory of a parallel system is nite, any program using such a recursion will be halted by the system any way. Thus there is no need to insert context-switches in recursive calls. Another reason for not using context-switches in recursive calls is that it will lead to \breadth- rst" scheduling, which might blowup the memory compared to \depth- rst" scheduling which is more ecient.
6
4 Using function calls to spawn activities. In this section we discuss the basic ideas involved in spawning activities using function calls or loop iterations. A more comprehensive scheme which takes into account preemption will be presented in the next section. As described in the introduction, one of the goals of a compiler for a parallel language is to guarantee that most of the activities will be created locally using function calls or loop iterations. In this way the overhead for spawning new activities is minimized. Actually, this includes two sub-goals: Reducing the number of messages needed to distribute new activities, i.e., using p = #machines messages to spawn n activities instead of one message for every new activity. Using fast mechanisms to create new activities on the current machine, once the spawning message has been received. Another related task involved with creating ecient code is the ecient addressing of local or external variables inside activities. This task is well performed by conventional compilers, using xed osets from the frame-pointer to address local variables/parameters of functions and procedures [30]. The above goals are easily realized if the compiler represents an activity (say A) by an \activity function" fA (: : :), such that spawning A is realized by calling fA(: : :). In this way the variables that are de ned in A will be realized as local variables of fA , allowing fast access to local variables de ned inside activities. However, since ParC allows arbitrary nesting of parallel constructs, A may reference external shared variables that are de ned in outer parallel constructs. This situation resembles the case of Pascal like languages where nested de nitions of functions are allowed. External references to global variables are usually realized in two ways: 1. an auxiliary array called the \display" is constructed in the activation record of every function, containing osets to the frame-pointers of all the relevant outer nested functions. The display is generated every function call, by copying and modifying the display array of the caller. Function calls thus become inecient; however, external references can be made faster using 2-3 instructions [21]. 2. a chain of \static link" pointers or osets is constructed on the stack, such that each static link points to the frame of the function where the current function is de ned. The calling sequence is ecient; however, a reference to an external variable requires a sequence of instructions that follows the chain of static links to get the right frame-pointer. External references are very common in ParC programs, mainly since ParC supports nesting of parallel constructs as a way to obtain a mapping of local variables to local memories (see section 2). It is therefore important to eciently address global variables. Both the display and the static link methods can be used. However, we have chosen instead the following option. Let v1 ; : : : ; vk be the set of external variables that are referenced by an activity A. The compiler will replace every reference to vi by a reference to a pointer pvi , and pass the addresses of v1 ; : : : ; vk through pv1 ; : : : ; pvk as arguments to fA . The de nition of fA is therefore fA(int pv1 ; : : : ; pvk )f: : :g. It is invoked by calling fA(&v1 ; : : : ; &vk ). Thus, external references are realized using parameters that point to the right variables. This method is only useful if the number of external variables that are used by an activity is small yet their nesting distance is not negligible. This, we believe, is a common case in ParC programs. Another argument for not using the Display or the Static-link methods in our case is as follows. It is desirable to implement the compiler such that the target code is C and not assembler, so that we only have to deal with the additional parallel constructs. Clearly, both Static-link and Display work with osets and require that the target code will be at assembler level. Another crucial point is that both Static-link and Display assume that the stack segment is continuous. However, in a parallel execution, some of the addresses of external variable will refer to stack segments that are allocated on dierent machines. What is the meaning of osets in this case? Note that it might be possible to extend the Static-link and the Display methods to work with parallel execution; however, passing parameters is extremely simple and therefore more natural to use. 7
Consider the following translation of a Pblok that spawns two activities, such that the rst activity has one external variable i, and the second references two external variables i,j. Each activity is translated into a function that gets a pointer to i as its argument. In the spawning sequence, the function name fA and its arguments &v1 ; : : : ; &vk are passed as parameters to the send() instruction. The rst argument of send() is the machine id where the activity should be executed. The rest of send(pid,f,....)'s arguments include the function address, number of parameters, and the parameters of the f ()s themselves. Using send(pid,f,....) allows us to put the calling sequence for f 1 and f 2 on the message-stack of each machine. In general, the allocation of activities generated by a Pblok to machines can be done at random as a simple way to guarantee load-balancing (see [31]). f(int n)
f(int n) i,j;
f
f int Pblok f int
x; x = i; if(n> 1)f(n-1);
:
g
f int
y; j = y+i; epar
g
nested parblock example
g
int
i,r; r = rand()%PNUM; send(1,f1,&i,&n,r); r = rand()%PNUM; send(2,f2,&i,&j,r);
the spanning sequence
f1 (i,n) *i,*n; x; x = *i; if(*n > 1)f(*n-1); f2 (j,i) *j,*i; y; *j = y+ *i;
int f int
g
int f int
g
activity functions
Note that the compiler might be able to detect if, in the scope of a shared variable is \read-only" by all the activities that uses it. Hence, it does not have to pass a pointer to that variable; instead, it can pass a copy of that variable. For example, i is \read-only" in both functions, so that the compiler need not use the indirection reference i. The analysis needed to realize this optimization is complex as it must determine that it is safe to pass a shared variable as read only. Consider for example the program x = 1; Pblok while(x > 0); : x = 0; epar ; here, the variable x in the while(x > 0); activity must be passed by a reference &x (even though x is not modi ed by while(x > 0);) so that the activity will get the update x = 0 when it happens. Note that using indirect references x need not be expensive as some machines (e.g., the X86 processors) support indirect references in assembler level. Another possible optimization is not to pass shared global variables as parameters, since global variables have one absolute address that can be computed at compile time. Passing pointers to external variables is necessary only if the address can not be computed at compile time, which is the case with shared variables that are stored on the stack. Another important issue in generating ecient code for parallel programs is to determine the scheduling that will occur during the execution, i.e., determine the allocation of activities to machines and the order of their execution. As explained in the introduction, the operating system usually uses a global queue to solve this problem. Combined with context switches and task-migration, this method obtains an optimal ratio of load-balancing. However, it is very costly in terms of overhead. A compiler can determine the scheduling at compile time, simply by determining where to send new activities that are generated by a spawn. As explained in the introduction, the proposed scheduling uses a simple rule, namely: activities spawned by a parblock construct are randomly distributed between the dierent machines. activities generated by Pfor are divided into #machines chunks, which are distributed between the dierent machines. Each chunk is represented by an activity function (called \selector function") that will locally execute an equal part (given by the parameters fr,to in the following code) of the Pfor 's activities as loop iterations. We assume that the number of computations involved with each iteration 0
0
8
of a Pfor is the same. Thus, an equal distribution of these iterations between the dierent machines is a reasonable strategy to preserve load-balancing. In addition, the realization of the above scheduling should guarantee that the next instruction after spawning will be executed only after the last spawned activity has terminated. The spawning sequence of Pfor contains a counter (mcounter) that is set to the number of machines minus one (the local chunk that is executed directly by the spawning activity). The address of mcounter is passed to every invocation of the selector function along with mid, the machine id of the spawning activity. Before terminating, the selector function checks if its machine id is equal to mid. If the answer is yes then the spawning activity waits until mcounter becomes zero; otherwise, it sends a done() message to mid that will decrease mcounter by one. The waiting itself is done by busy-wait or by receiving and executing new messages. Consider, for example, the translation of a program to multiply two matrices in parallel CN;N = AN;N BN;N :
Pfor int i;0;N-1;1; Pfor int j;0;N-1;1;f
int k; for(k=0;k= stack size)f stack size = max (2*stack size,next size); get pages(next size); set stack(); g current stack += next size;
a tests b
c
for stack allocation
d
e
f code for dynamic stack allocation
Inserting tests in cycles of a calling graph
14
The code for re-using memory pages works as follows. The system starts by allocating a xed number of physical pages used to construct a list of free pages. In addition, the virtual address space is logically divided between the dierent machines, so that each machine will use only a fraction of the virtual addresses. In this way the variables allocated on the stack of each activity have unique addresses. The result is that the memory is shared. Referencing a virtual address which is not in the current page table will cause a page fault. This allows the virtual machine to bring the correct page from the remote machine where the virtual address resides. Finally, we assume that we have bypassed the protection mechanism of the underlying operating system such that we can update the entries of the page table. The allocation of stack space is performed in the main loop of fselector k(), before it starts an activity. A global variable (denoted by vadd) contains the last virtual address used for the stack in every machine. There is no need to use special allocation if the activity does not contain a csw() instruction, as the activity will return to the main-loop. An additional stack segment right above the current SP will be allocated for an activity which might perform a context-switch. The number of physical pages for a preempted activity as determined by the static analysis is allocated and a linked list containing the description of these pages is produced. The page-table is updated such that the stack-frame to be used by this activity will start in the next page after vadd). This procedure is repeated every time control passes an allocation test. Thus the list of pages used for the stack-frames of the current activity expands and the page table is updated. When the activity terminates, the list of physical pages used for its stack is added to the list of free pages so that they can be reused. Since the stack of every activity starts a new page, no two activities share the same page, and the pages can be safely reused. Note that a deadlock may occur if none of the existing activities can allocate enough pages for their stacks. However, such a deadlock is an indication that the number of physical pages allocated for the program is not sucient, and a suitable request for allocating more physical pages should be sent to the operating system.
7 Distributed shared memory management There are two types of solutions to the problem of simulating shared memory using the set of local memories of several machines connected by a local network. The rst type are Distributed Shared Memory (DSM) systems which use the virtual memory to simulate a single address-space out of the separate address-spaces of the individual machines. Typically, DSM systems [20] modify (or bypass) the operating systems so that memory pages can be exchanged between the dierent machines. The virtual address space of the program is partitioned between the dierent machines, by invalidating a fraction of the virtual address space of each machine (using the page table mechanism). The union of all valid addresses in each machine forms the shared address space. Thus, any attempt to access an address whose page is not valid will cause a page-fault, allowing the DSM system to move this page from a remote machine (where it is currently valid) to the current machine making the address valid on the current machine so that the execution can commence. However, DSM systems require [26, 32, 3] extensive page migration between the dierent machines and complicated protocols to maintain consistency of shared memory references. In particular, there is usually a mechanism allowing each machine to determine the location of any given page. Usually, this type of system exhibits large overheads and usually fails to work [27]. Several weak consistency models have been developed where memory pages can be duplicated so that most of the memory references will be to a local copy. One of the latest models is the Lazy Release Consistency of Keleher [17], where updating of multiple copies of the same page is performed only at synchronization points. The updates require each machine to select only the 15
addresses (in the copied pages) that were changed and send them to the owner of that page. The mechanism needed to collect these addresses increases the overhead involved with Lazy Release Consistency. Using page faults to implement DSM can be characterized as an online method, since pages are moved from one machine to the other based on memory references made during the execution. The second type of solution is based on compiler technology and uses a static partition of the shared variables (in particular arrays) to the dierent machines. The static partition of the shared variables yields that the location (machine-id) of each data-item is known at compilation time. Thus the compiler translates non-local memory references to messages sent to the owner of the relevant addresses. This method is called, \the owner computes" [16], and is subject to many faults. In particular, the granularity of the resulting messages might be too small, so that too many messages oat the system. In addition, using the owner computes rule requires the ability to make an optimal partition of data (usually arrays) that can maximize locality of shared memory references. The solution proposed here is based on the DSM method; however, there are some traces of the owner computes method as well. As in the owner computes method, the virtual address space is partitioned between the machines. Thus the location of a virtual address remains xed during the execution. The location of each page can be easily computed, saving the need to maintain a mechanism to locate a given page. Pages are moved from one machine to the other. However, they must be returned to their owner before they can be reused. This scheme does not include pages that are marked as read-only; such pages need not be returned to their owner. It is the responsibility of the compiler to insert PAGE-RET instructions so that the set of currently non-owned pages (the ones brought to the current machine due to page-faults) will be sent back to their original owner. This creates an extremely simple DSM kernel that supports only page-migration. No special treatment is needed to support consistency of the shared memory (including synchronization points). Any request for a page (or set of pages) must block the machine that issued it until the machine gets the page or set of pages. The system should prevent dead-locks caused by several machines, each holding a set of pages needed by the other machines. This is usually done using randomness, assigning a probability to release the memory pages accumulated so far. Two dierent ways (\distributed" and \centralized") of sharing memory segments are used. In distributed sharing the virtual addresses of the segment are partitioned between the dierent machines, such that each machine invalidates all the addresses of all other parts, while in centralized sharing only one machine owns the segment, letting all other machines invalidate the set of pages of this segment. In the proposed scheme, the code segment is duplicated and is not shared. The stack segment of each machine, the segments allocated by malloc operations, and the global variables (excluding arrays) are centralized. Only global arrays are distributively shared. Thus each machine has a separate consecutive stack segment allocated at the beginning of the execution, yet shared by all the other machines (in case a pointer to it has been given to another machine using global variables). This stack segment can be enlarged during execution using the services of the operating system, namely by allocating a large segment and \returning" most of the physical pages to the operating system right after the allocation (such services are, for example, supported on windows-NT). There are several ways in which we can use compile time analysis to improve the performances of the shared memory simulation. One thing that a compiler can easily do is to replace all explicit references to global variables that are made outside the scope of loops by an update message to the owner of this variable, e.g., g = x =) send(mach0 ; update; &g; x). In this way the overhead of fetching the page that contains both g and other variables that creates false-sharing with g can be avoided. Note that an update message should block the current machine until a con rmation message is received, in case another machine synchronizes with the current machine and then accesses g. In this case, if the update message has not been received, then the value of g obtained by this machine might be inconsistent. Detecting indirect references to global variables (e.g., a = &g; a = x) is hard and requires complicated pointer analysis [25, 18]; however, in principle some references of this form can be detected and replaced by update messages. In addition to the above optimization, the compiler can also detect that the array references inside a given loop are read only and signal the DSM kernel to bring copies of the pages so that other references to these pages will not block the rest of the machines. 16
Another relatively simple optimization that a compiler can do to increase eciency is to reorder the iterations of loops that are executed in parallel and access the same arrays. The idea is that each loop will execute its iterations in a dierent order, thus reducing the chance that two loops will reference the same part of the array simultaneously. Recall that in our scheme there is only one copy of every page, so that simultaneous references to the same page must be made sequentially, slowing down the execution. Such simultaneous references can occur when the same code is executed in parallel by dierent machines. Particularly if the activities of a Pfor statement contain heavy-loops which are not dependent on the Pfor index. This is the case with the following code (left-side) that counts all the occurrences of an array's elements in a matrix. The compiler can detect that the iterations of the inner loop are independent and reorder them such that each machine will start the execution from a dierent place chosen at random (see right-side of example). int M[n][n],B[n],C[n]; pparfor int i;0,n-1;1; int j,k; for(j=0;j