global names of elements of the data structures. A recent .... In a `backto i, j' instruction, a value in a register, .... ister name `fr') of the thread, and a status register.
In the Digest of Papers, COMPCON Spring '93, San Francisco, California, February 22{26, 1993, pp. 165{174.
c 1993 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material
for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
Supporting a Dynamic SPMD Model in a Multi-Threaded Architecture Herbert H.J. Hum
Guang R. Gao
Dept. of Electrical & Computer Engineering Concordia University 1455 de Maisonneuve W. Montreal, Canada H3G 1M8
School of Computer Science McGill University 3480 University St. Montreal, Canada H3A 2A7
Abstract The SPMD (Single-Program Multiple-Data) model has gained acceptance as a programming model for scienti c array-intensive computations on distributed memory machines. Recently, researchers have been extending the SPMD model to handle programs which operate on recursively-de ned dynamic data structures; such models are commonly referred to as Dynamic SPMD (DSPMD) Models. In this paper, we examine existing Dynamic SPMD models and investigate how to eciently exploit temporal and physical locality in the traversal of and operations on dynamic data structures on a multithreaded architecture. In particular, we propose an extension of a DSPMD model and present a new multi-threaded architecture to support the model.
1 Introduction The SPMD (Single-Program Multiple-Data) model has gained acceptance as a programming model for array-intensive scienti c computations on distributed memory machines. In the last few years, we have seen a rapidly increasing number of research projects in this area, with the majority of them focused on the SingleProgram Multiple-Data (SPMD) [16] model and on the the data parallel paradigm [11]. Under the SPMD model, a mapping of elements of the data structures of a computation onto the processor set is speci ed. The same program is run at each node with its behaviour modi ed as necessary according to the node's identity. A computation executed according to the SPMD model advances by performing alternate computation and communication phases, in which the communication phases serve to synchro-
nize the processors. Often each node performs the computation associated with de ning (initializing or updating) values of the set of array elements assigned to it by the mapping. In general these computations may require input values that have been computed at other nodes. During a communication phase, values are moved from the nodes where they are computed to the nodes where they are needed for subsequent computation. The model is implemented by mapping the program onto a multiprocessor model, either manually or by using a data parallel compilation strategy. Most of the research projects in this compilation strategy are focused on developing compilers for sequential programming languages with extensions for user-speci ed partitioning of data. Compilers take such programs and automatically generate parallel code with messages for communication among processors. As an example, the SPMD model provides a good abstraction for the array computation features of such languages as FORTRAN 90 [20] and for languages with data distribution directives, such as FORTRAN D[6, 19]. In parallel with the work on developing the SPMD model for scienti c Fortran programs, there have been some recent proposals in developing appropriate data parallel models|called Dynamic SPMD (DSPMD) models|for the more general types of programs written in C-like languages; languages which allow dynamic data structures [17, 8]. Since the accesses of dynamic data structures are less regular than those of arrays, then the distinction between a computation and communication phase may have to be relaxed so that the execution can be performed more eciently. Eectively overlapping computations and communications is non-trivial in a conventional von Neumann processor node and thus some ecient architectural support for processing multiple threads of control ap-
pear necessary. A thread of control is a sequence of instructions which may be executed with some other threads of control in parallel. Two fundamental issues in multiprocessing using von Neumann style architectures are well known [3]: memory latency which is unavoidable in parallel machines; and the cost of synchronization which is high on a von Neumann machine which generally keeps a large processor state for currently executing tasks. Access to information held in memories of remote nodes can lead to high and unpredictable latencies due to network delays; thus rendering all static instruction scheduling techniques ineffective. Facing such challenges of high-performance computation, multi-threaded processor node architectures represent a promising alternative to RISC architectures and its multiple-issue extensions such as VLIW, superscalar, and superpipelined architectures [15] where instruction scheduling techniques at compile time factor heavily in their performance. Multithreaded processors rely less heavily on static instruction scheduling since there is some mechanism responsible for dynamically synchronizing threads of control while computations can proceed. A survey on the evolution of multithreaded architectures can be found in [5]. In this paper, we examine some existing Dynamic SPMD models and investigate how to eciently exploit temporal and physical locality in the traversal of and operations on dynamic data structures on multithreaded architectures. In particular, we propose an extension of a DSPMD model and present a new multithreaded architecture which can eciently support the model. Based on our previous experience on the argument-fetching data ow architectures [4] such as the MDFA [7] and SAM [12, 13], the new architecture uses a new thread management mechanism as well as a cache organization to overlap the long-latency remote memory accesses with other operations to keep the processor usefully busy. The detailed design of the cache memory itself and its coherence issues are beyond the scope of this paper and will appear in a companion publication. In section 2, we brie y review the extant dynamic SPMD models. Section 3 will discuss a new dynamic SPMD model which is an extension of an existing one. A multi-threaded architecture for supporting the new model is proposed in section 4. Finally, conclusions and future directions are drawn in section 5.
2 Dynamic SPMD Models Many proposed SPMD compilation techniques have been focused on scienti c programs where arrays are the dominant data structures. The structure of such programs often lends itself to parallelization, and a static distribution of arrays to use the SPMD model is natural. Moreover, many techniques developed for analyzing scienti c programs with arrays have long been proposed and are readily available for SPMD code generation. But when dealing with dynamic data structures, we are faced with a dierent scenario. First, these data structures (such as lists, trees, etc.) can be partitioned and parallel tasks created to operate on the partitions. This methodology has been used in many divide-and-conquer type programs. Moreover, dynamic data structures are constructed and allocated at runtime in a manner which is quite dierent from the creation of static array structures. The problem has to do with making all processors cognizant of the global names of elements of the data structures. A recent paper by Gupta[8] describing a dynamic SPMD model suggests a mechanism for addressing the problem of global names so that an approach similar to the array methods can be used. That is, during the communication phase, every processor must know where the elements of the data structures reside so that they can traverse the structure. As we can see, the constant creation and deletion of data structure elements can cause ineciencies in terms of large communication requirements. It would appear that applying runtime resolution to dynamic data structures to maintain the computations-communications-phases in conventional SPMD models may not be the proper strategy. Perhaps the boundaries between phases may have to be less apparent so that the execution can be more ecient. Recently, another interesting SPMD model that handles dynamic data structures has been proposed [17]. This approach is based on the notion of threads of computation propagating from processor to processor according to the physical location of the data. As in the SPMD approaches, the programmer is responsible for mapping data to processors, that is, in an allocation instruction, a processor id is speci ed to indicate where the data is to reside. However, rather than having the processor determine if it is the owner of some data before the start of a computation, it would be more natural to send computations to the processors that own the data explicitly. To support the migration of computations, special load and store instructions are employed in which a trap handler can be invoked
when the memory address of the request is non-local. It is the responsibility of the trap handler to send the computation away. To introduce parallelism into the code such that a processor which sent its computation away can have something else to process, the code is to be partitioned into threads at compile time by using the future mechanism [9] as found in some parallel Lisp implementations. By having multiple threads of control (or computation) in which some are in the process of communicatingwhile others are computing, the phases of communications and computations are now overlapped. However, two issues need to be addressed in their model. The rst problem is that always migrating a thread to the processor which owns the data can lead to ineciencies. In some instances, it would be more ecient to simply fetch the remote data, especially when only one remote data is required by the computation. The second problem is the reliance on the programmer to perform load balancing (via the assignment of structure elements to processors). If the shape of the data structure (e.g., a balanced or unbalanced tree) is dependent on the input data, then wouldn't load balancing be dicult if the programmer does not know what the input data are? Lastly, there is a problem with the proposed implementation of their model on von Neumann processors. Relying on conventional von Neumann architectures to provide thread switching and migration can lead to ineciencies. The reason is that a processor must halt execution and send a continuation consisting of an entire activation frame on the runtime stack, a program counter, and the register contents to another processor if a thread must be migrated. Furthermore, some expensive runtime stack management needs to be performed (i.e., portions of the stack must be copied to temporary storage) such that an enabled thread does not overwrite the stack frames of a waiting thread. In the next section, we present an extension to this model which can address the two issues. In section 4, we will propose the use of a multi-threaded architecture to support our DSPMD model so that the last issue of implementation can be addressed.
migrated when the data in memory address speci ed by e addr is not in the local processor and resides in a remote processor. Thus the load.r stands for \LOAD, but if not local, then send to Remote". If a copy of the data resides locally|say it is in a local cache memory|even though its eective address is non-local, then the load.r instruction would perform the load and the thread would remain local. The load.l instruction stands for \LOAD, if not local, then get data and execute Locally". Thus when the eective address (e addr) speci es a remote memory location, the data is fetched from remote and the thread continues to be executed locally. In the case where a copy of the remote data is stored locally, then no remote fetching is required. For the load.x instruction, the thread may or may not be locally executed depending on some strategy adopted by the execution hardware; thus load.x stands for \LOAD, don't care (X) where thread executes". For instance, the hardware should adopt the policy where if the data exists locally, then execute locally. If it is remotely located, then fetch data if local processor is not too busy, else send thread to remote processor.1 The `store.x e addr, rn' instruction speci es that if e addr points to a remote memory location, then the thread must be migrated, else the thread stays resident. This implies that our model enforces the owner write rule where the processor which owns the data is responsible for initializing or updating it.2 Note that these four new instructions are for accessing physically distributed data structures. For data which are produced locally and only consumed locally, the regular load and store instructions can be used. By introducing special instructions for accessing distributed data structures, we can see that communication eciency and some hardwareassisted load-balancing can be achieved. For instance, the rst problem we mentioned in the review of the Rogers-Reppy-Hendren model can be overcome by using the load.l instruction. Moreover, load balancing with hardware-assist can be performed by using the load.x instruction on data structures.
3 A New Dynamic SPMD Model
In our DSPMD model, we will keep the future mechanism for introducing parallelism into the code. A future is embodied in a future cell created by a futurecall
To address the issues of the Roger-Reppy-Hendren DSPMD model, we introduce a more exible load/store instruction. In fact, we introduce three load instructions: load.r, load.l, and load.x. The store instruction is renamed store.x. In a `load.r rn, e addr' instruction, the thread is automatically
3.1 Introducing Parallelism
1 The `busy factor' of a processor can be indicated by how many threads are currently being processed. 2 The owner write rule is dierent from the owner compute rule in that another processor can compute the value, but when it comes time to store the value, the processor which owns the value must perform the write.
function. The futurecall eventually returns a value in the cell. Other threads of control which require the value in a future cell perform a touch on the cell. If the cell is empty, the thread would be suspended and resumed once the cell has been lled. An example of using futures in C code is shown below. int TreeAdd (tree *t) f if (t == NULL) return 0; else f tree *t left, *t right; future cell f left, f right; /* next statement can trap */ t left = t->left; f left = futurecall (TreeAdd, t left); /* next statement can trap */ t right = t->right; f right = futurecall (TreeAdd, t right); g
return (touch(f left) + touch(f right) + t->val); g
The TreeAdd function traverses a binary tree and sums all the values in the nodes of the tree. A futurecall is invoked for each child of the currently executed node. In this manner, multiple threads of control can be created which recursively descend the tree and nally return a sum. In our model, we introduce `spawn n' and `backto i, j' instructions for the future support. A futurecall to a function which is located at instruction label m can be translated to spawn n call m backto i, j n: ;start of another thread The spawn instruction does not necessarily have to be inserted if no other thread of control needs to be started before the function call. Before we explain the spawn and backto operations, we would like to brie y describe the DSPMD model at runtime. At runtime, each function/future call is represented by an activation frame. Since we allow threads of control be active from multiple function instances, our DSPMD model will sport a tree of activation frames as opposed to the conventional runtime stack of activation frames/records. Each activation frame is structured as in gure 1 and can be identi ed by a frame pointer fp. Two slots in the frame are reserved for frame linkage: the caller PC and caller FP slots. Another group of slots are for storing values local to the
synchronization slots sync reset cnt cnt thread ptr.
local vars. & future slots caller FP caller PC tree of frames
Figure 1: Layout of an activation frame. activation and values in future cells, called future slots. Lastly, there are synchronization slots where each slot consists of three elds: one containing a synchronization count (sync cnt); another, a reset count; and the last, a thread pointer|i.e., the address of the rst instruction of a thread. Both the synchronization and future slots are operated upon by backto instructions. In a `backto i, j' instruction, a value in a register, say by convention r0, is stored in a future slot pointed to by (fp + i) where fp is the frame pointer of the currently executing thread.3 Then the synchronization slot at (fp + j) is operated upon. The synchronization count is decremented and if it reaches zero, the count is set to the reset count and a thread of control speci ed by the current frame pointer and the thread pointer in the slot is spawned, i.e., another thread of control is created. The reason for resetting the synchronization count is so that a thread can be re-activated; this is very useful if the thread is contained within a loop structure. Lastly, the current thread|the one executing the backto instruction|is terminated. In a `spawn n' instruction, a thread of control speci ed by the frame pointer of the currently executing thread and the instruction address n is spawned. The thread which executed the spawn continues executing.
3.2 TreeAdd in Our DSPMD Model Now we will use the TreeAdd function to illustrate our DSPMD model. Below is the pseudo-assembly code of TreeAdd: 0: bneq r0, 3 ;goto 3 if t null 3 Each thread must execute within the context of an activation frame.
5
fp
2
2
14
4
future t-> right
3
future t->left
2
t
1
caller FP
0
caller PC
synchronization slot
instructions. In the TreeAdd example, thread 14 does not start till both future values have been returned| both backto instructions (9 and 13) specify the same synchronization slot containing the thread pointer 14.
4 Architectural Support for the Dynamic SPMD Model
Figure 2: Layout of TreeAdd's activation frame. 1: 2: 3: 4: 5: 6: 7: 8: 9:
set r0, 0 ret set r1, 0x4800000E store [fp+5], r1 store [fp+2], r0 load.r r0, [r0+1] spawn 10 call 0, 6 backto 3, 5
;initialize sync. slot ;save t in frame ;r0 = t->left
10: 11: 12: 13:
load r0, [fp+2] load.r r0, [r0+2] call 0, 6 backto 4, 5
;r0 = t->right ;call TreeAdd on t right
14: 15: 16: 17: 18: 19: 20:
load r0, [fp+3] load r1, [fp+4] add r0, r0, r1 load r1, [fp+2] load r1, [r1] add r0, r0, r1 ret
;call TreeAdd on t left
;r0 = r0 + r1 ;r1 = t->val
TreeAdd can be partitioned into three threads: instructions from 0 to 9 form a thread, another from 10 to 13, and the last one from 14 to 20. The hexadecimal value in line 3 corresponds to a synchronization count of 2, reset count of 2, and a thread pointer indicating instruction 14|the slot is a 32-bit word where the most signi cant three bits form the synchronization count, the next three bits form the reset count and the last twenty-six bits is the thread pointer. The parameters of a `call' instruction is the address of the rst instruction of the function and the size of the activation frame respectively. The layout of TreeAdd's activation frame, which requires six slots, is shown in gure 2. Note that the thread starting at instruction 14 corresponds to the statement in the C code containing the `touch' operations. In our DSPMD model, the touch operations are implicitly executed by our backto
A multi-threaded node architecture [1, 2, 10, 13, 14] is best characterized as an architecture which attempts to eciently support the execution of multiple threads of control via the introduction of multiple copies of fast memory required in executing multiple active threads, and mechanisms for storing suspended threads, selecting an active thread to process, and processing synchronization events. In a multi-threaded processor, the sending of a continuation can be done concurrently with the execution of another active thread and thus the thread-switching overhead can be masked. Moreover, a tree of activation frames instead of a stack of frames is used to support concurrent activations of multiple functions. In this section, we introduce a simple multithreaded architecture model to support ecient switching between waiting and ready threads. Then, a new caching mechanism|called a global-cache|for handling requests for distributed data structure elements will be introduced.
4.1 A Multi-Threaded Architecture Model Our model of a multiprocessor consists of multithreaded node processors based on the ArgumentFetching Data ow Model[4] and an interconnection network which links all of the nodes. A processor node contains a synchronization unit which handles the synchronizations between threads, and an execution unit to process the instructions within the threads.4 The Multi-Threaded node Architecture (MTA) consists of a Synchronization Unit (SU) and an Execution Unit (EU) ( g. 3) where buers separate the units such that each unit can have relatively independent throughput rates. Both the SU and EU access local memory of the processor node, however, the aggregate of the local memories of each node represent a global memory address space. (Therefore, a memory address would 4 In the Argument-Fetching Data ow Model, the synchronization unit, called an Instruction Scheduling Unit, is responsible for scheduling and synchronizing data ow actors and the execution unit processes data ow actors|an actor is a single instruction.
to network
EU from network
local memory
SU
Figure 3: The Multi-Threaded Node Architecture. consist of a node id and the actual address within the local memory of that node.) The SU is responsible for processing synchronization signals emitted from the EU and communication messages from the interconnection network. It is the responsibility of the interconnection network to properly deliver messages to their destined processor nodes.
4.2 The Synchronization Unit The SU is responsible for issuing re signals to the EU. A re signal consists of a frame pointer fp and the address of the rst instruction ip of the thread to be executed. The re signal is actually put into a ready thread pool by the SU|the ready thread pool is a common area in local memory which is shared by the SU and EU|and it is the responsibility of the EU to fetch a re signal from the pool. The SU determines which threads are ready to be executed by processing incoming synchronization signals issued by the EU and communication messages from the interconnection network. The synchronization signals processed by the SU are corresponding to the spawn instruction and part of the backto instruction, respectively. (Spawn and backto were introduced in section 3. We will see later how the sync signal forms part of the backto instruction when we discuss the EU). Incidentally, the spawn and sync signals can only be issued by the local EU. When the SU encounters a spawn signal, it immediately puts the re signal into the ready thread pool. As for the sync signal, (fp + ss o) points to the synchronization slot to be operated upon. During the pro-
cessing of the sync signal, the synchronization count is decremented, if it reaches zero, the count is set to the reset count and a re signal consisting of fp and the thread pointer in the synchronization slot is put into the ready thread pool. The SU is also responsible for processing communication messages from the interconnection network. There are three types of communication messages: data retrieval, synchronization, and thread migration. There are two data retrieval messages: get and data. The get message is sent from an EU of a processor to another processor owning the requested data. It is of the form . This type of message corresponds to the backto instruction in the DSPMD model when the future call has been migrated to another processor node. We will see later how the synchronization message can ll its elds with the appropriate values when we discuss how functions are invoked in section 4.4. When the SU processes a synchronization message, value is stored at memory address (fp + fut o) and the synchronization slot at (fp + ss o) is operated upon similarly as if a sync signal was processed. That is, the synchronization count is decremented, checked for zero, etc. The astute reader will note that the sync signal described above corresponds to part of a backto instruction if the future call is processed locally. A thread migration message is of the form