The Gardens Prototype: An Extensible Architecture Supporting Adaptive Parallel Computation Paul Roe and Clemens Szyperski School of Computing Science Queensland University of Technology Brisbane, Australia fp.roe,
[email protected]
Abstract
The Gardens system is designed to create a virtual high performance computer from a network of idle workstations. The network presents a dynamic set of idle workstations to the Gardens system. Therefore Gardens must eciently support adaptive parallel computation. Furthermore to support experimentation and dierent paradigms of parallel computation, requiring customised policies, the Gardens system is extensible. That is it has a component-oriented architecture supporting plug-in system components, such as schedulers and load balancers.
Keywords High-Performance Computing, Networks of Workstations, Adaptive Parallelism, Extensible Architectures
1 Introduction Many engineers and scientists require high performance (parallel) computing facilities. A highly economical alternative to dedicated and centralised facilities is to create a virtual high performance computer from idle networked workstations. Utilising currently unused workstations for parallel computing has the advantage of using existing infrastructure (investment) which is already installed, maintained, and paid for! Tremendous computing resources are installed in modern networks of high performance workstations, and workstations are, on average, largely unused (in any one minute during peak time, 60% of all machines are on average idle [2]). This paper describes the Gardens system architecture which creates a virtual high performance computer from a network of idle workstations. 1
One of the main architectural dierences between modern massively parallel systems and networks of workstations is the interconnecting network. Only recently have networks become available that deliver the high performance (high bandwidth and low latency) required for ecient parallel computing across such networks1 . In fact, in terms of hardware technology such con gurations are very close to supercomputers such as the IBM SP2. The Gardens testbed comprises a network of Sparc workstations and a Myrinet [16] high performance network2 . The utilisation of a dynamic set of workstations requires adaptive parallelism [4]. Adaptive parallelism entails the automatic and transparent adaptation of parallel programs to available workstations; this contrasts with conventional parallel computing which assumes a xed set of processing elements.
2 Adaptive Parallelism Adaptive parallelism must make highly ecient use of available resources in order to be successful. Conventionally a program is allocated a static set of processors and its computation is statically partitioned across these. Since we are considering a processor set which will change dynamically we need an adaptive solution. Such a solution must adapt to the processor set (set of workstations) contracting or expanding. In the extreme there may be no processors on which to run a program. When a processor set expands or contracts computation needs to be redistributed accordingly. That is computation must be dynamically redistributed to take account of changes in workstations' status. Ecient use of workstations entails coexistence with regular workstation users. Adaptive parallel computation must be transparent to users in order for it to be accepted. Users will not accept a noticeable delay for their workstation to be returned from adaptive use to their own use. Thus resource discovery and release is a key problem for eective adaptive computation. Two important issues in traditional parallel computing are: load balancing and locality. In an adaptive parallel setting these are especially important since computation is frequently being redistributed. Hence in the face of computation redistribution, locality must be preserved and load balance maintained, otherwise the bene ts of parallel computing will be lost. Despite utilising a high performance network, our architecture is still unbalanced in the favour of computation. To rebalance the system, latency masking is desirable in order to fully utilise processors; the Charm project has demonstrated this to good advantage [17]. 1 Note that Asynchronous Transfer Technology (ATM) delivers in terms of bandwidth, but that current and foreseen implementations have truly disappointing latency characteristics. Other technologies, such as Myrinet and ParaStation, do not share this disadvantage and are commercially available. 2 A switched, 1.2GBit=s 10s latency per link network.
2
3 Gardens Design Rationale Based on our target of supporting adaptive parallel computing on high performance networks of workstations, we isolated the following main design goals:
Performance|We address performance at the levels of programmingmodel,
programming language, compiler, libraries, and run-time to achieve high performance. Extensible Architecture|By employing an extensible architecture we avoid hard-wiring of critical mechanisms or policies such that a number of classes of parallel algorithms as well as experiments with various mechanisms and policies can be supported. Coexistence with Workstation Use|The Gardens parallel computing environment must peacefully coexist with regular workstation use. Common Programming Model|Gardens provides a commonprogramming model which eciently and safely supports dierent paradigms of parallel computation. Platform Independence|Within reasonable bounds, our architecture should be portable across a number of platforms. Support of heterogeneous environments is a future goal.
In the following subsections we address each of these goals.
3.1 Performance
Performance is the primary motivation for parallel computing. It is important to improve over using a single workstation and to do so with reasonable eciency given a certain availability of workstations. We aim to address performance on all levels. That is: programming model, programming language, compiler, run-time system, and library support. This paper addresses the general system architecture. Local overheads, ie, the time that processors spend on \administrative" tasks such as preparing for communication or switching contexts, grows with the number of processors and thus directly aect the maximum eciency. It is therefore very important to minimise local overheads. Direct access to communication hardware is one of the important ways to reduce overheads. Avoiding kernel mechanisms and preemption for context switching is another.
3.2 Extensible Architecture
Our architecture is extensible. There are two reasons for rejecting a traditional monolithic design. Firstly, we want to support a number of dierent parallel programming paradigms. In particular, task and data parallelism. However, 3
beyond the common framework the particular support required for good performance can be quite speci c to a particular paradigm. Secondly, we want to design a exible infrastructure for experiments with adaptive parallel programming. A future goal is the support of interactive applications requiring ad-hoc data-driven extensibility, eg, a parallel programming workbench. An extensible architecture is one where mechanisms and policies are characterised abstractly and where components implementing mechanisms and policies can be plugged in and replaced as required.
3.3 Coexistence with Workstation Use
Utilising currently unused workstations for parallel computing has the advantage of using existing infrastructure (investment) which is already installed and maintained. To be successful, it is necessary that the adaptive parallel computing environment peacefully coexists with the regular workstation use. There are a number of quite separate issues that need to be considered.
Acquisition of a workstation must not hinder the workstation's direct user. Acquisition of a workstation must not have any negative eect on the workstation's integrity. Workstations must be released quickly when required by the workstation's direct user. The parallel execution environment must coexist with the workstation's native operating system.
3.3.1 Workstation Acquisition
There is an interesting con ict between the availability of resources on workstations, the needs of interactive users, and the needs of parallel programs. Workstations are barely utilised on average, but interactive users require signi cant processing resources for short bursts of time. Even low priority but resourceintensive background tasks can interfere with interactive use and are usually not welcomed for this reason3 . At the same time, parallel programs of suciently ne granularity do not tolerate even brief unavailability of participating processors. We distinguish two categories of parallel load: (1) tasks that can be quickly sent to another processor and that are guaranteed to be either easy to retreat or to terminate within a short time; (2) tasks that take unbounded time to execute and might build up signi cant state that would be expensive to regenerate. Tasks of the former class could be spontaneously sent to processors that are currently lightly loaded. Tasks of the latter class require an estimation that 3 For example: being CPU intensive (ie, fully consuming time slices) or using large areas of memory (swapping of user pages) leads to unwanted degradation of responsiveness; frequently using the local disk leads to unwanted audible distraction!
4
makes it highly likely that a selected workstation will remain unused long enough to amortise the cost of shipping and possibly retreating tasks. Currently, we only consider tasks of the more general second category with a special approach to measure user presence that trades resource utilisation for likely eciency. The rst category leaves room for future improvements. An interactive user might be using a workstation very lightly for an extended period of time, for example while browsing documents displayed on the screen. However, the same user might want the full workstation performance available at any time, for example to perform a complex document editing operation. Current systems, such as Condor [8] or Customs [18], detect \idle" workstations using user-set pro les and OS-provided process tables. If a parallel environment acquires a workstation based on such statistical evidence, it must be able to very quickly release the workstation again, when required by the user. Simply suspending execution is not good enough for parallel computing. To avoid these problems, it is necessary to \ask" users for permission. For this purpose we use a mechanism that already detects inactive users, that interacts with users in an unobtrusive way, and that is accepted by most users: a screen saver. We modi ed a standard screen saver to, after an adjustable period of continuous \screen saving", stop all activities and inform the parallel computing environment that the host workstation is currently not directly used. This approach works well. By setting the timeout before the screen saver releases the machine, a time window can be set for the user to disable the screen saver again. Users already have a habit for this, eg, to move the mouse. With a timeout of a few seconds it becomes likely that the machine will be available for an extended period of time4. An extension could consider currently active processes and a user pro le, to further restrict the release of a machine.
3.3.2 Tasking System
We execute all parallel application code within a single process of the host operating system on each acquired workstation. This process runs at a high priority to eectively block all other non-critical processes. We therefore avoid reliance on the local OS scheduler. Within the one application process we provide our own tasking system. The system supports a single parallel application per process. It can thus rely on cooperative multitasking, greatly reducing the context switching overhead. We provide direct access to the communication hardware of the high performance network. This is an obvious safety and security problem. Therefore we rely on a safe programming language, a safe set of trusted low-level libraries, and a trusted compiler to establish safety at no run-time cost. Hence, the programming language itself must be safe, ie, respect memory invariants, and the language must support the construction of safe libraries, ie, libraries that can strictly enforce invariants of their own. 4
We are currently conducting measurements to statistically substantiate this claim.
5
3.4 Common Programming Model
Our programming model should support dierent paradigms of parallel computing. The model also needs to be suciently close to hardware to achieve predictably high performance. That is, we require a simple performance model for the programmer to build on. Out of the two common models for data and task parallelism, tasking models are the more general. In a tasking model, tasks form the units of work and therefore the units of redistribution to adapt to changing processor sets. For a simple and uniform programming model with exible allocation of tasks to processors, tasks must not depend on current location. In particular, task functionality must not depend on whether tasks are allocated on the same or on dierent machines. To maintain location transparency, tasks communicate using messages, even when located on the same machine. It is left to the compiler and run-time system to make good use of the physically shared memory, such that \local" messages are implemented with minimum overhead5 . Consider having exactly one application task per processor. In the presence of a dynamic processor set, a parallel program would have to explicitly handle dynamic remapping of data and control ow. A model with multiple tasks per processors can adapt by redistributing tasks. To mask latencies, the one task per processor approach must use asynchronous communication. A model with multiple tasks per processor could also support synchronous communication and mask latencies by switching contexts. For these two reasons, we support multiple tasks per processor6 . In order to adapt, balance load, and mask latencies, we need a certain number of tasks per processor. At the same time, to minimise context switching overheads, this number needs to be minimised. The optimal working point depends on the particular parallel application and potentially on the current phase of the computation. We thus need to adaptively control the degree of parallelism. Another major issue is the control of locality. We do not expect generic solutions to the problems of controlling parallelism and locality. Instead, we expect the program itself to provide useful guidance. Since the resulting burden on the programmer can be signi cant, we intend to provide library abstractions for common cases. We chose Oberon-2 [10] as the implementation language as it is ecient, type safe, modular, and supports extensibility, all of which are required for our architecture.
3.5 Platform Independence
Performance is the prime criterion. However, a certain degree of platform independence is clearly useful. We use two approaches to isolate the parallel
5 We investigate type system support to almost completely eliminate overheads in this case [19, 15]. 6 Note that this covers the single task per processor con guration as a special case.
6
execution environment from the host OS and the used high performance network. We achieve a degree of OS independence by only relying on Unix-style processes with minimal OS support. While we are currently targeting Solaris, other Unix-like OSs, such as Linux, and Windows/NT are in our sights. To achieve network independence and performance we use Active Messages (AM) [20]. For future versions we shall consider using Active Messages 2, since AM does not support out-of-band communication. Without out-of-band support, we have to use an Ethernet for asynchronous system control. AM is available for many platforms, including some supercomputers such as the Thinking Machines CM5. A version for the Myrinet is currently under development; we are using a beta version. There is also a version running on top of TCP=IP has been released. The latter is useful for bootstrapping and prototyping and in addition may support latency-tolerant applications. While our design aims at supporting a number of dierent platforms, we initially restrict each instance of our system to run on a \homogeneous island". In the future, we plan to support heterogeneous networks as well. However, the problems that need to be solved to do so are certainly dicult.
4 Architecture Based on our design introduced in Section 3, we now brie y describe the Gardens Architecture. In Section 5 below we then present two instantiations of the architecture: a prototype based on a Myrinet and a simulator. dedicated HPN
workstations
GAP
GAP
GCP
GCP
GCP
ethernet
Figure 1: The Gardens Project Structure. Figure 1 illustrates the basic Gardens process structure. The architecture uses two kinds of processes: Gardens Application Processes (GAPs) and Gardens Control Processes (GCPs). Each machine that is known to Gardens runs a GCP. Once a machine has been acquired, a GAP is created and loaded with the application code. Upon release of a workstation, a GAP unloads its tasks to other contributing workstations and eventually terminates. A GCP is the local Gardens \daemon" representing the workstation within a Gardens system, and an agent for out-of-band communication with the local GAP. For example, the modi ed screen saver (see Section 3.3.1) modi es with the GCP. 7
The GCPs communicate with each other to maintain a consistent picture of the global state: for example, the current status of participating workstations. A simple approach might use a central \home" GCP, while a more robust and scalable solution would be to use a distributed agreement protocol. Each GCP that has a local GAP running informs its GAP of changes to this state. A parallel application is spread across a number of GAPs, one on each contributing machine. The tasking and communicationsystem, all other mechanism and policy modules, and the application tasks themselves execute in the GAPs. On acquisition of new machines, the local GCP creates a new GAP and loads all required code. There is at most one GAP per processor, ie, at most one parallel application per processor. We can therefore avoid the complex problem of application co-scheduling [2].
4.1 Layered Architecture of GAPs Gardens Application Process Application Tasks interface extension Policy Mechanisms interface Platform Abstractions local global task state state map
Figure 2: Layered Extensible Architecture of GAPs. A GAP has a four-layer architecture, see Figure 2. In the following we rst explain the layers from the bottom up and then provide some examples for components that would plug into the two middle layers. The lowest layer abstracts from the platform. It provides the basic tasking and communication infrastructure, including the state of local tasks. It has xed mechanisms to create, activate, and delete tasks, to yield control, and to send \active" messages to named tasks7 . For this purpose the platform abstraction layer maintains a partial mapping of tasks to processors which allows for direct addressing of all locally known tasks. Finally, the layer holds the global state as seen by the local GAP and provides an interface to the GCP services. 7 With active messages the sender speci es which message handler of the receiving task will be called on message arrival. Hence, there is no primitive to explicitly receive messages. Instead, messages may be received by handler up-call whenever a message is sent or control is yielded.
8
The two middle layers are populated by a number of mechanisms and policies. The interface between mechanisms and policies is mechanism-speci c. Where a separation of a speci c policy from its mechanisms is not required, this interface need not exist. Note that policies are also called via up-calls from the lowest layer. This is used to respond to external requests, possibly relayed via the local GCP. The top layer contains all application tasks. It has a standard interface to the policy layer supporting task creation, message sending, and checking (combination of potential message reception and context switching). In addition, the application interface may be extended by the used policy modules, as will become clear in the examples below.
4.2 Policy and Mechanism Components
Policies are required to make decisions what to do in order to serve requests, either issued by the application or by the system. Since the various policies tend to interact, all of them may be usefully implemented in a single policy module. Important areas that need to be addressed by policies are:
Task allocation to processors Task scheduling per processor Load balancing Control of parallelism Control of locality Communication between tasks
A policy selects a mechanism that determines how to implement a particular service. A mechanism can be used by several dierent policies. Also, it is quite common that a single policy uses a number of mechanisms. Mechanisms build on the services of the platform abstraction layer. For example, consider the task allocation service. An application requests the creation of a new task. The responsible policy component can then decide to (1) delay actual task creation, (2) create a new local task, or (3) create a new remote task. For a policy component that considers all three options, mechanisms are required to (1) create \embryonic" tasks, (2) create local tasks, and (3) create remote tasks. Another important example is load balancing. The underlying mechanism implements task migration between processors. Dierent migration mechanisms might be used for load balancing to adapt to changing processor sets and for load balancing on a processor set that remained unchanged.
9
4.3 Extensibility
The platform abstraction layer introduces objects to represent local state (task descriptors), global state (processor set), and a partial mapping from tasks to processors, see Figure 2. Each of the mechanisms may need to attach further information to each of these entities. Likewise, policy components may need to attach further information. We allow for the extension of lower-layer entities in higher layers. Components used in higher layers change from con guration to con guration. They can also change over time if the needs of an application change between computational phases. To achieve all of this we use Oberon's subtyping which is type-safe (unlike C++).
5 Prototypes In this section we describe two prototypes that are both instantiations of our architecture. In particular, we restricted the prototypes as follows: Homogeneous platform No fault tolerance Both prototypes are currently being completed. Performance evaluations are not yet available, but will be included in the nal version of this paper.
5.1 Myrinet Prototype
The Myrinet prototype executes on the Gardens testbed, a network of Sparc workstations connected by a Myrinet. To simplify implementation, we use a master GCP to coordinate the Gardens system. We currently use task migration only to adapt to changing processor sets. To implement task migration a global virtual address space for stacks and heaps is dynamically partitioned across processors. By using an SPMD model, where code is available on all processors, we can migrate tasks by communicating task state, stack and heap. Data and function pointers remain valid across machines (providing the executable is statically linked). Code is actually made available to all machines via NFS over the Ethernet. Tasks are migrated by synchronising all GCPs, eectively freezing all GAPs, and then copying all tasks from workstations to be released to others. Newly acquired workstations are only populated by new tasks as they are created by the running application. We expect to use re ned task migration algorithms and better load balancing in the future versions of the prototype.
5.2 Simulator
In our architecture the lowest layer fully abstracts the platform and also the location of the GCPs. In a sense, this layer can be seen as a \backplane" that can be replaced. This is done in our second Gardens prototype, a simulator 10
running on a single workstation. The simulator produces trace les for all activities and performs measurements. For example, the average and worst degrees of parallelism and locality are measured. The simulator's lowest layer actually supports several logical GAPs executing in a single process. Mechanisms such as task migration are thus trivialised. Future versions of the simulator may instead use a separate processes for each simulated GAP. The simulator is used to develop mechanism and policy modules. These modules are \plug compatible" with the Myrinet prototype. The main advantage of using the simulator is the reproducibility of results and independence of the Gardens testbed hardware.
6 Related Work There are many projects investigating parallel computing across networks of workstations, including: NOW (Network Of Workstations) at the UC Berkeley [2] and WOW (World of Workstations) at the Humboldt University of Berlin [12]. NOW is most ambitious, trying to simultaneously improve sequential and parallel performance while sticking with o-the-shelf hardware and system software. Our approach is more modest by not aiming at improvements for normal sequential workload. WOW aims at the support of responsiveness, ie, the combination of hard realtime and fault tolerance [9], both of which are not main foci of Gardens. Current WOW eorts are built on top of the Mach OS. However, only a few projects have explicitly addressed the architecture necessary to support adaptive parallelism. Essentially two approaches have been taken. The rst is data distribution where data is the unit of computation which is distributed and redistributed. Some projects which have taken this approach are: Data Parallel C [11], CHAOS [6], Piranha [4], Application Data Movement [13], DOME [3], and Adaptive Parallel Abstractions [14]. Essentially these systems only support data parallelism. The second more general approach is task distribution, where tasks are distributed and redistributed. Gardens has taken this approach, as has UPVM and MPVM [5, 7]. UPVM and MPVM support PVM-style message passing and task migration. UPVM supports a xed set of lightweight user-level tasks which can be migrated (SPMD-style parallelism). MPVM relies on full Unix processes, which, although rather heavyweight, can be migrated. More distantly related work includes: Condor [8] and Nimrod [1], eectively both of these exploit extremely coarse grain parallelism with little or no inter-process communication, they are adaptive but do not support parallel programming.
Acknowledgements This work was partially supported by an ARC grant (ARCSG 55, 7056). 11
References [1] D Abramson et al. Nimrod: A tool for performing parametised simulations using distributed workstations. In Proc, 4th IEEE Symp on High Performance Distributed Computing, August 1995. [2] Th E Anderson, D E Culler and D A Patterson et al. A case for NOW (Networks of Workstations). IEEE Micro, February 1995. [3] J Arabe, A Beguelin, B Lowekamp, E Seligman, M Starkey and P Stephan. Dome: Parallel programming in a heterogeneous multi-user environment. In Proceedings of the International Parallel Processing Symposium, 1996. [4] N Carriero, E Freeman and D Gelernter. Adaptive parallelism and Piranha. IEEE Micro, pages 40{49, January 1995. [5] J Casas et al. MPVM: A migration transparent version of PVM. Technical Report CS/E 95-002, Dept of Computer Science, Oregon Graduate Institute, 1995. [6] G Edjlali, G Agrawal, A Sussman, J Humphries and J Saltz. Compiler and runtime support for programming in adaptive parallel environments. Technical Report CS-TR-3510, University of Maryland, 1995. [7] R Konuru, J Casas, S Otto, R Prouty and J Walpole. A user-level process package for pvm. In Proc, Scalable High Performance Computing Conf, pages 48{55, May 1994. [8] M J Litzkow, M Livny and M W Mutka. Condor: A hunter of idle workstations. Technical Report CSTR 730, Computer Science Dept, University of Wisconsin-Madison, 1987. [9] M Malek. A Consensus-Based Framework for Responsive Computer System Design. In Proc, NATO Advanced Study Institute on Real-Time Systems. Springer-Verlag, October 1992. [10] H Mossenbock. Object-Oriented Programming in Oberon-2. Springer Verlag, 1993. [11] N Nedeljovic and M J Quinn. Data-parallel programming on a network of heterogeneous workstations. Concurrency: Practice and Experience, Volume 5, Number 4, June 1993. [12] A Polze and M Malek. Parallel computing in a world of workstations. In Proc, 7th IASTED/ISMM Intl Conf on Parallel and Distributed Computing and Systems, October 1995.
[13] R Prouty, S Otto and J Walpole. Adaptive execution of data parallel computations on networks of workstations. Technical Report CSE-94-012, Department of Computer Science and Engineering, Oregon Graduate Institute of Science and Technology, March 1994. 12
[14] P Roe. Abstractions for adaptive data parallelism. In Conf, 3rd Australasian Conf on Parallel and Real-Time Systems (PART'96), 1996. [15] P Roe. Constant parameters via generalised collections. submitted for publication, 1996. [16] C Seitz. Myrinet|a gigabit per second local-area network. IEEE Micro, February 1995. [17] W W Shu and L V Kale. Chare kernel - a runtime support system for parallel computations. Journal of Parallel and Distributed Computing, Volume 11, pages 198{211, 1990. [18] A Stolcke. Pmake (includes customs). Source code available from ftp.ICSI.Berkeley.EDU:=pub=ai=stolcke=software, October 1989. [19] C Szyperski and J Gough. The role of programming languages in the life-cycle of safe systems. In Proc, Intl Conf on Safety Through Quality, Kennedy Space Center, FL, USA, October 1995. [20] T von Eicken, D Culler, S C Goldstein and K E Schauser. Active messages: A mechanism for integrated communicationand computation. In Proc, 19th Intl Symp on Computer Architecture, May 1992.
13