A. Banerji, M. R. Casey, D. L. Cohn, P. M. Greenawalt, ...... Math Time ..... Computer Security Center, Ft. George G. Meade, MD, CSC-STD-001-83, Library No.
ARCADE: An Architectural Basis for Distributed Computing
A. Banerji, M.R. Casey, D. L. Cohn, P. M. Greenawalt, D. C. Kulkarni, J. E. Saldanha, J. M. Tracey Distributed Computing Research Laboratory Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN 46556 Technical Report 92-3
March, 1992
ARCADE: An Architectural Basis for Distributed Systems
A. Banerji, M. R. Casey, D. L. Cohn, P. M. Greenawalt, D. C. Kulkarni, J. E. Saldanha and J. M. Tracey Distributed Computing Research Laboratory University of Notre Dame
ABSTRACT
Modern computer interconnections typically include a wide variety of systems. They may have different hardware architectures and operating system structures. Normally, they are underutilized, resulting in significant latent power. ARCADE is new architectural basis for distributed computing systems intended to exploit this power. It defines machine-independent abstractions and services which can be used to build distributed applications on such interconnections. They have been successfully implemented as both a stand-alone microkernel and as a set of add-in services for conventional operating systems. The abstractions are based on high-level language models which makes them easily accessible and quite efficient. The services can transparently cross even heterogeneous machine boundaries. The architecture proposes a new approach to distributed shared memory and to global resource identification. The resulting structure has proven to be a powerful set of building blocks for a variety of system services and applications.
1. Introduction An operating system hides the details of a computer’s hardware and presents a platform on which applications are built.
Distributed operating systems extend this functionality across machine
boundaries, even between dissimilar machines. They should be designed to facilitate the development of distributed applications which exploit the latent compute power of an interconnection. Ideally, applications should not have to deal with machine boundaries or problems of heterogeneity. Indeed, operating systems should promote the construction of concurrent applications and present a natural
means of cooperation. ARCADE, an ARChitecture for A Distributed Environment, has proven to be a basis for such systems.
The platform that is seen by the applications is what we call the architecture of the operating system. An architecture presents a set of abstractions, or generalized elements removed from their implementations. Application software uses these abstractions through the system independent service interface.
What should the nature of these abstractions be? If they are based on familiar programming concepts, they will be easy to use. If they are powerful, they will be good building blocks. If they are implementation independent, it will be possible to support them on a variety of systems. In order to facilitate distributed applications, the abstractions must offer an accessible cooperation tool; exploiting the power of an interconnection requires support for concurrent execution entities.
ARCADE’s abstractions are based directly on high-level language constructs, making them familiar, powerful and portable. Such constructs are the basis of modern software and are natural programming tools. They reflect the basic elements of computation and serve as effective building blocks. They are defined at a high-level and have been implemented in several different ways. ARCADE’s principal cooperation tool is based on distributed shared memory and it uses a process-like execution entity that is easily migrated.
Usually, distributed functionality is either grafted on to a standard operating system or developed as a stand alone system. The first approach, referred to as an add-in, typically adds a sophisticated set of services on top of the operating system. The latter approach often layers the operating system functionality onto a thin software layer of seminal services, called a micro-kernel. The nature of ARCADE’s abstractions allows both implementation options and facilitates full cooperation between them.
3
We have found ARCADE to be portable and ease to use. We have realized it as a micro-kernel and as an add-in to three different operating systems. Our applications use a common set of services across heterogeneous interconnections with metaphors that are familiar to conventional programmers. This paper describes what ARCADE is, how we have built it and what we have learned by using it. The next section introduces the architectural components and Section 3 details how a programmer uses and manipulates them. Various implementation issues are discussed next, and a summary of several distributed computing experiments follows. A discussion of the lessons we have learned is in Section 6. Section 7 positions the contributions of ARCADE in the spectrum of current research on distributed computing, and the paper ends with some concluding remarks.
2. The ARCADE Architecture ARCADE models the physical world as a combination of execution entities, passive resources and external events. The execution abstraction extends the classical process idea by adding hierarchical names and redefining resource ownership. The principal resource is based on a high-level view of data combined with metadata, information about the data itself. The metadata allows the definition of a generalized identification mechanism which acts as a global pointer, spanning heterogeneous machine boundaries. Execution entities cooperate through message passing, shared data and synchronization events. A binary signaling mechanism combines external events and synchronization events to control execution state. Each element of the architecture includes security provisions which can allow construction of secure operating systems [De89].
In many respects, the ARCADE task abstraction closely resembles other kernel-level active abstractions, such as the OS/2 process or the Mach task. However, it is a portable construct well suited to a heterogeneous, distributed environment. It can be viewed as a thread of execution with an address space that holds code and data. Queues buffer notification information from other tasks. A signaling mechanism, based on input and output lines, allows tasks to exchange synchronization
4
information. This synchronization information is processed by a simulated programmable array logic mechanism, or PAL, to control the task state.
ARCADE supports applications made up of tasks that run on an interconnection of computers. While it effectively hides the boundaries between systems, it retains the notion of a machine. It assumes that each machine has a unique name within the interconnection and that task names are centrally controlled on the machine. Thus, tasks have globally-unique, hierarchically structured logical names. For convenience, they are also given fixed-length unique identifiers, known as UIDs, to facilitate identification.
The keystone of ARCADE is the passive data unit abstraction which is modeled on a high-level language’s notion of data. It associates structure and other information with the data to provide a variety of services. All data in a task’s address space is contained in data units which can be used as messages or as distributed shared memory. The structure information allows automatic translation between heterogeneous machines. Data unit locks are available as synchronization tools for flexible control of shared data coherency.
The data unit link abstraction is essentially a global reference mechanism. It provides a locationindependent handle for data-units residing throughout a distributed system. Since they span machine boundaries, links can be used to build distributed dynamic data structures.
3. Abstractions and Services This section presents a programmer’s view of ARCADE. Figure 1 shows the set of resources owned by an ARCADE task. Some of these can be manipulated by the task’s thread of execution and some are fixed when the task is created. Most ARCADE services deal with these resources, and the following discussion indicates how the resources and services are related.
5
The task’s address space is the set of all currently valid memory addresses. In a full ARCADE
address space
data units L
implementation, the entire address space is
data unit link
comprised of data units, including both code and
L
data. However, for an add-in implementation,
L
data units comprise only a portion of the address
L
output lines
name UID privilege vector key security level
R/S
q m 1 0
L/D
input lines n
NPDUs
space.
input queue PAL
async queue
2 1 0
A task creates a data unit with the allo-
Figure 1 - Task Resources
cate() service, specifying its structure, rather
than its size. ARCADE maps a block of memory into the requesting task’s address space, returning a pointer to the data unit. When a task no longer needs a data unit, the release() system call may be used to unmap the associated memory.
Data unit links, an ARCADE data type, are used to build data structures of multiple data units. A task may invoke setlink(), specifying an address within a data unit, to cause the link to point at the data unit. The setlink() service internally associates a global identifier with the link and stores it in the metadata.
A task may send a data unit as a message to another task. The move() service removes the data unit from the address space of the sender and makes it available to a receiver. Data unit movement is a three step process. First, the target data unit is unmapped and a notification packet data unit (NPDU) is created. The NPDU contains a link to the target data unit and is placed in the input queue of the destination task. An input queue is a first-in, first-out queue of notification packet data units.
6
The receiving task can map the first NPDU in its input queue into its address space with the receive() service. It can then reference the data unit link and use the access() service to place the target data unit in its address space.
If the sender uses share() rather than move(), the data unit will not be removed from its address space. Then, when it is received, both tasks have direct access to it. Thus, it has become essentially distributed shared memory, or DSM. For many applications, shared data is an ideal cooperation paradigm.
The three step data movement process has advantages. It prevents the sending task from having to wait until the entire data unit has been transferred; the task only blocks until the NPDU is added to the target’s input queue. Also, the destination task may selectively accept incoming data units since the NPDU identifies the sending task.
Sometimes it is desirable for the arrival of an NPDU to interrupt the task’s thread. Therefore, each task also has a asynchronous queue (AQ). Unlike the input queue, the arrival of an NPDU at an empty asynchronous queue invokes specially installed handler code which receives the NPDU.
Tasks have binary output lines which they can set high or low. The first line is reserved and remains high throughout the task’s life and effectively goes low when the task dies. A second reserved line goes high when the task’s input queue is not empty. All other lines can be used to send event-like information to other tasks.
A task has input lines which are used for synchronization and event notification. It can connect these to other task’s outputs and to lines associated with data units or hardware interrupts. The value of the input lines can be read with readip(). More importantly, they are the inputs to a simulated
7
programmable array logic (PAL) which can be programmed to generate run/sleep (R/S) and live/die (L/D) controls. These controls can suspend the task or cause it to terminate.
Since data units can be accessed by several tasks simultaneously, ARCADE provides coherency control. It defines application-controlled locks (denoted L in the figure) for each data unit. As with traditional databases, read-locks and write-locks are defined. For each data unit, the write-lock is unique, but read-locks are shared. When a task uses lockdu() to establish a lock, ARCADE assures that the task sees the most recent version of the data unit. Updates are propagated when a write lock is released, so the data does not change while read locks are active.
There are two special output lines associated with each data unit. One is high when the write lock is held and the other is high when there are active read locks. This helps detect deadlocks and notifies tasks when data is updated.
Other task resources, such as its name, are fixed at creation. The name of a root task consists of the name of the machine on which is was created followed by a unique task name component. For example, the task command on machine pclsys47 would be named \pclsys47:command. A normal task’s name is the concatenation of a unique child name with its parent’s name. Thus, \pclsys47:command\grep is a child of \pclsys47:command. Since tasks may migrate and may be created remotely, a name does not necessarily indicate the machine on which the task is executing. The services getuid() and getlname() can be used to translate names and UIDs.
The complete ARCADE service set is quite powerful and access to some services must be restricted. Therefore, each task has a privilege vector that indicates which of the ARCADE services that task can invoke. At creation, a task is given a subset of its creator’s privileges. In this way, some tasks can perform operating system functions while others can only act as user applications.
8
A security level is associated with each task and data unit. This allows ARCADE to enforce a full complement of security restrictions and to audit all attempted security breaches. The security level is metadata which tasks cannot access so the weaknesses associated with capabilities are avoided.
A task’s key is used in its creation and when the task is moved between machines. It is specified by a task’s creator and can be shared with others. The key is necessary to invoke the migrate() service and relocate the task. The task creation services, makeroot() and maketask(), and the migrate() service must be invoked on the machine where the task is to run. Since access to these services is controlled by the privilege vector, it is possible to make them fully secure.
4. Implementation Issues This section deals with the four current implementations of the ARCADE architecture [Co92]. ARCADE/386 is a micro-kernel that runs on 80386-based IBM PS/2s. ARCADE/CMS, ARCADE/OS2 and ARCADE/Mach are add-ins to the VM/CMS, OS/2 and Mach operating systems respectively. The common features of all implementations will be discussed first. Afterwards, micro-kernel specific implementation issues will be described, followed by an analysis of the add-in implementations.
4.1 Common features In order to support the ARCADE interface, some form of service provider must exist on each participating machine. The service provider’s main functions are to maintain resource information about the system, provide locking services, control the replication and coherency of data units and communicate with other service providers. Their methods of doing so are described below.
9
4.1.1 Resource Information The ARCADE abstractions require that the service provider keep track of a variety of resource information. The information concerns data units, data unit links, tasks and machines. Some is visible to tasks, while much of it is private to the service provider.
Every data unit has a structure, security level, and size associated with it when it is allocated. This metadata, along with a task’s access rights to the data unit, can be seen by tasks. However, the physical address, the current lock status, the version number and the globally unique data unit id cannot. Together, this information allows the system to find, lock, translate and update data units in the system.
To maintain security/integrity within the system, all the information about data unit links is hidden from tasks. Service providers internally associate data unit ids and machine locations with link values when performing setlink() calls.
This information can be transparently used on a later
access() to find the data unit and map it into the calling task’s address space.
Service providers must also be able to fulfill service requests concerning names. Thus, for all local tasks, they must know the mapping from the hierarchical logical name to the UID and vice versa. They must also be able to traverse a task’s family tree. This means that they remember the parent of each task as well as what children it has created.
Also, because required information may not be local, service providers must maintain a list of the names of other machines in the interconnection. This list can be given to tasks when requested, but other data about the machines remains private. indications of whether remote machines are running.
10
This includes communication addresses and
4.1.2 Locking Data unit locking is currently implemented with a lock manager on each machine. Each manager is responsible for handling lock requests for the data units that were allocated on its machine. Therefore, all requests are forwarded from local machines to the machine managing the lock.
In addition to the visible services that lock managers provide, they hide a very important one. When a lock is granted, the lock manager returns the version number of the most recent copy of the data unit. This allows a remote service provider to know if the requester’s data unit must be refreshed before the lock can be granted, thereby preventing the use of stale data.
4.1.3 Replication and Coherency When a data unit is shared, a replica is created on the target machine, and coherency must be maintained among the copies. When a task releases a write lock, the service provider makes an additional hidden copy of the data unit. This prevents any local tasks from modifying the new "master" copy. The service provider then broadcasts an invalidation message to signify that the data unit has changed. If the data unit id in the message matches that of a local data unit, then the receiving service provider must obtain an updated copy of the "master" copy. In this way, the illusion of shared data can be maintained. For small data units with no data unit links, it is possible to include the new data along with the invalidation notice.
4.1.4 Communication Clearly, each service provider must be able to communicate with the other service providers in the interconnection. To provide a consistent communication mechanism across both add-in and microkernel implementations, a protocol supporting multiplexed messages was chosen.
All the
implementations use either full UDP/IP or a subset of UDP services and a well-known UDP port. Currently, two implementations supports only UDP-like communication on a token-ring; the others provide communication across any hardware supported by the system’s UDP libraries.
11
4.2 Micro-kernel implementation ARCADE/386, the micro-kernel implementation of ARCADE is a complete implementation of the architecture. The kernel has absolute control over the machine, so tasks correspond exactly to their definition in the ARCADE architecture [De89]. Even complex task migration been handled in this implementation [Tr91].
The micro-kernel has its own scheduler and dispatching mechanism. In order to suspend and dispatch tasks, it maintains information for each task’s input and output lines. Whenever an output line changes, it propagates the change to all connected input lines. Whenever an input line changes, it recomputes the two outputs of the task’s PAL, determining whether it is time to suspend, resume or kill the task.
4.3 Add-in implementations The three add-in implementations of ARCADE show an evolution of style and intent. The first provided important lessons about the architecture and the relative importance of its components. The other two are quite similar and will be discussed together.
The key design decision for each
implementation was how ARCADE services were to be provided. The service provider had to handle service requests and transparently update data units. It had to be able to communicate with service providers on other implementations. Unlike a micro-kernel implementation, it could not be designed from scratch. Rather, it had to be built from the components provided by the host operating system.
4.3.1 ARCADE/CMS The first host operating system, VM/CMS, defines a set of virtual machines, each of which may run its own operating system. CMS is a single-threaded, user-oriented operating system for such a machine. The ARCADE/CMS add-in uses a kernel extension to CMS as the basic service provider. This set of resident routines is activated by a service request and has access to the task’s address space.
12
The routines interact with the token ring card to send UDP-like messages to other implementations. An ARCADE/CMS task can access all usual CMS services, including file systems and user interfaces.
The initial ARCADE/CMS design attempted to implement the entire set of ARCADE services, including the full task and data unit abstractions. It was found that data units and data unit links and the services needed to create and use them were simple to implement. However, the same was not true for the task abstraction. Since CMS does not support multitasking, multiple tasks could not be created on one virtual machine. ARCADE/CMS cannot dispatch or suspend a task, and the PAL component was not implemented. ARCADE/CMS showed that since each operating system has its own concept of an element of execution, it can be quite difficult impose the ARCADE definition of a task.
However, the lack of a full task abstraction did not significantly diminish the ability of the VM/CMS task using ARCADE/CMS services to cooperate with ARCADE/386 tasks. This showed that the data unit abstraction was a powerful cooperation mechanism, even between tasks running on different hardware architectures and operating systems. Therefore, ARCADE/OS2 and ARCADE/Mach are principally based on the data unit and data unit link abstractions.
4.3.2 ARCADE/OS2 and ARCADE/Mach Both of these implementations use a user-level data unit server task (DUST) as the service provider. An application task requests ARCADE services by doing interprocess communication with DUST. In OS/2 the IPC mechanism is named pipes while RPC messages are used in Mach. For each new task, a thread is started in DUST which remains active until the task terminates.
Data units are in memory shared between the application task and DUST. If a service request deals with a data unit, the DUST thread must be able to access it. When data units are updated, DUST must
13
be able to write them without disturbing the task. DUST communicates with other service providers through calls to standard UDP/IP library code.
Two new task identification services were added to the standard ARCADE service set for these implementations. The enroll() service allows a task to request a logical name and must be its first ARCADE service request. The service provider verifies that the name is unique and assigns it to the task. With unenroll(), a task releases the name and ceases to be recognized by the service provider.
5. Experiences This section reports on a variety of experiences with the four ARCADE implementations. The first subsection treats the construction of an operating system on top of the ARCADE kernel. High-level constructs built on ARCADE tasks, including transactions and objects, are summarized next. The third subsection presents a solution to a numerical problem that involves close cooperation between tasks running on different machines. The next subsection looks an example of a distributed data structure and is followed by an example of heterogeneous cooperation. The section ends with a discussion of processor sharing through task migration using the group manager concept.
5.1 System Tasks As a micro-kernel, ARCADE/386 has been called on to support a variety of system tasks. The first application of ARCADE was a basic operating system built on top of ARCADE/386 [Tr89]. This Kernel Operating System (KOS) includes such services as device support, a file system, a command interpreter and a loader. It was built as a set of normal ARCADE tasks using the abstractions and services described above. Subsequently, a version of KOS was designed to address security needs. The Secure ARCADE-Based Operating System (SABOS) follows the stringent security requirements of the "Orange Book" [Be90,Do83].
14
KOS’s device support allows multiple tasks to share physical hardware without interfering with each other. KOS’s three main device support tasks are the console manager (conman), the file system (filesys), and timer (timer). The conman task maps a single physical console, consisting of a screen and keyboard, to multiple logical consoles, each with its own screen and keyboard. User tasks, even remote user tasks, may request logical consoles from conman. A task on one machine can manipulate a console on a second machine by using standard library routines.
The filesys task mediates user task requests for file access, converting them into a series of disk reads and writes. These are forwarded to the proper disk driver task. The disk format is the same FAT-based structure used by DOS and OS/2.
File access requests need not be only from local tasks. A task can operate on a remote file by dealing with the remote filesys task. Performance data for such remote operations was gathered, and KOS is comparable to or better than DOS and a network file system. Its performance relative to DOS 3.3 and IBM LAN Manager is shown in Figure 2. All measurements were made with a 16 MHz PS/2
PS/2 model 70 as the remote machine.
These relative performance figures indicate that the ARCADE inter-task cooperation abstractions
KOS Time / DOS Time
model 80 as the local machine and a 25 MHz
and mechanisms can be implemented efficiently and can serve as an excellent basis for operating
11 10 9 8 7 6 5 4 3 2 1
local to remote local to local remote to remote
0
200
400
600
File Size in KBytes
systems. This is particularly interesting given that:
Figure 2 - Copy Time Ratios • KOS and ARCADE are research prototypes.
15
• DOS is single threaded while KOS runs in a multitasking environment with the overhead of context switches.
• DOS file services are obtained through software interrupts while KOS services require intertask interactions.
The KOS file system can be viewed as a simplistic, location-dependent distributed file system. A user task can use the KOS cfs (change file system) command to direct its file system requests to the filesys task on another machine. A true location-independent distributed file system (DFS) has been built above KOS and its file system [Sm90]. DFS spans multiple machines and shields the user from file location concerns. It uses replicated files to provide fault tolerance and data units to support heterogeneity.
DFS extends the file abstraction by including structure information along with the data. Data units and their metadata are be saved and retrieved as units, rather than as streams of bytes. A file written by a machine of one architecture can be correctly read by a machine of another. An entire binary tree can be saved and retrieved without either the reader or writer knowing its configuration or size.
5.2 Building Block Projects A Nested Transaction Subsystem (NTS) has been implemented to add the transaction paradigm to ARCADE [Ku91].
While the kernel provides coherency among data unit replicas, consistency
constraints that span multiple data units must be handled above the kernel.
NTS offers both
serializability and recoverability. Serializability guarantees that interleaved accesses to data units by different transactions are equivalent to some serial order among the transactions. Recoverability allows partial changes to be undone and ensures that committed transactions are never lost. An NTS task uses blocking locks for serializability and DFS files allow recoverability by making data units persistent.
16
Two projects have evaluated language-level support for ARCADE abstractions. One, ABC, augmented standard C to hide ARCADE-specific details, while a second created support for object-oriented programming in ARCADE. ABC is implemented as a preprocessor which converts a superset of C into a combination of C code and ARCADE service calls [Bn90]. It allows data units to be viewed essentially as simple C structures. The type information for the data units is automatically generated from the struct declaration by the preprocessor. In addition, data unit links appear as C pointers and the preprocessor inserts the required access() calls.
The object oriented environment based on ARCADE is designed for "programming in the large" [Br89].
Active objects are built on the ARCADE task abstraction.
Five types of inter-object
communication are seen by the programmer: • tell() is a method invocation that does not expect a reply. • submit() is an invocation that blocks waiting for a reply. • ask() is a non-blocking invocation which expects a reply. • reply() is the response to ask() and submit(). • forward() lets its destination reply() to the original request. A preprocessor translates each of these requests into ARCADE service calls. Thus, the programmer sees object oriented services which are really ARCADE-based.
The combination of high-level language support for distributed programming combined with the nested transaction subsystem shows that data units provide a wide range of programming possibilities. The versatility of the ARCADE service set proves that ARCADE is a powerful distributed system.
5.3 Closely Coupled Numerical Calculation Red/black successive over-relaxation (RBSOR) is a technique for approximating the solution to a series of simultaneous differential equations using Laplace’s equation. It has been used as an example for many distributed systems, including Amber [Ch89] and Orca [Ba89]. It iteratively evaluates the
17
steady-state temperature distribution of a rectangular plate, given a fixed set of edge temperatures. The plate is viewed as a checkerboard. In a given iteration, the temperature of each red square is computed as the average of its surrounding black squares; the temperature of each black square is then computed as the average of its surrounding red squares. Through repeated iterations, the algorithm converges to a steady state.
To distribute the calculation, the plate is subdivided into horizontal strips; each strip is assigned to a calculator task on a separate processor. CalculaExterior Boundary
tors exchange the temperatures on the strip Interior Boundary
has
converged.
In
our
im-
Edge
algorithm
plementation, the points in each region are represented by five data units: the interior, the
Computation Region
Edge
boundaries and an indication of whether the
Interior Boundary
upper and lower red boundary squares, and the
Exterior Boundary
upper and lower black boundary squares.
Figure 3 - RBSOR Calculation Regions
A calculator task uses the temperatures on the squares just outside its strip to calculate the temperatures in its interior region. The fixed edge temperature is provided at startup, and the variable external boundary information comes from neighboring calculator tasks. This is illustrated in Figure 3. The heavy line delineates the strip assigned to a typical calculator. Shaded rectangles are exterior boundary data units; unshaded areas represent data units written by this calculator and used as exterior boundaries by its neighbors.
Figure 4 shows three speedup curves for a 640 x 100 grid. For the no overlap curve, each calculator task computes all of its red squares, sends the red interior boundaries to its neighbors and waits to receive the red exterior boundaries before computing the black squares.
18
The second curve, labeled overlap, uses the method described in [Ch89]. A calculator task computes the red squares in its internal comSpeedup ratio
putation region while waiting for the update of its exterior black boundaries. When a data unit arrives from the task’s neighbor with new black
8
Asynchronous
7
Overlap
6
No Overlap
5 4 3 2
boundary data, the task proceeds to calculate and
1
update its interior boundary. This causes com1
munication and calculation to overlap in time,
2
3
4
5
6
7
8
Number of processors
Figure 4 - RBSOR Speed-up curves
resulting in better speed-up numbers.
The asynchronous realization allows the calculator tasks to use "stale" exterior boundaries if updates have not arrived in time. The tasks calculate all of the red squares without waiting. They then toggle a write-lock on the interior red boundary data unit to initiate update propagation by the ARCADE service provider. Black square calculations are then performed in the same manner. Since the calculator tasks may use old exterior boundary values, the calculation may take more iterations. However, without the communication delays, convergence is achieved faster. The convergence properties of this algorithm, including the stale data case, are discussed in [BT89]. Perhaps more importantly, the elimination of checks for whether or not an update has occurred leads to significantly simpler code.
5.4 Distributed Dynamic Data Structures Since data unit links are essentially global pointers, they can be used to create dynamic data structures which span machine boundaries. This section describes a distributed binary tree containing a sorted list. A task at any node can look up, insert or delete entries from the list with minimal impact on similar activities at other nodes. Each vertex of the tree is a data unit; it contains the vertex data, as
19
well as data unit links to the left and right children. Each task processing the tree shares the root data unit [Co91].
Figure 5 shows a typical distributed tree being processed by two tasks. To look up an entry, a task begins by acquiring a read lock on the root. It
Task 1
Task 2
then searches the tree, read-locking nodes in the Root
process. At each step, it accesses the target data
M D
unit, acquires a read lock on it, releases the read
D C
K
-
K -
P
P -
S
lock on its grandparent and, if the grandparent is C -
-
S -
U
not the root, releases the grandparent data unit. U -
Thus, in the figure, Task 2 has looked up the
-
Figure 5 - Distributed Tree Structure
entry U. At any time, a task has a maximum of
four data units in its address space and holds read locks on no more than three. Address mappings are denoted by the shaded regions in the figure.
Adding a vertex to the tree follows the same pattern. A task traverses the tree, starting at the root, until it finds the insertion point. The task waits until it can upgrade its lock on the insertion point data unit to a write lock. Once the write lock has been granted, the task allocates a data unit for the new vertex and sets the data unit link in the parent to point at it. Deleting a vertex is only slightly more complex. All changes to the tree take place in the subtree that grows from the parent of the outgoing vertex. Therefore, once the task has a write lock on the parent data unit, it can eventually acquire any additional write locks needed to properly restructure the tree.
The performance of the distributed tree was compared to that of a tree object, similar to the implementation described for Orca [Ba90]. The test consisted of measuring the time it took for eight nodes to insert a given number of vertices into the tree. Timing results are presented in Table 1. Our early results were disappointing, as they indicated that the distributed tree algorithm required
20
Tree Size (Vertices) 800 1600 2400 3200 4000 8000 16000
Object Implementation 17 35 52 69 86 188 355
Distributed Implementation Read Locks No Read Locks 31 16 66 30 116 46 150 61 181 80 376 156 852 328
Table 1 - Seconds to Insert All Vertices
approximately twice as long as the tree object to insert a given number of vertices. We suspected that this slowdown was caused by the communication currently required to assert read locks in ARCADE.
We confirmed this suspicion by timing a special version of the program whose vertex insertion algorithm did not acquire read locks. This modified algorithm operates correctly only if vertex insertion and deletion do not take place concurrently. These results are also listed in Table 1. As shown, the distributed tree without read locking performs slightly better than the tree object. Thus, the time to acquire read locks is a significant problem with the original distributed tree algorithm.
Given this experience, we are now considering modifications to the current implementation of locks that will reduce communication penalties and improve performance when dealing with DSM. The distributed tree algorithm illustrates the benefits of data unit links and shared data units in simplifying the construction of distributed applications that use dynamic data structures.
5.5 Heterogeneous Interoperation The first use of data units on all four ARCADE implementations was an evaluation of a Mandelbrot set. In this multitask application, several machines cooperatively calculate values for separate regions
21
of a space. Actual calculation times have been measured for combinations of the four operating systems.
Originally written for ARCADE/386, the application consists of five task types: user-interface, manager, displayer, progress reporter and worker. The first four handle overhead and worker tasks do the calculation. The manager task sends a data unit to each worker indicating which portion of the space it is to calculate. If these data units cross from a 386-based machine to the System/370 or viceversa, the data is transparently translated. The worker calculates the values for its assigned region, sets a data unit link in the original data unit to point at the location of these values and sends that to the displayer. The manager then assigns a new calculation region.
For this experiment, one ARCADE machine supported the user-interface, manager, displayer and progress reporter. Workers were then created on various combinations of other machines and the total time to complete the calculation was recorded. The ARCADE and OS/2 machines were all 16 MHz 80386-based IBM PS/2s. Mach was run on a 25 MHz 80386-based PC and VM/CMS executed on an IBM 9370 Model 60 [Co92].
To establish a base-line, the entire problem was solved on a single machine running ARCADE/386. The calculation took four minutes and thirty-five seconds. The same problem was then distributed to multiple ARCADE machines and a speed-up curve was calculated. The speed-up was essentially linear up to eight machines. Linear speedup is not surprising since each subregion can be calculated independently. A bottleneck seemed to be appearing at the displayer task which was drawing a picture of the region on its screen. Each of the other operating systems was then evaluated. Table 2 lists the times for a single worker task running on each of the operating systems. The first column shows the total time to solve the problem; the second illustrates the variance in compute time caused by different compilers. The slow VM/CMS compute time is due to a variable alignment problem unrelated to data units.
22
System ARCADE OS/2 Mach VM/CMS
Total Time 4:35 4:58 7:57 8:11
Math Time 4:08 4:08 6:12 6:58
Table 2 - Completion time for a single worker The multiple machine test involved workers on ARCADE, OS/2, Mach and VM/CMS. If the workers progressed at the same rates as in Table 5, the test should have taken one minute and thirty seconds, with the ARCADE task calculating 33% of the space, the OS/2 task 30%, the Mach task 19% and the VM/CMS task 18%. The actual time was one minute and thirty-three seconds, implying a 3% cooperation penalty.
5.6 Group Management A major motivation for distributed applications is to take advantage of the latent compute power of under utilized workstations. However, the load on workstations is dynamic, varying widely over the course of a day. Thus, a distributed application should be able to adapt to the changing environment. This means that if an application has placed a task on a user’s machine and the user needs the full compute power of that machine, the task should be moved. With the ARCADE micro-kernel, tasks are well defined entities which interact in a location independent way. Therefore, they can be moved from machine to machine relatively easily. This has lead to the creation of a processor sharing scheme based on task migration.
The scheme is designed to make optimal use of distribution while
maintaining fairness between applications [Tr91].
To mechanize task migration, a new ARCADE service, migrate(), was defined. This lets a task on a given machine move a remote task to the local machine if the key of the remote task is known. By only allowing local tasks to move work onto a machine, and with proper use of the privilege vector, it is possible for an operating system to control access to the computing resource.
23
Each application is considered a group of tasks and is registered with a group manager task, groupman, on its "home" machine. There is a groupman task on each machine in the interconnection and they cooperate to distribute tasks and assure fairness. The monitor task on each machine determines the local load and availability. For example, if a user want exclusive control of his or her own machine, a single keystroke alerts monitor to have tasks removed. The groupman tasks agree on how to apportion the available computing resource. For example, they may give each group equal access to computing or they may use a priority system. Finally a task manager, taskman, on each machine responds to groupman requests and moves tasks onto the local machine.
A simple example demonstrates the operation of the group manager approach. Two applications were run on an interconnection of eight machines. The RBSOR problem of Section 5.3, was distributed to all eight machines. If nothing else happened, it would have finished in 8.1 minutes. Figure 6 shows that after 400 seconds, a Mandelbrot space calculation was started. RBSOR was Start Mandelbrot
Start RBSOR
reduced to four machines and the remaining four Number of Machines
were assigned to Mandelbrot. Mandelbrot took 100 seconds, rather than the 85 it would have required if the four machines had been idle. RBSOR eventually recaptured the four freed machines and completed in 667 seconds. This
Mandelbrot
8 7 6 5
RBSOR
4 3 2 1
is close to the minimum of 552 and the 115
0
100
200
300
400
500
600
Time in seconds
seconds of overhead shows as white space on
Figure 6 - Adaptive Processor Sharing
the figure.
6. Discussion As illustrated in the previous sections, using a high-level language model for the data abstraction provides two major advantages. First, it permits implementation on diverse hardware and operating
24
system platforms and leaves room for platform-specific optimizations. Second, the abstraction is an excellent building block for applications ranging from operating systems to partial differential equation solvers.
6.1 High-Level Abstractions Our experience has shown that an architecture based on high-level abstractions can be efficiently implemented on a variety of platforms. Separating the underlying hardware or operating system from the abstractions leaves ample room for implementation-specific optimizations. In its micro-kernel realization, ARCADE supports an operating system with performance and functionality comparable to commercial systems. Single-system application performance for both the micro-kernel implementation and the add-in versions also compares favorably with their commercial counterparts.
The data unit emulates a programmer’s view of data and facilitates cooperation between elements of a distributed application. For example, when data units span machines, they reduce the "impedance mismatch" between classic, hardware-oriented distributed shared memory and the programmer’s language-oriented view of data. Therefore, they are able to offer transparent translation, applicationcontrolled coherency and user-selected granularity in ways that make sense to the programmer, allowing clean, efficient code.
In particular, data units are useful in three contexts: • Modules of data within a program • Messages for exchange of data • Shareable units for cooperation The last two are natural outgrowths of the first. Any data unit created for holding related data can be moved or shared as it is without marshalling or extra packaging. A programmer looks at data as a collection of values so it is natural that data unit services mesh well with programming language constructs.
25
However, the data unit abstraction is basic enough to be used as a building block. Sophisticated programming models and tools can be effectively supported on top of the data unit abstraction. As we have seen, C programmers can treat data units as if they were heap-allocated structures and data unit links as global pointers using the ABC preprocessor. They are also the basis for an objectoriented extension to C for distributed systems.
In addition to these language-oriented enhancements, the data unit abstraction can be generalized to increase its usefulness.
For example, the addition of persistence and atomicity has allowed
construction of a transaction system.
Thus, data units are excellent building blocks for richer
abstractions and programming paradigms.
The ARCADE abstractions of tasks and data units are also easy to use. This applies both to normal applications and to system software. The system software projects also attest to the completeness of the ARCADE architecture.
6.2 Asynchronous Algorithms and Dynamic Processor Sharing Data sharing is a natural cooperation mechanism that is widely used in uniprocessors and sharedmemory multiprocessors. For an iterative application with physically shared data, a change made in one iteration automatically becomes visible in the next iteration. With global variables, the appropriate declaration makes a variable visible to all relevant modules. Applications do not concern themselves with updating their data. This leads to simpler code which makes sharing an attractive alternative to messaging for distributed systems.
Clearly, distributed sharing must be built on top of a messaging system. Thus, it seems inherently inefficient. However, it can actually provide a significant performance boost. Algorithms which can run asynchronously need not waste time waiting for messages with partial results, rather they can use stale data. This may require more iterations, but each iteration will take less time. While it is possible
26
to carefully hand-code and fine tune a message-based implementation to do the same thing, it is painful, ad-hoc and error-prone.
Light-weight threads are a natural model for some asynchronous applications, but for others shared data makes more sense. A file server concurrently handling multiple requests works well with threads; a partial differential equation solver is better served by data sharing. Sharing avoids the strong coupling of control and data transfer inherent in thread-based RPC.
The task and data unit abstractions encapsulate the components of a task. Thus, an ARCADE task can be easily migrated between machines. This allows an ARCADE application to dynamically utilize additional resources as they become available. It can expand to harness the power of idle workstations and contract when they become active. Since the semantics of migration are defined independently of the application, the group manager concept can provide a generic sharing mechanism.
6.3 Synchronization ARCADE’s principle synchronization and event handling mechanism is the PAL, a software analog of a hardware device.
This powerful new abstraction can define compound events as logical
combinations of simple events. It can also be used as a building block for distributed versions of classic synchronization tools. However, even with special programming tools, working with it has not been easy. Indeed, most ARCADE applications have based their synchronization on input queue arrivals and shared data unit locks. The PAL abstraction is probably too foreign to most programmers, particularly those who do not need its full power. Therefore, we are investigating synchronization mechanisms that more closely match the programmer’s high-level viewpoint.
6.4 Data Unit Links Data unit links can be viewed as a general purpose identification mechanism. Initially, we have used them to create global pointers. Pointers are extremely valuable in high-level programming languages
27
and data unit links provide the same capability between heterogeneous machines. They work well as global pointers for distributed data structures and as data unit handles. A single data unit with a link, say acting as a node in a graph, can give access to a complete structure. We feel that data unit links are the basis for a general reference mechanism that is more powerful than pointers.
Unlike capabilities or file names, data unit links are meaningful only in the context of the task to which they have been given. They cannot be transferred. Hence, they preserve security constraints on the flow of information. All initial accesses to data units must pass through the kernel and even the most complex security structures can be built.
Data unit links are ideally suited to the needs of distributed object-oriented databases. Additional semantics, such as versioning, code sharing and method binding though links are under active study.
7. Relationship to other work ARCADE makes several important contributions to the current discussion of distributed computing. Perhaps the three most significant are: • The nature of distributed computing abstractions • Structured heterogenous distributed shared memory • Security and resource identification mechanisms This section discusses these contributions and their relationship to on-going work in distributed computing.
7.1 Nature of Abstractions The ARCADE service interface is designed to support both system-software and user applications. Other systems, including Mach [Ac86], Chorus [Ar89] and Amoeba [Mu90], also hide the hardware but are intended to support only system-software, not applications. For example, Mach was initially
28
designed as a micro-kernel to support 4.3 BSD. Although other operating systems have now been built on Mach, it is a difficult interface for applications. With Chorus, the nucleus normally presents its abstractions to a set of system servers, collectively called a subsystem. The subsystem, in turn, provides operating system services which are seen by application processes.
Micro-kernels have been used to realize portable distributed systems. For example, Mach has been implemented on 80386s, 680x0s and SPARCs, among others, and Chorus is available on multiple platforms. However, while it is possible to mimic standard operating system services on top them [Go90], they cannot co-exist with standard operating systems. Their abstractions are close to the hardware and implementations typically need complete control.
There are systems that offer abstractions close to application programming constructs. Amber’s [Ch89] passive and active objects allow the programmer to develop flexible object-oriented programs. Concert [Ye89] extends standard programming languages such as C and PL/1 to implement heterogenous cooperative peer-processing. Such systems are typically implemented on top of an existing operating system, Amber on Topaz and Concert on OS/2 and VM/370. However, they confine the programmer to a particular programming model, Amber with object-oriented programming and Concert with RPC.
The ARCADE abstractions strike a balance. Data units and data unit links are high enough for multiple cooperating implementations and low enough to support a variety of computational models.
7.2 Data Units and Distributed Shared Memory Distributed shared memory has been implemented in hardware [le90], as operating system software [Li86] and through compiler generated code [Ba89][Ni91].
The major DSM design issues are
granularity of shared data, coherence protocol and support for heterogeneity [Ni91]. Ivy [Li86] classically assumes shared data is totally unstructured, using hardware-dependent page-based granularity. Linda’s shared data is a tuple space [Ah89][Ca90], defining application-dependant tuple-
29
based granularity. Munin [Ct91] structures its shared data on the basis of variables in the source language. Most sharing schemes commit to either a page or a data object as a unit of granularity, but not both. However, depending on the data usage patterns, either approach may be best. Thus, it can be desirable to support both types of granularity. This is only possible when, as in ARCADE, the abstraction-level unit of coherency is a data object.
Coherence protocols can be classified on the basis of synchronization points in a sequence of shared accesses [Ni91]. With Ivy’s strict coherence, every read or write is a synchronization point. Munin’s release consistency is based on acquire and release operations which are similar to ARCADE’s lock and unlock. Clouds [Cn92,Ra89] offers both strict and weak coherency.
Weaker coherency typically increases the concurrency of shared data accesses, but their use depends on the application’s ability to tolerate stale data. Therefore, application specific coherence policies [An92], can be more efficient. Applications can use ARCADE’s advisory locks to control the coherency semantics.
In fact, when it makes sense, applications can choose to ignore some
synchronization points and use data that may be incoherent.
Several projects have extended the DSM abstraction a to heterogenous environments. In Mermaid [Zh90] for example, memory is shared in pages and a page contains data of only one type. Whenever a page is moved between two architecturally different systems, the data is converted to the appropriate format. Since the unit of coherency is a page, several restrictions apply. The size of each supported data type must be uniform and the translation process is not entirely transparent. Agora [Bi87], on the other hand, provides a multi-language structured shared data facility which can span heterogenous architectures. However, shared data is accessible only through a set of access functions and it may not contain references to other data objects. While these projects make important contributions, none offers the flexibility and ease of use available with data units and data unit links.
30
7.3 Security and Resource Identification A fairly common resource identification and protection mechanism is a capability [Ta86]. Mach, Amoeba, Chorus and Concert all support capabilities which contain encoded access rights to a resource. Capabilities are not sensitive context, a thread which possesses one has the access rights, regardless of how it was obtained. They can be passed by one thread to another, allowing unconstrained access to the resource. Other identification mechanisms, such as global virtual addresses as in Amber, have trouble supporting heterogeneity.
ARCADE associates security information with all of its abstractions, including data unit links. Data unit links are context sensitive, preventing illegal transfers of access rights. Once a task has access to a data unit, the data unit link can be optimized to a pointer.
8. Conclusions ARCADE offers improvements on current distributed system technology in three major areas. First, the experiment proves the efficacy of basing architectural components on high-level language constructs. This approach has lead to a powerful, heterogeneous and portable set of abstractions. These abstractions have been efficiently implemented in several different ways. Most importantly, it is easy to program distributed applications using the ARCADE abstractions.
ARCADE has proposed an evolution for the concept of distributed shared memory. By associating structure information with the shared entity, it has been tied to the higher-level notion of data. The semantics of data units allow a variety of coherency protocols, with locks introducing fine-grained, application-controlled coherency.
The data unit link abstraction is much more than just a global addressing mechanism. While our principal use of it has been as a pointer-like variable, its associated metadata gives it great potential
31
as a general reference mechanism. For example, it can be naturally applied to object oriented databases. Additional semantics, including overloading and executability, are obvious extensions.
The ARCADE project has been a valuable exercise in the development of distributed systems. It has shown that powerful distributed programming constructs can span hardware architectures, machine boundaries, operating systems and implementation strategies. It proposes a universal cooperation paradigm which is suitable for a broad spectrum of systems.
9. References [Ac86] Acetta, M., Baron, R., Bolosky, W., Golub, D., Rashid, R., Tevanian, A. and Young, M., Mach: A New Kernel Foundation for UNIX Development, Proceedings Summer Usenix, July, 1986. [Ah88] Ahuja, S., et. al., Linda and Friends, IEEE Computer, Vol. 19, No. 8, August, 1986, pp. 26-34. [An92] Ananthanarayanan, R., et. al. Application Specific Coherence Control for High Performance Distributed Shared Memory, to appear in Proceedings Symposium on Experiences with Distributed and Multiprocessor Systems, 1992. [Ar89] Armand, F., et. al., Distributing UNIX Brings it Back to its Original Virtues, Proceedings Workshop on Experiences with Distributed and Multiprocessor Systems, October, 1989, pp. 153-174. [Ba89] Bal, H., The Shared Data-Object Model as a Paradigm for Programming Distributed Systems, Ph.D. Dissertation, Vrije Universiteit, 1989. [Ba90] Bal, H., et. al., Experience With Distributed Programming in Orca, Proceedings International Conference on Computer Languages, 1990. [Bn90] Banerji, A., A New Programming Model for Distributed Applications, Master’s Thesis, University of Notre Dame, July 1990. [Be90] Bellon, M., A New Approach to Secure Distributed Operating System Design, Master’s Thesis, University of Notre Dame, April 1990. [BT89] Bertsekas, D. and Tsitsiklis, J., Parallel and Distributed Computation: Numerical Methods, Prentice Hall, Englewood Cliffs, NJ, 1989.
32
[Bi87] Bisianni, R., et. al., Heterogeneous Parallel Processing: The Agora Shared Memory, Technical Report CMU-CS-87-112, Carnegie Mellon University, March 1987. [Br89] Bryden, T., C Language Support for Object-Oriented Programming in ARCADE, Master’s Thesis, University of Notre Dame, November 1989. [Ca90] Carriero, N. and Gelernter, D., How to Write Parallel Programs, MIT Press, 1990. [Ct91] Carter, J. et. al. Implementation and Performance of Munin, Proceedings of the 13th ACM Symposium on Operating System Principles, October 1991, pp. 152-164. [Ch89] Chase, J., et. al., The Amber System: Parallel Programming on a Network of Multiprocessors, Proceedings of the 12th ACM Symposium on Operating Systems Principles, December, 1989, pp. 147-158. [Ce88] Cheriton, D., The V Distributed System, Communications of the ACM, Vol. 31, No. 3, March, 1988, pp. 314-333. [Cn92] Chen, R., Dasgupta, P., Integrating Consistency Control and Distributed Shared Memory: The Travails of an Implementation, to appear in Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems, 1992. [Co91] Cohn, D., et. al., Using Kernel-Level Support for Distributed Shared Data, Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems, March 1991, pp. 247260. [Co92] Cohn, D., et. al., A Universal Distributed Programming Paradigm for Multiple Operating Systems, to appear in Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems, 1992. [Da88] Dasgupta, P., et. al., "The Clouds Distributed Operating System: Functional Description, Implementation Details and Related Work", Proceedings of the 8th International Conference on Distributed Computing Systems, June, 1988, pp. 2-9. [De89] Delaney, W., The ARCADE Distributed Environment: Design, Implementation and Analysis, Ph.D. dissertation, University of Notre Dame, April 1989. [Do83] Department of Defence Trusted Computer System Evaluation Criteria, Department of Defence Computer Security Center, Ft. George G. Meade, MD, CSC-STD-001-83, Library No. s225,711, August 1983. [Fo88] Forin, A., Barrera, J., Young, M. and Rashid, R., Design, Implementation, and Performance Evaluation of a Distributed Shared Memory Server for Mach, Technical Report CMUCS-88-165, Carnegie-Mellon University, August, 1988.
33
[Go90] Golub, D., et. al., Unix as an Application Program, Proceedings of the Summer USENIX Conference, June 1990. [Ku91] Kulkarni, D., Nested Transaction Support for Reliable Distributed Applications, Master’s Thesis, University of Notre Dame, July 1991. [Le90] Lenoski, D., et. al., The Directory Based Cache Coherence Protocol for the DASH Multiprocessor, Proceedings of the 17th Annual Int. Symposium on Computer Architecture, IEEE, 1990, pp. 148-159. [Li86] Li, K., Shared Virtual Memory on Loosely Coupled Multiprocessors, Ph.D. Dissertation, Yale University, YALEU/DCS/RR-492, September, 1986. [Mu90] Mullender, S., et. al., Amoeba: A Distributed Operating System for the 1990s, IEEE Computer, Vol. 23, No. 5, May, 1990, pp. 44-53. [Ni91] Nitzberg, B. and Lo, V., Distributed Shared Memory: A Survey of Issues and Algorithms, IEEE Computer, Vol. 24, No. 8, August 1991, pp. 52-60. [Ra89] Ramachandran, U., Ahamad, M. and Khalidi, M., Coherence of Distributed Shared Memory: Unifying Synchronization and Data Transfer, Proceedings of the 1989 International Conference on Parallel Processing, Volume II, August, 1989, pp. 160-169. [Sm90] Smith, E., A New Distributed File System Architecture for ARCADE, Master’s Thesis, University of Notre Dame, June 1990. [Ta86] Tanenbaum, A., et. al., Using Sparse Capabilities in a Distributed Operating System, Proc. of the 6th Int. Conf. on Distributed Operating Systems, IEEE, 1986, pp. 558-563. [Tr91] Tracey, K., Processor Sharing for Cooperative Multi-task Applications, Ph.D. dissertation, University of Notre Dame, May 1991. [Tr89] Tracey, K., The Design and Implementation of an ARCADE-based Operating System, Master’s Thesis, University of Notre Dame, April 1989. [Ye89] Yemini, S., et. al., CONCERT: A High-Level-Language Approach to Heterogeneous Distributed Systems, Proceedings of the Ninth International Conference on Distributed Computing, June 1989, pp. 162-171. [Zh90] Zhou, S., et. al., Extending Distributed Shared Memory to Heterogeneous Environments, Proceedings of the 10th Int. Conf. on Distributed Computing Systems, IEEE, 1990, pp. 30-37.
34