The essential features of the COMANDOS platform - the "virtual machine" interface ..... The storage system is implemented using Unix le systems (virtual disks).
1 IMPLEMENTING THE COMANDOS ARCHITECTURE. Jose Alves Marques (2), Roland Balter (1), Vinny Cahill (4), Paulo Guedes (2), Neville Harris (4), Chris Horn (4), Sacha Krakowiak (3), Andre Kramer (4), John Slattery (4) and Gerard Vandome (1). (1) Bull Research Centre, c/o IMAG-Campus, BP 53X, 38041 Grenoble CEDEX, France. (2) Instituto de Engenharia de Systemas e Computadores, Rua Alves Redol 9-2, 1000 Lisboa, Portugal. (3) Laboratoire de Genie Informatique, IMAG-Campus, BP 53X, 38041 Grenoble CEDEX, France. (4) Distributed Systems Group, Department of Computer Science, Trinity College Dublin, Ireland.
The fundamental goal of the three year ESPRIT project 834, COMANDOS, is to identify and construct an integrated platform for programming distributed applications which may manipulate persistent - i.e. long-lived - data. The intention is to eventually provide such a platform running on a range of machines from dierent vendors. The COMANDOS project has already de ned the global architecture of such a platform [Horn87]. This architecture is speci ed as a number of co-operating functional components. A subset of the components constitute the COMANDOS kernel which provides the minimum functionality expected of a COMANDOS system. This paper describes the three dierent implementations of the COMANDOS kernel being undertaken within the current project. The diering goals of each implementation as well as the particular hardware environment targeted are summarised. The dierent approaches being followed in each implementation are overviewed, followed by a preliminary presentation of the results of each implementation to date.
1 1 INTRODUCTION The fundamental goal of the three year ESPRIT project 834, COMANDOS, is to identify and construct an integrated platform for programming distributed applications which may manipulate persistent - i.e. long-lived - data. The intention is to eventually provide such a platform running on a range of machines from dierent vendors. Although the main intended application domain is oce systems, we expect that the COMANDOS platform may be valuable as a basis for integrated information systems in such application domains as CAD, software factories and manufacturing administration. COMANDOS itself does not provide end-user applications, but rather a basis for the development of these. The essential features of the COMANDOS platform - the "virtual machine" interface - are:-
- support for distributed and concurrent processing in a loosely coupled LAN and WAN environment; - an extensible and distributed data management system, which can be tailored to speci c application areas; - tools to monitor and administer the distributed environment. Further aspects of the project include tools to aid in oce systems design and maintenance, particularly in the light of operational experience [Horn88], and interworking with existing data management systems constructed independently of the COMANDOS model [COMANDOS88c]. The programming environment provided by COMANDOS is intended to be multi-lingual, and a range of programming languages are expected to be used. To interact fully with, and thus exploit, the COMANDOS environment, a range of primitives will be available via libraries. Existing programs are supported without requiring recoding, although relinking with standard environment libraries may be necessary: such programs may not be able to bene t fully from the COMANDOS platform. Nevertheless, it is obviously crucial that existing applications, for example UNIX applications, can be supported easily. Consideration of our environment and virtual machine primitives have convinced us that it is useful to provide a language in which the concepts of the COMANDOS virtual machine are faithfully re ected. The language embodies the main features of the COMANDOS virtual machine i.e. the type model and the computational model. Some of its features may be regarded as "syntactic sugar": reducing the burden on the application programmer who is making extensive use of the COMANDOS program libraries, so ensuring that parameters are correctly managed across a series of library calls; that a series of such calls is indeed meaningful (i.e semantically correct); and by providing syntactic constructs which automatically result in a number of calls for frequently used cases. Moreover other features of the virtual machine such as typing and inheritance can only be expressed in linguistic terms. The resulting language, Oscar, designed within the project, and currently being implemented, is described in [COMANDOS88a, COMANDOS88b]. The execution environment for COMANDOS applications is provided by a low-level kernel and a set of additional services running, like applications, above the kernel. Although the kernel supports distributed processing, the functionality of this kernel is fundamentally richer than that of, for example, UNIX [Bach86], Mach [Jones86], Vkernel [Cheriton84] or Amoeba [Mullender85], in that it also provides a basis for common data management services required by many applications. Examples of the additional support provided includes atomic transactions and recovery; decomposition and reconstruction of complex data entities so as to accelerate associative retrieval; location transparency as a default mechanism, but with the ability to determine the precise (current) location of some entity, and to direct execution to particular sites; and a number of monitoring and control points through which the system can be remotely administered. Compatibility with international and de facto standards is obviously critical. For interworking, the COMANDOS kernel assumes the availability of both CL and CO (Class 4) ISO Transport, over ISO IP. Above the CL-Transport a RPC service - the COMANDOS Inter-Kernel Message Service - is used by the kernel. Both CL and CO associations are available to applications. An emerging de facto system interface of particular relevance is UNIX X/OPEN. In one of the COMANDOS prototypes, the COMANDOS kernel interface is being hosted on top of a UNIX implementation, and the normal UNIX interface is available to applications alongside the COMANDOS kernel primitives. In the other COMANDOS prototypes, the COMANDOS kernel
is being implemented in privileged machine mode: in these cases, many of the UNIX services have to be provided above the COMANDOS kernel. Our prototypes of the COMANDOS kernel are thus exploring a number of implementation strategies in parallel. This paper gives a brief review and status report on these prototypes. In Grenoble, Bull and Laboratorie de Genie Informatique are jointly implementing the COMANDOS kernel interface as a series of C libraries running in user mode on top of a UNIX V2.2 kernel with certain BSD extensions. Their implementation has been christened Guide. Using UNIX as a host, Guide has been able to develop faster than the other prototypes, but with relatively poor performance. At Dublin, Trinity College are implementing the COMANDOS kernel in privileged machine mode on both the NS32000 based Trinity Workstation and Digital VAX architectures. Their implementation is known as Oisin. While Oisin is aimed at relatively sophisticated multi-user hardware, it was felt it would also be interesting to develop a minimal implementation aimed at single user machines, with the intention that these could interact with larger machines to transparently gain access to the full range of COMANDOS services. At Lisbon, Inesc are developing a version of the kernel on i286 based PC/AT compatibles: their implementation is known as IK. In summary, the COMANDOS consortium are prototyping the COMANDOS kernel interface in a number of operating environments, both as a "guest" layer on top of UNIX, and in native mode on both relatively sophisticated machines and on PCs. The intention is to conduct an appraisal and analysis of the relative success (including performance) of these prototypes before the conclusion of the current three year project contract.
2 1 THE COMANDOS VIRTUAL MACHINE As noted above, the project aims to provide a complete interface allowing the development of large distributed applications. A complete description of the COMANDOS model is provided in [COMANDOS87]. In this section we brie y recall some of its main features. The COMANDOS Virtual Machine can be considered to be composed of two main entities: - A Type Model providing support for the description of the abstract structure and behavioural properties of objects. - A Computational Model de ning the interface with the operating system entities which provide the functionality needed in a distributed environment: distributed processing including object invocation and support for fault tolerance.
3 2 Type Model The objective of providing ecient support for both programming and data base conceptual modelling had a strong in uence on the de nition of the Type Model. Moreover, the distributed
nature of the system introduces the problem of heterogeneous implementations which is not usually considered in type models for homogeneous environments. In COMANDOS an object is generally composed of its internal state, or instance data, and a set of operations or methods. Each object is an instance of an implementation which provides the code for the object's operations. Furthermore each object has an associated type which describes the visible properties and behaviour of the object. An important initial decision was to exploit as much as possible static type checking for early detection of programming errors and to improve code eciency. A type represents the external interface of an object in the system. Types are speci cations of interfaces: each type may then be associated with (possibly) several implementations. The possibility of having multiple implementations allows dierent algorithms or dierent code for heterogeneous machines to be associated with the same interface. Subtyping is supported by the model as a way of specialising interfaces, allowing incremental development of software and the use of an instance of a type as if it were of another type. The conformance rules and the manipulation of the object references are detailed in [COMANDOS87] To support data base modelling it is important to be able to manage groups of similarly typed objects. The concept of class was introduced to simplify the management of large collections of objects. A class represents a set of objects having the same properties. On a class four basic operations may be performed: insertion of new objects, removal, test for membership and inspection of the extension. In addition to these generic operations, a type over which a class is de ned may have some user de ned operations which may be used in conjunction with queries.
4 2 Computational Model In COMANDOS the dierence between active and passive objects is that the former may change their internal state independently of any invocation. Active objects exist in the form of Jobs and Activities. A Job may be considered as a multiprocessor, multi-node virtual machine. A job may contain one or many activities executing in parallel. Activities are the fundamental active objects and represent distributed threads of control, similar to processes in a centralised system. Jobs and activities are distributed objects, and may span several nodes. In each node visited by any of the activities of the job the system maintains a local context for the job containing the objects used at that node. Most COMANDOS objects are passive. Passive objects may belong to one of several categories: Atomic, Synchronised, or Non-synchronised. Atomicity is an attribute of an object xed at its creation time which cannot be changed afterwards. The classical properties of atomic transactions are guaranteed for these objects when used within a transaction. Within transactions there may be objects for which no speci c synchronisation is required and thus, to reduce the overhead introduced by transactional mechanisms, only atomic objects acquire the properties of atomic transactions. The model is intended to support both short and long duration transactions. To improve eciency sub-transactions and intermediate checkpoints were incorporated in the model.
Synchronised objects have internal mechanisms for synchronising accesses. A synchronised object can only be mapped at a single node to which all accesses are routed. When invoking an object not located in the current context the system decides if the job should diuse to the node where the object is located or instead map the object into the local context. This decision should, in principal, take into account all the attributes which may be associated with objects (synchronisation properties, xed location, association with a given job, size, etc.) Channels are a special type of object for communication. A channel is de ned by a connection between a source and a sink object allowing the transfer of a particular type of object. The rationale for this concept is to provide an optimised mechanism for transferring large objects between activities and also to provide an uniform concept to handle input/output.
5 2 Architecture Uniform access to objects is a key point in the COMANDOS model. Globally known objects have a unique system wide identi er designated a Low Level Identi er (LLI). The LLIs form an address space in which invocation of objects is executed transparently. Transparency is supported for access to both for remote and stored objects. Similarly the single level storage model of the COMANDOS architecture hides the distinction between long-lived - stored - objects and short lived objects from application programmers. The Architecture is an abstract implementation structure, consisting of a set of distributed functional entities which provide support for the COMANDOS virtual machine interface. For more detail on the COMANDOS architecture than is provided here refer to [Horn87].
6 3 Virtual Object Memory (VOM) Objects are mapped into the virtual address space of a job when they are referenced. Objects are mapped out of virtual memory by the system, when the job terminates or when space is needed for other objects. All objects are potentially persistent. Persistency is not a static attribute of an object, but instead an object is considered persistent if it is reachable from an eternal persistent root. When an object is invoked the VOM analyses an internal table to determine if the object is already in the required context. tries to locate it as it may be already active in some other context. In each storage node a table identi es the objects normally stored there but which are currently mapped in the context of a job in some node. This table is used to detect whether this object is already in some context of the job, or in the case of a synchronised or atomic object, where a live image of the object is currently located.
7 3 Storage System (SS) Secondary storage is managed by the SS and may be seen as a set of containers. A container is a logical entity with a unique identi er, which may be implemented by several physical containers. Each container is organised as a set of segments. A segment is a contiguous storage unit. Provision has been made for optimising the access to related objects by grouping them in clusters. An object is normally entirely stored within a segment but exceptionally large objects may be partitioned over a set of segments. Garbage collection in distributed object oriented system is a particularly dicult problem. COMANDOS provides a compromise between ecient resource management and algorithm complexity, by using ageing [COMANDOS87].
8 3 Activity Manager (AM) The Activity Manager is responsible for the low-level functions of the system. It controls the run-time environment giving support for jobs, activities, semaphores, timers and triggers. A distinction is made between the local hardware dependent kernel and the distributed entity which provides the support for the COMANDOS abstractions. The local kernel may be a dedicated implementation for COMANDOS or any low level kernel providing process management and synchronisation facilities.
9 3 Communication Subsystem (CS) All the architecture components are distributed and use the Communication System to communicate with their remote peers. The CS oers two transport services, one dedicated to ecient remote invocations and another optimised for the transfer of large amounts of data.
10 3 Remaining Components The Transaction Manager (TM) is responsible for implementing the support for atomic objects, in particular concurrency control and recovery procedures for aborts and node failures. The Type Manager (TpM) is the component responsible for maintaining information about types and the relationships between them. The TpM will assist language compilation and the object management system. Mapping from certain languages to the TpM interface will be provided. Finally the Object Data Management System (ODMS) provides the functions related to management of classes and queries on objects in the object base. The ODMS is responsible for
three main functions: support of the query language; management of classes; implementation of distributed location schemas for class members.
11 1 KERNEL IMPLEMENTATIONS In the following sections we describe the three implementations of the COMANDOS kernel introduced in section 1. The kernel includes only ve of the components of the full COMANDOS architecture i.e the VOM, SS, AM, CS and TM. However the TM component of the kernel is not yet being implemented in any of the pilot implementations described here. The rst GUIDE prototype has been designed to provide a minimum basis for supporting distributed applications speci ed in terms of the COMANDOS model as quickly as possible, and to identify problems raised by the implementation of an Object Oriented Architecture above Unix. Three Unix System V features (shared memory, messages queues and semaphores) and the Unix BSD communication facilities (sockets) are extensively used in GUIDE. Mechanisms providing optimal support for object orientation will be investigated in the proposed next phase of the project (COMANDOS-2) as extensions to the Unix kernel. The main goals of the IK implementation of the COMANDOS kernel are: - To provide an implementation of COMANDOS on commercial, widespread and inexpensive machines, allowing the results of the project to be diused. - To evaluate the suitability of a segmented memory management system, as provided by the iAPX286, to eciently support object oriented systems. In the IK prototype the main concern is to gain experience in the ecient management of volatile and persistent objects in a distributed environment. Therefore, IK will provide ecient invocation of objects in virtual memory and transparent access to objects located at any node of the system. Support for local secondary storage is provided, however this implementation is primarily intended to access information residing on more powerful machines. The chief goal of the Oisin kernel is to provide an ecient implementation of the COMANDOS model on relatively sophisticated hardware. The chief features which distinguish Oisin from Guide and IK are as follows: - kernel mode exploitation of a demand paged virtual memory environment - use of clustering to reduce i/o operations and accelerate object invocations - a multi-level i/o subsystem, in which peripherals, i/o controllers and bus couples have separate drivers: each application level i/o request is mapped dynamically to a path of devices as necessary to reach the target peripheral. - absence of a UNIX style i/o buer pool and instead exploiting all available physical memory eectively as an i/o cache. As noted in section 1. it is our intention to both qualitatively and quantitatively compare the three implementations: rather than mirroring strategies in the three implementations, it seemed more prudent to explore in parallel techniques and exchange experiences.
12 2 GUIDE Implementation
13 3 Introduction This section describes the principles of the Bull/LGI implementation of the COMANDOS Kernel on top of Unix System V. This implementation is called GUIDE (standing for "Grenoble Universities Integrated Distributed Environment"). The main decision in this implementation was to map COMANDOS activities on Unix processes. The alternative choice would have been to map COMANDOS jobs on Unix processes and to implement activities as "lightweight processes" within a Unix process. The availability of shared memory made the rst solution more attractive, since it allows an easy sharing of objects and system tables between activities and between jobs on each node. The implementation of the main components of the GUIDE kernel is brie y described in the following sections.
14 3 Virtual Object Memory In the COMANDOS model [cf section 2.2 above], a job is de ned as a multi-processor multinode virtual machine. A job is de ned by its virtual address space (possibly spanning several physical nodes) which is shared by the activities of the job. The management of jobs and activities is described in more detail in section 3.1.4. On a given node, an activity is represented by a Unix process, whose virtual address space is divided into three areas: the private area, the shared area and the stack area. - The private area is divided into two zones: the rst zone is the Unix .text zone which contains the GUIDE kernel (this zone is shared by all the activities running on the node); the second zone is the Unix .data zone which contains the kernel data, and the binary code of currently used implementations. - The shared area is divided into three zones: the rst zone contains the Context Object table (COT) which describes objects mapped into the corresponding job on this node (this zone is shared by all the activities of this job). The second zone contains the Node Object Table (NOT) which describes objects currently loaded on the current node (this zone is shared by all the activities on the node). The COT may be viewed as a window on the NOT. Each table is implemented by means of a Unix system V shared memory segment. The third zone contains the data of the objects which are currently mapped on this node (this zone is also shared by all the jobs at the node). Each object is loaded within a Unix system V shared memory segment. An object shared by several jobs may be attached at dierent virtual addresses in each job, but at the same virtual address in all the activities of the same job. Concurrent accesses to shared objects are controlled by means of the Unix System V semaphores.
- The stack area is private to each activity. An object loaded in VOM is kept resident as long as shared memory segments are available. The NOT contains a binding counter (number of contexts in which this object is currently mapped) for each object. When this counter becomes null, the object can be withdrawn from the NOT, and the associated shared memory segment reallocated (NOT entries are reallocated on a LIFO basis). An ordinary or synchronised object is stored back into the SS when it is no longer bound within any context. An atomic object which has been modi ed within a transaction, is stored into the SS when the associated transaction is committed. Object invocation is performed within the process implementing the calling activity. When an object, not yet already bound within the job, is invoked, a search for the object is carried out within the VOM. If the object is found within the NOT on the local node (ie the object has already been mapped on the local node), it is then bound within the context of the calling job and the invocation is performed. If the object is not yet mapped on any node, then it is mapped on the local node and the invocation proceeds as above. If the object is already mapped on another node, and the current job is not already present on that node, the job and the activity are rst diused to the target node. The object invocation is then carried out on the remote node. Run-time support for object invocation is provided by the GUIDE kernel via the ObjectCall kernel primitive: ObjectCall (v object: view, ref impl: reference; method index: integer; param block: address);
where: v object is a view which points to the called object; ref impl is a reference to the implementation which contains the required operation (which may be dierent from the implementation of the object, because of inheritance); method index is the index of the operation within that implementation; param block is the address of a block which contains the parameters. The format of the parameter block is as follows: the rst entry contains the number of parameters; subsequent entries contain parameters - each entry contains one of the following: a) the address of a view that points to a parameter (this is the general case) or b) the value of the parameter (for integers and characters only) or c) the address of the parameter (for strings only). Passing addresses for views and strings is an optimisation for the case of a local invocation.
15 3 Secondary Storage The storage system is implemented using Unix le systems (virtual disks). A le system is a disk partition which is accessed by the Unix kernel as an independent logical device. The conversion between logical device addresses and physical addresses is performed by the disk driver. A COMANDOS physical container is mirrored by a Unix le system. The Unix le systems which correspond to physical containers are accessed in character mode. This means that the corresponding I/O is synchronous, and that the Unix buer cache is not used. Instead a cache is managed by the SS in order to optimise disk accesses: there is a single cache on each node where at least one physical container is supported, and this cache is common to all the physical containers located on this node. The SS is composed of two main parts : the client side, which performs SS primitives, and the server side, which deals with container management. The client side is linked with VOM processes, while the server side is composed of one or several dedicated Unix processes. The server side is composed of three main modules : the cache module, the physical container module and the logical container module. The cache module is very similar to the Unix buer cache and thus makes disk I/Os more ecient. The physical container module manages disk blocks within a physical container and the logical container module handles accesses to objects within logical and physical containers. Communication between the client process and a SS server process is performed using BSD sockets in datagram mode. Communications between server processes on the same node uses the object cache. Physical containers are organised in 512 bytes blocks. An object descriptor is stored in one block. Since the size of an object descriptor is less than the block size, the remaining space in the descriptor block may be used to store the corresponding object data. If the object size is greater than this remaining size, its contents are stored in independent data blocks. The addresses of these blocks are stored in the remaining part of the descriptor block (instead of the data for a small object). The rst few addresses are addresses of direct data blocks and the last two addresses refer to indirect data blocks (a single indirection for the rst one and a double indirection for the second one). This mechanism is very similar to the one used in the Unix le system. The resulting maximum object size is 8,287 Kbytes. Logical containers are organised as a hierarchy. A logical container is a son of the logical container into which objects that have not been used for a certain time are aged. This mechanism applies also to the father logical container, which is itself the son of another logical container, and so on up to a root logical container. In the GUIDE implementation, there is a single root logical container into which all unused objects are aged ultimately. Replication on a per physical container basis is supported in the GUIDE implementation. An object stored within a logical container is replicated on each physical container of this logical container. An object is available as long as at least one of its host physical containers is still operational. The number of physical containers of a logical container can be increased or decreased dynamically. When a new physical container is added to a logical container, it is rst synchronised with the already active physical containers. There is always a master physical container, in which the objects are modi ed, and secondary physical containers, into which the updates are propagated. When the master physical container becomes unavailable, one of the secondary physical containers is elected as the primary container and the users of the system should be unaected.
A limited versioning mechanism is implemented within the GUIDE kernel. This mechanism is provided by the ability to create a new object with another object as initial value. The new object can be kept as the old version of the object and the old one as the new version. Thus other objects referencing the object will always reference its latest version (by means of the old reference), and old versions can be made available to the user by a higher level versioning mechanism.
16 3 Job and Activity Management As mentioned in the introduction, an activity is represented by a Unix process on each node where it has invoked an object. The same process represents a given activity on a node, regardless of the number of calls it has made on that node. A job is created on a node which is called its initial node and a Unix process is associated with each activity on each node. The system guarantees the consistency of the virtual address spaces of the Unix processes which correspond to the activities belonging to a given job, as described in section 3.1.2. In addition to the processes associated with activites, there is a daemon process on each node which implements job and activity management. This daemon is in charge of forking the processes which represent activities, and of managing liation relationships between them. It "cleans up" the remaining processes after a job has terminated. In addition, the daemon process performs various monitoring functions.
17 3 Communication Subsystem Communications are performed using sockets in datagram mode (UDP/IP) for communications between nodes, and System V2.2 message queues for communications within a given node. Remote communication is used for peer to peer exhanges between the components of the kernel or storage subsystem on dierent nodes, and as a basis for the remote invocation mechanism. The rst version of RPC uses the Sun XDR protocol.
18 2 Oisin The TCD implementation is being primarily targeted at the NS32332 based Trinity Workstation, with a port of that implementation to Digital uVAX-IIs. Coding is primarily in Modula-2, although both C and assembler are occasionally used. The development environment is 4.2 bsd (NS Genix and Digital Ultrix). There follows below a brief overview of the NS32000 and VAX architectures for those readers unfamiliar with them. The NS32000 Architecture is a 32bit architecture speci cally designed to support high level language (HLL) compilers [National86]. The NS32000 also provides two protection modes: user level and supervisor level. Transitions to supervisor level are made for traps and interrupts.
Demand paged virtual memory, using two level page tables, is also supported. The current standard MMU chip is the NS32082 which supports a virtual address space of 16 Mbytes with 512 byte page size. The position of the current level 1 page table is given by one of a pair of page table base registers, and can be located arbitrarily in physical memory. A level 1 page table is 1Kbyte and contains the locations of 256 level 2 tables. Each level 2 table is 512bytes and contains the location of 128 physical pages. Each Page Table Entry (PTE) includes a Valid bit; Referenced bit; Modi ed bit and two protection level bits. The VAX-11 family is a 32 bit architecture with a 32-bit virtual address space. Demand paged virtual memory using single level page tables is provided. The page size is 512 bytes. There are four distinct virtual address spaces: P0, P1, System and reserved. P0 and P1 de ne the virtual address space for each process while the system space is a context switch independent space used by the kernel. Each of these spaces is 1 Gigabyte. Each region has a base register containing the base address of the page table for that region and a length register indicating the length of the page table.
19 3 Clustering Consideration of a number of sample programs for the COMANDOS environment, expressed on paper in Oscar, led us to believe that clustering of objects would be an important element of an ecient kernel implementation. A cluster is a set of contiguously "stored" objects. When mapped into VOM, the objects are contiguous: however a (large) cluster may be mapped to several contiguous ranges of disk blocks in a SS container, possibly in dierent disk cylinders. If any object in a cluster is accessed by an application, then the entire cluster is mapped into virtual memory. Depending on currently available physical memory, a range of pages including, but not limited to, the faulted page, will be initially retrieved. Object identi ers in Oisin appear as two categories. Inter-cluster references are similar to LLIs in Guide and IK: they consist of a logical container number and unique "generation" number within that container. They also contain a hint eld as to which cluster in that container the target object will most likely be found. Objects may also be migrated between logical containers, as explained in the COMANDOS Architecture report [COMANDOS87]. Intra-cluster references are adopted when the target object is inaccessible from outside of its cluster. Dereferencing an intra-cluster identi er is more ecient than an inter-cluster one: careful administration of precisely which objects are placed in the same cluster will obviously directly aect overall performance. Several clusters may be mapped simultaneously into the same virtual address space. A cluster may be simultaneously mapped into dierent address spaces at the same node (processor). A cluster may contain a single object, in which case it is similar to an object with an LLI in Guide or in IK.
20 3 Virtual Memory
The virtual memory subsystem is reasonably conventional. It does contain however a number of Oisin speci c mechanisms related to the ecient handling of clusters. Modi ed pages replaced in physical memory are rst placed on a paging device, and if not faulted back after some time, are lazily purged by the kernel back to their home positions in the SS. Maintenance of free paging space (and indeed free disk blocks in a disk cylinder) is via a buddy algorithm implemented on a bitmap. File type accesses are automatically supported by direct machine level reads and writes [Daley68].
21 3 Devices In UNIX, a single device driver may need to interact with several hardware units: this is particularly true of disk drivers which may for example interact with a dma controller; one or more bus adaptors (for example a SCSI interface); as well as the disk controller itself. In sophisticated i/o systems, there may in fact be several alternative routes from the CPU to a particular peripheral, and trac ought to be distributed to balance load and achieve fault tolerance. In Oisin, we have been in uenced by the i/o support of Digital's VMS [Kenah84]: each hardware unit has its own private device driver, and I/O requests follow a designated path from unit to unit as appropriate. In principle, dierent I/O requests may follow dierent paths to the same target device. Each I/O operation is de ned by an I/O Request packet and associated Completion Packet: such packets are exchanged between devices using software interrupts.
22 3 Physical Memory and Synchronisation. Rather than reserve a portion of physical memory as an I/O cache as in UNIX [Bach86] we allow potentially any page of physical memory to be used as a buer for I/O as required. Each job executing at a node has its own private context which is implemented as a single virtual address space and which is shared by all activities of the job executing at that node. Each activity is represented by one or more lightweight processes executing in this context. Processes are provided by the Oisin kernel which implements two level scheduling of jobs and activities. Oisin also provides semaphores to synchronise concurrent accesses by processes.
23 3 Communication Subsystem The CS is being implemented as a path of device modules, in eect as an I/O path in the multi-level I/O subsystem. The Inter Kernel Message protocol is to be treated as a single pseudodevice, as is the combination of Transport and Internet layers. The target device in the path for the datagram service is the ethernet device driver.
24 2 IK
25 3 Introduction The INESC kernel, named IK, provides a single user COMANDOS environment to run on commercial Personal Computers (PC/AT based on the INTEL iAPX286). Implementation is currently under way using Olivetti PE28, Olivetti M28 and Sperry IT both as development and target machines.
26 3 Brief Introduction to the iAPX286 The iAPX286 microprocessor [Intel83] is a 16-bit CPU with an integrated MMU and has three main features which in uenced our design: Segmented memory management, with a maximum segment size of 64 Kbytes. Address translation is performed through one of two tables, a system wide Global Descriptor Table (GDT) or a per-task Local Descriptor Table (LDT). Instructions that modify the current segment are restartable; an exception is raised if the segment is not present in memory, allowing to load segments in memory on demand. Hardware supported processes, described by particular segments. Process switching is automatically performed by special machine instructions. Four levels of protection are supported. Privileged levels can only be accessed via interrupts or gates.
27 3 Internal Structure of the Kernel We split the kernel into two parts: the Kernel User and the Kernel Supervisor. The Kernel User runs in user mode and handles invocation of objects already mapped in virtual memory. A call to the Kernel Supervisor is only issued if the object is not present in virtual memory. The Kernel Supervisor, which runs in the most privileged mode of the processor, is internally composed of a number of processes which communicate by exchanging messages. Global kernel data is shared via the GDT. Currently, there is one process for the Virtual Object Memory Manager (VOM) and the Activity Manager (AM), one for the Storage Subsystem and one for the Communications Subsystem. Device drivers are also implemented by independent processes. This organisation of the kernel in dierent processes simpli es development and debugging, because each component is isolated in a process, preventing it from corrupting data of other
components. Interfaces between dierent components are better de ned, allowing independent testing. Performance is not compromised, because these processes are lightweight and bene t from the hardware support for fast context switching.
28 3 Virtual Object Memory (VOM) Objects in IK are composed of three logical areas: header, data area and reference list. Headers are accessed very frequently and must not be modi ed by user mode software, so they are stored contiguously in a write-protected segment called the Header Table. The data area and the reference list are both stored in the same virtual memory segment. This constitutes a major point of our implementation: an object corresponds to a segment. Objects are mapped as a unit in the VOM, thus they are either completely mapped in the VOM or not mapped at all. When the object is mapped in the VOM, it is either present in primary memory or saved in the swap area. This mapping has several advantages and also some drawbacks. Memory management is simpli ed, because the logical structure of an object is mapped directly on the memory concept supported by the hardware. In terms of virtual memory, an object is described only by its virtual address. When objects are mapped and segments are resident in primary memory this implementation is very ecient, because all code executes in user mode, without the intervention of the Kernel Supervisor. In this case the code of an object runs in a way similar to a conventional resident process. The invocation time of a mapped object is close to that of a conventional procedure call. Small objects are eciently mapped in virtual memory because a reduced amount of I/O is necessary to read them from disk. For large objects this solution has the drawback of requiring the reading of the whole segment, even when only a small part is actually used. Another inconvenience is that object size is limited to the maximum size of a physical segment (64 Kbytes). Large objects may be decomposed into a hierarchy of smaller objects, at the expense of some performance degradation. Objects may be mapped anywhere in the virtual address space of an activity, so the code of an object may not assume the value of the code segment. This corresponds to the small model on the iAPX286: an object has a single code segment and a single data segment, and must not assume nor modify the contents of these segments. This provides an easy way to share objects between dierent jobs: each job has a private copy of the object's header, all headers reference the same virtual segment through dierent virtual addresses. The Kernel Supervisor maintains the consistency of the dierent copies of the header.
29 3 Storage Subsystem (SS) The Storage Subsystem of IK is simple because its target machines have small disks. Basically, it provides the ability to store and retrieve objects given their LLI. Objects are read and written as a single unit. Whenever possible, objects are stored on disk as a contiguous segment.
30 3 Activity Manager (AM) Activities are supported by processes. Activities of the same job execute in the same address space by sharing a common LDT. As process switching is inexpensive, several activities of a job executing at the same node are similar to concurrent lightweight threads of execution within the same address space. Conventional semaphores are used for synchronisation. When an activity diuses, a new process is always created at the remote node to handle the invocation. If the activity diuses again to the origin node, two processes will coexist, but only one is eligible for execution, the other is awaiting for the return of the remote invocation. At each node an activity stores the node from where it was called; if it diuses to yet another node, it stores the node to where it diused. This calling chain is used to locate the nodes visited by the activity, when performing some operation on it (e.g. killing the activity).
31 3 Communication Subsystem (CS) Communication in COMANDOS takes place between the kernels on distinct nodes. The type of messages exchanged between kernels will be queries, noti cations and requests for services. Two special cases are requests for operations to be carried out at remote nodes and requests for bulk data to be transmitted across the network. The CS oers a Standard Communication Interface which provides both a connection oriented (ISO Transport Class 4 - ISO 8073) and a connectionless transport (Draft ISO DIS 8602). These services may be used directly by any component of the kernel. However, normal use of the CS is done by use of its Inter-Kernel Message service (IKM). The IKM lies above the transport layer and provides an ecient packet protocol for both announcements and RPC-like communication between kernels at dierent nodes. Normally, it uses datagrams to transmit the information, but a connection may be used if a large amount of data has to be sent. In remote invocations or queries to a remote kernel the protocol is optimised to use a single packet in each direction, thus reducing protocol overheads. Support for heterogeneous machines is provided. Information is transmitted in its raw form and is translated only at the destination node. One advantage of this approach is that for communication between homogeneous machines no translation has to be done. Each component uses a table-driven translator, avoiding the need to send extra information describing the contents of the message.
32 1 STATUS OF IMPLEMENTATIONS
33 2 GUIDE Implementation At the present date (may 1988), version 1 of the GUIDE implementation is running, on Bull SPS7/300 and Matra-Datasysteme MS-3. Both machines are based on 68020 processors and the supporting system is a version of Unix System-V, plus socket communication from BSD 4.2. Version 1 is essentially a single-node implementation, the aim of which is to test the integration of the basic mechanisms of the kernel, and the integration between kernel and secondary storage management. The following features are implemented. - Support for jobs and activities. Currently, a single activity per job is supported, and a node may support several independent jobs. - Support for the Virtual Object Memory. The basic mechanism for object binding and local operation invocation is implemented, including dynamic linking to the required code of the operation. Shared objects are not yet supported. - Single-container secondary storage. A container is implemented on a single node. The secondary storage is integrated with the kernel via the object fault mechanism. In addition, a rst version of a C pre-processor for the GUIDE language (a subset of the OSCAR language) has been implemented. This allows us to exercise the basic system mechanisms by executing compiled programs, including object invocations. This version does not include inheritance nor support for concurrent activities, which are both expected to be provided by october 1988, together with a rst multi-node version of the kernel and secondary storage. This rst version has been experimented with since mid-march 1988. Most of the work has been devoted to performance tuning. Since the basic mechanisms of the kernel are implemented on top of Unix, the performance is certainly inferior to that of an implementation on a bare machine. However, our goal is to achieve a performance level which would allow us to run realistic applications with acceptable response time. Preliminary results so far indicate that this goal is achievable.
34 2 Oisin Implementation At the present date (July 1988) a version of Oisin is running on an NS32332. This version is a single-node implementation. The following features are implemented. - Support for the Virtual Object Memory. The basic mechanism for object binding and local operation invocation is implemented, including dynamic linking to the required code of the operation. Shared objects are not yet supported. - Persistent memory is supported so that when application programs terminate, its results linger in secondary storage without requiring additional eort on behalf of the programmer. - Multiple jobs each containing multiple activities are supported. Synchronisation is provided via semaphores.
- Single-container secondary storage. A container is implemented on a single node. The secondary storage is integrated with the kernel via the for object fault mechanism. - Object clustering, within both Virtual Object Memory and the Storage System. - The I/O subsystem, for disk accesses. Terminal and Ethernet drivers are being coded. - A Name Service, providing a UNIX like directory hierarchy, with each directory being an object. A command interpreter, using the Name Service, is also implemented, allowing objects to be interactively invoked and examined. In addition, support for Modula-2 application programs has been implemented, allowing us to write COMANDOS implementation objects in Modula-2 and exercise the Oisin kernel. A full syntax analyser for Oscar is also completed, and semantic analysis and code generation for the basic constructs of the language are now being attempted. The following are some preliminary performance results to date (on a 10MHz NS32332): - Internal Procedure Call and Return using JSR machine instruction : 4 microsecs. - External Procedure Call and Return using CXP machine instruction : 8 microsecs. The CXP instruction is normally used by a compiler to aid separate compilation. - Basic intra-cluster object invocation : 21 microsecs. - Basic inter-cluster object invocation : 310 microsecs. - Towers of Hanoi execution time (9 discs) in standard Modula-2 running under UNIX 4.2 BSD : 4.2 secs. - ditto using intra-cluster calls running under Oisin using stub routines for Modula-2 for object invocation: 5.8 secs. Our immediate plans are to complete and integrate the Communications Subsystem, allowing us to provide support for job diusion. We also wish to further optimise the kernel, including the times for basic object invocation. Finally, we intend to attempt the port to a micro-VAX 2 in teh autumn.
35 2 IK Implementation At the present date (May 88) a version of IK is running on a PC/AT. This version is a single node implementation, which provides the following features: - Support for the Virtual Object Memory. The basic mechanism for object binding and local operation invocation is implemented, including dynamic linking to the required code of the operation. Shared objects are not yet supported. - Single activity and single job, although several processes exist inside the kernel.
- Single-container secondary storage. A container is implemented on a single node. The secondary storage is integrated with the kernel via the for object fault mechanism. Application programs are (carefully) coded in C with calls to a small library to interface the system. A major part of the work was devoted to the development of the run-time support for the kernel processes. However, this has proven to have been worth the eort, because some kernel processes, like the SS and the CS, may be developed and partially debugged in MS-DOS, using all its debugging tools, and later integrated in the kernel with minor changes. Preliminary performance results indicate that the main design choices are appropriate, although much more experience is still needed to make conclusions. In a PC/AT running at 8MHz, the message passing primitive takes 800 microsecs to send a 64 byte message and perform the process switching. Object invocation in virtual memory takes currently 100 microsecs, but no attempt was made yet to optimise this code (e.g. writing an assembly language routine).
36 1 RELATED WORK and FUTURE PLANS In summary, COMANDOS is conceptualising, designing and, most importantly, implementing a vendor-independent platform for distributed processing, including management of longlived data, programming language support and on-line administration. To date, no such vendorindependent infrastructure exists. OSI may be used as a basis for interconnection of equipment from various vendors, but not as a common platform for the numerous interacting applications required in the integrated electonic organisation. The availability of such a vendor-independent integrated platform for programming distributed applications, coupled with data management, operating in an environment of heterogeneous machines would be a signi cant advance from current exploitations of OSI. It would also advance the state of the art in distributed systems. UNIX is an accepted vendor-independent system interface. Recent extensions to UNIX, such as the SUN Network File System [Sandberg86] or PCTE [Bourguignon85] extend aspects of the UNIX interface to distributed environments. However none of them provide at the same time all of the facilities which characterise the COMANDOS programming interface: distributed objects considered in the SUN Network File System are basically limited to UNIX les; the PCTE OMS (Object Management System) provides a simple object store and its type system is rather weak. We consider COMANDOS as a signi cant evolutionary development of UNIX. A major achievement of the project so far has been the de nition of an Architecture of a Virtual Machine for general distributed processing. This architecture emphasizes the integration of technology from distributed operating systems, distributed programming languages and distributed databases. This integration objective is achieved through the use of an object-oriented approach. It should be noted that a similar approach has been adopted in the UK Advanced Network Systems Architecture (ANSA) [ANSA87], and is currently appearing in international standardisation activities such as the ECMA DASE proposal and the ISO ODP workitem. More speci cally, these projects are considering a computational model which is close to that of COMANDOS. Although COMANDOS should not be seen primarily as an architecture de nition project, it is expected to draw from the ongoing prototyping phase signi cant experience on the interaction between the computational model and other aspects of the virtual machine interface,
most notably those for data management. This implementation experience should be valuable input to the standardisation process. A number of prototypes are expected at the end of the current three year Project (February 1989): - Three implementations of the COMANDOS kernel, as described in section 3. The comparison between the UNIX-based kernel and the native kernels should provide useful information about the limits of an approach based on the extension of the standard UNIX interface. - Partial implementation of some basic system services for data management, type management and object naming, running on top of the kernel. These services extend the COMANDOS virtual machine to provide high-level facilities for building distributed applications. - Two implementations of the COMANDOS language (OSCAR): one provided as an extension of an existing language (preprocessing C), and a full compiler of a subset of the language. The language will also be used for programming some of the system services mentioned above. Starting from the results of this rst three-year experience, a second phase of the COMANDOS project is foreseen in the framework of ESPRIT II. The overall objective of this second phase is to consolidate and integrate the various workitems available at the end of the current phase, in order to provide a full implementation of the COMANDOS programming platform and virtual machine. Main aspects of this project may be summarized as follows: - Provide a full compiler of the COMANDOS Language. - Implement the COMANDOS system (with full capabilities) both on a bare hardware and on an industrial low-level distributed kernel (CHORUS). - Extend the X/OPEN UNIX kernel for supporting eciently the COMANDOS features, and consequently to enhance UNIX towards the COMANDOS interface. - Support basic administrative facilities such as for example standard distributed directory services. - Finally provide a testbed application, both for checking the suitability of the COMANDOS model and language, and for evaluating internal mechanisms provided by the COMANDOS system.
1 ACKNOWLEDGEMENTS The authors wish to acknowledge, in particular, the contribution of those involved in the various kernel implementations: BULL: ?????; INESC: ????; LGI: Dominique Decouchant, Andrzej Duda, Hiep Nguyen Van, Michel Riveill and Xavier Rousset de Pina; TCD: Edward Finn and Gradimir Starovic.
The authors also wish to acknowedge all those who have contributed to the Object-Oriented working group of the COMANDOS project: BULL: ????; IEI: E. Bertino, R. Gagliardi, G. Mainetto, C. Meghini, F. Rabitti, and C. Thanos; INESC ???; LGI: M. Meysembourg, C. Roisin, and R. Scioville; Nixdorf: G. Mueller, and K. Profrock; Olivetti: A. Billocci, A. Converti, M. Farusi, L. Martino, and C. Tani; TCD: A. Donnellly; A. El-Habbash; F. Naji; A. O'Toole; B. Tangney; B. Walsh and I. White.
1 REFERENCES [ANSA87] ANSA: "The ANSA Reference Manual: release 00.03," 1987. [Bach 86] Maurice Bach, "The design of the UNIX operating system", Prentice-Hall 1986. [Bourguignon85] Bourguignon, "Overview of PCTE: A Basis for a Portable Common Tool Environment" in Proceedings of ESPRIT Technical Week, September 1985. [Cheriton84] D. Cheriton, "The V-Kernel: A Software Base for Distributed Systems" IEEE Software, Vol. 1, No. 2,April 1984. [COMANDOS87] COMANDOS: Object Oriented Architecture D2-T2.1-870904 [COMANDOS88a] COMANDOS: OSCAR Preliminary Programming Language Manual D1T3.2.3.2-880331 [COMANDOS88b] COMANDOS: Tutorial Introduction to OSCAR D1-T3.2.3.2-880331 [COMANDOS88c] COMANDOS: COMANDOS Integration System February 1988 [Daley68] R. Daley, J. Dennis "Virtual Memory, Processes, and Sharing in MULTICS" Comm. ACM, Vol 11, No 5, May 1968, pp306-312.
[Decouchant88] D.Decouchant, A.Duda, A.Freyssinet, M.Riveill, X.Rousset de Pina, R.Scioville, G.Vandome: "An implementation of an object-oriented distributed system architecture on Unix", Proc. EUUG Conf., Lisbon, October 1988. [Horn87] C. Horn, S. Krakowiak: "Object Oriented Architecture for Distributed Oce Systems" in ESPRIT '87: Achievements and Impact, North-Holland, 1987. [Horn88] C. Horn, A. Ness, F. Reim, "Construction and Management of Distributed Oce Systems" Proceedings of EURINFO '88, Athens, May 1988. [Intel83] INTEL: "iAPX286 Operating Systems Writer's Guide",1983 [Jones86] Jones M.B., Rashid R.F. 1986, Mach and Matchmaker: kernel and language support for object-oriented distributed systems, Proc. First ACM Conf. on Object-Oriented Programming Systems, Languages and Applications (OOPSLA), Portland (sept. 1986), pp. 67-77. [Kenah84] L. Kenah and S. Bate "VAX/VMS Internals and Data Structures", Digital Press, 1984. [Mullender85] S. Mullender, "Principles of Distributed Operating System Design" Ph.D Thesis, Vrjie Universiteit, Amsterdam, October 1985. [Lampson81] Lampson, B.W. "Atomic Transactions" in "Distributed Systems Architecture and Implementation", Springer-Verlag 81, pp. 246-264 [National86] National Semiconductor Corporation, NS32000 Series Databook, 1986 [Sandberg86] R. Sandberg, "The Sun Network Filesystem: Design, Implementation and Experience" Spring EUUG Conference, 1986. [Walker83] Walker, B., G. Popek, R. English, C. Kline and G. Thiel, "The LOCUS
Distributed Operating System", ACM Proc of the 9th SIGOPS, 1983 [Zimmermann84] Zimmermann H., Guillemont M., Morisset G., Banino J.S. 1984, Chorus: a Communication and processing Architecture for Distributed Systems, RR 328, INRIA, Rocquencourt (September 1984)