Design of the Distributed ProcessBase Architecture

1 downloads 0 Views 36KB Size Report
Munro, D. & Warboys, B.C. "A Compliant Persistent Architecture". Software,. Practice & Experience 30 (2000), Special Issue on Persistent Object Systems 30, 4.
Design of the Distributed ProcessBase Architecture William Brodie-Tyrrell‡, Henry Detmold‡, Katrina Falkner‡, Matt Lowry‡, Ron Morrison*, Dave Munro‡, Stuart Norcross*, Travis Olds‡, Zengping Tian*, Francis Vaughan‡ ‡

Department of Computer Science, University of Adelaide * School of Mathematical and Computational Sciences, University of St. Andrews

Abstract ProcessBase is an environment designed to support process modelling languages. This environment consists of a language, its interpreter and a persistent object store. Currently this environment supports concurrency through a multi-threading library, however, only a single interpreter instantiation exists as a supported architecture. ProcessBase is a simple language that provides many sophisticated features, including first-class procedures, strong typing, extension through library interfaces, hyper-programming and linguistic reflection, multi-threaded execution and compliance. This document describes the design of distributed ProcessBase architecture. The motivations behind the creation of this architecture are an exploration of compliance in a distributed setting, experimentation with distribution models and distributed garbage collection mechanisms.

Introduction ProcessBase [MBG+99] is an environment designed to support process modelling languages. This environment consists of a language, its interpreter and a persistent object store. The ProcessBase language is a simple language that provides many sophisticated features, including first-class procedures, strong typing, extension through library interfaces, hyper-programming and linguistic reflection [Kir92], multi-threaded execution and compliance [MBG+00]. This document describes the design of a distributed ProcessBase environment and the initial set of experiments to be undertaken in the development of prototype systems.

Model of Computation Figure 1 shows the model of computation for a distributed ProcessBase system. In this model, a single ProcessBase computation consists of multiple thread closures (T), which execute within a single shared address space. This address space is distributed over multiple physical sites (PS); residing on each physical site are a cache and a set of ProcessBase interpreters (I). An interpreter is defined as an agent of execution for ProcessBase instructions consisting of an operating system thread running against a 1

physical site. Underneath the physical site layer resides a persistent object store [ABC+83]; a coherency layer is required to maintain object consistency within the store and cache system.

T

T

T

PS

I

Single Computation

PS

I Cache

I

T

T

T

PS

I

I

I

Cache

I

I

Cache

Coherency Layer Store Figure 1: Model of Computation. Multiple ProcessBase threads may execute within a single interpreter. Each interpreter can be viewed as a single operating system thread that time-slices between the ProcessBase level threads that are allocated to it. The Process of Distribution A distributed computation consists of its persistent threads and data and any new threads and data that it may create. The distribution of these threads is controlled by the scheduling system designated for the computation. Multiple scheduling systems may be available, and a computation may define its own scheduling policy. Thus, the distribution of threads over physical sites is implicit to the computation, but explicit to the scheduling system. When a computation starts, all persistent threads are loaded from the persistent store and placed on sites according to the currently invoked scheduling policy. The execution of a thread is not bound to a particular physical site. If the application creates new threads, the placement of these threads is also controlled by the scheduling policy. Although the specific location of threads is not of importance, it is relevant that the application be able to signify where relationships between threads and threads and data exist that may effect placement decisions. It is possible, at the application level, to bind objects together, providing a hint to the scheduling system as to a desired placement. These hints may not always be obeyed if they conflict strongly with the scheduling policy chosen by the application. Threads can be in one of the following states at a time: • Running: the thread is currently executing.

2

• •

Runnable: the thread may execute, but is not currently doing so. Suspended: the thread has been explicitly suspended and cannot be made runnable until it has been explicitly resumed. • Blocked: the thread has been implicitly suspended (for example, by waiting for I/O) and should not be made runnable until it has been unblocked. • Killed: the thread has been terminated before completion. A newly created thread is initially in the suspended state and needs to be explicitly resumed to start execution. Experiment 1: Simple model The first experimental model is a simple model designed to explore implementation difficulties and to obtain an initial system to facilitate distributed ProcessBase application development. The proposed initial design of the distributed ProcessBase environment makes the following assumptions or simplifications on the general model of computation. • The number of sites involved in the computation can not change during a computation. • There will be one interpreter per physical site. • Sites are fault free, and partial failure cannot occur within the system. • Sequential consistency is provided by the coherency layer to ensure that there are never any invalid copies of data within the system. • A global stabilise is provided. • No thread migration mechanisms are currently supported. • Interrupts can be implemented using semaphores in order to take advantage of the multi-threaded nature of the system. • One scheduling policy is defined. In this experiment, compliance can be investigated through thread placement and object clustering. Experiment 2: Extended implementation model The second stage of development of the distributed ProcessBase environment will extend the flexibility of the system. This stage of development makes the following alterations to the base assumptions. • The number of sites involved in the computation may change during the computation. • There may be multiple interpreters per site. • Relaxed consistency models will be explored. • Thread migration is possible, due to both explicit user-level requests and interpreter intervention. • Application-level scheduling policies can be defined. Experiment 3: Compliance experimentation model This experiment involves taking the extended implementation model and exploring mechanisms for compliance in a distributed environment. In this experiment, compliance

3

can be investigated through the development of flexible coherency mechanisms, load balancing of threads, thread clustering and object clustering. Document Outline When developing a distributed system, mechanisms must be provided that enable the identification of the physical sites involved within the system and communication between these sites. In a system with implicit and explicit thread placement, mechanisms for remote thread creation and manipulation at both levels must be supported. It may also be required that specialised thread scheduling and synchronisation techniques be developed. These issues are discussed in the following sections relative to the general model of computation. In addition, some aspects and requirements for the first experiment are identified.

Resource Discovery and Communication The resource discovery and communications system makes two assumptions affecting the design of the distributed ProcessBase system. These are as follows. • Each site should be able to discover all other sites in the network with a minimum of static configuration. • Each site is able to communicate with any other site that is available. The development of experiment 1 requires an additional assumption: • Each site knows when all other sites have joined the system. The proposed model for resource discovery is configuration sharing. Configuration sharing is a scheme whereby each site has static knowledge of a possible set of neighbour sites, forming a region of initial connectivity. Each of those neighbour sites has another region of connectivity. The union of the regions of connectivity forms the complete network; if it is connected then all sites will be available to a single computation. If a supercluster is required, then the name of at least one machine in all the other clusters should be stored in the shared configuration of each cluster. In this way, each machine coming up will attempt communication with a site in each cluster and from those initial contacts, discover the complete set of live sites in all clusters. The configuration sharing scheme has a high cost in terms of message complexity (due to lists of sites being passed across the network when sites come online or go offline), but it is much more flexible and robust than either static configurations or broadcast. A single small configuration file is required for each cluster participating in the computation, at a minimum containing the name of a single machine in every cluster. The only time this file needs to be changed is when a cluster is added to the computation. Design of Resource Discovery Model An initial implementation will provide as configuration information the name of the site supporting the persistent object store. Once each site can communicate with the store, it can then obtain the list of all sites known to the system. In later designs of the system, this can be extended to support more flexible configuration.

4

Data Structures Each site has a table of the currently live sites, with each entry containing the addressing and operating system structures required to communicate with that node. The current design uses TCP (sockets) and so the information stored in each table entry is a file handle and IP address. Boot When a site starts it reads its configuration information, which may be provided in the form of a shared file or command line arguments. This information provides a list of machines that are initially contacted, and those machines provide to the new site a list of the currently known live machines. This list populates the site table, and the new site will then establish connections to each of the sites in the list so that passing of messages can occur later in the computation if necessary. Incoming Connection When a site receives an incoming connection from a new site, it receives from that new site the list of sites that the new site is in communication with. The not-so-new site returns to the new site a list of nodes that are live and not yet known to the new site.

Thread Placement and Scheduling The system consists of a number of sites running one or more interpreters, with each interpreter able to support one or more threads. Placement is handled by the scheduling system, using the currently defined scheduling policy. An application is able to influence the scheduling policy by defining bindings between threads and between threads and data. Two options have been proposed for the scheduling mechanism, with the common requirement that the mechanism support the multiplexing of an arbitrary number of threads over a limited number of operating system threads (initially, one) on each site: • Scheduling of threads is implemented in replaceable library code. • Scheduling of threads is implemented within the interpreter. The proposed model is that the scheduling policy be integrated into the interpreter for initial experimentation, and that further experimentation will investigate compliant scheduling processes involving library-level code. Thread Placement Model Explicit thread placement is provided through the createThread method, which can be invoked on any given physical site. Exposure to physical site information is limited to the scheduling and placement libraries. Figure 2 defines the proposed model for thread placement. Figure 2 depicts a single computation distributed over two physical sites, with each site supporting two interpreters. Each site contains a reference to a queue on which createThread requests for that site are kept. Each interpreter resident on that site is able to respond to createThread requests and pull items off the queue. Invocation of the createThread function (taking as an argument a procedure closure that defines the thread) causes a new createThread element to be added to the site’s queue. A deamon thread

5

executing in each site is responsible for responding to createThread requests using local thread creation mechanisms.

T

T

Single Computation

T

PS I

T

T

T

I

I

PS

I Cache

Cache

cT cT cT

Site

Figure 2: Model of Thread Placement Site and Thread Abstractions A thread is defined through the following view: type Thread is view [ suspend: fun(); resume: fun(); kill: fun (); getStatus: fun () -> int; ]

The following functions complete the user visible thread and site library: let yield = fun ()

which allows a thread to prematurely yield its scheduled execution quantum, let getCurrentThread = fun () -> Thread

which returns the Thread object representing the currently executing thread, let bind = fun (thr: vector(Thread));

which binds the vector of threads together, with respect to placement decisions, and let bind = fun (thr: Thread, obj: Any);

which binds the given thread to the object specified by the Any, with respect to placement decisions. The scheduler has access to a specialised library that supports access to site information and the ability to invoke remote thread creation. This library is defined through the Site abstraction, as follows. The virtual site abstraction is a view as follows: type Site is view [ createThread: fun ( fn: fun() ) -> Thread ]

6

The createThread function in this view has the semantics of creating a new suspended thread on the site, executing the code specified in the parameter. The following function is provided, that returns a vector of all sites that are currently part of the system: let getAllSites = fun () -> vector(Site)

Installation Before the system can support threads or sites, the store must be initialised with library code containing an implementation of threads and sites. This is supported by running the (single site, single threaded) interpreter against the store in a special installation mode. In this mode, the instruction returning the root of persistence returns the real root of persistence, rather than the logical root of persistence, as would be the case in normal execution. As a result, programs run in this mode can: • •

Initialise or update the systemBoot function location in the root. This is called when the

interpreter (in normal mode) begins to run. This supports library specific boot operations. In the closure of either of these functions, save a reference to the root or any other normally hidden object.

System Boot Process A distributed ProcessBase system has the following constituent processes: • A store. • A distinguished interpreter. • One or more subsidiary interpreters. Booting the system involves the following, implemented outside ProcessBase: • The store is started. • The interpreters are started. The IP address of the store site and the number of interpreters (interpCount) are passed as command line parameters. Each interpreter blocks until sufficient interpreters have been initialised. This process is supervised by the distinguished interpreter. • The distinguished interpreter creates the physical site table (PST) and assigns this into a field of the absolute root of persistence. • The distinguished interpreter executes the ProcessBase boot phase by calling the systemBoot function stored in the absolute root of persistence. Execution of the systemBoot function is not pre-emptable by other ProcessBase execution nor is it subject to scheduling. • Computation begins.

Addressing We propose that each object be addressed by a location independent, unique identifier. Newly created objects are named by their cache address (CA). A cache address is unique within the cache and is site dependent. When an inter-site reference to a new object is created, that object is then given a persistent identifier (PID). A PID is globally unique

7

and does not specify a location. Both the new object’s CA and its PID are valid addresses. During stabilisation (a global procedure) each remaining object is provided with a PID; all CAs are overwritten by PIDs. If an object is resident in a physical site’s cache, access to that object should, for efficiency reasons, go straight to that physical site rather than through a centralised service. For this reason, a mechanism must be provided that maps PIDs to site identifiers (SIDs) for all objects that are resident in a cache. An object that is resident only in the store does not require such a mapping. A SID cache is maintained on each physical site to provide these mappings. Mappings must also be provided for local CA comparisons with PIDs: Additional Requirements The system must support object duplication. Given an inter-site reference, a local site can import a copy of the referenced object to reduce dereference costs. Further requirements are imposed by the need to maintain coherency in duplicated mutable objects and to maintain referential integrity in the presence of a garbage collector. The implementation of a coherency protocol is simplified if we can specify and identify an owner site for an object in the distributed cache. To maintain referential integrity, the garbage collector requires that the system provide identity at the object level. Design of Addressing Model Comparisons The ProcessBase virtual machine (VM) implements object equality testing through reference comparison. The proposed distributed system has two object addressing mechanisms. These addressing mechanisms are listed below, • CA - cache address. The objects physical location in the local site’s cache. • PID - persistent identifier. Comparisons between PIDs and between CAs are trivial; comparisons between a PID and a CA can be accomplished by comparing the PID to the PID in the object referred to by the CA. Dereferences The cache performing a remote dereference of a PID needs to contact the store (or a coherency protocol) once only to find out the cache (if any) that an object resides in. After that point, the SID is stored in the location field with the PID or in the PID to SID cache. All further dereferences can use the cached SID (unless the cache is small and the SID is expired). Once the remote cache is known, the PID is sent to that cache and a result returned. The remote cache must do a lookup of the object’s actual address from the PID. Address Translation All nodes must perform two types of address translation: PID to CA and PID to SID.

8

To CA

Translations from PIDs to CA are similar: both require a hashing mechanism of some sort that has an entry for each object referred to by that type of address. The size of each table depends upon the frequency of checkpointing and the rate of garbage collection at that node. Frequent checkpoints mean more objects are assigned PIDs and require room in the PID hash. PID to SID at a Site

The PID to SID table needs to cache the SID for all PIDs commonly dereferenced in that site. The size required depends upon the fraction of references that are remote, and what fraction of those references are to mutable (and therefore not duplicatable) objects. The hash need not be complete (need not store all translations previously used): a smaller hash may cause some locations to be repeatedly resolved, this may or may not be acceptable for a significant reduction in space required for this table. PID to SID in the Store

Translating a PID to a SID can be performed either through coherency mechanisms, or through contact with the store (requiring serialisation of requests). In the latter case, the store must maintain a hash of objects that have been faulted out. When an object is first requested by a cache, the store remembers the cache that faulted the object. When another cache requests the object, the store can forward the request to the appropriate cache, which replies to the cache making the request. Object Migration When an object migrates (changes home node), the PID to SID caches at each site need to be updated to reflect the changed SID for that PID. On dereference, the old home site can reply to the cache making the request, indicating that the object has moved on or it can forward the request to what it believes is the new home site. Alternatively, the store (or coherency protocol) can notify all sites that an object has moved and cause the immediate update of PID to SID caches at each site. The former option (forwarding pointers) requires that PID to cache address hash entries be maintained at each site for no-longer-local objects until the next checkpoint (when the PID to SID caches at other sites are flushed or updated atomically). The latter requires a degree of synchronicity in the motion of the object: dereferences of pointers to the object must be blocked while the object is in motion.

Summary This document describes the design of a distributed ProcessBase architecture. Issues such as resource discovery, addressing, thread placement and scheduling are introduced; initial suggestions as to their format and policy are made as a specification for a prototype system. A set of experiments is outlined. These experiments progress from an initial prototype system with simplistic policies and protocols to an environment suitable for experimentation with collection algorithms and compliance. This document is designed as a starting point for further design – it is not expected that future implementations must be defined by the decisions made here. Further, it is expected that while undertaking

9

these experiments, a greater understanding of the relevant issues will lead to more suitable designs and policies.

References [ABC+83]

Atkinson, M.P., Bailey, P.J., Chisholm, K.J., Cockshott, W.P. & Morrison, R. “An Approach to Persistent Programming”. Computer Journal 26, 4 (1983) pp 360-365.

[Kir92]

G.N.C. Kirby, Reflection and Hyper-Programming in Persistent Programming Systems, PhD Thesis 1992.

[MBG+99]

Morrison, R, Balasubramaniam, D, Greenwood, M, Kirby, GNC, Mayes, K, Munro, DS, Warboys, BC. ProcessBase Reference Manual (Version 1.0.6) Universities of St Andrews and Manchester Report 1999.

[MBG+00]

Morrison, R., Balasubramaniam, D., Greenwood, R.M., Kirby, G.N.C., Mayes, K., Munro, D. & Warboys, B.C. "A Compliant Persistent Architecture". Software, Practice & Experience 30 (2000), Special Issue on Persistent Object Systems 30, 4 pp 363-386

10

Suggest Documents