A Detailed Description of Off++, a Distributed Adaptable ... - CiteSeerX

1 downloads 0 Views 314KB Size Report
MBankNavigator. MBankInspector. NullNavigator get_navigator get_inspector get_navigator get_inspector get_navigator get_inspector. Resource. MBank.
A Detailed Description of Off ++, a Distributed Adaptable µkernel Francisco J. Ballesteros [email protected]

Fabio Kon

Roy H. Campbell

ff-kon,[email protected]

Department of Computer Science University of Illinois at Urbana-Champaign Report No. UIUCDCS-R-97-2035, UILU-ENG-97-1748 August, 1997

Abstract The Off ++ distributed adaptable µkernel is a minimal µkernel whose only task is to safely multiplex and export the distributed hardware present in the network. It is designed to be used as a basis for distributed user-level OS services. This technical report describes the design and implementation of Off ++, the object-oriented redesign of the Off µkernel [1, 3]. Off ++ extends the functionalities provided by Off in order to provide basic support for the 2K operating system [5] we are currently building. This document is meant to be the starting point for the system literate implementation.

Contents 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . 1.2 The Off µkernel . . . . . . . . . . . . . . . . . . . 1.2.1 Shuttles. Distributed Process Services . . . 1.2.2 Portals. Interprocess communication . . . . 1.2.3 D-TLBs. Distributed Memory Management 1.3 Literate programming . . . . . . . . . . . . . . . . 1.4 How to read this document . . . . . . . . . . . . . 1.5 Tools . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

4 4 5 5 5 6 6 7 8

2 System structure 2.1 Exporting system objects . . . . . . . . . . 2.1.1 Traps and the OO model . . . . . . 2.1.2 Tying portals and methods together 2.2 Resource building blocks . . . . . . . . . . 2.2.1 Protection . . . . . . . . . . . . . . 2.2.2 Reference counting . . . . . . . . . 2.2.3 Synchronizing system resources . . 2.2.4 Sequencers . . . . . . . . . . . . . 2.3 System resources . . . . . . . . . . . . . . 2.3.1 Resources for plain users . . . . . . 2.3.2 Identifiers . . . . . . . . . . . . . . 2.3.3 Containers and resource units . . . 2.3.4 Resource availability . . . . . . . . 2.3.5 Architecture awareness support . . 2.3.6 Resource Allocators . . . . . . . . 2.3.7 Allocation statistics . . . . . . . . . 2.3.8 Fixed and Block allocators . . . . . 2.3.9 Resource revocation . . . . . . . . 2.4 Hardware resources . . . . . . . . . . . . . 2.4.1 Hardware resources for plain users . 2.5 Abstract resources . . . . . . . . . . . . . . 2.5.1 Abstract resources for plain users . 2.6 Domains and resource allocation . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

9 9 9 10 11 12 15 16 17 18 19 19 20 23 24 27 28 29 30 31 32 33 34 35

1

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

3 The node 3.1 The node interface . 3.2 Navigation . . . . . . 3.3 Miscellaneous . . . . 3.4 Nodes for plain users 3.5 System booting . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

36 36 37 38 38 39

4 Exporting the hardware 4.1 Memory banks . . . . . . . . . . . . . . . . . . . . . 4.1.1 Memory banks and page frames for plain users 4.2 Input/Output . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Input/Output for plain users . . . . . . . . . . 4.3 Traps and interrupts . . . . . . . . . . . . . . . . . . . 4.3.1 Traps and interrupts for plain users . . . . . . . 4.4 DMA lines . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 DMA lines for plain users . . . . . . . . . . . 4.5 Processors and processor pools . . . . . . . . . . . . . 4.5.1 Processors and processor pools for plain users .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

40 40 41 42 43 44 48 49 50 51 54

5 Implementing abstract resources 5.1 Shuttles . . . . . . . . . . . . . . . 5.1.1 Shuttles for plain users. . . 5.1.2 Shuttle properties . . . . . . 5.1.3 Shuttles and hardware events 5.2 Portals . . . . . . . . . . . . . . . . 5.2.1 Portals for plain users . . . . 5.3 Distributed Memory Managers . . . 5.3.1 DTLBs for plain users . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

55 55 60 61 67 69 72 73 75

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . . . . .

A Index of Chunks

76

B Index of Identifiers

80

2

List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

System calls . . . . . . . . . . . . . . . . . . . . . . . . . Initial resource hierarchy (not used) . . . . . . . . . . . . Basic resource hierarchy . . . . . . . . . . . . . . . . . . Containers and allocators . . . . . . . . . . . . . . . . . . Architecture awareness support: Navigators and Inspectors. Storage of Portals for two different applications. . . . . . . Hardware resource containers and elementary units. . . . . System abstract resource containers and elementary units. .

. . . . . . . .

9 12 13 21 25 30 31 33

5.1 5.2

Relationship among Shuttles, Portals, DTLBs and other resources. . . Shuttle activations: system calls and up-calls. . . . . . . . . . . . . .

55 57

3

. . . .

. . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Chapter 1

Introduction 1.1 Motivation The well known definition for operating system is “the software that securely abstracts and multiplexes physical resources” [19]. By no means is it known that those resources should be contained in a single node. So, Why are our distributed operating systems based on microkernels which essentially multiplex just local resources? Obviously system services can be later distributed when using a (centralized) microkernel. Indeed, that can be done even when using a monolithic system [16]. But this will not solve the actual problem that the system is not being actually distributed and is not transparently multiplexing both local and remote resources. Secondly, a major drawback of current distributed operating systems is their lack of adaptability. It is known that adaptability can be achieved using a minimal microkernel as a foundation for the operating system [8, 7, 4, 10]. If the microkernel is centralized, adaptation of system services for particular requirements may harm the distribution of those services because they are distributed on top of the microkernel and this distribution is not supported by the microkernel itself. The reason is that user extensions may fail to preserve properties found in the distributed hardware and there will be no lower layer supporting them. If the microkernel is itself distributed, simple system extensions may still benefit from system distribution because the lower layer is distributed by itself. This problem can be noticed by the fact that it is necessary to modify and/or re-implement existing system services to add new distributed services to a typical microkernel based distributed system [6, 20, 13]. The 2K Operating System is being built to explore the combination of distribution and adaptability. 2K is built from two main pieces:

 A distributed adaptable µkernel (DAMN1 ) exporting hardware resources available in the network without imposing any particular OS structure.  A customizable ORB used to plug-in OS services constructed as a set of distributed objects. 1 Distributed

Adaptable Micro-Nucleus

4

We expect the combination of both elements to yield a simple easy-to-use distributed and flexible OS. By adopting a strict architecture self-awareness philosophy, we will allow system and user modules to be conscious of its own physical and logical architecture. So, the system will be able to adapt itself optimizing its performance and reliability. This document describes the object oriented redesign of the Off µkernel [1, 3, 2], named Off ++, and is meant to be the starting point for its literate implementation (see section 1.3). Off ++ also adds new functionalities to the µkernel in order to provide basic support for 2K. These new functionalities include support for architecture self-awareness, system object browsing and inspection (see subsections 2.3.5 and 2.3.7), and flexible user-defined domains (see section 2.6). Multiple protection domains, multi-user operation, and multitasking are optional features which can be avoided in those nodes where they are not useful.

1.2 The Off µkernel The three basic abstractions provided by Off (shuttles, portals and DTLBs) will be described briefly in this section so that the reader will be able to better understand the following chapters.

1.2.1 Shuttles. Distributed Process Services Process services are quite simple in Off . The only abstraction provided is the Shuttle. A shuttle is an extensible hardware context. Initially, it consists only of a program counter and a stack pointer, though they can be safely extended later on to include other pieces of context or properties (such like general purpose registers, address spaces, privilege levels, etc.)

1.2.2 Portals. Interprocess communication The basic interprocess communication mechanism provided by Off is the Portal. A Portal can be seen like a “distributed interrupt line” or a “network-wide gate”. Portals have the basic utility of interrupts, i.e. they can be invoked transparently to make a handler perform some task. Portals do not implement buffering and do not specify whether synchronous messages, asynchronous messages or RPCs are to be used. The mechanisms and policies used to locate and to invoke portals over the network are left up to the user. User provided transport and location protocols are used to extend the portal mechanism transparently over the network. In this way traps, interrupts, exception and user messages can be accessed and adapted in the whole system.

5

1.2.3 D-TLBs. Distributed Memory Management The only thing done by Off is to export the network hardware, and this also holds for memory management. Off implements a Distributed Software TLB. The user can establish translations from virtual to distributed physical memory addresses and the address translation hardware is safely multiplexed by Off among the competing applications.

1.3 Literate programming As this document is hardcopy of a literate program [14] we think that it is worth saying something about this technique. We do it by citing the comp.programming.literate newsgroup FAQ: “Literate programming is the combination of documentation and source together in a fashion suited for reading by human beings. In fact, literate programs should be enjoyable reading, even inviting! [...] In general, literate programs combine source and documentation in a single file. Literate programming tools then parse the file to produce either readable documentation or compilable source. The WEB style of literate programming was created by D.E. Knuth during the development of his TEX typsetting software. All the original work revolves around a particular literate programming tool called WEB. Knuth says: The philosophy behind WEB is that an experienced system programmer, who wants to provide the best possible documentation of his or her software products, needs two things simultaneously: a language like TEX for formatting, and a language like C for programming. Neither type of language can provide the best documentation by itself; but when both are appropriately combined, we obtain a system that is much more useful than either language separately. The structure of a software program may be thought of as a web that is made up of many interconnected pieces. To document such a program we want to explain each individual part of the web and how it relates to its neighbours. The typographic tools provided by TEX give us an opportunity to explain the local structure of each part by making that structure visible, and the programming tools provided by languages such as C or Fortran make it possible for us to specify the algorithms formally and unambigously. By combining the two, we can develop a style of programming that maximizes our ability to perceive the structure of a complex piece of software, and at the same time the documented programs can be mechanically translated into a working software system that matches the documentation.”

6

What follows have been written in a mixture of LATEX and C++. The noweb literate programming tool has been used to obtain a LATEX formatted version of the document as well as the C++ code which can be later compiled.

1.4 How to read this document

7a

In what follows, each description of a part of the system may be followed by a chunk of code like this one hExample chunk of code. 7ai class A_Class { ... }; This code is used in chunk 7b. Defines: A Class, used in chunk 7b.

7b

In this example, the name of the chunk was “Example chunk of code.” The number on the right of the chunk name, also found on the left margin, identifies the chunk and will be used in cross-references. We will refer to such number as the “chunk number”. Also, when an interesting entity has been defined by a chunk (A Class in the example) it will be stated below the chunk using a small font. The chunk number is made the page number (identifying in which page the chunk is defined) and, if necessary, a letter to distinguish between different chunks in the same page. Using this number one is able to quickly locate any chunk in the document. Chunks of code can include other chunks, in this case, you can use the number enclosed with the chunk name between angles (i.e. “h” and “i”) to locate quickly the code for the chunk being included. In this example, the chunk shown above will be included in chunk 3a. We then should use the chunk numbers found in the left margin to locate it. hA chunk including another. 7bi // Some code... hExample chunk of code. 7ai // And we also use A_Class A_Class an_object; Root chunk (not used in this document). Uses A Class 7a.

7c

Finally, a chunk may be continued at a different part of the document. In this case it will be said explicitly below the chunk using a small font, as it can be seen below. hContinuing chunk of code. 7ci //Mary had This definition is continued in chunk 7d. Root chunk (not used in this document).

7d

+

chunk of code. 7ci  //a little lamb.

hContinuing

Note the use of the “+” to state that the chunk is a continuation. Finally, we suggest reading this document as it is. Do not try to read it following the order understood by a C++ compiler. The references below each chunk will allow you to navigate through this program’s web. 7

1.5 Tools To implement Off ++ several tools have been used. They are all you need to build a system image from the on-line source for this document.

 LATEX [15] has been used to typeset documents, including the printed version of the literate source code.  C ++ [18] has been the programming language employed. The GNU C and C ++ compiler named GCC is the one we use.  As a support for literate programming, noweb [17] has proven to be a nice tool.  The development environment is still a GNU based Linux system known as RedHat GNU/Linux.  Last, but not least (indeed, we should write “most”) we use the OSKit [9] for most of the low-level hardware glue code. All these tools have been already used in the construction of the original Off prototype.

8

Chapter 2

System structure 2.1 Exporting system objects System services are made of a bunch of system objects exported to users. The model is actually more generic: portals are used whenever a protection domain wants to export certain services in a secure way to alien users. For each object being exported, a set of methods are made available by means of portals.

2.1.1 Traps and the OO model The Off ++ µkernel is an object oriented system. However, the boundary between the user and the kernel is procedural: portals are used to perform system calls (implemented in turn by means of traps), as can be seen in figure 2.1. application objects

system object wrappers trap portal calls

Off++ microkernel kernel objects

Figure 2.1: System calls The path for a system call proceeds as follows: 1. A user object calls a system object method.

9

2. The system object is actually a wrapper that will translate the method invocation into a portal invocation. 3. The portal invocation will transfer the user control flow to the kernel. 4. Inside the kernel, a system object method (that wrapped by the wrapper object in step 1) will be called as part of the portal delivering mechanism. Only step 3 is procedural. Both user and kernel objects think that they are calling now and then to other objects. But whenever a protection domain has to be crossed, a portal is used. In this way no remote method invocation has to be built inside the kernel so that it could be kept simple. Thus, the only actual system calls are those needed to implement the portal delivering mechanism. Remaining system services are first provided by means of portals and then wrapped at user level by objects.

2.1.2 Tying portals and methods together Whenever an object wants to export part of its interface to the outer world it must create one portal per method. To automate the task, each class with methods being exported should be labeled ENTRY too. In this way a compiler can generate the glue code between the portals and the methods being exported. All the method of task are considered to be exported to the user. In particular, the compiler will generate automatically the implementation of this method: 10

hOff

private methods for exported objects. 10i // Creates portals to access class methods void export(void);

Root chunk (not used in this document).

10

The code generated for export will create one portal per method. The portal handler will be setup so that methods are invoked transparently: The program counter will be pointing to the method code and the maximum expected number of message words will be setup to the size in the stack of the method arguments. The task of extracting the arguments is done automatically by the portal invocation mechanism and the compiler because portal messages are always copied from (caller) stack to (callee) stack. Those methods exported are expected to return a non-zero value with the error code in case the invocation fails, or zero otherwise. This can be used by user-level wrappers to raise exceptions. That error code will be returned by delivering a “system call failed” (SCERR) event to the shuttle issuing the system call. The handler for that event may set the error code inside the user application so that an exception could be raised. The benefit of using events for error notifying is that we save a register and permit “optimistic” system usage where most of the times the system does not need to return anything to the user. It should be noted that the object constructor should call export or nothing will be actually exported. Portal identifiers are not stored because the order in which exported system objects are created is always the same. Thus, the algorithm used to construct the portal identifiers ensures that given the node identifier, portal numbers will always be the same. This is to say that they are node-wide constants. Although it is not part of the kernel, we will say that the compiler mentioned above can generate user wrappers too. These wrappers are objects with the same signature of the system object being exported but with just those methods being labeled ENTRY. Each wrapper method will simply invoke to the kernel portal and raise an exception on portal return whenever the return code is non-zero. Note that many of these wrappers are to be built dynamically as users obtain access to system resources. They can cache the identifier of the resource being wrapped and use that identifier to to perform system calls. To make it clear what system services are being exported to the user, and also to avoid in-kernel authentication (e.g. unnecessary access checks when operations are requested by the kernel itself) we will wrap every object being exported to the user. Those wrappers will be named with the prefix off u1 and will be described right after the objects they are wrapping to make it clear which methods are exported to the user.

2.2 Resource building blocks There are some basic interfaces which must be present on almost every system object. They were initially inherited by the Resource2 class, which models system resources, as can be seen in figure 2.2, but we will be using aggregation instead (see figure 2.3) to avoid multiple inheritance3. 1 The

“u” stands for “user”. in the code we will be using the prefix off we will omit it in this text. 3 Fabio seems to hate it. 2 Although

11

AccessChecked

RefCounted

Lockable

Resource

freeze() melt() new() delete() dump()

CompResource

Node

ResUnit

Figure 2.2: Initial resource hierarchy (not used)

2.2.1 Protection To protect object access we define a generic set of access operations (read, write, execute, delete, and protect in the current implementation 4) and for each one we have a Protection object specifying the protection for such operation. 12

access operations. 12i // Operations which can be protected. enum off_op_t { OFF_OP_R=0x1, OFF_OP_W=0x2, OFF_OP_X=0x4, OFF_OP_D=0x8, OFF_OP_P=0x10 }; const natural_t OFF_NOPS = 5;

hOff

Root chunk (not used in this document). Defines: off op t, used in chunk 13.

4 Although

nothing prevents future extensions to consider per-resource operations.

12

Resource

freeze() melt() new() delete() dump()

RefCounter

AccessChecker

Protection

CompResource

Lock

ResUnit

Node Sequencer

Figure 2.3: Basic resource hierarchy

Individual operations can be combined together (usually by a bit-or operation) to specify an access mode (e.g. OFF OP R|OFF OP X for “read and execute” access mode). Also, note that natural t and other basic types which we will be using through the document have to be defined in a portable way (i.e. ensuring that they all have the same size across different platforms). 13

access mode. 13i typedef off_op_t off_mode_t;

hOff

Root chunk (not used in this document). Defines: off mode t, used in chunks 14, 15a, 50a, 70, 72b, 74b, and 75b. Uses off op t 12.

13

Three different objects are involved in protection:

 A Protection object will specify the protection of a given Resource for different access modes.  A Rights object shows the access rights held by the user accessing the resource.  AccessChecker objects implement the policy by which access is granted or denied for a given Protection, access Rights and access mode. The meaning of Protection is then defined by the AccessChecker object. Every time the system wants to know whether access is granted or denied, it calls the resource access check method (which is delegated to the AccessChecker with the user Rights and access mode). To change the protection the (also delegated) method protect is called instead. 14a

access checker. 14ai // Checks access for user operations on kernel resources // class off_AccessChecker { public: // Checks for rights for the given access mode boolean_t access_check( off_mode_t m, const off_Rights &r, const off_Protection &p ) const;

hOff

// Change protection void protect(off_Protection &old, const off_Protection &new, off_mode_t m); }; Root chunk (not used in this document). Uses off mode t 13.

Resources that need to be protected will delegate its protection control to a systemwide AccessChecker. Different implementations of the off Protection interface will provide various protection mechanisms including access control lists and capabilities. The protection information will be kept in prot5 inside each object being protected. 14b

hOff

private members for protected objects. 14bi off_Protection _prot; // protection for this resource

This code is used in chunk 18a. 14c

public methods for protected objects. 14ci inline boolean_t access_check( off_mode_t m, const off_Rights &r ); inline void protect(const off_Protection &p, off_mode_t m);

hOff

This code is used in chunk 18a. Uses off mode t 13. 5 We will prefix any data member of a class with the initial letter for the class and an underscore (“ ”). Those members which may be in more than one class —like the reference counter— will be just prefixed by an underscore.

14

15a

The second one is exported to system users. hOther public methods of off uResource. 15ai void protect(const off_Protection &p, off_mode_t m, const off_Rights &r); This definition is continued in chunk 24b. This code is used in chunk 19a. Uses off mode t 13.

The specific Protection mechanism being used can be changed without disturbing the rest of the code. Once the best protection mechanism be known empirically, the system can be optimized by hardwiring it and avoiding dynamic dispatching. Initially, a Protection is defined as a big random number per operation. A Rights object is also a big random number. They are used as capabilities. The AccessChecker considers the access to be granted if Rights matches the numbers found in Protection for the access mode being specified. We believe that any other protection model can be quickly incorporated. Finally, it should be clear that every system object exporting a subset of methods to the user is responsible of calling access check with the appropriate access mode at the beginning of every exported operation. Otherwise protection will not take effect.

2.2.2 Reference counting Resources exported by the kernel will also be used by other kernel resources (e.g. page frames are used by address translations). To avoid resource deletion while its being used and to provide basic support for garbage collection we employ reference counters (RefCounter objects). When a RefCcounter comes down to zero an UNUSED exception may be raised to notify users of unreferenced (maybe unused) resources. This feature is actually included as a builtin service in every Resource, so that we could save some typing and execution time using foo->reference() instead of foo->reference counter->reference().. The implementation is so simple that we have folded and inlined it. 15b

hOff

public methods for reference counting objects. 15bi // References (unreferences) an object void reference( void ) { _rc++; }; void unreference( void ) { if (!--_rc) unused(); }; natural_t get_num_refs(void) { return _rc; };

This code is used in chunk 18a.

The only subtle point is that we will use the virtual method unused to raise any exception and then destroy the object. 15c

hOff

protected methods for reference counting objects. 15ci // Raises UNUSED exceptions and destroyes the object virtual void unused(void) {;}

This code is used in chunk 18a.

And of course. . . 15d

hOff

private members for reference counting objects. 15di natural_t _rc; // # of references

This code is used in chunk 18a.

15

2.2.3 Synchronizing system resources To synchronize access to system objects, we will use spin locks and bureaucracy6 most of the time. Locks The access pattern to system objects in system call is: 1. Call make available() to bring any needed remote object here. 2. Having the required objects in the local node, lock them either for reading or for writing (usual multiple readers, single write semantics apply). 3. Perform the operation (which should now be non-blocking) and return.

16a

Thus, resources must include a read/write lock. The member variable lock will be included in every “lockable” object. hOff private members for lockable objects. 16ai rw_lock_t _lock;

// The lock

This code is used in chunk 18a.

Again, we include this as builtin functionality to save some typing and time. 16b

hOff

public methods for lockable objects. 16bi // Spin (un)lock on this for reading lock_state_t r_lock(void); void r_unlock(lock_state_t old_state); // Spin (un)lock on this for writing lock_state_t w_lock(void); void w_unlock(lock_state_t old_state);

This code is used in chunk 18a.

6 As

it is done in the real life.

16

The value returned by lock routines (the lock state) holds the state to be restored when the resource is released through unlock routines, e.g. the interrupt mask (are interruptions enabled or disabled?) and any other piece of “global” state changed by the lock. Kernel bureaucracy In order to wait for a resource to reach a given state, or to do long-term waiting due to resources being temporarily unavailable, we use Bureaucrats. Each resource that may cause long-term waiting (abstract resources) will provide one or more Bureaucrats so that shuttles block on them until some new state is reaached. Bureaucrats can be understood as condition variables but can be implemented by other means. The use of condition variables is not an obstacle to system distribution because we will be using lists of identifiers (and not lists of objects) to represent the set of shuttles waiting on condition variables. Thus a single list may span several nodes. 17a

bureaucrat. 17ai class off_Bureaucrat { public: // Puts ’who’ to sleep on this bureaucrat. void wait(shtl_id_t who); // Signals this condition. Just one will be awaken. void signal(void); // Signals this condition. Everyone will be awaken. void signal_all(void); }

hOff

Root chunk (not used in this document). Defines: off Bureaucrat, used in chunk 71.

2.2.4 Sequencers Some objects will need to atomically obtain unique sequence numbers to create unique identifiers (for other objects contained inside). Again, this functionality is incorporated into the aggregate. 17b

hOff

private members for sequencing objects. 17bi off_seq_t _seq; // Next sequence nb.

Root chunk (not used in this document). 17c

hOff

protected methods for sequencing objects. 17ci // Gets a new sequence number. Uses lock() to ensure atomicity off_seq_t get_seq(void);

Root chunk (not used in this document).

17

2.3 System resources Each resource in the system supports a set of well-defined operations. Namely, those provided by building blocks that we have seen before and a few others like freeze, to get the state of the object and defer any further invocations, melt, to recreate a frozen object, and dump, which will return a printable representation of the resource state. 18a

resource. 18ai class off_Resource { private: hOff private members for protected objects. 14bi hOff private members for reference counting objects. 15di hOff private members for lockable objects. 16ai hOther private members of off Resource. 18bi protected: hOff protected methods for reference counting objects. 15ci public: hOff public methods for lockable objects. 16bi hOff public methods for reference counting objects. 15bi hOff public methods for protected objects. 14ci

hOff

// Every resource (be it simple or compound) is able to be // dumped for inspection and frozen/melted. virtual void *freeze(void *to, size_t size)=0; virtual melt(void *from, size_t size)=0; inline boolean_t is_frozen(void); virtual char *dump(char *buf, size_t size) const; hOther

public methods of off Resource. 24ai

}; Root chunk (not used in this document). Defines: off Resource, used in chunks 19–21, 25, and 36.

Also, we should mention that some resource members are initialized at allocation time (with the information provided by system users). That information includes the protection for the resource and an identifier used to identify the user-level entity responsible for the object after its allocation; we name such entity the object domain. Being the Portal the IPC mechanism provided by the kernel, the domain identifier is indeed a portal identifier of type off prtl id t. By doing that, we can have the kernel performing upcalls to either user-level or in-kernel domain servers which will be responsible for allocated system resources. Besides, the meaning of the domain abstraction is defined by the user and not by the kernel, as will be discussed in section 2.6. 18b

private members of off Resource. 18bi off_prtl_id_t r_domain; // Domain for the resource

hOther

This code is used in chunk 18a.

18

2.3.1 Resources for plain users These are the common entry points for every system resource. 19a

resource for users. 19ai ENTRY class off_uResource { private: off_Resource &u_resource; public: void *freeze(void *to, size_t size, const off_Rights &r)=0; void melt(void *from, size_t size, const off_Rights &r)=0; boolean_t is_frozen(const off_Rights &r); char *dump(char *buf, size_t size, const off_Rights &r) const;

hOff

hOther

public methods of off uResource. 15ai

}; Root chunk (not used in this document). Defines: off uResource, used in chunks 22 and 38b. Uses off Resource 18a.

2.3.2 Identifiers Off uses two kinds of identifiers:

 off id t composed by creation node:sequence number:implementor index and used for mobile objects (including resource containers and abstract resources),  and off eu id t composed by off id t:name-used-in-hw for the elementary hardware units being exported.

19b

hOff

The size of each field is defined as shown. identifiers. 19bi

typedef typedef typedef typedef

unsigned16_t unsigned32_t unsigned16_t unsigned long

off_node_t; // off_seq_t; // off_slot_t; // off_offset_t;//

This definition is continued in chunk 20a. Root chunk (not used in this document).

19

Node descriptor Sequence number Slot number Elementary unit offset number

20a

Thus, identifiers are typed: hOff identifiers. 19bi+

// Relocatable top level identifier struct off_id_t { off_node_t i_node : 16; // Creation node for the object off_seq_t i_seq : 32; // Sequence number off_slot_t i_slot : 16; // Implementor’s descriptor for the object }; // Hardware resource elementary unit identifier struct off_eu_id_t { off_id_t e_container; // id for its container off_offset_t e_offset; // Object id relative to the container. };

Defines: off eu id t, used in chunks 25, 31, 32, 41, 42, 44c, 49c, and 51c. off id t, used in chunks 20b, 22a, 25, 26a, 33, 34b, 36, 59, 62, and 64–66.

2.3.3 Containers and resource units Resources are arranged as recursive containers, being the node the outermost one. These containers delegate allocation to an Allocator object which can (and will) be further wrapped 7 to delegate usage statistics to a bookkeeper. See figure 2.4. 20b

compound resource. 20bi // Compound resource. Uses an allocator to allocate resource units. class off_CompResource : public off_Resource { private: // Allocator used for this container. off_Allocator *c_pool; // Resource unit pool hOther private members of off CompResource. (never defined)i public: // Named with ‘relocatable’ identifiers off_id_t get_id(void) const; // Gives a pointer to the container’s allocator off_Allocator *get_allocator(void) const; };

hOff

Root chunk (not used in this document). Defines: off CompResource, used in chunks 31, 33a, 65c, and 67. Uses off Allocator 27b, off id t 20a, and off Resource 18a.

7 cf.

the wrapper or decorator design pattern (see [11], pp 175.)

20

Allocator CompResource

BKAllocator

FixedAllocator

HWCompResource

BlockAllocator

Processor

grow() AbsCompResource make_available()

RelocTbl if ( object in allocator ) found else if ( reloctlb.lookup(obj)) notify() else missing()

Figure 2.4: Containers and allocators

Specific resource units maintain a reference to their container. They also redefine the new operator so that it could be used to allocate resource units from a container’s pool. This way we can avoid memory fragmentation. The reference to the container and the redefinition of new will be declared in subclasses of ResUnit to avoid explicit type casts. 21

elementary resource unit. 21i // Elementary resource unit. class off_ResUnit : public off_Resource { };

hOff

Root chunk (not used in this document). Defines: off ResUnit, used in chunks 32a and 33b. Uses off Resource 18a.

21

Compound resources and resource units for plain users Compound resources export to users the allocation information found in the CompResource allocator. To do so, they export the get allocator method. They also export get id. 22a

compound resource for users. 22ai ENTRY class off_uCompResource : public off_uResource { public: off_id_t get_id(const off_Rights &r) const; off_Allocator *get_allocator(const off_Rights &r) const; };

hOff

Root chunk (not used in this document). Defines: off uCompResource, used in chunks 32c and 34a. Uses off Allocator 27b, off id t 20a, and off uResource 19a.

We do not export to users anything besides that exported by a Resource in ResUnit system objects. 22b

elementary resource unit for users. 22bi ENTRY class off_uResUnit : public off_uResource { };

hOff

Root chunk (not used in this document). Defines: off uResUnit, used in chunks 32b and 34b. Uses off uResource 19a.

22

2.3.4 Resource availability When users issue a system call, it is targeted to a particular system object. The system call is processed in the local node only when that object is local. In any other case we forward the system call to a remote node by raising a FWD exception to the user. Assuming the target system object is local, so that the system call proceeds, it is not guaranteed that every resource needed by that system call be present in the local node. To deal with remote resources, every resource container (CompResources) implements a make available method so that whenever a resource is needed it could be fetched to the local node (if that is really required). Besides, when the container being considered is an AbsCompResource (a container of abstract resources), we have to take into account that resources contained inside it (abstract resources) are able to move around. Thus, in this case we will employ a relocation table to cache known object locations. In every CompResource, make available proceeds as follows: 1. Using the resource identifier, it tries to locate the resource in the allocator used by the AbsCompResource (using the slot field of the resource identifier). If the resource is found there, then there is nothing more to do. 2. If the resource was not present in the allocator and we are considering abstract resources, the relocation table has to be consulted to find out if we know of any object relocation. If we find an entry, we now know where the object is and we are done. 3. If we don’t know where the object is at this point, we assume it is located in its creation node (known by looking in the resource identifier). Thus, we suspect where the object is and raise a MISSING exception providing the expected location to user so that he could fetch the remote resource and bring it to the local container. (a) If the object was indeed at its expected location (it did not move around) the system call is able to proceed as soon as the MISSING exception handler completes successfully. Should the handler fail, the system call is aborted and an ILL (illegal instruction) exception is raised to the Shuttle domain. (b) If the object was not at its expected location we will notice it as soon as the user is notified by the remote resource container (where we expected to find the resource). In this case, a location algorithm will be run by the user. The resulting location should be inserted in the container relocation table, if any. We should note that resource “relocation” counts mostly for portals, shuttles and Elementary resource units are not able to move around, so they will always be at their creation nodes. Besides, because the system can be heterogeneous, a remote resource brought to the local node may be meaningless to us or require some sort of marshaling (eg. when it corresponds to a different architecture). In this case, a XDT exception is raised in DTLBs.

23

order to request to a trusted user translator an image of the remote object in the native format.

2.3.5 Architecture awareness support Each resource will be wrapped with a Navigator and an Inspector. This way, the OS built on top of the µkernel will be able to transparently navigate through its components and later on ask about resource attributes (e.g. bandwidth, persistence, etc.), The class hierarchies of Navigators and Inspectors mimic that of Resource, thus we omit it. Every resource will include two methods, get navigator and get inspector, to provide a pointer to its specific navigator and inspector. Both of them will be friends of the object considered and will be able to access its state in order to support their services. 24a

public methods of off Resource. 24ai // Returns the navigator for the resource. virtual off_Navigator *get_navigator(void) const; // Returns the inspector for the resource. virtual off_Inspector *get_inspector(void) const;

hOther

This code is used in chunk 18a. Uses off Inspector 26a and off Navigator 25.

By enquiring the navigator, references to other system resources may be obtained and get navigator or get inspector may be used again. The scheme works as shown in the example of the figure 2.5: 1. The user has a reference to a memory bank. 2. Using get navigator, a reference to its navigator is obtained. This navigator implements the Navigator interface. 3. Browsing with the navigator, a page frame is located 4. Its method get inspector is used to obtain a reference to an object implementing the Inspector interface. A common protocol can be used now to find out about the page frame persistence, size, etc. Of course, both get navigator and get inspector are exported to system users. 24b

+

public methods of off uResource. 15ai  off_Navigator *get_navigator(const off_Rights &r ) const; off_Inspector *get_inspector(const off_Rights &r ) const;

hOther

This code is used in chunk 19a. Uses off Inspector 26a and off Navigator 25.

24

Navigator

Resource

Inspector

get_navigator get_inspector

1

2

MBankNavigator

MBank

MBankInspector

get_navigator get_inspector

3

PFrameInspector

4 PFrame get_navigator get_inspector

NullNavigator

Figure 2.5: Architecture awareness support: Navigators and Inspectors.

A Navigator provides methods to select and lookup components. 25

resource navigator. 25i // A resource navigator ENTRY class off_Navigator { public: // Returns a reference to the resource being navigated off_Resource *get_current(void);

hOff

// Returns the first element. off_Resource *get_first(void); // Advances the iteration and returns the next element. off_Resource *get_next(void); // Selects a component by name. off_Resource *operator[](char *name); // by identifiers off_Resource *operator[](const off_eu_id_t &id); off_Resource *operator[](off_id_t id); }; Root chunk (not used in this document). Defines: off Navigator, used in chunk 24. Uses off eu id t 20a, off id t 20a, and off Resource 18a.

25

An Inspector has two main methods to lookup attributes either by name or by index. Attribute indexes will vary from 0 to some arbitrary value. The first invalid attribute is guaranteed to have a NULL name and type. To avoid usage of void *, users are requested to specify the expected type of an attribute. As it is unrealistic to force users to know in advance which attributes can be present, a method nameof is provided to obtain the name of a given attribute, and a method typeof is provided to obtain its type. 26a

resource inspector. 26ai // A resource inspector ENTRY class off_Inspector { public: // Returns the name of an attribute at [idx] char *nameof( natural_t idx) const; // Returns the type of an attribute off_attr_kind_t typeof(natural_t idx) const;

hOff

// Look up a boolean attribute by name or by index virtual boolean_t get_bool_attr(char *name) const { return FALSE; } virtual boolean_t get_bool_attr(natural_t idx) const { return FALSE; } // Look up a natural attribute by name or by index virtual natural_t get_nat_attr(char *name) const { return 0; } virtual natural_t get_nat_attr(natural_t idx) const { return 0; } // Look up a string attribute by name or by index virtual char *get_str_attr(char *name) const { return NULL; } virtual char *get_str_attr(natural_t idx) const { return NULL; } // Look up an id attribute by name or by index virtual off_id_t get_id_attr(char *name) const { return OFF_ID_NULL; } virtual off_id_t get_id_attr(natural_t idx) const {return OFF_ID_NULL;} }; Root chunk (not used in this document). Defines: off Inspector, used in chunk 24. Uses off attr kind t 26b and off id t 20a.

26b

The default implementation is to ignore the arguments and return a silly value, so that such function could be ignored: not every resource has attributes of every type. Thus, only nameof, typeof and one of the for previous pairs of methods must be provided by any subclass. The valid attribute types are hOff attribute type ids. 26bi enum off_attr_kind_t { OFF_BOOL_ATTR, OFF_STR_ATTR, Root chunk (not used in this document). Defines: off attr kind t, used in chunk 26a.

26

OFF_INT_ATTR, OFF_ID_ATTR };

27a

Finally, any resource should have at least these attributes defined hOff attributes. 27ai enum { OFF_ATTR_NULL = 0, OFF_ATTR_NAME, OFF_ATTR_CLASS, OFF_ATTR_DOM, OFF_ATTR_ID, OFF_ATTR_OFFSET, OFF_ATTR_URL, };

// // // // // // //

Used to mark the end of the attr list Name of this resource Class for this resource Resource domain Resource id (or container id) Resource offest in a container or 0 URL for the resource literate source code

Root chunk (not used in this document).

2.3.6 Resource Allocators Every allocator will provide the same interface, so that we could replace it at will. 27b

allocator. 27bi signature off_Allocator { // Returns a chunk of raw memory. // Will try to allocate n chunks starting at the i-th item. // i == 0 means allocate anywhere. void *allocate(natural_t n=1, natural_t i=0);

hOff

// Deallocates the chunk pointed by p void deallocate(void *p); }; Root chunk (not used in this document). Defines: off Allocator, used in chunks 20b and 22a.

27

The signature keyword is a GNU C++ extension which permits the use of subtype polymorphism separately from inheritance. Some dynamic dispatching is saved this way. However, it can be considered to be an abstract base class or interface implemented by every allocator.

2.3.7 Allocation statistics Statistics are maintained by bookkeeping allocators (BKAllocators) which are composed by wrapping8 the basic Allocator with some bookkeeping methods. These operations are inlined. 28

bookkeeping allocator. 28i class off_BKAllocator { private: natural_t b_nfree; // # of free nodes natural_t b_nalloc; // # of allocated nodes natural_t b_maxalloc; // Maximum # of ever allocated nodes natural_t b_nallocrq; // # of alloc requests natural_t b_nfreerq; // # of free requests public: // Get some statistics natural_t get_nfree(void) const {return s_nfree;} natural_t get_nalloc(void) const {return s_nalloc;} natural_t get_maxalloc(void) const {return s_maxalloc;} natural_t get_num_allocrq(void) const {return s_nallocrq;} natural_t get_num_freerq(void) const {return s_nfreerq;}

hOff

// Wrapped allocator bookkeeping methods inline void *allocate(natural_t n=1, natural_t i=0); inline void deallocate(void *p); }; Root chunk (not used in this document). Defines: off BKAllocator, used in chunk 29.

8 As

the allocator was an interface or signature, no inheritance is needed.

28

2.3.8 Fixed and Block allocators

29a

The FixedAllocator is an Allocator with a fixed allocation scheme. Its implementation may vary, but in general it will consist on a fixed array and a linked list of free nodes. This class will provide a generic implementation which can be later optimized by specialized allocators. hOff fixed allocator. 29ai class off_FixedAllocator: public off_BKAllocator { public: // Allocates (deallocates) units from a fixed store. void *allocate(natural_t n=1, natural_t i=0); void deallocate(void *p); }; Root chunk (not used in this document). Uses off BKAllocator 28.

29b

The BlockAllocator is an allocator which can grow dynamically. hOff block allocator. 29bi class off_BlockAllocator: public off_BKAllocator { public: // Allocates (deallocates) units from a growing store void *allocate(natural_t n=1, natural_t i=0); void deallocate(void *p); // Resizes the store void grow(natural_t n); }; Root chunk (not used in this document). Uses off BKAllocator 28.

Even though it is not required for the understanding of the rest of this document, we now say a few words about the implementation of BlockAllocator. In Off , it is used to allocate abstract system resources (shuttles, portals and DTLBs). It is desirable to avoid static limits in the amount of resources an application can allocate. So, each application would allocate, for instance, as many shuttles as it wants and as many portals as it wants. To implement that, we chose to allocate abstract system resources in virtual memory. However, adding this flexibility to the allocator could cause problems in the case when a single application allocates many resources (e.g. many portals) harming other applications’ performance. Our implementation of the BlockAllocator helps solving this problem by allocating elementary units (i.e. shuttles, portals and DTLBs) so that those of a particular application can be paged out without disturbing other applications. The kernel virtual memory may look like the one depicted in figure 2.6. In this situation the large number of VM pages used to store one of the applications’ (xfig) portals does not avoid the other (gcc) to have the opportunity to freely allocate its portals. As the system does not know what a “user application” is, it is not possible to make the allocator use a different page for each application. However, the BlockAllocator can use the resource domain identifier as a hint. So, what we actually do is to use a different set of virtual memory pages to allocate resources for each application domain. 29

Portal storage one page Kernel virtual memory paged out gcc’s portals xfig’s portals

Figure 2.6: Storage of Portals for two different applications.

2.3.9 Resource revocation Resources are not revoked by the Off ++ µkernel. Instead, an UNAVAILABLE exception is raised for the resource exhausted; so that users could run whatever algorithm is needed to make more units of the resource being exhausted. The exception trigger is the memory allocator being used. System allocators are provided with an Exhausted object at instantiation time for that purpose. The Exhausted object provides a single method notify to let allocators trigger the proper action (by default, raising an UNAVAILABLE exception). 30

Exhausted resource revocation trigger. 30i signature off_Exhausted { public: // Notify that resource is exhausted void notify(void); };

hOff

Root chunk (not used in this document).

30

HWCompResource

HWResUnit

PFrame

MBank

IOport IOBank Node

Event EventTbl DMALine DMA ProcTbl

Processor ProcSlot

Trap

TrapTbl

IrqTbl

Figure 2.7: Hardware resource containers and elementary units.

2.4 Hardware resources Hardware resources are not able to move and are always located at the same place. If they grow, they grow one container at a time (i.e. when new disks or pluggable memory banks are added to the system). Hardware resource containers may move by replacing a whole container by another. So, usually, hardware containers will use a fixed allocator. The hardware resources exported by the kernel are arranged as recursive contain ers starting from Node. See figure 2.7. 31

hardware resource container. 31i // A hardware resource container. Units will be allocated // within it. class off_HWCompResource : public off_CompResource { public: // We may wish to cache a resource unit void make_available(off_eu_id_t u); };

hOff

Root chunk (not used in this document). Defines: off HWCompResource, used in chunks 40, 42c, 44b, 49b, and 51b. Uses off CompResource 20b and off eu id t 20a.

31

Irq

Hardware resource units employ a particular naming scheme. They do not use off id t identifiers as remaining system resources do. They use elementary unit identifiers (off eu id t) instead. These ones are built of an off id t (naming the container) and an offset locating the object inside the container. The offset matches the resource physical name, i.e. that understood by the hardware. 32a

hardware resource unit. 32ai // A hardware resource unit. // Uses virtualized identifiers // and a fixed allocation scheme. class off_HWResUnit : public off_ResUnit { public: // Gets the identifier for this unit. off_eu_id_t get_id(void) const; };

hOff

Root chunk (not used in this document). Defines: off HWResUnit, used in chunks 41b, 43a, 45c, 50a, and 52. Uses off eu id t 20a and off ResUnit 21.

2.4.1 Hardware resources for plain users Users can ask for the identifier of a given resouce unit. 32b

hardware resource unit for users. 32bi ENTRY class off_uHWResUnit : public off_uResUnit { public: off_eu_id_t get_id(const off_Rights &r) const; };

hOff

Root chunk (not used in this document). Defines: off uHWResUnit, used in chunks 42a, 44a, 48b, 51a, and 54b. Uses off eu id t 20a and off uResUnit 22b.

Compound hardware resources do not export anything. 32c

hardware resource container for users. 32ci ENTRY class off_uHWCompResource : public off_uCompResource { public: };

hOff

Root chunk (not used in this document). Defines: off uHWCompResource, used in chunks 41d, 43b, 50c, and 54a. Uses off uCompResource 22a.

32

2.5 Abstract resources Shuttles, Portals, and DTLBs (see 1.2) are system abstractions, and are able to move around. Consequently, abstract resource containers (AbsCompResources) provide a method make avaiable which uses a relocation table so that system code may request to one of the system containers to make a particular resource unit available at the local node. They are arranged, as physical resources are, as recursive containers being the Node the top-level container (see figure 2.8). AbsResUnit

AbsCompResource

DMM

DTLB

Node Prtl

PrtlSrv ShtlSrv

Shtl

Figure 2.8: System abstract resource containers and elementary units.

33a

system server. 33ai // A system resource container. class off_AbsCompResource : public off_CompResource { // System containers use a block allocator and a relocation cache. public: // We may wish to fetch a resource unit void make_available(off_id_t u); };

hOff

Root chunk (not used in this document). Defines: off AbsCompResource, used in chunks 55, 69b, and 73. Uses off CompResource 20b and off id t 20a. 33b

system resource. 33bi // A system resource unit. It’s relocatable by nature. class off_AbsResUnit : public off_ResUnit { public: off_id_t get_id(void) const; };

hOff

Root chunk (not used in this document). Defines: off AbsResUnit, used in chunk 59. Uses off id t 20a and off ResUnit 21.

33

2.5.1 Abstract resources for plain users The wrappers for abstract resources are so simple that we will not discuss them. 34a

system server for users. 34ai ENTRY class off_uAbsCompResource : public off_uCompResource { public: };

hOff

Root chunk (not used in this document). Defines: off uAbsCompResource, used in chunks 60b, 72a, and 75a. Uses off uCompResource 22a. 34b

system resource for users. 34bi ENTRY class off_uAbsResUnit : public off_uResUnit { public: off_id_t get_id(void) const; };

hOff

Root chunk (not used in this document). Defines: off uAbsResUnit, used in chunks 61a and 75b. Uses off id t 20a and off uResUnit 22b.

34

2.6 Domains and resource allocation The kernel does not implements either “processes” nor any other “resource-container” abstractions. Users allocate system resources and release them back to the kernel when they are no longer needed. None of the system resources is responsible for containing the resources of a given application. Besides, as we will see, the system resource that is the most similar to the traditional process abstraction (the “shuttle”) does not contain the resources accessed during its execution. Thus, how are bookkeeping tasks accomplished? And, how are resources released upon application crashes? Certainly, a user-level (i.e. a non-µkernel) process or “application” abstraction is needed. To help on this task, the kernel delegates resource bookkeeping to its users. Users should implement resource domains containing a set of resources used by a user application. This concept is introduced in the system in order to help solving the three following problems:

 How to release resources when the application using them dies.  How to recognize who is responsible for a given resource.  How to efficiently protect and share system and user resources. Whenever a user issues a resource allocation request, it provides a target domain identifier tdi for that new resource. Then, the kernel entitles that resource to the domain identified by tdi (i.e. sets the r domain of the Resource being allocated to the supplied domain identifier — as defined in 2.3). When the resource gets finally unused, the kernel notifies that using its domain identifier as a portal name, so that users could update their allocation tables. Thus, domain identifiers must be portal identifiers. As we said, the meaning of a domain is defined by the user, not by the kernel (although the DM can be loaded into the kernel for efficiency). The 2K operating system will utilize this basic mechanism for implementing the concept of nested domains. From a single abstraction, it will be possible to implement replacements for traditional OS abstractions like group of users, user, process group, process, and thread. This will be possible because nothing forbids domains to persist, so that a permanent user can be seen as a persistent domain. 2K will also be able to support new abstractions like, for example, temporary users. They would be implemented as non-persistent domains and could be created by regular users inheriting a subset of the creator permissions. This would add flexibility compared to existing systems in which usually only a “super-user” is allowed to create new users. A resource could be also logically moved from one domain to another by changing its r domain. There is nothing in the system architecture forbidding such feature.

35

Chapter 3

The node 3.1 The node interface A Node object is responsible for bringing the system into operation and also for shutting it down (i.e. to either halt, reboot, or suspend it). Node is also a singleton1 used as a placeholder for node-wide system properties (e.g. the portal for the authentication server used to establish trust relationships among distributed kernels (auth), and the portal for the external data translator used to translate foreign kernel objects to the native architecture (xdt) are placed here.) Being the outer-most starting point in the container hierarchy, its id (obtained from the method get id) is all the user needs to know to start looking for available system resources. 36

node. 36i // An Off Node. // Can be also handled as a resource: freeze, melt, etc. class off_Node : public off_Resource { private: hOff private members for sequencing objects. (never defined)i hOther off Node private members. (never defined)i protected: hOff protected members for sequencing objects. (never defined)i hOther off Node protected methods. 37ai public: // Returns the node identifier off_id_t get_id(void) const;

hOff

// Returns or sets the authorization server and // the external data translator portals. off_prtl_t get_auth(void) const; off_prtl_t get_xdt(void) const; void set_auth(off_prtl_t p); 1 see

the singleton design patterm in [11], pp 127.

36

void hOther

set_xdt(off_prtl_t

p);

off Node public methods. 38ai

// Halts, reboots, or suspends this node void halt( char *msg="System On." ); void reboot( void ); void suspend( void ); }; Root chunk (not used in this document). Uses off id t 20a and off Resource 18a.

The Node can be also considered to be a fac¸ade2 (see [11]) for node-wide system operations.

3.2 Navigation

37a

To support efficient browsing of node components, the Node has a set of methods that provide access to its components. Most of them will be simply inlined. For example, for memory banks we have: hOther off Node protected methods. 37ai // Returns the number of specific containers found at this node inline natural_t get_num_mbanks(void) const; // Returns a pointer to a local memory bank. inline off_MBank *get_mbank( natural_t id = 0 ) const; This definition is continued in chunk 37b. This code is used in chunk 36. Uses off MBank 40.

37b

The same holds to gain access to remaining resource containers. off Node protected methods. 37ai+

hOther

// Get inline inline inline inline inline inline

a pointer to a hardware resource container off_IOBank *get_iobank( void ) const; off_DMA *get_dma( void ) const; off_ProcTbl *get_proctbl( void ) const; off_ShtlSrv *get_shtlsrv( void ) const; off_PrtlSrv *get_prtlsrv( void ) const; off_DMM *get_dmm( void ) const;

This code is used in chunk 36. Uses off DMA 49b, off DMM 73, off IOBank 42c, off ProcTbl 51b, off PrtlSrv 69b, and off ShtlSrv 55.

2 Although right now, it’s not a facade but it will centralize access to node-wide services like serial debugging, message logging, etc.

37

Besides, a generic navigator specialized for the Node will support the common interface for navigation. That is primarily for system users and not for the kernel itself.

3.3 Miscellaneous There are some global operations that do not fit well anywhere else. The following one, for example, redirects system output within the kernel to the serial line. Output can be redirected at any time. 38a

off Node public methods. 38ai // Use the serial console? void use_serial_console( boolean_t doit = TRUE );

hOther

This code is used in chunk 36.

3.4 Nodes for plain users 38b

The wrapper for node services export almost everything as can be seen. node for users. 38bi

hOff

ENTRY class off_uNode : public off_uResource{ public: off_prtl_t get_auth(const off_Rights &r) const; off_prtl_t get_xdt(const off_Rights &r) const; void set_auth(off_prtl_t p, const off_Rights &r); void set_xdt(off_prtl_t p, const off_Rights &r); void halt( char *msg="System On." ); void reboot( const off_Rights &r ); void suspend( const off_Rights &r ); off_IOBank *get_iobank( const off_Rights &r ) const; off_DMA *get_dma( const off_Rights &r ) const; off_ProcTbl *get_proctbl( const off_Rights &r ) const; off_ShtlSrv *get_shtlsrv( const off_Rights &r ) const; off_PrtlSrv *get_prtlsrv( const off_Rights &r ) const; off_DMM *get_dmm( const off_Rights &r ) const; // Use the serial console? void use_serial_console( const off_Rights &r, boolean_t doit = TRUE ); }; Root chunk (not used in this document). Uses off DMA 49b, off DMM 73, off IOBank 42c, off ProcTbl 51b, off PrtlSrv 69b, off ShtlSrv 55, and off uResource 19a.

38

3.5 System booting The system entry point, which is called by the OSKit after basic hardware initialization is named main. Its arguments proceed from the secondary boot loader. 39

main entry point. 39i // main system entry point. int main(int argc, char *argv[]);

hOff

Root chunk (not used in this document).

39

Chapter 4

Exporting the hardware 4.1 Memory banks Physical memory is distributed among different memory containers named mbanks. Each mbank supports allocation and deallocation of physical memory inside a single memory bank. To do so, it provides the alloc and free methods. Also, it provides means to locate a page frame either by address or by page frame number (PFN). 40

memory bank. 40i // A bank of memory class off_MBank : public off_HWCompResource { public: // Allocates n contiguous page frames. off_PFrame *alloc(off_Protection *prot, natural_t n=1, off_pg_id_t at=OFF_EU_ID_NULL); // Deallocates the n contiguous page frames starting at pf. void free(off_PFrame *pf, natural_t n=1 );

hOff

//Gets a page frame from its physical address off_PFrame *operator[](off_pg_id_t pfn) const; //Gets the size of pages inside the bank vm_size_t get_pgsize(void); }; Root chunk (not used in this document). Defines: off MBank, used in chunk 37a. Uses off HWCompResource 31, off PFrame 41b, and off pg id t 42b.

40

where 41a

page frame identifier. 41ai typedef off_eu_id_t off_pg_id_t;

hOff

This definition is continued in chunk 42b. Root chunk (not used in this document). Defines: off pg id t;, used in chunk 42b. Uses off eu id t 20a and off pg id t 42b.

PFrame maintains the information needed about each page frame installed in the system. Their methods are just a means to access and update such state. 41b

hOff

page frame. 41bi

class off_PFrame : public off_HWResUnit { public: // Gets the page frame bits. pg_bits_t get_bits(void); // Sets the page frame bits and returns the old ones. pg_bits_t set_bits(pg_bits_t b); }; Root chunk (not used in this document). Defines: off PFrame, used in chunk 40. Uses off HWResUnit 32a.

41c

where pg bits t correspond to the bits maintained by the address translation machinery. hOff page frame bits data type. 41ci typedef pt_entry_t pg_bits_t; Root chunk (not used in this document).

4.1.1 Memory banks and page frames for plain users These are the wrappers for memory banks and page frames: 41d

memory bank for users. 41di ENTRY class off_uMBank : public off_uHWCompResource { public: off_uPFrame *alloc(off_Protection *prot, natural_t n=1, off_pg_id_t at=OFF_EU_ID_NULL); void free(off_uPFrame *pf, const off_Rights &r, natural_t n=1 ); off_uPFrame *fetch(off_pg_id_t pfn,const off_Rights &r) const; vm_size_t get_pgsize(const off_Rights &r);

hOff

}; Root chunk (not used in this document). Uses off pg id t 42b, off uHWCompResource 32c, and off uPFrame 42a.

41

42a

page frame for users. 42ai ENTRY class off_uPFrame : off_uHWResUnit { public: pg_bits_t get_bits(const off_Rights &r); pg_bits_t set_bits(pg_bits_t b, const off_Rights &r); };

hOff

Root chunk (not used in this document). Defines: off uPFrame, used in chunk 41d. Uses off uHWResUnit 32b.

where 42b

+

page frame identifier. 41ai  typedef off_eu_id_t off_pg_id_t;

hOff

Defines: off pg id t, used in chunks 40, 41, 50a, 51a, 74b, and 75b. Uses off eu id t 20a and off pg id t; 41a.

4.2 Input/Output IO bank objects are merely used to allocate ports, as I/O should be able to operate (when permitted) from user level. 42c

IO bank. 42ci //An IO bank class off_IOBank : public off_HWCompResource { public: // Allocates n contiguous IO ports. off_IOPort *alloc(off_Protection *prot,natural_t n=1, off_io_id_t at=OFF_EU_ID_NULL); // Deallocates the n contiguous IO ports starting at iop. void free( off_IOPort *iop, natural_t n=1 );

hOff

//Gets an IO port from its port number. off_IOPort *operator[](off_io_id_t iop) const; }; Root chunk (not used in this document). Defines: off IOBank, used in chunks 37b and 38b. Uses off HWCompResource 31, off io id t 42d, and off IOPort 43a.

where 42d

IO identifier. 42di typedef off_eu_id_t off_io_id_t;

hOff

Root chunk (not used in this document). Defines: off io id t, used in chunks 42c and 43b. Uses off eu id t 20a.

42

43a

IO port. 43ai // An IO port class off_IOPort : public off_HWResUnit { public: // Input/Output 8, 16, 32 or 64 bit words through the port. unsigned8_t in8(void); void out8(unsigned8_t o);

hOff

unsigned16_t void

in16(void); out16(unsigned16_t o);

unsigned32_t void

in32(void); out32(unsigned32_t o);

unsigned64_t void

in64(void); out64(unsigned64_t o);

}; Root chunk (not used in this document). Defines: off IOPort, used in chunk 42c. Uses off HWResUnit 32a.

4.2.1 Input/Output for plain users Users may access IO ports using these wrappers. 43b

IO bank for users. 43bi ENTRY class off_uIOBank : public off_uHWCompResource { public: off_uIOPort *alloc(off_Protection *prot,natural_t n=1, off_io_id_t at=OFF_EU_ID_NULL); void free( off_uIOPort *iop, const off_Rights &r, natural_t n=1 ); off_uIOPort *fetch(off_io_id_t iop,const off_Rights &r) const; };

hOff

Root chunk (not used in this document). Uses off io id t 42d, off uHWCompResource 32c, and off uIOPort 44a.

43

44a

IO port for users. 44ai ENTRY class off_uIOPort: public off_uHWResUnit { public: unsigned8_t in8(const off_Rights &r,); void out8(unsigned8_t o,const off_Rights &r);

hOff

unsigned16_t void

in16(const off_Rights &r); out16(unsigned16_t o,const off_Rights &r);

unsigned32_t void

in32(const off_Rights &r); out32(unsigned32_t o,const off_Rights &r);

unsigned64_t void };

in64(const off_Rights &r); out64(unsigned64_t o,const off_Rights &r);

Root chunk (not used in this document). Defines: off uIOPort, used in chunk 43b. Uses off uHWResUnit 32b.

4.3 Traps and interrupts Both traps and interrupts are allocated using per-processor event tables. Each processor will have an associated a trap table and an associated interrupt table. Traps may be caused by the hardware or by the µkernel itself. Software interrupts are also available. 44b

event table. 44bi // An event table class off_EventTbl : public off_HWCompResource { protected: // Allocates n contiguous events. off_Event *alloc( off_Protection *prot, natural_t n=1, off_ev_id_t at=OFF_EU_ID_NULL ); // Deallocates the n contiguous events starting at t. void free( off_Event *t, natural_t n=1 ); //Gets an event from its number. off_Event *fetch(off_ev_id_t n) const; };

hOff

Root chunk (not used in this document). Defines: off EventTbl, used in chunk 45. Uses off Event 45c, off ev id t 44c, and off HWCompResource 31.

where 44c

event identifier. 44ci typedef off_eu_id_t off_ev_id_t;

hOff

Root chunk (not used in this document). Defines: off ev id t, used in chunks 44, 45, 54b, and 69a. Uses off eu id t 20a.

44

45a

trap table. 45ai // Definition of traps for the current processor class off_TrapTbl: public off_EventTbl { public: // Allocates n contiguous Traps. off_Trap *alloc( off_Protection *prot, natural_t n=1, off_ev_id_t at=OFF_EU_ID_NULL ); // Deallocates the n contiguous traps starting at t. void free( off_Trap *t, natural_t n=1 ); //Gets a trap from its number. off_Trap *operator [](off_ev_id_t n) const; };

hOff

Root chunk (not used in this document). Defines: off TrapTbl, used in chunks 53 and 67. Uses off EventTbl 44b, off ev id t 44c, and off Trap 46c. 45b

interrupt table. 45bi // Definition of interrupts for the current processor. class off_IntTbl: public off_EventTbl { public: // Allocates n contiguous interrupts. off_Irq *alloc( off_Protection *prot, natural_t n=1, off_ev_id_t at=OFF_EU_ID_NULL ); // Deallocates the n contiguous interrupts starting at i. void free( off_Irq *i, natural_t n=1 ); //Gets an Irq from its number. off_Irq *operator [](off_ev_id_t n) const; };

hOff

Root chunk (not used in this document). Defines: off IntTbl, used in chunk 53. Uses off EventTbl 44b, off ev id t 44c, and off Irq 47a.

Event tables are a set of Events. Each event is identified by an unique id and a reason. 45c

event. 45ci // An event, it could be a trap, an interrupt, ... class off_Event: public off_HWResUnit { public:

hOff

// Sets h as the handler for this event void set_handler( off_EventHandler h); // Gets the handler. off_EventHandler get_handler(void); }; Root chunk (not used in this document). Defines: off Event, used in chunks 44b, 46c, and 47a. Uses off EventHandler 46a and off HWResUnit 32a.

45

Events are handled by EventHandlers which can rely either on portals or, when delivered inside the kernel, on function calls. 46a

event handler. 46ai class off_EventHandler { public: // Calls the handler virtual void operator()( off_event_t ev, off_ev_reason_t reason, ... )=0; };

hOff

class off_FnEventHandler : public off_EventHandler { private: void (*handler)(off_event_t ev, off_ev_reason_t reason, ...); public: // Calls the handler virtual void operator()( off_event_t ev, off_ev_reason_t reason, ... ); }; class off_PrtlEventHandler : public off_EventHandler { private: // Its handler is // void (*handler)(off_event_t ev, off_ev_reason_t reason, ...); off_prtl_id_t handler; public: // Calls the handler virtual void operator()( off_event_t ev, off_ev_reason_t reason, ... ); }; Root chunk (not used in this document). Defines: off EventHandler, used in chunks 45c and 52. Uses off event t 46b.

where 46b

events and reasons data types. 46bi typedef natural_t off_event_t; typedef natural_t off_reason_t;

hOff

Root chunk (not used in this document). Defines: off event t, used in chunk 46a.

In turn, events may be either Traps or Irqs (interrupts). 46c

Trap. 46ci class off_Trap: public off_Event { private: off_mdepTrap &t_mdep; };

hOff

Root chunk (not used in this document). Defines: off Trap, used in chunk 45a. Uses off Event 45c.

46

Although traps cannot be raised artificially, interrupts can (e.g. to notify users or interrupts “caused by the kernel”, and not by the hardware). 47a

Irq. 47ai class off_Irq: public off_Event { private: off_mdepIrq &i_mdep; public: void raise(void); hOther public methods of off Irq. 47bi

hOff

}; Root chunk (not used in this document). Defines: off Irq, used in chunk 45b. Uses off Event 45c.

Whenever an interrupt happens, we must choose either to evict the current shuttle to dispatch the interrupt to its handler or to defer interrupt delivering in favor of the current shuttle. To make a choice, both shuttles and interrupts have priority-levels. See section 5.1.3 for a discussion of interrupt priorities. 47b

public methods of off Irq. 47bi //Sets/Gets the priority level for this interrupt. void set_prty(off_pl_t prty); off_pl_t get_prty(void);

hOther

This code is used in chunk 47a. Uses off pl t 47c.

where 47c

interrupt priority level data type. 47ci typedef natural_t off_pl_t;

hOff

Root chunk (not used in this document). Defines: off pl t, used in chunks 47b and 49a.

47

We have been seeing a few machine dependent objects like off mdepEvState, off mdepTrap, and off mdepIrq. Those are defined for a particular architecture and their meaning should be clear by the context. Machine dependent objects are always prefixed by off mdep and will be described along with the system implementation. Finally, there are some “traps” defined by Off ++ and not by the hardware, some of them have appeared before and some will appear later on. 48a

Virtual traps. 48ai // Virtual (Off++ raised) exceptions enum off_ex_t { OFF_EX_UNUSED, OFF_EX_MISSING, OFF_EX_UNAVAILABLE, OFF_EX_RELOC, OFF_EX_XDT, OFF_EX_FWD, OFF_EX_SCERR };

hOff

// // // // // // //

resource no longer referenced resource missing (remote?) resource exhaust resource has moved translation to native format needed request should be processed remotely system call failed

Root chunk (not used in this document).

4.3.1 Traps and interrupts for plain users On the one hand, users can interact with trap and interrupt tables through methods provided by Shuttles and Processors. Thus, Shuttle (see section 5.1) and and Processor (see section 4.5) behave as a facade because they provide simple entry points to handle both trap and interrupt tables. On the other hand, individual events (trap and interrupt entries) are exported to users by means of a wrapper class. 48b

event for users. 48bi ENTRY class off_uEvent : public off_uHWResUnit { public: void set_handler( off_prtl_id_t h, const off_Rights &r); off_prtl_id_t get_handler(const off_Rights &r) const; };

hOff

Root chunk (not used in this document). Defines: off uEvent, used in chunks 48c and 49a. Uses off uHWResUnit 32b.

48c

There is nothing specific for user traps as of this day. hOff Trap for users. 48ci ENTRY class off_uTrap : public off_uEvent { public: }; Root chunk (not used in this document). Defines: off uTrap, used in chunks 54b and 69a. Uses off uEvent 48b.

48

49a

On the other hand, interrupts export a few methods. hOff Irq for users. 49ai ENTRY class off_uIrq: public off_uEvent { public: void raise(const off_Rights &r); void set_prty(off_pl_t prty, const off_Rights &r); off_pl_t get_prty(const off_Rights &r) const; }; Root chunk (not used in this document). Defines: off uIrq, used in chunk 54b. Uses off pl t 47c and off uEvent 48b.

4.4 DMA lines DMA objects are used to allocate DMALines. Note that DMA lines are kept in the kernel although they could have been considered to be a device (and kept in userland). The reason is efficiency, because we do not want protection domain crossings just to program a DMA line. Also, we believe that no flexibility is lost by placing this small piece of code into the kernel. 49b

DMA table. 49bi // Allocation of DMA lines class off_DMA: public off_HWCompResource { public: // Allocates n DMA lines. off_DMALine *alloc(off_Protection *prot,natural_t n=1, off_dma_id_t at=OFF_EU_ID_NULL ); // Deallocates the n contiguous DMA lines starting at d. void free( off_DMALine *d, natural_t n=1 ); //Gets a DMA line from its number. off_DMALine *operator [](natural_t n) const; };

hOff

Root chunk (not used in this document). Defines: off DMA, used in chunks 37b and 38b. Uses off dma id t 49c, off DMALine 50a, and off HWCompResource 31.

where 49c

DMA line identifier. 49ci typedef off_eu_id_t off_dma_id_t;

hOff

Root chunk (not used in this document). Defines: off dma id t, used in chunks 49b and 50c. Uses off eu id t 20a.

49

DMA lines are very not well modeled, instead we provided a model of DMA pretty close to that of Intel based machines. Future Off portings to different architectures should provide feedback on how should DMA lines be modeled. 50a

DMA line. 50ai typedef off_mode_t dma_mode_t; class off_DMAline: public off_HWResUnit { public: // Enables/disables this dma line. void enable(void); void disable(void); // Programs a DMA operation. void program(off_pg_id_t addr, vm_offset_t length, dma_mode_t mode); // Gets # of copied or pending bytes. vm_offset_t get_copied(void); vm_offset_t get_pending(void); };

hOff

Root chunk (not used in this document). Defines: off DMALine, used in chunk 49b. Uses dma mode t 50b, off HWResUnit 32a, off mode t 13, and off pg id t 42b.

where 50b

DMA line mode data type. 50bi typedef natural_t dma_mode_t;

hOff

Root chunk (not used in this document). Defines: dma mode t, used in chunks 50a and 51a.

4.4.1 DMA lines for plain users DMA related wrappers are straightforward. 50c

DMA table for users. 50ci ENTRY class off_uDMA : public off_uHWCompResource { public: off_uDMALine *alloc(off_Protection *prot, natural_t n=1, off_dma_id_t at=OFF_EU_ID_NULL); void free(off_uDMALine *pf, const off_Rights &r, natural_t n=1 ); off_uDMALine *fetch(off_dma_id_t d,const off_Rights &r) const; };

hOff

Root chunk (not used in this document). Uses off dma id t 49c, off uDMALine 51a, and off uHWCompResource 32c.

50

51a

DMA line for users. 51ai ENTRY class off_uDMALine : public off_uHWResUnit { public: // Enables/disables this dma line. void enable(const off_Rights &r); void disable(const off_Rights &r); // Programs a DMA operation. void program(off_pg_id_t addr, vm_offset_t length, dma_mode_t mode, const off_Rights &r); // Gets # of copied or pending bytes. vm_offset_t get_copied(void,const off_Rights &r); vm_offset_t get_pending(void,const off_Rights &r); };

hOff

Root chunk (not used in this document). Defines: off uDMALine, used in chunk 50c. Uses dma mode t 50b, off pg id t 42b, and off uHWResUnit 32b.

4.5 Processors and processor pools The processor table can be used to allocate a whole processor to certain processes. Usually the node singleton will be the owner of every processor in the system and will delegate ownership to user level (or in-kernel) schedulers. Processor pools provide Processors. Usually a processor will be allocated to a user-level (or dynamically loaded in-kernel) scheduler which will in turn implement a processor allocation policy. 51b

processor table. 51bi class off_ProcTbl: public off_HWCompResource { public: // Allocates n processors. off_Processor *alloc(off_Protection *prot,natural_t n=1, off_proc_id_t at=OFF_EU_ID_NULL); // Deallocates the n contiguous processors starting at p. void free( off_Processor *p, natural_t n=1 ); //Gets a processor from its number. off_Processor *operator [](off_proc_id_t n) const; };

hOff

Root chunk (not used in this document). Defines: off ProcTbl, used in chunks 37b and 38b. Uses off HWCompResource 31, off Processor 52, and off proc id t 51c.

where 51c

processor identifier. 51ci typedef off_eu_id_t off_proc_id_t;

hOff

Root chunk (not used in this document). Defines: off proc id t, used in chunks 51b and 54a. Uses off eu id t 20a.

51

To help schedulers, processors provide some methods which we now discuss. First, a new shuttle can be explicitly installed in the processor using the switch to method. This is to avoid using run queues when they are not needed. Thus, multiprocessing is an optional feature. Dedicated single-threaded nodes can avoid using run queues at all (no processor multiplexing code will run). When they are, a run queue is an ordered collection of shuttle identifiers. Each identifier will be accompanied by the number of clock ticks it is expected to run. The queue will be run in a round robin fashion until the occurrence of an event of interest to the scheduler. To let the processor know when to stop the run queue and notify to the scheduler notify should be used. It specifies which events are to cause the stop and also, who should be notified (either by portal or by direct function call when inside the kernel). In any case, get current will return a pointer to shuttle being run (or to the last shuttle run, when idle). 52

processor. 52i class off_Processor : public off_HWResUnit { private: hOther private members of off Processor. 53bi public: // Installs a new shuttle void switch_to(off_Shtl &new); // Executes a run queue void run(off_shtl_id_t s[], off_offset_t ticks[], natural_t n); // Arranges for the processor to notify the owner on this events void notify( off_sevent_t events, off_EventHandler sched); // Returns the current shuttle off_shtl_id_t get_current(void);

hOff

hOther

public methods of off Processor. 53ci

}; Root chunk (not used in this document). Defines: off Processor, used in chunk 51b. Uses off EventHandler 46a, off HWResUnit 32a, off sevent t 53a, and off Shtl 59.

52

After the processor has invoked the scheduler to notify the event, the scheduler may either install a new run queue or just adjust the members of the current one. By now, the only events defined are these ones (NB. these are not hardware events, but scheduling events). 53a

scheduling events. 53ai // Scheduling related events. enum off_sevent_t { OFF_SEV_TRAP=0x01,// Trap occurred. OFF_SEV_IRQ =0x02, // Interrupt occurred. OFF_SEV_YLD =0x04, // Current shuttle blocked. OFF_SEV_BLK =0x08, // Other than current shuttle blocked. OFF_SEV_AWK =0x10, // A shuttle got awakened. OFF_SEV_END =0x20 // Run queue exhausted. };

hOff

Root chunk (not used in this document). Defines: off sevent t, used in chunks 52 and 54b.

Although it is an implementation issue, we will say that the run queue should be kept in the memory of the scheduler. The page the queue is in should be locked so that the processor could maintain a pointer to the run queue. In this way a user level scheduler may adjust the run queue just by writing its own memory without further system calls. Finally, as we said when we discussed traps and interrupts, each Processor contains a couple of event tables, one for traps and another one for interrupts. 53b

private members of off Processor. 53bi off_TrapTbl p_traps; // Processor trap table. off_IntTbl p_irqs; // Processor interrupt table.

hOther

This code is used in chunk 52. Uses off IntTbl 45b and off TrapTbl 45a.

53c

The trap and interrupt table for a given processor can be obtained with a public method of the Processor class: hOther public methods of off Processor. 53ci //Returns a reference to the trap/interrupt event table. off_TrapTbl *&get_traps_ref(void); off_IntTbl *&get_irqs_ref(void); This code is used in chunk 52. Uses off IntTbl 45b and off TrapTbl 45a.

53

4.5.1 Processors and processor pools for plain users 54a

The processor pool wrapper is defined as follows processor table for users. 54ai

hOff

class off_uProcTbl : public off_uHWCompResource { public: off_uProcessor *alloc(off_Protection *prot, natural_t n=1, off_proc_id_t at=OFF_EU_ID_NULL); void free(off_uProcessor *pf, const off_Rights &r, natural_t n=1 ); off_uProcessor *fetch(off_proc_id_t d,const off_Rights &r) const; }; Root chunk (not used in this document). Uses off proc id t 51c, off uHWCompResource 32c, and off uProcessor 54b.

54b

Processors for users also include a few methods coming from trap and interrupt tables. hOff processor for users. 54bi ENTRY class off_uProcessor : public off_uHWResUnit { // Installs a new shuttle void switch_to(off_shtl_id_t &new, const off_Rights &proc_r, const off_Rights &shtl_r); // Executes a run queue void run(off_shtl_id_t s[], off_offset_t ticks[], natural_t n, const off_Rights &proc_r, const off_Rights &shtl_r[]); // Arranges for the processor to notify the owner on this events void notify( off_sevent_t events, off_prtl_id_t sched, const off_Rights &proc_t); // Returns the current shuttle off_shtl_id_t get_current(const off_Rights &proc_r); off_uTrap *alloc_trap(off_Protection *prot, natural_t n=1, off_ev_id_t at=OFF_EU_ID_NULL); void free_trap(off_uTrap *t, const off_Rights &r, natural_t n=1 ); off_uTrap *fetch_trap(off_ev_id_t t,const off_Rights &r) const; off_uIrq *alloc_irq(off_Protection *prot, natural_t n=1, off_ev_id_t at=OFF_EU_ID_NULL); void free_irq(off_uIrq *t, const off_Rights &r, natural_t n=1 ); off_uIrq *fetch_irq(off_ev_id_t t,const off_Rights &r) const; }; Root chunk (not used in this document). Defines: off uProcessor, used in chunk 54a. Uses off ev id t 44c, off sevent t 53a, off uHWResUnit 32b, off uIrq 49a, and off uTrap 48c.

54

Chapter 5

Implementing abstract resources The overall picture is depicted in figure 5.1. The relationships there depicted will become clear as this chapter proceeds. EventHandler IrqTbl TrapTbl Processor

runq

id_t PrtlEventHandler

FnEventHandler

current Shtl

ShtlSrv

Prtl

ShtlPSet

ShtlProp

DMM

DTLB

at

eu_id_t

Figure 5.1: Relationship among Shuttles, Portals, DTLBs and other resources.

5.1 Shuttles Shuttles provide flows of control which can be extended to support user-level or inkernel process abstractions. The ShtlSrv (there is one per node) maintains a pool of shuttles. Thus, the ShtlSrv main task is to allocate shuttles.

55

55

shuttle server. 55i // A shuttle server class off_ShtlSrv: public off_AbsCompResource { public: // Allocates a shuttle. off_Shtl *alloc(const off_Protection &prot, natural_t n=1, off_shtl_id_t at=OFF_SHTL_NULL); // Deallocates a shuttle. void free(off_shtl_id_t *p, natural_t n=1);

hOff

// Locates a shuttle by its number. off_Shtl *operator [](off_shtl_id_t id); hOther

public methods of off ShtlSrv. 63i

}; Root chunk (not used in this document). Defines: off ShtlSrv, used in chunks 37b and 38b. Uses off AbsCompResource 33a and off Shtl 59.

56

A shuttle has a processor context and a set of properties (as described below) and it is the only schedulable entity provided by the kernel. Shuttles run when installed in the run queue of a given processor (as we saw in section 4.5). As we will see in the next chapter, shuttles move from one protection domain to another, including the kernel protection domain, by means of portals. Besides, the use of upcalls from kernel to user space means that the set of kernel and user activations1 (see figure 5.2) will be mixed. Thus, a user shuttle may call the kernel and the kernel my upcall back to the user during the system call; the upcall routine may in turn issue new system calls, and so on.

a sys call main()

User stack system calls

upcall

an upcall One activation frame

Kernel stack

Figure 5.2: Shuttle activations: system calls and up-calls. We may be interested in three different processor contexts when referring to a shuttle: 1. The current register set, i.e. the last saved register set which holds register values at the time the shuttle was last preempted. This is the one used when we are interested in the “current” state of the shuttle, no matter what its privilege level is. 2. The last user register set, i.e. the topmost user context saved in the stack of activations (system calls/upcalls). This is the one used when we are interested either in the user state at the last system call or in the state to be returned to the user. 3. The last kernel register set, i.e. the topmost kernel context saved in the stack of activations (system calls/upcalls). This is the one used when we are interested in the state of a shuttle while proceeding inside the kernel. 1 By

activation we mean entry points or cross-domain calls.

57

The first register set will always match one of the two latter ones. 58

hOff

off Shtl register accessors. 58i // Get a pointer to current, last user, or last kernel // registers in the stack. off_mdepPRegs *get_regref(void); off_mdepPRegs *get_uregref(void); off_mdepPRegs *get_kregref(void);

This code is used in chunk 59.

58

A reference to the current processor context can be obtained with get regref. Last user and kernel2 registers are accessible by means of get uregref and get kregref respectively. As these methods return a reference to the register storage, no “set ” methods are provided. Shuttles may also wait on abstract system resources due to some reason. When a shuttle goes to sleep it should call block. Later on, someone else may call ready to unblock it. These methods may cause scheduling events (SEV BLK, SEV YLD, and SEV AWK — see section 4.5) to be delivered to the processor scheduler so that it could reschedule if desired. Another effect of calling block is that (even if no event is ever raised) the shuttle will be ignored by its processor until a subsequent call to ready is performed. The Processor where the shuttle is running checks the shuttle’s blocked flag and ignores the shuttle if it is set. The blocked flag means “unable to run” and can be used not only to block a shuttle, but also during shuttle initialization, during shuttle dismantling, etc. It should not be confused with the more abstract “blocked” state used in traditional process abstractions. 59

shuttle. 59i // A shuttle. class off_Shtl : public off_AbsResUnit { private: hOther off Shtl private members. 64ci public: hOff off Shtl register accessors. 58i hOff off Shtl property accessors. 64bi

hOff

// Blocks due to some reason related to the given resource. // [May notify the scheduler] void block(off_id_t culprit, off_object_t culprit_type); // Awakes this shuttle. // [May notify the scheduler] void ready(void); // Is this shuttle blocked? inline boolean_t is_blocked(void) const; hOther

public methods of off Shtl. 60ai

}; Root chunk (not used in this document). Defines: off Shtl, used in chunks 52, 55, 62, 65b, 67, 68, 70, and 74a. Uses off AbsResUnit 33b and off id t 20a.

2 As

the kernel can be preempted, kernel registers will be saved in the kernel stack as user registers are.

59

Shuttles must not run at the same time on more than one processor. To ensure it, running shuttles are flagged and such a flag is checked by every processor before switching to a new shuttle. Two new methods do the job: 60a

public methods of off Shtl. 60ai // Checks/Sets if this shuttle is running. inline boolean_t is_running(void); inline void run(void); inline void stop(void);

hOther

This code is used in chunk 59.

5.1.1 Shuttles for plain users. 60b

shuttle server for users. 60bi ENTRY class off_uShtlSrv: public off_uAbsCompResource { public: // Allocates a shuttle. off_uShtl *alloc(const off_Protection &prot, natural_t n=1, off_shtl_id_t at=OFF_SHTL_NULL); // Deallocates a shuttle. void free(off_uShtl *p, natural_t n=1, const off_Rights &r);

hOff

// Locates a shuttle by its number. off_uShtl *fetch(off_shtl_id_t id, const off_Rights &r); hOther

public methods of off uShtlSrv. 66bi

}; Root chunk (not used in this document). Uses off uAbsCompResource 34a and off uShtl 61a.

60

61a

shuttle for users. 61ai ENTRY class off_uShtl : public off_uAbsResUnit { public: off_mdepPRegs get_reg(const off_Rights &r); off_mdepPRegs get_ureg(const off_Rights &r); off_mdepPRegs get_kreg(const off_Rights &r); void set_reg( off_mdepPRegs regs, const off_Rights &r); void set_ureg(off_mdepPRegs regs, const off_Rights &r); void set_kreg(off_mdepPRegs regs, const off_Rights &r);

hOff

hOff

off uShtl property accessors. (never defined)i

// Blocks due to some reason related to the given resource. // [May notify the scheduler] void block(off_prtl_id_t culprit, const off_Rights &r); // Awakes this shuttle // [May notify the scheduler] void ready(const off_Rights &r); // Is this shuttle blocked? boolean_t is_blocked(const off_Rights &r) const; hOther

public methods of off uShtl. 66ai

}; Root chunk (not used in this document). Defines: off uShtl, used in chunk 60b. Uses off uAbsResUnit 34b.

5.1.2 Shuttle properties Properties are used to specify which resources (e.g. DTLBs, IO maps, etc.) should be readily available for the shuttle to run. Specific resource instances are identified by a property value. Properties are named by values of type off prop t and their values by system identifiers of type off id t. 61b

shuttle property identifer. 61bi // Properties typedef natural_t off_prop_t;

hOff

Root chunk (not used in this document). Defines: off prop t, used in chunks 63–66.

61

The context switch code is guaranteed to call a property switch function pswitch for every property changing its value (i.e. for every resource whose instance must be changed) in the context switch. Two additional functions, pset and pclr, are provided for those cases when the hardware is switching the context automatically but some work should be done to install (or deinstall) a property value in a given shuttle. Any resource provider implementing this interface 3 can be considered to be a property provider. 62

shuttle property server. 62i // The interface of a property server. signature off_ShtlPropSrv {

hOff

// Switches property values. // Returns either 0 or an error code. int pswitch(off_id_t from, off_id_t to, off_Shtl &s); // Set or clear the property at a given shuttle. int pset(off_id_t pval, off_Shtl &s, const off_Rights &r); int pclr(off_id_t pval, off_Shtl &s); // Should we call to pswitch? boolean_t needs_switch(void); }; Root chunk (not used in this document). Defines: off ShtlPropSrv, used in chunks 63 and 74a. Uses off id t 20a and off Shtl 59.

3 An abstract class could do the job, but GNU C++ signatures provide subtype polymorphism independently from the inheritance hierarchy.

62

The needs switch method is provided to save some calls to pswitch. Those properties which don’t need to implement pswitch may provide a trivial (i.e. empty) implementation and arrange for needs switch to return true. When a new property is being defined, the shuttle server accepting the definition will use needs switch to determine whether it should arrange for pswitch to be called or not. To define the property being implemented as a shuttle property, a property implementor must contact ask the shuttle server to define it. The ShtlSrv must include now this method to support property definitions. Indeed, we include two new methods. One to define properties implemented inside the µkernel (which may use a simple procedure call) and another one to define properties which might be implemented outside the µkernel (which must use portals). 63

public methods of off ShtlSrv. 63i // Defines a (new) shuttle property. void define_kprop(off_ShtlPropSrv &implementor, off_prop_t id, off_Protection &p); void define_uprop(off_prtl_id_t implementor, off_prop_t id, off_Protection &p); void undefine_prop(off_prop_t id, const off_Rights &r);

hOther

This code is used in chunk 55. Uses off prop t 61b and off ShtlPropSrv 62.

63

If foreign shuttle servers ever find a shuttle with an unknown property being used, they will ask to other shuttle servers found in the net for the property definition. Property values are stored in property sets, of type ShtlPSet. A ShtlPSet is an ordered (by property id) collection of property values so that context switch could be done quickly. To save some more time in context switches, ShtlPSets may be shared. Thus they must be reference counted. 64a

shuttle property set. 64ai // A set of property values class off_ShtlPSet { private: hOff private members for reference counting objects. (never defined)i protected: hOff protected methods for reference counting objects. (never defined)i public: hOff public methods for reference counting objects. (never defined)i // Adds a new property to this set. void add(off_prop_t p, off_id_t val); // Deletes a new property to this set. void del(off_prop_t p); // Gets the value of the given property or OFF_SHTL_NULL. off_id_t operator[](off_prop_t p) const;

hOff

// Gets the number of the last property in the set. off_prop_t get_maxprop(void); }; Root chunk (not used in this document). Defines: off ShtlPSet, used in chunks 64, 65c, 70, and 72b. Uses off id t 20a and off prop t 61b.

Each shuttle has its own (possibly shared) ShtlPSet which can be accessed by the accessor get psetref. 64b

hOff

off Shtl property accessors. 64bi inline off_ShtlPSet *&get_psetref(void);

This definition is continued in chunk 65. This code is used in chunk 59. Uses off ShtlPSet 64a.

Some time can be saved in context switches avoiding the comparison of property arrays. Thus, we do not include the ShtlPSet in the shuttle but, instead, a reference to it. So, property sets can be shared and a single pointer comparison can check if we have the same property arrays. 64c

off Shtl private members. 64ci off_ShtlPSet *s_pset; // Shuttle properties.

hOther

This code is used in chunk 59. Uses off ShtlPSet 64a.

64

To provide more convenient entry points for users, some methods are included to get and set property values. 65a

hOff

+

off Shtl property accessors. 64bi  // Gets/Sets the value for a given property. off_id_t get_prop(off_prop_t p) const; void set_prop(off_prop_t p, off_id_t val);

This code is used in chunk 59. Uses off id t 20a and off prop t 61b.

65b

Another one will arrange for every property to be shared between two shuttles. hOff off Shtl property accessors. 64bi+ // Resets the properties so that they are share with other shuttle. void dup_props(const off_Shtl &s);

This code is used in chunk 59. Uses off Shtl 59.

Not every property needs to call its switch function on context switch. Those properties which do not need switch functions can be placed last in the array and the comparison stop as soon as the first property without a switch function is reached. Finally, although we have not mentioned it, properties can be used to support userlevel properties by arranging a user level function to be called at the beginning of every quantum. That function can be used to establish user-level context pieces if needed (e.g. to set a pointer to per-thread data or to establish a temporary memory map so that every thread could have a private (secure) store). When such a property is being used, its property value might identify the (user-space) argument to the function setting the user-level context. Allocation of ShtlPSets Property sets are stored into a ShtlPSetSrv used internally by the shuttle server. It is considered to be a composite resource but it is neither a hardware resource container nor an abstract resource container. 65c

shuttle property sets server. 65ci class off_ShtlPSetSrv : public off_CompResource { public: // Allocates a PSet off_ShtlPSet *alloc(void); // Deallocates a PSet. void free(off_ShtlPSet *p); };

hOff

Root chunk (not used in this document). Uses off CompResource 20b and off ShtlPSet 64a.

65

Shuttle properties for plain users For system users, properties are handled by shuttles. Thus there are few more methods in uShtl. 66a

public methods of off uShtl. 66ai off_id_t get_prop(off_prop_t p, const off_Rights &r) const; void set_prop(off_prop_t p, off_id_t val, const off_Rights &shtl_r, const off_Rights &prop_r); void dup_props(const off_shtl_id_t &s, const off_Rights &this_r, const off_Rights &s_r);

hOther

This definition is continued in chunk 69a. This code is used in chunk 61a. Uses off id t 20a and off prop t 61b.

66b

Property definition is also available for system users. The uShtlSrv provides this service. hOther public methods of off uShtlSrv. 66bi void define_prop(off_prtl_id_t implementor, off_prop_t id, const off_Protection &r); void undefine_prop(off_prop_t id, const off_Rights &r); This code is used in chunk 60b. Uses off prop t 61b.

66

5.1.3 Shuttles and hardware events Shuttles and Traps As we have seen in section 4.5, each Processor includes both a TrapTbl and an IntTbl. As in certain circumstances we may desire to dispatch traps on a per-shuttle basis, there is a predefined shuttle property (implemented by the shuttle trap table server ShtlTrapTblSrv) defining those traps in which the current shuttle has interest. 67

Shuttle trap table server. 67i class off_ShtlTrapTblSrv : public off_CompResource { public: // Allocates a trap table. off_TrapTbl *alloc(void); // Deallocates a trap table. void free(off_TrapTbl *t); // To implement the trap table property:

hOff

// Switches property values. // Returns either 0 or an error code. int pswitch(off_trapt_id_t from, off_trapt_id_t to, off_Shtl &s); // Set or clear the property at a given shuttle. int pset(off_trapt_id_t pval, off_Shtl &s, const off_Rights &r); int pclr(off_trapt_id_t pval, off_Shtl &s); // Should we call to pswitch? boolean_t needs_switch(void); }; Root chunk (not used in this document). Uses off CompResource 20b, off Shtl 59, and off TrapTbl 45a.

67

Thus, to make certain traps to be “redefined” for certain shuttles, a TrapTbl should be allocated and installed as a property in the those shuttles. Note that those traps already “allocated” in the Processor trap table will not be defined on a per-shuttle basis. Whenever a trap occurs, the processor trap table is has priority over the shuttle trap table (eg. Shuttles may not bypass the DMM handling of page fault traps even if they include an entry for such a trap in their trap tables). When used, trap tables are considered to be part of the shuttle itself: The trap server is never seen outside the shuttle server. Shuttles and interrupts Whenever an interrupt happens, we must choose either to evict the current shuttle to dispatch the interrupt to its handler or to defer interrupt delivering in favor of the current shuttle. To make a choice, both shuttles and interrupts have priority-levels. On processor interrupt, the lower 4 priority wins (considering the executing shuttle priority level and the interrupt priority level). To arbitrate, no shuttle may set a priority level lower than the lowest existing interrupt priority level. No interrupt may be adjusted to a priority level lower than the lowest existing shuttle interrupt-priority level. In few words, when users request a priority level (be it for a shuttle or for an interrupt) the policy is FCFS; e.g. If interrupt already i has a priority level n, no shuttle may set its interrupt priority level to numbers below n+1. To implement this feature we must include priority levels as a shuttle property predefined by the shuttle server. 68

Shuttle interrupt priority level server. 68i class off_ShtlIntPrtySrv { public: // Switches property values. // Returns either 0 or an error code. int pswitch(off_pl_id_t from, off_pl_id_t to, off_Shtl &s);

hOff

// Set or clear the property at a given shuttle. int pset(off_pl_id_t pval, off_Shtl &s, const off_Rights &r); int pclr(off_pl_id_t pval, off_Shtl &s); // Should we call to pswitch? boolean_t needs_switch(void); }; Root chunk (not used in this document). Uses off Shtl 59.

4 To

follow the fine tradition of UNIX systems.

68

The interrupt priority is considered to be inside the shuttle. The interrupt priority server is never seen outside the shuttle server. Shuttle traps and shuttle interrupt priorities for plain users The trap table and the interrupt priority are considered (and seen) as predefined properties by system users. Methods get prop, set prop and dup props are well behaved with respect to them. With respect to interrupt priorities, users may set the value of the interrupt priority level property without further operations (they don’t need to allocate priority levels as they are predefined). However, Trap tables are neither allocated nor deallocated by users. In fact, trap tables are not even seen by users. To handle per-shuttle trap definitions, the shuttle acts as a facade for its users mimicing the behavior of Processor in this respect. 69a

+

public methods of off uShtl. 66ai  off_uTrap *alloc_trap(off_Protection *prot, natural_t n=1, off_ev_id_t at=OFF_EU_ID_NULL); void free_trap(off_uTrap *t, const off_Rights &r, natural_t n=1 ); off_uTrap *fetch_trap(off_ev_id_t t,const off_Rights &r) const;

hOther

This code is used in chunk 61a. Uses off ev id t 44c and off uTrap 48c.

5.2 Portals A portal is a communication endpoint. It can be attached to a handler and also invoked like an interrupt. Portals are contained in the portal server, an instance of the PrtlSrv class whose main task is to allocate portals. 69b

portal server. 69bi class off_PrtlSrv : public off_AbsCompResource { public: // Allocates a portal. off_Prtl *alloc(const off_Protection &p, natural_t n=1, off_prtl_id_t at=OFF_PRTL_NULL); // Deallocates a portal void free(off_Prtl *p, natural_t n=1); // Gets a portal by its number off_Prtl *operator[](off_prtl_id_t p) const; };

hOff

Root chunk (not used in this document). Defines: off PrtlSrv, used in chunks 37b and 38b. Uses off AbsCompResource 33a and off Prtl 70.

69

Portals support both protected control transfers (PCTs) and asynchronous message delivering. When used for PCTs, the sender shuttle changes its properties on the fly and becomes itself a flow of control in the receiver protection domain. After the handler completes, the shuttle recovers its original state and continues its execution. Callees may, at any time, delegate the execution of PCTs to other servers. In this case, one single shuttle can cross several protection domains and come back directly to its original site. To delegate a PCT reply to a different server, pct pass can be used. Finally, when no reply is ever expected, deliver should be used instead. It will simply save (in the callee context) what users are unable to save safely (like eflags on Intel architectures) and install a simple activation frame for the handler on the receiver’s shuttle. In this case, the caller will resume execution as soon as the callee has been notified; the message is sent but may be not delivered yet. The handler information present in the portal is basically a program counter (pc), a pointer to a pool of stack pointers (spp), an access mode mode, and an (optional) set of properties (props). Each property value in props will cause the sender’s property to be set to that value during portal invocation. Remaining sender properties will maintain their original value. As it is most likely that the address space property (see section 5.3) will be present, that property is factored out of the property array to gain some performance. On nodes where the µkernel is compiled without support for multiple protection domains, a null value is used the such property is ignored. When the props set is not present in the portal, it will not support PCTs but just asynchronous message delivering. That is enough to support PCTs, but for asynchronous message delivering it is necessary that the portal have a shuttle identifier. Thus, when deliver is used, the handler (known by the portal information) will run not on the sender shuttle but on the receiver. If the receiver shuttle identifier is not specified, the portal will not support asynchronous delivering. 70

portal. 70i class off_Prtl : public AbsResUnit { private: hOther off Prtl private members. 71i public: // (Re)sets the portal handler. void set_hndlr(vm_offset_t pc, vm_offset_t spp, off_mode_t mode, off_dtlb_id_t vas, natural_t maxmsgsz=0, const off_ShtlPSet *props=NULL, off_shtl_id_t s=OFF_SHTL_NULL);

hOff

// (Locally) delivers a message to this portal using a PCT. // (May reset the stack) void pct(off_Shtl &sender, void *msg, natural_t msgsz, void *reply, natural_t replysz, natural_t tmout) const; // Pass the PCT to a different portal. // Reply to the initial PCT will be issued from there. (May reset the stack)

70

void pct_pass(off_Shtl &sender, off_prtl_id_t p) const; // Delivers an event to a portal (one-way PCT). // (May reset the stack) void deliver(off_Shtl &sender, natural_t rq_nwords, void *rq_msg, natural_t tmout) const; }; Root chunk (not used in this document). Defines: off Prtl, used in chunk 69b. Uses off mode t 13, off Shtl 59, and off ShtlPSet 64a.

Portals may be used to enforce object access control. When the handler is attached to the portal, the permitted access mode through this portal is specified. The kernel does not enforce it. Instead, the access mode will be placed in a pre-specified register so that the user handler can check the validity of the access. For example, a file object may create a read-only access point using a portal specifying OFF OP R as the only permitted access mode. We could have omitted this access mode feature from portals, but it does not change the portal semantics and avoid those calls which will fail to proceed further. The stack pool pointer (spp) portal member points to an array of initial stack pointers provided by the handler. Each incoming invocation will use, as its own user stack, one of the stacks found there. The pool of stacks can be of any length, but it is assumed that the last stack pointer will be invalid and its value will be 1. Upon portal invocation, the kernel selects one of the non-zero pointers found in the pool and then resets it to 0. Since a stack pointer with value 0 will be ignored, this is a safe mechanism to avoid race conditions during stack selection. To allow the stack to be reused, the user-level handler must (atomically) write the original stack pointer value back to the stack pool. When there are no more free stacks, any further call on the portal blocks. These blocked shuttles are queued (using shuttle identifiers instead of pointers, so that the list can span several nodes) using the p wstack private member. When a new stack is made available, they are awakened. 71

off Prtl private members. 71i off_Bureaucrat p_wstack; // No stack available on a PCT portal

hOther

This code is used in chunk 70. Uses off Bureaucrat 17a.

71

5.2.1 Portals for plain users The portal server has a “conventional” wrapper. 72a

portal server for users. 72ai ENTRY class off_uPrtlSrv : public off_uAbsCompResource { public: off_uPrtl *alloc(const off_Protection &prot, natural_t n=1, off_prtl_id_t at=OFF_PRTL_NULL); void free(off_uPrtl *p, natural_t n=1, const off_Rights &r);

hOff

off_uPrtl *operator [](off_prtl_id_t id, const off_Rights &r); }; Root chunk (not used in this document). Uses off uAbsCompResource 34a and off uPrtl 72b.

The portal wrapper is the most special one in the kernel. Only the non-delivering methods are exported to users as usually. Methods used for message delivering and PCT are implemented without wrapper services (because they are used to implement the portal delivering mechanism used by the wrappers). 72b

portal for users. 72bi ENTRY class off_uPrtl : public uAbsResUnit { public: // (Re)sets the portal handler. void set_hndlr(vm_offset_t pc, vm_offset_t spp, off_mode_t mode, off_dtlb_id_t vas, natural_t maxmsgsz=0, const off_ShtlPSet *props=NULL, off_shtl_id_t s=OFF_SHTL_NULL, const off_Rights &prtl_r, const off_Rights &shtl_r, const off_Rights &dtlb_r);

hOff

}; Root chunk (not used in this document). Defines: off uPrtl, used in chunk 72a. Uses off mode t 13 and off ShtlPSet 64a.

72

5.3 Distributed Memory Managers Off ++ memory management is based on Distributed TLBs (DTLBs) where TLB stands for Translation Lookaside Buffer, a cache of virtual to physical memory address translations [12]. Each Distributed Memory Manager ( DMM, for short) is actually a pool of DTLBs. However, DTLBs are an optional feature. Machines dedicated to a single application (like a dedicated file server, an embedded controller, etc.) can use a single protection domain for efficiency purposes. A DMM multiplexes the address translation hardware among the existing DTLBs in cooperation with the shuttle server (as there is usually one processor register used to identify the protection domain). 73

DMM. 73i class off_DMM : public off_AbsCompResource { public: // Allocates a DTLB. off_DTLB *alloc(const off_Protection &prot, natural_t n=1, off_dtlb_id_t at=OFF_DTLB_NULL); // Deallocates a DTLB. void free(off_DTLB *DTLB, natural_t n=1);

hOff

//Gets a DTLB from its number. off_DTLB *operator [](off_dtlb_id_t id); //Gets the size of pages being translated. vm_size_t get_pgsize(void); hOff

off DMM shuttle property methods. 74ai

}; Root chunk (not used in this document). Defines: off DMM, used in chunks 37b and 38b. Uses off AbsCompResource 33a and off DTLB 74b.

73

74a

As DTLBs are valid shuttle property values, the DMM also implements the SthlPropSrv interface. hOff off DMM shuttle property methods. 74ai // Property interface routines (off_ShtlPropSrv signature). // Switches property values. // Returns either 0 or an error code. int pswitch(off_dtlb_id_t from, off_dtlb_id_t to, off_Shtl &s); // Set or clear the property at a given shuttle. int pset(off_dtlb_id_t pval, off_Shtl &s, const off_Rights &r); int pclr(off_dtlb_id_t pval, off_Shtl &s); // Should we call to pswitch? boolean_t needs_switch(void); This code is used in chunk 73. Uses off Shtl 59 and off ShtlPropSrv 62.

A DTLB is a set of translations. Not every existing translation must be present in it. The DTLB should be considered to be a cache of the translations being used. However, on architectures with page tables, the cache may actually hold a copy of every existing translation to local memory. Page faults and remaining events for the DTLB will be delivered to its owner (see off Resource, for details). 74b

DTLB. 74bi class off_DTLB : public AbsResUnit { public: // Installs a set of (contiguous) address translations. void install(vm_offset_t va, off_pg_id_t pa, off_mode_t access_mode, natural_t n=1); // Deinstalls a set of (contiguous) address translations. void invalidate(vm_offset_t va, natural_t n=1); // Changes the access mode bits for the given translations void set_mode(vm_offset_t va, off_mode_t access_mode, natural_t n=1); };

hOff

Root chunk (not used in this document). Defines: off DTLB, used in chunk 73. Uses off mode t 13 and off pg id t 42b.

74

Note that these addresses can refer also to remote memory. Remote translations will make intensive use of the DMM’s relocation table (one of such tables is present in every AbsCompResource object, as we saw in section 2.5).

5.3.1

DTLBs

for plain users

Users can handle their DTLBs through the DTLB and DMM wrappers 75a

DMM for users. 75ai ENTRY class off_uDMM : public off_uAbsCompResource { public: // Allocates a DTLB. off_uDTLB *alloc(const off_Protection &prot, natural_t n=1, off_dtlb_id_t at=OFF_DTLB_NULL); // Deallocates a DTLB. void free(off_uDTLB *DTLB, natural_t n=1, const off_Rights &r);

hOff

//Gets a DTLB from its number. off_uDTLB *operator [](off_dtlb_id_t id, const off_Rights &r); //Gets the size of pages being translated. vm_size_t get_pgsize(const off_Rights &r,); }; Root chunk (not used in this document). Uses off uAbsCompResource 34a and off uDTLB 75b. 75b

DTLB for users. 75bi ENTRY class off_uDTLB : public off_uAbsResUnit { public: // Installs a set of (contiguous and w/ the same access rights) // address translations. void install(vm_offset_t va, off_pg_id_t pa, off_mode_t access_mode, natural_t n=1, const off_Rights &dtlb_r, const off_Rights &pa_r ); // Deinstalls a set of (contiguous) address translations. void invalidate(vm_offset_t va, natural_t n=1, const off_Rights &r); // Changes the access mode bits for the given translations void set_mode(vm_offset_t va, off_mode_t access_mode, natural_t n=1, const off_Rights &dtlb_r, const off_Rights &pa_r); };

hOff

Root chunk (not used in this document). Defines: off uDTLB, used in chunk 75a. Uses off mode t 13, off pg id t 42b, and off uAbsResUnit 34b.

75

Appendix A

Index of Chunks hA chunk including another. 7bi hContinuing chunk of code. 7ci hExample chunk of code. 7ai hOff access checker. 14ai hOff access mode. 13i hOff access operations. 12i hOff allocator. 27bi hOff attribute type ids. 26bi hOff attributes. 27ai hOff block allocator. 29bi hOff bookkeeping allocator. 28i hOff bureaucrat. 17ai hOff compound resource. 20bi hOff compound resource for users. 22ai hOff DMA line. 50ai hOff DMA line for users. 51ai hOff DMA line identifier. 49ci hOff DMA line mode data type. 50bi hOff DMA table. 49bi hOff DMA table for users. 50ci hOff DMM. 73i hOff DMM for users. 75ai hOff DTLB. 74bi hOff DTLB for users. 75bi hOff elementary resource unit. 21i hOff elementary resource unit for users. hOff event. 45ci hOff event for users. 48bi hOff event handler. 46ai hOff event identifier. 44ci hOff event table. 44bi

22bi

76

hOff events and reasons data types. 46bi hOff Exhausted resource revocation trigger. 30i hOff fixed allocator. 29ai hOff hardware resource container. 31i hOff hardware resource container for users. 32ci hOff hardware resource unit. 32ai hOff hardware resource unit for users. 32bi hOff identifiers. 19bi hOff interrupt priority level data type. 47ci hOff interrupt table. 45bi hOff IO bank. 42ci hOff IO bank for users. 43bi hOff IO identifier. 42di hOff IO port. 43ai hOff IO port for users. 44ai hOff Irq. 47ai hOff Irq for users. 49ai hOff main entry point. 39i hOff memory bank. 40i hOff memory bank for users. 41di hOff node. 36i hOff node for users. 38bi hOff off DMM shuttle property methods. 74ai hOff off Shtl property accessors. 64bi hOff off Shtl register accessors. 58i hOff off uShtl property accessors. (never defined)i hOff page frame. 41bi hOff page frame bits data type. 41ci hOff page frame for users. 42ai hOff page frame identifier. 41ai hOff portal. 70i hOff portal for users. 72bi hOff portal server. 69bi hOff portal server for users. 72ai hOff private members for lockable objects. 16ai hOff private members for protected objects. 14bi hOff private members for reference counting objects. 15di hOff private members for reference counting objects. (never defined)i hOff private members for sequencing objects. 17bi hOff private members for sequencing objects. (never defined)i hOff private methods for exported objects. 10i hOff processor. 52i hOff processor for users. 54bi hOff processor identifier. 51ci hOff processor table. 51bi hOff processor table for users. 54ai 77

hOff protected members for sequencing objects. (never defined)i hOff protected methods for reference counting objects. 15ci hOff protected methods for reference counting objects. (never defined)i hOff protected methods for sequencing objects. 17ci hOff public methods for lockable objects. 16bi hOff public methods for protected objects. 14ci hOff public methods for reference counting objects. 15bi hOff public methods for reference counting objects. (never defined)i hOff resource. 18ai hOff resource for users. 19ai hOff resource inspector. 26ai hOff resource navigator. 25i hOff scheduling events. 53ai hOff shuttle. 59i hOff shuttle for users. 61ai hOff Shuttle interrupt priority level server. 68i hOff shuttle property identifer. 61bi hOff shuttle property server. 62i hOff shuttle property set. 64ai hOff shuttle property sets server. 65ci hOff shuttle server. 55i hOff shuttle server for users. 60bi hOff Shuttle trap table server. 67i hOff system resource. 33bi hOff system resource for users. 34bi hOff system server. 33ai hOff system server for users. 34ai hOff Trap. 46ci hOff Trap for users. 48ci hOff trap table. 45ai hOff Virtual traps. 48ai hOther off Node private members. (never defined)i hOther off Node protected methods. 37ai hOther off Node public methods. 38ai hOther off Prtl private members. 71i hOther off Shtl private members. 64ci hOther private members of off CompResource. (never defined)i hOther private members of off Processor. 53bi hOther private members of off Resource. 18bi hOther public methods of off Irq. 47bi hOther public methods of off Processor. 53ci hOther public methods of off Resource. 24ai hOther public methods of off Shtl. 60ai hOther public methods of off ShtlSrv. 63i hOther public methods of off uResource. 15ai hOther public methods of off uShtl. 66ai 78

hOther public methods of off uShtlSrv.

66bi

79

Appendix B

Index of Identifiers A Class: 7a, 7b dma mode t: 50a, 50b, 51a main: 39 off AbsCompResource: 33a, 55, 69b, 73 off AbsResUnit: 33b, 59 off AccessChecker: 14a off Allocator: 20b, 22a, 27b OFF ATTR CLASS: 27a OFF ATTR DOM: 27a OFF ATTR ID: 27a off attr kind t: 26a, 26b OFF ATTR NAME: 27a OFF ATTR NULL: 27a OFF ATTR OFFSET: 27a OFF ATTR URL: 27a off BKAllocator: 28, 29a, 29b off BlockAllocator: 29b OFF BOOL ATTR: 26b off Bureaucrat: 17a, 71 off CompResource: 20b, 31, 33a, 65c, 67 off DMA: 37b, 38b, 49b off dma id t: 49b, 49c, 50c off DMALine: 49b, 50a off DMM: 37b, 38b, 73 off DTLB: 73, 74b off eu id t: 20a, 25, 31, 32a, 32b, 41a, 42b, 42d, 44c, 49c, 51c off Event: 44b, 45c, 46c, 47a off EventHandler: 45c, 46a, 52 off event t: 46a, 46b off EventTbl: 44b, 45a, 45b off ev id t: 44b, 44c, 45a, 45b, 54b, 69a 80

OFF off OFF OFF OFF off OFF OFF OFF off off off off OFF off off OFF off off off off off off off off off OFF OFF OFF OFF off OFF OFF off off off off off off off off off off off off off

EX FWD: 48a Exhausted: 30 EX MISSING: 48a EX RELOC: 48a EX SCERR: 48a ex t: 48a EX UNAVAILABLE: 48a EX UNUSED: 48a EX XDT: 48a FixedAllocator: 29a FnEventHandler: 46a HWCompResource: 31, 40, 42c, 44b, 49b, 51b HWResUnit: 32a, 41b, 43a, 45c, 50a, 52 ID ATTR: 26b id t: 20a, 20b, 22a, 25, 26a, 33a, 33b, 34b, 36, 59, 62, 64a, 65a, 66a Inspector: 24a, 24b, 26a INT ATTR: 26b IntTbl: 45b, 53b, 53c IOBank: 37b, 38b, 42c io id t: 42c, 42d, 43b IOPort: 42c, 43a Irq: 45b, 47a MBank: 37a, 40 mode t: 13, 14a, 14c, 15a, 50a, 70, 72b, 74b, 75b Navigator: 24a, 24b, 25 Node: 36 NOPS: 12 OP D: 12 OP P: 12 OP R: 12 op t: 12, 13 OP W: 12 OP X: 12 PFrame: 40, 41b pg id t;: 41a, 42b pg id t: 40, 41a, 41d, 42b, 50a, 51a, 74b, 75b pl t: 47b, 47c, 49a Processor: 51b, 52 proc id t: 51b, 51c, 54a ProcTbl: 37b, 38b, 51b prop t: 61b, 63, 64a, 65a, 66a, 66b Prtl: 69b, 70 PrtlEventHandler: 46a PrtlSrv: 37b, 38b, 69b reason t: 46b Resource: 18a, 19a, 20b, 21, 25, 36 81

off ResUnit: 21, 32a, 33b off sevent t: 52, 53a, 54b off Shtl: 52, 55, 59, 62, 65b, 67, 68, 70, 74a off ShtlIntPrtySrv: 68 off ShtlPropSrv: 62, 63, 74a off ShtlPSet: 64a, 64b, 64c, 65c, 70, 72b off ShtlPSetSrv: 65c off ShtlSrv: 37b, 38b, 55 off ShtlTrapTblSrv: 67 OFF STR ATTR: 26b off Trap: 45a, 46c off TrapTbl: 45a, 53b, 53c, 67 off uAbsCompResource: 34a, 60b, 72a, 75a off uAbsResUnit: 34b, 61a, 75b off uCompResource: 22a, 32c, 34a off uDMA: 50c off uDMALine: 50c, 51a off uDMM: 75a off uDTLB: 75a, 75b off uEvent: 48b, 48c, 49a off uHWCompResource: 32c, 41d, 43b, 50c, 54a off uHWResUnit: 32b, 42a, 44a, 48b, 51a, 54b off uIOBank: 43b off uIOPort: 43b, 44a off uIrq: 49a, 54b off uMBank: 41d off uNode: 38b off uPFrame: 41d, 42a off uProcessor: 54a, 54b off uProcTbl: 54a off uPrtl: 72a, 72b off uPrtlSrv: 72a off uResource: 19a, 22a, 22b, 38b off uResUnit: 22b, 32b, 34b off uShtl: 60b, 61a off uShtlSrv: 60b off uTrap: 48c, 54b, 69a pg bits t;: 41c

82

Bibliography [1] Francisco J. Ballesteros. Off—Un Nuevo Enfoque en la Construccion de Sistemas Operativos Distribuidos. PhD thesis, Facultad de Informatica. Universidad Politecnica de Madrid, 1998. (being submitted, defense pending). [2] Francisco J. Ballesteros and Luis L. Fernandez. http://www.gsyc.inf.uc3m.es/off, 1996.

Off web site.

[3] Francisco J. Ballesteros and Luis L. Fern´andez. The Network Hardware is the Operating System. In Proceedings of the 6th Hot Topics on Operating Systems (HotOS-VI)., Cape Cod, MA (USA), May 1997. [4] B.N. Bershad, S. Savage, P. Pardyak, E.G. Sirer, M. Fiuczynski, D. Becker, S. Eggers, and C. Chambers. Extensibility, safety and performance in the SPIN operating system. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles. ACM, December 1995. [5] Roy H. Campbell, Francisco J. Ballesteros, Fabio Kon, Ashish Singhai, Dulcineia Carvalho, and Robert Moore. 2k: A distributed adaptable operating system. http://choices.cs.uiuc.edu/2k, August 1997. [6] John B. Carter, Dilip Khandekar, and Linus Kamb. Distributed shared memory: Where we are and where we should be headed. In In proceedings of the 5th Workshop on Hot Topics in Operating Systems, 1995. [7] D. Cheriton and K. Duda. A caching model of operating system kernel functionality. In Proceedings of the First Symposium on Operating Systems Design and Implementation, pages 179–193, November 1994. [8] D. Engler, M. F. Kaashoek, and J. O’Toole. The Operating System Kernel as a Secure Programmable Machine. In Proc. of the 6th SIGOPS European Workshop, pages 62–67, Wadern, Germany, Sept 1994. ACM SIGOPS. [9] Bryan Ford, Godmar Back, Greg Benson, Jay Lepreau, Albert Lin, and Olin Shivers. The flux os toolkit: A substrate for kernel and language research. In Proceedings of the 16th SOSP, Saint-Malo, France, October 1997. ACM. [10] Bryan Ford, Mike Hibler, Jay Lepreau, Patrick Tullmann, Godmar Back, and Stephen Clawson. Microkernels Meet Recursive Virtual Machines. In Proc. OSDI, October 1996. 83

[11] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design Patters. Elements of Object-Oriented Software. Addison-Wesley, 1995. [12] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers, San Francisco, California, second edition edition, 1996. [13] Takuro Kitayama, T.Nakajima, and Hideyuki Tokuda. RT-IPC: An IPC Extension for Real-Time Mach. In Proceedings of the 2nd Microkernel and Other Kernel Architectures. USENIX, 1993. [14] Donald E. Knuth. Literate Programming. Center for the Study of Language and Information, Stanford University, 1992. [15] Leslie Lamport. A Document Preparation System: LATEX. Addison-Wesley, ISBN 0-201-15790-X. [16] R. Pike, D. Presotto, K. Thompson, and H. Trickey. Plan 9 from Bell Labs. In NKUUG Proceedings of the Summer 1990 Conference, London (England), July 1990. [17] Norman Ramsey. Literate programming simplified. IEEE Software, 11(5):97– 105, September 1994. [18] Bjarne Stroustrup. The C++ Programming Language. Addison Wesley, 1986. [19] Andrew S. Tanenbaum. Prentice-Hall, 1987.

Operating Systems: Design and Implementation.

[20] H. Tokuda, T.Nakajima, and P. Rao. Real-Time Mach: Towards a Predictable Real-Time System. In Proceedings of the 1st USENIX Mach Workshop. USENIX, Oct 1990.

84

Suggest Documents