Chapter 2. A Re ective Architecture for. Real-Time Operating Systems. John A. Stankovic and Krithi Ramamritham. This chapter describes how the notion of a re ...
Chapter 2 A Re ective Architecture for Real-Time Operating Systems John A. Stankovic and Krithi Ramamritham
This chapter describes how the notion of a re ective architecture can serve as a central principle for building complex and exible real-time systems, contributing to making them more dependable. A re ective system is one that reasons about and re ects upon its own current state and that of the environment to determine the right course of action. By identifying re ective information, exposing it to application code, and retaining it at run time, a system is capable of providing predictable performance with respect to timing constraints, of reacting in a exible manner to changing dynamics within the system as well as in the environment, to be more robust to violations of initial assumptions and to faults, to better evolve over time, and also for better monitoring, debugging, and understanding of the system. Advantages of this approach are given and details of a speci c implementation of a re ective architecture in a real-time operating system are provided. Several open questions are also presented.
2.1 Introduction In the eld of complex real-time systems, it is a common understanding that we want integrated system-wide solutions so that design, implementation, testing, monitoring, dependability, and validation (both functional and timing) are all addressed. However, building real-time systems for critical applications and showing that they meet functional, fault tolerance, and timing requirements are complex tasks. At the heart of this complexity there exist several opposing factors. These include: the desire for predictability versus the need for exibility to handle nondeterministic environments, failures, and system evolution, 22
Sec. 2.2
State of the Art — Reflection
23
the need for abstraction to handle complexity versus the need to include implementation details in order to assess timing properties, the need for ecient performance and low cost versus understandability, and the need for autonomy (to deal better with faults and scaling) versus the need for cooperation (to achieve application semantics). This chapter argues that one good approach for addressing such opposing requirements is a re ective system architecture that exposes the correct meta-level information. Exactly what this information should be and how it should be supported will be the subject of research for many years. However, in this chapter we provide details of our current view of this information and its implementation structures within a real-time operating system. Once a system has this information, it can be dynamically altered as system conditions change (increasing the coverage of the system dynamically), or as the system evolves over time, or is ported to another platform. In particular, the system is built with the idea that it cannot know the environment completely, but must nevertheless be dependable. The hypothesis is that with a re ective architecture the necessary exibility for complex real-time systems can be attained. Further, we argue that this powerful type of exibility contributes to many other things that are required, including understandability, analyzability (and therefore predictability), high performance, monitoring, debugging, and dependability. While re ection contributes to addressing all these issues, in the interest of space we focus primarily on showing how re ection supports the rst two points listed above. We note that much of what is discussed in this chapter has been implemented in the Spring kernel [22], and is now being used on a robotic automated assembly testbed in non-deterministic environments. In Section 2.2 we brie y discuss the notion of re ection and present the state of the art. In Section 2.3 we outline a methodology for real-time systems where re ection is a central principle. In Section 2.4 we identify ingredients of such a re ective architecture and provide details of its implementation in a real-time operating system. In Section 2.5 we summarize our contributions, the status of the implementation, and identify required future work.
2.2 State of the Art | Re ection Re ection, as a concept, is not new and almost all systems have some re ective information in them. However, identifying re ection as a key architectural principle and exploiting it in real-time systems is very new. In simple terms, re ection can be de ned as the computational process of reasoning about and acting upon the system itself. Normally, a computation solves an application problem such as sorting a le or ltering important signals out of radar returns. If, in addition, there is a system computation dealing with the dynamic setting of the parameters and policies of the sorting process or the ltering process itself, then we have a meta-level structure and possibilities for re ection. Consequently, when using a re ective architecture, we can consider a system to have a computational part and a re ective part. The
Sec. 2.3
Methodology
24
re ective part is a self-representation of the system, exposing the system state and semantics of the application. For example, for the ltering process this might include choosing the most appropriate lter, setting parameters for the process, checking ranges of input data, executing it on most suitable processor, and so on. To best support exibility over the long lifetimes of complex, real-time systems, we believe that it is absolutely essential to have general architecture structures capable of holding current state and semantics of both the application and system software, primitives for changing that information, and general (modi able) on-line methods for using that information. In other words, such re ective information can be monitored, used in decision making during system operation, and easily changed, thereby supporting many exibility features. Much research is still required to identify exactly what re ective information is important, how to represent it, and to determine the performance implications of using re ection, including its impact on the computational part of the system. Re ection has appeared in AI systems, e.g., within the context of rule-based and logic-based languages [12, 21]. More recently it is being used in object-oriented languages and object-oriented databases in order to increase their exibility [6, 13]. Combining object-oriented programming with re ection seems to be very important since object-oriented programming supports abstraction and good design and adding re ection supplies exibility. Re ection also has been touted as valuable to distributed systems for supporting transparency and exibility [24], but details and open research problems have not been worked out. Re ection has also been used to some extent in our past work on real-time systems [23]. This chapter discusses a greatly expanded view of the impact of re ection on real-time systems. One problem in discussing re ection is that re ection is a term used somewhat dierently in three dierent areas: AI, object-oriented programming, and objectoriented databases. The dierences emerge from two issues: one, it is a matter of degree as to what is considered re ective and what is considered computational; not always easy to identify, and two, some de nitions require re ection to be able to generate code at run time similar to how LISP has a duality between code and data. In this chapter we explicitly identify what information and processing we consider re ective for complex, dependable, real-time, distributed computing systems.
2.3 Methodology While most of this chapter deals with re ection, here we brie y present an overview of a design methodology for real-time systems. We do this, in part, to emphasize that re ection by itself is not sucient to address the problems raised in the Introduction, and, in part, to place what it does provide in perspective. Re ection can be used independently of the overall design paradigm, e.g., it can be used with an object-oriented design and implementation or with a more classical procedureoriented design and implementation. Because of the obvious advantages of an object-oriented design at the functional level, we brie y discuss a vision of how a real-time system could be developed under this approach. This would contribute to dependability. However, signi cant performance questions still exist for object-
Sec. 2.4
A Reflective Architecture
25
oriented real-time systems. In particular, open questions remain in the areas of interaction of concurrency and inheritance, overheads for support of objects, and predictability of these run-time structures which has not been established nor even seriously considered. Because of these reasons, in the actual prototype re ective system that we have developed [14, 15, 22], we currently employ a procedure based design and programming paradigm which is extended to integrate with re ection.
A vision for the design of complex real-time systems might begin with the use of a concurrent object-oriented approach subject to the integrating theme that the system is re ective. Concurrent programming support is required to handle the highly
asynchronous, event-driven nature of many real-time systems. The object-oriented aspects provide modularity, abstraction, information hiding, library modules with dierent properties, and so on. The hierarchies created within object classes can contain the application-level re ective information and proceed from more abstract levels down to speci c implementations where the implementations may have varying run-time costs and dierent fault semantics as a function of the actual compilers, operating systems, and hardware used. Information on the system con guration and implementation is supplied by a separate language such as our System Description Language (SDL) [14]. In this way the (programming-level) objects can encapsulate the high-level timing and fault tolerance requirements as well as the application-level re ective information; and the SDL will capture implementation details and system-level re ective information. Various tools then combine information from these languages for analysis of predictability and for creating a re ective run-time system. Re ection supports exibility, but exibility is a doubled-edged sword. Too much exibility can destroy predictability. Consequently, control must be exercised over the use of exibility both via constraints at design time and via the use of scheduling paradigms which provide for predictability. We believe that the objectoriented paradigm may provide the structure necessary to constrain builders of critical applications, even when we allow a high degree of exibility (supported by re ection). And dynamic real-time scheduling can produce a run-time, system-wide schedule for executing the objects to meet timing and fault requirements. Since the re ective paradigm should be used across all levels of the system, what information is re ective is application dependent. However, most complex real-time systems will require features found in operating systems, so we can apply this architecture to the operating system in a more generic manner than for higherlevel application code; we do this in the remainder of this chapter.
2.4 A Re ective Architecture Can a exible real-time system architecture be signi cantly dierent from a general timesharing system architecture? It can, and it must be, to handle the added complexity of time constraints, exible operation, and dependability. As systems get large we can no longer handcraft solutions, but require architectures that support algorithmic analysis along functional, time, and dependability dimensions. This analysis can and should be done for a dynamic system, as a completely static ap-
Sec. 2.4
A Reflective Architecture
26
proach relies too much on a priori identi cation of fault and load hypotheses which are invariably wrong in complex, critical applications. One problem we have in real-time systems is that the time dimension reaches across all levels of abstraction. Re ection has an advantage here in that at design time, at programming time, and at run time, one can reach across those layers. For example, if the programmer knows that the system retains the task importance, worst-case execution time as a function of a system state, and fault semantics for use at run time, then that user may program policies that make adaptive use of this information, including exact execution-time costs. This allows for more ecient use of resources, all information can be dynamically altered, and new policies can be added more easily, than a priori choosing one policy and mapping to a priority where all information on how that priority was achieved is lost! Building a real-time system based on a re ective architecture means that rst we must identify re ective information regarding the system. This information includes: importance of a task, group of tasks, and how tasks' importance relates to each other and to system modes, time requirements (deadlines, periods, jitter, etc.), time pro les, such as the worst-case execution times or formulas depicting the execution time of the module, resource needs, precedence constraints, communication requirements, objectives or goals of the system, consistency and integrity constraints, policies to guide system-wide scheduling, fault tolerance requirements and policies to guide adaptive fault tolerance, policies to guide tradeo analyses, and performance monitoring information. Implementation structures in the operating system then retain this information and primitives allow it to be dynamically changed. We have implemented this in the Spring kernel [22] by de ning process control blocks where much of the above information is kept, and other data structures that keep more system-wide information, such as properties of groups of processes.
Sec. 2.4
A Reflective Architecture
2.4.1 Speci cation of Re ective Information
27
Given that we choose to use a re ective architecture, we must be able to specify the re ective information. It is obvious that in real-time systems we must be able to specify real-time and fault tolerance requirements. To keep system costs reasonable and address the needs of individual tasks, we would also like to be able to specify adaptive fault tolerance policies. Further, we want to be able to analyze the system to meet predictability requirements and allow exibility at run time. To do this, what is required is the ability to separate what can be speci ed at a higher, more abstract level from implementation details. However, in a real-time system the implementation details are critical to the analysis and cannot be overlooked or relegated to unimportant details. As a result, we are developing three related and integrated languages that are then used by various tools for design and analyses. The three complementary languages are as follows. First, we need a high-level programming language that is ecient in producing source code and capable of specifying re ective information such as timing requirements (deadlines, periods, etc.) and the need for guarantees with respect to these requirements. In the spirit of research, we have developed two versions of such languages, Spring-C [15] and Real-Time Concurrent C (RTCC) [5]. Spring-C is currently implemented and used with our current overall re ective system implementation. However, it is very simple. In parallel, together with colleagues at ATT Bell Labs, we have developed a more sophisticated real-time language with the necessary features for timing speci cation and guarantee semantics, but it has not been integrated with the Spring kernel. Second, we require a system description language that provides enough details about the implementation so that careful, detailed, and accurate timing analysis can be performed. We have already designed and implemented this language (mentioned above), called the System Description Language (SDL) [14]. Third, we require a notation for specifying fault tolerance requirements on a task-by-task basis, where it is possible to describe adaptive management of redundancy. In this area, we have a preliminary design of such a language, called Fault Tolerant Entities for Real-Time (FERT) [2]. At the present time we want to keep the timing speci cation in the high-level language and the fault tolerance speci cation of redundancy strategies in another language, for separation of concerns and abstraction reasons. An open research question is whether this separation is suitable or whether we should merge the speci cations of time and fault tolerance into one language. In SDL we can specify details regarding the network, each node in the network, memory layout, bus characteristics, device characteristics and locations, kernel features, and other aspects related to performing accurate timing analysis. As we add timing requirements, extensions will be required and descriptions of primitives dealing with support for fault tolerance are also necessary. We would also like to add greater support for describing the environment which is being controlled by the real-time system. This feature is severely lacking in state-of-the-art tools and languages today. FERT allows a designer to treat each FERT object as a fault-tolerant entity
Sec. 2.4
A Reflective Architecture
28
with protection boundaries (fault containment units), initially without worrying about timing and redundancy issues. The designer of the FERT itself then speci es a set of application modules as part of a single entity. The application modules represent user code for redundant operations, or voters, or more general adjudication code. Each of these application modules would be written in Spring-C or RTCC with associated timing requirements. In addition to application modules, the FERT designer also supplies one or more adaptive control polices using four basic primitives which explicitly interact with scheduling and analysis algorithms, both o-line analysis and on-line algorithms that provide dynamic guarantees. Finally, FERTs carefully control input and output via PORTS. Timing, task value information, and so on, can be input via the PORTS and used by the adaptive control strategies. We consider more details of FERTs in Sections 2.4.3 and 2.4.4. Therefore, when programming with real-time languages, in our case SpringC (or eventually with RTCC), the System Description Language (SDL) [14], and FERT [2], we identify and provide re ective information and write code that modi es it. Tools can use this information for analysis and design knowing that such information is also available at run time. We now provide more speci c examples of the re ective architecture for scheduling, fault tolerance, and their interaction.
2.4.2 Real-Time Scheduling
Most real-time kernels provide a xed-priority scheduling mechanism. This works when tasks' priorities are xed. However, in general, this mechanism is inadequate because many systems require dynamic priorities and mapping a dynamic scheme onto a xed-priority mechanism can be very inecient, and signi cant information can be lost in the process. In other words, the run-time system has no information as to how the xed priority was calculated, e.g., it might have been some weighted formula that combined importance and deadline and resource needs. Fixed-priority scheduling is also incomplete because it deals only with the CPU resource. Since we are interested in when a task completes, we must consider all the resources that a task requires. An integrated view of resource management should be part of the re ective architecture interface, including the ability to specifically identify the needed resources and to reserve them for the (future) time when they will be needed. So, reservations of sets of resources should be part of the architecture. Fixed-priority scheduling has missing functionality because it substitutes a single priority number for possibly a set of issues, such as the semantic value of completing the task, the timing constraint of the task, and fault properties of the task. Further, a xed priority ignores the fact that semantic information is often dynamic, i.e., a function of the state of the system. When using priority-based scheduling, the analysis either assumes a static system and shows that logically the system works (but assumption coverages are often unknown and if something unexpected occurs, then the system is uncategorized), or assumes a completely dynamic approach with average-case performance; this is unacceptable for critical applications. In order to deal with predictability versus exibility, there is a need for multi-
Sec. 2.4
A Reflective Architecture
29
level, multi-dimensional scheduling algorithms that explicitly categorize the performance, including when there is system degradation or unexpected events. It is multi-level in the sense that we categorize the tasks into critical (missing the deadline causes loss of life or total system failure), essential (these tasks have hard deadlines and no value is accrued if the deadline is missed, but the tasks are not critical), soft real-time (task values drop after the deadline but not to zero or some negative number), and non-real-time (these tasks have no deadlines). Each category has its own performance metric. Critical tasks must be shown to meet their deadlines and fault requirements in the worst case based on the most intelligent assessment of environmental conditions that we can make at deployment time, but in addition we must be able to dynamically borrow processing power and resources from the other classes of work so as to understand how much additional load and faults can be handled. This additional work could be categorized as in the worst case, e.g., where all tasks run to worst-case times, and as average case where tasks execute to average execution times and where it is likely that even greater unexpected loads can be handled. Of course, no one wants to run the system in these dicult loads, but if such loads exist, the approach allows for categorization of what happens in excess load and likelihood of being able to continue. This is in contrast to some static solutions where any excess critical load is guaranteed to cause failure! It is also true that in many static designs all tasks are equated to critical tasks, thereby increasing the cost of the system or even making it infeasible due to the combinatorial explosion of schedules that have to be accounted for in large, complex, dynamic environments. We have developed and analyzed an algorithm for classes of tasks and where time can be borrowed from non-critical tasks in a categorized manner [4]. This permits a degree of safety because critical tasks are guaranteed, robustness because unexpected events (of a certain class, i.e., excess load) are handled in a categorized manner, and availability because the system gracefully degrades rather than failing catastrophically. Details of the algorithm and its analysis are beyond the scope of this chapter, but can be found in [4]. It is important to note that while this algorithm has these nice attributes, the algorithm assumes a very simple task model. It is necessary to derive new algorithms with the same attributes, but with more sophisticated task characteristics. Essential tasks, soft real-time tasks, and non-real-time tasks each have a probabilistic performance metric, but these can be a function of load. For example, in the absence of failures and overloads it may be that 100% of essential tasks also make their deadline, but this degrades as unexpected loads occur. Interesting approaches have been developed, including [19], where dynamic arrivals are accounted for when doing static allocation of critical periodic tasks. Other possibilities include bounding the performance, such as discussed in [7, 10]. The algorithms are multi-dimensional [17] in that they must consider all resources needed by a task, not just the CPU, and they must consider precedence constraints and communication requirements, not just independent tasks. Providing algorithm support for this level of scheduling improves productivity and reduces errors compared to having a very primitive priority mechanism and requiring the designer to map tasks to priorities accounting for worst cases blocking times over
Sec. 2.4
A Reflective Architecture
30
resources and interrupts. While the details of this type of algorithm are beyond the scope of this chapter, what is important about it is that it dynamically uses re ective information about the tasks requirements (importance, deadline, precedence constraints, resource requirements, fault semantics, etc.). See [17, 19, 20] for more details. We also use scheduling in planning mode as opposed to myopic scheduling.1 When tasks arrive, the planning based scheduling algorithm uses the re ective information about active tasks and creates a full schedule for the active tasks and predicts if one or more deadlines will be missed. If deadlines would be missed, error handling can occur before the deadline is missed, often simplifying error recovery. Further, because re ective information is available, the decision as to how to handle the predicted timing failure can be made more intelligently and as a function of the current state of the system as opposed to some a priori chosen policy. The planning based scheduling with its inherent advantages is implemented in the Spring kernel [17, 22] and a scheduling co-processor has been implemented to reduce its run time overhead [3].
2.4.3 Fault Tolerance
Many real-time systems require adaptive fault tolerance [8] in order to operate in complex, highly variable environments, to keep costs low (rather than a brute force static approach that replicates everything regardless of environmental conditions), and so that redundancy and control can be tailored to the individual application functions as is required by those individual functions. The re ective architecture is a suitable structure for adaptive fault tolerance. We now brie y discuss some of the details of what constitutes our current view of a re ective architecture for adaptive fault tolerance based on a three-level framework. At the highest level of this framework is the overall design process, which starts with the speci cation of the physical inputs and outputs to and from the external world, a rst speci cation of the important functional blocks in the system and the ow of data and interactions among them, a rst de nition of timing requirements (periods, deadlines, and response times of event-response chains), an identi cation of which functions are critical indicating that their execution must be guaranteed o-line, and a speci cation of mode changes. This top-level functional design ignores the issue of software redundancy. To manage redundancy, we introduce an intermediate level of design decomposition inside the functional blocks and above the application code. Redundancy is added inside the individual functional blocks, using a general scheme called FERT (Fault-Tolerant Entity for Real Time) [2] which adheres to the re ective architecture approach. The FERT level extends previous fault tolerance work [1, 11, 16] in two main ways: by including in the notation features that explicitly address realtime constraints, and by a exible and adaptable control strategy for managing Myopic scheduling refers to those algorithms which only choose what the next task to execute should be. At run time these algorithms do not have any concept of total load, or whether any or all of the tasks are likely to miss their deadlines. 1
Sec. 2.4
A Reflective Architecture
31
redundancy. The third level of the framework is the coding of the application modules themselves. The three levels must be consistent with each other. While it is the highest level of the framework that creates the intermodule (inter FERT) system structure, we now concentrate on the FERT level itself because this is where many re ective aspects of adaptive fault tolerance reside. The re ective aspects of FERT can be divided into two parts: the re ective information itself and the overall structure. Information regarding timing constraints, importance of tasks (speci ed as values and penalties), levels and types of redundancy, and adaptive control of redundancy are all part of the re ective information. All this information is visible at design, implementation, and run time, allowing it to be dynamically updated and used by on-line policies. For design time, the generic design notation can specify information such as whether n-copies and a voter, or primary backups, or imprecise computations are required for each individual FERT, and to notify the scheduling mechanisms (both o-line and on-line algorithms) of relative importance of tasks, their timing requirements, and their worst-case and average-case use of resources. This speci cation is part of the re ective architecture which links the design to both o-line analysis and on-line scheduling. Additionally, a FERT has its own structure, which is exposed to all levels. The structure consists of (1) ports through which all inputs and outputs pass, (2) application modules which implement the functionality and voting or adjudication of the FERT, if needed, and (3) a control module which speci es how application modules interact with each other and with the run-time system. The control part is meant to specify adaptive strategies that take into account available resources, deadlines, importance, and observed faults. The control part uses four generic primitives which we believe should be visible in a re ective architecture supporting dependable computing: possible: asks whether a collection of tasks can be feasibly scheduled; multiple possible requests can be issued in parallel; this allows dynamic analyzability with respect to meeting timing constraints; exec: identi es a list of tasks that must be executed subject to various constraints; typically used by the control when nally deciding what should execute given the results of the possible queries; unused: identi es resources which were planned to be used but no longer required, e.g., because the initial version of the task completed successfully and the backups are no longer required; improves performance of the system; output: nally commit the produced results. Even though we have developed a notation for adaptive fault tolerance under time constraints, many open research questions remain, including: how allocation decisions for the redundant copies are made and speci ed, how system-wide considerations will be included rather than the entity-byentity redundancy and timing design now supported,
Sec. 2.4
A Reflective Architecture
32
determining what information must be retained, at run time (what should be the interface to the run-time kernel), replica determinism de nition and support, recovery semantics that should exist within the FERT itself, and exibility provided by FERT is very powerful, but if not controlled, analysis will be impossible. Therefore, we need to develop controls on the speci cation of exible operations that allow high adaptivity and reliability, yet is predictable and analyzable. In summary, the re ective architecture for supporting adaptive fault tolerance permits the designer to specify time, importance, redundancy, and control information knowing that the run-time structure will retain this information, so that the adaptive control policies implementer can use and modify such information. In particular, the interaction with the o-line and on-line schedulers permits more exibility while preserving timing-related predictability. Since the intent of this section is to focus on re ective architectures for fault tolerance, we are brief and do not discuss many other issues related to fault tolerance. Interested readers should see [2].
2.4.4 Integrating Scheduling and Fault Tolerance
Scheduling must explicitly address real-time constraints for guaranteeing deadlines for redundant copies, voters, error handlers, and so on. It must also address dierent classes of tasks, such as critical tasks, hard deadline but not critical tasks, soft deadline tasks, and non-real-time tasks. Scheduling will also be part of the o-line analysis, e.g., to guarantee minimum levels of guarantee for critical tasks. It is necessary to integrate the o-line scheduling decisions with the dynamic operation of the system in a manner such that the o-line guarantees are not violated at run time, and such that the system maximizes its eectiveness beyond the minimum guaranteed part. O-line support is necessary to create a priori guarantees that the minimum performance and reliability of the system are achieved. This may be accomplished in a number of ways. Here we discuss brie y one way to accomplish this task using a form of reservation of exible time slots. The o-line algorithm works with the following assumptions: the Spring-C or RTCC, SDL, and FERT languages describe the timing and fault behavior of modules on an individual basis as well as other module requirements, such as precedence constraints and general resource requirements; a system-level speci cation details the minimum level of guaranteed performance and reliability required; the workload requirements are speci ed; and some knowledge of the run-time algorithms and environment is utilized.
Sec. 2.4
A Reflective Architecture
33
The environment information includes the hardware resources available and a distributed, real-time, fault-tolerant system kernel that has (i) a global time base, (ii) run-time data structures that contain the exibility and adaptability requirements of the application, (iii) predictable primitives, (iv) run-time scheduling support, and (v) a basic guarantee paradigm which uses on-line planning. The o-line guarantee algorithm takes as input all the information listed above and attempts to nd a feasible allocation and schedule for the modules that are part of the minimum guaranteed requirements, as well as accounting for other requirements in order to obtain good performance beyond the minimum. In particular, the interaction between the FERT speci cations and the o-line algorithm is as follows. FERTs are typed as being critical or non-critical. All critical FERTS are guaranteed by the o-line scheduling algorithm. In many cases the critical FERTS will have a single strategy de ned and therefore this is what must be guaranteed. If more than one strategy is de ned for a critical FERT, then the minimum strategy is a priori guaranteed by the o-line algorithm, and at run time, if it is possible, a more comprehensive and valuable strategy may be dynamically guaranteed each time the FERT executes. Non-critical FERTS are dynamically guaranteed using the options speci ed in the FERT, but some overall time and resource availability may be guaranteed for all non-critical FERTs. Since this algorithm executes o-line and for the critical tasks, signi cant compute time can and should be devoted to this problem. If the heuristic is having diculty in producing feasible allocations and schedules, the designer can choose to add resources to the system or modify requirements and re-run the algorithm. The output is a exible timetable with earliest and latest scheduled start times, and nish times for all the critical tasks and their minimum redundant copies and/or voters, and idle intervals. Various heuristics for static allocation and scheduling exist in the literature [9, 18, 19, 26]. We base the discussion of what is required in a new heuristic on a set of extensions to the heuristic found in [19]. That heuristic is able to schedule complex periodic tasks in distributed systems. It handles periodic time constraints, worstcase execution time, general resource needs, precedence constraints, communicating tasks, and replication requirements. The communicating tasks, when allocated across nodes of the distributed system, are scheduled in conjunction with a timeslotted subnet. The algorithm as it now exists has been evaluated by simulation. Further, extensions to the algorithm have been developed which attempt to balance load and spread out (in time) scheduled tasks to avoid clustered computation time which could cause long latency. In other words, some results have been developed which account for the dynamic operation of the system when performing the static allocation and scheduling o-line. We need to enhance the current algorithm in the following ways: considering aperiodic tasks with minimum guarantees, enhancing the fault semantics to include those supported by FERTs, accounting for the dynamics in a more sophisticated manner, including creating a window for each statically guaranteed task composed of an earliest
Sec. 2.5
Conclusions
34
start time, latest start time, and deadline, addressing tasks of dierent importance, and addressing mode changes. The on-line algorithm runs in parallel with application processes. It uses the a priori generated exible timetable (that accounts for all resources, not just the CPU) to insert newly invoked work (above the minimum reserved). If newly invoked work can be placed in the table, then the work is dynamically guaranteed, else various actions are taken based on the current policy. The planner also uses the on-line descriptions of the fault behaviors of the active FERTS, compiled from their control components. For example, suppose that a certain FERT is invoked, and the online scheduling algorithm (a planner) identi es the preferred strategy as requiring a primary and two backups to be scheduled prior to the deadline, all on dierent nodes. If the planner can nd open intervals for this requirement, in time, then that strategy is dynamically guaranteed. If not, the planner applies the designerspeci ed action, e.g., the information associated with this FERT might indicate just to abort the FERT, or alternatively, it might indicate that a strategy consisting of a simple error handler with no redundancy should be scheduled. The planner for a given system must be suciently powerful to support the level of adaptability of fault semantics speci ed by the designer, and in general, planners on dierent nodes must cooperate to nd feasible task assignments to time slots, including subnet time slots. A key problem is making the planner fast enough (and bounded) to be usable in many systems. However, the cost of distributed planning may be reduced by using a scheduling chip [3].
2.5 Conclusions Current real-time systems platforms present inadequate, incomplete, and missing functionality in their architectures, including at the interface to scheduling and for fault tolerance. As a result of these poor interfaces, critical real-time systems are dicult to design, maintain, analyze, and understand. The systems tend to be in exible and productivity is low. We need to raise the level of functionality that the run-time platform provides to better provide portability, productivity, and lower costs. A re ective architecture has potential in these areas. Further, in order to support dependable real-time systems in complex and unfriendly environments, the current goals, policies, and state of the system must be available for dynamic access and modi cation. However, the dynamics must be carefully controlled. Establishing designs and solutions that engineer good tradeos among the opposing factors listed in the Introduction will be dicult and the subject of research for many years. However, the re ective architecture has many nice properties that provide a structure for engineering those tradeos. More speci cally, in this chapter we presented a view of how complex real-time systems for critical applications might be built based on a re ective architecture.
Acknowledgments
35
While many systems contain various amounts of information that are used dynamically, in many cases such information and its use is ad hoc and piecemeal. Here we identi ed speci c re ective information as well as describing integrated and more complete architectural designs for both real-time and fault tolerance requirements. We discussed how using re ection as an architectural principle enhances certain aspects of dependability because it (i) addresses time constraints at all levels of the system so that more accurate timing analysis can be done even early in the design, (ii) facilitates the use of scheduling algorithms that are robust even under violations in the initial assumptions, (iii) supports planning-based scheduling that allows detection of timing errors early and application of state- and time-dependent recovery strategies, and (iv) supports adaptive fault tolerance so that there is exible management of redundancy subject to time constraints. While nal veri cation that this approach will succeed in practice remains to be demonstrated (i.e., it has not been used on real safety critical applications), we have demonstrated many aspects of it in many ways. For example, a System Description Language (SDL) has been designed and implemented [14] along with extensions to the programming language C, called Spring-C. A more sophisticated real-time language, RTCC, has also been designed. These languages provide speci cation of and access to re ective information. A compiler [15] has been implemented that accumulate this information and make it part of the run-time structures of the Spring kernel [22]. The kernel contains the planning-based scheduling algorithms [17] that dynamically create predictable execution, and to lower the overhead of online planning, a hardware scheduling chip has been designed and implemented [3]. Many aspects of the algorithms and implementation details have been studied and shown to be valuable both via simulation [4, 17] and in an actual distributed testbed system composed of three multiprocessor nodes (15 processors) [22]. The adaptive fault tolerance aspects of the re ective architecture have not been implemented, and are still in the design stage [2]. Finally, systems which support exibility and adaptability are usually more dicult to understand and control. Without good constraints and guidelines a designer could program haphazardly, making it almost impossible to analyze. Future work includes developing good engineering tradeos between what is necessary for adaptability and what is necessary for analyzability. It is also necessary to embed the approach we discussed here in an overall design methodology, such as objectoriented programming.
Acknowledgments This work has been supported, in part, by NSF under Grants IRI 9208920 and CDA 8922572, by ONR under Grant N00014-92-J-1048, and by CNR-IEI under the PDCS project. We would like to thank Doug Niehaus for his various contributions to the project, including major parts of the design and implementation of the system. Many others in the Spring project also deserve thanks. We thank them all. We also thank A. Bondavalli and L. Strigini for their contributions to our joint work
References
36
on adaptive fault tolerance.
References [1] A. Bondavalli, F. Giandomenico, and J. Xu, A Cost Eective and Flexible Scheme for Software Fault Tolerance, Univ. of Newcastle, TR 372, Feb. 1992. [2] A. Bondavalli, J. Stankovic, and L. Strigini, Adaptable Fault Tolerance for Real-Time Systems, Proc. Third International Workshop on Responsive Computer Systems, Sept. 1993. [3] W. Burleson, J. Ko, D. Niehaus, K. Ramamritham, J. Stankovic, Wallace, and C. Weems, The Spring Scheduling Co-processor: A Scheduling Accelerator, Proc. ICCD, Cambridge, MA, Oct. 1993. [4] G. Buttazzo and J. Stankovic, RED: A Robust Earliest Deadline Scheduling Algorithm, Proc. Responsive Systems Workshop, Sept. 1993. [5] N. Gehani and K. Ramamritham, Real-Time Concurrent C: A Language for Programming Dynamic Real-Time Systems, Real-Time Systems, Vol. 3, No. 4, Dec. 1991. [6] M. Ibrahim, editor, Proc. OOPLSA '91 Workshop on Re ection and Metalevel Architectures in Object Oriented Programming, Oct. 1991. [7] R. Karp, On-line versus O-line Algorithms: How Much Is It Worth to Know the Future, Information Processing, North-Holland, Amsterdam, Vol. 1, 1992. [8] K. Kim and T. Lawrence, Adaptive Fault Tolerance: Issues and Approaches, Proc. Second IEEE Workshop on Future Trends in Distributed Computing Systems, Cairo, Egypt, 1990.
[9] C. Koza, Scheduling of Hard Real-Time Tasks in the Fault Tolerant Distributed Real-Time System MARS, Proc. Fourth IEEE Workshop on Real-Time Operating Systems, Boston, pp. 31-36, 1987. [10] G. Le Lann, Why Should We Keep Using Precambrian Design Approaches at the Dawn of the Third Millennium? Proc. Workshop on Large, Distributed, Parallel Architecture Real-Time Systems, March 1993. [11] C. Liu, A General Framework for Software Fault Tolerance, PDCS Second Year Project Report, ESPRIT Project 3092, 1991. [12] P. Maes, Concepts and Experiments in Computational Re ection, OOPSLA '87, Sigplan Notices, Vol. 22, No. 12, pp. 147-155, 1987. [13] S. Matsuoka, T. Watanabe, Y. Ichisugi, and A. Yonezawa, Object Oriented Concurrent Re ective Architectures, 1992.
References
37
[14] D. Niehaus, J. Stankovic, and K. Ramamritham, The Spring System Description Language, UMASS TR-93-08, 1993. [15] D. Niehaus, Program Representation and Translation for Predictable RealTime Systems, Proc. RTSS, pp. 43-52, Dec. 1991. [16] D. Powell et al., The Delta-4 Approach to Dependability on Open Distributed Computing Systems, Proc. 18th International Symposium on Fault Tolerant Computing, Tokyo, Japan, 1988. [17] K. Ramamritham, J. Stankovic, and P. Shiah, Ecient Scheduling Algorithms For Real-Time Multiprocessor Systems, IEEE Transactions on Parallel and Distributed Computing, Vol. 1, No. 2, pp. 184-194, Apr. 1990. [18] K. Ramamritham, Scheduling Complex Periodic Tasks, Proc. International Conference on Distributed Computing Systems, 1990. [19] K. Ramamritham and J. Adan, Providing for Dynamic Arrivals During the Static Allocation and Scheduling of Periodic Tasks, Proc. Real-Time Operating System and Software, May 1993. [20] K. Ramamritham and J. Stankovic, Scheduling Strategies Adopted in Spring: An Overview, chapter in Foundations of Real-Time Computing: Scheduling and Resource Management, edited by Andre van Tilborg and Gary Koob, Kluwer Academic Publishers, Dordrecht, The Netherlands, pp. 277-306, 1991. [21] B. Smith, Re ection and Semantics in a Procedural Language, MIT TR 272, 1982. [22] J. Stankovic and K. Ramamritham, The Spring Kernel: A New Paradigm for Real-Time Systems, IEEE Software, Vol. 8, No. 3, pp. 62-72, May 1991. [23] J. Stankovic, On the Re ective Nature of the Spring Kernel, invited paper, Proc. Process Control Systems '91, Feb. 1991. [24] R. Stroud, Transparency and Re ection in Distributed Systems, position paper for Fifth ACM SIGOPS European Workshop, April 1992. [25] B. Wyatt, K. Kavi, and S. Hufnagel, Parallelism in Object Oriented Languages: A Survey, IEEE Software, Vol. 6, Nov. 1992. [26] J. Xu and D. Parnas, Scheduling Processes with Release Times, Deadlines, Precedence, and Exclusion Relations in Real-Time Systems, IEEE Trans. on Software Engineering, Vol. 16, pp. 360-369, 1990.