Dynamic Optimization in Hazard Detection for Continuous Safety ...

1 downloads 0 Views 258KB Size Report
7] Weiming Gu, Greg Eisenhauer, Eileen Kraemer, Karsten Schwan, John Stasko, ... 8] Je rey K. Hollingsworth, Barton P. Miller, Marcelo J. R. Goncalves, Oscar ...
Submitted to Symposium on Reliable Distributed Systems, SRDS'98

Dynamic Optimization in Hazard Detection for Continuous Safety-Critical Systems Beth Plale

Karsten Schwan

College of Computing Georgia Institute of Technology Atlanta, Georgia 30332 fbeth,[email protected]

Abstract

Safety critical systems are pervasive in modern society. Financial systems, transportation systems, medical record retrieval systems, and air trac control systems all could potentially threaten economic, property, or personal safety. However, this class of safety critical system is not amenable to the existing approaches to achieving safe software. Hence a new approach to software safety is needed; one that can accommodate distributed systems, that may contain COTS components, and that may consist of components not all of which were designed to be used in safety-critical settings. In response to this need, we have developed a software hazard detection tool that we argue increases the safety level of continuous safety critical systems, in part by employing dynamic behavior to enhance the exibility of hazard detection and to expand the potential for optimizations of the detection process. Keywords: software safety, safety-critical systems, on-line monitoring, distributed systems, query optimization

1 Introduction

1.1 Safety-critical Systems

Safety critical systems are pervasive in modern society and their numbers growing. A safety critical system is de ned as any system wherein a problem or hazard occurring in the system could potentially compromise economic, property, or personal safety. The more traditional safetycritical systems are often embedded and real time, with an emphasis on timing deadlines and predictable performance at the task and communication levels. Examples are well known: digital

ight control systems, nuclear power plants, collision avoidance system on board an aircraft, and computerized signaling systems for controlling commuter trains. There are other systems that are clearly safety-critical but that deviate from our traditional (i.e., dated) understanding of safetycritical systems for one reason or another: they may not have hard real-time deadlines or may not be embedded. There are cardiac care products such as patient monitoring devices for monitoring an intensive-care patient's vital signs or implanted devices such as de brillators. Then there is the ammunition control software system (ACS) which manages ammunition holdings. ACS is not realtime, but clearly safety critical in that incorrect storage combinations can lead to massive explosions. Still, all the examples fall into categories commonly associated with safety: military and medical applications. But the recent growth in safety-critical systems is attributed to two factors: the rst 1

is that computers in safety-critical roles are pervading all areas of our lives and our understanding of the potential dangers of this pervasiveness is increasing. For example, our de nition of safetycritical systems has expanded to include economic damage. To illustrate, in a recent keynote speech, Kenneth Birman observed that a growing number of software components that were never intended to be safety-critical are today appearing in safety critical settings[3]. For example, an emergency room physician may access a remotely located patient record for information essential to an accurate diagnosis. A delay in responding to the request could potentially endanger the patient. Examples in addition to the medical record retrieval system includes transportation systems, nancial systems, and electronic, telephone and telecommunications networks. The Automated Vehicle/Highway System (AVHS) [19] is a transportation system under development at UC Berkeley in which carlike vehicles travel in strictly controlled platoons so as to be able to minimize the distance between each vehicle. One obvious safety critical issue is crash prevention, particularly when vehicles change lanes or merge on or o the road. Computerized nancial systems are considered safety critical not because of the threat to personal safety, but because of the potentially catastrophic harm to the economy should such a system fail. Electronic, telephone, and telecommunications networks are also safety critical systems if one considers that loss of service removes the capability to summon emergency services. Similarly, loss of long-distance service can cause serious disruption of business activities and an ensuing loss of revenue [14]. We distinguish this broader class of safety-critical applications, which we call continuous safety critical systems, from the more traditional safetycritical systems by the following characteristics:  they are long-running,  they consist of components distributed over multiple physical resources,  they may evolve, and  they may include commercial o -the-shelf (COTS) components. A long-running system may be up for weeks, months, or even years. It is highly likely that the nature of potential hazards would change over time as the environment changes. The components of this class of system can be geographically distributed, such as the emergency room physician accessing a remote patient record database. These systems may evolve in the sense that new tasks/objects may be instantiated to support an evolving environment. Finally, these systems may include COTS components, such as the remote database management system of the example above. The database management system was probably not originally intended as a safety critical component so techniques to achieving an intrinsically safe system, often employing format methods, were likely to not have been employed. Moreover, unlike real-time systems, continuous safety-critical systems are time dependent but only some of their components may have real-time constraints. The remainder of the system does not have the strict guarantees required of a real-time system, and its e ort is considered to be best-e ort as opposed to guaranteed-response [10]. Finally, unlike implanted devices, which are inaccessible once installed (e.g., de brillators), continuous safety-critical systems are accessible. This emerging class of safety critical systems is less amenable to existing approaches to achieving safe software (i.e., formal methods and guaranteed deadlines) than are traditional safety-critical applications. Clearly there is opportunity for other approaches to improving safety.

1.2 Hazard Detection

Our research addresses the problem of achieving safer software through a software hazard detection approach. Our response to the problem is Cnet, a language-based approach to on-line hazard detection that allows the user to specify constraints in the form of queries in a temporal database query language. 2

sensor specifications

constraint specifications

constraint compiler sensor definitions

safety-critical application

library calls

constraint management interface

event data user commands

action

Cnet analysis

Figure 1: The system architecture. The architecture of the system is shown in Figure 1. The constraint compiler accepts sensor and constraint speci cations. Sensor speci cations are transformed into sensors which are used to monitor the safety-critical application. Constraint speci cations are optimized before the compiler generates intermediate code as a set of commands to the Cnet library. The library builds a run-time instantiation for each query. The run-time instantiations execute as part of the analysis tool. The analysis tool uses monitoring to extract event data from the application. It analyzes the information on-line against the set of constraints it maintains, then reacts possibly in the form of actions taken in the application when one or more constraints are violated. The analysis tool also responds to user commands initiated through the constraint management interface. The interface allows the user to modify the constraint set through dynamic constraint addition, etc. For purposes of an overview, solid lines indicate run-time control ow whereas dotted lines indicate static, pre-runtime relationships, though in reality the lines blur. The safety-critical application used in our work is an autonomous robotics application described in the next section. Cnet's software based approach to hazard detection addresses the unique needs of continuous safety-critical systems in a number of ways. First, the unique interface between the compiler and library in the form of a compiler-generated script of commands makes the approach well suited to long running and evolving applications by facilitating constraint set adaptions. Adaptions can occur through user commands to alter the constraint set (e.g., add, remove) or through algorithmic adaptations based on statistical run-time data used to automatically trigger query reoptimization. This dynamic behavior is key to the analysis tool's responsiveness to the unique needs of continuous safety-critical systems. Second, the language-based approach allows for the speci cation of a general set of hazards, including temporal constraints and constraints with complex conditions speci ed over multiple components. Additionally, the language-based approach enables the application of constraint optimizations, either dynamically or during constraint creation. Finally, the external nature of the approach facilitates ecient use with distributed application components as the analysis can be done on a machine independent from the application components. The external nature also allows the tool's use regardless of the language with which the safety-critical application 3

was written, subject to the needs of instrumentation. The analysis tool is able to analyze any uniform event stream, regardless of source. The contributions of this paper include the application of and algorithms for statistics-based optimizations to achieve dynamic constraint reoptimization, performance evaluation results of Cnet, and a preliminary discussion of a distributed Cnet. We motivate the work by describing the continuous safety-critical application that provoked the work and by rather cursorarily describing the query language by means of an example constraint taken from the application. The run-time environment is brie y introduced in Section 3 just to provide sucient background for a subsequent discussion of dynamic optimization and management of hazards. Performance results appear in Section 4, followed by a discussion of the distributed extension of the work, see Section 5. Related research appears in Section 6 and conclusions and future work in Section 7.

2 Hazard Detection for a Robotics Application

The sample application used in our work is a multiagent reactive robotic system simulation developed by Balch and Arkin [2] in their work on the Autonomous Robot Architecture (AuRA) at the Georgia Institute of Technology. The authors tested their strategy through an iteration of simulation and instantiation on real systems. Our work is in applying hazard detection to the robotics application where a set of autonomous robots navigate toward a goal across unmapped terrain, `reacting' to stimuli in the environment. The robot group performs one of three tasks: forage, consume, and graze. During forage, a robot wanders an environment, looking for attractors. Upon encountering an attractor, it moves toward the attractor, attaches itself, and returns the object to a speci ed home base. During consume, a robot also wanders the environment, looking for attractors. Unlike forage, though, after attachment a robot performs work on the object in place. In the nal task, graze, discrete attractors are not involved; the object is to completely cover, or visit, the environment (akin to mowing the lawn). A robot searches for an area not grazed, moves toward it, then grazes until the entire environment (or some percentage thereof) has been covered.

2.1 Constraint Speci cation

The language we employ has a high level of user familiarity, being an SQL variant, but was selected primarily for its prescriptive nature which gives us much freedom in applying optimizations tailored to an on-line environment. The language incorporates features from the active database rule language Starburst [24] for its meta statements, and from the relational temporal query language ATSQL2 [23] for specifying constraint conditions. A simple constraint for the robotics example appears in Figure 2. CREATE RULE C:1 ON RobotRadEv, RobotDistEv IF SELECT RadiatedRobotEv d.idSelf r.rad FROM RobotRadEv as r, RobotDistEv as d WHERE r.id = d.idSelf and d.dist < 10 and r.rad >= 200 THEN STEER changeRobotRepulsionSphere RobotRadEv.id 30

Figure 2: Radiation level of robot exceeds 200 R and another robot approaches within 10 ft. of radiating robot. 4

In Figure 2, the create rule clause creates a constraint named C:1 with the event sources RobotRadEv and RobotDistEv. Both event sources are generated by sensors in the application. The IF clause delineates the rule's condition. The event sources appear in the FROM clause to de ne the aliases r and d. The WHERE clause speci es the selection condition. The condition r.id = d.idSelf is a join condition corresponding to a Cartesian product operation followed by a selection operation. The constraint selects the events that satisfy the condition of the WHERE clause, then projects the result on the d.idSelf and r.rad attributes for the derived event RadiatedRobotEv. The THEN statement, the nal clause of the rule, delineates the action list, which in this example is a single STEER action that causes the changeRobotRepulsionSphere function to be invoked in the robotic application for the robot named by the attribute RobotRadEv.id and with the goal gain force value of 30. A more complete description of the robotics application and language is given in [18].

3 Adaptive Hazard Detection

A signi cant contribution of this paper is the adaptive capability inherent in the approach that makes it uniquely responsive to the evolving nature of continuous applications. Speci cally, the language-based approach makes optimization possible and the portable, intermediate code generated by the compiler together with the Cnet library opens the way for dynamic manipulation of the optimized code. Cnet o ers two means by which the current constraint set can be altered, one under algorithmic control and the other under user control. The algorithmic approach, called dynamic constraint optimization is based on the concept that statistical data accumulated at run-time can be used to obtain an optimized version of a query. User controlled constraint management is provided through an external user interface task that forwards user-directed commands to the analysis tool. Through the interface the user is able to add, remove, alter, activate, and deactivate constraints. Before describing the adaptive capabilities, we brie y review the run-time environment and event-handling mechanisms [20].

3.1 Run-time Environment

Hazard detection is implemented in terms of nodes, queues between nodes, and a dispatcher controlling overall execution. Integral to the approach is a language and compiler for writing constraint speci cations and optimizing the resulting parse trees. Through calls to the Cnet library, an executable version of the parse tree, called a node is created. A node is a collection of select, project, and Cartesian product operations. The rst operation of every node is a gate operation. The gate operation serves as both an entry-point to the node and point-of-control for the node. Point-of-control activity includes activating or deactivating the node, triggering cost data updates, or deleting the node. Nodes are connected as a DAG where an arc exists between two nodes A and B if node B requires an input event type that is generated by A. Communication between the analysis tool, user interface, and application is achieved using a communication infrastructure. Analysis tool execution is handled by a dispatcher. Cnet connects to the application via a communication and monitoring infrastructure. Monitoring is accomplished through Falcon [7], a tool providing low perturbation and low latency monitoring support. Falcon, running as one or more separate threads in the application, communicates with the application via shared bu ers. Falcon extracts event data from the bu er and sends it to the analysis tool via DataExchange[5], a communication and binary I/O infrastructure o ering a publish-subscribe model and fast heterogeneous data transfer, and services such as dynamic connection of clients, information ltering, and control ow. 5

The analysis tool interacts with the infrastructure by registering its interest in a set of event types. When the application generates events, the events are received at the infrastructure level and decoded if necessary before being passed to the analysis tool. Steering requests originating in the analysis tool are similarly handled.

3.2 Event Handling

Event data is handled by a dispatcher. Designing an event handling agent allows us to better accommodate multisource hazards and dynamic adaption. Multisource hazards are serviced by a dispatcher capable of receiving and handling events regardless of source location. Dynamic adaption, which is the analysis tool's response to the evolving nature of the application, is facilitated by the dispatcher's ability to integrate user and algorithmic generated commands into its event handling process. Fairness is insured by a rule processing algorithm largely based on depth rst search [18]. The dispatcher's primary function is in controlling Cnet execution. As such, it is responsible for:  maintaining a record of the current set of active nodes,  linking nodes, and  maintaining information about the relationship between nodes. More importantly, however, the dispatcher acts as an event servicing agent, processing events from a number of sources:  event data from application  management (i.e., ADD, DELETE) requests from the constraint management interface  constraint replace requests from the optimizer  action requests from the nodes themselves. Event ow to and from the dispatcher is depicted in Figure 3 where the shaded area represents tasks internal to the analysis tool. Event data from the application comprises the signi cant majority of all events processed by the dispatcher. Other event types, though occurring less frequently, are no less important. User requests originating in the constraint management interface take the form of requests to add or remove constraints from the system. Periodic triggering of dynamic query reoptimization may occur on the basis of run-time data accumulated during execution. The optimizer noti es the dispatcher when reoptimization has completed. It is the function of the dispatcher to guide actual replacement of the old constraint with the optimized version. Nodes themselves have actions to be executed when a constraint is violated; the dispatcher enacts the actions on behalf of nodes.

3.3 Dynamic Constraint Optimization

Dynamic query reoptimization accumulates statistical information about application data at run-time and employs heuristics to determine when query reoptimization should be triggered. The technique potentially o ers signi cant bene ts. For instance, when statistical data about application data is considered during the reorganization of a query's select operations, even a simple reordering of select operations can dramatically reduce the number of events processed by a node. A number of components work together to achieve dynamic constraint optimization. They are, as shown in Figure 7, the executable nodes (e.g., C:1), a cost data manager, a query optimizer, and the dispatcher. The node C:1 collects statistical data about the attributes on which its condition is based and writes the data to the cost data le. The cost data manager determines when reoptimization should occur by comparing incoming cost data against the existing data. The query 6

constraint management interface

Projection

Selection

Projection

Cartes Product

Cartes Product Selection

Cartes Product

Gate Cartes Product

safety-critical application Dispatcher

Gate

Projection

Selection

Cartes Product

Cartes Product

Gate

optimizer

Figure 3: Event dispatching in the analysis tool. Cost Factors COLCARD HIGHKEY LOWKEY

Description number distinct data elements highest value lowest value

Figure 4: Cost factors. optimizer reoptimizes queries based on stored copies of the parse tree. Storing the parse tree obviously eliminates the need for re-parsing the original query. The dispatcher responds to requests from the optimizer to alter a node.

3.3.1 Cost Functions

A cost function of a predicate P , CF (P ) is the fraction of events resulting from the predicate restriction P [17]. Estimating cost functions is done under a number of assumptions, including uniform distribution of individual attribute values and independent joint distributions of values from any two unallied attributes. Cost functions are commonly used in relational databases during query plan selection to estimate the I/O cost of the query. I/O costs include access cost to secondary storage, storage cost, cost of storing intermediate les [6]. For Cnet's application, we do not care about I/O costs and instead formulate cost functions that are used during query reoptimization. The set of cost factors and cost functions we employ are shown in Figures 4 and 5 respectively. We illustrate its use with the following example. Suppose a condition has two predicates: r.id = 10 and r.rad > 200 and the original optimization placed the predicate r.rad > 200 lower (earlier) in the parse tree than the other. Cost data obtained during run time reveals the cost function results shown in Figure 6. The low cost function value of the predicate r.id = 10 indicates that this predicate is far more likely than the other to result in discarded events. The optimizer can then use this data to push the predicate lower in the parse tree. 7

PredicateType Pred1 = value Pred1 > value Pred1 < value

CostFunction 1=COLCARD (HIGHKEY ? value)=(HIGHKEY ? LOWKEY ) (value ? LOWKEY )=(HIGHKEY ? LOWKEY ) Figure 5: Cost functions.

Attribute Name COLCARD HIGHKEY LOWKEY r.rad 1000 999 0 r.id 24 25 1 Predicate Cost Function r.id = 10 :042 r.rad > 200 :800 Figure 6: Cost data obtained at run-time.

3.3.2 Optimization Algorithm Referring again to Figure 7, a node during execution stores cost data for the attribute in its predicates. Since the cost data for dynamic query reoptimization needs to be reasonably close but not necessarily up-to-the-minute, we employ a simple heuristic that states that upon every nth event processed, the node sends an update message to the cost data manager. The cost data manager compares the incoming cost data against the existing values. Here again we employ a heuristic that states if any one of the values di ers by some epsilon, , then return 'trigger satis ed'. The cost data manager returns the status indicating whether or not to trigger reoptimization. If the trigger status indicator is set, the node triggers reoptimization by sending a request to the the optimizer, supplying its own name as a parameter. The optimizer executes as a separate thread waiting for requests. When a reoptimization request arrives, the optimizer retrieves the parse tree based on the node name and reoptimizes based on the algorithm given in Figure 8. The algorithm compares the cost factor of the parent select node to the child's. If the cost factor of the parent is less than that of the child, the parent and child swap positions in the parse tree. Upon completion the optimizer generates a new script, saves the parse tree and issues a command to the dispatcher to alter the existing node. The dispatcher noti es the node to delete itself but save its accumulated state to prevent missed violations. It then invokes the Tcl interpreter to add the new node and bind to it the saved history.

3.4 Dynamic Constraint Management

User controlled constraint management is an important part of dynamic constraint management in that it provides the means through which a user can alter the state of the analysis tool based on increased understanding of the system, changing demands on the system, or a changing environment. The user manages the constraint set by creating rules in the constraint language. As described in [18], the language provides ve rules: the create rule, alter rule, drop rule, activate rule, and deactivate rule, all of which are available to the user for managing the active constraint set. Constraint addition occurs when the constraint management interface accepts a create rule from the user and invokes the compiler to generate a script. Once the constraint is encoded as a 8

cost data manager

cost data

cost data file "reoptimize

now"

C:1

"reoptimize(C:1)" Cnet Optimizer

Alter request for C:1 new script for C:1

"alter node"

Dispatcher

Figure 7: Flow of control in dynamic constraint optimization. optimizeNodes(.., SelectNode * parent, SelectNode * child) { if (parent.costFunction() < child.costFunction()) { push parent below child return child as new parent // parent and child swap positions } return parent // no change to parse tree }

Figure 8: Optimization algorithm considering cost data script, the constraint management interface generates an ADD command for the dispatcher, as shown in Figure 9. The dispatcher invokes the Tcl interpreter, passing it the script le name. The interpreter executes the script, the execution of which builds the node, C, in the example. The new node automatically registers its existence with the dispatcher, providing it a list of input and output event lists. For an add to be successful, the input event list must include only previously de ned events types, otherwise the add fails. The dispatcher uses the event lists to make connections between nodes. Shown in Figure 9, C needs one event type from the dispatcher and one generated by A. Control then returns to the dispatcher and event processing resumes. Events arriving at the dispatcher from the application for which the new node is interested will immediately be forwarded to the new node. In a similar way a constraint is removed from the net. The constraint management interface generates a command from a drop rule, the dispatcher locates the node based on the node name and invokes a method in the node to delete itself. A node to be deleted cannot have other nodes dependent upon it for events. In other words, continuing to use Figure 9 as an example, a subsequent command to remove node A would fail since node C requires the derived events generated by A. The alter rule replaces one constraint with another, user supplied, version. Replacement preserves and transfers state between old constraint and new. For the activation and deactivation rules, the user interface generates commands that are handled by the dispatcher in same manner 9

Projection

Selection

newConstraint Cartes Product

Projection

Cartes Product

A

Selection

Cartes Product

Gate Cartes Product

C constraint mgt interface

ADD "newConstraint"

Dispatcher

Gate Projection

Selection

Cartes Product

Cartes Product

B Gate

Cnet library

Tcl interpreter

Figure 9: Dynamic constraint addition. as it handles activate and deactivate action requests generated by the nodes.

4 Evaluation

Evaluation was performed to demonstrate the e ectiveness of optimization. Before evaluating optimizations directly, we microbenchmark system performance then show how times obtained from computed component times compare to measured component times for a typical mix of queries. Microbenchmarking is performed on each of component operation (e.g., gate, select, project, Cartesian product). The component times are then used to predict the overall cost of a query. The benchmark numbers are obtained using a mix of twelve constraints where the constraint set varies depending on the component. For the gate operation, every constraint consists of a single gate operation accepting a single primitive event. The gate operation serves as both an entry-point to the node and point-of-control for the node, but point-of-control activity (e.g., cost data collection and updates) does not occur during microbenchmarking. That is, gate simply dequeues an event and returns. For the other operations, the constraints in the set consist of a gate operation plus a homogeneous collection of one or more operations where each constraint accepts either a primitive event or derived event. For example, of the 12 constraints in the select benchmark, eight have one select operation and four have two; and eight of the twelve constraints accept a primitive event while four accept a derived event. The respective times are given in Table 1. Benchmarking was run for 16872 application generated events on a Sun UltraSPARC 1. The time, shown in Table 1, obtained for the gate operation may be misleading. The gate operation, when not servicing point-of-control requests in fact does very little. Factored into the time shown is the overhead of the dispatcher: the time for control to return from a gate operation to the dispatcher, for the dispatcher to either invoke the next node or to obtain a new event, queue the event, then transfer control to the rst node in the list. As shown in the table, projection is a costly operation with respect to the others. The cost is reasonable considering that projection creates a new event for every event it receives. The microbenchmarks can be used to predict the execution time of a set of realistic queries. 10

operation

gate projection selection Cartesian product

execution time (microseconds)

4.7 8.6 1.6 0.5

Table 1: Microbenchmark results. C:2

C:5

C:1

C:11

C:4

C:12

C:6

C:3

ROBOT_DIST_EV

ROBOT_STATE_EV

DIST_GOAL_EV

C:9 C:10

C:8

ROBOT_DIST_EV

C:7

ROBOT_STATE_EV

DIST_GOAL_EV

Figure 10: Cnet for typical mix. The query mix to which we apply the benchmark values is referred to as the typical mix, a set of 12 queries accepting three event types. The structure of the Cnet for the typical mix is shown in Figure 10. The queries are all fairly simple and are all optimized. As can be seen from Table 2 the 16872 events resulted in 362,569 operations (i.e., number of times a single select, project, etc. operation was executed). Additionally, the measured execution time is 9% greater than the computed execution time indicating there is activity in the analysis tool for which the benchmark values are unable to account. We suspect this is due, in part, to caching e ects. total ops measured execution computed execution time (seconds) time (seconds)

362 569 ;

1 51988

1.39385

:

Table 2: Typical mix. In measuring the bene ts of optimization, the evaluation criteria used is total time to execute a query, which is the measured time from the time the rst event is available until the last event is received. Queries used in the evaluation execute no actions. The results, for example Table 3, list the execution time and number of operations performed for the constraint in optimized 11

and unoptimized forms. Each constraint is applied to an event stream consisting of 16872 events generated by the robotics application. The nal row shows the operations breakdown by percentage where (g/p/s/cp) indicates the ordering (gate/project/select/cartesian product). Optimization A

optimized

execution time (secs) number ops % by op (g/p/s/cp)

unoptimized

0 14427 1 08329 34 070 865 411 49 0 51 0 .02/0/.11/.87 :

:

;

:

;

= =:

=

Table 3: Optimization A, a complex constraint. Optimization B

optimized

unoptimized

execution time (secs) 0 3019 5 65136 number ops 39 471 2 477 718 % by op (g/p/s/cp) .42/.42/.15/0 .01/0/.45/.54 :

:

;

;

;

Table 4: Optimization B, constraint C:1. Optimization A is a complex query consisting of three input event types and a complex condition. The optimization yielded a 87% reduction in time and a 96% reduction in the number of operations. As can be seen from the operations breakdown, whereas in the unoptimized version 87% of the operations were attributed to Cartesian product, no Cartesian product operations were performed in the optimized version. This indicates the optimization policy of pushing selects down the parse tree is e ective in that the pushed select operations very e ectively reduced the number of events reaching Cartesian product. Optimization B shown in Table 4 is the result of the optimization of C:1. The results show similar bene ts of optimization: a 95% reduction in time and a 98% reduction in the number of operations performed. As can be seen from the operations breakdown, the number of project operations increased, serving to reduce the size of the event processed by subsequent operations. The reductions seen as a result of optimization agrees with those in Table 3 and also with results observed in other experimental runs.

4.1 Optimizations Between Nodes

Optimizations between nodes can achieve gains in the execution time for a set of constraints. One such optimization technique is factorization where a subexpression common to two or more queries is factored out and with it, a new query built that directs its events to the original queries. Table 5 provides an example of the e ects of factorization. The timing results seem to indicate little bene t is accrued as a result of ltering. Though the total number of operations reduced by more than half in the ltered version, the execution time drops only 4.6%. Insight to the problem Filter A

execution time (secs) number ops % by op (g/p/s/cp)

lter

no lter

0 22191 0 23224 33 345 72 171 33 33 33 0 31 07 62 0 :

:

;

:

=:

;

=:

=

:

Table 5: Filtering results. 12

=:

=:

=

can be obtained by looking at the operations breakdown, particularly at the percentage attributed to projection. The percentage increased signi cantly from the no lter version to the lter version. In the no- lter version, projects made up 29% of the total time and 7% of the total operations. In the lter case, projects accounted for a signi cant 67% of the total time and 33% of the total operations. These numbers are important because projects are expensive with respect to the other operations. The time gained in the rather substantial ltering that took place (not evident by the numbers presented) was partially lost through additional projection operations. The example indicates the bene t of ltering depends on a number of variables: rst, the ltering level of the factored out select conditions. The bene ts of select conditions that lter fewer events is o set by the cost of additional operations (e.g., project, gate) and communication between nodes. Since information about the ltering level of a select condition can only be determined through run-time accumulation of cost data, we suspect optimization by factorization would bene t from dynamic optimization. The second factor is the location of the lter. Were the lter to be located closer to the source or even at the application, additional bene ts in the form of lower latency and lessened network demand would be accrued. Note to reviewers: the performance evaluation conducted to date has involved static optimizations. We fully anticipate having results demonstrating the ecacy of dynamic optimizations available by the time the nal copy is due.

5 Towards a Distributed Cnet

We begin by addressing practical concerns that may have arisen in the mind of the reader thus far. First, since the application class is characterized as being long-running, isn't it a limitation that sensors must be speci ed initially and instrumentation points added prior to compilation? We answer, yes, the monitoring framework currently employed reuqires instrumentation points be inserted prior to compilation. But other, more dynamic instrumentation techniques such as Paradyn's binary editing [8] could be used as well. Second, is it not a limitation that the temporal query language does not support the speci cation of timing constraint violations such as occur in real-time sytems [15]. To this we answer that there may be more appropriate hazard detection approaches for hard-deadline real-time systems. The generality of our approach makes it well suited to the broader class of safety-critical systems where formal methods and timing guarantees may be realistic on only a small portion of the system, if at all. Third, how does the analysis tool deal with event ordering when receiving events from distributed entities? The analysis tool employs an outside reorder process to impose a partial order on the event stream. Finally, is it possible for Cnet to be ooded with events arriving from multiple, distributed sources? Though we have no rm supporting numbers, it is highly realistic to expect that the analysis tool would become ooded. We consider it a limitation that the current implementation that Cnet runs as a simple threaded application in a single address space and address this issue by discussing the direction in which we are taking this work. The Cnet library is structured as a set of objects having an 'is a member of' relationship between levels of the hierarchy. This is illustrated in Figure 11. At the lowest level of the hierarchy are operations (e.g., select, project, and cartesian product). A node is a collection of operations but with an identity onto itself in that it can respond to commands to deactivate itself, remove itself, etc. In addition, a node knows which are the enter and exit operations, and what are its input and output event types. A cnet object is a collection of one or more nodes that is cognizant of its member nodes, of the pair at which it communicates with the outside world. It also posesses a list of input/output event types. As will 13



cnet 1

node C:1

node C:2

select1 op

node C:3

select2 op



project1 op

Figure 11: Object relationships in Cnet. be discussed, we also tag each Cnet with a condition to aid in the placement process. a, b A

C:3

b C:1

d placement daemon

I

"fragment C:3" a,c

B

"may want to fragment C:3"

a

"add C:3" reoptimizer

e C:2 II component registrar

b,e

C:3

f









/

A

a,b

I, 12567

start=0, end=17

B

a,c

II, 4558

start=18, end=36

I, 12668

C:1

b

d

C:2

a

e

if end < 10 and ...

Figure 12: Placing Cnets. This hierarchical structure makes distribution of constraints straightforward. Distributed Cnets require we address several issues not previously considered. We will focus our discussion on the last one:  partitioning the set of nodes amongst Cnets,  communication between Cnets, and  optimally placing a Cnet. Placing a Cnet must be done both at startup and when constraints are dynamically added to the system. Our design achieves placement through the use of a placement registrar and a placement daemon, both of which are shown in Figure 12. The placement daemon registers new Cnets and 14

invoke the reoptimizer as needed. The component registrar is a server that maintains input/output event information about each active entity in the environment; including both application and Cnet components. The precedence for such a server exists in the DataExchange/PBIO [5] format server. The format server is needed when a non-connected communication protocol such as ATM or UDP is used. PBIO always sends meta information describing a message format prior to any messages of that type being sent. In a non-connected environment, a sender registers a format with the format server prior to sending any messages of that type. The receiver contacts the format server for a description of the message received so it can decode the message. Our component registrar, which could be implemented as additional functionality on top of the DataExchange format server, requires that entities register themselves and at the same time provide information useful for constraint placement: input/output event types, , and attributes/conditions. It is reasonable to expect that the placement registrar will be able to make many of its decisions based on event type only. But there are cases when multiple data sources supply the same event types. In this case, more detailed information in the form of attributes/conditions is needed. We describe the placement algorithm in the context of the example in Figure 12. Placing cnet C:1 is straightforward because only process A exports the event type b, so query C:1 is placed on machine I. Placing C:2 requires additional information since it requires event type a which is exported by both A and B. By applying C:2's placement condition against the attribute list of A and then of B, one has further information with which to make a placement decision. As an example, suppose A and B are part of a global simulation of the earth's atmosphere where the simulation is structured such that the atmosphere is partitioned into 37 horizontal layers with A computing species transport for layers 0-17 and B simulating the remainder. The attributes stored at the component registrar for A and B would include starting and ending layer numbers. The condition associated with C:2 is then applied against the attributes. In Figure 12, the condition would be true for process A but not for B. Another case is illustrated with the placement of C:3. C:3 requires one event generated from A on host I and another from C:2 on II. In this case, the component registrar does not have enough information, even with the attributes and conditions. The format registrar has the option of sending a response to the placement daemon asking that C:3 be fragmented. The placement daemon can invoke the reoptimizer to refragment the cnet and reissue an add command for the separate pieces. Optimizing placement can be achieved by applying a paradigm from database query optimization. This approach is being explored in joint work with Richard Snodgrass. The idea is thus: if one can abstract away details about components (e.g., cnets, applicaition components) until the component types have been reduced to just a handful, then it is possible to express the components and their relationships as an abstract syntax tree upon which cost-based optimizations can be applied.

6 Related Work

Afjeh et. al [1] propose a monitoring and control system for observing the progress of a parallel propulsion system simulation and for steering the simulation. Monitoring data is obtained from the simulation via one or more watchdog processes then forwarded to the analysis component. Analysis is achieved through an expert system where rule satisfaction is used to activate steering. Such an expert system-based approach may not be appropriate for the detection of hazards in the continuous applications we are targeting for several reasons. First, while Cnet detection and analysis is easily distributed, this is not possible with an expert system-based approach. Second, rules in expert systems are typically designed to capture domain-speci c knowledge which is acquired through 15

signi cant setup times. Brockmeyer and Jahanian [4] incorporate a monitoring and assertion checking tool into the Modechart Toolset (MT). The monitoring and assertion checking is performed on trace data from symbolic executions of real-time speci cations written in Modechart, the same language used for speci cation of assertions. Assertions can include temporal properties, can be user supplied, or can be based on provided classes. The assertions are executed together with the original speci cation generating an execution trace that highlights the violation of an assertion. This approach, although similar to ours in its support of temporal and complex speci cations, cannot be directly applied to on-line hazard detection. In Liu and Pu's work on continual queries [13], a client speci es a continual query over information stored in a distributed environment such as the Internet. The objective is to compute the query and return the entire resulting relation upon the rst triggering only. On subsequent triggerings, only change information is returned. The primary objective of the work, to minimize the amount of information returned, is most e ectively met when the returned relation is large. Contrast this to the primary goal of hazard detection: fast response in the event of constraint violation. Fast response requires query reevaluation upon every event data arrival. Because constraints are infrequently violated, there often is no returned relation, or if there is, it is a single event. The objectives of the two approaches are entirely di erent. Snodgrass [21] develops an information based approach to modeling program behavior that treats monitoring information (runtime data, states of processes, states of processors, messages, etc.) as relations. The program states and activities observed by the monitor are speci ed by a high-level query language called TQuel (considerably re ned in [22]). Based, in part, on the relational approach originated by Snodgrass, Ogle et al. explore application-dependent and on-line monitoring of parallel and distributed applications[16]. Kilpatrick and Schwan[9] further re ne the language-based approach for specifying application-dependent monitoring requests. Kilpatrick also explores the idea of integrating all components of a parallel programming environment using the uniform Entity-Relationship information model. As evidenced by earlier work, the relational model is an appropriate formalism for processing monitoring information. Prior applications of this formalism, however, were geared primarily toward performance evaluation. Whereas performance evaluation is oriented toward collection and organization of information for what is often post-mortem analysis, hazard detection is oriented toward information ltering - an overwhelming percentage of the information received is discarded. This shift in focus results in improved performance over earlier work. In addition, prior approaches to monitoring were static, that is, they required that all constraints be known at compile time. Given the exploratory and `what-if' potential of safety constraints, application of the relational model to hazard detection must be accompanied by adaption techniques. Leveson's work in the early 1980's [12] is early recognition of the need for run-time checking for hazard prevention. Her synchronous approach, though, requires embedding all constraints in the application. This is not necessary in the Cnet architecture where both synchronous and asynchronous checking may be done.

7 Conclusion and Future Work

Continuous safety-critical systems are becoming common in modern society. Growing along with their proliferation is an awareness that existing approaches, often based on formal techniques, are not enough to provide the required assurance of safety. Needed is a complementary detection approach. Our research meets the needs of continuous safety-critical systems by providing a external, language-based approach to hazard detection. The language-based approach with its prescriptive 16

language leaves much room for tailored optimizations, both static and statistical, or dynamic, in nature. The logically external approach encapsulates queries facilitating both distribution of analysis components and query reoptimization. There are a number of promising ongoing and future research topics. Ongoing work includes distributing the analysis components to increase the scalability of the approach and to take advantage of available resources for analysis. We are also currently measuring the performance of the statisical optimization techniques and algorithms. Future work includes expanding cost data and cost function heuristics to better utilize the dynamic exibility built into the implementation, and the application of the analysis tool to a virtual environment. For instance, in the Iowa Driving Simulator (IDS) [11] a fully immersive ground-vehicle simulator can place a driver in a highly realistic driving environment. Hazard conditions in such an environment are the same as in real-life but without the attendant risk of harm or loss.

References

[1] A. Afjeh, P. Homer, H. Lewandowski, J. Reed, and R. Schlichting. Development of an intelligent monitoring and control system for a heterogeneous numerical propulsion system simulation. In Proc. 28th Annual Simulation Symposium, Phoenix, AZ, April 1995. [2] Tucker Balch and Ronald Arkin. Communication in reactive multiagent robot systems. Autonomous Robots, 1(1):27{52, 1994. [3] Kenneth Birman. Keynote speech. 6th IEEE Int'l Symposium on High Performance Distributed Computing (HPDC-6), 1997. [4] Monica Brockmeyer, Farnam Jahanian, Connie Heitmeyer, and Bruce Labaw. An approach to monitoring and assertion-checking of real time speci cations in Modechart. In Workshop on Parallel and Distributed Real-Time Systems, April 1996. [5] Greg Eisenhauer, Beth (Plale) Schroeder, and Karsten Schwan. DataExchange: High performance communication in distributed laboratories. In Proceedings Ninth IASTED Int'l Conference on Parallel and Distributed Computing Systems, October 1997. [6] Ramez Elmasri and Shamkant B. Navathe. Fundamentals of Database Systems. AddisonWesley, 2 edition, 1994. [7] Weiming Gu, Greg Eisenhauer, Eileen Kraemer, Karsten Schwan, John Stasko, Je rey Vetter, and Nirupama Mallavarupu. Falcon: On-line monitoring and steering of large-scale parallel programs. In Proceedings of FRONTIERS'95, February 1995. [8] Je rey K. Hollingsworth, Barton P. Miller, Marcelo J. R. Goncalves, Oscar Naim, Zhichen Xu, and Ling Zheng. MDL: A language and compiler for dynamic program instrumentation. In Proceedings International Conference on Parallel Architectures and Compilation Techniques, November 1997. [9] Carol Kilpatrick, Karsten Schwan, and David Ogle. Using languages for capture, analysis and display of performance information for parallel and distributed applications. In Proceedings 1990 Int'l Conference on Programming Languages, 1990. [10] Herman Kopetz and Paulo Verissimo. Real-time and dependability concepts. In Sape Mullender, editor, Distributed Systems. Addison-Wesley, 2 edition, 1993. 17

[11] Joh Kuhl, Douglas Evans, Yiannis Papelis, Richard Romano, and Ginger Watson. The Iowa Driving Simulator: An immersive research environment. Computer, 28(7):35{41, July 1995. [12] Nancy G. Leveson and Timothy J. Shimeall. Safety assertions for process-control systems. In Proceedings 13th Int'l Symposium on Fault Tolerant Computing, pages 236{240, June 1983. [13] Ling Liu, Calton Pu, Roger Barga, and Tong Zhou. Di erential evaluation of continual queries. Technical Report TR95-17, Department of Computer Science, University of Alberta, 1996. [14] John McLean and Constance Heitmeyer. High assurance computer systems: A research agenda, safety track report. In America in the Age of Information, National Science and Technology Council Committee on Information and Communications Forum, 1995. [15] Aloysius K. Mok and Guangtian Liu. Early detection of timing constraint violation at runtime. In Proceedings of the 18th Annual Real-time Systems Symposium. IEEE Computer Society Press, 1997. [16] David Ogle, Karsten Schwan, and Richard Snodgrass. Application-dependent dynamic monitoring of distributed and parallel systems. IEEE Transactions on Parallel and Distributed Systems, 4(7):762{778, July 1993. [17] Patrick O'Neil. Database: Principles, Programming, Performance. Morgan Kaufmann, 1994. [18] Beth Plale and Karsten Schwan. Cnet: language-based approach to on-line constraint analysis. Technical Report GIT-CC-97-36, College of Computing, Georgia Institute of Technology, 1997. http://www.cc.gatech.edu/tech reports. [19] Anuj Puri and Pravin Varaiya. Driving safely in smart cars. June 1995. [20] Beth (Plale) Schroeder, Sudhir Aggarwal, and Karsten Schwan. Software approach to hazard detection using on-line analysis of safety constraints. In Proceedings 16th Symposium on Reliable and Distributed Systems, pages 80{87. IEEE Computer Society, October 1997. [21] Richard Snodgrass. A relational approach to monitoring complex systems. ACM Transactions on Computer Systems, 6(2):157{196, May 1988. [22] Richard T. Snodgrass, editor. The TSQL2 Temporal Query Language. Kluwer Academic Publishers, 1995. [23] Richard T. Snodgrass, Michael H. Bohlen, Christian S. Jensen, and Andreas Steiner. Adding valid time to SQL/temporal. In ISO/IEC JTC1/SC21/WG3 DBL MAD-146r2, November 1996. [24] Jennifer Widom and Stefano Ceri, editors. Active Database Systems. Morgan Kaufmann, 1996.

18