The Simplex Architecture for Safe On-Line Control System ... - CiteSeerX

35 downloads 8370 Views 263KB Size Report
This is unacceptable to companies with tight production schedules: it ... The Simplex architecture 1, 4, 6, 7], a real-time software technology ..... up required more than one-half hour to let the QMS warm up, and cleaning the chamber between.
The Simplex Architecture for Safe On-Line Control System Upgrades D. Seto, B. Krogh, L. Sha, and A. Chutinan  Software Engineering Institute  Dept. of Electrical and Computer Eng. Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA 15213 Pittsburgh, PA 15213 dseto, [email protected] [email protected], [email protected]

Abstract In this paper, we describe the Simplex architecture, a real-time software technology which supports the safe, reliable introduction of control system upgrades while the system is running. We present the Simplex architecture from a control application point of view. In particular, we discuss the fault-tolerance feature of the technology, introduce its basic structure in control systems, and investigate the control issues when the technology is employed. Application of the the Simplex architecture is demonstrated for a plasma-enhanced chemical vapor deposition (PECVD) system, a standard process in semiconductor manufacturing. We conclude the paper with a discussion of the potential impact that the Simplex architecture can make on future control applications.

1 Introduction Many large-scale systems, such as defense systems, ight control systems, process control systems, to name a few, are now controlled by computers or computer networks. With computing technology and sophisticated control methods advancing so rapidly, there exist strong demands for control systems to adopt changes. Additionally, to lower the cost and maintain continuous technical support from the vendors, more and more COTS (customer-o -the-shelf) components are being used, which will inevitably require upgrades. To introduce new technology and functionality into complex systems, incremental evolution is usually more desirable than upgrading the entire system. Therefore, upgradability and evolvability will become essential features for improving complex computer-controlled systems. One of the most attractive features of computer-controlled systems should be the ease with which they can be modi ed to incorporate improvements and new capabilities. This advantage is hardly  This research was supported in part by the Oce of Naval Research, by the Software Engineering Institute of Carnegie Mellon University, by NSSN, by ISC, and JSF, and by Rockwell International. 1

seen in present computer control systems, however. So-called \legacy control systems" persist in being used because of the diculties involved in control system upgrades. Obstacles to modifying legacy control systems include: 







It is not clear how the modi cations may a ect the legacy code. In many legacy systems, users may only know how the code works as it is, but lack of knowledge of the details of the software. Even worse, it may not be possible to get technical support for the legacy code. Newly developed software often does not have a realistic simulation environment for testing. When the real system is large and complex, duplication of the system for software testing is often not possible. Even when new control software can be tested o -line, it is impossible to test the code under all conditions that may arise in the operation of the real system. Some problems only crop up in the real environment. For example, in real-time control systems, timing problems often occur only in the real-time implementation. It may be very time consuming to shut-down the system to install and re-start the system test control system upgrades. This is unacceptable to companies with tight production schedules: it costs too much to lose production time. Even in a laboratory environment, system shut-down and start-up can be prohibitively time consuming.

In summary, because of these obstacles, despite the potential improvement in the system, system upgrades may not be implemented because they may result in reduced reliability and availability. Therefore, how to maintain system reliability and availability is the key issue in upgrading control systems. It would be desirable to be able to make the software changes in a safe and reliable fashion while the system is running. The Simplex architecture [1, 4, 6, 7], a real-time software technology developed at the Carnegie Mellon University Software Engineering Institute, is designed for this purpose. The Simplex architecture is built upon the fundamental concept of analytic redundancy which refers to redundant components a system with dissimilar speci cations, designs and implementations, but some analytically related function. The application of analytical redundancy for software fault tolerance in control systems was introduced by Bodson et al. in [1]. We give a brief review here. In computer-controlled systems, a controller is a software implementation of a control algorithm. A controller is said to be reliable if it controls the physical system correctly with respect to a set of 2

speci cations within a known set of operating conditions. An upgrade to a reliable controller is a new controller intended to provide the same function as the reliable controller with perhaps some additional features or improved performance. In the control system upgrade scenario, we refer to the reliable controller whose reliability has been proved through extensive use as the baseline controller, and the upgrade controller which introduces new, but untested, features is called the experimental controller. The baseline controller and experimental controller are analytically redundant in the sense that both of them will control the physical system to satisfy certain common requirements. In the Simplex architecture, the baseline controller and the upgrade controller run simultaneously, with the experimental controller actually controlling the system initially. The controller outputs and state of the physical system are monitored as the system runs. If the experimental controller generates an output that is outside acceptable limits, or it drives the system to an undesirable state, which signi es possible errors in the controller, it will be disabled and the baseline controller will take over. While the baseline controller is controlling the system, the experimental controller can be investigated and xed, and then re-installed to take back the control. Such a process can be repeated until the reliability of the experimental controller achieves the same level as the baseline controller. By using the baseline controller to provide protection for the experimental controller to explore the upgrade features as well as to improve its reliability, high reliability and high availability can be achieved during the system upgrade without shutting down the system. The Simplex architecture is the realtime environment that supports this capability. General discussion of the Simplex architecture as a software technology can be found in [4, 6, 7]. In this paper, we describe the details of the Simplex implementation. We present the basic structure of the Simplex architecture as it is employed in control systems with a single processor. Based on this structure, we discuss di erent types of controller faults that the Simplex architecture can handle as well as the approaches used to detect the faults. A control switching logic is established to govern the selection of the appropriate controller to control the physical system. By following the switch rules, the system is guaranteed to be able to recover from the problems with a faulty controller, and therefore, the protection of the overall system from failure caused by faulty upgrades is realized. This is be demonstrated by an application to the Carnegie Mellon plasma-enhanced chemical vapor deposition (PECVD) system. The paper is organized as follows. In Section 2, we introduce the basic structure of the Simplex architecture and describe the types of faults it can handle. In Section 3, we describe the fault detection mechanism based on the trajectories of the physical system in its state space, and derive 3

the control switching logic that governs which controller is chosen to control the physical system at each sampling period. In Section 4, we present the PECVD system to illustrated the complete control system upgrade method using the Simplex. The concluding section summarizes the current status of the Simplex architecture and its implications for control systems of the future.

2 Structure and Operation of the Simplex Architecture >From the perspective of control application developers, the Simplex architecture is a collection of on-line software modi cation tools, real-time process management and communication services, and the fault tolerant \middleware" that lies between the application code and the real-time operating system. Multiple real-time processes are scheduled to run simultaneously by the Rate Monotonic Algorithm (RMA) [5]. Each of the processes is a runtime instance of a software module and can be facilitated with a replacement unit, the basic building block of the Simplex architecture. Replacement units are so designed that the unit processes and their connections to the other processes can be added and deleted on-line. Such dynamic binding makes it possible to replace software modules at runtime, and therefore enables runtime insertion/removal of the experimental controller which is implemented as a replacement unit process. Rate monotonic scheduling guarantees a software module can be replaced or modi ed in real time while other processes still meet their deadlines [5]. The inter-process communication is realized by the real-time publishers and subscribers facility [3], which enables the processes to dynamically publish and subscribe needed information to each other. The key feature of the architecture is to allow the upgrades, which are implemented as software modules, to be inserted to or removed from the overall system on-line.

2.1 Basic Simplex Components As the Simplex architecture is employed in control systems, its basic structure contains a user interface module, a decision module, an I/O module, and three controllers, implemented as a baseline controller, a safety controller and an experimental controller. As described above, the experimental controller (EC) contains the system upgrade to be tested. For instance, it might be an advanced control algorithm which would further improve the system performance, but its reliability is uncertain. The baseline controller (BC) controls the system reliably in some known domain in the state space of the physical system. As the enhanced feature of the experimental controller may drive the physical system in a larger domain of the state space than the one that the baseline controller can handle, we 4

Safety Control Module User Interface and Update Management

Decision Module I/O Module

Baseline Control Module

Physical System

Experimental Control Module

Figure 1: Basic structure of the Simplex architecture for safe on-line control system upgrades. introduce yet another controller, the safety controller (SC). It is a highly reliable controller that, in a known domain of the physical system state space (larger than the domain for the baseline controller), is certain to drive the physical system back to the operating domain of the baseline controller. Although the safety controller can stabilize the physical system in a larger domain of the state space, it may not be able to achieve the control objective of the system. On the other hand, the baseline controller is designed to achieved the control objectives, but must operate within a smaller domain of the state space. Therefore, the safety controller and the baseline controller are complimentary to each other in terms of the state they can handle and the control goal they can reach, and the baseline controller will improve the system performance in the sense of achieving the control objective if the safety controller can drive the physical system to a state from which the baseline controller can take over. The decision module is the central part of the architecture. It evaluates the performance of the physical system and selects a control command from the controllers an each sampling period. Although all three controllers may be running, generating commands simultaneously, only one of the commands will be chosen to be sent to the physical system. The controller whose command is used to control the physical system is called active controller. The I/O module reads the measurements from the physical system and distributes them to all other processes, and receives the control command from the decision module and sends it to the physical system. The user interface module gets commands from the user and processes them, initializes the replacement unit management, and manages the processes. Fig. 1 shows the complete basic structure of the Simplex architecture with the arrows indicating the directions of data ow. The Simplex architecture is usually con gured in two parts, a high assurance kernel and the 5

application controllers. The former consists of the user interface, I/O module, decision module and a safety controller, while the latter includes a baseline controller and an experimental controller. The high assurance kernel is certi ed to be highly reliable and its modules are not replaceable. The application controllers, on the other hand, provide the chance to improve the system performance, and they are implemented as replaceable units.

The overall system operates as follows. The I/O process, driven by a timer, samples the physical system at a prescribed sampling rate. At each sample, it acquires the sensor data and distributes them to other processes. Upon receiving the data from the I/O process, the decision process starts to evaluate the state of the physical system and waits for the results from control processes. Based on the evaluation, it selects one of the control commands to send to the I/O process, which then sends the command to the physical system. The control processes begin their computation of control commands as soon as they receive the data from the I/O process, and send their results to the decision process. All the computation and analysis in the decision and the control processes need to be done within the sampling period, and they are all event driven by the arrival of the data from I/O process. What we described above is the most basic structure of the Simplex architecture for control system upgrades. By supporting a real-time network communication protocol, the Simplex architecture can also be implemented on an analytically redundant multi-machine network. As illustrated in Fig. 2, in the multi-machine implementation each processor in the network is running a copy of the Simplex, but the processors may have di erent functionality and reliability. For example, one of the processors may run only the safety controller, with certi ed reliability in terms of control software and hardware. Application controllers may be implemented in the other processors referred to as Fault Tolerant Processors (FTP) with analytical redundancy. In the multi-machine implementation, the Simplex architecture provides protection against hardware failures as well as software failures in the FTPs. If the controller on a given FTP is the active controller and a software or hardware failure occurs in that FTP, active control is switched to the safety controller or one of the controllers on another FTP. The remainder of the paper will focus only on the single machine implementation of the Simplex Architecture.

2.2 Faults in Real-Time Control Software While the Simplex architecture can be con gured to handle component failure, such as operating system (OS) failure or hardware failure, we shall concentrated only on software faults in this paper. 6

BC

BC EC

EC

Decision

Decision

I/O

I/O

FTP 1

FTP N

NETWORK SC Decision I/O

RIC

Physical System

Figure 2: A multi-machine networked implementation of the Simplex architecture. The software faults that the Simplex is able to detect and tolerate can be classi ed as two types, timing faults and semantic faults. A timing fault is an application-level fault from a real-time system perspective. As the control processes implement the control algorithms, each of them will have a deadline to meet, usually determined by the sampling rate of the control system. If a process has not nished the computation of its control command by the end of the sampling period, we say the process has missed its deadline. In this case, no output from that control process is available within the speci ed time period. While the choice of sampling rate is a design decision which needs to be carefully made such that all the computation should nish within the sampling period, mistakes could be made when the upgrade is introduced. Timing faults can also occur when the controller contains faulty code, such as a divide by zero or an in nite loop. The Simplex applies the rate monotonic algorithm to schedule the runtime processes [5]. This allows the processes to be scheduled dynamically to be added, or removed from, the OS running queue. The scheduling algorithm itself guarantees the completion of computation of each process before its deadline in the absence of faults. By assigning the highest priority to the safety controller among all control processes, it is certain that there is at least one fault-free control command available to control the physical system. Semantic faults are application-domain speci c. They occur when a controller generates commands that lead to a violation of the control speci cations for the physical system. There are various possibilities that could result in this type of fault, e.g., bad gains for a linear controller, uninitialized variables, improperly updating internal states, etc. Detection of semantic faults is based on an evaluation of the physical system behavior, i.e., the trajectories of the physical system in its state space. 7

Constraints on the physical system operation, such as limits on control inputs and state variables, determine a region in the state space of the physical system which we call the safety regain (to be de ned precisely in next section). The safety region has the property that all the states inside the region satisfy the physcial restrictions on the system state, and starting from any state inside the region, the trajectories of the physical system can be kept within the region by available safety and baseline controllers. The semantic fault is detected when the trajectory of the physical system under the experimental controller has the potential of leaving the safety region. In addition to tolerating timing faults and semantic faults, it is also important to prevent faults in the experimental controller from causing problems in the other system modules. Run-time error containment is realized via process address space protection. As the control algorithms are implemented in processes, each process will run in its own address space. The address space protection is provided by the underlying OS with the memory partition management. Such protection guarantees that any fault occurring in one process will be contained in that process and there is no e ect of that fault on the other parts of the software.

3 The Simplex Switching Logic In this section we present a model of the Simplex switching logic for protection against timing and semantic faults. The former is detected by checking if the controllers meet their deadlines, and the latter is detected by applying limits to the controller outputs and measured physical system state variables based on an abstraction of the physical system behavior.

3.1 Abstraction of the Continuous Dynamics Consider a physical system described by the state equation

with state constraints:

x_ (t) = f (x(t); u(x(t); t))

(1)

q1 (x) 0; :::; ql(x) 0; l < n

(2)

p1 (u) 0; :::; pr(u) 0; r < m

(3)



and control constraints:







where x() 2 Rn and u() 2 Rm are the state and control input of the physical system, respectively. The control law u() can be either open-loop or closed-loop. The state and control constraints together 8

give the physical constraints to the physical system, which are usually treated as hard constraints. The physical constraints re ect operating limits for physical devices or other considerations such as safety or lack of sucient knowledge to operate the physical system outside of these boundaries. Soft constraints may also exist, re ecting regions within which certain desired levels of control performance can be maintained. Violations of these performance-related limits do not necessarily threaten the safety or viability of the physical system, however. In this paper we focus on the application of the Simplex architecture to provide protection against semantic faults related to the hard physical constraints. To provide protection against semantic faults, it is necessary to identify a region in the state space where the safety controller can control the system without violating the physical constraints. It is also necessary to identify the region within which the baseline controller can be applied without violating the physical constraints so that control can be switched from the safety controller to the baseline controller at the appropriate time. These regions are de ned as follows. Let F  Rn denote the set of admissible states satisfying the state constraints (2), i.e., F

= fx : q1 (x)  0; :::; ql(x)  0g;

and  Rm denote the set of admissible controls satisfying the control constraints (3), i.e.,

= fu : p1(u)  0; :::; pr(u)  0g: We de ne an operational region (OR) for a given control law u, which takes values from , to be a subset Ou  F such that if u is applied starting from any state in Ou , the resulting trajectory for the physical system remains in Ou and satis es the control objective for u. By characterizing the ORs for di erent control laws, we would like to establish a control switching logic based on what ORs the trajectory of the physical system fails in. The OR just de ned, however, may not be sucient for this purpose in digital control, where one sampling period delay of control is inevitable. To take the sampling period into account, we de ne a restricted operational region (ROR) as follows. Let T be the sampling period of the system and u (t0 ; x0; t) be the solution of (1) at t > t0 with u taking values from and (t0 ; x0) the initial condition. Then a restricted operation region Ru of the control law u is de ned as a subset of Ou and for all x 2 Ru , v (t0; x; t0 + T ) 2 Ou 8t0 > 0 and 8v 2 . Figure 3 illustrates the concept of ORs and RORs for various control laws. Using the above concepts, the philosophy of the Simplex switching logic can now be described. 9

Restricted operational region for tracking

Restricted operational region for stabilization

Rus

OR of safety controller

Safety region

Sus

Recoverable safety region

Figure 3: Illustrations of controllers' (restricted) operational region and safety region. The safety controller will be designed with the control objective of keeping the physical system from violating the physical constraints and to bring the system back to a state where the baseline controller can be used. Speci cally, let us and ub be the safety control law and baseline control law, respectively. Then the safety region for the physical system is de ned as the ROR of us ; i.e., if all the trajectories of the physical system starting in Rus can be driven by us to a subset Sus  Rus ; then the safety region is said recoverable to Sus : Since the objective of the safety controller is to return the system to a state from which the baseline controller can take over, it is necessary that Sus be contained in the ROR for the baseline controller, that is, Sus  Rub : The Simplex switching strategy can then be described as follows: monitor state of the physical system when the experimental controller is active; if the state reaches the boundary of Rus ; switch to us ; then switch to ub when the state reaches Rub : This switch sequence is illustrated in Fig. 4. If this strategy can be implemented, a given state of the physical system is said to be safe if it is inside Rus . Otherwise it is unsafe. The abstraction of the continuous dynamics described above speci es two regions in the state space of the physical system which correspond to di erent attributes to the trajectory. First, the safety region implies that all the trajectories starting within it will remain inside the region of the admissible states and the corresponding control will comply with the control constraints. As the trajectory of the physical system, driven by an application controller, goes out of the safety region, the safety controller will take over the control. Second, the subset of the safety region, to which the safety region is recoverable, makes the improvement of the system performance possible. In fact, if a restricted operational region of the baseline controller, , is prescribed, and the safety controller is designed such that the safety region is recoverable to Rub , then the baseline controller can be made active as the safety controller steers the system trajectory into Rub : In the sense of providing ne 10

tuning with respective to the control goal, the baseline controller is complimentary to the safety controller. The control algorithm for various controllers in the Simplex may have di erent performance speci cations. While the application controllers are usually designed to achieve the given control objectives, the safety control may not need to ful ll any particular requirement other than to maintain the boundedness of the trajectories of the physical system. In other words, an application controller emphasizes reaching the control goal of the system but the safety controller focus on providing safety protection to the system. This is the fundamental di erence between two types of controllers in the Simplex. For the baseline controller, a prescribed operational region is required if it is desired to switch from the safety controller to the baseline controller. To make the Simplex work eciently in control system upgrade, the safety controller should be so designed so that the safety region is large enough to cover the trajectories of interest. It is this design that makes it possible for the experimental controller to explore new features despite the risk of faults.

3.2 Control Switching Logic In the Simplex basic structure, the behavior of the physical system is governed by the controller selected by the decision module, in which a control switching logic is established for safety protection and performance re nement. After the upgrade controller, implemented as the experimental controller, is integrated into the control system, it will be in charge of controlling the physical system. If the experimental controller drives the system safely, i.e., does not drive the system trajectories out of the safety region, and the control goal, stabilizing at an equilibrium state in this case, is being achieved, it will remain in control. If the experimental control contains a bug which results in the system trajectory going out of the safety region, the safety controller will take over the control as this happens, and remain in control until the system trajectory enters the operational region of the baseline controller. Then the baseline controller becomes active. In addition to the trajectories of the physical system, the user interface provides a way to manually a ect the selection of the active controller by changing the availability of the application controllers. Conceptually, the user's command, the availability of the application controllers, the physical system trajectories, and the control switching logic are the key blocks in the Simplex to determine the active controller. Fig. 5 shows the ow of in uence among them. While the structural relation and the conceptual in uence among the blocks remain the same 11

initial state

switch to safety controller recovered trajectory switch to baseline controller

X safe trajectory equilibrium state

ROR of experimental controller (not known) Safety Region ROR of baseline controller

Figure 4: Possible control switchings.

User’s command

Availability of the application controllers

Control switching logic

Trajectories of the physical system

Figure 5: The ow of in uence among the key blocks of the Simplex architecture. from application to application, the detailed speci cation in each block and how to tolerate faults when they are detected are application dependent. In this paper, we have the following speci cations: (1) As an application controller is active and a fault is detected, the safety controller will take over as the active controller; (2) When an application controller changes from active to inactive because of a fault it contains, its output will be disabled until the user re-enables it. (3) As the safety controller is active and the state of the physical system falls in the operational region of the baseline controller, the active controller switches to the baseline controller. (4) If both the experimental controller and the baseline controller are running with valid control commands, the experimental controller will be selected as the active controller. The availability of the application controller directly in uences the selection of the active controllers. We de ne three discrete states for each application controller, ENABLED, DISABLED, and TERMINATED, which respectively represent that the controller is running and its output can be chosen to be sent to the physical system; the controller is running but its output is disabled; and the controller is destroyed. When a controller is destroyed, all of the resources it has been allocated are released. The state transition of an application controller depends on the user's commands and 12

DISABLE or A_TO_NA

ENABLED DESTROY ENABLE CREATE

DISABLED

TERMINATED DESTROY

Figure 6: State transition diagram of an application controller. if the controller changes from active to inactive. In particular, the events which may cause a change of state of an application controller can be summarized in the set CREATE, DESTROY, EANBLE, DISABLE, A TO NAg

f

where CREATE/DESTROY are user's commands to start/terminate the processes in which the controller is implemented, ENABLE/DISABLE are user's commands to enable/disable the controller's output, and A TO NA is the event when the controller is changed from active to inactive. A state transition diagram, re ecting only the changes of a controller's discrete state, is given in Fig. 6. The control switching logic is designed to tolerate timing faults and semantic faults. A timing fault from an application controller is detected if it misses its computation deadline. Therefore, by monitoring the messages received in the decision module, a timing fault will be detected if there is no message available from the controller. In addition to timing faults, the state of the application controller is another factor that needs to be checked to see if there is a valid control command from that controller. Therefore, we de ne a boolean variable bc ready (ec ready ) for the baseline controller (experimental controller) which is TRUE when the controller has a valid output before the deadline and its state is ENABLED. A semantic fault is detected by checking if the control value is valid and the state of the physical system is safe, i.e., if the state is inside the safety region. De ne a boolean variable safe with value TRUE presenting that the physical system is safe. Apparently, the physical system under the safety controller will always be safe. As the switch from the safety controller to the baseline controller is required, we de ne a boolean variable in ORBC to indicate if the state of the physical system is inside the operational region of the baseline controller, and in ORBC = TRUE if it is the case. The control switching logic can be established as an assignment policy of the active controller. To this end, we de ne the active controller state to be in the set BASELINE, EXPERIMENTAL, SAFETYg

f

13

(!safe) or (!bc_ready) & (!ec_ready)

SAFETY

(!safe) or (!ec_ready) ec_ready

in_RORBC & bc_ready & (!ec_ready) EXPERIMENTAL Figure 7: The state transition diagram of the active controller. safe & ec_ready BASELINE

to represent the controller which is in control of the system. The state transition of the active controller will be determined by the values of boolean variables bc ready , ec ready , safe, and in ORBC . Fig. 7 shows the state transitions of the active controller when the boolean expressions on the transition arcs are TRUE. We have now completely established the control switching logic to determine the active controller. Implementation of this logic amounts to coding the state transition diagrams in Fig. 6 and Fig. 7. To summarize, we conclude that the control switching logic is an assignment policy for the active controller, which can be completely determined by two state transition diagrams. The changes from application to application a ect only the details of the state transitions diagrams, such as the abstraction of the dynamics of the physical system, and/or the number of application controllers. In next section, we present a real world example to demonstrate what has been developed.

4 Simplex Application: Plasma Enhanced Chemical Vapor Deposition Plasma enhanced chemical vapor deposition (PECVD) is a standard unit process for the deposition of insulation and passivation layers in microelectronic circuits. A custom PECVD reactor was constructed at Carnegie Mellon University to study the use of in situ sensing and control of plasma species concentrations during the deposition of silicon nitride. As illustrated in Fig. 8, the CMU PECVD reactor has an integrated quadrupole mass spectrometer which samples directly the species in the plasma. The computer-controlled system and instrumention for the PECVD reactor are illustrated in Fig. 9. This PC-based system regulates the concentrations of disilane and triaminosilane and the plasma DC bias by changing the set points for the silane gas mass- ow, RF power, and total reactor pressure as process inputs. Linear-quadratic gaussian (LQG) optimal control was used to design the original multivariable controller. Results from the research on in situ sensor-based control using this system can be found in [2]. Here we present the application of the Simplex architecture to this system to support the implementation and testing of a new model-predictive control (MPC) 14

gas inlet 250 µm aperture to main pump 0.40 cm aperture

turbopump quadrupole mass spectrometer

turbopump

Figure 8: PECVD reactor with integrated quadrupole mass spectrometer. mass flow controller NH3

PC Connections IEEE 488 RS232 analog line

multigas controller

digital line

mass flow controller SiH4 pressure controller

temperature controller

PLASMA REACTOR & INSTRUMENTATION

RF power supply & matching network

PC LabWindows/CVI

ppt interface

ethernet (campus network)

7

valve & pump switches

Figure 9: PECVD reactor instrumentation and computer control system. algorithm [8]. In modifying the control code for the PECVD system to implement the MPC alogrithm we experienced many of the obstacles one encounters in the development and maintenance of real-time control software: the existing LQG control code was embedded in the \legacy code" that had evolved over the course of the project; the only test bed for the real-time implementation was the actual PECVD reactor, a one-of-a-kind system; experiments were time consuming because system startup required more than one-half hour to let the QMS warm up, and cleaning the chamber between runs could take a couple hours; and safety was a serious matter { silane is a highly explosive gas, and consequently bugs in the control code could lead to dangerous conditions if the system was not carefully monitored. 15

controller data

mass flows, RF power, temperature, etc.

PECVD Reactor

ethernet

plasma species , dc bias

Simplex

LabWindows/CVI

Figure 10: Simplex computer to support control algorithm modi cations for the PECVD system. To apply the Simplex architecture to the PECVD system, the computer control system was modi ed as illustrated in 10. The control portion of the code was removed from the LabWindows/CVI system and ported to computer running Simplex. The user instrumentation interfaces remained intact on the original PC. An ethernet connection supported the exchange of data between the LabWindows/CVI PC and the Simplex PC. The LQG controller was used as the baseline controller for the Simplex implementation. It was well tested and controlled the PECVD reactor reliably. The safety controller was designed to bring the process to its initial nominal operating point by driving the process inputs along smooth exponential trajectories from their initial values (when the safety controller is turned on) to their nominal values. Switching rules were implemented based on limits on the process inputs being computed by the control algorithm (silane mass ow, RF power, pressure), and limits on the allowable values for the measured process variables. These limits were selected to re ect equipment and process operating constraints. When any of these limits are violated under the experimental controller, Simplex would switch control to the safety controller until the process outputs had settled to their nominal operating values. Then, the baseline controller would be switched in to drive the process back to the desired operating set points. The Simplex architecture made it possible to install and test the new MPC software without having to worry about the process going out of control due to errors in the MPC code. This was extremely useful because of the complexity of the MPC algorithm, which includes the solution of a constrained nonlinear optimization problem during each control period. Porting the optimization code to the real-time control system was not a trivial task, and even after the code was working it took considerable experimentation and evaluation to tune the MPC algorithm parameters to achieve acceptable performance. With Simplex, this experimentation could be carried out quite eciently 16

D isilane (run 4-1) 1.00E-09

8.00E-10

6.00E-10

Torr

4.00E-10

2.00E-10

2042.8

2036.7

2030.7

2024.6

2018.6

2012.5

2006.8

2000.5

1994.4

1988.5

1982.4

1976.3

1970.3

1964.2

1958.1

1946

1952.1

1934

1940.1

1928

1921.9

1915.9

1909.9

1903.9

1897.8

1891.8

1885.6

1879.6

1873.6

1867.6

1861.5

1855.5

1849.4

1843.4

0.00E+ 00

-2.00E-10

-4.00E-10

-6.00E-10

0 sec

Figure 11: Disilane concentration and Simplex controller for run with limits in Table 1. since the code could be installed and run while the PECVD reactor was running. When problems occurred, Simplex returned control to the baseline LQG controller (via the safety controller). The MPC code could be modi ed to correct bugs or adjust parameters, while the process continued to run, and then be re-installed for another test. Two examples of the Simplex operation are presented in here. The rst example illustrates the Simplex response to a limit violation by a measured process output variable. Figure 11 shows the trajectory of the disilane along with an indication of the active controller at each instant. The points at which Simplex switched the active controller are indicated by the vertical dashed lines. During the initial part of this run, the baseline controller is operating and the disilane is being regulated to its set point value. The MPC controller is then installed (indicated by the grey line rising in the plot below the disilane value), and the disilane swings up and then quickly decreases below its lower limit (indicated by the dashed line). At this point Simplex switches in the safety controller which takes the process smoothly back to its nominal operating point, and then switches control back to the baseline controller. The trajectory of the triaminosilane for this run is shown in Fig. 12. Although the triaminosilane does not violate its limits, it is varying radically from its desired set point. Clearly there is a problem with the MPC software. The second example illustrates the Simplex action when the MPC code misses a deadline. Since 17

0

7942.8 7948.9

7942.8

7955

7948.9

7961 7955

Pressure (run 3-2)

7967.1 7961

7973.1 7973.1 7979.2 sec

7985.2 7991.3

7979.2 sec

DC Bias (run 3-2)

7967.1

7985.2 7991.3 7997.2 8003.2 8009.3

8003.2

8015.4

8009.3

8021.4

8015.4

8027.5

8021.4

8033.4

8027.5

8039.5

8033.4

8045.6

8039.5

8051.7

8045.6

8057.7

8051.7

8063.7

8057.7

8069.7

0

7997.2

8063.7 8069.7 0

torr 1.00E-09

9.00E-10

8.00E-10

7.00E-10

6.00E-10

5.00E-10

4.00E-10

3.00E-10

2.00E-10

1.00E-10

0.00E+00 7900.2

sccm 8

7

6

7924.7

7930.8

7930.8

7936.9

7936.9

7942.8

7942.8

7948.9

7948.9

7955

7955

7961

7961

7979.2 sec

7985.2

7973.1 7979.2 7985.2 7991.3 7997.2

7997.2

8003.2

8003.2

8009.3

8009.3

8015.4 8021.4

8015.4

8027.5

8021.4

8033.4 8027.5

8039.5

8033.4

8045.6

8039.5

8051.7

8045.6

8057.7

8051.7

8063.7 8069.7

0

8057.7 8063.7 8069.7 -8

watts

7912.3

7924.7

7918.7

7930.8

7924.7

7936.9

7930.8

7942.8

7936.9

7948.9

7942.8

7955

7948.9

7997.2

7973.1 7979.2 7985.2 7991.3 7997.2 8003.2

8003.2

8009.3

8009.3

8015.4

8015.4

8021.4

8021.4

8027.5

8027.5

8033.4

8033.4

8039.5

8039.5

8045.6

8045.6

8051.7

8051.7 8057.7 8063.7

8063.7

1855.5 1861.5 1867.6 1873.6 1879.6 1885.6 1891.8 1897.8 1903.9 1909.9 1915.9 1921.9 1928 1934 1940.1 1946 1952.1 1958.1 1964.2 1970.3 1976.3 1982.4 1988.5 1994.4 2000.5 2006.8 2012.5 2018.6 2024.6 2030.7 2036.7 2042.8 0

8069.7

0

0

8069.7

8057.7

RF Power (run 3-2)

7991.3

7967.1

sec

sec

7985.2

7961

Triaminosilane (run 3-2)

7979.2

25

7918.7

7973.1

20

7912.3

7961

15

7906.3

7906.3

7955

10

7900.2

7900.2

7967.1

5

1.00E-10

9.00E-11

8.00E-11

7.00E-11

6.00E-11

5.00E-11

4.00E-11

3.00E-11

2.00E-11

1.00E-11

0.00E+00

0

torr

1849.4

sec

7991.3

7967.1

Silane (run 3-2)

7973.1

sec

Dislane (run 3-2)

7967.1

1843.4

Triaminosilane (run 4-1)

7918.7

5

7912.3

7924.7

4

7918.7

3

7906.3

2

7900.2

7912.3

1

0

7906.3

1.60E-10

7936.9

7936.9

1.40E-10

7930.8

7930.8

1.20E-10

7924.7

7924.7

1.00E-10

7918.7

7918.7

8.00E-11

7912.3

Torr 6.00E-11

7912.3

4.00E-11

7906.3

2.00E-11

7906.3

0.00E+00

7900.2

Figure 12: Triaminosilane concentration and Simplex controller for run with limits in Table 1.

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0 7900.2

-2

-4

-6

-8

-10

-12

-14

-16

-18

-20

18

Figure 13: Process inputs and outputs for Simplex reaction to a timeout condition on the PC code.

torr volts

the MPC algorithm includes a nonlinear optimization routine, it is possible that the optimization iterations will not converge within the sampling period, leading to a time-out condition. The process inputs and outputs are shown in Fig. 13 for a run where the MPC code executes for a few cycles and then the safety controller is invoked because the MPC code does not return control values in time. In this case, the process inputs and outputs are varying widely, indicating the MPC algorithm is not controlling the process well, but there are not any limit violations before control is taken away from the experimental controller.

5 Conclusion This paper describes the application of the Simplex architecture to support safe, reliable on-line software upgrades in computer controlled systems. As computing and control technologies advance at a rapid pace, control systems with the ability to adopt changes safely and reliably will be required. Upgradability and evolvability are desirable particularly in complex, safety-critical applications. The Simplex architecture is designed to meet this challenge. In this paper, the basic structure of the Simplex is presented to support the analytically redundant controllers. We have presented the philosophy behind the Simplex switch rules to successfully tolerate timing faults and semantic faults. An abstraction of the continuous dynamics of the physical system is developed to provide criteria for semantic fault detection. A control switching logic is established to handle the control switching when the upgrade controller is detected faulty while it is controlling the physical system. The basic structure of the Simplex as well as its conceptual components are invariant in most of control applications. The Simplex architecture applied for control system upgrade is demonstrated by a real-world example, the control of plasma enhanced chemical vapor deposition. In addition to supporting safe, reliable, on-line control software upgrade, the Simplex architecture can be generalized to a broad range in control applications, and we consider three examples here. First, it can be applied to control systems for the purpose of fault tolerance. Its fault-tolerant cability is not limited to only control software. It can be extended to support fault tolerance in general to protect against, for instance, operating system failures, sensor failures, actuator failures, and component aging, provided there are analytically redundant backup modules available on line. When faults are detected, Simplex operation from a failed component can be smoothly transferred to the redundant component. When the failed component is repaired, it can be inserted back into the system. 19

The Simplex architecture also provides a way to develop control algorithm on-line. With the fault tolerant facility, any control algorithm, which may have not been fully tested, can be implemented in the Simplex architecture, and again, the operation of the system will be protected by the control algorithms which have sound reliability. By repeatedly making changes to the algorithm being tested, its reliability can be made to the same level as the protecting algorithm while its functionality or the system performance under this control can be improved. Finally, the Simplex architecture can support control system recon guration. In complex control systems, the control objectives are often achieved by a group of control algorithms, which include nominal controls operating with nominal system parameters, and others designed to take care of abnormal situations, such as device degradation or environmental variations. It is required that the control system can automatically recon gure itself to provide appropriate control, and this can be achieved with the Simplex architecture. Not limited to the set of currently running controllers, the Simplex allows any replacements at run-time. When a control task needs to be performed, and the controller is not among those currently running, one of the running controllers, which will not be needed in the near future, can be swapped out and the needed one can be made available. As the Simplex architecture is a technology still being matured, the issues introduced here are still under investigation. How to systematically design the safety control and derive the corresponding safety region and safety region are also the focus of control research.

Acknowledgment Some of the materials presented in this paper came out of various discussions within the Simplex team at SEI, and we thank everyone in the team for their contributions.

References [1] M. Bodson, J. Lehoczky, R. Rajkumar, L. Sha, and J. Stephan, \Analytic redundancy for software fault-tolerance in hard real-time systems". In the Foundations of dependable computing paradigms for dependable applications, G. M. Koob and C. G. Lau, Eds. Kluwer Academic Publishers, 1994. [2] T.J. Knight, D.W. Greve, X. Cheng and B.H. Krogh, \Real-time multivariable control of PECVD silicon nitride lm properties". In the IEEE Trans. on Semiconductor Manufacturing, vol. 10, no. 1, pp. 137-146, Feb. 1997.

20

[3] R. Rajkumar, M. Gagliardi, and L. Sha, \The Real-Time Publisher/Subscriber IPC Model for Distributed Real-Time Systems: Design and Implementation". In the Proceedings of the 1st IEEE Real-Time Technology and Applications Symposium, May, 1995. [4] J. Rivera, A. Danylyszyn, C. Weinstock, L. Sha, and M. Gagliardi, \An Architectural Description of the Simplex Architecture". Technical Report, CMU/SEI-96-TR-006, March, 1996. [5] L. Sha, R. Rajkumar, and S. Sathaye, \Generalized Rate Monotonic Scheduling Theory: A Framework of Developing Real-Time Systems". In the IEEE Proceedings, January, 1994. [6] L. Sha, R. Rajkumar, and M. Gagliardi, \A Software Architecture for Dependable and Evolvable Industrial Computing Systems". Technical Report, CMU/SEI-96-TR-005, July, 1996. [7] L. Sha, R. Rajkumar and M. Gagliardi \Evolving Dependable Real-Time Systems". In the Proceedings of the 17th IEEE Aerospace Applications Conference, February, 1996. [8] X. Cheng and B. H. Krogh \A New Approach to Guaranteed Stability for Receding Horizon Control". In the Preprints 13th IFAC World Congress,vol. C, pp. 433-438, July, 1996.

21