Automated Learning and Monitoring of Limit Functions - CiteSeerX

1 downloads 0 Views 276KB Size Report
TETRAD.BOOK/book.html). 14] Richard S. Sutton and Christopher J. Matheus. Learning polynomial functions by feature construction. In Proceedings of Eighth.
Automated Learning and Monitoring of Limit Functions Dennis DeCoste

Monitoring and Diagnosis Technology Group Jet Propulsion Laboratory / California Institute of Technology 4800 Oak Grove Drive; Pasadena, CA, 91109; USA http://www-aig.jpl.nasa.gov/home/decoste/ [email protected]

Abstract

In practice, automated monitoring of spacecraft relies heavily on limitsensing and simulation. Unfortunately, limit-sensing tends to be too imprecise (missing alarms) and simulation tends to be too precise (false alarms). To help overcome those disadvantages, we present an anytime algorithm called envelope learning and monitoring via error relaxation (ELMER). It can incrementally generate successively tighter hi/low limit functions envelopes, essentially moving from the wide static limits typical of limit-sensing toward the precise predictions of simulations, while avoiding unacceptable false alarm rates. We summarize the techniques and motivations underlying ELMER and illustrate its performance on telemetry data from the NASA spacecraft TOPEX.

1 Introduction

Automated reliable monitoring of spacecraft engineering sensor data, for both detection of anomalies and veri cation of expected behaviors, is very important for achieving low-cost mission operations. In practice, automated monitoring of spacecraft relies heavily on limit-sensing and simulation. Limit-sensing compares time-evolving sensor values against high and low alarm limits. For the sake of minimizing the run time cost of monitoring, the numbers of false \nuisance alarms", and the costs of (manual) limit-determination, these limits are typically prede ned, static, and relatively wide (e.g. \red-lines"). Alternatively, simulation dynamically computes tight limits | at the cost of requiring dynamic models that are expensive to create, test, maintain, and execute. Unfortunately, limit-sensing tends to be too imprecise (i.e. failure to detect anomalies) and simulation tends to be too expensive and too precise (i.e. false alarms). The use of expert systems, qualitative reasoning (e.g. [4], [7]) and Bayes nets (e.g [9]) can o er valuable context-sensitive alternatives, but currently require signi cant manual knowledge engineering e ort.

Given the large volumes of engineering data that both testbeds and deployed spacecraft typically provide, it is natural to consider employing machine learning methods to help reduce the need for manual knowledge engineering. During early stages of ight, relatively wide and predetermined red-lines for each sensor might suce. For example, many classes of subtle anomalies are more likely to arise during later stages of a mission, as components begin to degrade. This might be especially true if the mission goes into extended mission mode and begins to exceed the expected lifetime of some components. Indeed, in a `cheaper, quicker" mission life cycle framework, degradations and failures might become more common. Such anomalies might become more acceptable if early detection allows preventive operation to reduce serious impact on the main mission objectives. After several weeks or months of ight, tighter limits based on training over data from that early historic behavior could be learned and used. An additional advantage of such an approach is that those trained limits would re ect the speci c idiosyncracies of a spacecraft that typically only become clear once it is put into actual ight. Clearly one of the biggest challenges facing the use of machine learning techniques in such a scenario is to properly account for the context-conditional uncertainty in the limits, since future contexts may not be suciently similar to that early training data. With such a scenario and challenges in mind, we have recently developed a new anytime approach called envelope learning and monitoring via error relaxation (ELMER) [5]. Given historic and/or simulated data, it incrementally and automatically learns tighter high/low limit functions (envelopes). ELMER essentially starts with the wide red-line limits characteristic of traditional limitsensing and gradually re nes them towards the precise predictions of simulations. During this re nement process, the resulting limit functions will generally be successively tighter, while maintaining false alarm rates below user-speci ed tolerances. ELMER is motivated by our intuitions that simple limit-sensing can perform well, if we nd limits that are context-sensitive functions and we limit-sense across multiple perspectives of the raw data. In our early exposure to Space Shuttle oper-

To appear in Proceedings of i-SAIRAS-97, Japan, July 1997

Dennis DeCoste, JPL

ations, for example, we discovered that operators often slightly nudged the red-lines for some sensors up or down between ights, to account for both nuisance false alarms and for empirical realizations that some limits were excessively wide. ELMER can be viewed in part as a means for automating such a process via machine learning techniques.

2 Example: Bounding TOPEX Data

To illustrate ELMER's performance, we present an example using recent data from the NASA TOPEX spacecraft. Figure 2 plots across time the test data of 56 sensors for 1000 samples, representing behavior over almost three hours (i.e. sampling rate is 10 seconds). We will refer to the sensor plotted in column c, row r of this gure as sensor S . Figure 3 plots the high envelope that ELMER learned for target sensor S2 9 , based on almost ve days of training data for these same 56 sensors and for particular P and R learning parameters (discussed shortly). This high envelope is the result of a linear regression for which the target output is the value of that sensor at time t+1 and the inputs are the values at time t of a subset of the available sensors. According to standard statistical practices, we normalized each sensor's data to mean 0 and variance 1. We will summarize how ELMER selects such input subsets and regresses with them in the next section. For comparison, Figure 4 plots the result of traditional linear least squared regression using singular value decomposition (SVD). Also plotted are the error bars based on standard methods for computing input-conditional variance for linear regression [1]. The input selections and regressions using SVD took a total of about 240 seconds of training time on a Sparc20, compared to about 350 seconds for high envelope bounding. The relatively tight t of both regressions suggests that our input selection heuristics worked reasonably well. The high envelope selected more inputs (9) than the SVD case (7). This is common and reasonable, since ELMER's relaxed non-alarm error encourages the high envelope to stay high except in those contexts for which the high bound can be lower without introducing excessive (false) alarms during training. Indeed, the regression weight for the bias term was signi cantly higher for the high envelope (.95) than for the SVD t (0.02), with more weights (4 versus 2) being negative. In both cases, the target sensor itself was automatically selected rst and obtained the highest nal regression weight of any sensor, which is again very common and reasonable. Note that the mean non-alarm and alarm non-zero errors are of similar magnitude (.196 and .17) for SVD, whereas for the high bound there is a signi cant bias against alarm error (1.068 vs .26). There are many more and larger alarms for the SVD t than for the high envelope. As the plots of this example illustrate, variancebased error bars are often considerably wider than ELMER's envelopes. This illustrates a key di erc;r

;

2

ence between ELMER's envelopes and variancebased error bars. ELMER essentially strives to directly learn the input-conditional bounds of what it has seen before, while still not being so tight as to cause (false) alarms on the training data.

3 Summary of ELMER Techniques

We de ne the bounds estimation problem as follows: De nition 1 (Bounds estimation) Given a set of patterns P , each specifying values for inputs x1 ; :::; x and target y generated from the true underlying function y = f (x1 ; :::; x ) + ", learn high and low approximate bounds y = f (x1 ; :::; x ) and y = f (x1 ; :::; x ), such that y  y  y generally holds for each pattern, according to given cost functions. We allow any 1  l  d, 1  h  d, d  1, D  0, making explicit both our expectation that some critical inputs of the generator may be completely missing from our patterns and our expectation that some pattern inputs may be irrelevant or useful in determining only one of the bounds. 1 We also make the standard assumption that the inputs are noiseless whereas the target has Gaussian noise de ned by ". For each envelope sensor S and time t+1, we compare its actual sensed value at t+1 (y) to its envelope predictions (y and y ). To simplify discussion, we will usually discuss learning only high bounds y ; the low bounds case is essentially symmetric. An alarm occurs when output y is below the target y, and a non-alarm occurs when y  y . We will call these alarm and non-alarm patterns, denoted by sets P and P respectively, where N = jPj = jP j + jP j. ELMER relaxes the standard least squared regression cost function to include parameters which allow balancing the cost of (false) alarms against the cost of imprecision, as de ned in Figure 1. We can favor non-alarms (i.e. looseness) over alarms (incorrectness) by using values for the penalty (P) factors where P a > P n . This is analogous to the more common use of nonstandard loss functions to perform risk minimization in classi cation tasks [1]. P a = P n = 1 and R = R a = R n gives us the standard class of Minkowski-R errors [1], where R=2 gives standard sum-of-squares error and R=1 gives classic robust estimation (which reduces the e ects of outliers). here on linear regression (i.e. y = P Wewfocus x ) to perform bounds estimation, both =1 for simplicity and because our larger work stresses heuristics for identifying promising explicit nonlinear input features, such as product terms (e.g. [14]). Nevertheless, this approach generalizes to nonlinear regression as well. We use this cost function to perform linear regression using the classic batch Newton method d

D

L

H

H

L

h

l

L

H

H

L

H

H

H

a

a

n

n

H

H

H

H

H

H

H

h i

i

i

1 We assume x1 is a constant bias input of 1 which is always provided. For meaningful comparisons, other inputs with e ective weight of zero are not counted in these dimensionality numbers D,d,h, and l.

To appear in Proceedings of i-SAIRAS-97, Japan, July 1997

Dennis DeCoste, JPL

(

 P

? ?

y y )RHn if yH P (y eH = Hn H PHa (yH y )RHa if yH < y EH = 1 eH , EjH j = 1 p2 p2

(

,

? ?

 P

y )RLn if yL y P (y eL= Ln L PLa (yL y )RLa if yL > y 1 EL = eL , EjLj = 1 p2 p2

)

P P je j, P P je j N N N Figure 1: Asymmetric cost functions for high and low bounds.

N

P

)

3

P

H

L

Parameters: PHa ,PHn ,PLa ,PLn  0; RHa ,RHn ,RLa ,RLn  1.

Figure 2: TOPEX time-series data for 56 sensors. [1], which specializes to performance analogous to that of SVD in the special case of a symmetric cost function (i.e. P n =1, P a =1, R n =2, R a =2 and P n =1, P a =1, R n =2, R a =2.). For R and P values which give asymmetric error, Newton's method results in quick convergence (typically less than 100 epochs in our experience). Even for asymmetric R and P values, we have found the weights resulting from SVD to be useful as initial seed weights. For example, such seeding allows P n =0 to give a non-trivial result for lowerbounding. Otherwise, with initial weights of all 0, P n =0 would result in zero error (since Ej j would be 0) and thus training would immediately stop with the constant result of y =0. Details of our use of Newton's method are given in [6]. We have also experimented with online learning, where each envelope function adapts not only the weights of its linear sum of inputs, but also its P and F parameters. In that context, our cost function can be viewed as enforcing two separate learning rates, one for alarms and one for nonalarms. This allows us to bias learning primarily toward quick avoidance of future false alarms, L

H

H

H

L

L

H

L

H

H

L

L

with a secondary bias toward gradually reducing envelope wideness. The complexity of linear regression is cubic in the number of inputs. For our application domains the number of sensors and transforms we wish to consider as inputs is typically in the hundreds or thousands. Thus, to achieve anytime performance, it is particular important that ELMER use heuristics to focus on promising candidate inputs. Indeed, given that the main goal in ELMER is not to necessarily t the data precisely, suboptimal but cheaper and reasonable input subsets can often suce. For input subset selection, we currently use greedy heuristics analogous to those used for incremental addition of hidden units in constructive neural network learning (e.g. [8]). Namely, we compute scores for each candidate sensor based on their correlations with the error residual of the current regression network. We add the candidate with highest correlation to the input set and rerun the batch Newton method to learn new (tighter) envelopes. We stop adding new inputs when the error on the validation data set starts to increase,

To appear in Proceedings of i-SAIRAS-97, Japan, July 1997

Dennis DeCoste, JPL

CESAFPE2

(HI

envelope

=

dashed

4

line)

2 1 0 -1 -2 -3 non-alarm 10

err

(min=0,mean=1.037,max=4.06)

err>0

(mean=1.068,var=0.31)

1

0.1

0.01

0.001 alarm 1

err

(min=0,mean=0.0078,max=0.59)

err>0

(mean=0.26,var=0.037)

0.1

0.01

0.001

0.0001 0

100

200

300

400

500

600

700

800

900

1000

Figure 3: Example of learned envelope for TOPEX sensor.

In the top plot, the dashed line shows the high envelope and the darker line shows the target sensor data. The middle plot shows the ELMER relaxed error values for non-alarms and the bottom plot shows the error for alarms. Note the logarithmic scales for the error plots. ELMER parameter values: PHn =0.0001, PHa =1, RHn =2, RHa =10.

since that suggests that over tting is beginning to occur. Our approach also requires search over the space of possible and R and P values. Currently we have no optimal method of performing this search. Instead, we spot check combinations of various commonly useful values and pick the best result. For example, these choices typically include R n 2 f1; 2g, R a 2 f2; 10; 20g, P n 2 f1; 0:1; 0:01; 1e?3; 1e?4; 1e?5; 1e?10; 1e?15; 0g, and P a 2 f1; 1000g. The selection criteria involves not only the relative alarm versus non-alarm errors, but also other user preferences, such as acceptable levels of false alarm rates for each sensor. H

H

H

H

4 Summary of Key Ideas

4.1 Envelopes Are Context-Sensitive Limit Functions

For each sensor S, each ELMER envelope is a function whose output at time t predicts a bound on the value of S at time t+1. Candidate inputs can include not only S and other sensors, but also transforms of them, such as means, mins, maxs, and derivatives over various time-windows and at var-

ious time lags. Given an initial set of inputs (often simply the value of S at time t) and a set of candidate inputs ELMER uses linear-time greedy feature selection search to incrementally learn linear weighted-sums which de ne context-sensitive limits on S's historic behavior. In contrast to predominately non-linear function approximation approachs, such as neural networks, ELMER strives to nd simple-yet-sucient approximations, both for learning eciency and comprehensibility of results.

4.2 Monitoring Warrants Asymmetry

ELMER balances the cost of false alarms against the cost of imprecision by using a parametric asymmetric cost function for which standard leastsquare regression is just one special case.

4.3 Multiple Weak Perspectives versus Single Strong One

Most alternative time-series prediction approachs focus on learning a single time-series with overall least mean square error. However, for monitoring tasks, we argue that it may often be more useful to learn multiple time-series bounds, even if

To appear in Proceedings of i-SAIRAS-97, Japan, July 1997

Dennis DeCoste, JPL

CESAFPE2

(SVD

fit

=

dashed

5

line)

2 1 0 -1 -2 -3 non-alarm 10

err

(min=0,mean=0.092,max=2.53)

err>0

(mean=0.196,var=0.138)

err

(min=0,mean=0.092,max=2.43)

err>0

(mean=0.17,var=0.11)

1

0.1

0.01

0.001 alarm 10 1 0.1 0.01 0.001 0.0001 0

100

200

300

400

500

600

700

800

900

1000

20

15

10

5

0

-5

-10

-15

-20

Figure 4: Traditional least-squared t with error bars.

In the top plot, the light dashed lines show the regression mean and the dark line shows the target sensor data. The bottom plot shows light vertical error bars (2 standard deviations), with the target sensor data plotted darkly on top. ELMER parameter values: PHn =1, PHa =1, RHn =2, RHa =2.

each is relatively weak. The idea is that wider envelopes may be easier to learn/represent/evaluate (and with fewer false alarms), yet still be adequate for detecting anomalies, as long as there are some transformations for which a given anomaly is so drastic that even weak envelopes on that perspective allow detection. This motivates use of ELMER to learn many envelopes, each involving an assortment of input and output transforms. For example, transforms such as derivatives and maximums can often be very valuable targets for anomaly detection and for which relatively simple context-sensitive bounds can be learned.

5 Discussion

5.1 Related Work

As noted earlier, work on variance-based error bars (e.g. [2], [11], [10]) is clearly related. More

generally, ELMER is performing a special version of probability density estimation, which we call bounds estimation. Alternatives, such as mixture models (e.g. [1]) and kernel density estimation (e.g. [17]), strive to more accurately model how the probability varies across the output range, conditional on the inputs. We generally expect that accurate estimation of probability densities will typically require signi cantly more training data and time, since those techniques are promising to deliver more than ELMER (i.e. accurate probability spreads, as opposed to simple bounds). Furthermore, until such methods converge on accurate estimates, their intermediate results are not particularly useful or reliable. In contrast, each ELMER envelope strives merely to identify a single pair of input-conditional

Dennis DeCoste, JPL

To appear in Proceedings of i-SAIRAS-97, Japan, July 1997

bounds within which the sum probability is 1. In this way, ELMER can avoid the typically higher training costs of those more general probability density approaches, which we argue are not necessary worth the expensive for many monitoring tasks. ELMER allows for incremental, anytime bounding, such as starting with relatively small P n factors (i.e. learning relatively wide envelopes), and then gradually increasing those factors until the false alarm rate of the resulting (tighter) envelope becomes unacceptly high. H

5.2 Status

We are currently evaluating and re ning our ELMER techniques across a wide variety of NASA domains, including EUVE, Deep Space Network, New Millenium DS/1, TOPEX, Space Shuttle, and Pluto Express.

5.3 Future Work

Our near term plans for extending our ELMER framework include incorporating and evaluating greedy feature construction methods (e.g. [14]), causal induction methods (e.g. [13], [3] [16]), and training data compression/subsampling methods (such as support vectors, [12] [15]). In all of these cases, the key motivations are to increase the autonomy, accuracy, and eciency with which ELMER can incrementally learn bounds suitable for complex monitoring tasks from large historic databases of time-series engineering data.

6 Acknowledgements

[6] [7]

[8] [9]

[10] [11] [12]

Alex Gray, John Szijjarto, Mike Foster, Mike Turmon, and Jay Wyatt all provided signi cant feedback on these ideas. The research described in this paper was carried out by the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration.

[13]

References

[14]

[1] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [2] Christopher M. Bishop and Cazhaow S. Qazaz. Regression with input-dependent noise: A bayesian treatment. Advances in Neural Information Processing Systems 9, 1997. [3] Paul R. Cohen, Dawn E. Gregory, Lisa Ballesteros, and Robert St. Amant. Two algorithms for inducing structural equation models from data. In Fifth International Workshop on Arti cial Intelligence and Statistics, 1995. (See http://www-eksl.cs.umass.edu/ papers/fbd-ftc-aistats95 94-80.ps). [4] Dennis DeCoste. Dynamic across-time measurement interpretation. Arti cial Intelligence, 51:273{341, 1991. [5] Dennis DeCoste. Learning and monitoring with families of high/low envelope functions:

[15]

[16]

[17]

6

Maximizing predictive precision while minimizing false alarms. Technical Report D13418, Jet Propulsion Laboratory / California Institute of Technology, March 1996. Dennis DeCoste. Bounds estimation via regression with asymmetric cost functions. In submitted to IJCAI-97, January 1997. Daniel Dvorak and Benjamin Kuipers. Model-based monitoring of dynamic systems. In Proceedings of the Eleventh International Joint Conference on Arti cial Intelligence, pages 1238{1243, August 1989. Scott Falhman and C. Lebiere. The cascadecorrelation learning architecture. Advances in Neural Information Processing Systems 2, 1990. Eric Horivitz and Matthew Barry. Display of information for time-critical decision making. In Proceedings of the Eleventh Conference on Uncertainty in Arti cial Intelligence, 1995. (See ftp://ftp.research.microsoft.com/ pub/ejh/vista.ps). Herbert Kay and Benjamin Kuipers. Numerical behavior envelopes for qualitative models. Proceedings of the Twelth National Conference on Arti cial Intelligence, 1993. David A. Nix and Andreas S. Weigend. Learning local error bars for nonlinear regression. Advances in Neural Information Processing Systems 7, 1995. B. Scholkopf, C. Burges, and V. Vapnik. Extracting support data for a given task. In Proceedings of First International Conference on Knowledge Discovery and Data Mining, 1995. Peter Sprites, Clark Glymour, , and Richard Scheines. Causation, Prediction, and Search. SpringerVerlag, 1993. (See http://hss.cmu.edu/ html/departments/philosophy/ TETRAD.BOOK/book.html). Richard S. Sutton and Christopher J. Matheus. Learning polynomial functions by feature construction. In Proceedings of Eighth International Workshop on Machine Learning, 1991. V. Vapnik, S. Golowich, and A. Smola. Support vector method for function approximation, regression estimation, and signal processing. Advances in Neural Information Processing Systems 9, 1997. T.S. Verma and J. Pearl. An algorithm for deciding if a set of observed independencies has a causal explanation. In Proceedings of the Eighth Conference on Uncertainty in Arti cial Intelligence, 1992. (See http://www-eksl.cs.umass.edu/ papers/fbdftc-aistats95 94-80.ps). Andreas S. Weigend and Ashok N. Srivastava. Predicting conditional probability distributions: A connectionist approach. International Journal of Neural Systems, 6, 1995.

Suggest Documents