ISSC 2002, Cork.
June 25–26
Gaussian Processes for Modelling of Dynamic Non–linear Systems Gregor Gregorˇ ciˇ c and Gordon Lightbody Department of Electrical Engineering University College Cork IRELAND E–mail:
[email protected] Abstract — Parametric multiple model techniques have recently been proposed for the modelling of non–linear systems and use in nonlinear control. Research effort has focused on issues such as the selection of the structure, constructive learning techniques, computational issues, the curse of dimensionality, off–equilibrium behavior etc. To reduce these problems, the use of non–parametrical modelling approaches have been proposed. This paper introduces the Gaussian process prior approach for the modelling of non–linear dynamic systems. Issues such as selection of the input space dimension and multi–step ahead prediction are discussed in this paper. The Gaussian process modelling technique is demonstrated on the simulated example of the non–linear hydraulic system. Keywords — Gaussian processes, Non–linear modelling, Covariance function, Covariance matrix, Hyperparameters, Prediction.
I
Introduction
O
NE approach for the global modelling of non– linear systems, is so called black–box modelling, in which the process model is approximated using multiple weighted models combined to form a global model [1]. This modelling structure could be represented in equation (1), N X yˆ(k + 1) = ϕi Mi (1) i=1
where: yˆ(k + 1) is the one–step ahead predicted output, N is the number of the local models, ϕi is the ith validity function and Mi is the ith local model. The choice of validity function and local models determine the type of the multiple model network structure. Well known structures of this type are neural networks [2] and fuzzy rule based models [3]. For example, if the local model Mi in the equation (1) is replaced with the scalar weight Mi = wi and the validity function ϕi is chosen as the Gaussian basis function ϕi = ϕi ψ (k) , where ψ (k) is the full data vector at k th time sample, then equation (1) yields the structure of the well known, Radial Basis Function network (RBFN): N X yˆ(k + 1) = (2) ϕi ψ (k) wi i=1
The Local Modelling approach has been proposed to increase transparency as well as reduce the effects of the curse of dimensionality. Instead
of placing a large number of the “meaningless” local models all over the operating space, the operating space is partitioned into a number of operating regimes. For each regime, a linear local model is identified. There are then many ways of combining these local models to provide a global model [1]. In the so called supervised switching approach [4], an expert system known as a supervisor, determines from process data, which model best represents the process at a particular time. Use of this approach for control can be limited because of the stability issues highlighted in [5]. In the blended local models network [6], instead of switching between the models, a combination of local models outputs is used in similar fashion to the radial basis function network. The scalar weights wi in the equation (2) are replaced with a linear local model identified for ith operating regime. The number of local models, is typically dramatically reduced in comparison with types of models mentioned earlier. Validity functions ϕi are no longer function of the full data vector ψ (k), but would be function of some scheduling vector φ (k), which defines the operating point of the system. The operating point can change when the disturbance hits the system. If the disturbance is measurable variable it can be treated as an additional input. The dynamic of the disturbance can be then modelled separately. By simply adding an additional dimension to the scheduling vector this disturbance model can be
then included in the global model. The Local Model Network can be represented as follows: yˆ(k + 1) =
N X
T ϕi φ (k) ψ (k) Θi
(3)
i=1
where Θi is the vector of the parameters of the ith local model. If the nonlinear system is represented as a surface above the operating space, T then the product ψ (k) Θi can be seen as a linear approximation of the surface, i.e. a plane, touching the surface at the operating point. Validity functions ϕi φ (k) can be optimised or set, taking in to account the prior knowledge of the modelled system. Problems associated with these modelling techniques are: decomposition of the operating space, selection of the scheduling variable, the selection of the structure of the local models, the issues of off–equilibrium behavior [7] and the possible bias caused by overlapping and normalising of validity function [1]. For control use, the off–equilibrium problem has been partly reduced using velocity linearisation [8]. Local modelling technique presented in the equation (3), as well as the velocity linearisation are based on some set of parameters. In the last few years much research has focused on the non– parametrical approaches to modelling. The Gaussian process prior approach, introduced in [9] and recently revised in [10] is presented in this paper, as one of the non–parametrical alternatives for modelling of the non–linear dynamic systems. The advantage of the confidence in the prediction using a Gaussian process model over a linear regression model is highlighted in this paper. The importance of the selection of the input space dimension is highlighted for the meddling of a hydraulic system. The multi–step ahead prediction using one–step ahead prediction Gaussian process model is also examined. II
Gaussian Processes
Gaussian Process is a collection of any finite number of random variables which have a joint Gaussian distribution. A detailed derivation of the Gaussian processes based on the Bayesian framework can be found in [11]. Consider some noisy input/output set of data D. The full matrix ΦN of N d–dimensional input vectors is constructed as follows: u1 (1) u2 (1) · · · ud (1) u1 (2) u2 (2) · · · ud (2) .. .. .. .. . . . . ΦN = u1 (k) u2 (k) · · · ud (k) .. .. .. .. . . . . u1 (N ) u2 (N ) · · · ud (N )
Scalar outputs are stacked in the output vector y N : T
y N = [y(1), y(2), . . . , y(k), . . . , y(N )]
The aim is to construct the model, and then at some new input vector uT (N + 1) = [u1 (N + 1), u2 (N + 1), . . . , ud (N + 1)] ∈ / D find the distribution of the corresponding output y(N + 1). A general model for the set of data can be written as: y(k) = ynf (u(k)) + η(k) (4) where y(k) ia a noisy output, ynf is the modelling function which produces noise free output from input vector u(k) and η(k) is additive noise. The prior over the space of possible functions to model the data can be defined as P (ynf |α ), where α is some set of hyperparameters. Also a prior over the noise P η β can be defined, where η is the vector of noise values η = [η(1), η(2), . . . , η(k), . . . , η(N )] and β is set of hyperparameters. The probability of the data given hyperparameters α and β can be written as: P y N ΦN , α, β = Z dynf dηP y N ΦN , ynf , η P (ynf |α ) P η β (5) T Define matrix: ΦN +1 = ΦTN , u(N + 1) and h iT T vector: y N +1 = y N , y(N + 1) . The conditional distribution of y N +1 can be then written as: P y N +1 D, α, β, ΦN +1 = P y N +1 α, β, ΦN +1 P y N α, β, ΦN
(6)
This conditional distribution can be used to make a prediction about y(N + 1). The integral in the equation (5) is complicated. The standard approach to solving such problems is given in [11] and [12]. The approach based on Gaussian process priors gives an exact analytic form of equation (5), and facilitates exact Bayesian analysis, using matrix manipulations. The Gaussian process is fully represented by its mean and covariance function C(·) which produces the covariance matrix C . In this paper, a zero– mean distribution is assumed: T
y N = [y(1), y(2), . . . , y(k), . . . , y(N )] ∼ N (0, C)
(7)
Obviously not all data can be modelled as a zero– mean process. If the data is properly scaled and
detrended, then this assumption is correct. A prediction at point y(N + 1) can be made using the conditional distribution [11], which for Gaussian process is also Gaussian [10]. yˆ(N + 1) ∼ N µyˆ(N +1) , σy2ˆ(N +1) , where : µyˆ(N +1) = v TN +1 C−1 N yN
(8)
σy2ˆ(N +1) = ν − v TN +1 C−1 N v N +1 CN of size N × N , v N +1 , of size 1 × N are the covariance matrices and ν is a scalar, constructed as follows: C1,1 ··· C1,N .. CN = ... C . m,n
CN,1
···
CN,N
Cm,n = C (u(m), u(n)) v N +1 = [C (u(1), u(N + 1)) , . . .
(9) T
. . . , C (u(N ), u(N + 1))] ν = C (u(N + 1), u(N + 1))
C(·) is the covariance function of the inputs only. Any choice of the covariance function, which will generate a non–negative definite covariance matrix for any set of input points, can be chosen[12], which offers the ability to include the prior knowledge in the model. The following covariance function works well:
through the set of data D. The prediction at point u(N + 1) can be written as: ˆ · u(N + 1) yˆ(N + 1) = K
(11)
A different approach can be taken for the prediction of y(N + 1) at the point u(N + 1) using the same set of data. Assuming that outputs y(k) have a joint normal (Gaussian) distribution1 , the Gaussian process prior was placed directly on the space of functions (in this case on space of gains K1 , . . . , KN ), that can model the data. The difference between equation (11) and the ˆ at the new input equation (8) is obvious. Given K u(N + 1) equation (11) gives prediction yˆ(N + 1). However the prediction based on Gaussian processes is more powerful. Given the CN , v N +1 and ν (which are function of all inputs and hyperparameters) and vector of noisy outputs y N , the prediction at the new input u(N + 1) has a Gaussian distribution. That means that at the point u(N + 1) the output y(N + 1) is most likely going to be the mean prediction µyˆ(N +1) . The level of confidence for this prediction is given by the variance σy2ˆ(N +1) about the mean. The variance can be seen as “error bars” for this prediction. (Fig. 1b)
C (u(m), u(n)) = − 12
w0 e
d P l=1
2
wl (ul (m)−ul (n))
+ wη δ(m, n)
(10)
T
where θ = [w0 , w1 , . . . , wd , wη ] is the vector of hyperparameters and d is the dimension of the input space. The effect of the hyperparameters on the covariance function is discussed in section b). To be able to make a prediction, using the equation (8), the hyperparameters have to be provided either as prior knowledge, or trained from the training data. One of the possible techniques is maximizing of the likelihood of the training data by the given covariance function. [11, 12] a)
Linear Regression vs. Gaussian Process Models
Consider a single input, single output system, represented by equation (4). Let the function ynf = Ku(k), where K is an unknown gain. For each pair of data {u(k), y(k)} for k = 1 . . . N , the parameter K can be calculated. Since y(k) data is corrupted with noise, every calculation gives a difˆ can ferent value value of K. (Fig. 1a) Parameter K be estimated using the Least–Squares algorithm. This approach is equivalent to fitting the best line
Fig. 1: a) Training set of data, b) Mean µyˆ(N +1) of the prediction distribution is the predicted value of yˆ(N + 1) at u(N + 1). Variance σy2ˆ(N +1) gives error bars yˆ(N + 1) ± σy2ˆ(N +1) on prediction.
b)
Introduction to Gaussian Process Modelling of Dynamic Processes
Consider a first order discrete dynamic system described by the non–linear function: y(k+1) = 0.95·tanh (y(k))+sin (u(k))+η(k) (12) 1 The set of outputs y(1), . . . , y(N ) can be seen as the set of random variables K1 u(1), . . . , KN u(N ).
where u(k) and y(k) are the input and the output of the system, y(k + 1) is the one–step ahead predicted output and η(k) is white noise. To model the system using Gaussian process model, the dimension of the input space should be selected. The problem is similar to the structure selection of an ARX model [13]. Since the system presented in equation (12) has order one, the input space is going to be two dimensional. One present input and output are required to make one–step ahead prediction of the first order system. The selection of the input space is discussed in more detail in section c). The system was simulated to generate representative data. A Gaussian process model was trained as a one–step ahead prediction model. The training data points were arranged as follows:
ΦN
u(1) u(2) .. .
y(1) y(2) .. .
= u(k) y(k) . .. .. . u(N ) y(N )
(13)
T
y N = [y(2), y(3), . . . , y(k + 1), . . . , y(N )]
for each input dimension. For less important inputs, the corresponding wl (where l = 1 . . . d) will become small and the model will ignore that input. In this case w1 and w2 are close together, which means that input u(k) and y(k) have equal weight in the prediction. The hyperparameter w0 gives the overall scale of the local correlation. wη is the estimated variance of the noise. In this case. the added noise η(k) was white. If noise is correlated then the covariance function (10) can be modified as shown in [14]. A more detail description of the hyperparameters can be found in [11] and [12]. This Gaussian process model was then tested on a validation data set. Fig. 3 shows the input u(k), true noise free output and the prediction yˆ(k + 1). In the range of operating space, where the model was trained (white region), the predicted output fits well to the true output. As soon as the input moves away from the well modelled region, the prediction does not fit to the true output. In the time interval between 8 and 11 seconds, the system was driven well into an untrained region of the operating space. The shaded band around the the prediction presents the error bars of ± one standard deviation from the predicted mean. The wider error bars in this region indicate that the model is less certain about its prediction.
The noise variance was 0.01 and the sampling time was 0.1 sec. The training data is shown in Fig. 2. The Maximum Likelihood frameInput, Output, Prediction, Error bars
Input Output
4 3 2
Input, Output
10
1 0
Input True (noise free) output Prediction Error bars Untrained region
5
0
−1 −5 3
−2
4
5
6
7
8
9
10
11
12
13
14
Time (sec)
−3 −4 0
5
10
15
Time (sec)
Fig. 3: Validation of the model. In the range away from operating space spanned by the training data, the prediction does not fit well to true noise free output. The less certain prediction in this region is confirmed by the wider error bars.
Fig. 2: Training data set.
work was used to determine the hyperparameters. Initially, the hyperparameters were set randomly and the conjugate gradient optimisation algorithm was used to search for their optimal values. The following set of hyperpaT rameters was found: θ = [w1 , w2 , w0 , wη ] = T [0.1971, 0.1796, 1.6931, 0.0104] . Hyperparameters w1 and w2 allow a different distance measure
The noise free version of the system of equation (12) can be presented as a 3–D surface y(k + 1) = f (u(k), y(k)) as shown in Fig. 4a. Fig. 4b shows the approximation of the surface using Gaussian Process model. Contour and gradient plots of the true process function and its Gaussian Processes model can be seen on Fig. 5. The dots represent the training set of data. It can be seen on the magnified portion of the plot, that the model rep-
b) GP approx. of y(k+1)
a) True y(k+1)
2 1 0 −1 −2 10
2 1 0 −1 −2 10
10 0
10 0
0 −10 −10
y(k)
0 −10 −10
y(k)
u(k)
u(k)
data set. To examine the effect of the selection of the dimension of the input space on the prediction, eleven Gaussian process models were trained as a one–step ahead predictors. For each of them, a different dimension of the input space was chosen, and cost over the test set of data was calculated. The cost was formulated as:
Fig. 4: a) True process surface y(k + 1) = f (u(k), y(k)), b) Gaussian Process approximation of the process surface.
JD =
N X
e2 (k) where
k=1
5
5
y(k)
b) 10
y(k)
a) 10
0
−5
−10 −10
0
−5
−5
0
u(k)
5
10
−10 −10
−5
0
5
10
u(k)
Fig. 5: a) Contour and gradient plot of the true process function, b) Gaussian Process approximation of the process function. The magnified portion of the plot shows the close– up of the operating space where the model was trained.
c)
Gaussian Process Modelling of an Hydraulic Process
An hydraulic positioning system containing a mass connected through the damper to the hydraulic force actuator, was used as a nonlinear modelling example. The input to the system was the voltage applied to the proportional valve. The model of the actuator can be written as: p1 p2 F = − , where : A1 A2 p B p˙1 = ξf (vi ) (ps − p1 ) − A1 x˙ (14) V1 B √ ˙ p˙2 = (−ξf (vi ) p2 − A1 x) V2 p1 and p2 are the pressures in the actuator cylinder, ps is the supplied pressure. Here f (vi ) assumed to be a linear function of the input voltage. The output of the system is the force F produced by the actuator. The system is fourth order and the output force is a non–linear function of the mass position x, its velocity x, ˙ and the input voltage. A Simulink model of the hydraulic system was used to generate the training and the validation
The dimension of the input space, in this case corresponds to the number of vectors, which represents past inputs and outputs: d = 2 ⇒ ΦN = u(k), y(k) d = 4 ⇒ ΦN = u(k), y(k), u(k − 1), y(k − 1) d = 6 ⇒ ΦN = u(k), y(k), u(k − 1), y(k − 1), u(k − 2), y(k − 2) .. .. . . (16) On Fig.6, the cost vs. dimension of the input space is shown. The cost does not change significantly after dimension increases over eight. A dimension od eight implies, that the present and three delayed vectors of input/output pairs were used for the prediction. Since the modelled system was fourth order, the eight dimension input space was expected to give the best dimension/cost trade–off. This illustrates the analogy between dimension selection of the Gaussian process model and structure selection of the ARX model. 200
150
Cost
resents the true function well, on the region where the model has been trained. Away of that region, however the model is no longer a good representation of the true function.
(15)
e(k) = y(k) − yˆ(k) D = Dimension of the input space
100
50
0
2
4
6
8
10
12
14
16
18
20
22
Dimension of the input space
Fig. 6: Cost vs. dimension of the input space
On Fig.7, the prediction based on first order, fourth order, twelve order Gaussian process models and the true noise free output, are shown. The first order model does not provide good fit, whilst the fourth and twelve order models provide a much better fit to the true output. In control applications, usually more than one– step ahead prediction is needed. The fourth order Gaussian process model, trained as a one–step
3500 True (noise free) output First order GP model Fourth order GP model Twelve order GP model
8000
True (noise free) output 1−step ahead prediction 5−step ahead prediction 10−step ahead prediction
3450 3400
6000
Force (N)
4000
2000
3300 3250 3200 3150
0 3100 3050
−2000 28
30
32
34
36
38
40
42
44
46
48
Samples
3000
256
258
260
262
264
266
268
Samples
Fig. 7: Prediction of the first order, fourth order, twelve order Gaussian process models and the true noise free output.
Fig. 9: To make a multi–step ahead prediction, the one– step ahead predicted output is fed back to the model input. This has an filtering effect to the noise.
ahead predictor, for the hydraulic actuator, described in section c) was used as multi–step ahead prediction model. The cost was formulated as:
formulation of the Gaussian process model was explained. It was shown that as well as the prediction, the Gaussian process model also provides information about the confidence in this prediction. This can be useful in model based control applications, for example, to determine when to switch from a fine tuned, model based controller, to a robust detuned PID controller. On the simulated example of the force actuator, the importance of selecting the dimension of the input space was examined. The multi–step ahead prediction using one–step ahead prediction Gaussian process model was also examined. Remark 1: Since the prediction of the model output requires inversion of the covariance matrix ΦN , size N × N , the use of the Gaussian processes model can became problematic with large amount of data. Remark 2: In this paper it was assumed that output data was corrupted with the white noise. The noise contribution was included in the covariance function (10) as wη δ(m, n), where wη is the estimate of the noise variance and δ(m, n) is the kronecker delta. If the input data was also corrupted with the noise, the Gaussian process model might be biased. Modifying the covariance function for the correlated output noise, as shown in [14] suggests that the covariance function can be adapted to the output noise as well as to the input noise.
Js =
N X
e2 (k) where
k=1
(17)
e(k) = y(k) − yˆ(k) s = Number of ahead predicted steps Fig.8 show the cost vs. multi–step ahead prediction. As expected, the one step ahead prediction gives the lowest cost, but cost does not increase significantly with increasing the prediction horizon. After the maximum cost corresponding to a three–step ahead prediction, cost decreases, which is due to the fact that the predicted noise is fed through the model. This has a filtering effect on the multi–step ahead prediction. This can be seen clearly on the Fig. 9. 25 20
Cost
Force (N)
3350
15 10 5 0
1
2
3
4
5
6
7
8
9
10
Number of ahead predicted steps
References Fig. 8: Cost vs. multi–step ahead prediction
III
Conclusions
This paper introduces Gaussian processes modelling approach as a non–parametrical approach to modelling of the nonlinear dynamic systems. The
[1] Roderick Murray-Smith and Tor Arne Johansen (Eds.). Multiple Model Approaches to Modelling and Control. Taylor and Francis, London, 1997. [2] K. S. Narendra and K. Parthasarathy. Identification and control of dynamical systems us-
ing neural networks. IEEE Transactions on Neural Networks, 1(1):4–27, Mar. 1990. [3] T. Takagi and M. Sugeno. Fuzzy identification of systems and its applications to modeling and control. IEEE Transactions on Systems, Man and Cybernetics, SMC–15(1):116– 132, Jan.–Feb. 1985. [4] K. S. Narendra, J. Balakrishnan, and M. K. Ciliz. Adaptation and learning using multiple models, switching, and tuning. IEEE Control Systems Magazine, 15(3):37–51, Jun. 1995. ´ Caibre, and P.F. Cur[5] R. N. Shorten, F. O. ran. On the dynamic instability of a class of switching systems. In Proceedings of IFAC Conference on Artificial Intelligence in Real Time Control, 2000. [6] Robert Shorten, Roderick Murray-Smith, and Roger Bjørgan. On the interpolation of local models in blended multiple model structure. International Journal of Control, 72(7/8):620–628, 1999. [7] T. A. Johansen R. Murray-Smith and R. Shorten. On transient dynamics, off– equilibrium behaviour and identification in blended multiple model structures. European Control Conference, 1999. [8] D.J. Leith and W.E. Leithead. Analytic framework for blended multiple model systems using linear local models. International Journal of Control, 72(7/8):605–619, 1999. [9] A. O’Hagan. Curve fitting and optimal design for prediction (with discussion). Journal of the Royal Statistical Society B, 40(1):1–42, 1978. [10] C.K.I. Williams. Prediction with gaussian processes: From linear regression to linear prediction and beyond. Learning and Inference in Graphical Models, 1998. [11] Mark N. Gibbs. Bayesian Gaussian Processes for Regresion and Classification. PhD thesis, Univercity of Cambridge, 1997. [12] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In M. E. Hasselmo Touretzky, M. C. Mozer, editor, Advances in Neural Information Processing Systems 8, pages 514–520. MIT Press, 1996. [13] Lennart Ljung. System Identification: Theory for the User. Prentice Hall PTR, 2nd edition, Jan. 1999.
[14] R. Murray-Smith and A. Girard. Gaussian process priors with arma noise models. In Proceedings of The Irish Signals and Systems Conference, Maynooth, pages 147–152, June 2001.