4SID LINEAR REGRESSION Bo Wahlberg and Magnus Jansson S3 { Automatic Control Royal Institute of Technology S-100 44 Stockholm, SWEDEN Email:
[email protected]
Abstract
time, this introduces new problems in terms of understanding the properties of the algorithms. In the last few years several new 4SID algorithms have been tested out both in practice and by computer simulations. The results have been compared with results obtained with classical tools. In general, the 4SID approach seems to perform very well, even close to optimal. The motivation of our work is to shed some light on the understanding of 4SID methods. We will try to point out similarities and di erences with traditional techniques. The main motivation is to explain why they work well and perhaps nd situations where they do not. This may also lead to improved algorithms. This paper compares the 4SID methods with traditional linear regression. Properties of ARX-models in 4SID are studied in some detail. Subspace estimation from the linear regression model is discussed and different approaches are related to previously reported estimates. A result from an asymptotic analysis of the pole estimation error is also presented. Finally, a numerical example is included to highlight one problem in choosing some user provided parameters.
1 Introduction
2 Models and Assumptions
Recently, State Space Subspace System Identication (4SID) has been suggested as an alternative to more traditional prediction error system identication, such as ARX least squares estimation. The aim of this note is to analyse the connections between these two di erent approaches to system identication. The conclusion is that 4SID can be viewed as a linear regression multi-step ahead prediction error method, with certain rank constraints. This allows us to analyse 4SID methods within the standard framework of system identication and linear regression estimation. For example, it is shown that ARX models have nice properties in terms of 4SID identication. From a linear regression model, estimates of the extended observability matrix are found. Results from an asymptotic analysis are presented, i.e. explicit formulas for the asymptotic variances of the pole estimation error are given. From these expressions, some di culties in choosing user specied parameters are pointed out in an example. Keywords: System identication parameter estimation subspace methods. This work is a continuation of the e orts in Viberg et al. 10] and 11] to analyse and understand properties of State Space Subspace System Identication (4SID). This concept has been proposed and analysed in a series of papers by e.g. De Moor and Van Overschee, 6], 7], 8], Verhaegen, 9], and Larimore, 1], 2]. The schemes for subspace system identication to be considered are very attractive since they estimate a state-space representation directly from input-output data. This is done without requiring a canonical parametrization as is often necessary in more traditional prediction error methods 3], 5]. Furthermore, they use reliable numerical tools such as QRdecomposition and Singular Value Decomposition (SVD), which leads to numerically e ective implementations. Hence, the nonlinear iterative optimization schemes that usually are necessary in the traditional framework are avoided. However, at the same
Consider a linear time-invariant system with m input signals, and l output signals. Assume that the dynamics of the system can be described by an nth order state space realization
xt+1 = Axt + But + wt yt = Cxt + Dut + vt :
(1) (2)
Here xt is the state-vector, ut is the input signals, wt represents the process noise, yt is the output signals, and vt is additive measurement noise. The measurable input signals, ut , are assumed to be independent of wt and vt . Introduce the (l 1) vector of stacked outputs as h iT y(t) = ytT ytT+1 : : : ytT+;1 (3) where > n is a user specied integer. Similarly, dene the vector of stacked future inputs, process and
measurement noises as u (t), w (t), and v (t), respectively. The following input-output equation is then easily derived from the state space realization
y (t) = ;xt + u(t) + n(t) (4) where the term n (t) = w (t) + v (t) contains
the noise, and where we have introduced the block matrices, 2 3 C 6 CA 7 ; = 664 .. 775 (5) . 2 666 = 666 64
CA;1
D CB
0
D
.. . .. .
...
0 ... ...
0 0
0 0 0 .. .
CA;2 B CA;3 B CB D
2 66 = 66 4
0
3 7 7 7 7 (6) 7 7 7 5
0 0 3
7 ... 0 0 77 : (7) 7 .. ... . 0 05 CA;2 C 0 Notice that ; is the extended observability matrix, and hence of rank n provided the system is observable. The key idea of subspace identication is to estimate the n dimensional range space of ; . This can be done with more or less clever projections on the data. The framework can be a bit di cult in terms of advanced numerical linear algebra. Let us instead put the problem into a traditional linear regression framework. Denition. The 4SID linear regression model is dened by
C
y(t) = L1y (t ; ) + L2u (t ; ) + L3u(t) + e(t):
(8) Here is another design variable whos interpretation will become clear later. The key observation is that if the noise vector e (t) is orthogonal to y (t ; ), u (t ; ) and u(t) then rank L1 L2 ] = n L1 L2 ] = ; K1 K2 ]:
(9) (10)
This means that the range space of L1 L2 ] equals the range space of the observability matrix. This observation is the backbone of 4SID. The explicit relation to linear regression, will hopefully make certain things easier to understand. In the following two subsections we will sketch the connection of di erent systems to the linear regression model.
2.1 Deterministic Systems
For the deterministic case, with no noise, a deadbeat observer can be constructed that estimates the state exactly in n steps, hence the state vector can be obtained as xt = K1 y (t ; ) + K2 u (t ; ) n: (11) Using this in (4) we obtain the 4SID linear regression form. Here we directly notice that the size of corresponds to the observer dynamics.
2.2 Stochastic Systems
Any observable stochastic system can be written in 4SID linear regression form, with the key property (10) holding. The idea is just to use an optimal stochastic observer (Kalman lter). Let x^t = K1 y (t ; ) + K2 u (t ; ) + K3 u (t) (12) be the linear mean square error optimal estimate of xt , given y (t ; ), u (t ; ) and u (t). Then xt ; x^t is orthogonal to y (t; ), u (t; ) and u (t). Hence, y(t) = ;x^t + u(t) + ;(xt ; x^t ) + n(t) = L1 y (t ; ) + L2 u (t ; ) +L3 u (t) + e (t) (13) with the desired properties. In 7], the relation between subspace projections and Kalman ltering is discussed in detail.
3 ARX 4SID Linear Regression Models
Consider the SISO ARX model, A(q)yt = B (q)ut + zt (14) A(q) = 1 + a1 q;1 + : : : an q;n (15) ; 1 ; n B (q) = b1 q + : : : bn q (16) ; 1 ; 1 where q is the delay operator q ut = ut;1, and zt is zero mean white noise with variance . The extension to MIMO ARX models is straightforward. A most important property of ARX models is that the corresponding k-step ahead predictor has nite memory. Rewrite the polynomial model into observer canonical innovation form xt+1 = Axt + But + Kzt (17) yt = Cxt + zt : (18) where 2 ;a1 1 0 : : : 0 3 6 ;a2 0 1 : : : 0 77 6 6 .. . . . .. 77 (19) A = 66 ... . . 7 4 ;an;1 0 ::: 1 5 ;an 0 ::: 0 T B = b1 : : : bn ] (20) T K = ;a1 : : : ; an ] (21) C = 1 0 : : : 0]: (22)
Observe that
2 66 A ; KC = 666 4
3 77 77 75
1 0 ::: 0 0 1 ::: 0 .. . . . .. (23) . . 0 ::: 1 0 ::: 0 is nilpotent, (A ; KC )n = 0, and the corresponding observer x^t+1 = (A ; KC )^xt + But + Kyt (24) converges after n step to the true state vector, i.e. xt = K1 y (t ; ) + K2 u (t ; ) n (25) even if we have noise! Observe that we have the same expression as in the deterministic case. Hence L3 = and e (t) = n(t). This allows us to characterize the noise, e(t) = z (t) + z (t) (26) where 2 3 0 0 0 66 77 .. = 66 h.1 . . 0 0 77 : (27) .. 0 0 5 4 .. h;1 h1 0 Here fhk g is the impulse response of 1=A(q), i.e. 1 = 1 + h q;1 + h q;2 + : : : : (28) 1 2 A(q) The k-step ahead predictor is given by y^tjt;k = Gk (q)yt;k + B (q)Fk (q)ut (29) where Gk (q) = g0 + g1 q;1 + : : : + gn;1q;n+1 (30) Fk (q) = 1 + f1q;1 + : : : + fk;1 q;k+1 (31) satisfy 1 = F (q) + q;k Gk (q) (32) k A(q) A(q) ) 1 = A(q)Fk (q) + q;k Gk (q): (33) Note, k-step ahead prediction error equals Fk (q)zt . This reveals that the 4SID regression model for ARX systems is nothing else but a combined k-step ahead prediction error model, k = 1 : : : . y (t) = ^ytTjt;1 : : : y^tT+;1jt;1]T + n(t) (34) The coe cients of L1 , L2 and L3 are directly related to the coe cients of Gk (q), Fk (q) and B (q). A couple of other useful observations are: 0 0 .. . 0 0
ytTjt;1 : : : ytT+;1jt;1 ]T is linear in the coe -
icients of B (q). Hence, given an estimate of A(q), the coe cients of B (q) can be found using linear regression. The observability matrix is determined by the coe cients of A(q). Moreover, the nullspace of the extended observability matrix can be linearly parameterized by the coe cients of A(q). For ARX models the noise n (t) is a stationary moving average process. The covariance of n (t) can be quite badly scaled, i.e. if the one step ahead prediction error is much less than the ahead prediction error. Hence, one should apply weighting in the least squares t, if one does not want to exaggerate long prediction properties. Since we cannot expect the residuals to be white, we should not expect e cient estimates using 4SID regression least squares estimation. This is also known from that the one step-ahead prediction is always best in terms of e ciency.
4 4SID Regression Estimation
Let us now go back to the linear regression model, (8), and suppose data are given for t = 1 : : : N . The unconstrained least squares 4SID linear regression estimate is then given by N; +1 X min jjy (t) ; L1 y (t ; )+ L L L 1
2
3
t=1+
+L2 u (t ; ) + L3 u (t)]jj2 : (35) Notice that this is a standard least squares linear regression problem. We have however not used the fact that we know a lot about the structure of L1 L2 and L3 . For example L3 is lower triangular Toeplitz for deterministic and ARX systems. The least squares criterion is still quadratic in the elements of L3 if this structure is imposed. Unfortunately, such an approach does not permit the simple recovery of the range space of the observability matrix as in the 4SID technique. Of course, we can write fLi g as functions of the system parameters taking all structures into account. However, we then obtain a quite complicated nonlinear optimization problem. It makes more sense to use a weighted least squares cost function, which takes into account the di erence in k-step ahead variances, NX ;+1 min jjy (t) ; L1 y (t ; )+ L L L 1
2
3
t=1+
+L2 u (t ; ) + L3 u (t)]jj2W : (36) 2 T Here jjxjjW = x Wx. Using (8), the observations can be modeled by the matrix equation Y = L1 Y + L2 U + L3 U + E (37)
where the block Hankel matrix of outputs is dened as
Y = y (1 + ) : : : y (N ; + 1)] (38) and similarly for U Y U and E . The weighting matrix W is omitted in the following, but can be incorporated by scaling with W1=2 from the left. The least squares cost function then equals
jjY ; L1 Y + L2 U + L3 U ]jj2F
(39)
where subscript F denotes the Frobenius norm. The unrestricted L3 minimizing (39) is then given by L^ 3 = Y ; L1 Y ; L2 U ]UT (U UT );1: (40) Introduce the projection matrix
? = I ; UT (U UT );1 U : (41) Substituting the estimate L^ 3 into (39) leads to the cost function
jjY ? ; L1 Y ? + L2 U ? ]jj2F :
(42)
It is now possible to optimize this compressed cost function with respect to L1 and L2 , and then in a second step take into account that L1 L2 ] = ; K1 K2 ]. This can be done by singular value decomposition (SVD). The minimal value of (42) is obtained for ;1 L1 L2 ] = Y ? PT P ? PT (43) where
Y p (t) = uy ((tt)) P = U
(44)
contains the \past" data. A possible estimate of the extended observability matrix is then given from the n principal left singular vectors of (43). Here a relation to the N4SID-method, 7], can be seen. The matrix used for subspace estimation in N4SID is L1 L2 ]P . We shall not elaborate further on this. A more natural approach is to directly minimize the cost function (42) subject to the rank condition on L1 L2 ]. This can be done as follows. Dene
J = (P ? PT );1=2 : L = L1 L2 ] K = K1 K2 ]J;1:
Then
Introduce the SVD of the matrix
Y? PT J = Q ST : (49) Then the optimal rank n value of ; is obtained by ;^ = n rst columns of Q 1=2 : (50) With a weighting matrix W one would take SVD of W1=2 Y? PT J, pick the n rst columns and then multiply with W;1=2 . Equation (50) is exactly the PO-MOESP method of Verhaegen 9]. If the weighting matrix, W, is chosen as
W = (Y? Y);1
(51) we have the canonical variate analysis (CVA) solution of Larimore, see 1], 2] and also 8]. We call this the CVA solution even if Larimore does not explicitly use the extended observability matrix. Actually, a generalization of the canonical variate analysis method is proposed in 1], 2], which replaces the weighting (51) with a general matrix. In 12] it was conjectured that (49), by the general theory of instrumental variables, would be the optimal matrix to use for subspace estimation. Note that there is a typing error in 12]. There, in our notation, it stands Y ? PT J2 , but it was supposed to be Y ? PT J. For a closer look at the connection between di erent 4SID methods we refer to 8] and 12].
5 Analysis In this section we will brie y state the result of an asymptotic analysis. The expressions given come from a straightforward extension of the analysis by Viberg, Ottersten, Wahlberg and Ljung 10], 11]. For characterizing the quality of the estimates a canonical parameter set has to be considered. Here the system poles are chosen. Suppose an estimate of the extended observability matrix is found by (50). A common way to obtain the system matrix, A, is to use the shift invariance structure of ; . Thus, A^ = (J1 ;^ )y (J2 ;^ ) (52)
(45) (46) (47)
where (:)y denotes pseudo-inverse and the selection matrices Ji are dened by
jjY ? ; L1 Y ? + L2 U ? ]jj2F = jjY ? ; L P ? ]jj2F jj2F + = jjY ? PT J ; ; K +jjY ? jj2F ; jjY ? PT Jjj2F : (48)
In the following it is assumed that the system is given in its diagonal form, i.e. A is diagonal with the system p poles, k , on the diagonal. It can be shown that N ~k has a limiting zero mean Gaussian distribution. The elements of the asymptotic covariance are
J1 = I(;1)l O(;1)ll ] J2 = O(;1)ll I(;1)l ]:
(53) (54)
given by
N E ~ k ~l ] N E ~ k ~l ]
X j j