The Generalized Cross Validation Filter

The Generalized Cross Validation Filter Giulio Bottegal a and Gianluigi Pillonetto b a Department

of Electrical Engineering, TU Eindhoven, Eindhoven, The Netherlands (e-mail: [email protected])

arXiv:1706.02495v1 [stat.ML] 8 Jun 2017

b Department

of Information Engineering, University of Padova, Padova, Italy (e-mail: [email protected])

Abstract Generalized cross validation (GCV) is one of the most important approaches used to estimate parameters in the context of inverse problems and regularization techniques. A notable example is the determination of the smoothness parameter in splines. When the data are generated by a state space model, like in the spline case, efficient algorithms are available to evaluate the GCV score with complexity that scales linearly in the data set size. However, these methods are not amenable to on-line applications since they rely on forward and backward recursions. Hence, if the objective has been evaluated at time t − 1 and new data arrive at time t, then O(t) operations are needed to update the GCV score. In this paper we instead show that the update cost is O(1), thus paving the way to the on-line use of GCV. This result is obtained by deriving the novel GCV filter which extends the classical Kalman filter equations to efficiently propagate the GCV score over time. We also illustrate applications of the new filter in the context of state estimation and on-line regularized linear system identification. Key words: Kalman filtering; generalized cross-validation; on-line system identification; inverse problems; regularization; smoothness parameter; splines

1

Introduction

Linear state space models assume the form xk+1 = Ak xk + ωk yk = Ck xk + ek where xk is the state at instant k, yk is the output, while ωk and ek are random noises. The matrices Ak and Ck regulate the state transition and the observation model at instant k. This kind of models plays a central role in the analysis and design of discrete-time systems [17]. Applications abound and include tracking, navigation and biomedicine. In on-line state estimation, the problem is the reconstruction of the values of xk from measurements of yk collected over time. When the matrices Ak and Ck and the noises covariances are known, the optimal linear estimates are efficiently returned by the classical Kalman filter [1]. However, in 1

This research has been partially supported by the MIUR FIRB project RBFR12M3AC-Learning meets time: a new computational approach to learning in dynamic systems and by the Progetto di Ateneo CPDA147754/14-New statistical learning approach for multi-agents adaptive estimation and coverage control. This paper was not presented at any IFAC meeting. Corresponding author Gianluigi Pillonetto Ph. +390498277607.

Preprint submitted to Automatica

many circumstances there can be unknown model parameters that also need to be inferred from data in an on-line manner, e.g. variance components or entries of the transition/observation matrices. One can interpret such parameters as additional states. Then, the extended Kalman filter [16] or more sophysticated stochastic techniques, such as particle filters and Markov chain Monte Carlo [10,22,3,9], can be used to track the filtered posterior. Another technique consists of propagating the marginal likelihood of the unknown parameters via a bank of filters [1, Ch. 10]. In this paper, we will show that another viable alternative is the use of an approach known in the literature as generalized cross validation (GCV) [12]. In the literature of statistics and inverse problems, GCV is widely used in off-line contexts to estimate unknown parameters entering regularized estimators [5,37,40]. This approach was initially used to tune the smoothness parameter in ridge regression and smoothing splines [14,12,33]. GCV is now also popular in machine learning, used to improve the generalization capability of regularized kernel-based approaches [34,8], such as regularization networks, which contain spline regression as special case [31,11]. To introduce GCV in our state space context, we first recall that smoothing splines are closely linked to state space models of m-fold integrated Wiener processes [25]; then it appears natural to extend GCV to general state space models. To this end, assume that measurements yk have been

9 June 2017

this framework, an implementation of the GCV filter via a bank of parallel filters turns out computationally attractive.

collected up to instant t and stacked in the vector Yt . Denote with Yˆt the vector containing the optimal linear output estimate 1 and use Ht to denote the influence matrix satisfying

The paper is organized as follows. In Section 2, first some additional notation is introduced. Then, the GCV filter is presented. Its asymptotic properties are then discussed in Section 3. In Section 4 we illustrate some applications, including also smoothing splines and on-line regularized linear system identification with the stable spline kernel used as stochastic model for the impulse response [28,29]. Conclusions end the paper while the correctness of the GCV filter is shown in Appendix.

Yˆt = Ht Yt . Then, the parameter estimates achieved by GCV minimize GCVt =

St , t(1 − δt /t)2

(2)

where St is the sum of squared residuals, i.e. St = kYˆt −Yt k2 ,

2

and δt are the degrees of freedom given by the trace of Ht , i.e. δt = Tr(Ht ). In the objective (2), the term St accounts for the goodness of fit while δt assumes values on [0,t] and measures model complexity. In fact, in nonparametric regularized estimation, the degrees of freedom δt can be seen as the counterpart of the number of parameters entering a parametric model [20,13,26]. GCV is supported by important asymptotic results. Also, for finite data set size it turns often out a good approximation of the output mean squared error [7]. It is worth stressing that such properties have been derived without postulating the correctness of the prior models describing the output data [38,39]. In control, this means that GCV can compensate for possible modeling mismatch affecting the state space description.

2.1

The GCV filter State space model

First, we provide full details about our measurements model. We use x ∼ (a, b) to denote a random vector x with mean a and covariance matrix b. Then, our state space model is defined by xk+1 = Ak xk + ωk yk = Ck xk + ek , k = 1, 2, . . . x1 ∼ (µ, P0 ) ωk ∼ (0, Qk ) ek ∼ (0, γ)

(3a) (3b) (3c) (3d) (3e)

where the initial condition x1 and all the nosies {ωk , ek }k=1,2,... are mutually uncorrelated. We do not specify any particular distribution for these variables, since the GCV score does not depend on the particular noise distribution 2 . If x1 , ωk , ek are Gaussian, then the Kalman filter provides the optimal state estimate in the mean-square sense. In the other cases, the Kalman filter corresponds to the best linear state estimator [1]. In addition, just to simplify notation the measurements yk are assumed scalar, so that γ represents the noise variance. We assume that some of the parameters in (3) may be unknown, or could enter Ak , Bk , Qk and P0 ; however, we do not stress this possible dependence to make the formulas more readable. The matrix P0 is assumed to be independent of γ. Such parameter is typically unknown, being connected to the ratio between the measurement noise variance and the variance of the driving noise. It corresponds to the regularization parameter in the smoothing-splines context described in the example below.

Despite these nice features, the use of GCV within the control community appears limited, in particular in on-line contexts. One important reason is the following one. For state space models, there exist efficient algorithms which, for a given parameter vector, return its GCV score with O(t) operations [18,4], see also [15,35,21] for procedures dedicated to smoothing splines. But all of these techniques are not suited to on-line computations since they involve forward and backward recursions. Hence, if GCVt−1 is available and new data arrive at time t, other O(t) operations are needed to achieve GCVt . In this paper, we will instead show that the update cost is O(1), thus paving the way to a more pervasive on-line use of GCV. This result is obtained by deriving the novel GCV filter which consists of an extension of the classical Kalman equations. Thanks to it, one can run a bank of filters (possibly in parallel) to efficiently propagate GCV over a grid of parameter values. This makes the proposed GCV filter particularly suitable for applications where a measurement model admits a state space description with dynamics depending on few parameters, see e.g. the next section for an application in numerical differentiation. In

Example 1 (Smoothing splines [30]) Function estimation and numerical differentiation are often required in various applications. These include also input reconstruction in nonlinear dynamic systems as described e.g. in [30]. Assume 2

Of course, GCV may result not effective if the noises are highly non-Gaussian. Different approaches, like particle filters, should instead be used if linear estimators perform poorly due e.g. to multimodal noise distributions.

The components of Yˆt are thus given by Cxˆk|t , where the smoothed state xˆk|t can be obtained for any t with O(t) operations by a fixed-interval Kalman smoothing filter [32,19]. 1

2

t(1

Do ft /t)

mean a and covariance state space modelabo is First,b.weThen, need our to provide full details defined by (2) model. We use x ⇠ (a, b) to denote a r mean a and covarianceoutput b.output Then, our covarianc covar defined by xk+1 = Axk + wk (3a) 2 where the sum of squared residuals, Ti.e.T Ssrt =Ssr kYˆtt is H T T 1 1 t Yt k , ) t KkK=k = APAP (CP St Var(X := Var(X (CP + g) t(3b) kCkC kC kC+ g) yk = Cxk + vk , k(5a) =(5a) 1, 2, . .S. t := T T K (CS CT +T 1) x = Ax + w 2 k+1 k k V := Var(Y ) =t ) V := Var(Y AS A AS A K (CS C + 1) x1 ⇠ (x0 , P0 ) t t (3c) t k k k k Ssrt =bykYˆthe H kof , k k and Do ft are the degrees of freedom given trace t GtkYGt= (5b)(5b) k= y = Cx + v , k = 1 T T k k k CPCP wk ⇠ (0, Q) (3d) that one is interested in the first m derivatives of kC kC+ g+ g Htdetermining , i.e. '" − )" &" Update x ⇠ (x , P ) where I is the t ⇥ where I is th 1 0 0 t t (3e) a continuous-time signal measured withand non-uniform samxˆk+1 AxÂ K (y CxˆC (5c)(5c) xˆ = given = xˆk by +k K vk ⇠ (0, g) Doft f= the the Do Tr(H k+ k )xôf kk(yktrace k) t are t ).degrees of freedom + k+1 notation, the sm notation, the w ⇠ (0, Q) pling periods Tk . Modeling the signal as an m-th fold inteHt ,the i.e.term Ssr accounts for the 1k zˆk+1 == (A(A KkC) zˆk + (yK xˆkC)kxˆCk )TUpdate In the objective (2), zˆgoodKkC) zˆk G +k G t k+1 kk(y kCAP (CPkCT +(5d) g) (5d) in g) Section 1, is1(5 k= in Section grated Wiener processness oneof obtains theDo stochastic interpretavk ⇠ (0, Do ft and = Tr(H z −1 t ). T T fit while ft assumes values on [0,t] measures T T T T where the initial condition x and all the nosies {w , v } P = (A K C)P (A K C) + gK K + Q (5e) P = (A K C)P (A K C) + gK K + Q (5e) 1 k k k=1,2,. k+1 k k k k k k ASk A k kk K + 1) k+1 T tion of the m-th ordermodel smoothing splinesIn [40]. k k (CS KkkC = AP (CPkCT + g) y1(5 complexity. In fact,In inparticular, nonparametric esthe objective (2), the regularized term +Ssrt accounts for the goodkC measurements Guncorrelated. k= are mutually the T T T TIn addition, k T + g(5f)(5f) Update output covar Update output one can use (3) to represent the signal dynamics as follows CP C S = (A K C)S (A K C) + K K S = (A K C)S (A K C) + K K k k+1 k k k k k+1 k k k k T+ k k g represents that the degre timation, the degrees be seen as the so that the dc *" where nessofoffreedom fit whileDo Doftftcan assumes values on [0,t] are andassumed measuresscalar, ASkthe ATsonoise K (CS C so that variance. the initial condition x and all the k k 1 T= TA xêsKk (yk G Ckxˆk= )all the formulas (5 counterpart ofthe number of parameters parametk+1 CS C + 1+xˆkare model complexity. Inentering fact, in anonparametric regularized CS 1+  Taddition, allows toTC simplify derived mutuallybut uncorrelated. TThis 1 TgTk+1k+1 1 notation CPSIn C:= + g:= kmatrix, DoDo fk+1 Docan fkk+ (5g) fk+1 = Do f= + 1kgC (5g) (5a) Var(X S Va K= AP C (CP C + g) (5a) K AP (CP C + g) Update output covariance i.e. k1 t t) t k f= k k ˆ ˆ T timation, thedegrees of freedom Do be seen as the T 1 0ric model 0 . . .[17,10,21]. 0 t in theCP sequel can easily multiple outC C + (A g+be zk+1 = + G+scalar, (ykAxˆtoC xthe ˆkK ) (y (5 are assumed so that g represent CP g K T k+1 kC)zkextended k= k+1 x ˆ + C x ˆ ) T T T T   ) k+1 k k k k "g, GCV is supported by important0of asymptotic results and, when counterpart the number of parameters entering Va Vtcould Var(Y Ak (CS K C This + 1)allows AS A ASakKparamet+k1) t := t) Beyond other unknown parameters enter T:=V k (CS kC to-(5h) simplify notation but all T Ssr Ssr = Ssr (5h) k+1 k k1kk=put case. Pk+1 = (A KkC)P K C)T )+(5b) gK Q (5 k+1 1data set 0 size . . . 0is TSsr G GT= = kS(A k(5b) kK ˆtk+1 ˆkk ++Gdepenk+  out  k finite, itmodel turnsoften a= good approximator := Var(X ric [17,10,21]. K AP C (CP C g) (5a) T+ Tand z = (A K C) z (y t k k k P , A, B Q but we omit to stress this possible k k k in the sequel can be easily extended CP C g CP C + g 0 Tk+ k   2   T where Tr is the T T where Tr is 0 CS C 1 . CS Cagain + . . output . k+1k+1 T S1(y = (A xˆ Kkcase. K C) +=KH of the meanGCV squared isAS worth stressing 2 2V  [4].byItimportant T error is supported and, when T +asymptotic 2Tresults kPK dence simplify matrix where Beyond unknown pa ItKgI iskC) :=The Var(Y Skfollowing H + ,Is(t K += g+k2xˆC )C)S 0The gk+1 Cxˆnotation. ) kP(A . . . ..  t k+1 t )other twhere tkis tth =1) AkxˆT(y xˆkC)put (5c) =k(5c) (Ag, KThe C)P (Atassumed xˆk+1 A xˆk++ K xk+1 ˆk2to )(yC kA k (CS k+1 k+1 Tthat ,G = Ck =been Ak =  2k k+1 k+1 .  ,data followin k+ kT KkC k 2(y −1 T +1 k (5b) such properties have derived postulating (CP C + g) (CP C + g) set size is finite, it turns often out a good approximator k without k+1   ..   k+1 & not to depend on g. notation CS C notation, the omit to stress " P0 , A, B and Q k+1but we opments. CPzkˆCTdata + g z =K(A opments. ˆk(y+=GDo ..the correctness ˆkKDo  .. . . . . .. of the  mean =simplify (A KkC)S (Ain The KkC) fxˆk+1 (5m zkC (yxdence (5d) C) + ˆfkkk)+C1xâgain (5d) models describing the error output kSection the output squared [4].=Itzˆ(A is2ˆworth kC) k )gSk+1 k+1 ky k kC CG yzstressing xˆk+1 T the Secti k+1 k+1  . inidentity 1, k+1 . . .  of prior to notation. 2k+1 . ˆ CP C + g where I is t ⇥t mat + k+1 t + 2g C z x ˆ = A x ˆ + K (y C x ˆ ) (5c) + 2g C z k+1k+1 k derived k pos-without k  m m−1  that T T T T T [30,31]. In control, thissuch means that k+1 GCV cankbeen contrast properties have postulating T 1 T CSk+1C(5 =K(A Qg. on(5e) Pk+1 =P(A C)P C)g K+not gKto Kdepend +k K Qk + CC)P +K g+ CP C k+1 kSsr k (A kC) kCP kK(A k+ k+1 notation, thef (5e) smoothed Tk Tk k+1 kgK '" the k+1k = Ssrk One Do fk+1 = Do +Lemma 1 Lemma g 1estima 1 O ˆk+1 ˆ k sible undermodelling affecting the state space description. the correctness of the prior models describing output data . . . T 1 z = (A K C) z + G (y C x ˆ ) (5d) T k T T k k k k = (Ak Ssr TK C) + T KTinKSection 1, (5f) m! (m−1)! CP C Ssr is k+1k+1 k+1 S K C)S (A S = (A K C)S (A K C) + K K (5f) k+1 k k k k k k+1 k k k k so that k so that the d CS C + 1 Gcv = (k + 1) (5i) [30,31]. In control, this meansGcv that GCV contrast = (kcan + T1) (5i) 2.2 TheTposGCV filter k+1 2(5e) k+1 2 T k+1 Ssr Pk+1 = (A KkC)P + Qf+ )ˆ2t = Ht∂Stlo =(ySsr H + 1+k1K Do fk+1 )T2gk+1 k (A KkC) (k+ (kgK k+1k Cxˆk+1Y k Do Tki+ j−1 these nice features, + 1CT +k+1 CS Ck+1 +)(CP 1C k+1 state spaceTdescription. k+1CS Despite GCV does notaffecting appear tothe be much g)2 [Qk ]i j = . sible undermodelling Do f = T Do f = Do f + 1 g (5g) Do g t T ftt = gt Do f = Do f + 1 (5g) k+1C) k gK k+1 k S = (A K C)S (A K + K (5f) T CSk+1Cof+freedom 1 T k+1 k k k CP k k+1 soGCV that the degrees (i − 1)!( jused − 1)!(i + j− CP C + The gyk+1 within the1)control community,k+1 in particular kin on-line + C+ ! 2g2.2 !xˆfilter + C "k+1 g2 (y T+ + 2g"#$ Czˆk+1 z −1 T + g)2 k+ f (·) contexts. One important is the following one. For CS C 1 T (CP C Despitereason these nice features, GCV does not appear to be much Ssr = Ssr (5h) k+1 k+1 ∂Y2t V Ssr = Ssr (5h) k+1k k CPk+1C + g k+1 2 Do f = Do f + 1 g (5g) k+1 k is It It is apparent the difference w.r.t thethe classical Kalman apparent that w.r.t classical Kalman SsrytSsr =f t = =gCTr(H g T the state models,used there exist efficient which, within the control community, inthat particular in on-line Such model depends on thespace measurement noise variance γ, algorithms T x ˆ CP CCS + 2g difference T where Do k+1 + Ssr k+1 where is T∂t 2 ˆ t Tr k+1 CSk+1 k+1 C +additional 1C +=1state 2+same ˆk (of ˆkC(of k+1 2g C z 2 +-of 2x k+1 filter is the presence the additional z the same Gcv (k + 1) f (·) filter is the presence of the state z the forparticularly a given parameter vector, return its GCV score with O(t) g (y ˆ ) contexts. OneSsr important reason is the following one. For making this application suited for the GCV filter. k+1 T k+1 g (y 2(5h) C xˆk+1 ) 1k+1Do f )2 The foll Thek+1 followin CP C + g( k+1 = Ssrk 2T +k+1 (k + &" of + g(·) (CP Care g)regulated k+1 (CP CT k+1 + g) x ˆ ). Its dynamics regulated by (5d). Note k+1 dimension of x ˆ ). Its dynamics are by (5d). Note operations [15], seestate also [12,28,18] for procedures dedicated space models, theredimension exist efficient algorithms which, k k opment + T +1 opments. Ssr where Tr is the trace operator. k+1 CS C Cxˆk+1 ykGcv k+1 k+1 xînnovation 2yC this equation piloted the xˆkC, x= but 2Cy that this equation isˆ 2g piloted by the ykCk+1 ˆk ,(kbut % to smoothing splines. allparameter of these +techniques are not k+1innovation ˆby for aBut given vector, return its GCV score with O(t) + 1)simple lemma i g 2that (y2isk+1 xk+1 ˆzk+1 + C k+1) "#$ T g(·) The following + 2g C z 2.2 The GCV filter suited to on-line computations since they T 2 k+1 (k + 1 Do T k+1 (CP C +gain g) Proof: InBelow, view the Kalman KkKinkThe (5b) isk+1 replaced GGCV ask defined by now CP +by gk G Proof: Infk+ vio the Kalman gain in (5b) isC replaced as defined by reported. forward k+1 operations [15], see also involve [12,28,18] for procedures dedicated CP +ofgCby equations the filter are Lemma opments. Lemma 1 O It is apparent that the difference w.r.t the classical Kalm (5b). In turn, such gain depends by S which is propagated In turn, such gain depends by S which is propagated y C x ˆ Ssr and backward recursions. Hence, splines. if Gcvt 1But is available and to smoothing all of these techniques are not k k k+1 k+1 k+1 Ssr 2 (5b). xˆk(kdenotes the optimal linear state prediction covariance k+1 + 2gover Cover zˆk+1 Gcv = + 1)version (5i)of k+1 Gcv = (5i) state time through a+ ofDo the Riccati equation The GCV filter equations are now xˆk dek+1 2 the isversion the presence of additional zˆkfilter (of the sam time through a1) modified offregulated the Riccati equation new data arrivereported. at suited time t,Below, other O(t) operations needed to on-line computations involve CP CT(k + gmodified + 1fk+1 Pthe . filter Its dynamics are by theofclassical Kalman filter k+1they k+1 )blocks equations GCV are n Fig.are 1.since GCV filter: in bottom the nonlinear f and g1the are +forward 1 (k Do )2The k(k Lemma One has given by (5f). dimension of x ˆ ). Its dynamics are regulated by (5d). No given by (5f). notes the optimal linear ahead state prediction havIt is apparent that the difference w.r to one-step achieve Gcv . In this paper, we will instead show that Do f and backward recursions. Hence, if Gcv is available and k t t = t 1 Do fstate g Ssrk+1 via (5a),and (5c) the Riccati equation (5e). linear defined, respectively, by (5h) (5i)and while can be recursively t = t pred xˆδkk+1 denotes the optimal Gcvat = (kt,way +other 1)to O(t) (5i) thatneeded this equation is piloted by the innovation Cxˆk , b ing covariance Pk . Its dynamics regulated bythus the classical k+1 the updateare cost is new O(1), paving the a 1more kadditiona data arrive time is the presence of Then, they(10) computed byoperations (5g). Pk .filter Its dynamics are regulated by (10) the cli (k + Do fk+1 )2 are Then, is ob ∂by log the Kalman gain Kkdimension in (5b) isf replaced GdetV GCV for on-line This result Kalman filter via (5a),pervasive (5c) anduse the of Riccati equation (5e). k as tdefined to achieve Gcv . In this 2.3 paper, we will instead show that of x ˆ ). Its dynamics are t applications. Do = t g k 2∂ via (5a), (5c) and the Riccati equation t behavior of the GCV filter It is apparent that the difference w.r.t the classical Kalman 2.3isAsymptotic Asymptotic behavior of the GCV filter Ssr = t T It apparent that the difference w.r.t the classical Kalman Ssr = g (5b). In C turn, + such depends by Skiswhich ∂ g is tpropagat Ck+1 Σ 1 gain is obtained by deriving the novel GCV filter which consiststhe way GCV filter the update cost is O(1), thus paving to a k+1 more that thiszêquation piloted by the in filter theγ 2presence thek+1 additional (ofxsame the)of2same −Cthe ˆk+1 Sk+1 = presence Sisk + kk+1 ∂Riccati log 1∂ detV detV over time through a(y modified version the equati isThanks the ofThis the of additional state zˆstate (of k+1 t kKalman of an extension of pervasive the classical T useKalman of GCVequations. forfilter on-line applications. result 2 ∂Y V Ytlog the gain K in is replac GCV filter k gt t(5b) + γ) (C P C 2 = tg g k+1 k+1 dimension of x ˆ ). Its dynamics are regulated by (5d). Note k+1 It is apparent that the difference w.r.t the classical Kalman k Ssr = g 3 Numerical Examples given by (5f). dimension of x ˆ ). Its dynamics are regulated by (5d). Note 3 Numerical Examples t k ∂ g to it, one can run aisbank of filters (possibly parallel) ∂ g by obtained by deriving theinnovel GCVtofilter which consists GCV filter (5b). In turn, such gain depends ∂ g that this equation by the innovation yxˆkk , but Cxˆk , but (5h) thatof this equation isInitialization piloted innovation filtera isgrid thethe the additional stateisThanks zˆpiloted (ofthe the same k C kby efficiently propagate over ofpresence parameter values. of GCV an extension of classical Kalman equations. overbyytime through a modified Proof: version =g Initialization the Kalman gain Kk inis (5b) is replaced G as defined k y −C x ˆ In vie dimension of x ˆ ). Its dynamics are regulated by (5d). Note the Kalman gain K in (5b) replaced by G as defined by by Proof: k+1 k+1 k+1 k4 of k 2 kin parallel) to it, one can run a bank filters (possibly to ˆk+1 given by 4 Conclusions Conclusions 2.3 Asymptotic behavior of (5f). the GCV filter + 2γ C ζ (5b). In turn, such gain depends by S which is propagated k+1 k =T that this equation is piloted byofsuch the innovation yk Pby CxˆSkC,k Tbut Initialization (5b). In turn, gain depends which is ˆpropagated γthe Ck+1 efficiently propagate GCV2, over aover grid parameter Theµ,paper as follows. In(4a) Section after setting k+1 xˆk+1 =+µ, zRiccati 0 Inequation (4a) time through aGvalues. modified version of xˆ1 = ζˆ1 is =organized 0 1by 1= Proof: view of (7), we start the Kalman gain K in (5b) is replaced by as defined over time through a modified version of the Riccati equation =t k k up some additional notation. the GCV filter presented, given by (5f). which 5 given Appendix 5isgain Appendix 3 Sk+1 Numerical PExamples P0 , Asymptotic S1 = 0 (5i) behavior of the(4b) GC (5b). In(4b) turn, such depends is propagated 1 =2.3 by (5f). P1 = P0 ,discussing Σ1 = 0 its The GCV (k +by1)S2,k after k+1 also asymptotic In follows. Section 3, ˆ paper isproperties. organized as In=we Section setting 2 z1 = 0 δk+1 )equation T xˆ1 =1 µ, over time through a modified version (k of + the1 − Riccati T −1 Then, (1 gV =(10) It(4c) H last tw where the las illustrate applications. first(4c) onenotation. deals with a DC filter is presented, Do f1 = 1 g(CP0C + g) where δ1 = 1 − γ(Ctwo γ) someThe additional thea GCV t the 1 P0C is 1 +up P1 =Eq. P0 ,Then, given by (5f). Without 1= loss ofAsymptotic generality, weConclusions set theof initial system condition 2.3loss the GCV filter 4behavior Without of generality, we set the initial condition 2Examples 11Sis instead 3system Numerical Eq. 11 is0ins motor (y model taken from [20]. The second one is related also discussing its asymptotic properties. In Section 3, we 2 2.3 Asymptotic behavior of the GCV filter (y Cµ) 1 2 denote T 1 −C1 µ) 2 to to zero, i.e.i.e. µ 0. 0. WeWe also Xt ,Y and Et gE to the zero, µ deals = also Xt ,Y denote Ssr = (4d) t 1and t toThen, Dothe 1 g(CP + to γon-line regularized linear system identification, using the= 0∂Clog (4d) S1 = two applications. The first one with ause a use DC (10) obtained from thd T+ 2f 1is= T + γ)illustrate (CP C g) ∂ log detV 2 0 It is apparent that the difference w.r.t the classical Kalman g column vectors containing the states, the outputs and the (C P C column vectors containing the states, the outputs and the 1 0 1kernelmotor 1 t12 2.3 taken Asymptotic behavior ofNumerical the GCV filter stable spline as stochastic model for the impulse reg model from [20]. The second one is related 3 Examples 5 Appendix (y1t V∂YCµ) ∂Y gY 4 Conclusions VYtt ∂ ◆ 3measurements Numerical Examples 1 the same measurements noises upof tousing instant noises up to instant t, i.e.state filter is the presence the additional ζˆkSsr of Ssr = g 2g 2 ✓g 2 ∂t Tgt∂V = sponseS[22,23]. Conclusions the paper while the proof to on-line end regularized linear system identification, thet, i.e. 1 Gcv (4e) 1= ∂ log2detV1t t (CP C + g) ∂ g ∂ g 1 GCV1 = (4e) 0 dimension of xfor ˆk . Comparing and (5d),(1onegDo canf1 )see that= gTr Vt of the of the GCV is reported in Appendix. stable spline kernel as stochastic the impulse (5c) re(1 −correctness δ1 )2 3 filter Numerical Examples 4model Conclusions lossTof Tgenerality, we set∂ gthe initial system ∂g = T T T Without TT T T 5 TT TAppendix Ssr conditi 4 k end Xt paper [xby , Y = [y . . . y ] X=t = .t. ]to xK ] , Y = [y . . . y ] sponse [22,23]. Conclusions the the proof A isConclusions replaced A1. .k.x− C . In addition, the dynamics of ζˆ and Et to 1denote t t 1[x.while 1 t Gcv t kzero, 1 0. We t also use X k i.e. µ = 1 t ,Y1t k= 2 t = gTr(V ) (1 Do f ) T T T T T T 1 t of the correctness of the GCV is reported in Appendix. =t are filter still driven by E the innovation y. ]k −C Kalman = [e . . . e ] k xˆk , but the E = [e . . . e . Update column vectors containing the states, the outputs and t t 1 t 1 t 4 Conclusions Appendix T loss of t,generality, the = Tr(It weHset gain Kk5 given by (5a)measurements is substitutednoises by Without theupGtok instant defined by 5 Appendix t St H = T i.e. 2 to zero, i.e. µ = 0.=We also use Xt ,Y (5b). In turn, such gain depends on Σ which is propagated T T −1 where th t Do f , k t Then, it holds that Then, it holds that (5a) Kk = Ak PkCk (Ck PkCk + γ) where thestate las 5 Appendix column vectors containing the Without lossaofmodified generality, we setofthe initial condition over time through version Riccati equation Tthe T Tsystem T where T T the Eq.the 11si Without loss of generality, we set the initial system condition secon X = [x . . . x ] , Y = [y . . . y ] where T T Eq. 11 is ins 2 t t t 1 the t to instant t, Ak Σk Ak − Kk (Ck ΣkCk + 1) measurements noises up to (5f). zero, i.e. µ = 0.filter We also use X1t ,Y Et to in denote t and given by graphically Fig. T Tthe Tlast two the passages exp to zero, i.e. µ The = 0.GCV We Xt ,tE,Yt ,t and Etdepicted towhere denote respectively. respectively. Yt = H Gk = (5b) Ycontaining =t X Htuse Xist E+ talso t+ E = [e . . . e ] . T t column vectors the states, the outputs and the 1 t 1. Without loss of generality, we set the initial system condition Ck PkCk + γ Eq. and 11 isthe instead proved as2 ∂Y fol column vectors containing the states, the outputs 1t T T measurements noises up instant Y to zero,(5c) i.e. µ = 0. We also use Xt ,Y Et instant to to denote thet, i.e. Xt = [x1 . . . xt ]gT 2, ∂YYgt V t TTand t t= [y T T measurements noises up to t, i.e. T T xˆk+1 = Ak xˆk + Kk (yk −Ck xˆk ) where HtH=t = [B [B. . ..B. . B] ]is the regression matrix built using where is the regression matrix built using Then, it holds that T ∂ gT T column vectors containing the states, the outputs and the 1 Et = [e1 . . . et ] t 3blocks B. instant WeWe also use tot smoothing denote and t blocks also to denote TVT T Tstate tthe t Vand Asymptotic behavior ratio t Yt ζˆk+1 = (Ak − KkCk )ζˆk + Gk (yk −Ck xˆk ) (5d) noises X =use [xST1Tt T.Sand .tand . xand ,V YTt = [yTT1the .T. .the ystate ] 2 ∂Y measurements up toB. t,Tt i.e. = g 2YtT Vt 2Yt tY] = t g X = [x . . . x ] , [y . . . y ] t t 1 t 1 t Yt = Ht Xt +∂Egt , ETt = [eTT1 T. . . etT ]TThen, . Pk+1 = (Ak − KkCk )Pk (Ak − KkCk )T + γKk KkT + Qk .the . . et GCV ] . filter it holds that = YtT (It Ht St H T T T T Et =T[eTof 1 3.1 Asymptotic behavior X = [x . . . x ] , Y = [y . . . y ] (5e) t t 1 t 1 t 2 usi where Ht = [BT . . . BT ]T is the regression T T that 3 3=matrix kY=t Hbuilt YˆX tk+ E E =Then, [eT1 . .it. eholds t ] . t blocks B. We also use S and V to denote Y Σk+1 = (Ak − KkCk )Σk (Ak − KkCk )T + Kk KkT (5f) t t t Then,t it holds that the state at t t is We first consider the case where the state-space model (3) = Ssrt where t T +1 where the se Ck+1 Σk+1Ck+1 respecti and Yt =AHkt,XCt + , Qk are k, E Then, it(5g) holds thattime-invariant, i.e. the matrices twhere δk+1 = δk + 1 − γ Ht =constant [BT . . . BT ]Trespectively. is the regres Yt = Ht Xt + Egoverning T +γ t, in k. The structure of the equations the GCV filter Ck+1 Pk+1Ck+1 where the second and third t blocks B. We also use St and Vt et built using respectively. Yt where = Ht XtHT+ Et[B , TT T. . . BT ]T is the regression matrix t = where H = [B B. . . . We B ] also is the built t tblocks useregression St and Vt matrix to denote theusing state and t blocks B. We also use St and Vt to denote the state and T T T where Ht = [B3 . . . B ] is the regression matrix built using t blocks B. We also use St and Vt to denote the state and 3 3 Ssrt Gcvt = , t(1 Do f /t)2 where Ssrt is the sum of squared residuals, i.e. Update Update t

also providing an interesting closed-form expression for the cubic splines case useful to further speed up the tuning of γ. For the general case, we notice that Proposition 1 gives also a numerical procedure to compute the asymptotic smoothing ratio (for different values of γ). In fact, if (3) is stabilizable and detectable, combining (5g) and Proposition 1 we obtain

permits to easily understand its asymptotic behaviour. In particular, exploiting well known properties of the Kalman filter [1], the following result is obtained (see the Appendix for a proof). Proposition 1 Assume that the system (3) is time-invariant, stabilizable and detectable. Then, for any P0 we have lim Pk = P¯ and lim Σk = Σ¯

k→∞

lim

k→∞

k→∞

where P¯ and Σ¯ are the unique symmetric and semidefinite positive matrices solving, respectively, the algebraic Riccati equation ¯ T + Q − APC ¯ T (CPC ¯ T + γ)−1CPA ¯ T P¯ = APA

with Σ¯ and P¯ defined, respectively, in (6) and (7).

4

(6)

4.1

and the Lyapunov equation ¯ − KC) ¯ Σ(A ¯ T + K¯ K¯ T Σ¯ = (A − KC)

(7)

Properties of the GCV filter can be also characterized in the time-varying case. In particular, following Section 2 of [2], one can first replace stabilizability and detectability with the assumptions of uniform stabilizability and detectability. Then, following the same reasonings contained in the proof of Proposition 1, Theorem 5.3 in [2] ensures the uniform exponential stability of the GCV filter.

• GCV: this approach estimates γ exploiting the GCV filter. More specifically, the GCV score is propagated over a grid containing 100 values of γ logarithmically spaced on the interval [10−2 , 104 ]. Then, at any t the estimate Zˆt is computed by a Kalman smoothing filter which exploits the γt that minimizes GCVt . • Oracle: the same as GCV except that γt maximizes the fit Ft . Note that this approach is not implementable in practice since it uses an oracle that knows the noiseless (unavailable) output Zt .

¯ T − K(C ¯ T + 1) ¯ ΣC AΣA G¯ = . ¯ T +γ CPC

Then, one can exploit the asymptotic (suboptimal) GCV filter, with the guarantee that the objective values will converge to the exact GCV scores as k increases. Moreover, in off-line contexts this approach appears computationally appealing even when compared to the many GCV-based spline algorithms developed in the last decades [41,40,18,15,35]. Furthermore, [21] defined the asymptotic smoothing ratio as

k→∞

(8)

where Zˆt is the estimate of Zt obtained through the Kalman smoother [1], and 11 a column vector with all entries equal to 1. The following two different estimators Zˆt are tested:

Proposition 1 leads also to a new computationally appealing approach to tune the regularization parameter γ e.g. in smoothing splines. In particular, consider the scenario described in [21] where an unknown function has to be reconstructed by spline regression from equally spaced noisy samples. When assumptions in Proposition 1 hold true, it is possible to compute off-line the gains

lim

Spline example

kZt − Zˆt k Ft = 100% 1 − , kZt − 11Z¯t k

Fast regularization parameter tuning and the smoothing ratio

¯ T (CPC ¯ T + γ)−1 , K¯ = APC

Numerical Examples

We consider the reconstruction of the function exp(sin 8t) taken from [29] from samples collected at 400 instants ti randomly generated from a uniform distribution on [0, 1]. The measurement noise is Gaussian with standard deviation equal to 0.3. We model f as the two-fold integral of white noise setting m = 2 in the time-varying state space model reported in Example 1. This corresponds to reconstructing f using cubic smoothing splines [40]. We use Zt to denote the vector containing the noiseless outputs (corresponding to the second entries of {xk }tk=1 ). We denote the average of Zt by the scalar quantity Z¯t . Then, the performance measure is the percentage fit

¯ T (CPC ¯ T + γ)−1 . In addition, all the roots where K¯ = APC ¯ are inside the unit circle so that the of the matrix A − KC (asymptotic) GCV filter is asymptotically stable.

3.2

¯ T +1 δk CΣC = 1−γ ¯ T k CPC + γ

The left panel of Fig. 2 displays the noiseless output (solid line), the measurements (◦) and the function estimate returned by GCV (dashed line) which turns out close to f . The right panel also shows that the GCV filter is able to track well and in an on-line manner the time-course of γ returned by Oracle.

δk , k

4

GCV estimate

3.5

Estimate of γ

10 4 Measurements GCV estimate True

3

GCV Oracle

10 3

2.5 10 2

2

10 1

1.5 1

10 0

0.5 10

0 -0.5

0

0.2

0.4

0.6

0.8

1

-1

10 -2

0

50

100

Time

150

200

250

300

350

400

Time

Fig. 2. Cubic spline example - Section 4.1. Left: noiseless output (solid line), measurements (◦) and cubic spline estimate obtained by GCV (dashed line). Right: Estimated regularization parameter γt , as a function of time, obtained by Oracle maximizing the fit Ft in eq. 8 (solid line) and by GCV minimizing the score GCVt computed by eq. 5i (dashed line).

GCV estimate

30

Estimate of γ

10 4 Measurements GCV estimate True

20

GCV Oracle

10 3

10 0

10 2

-10 10 1

-20 -30

10 0

-40 -50

10 -1

-60 -70

0

50

100

150

200

Time

10 -2

0

50

100

150

200

Time

Fig. 3. Model mismatch example - Section 4.2. Left: noiseless output (solid line), measurements (◦) and smoothed output obtained by GCV (dashed line). Right: Estimated noise variance γt , as a function of time, obtained by Oracle (solid line) and by GCV (dashed line).

4.2

GCV capability to compensate for model mismatch

We will use data generated by this model to test the capability of the GCV filter to compensate for mismatches between the true system and the model used to track the data by tuning γ in an on-line manner. As in the previous example Zt is the vector containing the first t noiseless outputs (which are the second entries of {xk }tk=1 ) and the performance measure is (8). The following three different estimators Zˆt are tested:

We consider the following discrete-time model (see also [23, Section 6]): ! 0.7 0 xk+1 = xk + ωk (9a) 0.1 1 yk = 0 1 xk + ek (9b)

• GCV: this approach uses a wrong transition covariance given by ! 0 0 ˜ Q = Q+ , 0 100

with zero-mean Gaussian noises of covariances ! 11.81 Q= 11.81 0.625 , γ = 30. 0.625

and then estimates γ exploiting the GCV filter over a grid with 100 values logarithmically spaced on [10−2 , 104 ].

5

matrix Φt is defined by the input samples and Et is the measurement noise vector, which we assume white and Gaussian.

Fits 95

To solve this problem, we use the kernel-based approach originally proposed in [28,27,6]. The impulse response estimate is given by

90 85

arg min kYt − Φt gk2 + γgT P0−1 g.

80

g∈Rm

(11)

75

It makes use of the regularization matrix P0 induced by the so called first-order spline kernel, i.e. its (i, j) entry is

70 65

[P0 ]i j = α max(i, j) , Oracle

GCV

where α is an hyperparameter which regulates the rate of decay to zero of the components of g. We refer the reader also to [29] for further details on advantages of (11) over classical parametric approaches.

Nominal

Fig. 4. Model mismatch example - Section 4.2. Boxplots of the 100 fits F200 , as defined in eq. 8, obtained after a Monte Carlo study by the estimators Oracle, GCV and Nominal.

In real applications, both γ and α are unknown. Since we consider a situation where g has to be estimated on-line, we will estimate these two hyperparameters by the GCV filter. To do that, we first notice that (11) corresponds to the maximum a posteriori (MAP) estimator of g and, under the stated Gaussian assumptions, also to its minimum meansquare estimator (MMSE). The estimate of g can then be computed using the Kalman filter. In fact the state space model is

Then, at any t the estimate Zˆt is computed by a Kalman smoothing filter which exploits the γt minimizing GCVt . • Oracle: the same as GCV except that γt maximizes the fit Ft in (8). • Nominal: the estimate Zˆt is returned by a Kalman smoothing filter defined by the nominal wrong covariance Q˜ and γ = 30. Thus, this approach does not try to compensate for model mismatch since it does not tune γ from data. The left panel of Fig. 3 displays the noiseless output (solid line), the measurements (◦) and the smoothed output obtained by GCV (dashed line) which appears close to truth. In the right panel, one can also see the trajectory in time of the γt returned by Oracle and by GCV. One can appreciate the capability of the GCV filter to compensate the modelling mismatch by tracking a regularization parameter leading to a high fit Ft . To further support these findings, we have also performed a Monte Carlo study of 100 runs. During each run, 200 output measurements are generated using (9) and Zt is reconstructed by GCV, Oracle and Nominal. From the MATLAB boxplots of the 100 fits (8) reported in Fig. 4, the robustness of GCV emerges clearly. Its performance is in fact very close to that of the oracle-based procedure. 4.3

xk+1 = xk yk = Ck xk + ek , k = 1, 2, . . . x1 ∼ (0, P0 ) ek ∼ (0, γ)

(12)

where the state vector is the stochastic model for g (with xk = g for any k), yk and ek are the k-th entries of Yt and Et , respectively, and Ck is the k-th row of Φt . We define a grid in the plane (γ, α) taking the points such that α is in the set {0.5, 0.6, . . . , 0.9, 0.95, 0.99}, while γ assumes values in a logarithmically spaced grid of 20 point between 10−2 and 103 . In this way, the grid consists of 140 points. We run 140 GCV filters in parallel, each corresponding to one of the points; when a new measure yk (and Ck ) is available, we update the GCV score of each pair (γ, α), selecting the one giving the minimum score.

On-line regularized linear system identification

Now, we consider a linear system identification problem where the aim is to estimate an unknown impulse response from input-output measurements. Assuming a high order FIR, the model describing the outputs collected up to instant t, and stacked in the (column) vector Yt , is Yt = Φt g + Et ,

0 ≤ α < 1,

We test the obtained GCV filter for on-line regularized system identification on a set of 100 Monte Carlo runs. At any run, a random impulse response of length m = 200 is generated using the same mechanism described in [24, Section 7.4]. The generated system is fed with a with noise sequence of unit variance. Note that this type of input is persistently exciting and guarantees the observability of the system (12) (see e.g. [2]), avoiding the covariance windup phenomenon [36]. The standard deviation of the measurement noise is

(10)

where g denotes the m-dimensional vector whose components are the impulse response coefficients, the regression

6

Wt and Vt to denote the state and output covariance matrix, i.e.

that of the 200 noiseless outputs divided by 10. We assume the system is at rest (the input is equal to zero) prior to the data collection. The performance of the estimator is evaluated by means of the fit (as a function of time) kgi − gˆti k Ft = 100% 1 − i , kg − 11g¯i k

Wt := Var(Xt ) Vt := Var(Yt ) = Ot Wt OtT + γIt ,

(13)

where It is the t ×t identity matrix. Note that, using the above notation, the smoothed estimate of Yt , already encountered in Section 1, is

where gi is the impulse response generated at the i-th Monte Carlo run, g¯i its mean, and gˆti its estimate (the impulse response estimate is function of the time instant t).

Yˆt = Ot Wt OtT Vt−1Yt ,

An example of one of the Monte Carlo runs is given in Fig. 5, which shows the evolution in time of the impulse response estimate and its fit. It suffices 50 measurements to the GCV filter to achieve an appreciable fit. The overall results of the Monte Carlo experiment are summarized in Fig. 6, which depicts the average fit of the impulse responses as a function of time. It can be seen that, after a short transient phase, the fit increases monotonically and achieves a high average value. 5

δt = Tr(Ot Wt OtT Vt−1 ).

Lemma 2 One has ∂ log detVt , ∂γ ∂Yt Vt−1Yt . St = −γ 2 ∂γ δt = t − γ

Conclusions

γVt−1 = It − Ot Wt OtT Vt−1 .

(19)

(20)

Then, (18) is obtained from the following equalities ∂ log detVt −1 ∂Vt γ = γTr Vt ∂γ ∂γ = γTr(Vt−1 ) = Tr(It − Ot Wt OtT Vt−1 ) = t − δt ,

Derivation of the GCV filter

(21)

where the last two passages exploit (20) and (17), respectively. Eq. 19 is instead proved as follows

Without loss of generality, we set the initial system condition to zero, i.e. µ = 0. We also use Xt ,Yt and Et to denote the column vectors containing the states, the outputs and the measurements noises up to instant t, i.e. Yt = [y1 . . . yt ]T ,

(18)

Proof: In view of (15), we start noticing that

Appendix

Xt = [x1T . . . xtT ]T ,

(17)

The following simple lemma is useful for our future developments.

A Matlab implementation of the GCV filter is available at the web page http://www.dei.unipd.it/ giapi/.

6.1

(16)

so that the degrees of freedom at instant t turn out

The novel filter here presented allows to propagate efficiently the GCV score over time. Hence, unknown parameters entering a state space model can now be estimated in an on-line manner resorting to one of the most important techniques used for parameter estimation. The asymptotic properties of the GCV filter provide also a new very efficient way to estimate the regularization parameter e.g. in smoothing splines. Applications of the new filter have been illustrated using artificial data regarding a function estimation and on-line regularized linear system identification.

6

(14) (15)

−γ 2

Et = [e1 . . . et ]T .

Then, it holds that

∂Yt Vt−1Yt = γ 2YtT Vt−2Yt ∂γ = YtT (It − Ot Wt OtT Vt−1 )T (It − Ot Wt OtT Vt−1 )Yt = kYt − Yˆt k2 = St

where the second and third equality exploit (20) and (16), respectively.

Yt = Ot Xt + Et , where Ot = diag{C1 , . . . , Ct } is the regression matrix built using the measurement matrices Ck , k = 1, . . . ,t. We also use

7

0.6

Impulse response estimate

Fit

100 k=10 k=50 k=200 True

0.4

50

0.2

0 0

-0.2

-0.4 -50

-0.6 10 0

10 1

0

50

100

150

200

Time

Fig. 5. On-line system identification - Section 4.3. Left: true impulse response (solid line) and GCV estimates obtained at time instants k = 10, 50, 200. Right: Fit obtained by GCV as a function of time.

+ Ak PkCkT (Ck PkCkT + γ)−2Ck Pk ATk (Ck ΣkCkT + 1).

Fit from Monte Carlo 100

Exploiting the definition of Kk and rearraging the terms, the recursive formula (5f) is obtained. 50

Now, consider the dynamics of the predicted state xˆk+1 = Ak xˆk + Ak PkCkT (Ck PkCkT + γ)−1 (yk −Ck xˆk ).

0

We now show that ζˆk is the partial derivative of xˆk w.r.t. γ. In fact, differentating the above equation using the correspondence ζˆk := ∂∂xˆγk , one obtains

-50

-100

0

50

100

150

ζˆk+1 = Ak ζˆk + Ak ΣkCkT (Ck PkCkT + γ)−1 (yk −Ck xˆk ) − Ak ΣkCkT (Ck ΣkCkT + 1)(Ck PkCkT + γ)−2 (yk −Ck xˆk ) − Ak PkCT (Ck PkCT + γ)−1Ck ζˆk .

200

Time

k

Fig. 6. On-line system identification - Section 4.3. Average of the GCV fits (solid line) ± one standard deviation (dashed line), as a function of time, obtained after a Monte Carlo study. At any of the 100 runs a new impulse response was randomly generated as detailed in [24, Section 7.4].

k

This, combined with the definition of Kk , leads to the recursive formula (5d). Now, exploiting well known properties of the innovations sequence {yk − Ck xˆk }tk=1 , whose variances are {Ck PkCkT + γ}tk=1 , and recalling that Σk := ∂∂Pγk , we have

The dynamics of the matrix Pk in the GCV filter are regulated by the discrete-time algebraic Riccati equation (DARE), which can be also rewritten as

t ∂ log(C P CT + γ) ∂ log detVt k k k =∑ ∂γ ∂γ k=1

Pk+1 = Ak Pk ATk + Qk − Ak PkCkT (Ck PkCkT + γ)−1Ck Pk ATk . It is now easy to see that the matrix Σk entering the GCV filter is the partial derivative of Pk w.r.t. γ. In fact, differentiating the DRE, and adopting the notation Σk := ∂∂Pγk , one has

t

=

Ck ΣkCT + 1

∑ Ck PkCTk + γ .

k=1

k

Then, the recursive formula (5g) for the degrees of freedom δk is obtained combining the above equation and (18).

Σk+1 = Ak Σk ATk − Ak ΣkCkT (Ck PkCkT + γ)−1Ck Pk ATk − Ak PkCkT (Ck PkCkT + γ)−1Ck Σk ATk 8

Still using properties of the innovations sequence, and recalling that ζˆk := ∂∂xˆγk , one also has −

[9] R. Frigola, F. Lindsten, T.B. Schon, and C.E. Rasmussen. Bayesian inference and learning in Gaussian process state-space models with particle mcmc. In Advances in Neural Information Processing Systems (NIPS), 2013.

t ∂ (y −C xˆ )2 (C P CT + γ)−1 ∂Yt Vt−1Yt k k k k k k =−∑ ∂γ ∂γ k=1

[10] W.R. Gilks, S. Richardson, and D.J. Spiegelhalter. Markov chain Monte Carlo in Practice. London: Chapman and Hall, 1996. [11] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures. Neural Computation, 7(2):219–269, 1995.

t

Ck ΣkCT + 1 ∑ (Ck PkCTk + γ)2 (yk −Ck xˆk )2 k k=1 y −C k k xˆk + 2Cζˆk . Ck PkCkT + γ =

[12] G. Golub, M. Heath, and G. Wahba. Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics, 21(2):215–223, 1979. [13] T. J. Hastie, R. J. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Data Mining, Inference and Prediction. Springer, Canada, 2001.

This equation, in combination with (19), proves the correctness of the update rule (5h) for the sum of squared residuals Sk and completes the derivation. 6.2

[14] A.E. Hoerl and R.W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12:55–67, 1970. [15] M.G. Hutchinson and F.R. De Hoog. Smoothing data with spline functions. Numer. Math., 47:99–106, 1985.

Proof of Proposition 1

[16] A. Jazwinski. Stochastic Processes and Filtering Theory. Dover, 1970.

If the system (3) is stabilizable and detectable, then standard properties of the algebraic Riccati equation (6) ensure that P¯ is symmetric and positive semidefinite and that the Kalman filter, corresponding to (5a), (5c) and (5e), is asymptotically stable (see [1, p. 77]). Because the Kalman filter is asymp¯ has all the eigenvalues totically stable, the matrix (A − KC) inside the unit circle, ensuring that (7) admits a unique positive semidefinite solution [1, p. 67]. The filter state transition matrix which regulates the dynamics of xˆk and ζˆk is A − KkC

0

−GkC A − KkC

[17] R.E. Kalman. A new approach to linear filtering and prediction problems. Trans. of the AMSE - Journal of Basic Engineering, 82:35–45, 1960. [18] R. Kohn and C.F. Ansley. A fast algorithm for signal extraction, influence and cross validation in state space models. Biometrika, 76(1):65–79, 1989. [19] L. Ljung and T. Kailath. A unified approach to smoothing formulas. Automatica, 12(2):147–157, 1976. [20] D.J.C. MacKay. Bayesian interpolation. Neural Computation, 4:415– 447, 1992.

!

[21] G. De Nicolao, G. Ferrari Trecate, and G. Sparacino. Fast spline smoothing via spectral factorization concepts. Automatica, 36:1733– 1739, 2000. [22] B. Ninness and S. Henriksen. Bayesian system identification via MCMC techniques. Automatica, 46(1):40–51, 2010.

and so it also has all eigenvalues inside the unit circle at least for sufficiently large k.

[23] H. Ohlsson, F. Gustafsson, L. Ljung, and S. Boyd. State smoothing by sum-of-norms regularization. In 49th IEEE Conference on Decision and Control (CDC), pages 2880–2885. IEEE, 2010.

References

[24] G. Pillonetto, T. Chen, A. Chiuso, G. De Nicolao, and L. Ljung. Regularized linear system identification using atomic, nuclear and kernel-based norms: The role of the stability constraint. Automatica, 69:137 – 149, 2016.

[1] B. D. O. Anderson and J. B. Moore. Optimal Filtering. PrenticeHall, Englewood Cliffs, N.J., USA, 1979.

[25] G. Pillonetto and A. Chiuso. Fast computation of smoothing splines subject to equality constraints. Automatica, 45(12):2842–2849, 2009.

[2] B.D.O. Anderson and J.B. Moore. Detectability and stabilizability of time-varying discrete-time linear systems. SIAM Journal on Control and Optimization, 19(1):20–32, 1981.

[26] G. Pillonetto and A. Chiuso. Tuning complexity in regularized kernel-based regression and linear system identification. Automatica, 58(2):106–117, 2015.

[3] C. Andrieu, A. Doucet, and R. Holenstein. Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010. [4] C.F. Ansley and R. Kohn. Efficient generalized cross-validation for state space models. Biometrika, 74(1):139–148, 1987.

[27] G. Pillonetto, A. Chiuso, and G. De Nicolao. Prediction error identification of linear systems: a nonparametric Gaussian regression approach. Automatica, 47(2):291–305, 2011.

[5] M. Bertero. Linear inverse and ill-posed problems. Advances in Electronics and Electron Physics, 75:1–120, 1989.

[28] G. Pillonetto and G. De Nicolao. A new kernel-based approach for linear system identification. Automatica, 46(1):81–93, 2010.

[6] T. Chen, H. Ohlsson, and L. Ljung. On the estimation of transfer functions, regularizations and Gaussian processes - revisited. Automatica, 48(8):1525–1535, 2012.

[29] G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, and L. Ljung. Kernel methods in system identification, machine learning and function estimation: a survey. Automatica, 50(3):657–682, 2014.

[7] P. Craven and G. Wahba. Smoothing noisy data with spline functions. Numerische Mathematik, 31:377–403, 1979.

[30] G. Pillonetto and M.P. Saccomani. Input estimation in nonlinear dynamic systems using differential algebra concepts. Automatica, 42:2117–2129, 2006.

[8] T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13:1–50, 2000.

[31] T. Poggio and F. Girosi. Networks for approximation and learning. In Proceedings of the IEEE, volume 78, pages 1481–1497, 1990.

9

[32] H. E. Rauch, F. Tung, and C. T. Striebel. Maximum likelihood estimates of linear dynamic systems. AIAA J., 3(8):1145–1150, 1965. [33] J. Rice. Choice of smoothing parameter in deconvolution problems. Contemporary Math., 59:137–151, 1986. [34] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. (Adaptive Computation and Machine Learning). MIT Press, 2001. [35] B.W. Silverman. Some aspects of the spline approach to nonparametric regression curve fitting. J. of the Royal Statistical Society, 47:1–52, 1985. [36] B. Stenlund and F. Gustafsson. Avoiding windup in recursive parameter estimation. Preprints of reglermöte 2002, pages 148–153, 2002. [37] A. Tarantola. Inverse Problem Theory and Methods for Model Parameter Estimation. SIAM, Philadelphia, 2005. [38] G. Wahba. Bayesian confidence intervals for the cross-validated smoothing spline. Journal of the Royal Statistical Society. Series B (Methodological), 45(1):pp. 133–150, 1983. [39] G. Wahba. A comparison of GCV and GML for choosing the smoothing parameter in the generalized spline smoothing problem. The Annals of Statistics, 13(4):1378–1402, 1985. [40] G. Wahba. Spline models for observational data. SIAM, Philadelphia, 1990. [41] H.L. Weinert. Fast Compact Algorithms and Software for Spline Smoothing. Springer, 2013.

10

The Generalized Cross Validation Filter

The Generalized Cross Validation Filter

Suggest Documents

Generalized Cross Validation for - KU Leuven

GCVPACK - ROUTINES FOR GENERALIZED CROSS VALIDATION ...

Generalized Cross-Validation for Wavelet Shrinkage ...

The generalized cross validation method for the ... - Stanford University

The implementation of a Generalized Cross Validation ... - CiteSeerX

Generalized filter design.xps

Cross-validation and cross-study validation of

Fast generalized cross validation using Krylov ... - Semantic Scholar

Generalized Cross-Validation for Wavelet Shrinkage in Nonparametric ...

Multiple wavelet threshold estimation by generalized cross validation ...

Generalized Cross-Validation as 0 Method for - NYU Computer Science

Generalized Cross Validation for Multiwavelet Shrinkage - IEEE Xplore

Cross-validation is dead. Long live cross- validation! Model validation ...

Generalized Linear Quaternion Complementary Filter

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap

The Problem of Cross-Validation

Cross-Validation Without Doing Cross-Validation in ...

Cross-Validation Cumulative Learning

psychometric validation of the generalized ...

predictive validation and cross-validation of the ... - Jagdish Sheth

Generalized Inversion Attack on Nonlinear Filter Generators

Generalized Multiplicative Extended Kalman Filter for ...

Generalized Digital Butterworth Filter Design - CiteSeerX