Skew-t Inference with Improved Covariance Matrix Approximation - arXiv

1

Skew-t Inference with Improved Covariance Matrix Approximation

arXiv:1603.06216v1 [cs.SY] 20 Mar 2016

Henri Nurminen, Tohid Ardeshiri, Robert Piché, and Fredrik Gustafsson,

Abstract—Filtering and smoothing algorithms for linear discrete-time state-space models with skew-t distributed measurement noise are presented. The proposed algorithms improve upon our earlier proposed filter and smoother using the mean field variational Bayes approximation of the posterior distribution to a skew-t likelihood and normal prior. Our simulations show that the proposed variational Bayes approximation gives a more accurate approximation of the posterior covariance matrix than our earlier proposed method. Furthermore, the novel filter and smoother outperform our earlier proposed methods and conventional low complexity alternatives in accuracy and speed.

anchor

range

true position

likelihood

Index Terms—skew t, skewness, t-distribution, robust filtering, Kalman filter, variational Bayes, RTS smoother, truncated normal distribution

I. I NTRODUCTION Asymmetric and heavy-tailed noise processes are present in many inference problems. In radio signal based distance estimation [1]–[3], for example, obstacles cause large positive errors that dominate over symmetrically distributed errors from other sources [4]. The skew t-distribution [5]–[7] is the generalization of the t-distribution that has the modeling flexibility to capture both skewness and heavy-tailedness of such noise processes. To exemplify this, Fig. 1 illustrates the contours of the likelihood function for three independent range measurements where some of the measurements are positive outliers. In this example, skew-t, t, and normal likelihoods are compared. The skew-t likelihood gives a more realistic spread of the probability mass than the normal and t likelihoods. Filtering and smoothing algorithms for linear discrete-time state-space models with skew-t measurement noise using a variational Bayes (VB) method are presented in [8]. This filter is applied to indoor localization with real ultra-wideband data in [9]. This letter proposes improvements to the filter and smoother proposed in [8]. Analogous to [8], the measurement noise is modeled by the skew t-distribution, and the proposed filter and smoother use a VB approximation of the posterior. However, the main contributions of this letter are (1) a new factorization of the approximate posterior distribution, (2) the application of an existing method for approximating the statistics of a truncated multivariate normal distribution (TMND), and (3) a proof of optimality for a truncation ordering in approximation of the moments of the TMND. A TMND is a multivariate normal distribution whose support is restricted (truncated) by linear constraints and that is re-normalized to integrate to unity. The aforementioned contributions improve H. Nurminen and R. Piché are with the Department of Automation Science and Engineering, Tampere University of Technology (TUT), PO Box 692, 33101 Tampere, Finland (e-mails: [email protected], [email protected]). H. Nurminen receives funding from TUT Graduate School, the Foundation of Nokia Corporation, and Tekniikan edistämissäa¨ tiö. T. Ardeshiri and F. Gustafsson are with the Department of Electrical Engineering, Linköping University, 58183 Linköping, Sweden, (e-mails: [email protected], [email protected]). T. Ardeshiri receives funding from Swedish research council (VR), project scalable Kalman filters.

Fig. 1. The contours of the likelihood function for three range measurements for the normal (left), t (middle) and skew-t (right) measurement noise models are presented. The t and skew-t likelihoods can handle one outlier (upper row), while only the skew-t model can handle the two positive outlier measurements (bottom row) due to its asymmetry. The likelihoods’ parameters are selected such that the first two moments of the normal, t and skew-t PDFs coincide.

the estimation performance by reducing the covariance underestimation common to most VB inference algorithms [10, Chapter 10]. To our knowledge, VB approximations have been applied to the skew t-distribution only in our work [8], [9] and by Wand et al. [11]. II. P ROBLEM FORMULATION Consider the linear and Gaussian state evolution model iid

wk ∼ N (0, Q),

xk+1 = Axk + wk , p(x1 ) = N (x1 ; x1|0 , P1|0 ),

(1a) (1b)

where N (·; µ, Σ) denotes a (multivariate) normal PDF with mean µ and covariance matrix Σ; A ∈ Rnx ×nx is the state transition matrix; xk ∈ Rnx indexed by 1 ≤ k ≤ K is the state to be estimated with initial prior distribution (1b), where the subscript “a|b” is read “at time a using measurements up to time b”. Further, consider the measurements yk ∈ Rny to be governed by the measurement equation yk = Cxk + ek ,

iid

[ek ]i ∼ ST(0, Rii , ∆ii , νi ),

(2)

where the measurement noise distribution is a product of independent univariate skew t-distributions. This model is justified in applications where one-dimensional data from different sensors can be assumed to have statistically independent noise [9]. The PDF and the first two moments of the skew tdistribution can be found in [9] and [12], respectively. The model (2) admits the hierarchical representation yk |xk , uk , Λk ∼ N (Cxk + ∆uk , Λ−1 k R), uk |Λk ∼ [Λk ]ii ∼

N+ (0, Λ−1 k ), G( ν2i , ν2i ),

(3a) (3b) (3c)

2

where R ∈ Rny ×ny is a diagonal matrix of which the √ square roots of the diagonal elements, Rii , are the spread parameters of the skew t-distribution in (2); ∆ ∈ Rny ×ny is a diagonal matrix whose diagonal elements ∆ii are the shape parameters; ν ∈ Rny is a vector whose elements νi are the degrees of freedom; C ∈ Rny ×nx is the measurement matrix; {wk ∈ Rnx |1 ≤ k ≤ K} and {ek ∈ Rny |1 ≤ k ≤ K} are mutually independent noise sequences; the operator [·]ij gives the (i, j) entry of its argument; Λk is a diagonal matrix with a priori independent random diagonal elements [Λk ]ii . Also, N+ (µ, Σ) is the TMND with closed positive orthant as support, location parameter µ, and squared-scale matrix Σ. Furthermore, G(α, β) is the gamma distribution with shape parameter α and rate parameter β. Models where the measurement noise components are vector-valued with independently multivariate skew-t distributed noises [5]–[7], [13]–[15] require only a straightforward modification to the update of the approximate posterior of Λk in the proposed filtering and smoothing algorithms. Bayesian smoothing means finding the smoothing posterior p(x1:K , u1:K , Λ1:K |y1:K ). In [8], the smoothing posterior is approximated by a factorized distribution of the form q [8] , qx (x1:K )qu (u1:K )qΛ (Λ1:K ). Subsequently, the approximate posterior distributions are computed using the VB approach. The VB approach minimizes the Kullback– R q(x) dx [16] Leibler divergence (KLD) DKL (q||p) , q(x) log p(x) of the true posterior from the factorized approximation. That is, DKL (q [8] ||p(x1:K , u1:K , Λ1:K |y1:K )) is minimized in [8]. The numerical simulations in [8] manifest the covariance underestimation of the VB approach, which is a known weakness of the method [10, Chapter 10]. The aim of this letter is to reduce the covariance underestimation of the filter and smoother proposed in [8] by removing independence approximations of the posterior approximation. III. P ROPOSED S OLUTION Using Bayes’ theorem, the state evolution model (1), and the likelihood (3), the joint smoothing posterior PDF can be derived as in [8]. This posterior is not analytically tractable. We propose to seek an approximation in the form p(x1:K ,u1:K , Λ1:K |y1:K ) ≈ qxu (x1:K , u1:K ) qΛ (Λ1:K ), (4)

Table I O PTIMAL R ECURSIVE T RUNCATION TO THE P OSITIVE O RTHANT 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

Inputs: µ, Σ, and the set of the truncated components’ indices T while T 6= ∅ do √ k ← argmin √ i {µi / Σii | i ∈ T } ξ ← µk / Σkk if Φ(ξ) does not underflow to 0 then ← φ(ξ)/Φ(ξ) . φ is the PDF of N (0, 1), Φ its CDF √ µ ← µ + (/ Σkk ) · Σ:,k Σ ← Σ − ((ξ + 2 )/Σkk ) · Σ:,k Σk,: else √ µ ← µ + (−ξ/ Σkk ) · Σ:,k . limξ→−∞ ( + ξ) = 0 [22] Σ ← Σ − (1/Σkk ) · Σ:,k Σk,: . limξ→−∞ (ξ + 2 ) = 1 [22] end if T ← T \{k} end while Outputs: µ and Σ; ([µ, Σ] ← rec_trunc(µ, Σ, T ))

the numerical quadrature of [20] in 2 and 3 dimensional cases and the quasi-Monte Carlo method of [21] for the dimensionalities 4−25. However, these methods can be prohibitively slow. Therefore, we approximate the TMND’s moments using the fast recursive algorithm suggested in [22], [23]. The method is initialized with the original normal density whose parameters are then updated by applying one linear constraint at a time. For each constraint, the mean and covariance matrix of the once-truncated normal distribution are computed analytically, and the once-truncated distribution is approximated by a nontruncated normal with the updated moments. The result of the recursive truncation depends on the order in which the constraints are applied. Finding the optimal order of applying the truncations is a combinatorial problem. Hence, we choose a greedy approach, whereby the constraint to be applied is chosen from among the remaining constraints so that the resulting once-truncated normal is closest to the true TMND. By Lemma 1, the optimal constraint in KLD-sense is the one that truncates the most probability. The obtained algorithm with the optimal processing sequence for computing the mean and covariance of a given normal distribution truncated to the positive orthant is given in Table I. Lemma 1. Let p(z) be a TMND with the support {z ≥ 0} and q(z) = N (z; µ, Σ). Then, argmin DKL p(z) c1i q(z)[[zi ≥ 0]] = argmin √µΣi , (6) i

where the factors in (11) are specified by qˆxu , qˆΛ = argmin DKL (qN ||p(x1:K , u1:K , Λ1:K |y1:K )) qxu ,qΛ

and where qN , qxu (x1:K , u1:K )qΛ (Λ1:K ). Hence, x1:K and u1:K are not approximated as independent as in [8] because they can be highly correlated a posteriori [8]. The analytical solutions for qˆxu and qˆΛ are obtained by cyclic iteration of log qxu (·) ← E [log p(y1:K , x1:K , u1:K , Λ1:K )] + cxu

(5a)

log qΛ (·) ← E [log p(y1:K , x1:K , u1:K , Λ1:K )] + cΛ

(5b)

qΛ

qxu

where the expected values on the right hand sides are taken with respect to the current qxu and qΛ [10, Chapter 10] [17], [18]. Also, cxu and cΛ are constants with respect to the variables (x1:K , u1:K ) and Λ1:K , respectively. Computation of the expectation in (5b) requires the first two moments of a TMND, because the support of u1:K is the nonnegative orthant. These moments can be computed using the formulas presented in [19]. They require evaluating the CDF (cumulative distribution function) of general multivariate normal distributions. The M ATLAB function mvncdf implements

i

ii

where µi is the ith element of µ, Σii is the ith R diagonal element of Σ, [[·]] is the Iverson bracket, and ci = q(z)[[zi ≥ 0]] dz. Proof: DKL p(z) c1i q(z)[[zi ≥ 0]] Z ∞ + =− p(z) log( c1i q(z)[[zi ≥ 0]]) dz 0 Z ∞ + = log ci − p(z) log q(z) dz − 1 = log ci , 0 +

where = means equality up to an additive constant. Since ci is an increasing function of √µΣi the proof follows. ii The recursion (5) is convergent to a local optimum [10, Chapter 10]. However, there is no proof of convergence available when the moments of the TMND are approximated. In spite of lack of a convergence proof the iterations did not diverge in the numerical simulations presented in section IV. The derivations for the expectations of (5) are presented in the appendixes. In the smoother, the update (5a) includes

3

Table II S MOOTHING FOR SKEW-t MEASUREMENT NOISE

Table III F ILTERING FOR SKEW-t MEASUREMENT NOISE

1: Inputs: A, C, Q, R, ∆, ν, x1|0 , P1|0 and y1:K 0 2: Az ← A 0 0 , Cz ← [ C ∆ ] initialization 3: Λk|K ← Iny for k = 1 · · · K 4: repeat update qxu (x1:K , u1:K ) given qΛ (Λ1:K ) 5: for k = 1 to K do 6: Zk|k−1 ← blockdiag(Pk|k−1 , Λ−1 ) k|K

1: Inputs: A, C, Q, R, ∆, ν, x1|0 , P1|0 and y1:K 2: Cz ← [ C ∆ ] 3: for k = 1 to K do initialization 4: Λk|k ← Iny 5: repeat x update qxu (xk , uk ) = N ( ukk ; zk|k , Zk|k ) given qΛ (Λk ) 6: Zk|k−1 ← blockdiag(Pk|k−1 , Λ−1 ) k|k

7:

7:

8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:

Kz ← Zk|k−1 CzT (CPk|k−1 C T +∆Λ−1 ∆T +Λ−1 R)−1 k|K k|K h i xk|k−1 zek|k ← + Kz (yk − Cxk|k−1 ) 0 e Zk|k ← (I − Kz Cz )Pk|k−1 ek|k , {nx + 1 · · · nx + ny }) [zk|k , Zk|k ] ← rec_trunc(e zk|k , Z xk|k ← [zk|k ]1:nx , Pk|k ← [Zk|k ]1:nx ,1:nx

xk+1|k ← Axk|k Pk+1|k ← APk|k AT + Q end for for k = K − 1 down to 1 do −1 Gk ← Zk|k Az Zk+1|k zk|K ← zk|k + Gk (zk+1|K − Az zk|k ) Zk|K ← Zk|k + Gk (Zk+1|K − Zk+1|k )GT k xk|K ← [zk|K ]1:nx , Pk|K ← [Zk|K ]1:nx ,1:nx uk|K ←[zk|K ]nx +(1:ny ) , Uk|K ←[Zk|K ]nx +(1:ny ),nx +(1:ny ) end for update qΛ (Λ1:K ) given qxu (x1:K , u1:K ) for k = 1 to K do Ψk ← (yk −Cz zk|K )(yk −Cz zk|K )T R−1 +Cz Zk|K CzT R−1 +uk|K uT + Uk|K k|K

νi +2 24: [Λk|K ]ii ← ν +[Ψ i k ]ii 25: end for 26: until converged 27: Outputs: xk|K and Pk|K for k = 1 · · · K

9: 10: 11: 12:

13:

νi +2 14: [Λk|k ]ii ← ν +[Ψ i k ]ii 15: until converged 16: xk+1|k ← Axk|k 17: Pk+1|k ← APk|k AT + Q 18: end for 19: Outputs: xk|k and Pk|k for k = 1 · · · K

A satellite constellation of Global Positioning System provided by the International GNSS service [25] is used with 8 measured satellites. The RMSE is computed for [xk ]1:3 .

a forward filtering step of the Rauch–Tung–Striebel smoother (RTSS) [24] where the first filtering posterior is a TMND. The TMND is approximated as a multivariate normal distribution whose parameters are obtained using the recursive truncation. This approximation enables recursive forward filtering and the use of RTSS’s backward smoothing step that gives normal approximations to the marginal smoothing posteriors qxu (xk , uk ) ≈ N ([ uxkk ] ; zk|K , Zk|K ). After the iterations converge, the variables u1:K are integrated out to get the approximate smoothing posteriors qx (xk ) = N (xk ; xk|K , Pk|K ), where the parameters xk|K and Pk|K are the output of the skew t smoother (STS) algorithm in Table II. STS can be restricted to an online recursive algorithm to synthesize a filter which is summarized in Table III. In the filter, the output of a filtering step is also a TMND which in analogy to STS is approximated by a multivariate normal distribution to have a recursive algorithm. Using recursive truncation, the TMND is approximated by a normal distribution qx (xk ) = N (xk ; xk|k , Pk|k ) whose parameters are the outputs of the skew t filter (STF) algorithm in Table III. IV. S IMULATIONS Our numerical simulations use satellite navigation pseudorange measurements of the model [yk ]i | xk ∼ ST(ksi − [xk ]1:3 k + [xk ]4 , 1 m, δ m, 4)

8:

Kz ← Zk|k−1 CzT (CPk|k−1 C T +∆Λ−1 ∆T +Λ−1 R)−1 k|k k|k h i xk|k−1 zek|k ← + Kz (yk − Cxk|k−1 ) 0 ek|k ← (I − Kz Cz )Pk|k−1 Z ek|k , {nx + 1 · · · nx + ny }) [zk|k , Zk|k ] ← rec_trunc(e zk|k , Z xk|k ← [zk|k ]1:nx , Pk|k ← [Zk|k ]1:nx ,1:nx uk|k←[zk|k ]nx +(1:ny ) , Uk|k←[Zk|k ]nx +(1:ny ),nx +(1:ny ) Qn y ν +[Ψ ] update qΛ (Λk ) = i=1 G [Λk ]ii ; ν2i + 1, i 2 k ii given qxu (xk , uk ) Ψk ← (yk − Cz zk|k )(yk − Cz zk|k )T R−1 + Cz Zk|k CzT R−1 +uk|k uT + Uk|k k|k

(7)

where si is the ith satellite’s position, [xk ]4 is bias with prior N (0, (0.75 m)2 ), and δ is skewness parameter. The linearization error is negligible because the satellites are far. The state model is a random walk with process covariance Q = diag((q m)2 , (q m)2 , (0.2 m)2 , 0), where q is a parameter.

A. Computation of TMND statistics In this subsection we study the computation of the moments of the untruncated components of a TMND. One state and one measurement vector per Monte Carlo replication are generated from the model (7) with ν = ∞ degrees of freedom (corresponding to skew-normal likelihood), prior x ∼ N (0, diag(ρ m2 , ρ m2 , (0.22 m)2 , (0.1 m)2 )), and 10 000 replications. The compared methods are recursive truncations with the optimal truncation order (RTopt) and with random order (RTrand), the variational Bayes (VB), and the analytical formulas of [19] using M ATLAB function mvncdf (MVNCDF). In RTrand any of the non-optimal constraints is chosen at each truncation. VB is an update of the skew t variational Bayes filter (STVBF) [8] where Λk = I and the VB iteration is terminated when the position estimate changes less than 0.005 m or at the 1000th iteration. Fig. 2 shows distributions of the distance from the estimate of the bootstrap particle filter (PF) with 100 000 samples. The box levels are 5 %, 25 %, 50 %, 75 %, and 95 % quantiles and the asterisks show minimum and maximum values. With small ρ and δ the differences between RTrand, RTopt, and MVNCDF are small. With large δ there are statistically significant differences as the p-values of two-sided Wilcoxon signed rank test in Fig. 2 show. RTopt outperforms RTrand in the cases with high skewness, which reflects the result of Lemma 1. MVNCDF is more accurate than RTopt in the cases with high skewness, but MVNCDF’s computational load is roughly 40 000 times that of the RTopt. This justifies the use of recursive truncation approximation. The approximation of the posterior covariance matrix is tested by studying the normalized estimation error squared (NEES) values [26, Ch. 5.4.2] shown by Fig. 3. If the covariance matrix is correct, the expected value of NEES is

0.1 0.01 0.001 0

0.5

1

2

3 δ

4

5

10

20

p-value

1 0.1 0.01 0.001 0

0.5

1

2

3 δ

4

5

10

20

4

×104 5

1

2

10

20

Np

3

4

×104 5

PF

2

KF

TVBF

STVBF

30

40

50

STF

1.8 1.6

4 3.5

10

0.01

20

NVB

30

40

3

50

NVB

0.001

0.5

1

2

3 δ

4

5

10

20

0

0.5

1

2

3 δ

4

5

10

20

4 average NEES

4 3.5 3 2.5

3.5

Fig. 4. STF converges in five iterations. The required number of PF particles can be 10.000. (left) q = 0.5, δ = 5, (right) q = 5, δ = 5. 120

3

120 KF TVBF STVBF

100 80

RMSE difference (%)

0

60 40 20 0 -20

2.5

2

3

4 δ

2

2 0

5

10 δ

15

20

0

5

10 δ

15

20

Fig. 3. RTopt’s NEES is closest to the optimal value 3, so recursive truncation gives the most realistic covariance matrix. (left) ρ = 12 , (right) ρ = 202 .

80 60 40 20

5

10

0

20

1

2

3

4 δ

5

10

20

Fig. 5. STF outperforms the comparison methods with skew-t-distributed noise. RMSE differences per cent of the STF’s RMSE. The relative differences increase as δ is increased. (left) q = 0.5, (right) q = 5. TVBF STVBF

the state dimensionality 3 [26, Ch. 5.4.2]. VB gets large NEES values when δ is large, which indicates that VB underestimates the covariance matrix. RTopt and RTrand give NEES values closest to 3, so the recursive truncation provides the most accurate covariance matrix approximation.

100

-20 1

RMSE difference (%)

average NEES

3

1.4

Fig. 2. With large δ values RTopt is closer to PF than RTrand but less accurate than computationally heavy MVNCDF (upper row). p-values of twosided Wilcoxon signed rank test (bottom row) show that the differences from RTopt are significant with large δ. (left) ρ = 12 , (right) ρ = 202 . MVNCDF VB RTrand RTopt

p

0.1

RMSE difference (%)

p-value

0.01 0.001

N

4.5

10

1

0.1

2

100

1 MVNCDF VB RTrand

1

median RMSE (m)

1

1000

median RMSE (m)

MVNCDF 1000 VB 100 RTrand RTopt 10

error from PF (% of PF error)

error from PF (% of PF error)

4

100 50 0 -50 0.1

q

0.5

5

20

Fig. 6. STF outperforms TVBF and STVBF with UWB noise. RMSE differences per cent of the STF’s RMSE. TVBS

STVBS

STS median RMSE (m)

B. Skew-t inference In this section, the proposed skew t filter (STF) is compared with state-of-the-art filters using numerical simulations of a 100-step trajectory. The compared methods are a bootstraptype PF, STVBF [8], t variational Bayes filter (TVBF) [27], and Kalman filter (KF) with measurement validation gating [26, Ch. 5.7.2] that discards the measurement components whose normalized innovation squared is larger than the χ21 distribution’s 99 % quantile. TVBF and KF’s parameters are numerically optimized maximum expected likelihood parameters. The results are based on 1000 Monte Carlo replications. Fig. 4 illustrates the filter iterations’ convergence. The figure shows that the proposed STF converges within 5 VB iterations and outperforms the other filters except for PF already with 2 VB iterations. Furthermore, Fig. 4 shows that STF’s converged state is close to the PF’s converged state in RMSE, and PF can require as many as 10 000 particles to outperform STF. STF also converges faster than STVBF when the process variance parameter q is large. With a small q, STVBF with a small number of VB iterations can give a lower RMSE than the converged STVBF. The reason for this is probably that in the first iterations STVBF accommodates outliers by decreasing the Λk estimates, which also affects the covariance, while in the later iterations uk estimates are increased, which makes the mean more accurate but underestimates the covariance. Fig. 5 shows the distributions of the RMSE differences from the STF’s RMSE as percentages of the STF’s RMSE. STF clearly has the smallest RMSE when δ ≥ 3. Unlike STVBF, the new STF improves accuracy even with small q, which can be explained by the improved covariance approximation. Fig. 6 shows the results of a test where the measurement noise in (7) is generated from the histogram distribution of the UWB time-of-flight data set used in [9]. The filters use the maximum likelihood parameters fitted to the data set numerically with the degrees-of-freedom parameters fixed to

median RMSE (m)

4 RTSS

1.6 1.4 1.2

3.5

3

1 10

20

N

30 VB

40

50

10

20

N

30

40

50

VB

Fig. 7. Five STS iterations give the converged state’s RMSE. (left) q = 0.5, δ = 5, (right) q = 5, δ = 5.

4. The proposed method STF has the lowest RMSE also in this test, which shows that the method is robust to deviations from the assumed distribution and thus usable with real data. The proposed smoother is also tested with measurements generated from (7). The compared smoothers are the proposed skew t smoother (STS), skew t variational Bayes Smoother (STVBS) [8], t variational Bayes smoother (TVBS) [27], and the RTSS with 99 % measurement validation gating [24]. Fig. 7 shows that STS has lower RMSE than the smoothers based on symmetric distributions. Furthermore, STF’s VB iteration converges in five iterations, so it is faster than STVBF. V. C ONCLUSIONS We have proposed a novel approximate filter and smoother for linear state-space models with heavy-tailed and skewed measurement noise distribution. The algorithms are based on the variational Bayes approximation, where some posterior independence approximations are removed from the earlier versions of the algorithms to avoid significant underestimation of the posterior covariance matrix. Removal of independence approximations is enabled by the recursive truncation algorithm for approximating the mean and covariance matrix of truncated multivariate normal distribution. An optimal processing sequence is given for the recursive truncation.

5

A PPENDIX A D ERIVATIONS FOR THE SMOOTHER We derive the expectations for the iterations of the variational Bayes smoother approximating the joint smoothing density p(x1:K , u1:K , Λ1:K |y1:K ) ∝ p(x1:K , u1:K , Λ1:K , y1:K ) = p(x1 )

K−1 Y

p(xl+1 |xl )

l=1

= N (x1 ; x1|0 , P1|0 )

K Y

(8)

p(yk |xk , uk , Λk ) p(uk |Λk ) p(Λk )

k=1 K−1 Y

N (xl+1 ; Axl , Q) ·

l=1

K Y k=1

(

(9)

) ny Y ν ν i i −1 N (yk ; Cxk + ∆uk , Λ−1 G [Λk ]ii ; , (10) k R) N+ (uk ; 0, Λk ) 2 2 i=1

which is approximated by a factorized probability density function (PDF) in the form p(x1:K ,u1:K , Λ1:K |y1:K ) ≈ qxu (x1:K , u1:K ) qΛ (Λ1:K ).

(11)

The VB solutions qˆxu and qˆΛ can be obtained by cyclic iteration of log qxu (x1:K , u1:K ) ← E [log p(y1:K , x1:K , u1:K , Λ1:K )] + cxu

(12a)

log qΛ (Λ1:K ) ← E [log p(y1:K , x1:K , u1:K , Λ1:K )] + cΛ

(12b)

qΛ

qxu

where the expected values are taken with respect to the current qxu and qΛ , and cxu and cΛ are constants with respect to the variables [ uxkk ] and Λk , respectively [10, Chapter 10] [17]. This appendix gives the derivations for one iteration of (12). For brevity all constant values are denoted by c. The logarithm of the joint smoothing distribution is log p(x1:K , u1:K , Λ1:K , y1:K ) = log N (x1 ; x1|0 , P1|0 ) +

K X

log N (xl+1 ; Axl , Q)

l=1

+

K−1 X

log N (yk ; Cxk +

∆uk , Λ−1 k R)

+

log N+ (uk ; 0, Λ−1 k )

+

ny K X X

(13) log G([Λk ]ii ; ν2i , ν2i ),

k=1 i=1

k=1

A. Derivations for qxu Eq. (12a) gives logqxu (x1:K , u1:K ) = log N (x1 ; x1|0 , P1|0 ) +

K−1 X


l=1

+

K X k=1

−1 −1 E [log N (yk ; Cxk + ∆uk , Λk R) + log N+ (uk ; 0, Λk )] + c

(14)

qΛ

= log N (x1 ; x1|0 , P1|0 ) +

K−1 X


l=1 K

−

1X T −1 T E [(yk − Cxk − ∆uk ) R Λk (yk − Cxk − ∆uk ) + uk Λk uk ] + c 2 qΛ

(15)

k=1

= log N (x1 ; x1|0 , P1|0 ) +

K−1 X


l=1 K

1 X − (yk − Cxk − ∆uk )T R−1 Λk|K (yk − Cxk − ∆uk ) + uT k Λk|K uk + c 2

(16)

k=1

= log N (x1 ; x1|0 , P1|0 ) +

K−1 X


l=1

+

K n o X −1 log N (yk ; Axk + ∆uk , Λ−1 k|K R) + log N (uk ; 0, Λk|K ) + c

(17)

k=1

K−1 X P1|0 O x x1 xl+1 A ; 1|0 , O Λ−1 + log N ; u1 ul+1 O 0 1|K l=1 xk + log N yk ; [C ∆] , Λ−1 k|K R + c, u1:K ≥ 0, uk

= log N

O O

Q O xl , O Λ−1 ul l+1|K (18)

6

where Λk|K , EqΛ [Λk ] is derived in Section A-B, and u1:K ≥ 0 means that all the components of all uk are required to be nonnegative for each k = 1 · · · K. Up to the truncation of the u components, qxu (x1:K , u1:K ) has thus the same form as the A O e joint smoothing h Qposterior i of a linear state-space model with the state transition matrix A , [ O O ], process noise covariance O fk , e , [ C ∆ ], and measurement noise covariance matrix R e , Λ−1 R. matrix Q , measurement model matrix C O Λ−1 k|K k+1|K Let us denote the PDFs related to this state-space model with pe. It would be possible to compute the truncated multivariate normal posterior of the joint smoothing distribution ] |y1:K ), and account for the truncation of u1:K to the positive orthant using the recursive truncation. However, this pe ([ ux1:K 1:K would be impractical with large K due to the large dimensionality K × (nx + ny ). A feasible solution is to approximate each filtering distribution in the Rauch–Tung–Striebel smoother’s (RTSS [24]) forward filtering step with a multivariate normal distribution by 1 xk 0 0 pe(xk , uk |y1:k ) = N ; zk|k , Zk|k · [uk ≥ 0] (19) uk C xk ≈N ; zk|k , Zk|k (20) uk for each k = 1 · · · K, where [uk ≥ 0] is the Iverson bracket notation 1, if all components of uk are non-negative , [uk ≥ 0] = 0, otherwise C is the normalization factor, and zk|k , Epe [[ uxkk ] |y1:k ] and Zk|k , Varpe [[ uxkk ] |y1:k ] are approximated using the recursive truncation. Given the multivariate normal approximations of the filtering posteriors pe(xk , uk |y1:k ), by Lemma 2 the backward recursion of the RTSS gives multivariate normal approximations of the smoothing posteriors pe(xk , uk |y1:K ). The quantities required in the derivations of Section A-B are the expectations of the smoother posteriors xk|K , Eqxu [xk ], uk|K , Eqxu [uk ], and the covariance matrices Zk|K , Varqxu [ uxkk ] and Uk|K , Varqxu [uk ]. K Lemma 2. Let {zk }K k=1 be a linear–Gaussian process, and {yk }k=1 a measurement process such that

z1 ∼ N (z1|0 , Z1|0 ) zk |zk−1 ∼ N (Azk−1 , Q) yk |zk ∼ (a known distribution).

(21) (22) (23)

with the standard Markovianity assumptions. Then, if the filtering posterior p(zk |y1:k ) is a multivariate normal distribution for each k, then for each k < K zk |y1:K ∼ N (zk|K , Zk|K ), (24) where zk|K = zk|k + Gk (zk+1|K − Azk|k ),

(25) T

Zk|K = Zk|k + Gk (Zk+1|K − AZk|k A − T

T

Gk = Zk|k A (AZk|k A + Q)

−1

Q)GT k,

(26)

,

(27)

and zk|k and Zk|k are the mean and covariance matrix of the filtering posterior p(zk |y1:k ). Proof:

The proof is mostly similar to the proof of [29, Theorem 8.2]. First, assume that zk+1 |y1:K ∼ N (zk+1|K , Zk+1|K ).

(28)

for some k < K. The joint conditional distribution of zk and zk+1 is then p(zk , zk+1 |y1:k ) = p(zk+1 |zk ) p(zk |y1:k ) || Markovianity assumption = N (zk+1 ; Azk , Q) N (zk ; zk|k , Zk|k ) zk|k Zk|k Zk|k AT zk =N ; , , zk+1 Azk|k AZk|k AZk|k AT + Q

(29) (30) (31)

so by the conditioning rule of the multivariate normal distribution p(zk |zk+1 , y1:k ) = N (zk ; zk|k + Gk (zk+1 − Azk|k ), Zk|k − Zk|k AT (AZk|k AT + Q)−1 AZk|k ) T

= N (zk ; zk|k + Gk (zk+1 − Azk|k ), Zk|k − Gk (AZk|k A +

Q)GT k ).

(32) (33)

7

We use this formula in p(zk , zk+1 |y1:K ) = p(zk |zk+1 , y1:K ) p(zk+1 |y1:K ) = p(zk |zk+1 , y1:k ) p(zk+1 |y1:K ) || Markovianity assumption

(34) (35)

= N (zk ; zk|k + Gk (zk+1 − Azk|k ), Zk|k − Gk (AZk|k AT + Q)GT k ) N (zk+1 |zk+1|K , Zk+1|K ) T z + Gk (zk+1|K − Azk|k ) zk G Z G + Zk|k − Gk (AZk|k AT + Q)GT k =N ; k|k , k k+1|K k zk+1 • • =N

z + Gk (zk+1|K − Azk|k ) Z + Gk (Zk+1|K − AZk|k AT − Q)GT k ; k|k , k|k zk+1 • • zk

• , •

(36) • • (37) (38)

so p(zk |y1:K ) = N (zk ; zk|k + Gk (zk+1|K − Azk|k ), Zk|k + Gk (Zk+1|K − AZk|k AT − Q)GT k) = N (zk ; zk|K , Zk|K ).

(39) (40)

Because zK |y1:K ∼ N (zK|K , ZK|K ), and because (28) implies (40), the statement holds by the induction argument. B. Derivations for qΛ Eq. (12b) gives X ny K X νi νi −1 −1 log N (y ; Cx + ∆u , Λ R) + log N (u ; 0, Λ ) + log G [Λk ]ii ; , + c. E k k k + k k k 2 2 qxu k=1 k=1 i=1 (41) QK Therefore, qΛ (Λ1:K ) = k=1 qΛ (Λk ) where log qΛ (Λ1:K ) =

K X

1 T −1 E [tr{(yk − Cxk − ∆uk )(yk − Cxk − ∆uk ) R Λk }] 2 qxu ny X 1 νi νi T − E [tr{uk uk Λk }] + log[Λk ]ii − [Λk ]ii + c 2 qxu 2 2 i=1 T 1 C −1 = − tr (yk − Cxk|K − ∆uk|K )(yk − Cxk|K − ∆uk|K )T + [C ∆] Zk|K R Λ ∆T 2 n y o X 1 n νi νi − tr (uk|K uT log[Λk ]ii − [Λk ]ii + c k|K + Uk|K )Λk + 2 2 2 i=1 ny X νi + [Ψk ]ii νi log[Λk ]ii − [Λk ]ii + c = 2 2 i=1

logqΛ (Λk ) = −

(42) (43) (44) (45)

where T

Ψk = (yk − Cxk|K − ∆uk|K )(yk − Cxk|K − ∆uk|K ) R

−1

+ [C

∆] Zk|K

CT R−1 + uk|K uT k|K + Uk|K . ∆T

(46)

Therefore, qΛ (Λk ) =

ny Y νi νi + [Ψk ]ii G [Λk ]ii ; + 1, . 2 2 i=1

(47)

Note that only the diagonal elements of the matrix Ψk are required. In the derivations of Section A-A, Λk|K , EqΛ [Λk ] is required. EqΛ [Λk ] is a diagonal matrix with the diagonal elements [Λk|K ]ii =

νi + 2 . νi + [Ψk ]ii

(48)

8

A PPENDIX B D ERIVATIONS FOR THE F ILTER Suppose that at time index k the measurement vector yk is available, and the prediction PDF p(xk |y1:k−1 ) is p(xk |y1:k−1 ) = N (xk ; xk|k−1 , Pk|k−1 ).

(49)

Then, using Bayes’ theorem the joint filtering posterior PDF is p(xk , uk , Λk |y1:k ) ∝ p(yk , xk , uk , Λk |y1:k−1 ) = p(yk |xk , uk , Λk ) p(xk |y1:k−1 ) p(uk |Λk ) p(Λk ) = N (yk ; Cxk +

−1 ∆uk , Λ−1 k R) N (xk ; xk|k−1 , Pk|k−1 ) N+ (uk ; 0, Λk )

(50) (51) ny Y νi νi . G [Λk ]ii ; , 2 2 i=1

(52)

This posterior is not analytically tractable. We seek an approximation in the form p(xk ,uk , Λk |y1:k ) ≈ qxu (xk , uk ) qΛ (Λk ).

(53)

The VB solutions qˆxu and qˆΛ can be obtained by cyclic iteration of log qxu (xk , uk ) ← E [log p(yk , xk , uk , Λk |y1:k−1 )] + cxu

(54a)

log qΛ (Λk ) ← E [log p(yk , xk , uk , Λk |y1:k−1 )] + cΛ

(54b)

qΛ

qxu

where the expected values on the right hand sides of (54) are taken with respect to the current qxu and qΛ , and cxu and cΛ are constants with respect to the variables [ uxkk ] and Λk , respectively [10, Chapter 10] [17]. In sections B-A and B-B the derivations for the variational solution (54) are given. For brevity all constant values are denoted by c in the derivations. The logarithm of the joint filtering posterior which is needed for the derivations is given by 1 log p(yk , xk , uk , Λk |y1:k−1 ) = − (yk − Cxk − ∆uk )T R−1 Λk (yk − Cxk − ∆uk ) 2 1 −1 − (xk − xk|k−1 )T Pk|k−1 (xk − xk|k−1 ) 2 ny X 1 νi νi − uT Λ u + log[Λ ] − [Λ ] + c, uk ≥ 0, k k k ii k ii 2 k 2 2 i=1

(55)

where uk ≥ 0 means that every component of uk is non-negative. A. Derivations for qxu Using equation (54a) we obtain 1 T −1 E [(yk − Cxk − ∆uk ) R Λk (yk − Cxk − ∆uk )] 2 qΛ 1 1 −1 (xk − xk|k−1 ) − E [uT − (xk − xk|k−1 )T Pk|k−1 Λk uk ] + c 2 2 qΛ k T 1 xk xk −1 ∆] ∆] =− yk − [C R Λk|k yk − [C uk uk 2 T −1 Pk|k−1 0 1 xk|k−1 xk|k−1 xk xk − − − , uk ≥ 0, 0 Λ−1 uk uk 0 0 2 k|k

log qxu (xk , uk ) = −

where Λk|k , EqΛ [Λk ] is derived in section B-B. Hence, Pk|k−1 xk|k−1 x xk qxu (xk , uk ) ∝ N yk ; [C ∆] k , Λ−1 R N ; , k|k 0 uk uk 0

0 Λ−1 k|k

(56) (57) (58)

where [uk ≥ 0] is the Iverson bracket. By Kalman filter’s [28] measurement update, this becomes 1 xk 0 0 ; zk|k , Zk|k · [uk ≥ 0], qxu (xk , uk ) = N uk C

· [uk ≥ 0],

(59)

(60)

9

where xk|k−1 + Kk (yk − Cxk|k−1 ), 0 Pk|k−1 0 0 , Zk|k = (I − Kk [C ∆]) 0 Λ−1 k|k Pk|k−1 C T −1 T −1 T −1 Kk = . −1 T (CPk|k−1 C + ∆Λk|k ∆ + Λk|k R) Λk|k ∆ 0 = zk|k

(61) (62) (63)

The first and second moments xk|k , Eqxu [xk ], uk|k , Eqxu [uk ], Zk|k , Varqxu [ uxkk ], and Uk|k , Varqxu [uk ] are required in the derivation of qΛ in Section B-B, and they can be approximated using the recursive truncation algorithm. For the linear– Gaussian time update to be analytically tractable, the marginal distribution qxu (xk ) is approximated as a normal distribution Z qxu (xk ) = qxu (xk , uk ) duk ≈ N (xk|k , Pk|k ), (64) where Pk|k , Varqxu [xk ]. B. Derivations for qΛ Using equation (54b) we obtain 1 T −1 E [tr{(yk − Cxk − ∆uk )(yk − Cxk − ∆uk ) R Λk }] 2 qxu ny X 1 νi νi T − E [tr{uk uk Λk }] + log[Λk ]ii − [Λk ]ii + c 2 qxu 2 2 i=1 T 1 C −1 (yk − Cxk|k − ∆uk|k )(yk − Cxk|k − ∆uk|k )T + [C ∆] Zk|k R Λ = − tr k ∆T 2 ny o X 1 n νi νi + U )Λ + log[Λ ] − [Λ ] +c − tr (uk|k uT k k ii k ii k|k k|k 2 2 2 i=1 ny X νi νi + [Ψk ]ii = log[Λk ]ii − [Λk ]ii + c 2 2 i=1

log qΛ (Λk ) = −

(65) (66) (67) (68)

where Ψk = (yk − Cxk|k − ∆uk|k )(yk − Cxk|k − ∆uk|k )T R−1 + [C

∆] Zk|k

CT R−1 + uk|k uT k[k + Uk|k ∆T

(69)

and the moments xk|k , Eqxu [xk ], uk|k , Eqxu [uk ], Zk|k , Varqxu [ uxkk ], and Uk|k , Varqxu [uk ] are derived in Section B-A of this report. Therefore, ny Y νi + [Ψk ]ii νi qΛ (Λk ) = G [Λk ]ii ; + 1, . (70) 2 2 i=1 Note that only the diagonal elements of the matrix Ψk are required. In the derivations of Section B-A Λk|k , EqΛ [Λk ] is required. EqΛ [Λk ] is a diagonal matrix with the diagonal elements [Λk|k ]ii =

νi + 2 . νi + [Ψk ]ii

(71)

10

R EFERENCES [1] F. Gustafsson and F. Gunnarsson, “Mobile positioning using wireless networks: possibilities and fundamental limitations based on available wireless network measurements,” IEEE Signal Processing Magazine, vol. 22, no. 4, pp. 41–53, July 2005. [2] B.-S. Chen, C.-Y. Yang, F.-K. Liao, and J.-F. Liao, “Mobile location estimator in a rough wireless environment using Extended Kalman-based IMM and data fusion,” IEEE Transactions on Vehicular Technology, vol. 58, no. 3, pp. 1157–1169, March 2009. [3] M. Kok, J. D. Hol, and T. B. Schön, “Indoor positioning using ultra-wideband and inertial measurements,” IEEE Transactions on Vehicular Technology, vol. 64, no. 4, 2015. [4] K. Kaemarungsi and P. Krishnamurthy, “Analysis of WLAN’s received signal strength indication for indoor location fingerprinting,” Pervasive and Mobile Computing, vol. 8, no. 2, pp. 292–316, 2012, special Issue: Wide-Scale Vehicular Sensor Networks and Mobile Sensing. [5] M. D. Branco and D. K. Dey, “A general class of multivariate skew-elliptical distributions,” Journal of Multivariate Analysis, vol. 79, no. 1, pp. 99–113, October 2001. [6] A. Azzalini and A. Capitanio, “Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution,” Journal of the Royal Statistical Society. Series B (Statistical Methodology), vol. 65, no. 2, pp. 367–389, 2003. [7] A. K. Gupta, “Multivariate skew t-distribution,” Statistics, vol. 37, no. 4, pp. 359–363, 2003. [8] H. Nurminen, T. Ardeshiri, R. Piche, and F. Gustafsson, “Robust inference for state-space models with skewed measurement noise,” IEEE Signal Processing Letters, vol. 22, no. 11, pp. 1898–1902, Nov 2015. [9] H. Nurminen, T. Ardeshiri, R. Piché, and F. Gustafsson, “A NLOS-robust TOA positioning filter based on a skew-t measurement noise model,” in International Conference on Indoor Positioning and Indoor Navigation (IPIN), October 2015, pp. 1–7. [10] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2007. [11] M. P. Wand, J. T. Ormerod, S. A. Padoan, and R. Frühwirth, “Mean field variational Bayes for elaborate distributions,” Bayesian Analysis, vol. 6, no. 4, pp. 847–900, 2011. [12] S. K. Sahu, D. K. Dey, and M. D. Branco, “Erratum: A new class of multivariate skew distributions with applications to Bayesian regression models,” Canadian Journal of Statistics, vol. 37, no. 2, pp. 301–302, 2009. [13] ——, “A new class of multivariate skew distributions with applications to Bayesian regression models,” Canadian Journal of Statistics, vol. 31, no. 2, pp. 129–150, 2003. [14] T.-I. Lin, “Robust mixture modeling using multivariate skew t distributions,” Statistics and Computing, vol. 20, pp. 343–356, 2010. [15] S. X. Lee and G. J. McLachlan, “EMMIXuskew: An R package for fitting mixtures of multivariate skew t distributions via the EM algorithm,” Journal of Statistical Software, vol. 55, no. 12, pp. 1–22, November 2013. [16] T. M. Cover and J. Thomas, Elements of Information Theory. John Wiley and Sons, 2006. [17] D. G. Tzikas, A. C. Likas, and N. P. Galatsanos, “The variational approximation for Bayesian inference,” IEEE Signal Processing Magazine, vol. 25, no. 6, pp. 131–146, Nov. 2008. [18] M. J. Beal, “Variational algorithms for approximate Bayesian inference,” Ph.D. dissertation, Gatsby Computational Neuroscience Unit, University College London, 2003. [19] G. Tallis, “The moment generating function of the truncated multi-normal distribution,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 23, no. 1, pp. 223–119, 1961. [20] A. Genz, “Numerical computation of rectangular bivariate and trivariate normal numerical computation of rectangular bivariate and trivariate normal and t probabilities,” Statistics and Computing, vol. 14, pp. 251–260, 2004. [21] A. Genz and F. Bretz, “Comparison of methods for the computation of multivariate t probabilities,” Journal of Computational and Graphical Statistics, vol. 11, no. 4, pp. 950–971, 2002. [22] T. Perälä and S. Ali-Löytty, “Kalman-type positioning filters with floor plan information,” in 6th International Conference on Advances in Mobile Computing and Multimedia (MoMM). New York, NY, USA: ACM, 2008, pp. 350–355. [23] D. J. Simon and D. L. Simon, “Constrained Kalman filtering via density function truncation for turbofan engine health estimation,” International Journal of Systems Science, vol. 41, no. 2, pp. 159–171, 2010. [24] H. E. Rauch, C. T. Striebel, and F. Tung, “Maximum Likelihood Estimates of Linear Dynamic Systems,” Journal of the American Institute of Aeronautics and Astronautics, vol. 3, no. 8, pp. 1445–1450, August 1965. [25] J. M. Dow, R. Neilan, and C. Rizos, “The international GNSS service in a changing landscape of global navigation satellite systems,” Journal of Geodesy, vol. 83, no. 7, p. 689, February 2009. [26] Y. Bar-Shalom, R. X. Li, and T. Kirubarajan, Estimation with Applications to Tracking and Navigation, Theory Algorithms and Software. John Wiley & Sons, 2001. [27] R. Piché, S. Särkkä, and J. Hartikainen, “Recursive outlier-robust filtering and smoothing for nonlinear systems using the multivariate Student-t distribution,” in IEEE International Workshop on Machine Learning for Signal Processing (MLSP), September 2012. [28] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Transactions of the ASME–Journal of Basic Engineering, vol. 82, no. Series D, pp. 35–45, 1960. [29] S. Särkkä, Bayesian Filtering and Smoothing. Cambridge, UK: Cambridge University Press, 2013.

Skew-t Inference with Improved Covariance Matrix Approximation - arXiv

Skew-t Inference with Improved Covariance Matrix Approximation - arXiv

Suggest Documents

Skew-t Filter and Smoother with Improved Covariance Matrix ... - arXiv

Consistent Covariance Matrix Estimation with

Tyler's Covariance Matrix Estimator in Elliptical Models with ... - arXiv

Consistent Covariance Matrix Estimation with Cross - people.hbs.edu

Fuzzy clustering with a fuzzy covariance matrix

Robust Covariance Matrix Estimation with Canonical Correlation ...

Theoretical Analysis of an Improved Covariance Matrix ... - CiteSeerX

Theoretical Analysis of an Improved Covariance Matrix ... - CiteSeerX

Approximation of harmonic covariance functions

Asymptotic properties of robust complex covariance matrix ... - arXiv

Quantization and Feedback of Spatial Covariance Matrix for ... - arXiv

Corrections to LRT on Large Dimensional Covariance Matrix by ... - arXiv

On asymptotics of eigenvectors of large sample covariance matrix - arXiv

Using Empirical Covariance Matrix in Enhancing Prediction ... - arXiv

Fixed support positive-definite modification of covariance matrix ... - arXiv

A parallel implementation of the covariance matrix adaptation ... - arXiv

A Subspace Method for Array Covariance Matrix Estimation - arXiv

Optimal rates of convergence for covariance matrix estimation - arXiv

Optimal rates of convergence for covariance matrix estimation - arXiv

Adaptive covariance matrix estimation through block thresholding - arXiv

Hierarchical Matrix Approximation with Blockwise Constrains Mario ...

COVARIANCE MATRIX ESTIMATION FOR THE

Improved Approximation for Weighted Tree Augmentation with ...

Image Segmentation with Superpixel-based Covariance ... - arXiv