State-Space Inference and Learning with Gaussian Processes

State-Space Inference and Learning with Gaussian Processes Ryan Turner

Seattle, WA March 5, 2010 joint work with Marc Deisenroth and Carl Edward Rasmussen

Turner (Engineering, Cambridge)

State-Space Inference and Learning with Gaussian Processes

1

Outline

Motivation for dynamical systems Expectation Maximization (EM) Gaussian Processes (GP) Inference Learning Results



2

Motivation measurement device (sensor)

position, velocity g(position,noise)

system

filter

p(position, velocity)

throttle

controller

estimating (latent) states from noisy measurements Turner (Engineering, Cambridge)


3

Setup xt−1

f

g

zt−1

g

zt

xt = f (xt−1 ) + w, yt = g(xt ) + v,

f

xt

xt+1 g

zt+1

w ∼ N (0, Q)

v ∼ N (0, R)

x: latent state, y: measurement learning: find f and g using y1:T



4

The Goal

Learn the NLDS in an nonparametric and probabilistic fashion EM algorithm. Requires inference (filtering and smoothing) and prediction in nonlinear dynamical systems (NLDS) using moment matching. filtering: find distribution p(xt |y1:t ) smoothing: find distribution p(xt |y1:T ) prediction: find distribution p(yt+1 |y1:t )

Gaussian process inference and learning (GPIL) algorithm



5

Expectation Maximization

EM iterates between two steps, the E-step and the M-step. E-step (or inference step): find a posterior distribution p(X|Y, Θ). M-step: maximize the expected log-likelihood Q = EX [log p(X, Y|Θ)] wrt Θ.



6

Pictorial introduction to Gaussian process regression 4

f(x)

2

0

−2

−4 −5

0 x



5

7


f(x)

2

0

−2

−4 −5

0 x



5

7


f(x)

2

0

−2

−4 −5

0 x



5

7


f(x)

2

0

−2

−4 −5

0 x



5

7

Existing Methods for nonlinear systems

Extended Kalman Filter (EKF) [Maybeck, 1979]. Unscented Kalman Filter (UKF) [Julier and Uhlmann, 1997]. Assumed Density Filter (ADF) [Boyen and Koller, 1998, Opper, 1998]. Radial Basis Functions (RBF) [Ghahramani and Roweis, 1999]. Neural networks [Honkela and Valpola, 2005]. Other GP approaches [Wang et al., 2008, Ko and Fox, 2009b] GPDM and GPBF. GPs for filtering in the context of the UKF, the EKF [Ko and Fox, 2009a], and the ADF [Deisenroth et al., 2009].



8

The GP-ADF

f( · ) xτ −1

xτ

xτ +1

xt−1

xt

xt+1

yτ −1

yτ

yτ +1

yt−1

yt

yt+1

training


g( · )

test


9

Advantages of GPIL Model f and g with GPs: f ∼ GP f , g ∼ GP g . GPs account for three uncertainties: system noise measurement noise model uncertainty

Integrates out the latent states (not MAP) unlike [Wang et al., 2008, Ko and Fox, 2009b]. Tractable algorithm for approximate inference (smoothing) in GP state-space models. Learning without ground-truth observations xi of the latent states. 4

f(x)

2

0

−2

−4 −5


0 x

5


10

E-Step: Forward sweep

time update p(xt−1 |z1:t−1 ) xt−1

f

measurement update

p(xt |z1:t−1 )

p(xt |z1:t−1 )

p(xt |z1:t )

xt

xt

xt g zt

zt p(zt |z1:t−1 ) 1) predict next hidden state

2) predict measurement

measure zt

3) hidden state posterior

Backward sweep also analytic



11

Predictions Using Moment Matching

1

1

0.5

0.5 0 −0.5

−1

−1

xt+1

0 −0.5

−1.5

−1.5

−2

−2

−2.5

−2.5

−3

−3

−3.5

−3.5

−4

2

1.5

1

0.5

0

−4 −0.5


0

(xt,ut)

0.5


1

12

M-Step

xt−1

f

g

zt−1


f

xt g

zt

xt+1 g

zt+1


13

Pseudo-training data β6

2 β2

β3

1 0

α1

α5

α2 α3

α7 α6

β4

−1 −2

β5

α4

β7

β1

−2

−1


0

1

2


14

Why We Need Pseudo-training Data

α, β xt−1

xt

xt+1

yt−1

yt

yt+1

ξ, υ GP f and GP g are not full GPs, but rather sparse GPs Turner (Engineering, Cambridge)


15

Why We Need Pseudo-training Data

xt → xt+1 given α and β is a GP prediction.

xt−1 is (uncertain) test input. α and β are standard GP training set. xt+1 ⊥ xt−1 |xt , α, β

Markovian property.

Without using a pseudo training set, xt+1 ⊥ xt−1 |xt , f conditions on ∞-dimensional object f intractable



16

The Auxiliary Function

We decompose Q into Q = E [log p(X, Y|Θ)] = E[log p(x1 |Θ)] X X   T T X X + E log p(xt |xt−1 , Θ)+ log p(yt |xt , Θ) X | {z } t=1 | {z } t=2 Transition


Measurement


17

The Auxiliary Function

We decompose Q into Q = E [log p(X, Y|Θ)] = E[log p(x1 |Θ)] X X   T T X X + E log p(xt |xt−1 , Θ)+ log p(yt |xt , Θ) X | {z } t=1 | {z } t=2 Transition

Measurement

using the factorization properties of the model.



17

The Transition Contribution

EX [log p(xt |xt−1 , Θ)] M (xti − µi (xt−1 ))2 1X EX +EX log σi2 (xt−1 ) =− 2 2 i=1 σi (xt−1 ) | {z } | {z } Complexity Term Data Fit Term



18

The Transition Contribution

EX [log p(xt |xt−1 , Θ)] M (xti − µi (xt−1 ))2 1X EX +EX log σi2 (xt−1 ) =− 2 2 i=1 σi (xt−1 ) | {z } | {z } Complexity Term Data Fit Term

We approximate the data fit EX (xti − µi (xt−1 ))2 (xti − µi (xt−1 ))2 EX ≈ σi2 (xt−1 ) EX [σi2 (xt−1 )] and lower bound the EM lower bound with EX log σi2 (xt−1 ) ≤ log EX σi2 (xt−1 ) .



18

Synthetic Data 8

ground truth posterior mean pseudo targets

6

f(x)

4 2 0 −2

−3

−2

−1

0

1

2

3

−3

−2

−1

0 1 2 3 xState-Space Inference and Learning with Gaussian Processes

0.5

0


19

Snow Data

4

4

posterior mean pseudo targets

3 snowfall in log−cm

3

2

1

2

1

0

0

−1

−1 1

0.1

0.01

−1 Turner (Engineering, Cambridge)

0

1

2 x

3

4

5


20

Quantitative Results

Method TIM Kalman ARGP NDFA GPDM GPIL ? UKF EKF GP-UKF

NLL synth. 2.21±0.0091 2.07±0.0103 1.01±0.0170 2.20±0.00515 3330±386 0.917 ± 0.0185 4.55±0.133 1.23±0.0306 6.15±0.649

RMSE synth.


2.18 1.91 0.663 2.18 2.13 0.654 2.19 0.665 2.06

NLL real 1.47±0.0257 1.29±0.0273 1.25±0.0298 14.6±0.374 N/A 0.684 ± 0.0357 1.84±0.0623 1.46±0.0542 3.03±0.357

RMSE real

1.01 0.783 0.793 1.06 N/A 0.769 0.938 0.905 0.884


21

Conclusions

GPs for flexible distribution over nonlinear dynamical systems. Filtering and smoothing based on moment matching Learning the dynamical system (even without ground-truth latent state)



22

References Boyen, X. and Koller, D. (1998). Tractable inference for complex stochastic processes. In Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI 1998), pages 33–42, San Francisco, CA, USA. Morgan Kaufmann. Deisenroth, M. P., Huber, M. F., and Hanebeck, U. D. (2009). Analytic moment-based Gaussian process filtering. In Bouttou, L. and Littman, M. L., editors, Proceedings of the 26th International Conference on Machine Learning, pages 225–232, Montreal, Canada. Omnipress. Ghahramani, Z. and Roweis, S. (1999). Learning nonlinear dynamical systems using an EM algorithm. In Advances in Neural Information Processing Systems 11, pages 599–605. Honkela, A. and Valpola, H. (2005). Unsupervised variational Bayesian learning of nonlinear models. In Saul, L. K., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems 17, pages 593–600. MIT Press, Cambridge, MA. Julier, S. J. and Uhlmann, J. K. (1997). A new extension of the Kalman filter to nonlinear systems. In Proceedings of AeroSense: 11th Symposium on Aerospace/Defense Sensing, Simulation and Controls, pages 182–193, Orlando, FL, USA. Ko, J. and Fox, D. (2009a). GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Autonomous Robots, 27(1):75–90. Ko, J. and Fox, D. (2009b). Learning GP-BayesFilters via Gaussian Process Latent Variable Models. In Proceedings of Robotics: Science and Systems, Seattle, USA. Maybeck, P. S. (1979). Stochastic Models, Estimation, and Control, volume 141 of Mathematics in Science and Engineering. Academic Press, Inc. Opper, M. (1998).



23