Identification of Gaussian Process State Space Models

5 downloads 0 Views 1MB Size Report
May 30, 2017 - Maruan Al-Shedivat, Andrew G. Wilson, Yunus Saatchi, Zhiting Hu, and ... M. Frank, Dengda Tang, Michael C. Quirk, and Matthew A. Wilson. A.
Identification of Gaussian Process State Space Models

arXiv:1705.10888v1 [stat.ML] 30 May 2017

Stefanos Eleftheriadis∗ [email protected]

Thomas F.W. Nicholson∗ [email protected]

Marc Peter Deisenroth∗† [email protected]

James Hensman∗ [email protected]

Abstract The Gaussian process state space model (GPSSM) is a non-linear dynamical system, where unknown transition and/or measurement mappings are described by GPs. Most research in GPSSMs has focussed on the state estimation problem. However, the key challenge in GPSSMs has not been satisfactorily addressed yet: system identification. To address this challenge, we impose a structured Gaussian variational posterior distribution over the latent states, which is parameterised by a recognition model in the form of a bi-directional recurrent neural network. Inference with this structure allows us to recover a posterior smoothed over the entire sequence(s) of data. We provide a practical algorithm for efficiently computing a lower bound on the marginal likelihood using the reparameterisation trick. This additionally allows arbitrary kernels to be used within the GPSSM. We demonstrate that we can efficiently generate plausible future trajectories of the system we seek to model with the GPSSM, requiring only a small number of interactions with the true system.

1

Introduction

State space models can effectively address the problem of learning patterns and predicting behaviour in sequential data. Due to their modelling power they have a vast applicability in various domains of science and engineering such as robotics, finance, neuroscience [Brown et al., 1998]. Most research and applications have focussed on linear state space models for which solutions for inference (state estimation) and learning (system identification) are well established [Kalman, 1960, Ljung, 1999]. In this work, we are interested in a non-linear flavour of state space models. In particular, we consider the case where a Gaussian process is responsible for modelling the underlying dynamics. The latter is widely known as the Gaussian process state space model (GPSSM) and possesses some desired properties: firstly, due to the non-parametric nature of the GPs, the GPSSM has the promise to be effective in learning from small datasets. Hence, in many situations it can be advantageous over well-known parametric models (e.g., recurrent neural networks – RNN), as in various reinforcement learning tasks where we are interested in efficiently learning a controller from observations of multiple short episodes. In such settings, we naturally need to deal with lack of data initially. Classical system identification methods [Ljung, 1999] require many data points to find the underlying model. This is where the probabilistic nature of the GPs can be proven critical. By using a GP for the latent transitions, we can get away with an approximate model and learn a distribution over functions. Hence, we can potentially account for model errors whilst quantifying uncertainty, as discussed and empirically shown by Schneider [1997], Deisenroth et al. [2015]. The latter ensures that the system will not become overconfident in regions of the space where data are not available. Although this seems attractive, proper system identification with the GPSSM is a challenging task. This is due to the un-identifiability problems that are caused from the fact that both states and transition ∗ †

PROWLER.io, Cambridge, UK. Department of Computing, Imperial College London, UK.

functions are unknown. Hence, most work has focused only on state estimation of the GPSSM. In this paper, we focus on addressing the challenge of system identification and based on recent work by Frigola et al. [2014] we propose a novel inference method for learning the GPSSM. Specifically, we approximate the entire process of the state transition function by employing the framework of variational inference. Moreover, we assume a Markov-structured Gaussian posterior distribution over the latent states. The variational posterior can be naturally combined with a recognition model based on bi-directional recurrent neural networks, which facilitates smoothing of the posterior over the entire sequence of data. We present an efficient algorithm based on the reparameterisation trick for computing the lower bound on the marginal likelihood. This significantly accelerates learning of the model and allows for the use of arbitrary non-smooth kernel functions.

2

Gaussian process state space models

In the following, we present the GPSSM, a dynamical system whose building blocks are Gaussian processes. We consider a dynamical system xt = f (xt−1 , at−1 ) + f ,

y t = g(xt ) + g ,

(1)

where t indexes time, x ∈ RD is a latent state, a ∈ RP are control signals (actions) and y ∈ RO are measurements/observations. Furthermore, we assume i.i.d. Gaussian system/measurement noise  2 (·) ∼ N 0, σ(·) I . The state-space model eq. (1) can be fully described by the measurement and transition functions, g and f , respectively. The key idea of a GPSSM is to model the transition function f and/or the measurement function g in eq. (1) using Gaussian processes. A Gaussian process (GP) is a distribution over functions and is fully specified by a mean η(·) and a covariance function k(·, ·) [see e.g. Rasmussen and Williams, 2006]. The covariance function allows us to encode basic structural assumptions of the class of functions we want to model, e.g., smoothness, periodicity or stationarity. A common choice for a covariance function is the radial basis function (RBF) k(x, x0 ) = σf2 exp − 21 (x − x0 )> Λ−1 (x − x0 ) , which encodes the assumption that the underlying function is infinitely differentiable. Here, σf is a parameter controlling the amplitude of the function, and Λ = diag(λ21 , . . . , λ2D ) is a diagonal matrix of squared length-scales λi . In this work we make use of two important properties of GPs. Let f (·) denote a Gaussian process random function, and X = [xi ]N i=1 be a series of points in the domain of that function. Then, any finite subset of function evaluations are jointly Gaussian distributed, if f = [f (xi )]N i=1 then  p(f |X) = N f | η(X), K XX , (2) where the matrix K XX contains evaluations of the kernel function at all pairs of datapoints in X, and η(X) is vector containing evaluations of the prior mean function at all points. This property leads to the widely used GP regression model: if additive Gaussian noise is assumed, the marginal likelihood can be computed in closed form, enabling learning of the kernel parameters. The second important GP property is that the conditional distribution of a Gaussian process is another Gaussian process. If we are to observe the values f at the input locations X, then we predict the values elsewhere on the GP at x? using the conditional  −1 f (x? ) | f ∼ GP η(x? ) + k(x? , X)K −1 XX (f − η(X)), k(x? , x? ) − k(x? , X)K XX k(X, x? ) . (3) In the GPSSM, we are presented with neither values of the function on which to condition, nor on inputs to the function, since the hidden states xt are latent. The challenge of inference in the GPSSM lies in dually inferring the latent variables x and in fitting the Gaussian process dynamics f (x). In the GPSSM, we place independent GP priors on the transition function f in the state-space model from eq. (1) for each output dimension of xt+1 , and collect realisations of those functions in the random variables f , such that  2 fd (·) ∼ GP ηd (·), kd (·, ·) , f t = [fd (˜ xt−1 )]D (4) d=1 and p(xt |f t ) = N (xt |f t , σf I), ˜ t = [xt , at ] to collect the state-action pair at time t. In this where we used the short-hand notation x (d) work we use a mean function that keeps the state constant, so ηd (x˜t ) = xt . 2

To reduce some of the un-identifiability problems of GPSSMs, we assume a linear measurement mapping g so that the data conditional is p(y t |xt ) = N (y t |W g xt + bg , σg2 I) .

(5)

The linear observation model g(x) = W g x + bg + g is not limiting, since a non-linear g could be replaced by additional dimensions in the state space [Frigola-Alcade, 2015]. 2.1

Related work

State estimation in GPSSMs has been proposed by [Ko and Fox, 2009a, Deisenroth et al., 2009] for filtering and by [Deisenroth et al., 2012, Deisenroth and Mohamed, 2012] for smoothing using both deterministic (e.g., linearisation) and stochastic (e.g., particles) approximations. These approaches did not focus on system identification (parameter learning) but on inference in learned GPSSMs. This can be attributed to the fact that learning of the state transition function f without observing the system’s true state x is challenging. Towards this approach, Wang et al. [2008], Ko and Fox [2009b], Turner et al. [2010] proposed methods for learning GPSSMs based on maximum likelihood estimation. Frigola et al. [2013] followed a fully Bayesian treatment to the problem and proposed an inference mechanism based on particle Markov chain Monte Carlo. Specifically, they first obtain sample trajectories from the smoothing distribution that could be used to define a predictive density via Monte Carlo integration. Then, conditioned on this trajectory they sample the model’s hyper-parameters. The downside of this approach is the expensive inference, whose computational cost scales proportionally to the length of the time series and the number of the particles. In order to tackle this inefficiency, Frigola et al. [2014] suggested to follow a hybrid inference approach combining variational inference and sequential Monte Carlo. Using the sparse variational framework from [Titsias, 2009] to approximate the GP led to a tractable distribution over the state transition function that is independent of the length t. An alternative to learning a state-space model is to follow an autoregressive strategy as in MurraySmith and Girard [2001], Likar and Kocijan [2007], Turner [2011], Roberts et al. [2013], Kocijan [2016], in order to model the function as a direct mapping from previous to current observations. However, such an autoregressive structure can be problematic: as it is learned over the true observations and not in a latent space, noise is propagated through the system during inference. To alleviate this, Mattos et al. [2015] proposed the recurrent GP, a non-linear dynamical model that resembles a deep GP mapping from observed inputs to observed outputs, with an autoregressive structure on the intermediate latent states. They further followed the idea from [Dai et al., 2015] and introduced a recognition model to approximate the true posterior of the latent state. More specifically, they suggested to use an RNN to model the means of the variational approximate distribution at each layer as a function of past latent states from previous layers. A downside to this approach is the need to update the approximation to the posterior after observing new data. Hence, performing inference for future points under the recurrent GP requires to feed forward the future actions into the RNN, in order to propagate uncertainty towards the outputs. Another issue stems from the model’s inefficiency in analytically computing expectations of the kernel functions under the approximate posterior when dealing with high-dimensional latent states. Recently, Al-Shedivat et al. [2016], inspired by the promising results of the recurrent GPs [Mattos et al., 2015], introduced a recurrent structure to the manifold GP [Calandra et al., 2016]: They proposed to use an LSTM in order to map the observed inputs onto a non-linear manifold, which is the space that the GP actually operates on. For efficient inference they follow an approximate inference scheme based on Kronecker products over Toeplitz-structured kernels. A common theme in the above work is their attempt to fit models to noisy sequences. This causes uncertainty to appear on both inputs and outputs of the GP. McHutchon and Rasmussen [2011] proposed to cope with this by taking a local linearisation of the function and using it to propagate uncertainty from the inputs to the output of the GP, where it can be dealt with more naturally.

3

Inference

Our inference scheme uses varitational Bayes [see e.g. Beal, 2003, Blei et al., 2017]. We first define the form of the approximation to the posterior (denoted q(...)), and then derive the evidence lower bound (ELBO), with respect to which the posterior approximation is optimized in order to minimize 3

the Kullback-Leibler divergence between the approximate and true posteriors. We detail how the ELBO is estimated in a stochastic fashion and optimized using gradient-based methods, and describe how the form of the approximate posterior is gived by a recurrent neural network. The graphical models of the GPSSM and our proposed approximation are shown in Figure 1. a1

y1

a2

y2

a3

y3

∞ fd(·)

μ

a1

y1 Wỹ

h0 x0

x1

x2

x3

WA,L

∞ fd(·)

W(f,b) h

x1

y2 Wỹ

h1

WA,L

x0

a2

W(f,b) h

y3 Wỹ

h2

WA,L

x2

a3

W(f,b) h

h3

WA,L

x3

Figure 1: The GPSSM with the GP state transition functions (left), and the proposed approximation with the recognition model in the form of a bi-RNN (right). Black arrows show conditional dependencies of the model, red arrows show the data-flow in the recognition.

3.1

Posterior approximation

Following [Frigola et al., 2014], we adopt a variational approximation to the posterior, assuming factorisation between the latent functions f (·) and the state trajectories X. However, unlike Frigola et al.’s work, we do not run particle MCMC to approximate the state trajectories, but instead assume that the posterior over states is given by a Markov-structured Gaussian distribution parameterised by a recognition model (see section 3.3). In concordance with [Frigola et al., 2014], we adopt a sparse variational framework to approximate the GP. The sparse approximation allows us to deal with both (a) the unobserved nature of the GP inducing inputs and (b) any potential computational scaling issues with the GP by controlling the number of inducing points in the approximation. The variational approximation to the GP posterior is formed as follows: Let Z = [z 1 , . . . , z M ] be ˜ . For each Gaussian process fd (·), we collect evaluations some selected points in the same domain as x of the function at Z into the inducing variables ud = [fd (z m )]M m=1 , meaning that the density of ud under the GP prior is N (η d (Z), K zz ). We make a mean-field variational approximation to the QD posterior for U , taking the form q(U ) = d=1 N (ud | µd , Σd ). The variational posterior of the rest of the points on the GP is assumed to be given by the same conditional distribution as the prior:  fd (·) | ud ∼ GP ηd (·) + k(·, Z)K −1 k(·, ·) − k(·, Z)K −1 (6) zz (ud − η d (Z)), zz k(Z, ·) . Integrating this expression with respect to the prior distribution p(ud ) = N (η d (Z), K ZZ ) gives the GP prior in eq. (4). Integrating with respect to the variational distribution q(U ) gives our  approximation to the posterior process fd (·) ∼ GP µd (·), vd (·, ·) , with −1 −1 µd (·) = ηd (·)+k(·, Z)K −1 zz (µd −η d (Z)), vd (·, ·) = k(·, ·)−k(·, Z)K zz [K zz −Σd ]K zz k(Z, ·) . (7)

The approximation to the state trajectory posterior is assumed to have a Gauss-Markov structure:   q(x0 ) = N x0 | m0 , L0 L> q(xt | xt−1 ) = N xt | At xt−1 , Lt L> (8) 0 , t . This distribution is specified through a single mean vector m0 , a series of square matrices At , and a series of lower-triangular matrices Lt . Our assumption that the state-trajectory can be modelled as Gaussian is not true in general, though our experiments suggest that the approximation is successful. With the approximating distributions for the variational posterior defined in eq. (7) and (8), we are ready to derive the evidence lower bound (ELBO) on the model’s true likelihood. Following [Frigola-Alcade, 2015] equation 5.10, the ELBO is given by ELBO = Eq(x0 ) [log p(x0 )] + H[q(X)] − KL[q(U ) || p(U )] T X D hX i 1 (d) ˜ t−1 ) + log N xt | µd (˜ + Eq(X) − 2 vd (˜ xt−1 , x xt−1 ), σf2 2σf t=1 d=1

+ Eq(X)

T hX

log N y t | g(xt ), σg2 I O

t=1

4

i

,

(9)

where KL[·||·] is the Kullback-Leibler divergence between two distributions, and H[·] denotes the entropy of a distribution. Note that with the above formulation we can naturally deal with multiple episodic data since the ELBO can be factorised across each independent episode. We can now learn the GPSSM by optimising the ELBO w.r.t. the parameters of the model and the variational parameters. A full derivation is provided in the supplementary material. The form of the ELBO justifies the Markov-structure that we have assumed for the variational distribution q(X): we see that the latent states only interact over pair-wise time steps xt and xt−1 , so adding further structure to q(X) is unnecessary. 3.2

Efficient computation of the ELBO

To compute the ELBO in eq. (9), we need to compute expectations w.r.t. q(X). Frigola et al. [2014] showed that for the RBF kernel the relevant expectations can be computed in closed form in a similar way to Titsias and Lawrence [2010]. To allow for general kernels, we propose to use the reparameterisation trick [Kingma and Welling, 2013] instead: by sampling a single trajectory from q(X) and evaluating the integrand in eq. (9), we obtain an unbiased estimate of the ELBO. To draw a sample from the Gauss-Markov structure in eq. (8), we first sample t ∼ N (0, I), t = 0 . . . T , and then apply the recursion x0 = m0 + L0 0 ,

xt = At xt−1 + Lt t .

(10)

This simple estimator of the ELBO can then be used in optimisation using stochastic gradient methods; we have used the Adam optimizer [Kingma and Ba, 2014]. It may seem initially counter-intuitive to use a stochastic estimate of the ELBO where one is available in closed form, but this approach offers two distinct advantages. First, computation is dramatically reduced: our scheme requires O(T D) storage in order to evaluate the integrand in (9) at a single sample from q(X). A scheme which computes the integral in closed form requires O(T M 2 ) (where M is the number of inducing variables in the sparse GP) storage in order to store the sufficient statistics of the kernel evaluations. The second advantage is that we are no longer restricted to the RBF kernel, but can use any valid kernel for inference and learning in GPSSMs. The reparameterisation trick also allows us to perform batched updates of the model parameters, amounting to doubly stochastic variational inference [Titsias and Lázaro-Gredilla, 2014], which we experimentally found to improve run-time and sample-efficiency. Some of the elements of the ELBO in eq. (9) are still available in closed-form. To reduce the variance of the estimate of the ELBO we exploit this where possible: The entropy of the Gauss-Markov PT structure is: H[q(X)] = − T2D log(2π) − T2D − t=0 log(det(Lt )); the expected data likelihood (last term in eq. (9)) can be computed easily given the marginals of q(X), which are given by q(xt ) = N (mt , Σt ),

mt = Amt−1 ,

> Σt = At Σt−1 A> t + Lt Lt ,

(11)

and the necessary Kullback-Leibler divergences can be computed analytically also: we use the implementations from GPflow [Matthews et al., 2017]. 3.3

A recurrent recognition model

The variational distribution of the latent trajectories in eq. (8) has a large number of parameters (At , Lt ) that grows with the length of the dataset. Further, if we wish to train a model on multiple episodes (independent data sequences sharing the same dynamics), then the number of parameters grows still. To alleviate this, we propose to use a recognition model in the form of a bi-directional recurrent neural network (bi-RNN). A bi-RNN is a combination of two independent RNNs operating on opposite directions of the sequence. Each network is specified by two weight matrices W acting on a hidden state h: (f )

(f )

(f )

(f )

(b) ht−1

=

(b) (b) φ(W h ht

(f )

˜ t + bh ) , + W y˜ y

ht+1 = φ(W h ht

+

(b) ˜t W y˜ y

+

(b) bh ) ,

(12) (13)

˜ t denotes the concatenation of the observed data and control actions [y t , at ] and the suwhere y perscripts denote the direction of the RNN. The activation function φ acts on each element of its argument separately, we use the tanh function. In our experiments we found that using gated recurrent units [Cho et al., 2014] improved performance of our model. We now make the parameters of 5

the Gauss-Markov structure dependent on the sequences h(f ) , h(b) , so that (f )

(b)

(f )

At = reshape(W A [ht ; ht ] + bA ),

(b)

Lt = reshape(W L [ht ; ht ] + bL ) .

(14)

The parameters of the Gauss-Markov structure q(X) are now almost completely encapsulated in (f,b) (f,b) (f,b) the recurrent recognition model as W h , W y˜ , W A , W L , bh , bA , bL . We only need to infer the parameters of the initial state, m0 , L0 ; this is where we utilise the functionality of the bi-RNN structure. Instead of directly learning the initial state q(x0 ), we can now obtain it indirectly via the output of the backward RNN. Another nice property of the proposed recognition model is that now q(X) is recognised from both future and past observations, since the proposed bi-RNN recognition model can be regarded as a forward and backward sequential smoother of our variational posterior.

4

Experiments

In this section, we benchmark the proposed GPSSM approach on data from one illustrative example and two challenging non-linear data sets. Our aim is to explicitly demonstrate that we can: (i) cheaply and effortlessly benefit from the use of non-smooth kernels with our approximate inference and accurately model non-smooth transition functions; (ii) successfully learn non-linear dynamical systems even when we do not have access to the true state (partially observed inputs); (iii) sample future trajectories and generate plausible future states of the system even when trained with a small number of episodes of fully and partially observed inputs. 4.1

Non-linear System identification

We first apply our approach to a synthetic dataset generated broadly according to [Frigola et al., 2014]. The data is created using a non-linear, non-smooth transition function with additive state and observation noise according to: p(xt+1 |xt ) = N (f (xt ), σf2 ), and p(yt |xt ) = N (xt , σg2 ), where f (x) = xt + 1,

if x < 4,

13 − 2x,

σf2

σg2

otherwise

(15)

In our experiments we set the noise variances to = 0.01 and = 0.1, and generate 200 sequences (episodes) of length 10 that were used as the observed data for training the GPSSM. We used 2 inducing points (initialised uniformly across the range of the input data) for approximating the GP and 20 hidden units for the recurrent recognition model. In this experiment, we demonstrate the ability to perform inference under the proposed model with arbitrary non-smooth kernel functions.3 Here we compare with the following kernels: RBF, additive composition of the RBF (initial ` = 10) and Matern (ν = 12 , initial ` = 0.1) [Rasmussen and Williams, 2006], arc-cosine (order 0) [Cho and Saul, 2009], and the MGP kernel [Calandra et al., 2016] (depth 5, hidden dimensions [3, 2, 3, 2, 3], tanh activation, Matern (ν = 12 ) compound kernel). All kernels are additively composed with the constant kernel. The learnt GP state transition functions are shown in Figure 2. As we can see, by using the non-smooth kernels we are able to learn accurate transitions and model the instantaneous dynamical change, as opposed to the smooth transition learnt with the RBF. Note that all non-smooth kernels place inducing points directly on the peak (at xt = 4) to model the kink, whereas the RBF kernel explains this behaviour as a longer-scale wiggliness of the posterior process. An interesting finding is that when using a kernel without the RBF component the GP posterior quickly reverts to the mean function (m(x) = x) as we move away from the data: the short length-scales that enable them to model the instantaneous change prevent them from extrapolating downwards in the transition function. The composition of the RBF and Matern kernel benefits from long and short length scales. Therefore, it can learn the instantaneous, while also extrapolating to unseen regions of the function’s domain. The posteriors can be viewed across a longer range of the function space in the supplementary material. 4.2

Modelling cart-pole dynamics

In this experiment, we demonstrate the efficacy of the proposed GPSSM on learning the non-linear dynamics of the cart-pole system from [Deisenroth and Rasmussen, 2011]. The system is composed 3

This is the benefit of using the reparameterisation trick for approximating the ELBO (see section 3.2).

6

GP posterior GP posterior

inducing points

ground truth Arc-cosine

RBF + Matern

RBF

MGP

MGP

Arc-cosine RBF RBF + Matern 4

xt+1

inducing points

2 0 −2 −2 −1 0

1

2 3 xt

4

5

6 −2 −1 0

1

2 3 xt

4

6 −2 −1 0

5

1

2 3 xt

4

5

6 −2 −1 0

1

2 3 xt

4

5

6

Figure 2: Visualisation of the GP state transition function learnt with different kernels. The true underlying function is given by eq. (15).

of a cart running on a track, with a freely swinging pendulum attached to it. The state of the system consists of the cart’s position and velocity, and the pendulum’s angle and angular velocity, while a horizontal force (action) a ∈ [−10, 10]N can be applied to the cart. We used the data-efficient reinforcement learning algorithm from [Deisenroth and Rasmussen, 2011] to learn a feedback controller that swings the pendulum up and to balances it in the inverted position in the middle of the track. We collected trajectory data from 16 trials during learning; each trajectory/episode was 4 s (40 time steps) long. The 16th episode serves as the test data.

−2 −1 0 −2 1 −1 2 03 −2 14 −1 25 036 −2 14 −1 25 036 14 25 36 4 5 When training the GPSSM for the cart-pole system we used data up to the first 15 episodes. We x100 xt with a Matern ν =xtand 50 hidden units t inducing points toxapproximate t used the GP function 1 2

for the recurrent recognition model. The learning rate for the Adam optimiser was set to 10−3 . We qualitatively assess the performance of our model by feeding the control sequence of the last episode to the GPSSM in order to generate future responses. In Figure 3, we demonstrate the ability of the proposed GPSSM to learn the underlying dynamics of the system from a different number of episodes with fully and partially observed data. In the top row, the GPSSM observes the full 4D state, while in the bottom row we train the GPSSM with only the cart’s position and the pendulum’s angle observed (i.e., velocities are hidden). In both cases, sampling long-term trajectories based on only 2 episodes for training does not result in plausible future trajectories. However, we could model part of the dynamics after training with only 8 episodes (320 time steps interaction with the system), while training with 15 episodes (600 time steps in total) allowed the GPSSM to produce trajectories similar to the ground truth. It is worth emphasising the fact that the GPSSM could recover the unobserved velocities in the latent states, which resulted in smooth transitions of the cart and swinging of the pendulum. Hence, the simulated behaviour was close to the ground truth. Detailed fittings for each episode and learnt latent states with observed and hidden velocities are provided in the supplementary material. We also ran experiments using lagged actions where the current partially observed state is affected by the action two time-steps previous. The results (provided in the supplementary material) show that we are able to sample future trajectories with an accuracy similar to time-aligned actions. This indicates that our model is able to learn a compressed representation of the full state and previous inputs, essentially ‘remembering’ the lagged actions. 4.3

Modelling double-pendulum dynamics

Similarly to the previous experiment, in this section, we learn and model the dynamics of the double pendulum system from [Deisenroth et al., 2015]. The double pendulum is a two-link robot arm with two actuators. The state of the system consists of the angles and the corresponding angular velocities of the inner and outer link, respectively, while different torques a1 , a2 ∈ [−2, 2] Nm can be applied to the two actuators. The task of swinging the double pendulum and balancing it in the upwards position is extremely challenging. First, it requires the interplay of two correlated control signals (i.e., the torques). Second, the behaviour of the system, when operating at free will, is chaotic. We follow a similar approach as with the cart-pole dataset and learn the underlying dynamics from episodic data (15 episodes, 30 time steps long each). Training of the GPSSM was performed with data up to 14 episodes, while always demonstrating the learnt underlying dynamics on the last episode, which serves as the test set. We used 200 inducing points to approximate the GP function with a 7

6

8 episodes (320 time steps in total)

2 episodes (80 time steps in total)

15 episodes (600 time steps in total)

0.4

5

0

angle

cart position

10 0.2

−0.2 0 −0.4

0.4

5

0

angle

cart position

10 0.2

−0.2 0 −0.4 control signal 10 −10 0

5

10

15

20

25

30

35

0

40

5

10

15

20

25

30

35

40

0

5

10

time step

time step

15

20

25

30

35

40

time step

Figure 3: Predicting the cart’s position and pendulum’s angle behaviour from the cart-pole dataset by applying the control signal of the testing episode to sampled future trajectories from the proposed GPSSM. Learning of the dynamics is demonstrated with observed (upper row) and hidden (lower row) velocities and with increasing number of training episodes. Ground truth is denoted with the marked lines. 2 episodes (60 time steps in total)

8 episodes (240 time steps in total)

14 episodes (420 time steps in total)

4 5 2 4

outer angle

inner angle

6

0

3 outer torque

inner torque 2 -2 0

5

10

15

time step

20

25

30

0

5

10

15

time step

20

25

30

0

5

10

15

20

25

30

time step

Figure 4: Predicting the behaviour of the inner and outer pendulum’s angle from the double-pendulum dataset by applying the control signals of the testing episode to sampled future trajectories from the proposed GPSSM. Learning of the dynamics is demonstrated with hidden angular velocities (see supplementary material for observed velocities) and with increasing number of training episodes. Ground truth is denoted with the marked lines.

Matern ν = 21 and 80 hidden units for the recurrent recognition model. The learning rate for the Adam optimiser was set to 10−3 . The difficulty of the task is evident in Figure 4, where we can see that even after observing 14 episodes we cannot accurately predict the system’s future behaviour for more than 15 time steps (i.e., 1.5 s). It is worth noting again we can generate reliable simulation even though we observe only the pendulums’ angles.

5

Conclusion

We have proposed a novel inference mechanism for the GPSSM in order to address the challenge of non-linear system identification. We derived a variational lower bound for approximating the entire process and introduced a Gaussian posterior distribution over the latent states, which induces a Markov structure. By exploiting the reparameterisation trick in our inference we achieve computational efficiency during training, while benefiting from learning from non-smooth kernel functions. We have provided experimental evidence that our approach could identify latent dynamics, even from partial observations, while requiring only small data sets for this challenging task. 8

References Maruan Al-Shedivat, Andrew G. Wilson, Yunus Saatchi, Zhiting Hu, and Eric P. Xing. Learning scalable deep kernels with recurrent structure. arXiv preprint arXiv:1610.08936, 2016. Matthew J. Beal. Variational algorithms for approximate Bayesian inference. PhD thesis, University of London, London, UK, 2003. David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, (accepted), 2017. Emery N. Brown, Loren M. Frank, Dengda Tang, Michael C. Quirk, and Matthew A. Wilson. A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells. Journal of Neuroscience, 18(18):7411–7425, 1998. Roberto Calandra, Jan Peters, Carl E. Rasmussen, and Marc P. Deisenroth. Manifold Gaussian processes for regression. In Proceedings of the IEEE International Joint Conference on Neural Networks, 2016. KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder-decoder approaches. CoRR, abs/1409.1259, 2014. Youngmin Cho and Lawrence K. Saul. Kernel methods for deep learning. In Advances in Neural Information Processing Systems (NIPS), pages 342–350. 2009. Zhenwen Dai, Andreas Damianou, Javier González, and Neil Lawrence. Variational auto-encoded deep Gaussian processes. In International Conference on Learning Representations (ICLR), 2015. Marc P. Deisenroth and Shakir Mohamed. Expectation propagation in Gaussian process dynamical systems. In Advances in Neural Information Processing Systems (NIPS), pages 2618–2626, 2012. Marc P. Deisenroth and Carl E. Rasmussen. PILCO: A model-based and data-efficient approach to policy search. In International Conference on Machine Learning (ICML), pages 465–472, 2011. Marc P. Deisenroth, Marco F. Huber, and Uwe D. Hanebeck. Analytic Moment-based Gaussian process filtering. In Proceedings of the 26th International Conference on Machine Learning (ICML), pages 225–232, June 2009. Marc P. Deisenroth, Ryan D. Turner, Marco Huber, Uwe D. Hanebeck, and Carl E. Rasmussen. Robust filtering and smoothing with Gaussian processes. IEEE Transactions on Automatic Control, 57(7):1865–1871, 2012. Marc P. Deisenroth, Dieter Fox, and Carl E. Rasmussen. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2): 408–423, 2015. Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl E. Rasmussen. Bayesian inference and learning in Gaussian process state-space models with particle MCMC. In Advances in Neural Information Processing Systems (NIPS), pages 3156–3164, 2013. Roger Frigola, Yutian Chen, and Carl E. Rasmussen. Variational Gaussian process state-space models. In Advances in Neural Information Processing Systems (NIPS), pages 3680–3688, 2014. Roger Frigola-Alcade. Bayesian time series learning with Gaussian processes. PhD thesis, University of Cambridge, 2015. Rudolf E. Kalman. A new approach to linear filtering and prediction problems. Transactions of the ASME — Journal of Basic Engineering, 82(Series D):35–45, 1960. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations (ICLR), 2013. 9

Jonathan Ko and Dieter Fox. GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Autonomous Robots, 27(1):75–90, July 2009a. Jonathan Ko and Dieter Fox. Learning GP-BayesFilters via Gaussian process latent variable models. In Proceedings of Robotics: Science and Systems, June 2009b. Juš Kocijan. Modelling and control of dynamic systems using Gaussian process models. Springer, 2016. Bojan Likar and Juš Kocijan. Predictive control of a gas-liquid separation plant based on a Gaussian process model. Computers & chemical engineering, 31(3):142–152, 2007. Lennart Ljung. System identification: Theory for the user. Prentice Hall, 1999. Alexander G. de G. Matthews. Scalable Gaussian process inference using variational methods. PhD thesis, Cambridge University, 2017. Alexander G. de G. Matthews, James Hensman, Richard E. Turner, and Zoubin Ghahramani. On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. In The 19th International Conference on Artificial Intelligence and Statistics, volume 51, pages 231–239. JMLR Workshop and Conference Proceedings, 2016. Alexander G. de G. Matthews, Mark van der Wilk, Tom Nickson, Keisuke Fujii, Alexis Boukouvalas, Pablo León-Villagrá, Zoubin Ghahramani, and James Hensman. GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research, 18(40):1–6, 2017. César Lincoln C. Mattos, Zhenwen Dai, Andreas Damianou, Jeremy Forth, Guilherme A. Barreto, and Neil D. Lawrence. Recurrent Gaussian processes. In International Conference on Learning Representations (ICLR), 2015. Andrew McHutchon and Carl E. Rasmussen. Gaussian process training with input noise. In Advances in Neural Information Processing Systems (NIPS). The MIT Press, 2011. Roderick Murray-Smith and Agathe Girard. Gaussian process priors with ARMA noise models. In Irish Signals and Systems Conference, pages 147–152, 2001. Carl E. Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. The MIT Press, Cambridge, MA, USA, 2006. Stephen Roberts, Michael Osborne, Mark Ebden, Steven Reece, Neale Gibson, and Suzanne Aigrain. Gaussian processes for time-series modelling. Philosophical Transactions of the Royal Society A, 371(1984):20110550, 2013. Jeff G. Schneider. Exploiting model uncertainty estimates for safe dynamic control learning. In Advances in Neural Information Processing Systems (NIPS). 1997. Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 5, pages 567–574, 2009. Michalis K. Titsias and Neil D. Lawrence. Bayesian Gaussian process latent variable model. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume 9 of JMLR W&CP, pages 844–851, 2010. Michalis K. Titsias and Miguel Lázaro-Gredilla. Doubly stochastic variational Bayes for nonconjugate inference. In Proceedings of the International Conference on Machine Learning (ICML), pages 1971–1979, 2014. Ryan D. Turner. Gaussian processes for state space models and change point detection. PhD thesis, University of Cambridge, Cambridge, UK, 2011. Ryan D. Turner, Marc P. Deisenroth, and Carl E. Rasmussen. State-space inference and learning with Gaussian processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), volume JMLR: W&CP 9, pages 868–875, 2010. Jack M. Wang, David J. Fleet, and Aaron Hertzmann. Gaussian process dynamical models for human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):283–298, 2008. 10

A

Derivation of the ELBO

This appendix contains three parts: we first explicate the joint distribution of the model and data p(X, f (·), Y ); then we describe the variational approximation to the model posterior q(X, f (·)); then we show how they combine to produce the ELBO. Table 1 provides some nomenclature. Table 1: Nomenclature used in this derivation t = 0...T d = 1...D O m = 1...M

time steps indexed t dimension of hidden states xt indexed d dimension of the observed data number of inducing variables indexed m

xt at ˜t x yt ˜t y

hidden state at time t, xt ∈ RD control input (action) at time t, at ∈ RP concatenation of control input and state at t observation at time t, y t ∈ RO concatenation of control input and observation at t

X Y

collection of hidden states, = [xt ]Tt=0 . collection of observations, = [y t ]Tt=0 .

σf2 σn2

variance of state transition noise variance of observation nosie

fd (·) f (·) ηd (·) kd (· , ·) µd (·) vd (·, ·)

the dth Gaussian process (GP) collection of GPs, = [fd (·)]D d=1 prior mean function of the dth GP prior covariance function of the dth GP posterior mean function of the dth GP posterior covariance function of the dth GP

Z ud U µd Σd

Locations of variational pseudo-inputs evaluations of the dth GP at the pseudo-inputs: ud = [fd (z m ]M m=1 . collection: U = [ud ]D d=1 variational posterior mean of ud variational posterior covariance of ud

At Lt mt St

variational transition matrix of q(xt | xt−1 ) triangular-square-root of variational covariance of q(xt | xt−1 ) variational mean of the marginal q(xt ) variational covariance of the marginal q(xt )

A.1

Model joint distribution

Here we define the joint distribution of the the Gaussian processes f , the latent states x and the data y. The Gaussian processes have prior mean η(·) and prior covariances k(·, ·):  p(fd (·)) = GP ηd (·), kd (·, ·) d = 1...D.

(16)

We note that placing a measure p on the function f causes some measure-theoretic discrepancies. Nonetheless, the derivation holds following a more theoretical consideration of the problem [Matthews et al., 2016], and the intuition given by our derivation is correct. The initial state is assumed to be drawn from a standard normal distribution p(x0 ) = N (0, I D ) .

(17)

The state transition depends on the Gaussian processes:  p(xt | xt−1 , f (·)) = N xt | f (˜ xt−1 ), σf2 I D , 11

(18)

We assume a linear-Gaussian observation model: p(y t | xt ) = N y t | W g xt + bg , σn2 I O



(19)

p(xt | f , xt−1 )

(20)

The joint density is then p(f , X, Y ) =

D Y

p(fd (·)) p(x0 )

p(y t | xt )

t=1

d=1

A.2

T Y

T Y t=1

Approximate posterior distribution

We will use variational Bayes to approximate the posterior distribution over f and X, whilst simultaneously obtaining a bound on the marginal likelihood (the ELBO) which will be used to train the parameters of the model, including covariance function parameters, noise variances and the parameters W g , bg of the linear output mapping. The posterior over Gaussian processes takes the form of a sparse GP. We introduce a series of M ˜ . Following convention, variational inducing points Z = [z m ]M m=1 which lie in the same domain at x the values of the dth function at those points are denoted ud = [fd (z m )]M m=1 . Note that the variables u are not auxiliary variables, but part of the original model specification, being part of the GP. We assume a variational posterior of the form q(U ) =

D Y

 N ud | µd , Σd .

(21)

d=1

The remainder of the GPs conditioned on u are assumed to take the same form as the GP prior conditional. That is  −1 q(fd (·) | ud ) = p(fd (·) | ud ) = GP ηd (·) + k(·, Z)K −1 zz (ud − η d (Z)), k(·, ·) − k(·, Z)K zz k(Z, ·) . (22) Marginalising with respect to ud leads to our approximation to the GP:  q(fd (·)) = GP µd (·), vd (·, ·) ,

(23)

with µd (·) = ηd (·) + k(·, Z)K −1 zz (µd − η d (Z)) , vd (·, ·) = k(·, ·) −

k(·, Z)K −1 zz [K zz



Σd ]K −1 zz k(Z, ·) .

(24) (25)

The approximation to the posterior over state trajectories is given a Gauss-Markov structure of the form T Y

q(xt | xt−1 ) ,

(26)

q(x0 ) = N (x0 | m0 , L0 L> 0)

(27)

q(X) = q(x0 )

t=1

where

q(xt | xt−1 ) =

N (xt | At xt−1 , Lt L> t ).

(28)

T The complete set of variational parameters is then Z, {µd , Σd }D d=1 , m0 , L0 , {At , Lt }t=1 . The parameters of q(X) are reconfigured to be the output of an RNN recognition model (see main text), whilst we optimise the parameters controlling f (·) directly.

The joint posterior then factors as q(f (·), X) =

D Y d=1

12

q(fd (·))q(X) .

(29)

A.3

The ELBO

Having specified the forms of the model and the approximate posterior, we are ready to derive the ELBO. Following the standard variational Bayes methods, we write   p(Y | X)p(X | f (·)) p(f (·)) ELBO = Eq(X)q(f (·)) log . (30) q(X) q(f (·)) We will split the ELBO into four parts, dealing with each in turn:        p(f (·))  . ELBO = Eq(X) log p(Y | X) + Eq(X)q(f (·)) log p(X | f (·)) − Eq(X) log q(X) + Eq(f (·)) log q(f (·)) | {z } | {z } | {z } | {z } part 1 part 2 part 3 part 4

(31) Part 1 This expression can be computed straight-forwardly in closed form due to our choice of a linear-Gaussian emission g(x). Let mt , Σt be the marginals of q(xt ) computed via the recursion, and recall the form of the linear emission function g(xt ) = W g xt + bg T hX   i Eq(X) log p(Y | X) = Eq(X) log N y t | g(xt ), σg2

(32)

t=1

=

T X

h i Eq(xt ) log N y t | W g xt + bg , σg2

(33)

t=1

=

T X

 log N y t | W g mt + bg , σg2 −

> 1 2 tr(W g W g Σt ) . 2σn

(34)

t=1

In practise we defer this simple computation to the variational_expectations functionality in GPflow [Matthews et al., 2017]. Part 2 This expression cannot be computed in closed form without restriction to the RBF kernel as in [Frigola-Alcade, 2015]. We eliminate the integral with respect to f here, and then use the reparameterisation trick to estimate the integral with respect to X (see main text).   part 2 = Eq(X)q(f (·)) log p(X | f (·)) (35) T Y   = Eq(X)q(f (·)) log p(x0 ) N xt | f (˜ xt−1 ), σf2 I D

(36)

t=1 T X D hX   i (d) = Eq(x0 ) log p(x0 ) + Eq(X)q(f (·)) log N xt | fd (˜ xt−1 ), σf2

(37)

t=1 d=1





= Eq(x0 ) log p(x0 ) + Eq(X)

T X D hX

i  (d) ˜ t−1 ) , xt−1 , x log N xt | µd (˜ xt−1 ), σf2 − 12 σf−2 vd (˜

t=1 d=1

(38) which matches the term in the main text. Part 3 This corresponds to the entropy of q(X). It is straightforward to derive: T X   (T + 1)D −Eq(X) log q(X) = H[q(X)] = log(2πe) + log(det(Lt )) . 2 t=0

(39)

Part 4 This final part is the Kullback-Leibler divergence between the prior and (approximate) posterior GPs. We first note that it can be written as a sum across dimensions d, and then that each 13

GP fd (·) can be factored into two parts: p(fd (·) | ud )p(ud ) and similarly for q. This results in  X    D p(f (·)) p(fd (·)) = (40) Eq(fd (·)) log Eq(f (·)) log q(f (·)) q(fd (·)) d=1   D X p(fd (·) | ud )p(ud ) = Eq(fd (·) | ud )q(ud ) log . (41) q(fd (·) | ud )q(ud ) d=1

Since we have defined the posterior conditional q(fd (·) | ud ) to match the prior conditional, the two terms cancel, resulting in  X    D p(f (·)) p(ud ) = (42) Eq(f (·)) log Eq(ud ) log q(f (·)) q(ud ) d=1

=−

D X

  KL q(ud )||p(ud ) .

(43)

d=1

Since the result is a Kullback-Leibler divergence between two finite-dimensional normal distributions, it is computed straightforwardly. Although this notation is somewhat sloppy (since the sets of variables fd (·) and ud overlap), the result is correct. Matthews [2017] contains a more careful and significantly more technical derivation.

14

Full visualisation of synthetic 1D dataset

xt+1

B

4 2 0 −2 −4

RBF

RBF + Matern

Arc-cosine

MGP

GP posterior ground truth

4 2 0 −2 −4 −6 −4 −2 0 2 4 6 8 xt

xt+1

GP posterior ground truth

−6 −4 −2 0 2 4 6 8 xt

Figure 5: Visualisation of the learned GP transition functions across a greater domain of the function. It can be seen that all models revert to the mean function (defined as the identity function) away from the data. The short lengthscales of the Arc-cosine and MGP (compounded with a Matern kernel) that are used to fit the kink of the true transition function mean that they almost instantaneously revert to the mean function. The longer length scales of the RBF-containing kernels mean that we revert much more slowly to the mean.

15

C

Modelling double-pendulum dynamics 2 episodes

14 episodes

8 episodes

4 5 2 4

outer angle

inner angle

6

0

3

4 5 2 4

outer angle

inner angle

6

0

3 outer torque

inner torque 2 -2 0

5

10

15

time step

20

25

30

0

5

10

15

time step

20

25

30

0

5

10

15

20

25

30

time step

Figure 6: Predicting the chaotic behaviour of the inner and outer pendulum’s angle from the double pendulum dataset by applying the control signals of the testing episode to sampled future trajectories from the proposed GPSSM. Learning of the dynamics is demonstrated with observed (upper row) and hidden (lower row) angular velocities and with increasing number of training episodes. Ground truth is denoted with the marked lines.

16

D

Learnt latent states for cart-pole

Below we provide the learnt latent states for the cart-pole dataset with observed and hidden velocities. It is worth noting that the model has recovered similar structure for both cases.

0

5

10

Episode1

Episode2

Episode3

Episode4

Episode5

Episode6

Episode7

Episode8

Episode9

Episode10

Episode11

Episode12

Episode13

Episode14

Episode15

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

Figure 7: Learnt latent states for the cart-pole dataset with observed velocities.

17

40

0

5

10

Episode1

Episode2

Episode3

Episode4

Episode5

Episode6

Episode7

Episode8

Episode9

Episode10

Episode11

Episode12

Episode13

Episode14

Episode15

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

Figure 8: Learnt latent states for the cart-pole dataset with hidden velocities.

18

40

E

Cart-pole training data fitting

Below we provide detailed fittings on the training episodes for the cart-pole dataset. Episode0

Episode1

Episode2 15

2

10

1

0

−5 −10

0

5

−10

angle

cart position

0

−1 −15

0

−20

−2

Episode3

Episode4

Episode5 10

5 −10 0

−15 Episode6

Episode7

Episode8 10

0 cart position

angle

−5

−5 5

angle

cart position

0

−10 0

−15 Episode9

Episode10

Episode11 10

5 −10 0

−15 Episode12

Episode13

Episode14 10

0 cart position

angle

−5

−5 5 −10 0

−15 0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

Figure 9: Detailed fittings per episode for the cart-pole dataset with observed velocities.

19

40

angle

cart position

0

Episode0

Episode1

Episode2 15

0

2 10 1

−10 −10

5

0

−15

0

−1

−20 Episode3

Episode4

angle

cart position

0 −5

Episode5 10

−5 5

angle

cart position

0

−10 0

−15 Episode6

Episode7

Episode8 10

5 −10 0

−15 Episode9

Episode10

Episode11 10

0 cart position

angle

−5

−5 5

angle

cart position

0

−10 0

−15 Episode12

Episode13

Episode14 10

−5 5 −10 0

−15 0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

40

0

5

10

15

20

25

30

35

Figure 10: Detailed fittings per episode for the cart-pole dataset with hidden velocities.

20

40

angle

cart position

0

F

Lagged action cart-pole result

0.6

10

0.4

8

0.2

6

0.0 4

0.2 0.4

2

0.6

0

0.8 10

0

5

10

15

0

5

10

15

20

25

30

35 40 control signal

20

25

30

35

0 10

time step

40

Figure 11: Results using lagged inputs on the cart-pole data

21