A VARIATIONAL APPROACH TO PATH ESTIMATION AND ...

1 downloads 0 Views 2MB Size Report
Aug 3, 2015 - We consider a hidden Markov model, where the signal process, given by a diffusion, is .... motion and to the Cox-Ingersoll-Ross process.
A VARIATIONAL APPROACH TO PATH ESTIMATION AND PARAMETER INFERENCE OF HIDDEN DIFFUSION PROCESSES∗

arXiv:1508.00506v1 [math.OC] 3 Aug 2015

TOBIAS SUTTER† , ARNAB GANGULY‡ , AND HEINZ KOEPPL§ Abstract. We consider a hidden Markov model, where the signal process, given by a diffusion, is only indirectly observed through some noisy measurements. The article develops a variational method for approximating the hidden states of the signal process given the full set of observations. This, in particular, leads to systematic approximations of the smoothing densities of the signal process. The paper then demonstrates how an efficient inference scheme, based on this variational approach to the approximation of the hidden states, can be designed to estimate the unknown parameters of stochastic differential equations. Two examples at the end illustrate the efficacy and the accuracy of the presented method. Key words. Variational inference, stochastic differential equations, diffusion processes, hidden Markov model, optimal control AMS subject classifications. 62M05, 60J60, 60H10, 49J15

1. Introduction. Diffusion processes modeled by stochastic differential equations (SDEs) appear in several disciplines varying from mathematical finance to systems biology. For example, in systems biology stochastic differential equations are used for efficient modeling of the states of the chemical species in a reaction system when they are present in high abundance [30]. Oftentimes, the state of the system or the signal process is not directly observed, and inference of the state trajectories and parameter of the system has to be achieved based on noisy partial observations. Typically, in such a scenario, the observation data is conveniently modeled as a function of the hidden state corrupted with independent additive noise. However, generalizations of this basic setup, which, for example, could include stronger coupling between the hidden signal and the observation processes, are often used for modeling more complex phenomena. In such a model optimal filtering theory concerns itself with recurrent estimation of the current state of the hidden signal process given the observation data until the present time. This is particularly useful in tracking problems where the estimation of the current location of an object needs to be constantly updated as new noisy information flows in. On the other hand, optimal smoothing involves the class of methods which can be used in reconstruction of any past state of the signal process given a set of measurements up to the present time. More specifically, given the signal process X and the observation process Y , filtering theory entails computation of the conditional expectations of the form E φ(Xt )|FtY , where {FtY } denotes the filtration generated by the process Y . The σ-algebra FtY contains all the information about the observation process Y up to the present time t. Smoothing, however,   involves evaluation of the conditional expectations of the form E φ(Xs )|FtY , where s < t. The smoothing techniques can also be viewed as tools of estimation of the current state given a data set which includes future observations. This interpretation is particularly relevant in statistics, where such techniques are essentially the means ∗ This

work was supported by the ETH grant (ETH-15 12-2). of Electrical Engineering and Information Technology, ETH Zurich, Switzerland ([email protected]). ‡ Department of Mathematics, Louisiana State University, USA ([email protected]). § Department of Electrical Engineering and Information Technology, TU Darmstadt, Germany ([email protected]). † Department

1

2

SUTTER, GANGULY, KOEPPL

of computing certain posterior conditional densities given the observation set. The present article focusses on a variational approach to this smoothing problem and later employs the method for estimation of parameters of diffusion processes. Evaluation of such conditional expectations or densities are quite difficult, since they are often solutions of suitable (stochastic) partial differential equations. These are usually infinite-dimensional problems and analytical solutions are generally impossible. Hence, effort has been directed toward developing of a variety of numerical schemes for efficient approximation of these conditional densities. While Markov chain Monte Carlo methods for inference use discretization of the given SDE for writing down an approximate likelihood [1, 19, 24], particle methods approximate the (posterior) conditional densities by suitably weighted point masses [5, 11, 13]. However, these method often rely on a suitable discretization of the problem which is mostly done in an ad-hoc way. Since a theoretical framework for obtaining approximations is not present, the approximation error might be difficult to quantify. In contrast, the present paper focusses on a variational approach to this estimation problem. The main idea in such a method is to approximate the (posterior) conditional probability distribution of the system’s state (given the observed data) by an appropriate Gaussian distribution, where the optimal parameters for the Gaussian distribution are obtained by minimizing the relative entropy (or Kullback-Leibler distance) between the posterior process and a suitable approximating SDE. Earlier works like [2–4] considered the case when the signal process is modeled by an SDE with a constant diffusion term. The advantage of working with a constant diffusion term is that it implies that the approximating SDE will simply have a linear drift so that marginals are distributed as Gaussian. This simple expression of the SDE with a linear drift makes the subsequent optimization problem for finding the suitable parameters for this approximating SDE easier. However, since most physical phenomena cannot be realistically modeled by SDEs with constant diffusion term, there is a pressing need of extending the approach to general SDEs. One natural but naive approach in this regard could be to freeze the diffusion term at an appropriate value, that is, to take the zeroth order expansion of the diffusion coefficient. Although simple to implement, the efficacy of the method is not guaranteed by theoretical results and will vary from case to case, and a reasonable error analysis might require unreasonably restrictive conditions on the model. Instead, the present article delves much deeper in to the problem and develops methods for finding the optimal approximating SDE such that the relative entropy between it and the true posterior process is minimized subject to the condition that the marginals of the former follow Gaussian distributions. The main obstacle that needs to be overcome in this approach stems from the fact that unlike the previous case, the approximating SDE here cannot be taken to be the one with a linear drift; and a suitable expression of it needs to be found so that the marginals are still Gaussian. This has been achieved in Theorem 4.5. In fact, our work outlines the most general techniques for approximating the posterior density by any density from the exponential family or mixture of exponential families. In this connection we would like to note that the reason for requiring that the marginals follow a Gaussian distribution or more generally, a distribution from the exponential family because this results in a finite-dimensional smoother which can be used for approximating a wide range of distributions. It should be noted that the variational method considered here is different from the so-called extended Kalman filter (EKF) in two ways: first, EKF is employed

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

3

for filtering problems; but more importantly, EKF starts by linearizing the signal (prior) SDE and then freezing its diffusion term, while the variational approach is concerned with approximation of the posterior SDE. Therefore even though in the constant diffusion term case, the approximating SDE happens to have linear drift and thus resulting in a Gaussian smoother, it is not based on the same philosophy behind the EKF. And as mentioned before, in the non-constant diffusion term case although our method can be used to obtain a finite-dimensional smoother, in particular, a Gaussian smoother, it completely avoids any form of linearization of the given SDE or subsequent freezing of the diffusion term. In our paper this variational approximation method has been formulated as an optimal control problem. The advantage of this theoretical framework is that necessary conditions for global optimality are then obtained by employing the Pontryagin maximum principle. This leads to considerable computational advantages of the variational method compared to numerically solving the underlying (stochastic) PDEs, that is highlighted by two examples. The later part of the paper focusses on the important topic of parameter inference of SDEs. The above scheme of estimating the hidden states and the smoothing densities is cleverly used in designing an efficient method for estimating parameters of SDEs. In particular, the paper proposes an iterative EM-type algorithm which aims to compute approximate maximum likelihood estimates of the parameters in a tractable way. Two illustrative examples, which are important in mathematical finance, demonstrate the accuracy and efficiency of the proposed algorithms. Future projects will address more complicated models. The layout of this article is as follows: In Section 2 we formally introduce the problem setting. The variational approximation idea is motivated in Section 3 leading to a specific class of optimization problems that is addressed in Section 4. It is then reformulated in Section 5 as an optimal control problem and necessary conditions for optimality are derived. Section 6 explains how the variational approximation can be used to infer unknown parameters of the model. Section 7 discusses the presented variational approximation in the context of a discrete time measurement model. The theoretical results are applied in Section 8 to two examples: a geometric Brownian motion and to the Cox-Ingersoll-Ross process. We finally conclude with some remarks and directions for future work in Section 9. Certain technical proofs are relegated to the appendix. Notation. Hereafter, In is the n-dimensional identity matrix and Ei is the n × n matrix where the ii-th entry is one and zero elsewhere. We let Sym(n, R) and GL(n, R) be respectively the set of symmetric

and invertible n × n matrices with real entries. For matrices A, B ∈ Rn×n let A, B := tr(A> B) denote the Frobenius inner product. For a√ vector b ∈ Rn and a positive definite matrix A, we em> −1 ploy the norm kbk PnA := b A b. We define then standard n−simplex as ∆n := n {x ∈ R : x ≥ 0, i=1 xi = 1}. Let C := C([0, T ], R ) denote the space of continuous functions on [0, T ] taking values in Rn . Let S be a metric space, equipped with its Borel σ-field B(S). The space of all probability measures on (S, B(S)) is denoted by P(S). The relative entropy (or Kullback-Leibler divergence) between any two probability measures µ, ν ∈ P(S) is defined as ( R    log dµ dν dµ, if µ  ν D µ||ν := +∞, otherwise, where  denotes absolute continuity of measures and

dµ dν

is the Radon-Nikodym

4

SUTTER, GANGULY, KOEPPL

derivative. The relative entropy is always nonnegative, and is equal to zero if and only if µ ≡ ν. By convention measurable means Borel-measurable in the sequel. Given an S-valued random variable X with Law(X) = µ ∈ P(S), let Eµ X denote the expectation of X. 2. Model setup. As usual, we will work on a complete probability space (Ω, F, P) equipped with a filtration {Ft } satisfying the usual conditions, that is, {Ft } is complete, right continuous and contains all the P-null sets. The basic objects in our study consist of a signal process X and an observation process Y , both of which are assumed to be {Ft }- adapted. The unobserved signal process X is modeled by the following stochastic differential equation describing the state evolution of a dynamical system: (2.1)

dXt = f (Xt )dt + σ(Xt )dWt ,

X0 = x0 ,

0 ≤ t ≤ T,

where f : Rn → Rn , σ : Rn → Rn×n , and W is an n-dimensional Brownian motion independent of x0 . The observation process Y is modeled as noisy measurements of some function of the signal process X. Mathematically, Y is defined as Z t (2.2) Yt = h(Xs )ds + Bt , 0

n

n

where h : R → R is called the observation function and B is an n-dimensional Brownian motion independent of x0 and W. Assumption 2.1. We stipulate that (i) f and σ are globally Lipschitz; (ii) and h is twice continuously differentiable. It is known [17] that under Assumption 2.1 there exists a unique strong solution to the SDE (2.1). Given the observed data up to some time T , {Ys : s ≤ T }, the goal of the paper is to outline an approximation method for the smoothing density, PS (x, t), which is the conditional probability density of Xt given {Ys : s ≤ T }. In other words, the smoothing density is defined by the equation: Z   (2.3) E φ(Xt )|FTY = φ(x)PS (x, t) dx, up to a.s. equivalence, where φ is any bounded measurable function from Rn to R and {FtY } denotes the filtration generated by the process Y . More generally, we will be interested in approximating the full conditional probability measure on the path space, C ≡ C([0, T ], Rn ). To describe   this mathematically, assume that a regular conditional probability measure P ·|FTY is chosen. Then there exists a measurable probability kernel y ∈ C → Πpost (·, y) ∈ P(C) such that for any measurable set A ⊂ C,   P X[0,T ] ∈ A|FTY = Πpost (A, Y[0,T ] ).

Given the observation process up to time T , Y[0,T ] , we now describe a characterization of the probability measure Πpost (·, Y[0,T ] ), which will play a pivotal role for our purposes. The probability measure Πpost (·, Y[0,T ] ) is actually the distribution of a ¯ T on C, and the latter is obtained by a modification of the original diffusion process X signal process X: (2.4)

¯ tT = g(X ¯ tT , t)dt + σ(X ¯ tT )dW ¯ t, dX

¯ 0T = x0 , X

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

5

¯ is an {Ft }-adapted Brownian motion that is independent of Y . Notice that where W the diffusion coefficient of the above SDE (which we will henceforth call the posterior SDE or posterior diffusion) is same as that of the original SDE, and the drift of this posterior SDE is time-dependent and is obtained as (2.5)

g(x, t) := f (x) + a(x)∇ log v(x, t),

where a(x) := σ(x)σ(x)>. We give details about the (random) function v a little later, but the important point to note here is that the new drift function is the old drift function with an extra additive term, and the observation process Y[0,T ] enters into the characterization of Πpost (·, Y[0,T ] ) only through v. To see this characterization of Πpost (·, Y[0,T ] ), we first look at the usual filtering density PF (x, t), which is naturally defined by Z   Y (2.6) E φ(Xt )|Ft = φ(x)PF (x, t) dx.

Under suitable technical conditions, the filter density PF satisfies the Kushner-Stratonovich equation (for example, see [5, 18, 28]). For our purposes, however, it is convenient to R work with the unnormalized filter density p(x, t), that is, PF (x, t) = p(x, t)( Rn p(x, t)dx)−1 , which satisfies the so-called Zakai equation [31] ( dp(x, t) = A∗ p(x, t)dt + p(x, t)h(x)> dYt (2.7) p(x, 0) = p0 (x). Here p0 denotes the density of x0 and A∗ is the adjoint of the infinitesimal generator P P 2 ∂ ψ(x) + 12 i,j ai,j (x) ∂x∂i ∂xj ψ(x) for of the process X given by Aψ(x) = i fi (x) ∂x i ψ ∈ C02 (Rn , R). We next consider the backward stochastic partial differential equation (SPDE) ( dv(x, t) = −Av(x, t)dt − v(x, t)h(x)> dYt (2.8) v(x, T ) = 1. Conditions about existence of solutions to (2.7) and (2.8) can be found in [25]. It is well known [25, Corollary 3.8] that the smoothing density can be expressed as p(x, t)v(x, t) . p(x, t)v(x, t)dx Rn

PS (x, t) = R

(2.9)

Now by using (2.7), (2.8) and (2.9), it can be shown that the smoothing density solves the following Kolmogorov forward equation (see Appendix A for a detailed derivation)

(2.10)

 X ∂ X ∂2 1 ∂  + g(x, t) − aij (x) PS (x, t) = 0, ∂t ∂x 2 ∂x ∂x i i j i i,j 

with the drift term g defined by (2.5). In other words, the conditional probability ¯ T as defined in (2.4). measure Πpost (·, Y[0,T ] ) on C is induced by the diffusion process X Evaluating Πpost (·, Y[0,T ] ) is what is known as the path estimation problem. Except for a few simple cases, the SPDEs, that are involved in this estimation of the

6

SUTTER, GANGULY, KOEPPL

hidden path, are analytically intractable. The variational approach that we undertake in this paper actually has the goal of approximating Πpost (·, Y[0,T ] ). Toward this end, a natural objective is to approximate Πpost (·, Y[0,T ] ) by a probability measure such that the corresponding marginals of the latter come from a known family of distributions (e.g, exponential family). As a result, the marginal of this approximating probability measure at time t approximates the smoothing density PS (x, t). The procedure adopted in this article involves finding the optimal parameters of this approximating distribution by minimizing the relative entropy between the posterior distribution and the approximating one. 3. Variational approximation: Motivation. Let Πprior denote the distribution of the original signal process X on C, that is, for a measurable A ⊂ C,  Πprior (A) ≡ P X[0,T ] ∈ A . Define the two terms (3.1) (3.2)

Z

T

1 ys dh(Xs ) + 2

Z

T

kh(Xs )k2 ds HT (X[0,T ] , y) := −h(XT )yT + 0 0 Z  I(HT (·, y)) := − log exp (−HT (·, y)) dΠprior .

Let y be a sample path of the observation process Y on the interval [0, T ]. Then notice that by the pathwise Kallianpur-Striebel formula (or the Bayes formula), we have dΠpost (·, y) exp(−HT (·, y)) exp(−HT (·, y)) . =R = dΠprior L(y) exp(−HT (·, y))dΠprior R where L(y) = exp(−HT (·, y))dΠprior . Consequently, L(y) can be interpreted naturally as the likelihood of the path y, or equivalently, I(HT (·, y)) is viewed as the negative log-likelihood of the sample path y. Now for any probability measure Q1 on C([0, T ], R), the relative entropy between Q and Πpost (·, y) can be expressed by the following lemma.     Lemma 3.1. D Q||Πpost (·, y) = −I(HT (·, y)) + D Q||Πprior + EQ HT (·, y) . Proof. The proof essentially follows the one in [29, Lemma 2.2.1]. Splitting the relative entropy and using the pathwise Kallianpur-Striebel formula yields    Z    dQ dΠpost (·, y) D Q||Πprior = log + log dQ dΠpost (·, y) dΠprior   Z  dΠpost (·, y) dQ = D Q||Πpost (·, y) + log dΠprior   Z  exp(−HT (·, y)) = D Q||Πpost (·, y) + log R dQ exp(−HT (·, y))dΠprior Z     = D Q||Πpost (·, y) − EQ HT (·, y) − log exp(−HT (·, y))dΠprior .

Mitter and Newton [22] provide an information-theoretic interpretation to this result. They interpret the term (3.2) as the total information available to the estimator Q through the sample  path y.  On the other hand, they call the quantity F(Q, y) := D Q||Πprior + EQ HT (·, y) the apparent information of the estimator. 1Q

will be called the approximating probability measure in the sequel.

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

7

By non-negativity of the relative entropy F(Q, y) ≥ I(HT (·, y)) with equality if and only if Q = Πpost (·, y). In this sense, a suboptimal estimator appears to have access to more information than is actually available. Since the total information I(HT (·, y)) does not depend on Q, minimizing the relative entropy between Q and Πpost (·, y) over a class of probability measures Q is equivalent to minimizing the apparent information F(Q, y). This motivates to consider an approximating distribution Q on C that is characterized as the solution to the following optimization problem:    Problem 3.2. Minimize the objective function D Q||Πprior + EQ HT (·, y) subject to (i) Q is a probability distribution induced by an SDE of the form (3.3)

dZt = u(Zt , t)dt + σ(Zt )dWt ,

Z0 = x0 ,

0 ≤ t ≤ T;

(ii) The marginals of Q at time t, i.e., the distribution of Zt , belong to a chosen family of distributions. We will show in the remainder of this article how Problem 3.2 can be restated as an optimal control problem, which leads to a standard formulation of necessary optimality conditions in terms of Pontryagin’s maximum principle. Note that the objective function of Problem 3.2 is known to be strictly convex with respect to Q [12]. The constraint (ii) restricts the feasible set approximating distributions Q to a convex set. It is, however, also coupled with the first constraint (i), that parametrizes the feasible set of distributions in terms of the drift function u. This coupling is investigated in Section 4, in particular Theorem 4.5 characterizes the set of all drift terms u such that the distribution induced by (3.3) has finite dimensional marginals that belong to a given family of distributions. Hence, Problem 3.2 can alternatively be interpreted as minimizing the objective function over a class of drift functions u that induce Q via (3.3) and such that Q satisfies constraint (ii). For example, if the goal is to approximate the posterior distribution Πpost by a distribution Q whose marginals are normal distributions, then one aims to find a drift term u such that the objective function is minimized and such that the solution Zt to (3.3) admits a normal distribution. Remark 3.3. We discuss the behaviour of Problem 3.2 when some of the constraints are removed. 1. If the constraint (ii) is omitted, i.e., the relaxed optimization problem of  minimizing D Q||Πprior +EQ HT (·, y) subject to Q being a probability distribution induced by (3.3) is considered, then a necessary condition for optimality (derived from the Euler-Lagrange equations) is given by (2.5), see [3] for details. 2. If both constraints (i) and i.e., we consider the problem  (ii) are  removed,  of minimizing D Q||Πprior + EQ HT (·, y) subject to Q being a probability distribution on C([0, T ], R), then the unique optimizer is given by Πpost , which follows directly from Lemma 3.1. The objective function in Problem 3.2, in particular the relative entropy between the approximating distribution Q and the prior distribution Πprior can be simplified, since due to the constraint (i) the underlying SDEs (3.3) and (2.1) share the same diffusion coefficient. In view of (3.3) and (2.1), consider two SDEs dXt = f (Xt )dt + σ(Xt )dWt ,

X 0 = x0 , 0 ≤ t ≤ T

8

SUTTER, GANGULY, KOEPPL

dZt = u(Zt , t)dt + σ(Zt )dWt ,

Z0 = x0 , 0 ≤ t ≤ T,

with u : Rn × R → Rn , f : Rn → Rn , σ : Rn → Rn×n , W an n-dimensional Brownian motion independent of x0 and both SDEs satisfying Assumption 2.1. Let (Ω, FT , P ) be a probability space, where FT is the sigma algebra σ(Ws : s ≤ T ) and let Πprior and Q denote the the laws of Xt and Zt with respect to P . It follows by Girsanov’s Theorem [23], that "Z #    T 1 dQ > v(s, ω) v(s, ω)ds , = EQ EQ log dΠprior 2 0 where v(s, ω) = σ(Zs (ω))−1 (u(Xs (ω)) − f (Xs (ω))) . Therefore, the relative entropy between Q and Πprior is "Z # T  1 2 ku(Xs , s) − f (Xs )ka(Xs ) ds , D Q||Πprior = EQ 2 0 2

>

−1

where ku(x, s) − f (x)ka(x) := (u(x, s) − f (x)) a(x) (u(x, s) − f (x)). Hence, the objective function in Problem 3.2 can be expressed as    D Q||Πprior + EQ HT (·, y)   Z T 1 2 > ku(X , t) − f (X )k + y = E t t a(Xt ) t u(Xt , t) ∇h(Xt ) Q (3.4) 2 0     1 1 + σ(Xt )> ∇2 h(Xt )σ(Xt ) + kh(Xt )k2 dt − yT EQ h(XT ) , 2 2

where the last equality is due to Fubini’s Theorem and Itˆo’s Lemma. The two coupling constraints (i) and (ii) in Problem 3.2 are studied in the next section and will finally allow us to reformulated Problem 3.2 as an optimal control problem. 4. Multi-dimensional SDE with prescribed marginal law. This section establishes conditions on the drift function in the approximate SDE (3.3) such that the induced marginal distributions evolve in a given exponential family.

Definition 4.1 (Exponential family). Let H

1 , . . . , Hm be Hilbert spaces and let Qm H = i=1 Hi be endowed with the inner product ·, · . Let the functions ci : Rn → Hi for i = 1, . . . , m be linearly independent, have at most polynomial growth, be twice continuously differentiable and denote c(x) = (c1 (x), . . . , cm (x)). Assume that the convex set   Z

 Γ := Θ ∈ H : ψ(Θ) = log exp Θ, c(x) dx < ∞ has non-empty interior. Then

EM(c) = {p(·, Θ), Θ ∈ Λ},

p(x, Θ) := exp



 Θ, c(x) − ψ(Θ) ,

where Λ ⊆ Γ is open, is called an exponential family of probability densities.

9

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

Definition 4.2 (Mixture of exponential families). Let EM(c(i) ) for i = 1, . . . , k be exponential families according to Definition 4.1. Then EM(c(1) , . . . , c(k) ) =

X k `=1

ν` p` (·, Θ(`) ) : p` (·, Θ(`) ) ∈ EM(c(`) ), ν ∈ ∆k



is called a mixture of k exponential families of probability densities. Consider the stochastic differential equation (4.1)

dXt = u(Xt , t)dt + σ(Xt )dWt ,

X0 = x0 ,

0 ≤ t ≤ T,

where u : Rn × R → Rn , σ : Rn → Rn×d and W is a d-dimensional Brownian motion independent of x0 . Assumption 4.3. 1. The SDE (4.1) satisfies Assumption 2.1. 2. The initial condition x0 has a density p0 that is absolutely continuous with respect to the Lebesgue measure and has finite moments of any order. 3. The unique solution Xt to (4.1) admits a density p(x, t) that is absolutely continuous with respect to the Lebesgue measure and that satisfies the Kolmogorov forward equation. Problem 4.4. Let be given a mixture of exponential families EM(c(1) , . . . , c(k) ), an initial density p0 contained in EM(c(1) , . . . , c(k) ), a diffusion term σ and let a(·) := σ(·)σ(·)> . Let U(x0 , σ) denote the set of all drifts u such that x0 , u, σ and its related SDE (4.1) satisfy Assumption 4.3. Assume U(x0 , σ) to be non-empty. Then given (1) (k) a curve t 7→ p(·, Θt , . . . , Θt ) in EM(c(1) , . . . , c(k) ), find a drift in U(x0 , σ) whose (1) (k) related SDE has a solution with marginal density p(·, Θt , . . . , Θt ). Theorem 4.5. Given the assumptions and notation of Problem 4.4. Consider the SDE (4.1) with drift term (1)

(k)

∂ n n 1X ∂ 1X ∂x p(x, Θt , . . . , Θt ) ui (x, t) = aij (x) j aij (x) + (1) (k) 2 j=1 ∂xj 2 j=1 p(x, Θt , . . . , Θt )



1 (1)

(k)

p(x, Θt , . . . , Θt )

k X `=1

D E (`) (`) ˙ (`) ν` p` (x, Θt ) Θ , I (x) , t i

for i = 1, . . . , n, where Z xi D E (`) (`) (`) (`) (4.2) Ii (x) := ϕi ((x−i , ξi ), Θt ) exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi , −∞

(`)

(xi− , ξi ) := (x1 , . . . , xi−1 , ξi , xi+1 , . . . , xn )> and the functions ϕi all ` = 1, . . . , k satisfy (4.3)

n D  E X (`) (`) ˙ (`) Θ (x−i , ξi ), Θt t ,ϕ i

i=1

ξi =xi

: Rn × H → H for

D E (`) (`) ˙ (`) = Θ t , c (x) − ∇Θ ψ` (Θt ) .

If u ∈ U(x0 , σ), then the SDE (4.1) solves Problem 4.4, i.e., Xt has a density pXt (x) =

k X `=1

ν` exp

D

E  (`) (`) Θt , c(`) (x) − ψ` (Θt ) ,

for all t ≤ T.

10

SUTTER, GANGULY, KOEPPL

The proof is provided in Appendix B. Remark 4.6. 1. For the non-mixture and one-dimensional case (k = n = 1), the result is known [7] and coincides with Theorem 4.5. Furthermore, it can be seen by the proof in [7] and by invoking the existence and uniqueness theorem for ODEs, that the drift function u is uniquely determined. 2. For the multi-dimensional case (n > 1), the drift function is not unique (`) anymore, as there exist multiple choices for ϕi 2 . This gives rise to a natural (`) question, if there exist a particular choice of ϕi such that the integral terms (`) Ii in (4.2) admit closed-form expressions. In Section 4.1 (Proposition 4.7), (`) we derive such functions ϕi for the mixture of multivariate normal densities. 3. In a non-mixture setting (k = 1), the drift function simplifies to ui (x, t) =

  n n 1X ∂c(x) 1X ∂ ai,j (x) + ai,j (x) Θt , 2 j=1 ∂xj 2 j=1 ∂xj   Z xi 

 ˙ t, ϕi ((x−i , ξi ), Θt ) exp Θt , c(x−i , ξi ) − c(x) dξi , − Θ −∞

where the functions ϕi have to satisfy (4.3). As remarked, the drift term proposed in Theorem 4.5 consists of the integral terms (4.2), that depend on the particular exponential families considered. In the following, we restrict ourselves to the mixture of multivariate normal densities and show that these integral terms, and hence the drift function, admit a closed-form expression. 4.1. Mixture of multivariate normal densities. Consider the family of multivariate Gaussian distributions with mean m ∈ Rn and covariance matrix S ∈ Sym(n, R), that can be expressed in terms of Definition 4.1 as follows. Let the

Hilbert space H = Rn × Rn×n be endowed with the inner product (a, A), (b, B) = a> b + tr(A> B) and define   1 −1 −1 Θ = (η, θ) := S m, − S ∈H 2 (4.4)

c : Rn → H, ψ : H → R,

c(x) = (x, xx> )

  1 1 1 −1 n > −1 ψ(Θ) = − tr(ηη θ ) + log det − θ + log(2π). 4 2 2 2

A direct computation, using tr(ηη > θ−1 ) = η > θ−1 η, leads to

 p(x, Θ) = exp c(x), Θ − ψ(Θ)   1 1 > −1 = − (x − m) S (x − m) . n 1 exp 2 (2π) 2 (det S) 2 We point out again that for the proposed variational method, it is favourable if the approximating SDE (3.3) has a drift function that admits a closed-form expression. Furthermore, since the drift function is not unique (cf. Remark 4.6), among all 2 For

for

(`) ϕi ,

(`)

(`)

(`)

example, ϕi (x, Θt ) := δij (c(`) (x)−∇Θ ψ` (Θt )) for all j ∈ {1, . . . , n} are feasible choices as they satisfy (4.3).

11

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES (`)

feasible solutions characterized by the ϕi functions, we want to find one that can be computed analytically. The latter turns out to be a difficult task and depending heavily on the specific exponential familiy chosen. From now on, we consider the exponential family of the multivariate normal probability densities that is given by (`) (4.4). In this setting, it is possible to find functions ϕi such that the integral terms (4.2), and therefore the drift function, can be computed in closed form. Proposition 4.7. For the mixture of multivariate normal densities, one possible choice for the drift function proposed by Theorem 4.5 is Pk (`)   1 (`) −1 ˙(`) (`) −1 (`) 1 (`) −1 (`) 1 `=1 ν` p` (x, Θt ) θt θt θt ηt − θt η˙ t u(x, t) = div a(x) + (1) (k) 2 4 2 p(x, Θt , . . . , Θt )   1 (`) 1 (`) −1 ˙(`) (`) θt x + a(x) ηt + θt x . − θt 2 2 The proof is provided in Appendix C. Remark 4.8. For the non-mixture setting the drift term simplifies to  1 1 1 1 u(x, t) = div a(x) + θt−1 θ˙t θt−1 ηt − θt−1 η˙ t − θt−1 θ˙t x + a(x) 2 4 2 2



 1 ηt + θt x , 2

that in the special case of a constant diffusion term is a linear function, as one would expect. We introduce the following ansatz for the drift function  (4.5) u(x, t) = 1 div a(x) + 2 (`)

   (`) (`) (`) (`) (`) ν p (x, Θ ) A +B x+a(x) C +D x ` ` t t t t t `=1

Pk

(`)

(1)

(k)

p(x, Θt , . . . , Θt )

(`)

,

(`)

where Bt , Dt ∈ Rn×n and At , Ct ∈ Rn for all ` = 1, . . . k. The coefficients (`) (`) (`) (`) At , Bt , Ct and Dt cannot be chosen arbitrarily. They are coupled according to Proposition 4.7. By comparing the coefficients of Proposition 4.7 and (4.5) one gets 1 (`) −1 ˙(`) (`) −1 (`) 1 (`) −1 (`) 1 (`) −1 ˙(`) 1 (`) (`) (`) (`) (`) (`) At = θt θt θt ηt − θt η˙ t , Bt = − θt θ t , C t = η t , Dt = θ t . 4 2 2 2 (`)

(`)

(`)

Hence, one directly sees that the four parameters At , Bt , Ct ` = 1, . . . , k are coupled via the two ODEs

(`)

and Dt

for all

(`)

(4.6) (4.7)

dCt (`) (`) (`) > (`) = −Dt At − Bt Ct dt (`) dDt (`) (`) = −2Dt Bt . dt

Note that the parametrization introduced in (4.5) provides relatively simple expression for the mean and variance of the variational approximation derived in the next section (Section 4.2). In the authors’ opinion this parametrization therefore helps to keep the notation simple.

12

SUTTER, GANGULY, KOEPPL

4.2. Equations for mean and variance. Theorem 4.5 provides an explicit formula for the drift term in the approximating SDP (4.1), that simplifies to (4.5) in the case of multi-normal marginal densities. Therefore, the mean and variance of the approximating SDE (4.1) are characterized via the following two ODEs. Theorem 4.9. Consider the SDE (4.1) with drift term u given by (4.5), such (1) (k) that the solution Xt has a marginal density p(x, Θt , . . . , Θt ) ∈ EM(c1 , . . . , ck ) that (`) is an arbitrary convex combination of densities p` (x, Θt ) ∈ EM(c` ) for ` = 1, . . . , k. (`) (`) (`) Let mt and St denote the mean and variance of Xt with respect to p` (x, Θt ). Then, (`) h i     (4.8) dmt = 1 E div a(X) +A(`)+B (`) m(`)+E a(X) C (`)+E a(X)D(`) X t t t t t p` p` p` dt 2

and

(`)

dSt dt

(4.9)

=

> i 1    1 (`)   > 1 h Ep` Xdiv a(X) + Ep` div a(X) X > − mt Ep` div a(X) 2 2 2   (`) >   1  (`) (`) (`) (`) > − Ep` div a(X) mt + Ep` a(X) + St Bt + Bt St 2 h i h i   (`) > (`) (`) (`) > + Ep` XCt a(X) + Ep` a(X)Ct X > − mt Ct Ep` a(X) h i h i   (`) (`) > (`) (`) − Ep` a(X) Ct mt + Ep` XX > Dt a(X) + Ep` a(X)Dt XX > h i h i (`) (`) (`) (`) > − mt Ep` X > Dt a(X) − Ep` a(X)Dt X mt . (`)

(`)

The proof is provided in Appendix D. Note that given mt and St the mean Pk Pk (`) (`) and variance of Xt can be expressed as mt = `=1 ν` mt and St = `=1 ν` St + > Pk P P (`) (`) (`) (`) k k − ( `=1 ν` mt )( `=1 ν` mt )> , respectively. `=1 ν` mt mt

Remark 4.10. If the coefficients νi in the convex combination of the marginal (1) (k) density p(x, Θt , . . . , Θt ) in Theorem 4.9 are fixed a priori, the ODEs (4.8) and (4.9) (`) (`) are only sufficient for describing mt and St . Oftentimes, however, one is interested in choosing those coefficients a posteriori, for example by solving an auxiliary optimization problem, which is the setting of Theorem 4.9 and the ODEs are necessary and sufficient. We have studied how to reformulate the constraints (i) and (ii) of Problem 3.2 by deriving an expression for the drift term to the approximating SDE (3.3). In the case that the marginals in (ii) are restricted to a mixture of multivariate normal densities this reformulation reduces to the ODEs (4.6) (4.7), (4.8) and (4.9). 5. Optimal control problem formulation. In this section, we show that the optimization problem 3.2, using the results derived from Theorem 4.5, can be reformulated as a standard optimal control problem (OCP), which conceptually is similar to [22]3 . Therefore, the presented variational approximation method to the path estimation problem for SDEs can be expressed as an OCP and as such leads to a standard formulation of necessary global optimality conditions in terms of Pontryagin’s maximum principle. Consider the vector spaces Vˆ := Rn × Rn×n , Zˆ :=

3 Note that [22] addresses a related problem, whose main difference, when compared to the presented method, is that the variational characterization considered there is exact.

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

13

Rn × Sym(n, R) × Rn × Sym(n, R) and define the trajectories (`) (`) [0, T ] 3 t 7→ v (`) (t) := (At , Bt ) ∈ Vˆ

(`) (`) (`) (`) ˆ [0, T ] 3 t 7→ z (`) (t) := (mt , St , Ct , Dt ) ∈ Z,

 for ` = 1, . . . , k. We introduce the state variable z(t) := z (1) (t), . . . , z (k) (t) ∈  Qk ˆ Qk (1) (t), . . . , v (k) (t) ∈ `=1 Vˆ =: V `=1 Z =: Z and the control variable v(t) := v for t ∈ [0, T ]. As a first step, in view of the cost functional (3.4) of Problem 3.2, the so-called Lagrangian  1 2 ku(Xt , t) − f (Xt )ka(Xt ) EQ 2    (5.1) 1 1 > 2 2 > + yt u(Xt , t) ∇h(Xt ) + σ(Xt ) ∇ h(Xt )σ(Xt ) + kh(Xt )k 2 2 is expressed as a function of only z(t), v(t) and t. This step, while being exact in many cases, may require an approximation. In the case that the marginals of Q are mixtures of normal densities, the expectation of any polynomial in Xt can be expressed as a function of its mean and variance. If the diffusion term σ is a polynomial, and no mixture is considered (k = 1), the drift function u, according to (4.5), is a polynomial. In case the argument in (5.1) is not a polynomial, but smooth enough, one can always approximate it by an appropriate polynomial. We refer to Section 8 to see how the Lagrangian can be derived for two concrete examples. Consider a Lagrangian L : [0, T ] × Z × V → R,

L(t, z(t), v(t)) ≈ (5.1),

where ≈ indicates that in order to express the term (5.1) by the state and control variables only, an approximation might be needed, as explained above. Similarly to the Lagrangian, in view of the cost functional (3.4), we introduce a terminal cost F : Z → R by   F (z(T )) ≈ −yT EQ h(XT ) .

Under the assumption that the drift term σ is a polynomial, the ODEs derived in the previous section can be expressed in standard form. We define the function h : Z × V → Z by   (k) (k) (1) (1) h(z(t), v(t)) = h1 (z(t), v(t)), . . . , h4 (z(t), v(t)), . . . , h1 (z(t), v(t)), . . . , h4 (z(t), v(t)) , where

(`)

dmt dt (`) dSt dt (`) dCt dt (`) dDt dt

(`)

dz1 (`) (t) = h1 (z(t), v(t)) dt (`) dz (`) = 2 (t) = h2 (z(t), v(t)) dt (`) dz (`) = 3 (t) = h3 (z(t), v(t)) dt (`) dz (`) = 4 (t) = h4 (z(t), v(t)) , dt =

14

SUTTER, GANGULY, KOEPPL

for ` = 1, . . . , k are given by (4.8), (4.9), (4.6) and (4.7). Thus, we have shown so far in this article that Problem 3.2 can be reformulated as the following optimal control problem

(5.2)

  minimize  v∈M([0,T ],V) subject to  

J(v) =

RT 0

L(t, z(t), v(t))dt + F (z(T ))

z(t) ˙ = h(z(t), v(t)), z(0) = z0 ,

t ∈ [0, T ] a.e.

where M([0, T ], V) denotes the space of measurable functions from [0, T ] to V. 5.1. Maximum principle. We derive necessary conditions for global optimality of the optimization problem (5.2) that are provided by the Pontryagin maximum principle (PMP). Since the control set V is unbounded, we need an extended setting of the standard PMP, see [9, Section 22.4] for a comprehensive survey. It requires some further assumptions. Assumption 5.1. Let the process (z ? (t), v ? (t))t∈[0,T ] be a local minimizer for the OCP (5.2), that satisfies (i) The function F is continuously differentiable; (ii) The functions h and L are continuous and admit derivatives relative to z which are themselves continuous in all variables (t, z, v); (iii) There exist ε > 0, a constant c, and a summable function d such that for almost every t ∈ [0, T ], we have |z − z ? (t)| ≤ ε ⇒ |∇z (h, L)(t, z, v ? (t))| ≤ c|(h, L)(t, z, v ? (t))| + d(t). Note that Assumption 5.1(iii) is implied if |∇z h(t, z, v)| + |∇z L(t, z, v)| ≤ c (|h(t, z, v)| + |L(t, z, v)|) + d(t) holds for all v ∈ V when z is restricted to a bounded set, which is satisfied by many systems. Moreover, the condition automatically holds if v ? happens to be bounded. Lemma 5.2 (PMP [9, Theorem 22.2]). Given Assumption 5.1, let the process (z ? (t), v ? (t))t∈[0,T ] be a local minimizer for the problem (5.2). Then there exists an absolutely continuous function p : [0, T ] → Z satisfying 1. the adjoint equation p(t) ˙ = −∇z p(t), h(z ? (t), v ? (t)) − ∇z L(t, z ? (t), v ? (t)) for almost every t ∈ [0, T ]; 2. the transversality condition p(T ) = ∇z F (z(T )); 3. the condition

maximum

p(t), h(z ? (t), v ? (t)) +L(t, z ? (t), v ? (t)) = inf p(t), h(z ? (t), v) +L(t, z ? (t), v) v∈V

for almost every t ∈ [0, T ].

Remark 5.3. 1. Given that an optimal process (z ? , v ? ) exists4 , the maximum condition 3 can be used to derive a feedback law

v ? (t) ∈ arg min p(t), h(z ? (t), v) + L(t, z ? (t), v). v∈V

4 Existence of an optimal process can be assured by standard existence results, see for example [9, Theorem 23.11].

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

15

2. Lemma 5.2, basically leads to a boundary value problem with initial conditions for the states and terminal conditions for the adjoint states, that provides necessary conditions for global optimality of Problem 3.2. We summarize the described method to approximate the smoothing density introduced so far. It basically consists of the following three steps, that provide a solution to Problem 3.2: Step 1 Fix a mixture of exponential families of probability densities, e.g., the mixture of multivariate normal densities. Theorem 4.5, that simplifies to Proposition 4.7 for the multivariate normal densities, characterizes the approximate posterior SDE (3.3) whose solution admits marginal densities evolving in the chosen mixture of exponential families. Step 2 Given the approximate posterior SDE (3.3), we derive an optimal control formulation of Problem 3.2. For the mixture of multivariate normal densities, this derivation is presented in Sections 4 and 5 and finally leads to the OCP (5.2). Step 3 Necessary conditions for optimality of the OCP (5.2), and hence for Problem 3.2, can be derived from Pontryagin’s maximum principle and result in a boundary value problem. Under the assumption that the smoothing density at terminal time is available, the necessary conditions reduce to an ordinary initial value problem. 5.2. Computational complexity. If the smoothing density at terminal time T is known, the PMP, Lemma 5.2, reduces to an ordinary initial value problem, that can be solved numerically much more efficiently than (S)PDEs. Therefore, the major computational difficulty of the presented variational approach lies in estimating the smoothing density at terminal time. A straightforward, however clearly not efficient, method for that is solving the Zakai equation (2.7), as explained in Section 2, which we used in the numerical examples in Section 8. As such, whereas the standard PDE approach for computing a smoothing density requires solving a Zakai equation and the SPDE (2.8), the presented variational approach relies on only a Zakai equation and the mentioned initial value problem. This can be seen as a significant reduction in terms of computational effort required and is demonstrated by two numerical examples in Section 8, Table 1. Moreover, for future work, we aim to study the derivation of an estimator for the marginal smoothing density at terminal time without solving a Zakai equation, that would then allow us to apply the proposed variational approximation method to high-dimensional problems, see Section 9 for more details. Another idea to circumvent the estimation of this mentioned terminal condition is to use an alternative approach to the PMP, for characterizing a solution to the OCP (5.2) that is briefly described in the following remark. Remark 5.4 (Semidefinite programming). Solutions to the OCP (5.2) can be characterized via the so-called weak formulation which consists of an infinitedimensional linear program, see [21, Chapter 10] for details. Therefore, numerical approximation schemes to such infinite-dimensional linear programs, that have been studied in the literature, can be employed to solve Problem 3.2. This approach seems particularly promising when the data of the OCP (dynamics and costs) are described by polynomials, as then the seminal Lasserre hierarchy based on solving a sequence of semidefinite programs, is applicable [20, 21].

16

SUTTER, GANGULY, KOEPPL

6. Parameter inference. The goal of this section is to outline the use of the techniques, developed so far for path estimation, for inference of parameters in a hidden Markov model. We consider a class of dynamical systems (6.1)

dXtκ = f (Xtκ , κ)dt + σ(Xtκ , κ)dWt ,

X0κ = x0 ,

0 ≤ t ≤ T,

parametrized by κ. The observation process can be modeled by (2.2), but as discussed in the next section, the approach discussed below can also be used with necessary modifications for a discrete observation process. κ As a natural notation, for each parameter κ, the probability distribution of X[0,T ] on C will be denoted by Πκprior . Given a sample path {yt : 0 ≤ t ≤ T } of the observation process Y[0,T ] , the objective is to select an optimal κ? ∈ Rd such that the observation process (Yt )t∈[0,T ] in (2.2) has a high probability of reproducing the given data y. This is basically the inference scheme based on classical maximum likelihood estimation, and we propose an algorithm similar to the lines of expectation maximization (EM) algorithm (see [8] for a comprehensive survey), which aims to obtain the optimal κ? through (3.2), for each κ, we define I κ (HT (·, y)) :=  R multiple iterations.κ Recalling − log exp (−HT (·, y)) dΠprior . As already noted in Section 3, for each parameter κ, the term I κ (HT (·, y)) provides the total information available through the sample path y, and can be interpreted as the negative log-likelihood of y given the parameter κ. However, minimizing (even evaluating) this negative log-likelihood function can be a hard problem. But, as mentioned in Section 3, Lemma 3.1 and non-negativity of the relative entropy together imply that an upper bound to this negative log likelihood term is given by the apparent information, F(Q, κ) := D Q||Πκprior +   EQ HT (·, y) . The advantage of this observation is that this upper bound to the negative log-likelihood function is also the objective function in Problem 3.2, for which the program for finding the minimizer Q is by now well-established. Therefore instead of minimizing the actual negative log-likelihood, we minimize an upper bound of it. The path to find the right parameter κ corresponding to the sample path y is now quite standard in statistics. After initialization of the parameter κ, we find the optimal Q by solving the Problem 3.2, and then in the subsequent step, for this Q we obtain the optimal parameter κ by minimizing F(Q, κ). This yields an iterative EM-type algorithm whose details are given below.

EM-type algorithm initialize while Step 1: Step 2: Step 3:

i = 0, κi := κ ˆ0 i≤M compute Qi by solving Problem 3.2 with parameter κi update parameter as κi+1 ∈ arg min F(Qi , κ)

set i → i + 1

κ

Remark 6.1. Analyzing convergence of the above algorithm and consistency of the above corresponding estimator is the next important step and will be addressed in our future projects. We refer to Section 8 for a numerical visualization of this variational parameter inference method applied to two examples and to Section 9 for a discussion about convergence and consistency of the estimator as a topic of further research.

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

17

7. Discrete time measurement model. In most practical examples, the measurements of physical quantities are processed by computers, and as such the data available are obtained only at discrete times, potentially restricted to a low number. The goal of this section is to outline how the discussed variational approximation scheme adapts naturally to such cases with obvious modifications. In this case the signal process (2.1) is observed through noisy measured data y := {yk }N k=1 at discrete times t1 ≤ t2 ≤ . . . ≤ tN ≤ T . The canonical model for the observation process is thus given by (7.1)

Yk = h(Xk , tk ) + ρk ,

for k = 1, . . . , N,

where Xk := Xtk , h : Rn × R → Rn is a measurable function, the ρk are Rn -valued i.i.d. Gaussian random variables with zero mean and covariance Rk , and they are independent of x0 and σ(Ws : s ≤ T ). We consider m such that tm ≤ t < tm+1 and similarly to Section 2 define the filter density p and smoothing density PS by Z   (7.2) E φ(X(t))|Y1 , . . . , Ym , x0 = φ(x)p(x, t) dx Z   E φ(X(t))|Y1 , . . . , YN , x0 = φ(x)PS (x, t) dx, (7.3) where φ is any measurable function from Rn to R. It is well known (see [15, Appendix] for a derivation) that the smoothing can be expressed as (7.4)

p(x, t)v(x, t) , p(x, t)v(x, t)dx Rn

PS (x, t) = R

where p(x, t) and v(x, t) in between the observation times are the solutions to the ( dp(x, t) = A∗ p(x, t)dt (7.5) Kolmogorov forward equation: p(x, 0) = p0 (x),

(7.6)

Kolmogorov backward equation:

(

dv(x, t) = −Av(x, t)dt v(x, T ) = 1,

punctuated by jumps at the data points tk for k = 1, . . . , N   1 −1 −1 > > (7.7) p(x, t+ ) ∝ p(x, t ) exp y R h(x, t ) − h(x, t ) R h(x, t ) k k k k k k k k 2   1 −1 −1 > > (7.8) v(x, tk ) ∝ v(x, t+ ) exp y R h(x, t ) − h(x, t ) R h(x, t ) . k k k k k k k 2 Similar to the continuous time measurement model, it can be shown that the smoothing density solves the Kolmogorov forward equation given by (2.10), with drift function g(x, t) := f (x)+a(x)∇ log v(x, t), where v is the solution to (7.6). As before, we denote   the prior probability measure by Πprior (A) = P X[0,T ] ∈ A and theposterior proba bility measure, induced by the solution to (2.10), by Πpost (A, Y ) = P X[0,T ] ∈ A|FTY , where FTY = σ(x0 , Y1 , . . . , YN ). Let yk denote a realization of the observation process at the time tk . The variational approximation derived in Section 3, and, in particular, Problem 3.2 carries over to the discrete time observation setting considered here. As

18

SUTTER, GANGULY, KOEPPL

before, the path to the objective function starts from Lemma 3.1, which holds in this case with  N  X 1 −1 (7.9) HT (X, y) := kRk h(Xi , ti )k2 − yi> Rk−1 h(Xi , ti ) . 2 i=1 One way to see this is to recast the discrete model in the traditional setup of Section 2, and then use the Kallianpur-Striebel theorem. To do this, first assume that without ¯ : C × [0, T ] → Rn by loss of generality Rk = I. Define the function h X ¯ t) = h(x, (tk+1 − tk )−1/2 h(x ◦ η(t), η(t))1{tk ≤t 1 ¯ 2 2 ¯ ˜ kh(X, s)k ds − h(X, s)dY (s) = kh(Xk , tk )k − h(Xk , tk ) 2 (tk+1 − tk )1/2 0 0 2 k=1  N  X 1 Law 2 > kh(Xk , tk )k − Yi h(Xk , tk ) , = 2 k=1

which leads to (7.9). Therefore, in this case the objective function in Problem 3.2 can be expressed as  Z T     1 2 (7.10) D Q||Πprior + EQ HT (·, y) = EQ ku(Xt , t) − f (Xt )ka(Xt ) + ι(Xt , t) dt, 2 0

where (7.11)

 N  X 1 −1 −1 > 2 ι(Xt , t) = yi Rk h(Xi , ti ) − kRk h(Xi , ti )k δ(t − ti ). 2 i=1

Section 4 is independent of the considered measurement model, and by following Section 5 we arrive at an optimal control problem (5.2), where the cost functional is replaced by (7.10). The derivation of necessary conditions for global optimality of the

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

19

optimization problem (5.2), compared to the continuous time measurement model, here is somewhat nonstandard, due to the Dirac delta terms (7.11) involved in the Lagrangian. However, the problem can be seen as an OCP with so-called intermediate constraints, for which an extension of the PMP is available [14]. Assumption 7.1. Let the process (z ? (t), v ? (t))t∈[0,T ] be a local minimizer for the optimal control problem (5.2), that satisfies (i) Assumptions 5.1(i) and (ii); (ii) v ? is measurable and essentially bounded. Lemma 7.2 (Extended PMP). Let the process (z ? (t), v ? (t))t∈[0,T ] be a local minimizer for the problem (5.2). Given Assumption 7.1, then there exists an absolutely continuous function p : [0, T ] → Z satisfying

1. the adjoint equation p(t) ˙ = −∇z p(t), h(z ? (t), v ? (t)) − ∇z L(t, z ? (t), v ? (t)) for almost all t ∈ [0, T ];   2. the transversality conditions p(ti ) = p(t− i ) − ∇z EQ ι(X, ti ) for i = 1, . . . , N and p(T ) = 0; 3. the condition

maximum

p(t), h(z ? (t), v ? (t)) +L(t, z ? (t), v ? (t)) = supv∈M([0,T ],V) p(t), h(z ? (t), v(t)) + L(t, z ? (t), v(t)) for almost all t ∈ [0, T ]. Proof. Follows directly from [14], when transforming problem (5.2) into an OCP with intermediate constraints. Remark 7.3. 1. Note that the data (measurements) enter the expression through the cost function, namely the term (7.11), which is nonzero only at measurement times {ti }N i=1 and leads to jumps in the adjoint state. Furthermore, as described in Remark 5.3 the maximum condition 3 can be used to derive a feedback law. 2. Lemma 7.2, basically leads to a boundary value problem, that provides necessary conditions for optimality of Problem 3.2. See Section 5.2 for a discussion about how to numerically solve it. We refer to the numerical examples in Section 8 for the performance of such a solution. 8. Simulation results. In this section, we present two examples to illustrate the performance of the variational approximation method introduced. Both examples have important applications in mathematical finance. As a first example, we consider the geometric Brownian motion that is used to model stock prices in the Black-Scholes model [26]. The second example is concerned with the Cox-Ingersoll-Ross process, that is often used for describing the evolution of interest rates [10]. 8.1. Geometric Brownian motion. Consider as underlying system a onedimensional geometric Brownian motion (GBM) (8.1)

dXt = κXt dt + λXt dWt ,

X0 = x0 ∼ log N (µ, σ),

for 0 ≤ t ≤ T and assume that the available data are noisy observations {yk }N k=1 at time tk , modeled by the observation process Yk = Xtk + ρk , where {ρk }N k=1 are i.i.d. normal random variables with zero mean, standard deviation R and tN = T . We compare the smoothing density, obtained by solving the respective PDEs, against the variational approximation presented in this article.

20

SUTTER, GANGULY, KOEPPL

PDE approach. As explained in Section 7, the smoothing density can be characterized by (7.4) that is the (normalized) product of two densities v and p. The first density satisfies the Kolmogorov backward equation (7.6) with jump conditions (7.8)   2

1 N) exp −(x−y . at the measurement times and terminal condition v(x, T ) = √2πR 2R2 Its marginals are shown in Figure 1a. The second density, called the filter density, is given by the Kolmogorov forward  equation (7.5) with jump conditions (7.7) and initial 2

1 condition p(x, 0) = √2πxσ exp −(log2σx−µ) that is given by (8.1). Its marginals are 2 shown in Figure 1b. The smoothing density is depicted in Figure 1c as the solid line.

Parameter values. We take κ = 1 and λ = 0.1. Also we take R = 0.25, and for the initial distribution of X0 , we take µ = 0 and σ = 0.25. Variational approximation. In order to approximate the smoothing density by a normal density, according to Theorem 4.5, the drift function for the approximating SDE (3.3) has to be chosen as (8.2)

u(x, t) = At + (λ2 + Bt )x + λ2 x2 (Ct + Dt x).

To express the Lagrangian in (7.10) as a function of only the state variables and control inputs, one can see directly from (8.1) and (8.2), that the first two inverse moments of Xt with respect to Q need to be approximated. Due to the non   −1 negativity of the GBM, we use the approximation EQ Xt−1 ≈ EQ Xt = m−1 and t  −2   −2 2 −1 EQ Xt ≈ EQ Xt = (St + mt ) , whose accuracy has been investigated in [16]. Note that Assumption 7.1 can be easily verified to hold, if we restrict the optimizers in (5.2) to bounded controls. We solve the ODE system obtained from Lemma 7.2 under the assumption that the smoothing density at terminal time T is available, see Section 5.2 for a discussion about this assumption. The solution is depicted in Figure 1c as the dashed line. Finally, Figure 1d shows the relative entropy between the smoothing density obtained by the PDE approach and the variational method, and hence reflects the accuracy of the variational approximation. Parameter inference. We consider the case where the drift parameter κ in (8.1) is assumed to be unknown. Figure 1e shows the performance of the EM-Algorithm introduced in Section 6 for an initial guess κ ˆ 0 = 4 of the unknown parameter. It can be seen that the estimator κ ˆ is quite close to the true value of κ = 1 proving the efficacy of our algorithm. Also, the algorithm converges quite rapidly. 8.2. Cox-Ingersoll-Ross. Consider as underlying system a Cox-Ingersoll-Ross (CIR) process p (8.3) dXt = κ(b − Xt )dt + λ Xt dWt , X0 = x0 ∼ N (µ, σ),

for 0 ≤ t ≤ T and assume that the available data are noisy observations {yk }N k=1 at time tk , modeled by an observation process Yk = Xtk + ρk ,

where ρk are i.i.d. normal random variables with zero mean, standard deviation R and tN = T . We compare the smoothing density, obtained by solving the respective PDEs, against the variational approximation presented in this article.

VARIATIONAL INFERENCE FOR HIDDENAND DIFFUSION PROCESSES SUTTER, SUTTER, GANGULY, GANGULY, AND AND KOEPPL/PATH KOEPPL/PATH ESTIMATION ESTIMATION AND VARIATIONAL VARIATIONAL INFERENCE INFERENCE SUTTER, GANGULY, AND KOEPPL/PATH ESTIMATION AND VARIATIONAL INFERENCE 6 6 6 6 + 0+ + 0+ t =t0= t =t0= 6 6 + + tt = tt = SUTTER, GANGULY, AND KOEPPL/PATH ESTIMATION AND VARIATIONAL =t0T2= T2 =t0T2= T2 INFERENCE

15

T + + T + + tt = tt = =t 2T2= T2 =t 2T2= T2 4 4 Marginals of likelihood density PL(x,t) Marginals of filter density PF(x,t) + T+ =tT T = T t=T t=T tt = t = 4 2 2 t=T t=T SUTTER, GANGULY, KOEPPL SUTTER, SUTTER, GANGULY, GANGULY, AND AND KOEPPL/PATH KOEPPL/PATH ESTIMATION ESTIMATION AND AND VARIATIONAL VARIATIONAL INFERENCE INFERENCE 14 14

4 4 4

5.5

18

14 1421 14

4.5

t=0+ t=T/2 t=T/2+ t=T

5

t=0+ t=T/2 t=T/2+ t=T

4

4.5

3.5

4

2 SUTTER, 2 2 2 GANGULY, AND KOEPPL/PATH ESTIMATION AND VARIATIONAL INFERENCE 2 2 6 6 6 6

14

3

2.5

PL(x,t)

3

+ 0+ =t0=

+ 0+ t t =t0= 6 6 2 + + tt = tt = SUTTER, GANGULY, AND KOEPPL/PATH ESTIMATION AND VARIATIONAL =t0T2= T2 =t0T2= T2 INFERENCE 2

2.5

0 0 4 0 0.80.8 4 0.8

4

1.5

5.5

1

5 4.5

0.5

0 0.5

2 2

T + + tt = =t 2T2= T2 + 1.6 1.21.2 1.41.4 1.6 =tT T = T tt = 2 1.2 x x1.4 t = T1.6

1 1 1

x

PL(x,t)

1

2 2

1.5

0 0.80.8 0.8 1.5 1

0.5

0 0.5

4 4 4

t= 1.21.2 1.41.4 1.6 1.6t0= 0 1.2 x x1.4 1.6 tt = =t0T4= T4 x T

6 6 6

T t)T (a)(a) Likelihood Likelihood density density PtLtP (x, t) Lt(x, = (a) Density v(x, t) = = 2 42 Marginals of smoothing density PS(x,t) (a)(a) Likelihood density PLP (x,(x, t) T t)3T Likelihood density L t 3T tt = = = 1

1.5

5

t=0 t=t 1 t=t 2 t=t 3 t=T

4 4 4

2 2 2 4

Marginals of smoothing density PS(x,t)

6

2

4

T

2

T 3T tt = =t 23T = 4 4 =t3T T= T tt = 4 t=T t=0 t=t 1 t=t 2 t=t 3 t=T

5

3

2 2 2 4

2

t =t0= 024 = t3T T= ttt0= T tt = = = T 4 44 T T T T t t= tt = = 4=

PL(x,t)

0 0 0 0.80.8 1 1 1.21.2 1.41.4 1.61.6 0.8 0 0.80 0.8 1 1 11.21.2x1.2x1.4 1.6 1.41.4 1.61.6 0 0.8 1 1.2 1.6 x x x1.4 1

3

2

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

x

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

x

44 44 S PSS (x, t) (c) Smooting density 4 44

4

3 3 33 33

1

0 0

0.5

1

1 1

· 10 110−3 8 ·−3 x

1.5

x

1.51.5 x 1.5

2 2 2

2

2.5

D Q||Π D Q||Π postpost    D Q||Π D Q||Π postpost

2 2 2

−3 (b) Filter density density PFP(x, t) t) t) 6(b) ·(b) 10 6Filter ·−3 10 F (x, Filter density p(x, density PFP (x,(x, t) −3 Filter density 6(b) · (b) 10 6Filter ·−3 10 F  t)  0.5

1

1.5

2

2.5

x

 D Q||Π D Q||Π postpost    D Q||Π D Q||Π postpost

−3 −3

8 · 10 8 · 10 8 · 10 8 ·−3 10−3 4·

10 4 ·−3 10−3 −3 −3 6 · 10 6 ·−3 104−3· 10 4 · 10 −3 −3

6 · 10 6 · 10

−3 −3

2 · 10 −3· 10 4 · 10 4 ·−3 102 −3· 10−3 −3 4 · 10 2 · 10 4 ·−3 102 2 · 10 2 ·−3 10−3 2 · 10 2 ·−3 10−3

(c)(c) Smooting Smooting density PSP (x, t) t) S (x, xdensity (c)(c) Smooting Smooting density density PSP (x, t) t) S (x, (c)Smooting Smooting density P(x, (x, t) t) (c) Smooting density (x, SP (c) density P (c)(c) Smooting density P P t)SS Smooting density (x, t) (x, t) 0 0.7

1.5

0.5

1

0 0.7

1.51.5 1.5

x

2

0 0 0.50.5 0.5 8

2

x

15

(b)(b) (b) Filter Filter density density PFP(x, t) t) t) F (x, Filter density p(x, (b)(b) Filter density PFP (x,(x, t) t) Filter density F    8 · 10 8 ·−3 10−3 0

2.5

x

6

PL(x,t)

1 1 1

0

3

2

2.5

T + + tt = =t 2T2= T2 1tt=T 1T + = t=T 2 1 t=T x x t=0+ t=T/2 t=T/2+ t=T

0.5

2

6

1

4

x

3

Marginals of filter density PF(x,t)

3.5

Likelihood Likelihood density density PLP (x, t) t)2 2(a)(a) (a) Density v(x, t)L (x, (a) Likelihood density P (x, t) 6 6 0 0(a) Likelihood densityLPL (x, t)0 3.5

1.5

4.5

t=0+ t=T/2 t=T/2+ t=T

4

0 0 0 0.50.5 0.5

4 4 4

Marginals of likelihood density PL(x,t)

PL(x,t)

PL(x,t)

3.5

0 0 0 0 0 0 0 0

0 0 0 0 5 · 10 5 ·−2 10−2 0.10.1 0 0 −2 −2 0.10.1 0−2 0−2 0.1 5 · 10 5 · 10 t 0.1 0.15 0.15 t0.20.2 5 · 10 5 · 10 0.15 0.15 t0.20.2 5 · 10 5 ·−2 10−2 0.1 t 0.1 t t

0.15 0.15 0.15 0.15

0.20.2 0.20.2

(d)t(d) Relative entropy entropy t Relative

(d)(d) Relative Relative entropy entropy (d) Relative entropy (d) Relative entropy (d) Relative entropy (d) Relative entropy (d) Relative entropy

(d) Relative entropy

µ b µ b µ b ηb

3 3

µ b µ b µ b ηb κ ˆ

2 32 2 2

2 2 21 12

0 0 500500 1,000 1,000 1,500 1,500 2,000 2,000 1 1 02 0 Iterations 500 1,000 1,500 500 of 1,000 1,500 2,000 2,000 Iterations EM of EM algorithm algorithm of EM algorithm Iterations of EM algorithm 1 1 Iterations

Parameter Parameter inference, inference, µ b= µ b 1.1867 =1,500 1.1867 0(e)0(e) 500500 1,000 1,000 1,500 2,000 2,000 1 (e) 1 Parameter inference, µ b =µ Parameter b 1.1867 = 1.1867 2,000 0 0(e) 500 1,000 1,500 500 inference, 1,000 1,500 2,000 Iterations Iterations of EM of EM algorithm algorithm

1 Motion: FigFig 1: 1: Geometric Geometric Brownian Brownian Motion: PDE(solid) PDE(solid) vs vs variational variational approach approach (dashed) (dashed) 0 Motion: 500 1,500 2,000 Iterations of 1,000 EM algorithm Iterations of EM algorithm FigFig 1: 1: Geometric Brownian PDE(solid) vs variational approach (dashed) Geometric Brownian Motion: PDE(solid) vs variational approach (dashed) (e)(e) Parameter Parameter inference, inference, µ b= µ b 1.1867 = 1.1867 Iterations of EM algorithm

(e)(e) Parameter inference, µ b =µ Parameter inference, b 1.1867 = 1.1867 8. 8.Conclusion. Conclusion. 7.2.1. Smoothing Density via PDE. Asmethod mentioned in η Section 2, the MCMC likelihood satisfies 8.Tobias: Conclusion. Tobias: Mention Mention that that with with thethe presented presented method (compared (compared to to standard standard MCMC or density or Extended Extended (e) Parameter inference, ˆ= 1.1867 Fig Fig 1: 1: Geometric Geometric Brownian Brownian Motion: Motion: PDE(solid) PDE(solid) vs vs variational variational approach approach (dashed) (dashed) (e) Parameter inference, κ ˆ = 1.1867 the Kolmogorov backward equation Tobias: Mention that with the presented method (compared towhich standard MCMC orinExtended Kalman Kalman Smoothing Smoothing methods) methods) one one gets gets the the fullfull posterior posterior SDE, SDE, which cancan bebe useful useful in many many dif-difFig 1: Geometric Brownian Motion: PDE(solid) vs variational approach Fig 1: Geometric Brownian Motion: PDE(solid) vs variational approach (dashed) Kalman Smoothing methods) onetype gets the full posterior SDE, which can be useful....in....many dif- (dashed) ferent ferent settings settings (e.g., (e.g., reachability reachability type questions, questions, or or when when estimating estimating a functional a functional 2 2 ∂ ∂ λ x ∂ ferent1:settings (e.g., reachability type questions, orPwhen .... (solid) versus PL (x, t) = −a(b − x) − of the 2aPDE Pfunctional L (x, t)estimating L (x, t), Fig. Geometric Brownian motion: Comparison solution Tobias: Tobias: mention mention numerical numerical investigations as as outlook outlook shooting shooting etc. etc. ∂t investigations ∂x — — 2 methods ∂xmethods 8. 8.Conclusion. Conclusion.   Tobias: mention numerical investigations as outlook — shooting methods etc. values are: η = 1, the variational approach (dashed). The −(x−y considered numerical )2 1 T − with terminal condition Pthe T presented )PDE. = √2πR exp giveninbySection measurement 7.2.1. Smoothing Density via As mentioned 2, the model likelihood satisfies L (x, 8.Tobias: Conclusion. 2R2N Tobias: that that with the presented (compared (compared to standard MCMC or Extended Extended λ Mention = Mention 0.1, R= 0.15, Twith = 0.2s, µ = 0, σ method = method 0.25, = 4, t1 =the Tto/4, tstandard t3MCMC =(7.5). 3Tor /4density 2 = T /2, Its marginals are shown Figure 2a. The filter density is(compared given by the Kolmogorov forward equationor Extended the Kolmogorov backward equation Tobias: Mention that with inthe presented method to standard MCMC

Fig. 1: Geometric Brownian motion: Comparison of the PDE solution (solid) versus the variational approach (dashed). The considered numerical values are: κ = 1, and tSmoothing Kalman Kalman Smoothing methods) methods) oneone gets gets thethe fullfull posterior posterior SDE, SDE, which which cancan bebe useful useful in in many many dif-dif4 = T. λKalman =ferent 0.1, R =(e.g., 0.15, T = 0.2s, µtype =questions, 0, σ =posterior 0.25, NSDE, =1estimating 4, T /4, t2 =....inT..../2, = 3T /4 ∂ methods) ∂ or ∂t2 12 = Smoothing one gets the full which can be useful manyt3dif2 or 22 ferent settings settings (e.g., type questions, when when estimating a functional a functional PFreachability (x,reachability t) = aP (x, t) + (−a(b − x) + λ ) P (x, t) + λ x P (x, t), F F F ∂ ∂ ∂x λ x ∂ 2 ∂t reachability and = T . (e.g., ferentt4settings type questions, .... PL (x, t) = −a(b − x) or Pwhen −2 ∂x 2aPfunctional L (x, t)estimating L (x, t), Tobias: Tobias: mention mention numerical numerical investigations as as outlook outlook shooting shooting etc. etc. ∂t investigations ∂x — — 2 methods ∂xmethods  line.  Tobias:density mention numerical as outlook — shooting methods is depicted in investigations Figure 2c as the dashed Finally, Figure 2detc. shows the 2 1 T) with terminal condition PL (x, T −smoothing ) = √2πR exp −(x−y given thesolution measurement model (7.5). relative entropy between the density obtained by theby PDE and 2R2 the variational approximation. Its marginals are shown in Figure 2a. The filter density is given by the Kolmogorov forward equation Parameter inference. We consider the case where the parameter a in the drift

1 density ∂ 2 can term of ∂ (8.4) is unknown. Figurein 2e Section shows the EM-Algorithm introduced in Section 6 be characterPDE approach. As explained 7, the ∂smoothing PF (x, t) = aPF (x, t) + (−a(b − x) + λ2 ) PF (x, t) + λ2 x 2 PF (x, t), ∂t ∂x 2 ∂x ized as the (normalized) product of two densities v and p. The first density satisfies the Kolmogorov backward equation (7.6) with jump  conditions (7.8) at the measure2

1 N) exp −(x−y . Its marginals are ments and terminal condition v(x, T ) = √2πR 2R2 shown in Figure 2a. The second density, called the filter density, is given by the Kolmogorov forward  equation  (7.5) with jump conditions (7.7) and initial condition −(x−µ)2 1 √ p(x, 0) = 2πσ exp that is given by (8.3). Its marginals are shown in Fig2σ 2

22

SUTTER, GANGULY, KOEPPL

ure 2b. The smoothing density is depicted in Figure 2c as the solid line. Parameter values. We take κ = 1, b = 0.3 and λ = 0.2. Also we take R = 0.1, and for the initial distribution of X0 , we take µ = 1 and σ = 0.1. Variational approximation. In order to approximate the smoothing density by a normal density, according to Theorem 4.5, the drift function for the approximating SDE (3.3) has to be chosen as (8.4)

u(x, t) =

1 2 λ + A(t) + B(t)x + λ2 x(C(t) + D(t)x). 2

To express the Lagrangian in (7.10) as a function of only the state variables and control inputs, one can see directly from (8.1) and (8.2), that the first inverse moment of Xt with respect to Q needs to be approximated. Due to the non-negativity    −1 of the CIR, we use EQ Xt−1 ≈ EQ Xt = m−1 t , whose approximation quality has been studied in [16]. Assumption 7.1 can be easily verified to hold, if we restrict the optimizers in (5.2) to bounded controls. We solve the ODE system obtained from Lemma 7.2, assuming that we have access to the smoothing density at terminal time T , see Section 5.2 for a discussion about that assumption. The variational approximation to the smoothing density is depicted in Figure 2c as the dashed line. Finally, Figure 2d shows the relative entropy between the smoothing density obtained by the PDE solution and the variational approximation. Parameter inference. We consider the case where the parameter a in the drift term of (8.3) is unknown. Figure 2e shows the EM-Algorithm introduced in Section 6 for an initial guess κ ˆ 0 = 4 of the unknown parameter. It can be seen that the estimator κ ˆ is quite close to the true value of κ = 1 proving the efficacy of our algorithm. Also, the algorithm converges very fast. Table 1: Runtime comparison. The presented variational approach for approximating the smoothing density is compared with the standard PDE approach for the two examples in Sections 8.1 and 8.2. All simulations are performed on a 2.3 GHz Intel Core i7 processor with 8 GB RAM using Matlab. Forward PDE (7.5) Backward PDE (7.6) ODE PDE approach5 Variational approach6

Geometric Brownian motion Cox-Ingersoll-Ross 2.02 s 1.33 s 2.87 s 2.38 s 0.10 s 0.23 s 4.89 s 3.70 s 2.12 s 1.55 s

Table 1 summarizes the runtimes of the two numerical examples above. It can be seen that the ODEs provided by the maximum principle can be solved by roughly one magnitude faster than the backward PDE (7.6), that is the reason for the speedup of the variational approach compared to the PDE approach. Moreover, it is highlighted that the main computational effort in the variational approach is needed to estimate 5 consists 6 consists

of solving the two PDEs (7.5) and (7.6) of the PDE (7.5) and the ODE system in order to solve Problem 3.2

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

SUTTER, GANGULY, AND KOEPPL/PATH ESTIMATION AND VARIATIONAL INFERENCE SUTTER, GANGULY, AND KOEPPL/PATH ESTIMATION AND VARIATIONAL INFERENCE 6 6

6 6

+ + t =t 0= 0

+ + t =t 0= 0

SUTTER, GANGULY, AND KOEPPL/PATH ESTIMATION AND VARIATIONAL SUTTER, GANGULY, AND KOEPPL/PATH ESTIMATION AND VARIATIONAL T INFERENCE t =t T= T t =t T=INFERENCE 4 4 5.5

5.5

5

5

4.5

4.5

2 2 +T + =t T= 2 2

t t =t T= T

Marginals of likelihood Marginals of likelihood densitydensity PL(x,t) PL(x,t)

4 4 5.5

5.5

5

5

4.5

4.5

t=0+ t=T/2 t=T/2+ t=T

t=0+ t=T/2 t=T/2+ t=T

2 2 +T + =t T= 2 2

23

15 15

17 17

t t =t T= T

Marginals filter density Marginals of filterof density PF(x,t) PF(x,t)

t=0+ t=T/2 t=T/2+ t=T

t=0+ t=T/2 t=T/2+ t=T

4

4

4

3.5

3.5

3.5

2

6 6

2

0 0 4 0.60.64 0.80.8 1.5

6 6

+ + t =t 0= 0

PL(x,t)

3 2.5

3 2.5

3

+ + t =t 0= 0

2.5

17 17

SUTTER, GANGULY, AND KOEPPL/PATH ESTIMATION AND VARIATIONAL SUTTER, GANGULY, AND KOEPPL/PATH ESTIMATION AND VARIATIONAL T INFERENCE t =t T= T t =t T=INFERENCE 1.5

5.5

1

1

0.5

0.5

0

0

5

5

4.5

4.5

4

4

2 2

1 1 x x

Marginals of likelihood Marginals of likelihood densitydensity PL(x,t) PL(x,t)

5.5

2 2 +T + =t T= 2 2

t 1.2 t =1.2 t T= T t=0+ t=T/2 t=T/2+ t=T

1.41.4

t=0+ t=T/2 t=T/2+ t=T

5

5 4.5

4

4

PL(x,t)

PL(x,t)

0.5

1

1

3.5

3.5

2 2

1.5

1.5

x

x

3

PL(x,t)

3.5

0.5

3

2.5

2.5 2

2

t 1.4 =0 1.4 T =t0T= tt = 4 04 T T = (a)(a) Likelihood density PLttP (x, t)T Likelihood density t)T2 Lt(x, = = 2 (a) Density v(x, t) 4 4 4 4 3Tt)T 3T Likelihood density Likelihood density PLtP(x, = L (x, 6 6 (a)(a) tt)T= = t =tt 0= 0 24 24 4 4 6 6 t =T T=tTT 3T 3T =tt 0= =t0 = tt = 1.5

6 6

1

1.5

0.60.6

0.80.8

1

0.5

0.5

0

0

0.5

1 1 x x

0.5

1

x

2 2

4 4

2 2

4 4

1

4 T =t T = = 2 4 3T T = t = = 24

4 4 T T 2 4 3T T 24 = T= T 3T = =tt 3T 4 4

tt t =t T = tt t t t =t T= T

4

T

0

0

2

0.5

1.5 1

0.5 0

0

0.5

1.21.2

1.41.4

1

1

1.5

x

x

0 −3 0.60.68 ·0.8 1 1 8100.8 · 10−3 x x −3

1 0.5

1 x x

t=0+ t=T/2 t=T/2+ t=T

t=0+ t=T/2 t=T/2+ t=T

(b)(b) Filter density PFP(x, t) (b) Filter density t) t) F (x, Filter density p(x, Filter density (b)(b) Filter density PFP(x, t) t)   F(x,

0.5

3 2.5

1.5

1.5

1

0.5

2

1.5

1.21.2

x

3

2.5

1

0.5

2 + +

T t =t T= 2 2 0.80.8t =tT= 1T

1.5 Marginals filter density Marginals of filterof density PF(x,t) PF(x,t)

1.5

5.5

4.5

(a)(a) Likelihood density PLP (x, t) t) Likelihood density (a) Density v(x, t)L (x, (a) Likelihood density Likelihood density PLP(x, t) t)0 L (x, 6 6 0(a) 0 t=0 3.5

0 0 0.60.6

4 4 5.5

2

2

2

PL(x,t)

3 2.5

PL(x,t)

4

3.5

PL(x,t)

PL(x,t)

VARIATIONAL INFERENCE FOR HIDDENAND DIFFUSION PROCESSES 19 GANGULY, AND KOEPPL/PATH ESTIMATION VARIATIONAL INFERENCE SUTTER, GANGULY, AND KOEPPL/PATH ESTIMATION VARIATIONAL INFERENCE 15 15 2 2SUTTER, 2 2 AND

1.21.2

D Q||Π D Q||Π postpost  D Q||Πpost

1.41.4

8 · 10 −3 −3 Filter density PFP(x, t) (b) density t) t) F (x, 6(b)·(b) Filter density p(x, 610 · Filter 10 0.5

1

1

1.5

x

x

1.5

1.5

CIR #10--3Relative entropy distance between PDE solution and variational approximation

7 (b) Filter density (b) Filter density PFP(x, t) t) F(x, −3

  Distance Kullback-Leibler D Q||Π D Q||Π postpost  D Q||Πpost

6 · 10

8 · 810· −3 10−3

−3 −3

4 · 410· 10 8 · 10−3 5 6 · 610· −3 10−3 −3 6

7

4 · 10

CIR #10--3Relative entropy distance between PDE solution and variational approximation

6 · 10−3 24 · 210· −3 10−3 4 · 410· −3 10−3

Kullback-Leibler Distance

6

−3 2 2 4 · 10−3 2 · 10 2 · 210· −3 10−3 0 0 2 2 0 0 0 0 0.10.1 0.20.2 0.60.6 0.80.8 1 1 1.21.2 1.41.4 2 · 10−3 0 0 0 0 0 0 0 x t 0 0.1 0.2 0 0.1 0.2 0.3 x t 0 0.1 0.2 0.3 0.80.8 11 1 1.2 0.60.60.60.6 0.80.8 1.2 1.41.4 1 1.21.21.41.4 0 0 0 0 0.1 t t 0.2 0.3 1x 1.21.2 1.41.4 0.60.6 0.80.8 x1x x t (c) Smooting density P (x, t) (d) Relative entropy (c) Smooting densitySP (x, t) (d) Relative entropy 5

3

4

2

3

1

2

1

S

x density x density (c)(c) Smooting PSP (x, t) t) Smooting S (x,

0

0

0

0

0.1

4

Parameter estimation example - CIR Model Parameter example - CIR Parameter estimation example -estimation CIR Model Parameter estimation example - CIR Model

4

0.15

0.2

b a b a

Model

Parameter a Parameter a

3.5

3.5

0.15

0.2

0.25

0.3

0.3

0.25

0.3

(d) Relative entropy

4 3

3.5

0.1

t (d)(d) Relative entropy Relative entropy (d) Relative entropy (d) Relative entropy (d) Relative entropy (d) Relative entropy (d) Relative entropy

0.05

(c) Smooting density P (x, t) (c)Smooting Smooting density P(x, (x, t)(x, St) (c) density P S (c) Smooting density (c)Smooting Smooting density t) (c) density PSSS(x, (x, t) t) 44 44 PSP 44

0.05

0.30.3

33.5

Parameter a Parameter a

b a b a κ ˆ

3 3 3

3

3

3 2.5

2.5

2 2 2 2.5

3

22.5

2 2 1.5

1.5

2

2

1 1 0 0 1

1

0

1.5

05

2

510

10 10 1015

1520

20 20 2025

2530

30 30

3035

iterations of EM algorithm iterations of EM algorithm

3540

40 40

4045

4550

50

50 50

1.5

Iterations ofinference, EM algorithm Iterations ofinference, EM algorithm Parameter b a 1.4469 = 1.4469 (e)(e) Parameter b a= 1 11 1 0(e)(e) Parameter b a3030 =a 30 1.4469 Parameter inference, = 101015101520inference, 20 0 50 510 20 2530 35 4045404550 40 50 0 25 20 35 b 40 1.4469 iterations of EM algorithm

50 50

iterations of EM algorithm Cox-Ingersoll-Ross: PDE (solid) variational approximation (dashed) FigFig 2: 2: Cox-Ingersoll-Ross: PDE (solid) vs vs variational approximation (dashed) Iterations of EM algorithm Iterations ofinference, EM algorithm FigFig 2: 2: Cox-Ingersoll-Ross: PDE (solid) vsinference, variational approximation Parameter b a 1.4469 = 1.4469(dashed) Parameter b a approximation = Cox-Ingersoll-Ross: PDE (solid) vs variational (dashed) 1 (e)(e)

0

10

20

30

40

50

(e) Parameter b aeven =b (e) Parameter inference, ain1.4469 = 1.4469 solving a ordinary initial value problem that is tractable in relatively large dimensions, to to solving a ordinary initial value problem that is inference, tractable even relatively large dimensions, Iterations of EM algorithm 2: Cox-Ingersoll-Ross: PDE (solid) vs variational approximation (dashed) FigFig 2:tomention Cox-Ingersoll-Ross: PDE (solid) vs variational approximation (dashed) compared to PDEs. Alternatively, it would be interesting to study numerical methods specifically compared PDEs. Alternatively, it estimate would bethe interesting to study numerical methods specifically Tobias: mention that if we could estimate the terminal conditions of thethe smoothing density this Tobias: that if we could terminal conditions of smoothing density this tailored to the boundary value problems resulting from the maximum principle, such shooting tailored to2: the boundary value problems resulting from the maximum principle, such as as thethe shooting would lead to aCox-Ingersoll-Ross: significant improvement in terms of computation time would lead to a significant improvement in terms of computation time FigFig Cox-Ingersoll-Ross: PDE (solid) vs approximation (dashed) (e) Parameter inference, a ˆvariational =variational 1.4469 2: PDE (solid) vs approximation (dashed) (e) Parameter inference, κ ˆ = 1.4469 method, [22] a comprehensive summary. This should also lead a significant reduction method, seesee [22] forfor a comprehensive summary. This should also lead to to a significant reduction in in computation time, compared presented examples. computation time, compared to to thethe presented examples. solving a ordinary initial value problem that tractable even relatively large dimensions, to to solving a ordinary initial value problem that is is tractable even in in relatively large dimensions, Tobias: any more ideas mention for further research? Tobias: more ideas to to mention for further research? Fig. 2: any Cox-Ingersoll-Ross: Comparison of the PDE solution (solid) versus the variAcknowledgement: The authors areare grateful to to Debasish Chatterjee, John Lygeros and Peyman Acknowledgement: The authors grateful Debasish Chatterjee, John Lygeros and Peyman compared to PDEs. Alternatively, it would be interesting to study numerical methods specifically compared to PDEs. Alternatively, it estimate would be interesting to study numerical methods specifically Tobias: mention that if we could estimate the terminal conditions of the smoothing density this Tobias: mention that if we could the terminal conditions of the smoothing density this Mohajerin Esfahani for helpful discussion and pointers to to references. ational approach (dashed). The considered numerical values are: λ = 0.2, a = 1, Mohajerin Esfahani for helpful discussion and pointers references.

Fig. 2: toCox-Ingersoll-Ross: Comparison offrom the PDE solution (solid) versus the varitailored to the value resulting the maximum principle, such shooting tailored the boundary problems resulting maximum principle, such as as thethe shooting would to a boundary significant improvement in terms computation would to a 1, significant in terms time blead =lead 0.3, µ = σ =value 0.1, Rimprovement =problems 0.1, T = 0.3s, N from =of 2,of tthe = T /2 and ttime T. 1 computation 2 = ational approach (dashed). The considered numerical values are: λ = 0.2, κin=in 1, Acknowledgement: The authors grateful Debasish Chatterjee, John Lygeros Acknowledgement: authors areare grateful to to Debasish Chatterjee, John Lygeros Peyman method, see [22] a The comprehensive summary. This should also lead to a Peyman significant reduction method, see [22] forfor a comprehensive summary. This should also lead toand aand significant reduction Mohajerin Esfahani for helpful discussion references. Mohajerin Esfahani for0.1, helpful and to to references. bcomputation = 0.3, µ= 1, σcompared = R = 0.1, Tandpointers =pointers 0.3s, N = 2, t1 = T /2 and t2 = T . computation time, compared to the presented examples. time, todiscussion the presented examples. Tobias: any more ideas mention further research? Tobias: any more ideas to to mention forfor further research? Acknowledgement: The areare grateful to to Debasish Chatterjee, John Lygeros and Peyman for an initial guess a ˆ0authors =authors 4 of the unknown parameter. Acknowledgement: The grateful Debasish Chatterjee, John Lygeros and Peyman Mohajerin Esfahani forfor helpful discussion and pointers to to references. Table 1 summarizes the runtimes of the two numerical examples above. It can be Mohajerin Esfahani helpful discussion and pointers references. seen that the ODEs provided by the maximum principle can be solved by roughly one

the marginal smoothing density atgrateful terminal time,Chatterjee, which is John done by solving the filter Acknowledgement: The authors Debasish Chatterjee, John Lygeros and Peyman Acknowledgement: The authors areare grateful to to Debasish Lygeros and Peyman 5 consists PDE (7.5). If, however, the filter density at terminal time could be estimated in of solving the twodiscussion PDEs (7.5) and (7.6) Mohajerin Esfahani helpful discussion and pointers references. Mohajerin Esfahani forfor helpful and pointers to to references. 6 consists of the PDE (7.5) and the ODE system in order to solve Problem 3.2 a more efficient way, e.g., by using an MCMC method, the proposed variational approximation method could be applicable to high-dimensional problems. 9. Conclusion. The paper is devoted to a variational method for estimating paths of a signal process in a hidden Markov model. In particular, this leads to approximations of smoothing density which can be used to reconstruct any past state of the signal process given a full set of observations. A crucial fact that plays an important role in our method is that the smoothing distribution is induced by a pos-

24

SUTTER, GANGULY, KOEPPL

terior SDE which itself is a modification of the original signal process. The presented variational approach proposes an approximate SDE which minimizes the relative entropy between the posterior SDE and a class of SDEs whose marginals belong to a chosen mixture of exponential families. In the simplest case of normal marginals and a posterior SDE with constant diffusion term, the approximating SDE consists of a linear drift and constant diffusion term, which is well known. It is shown that the prescribed approximation scheme can be formulated as an optimal control problem, and necessary conditions for global optimality are obtained by the Pontryagin maximum principle. The resulting numerical methods have considerable computational advantages over numerically solving the underlying (S)PDEs, that is highlighted by two examples. The developed approximation scheme is then used for designing an efficient method for parameter inference for SDEs. Future work. For future work, as mentioned in Section 5.2, we aim to study how to efficiently estimate the filter density at terminal time T . Then, the presented variational approximation method reduces to solving a ordinary initial value problem that is tractable even in relatively large dimensions, compared to PDEs. Alternatively, it would be interesting to study numerical methods specifically tailored to the boundary value problems resulting from the maximum principle, such as the shooting method, see [27] for a comprehensive summary. This should also lead to a significant reduction in computation time, compared to the presented examples. Also the approach of solving the optimal control problem via its weak formulation that, given polynomial problem data, reduces to solving a hierarchy of semidefinite programs, would be expected to be computationally attractive as pointed out in Remark 5.4. Our future projects will also delve into analyzing the convergence of the EM-type algorithm used for parameter inference as well as the properties of the obtained estimators. We will also focus on refining the basic inference algorithm to get better efficiency and speed. One promising path to take in this direction would be designing of suitable adaptive EM-type algorithms. It is also conceivable that the ideas mentioned in the paper can be combined with suitable MCMC schemes to get better accuracy and efficiency in high-dimensional models. Appendix A. Derivation of Equation (2.10). We consider the one-dimensional case; an extension to the multi-dimensional case is straightforward. According to (2.9) the smoothing density is given by PS (x, t) = K(t)p(x, t)v(x, t), where K(t) := −1 R p(x, t)v(x, t)dx . The main idea is to recall that the process (K(t))t∈[0,T ] is Rn known to be almost surely constant [25, Theorem 3.2]. Therefore ∂ ∂ ∂ PS (x, t) = K(t)p(x, t) v(x, t) + K(t)v(x, t) p(x, t) ∂t ∂t ∂t   PS (x, t) 1 = −f (x)v 0 (x, t) − a(x)v 00 (x, t) − v(x, t)h(x)> dYt v(x, t) 2 !  0  00 f (x)PS (x, t) 1 a(x)PS (x) PS (x, t) > + v(x, t) − + + h(x) dYt . v(x, t) 2 v(x, t) v(x, t) The proof follows by a straightforward computation. We compute in a preliminary step 

0 f (x)PS (x, t) 1 = 2 ((f 0 (x)PS (x, t)+f (x)PS0 (x, t)) v(x, t) −f (x)PS (x, t)v 0 (x, t)), v(x, t) v (x, t)

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

25

and  00 a(x)PS (x, t) 1 = (a00 (x)PS (x, t) + 2a0 (x)PS0 (x, t) + a(x)PS00 (x, t)) v(x, t) v(x, t) 1 (2v 0 (x, t)a0 (x)PS (x, t) + 2v 0 (x, t)a(x)PS0 (x, t) +a(x)PS (x, t)v 00 (x, t)) − 2 v (x, t)  1 2a(x)PS (x, t)v 0 (x, t)2 . + 3 v (x, t)

Using this two preliminaries, we get ∂ PS (x, t) ∂t

1 = −f 0 (x)PS (x, t) − f (x)PS0 (x, t) + a0 (x)PS0 (x, t) + (a00 (x)PS (x, t) + a(x)PS00 (x, t)) 2 1 0 0 00 (a (x)v (x, t)PS (x, t) + a(x)v (x, t)PS (x, t) + a(x)v 0 (x, t)PS0 (x, t)) − v(x, t) 1 + 2 a(x)v 0 (x, t)2 PS (x, t) v (x, t)  1 = − f 0 (x)PS (x, t) + f (x)PS0 (x, t) + a0 (x)v 0 (x, t)PS (x, t) v(x, t)   1 00 0 0 0 2 + a(x)v (x, t)PS (x, t) + a(x)v (x, t)PS (x, t) − 2 (a(x)v (x, t) PS (x, t)) v (x, t) 0 1 + a0 (x)PS (x, t) + a(x)PS0 (x, t) 2  0  00 1 v 0 (x, t) PS (x, t) + a(x)PS (x, t) =− f (x) + a(x) v(x, t) 2  0 0  00 1 =− f (x) + a(x) log v(x, t) PS (x, t) + a(x)PS (x, t) , 2

and as such (2.10) holds.

Appendix B. Proof of Theorem 4.5. Consider an arbitrary curve t 7→ (1) (k) p(·, Θt , . . . , Θt ) evolving in EM(c(1) , . . . , c(k) ). Define a diffusion dZt = u(Zt , t)dt + σ(Zt )dBt ,

Z0 = x0 ,

with the given diffusion coefficient a(·) = σ(·)σ(·)> . Clearly the density of Zt coincides (1) (k) (1) (k) with p(·, Θt , . . . , Θt ) if and only if p(·, Θt , . . . , Θt ) satisfies the Kolmogorov forward equation for Zt , i.e., n (1) (k)  X ∂  ∂p(x, Θt , . . . , Θt ) (1) (k) =− ui (x, t)p(x, Θt , . . . , Θt ) ∂t ∂xi i=1 (B.1) n n  1 X X ∂2  (1) (k) + aij (x)p(x, Θt , . . . , Θt ) . 2 i=1 j=1 ∂xi ∂xj We will show this in two steps that (B.1) holds for the proposed drift term. Consider the decomposition ui (x, t) = gi (x, t) + γi (x, t) for all i = 1, . . . , n, where (1)

(B.2)

(k)

∂ n n 1X 1X ∂ ∂x p(x, Θt , . . . , Θt ) aij (x) + aij (x) j gi (x, t) := (1) (k) 2 j=1 ∂xj 2 j=1 p(x, Θt , . . . , Θt )

26

SUTTER, GANGULY, KOEPPL

and (B.3)

γi (x, t) :=

−1

(1)

(k)

p(x, Θt , . . . , Θt ) (1:k) p(x, Θt )

k X `=1

D E (`) (`) ˙ (`) ν` p` (x, Θt ) Θ t , Ii (x) . (1)

(k)

We use the shorthand notation := p(x, Θt , . . . , Θt ) Claim B.1. The functions gi defined in (B.2) for all i = 1, . . . , n satisfy n X n n  1X  X ∂2  ∂  (1:k) (1:k) gi (x, t)p(x, Θt ) = aij (x)p(x, Θt ) . ∂xi 2 i=1 j=1 ∂xi ∂xj i=1

Proof. Claim B.1 follows from a straightforward computation as n  X ∂  (1:k) gi (x, t)p(x, Θt ) ∂xi i=1  n X  ∂gi (x, t) ∂ (1:k) (1:k) p(x, Θt ) + gi (x, t) p(x, Θt ) = ∂xi ∂xi i=1   (1:k) ∂ n n n X ) 1 X ∂2 1 X ∂ ∂xj p(x, Θt  = aij (x) + aij (x) (1:k) 2 j=1 ∂xi ∂xj 2 j=1 ∂xi p(x, Θt ) i=1  2  (1:k) ∂ ) ∂xi ∂xj p(x, Θt + aij (x) (1:k) p(x, Θt )     (1:k) (1:k) ∂ ∂ p(x, Θ ) p(x, Θ ) t t ∂xj ∂xi  p(x, Θ(1:k) − aij (x) ) t (1:k) 2 p(x, Θt ) n X

∂ (1:k) p(x, Θt ) ∂x i i=1     n  n X X ∂2 1 ∂ ∂ (1:k) (1:k) = aij (x) p(x, Θt ) + aij (x) p(x, Θt ) 2 i=1 j=1 ∂xi ∂xj ∂xi ∂xj    (1:k) (1:k) ∂ ∂   ) ) p(x, Θ p(x, Θ 2 t t ∂xj ∂xi ∂ (1:k) + aij (x) p(x, Θt ) − aij (x) (1:k) ∂xi ∂xj p(x, Θt )    ∂ ∂ (1:k) + aij (x) p(x, Θt ) ∂xj ∂xi    (1:k) (1:k) ∂ ∂ ) ∂x p(x, Θ ) t ∂xj p(x, Θt i  +aij (x) (1:k) p(x, Θt )     n n  1 XX ∂2 ∂ ∂ (1:k) (1:k) = aij (x) p(x, Θt ) + aij (x) p(x, Θt ) 2 i=1 j=1 ∂xi ∂xj ∂xi ∂xj      ∂2 ∂ ∂ (1:k) (1:k) + aij (x) p(x, Θt ) + aij (x) p(x, Θt ) ∂xj ∂xi ∂xi ∂xj n X n   2 X 1 ∂ (1:k) = aij (x)p(x, Θt ) . 2 i=1 j=1 ∂xi ∂xj +

gi (x, t)

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

27

Claim B.2. The functions γi defined in (B.3) for all i = 1, . . . , n satisfy   n X ∂ ∂ (1:k) (1:k) p(x, Θt ) = − γi (x, t)p(x, Θt ) . ∂t ∂xi i=1 Proof. k

X ∂ ∂ (1:k) (`) p(x, Θt ) = ν` p` (x, Θt ) ∂t ∂t `=1

=

k X `=1

D E (`) (`) (`) ˙ (`) ν` Θ , c (x) − ∇ ψ (Θ ) p` (x, Θt ). θ ` t t

Moreover, X  k E D 1 ∂ ∂ (`) (`) (`) (1:k) ˙ ν p (x, Θ ) Θ , I (x) γi (x, t) = p(x, Θ ) ` ` t t t i (1:k) ∂xi p(x, Θt )2 ∂xi `=1 k D E E D X 1 (`) (`) (`) (`) (`) ˙ (`) − ν Θ , ϕ (x, Θ ) exp[ Θ , c (x) − ψ (Θ )] , ` ` t t t t i (1:k) p(x, Θt ) `=1 where we used E E D D (`) (`) (`) (`) ˙ (`) ˙ (`) = Θ p` (x, Θt ) Θ t , p` (x, Θt )Ii t , Ii  D E (`) (`) (`) ˙ (`) = Θ , exp[ Θ , c (x) − ψ` (Θt )] t t  Z xi D E (`) (`) (`) (`) (`) ϕi ((x−i , ξi ), Θt ) exp[ Θt , c (x−i , ξi ) − c (x) ]dξi −∞   Z xi D E (`) (`) (`) (`) (`) (`) ˙ = Θt , ϕi ((x−i , ξi ), Θt ) exp[ Θt , c (x−i , ξi ) − ψ` (Θt )]dξi . −∞

Therefore,       ∂ ∂ ∂ (1:k) (1:k) (1:k) γi (x, t)p(x, Θt ) = γi (x, t) p(x, Θt ) + γi (x, t) p(x, Θt ) ∂xi ∂xi ∂xi ! k !   k D E (`) X X 1 (`) ∂c (x) (`) (`) (`) (`) ˙ t , I (x) = νk Θt , p` (x, Θt ) ν` p` (x, Θt ) Θ i (1:k) ∂xi p(x, Θt ) `=1 `=1 k D E D E X (`) (`) (`) (`) (`) ˙ (`) − ν` Θ , ϕ (x, Θ ) exp[ Θ , c (x) − ψ` (Θt )] t t t i `=1



1 (1:k)

p(x, Θt

=−

k X `=1

)

k X `=1

νk



(`) Θt ,

∂c(`) (x) ∂xi



(`) p` (x, Θt )

D E (`) (`) (`) ˙ (`) ν` Θ t , ϕi (x, Θt ) p` (x, Θt ),

!

k X `=1

(`) ν` p` (x, Θt )

! D E (`) (`) ˙ Θt , Ii (x)

28

SUTTER, GANGULY, KOEPPL

and   X n n X k D E X ∂ (1:k) (`) (`) (`) ˙ (`) − γi (x, t)p(x, Θt ) = ν` Θ t , ϕi (x, Θt ) p` (x, Θt ) ∂x i i=1 i=1 `=1

=

k X

(`) ν` p` (x, Θt )

=

`=1

=

D

E

(`) (`) ˙ (`) Θ t , ϕi (x, Θt )

i=1

`=1

k X

n D X

!

(`) (`) (`) ˙ (`) ν` p` (x, Θt ) Θ t , c (x) − ∇θ ψ` (Θt )

∂ (1:k) p(x, Θt ), ∂t

E

where we used (4.3). The two claims imply (B.1) and hence completes the proof. Appendix C. Proof of Proposition 4.7. The proof basically requires Theorem 4.5 and two additional lemmas. We first propose in Lemma C.1 a choice of (`) functions ϕi that satisfy (4.3) in Theorem 4.5. Then we show in a second step, in Lemma C.2, that for this choice the integral terms (4.2) admit a closed form expression. We start with a few preparatory results that are needed to prove Proposition 4.7. Note that ∇Θ ψ(Θ) = (∇η ψ(Θ), ∇θ ψ(Θ)) ∈ H and recall that according to [6, p.631] for A ∈ Rn×m , B ∈ Rm×n and X ∈ GL(n, R)

and therefore

 d tr AX −1 B = −X −1 BAX −1 dX  d log det AX −1 B = −X −1 B(AX −1 B)−1 AX −1 dX

1 ∇η ψ(Θ) = − θ−1 η 2   1 −1 > −1 1 −1 1 > −1 1 ∇θ ψ(Θ) = θ ηη θ − θ = θ−1 ηη θ − In . 4 2 4 2

(C.1) (C.2)

(`)

Lemma C.1. Consider the functions ϕi (`)

(`)

(`)

: Rn ×H → H, where ϕi

  (`) (`) = ϕ1,i , ϕ2,i (`)

with ϕ1,i : Rn × H → Rn and ϕ2,i : Rn × H → Rn×n for ` = 1, . . . , k. Let ϕ1,i and (`)

ϕ2,i be defined as (`)

(`) −1

(`)

(`)

(`)

(`)

Ei θt (c1 (x−i , ξi ) − ∇η ψ` (Θt ))  (`) (`) (`) −1 (`) (`) (`) (`) ϕ2,i ((x−i , ξi ), Θt ) := θt Ei θt (c2 (x−i , ξi ) − ∇θ ψ` (Θt ))Θt ϕ1,i ((x−i , ξi ), Θt ) := θt

 1 (`) 1 (`) (`) > (`) (`) −1 + ηt (x−i , ξi )> θt θt . − θt (x−i , ξi )ηt 2 2

Then, for all ` = 1, . . . , k n D  E X (`) (`) ˙ (`) Θ , ϕ (x , ξ ), Θ −i i t t i

i=1

ξi =xi

D E (`) (`) ˙ (`) = Θ , c (x) − ∇ ψ (Θ ) . θ ` t t

29

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

Proof of Lemma C.1. According to (4.4) we have n D n D E E X E D X (`) (`) (`) (`) (`) (`) (`) (`) (`) ˙ Θt , ϕi (x, Θt ) = η˙ t , ϕ1,i (x, Θt ) + θ˙t , ϕ2,i (x, Θt ) , i=1

i=1

consisting of the two components n D n D E X E X (`) (`) (`) (`) (`) −1 (`) (`) (`) η˙ t , ϕ1,i (x, Θt ) = η˙ t , θt Ei θt (c1 (x) − ∇η ψ` (Θt )) i=1

i=1

=

and n D X i=1

*

(`) η˙ t ,

n X

(`) −1 (`) (`) θt Ei θt (c1 (x)

i=1

D E (`) (`) (`) = η˙ t , c1 (x) − ∇η ψ` (Θt )



(`) ∇η ψ` (Θt ))

+

E (`) (`) (`) θ˙t , ϕ2,i (x, Θt ) =



(`) θ˙t ,

n  X (`) −1 (`) (`) (`) θt Ei θt (c2 (x) − ∇θ ψ` (Θt )) i=1

 1 (`) −1 1 (`) −1 (`) (`) > (`) −1 (`) Ei θt xηt θt + θt Ei ηt x> − θt 2 2 E 1D D E 1D E (`) (`) (`) (`) (`) > (`) −1 (`) (`) −1 (`) > = θ˙t , c2 (x) − ∇θ ψ` (Θt ) − θ˙t , xηt θt + θ˙t , θt ηt x 2 D E 2 (`) (`) (`) ˙ = θt , c2 (x) − ∇θ ψ` (Θt ) ,

where we have used in the last step that D E     (`) (`) > (`) −1 (`) (`) > (`) −1 > (`) (`) > (`) −1 θ˙t , xηt θt = tr θ˙t (xηt θt ) = tr θ˙t xηt θt  D E  (`) (`) −1 (`) > (`) (`) −1 (`) > > ηt x , ηt x ) = θ˙t , θt = tr θ˙t (θt

which follows from the observation that for A ∈ Sym(n, R) and B ∈ Rn×n tr(AB > ) = tr(AB). Lemma C.2. For i = 1, . . . , n, j = 1, 2 and ` = 1, . . . , k consider Z si D E (`) (`) (`) (`) Ij,i (si , x) := ϕj,i ((x−i , ξi ), Θt ) exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi , −∞

(`)

where the functions ϕj,i are chosen according to Lemma C.1. Then, D E 1 (`) −1 (`) (`) I1,i (si , x) = θt ei exp Θt , c(`) (x−i , si ) − c(`) (x) 2 1 (`) −1 (`) (`) (`) (`) −1 I2,i (xi , x) = θt ei (2θt x − ηt )> θt . 4 (`)

(`)

(`)

(`)

Note that Ii (x) = (I1,i (xi , x), I2,i (xi , x)), where Ii (x) is the function defined in (4.2) and ei denote the canonical basis vectors of Rn . Proof of Lemma C.2. Z si D E (`) (`) (`) (`) I1,i (si , x) = ϕ1,i ((x−i , ξi ), Θt ) exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi −∞

30

SUTTER, GANGULY, KOEPPL

=

Z

si

−∞

(`) −1

θt

(`)

(`)

(`)

Ei θt (c1 (x−i , ξi ) − ∇η ψ` (Θt )) exp

D

E (`) Θt , c(`) (x−i , ξi ) − c(`) (x) dξi

Z D E 1 (`) −1 si (`) (`) (`) (`) Ei 2θt (c1 (x−i , ξi ) − ∇η ψ` (Θt )) exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi = θt 2 −∞   Z si D E −1 1 (`) −1 (`) 1 (`) (`) (`) Ei 2θt (x−i , ξi ) + θt = θt ηt exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi 2 2 −∞ Z si   D E −1 1 (`) (`) (`) (`) Ei 2θt (x−i , ξi ) + ηt exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi , = θt 2 −∞ D E (`) where (C.1) was used. Consider the substitution z := Θt , c(`) (x−i , ξi ) − c(`) (x) that leads to (`)

I1,i (si , x) =

1 (`) −1 θ ei 2 t

Z

si

  D E (`) (`) (`) (`) (`) 2θ (x , ξ ) + η exp Θ , c e> (x , ξ ) − c (x) dξi −i i −i i t t t i

−∞ E D (`) Θt ,c(`) (x−i ,si )−c(`) (x)

Z

1 (`) −1 exp(z)dz θ ei 2 t −∞ D E 1 (`) −1 (`) = θt ei exp Θt , c(`) (x−i , si ) − c(`) (x) , 2 =

(`)

(`)

(`) −1

where we used that θt ≺ 0, since θt = − 21 St and the invese of a negative definite matrix is negative definite. For the second term Z xi D E (`) (`) (`) (`) I2,i (x) = ϕ2,i ((x−i , ξi ), Θt ) exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi −∞  Z xi 1 (`) (`) > (`) −1 (`) (`) (`) (`) = θt Ei θt (c2 (x−i , ξi ) − ∇θ ψ` (Θt ))θt − θt (x−i , ξi )ηt 2 −∞  E D −1 1 (`) (`) (`) (`) + ηt (x−i , ξi )> θt θt exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi 2  Z 1 (`) −1 xi (`) (`) (`) (`) (`) (`) > = θt Ei 4θt (c2 (x−i , ξi ) − ∇θ ψ` (Θt ))θt − 2θt (x−i , ξi )ηt 4 −∞  D E (`) (`) (`) −1 (`) + 2ηt (x−i , ξi )> θt θt exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi     Z 1 (`) −1 xi 1 (`) (`) > (`) −1 1 (`) (`) (`) −1 = θt Ei 4θt (x−i , ξi )(x−i , ξi )> − θt ηt ηt θt − In θt 4 4 2 −∞  D E (`) (`) > (`) (`) (`) −1 (`) − 2θt (x−i , ξi )ηt + 2ηt (x−i , ξi )> θt θt exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi  Z 1 (`) −1 xi (`) (`) (`) (`) > (`) = θt Ei 2θt (x−i , ξi )(x−i , ξi )> 2θt − ηt ηt + 2θt 4 −∞  D E (`) (`) > (`) (`) (`) −1 (`) − 2θt (x−i , ξi )ηt + 2ηt (x−i , ξi )> θt θt exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi , where we have used (C.2). By expanding terms and using integration by parts together with the first assertion of this lemma   Z 1 (`) −1 xi (`) (`) (`) (`) (`) > (`) (`) −1 I2,i (x) = θt Ei (2θt (x−i , ξi ) + ηt )(2θt (x−i , ξi ) − ηt ) + 2θt θt 4 −∞

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

=

=

=

=

=

exp Z xi

D

31

E (`) Θt , c(`) (x−i , ξi ) − c(`) (x) dξi

1 (`) −1 (`) (`) (`) (`) (`) −1 θt Ei (2θt (x−i , ξi ) + ηt )(2θt (x−i , ξi ) − ηt )> θt 2 −∞ D E (`) exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi Z D E 1 xi (`) −1 (`) θt Ei exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi + 2 −∞ Z 1 xi (`) 1 (`) (`) (`) −1 (`) (`) (`) −1 I1,i (ξi , x)(2θt ei )> θt dξi I1,i (xi , x)(2θt x − ηt )> θt − 2 2 −∞ Z D E 1 xi (`) −1 (`) + θt Ei exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi 2 −∞ 1 (`) −1 (`) (`) (`) −1 θt ei (2θt x − ηt )> θt 4 Z D E 1 xi 1 (`) −1 (`) (`) (`) −1 dξi θt ei exp Θt , c(`) (x−i , ξi ) − c(`) (x) e> − i 2θt θt 2 −∞ 2 Z D E 1 xi (`) −1 (`) + θt Ei exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi 2 −∞ 1 (`) −1 (`) (`) (`) −1 θt ei (2θt x − ηt )> θt 4 Z E D 1 xi (`) −1 (`) Ei exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi θt − 2 −∞ Z E D 1 xi (`) −1 (`) + θt Ei exp Θt , c(`) (x−i , ξi ) − c(`) (x) dξi 2 −∞ 1 (`) −1 (`) (`) (`) −1 θt ei (2θt x − ηt )> θt . 4 1 2

Proof of Proposition 4.7 We decompose the function ui (x, t), given by Theorem 4.5 into ui (x, t) = gi (x, t) + γi (x, t) for all i = 1, . . . , n, where (1)

gi (x, t) :=

(k)

∂ n n 1X ∂ 1X ∂x p(x, Θt , . . . , Θt ) aij (x) + aij (x) j (1) (k) 2 j=1 ∂xj 2 j=1 p(x, Θt , . . . , Θt )

and γi (x, t) :=

−1

(1)

(k)

p(x, Θt , . . . , Θt )

k X `=1

D E (`) (`) ˙ (`) ν` p` (x, Θt ) Θ t , Ii (x) .

As a preliminary step by invoking Lemma C.2 D E ˙ t(`) , I (`) (x) Θ i D E D E (`) (`) (`) (`) = η˙ t , I1,i (x) + θ˙t , I2,i (x)     (`) 1 (`) −1 (`) 1 (`) −1 (`) (`) > (`) −1 ˙ = η˙ t , θt ei + θt , θt ei (2θt x − ηt ) θt 2 4     1 (`) −1 (`) > (`) −1 (`) 1 (`) −1 (`) 1 (`) −1 > ˙ = η˙ t , θt ei + θt , θt ei x − θ t ei η t θ t 2 2 4

32

SUTTER, GANGULY, KOEPPL

1 (`) > (`) −1 η˙ (θt ei ) + 2 t 1 (`) (`) −1 = η˙ t > (θt ei ) + 2 1 (`) (`) −1 ei ) + = η˙ t > (θt 2 1 (`) (`) −1 = η˙ t > (θt ei ) + 2 =

1  ˙(`) (`) −1 > >  1  ˙(`) (`) −1 (`) > (`) −1 >  tr θt (θt ei x ) − tr θt (θt ei η t θ t ) 2 4     1 1 (`) (`) −1 (`) (`) −1 (`) > (`) −1 tr θ˙t xe> − tr θ˙t θt ηt ei θt i θt 2 4     1 1 (`) −1 ˙ (`) (`) −1 ˙ (`) (`) −1 (`) > − tr θt tr θt θt xe> θt θt η t ei i 2 4 1 > (`) −1 ˙(`) 1 > (`) −1 ˙(`) (`) −1 (`) e θ θ t x − ei θ t θt θt ηt . 2 i t 4

Therefore, γi (x, t) = −

k X

1

(1) (k) p(x, Θt , . . . , Θt ) `=1

(`)

ν` p` (x, Θt )

1 (`) −1 ˙ (`) (`) −1 (`) θt θt ηt − e> i θt 4





1 > (`) −1 1 (`) −1 ˙ (`) η˙ θt ei + e> θt x i θt 2 2

and γ(x, t) = −

k X

1

(1) (k) p(x, Θt , . . . , Θt ) `=1



(`)

ν` p` (x, Θt )

 1 (`) −1 (`) 1 (`) −1 ˙(`) 1 (`) −1 ˙(`) (`) −1 (`) θ η˙ t + θt θt x − θt θt θt ηt . 2 t 2 4

Furthermore,   k D E (`) X ∂ (1) (k) (`) (`) (`) ∂c (x) p(x, Θt , . . . , Θt ) = ν` Θt , exp[ Θt , c(`) (x) − ψ` (Θt )] ∂xj ∂xj `=1

=

k X `=1

=

k X `=1

D D E  E (`) (`) (`) ν` Θt , ej , ej x> + xe> exp Θt , c(`) (x) − ψ` (Θt ) j  D E   (`) (`) (`) (`) > Θt , c(`) (x) − ψ` (Θt ) , ν` ηt ej + 2e> j θt x exp

and therefore gi (x, t) =

n 1X

 1 div ai· + 2 2

aij (x)

j=1

  (`) (`) > (`) ν p (x, Θ ) η e + 2e θ x ` ` j t t j t `=1

Pk

(1)

(k)

p(x, Θt , . . . , Θt )

leading to

g(x, t) =

 1 1 div a(x) + 2 2

  (`) (`) (`) ν p (x, Θ )a(x) η + 2θ x t t t `=1 ` `

Pk

(1)

(k)

p(x, Θt , . . . , Θt )

(`)

.

Note that our choice of ϕi satisfy (4.3) as shown in Lemma C.1, which then completes the proof. Appendix D. Proof of Theorem 4.9 .

33

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

Lemma D.1. For an SDE of the form (4.1) the mean mt and covariance matrix St of Xt satisfy   dmt = E u(Xt , t) dt,       dSt = E Xt u(Xt , t)> + E u(Xt , t)Xt> + E σ(Xt )σ(Xt )> (D.1)   >   −mt E u(Xt , t) − E u(Xt , t) m> dt. t Proof. The equation for the mean is trivial. For the variance let Yt := Xt Xt> . According to Itˆ o’s Lemma [23] dYt = Xt (u(Xt , t)dt + σ(Xt )dBt )> + (u(Xt , t)dt + σ(Xt )dBt )Xt> + σ(Xt )σ(Xt )> dt, and similarly    >   dt. dm2t = mt E u(Xt , t) + E u(Xt , t) m> t

Hence,

  dS(t) = E dY (t) − dm2t       = E Xt u(Xt , t)> + E u(Xt , t)Xt> + E σ(Xt )σ(Xt )>   >   −mt E u(Xt , t) − E u(Xt , t) m> dt. t

Lemma D.2. Mean mt and variance St satisfy mt =

k X

(`)

ν` mt

`=1

and St =

k X

(`) ν` St

`=1

+

k X

(`) (`) > ν` mt mt

`=1

k X



(`) ν` mt

`=1

!

k X

(`) ν` mt

`=1

!>

.

Proof. The statement for the mean is straightforward. For the variance,   St = Ep XX > − mt m> t ! k !> k k X X X   (`) (`) > = ν` Ep` XX − ν` mt ν` mt `=1

=

k X `=1

=

k X `=1

`=1

`=1

k    X  (`) (`) −1 (`) (`) −1 ν` Ep` XX > − mt mt + ν` mt mt − `=1

(`)

ν` St +

k X `=1

(`)

(`) >

ν` m t m t



k X `=1

(`)

ν` mt

!

k X `=1

(`)

ν` mt

k X `=1

!>

(`) ν` mt

!

k X `=1

(`) ν` mt

!>

.

Proof of Theorem 4.9 Consider a drift function u(x, t) given by (4.5). In view of Lemma D.1 "P # "P #   (`) (`) (`) (`) k k  dmt 1 `=1 ν` p` (x, Θt )At `=1 ν` p` (x, Θt )Bt X = Ep div a(X) + Ep + Ep (1) (k) (1) (k) dt 2 p(x, Θt , . . . , Θt ) p(x, Θt , . . . , Θt )

34

SUTTER, GANGULY, KOEPPL

"P

(`)

k `=1

+ Ep

(`)

ν` p` (x, Θt )a(X)Ct

#

"P

k `=1

(`)

(`)

ν` p` (x, Θt )a(X)Dt X

#

+ Ep (1) (k) (1) (k) p(x, Θt , . . . , Θt ) p(x, Θt , . . . , Θt )    k h i X    (`) 1 (`) (`) (`) (`) = ν` Ep` div a(X) + At + Bt mt + Ep` a(X) Ct + Ep` a(X)Dt X 2 `=1

which can be simplified according to Lemma D.2, such that (D.2) k (`) X dmt ν` dt `=1    k h i X    (`) 1 (`) (`) (`) (`) = ν` Ep` div a(X) + At + Bt mt + Ep` a(X) Ct + Ep` a(X)Dt X 2 `=1

Pk Note that (D.2) has to hold for all ν` ≥ 0 such that `=1 ν` = 1. Therefore (D.3)   (`) h i    (`) dmt 1 (`) (`) (`) (`) = Ep` div a(X) + At + Bt mt + Ep` a(X) Ct + Ep` a(X)Dt X . dt 2 For the variance, we have according to Lemma D.2 k

X dS dSt = ν` t dt dt

(`)

+

`=1

k X

=

`=1

ν`

dt

+

!

k X

m> t − mt

(`) dmt

dt

(`)

dmt (`) > (`) mt + mt dt

ν`

(`)

dmt ν` dt

(`) dSt

`=1

(`)

`=1

k X



k X

(`) > mt

+

`=1

(`) mt

(`)

ν`

dmt dt

(`)

dmt dt

dmt dt !> !>

!> !

(`)

dmt dt

!> !

!> !

,

(`)

dmt − m> t − mt dt

This implies k X `=1

where

(`)

dS ν` t dt

dSt dt

k

dS X ν` = + dt `=1

(`)

dmt (`) > (`) (m> ) + (mt − mt ) t − mt dt

(`)

dmt dt

is given according to Lemma D.1, by

      dSt = E Xt u(Xt , t)> + E u(Xt , t)Xt> + E σ(Xt )σ(Xt )> − mt dt



dmt dt

>



dmt > m dt t

Therefore,

(D.4)

k X

(`)

ν`

`=1

dSt dt

      = E Xt u(Xt , t)> + E u(Xt , t)Xt> + E σ(Xt )σ(Xt )> −

(`)

Recall that

dmt dt

k X `=1

(`)

ν`

dmt (`) > (`) mt + mt dt

(`)

dmt dt

!> !

.

is given by (D.3). Next, we compute    k   X > 1 (`) (`) > (`) (`) > (`) (`) > E Xu(X, t)> = ν` Ep` Xdiv a(X) + m t At + (mt mt + St )Bt 2 `=1

.

VARIATIONAL INFERENCE FOR HIDDEN DIFFUSION PROCESSES

35

h i h i (`) > (`) + Ep` XCt a(X) + Ep` XX > Dt a(X) ,    k   X  > 1 (`) (`) −1 (`) (`) (`) > (`) > E u(X, t)X = + Bt (mt mt + St ) ν` Ep` div a(X) X + At mt 2 `=1 i h i h (`) (`) + Ep` a(X)Ct X > + Ep` a(X)Dt XX > , and k     X   Ep σ(X)σ(X)> = Ep a(X) = ν` Ep` a(X) , `=1

such that by evaluating (D.4) and by recalling that it has to hold for all convex combinations, we get (`)

dSt dt

 > > i 1    1 (`)  1 h Ep` Xdiv a(X) + Ep` div a(X) X > − mt Ep` div a(X) 2 2 2 h i   (`) >   1  (`) (`) > (`) (`) (`) > − Ep` div a(X) mt + Ep` a(X) + St Bt + Bt St + Ep` XCt a(X) 2 h i     (`) (`) > (`) (`) (`) > + Ep` a(X)Ct X > − mt Ct Ep` a(X) − Ep` a(X) Ct mt h i h i h i (`) (`) (`) (`) + Ep` XX > Dt a(X) + Ep` a(X)Dt XX > − mt Ep` X > Dt a(X) h i (`) (`) > − Ep` a(X)Dt X mt . =

Acknowledgments. The authors are grateful to Debasish Chatterjee, John Lygeros, and Peyman Mohajerin Esfahani for helpful discussions and pointers to references. REFERENCES

[1] C. Andrieu, A. Doucet, and R. Holenstein, Particle Markov chain Monte Carlo methods, Journal of the Royal Statistical Society. Series B. Statistical Methodology, 72 (2010), pp. 269–342. [2] C. Archambeau, D. Cornford, M. Opper, and J. Shawe-Taylor, Gaussian process approximations of stochastic differential equations, in Gaussian Processes in Practice, vol. 1 of JMLR Proceedings, 2007, pp. 1–16. [3] C. Archambeau and M. Opper, Approximate inference for continuous-time Markov processes, in Bayesian Time Series Models, Cambridge University Press, 2011, pp. 125–140. [4] C. Archambeau, M. Opper, Y. Shen, D. Cornford, and J. Shawe-Taylor, Variational inference for diffusion processes, in Advances in Neural Information Processing Systems 20, MIT Press, Cambridge, MA, 2008, pp. 17–24. [5] A. Bain and D. Crisan, Fundamentals of stochastic filtering, Stochastic Modelling and Applied Probability, Springer, New York, 2009. [6] D. S. Bernstein, Matrix Mathematics, Princeton University Press, 2 ed., 2009. [7] D. Brigo, On SDEs with marginal laws evolving in finite-dimensional exponential families, Statistics and Probability Letters, 49 (2000), pp. 127–134. ´, E. Moulines, and T. Ryde ´n, Inference in hidden Markov models, Springer Series in [8] O. Cappe Statistics, Springer, New York, 2005. With Randal Douc’s contributions to Chapter 9 and Christian P. Robert’s to Chapters 6, 7 and 13, With Chapter 14 by Gersende Fort, Philippe Soulier and Moulines, and Chapter 15 by St´ ephane Boucheron and Elisabeth Gassiat.

36

SUTTER, GANGULY, KOEPPL

[9] F. Clarke, Functional analysis, calculus of variations and optimal control, vol. 264 of Graduate Texts in Mathematics, Springer, London, 2013. [10] J. C. Cox, J. E. Ingersoll, Jr., and S. A. Ross, A theory of the term structure of interest rates, Econometrica, 53 (1985), pp. 385–407. [11] D. Crisan and T. Lyons, A particle approximation of the solution of the KushnerStratonovitch equation, Probab. Theory Related Fields, 115 (1999), pp. 549–578. ´ r, I-divergence geometry of probability distributions and minimization problems, Ann. [12] I. Csisza Probability, 3 (1975), pp. 146–158. [13] P. Del Moral, J. Jacod, and P. Protter, The Monte-Carlo method for filtering with discrete-time observations, Probab. Theory Related Fields, 120 (2001), pp. 346–368. [14] A. Dmitruk and A. Kaganovich, The hybrid maximum principle is a consequence of Pontryagin maximum principle, Systems and Control Letters, 57 (2008), pp. 964 – 970. [15] G. L. Eyink, A Variational Formulation of Optimal Nonlinear Estimation, ArXiv Physics e-prints, (2000). [16] N. L. Garcia and J. L. Palacios, On inverse moments of nonnegative random variables, Statist. Probab. Lett., 53 (2001), pp. 235–239. [17] O. Kallenberg, Foundations of modern probability, Probability and its Applications (New York), Springer-Verlag, New York, second ed., 2002. [18] H. J. Kushner, Dynamical equations for optimal nonlinear filtering, J. Differential Equations, 3 (1967), pp. 179–190. [19] H. J. Kushner and P. Dupuis, Numerical methods for stochastic control problems in continuous time, vol. 24 of Applications of Mathematics (New York), Springer-Verlag, New York, second ed., 2001. Stochastic Modelling and Applied Probability. [20] J. B. Lasserre, Global optimization with polynomials and the problem of moments, SIAM Journal on Optimization, 11 (2001), pp. 796–817. [21] J. B. Lasserre, Moments, Positive Polynomials and Their Applications, vol. 1 of Imperial College Press Optimization Series, Imperial College Press, London, 2010. [22] S. K. Mitter and N. J. Newton, A variational approach to nonlinear estimation, SIAM J. Control Optim., 42 (2003), pp. 1813–1833 (electronic). [23] B. Øksendal, Stochastic differential equations, Universitext, Springer, Berlin ; Heidelberg [u.a.], 6. ed. ed., 2003. `s and H. Pham, Optimal quantization methods for nonlinear filtering with discrete[24] G. Page time observations, Bernoulli, 11 (2005), pp. 893–932. ´ [25] E. Pardoux, Equations du filtrage non lin´ eaire, de la pr´ ediction et du lissage, Stochastics, 6 (1981/82), pp. 193–231. [26] A. N. Shiryaev, Essentials of stochastic finance, vol. 3 of Advanced Series on Statistical Science & Applied Probability, World Scientific Publishing Co., Inc., River Edge, NJ, 1999. Facts, models, theory, Translated from the Russian manuscript by N. Kruzhilin. [27] J. Stoer and R. Bulirsch, Introduction to numerical analysis, vol. 12 of Texts in Applied Mathematics, Springer-Verlag, New York, third ed., 2002. Translated from the German by R. Bartels, W. Gautschi and C. Witzgall. [28] R. L. Stratonovich, Conditional Markov processes, Theory of Probability & Its Applications, 5 (1960), pp. 156–178. [29] R. van Handel, Filtering, Stability, and Robustness, PhD thesis, California Institute of Technology, 2007. [30] D. J. Wilkinson, Stochastic modelling for systems biology, Chapman & Hall/CRC Mathematical and Computational Biology Series, Chapman & Hall/CRC, Boca Raton, FL, 2006. [31] M. Zakai, On the optimal filtering of diffusion processes, Zeitschrift f¨ ur Wahrscheinlichkeitstheorie und Verwandte Gebiete, 11 (1969), pp. 230–243.