Lecture Notes on Diffusion and Stochastic Differential

2 downloads 0 Views 5MB Size Report
Aug 19, 2016 - The solution to a stochastic differential equation is termed a diffusion process. ...... two corresponds to every affine transformation Zt = aYt + b also being driftless. ...... VY = EV{Y |N} + VE{Y |N} = ENσ2 + VµN = (µ2 + σ2)λ.
Lecture Notes on Diffusion and Stochastic Differential Equations

Uffe Høgsbro Thygesen August 19, 2016

Preface

This notes are written for the course 02425 Diffusion and Stochastic Differential Equations, which is being offered at the Technical University of Denmark by DTU Compute and DTU Aqua. The course is aimed at students at an advanced state in the M.Sc.&Eng. programme or in the Ph.D. programme. It is assumed that students are familiar with a broad range of subjects in applied mathematics and probability, notably ordinary differential equations, elementary notions of partial differential equations, probability at an elementary level (not measuretheoretic), and stochastic processes, including Markov chains. These notes are in continuous development. I will be thankful for any comments, corrections, and suggestions for improvements.

Uffe Høgsbro Thygesen Charlottenlund, August 2016

iii

Contents Contents

iv

1 Introduction

1

2 Diffusive transport and random walks 2.1 Diffusive transport . . . . . . . . . . . . . . . 2.1.1 The conservation equation . . . . . . . . 2.1.2 Fick’s laws . . . . . . . . . . . . . . . . 2.1.3 Diffusive spread of a point source . . . . 2.1.4 Diffusive attenuation of waves * . . . . . 2.2 The motion of a single molecule . . . . . . . . 2.3 Mathematical Brownian motion . . . . . . . . 2.4 Advective and diffusive transport . . . . . . . 2.5 Monte Carlo simulation of particle motion . . 2.6 Stochastic differential equations: A preview . 2.7 Relative importance of advection and diffusion 2.8 Diffusion in more than one dimension * . . . . 2.9 Conclusion . . . . . . . . . . . . . . . . . . . . 2.10 Additional exercises . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . *. . . . . . .

3 Probability spaces 3.1 Stochastic experiments . . . . . . . . . . . . . . 3.2 Random variables . . . . . . . . . . . . . . . . . 3.3 Expectation is integration . . . . . . . . . . . . 3.4 Information as σ-algebras . . . . . . . . . . . . 3.4.1 Borel’s paradox . . . . . . . . . . . . . . . 3.5 Conditional expectations . . . . . . . . . . . . . 3.5.1 Borel’s paradox resolved . . . . . . . . . . 3.5.2 Properties of the conditional expectation 3.5.3 Conditional distributions and variances . 3.6 Linear spaces of random variables . . . . . . . . 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . 3.8 Notes and references * . . . . . . . . . . . . . . 3.9 Additional exercises . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

7 9 9 11 12 15 17 19 20 22 24 26 27 29 30

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

33 34 38 40 43 44 46 48 50 51 52 55 57 61

4 Stochastic processes 65 4.1 Discrete-time Gaussian white noise . . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Brownian motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 iv

4.3 4.4 4.5 4.6 4.7

4.8

4.2.1 Properties of Brownian motion . . . . . . . . Filtrations and accumulation of information . . . . Markov processes and stochastic state space models Martingales . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . Notes and references * . . . . . . . . . . . . . . . . 4.7.1 Convergence of random variables . . . . . . . 4.7.2 The finite-dimensional distributions . . . . . 4.7.3 Wiener measure . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . .

5 Linear dynamic systems 5.1 Linear systems with deterministic inputs . . . . . . 5.2 Linear systems driven by noise . . . . . . . . . . . . 5.3 First and second order statistics of the response . . 5.4 Stationary processes . . . . . . . . . . . . . . . . . 5.4.1 Stable systems driven by stationary processes 5.5 The white noise limit . . . . . . . . . . . . . . . . . 5.6 Integrated white noise is Brownian motion . . . . . 5.7 Linear systems driven by white noise . . . . . . . . 5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . 5.9 Notes and references . . . . . . . . . . . . . . . . . 5.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . 6 Stochastic integrals 6.1 Models based on integral equations . . . . . 6.2 From white noise to stochastic integrals . . . 6.3 Integrals: Riemann, Lebesgue, and Stieltjes Rt 6.4 The problem: 0 Bs dBs . . . . . . . . . . . 6.5 The It¯ o integral . . . . . . . . . . . . . . . . 6.6 Examples of It¯ o integrals . . . . . . . . . . . 6.7 Relaxing the L2 constraint . . . . . . . . . . 6.8 It¯ o processes and solutions to SDEs . . . . . 6.8.1 The Euler method . . . . . . . . . . . 6.9 Integration w.r.t. semimartingales * . . . . . 6.10 The Stratonovich integral . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

7 Stochastic calculus 7.1 The chain rule of deterministic calculus . . . . . . . . 7.2 It¯ o’s lemma: The stochastic chain rule . . . . . . . . 7.3 Analytical solutions of SDE’s and stochastic integrals 7.4 Dynamics of derived quantities . . . . . . . . . . . . 7.5 Coordinate transformations . . . . . . . . . . . . . . 7.5.1 The Lamperti transform . . . . . . . . . . . . . 7.5.2 The scale function . . . . . . . . . . . . . . . . 7.6 The chain rule in Stratonovich calculus . . . . . . . . 7.7 Time change . . . . . . . . . . . . . . . . . . . . . . . v

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

71 83 84 88 92 93 93 97 99 100

. . . . . . . . . . .

105 . 105 . 107 . 111 . 112 . 114 . 115 . 118 . 120 . 122 . 123 . 124

. . . . . . . . . . .

127 . 127 . 129 . 131 . 133 . 136 . 142 . 144 . 147 . 149 . 149 . 151

. . . . . . . . .

155 . 156 . 156 . 159 . 162 . 163 . 164 . 165 . 166 . 168

8 SDEs: Existence and uniqueness 8.1 Uniqueness of solutions . . . . . . . . . . . . . . . . . 8.1.1 Non-uniqueness: The falling ball . . . . . . . . 8.1.2 Lipschitz continuity implies uniqueness . . . . . 8.2 Existence of solutions . . . . . . . . . . . . . . . . . . 8.2.1 Linear bounds rule out explosions . . . . . . . 8.2.2 The Picard iteration and the proof of existence 8.3 It¯ o versus Stratonovich equations . . . . . . . . . . . 8.4 Additional exercises . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

171 . 172 . 172 . 174 . 177 . 178 . 180 . 182 . 184

Kolmogorov equations Diffusions are Markov processes . . . . . . . . . . . . . . Transition probabilities . . . . . . . . . . . . . . . . . . . The backward Kolmogorov equation . . . . . . . . . . . The forward Kolmogorov equation . . . . . . . . . . . . An abstract formulation of the Kolmogorov equations * . The stationary distribution . . . . . . . . . . . . . . . . . Detailed balance, no-flux, and reversibility . . . . . . . . Chapter conclusion . . . . . . . . . . . . . . . . . . . . . Notes and references . . . . . . . . . . . . . . . . . . . . 9.9.1 The strong Markov property . . . . . . . . . . . . 9.9.2 Do the transition probabilities admit densities? . . 9.9.3 Random walks with heterogeneous diffusivity . . . 9.9.4 Numerical computation of transition probabilities .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

187 188 189 194 196 198 200 203 205 205 205 207 208 208

10 State estimation 10.1 Observation models and the state likelihood . . . . . . . . 10.2 The filtering principle: Time update and data update . . . 10.3 The smoothing filter . . . . . . . . . . . . . . . . . . . . . 10.4 Sampling typical tracks . . . . . . . . . . . . . . . . . . . . 10.5 Likelihood inference . . . . . . . . . . . . . . . . . . . . . . 10.6 The Kalman filter . . . . . . . . . . . . . . . . . . . . . . . 10.7 Fast sampling and continuous-time filtering . . . . . . . . 10.7.1 The stationary filter . . . . . . . . . . . . . . . . . . 10.8 Estimating states and parameters as a mixed-effect model 10.9 Chapter conclusion . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

213 215 218 221 223 224 225 228 230 231 234

. . . . . . . . .

235 . 235 . 238 . 238 . 239 . 242 . 244 . 246 . 248 . 248

9 The 9.1 9.2 9.3 9.4 9.5 9.6 9.7 9.8 9.9

. . . . . . . .

11 Expectations to the future 11.1 The generator and expected rate of change . . . . . . . . 11.2 The Dynkin formula . . . . . . . . . . . . . . . . . . . . 11.3 Expected point of exit . . . . . . . . . . . . . . . . . . . 11.3.1 Does a scalar diffusion exit right or left? . . . . . . 11.4 Analysis of a singular boundary point . . . . . . . . . . . 11.5 Recurrence of Brownian motion . . . . . . . . . . . . . . 11.6 The expected time to exit . . . . . . . . . . . . . . . . . 11.6.1 Exit time in a sphere, and the diffusive time scale 11.7 Expectations of general rewards . . . . . . . . . . . . . . vi

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

12 Numerical simulation of sample paths 12.1 The Euler scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 The strong order: The Euler method for geometric Brownian motion 12.3 Strong order analysis of the Euler scheme . . . . . . . . . . . . . . . 12.4 The weak order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 The Milstein scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6 The stochastic Heun method . . . . . . . . . . . . . . . . . . . . . . . 12.7 Stability and implicit schemes . . . . . . . . . . . . . . . . . . . . . . 12.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Dynamic optimization 13.1 Markov decision problems . . . . . . . . . . . . . . . . . . . . . 13.1.1 The backward iteration: Dynamic programming . . . . . 13.1.2 Optimal foraging and Gilliam’s rule . . . . . . . . . . . . 13.2 Controlled diffusions and performance objectives . . . . . . . . 13.3 Verification and the Hamilton-Jacobi-Bellman equation . . . . . 13.4 Multivariate linear-quadratic control . . . . . . . . . . . . . . . 13.5 Performance evalution and the proof of the verification theorem 13.6 Steady-state control problems . . . . . . . . . . . . . . . . . . . 13.7 A case: Fisheries management . . . . . . . . . . . . . . . . . . . 13.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.9 Notes and references . . . . . . . . . . . . . . . . . . . . . . . . 13.9.1 Numerical analysis of the HJB equation . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . . . . . . . .

. . . . . . . .

251 . 251 . 252 . 256 . 258 . 258 . 261 . 264 . 266

. . . . . . . . . . . .

267 . 268 . 269 . 271 . 276 . 277 . 278 . 280 . 281 . 285 . 288 . 290 . 290

14 Solutions to selected exercises

293

Bibliography

311

vii

CHAPTER

1

Introduction

Ars longa, vita brevis. Hippocrates, c. 400 BC A stochastic differential equation can, informally, be viewed as a differential equation in which a stochastic “noise” term appears, such as dXt = f (Xt ) + ξt dt

(1.1)

Here, Xt is the state of a system in mind at time t, and most often we want to find or characterize this X, i.e. solve the equation for Xt as a function of time. The solution is a stochastic process which is called a diffusion. f is a function which describes the dynamics of the system, and ξt is a stochastic process which affects the dynamics. First, let us consider two examples of phenomena which can be modeled with stochastic differential equations.

Motion of a particle embedded in a fluid flow Figure 1.1 displays water flowing around a cylinder.1 In absence of diffusion, water molecules will follow the streamlines. A small particle will follow the same streamlines, but is also subject to diffusion, i.e. random collisions with neighboring molecules which cause it to deviate from the streamlines. Collisions are frequent but each cause only a small displacement, so the resulting path is erratic. In absence of diffusion, we can find the trajectory of the particle by solving the ordinary differential equation 1

The flow used here is irrotational, i.e. potential flow. Mathematically, this is convenient even if physically, it may not be the most meaningful choice.

1

3 2 1 0 −1 −2 −3

−5

0

5

Figure 1.1. A small particle embedded in the flow around a cylinder. The thin lines are streamlines, i.e. the paths a water molecule follows, neglecting diffusion. The thick black lines is a simulated random trajectory of a particle, which is transported with the flow, but at the same time subjected to molecular diffusion.

dXt = u(Xt ) dt Here, Xt ∈ R2 is the position in the plane of the particle at time t. u(·) is the flow field, so that u(x) is a vector in the plane indicating the speed and direction of the water flow at position x. To obtain a unique solution, this equation needs an initial condition such as X0 = x0 where x0 is the known position at time 0. The trajectory {Xt : t ∈ R} is exactly a streamline. To take the molecular diffusion into account, i.e. the seemingly random motion of a particle due to collisions with fluid molecules, we add “white noise” to the equation dXt = u(Xt ) + ξt dt

(1.2)

Here, ξt is a two-vector, each element of which is white noise; a stochastic process with the property that the noise signal at two distinct points of time are independent. If this seems vague, there is a good reason for it: A major goal of this book is to develop the precise meaning of ξt and equation (1.2). The main hurdle is to reconcile the paradox that we have postulated a differential equation for the trajectory Xt , yet this trajectory is nowhere differentiable, as suggested by figure 1.1. However, it is easy to explain how the trajectory has been simulated: The process is discretized, i.e. divided into small time intervals. At each time step, the old 2

position is first advected along the streamline, then shifted randomly by adding a perturbation ∆X which is sampled from a bivariate Gaussian distribution with mean 0 and variance 2D∆t on each coordinate. This, we shall see, is the Euler method for solving a stochastic differential equation numerically and thus simulating the trajectories of the particle. The solution to a stochastic differential equation is termed a diffusion process. This term emphasizes the connection between stochastic differential equations and physical transport involving Fickian diffusion. So the movement of a particle in a fluid is a phenomenon which can be modeled with a stochastic differential equation. For such models, we shall consider questions such as: For a given initial position of the particle, what is the probability distribution of its position at some later time? What is the probability that it hits the cylinder? How close do we expect the particle to get to the cylinder?

Price of a stock Let Xt denote the price at time t in euros of the stock of a large company. It would be lucrative to be able to predict future prices, but unfortunately prices do not evolve predictably. Figure 1.2 shows the simulated price of a stock, as well as predictions made at time t = 5 based on the history up to that point. The model here is the simplest one conceivable: An investor expects the price to grow with a constant rate, such that E{Xt+1 |Xt } = (1 + r)Xt Here, the left hand side means the conditional expectation of Xt+1 , given that Xt is known. r is the return rate over one time step. We can take the time step to be one day. If the return rate r is greater than what can otherwise be obtained, then buying this asset could be considered a good investment. Unfortunately, stocks with high expected return rates tend to also have large risks of loss - otherwise there would be few sellers or the price would jump up immediately. The simplest model of risk is obtained by specifying the conditional variance of Xt+1 , for example as V{Xt+1 |Xt } = σ 2 Xt2 The reason we model the variance as proportional to Xt2 is that prices are relative, so that if prices double, then also the uncertainty doubles and hence the variance quadruples. One model which displays these statistics is the stochastic recursion Xt+1 = Xt (1 + r + σWt ) where Wt has mean 0 and variance 1, and subsequent Wt ’s are independent. A convenient shorthand for this is the stochastic difference equation 3

1.5 1.0 0.0

0.5

Price

0

2

4

6

8

10

Time

Figure 1.2. The simulated price of a stock. At time t = 5, the future is predicted based on the information available at that time. The solid line indicates the prediction while the grey zone shows (marginal) 67 % confidence regions.

4

∆Xt = rXt ∆t + σXt Wt where ∆Xt = Xt+1 − Xt and ∆t = 1. But the market never sleeps, so we would be interested in a continuous-time model which can predict also the price at intermediate times. Such a model is, informally, obtained by dividing the difference equation with the time step dXt = rXt + σXt ξt dt where ξt is a continuous-time noise process which takes the place of Wt /∆t. We will discuss this equation later as well as its solution, which is called Geometric Brownian Motion. In particular, we need to consider the relationship between the continuous-time noise process ξt , the discrete-time noise process Wt , and the time step ∆t. Much of the early theory of stochastic differential equations was developed for applications in physics, but more recently, applications to economics have been an important driver. For such a model, we would want to ask questions such as: What is the distribution of the prize at a later time? How can we estimate parameters (r, σ), based on historical observations? What is the probability that the prize ever drops below a given threshold? If we buy and sell assets according to a given strategy, what is the distribution of the resulting profit?

Overview of the theory The most fundamental element of theory concerns how to assign a precise meaning to a stochastic differential equation. For this construction, we must first develop a stochastic calculus, most importantly a stochastic integral (It¯o’s integral) and a stochastic version of the chain rule (It¯ o’s formula). We can then build the foundation of stochastic differential equations, and answer fundamental questions such as when a unique solution exists. Solutions to stochastic differential equations can rarely be found analytically; we will show a few important examples where solutions can be found, and other situations where we can derive analytically properties of the solution, even if we cannot determine the solution itself in detail. In most cases, however, if we want the solution we must rely on numerical simulation. We will discuss algorithms for this purpose. Diffusion processes, i.e. solutions to stochastic differential equations, are Markov processes and can therefore be characterized in terms of their transition probabilities. We will examine these transition probabilities, which satisfy a Kolmogorov equation of advection-diffusion type. One important motivation for considering stochastic differential equations is to match models with data in order to estimate parameters and states and to make predictions about the future. We will discuss techniques for estimation, in particular Hidden Markov Models and the Kalman filter. Another important motivation is decision-making in dynamic systems with uncertainty. We will discuss stochastic dynamic optimization, and in particular the dynamic programming 5

principle, and demonstrate how it can be applied to a variety of situations in engineering, finance, and analysis. However, before we start developing this theory, it is useful to develop some intuition for diffusion processes. First, we consider in some detail the motion of particles suspended in fluid flow. Next, we revisit basic probability, including its measure-theoretic foundation. Then, we turn to stochastic processes, and in particular Brownian motion, which is the simplest example of a diffusion process, and at the same time pivotal in the general theory. There is no doubt that stochastic differential equations are becoming more widely applied in many fields of applied science and engineering, and that by itself justifies their study. From a modeler’s perspective, it is an attractive element in the toolbox, because all our understanding of processes and dynamics can be summarized in the drift term f in (1.1), while the noise term ξt manifests that our models are always incomplete descriptions of actual systems. The mathematical theory ties together several branches of mathematics - dynamics of ordinary and partial differential equations, measure and probability, statistics, and optimization. As you develop an intuition for stochastic differential equations, you will establish interesting links between subjects that may at first seem unrelated, such as physical transport processes and propagation of noise. I have found it immensely rewarding to study these equations and their solutions. My hope is that you will, too.

6

CHAPTER

2

Diffusive transport and random walks

Summary Diffusion is a transport process that takes place in fluids like air and water; even in solids. It is caused by the erratic and unpredictable motion of molecules due to collisions with other molecules. When viewed at a large scale compared to the individual molecule, diffusion moves material from regions with high concentration to regions with low concentration. See figure 2.1. Mathematically, diffusion concerns the concentration C = C(x, t) and the flux J(x, t). The diffusive flux quantifies the transport and satisfies Fick’s first law: J = −D ∂C ∂x . Here, D is a material constant termed the diffusivity. Fick’s second law states that mass is conserved while being redistributed in space, so that the concentration field C evolves in time according to the diffusion equation ∂C ∂J ∂ =− = ∂t ∂x ∂x

  ∂C D ∂x

.

(2.1)

Diffusion often acts in concert with advection, where material is also transportedwith a ∂ ∂C fluid flow field u. The resulting advection-diffusion equation is ∂C ∂t = − ∂x uC − D ∂x where uC is the advective flux. Diffusion is consistent with each single molecule moving according to a stochastic process: If we pick one molecule randomly, we can use the diffusion equation to compute the probability distribution of the position of this molecule, at some later stage. This corresponds to modeling the motion of an individual molecule with a Markov process, the transition probabilities of which are governed by the diffusion equation. Thus, the microscale interpretation of diffusion is that each molecule performs a random walk. Molecular diffusion is fascinating and relevant in its own right, but has even greater applicability because it serves as a reference and an analogy to other modes of dispersal; for example 7

Terminal

● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●●●● ●● ● ●● ● ●● ● ●● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ●● ● ●

+

● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●

Y[, Nt]

Lagrangian (particle) view

Eulerian (field) view

Initial

+

Figure 2.1. Two-dimensional simulation of a solute diffusing in a container. Top panels: Concentration fields. Bottom panels: Position of 100 molecules. Left panels: Initially, the solute is concentrated at the center. Right panels: After some time, diffusion has dispersed the solute so that is more evenly distributed over the container. The bottom right panel also includes the trajectory of one molecule (grey line and end point indicated with red cross); notice its irregular appearance.

8

of particles in turbulent flows, or of animals which move unpredictably. At an even greater level of abstraction, a molecule moving randomly in physical space is a archetypal example of a dynamic system moving randomly in a general state space.

2.1

Diffusive transport

In this section we model how a substance spreads in space due to molecular diffusion. To make things concrete, you may think of smoke in still air, or dye in still water. The substance is distributed over a one-dimensional space R. Let µt ([a, b]) denote the amount of material present in the interval [a, b] at time t. Mathematically, this µt is a measure; it measures the material in the interval [a, b]. We may measure the substance in terms of number of molecules or moles, or in terms of mass, but we choose to let µt be dimensionless. We assume that µt admits a density, which is the concentration C(·, t) of the substance, so that the amount of material present in any interval [a, b] can be found as the integral of the concentration over the interval: b

Z µt ([a, b]) =

C(x, t) dx

.

a

The density has unit per length; if the underlying space had been two or three dimensional, then C would have unit per length squared or cubed. The objective of this section is to pose the diffusion equation (2.1), which governs the time evolution of this concentration C.

2.1.1

The conservation equation

We first establish the conservation equation ∂C ∂J + =0 , ∂t ∂x

(2.2)

which expresses that mass is redistributed in space by continuous movements, but neither created, lost, nor teleported. The substance is being transported over the space. Transport is by continuous movements only - no molecules jump instantaneously between separated regions in space. The transport may then be quantified with a the flux J(x, t), which is the net transport, from left to right, across the point x at time t. The flux J has physical dimension per time - it is the net amount of material that cross the point x per unit time, from left to right. Considering the amount of material in the interval [a, b], we see that it is only changed by the net influx at the two endpoints, i.e. d µt ([a, b]) = J(a, t) − J(b, t) . dt 9

C Ja Jb µ a

x

b

Figure 2.2. Conservation in one dimension. The total mass in the interval [a, b] is Rb µt ([a, b]) = a C(x, t) dx corresponding to the area of the shaded region. The net flow into the interval [a, b] is J(a) − J(b).

Assume that the flux J is differentiable in x, then Z J(a, t) − J(b, t) = − a

b

∂J (x, t) dx ∂x

.

On the other hand, since µt is given as an integral, we can find the rate of change by differentiating under the integral sign: d µt ([a, b]) = dt

Z

b

a

∂C (x, t) dx ∂t

.

Combining these two expressions for the rate of change of material in [a, b], we obtain: Z a

b

∂C ∂J (x, t) + (x, t) dx = 0 . ∂t ∂x

Since the interval [a, b] is arbitrary, we can conclude that the integrand is identically 0, or ∂C ∂J (x, t) + (x, t) = 0 ∂t ∂x

.

This is known as the conservation equation. To obtain a more compact notation, we often ˙ for time derivative and a quote (as in omits the arguments (x, t), and we use a dot (as in C) 0 J ) for spatial derivative. Thus, we can state the conservation equation compactly as C˙ + J 0 = 0 . 10

2.1.2

Fick’s laws

Fick’s first law for pure diffusion states that the diffusive flux is proportional to the concentration gradient:

J(x, t) = −D

∂C (x, t) ∂x

or simply J = −DC 0

.

(2.3)

This means that the diffusion will move matter from regions of high concentration to regions of low concentration. The constant of proportionality, D is termed the diffusivity and has dimensions length squared per time. The diffusivity depends on the diffusing substance, on the background material is diffusing in, and the temperature. See table 2.1 for examples of diffusivities. Diffusivity [m2 /s]

Process Salt ions in water at room temperature Smoke particle in air at room temperature Carbon atoms in iron at 1250 K

1 × 10−9 2 × 10−5 2 × 10−11

Table 2.1. Examples of diffusivities

Biography: Adolph Eugen Fick (1829-1901) was a German pioneer in biophysics with a background in mathematics, physics, and medicine. Interested in transport in muscle tissue, he used transport of salt in water as a convenient model system. In a sequence of papers around 1855, he reported on experiments as well as a theoretical model of transport, namely Fick’s laws, which was derived as an analogy to conduction of heat. An historical account of the theory of diffusion is given in (Philibert, 2006).

Fick’s first law is empirical, although we will shortly see that it is consistent with a microscopic model of molecule motion. Combining Fick’s first law (2.3) with the conservation equation (2.2) yields Fick’s second law, the diffusion equation (2.1), which we restate as: C˙ = (DC 0 )0

(2.4)

This law predicts, for example, that the concentration will decrease at a peak, i.e. where C 0 = 0 and C 00 < 0. In many physical situations, the diffusivity D is constant in space. In 11

this case we may write Fick’s second law as C˙ = DC 00 when D is constant in space.

(2.5)

i.e., the rate of increase of concentration is proportional to the spatial curvature of the concentration. However, constant diffusivity is a special situation, and the general form of the diffusion equation is (2.4). Exercise 2.1: For the situation in figure 2.2, will the amount of material in the interval [a, b] increase or decrease in time? Assume that (2.5) applies, i.e. the transport is diffusive and the diffusivity D is constant in space. For the diffusion equation to admit a unique solution, we need an initial condition C(x, 0) and spatial boundary conditions. Typical boundary conditions are Dirichlet conditions which fix the concentration C, and Robin conditions which fix the flux J at the boundary. In many situations the domain is unbounded so that the boundary condition concerns the limit |x| → ∞.

2.1.3

Diffusive spread of a point source

We now turn to an important situation where the diffusion equation admits a simple solution which can be stated in closed form: 1. The spatial domain is the entire real line R. 2. The diffusivity D is constant in space. 3. The fluxes vanish in the limit |x| → ∞. Consider the initial condition that one unit of material is located at position x0 , i.e. C(x, 0) = δ(x − x0 )

,

where δ is the Dirac delta. The solution is then a Gaussian bell curve: 1 C(x, t) = √ φ 2Dt



x − x0 √ 2Dt

 .

Here φ(·) is the probability density function in a standard Gaussian variable, 1 1 φ(x) = √ exp(− x2 ) 2 2π 12

.

(2.6)

0.25 0.20 0.15 0.10 0.00

0.05

Concentration [1/m]

Time = 1 s Time = 5 s Time = 10 s

−10

−5

0

5

10

Position [m] Figure 2.3. Pure diffusion. The concentration field at times t = 1, 5, 10 with diffusivity D = 1 and a unit amount of material, which initially is located at the point x = 0.

13

4 3 2 1 0

Length scale 2Dt

0

2

4

6

8

10

Time t

Figure 2.4. Square root relationship between time and diffusive length scale.

Thus the substance is distributed according to a Gauss distribution with mean x0 and standard √ deviation 2Dt, see figure 2.3. This standard deviation is a characteristic length scale of the concentration field which measures (half the) width of the plume. We see that length scales with the square root of time, or equivalently, time scales with length squared (figure 2.4). This scaling implies that molecular diffusion is often to be considered a small scale process: on longer time scales or larger spatial scales other phenomena may take over and be more important. We will return to this point later, in section 2.7. Exercise 2.2: Insert the solution (2.6) into the diffusion equation and verify that it satisfies the equation. In which sense does the solution also satisfy the initial condition? Exercise 2.3: Compute the diffusive length scale for smoke in air, and for salt in water, for various time scales between 1 second and 1 day. The solution (2.6) is a fundamental solution (or Green’s function) with which we may construct also the solution to general initial conditions. To see this, let H(x, x0 , t) denote the solution C(x, t) corresponding to the initial condition C(x, 0) = δ(x − x0 ), i.e. (2.6). Since the diffusion equation is linear, a linear combination of initial conditions results in the same linear combination of solutions. In particular, we may write a general initial condition as a linear combination of Dirac deltas:

Z

+∞

C(x0 , 0) · δ(x − x0 ) dx0

C(x, 0) =

.

−∞

We can then determine the response at time t from each of the deltas, and integrate the 14

1.0 0.5

C(x,0)

−1.0

−1.0

−0.5

C

C(x,T)

0.0

1.0 0.0

C(x,T)

−0.5

C

0.5

C(x,0)

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

x

0.6

0.8

1.0

x

Figure 2.5. Attenuation of spatial waves by diffusion. Here the diffusivity is D = 1 and the terminal time is T = 1. Left panel: A long wave is attenuated slowly. Right panel: A shorter wave is attenuated more quickly. responses up: Z

+∞

C(x0 , 0) · H(x, x0 , t) dx0

C(x, t) =

.

(2.7)

−∞

Note that here we did not use the specific form of the fundamental solution; only linearity of the diffusion equation and existence of the fundamental solution. In fact, this technique works also when diffusivity varies in space and when advection is effective in addition to diffusion, as well as for a much larger class of problems. However, when the diffusivity is constant in space, we get a very explicit result, namely that the solution is the convolution of the initial condition with the fundamental solution: Z

+∞

C(x, t) = −∞

2.1.4

  1 1 |x − x0 |2 exp − C(x0 , 0) dx0 2 2Dt (4πDt)1/2

.

Diffusive attenuation of waves *

Another important situation which admits solutions in closed form is the diffusion equation (2.5) with the initial condition C(x, 0) = sin kx 15

,

where k is a wave number, related to the wavelength L by the formula kL = 2π. In this case the solution is C(x, t) = exp(−λt) sin kx with λ = Dk 2

.

(2.8)

Exercise 2.4: Verify this solution. Thus, harmonic waves are eigenfunctions of the diffusion operator, i.e. they are attenuated exponentially while preserving their shape. Note that the decay rate λ is quadratic in the wave number k. Another way of expressing the same scaling is that the half-time of the attenuation is

T1/2 =

1 log 2 L2 log 2 = λ 4π 2 D

,

i.e., the half time is quadratic in the wave length. We recognize the square root/quadratic relationship between temporal and spatial scales from figure 2.4. Recall that we used the fundamental solution (2.6) to obtain the response of a general initial condition. We can do similarly with the harmonic solution (2.8), although we need to add the cosines or, more conveniently, use complex exponentials. Specifically, if the initial condition is square integrable, then it can be decomposed into harmonics as Z

+∞

C(x, 0) =

˜ 0) eikx dk C(k,

,

−∞

Careful! Different ˜ 0) is the (spatial) Fourier transform where C(k, authors use slightly Z +∞ 1 different ˜ C(k, 0) = C(x, 0) e−ikx dx . 2π −∞ definitions of the Fourier Now, each wave component exp(ikx) is attenuated to exp(−Dk 2 t + ikx), so the Fourier transform. transform of C(x, t) is ˜ t) = C(k, ˜ 0) e−Dk2 t C(k,

.

We can now find the solution C(x, t) by the inverse Fourier transform: Z

+∞

C(x, t) =

˜ 0) e−Dk2 t+ikx dk C(k,

.

−∞

One interpretation of this result is that short-wave fluctuations (large |k|) in the initial condition are smoothed out rapidly while long-wave fluctuations (small |k|) persist longer; the 16

solution is increasingly dominated by longer and longer waves which decay slowly as the short waves disappear.

The motion of a single molecule

40

Position [m]

1 −2

0

−1

20

0

Position [m]

2

60

3

80

4

2.2

0

20

40

60

80

100

0

Time [s]

2000

4000

6000

8000

10000

Time [s]

Figure 2.6. A random walk model of molecular motion. Left: A close-up where individual transitions are visible. Right: A zoom-out where the process is indistinguishable from Brownian motion. We can accept Fick’s first equation as an empirical fact, but we would like to connect it to our microscopic understanding. In this section we present a caricature of a microscopic mechanism which can explain Fickian diffusion. The essential idea in the model is that each individual molecule moves in an erratic and unpredictable fashion, due to the exceedingly large number of collisions with other molecules, so that only a probabilistic description of its trajectory is feasible. This phenomenon is called Brownian motion.

17

Biography: Robert Brown (1773 - 1858) was a Scottish botanist. In 1827 he studied pollen immersed in water under microscope, and observed an erratic motion of the grains. Unexplained at the time, we now attribute the motion to seemingly random collision between the pollen and water molecules. The physical phenomenon and its mathematical model is named Brownian motion, although many other experimentalists and theoreticians contributed to our modern understanding of the phenomenon.

Let Xt denote the position at time t of one specific molecule, for example a certain smoke particle in air. This Xt is a real-valued random variable; we still consider one dimension only. Every now and then, the molecule is hit by an air molecule which causes a displacement ±k. This may happen at regularly spaced points of time, a time step h apart, and does so independently of what has happened previously.1 In summary, the position {Xt : t ≥ 0} is a random walk :

Xt+h

  Xt + k Xt =  Xt − k

w.p. (with probability) p w.p. 1 − 2p w.p. p

.

Here, p is a real parameter in the interval (0, 21 ]. Notice that the displacement in one time step, Xt+h − Xt , has mean 0 and variance 2k 2 p. Exercise: Verify this! Since we assume that the displacement in each time step is independent, the central limit theorem applies. After many time steps, the displacement will therefore follow a distribution which is well approximated by a Gaussian Xnh ∼ N (0, 2k 2 pn) , i.e., Z

x2

P(x1 ≤ Xnh ≤ x2 ) ≈ x1

p 1 p · φ(x/ 2k 2 pn) dx 2k 2 pn

,

where φ is still the p.d.f. of a standard Gaussian variable. Next, assume that we let not just one particle start at the origin, but a large number N , and that each molecule moves 1

If you object that the collision will not cause a fixed displacement, but rather a change in velocity, you are right. However, the simple picture is nevertheless more useful at this point; a hand-waving argument is that the velocity decays to 0 due to viscous friction and that the molecule drifts a certain distance during this decay. A more complete treatment would consider the velocity process in detail. The simple model was used by Einstein (1905); one extension that takes velocity into account was presented by Uhlenbeck and Ornstein (1930) and is known as the Ornstein-Uhlenbeck process.

18

independently. According to the law of large numbers, the number of particles present between x1 and x2 at time nh will be approximately Z

x2

N x1

p 1 p · φ(x/ 2k 2 pn) dx 2k 2 pn

.

Notice that this agrees with equation (2.6), assuming that we take N = 1, D = k 2 p/h and t = nh. We see that out cartoon model of molecular motion is consistent with the results of the diffusion equation, if we assume that 1. the molecules take many small jumps, i.e. h and k are small, so that the Central Limit Theorem is applicable at any length and time scale that our measuring devices can resolve, and 2. there is a large number of molecules present and they behave independently, so that we can ignore that the number of molecules in a region will necessarily be integer, and use the law of large numbers to neglect the variance on this number. Both of these assumptions are reasonable in many situations where molecular systems are observed on a macroscale, for example in our everyday lives.

2.3

Mathematical Brownian motion

If we trace the motion of one molecule in one dimension, we may model this motion as one outcome of a stochastic process {Bt : t ≥ 0}. If we measure the motion with a coarse resolution compared to the collision with other molecules, then the stochastic process will have the Markov property: Even if we have observed the motion up to some time t, then our prediction of the future depends only of the position at time t (given the present, the future is independent of the past). We will discuss Markov processes more formally later, in chapter 4. Markov processes are specified by their transition probabilities, and the transition probabilities for this particular Markov process are Gaussian due to the Central Limit Theorem, since the transitions are the result of a large number of collisions. In short, the conditional distribution of Bt+h given Bt is Gaussian: Bt+h |Bt ∼ N (Bt , 2Dh)

,

i.e., Z

x2



P(Bt+h ∈ [x1 , x2 ] | Bt ) = x1

19

1 ·φ 2Dh



x − Bt √ 2Dh

 dx

.

¯ + } which have these properties for all t, h is called (mathA Markov process {Bt : t ∈ R ematical) Brownian motion. Note that, physically, these properties should only hold when the time lag h is large compared to the time between molecular collisions, so mathematical Brownian motion is only an appropriate model of physical Brownian motion at coarse scale, i.e. for large h. Mathematical Brownian motion is a fundamental process. It is simple enough that many questions regarding its properties can be given explicit and interesting answers, and we shall see several of these later, in chapter 4. Biography: Albert Einstein (1879-1955). Einstein’s contributions to the theory of Brownian motion were fundamental, although perhaps less known to the general public than his theory of relativity or his work on the photoelectric effect. Starting in his Annus Mirabilis, 1905, he published a sequence of papers which explained qualitatively and quantitatively Brownian motion as the result of molecular collisions. This work made it possible to end the debate whether atoms really exist.

In many situations we choose to work with standard Brownian motion, where we take 2D = 1 so that the displacement Bt+h − Bt has variance equal to the time step h. If we insist that time has the physical unit of seconds s, notice that this means that Bt has the physical unit √ of s!

2.4

Advective and diffusive transport

In many physical situations, diffusion is not the sole transport mechanism. For example, if a particle in a fluid has higher density than the surrounding fluid, then the movement of the particle will have a bias downwards. Similarly, if the fluid is flowing, then the particle will have a tendency to follow the flow. Mathematically, both of these situations amounts to a directional bias in the movement, so we can focus on the latter. Let the flow field be u(x, t). If we use Xt to denote the position of a fluid element at time t, then Xt satisfies the differential equation d Xt = u(Xt , t) . dt Consider again a solute which is present in the fluid, and as before let C(x, t) denote the concentration of the solute at position x and time t. If the material is a perfectly passive tracer (i.e., material is conserved and transported with the bulk motion of the fluid), then the flux of material is the advective flux: 20

0.4 0.3 0.2 0.1

C

C(x,0)

0.0

C(x,T)

0

5

10

15

20

x

Figure 2.7. A fluid flow in one dimension which transports a substance. The plume is advected to the right with the flow; at the same time it diffuses out. The diffusivity is D = 1, the advection is u = 2.5, and the terminal time is T = 4.

JA (x, t) = u(x, t) C(x, t) . If in addition molecular diffusion is in effect, then according to Fick’s first law (2.3) this gives rise to a diffusive flux JD = −DC 0 . We may assume that these two transport mechanisms operate independently, so that the total flux is the sum of the advective and diffusive fluxes:

J(x, t) = u(x, t) C(x, t) − D(x)

∂C (x, t) ∂x

,

or simply J = uC − DC 0 . Inserting this into the conservation equation (2.2), we obtain the advection-diffusion equation for the concentration field: C˙ = −(uC − DC 0 )0

.

(2.9)

It is not possible solve this equation analytically, except in simple cases. One such simple case is when u and D are constant, the initial condition is a Dirac delta, C(x, 0) = δ(x − x0 ) where x0 is a parameter, and the flow vanishes as |x| → ∞. Then we obtain:   1 (x − ut − x0 )2 C(x, t) = √ exp (− 4Dt 4πDt 21

(2.10)

which corresponds to the probability density function of a Gaussian variable with mean x0 +ut and variance 2Dt. We see that the advection shifts the mean with constant rate, as would be the case if there had been no diffusion, and the diffusion gives rise to a linearly growing variance while preserving the Gaussian shape, as in the case of pure diffusion. This solution is important, but keep in mind that it is a very special case: Later we will see that in general, when the flow is not constant, it will affect the variance, and the diffusion will affect the mean. Exercise 2.5: 1. Verify the solution (2.10). 2. Solve the advection-diffusion equation (2.9) on the real line with constant u and D with the initial condition C(x, 0) = sin(kx) or, if you prefer, C(x, 0) = exp(ikx).

2.5

Monte Carlo simulation of particle motion

The microscale interpretation of the advection-diffusion equation (2.9) is that each particle is being advected with the fluid, but at the same time subjected to random collisions with other molecules. Thus each particle performs a biased random walk. As one would expect from the Central Limit Theorem, the exact recursion of the random walk is not important, as long as the time step h is small and the mean and variance of each step corresponds to the advection and diffusion. When the diffusivity D and the flow u is constant in space and time, one random walk which is consistent with the advection-diffusion equation is to sample the increments ∆X = Xt+h − Xt from a Gaussian distribution ∆X ∼ N (u · h, 2Dh) . The displacement over n such steps will be Gaussian with mean ut and variance 2Dt where t = nh, i.e. agree with the solution of the advection-diffusion equation, also if the time step h is not small. Now we let the flow u = u(x) vary in space, but still have D constant. We may then ask if we can simulate the trajectory of a single particle in this flow, say over the time interval [0, T ]. To do this, we may divide the time interval into N subintervals 0 = t0 , t1 , . . . , tN = T

,

so that we obtain sufficient temporal resolution needed for our purpose. Then, we sample the trajectory recursively. We start by sampling X0 from the initial distribution C(·, 0). Next we sample the position of the molecule at time t1 . To do this, solve the advection-diffusion equation φ˙ = −(uφ − Dφ0 )0 for t ∈ [0, t1 ] with the initial condition φ(x, 0) = δ(x − X0 ). This equation is the same as the advection-diffusion equation (2.9) but now in φ(x, t) which is the concentration, not of all molecules, but of those molecules which at time 0 are at or near X0 . 22

These molecules will at time t1 be distributed according to φ(·, t1 ), so we pick one random of these molecules. This completes the first step in a recursive scheme for sampling the trajectory, where we loop over i = 0, 1, . . . , N − 1: 1. Given Xti , solve the advection-diffusion equation φ˙ = −(uφ − Dφ0 )0 for t ∈ [ti , ti+1 ] subject to the initial condition φ(x, ti ) = δ(x − Xti ), applied to the start of the interval, i.e. for t = ti . This φ is the concentration of those molecules which at time ti are at or near Xti . 2. Sample Xti+1 from this φ(·, ti+1 ). 3. Set i := i + 1 and repeat. With this algorithm, the sampled process {Xti : i = 0, . . . , N } is given by a stochastic recursion and is therefore a Markov process: At each iteration, we compute the next position from the current position only, independently of the earlier parts of the trajectory which brought the particle to the current position. The transition probabilities in this Markov process are given by the advection-diffusion equation (2.9). We say that this process {Xti : i = 0, . . . , N } is a sampled diffusion process. The algorithm is not easy to implement, in general, because we have to solve a partial differential equation at each time step. If the time step is small, we can in stead use a simpler recursion: At each time step i, we sample Xti+1 from a Gaussian in which the old position Xti enters as a parameter: Xti+1 |Xti ∼ N (Xti + u(Xti ) · ∆ti , 2D · ∆ti )

,

(2.11)

where ∆ti = ti+1 − ti . The idea is here that over the short time interval, the process does not move far and therefore the flow u can be considered constant, so that the solution to the advection-diffusion equation can be approximated with a Gaussian density, i.e. (2.10). As the time step in this recursion goes to 0, this approximation becomes more accurate so that the p.d.f. of Xt will approach the solution C(·, t) to the advection-diffusion equation (2.9). This hopefully seems plausible, although we are far from ready to prove it. One way to simulate from the Gaussian distribution in (2.11) is to start with standard Brownian motion {Bt : t ≥ 0}, and then set Xti+1 = Xti + u(Xti , t) · ∆ti +



2D · ∆Bti

,

(2.12)

where ∆Bti = Bti+1 − Bti . Exercise: Verify that with this recursion, the conditional distribution of Xti+1 given Xti is as specified by (2.11). This equation (2.12) says that the displacement over the time interval ∆ti has two contributions: One from the advection, which - when the time interval is short - can be approximated as u(Xti ) · ∆ti , and one from an underlying Brownian motion which perturbs the particle 23

away from the trajectory of a fluid element. Note that, if diffusion is absent (D = 0), this recursion is the Euler method of solving the ordinary differential equation dXt /dt = u(Xt ), so the recursion (2.12) can be seen as an discrete time version of an ordinary differential equation, with an additional perturbation due to the noise term, at each time step. The equation (2.12) is a stochastic difference equation; for a given initial condition X0 , and a given realization of the Brownian motion Bt , it defines the trajectory Xt recursively. A convenient shorthand is √ ∆Xt = u(Xt , t)∆t +

2D ∆Bt

.

(2.13)

This, as we will see later, is the Euler method for solving numerically a stochastic differential equation governing the motion of a single molecule.

2.6

Stochastic differential equations: A preview

At this point we make a very coarse sketch of the development over the next chapters, where we let the time step tend to 0 in the stochastic difference equation (2.13) and thus reach a stochastic differential equation √ dXt = u(Xt , t) dt +

2D dBt

.

The aim of this body of theory is to ensure that as the time step goes to 0, we do in fact reach a limit, that this limit corresponds to the way diffusion acts on an ensemble of molecules (section 2.1.2), and to develop tools to analyze this limit. To this end, it is tempting to rewrite the difference equation (2.13) as √ ∆Xt ∆Bt = u(Xt , t) + 2D ∆t ∆t

,

(2.14)

and let the time step go to 0, reaching ” √ dXt dBt ” = u(Xt , t) + 2D dt dt

.

However, as indicated by the quotation marks, Brownian motion Bt is not differentiable! This should be evident from figure 2.6; we will return to this later, in section 4.2.1. The derivative dBt /dt can loosely be identified with continuous-time white noise, which leads to the interpretation that a stochastic differential equation is an ordinary differential equation driven by white noise. Much useful intuition can be gained from this interpretation, but mathematically speaking, it would require too many efforts to give a precise meaning to this equation. A more useful approach is to consider the integral version of the equation. To this end, notice first that the solution to the stochastic recursion (2.13) satisfies 24

Xtn = X0 +

n−1 X

√ 2D Btn

u(Xti , t) ∆ti +

.

i=0

If the time step goes to zero, the sum converges to an integral Z



t

u(Xs ) ds +

Xt = X0 +

2D Bt

.

0

This is, strictly, a stochastic integral equation, but it is customary to call it a stochastic differential equation. In this course we will establish conditions on the flow field u under which it admits a unique solution, Xt , which will be called a diffusion process. Such diffusion processes constitute a subclass of the Markov processes for which the transition probabilities are governed by advection-diffusion equations; see precise statements later. It is also fair to point out where the difficulties arise: Here, we assumed that the diffusivity D is constant, and that simplifies matters greatly. Without this assumption, we will show that the recursion must be replaced by ∆Xt = (u(Xt , t) + D0 (Xt , t))∆t +

p 2D(Xt , t)∆Bt

,

which corresponds to

Xtn = X0 +

n−1 X

N −1 p X

i=0

i=0

(u(Xti , t) + D0 (Xti , t)) ∆ti +

2D(Xti , t) ∆Bi

.

The difficulty with this equation is not so much that the term D0 appears, but that the last sum cannot be written as a simple function of Bt . Rather, when the time steps become small, we obtain the limit Z Xt = X0 +

t

Z tp (u(Xs , t) + D (Xs , t)) ds + 2D(Xs ) dBs 0

0

.

(2.15)

0

The last integral is not an ordinary Riemann integral, but rather a new type of integral, termed an It¯o integral. In chapter 6 we look at this It¯o integral in detail and generality; define it and describe its properties, and with that in hand develop the theory of the stochastic differential equation dXt = (u(Xt , t) + D0 (Xs )) dt +

p 2D(Xs , t) dBt

which is nothing but a shorthand for the integral equation (2.15). 25

,

2.0 1.0

Advective 0.0

Length scales

Diffusive

0

1

2

3

4

Time t

Figure 2.8. Advective (dashed) and diffusive (solid) length scale as function of time scale.

2.7

Relative importance of advection and diffusion *

We have now introduced two transport processes, advection and diffusion, which may be in effect simultaneously. It is useful to assess the relative importance of the two. First, consider a single molecule subject to constant advection u and diffusion D. At time t, the advection has moved the molecule a distance |u|t, while the diffusive length √ scale - the root mean square distance that the molecule has been moved by diffusion - is 2Dt. These length scales are shown in figure 2.8. Notice that initially, when time t is sufficiently small, the diffusive length scale is larger than the advective length scale, while for sufficiently large time t the advective length scale dominates. This justifies our earlier claim that diffusion is most powerful at small scales. The two length scales equal when

t=

2D u2

.

In stead of fixing time and computing associated length scales, one may fix a certain length scale L and ask about the corresponding time scales associated with advective and diffusive transport: The advective time scale is L/u while the diffusive time scale is L2 /2D. We define the Pecl´et number as the ratio between the two:

Pe = 2

Diffusive time scale Lu = Advective time scale D

.

It is common to include the factor 2 in order to obtain a simpler final expression, but note that different authors may include different factors. Regardless of the precise numerical value, 26

a large Pecl´et number means that the diffusive time scale is larger than the advective time scale. In this situation, advection is a more effective transport mechanism than diffusion at the given length scale, i.e. the transport is dominated by advection. Conversely, if the Pecl´et number is near 0, diffusion is more effective than advection at the given length scale. Such considerations may suggest to simplify the model by omitting the least significant term, and when done cautiously, this can be a good idea. The analysis in this section has assumed that u and D were constant. When this is not the case, it is customary to use “typical” values of u and D to compute the Pecl´et number. This can be seen as a useful heuristic, but can also be justified by the non-dimensional versions of the transport equations, where the Pecl´et number enters. Of course, exactly which “typical values” are used for u and D can be a matter of debate, but this debate most often affects digits and not orders of magnitude. Even the order of magnitude of the Pecl´et number is a useful indicator if the transport phenomenon under study is dominated by diffusion or advection.

2.8

Diffusion in more than one dimension *

In the previous we have considered only one spatial dimension, and so now we extend to consider two, three, or in general n spatial dimensions. Consider again the one-dimensional situation in figure 2.2. In n dimensions, we consider a region V ⊂ Rn in stead of an interval [a, b]. Let µt (V ) denote the amount of the substance present in this region. This measure can be written in terms of a volume integral of the density C: Z µt (V ) =

C(x, t) dx

.

V

Here x = (x1 , . . . , xn ) is a vector in Rn and dx is the volume of an infinitesimal volume element. The concentration C now has physical unit per volume, i.e. SI unit m−n , since µt (V ) should still be dimensionless. The flux J(x, t) is now a vector field, i.e. a vector-valued function of space and time; in terms of coordinates we have J = (J1 , . . . , Jn ). The defining property of the flux J is that the net rate of exchange of matter through a surface ∂V is Z J(x, t) · ds(x)

.

V

Here, ds is the surface element at x ∈ ∂V , a vector normal to the surface. The flux has SI unit m−n+1 s−1 . Conservation of mass now means that the rate of change in the amount of matter present inside V is exactly balanced by the rate of transport over the boundary ∂V : Z V

˙ C(x, t) dx +

Z J(x, t) · ds(x) = 0 ∂V

27

(2.16)

where ds is directed outward. This balance equation compares a volume integral with a surface integral. To proceed, we convert the surface integral to another volume integral, using the divergence theorem (also known as the Gauss theorem), according to which Z

Z

J · ds

∇ · J dx =

.

∂V

V

In words, the total flow out of the control volume equals the total divergence integrated over the the control volume. Here, in terms of coordinates, the divergence is

∇·J =

∂J1 ∂Jn + ··· + ∂x1 ∂xn

.

Thus, substituting the surface integral in (2.16) with a volume integral, we obtain Z

˙ C(x, t) + ∇ · J(x, t) dx = 0 .

V

Since the control volume V is arbitrary, we get C˙ + ∇ · J = 0 ,

(2.17)

which is the conservation equation in n dimensions, in differential form. y Jy + 12 ∆Jy

(x, y)

Jx − 12 ∆Jx

Jx + 12 ∆Jx

Jy − 12 ∆Jy

x

Figure 2.9. The conservation equation and the divergence. The net outflow is ∆Jx ∆y + ∆Jy ∆x . With ∆Jy = ∂Jy /∂y·∆y and ∆Jy = ∂Jy /∂y·∆y, this reduces to div J ·∆x·∆y. 28

∂J1 ∂J2 Exercise 2.6: If you are uncomfortable with the divergence operator ∇ · J = ∂x + ∂x + 1 2 ∂Jn · · · + ∂xn , consider the two-dimensional case (n = 2), using x and y for the coordinates. Take a rectangular ∆x-by-∆y control volume as in figure 2.9. Write up the flow across each of the four sides of the boundary, replacing the flow J = (Jx , Jy ) with its Taylor expansion around the center (x, y), and discarding terms of higher order than ∆x ∆y. Next, compute R ˙ the rate of increase of total mass in the control volume, V C dx dy, also replacing C with its truncated Taylor expansion around the center (x, y). From this, derive the conservation ∂Jy x equation C˙ = − ∂J ∂x − ∂y .

Fick’s first law in n dimensions relates the diffusive flux to the gradient of the concentration field: J = −D∇C

,

where ∇C is the gradient, the vector field with coordinates (∂C/∂x1 , . . . , ∂C/∂xn ). In many situations, the diffusivity D is still a scalar material constant, so that the relationship between concentration gradient and diffusive flux is invariant under rotations. In this case we say that the diffusion is isotropic. However, in general D is a matrix (or a tensor, if we do not make explicit reference to the underlying coordinate system). Then, the diffusive flux is not necessarily parallel to the gradient, and its strength depends on the direction of the gradient. Such anisotropic situations arise when the diffusion takes place in an anisotropic material, or when the diffusion is not molecular but caused by other mechanisms such as turbulence. Anisotropic diffusion is also the standard situation when the diffusion model does not describe transport in a physical space, but rather stochastic dynamics in a general state space of a dynamic system. Fick’s second law can now be written C˙ = ∇ · (D∇C)

.

When the diffusivity is constant and isotropic, this reduces to C˙ = D∇2 C. Here ∇2 is the ∂2 ∂2 Laplacian ∇2 = ∂x 2 + · · · + ∂x2 , a measure of curvature. 1

n

To take advection into account, we assume a flow field u with coordinates (u1 , . . . , un ). The advective flux is now uC and the advection-diffusion equation is C˙ = −∇ · (uC − D∇C)

2.9

.

(2.18)

Conclusion

Diffusion is a transport mechanism, as we have seen in section 2.1. Its mathematical model is stated in terms of the concentration field and the flux. Fick’s laws tell us how to compute the flux for a given concentration field, and therefore specifies the temporal dynamics of the 29

concentration field. This is the classical approach to diffusion, in the sense of 19th century physics. A microscopic model of diffusion contains an exceedingly large number of molecules which each move erratically and unpredictably, due to collisions with other molecules. This statistical mechanical image is entirely consistent with the continuous fields of classical diffusion, but also brings attention to the motion of a single molecule. We model this motion as a stochastic process to capture its unpredictable nature, and give it the name diffusion process. It is a Markov process, the transition probabilities of which are governed by the advection-diffusion equation. So the probability density function associated with a single molecule is advected with the flow while diffusing out due to unpredictable collisions, in the same way the overall concentration of molecules is advected while diffusing. This is a very useful picture to have in mind, because it allows us to apply physical intuition to diffusion processes. Also, special solutions, formulas and even software developed for chemical transport by diffusion (or heat transfer, by analogy) can be used for diffusion processes. We can simulate the trajectory of a diffusing molecule with a stochastic recursion (section 2.5). This provides a Monte Carlo particle tracking approach to solving the diffusion equation, which is useful in science and engineering. This is particular so in high-dimensional spaces or complex geometries where numerical solution of partial differential equations is difficult (and analytical solutions are unattainable). This stochastic recursion can be written as a stochastic difference equation (2.13). This equation contains a perturbation from the path of a fluid element, representing the displacement of the molecule due to intermolecular collisions. In this way, the motion of a molecule can be seen as a dynamic system (namely, the movement of a fluid element) which is perturbed by stochastic noise (namely, molecular collisions). The advected and diffusing molecule becomes a reference example of stochastic dynamics; by studying it, we learn about the influence of stochastic noise on other dynamics systems. The noise term can be modeled as the increment of standard Brownian motion, scaled appropriately using the diffusivity. So from the realized trajectory of a Brownian particle (a molecule undergoing standard Brownian motion), we can compute the realized trajectory of a molecule in an advective flow while subject to diffusion. With this approach to diffusion, the trajectory of the diffusing molecule is the focal point, while the classical focal points (concentrations, fluxes, and the advection-diffusion equation that connect them) become secondary, derived objects. In the chapters to come, we will depart from the physical notion of diffusion, in order to develop the mathematical theory, ending with the stochastic integral equation (2.15). While going through this construction, it is useful to have figure 2.1 and the image of a diffusing molecule in mind. If a certain piece of mathematical machinery seems abstract, it may be enlightening to consider the question: How can this help describe the trajectory of a diffusing molecule?

2.10

Additional exercises 30

Exercise 2.7: 1. Solve the diffusion equation (2.5) on the real line with a ”Heaviside step” initial condition  C(x, 0) =

0 1

when x < 0 when x > 0

.

.

As boundary condition, take limx→+∞ C(x, t) = 1 and limx→−∞ C(x, t) = 1. Hint: If you cannot guess the solution, use the formula (2.7) and massage the integral into a form that resembles the definition of the cumulative distribution function. 2. Consider the diffusion equation (2.5) on the positive half-line x ≥ 0 with initial condition C(x, 0) = 0 and boundary conditions C(0, t) = 1, C(∞, t) = 0. Hint: Utilize the solution of the previous question, and the fact that in that question, C(0, t) = 1/2 for all t > 0.

Exercise 2.8: Error bounds is an important theme in this book, and for this purpose it is inconvenient that the Gaussian cumulative distribution function does not admit a closedform expression. It is useful to have simple bounds on the tail probabilities in the Gaussian distribution. Useful upper and lower bounds are (Karatzas and Shreve, 1997) x 1 φ(x) ≤ 1 − Φ(x) ≤ φ(x) 1 + x2 x

.

which hold for x ≥ 0. Here, as always, Φ(·) is theR c.d.f. of a standard Gaussian variable ∞ X ∼ N (0, 1), so that 1 − Φ(x) = P(X ≥ x) = x φ(y) dy with φ(·) being the density, 1 2 φ(x) = √12π e− 2 x . 1. Plot the tail probability 1 − Φ(x) for 0 ≤ x ≤ 6. Include the upper and lower bound. Repeat, in a semi-logarithmic plot. 2. Show that the bounds hold. Hint: Show that the bounds hold as x → ∞, and that the differential version of the inequality holds for x ≥ 0 with reversed inequality signs.

Exercise 2.9: Consider pure diffusion in n > 1 dimensions with a scalar diffusion D, and a point initial condition C(x, 0) = δ(x − x0 ) where x ∈ Rn and δ is the Dirac delta in n dimensions. Show that each coordinate can be treated separately, and thus that the solution is a Gaussian in n dimensions corresponding to the n coordinates being independent, i.e.

C(x, t) =

n Y i=1



1 x − e i x0 1 1 |x − x0 |2 φ( √ )= exp(− ) 2 2Dt (4πDt)n/2 2Dt 2Dt

where ei is the ith unit vector. 31

The error function The physics literature often prefers the “error function” to the standard Gaussian cumulative distribution function The error function is defined as Z x 2 2 e−s ds , erf(x) = √ π 0 and the complementary error function is 2 erfc(x) = 1 − erf(x) = √ π

Z



2

e−s ds

.

x

These are related to the standard Gaussian distribution function Φ(x) by √ erfc(x) = 2 − 2Φ( 2x), √ erf(x) = 2Φ( 2x) − 1,

√ 1 Φ(x) = 1 − erfc(x/ 2) 2 Φ(x) =

32

√ 1 1 + erf(x/ 2) 2 2

CHAPTER

3

Probability spaces

Summary

Stochastic differential equations require a formal axiomatic foundation of probability theory. This foundation is measure-theoretic. We start by defining a stochastic experiment in which a random element (the realization) of a sample space is chosen. Next, events are statements about the experiment to which we assign probabilities. We define random variables as functions which map the realization to a real number. Thus, analysis of random variables can be seen as analysis of functions, and we can apply existing results about real-valued functions. For example, expectation corresponds to integration. Next, we model information as the set of questions a person is able to answer, and we define conditional expectation given some information. I assume that you have seen this material before, but in a different guise. In science and engineering curricula, probability is typically taught elementary, i.e. without measure theory, and engineers and applied scientists are typically neither familiar nor comfortable with σ-algebras and other objects from measure theory. You should not “un-learn” the elementary approach, but only recognize that for some theoretical constructions (case in point, stochastic differential equations) the elementary approach does not put enough firm ground under probabilistic arguments and calculations. The measure theoretic approach provides this firm ground, is consistent with the elementary approach, and once one penetrates what seems to be a canopy of definitions, one may even say that the framework is natural. Finally, the mathematical literature on stochastic differential equations is written in the language of measure theory, so the vocabulary is necessary for anyone working in this field, even if ones interest is applications rather than theory. 33

Ω A ω

Figure 3.1. In a stochastic experiment, Chance picks an outcome ω from the sample space Ω. An event A occurs (equivalently, is true) if ω ∈ A so the for the realization ω in the figure, the event A did not occur.

3.1

Stochastic experiments

The most basic concept in probability theory is the stochastic experiment. One of the most simple experiments is that of rolling a die. In this section we make the mathematical notion of a stochastic experiment precise. First, we need the set of all possible outcomes. We will call this the sample space, typically denoted Ω. The elements in Ω are the outcomes or the realizations, which we typically denote ω. For the die experiment, where we care only about the number of eyes the die shows, we can identify Ω with the set {1, 2, 3, 4, 5, 6}. A few other examples: The Bernoulli experiment: When tossing a coin, the sample space can be taken to be {0, 1}. Linear regression: A classical statistical model is Yi = xi θ + σEi where Y1 , . . . , Yn are observed random variables, θ and σ are parameters, xi are non-random covariates, and E1 , . . . , En are independent measurement errors, each Gaussian distributed with mean 0 and variance 1. For this model, we can take the sample space to be Ω = Rn and identify the outcome ω ∈ Ω with the measurement errors, i.e. ω = (e1 , . . . , en ). Stochastic differential equations: We will be working with the equation

dXt = f (Xt ) dt + g(Xt ) dBt with initial condition X0 = x, where Xt and Bt are scalars for each t. We shall see in the ¯ + , R), the set of continuous following chapters that the sample space can be taken to be C(R real-valued functions defined on [0, ∞), and the stochastic experiment is to pick the Brownian 34

motion {Bt : t ≥ 0} from this set. Physically, we can think of an infinite collection of Brownian molecules released at the origin at time 0, and the stochastic experiment is to pick one random molecule from this ensemble and follow its trajectory. This trajectory in turn drives the dynamics of the state X. The next thing we need are events. Events are statements about the outcome, such as “The die showed 2”, or “The die showed an even number”. Once the experiment has been performed, the statement is either true or false. Mathematically, an event A is a subset of Ω containing those outcomes for which the statement is true. The statement “The die showed 2” corresponds to the event {2}, while the truth set of the statement “The die shows on even number” is the event {2, 4, 6}. For the linear regression model, an event is a subset of Rn . One example is the event “all the E’s are positive”, corresponding to Rn+ . For the stochastic differential equation, an example of an event is “X1 ≤ 0 and X2 ≥ 0”. Although it is not at all straightforward to say more explicitly which outcomes lie in this set, we shall learn to compute the probability of such an event. This brings us to probabilities: The point of the stochastic model is exactly to assign probabilities to each event. If the die is fair, then P({2}) = 1/6 while P({2, 4, 6}) = 1/2. For the linear regression model, probabilities are obtained by integrating the probability density function over the event Z P(A) =

ρ(ω) dω

(3.1)

A

Here ρ is the joint density of the random variables Ei , in our case a multivariate Gaussian. For example P({ω = (e1 , . . . , en ) ∈ Rn+ }) = 2−n Now the vocabulary is in place - sample space, outcomes, events, probabilities - we need to specify the mathematical properties of these objects. First, which events do we consider? For the example of the fair die, it is simple: Any subset of Ω is allowed, including the empty set and Ω itself. Moreover, the probability of an event depends only on the number of elements in it, P(A) = |A|/6. For the linear regression model, we could start by trying to make an event out of each and every subset A of Rn , and use the formula (3.1) to compute its probability. Unfortunately, it turns out that there are subsets A of Rn that are so pathological that we cannot integrate even a constant function over them. You can accept this as a curious mathematical fact, or you can look up the “Vitali set” in (Billingsley, 1995) or on Wikipedia. We need to exclude such “non-measurable” subsets of Rn : They do not correspond to events. When not every set A ⊂ Ω can be an event, which ones should be? Some events are required for the theory to be useful. For example, in the scalar case Ω = R, we want intervals to correspond to events, so that the statement ω ∈ [a, b] is an event for any a and b. But 35

regardless of which basic events we would prefer to have, it is imperative that our logic machinery works: If A is an event, then the complementary set Ac must also be an event, so that the statement “not A” is valid. Next, if also B is an event, then the intersection A ∩ B must also be an event, so that the statement “A and B” is valid. More generally, in stochastic processes we often consider infinite sequences of events, for instance when analyzing convergence. So if {Ai : i ∈ N} is a sequence of events, then the statement “for each integer i, the statement Ai holds” should be valid. In terms of subsets of sample space Ω, this means that A1 ∩ A2 ∩ . . . must be an event. Let F denote the collection of events which we consider. Mathematically speaking, the requirements on F that we have just argued for, means that F is a σ-algebra: Definition 3.1.1 (σ-algebra of events). Given a sample space Ω, a σ-algebra F of events is a family of subsets of Ω for which:

1. The certain event is included, Ω ∈ F.

2. For each event A ∈ F, the complementary set Ac is also an event, Ac ∈ F.

3. Given a sequence of events {Ai ∈ F : i ∈ N}, there is an event that all Ai occur, i.e. ∩i Ai ∈ F.

Given a sequence of events {Ai ∈ F : i ∈ N}, also the union ∪i Ai is an event. Exercise: Verify this!. We often say, for short, that σ-algebras are characterized by being closed under countable operations of union and intersection. Example 3.1.1 (The Borel algebra). A specific σ-algebra which we will encounter frequently in this book, is related to the case Ω = R. We previously argued that the intervals [a, b] should be events for the theory to be useful in many situations. The smallest σ-algebra which contains the intervals is called the Borel-algebra and denoted B(R) or simply B. The Borel algebra contains all closed sets, all open sets, and all sets which can be obtained by a countable number of operations of union and intersection applied to a countable collection of open sets. This collection of sets is large enough to contain the sets one encounters in practice, and the fact that the Vitali set and other non-Borel sets exist is more an excitement to mathematicians than a nuisance to practitioners. In the case Ω = Rn , we require F to include (hyper)rectangles of the form [a1 , b1 ] × [a2 , b2 ] × . . . × [an , bn ], for ai , bi ∈ R and also use the name Borel-algebra, B(Rn ), for the smallest σ-algebra that contains these hyper-rectangles.

36

Biography: ´ ´ F´elix Edouard Justin Emile Borel (1871-1956) was a French mathematician who was one of the leading figures in the development of measure theory. He applied measure theory to real functions of real variables, as well as to the foundations of probability. He was also politically active, argued for a united Europe, was Minister of the Navy, and was imprisoned during World War II for assisting the Resistance.

Having outcomes and events in place, we need to assign a probability P(A) to each event A. The way we do this must be consistent: Definition 3.1.2 (Probability measure). A probability measure is a map F 7→ [0, 1] with the following two properties: 1. P is normalized: The certain event has probability 1; P (Ω) = 1. 2. P is countably additive: Given a countable family of events Ai which are mutually P exclusive, i.e. Ai ∩ Aj = ∅ for i 6= j, we have P (∪i Ai ) = i P (Ai ). Since P assigns a real non-negative number to each set and is additive, P is a measure. Since it is normalized in the sense that P(Ω) = 1, it is a probability measure. It seems natural that forming the union of disjoint sets corresponds to adding probabilities, but it still has some interesting consequences. We may investigate some of these in a simple context, namely Ω = [0, 1] and P([a, b]) = b − a for 0 ≤ a ≤ b ≤ 1. Probability then equals length, the measure is called Lebesgue measure, and this is a uniform distribution on [0, 1]. Note: 1. We can only require additivity to hold for (countable) sequences of sets, it does not necessarily hold for uncountable infinite families of sets. For example, the sample space Ω = [0, 1] is the uncountable union of the singletons {x} where x ∈ [0, 1]. Each of these singletons has probability measure P({x}) = 0, while their union has measure P(∪x {x}) = P(Ω) = 1. So when there are an uncountable number of parts, the whole is more than a sum of its parts. 2. Similarly, the collection of events F is closed only under countable unions. For example, let A ⊂ [0, 1] be a non-measurable set such as the Vitali set. For each x ∈ [0, 1], let Ax = {x} if x ∈ A and Ax = ∅ otherwise. Then each Ax is measurable but ∪{Ax : x ∈ [0, 1]} is not since this union is simply A. 37

This means that some questions related to continuous-time phenomena can become quite technical. For example, for a continuous-time stochastic process {Xt : t ∈ R+ } (which we introduce in chapter 4), even if it is an event that Xt is continuous at a given point t0 , this does not necessarily mean that it is an event that Xt is continuous over an interval t ∈ [0, 1], since this involves an uncountable number of events. So an effort is needed to make sure that we can assign a probability to the statement “Xt is a continuous function of time t”. 3. The rational numbers form a countable set Q. So the set [0, 1] ∩ Q has measure 0, i.e. when sampling a real number uniformly from [0, 1], the probability of getting a rational number is 0. This holds despite the rationals being dense in [0, 1] (every real number can be approximated with a rational number with arbitrary accuracy). So almost no real numbers are rational, but every real number is almost rational. We are now done. Our mathematical model of a stochastic experiment involves a sample space Ω, a family of events F, and a probability measure P, which all satisfy the assumptions in the previous. Together, the triple (Ω, F, P) is called a probability space. Exercise 3.10: Consider the plane R2 with its Borel algebra B(R2 ). Show that the set A = {(x, y) ∈ R2 : x + y ≤ 0} is Borel.

3.2

Random variables

A random variable is a quantity which depends on the outcome of the stochastic experiment; so its mathematical representation is a function defined on Ω. The special case of real-valued variables is important; then we have X : Ω 7→ R. It is a great strength to have defined random variables as functions on sample space: We are very familiar with functions, and now we can apply all this machinery to random variables. For example, we know several ways a sequence of functions can converge to a limit (most importantly pointwise, and in Lp norm), and each of these correspond immediately to convergence of random variables. Now that we have seen that not all subsets of Ω are necessarily valid events, it is fair to ask if all functions Ω 7→ R are valid random variables. The answer is no; we can construct random variables that are so pathological that we would not be able to analyze them. For example, let A be a non-measurable subset of Ω, i.e. A 6∈ F, and take X to be the indicator function of A:  X(ω) = 1A (ω) = 1(ω ∈ A) =

1 if ω ∈ A, 0 else.

Then the statement X = 1 is no event and has no probability assigned to it! Note: This does not mean that the statement X = 1 has probability 0; it means that we cannot assign any probability to it, zero or non-zero, in a meaningful way. This is not a good start for analysis 38

X −1 ([a, b]) Ω

ω X(ω)

a

b R

Figure 3.2. A real-valued random variable is a map X : Ω 7→ R such that the preimage X −1 ([a, b]) of any interval [a, b] is an event, i.e. an element in F. of a random variable, so even if this X is a real-valued function on sample space, it does not qualify to be a random variable. To avoid such degenerate cases, we require that the statement “X ∈ [a, b]” corresponds to an event, for any a and b. In short: X −1 ([a, b]) ∈ F

.

Here, X −1 (B) denotes the pre-image of B under X: Definition 3.2.1 (Pre-image). Given a function X : Ω 7→ R, and a set B ⊂ R, the pre-image is X −1 (B) = {ω ∈ Ω : X(ω) ∈ B} .

Note that the pre-image is very different from the inverse of a function, although we use the same notation: The pre-image maps subsets of R to subsets of Ω. In most cases in this book, X maps a high-dimensional sample space Ω to a low-dimensional space such as R, so the function X will not be invertible. However, should the function X happen to be invertible, then the pre-image of a singleton {x} is a singleton {ω}. The condition that X −1 ([a, b]) ∈ F implies that X −1 (B) ∈ F for any Borel set B ⊂ R. In words, the pre-image under X of any Borel set B is an event, i.e. an element in F. We say that X is (Borel) measurable w.r.t. F. When it is clear from the context which σ-algebra we refer to, we often simply say that X is measurable. In summary, and generalizing to the multidimensional case: Definition 3.2.2 (Random variable). A Rd -valued random variable is a mapping X : Ω 7→ Rd which is measurable, i.e. X −1 (B) ∈ F for any Borel set B ∈ B(Rd ). 39

A word about notation: We often omit the ω argument. For instance, P(X ≤ 0) is really a shorthand for P({ω ∈ Ω : X(ω) ≤ 0}). Now that we are on firm ground, we can define objects which are familiar from elementary probability. The cumulative distribution function (c.d.f.) of a real-valued random variable is FX (x) = P(X ≤ x) . If FX is absolutely continuous, then its derivative is defined except on a set of real with (Lebesgue) measure 0; we define the probability density function fX (x) as this derivative. Exercise 3.11: 1. Let X be a real-valued random variable and let g : R 7→ R be Borel-measurable, i.e. g −1 (B) ∈ B for any B ∈ B. Show that Z : Ω 7→ R given by Z(ω) = g(X(ω)) is a random variable. 2. It may be shown (Billingsley, 1995) that any continuous function g : R 7→ R is Borel measurable. Use this to show that X 2 is a random variable when X is.

3.3

Expectation is integration

We now set out to state the measure-theoretic definition of the expectation of a random variable. Recall that in the elementary (non-measure theoretic) approach to probability, we R define expectation EX of a continuous random variable as an integral x fX (x) dx where fX is the probability density, while the integral is replaced with a sum in the case of a discrete random variable. In the measure-theoretic approach, we seek a definition of the expectation EX which is consistent with the elementary definition, but which concerns the basic objects, i.e. the probability space (Ω, F, P) and the random variable X, rather than a derived object such as the probability density function fX . First, consider the case of a “simple” random variable Xs , i.e. one that attains a finite number of possible values x1 , . . . , xn . Then the elementary definition of the expectation is

EXs =

n X

xi P(Xs = xi )

i=1

and this definition is applicable in the measure-theoretic construction as well. Notice that the right hand side can be seen as the integral over Ω of a piecewise constant function Xs : Ω 7→ R (figure 3.3).

W.p. 1 means “with probability 1”, i.e. P(0 ≤ Xs ≤ X) = 1. Another way of saying the same is “almost surely”, abbreviated “a.s.”.

Next, consider an arbitrary non-negative random variable X. Then we may construct a simple random variable Xs which is a lower bound on X, i.e. such that 0 ≤ Xs (ω) ≤ X(ω) w.p. 1. 40

3 x

2

X(ω)

E(Xs)

0

1

Xs(ω)

0.0

0.5

1.0

ω

Figure 3.3. Expectation as integrals over sample space Ω. Here, Ω = [0, 1]. A nonnegative random variable X is bounded below by a simple random variable Xs . If the probability measure is uniform, then expectation corresponds to area under the curve; e.g. EXs corresponds to the gray area.

41

It is reasonable to require that EX, if it exists, satisfies EX ≥ EXs . Also, it is reasonable to require that if Xs is a “good” approximation of X, then EXs is “near” EX. This leads us to define Z X(ω) dP(ω) = sup{EXs : Xs simple, 0 ≤ Xs ≤ X}

EX = Ω

So expectation EX of a non-negative random variable is an integral over the sample space Ω of the function X : Ω 7→ R, with respect to the probability measure. This Lebesgue integral is defined by approximating the function X with simple functions from below. Note that the expectation may be +∞. Finally, for a random variable X which attains both positive and negative values, we define the positive part X + = X ∧0 and the negative part X − = (−X)∧0. Note that X = X + −X − . We now define the expectation EX = EX + − EX −

if EX + < ∞,

EX − < ∞

.

We may state the condition that both positive and negative part have finite expectation more compactly: We require that E|X| < ∞. Exercise: Why do we know that |X| is a random variable? This expectation has the nice properties we are used to from elementary probability, and which we expect from integrals - see the notes at the end of this chapter. A much more in-depth discussion of expectations and integrals over sample space can be found in e.g. (Williams, 1991) or (Billingsley, 1995). A technical advantage of defining expectation as an integral over the sample space, is that it covers both the discrete case where X(Ω) is a countable set, and the continuous case where e.g. X(Ω) = R, so there is no need to state every result in both a continuous version and a discrete version. This definition of expectation is also consistent with an Lebesgue-Stieltjes integral Z

+∞

EX =

x dFX (x) −∞

where we integrate, not over sample space Ω, but over the possible values of X, i.e. the real axis. Also this definition covers both the continuous and discrete case in one formula. Exercise 3.12: Define Ω = [0, 1], F = B(Ω), and P being the uniform distribution on Ω. ¯ + 7→ [0, 1] be a nonincreasing function. Let X(ω) = G−1 (ω) (or more precisely, Let G : R ¯ + : G(x) ≤ ω}). X(ω) = inf{x ∈ R 1. Show that G is the complementary distribution function of X, i.e. P(X > x) = G(x). I.e., mapping a uniform random variable through a complementary distribution function yields a random variable with that distribution. 42

2. Show that

Z EX =

Z X(ω) P(dω) =





G(x) dx 0

geometrically, by showing that the two integrals describe the area of the same set in the (x, G(x)) plane (or in the (X, ω) plane). This is a convenient way of computing expectations in some situations.

3.4

Information as σ-algebras

We now consider the situation where an observer has partial knowledge about the outcome of a stochastic experiment. This is a very important situation in statistics as well as in stochastic processes. The key to modeling such information mathematically is to describe which questions about the outcome the observer is able to answer. A question corresponds to an event, namely the set of those outcomes for which the answer is “yes” (we assume that this set of outcomes is in fact an event). We say that the observer can resolve an event, if the observation always lets him answer the corresponding question, i.e. if he knows if the event has occurred or not. For example, when tossing two dice, the sample space is Ω = {1, . . . , 6}2 ; we will denote the outcomes 11, 12, . . . , 16, 21, . . . , 66. See figure 3.4. Now assume an observer does not see the dice, but is told the maximum of the two. Which events can he resolve? He will certainly know if the event {11} is true; this will be the case iff the maximum is 1. Similar, the maximum being 2 corresponds to the event {12, 21, 22}. Generally, he can resolve the event {1z, z1, 2z, z2, . . . , zz} for any z ∈ {1, . . . , 6}. There are certainly events which he cannot resolve, for example the event that the first die shows one: He will not generally know if this event is true, only if he is told that the maximum is 1. Using the symbol H for all the events that he can resolve, we note that H will be a σ-algebra. For example, if he knows whether events A, B occurred, then he also knows if A ∩ B occurred, assuming that he knows how to apply logic. The σ-algebra H will be contained in the original system of events F. In summary: The information available to an observer is described by a set of events H, which is a sub-σ-algebra to F. An observer with information H will know the realized value of a random variable X, if and only if X is H-measurable. If the observer has obtained the information by measuring some real-valued random variable Y , then the information available is the σ-algebra generated by Y , i.e. H = Y −1 (B) = {A ⊂ Ω : A = Y −1 (B) for some B ∈ B} In that case, a lemma due to Doob and Dynkin (see e.g. (Williams, 1991)) states that the observer will know the realized value of X, if and only if it is possible to compute X from Y . To be precise: 43

y

16

26

36

46

56

66

15

25

35

45

55

65

14

24

34

44

54

64

13

23

33

43

53

63

12

22

32

42

52

62

11

21

31

41

51

61 x

Figure 3.4. The sample space corresponding to the experiment of tossing two dice. Purple regions illustrate the information σ-algebra H generated by an observation of the maximum of the two dice: In each purple solid region, the maximum of the two dice is constant. Any union of these purple regions is an event in H. The open gray ellipsis contains the event “The first die shows one’, which is not contained in H. Lemma 3.4.1 (Doob-Dynkin). Let X and Y be Rn -valued random variables on a probability space (Ω, F, P). Then X is measurable w.r.t. σ(Y ) if and only if there exists a Borel measurable function g : Rn 7→ Rn such that X(ω) = g(Y (ω)) for all ω ∈ Ω. Maybe you think that this is unnecessary formalism; that a statement such as “The observer has observed Y = y” is sufficient. In this case, consider the following example, which is a slightly modified version of Borel’s paradox.

3.4.1

Borel’s paradox

We consider a stochastic experiment of randomly picking a point in the plane so that the x and y coordinates are i.i.d. standard Gaussians (figure 3.5). An observer reports to us that the point lies on the x-axis, but no other information. Based on this information, what is our conditional expectation of the squared distance to the origin? Letting S denote the squared distance to the origin S(ω) = R2 (ω) = X 2 (ω) + Y 2 (ω), one argument is as follows:

E(S|Y = 0) = E(X 2 + Y 2 | Y = 0) 2

(3.2) 2

= E(X |Y = 0) + E(Y |Y = 0) 2

= E(X ) = 1 44

(3.3) (3.4)

(X,Y) ●

R Θ

Figure 3.5. Borel’s paradox. A random point has Cartesian coordinates (X, Y ) and polar coordinates (R, Θ). (X, Y ) follows a standard bivariate Gaussian; the probability density is indicated with colors and the gray circle encloses 50 % of the probability. Given that the point is observed to lie on the x-axis (thick solid line), what is the mean square distance ES = ER2 to the origin?

45

where we have used that the fact from elementary probability that if X and Y are independent, then E(X 2 |Y = 0) = E(X 2 ). Another argument is as follows: The observation that the point lies on the x-axis, is information about the angle Θ from the x-axis to the vector (X, Y ): When the point is on the x-axis, Θ must be 0 or π. Since the distribution is rotational symmetric, S and Θ are independent. Hence E(S|Θ ∈ {0, π}) = E(S) = 2 So what the conditional expectation of S, 1 or 2? To resolve this seeming paradox, we must first define conditional expectations precisely. If you are impatient, have a sneak peak at the resolution on page 48.

3.5

Conditional expectations

A conditional expectation is what is expected by an observer who has only partial information about the outcome of the stochastic experiment. Since we use σ-algebras to model information, we aim to define conditional expectations such as E{X|H} where X is a random variable on (Ω, F, P) and H is a sub-σ-algebra to F, describing the information. First, the conditional expectation E{X | H} must be a random variable: If we repeat the experiment, then observer gets different information so his expectation will be different. Second, E{X | H} must depend only on the information available to the observer. Phrased in the language of measure theory, E{X | H} must be measurable w.r.t. H. If the information stems from a measurement of Y , i.e. H = σ(Y ), then this means that there must exist some (measurable) function g such that E{X | Y } = g(Y ), which is reasonable: We must be able to compute the conditional expectation from the available data. Note that we allow the shorthand E{X|Y } for E{X|σ(Y )}. We now “just” need to determine which value the random variable E{X | H} attains for each ω; or in the case H = σ(Y ), we just need to determine the function g such that E{X|Y } = g(Y ). Let us go back to elementary probability. If Y attains values y1 , y2 , . . . with non-zero probability, then

g(yi ) = E{X|Y = yi } =

E{X · 1(Y = yi )} P(Y = yi )

This equation only makes sense because P(Y = yi ) > 0. But if we multiply both sides with P(Y = yi ), we obtain an “integral” version which holds trivially also when P(Y = yi ) = 0. 46

We use the identity g(yi ) · P(Y = yi ) = E{g(Y ) · 1(Y = yi )} to obtain a more appealing form: E{g(Y ) · 1(Y = yi )} = E{X · 1(Y = yi )} This serves as our definition of conditional expectation with respect to any information σalgebra H: Definition 3.5.1. Given a random variable X on (Ω, F, P) such that E|X| < ∞, and a sub-σ algebra H of F, the conditional expectation of X w.r.t. H is the almost surely unique random variable Z = E{X|H} which is measurable w.r.t. H and for which E{Z · 1H } = E{X · 1H } holds for any H ∈ H.

Exercise 3.13: Consider again the stochastic experiment of tossing two dice in figure 3.4. Let X and Y be the face value of the first and second die, respectively. Let H be the information obtained by observing the maximum of the two, H = σ(max(X, Y )). Let Z be the conditional expectation of X given H, Z = E{X|H}. Compute Z(ω) for each ω and display the results in a two dimensional table similar to figure 3.4. Exercise 3.14

Conditional expectations:

1. For which σ-algebra H does it always hold that E{X|H} = X? 2. For which σ-algebra H does it always hold that E{X|H} = EX?

Exercise 3.15 Independence: We say that two events A and B are independent, if P(A ∩ B) = P(A) · P(B). We say that two σ-algebras G and H are independent if events A and B are independent, whenever A ∈ G and B ∈ H. We say that two random variables X and Y are independent if σ(X) and σ(Y ) are independent. Show that if X and Y are independent and E|X| < ∞, then E{X|Y } = EX. Hint: Assume first that X is simple. In most situations, the information H is obtained by observing a random variable Y . Figure 3.6 offers a graphical illustration of this situation. The definition 3.5.1 hides an implicit theorem, namely that the conditional expectation is well defined in the sense that it exists and is almost surely unique. See (Billingsley, 1995) or (Williams, 1991). The conditional expectation is only “almost surely unique” since it is defined in terms of expectations, and therefore can be modified on a H-measurable set of P-measure 0 and still satisfy the definition. So whenever we write an equation involving realizations of the conditional expectation, we should really add the qualification “almost surely”. We do not do this. In many situations, the conditional expectation can be made surely unique. For 47

y b

A = Y −1 ([a, b])

a x

Figure 3.6. Conditional expectation w.r.t. σ(Y ): Horizontal strips A = Y −1 ([a, b]) = {ω = (x, y) : a ≤ y ≤ b} generate σ(Y ) and are the typical elements in σ(Y ). The conditional expectation of X w.r.t. Y corresponds to averaging X w.r.t. P over such a thin horizontal strip.

example, when the information stems from measurements of a continuous random variable Y such that H = σ(Y ), then there may exist a continuous g such that E{X|H} = g(Y ); in this case g is unique. This is reassuring, since from a modeller’s perspective it would be worrying if conclusions depend discontinuously on an observed random variable, or are not uniquely defined. There we will assume that g is chosen to be continuous whenever possible. This allows us to use the notation

E(X|Y = y)

meaning “g(y) where g(Y ) = E(X|Y ) (w.p. 1) and g is taken to be continuous”.

3.5.1

Borel’s paradox resolved

The root of Borel’s paradox is that the event on which we condition, namely that the point (X, Y ) lies on the x-axis, has probability 0. According to elementary probability, we cannot condition on events of probability 0. The paradox is resolved by formulating a precise model of the experiment, including the observation. This should make the statement “the point is observed to lie on the x-axis” mathematically precise, and this allows us to reach a definite answer. The sample space is the plane Ω = R2 , the system of events is the Borel algebra B(R2 ), and the probability P is given by the two-dimensional standard Gaussian density, i.e. 48

Z P(A) = A

1 − 1 (x2 +y2 ) dx dy for A ∈ B(R2 ) e 2 2π

We have the following random variables for ω = (x, y) ∈ R2 :

X(ω) = x , Y (ω) = y , S(ω) = x2 + y 2 , and Θ(ω) = ∠(x, y) You may want to verify that with this construction, X and Y are i.i.d. and N (0, 1). S is χ2 (2), i.e. chi-squared distributed with two degrees of freedom, and Θ is uniform on [0, 2π). Finally, S and Θ are independent. Next, we express the information available to the observer as a σ-algebra H, so that we condition not on Y = 0 or on Θ ∈ {0, π}, but on H. How does the observer know that the point lies on the x-axis, by observing Y or by observing Θ? In the first case, the relevant conditional expectation is E(S|Y ). We obtain E(S|Y ) = E(X 2 + Y 2 |Y ) = E(X 2 |Y ) + E(Y 2 |Y ) = 1 + Y 2 since X and Y are independent, and since Y 2 is σ(Y )-measurable. Here, we use the properties of conditional expectations that are familiar from elementary probability, and which we will state shortly, in theorem 3.5.1. Exercise 3.16: Verify from first principles that 1 + Y 2 satisfy all the requirements of E(S|Y ) in definition 3.5.1. It follows that E(S|Y ) = 1 for any ω such that Y (ω) = 0 - disregarding the possibility that E(S|Y ) is discontinuous at the x-axis. In short, E(S|Y = 0) = 1. Now, in the second case where we have measured Θ to be near 0 or π, the relevant conditional expectation is E(S|Θ). We find E(S|Θ) = 2 regardless of the observed value of Θ, because S and Θ are independent. You may want to verify this. We see that both answers E(S|Y = 0) = 1 and E(S|Θ ∈ {0, π}) = 2 are correct, despite Y = 0 is the same event as Θ ∈ {0, π}. Simply stating that “the point lies on the x-axis” is not enough to choose between them. We must also describe the measurement process, i.e. specify the information H. 49

y + −

x

y  x



Figure 3.7. Borel’s paradox resolved. Top panel: The set |Y | ≤ . Averaging S over this set, and letting  → 0, yields E(S|Y = 0) = 1. Bottom panel: The set |Θ| ≤ ∨|Θ−π| ≤ . Averaging S over this set, and letting  → 0, yields E(S|Θ = 0 ∨ Θ = π) = 2. Note that the bottom set puts more weight to points far from the origin, i.e. with large S. Hence E(S|Θ = 0) > E(S|Y = 0).

This is not just mathematical sophism. From a practical point of view, we never measure that the point (X, Y ) is on the x-axis; we measure that the point is so close to the axis that we cannot distinguish it from the axis. The difference between the two σ-algebras, and thus the two different results, arise from different models of the measurement uncertainty: In the first, we have a measurement error on Y so that

E(S | −  < Y < ) = 1 + O()

In the other, we have a measurement error on the angle Θ so that

E(S | −  < Θ < ) = 2 + O()

So the mathematician may appreciate that Borel’s paradox is elucidated by the measuretheoretic foundation of probability. The engineer or applied scientist may appreciate that it is not enough to report an observation, we should also report the observation process and, in particular, the nature of the observation error. 50

3.5.2

Properties of the conditional expectation

Some useful properties of conditional expectations are summarized in the following theorem. The proofs are fairly straightforward and a good exercise. Theorem 3.5.1. Given a probability space (Ω, F, P), a sub-σ-algebra H of F, and random variables X and Y such that E|X| < ∞ and E|Y | < ∞. Then: 1. E{aX + bY |H} = aE{X|H} + bE{Y |H} for a, b ∈ R (Linearity of conditional expectation) 2. Let G be a σ-algebra on Ω such that F ⊃ G ⊃ H. Then E[E{X|G}|H] = E[X|H]. This is the Tower property. 3. EE{X|H} = EX. This is a special case of the Tower property which we will make frequent use of. 4. E{X|H} = X if and only if X is H-measurable. 5. E{XY |H} = XE{Y |H} whenever X is H-measurable.

3.5.3

Conditional distributions and variances

Recall that in elementary probability we start by defining conditional probabilities and densities, and then derive conditional expectations from there. In the measure theoretic approach, the conditional expectation of a random variable is the most fundamental among all the conditional ones. From the conditional expectation we can define other conditional statistics, such as 1. The conditional probability of an event: P(A|H) = E{1A |H} 2. The conditional distribution function of a random variable FX|H (x) = P(X ≤ x|H) 3. The conditional density of a random variable fX|H (x) =

d F (x) dx X|H

wherever it exists, and 4. The conditional variance of a random variable X such that E|X|2 < ∞: V{X|H} = E(X 2 |H) − (E{X|H})2 51

These conditional statistics are all H-measurable random variables. In particular, when H is generated by a random variable, Y , any of these statistics will be functions of Y . Exercise 3.17: Show that if X is H-measurable and E|X|2 < ∞, then V{X|H} = 0. It is useful to be able to manipulate conditional variances. Two fundamental formulas are the following: Let X and Y be random variables such that E|X|2 , E|Y |2 and E|XY |2 all are finite. If furthermore Y is H-measurable, then V{XY |H} = Y 2 V{X|H} and V{X + Y |H} = V{X|H} Exercise: Verify these two formulas. These formulas generalize the well known formulas for V(aX) = a2 VX, V (a + X) = VX where a is a constant. Exercise: Explain in which way this is a generalization. They can be understood in the way that given the information in H, Y is known and can hence be treated as if deterministic. We have a very useful decomposition formula, which is a variance version of the simple Tower property: VX = EV{X|H} + VE{X|H} .

(3.5)

In the next section, we shall see a different formulation of the same decomposition, interpreted in terms of estimators and estimation erros. Exercise 3.18: Verify the variance decomposition formula. Exercise 3.19: Let {Xi : i ∈ N} be a collection of independent random variables, each Gaussian distributed with mean µ = 1 and variance σ 2 = 2. Let N be a random P variable, independent of all Xi , and Poisson distributed with mean λ = 5. Finally, define Y = N i=0 Xi . Determine the mean and variance of Y .

3.6

Linear spaces of random variables

We should now be used to the thought that random variables are functions defined on sample space. One useful consequence of this setup is that many standard results from analysis of functions can be applied to random variables. When working with functions, it is useful to restrict oneself to functions with simplifying properties - for example, continuous functions or bounded functions. The same applies when 52

working with random variables: It is convenient to consider only random variables within a particular linear space. First, it is easy to see that all random variables (on a given sample space Ω, w.r.t. a given σ-algebra F) form a linear (vector) space. That is to say: If X1 and X2 are random variables defined on Ω and c1 and c2 are real numbers, then also X : Ω 7→ R given by X(ω) = c1 X1 (ω) + c2 X2 (ω) for ω ∈ Ω is a random variable. Exercise: Show that X defined in this way is measurable w.r.t. F. Hint: Show first that X −1 ([a, b]) ∈ F for a, b ∈ R. This linear space can be equipped with a norm. A norm which is convenient for random variables is L2 , i.e. root mean square:

kXk2 =



EX 2 =

sZ

X 2 (ω)P(dω)



The linear space L2 (Ω, F, P) denotes those random variables on (Ω, F, P) for which this norm is finite. In many applications, this space is large enough to contain all variables of interest, yet the space has many nice properties. p Most importantly, the norm can be written in terms of an inner product, i.e. kXk2 = hX, Xi where hX, Y i = EXY . This means that many results from standard Euclidean geometry applies, which is extremely powerful. For example, we have the following result relating mean-square estimation, conditional expectation, and orthogonal projection in L2 : Theorem 3.6.1. Let X be a random variable in L2 (Ω, F, P) and let H be a sub-σ-algebra of ˆ = E(X|H), and let F. Use the conditional expectation as an estimator of X, i.e. define X ˜ ˆ X = X − X be the corresponding estimation error. Then: ˜ = 0 or equivalently EX ˆ = EX. 1. The estimator is unbiased, i.e. EX 2. The estimator and the estimation error are orthogonal in L2 , i.e. uncorrelated: ˜X ˆ =0 EX 3. The mean square (or variance) of X can be decomposed in a term explained by H, and a term unexplained by H: ˆ 2 + E|X| ˜2 E|X|2 = E|X|

,

ˆ + VX ˜ VX = VX

ˆ is the least squares estimator of X, i.e. if Z is any L2 random variable which is 4. X ˆ − X)2 . measurable w.r.t. H, then E(Z − X)2 ≥ E(X Proof. That the estimator is unbiased follows directly from the Tower property: 53

ˆ = EE{X|H} = EX EX To show that estimator and estimation error are uncorrelated:

˜X ˆ = E(X − X) ˆ X ˆ EX ˆ X|H} ˆ = EE{(X − X) h i ˆ ˆ = E XE{X − X|H} h i ˆ ·0 =E X =0 ˆ − X|H} = X ˆ − E{X|H} = 0. The decomposition of 2-norms (or variance) follows since E{X directly from this orthogonality, and is essentially the variance decomposition we established in the previous section. Finally, let Z be H-measurable. Then

ˆ + (X ˆ − X))2 E(Z − X)2 = E((Z − X) ˆ 2 + E(X ˆ − X)2 + 2E[(Z − X)( ˆ X ˆ − X)] = E(Z − X) ˆ is a standard trick, which appears on several Adding and subtracting the candidate solution X occasions when working in L2 , also in filtering and optimal control. Now, for the last term we have

ˆ X ˆ − X)] = EE{(Z − X)( ˆ X ˆ − X)|H} E[(Z − X)( h i ˆ ˆ − X|H} = E (Z − X)E{ X h i ˆ ·0 = E (Z − X) =0 ˆ are H-measurable. It follows that since Z and X ˆ 2 + E(X ˆ − X)2 ≥ E(X ˆ − X)2 E(Z − X)2 = E(Z − X) ˆ w.p. 1. and we see that equality holds if and only if Z = X

This is a projection result in the following sense: The information H defines a linear sub-space of L2 , namely those random variables which are H-measurable. The random variable X can 54

ˆ resides in this linear sub-space, now be decomposed into two orthogonal terms: The one, X, ˜ resides in the orthogonal complement. while the other, X, Although the L2 space is convenient, there are situations where we do not want to restrict ourselves to random variables with finite variance. In those situations we may consider in stead the L1 -norm Z |X(ω)| P(dω)

kXk1 = E|X| = Ω

and the space L1 (Ω, F, P) of those random variables which have finite L1 -norm. More generally, we have the Lp norms:

p 1/p

kXkp = (E|X |)

Z

p

1/p

|X(ω)| P(dω)

= Ω

which each define Lp (Ω, F, P), i.e. a linear space of those real-valued random variables for which the norm in question is finite. When we let p → ∞, we obtain the L∞ -norm kXk∞ = ess sup |X(ω)| ω∈Ω

The “ess” stands for “essential” and indicates that X(ω) is allowed to exceed kXk∞ on an event of probability 0. Exercise 3.20: Show that if p ≥ q ≥ 1 and X ∈ Lp , then X ∈ Lq .

3.7

Conclusion

In mathematical analysis, non-measurable sets and non-measurable functions can be regarded as esoteric phenomena: Few people outside the mathematics departments know of their existence. Even at the mathematics departments, only very few people take enough interest in them to actually study them. It is a strength of measure theory that the classes of measurable sets and functions are large enough to contain everything one will ever encounter in applications. Also in probability, the Borel algebra may be regarded as a sufficiently large collection of sets, and it has no practical implications that there are subsets of R which are not Borel. However, as soon as one takes partial information into account in the form of a sub-σ-algebra H, then it becomes an everyday phenomenon to encounter events or random variables which are not measurable w.r.t. H. So when we require that random variables are measurable w.r.t. F, it can be considered a technical condition which does not limit the applicability of the theory. On the other hand, if a given random variable X is H-measurable, then this states that the observer who has access to the information H is also able to determine the realized value of X. 55

Technical term

Interpretation

Basic σ-algebra F

All events, i.e. all statements about the outcome of the stochastic experiment that we consider.

Information σ-algebra H ⊂ F

The information available to an observer; i.e. the questions that he can answer.

X is H-measurable

The information in H is enough to determine the realized value of X.

H = σ(X)

The information in H is (or could be) obtained by observing X.

G⊂H

Any question that can be answered with G can also be answered with H.

X ∈ L1

X has finite mean.

X ∈ L2

X has finite mean square (and thus finite variance)

X⊥Y

X and Y are uncorrelated (and are L2 ).

X⊥ ⊥Y

X and Y are independent. Table 3.1. A summary of technical terms and their interpretation

56

In the teaching of probability, it is a on-going debate when, if at all, students should be introduced to the measure-theoretic foundation. The elementary approach to probability, i.e. without measure theory, is sufficient for many applications, both in statistics and in stochastic processes; for example Markov chains. And the measure theoretic language, and train of thought, takes time to get used to and even more time to master! For most students in science and engineering, the time is better spent with issues that relate more directly to applications. However, for continuous-time continuous-space processes, such as diffusion processes, the elementary approach is not firm enough. To develop the theory of diffusion processes, and the stochastic differential equations that govern them, we need the axiomatic and rigorous foundation of measure theory. Even students who focus on applications, will one day need to read a journal article which uses the measure-theoretic language. In this chapter, we have not constructed this foundation brick by brick. That construction is involved enough to require the better part of a book (such as (Billingsley, 1995) or (Williams, 1991)) and the better part of a course. But at least we have introduced the language and outlined the principles, and this allows us to develop the theory of stochastic differential equations using the standard terminology, which is measure-theoretic. Some more elements of the construction are provided in the following notes.

3.8

Notes and references *

The material in this chapter is standard text-book material, mainly taken from (Billingsley, 1995; Royden, 1988; Williams, 1991). In the following we provide a collection of results which will be useful in the following chapters. It may be skipped at a first reading without disturbing the flow, but we will refer to the material later.

Monotonicity Convergence of a sequence can be a tricky business, but convergence of a monotonic sequence is easy. Consider first sequences of sets. Let {An : n ∈ N} be an increasing sequence of events, that is A1 ⊂ A2 ⊂ · · · . In that case we can define the limit as

lim An =

n→∞

[

An

n∈N

For a decreasing sequence of sets, i.e. when A1 ⊃ A2 ⊃ · · · , we define the limit as ∩n∈N An . Lemma 3.8.1. If {An : n ∈ N} is an increasing, or decreasing, sequence of sets, then P( lim An ) = lim P(An ) n→∞

n→∞

57

Proof. Consider first the case of an increasing sequence. Introduce the set differences, D1 = A1 , Dn = An \ An−1 for n ≥ 2. Then {Dn : n ∈ N} is a disjoint sequence of sets such that lim An = ∪n∈N Dn

n→∞

The countable additivity of probability measures then implies that P( lim An ) = n→∞

X

P(Dm )

n∈N

which we should show. Finally, if the sequence {An : n ∈ N} is decreasing, then we consider in stead the sequence {Acn : n ∈ N}, which is increasing. For sequences of random variables, we have the Monotone Convergence Theorem: Theorem 3.8.2. Given random variables X and {Xi : i ∈ N} on a probability space (Ω, F, P) such that for almost all ω: 0 ≤ X1 (ω) ≤ X2 (ω) ≤ · · · and Xi (ω) → X(ω) as i → ∞ Then EXi → EX. Note that the limit may be EX = ∞. A proof can be found in (Williams, 1991).

Tail events and the Borel-Cantelli lemmas We consider a sequence of events {An : n ∈ N} under a probability space (Ω, F, P). We are interested in tail events, i.e. those that concern the asymptotics of the sequence {An }: Definition 3.8.1 (Tail σ-algebra). Given a sequence {An : n ∈ N} of events, the tail σalgebra is T :=

\

σ({An : n ≥ m})

m∈N

So a tail event is one which can be resolved from the tail sequence {An : n ≥ m} for any m ∈ N. Convergence of random variables is the most important situation in which tail events are of interest. A particular tail event is eventually, also called the set limit inferior: lim inf An = An , eventually := n→∞

[ \ m∈N n≥m

58

An

.

For a particular outcome ω, the sequence {An } will occur eventually if there exists an m (which depends on ω) such that ω ∈ An for all n ≥ m. The notation lim inf deserves an explanation: Recall the definition of lim inf for a function f : R 7→ R: lim inf f (x) = lim inf f (x) x→∞

y→∞ x≥y

and recall that the function y 7→ inf x≥y f (x) is an increasing function which, for each y, T provides a lower bound on the tail of f . In the same way, Bm := n≥m An is an increasing sequence of sets which is a lower bound on the tail, in the sense that it is contained in all An for n ≥ m. Taking the union of all Bm for m ∈ N corresponds to taking the limit m → ∞, as discussed in the previous. 1 We have the following easy result: Lemma 3.8.3. If lim inf n→∞ P(An ) = 0, then P(lim inf n→∞ An ) = 0. T Proof. Let Bm = n≥m An be as in the previous. Then P(Bm ) ≤ P(An )Sfor all n ≥ m. Since P(An ) → 0 as n → 0 it follows that P(Bm ) = 0 for all m. Hence also P( m∈N Bm ) = 0 from which the conclusion follows. Another tail event is infinitely often, also called set limit superior: lim sup An = An , infinitely often = n→∞

\ [

An

m∈N n≥m

For a particular outcome ω, the events An occur infinitely often if there for any m ∈ N exists an n ≥ m such that ω ∈ An . Exercise 3.21

Set limit inferior and superior:

1. Show that lim inf n→∞ An ⊂ lim supn→∞ An holds for any sequence {An : n ∈ N} of events. 2. Construct an example of a sequence {An : n ∈ N} such that lim inf n→∞ An = ∅ while lim supn→∞ An = Ω. 3. Construct an example of a sequence {An : n ∈ N} such that lim inf n→∞ An = lim supn→∞ An . The example should not be too trivial, i.e. all the An should not be identical. Note: In this case we define the set limit lim An = lim inf An = lim sup An

n→∞ 1

n→∞

n→∞

Another connection is through indicator functions: 1(ω ∈ lim inf An ) = lim inf 1(ω ∈ An ).

59

4. Show that lim inf n→∞ Acn = (lim supn→∞ An )c . Since the set limit superior is larger than the set limit inferior, the condition P(An ) → 0 is not sufficient for P(lim supn→∞ An ) = 0, as the following example shows: Exercise 3.22: Let Ω = [0, 1], let F be the usual Borel algebra, and let the measure P be Lebesgue measure, i.e. length. Then consider the sequence An given by

0 1 [ , ], 4 4

[0,1], 0 1 1 2 [ , ], [ , ], 2 2 2 2 1 2 3 3 [ , ], [ , ], 4 4 4 4 ···

3 4 [ , ], 4 4

Show that for this sequence P(An ) → 0 while P(lim sup An ) = 1. In may situations related to convergence, the event An is a “bad” event and we want to ensure that there is probability 0 that the events An occur infinitely often. Equivalently, the event Acn is a “good” event and we want to make sure that with probability 1 the events Acn occur eventually. For this a useful result is the (first) Borel-Cantelli lemma: P Lemma 3.8.4 (Borel-Cantelli I). If ∞ n=1 P(An ) < ∞, then P(lim supn→∞ An ) = 0. Proof. Let Cm =

S

n≥m An .

Then P(Cm ) ≤

X

P(An )

n≥m

and hence P(Cm ) → 0 as m → ∞. Finally, since lim supn→∞ An =

T

m∈N Cm ,

it follows that

P(lim sup An ) ≤ P(Cm ) for all m ∈ N n→∞

and therefore P(lim supn→∞ An ) = 0. P Given this lemma, it is reasonable to ask what can be concluded if the sum ∞ n=1 P(An ) diverges. Without further qualification, not much can be said. Exercise: Construct an exP ample of a sequence such that P(An ) diverges and where lim inf n→∞ An = Ω, and one where lim supn→∞ An = ∅. However, if we add that the events are independent, then a much stronger conclusion can be drawn: Lemma 3.8.5 (Borel-Cantelli II). Let An be a sequence of independent events such that P∞ n=1 P(An ) = ∞. Then P(lim supn→∞ An ) = 1. Proof. (Williams, 1991) Let Cm = ∪n≥m An . We aim to show that P (Cm ) = 1 for all m; equivalently that 60

c P(Cm )=0

for all m. But \

c Cm =

Acn

n≥m

Using independence, and the fact that exp(x) ≥ 1 + x for all x, we obtain 

 c P(Cm )=

Y

P(Acn ) =

n≥m

Y

(1 − pn ) ≤

n≥m

Y

exp(−pn ) = exp −

X

pn  = 0

n≥m

n≥m

since the sum diverges.

Properties of the expectation Here, we simply state for reference a number of properties of the expectation. You are probably familiar with most of them, if not all, already. Theorem 3.8.6. 1. (Linearity) Let a, b ∈ R and let X, Y be random variables with E|X| < ∞, E|Y | < ∞. Then E(aX + bY ) = aEX + bEY . 2. (Markov inequality) Let X be a non-negative random variable and let c ≥ 0. Then EX ≥ c · P(X ≥ c). 3. (Jensen’s inequality) Let X be a random variable with E|X| < ∞ and let g : R 7→ [0, ∞) be convex. Then Eg(X) ≥ g(EX). 4. (Fatou’s lemma) Let {Xn : n ∈ N} be a sequence of non-negative random variables. Then E lim inf n→∞ Xn ≤ lim inf n→∞ EXn . 5. (Schwarz inequality) Let X, Y be random variables such that E|X|2 < ∞, E|Y |2 < ∞. Then |EXY | ≤ E|XY | ≤



EX 2 · EY 2

or, in L2 terminology, |hX, Y i| ≤ h|X|, |Y |i ≤ kXk2 · kY k2 .

3.9

Additional exercises

Fundamental probability 61

Exercise 3.23 Independence vs. pairwise independence: Let X and Y be independent and identically distributed Bernoulli variables taking values on {−1, 1} and with probability parameter p = 1/2 and , i.e. P(X = −1) = P(X = 1) = 1/2. Let Z = XY . Show that X, Y and Z are pairwise independent, but not all independent.

Conditioning

Exercise 3.24 Conditional expectation, graphically: Consider the probability space (Ω, F, P) with Ω = [0, 1]2 , F the usual Borel-algebra on Ω, and P the Lebesgue measure, i.e. area. For ω = (x, y) ∈ Ω, define X(ω) = x, Y (ω) = y, and Z(ω) = x + y. 1. Sketch level sets (contour lines) for X, Y , Z, E{Z|X} and E{X|Z}. 2. Define and sketch (continuous) g and h such that E{Z|X} = g(X) and E{X|Z} = h(Z).

Exercise 3.25: Fred rolls a die and observes the outcome. He tells George and Harry if the number of eyes is odd or even. He also tells George of the number of eyes is greater or smaller than 3.5. He then asks George and Harry to estimate the number of eyes (using conditional expectation). For each outcome, what is George’s estimate? What is Harry’s estimate? What is Harry’s estimate of George’s estimate? What is George’s estimate of Harry’s estimate? Exercise 3.26 More information implies less variance, on average: Let X be a random variable on (Ω, F, P) such that VX < ∞, and let H ⊂ G be sub-σ-algebras of F. Then it holds that E[V{X|G}|H] ≤ V{X|H} 1. Explain in words this statement, and why it is plausible. 2. Show that the statement is true. Hint: Decompose X into the three terms E{X|H}, E{X|G} − E{X|H}, and X − E{X|G}, and establish orthogonality properties of these terms. 3. Construct an example for which V{X|G} > V{X|H} is an event with positive probability. 4. Show the more general variance version of the Tower property: V{X|H} = E{V{X|G}|H} + V{E{X|G}|H}

62



B

Y

A

Figure 3.8. Venn diagram illustrating conditional independence. Probability corresponds to surface area. A and B are conditionally independent given Y , but not conditionally independent given ¬Y . Exercise 3.27: Show the following result: Let X and Y be jointly Gaussian distributed stochastic variables taking values in Rm and Rn with mean µX and µY , respectively, and with VX = P , Cov(X, Y ) = Q , VY = R . Assume R > 0. Then E{X|Y } = µX + QR−1 (Y − µY ) and V{X|Y } = P − QR−1 Q0

.

Finally, the conditional distribution of X given Y is Gaussian. Exercise 3.28 Conditional independence: conditionally independent given a third event Y , if

We say that two events A and B are

P{A ∧ B|Y } = P{A|Y }P{A|Y } Consider the Venn diagram in figure 3.8. Show that A and B are conditionally independent given Y . Then show that A and B are not conditionally independent given ¬Y . We say that two events A and B are conditionally independent given a σ-algebra G, if

P{A ∧ B|G} = P{A|G} · P{B|G} 63

(almost surely; recall that all these terms are G-measurable random variables). We say that two random variables are independent given a σ-algebra G, if the events {X ∈ A} and {Y ∈ B} are conditionally independent given G, for any choice of A and B. Assume that the joint density of three real-valued random variables X, Y, Z can be written as f (x, y, z) = fZ (z) · fX|Z (x, z) · fY |Z (y, z) Show that X and Y are conditionally independent given (the σ-algebra generated by) Z. Give an example to show that the events X ∈ A and Y ∈ B are not necessarily conditionally independent given any event G ∈ σ(Z). Hint: You may want to choose an example where G = Ω; i.e. conditional independence does not imply unconditional independence.

The Gaussian distribution

Exercise 3.29 Moments in the Gaussian distribution: Consider a standard Gaussian variable, X ∼ N (0, 1). Show that the moments of X are given by the following formula: E|X|p =

p p 1 2p /π Γ( + ) 2 2

Hint: Write up the integral defining the moments,Ruse symmetry, and substitute u = 12 x2 . ∞ Recall the definition of the gamma function Γ(x) = 0 tx−1 e−t dt. p p In particular, show that E|X| = 2/π ≈ 0.798, E|X|2 = 1, E|X|3 = 8/π, E|X|4 = 3. Double-check the results by Monte Carlo simulation.

64

CHAPTER

4

Brownian motion and other stochastic processes

A stochastic process is an indexed collection of random variables

Xt (ω) where ω ∈ Ω, as always, represents the outcome of the stochastic experiment while the index t in our context represents time. Our primary interest is the continuous-time case where t ∈ R, and our main example in this chapter is Brownian motion. We construct Brownian motion mathematically and examine some of its many interesting properties - first and foremost, the fact that Brownian motion scales with the square root of time. Brownian motion is a prime example of two important classes of processes: 1. Markov processes, section 4.4. These include diffusion processes, i.e. the solutions to stochastic differential equations, and are intimately related to the paradigm of state space models. 2. Martingales, section 4.5. These are generalized unbiased random walks and are instrumental in the theory of stochastic differential equations, and indeed in modern probability. In the following chapter, we will focus on stationary processes, where the statistics of the process are time-invariant. Even if this stationarity is a special situation, it is greatly simplifying, common in applications, and provide an important frame of reference. The relationship between these classes of processes are shown graphically in figure 4.1. Stochastic processes involve an infinite number of random variables and therefore require a more complicated sample space Ω than do e.g. statistical models with only a finite number of variables. The sample space Ω is typically either a space of sequences ω : N 7→ Rn , or a 65

Markov

B.M.

Diffusions

Martingales

Stationary

Figure 4.1. The basic classes of processes considered. “B.M.” is Brownian motion. ¯ + 7→ Rn . It is the complexity of these sample spaces that require and space of functions ω : R justify the rigor and precision of the measure-theoretic approach to probability. The notion of information is crucial in stochastic processes, and typically there is a qualitative difference between information available about past events and about future events. To formalize this, we need a notion of how information develops over time. This is the concept of filtrations of the sample space, section 4.3. This is a key ingredient in the development of Markov processes and martingales, and also in the applications involving prediction, estimation or hindcasting which we will turn to in later chapters.

4.1

Discrete-time Gaussian white noise

The objective of this section is to illustrate with an example how we may construct a stochastic process from its statistics. The word construct indicates that we need to define an underlying probability space, so that the stochastic process is given as a family of random variables on this probability space. The example is discrete-time white noise, a simple process that consists of a sequence of i.i.d. (independent and identically distributed) random variables {Xi : i ∈ N}. We take the distribution of each to be Gaussian: Xi ∼ N (0, 1) One sample path of a Gaussian white noise process is shown in figure 4.2. Simulating and visualizing a stochastic process is an efficient way to understand its behavior before going into 66

2

● ●

1



0

Xi

● ●

● ●







● ●

−1







● ●



15

20



−2



5

10 Time i

Figure 4.2. A simulated sample path of discrete-time Gaussian white noise. mathematical analysis; this should not be underestimated. To construct Gaussian white noise as a stochastic process, the obvious choice is to take Ω = RN , i.e. an outcome ω is a sequence

ω = (x1 , x2 , . . .) where xi ∈ R for any i ∈ N. We can then identify the outcome ω with the sample path of the process. More precisely, the stochastic process {Xi : i ∈ N} is defined as

Xi (ω) = xi

where ω = (x1 , x2 , . . .) and i ∈ N

This is the analogy of the standard set-up for a single real-valued variable X, where we take Ω = R and define X(ω) = ω. It remains to define the σ-algebra and the probability measure P on Ω. The σ-algebra F is the smallest σ-algebra which contains all (cylinder) sets of the form {ω : Xi (ω) ∈ B} for any i ∈ N and any Borel set B ⊂ R; this is the minimal way of making sure that P(Xi ∈ B) is well defined. We may now define the probability measure P by two requirements: 1. Xi ∼ N (0, 1) for any i ∈ N, i.e. P(Xi ∈ B) = 67

R B

dΦ(x).

2. For any n ∈ N and any n distinct natural numbers (i1 , i2 , . . . , in ) ⊂ N, the random variables Xi1 , Xi2 , . . . , Xin are independent. At first, it may not seem obvious that these two properties define P(A) for any event A ∈ F. After all, F is a fairly large collection of sets which each can be quite complicated. But any event A ∈ F can be formed by a countable number of operations of union and intersection applied to the fundamental events {Xi ∈ B}. For each of these operations, the probability of the resulting event can be found from the two requirements (Gaussianity and independence) in combination with the general properties of probability measures (see definition 3.1.2). We see that our definition of the σ-algebra is consistent with our definition of the probability measure. So P(A) is in fact defined, and unambiguously so, for each event A ∈ F. Exercise 4.30: Check that with this construction, the statement “the sample path contains a finite number of positive values” is an event which has probability 0. Hint: If you wish, you may at first show that for any N ∈ N, the statement “Xi is negative for all i > N ” is an event with probability 0. We may now perform operations on the white noise sequence and thereby obtain a new stochastic process. A simple and important example of this is the cumulative sum {Bi : i ∈ N0 } given by:

B0 = 0 ,

Bi =

i X

Xj for i ∈ N .

j=1

This is a discrete-time random walk; in fact it is Brownian motion sampled at integer time points. This process is an approximation to Brownian motion; a simulated sample path is shown in figure 4.3. Notice how we may keep the underlying probability triple constant, while defining more and more stochastic processes on this probability space. Of course, any property of such these processes can be derived from the properties of the elementary underlying probability space.

4.2

Brownian motion

We now return to Brownian motion as described in section 2.3, aiming to construct it properly as a stochastic process. Brownian motion is a key process in the study of stochastic differential equations, for (at least) two reasons: First, it is the simplest example of a diffusion process, in that there is no advection and the diffusivity is constant. Stated differently, it is the solution of the simplest stochastic differential equation, and therefore serves as the main illustrative example. Second, as we outlined in section 2.6, a general stochastic differential equation involves an integral, the It¯ o integral, with respect to Brownian motion. Knowing Brownian motion is necessary to understand the behavior of this integral. Definition 4.2.1. [Brownian motion] Brownian motion on R is a stochastic process {Bt : t ≥ 0} on some probability space (Ω, F, P), which satisfy the following properties: 68

2

● ● 1



0



● ●









−1









● ●

−2

Bt

● ●





0

● 5

10

15

20

Time t

Figure 4.3. A simulation of the interpolated cumulative sum of discrete time Gaussian white noise. When restricted to integer time points, this agrees with Brownian motion. The linear interpolation makes the process an approximation to Brownian motion.

69

1. The process starts at B0 = 0. 2. The increments of Bt are independent. Specifically, let time points 0 ≤ t0 < t1 < t2 < . . . < tn be given and let the corresponding increments be ∆Bi = Bti − Bti−1 where i = 1, . . . , n. Then these increments ∆B1 , . . . ∆Bn are independent. 3. The increments are Gaussian with mean 0 and variance equal to the time lag: Bt − Bs ∼ N (0, t − s) whenever 0 ≤ s ≤ t. 4. For almost all realizations ω, the sample path t 7→ Bt (ω) is continuous. Sometimes we also use the word Brownian motion to describe the shifted-and-scaled process αBt + β for α, β ∈ R. In that case we call the case α = 1, β = 0 standard Brownian motion. Similarly, if B0 = β 6= 0, then we speak of Brownian motion starting at β. Although we now know the defining properties of Brownian motion, it is not yet clear if there actually exists a process with these properties. Fortunately, we have the following theorem: Theorem 4.2.1. Brownian motion exists. I.e., there exists a probability triple (Ω, F, P) and a stochastic process {Bt : t ≥ 0} which together satisfy the conditions in definition 4.2.1. In many situations, we do not need to know what the probability triple is, but it can be illuminating. The standard choice is to take Ω to be the space C([0, ∞)) of continuous real¯ + 7→ R, and to identify the realization ω with the sample path of the valued functions R Brownian motion, i.e. Bt (ω) = ω(t). The σ-algebra F is the Borel algebra on C([0, ∞)), the smallest σ-algebra which makes Bt measurable for each t ≥ 0. The probability measure P is fixed by the statistics of Brownian motion. This construction is called canonical Brownian motion. The probability measure P on C([0, ∞)) is called Wiener measure, after Norbert Wiener. Biography: Norbert Wiener (1894-1964). An American wonder kid who obtained his Ph.D. degree at the age of 17. His contributions to mathematics were both pure and applied. He introduced the characterization of stochastic processes in terms of probability measures on spaces of sample paths. His work on Brownian motion (1926) explains why this process is often referred to as the “Wiener process”. During World War II he worked on automation of . anti-aircraft guns; this work lead to what is now known as the Wiener filter for noise removal. He fathered the theory of “cybernetics” which formalized the notion of feed-back, and stimulated work on artificial intelligence.

This construction agrees with the interpretation we have offered earlier: Imagine an infinite collection of Brownian particles released at the origin at time 0. Each particle moves along a continuous trajectory; for each possible continuous trajectory there is a particle which 70

follows that trajectory. Now pick one random of these particles. The statistical properties of Brownian motion specifies what we mean with a “random” particle. We will elaborate on the construction of Brownian motion in section 4.7.3, but for now we shall simply be satisfied with the claim that it exists, and in stead turn to some of its several important properties: .

4.2.1

Properties of Brownian motion

Finite-dimensional distributions A stochastic process involves an infinite number of random variables. To characterize its distribution, one must restrict attention to a finite number of variables. The so-called finitedimensional distributions do exactly this. Take an arbitrary finite number n, n time points 0 ≤ t1 < t2 < · · · < tn , then we must specify the joint distribution of the vector random variable ¯ = (Bt , Bt , . . . , Btn ) B 1 2 For Brownian motion we get that the distribution of this vector is Gaussian with mean ¯ = (0, 0, . . . , 0) EB and covariance   ¯= ¯ >B EB  

t1 t1 .. . t1

 t1 · · · t1 t2 · · · t2   ..  .. .. . . .  t2 · · · tn

The expression for the covariance can be summarized with the statement that Cov(Bs , Bt ) = s whenever 0 ≤ s ≤ t. ¯ Hint: Start by using the Exercise 4.31: Proof that this is the the joint distribution of B. properties of the increments to find the distribution of the vector Bt1 , Bt2 − Bt1 , . . . , Btn − ¯ can be found from this vector by multiplying with the n-by-n matrix Btn−1 . Then use that B A with entries  1 if i ≥ j, Aij = 0 else .

71

Exercise 4.32 Simulation of Brownian motion: Write a function which simulates Brownian motion on an interval [0, T ]. The function should take as input a partition 0 ≤ t1 < · · · < tn = T , and should then sample the increments Bti − Bti−1 ∼ N (0, ti − ti−1 ) for i = 1, . . . , n. Finally it should compute and return B0 , Bt1 , Bt2 , . . . , Btn−1 , BT . Test the function by simulating sufficiently many replicates of (B0 , B1/2 , B3/2 , B2 ) to verify the covariance of this vector, and the distribution of B2 . Save the function for future use. Note: Other methods for simulating Brownian motion will be given in the exercises at the end of the chapter. Exercise 4.33 Brownian motion is continuous in the mean square: Show that Brownian motion is continuous in the mean square at any t ≥ 0, i.e. lim E|Bt+h − Bt |2 = 0

h→0

Self-similarity Brownian motion is a self-similar process: if we rescale time, we can also rescale the motion so that we recover the original process. Specifically, if Bt is Brownian motion, then so is α−1 Bα2 t , for any α > 0. Exercise: Verify this claim. This means that Brownian motion itself possesses no characteristic time scales which makes it an attractive component in models. Notice that the rescaling is linear in space and quadratic in time, in agreement with the scaling properties of diffusion (section 2.1.3 and, in particular, figure 2.4). A graphical signature of self-similarity is seen in figure 4.4. The sample paths themselves are not self-similar, i.e. they each appear differently under the three magnifications. However, they are statistically indistinguishable. If the axis scales had not been provided in the figure, it would not be possible to infer the zoom factor from the panels. A useful consequence of self-similarity is that the moments of Brownian motion also scale with time: √ E|Bt |p = E| tB1 |p = tp/2 · E|B1 |p

.

Numerical values can be found for these moments (exercise 3.29). But in many situations these numerical values are not important, only the scaling relationships tp/2 . To remember these scaling relationships, it is useful to keep in mind that the physical unit of Brownian motion is the square root of time, then the scalings follow from dimensional analysis.

Properties of the increments From the definition of Brownian motion, it is clear that the increments Bt+h − Bt are stationary (i.e. follow the same distribution regardless of t ≥ 0 for fixed h), have mean 0, and are independent for non-overlapping intervals. These properties are key in the analysis of 72

20 −20

−2 0.01

0

Bt(ω)

0

Bt(ω)

2

0.2 0.0 −0.2

Bt(ω)

0.00

0

Time t

1 Time t

0

100 Time t

Figure 4.4. Self-similarity of Brownian motion. The three panels show the same realizations of Brownian motion, but at three different magnifications. Brownian motion, and also imply that Brownian motion is a prime example of a Markov process and a martingale, as we shall see shortly. However, the independence of increments also implies that Brownian motion is, although continuous, a very erratic process. This is evident from figure 4.4. One mathematical expression of this feature is that Brownian motion has unbounded total variation. To explain this, consider Brownian motion on the interval [0, 1], and consider a partition of this interval: Definition 4.2.2 (Partition of an interval). Given an interval [S, T ], we define a partition ∆ as an increasing sequence S = t0 < t1 < · · · < tn = T . For a partition ∆, let #∆ be the number of sub-intervals, i.e. #∆ = n, and let the mesh of the partition be the length of the largest sub-interval, |∆| = max{ti − ti−1 : t = 1, . . . , n}. Define the sum

V∆ =

#∆ X

|Bti − Bti−1 |

i=1

We call this a discretized total variation associated with the partition ∆. We define the total variation of Brownian motion on the interval [0, 1] as the limit in probability V = lim sup|∆|→0 Vn as the partition becomes finer so that its mesh vanishes, whenever the limit exists. Then it can be shown that V = ∞, w.p. 1, which agrees with the discrete time simulation in figure 4.5. 73

● ● ●

−6.0

−5.5

−5.0

−4.5

−4.0

1.00 1.05 1.10





−6.0

log10(∆t)















0.95

200



0.90



Discretized quadratic variation [B]1

500



100

Discretized total variation V(B)



−5.5

−5.0

−4.5

−4.0

log10(∆t)

Figure 4.5. Estimating the variation (left panel) and the quadratic variation (right panel) of one sample path of Brownian motion on [0, 1] with discretization. Notice that as the time discretization becomes finer, the total variation appears to diverge to infinity, while the quadratic variation appears to converge to 1. These observations agree with the theoretical analysis.

74

Exercise 4.34: p Consider a regular partition ∆ = {0, 1/n, 2/n, . . . , 1} where n ∈ N. Show that EV∆ = 2n/π, so that EVn → ∞. Hint: Use exercise 3.29. One consequence of the unbounded total variation is that the length of the Brownian path is infinite. I.e., a particle that performs Brownian motion (in 1, 2, or more dimensions) will travel an infinite distance in finite time, almost surely. A physicist would be concerned about this property: It implies that a Brownian particle has infinite speed and infinite kinetic energy. The explanation is that the path of a physical particle differs from mathematical Brownian motion on the very fine scale. The difference may be insignificant in a specific application such as finding out where the particle is going, so that Brownian motion is a useful model, but the difference explains that physical particles have finite speeds while Brownian particles do not. In turn, Brownian motion has finite quadratic variation: Definition 4.2.3 (Quadratic variation). The quadratic variation of a process {Xs : 0 ≤ s ≤ t} is the limit #∆ X [X]t = lim |Xti − Xti−1 |2 (limit in probability) |∆|→0

i=1

whenever it exists. Here, ∆ is a partition of the interval [0, t]. Then the law of large numbers shows that the quadratic variation of Brownian motion is [B]t = t, for 0 ≤ t. To appreciate these results, notice that for a differentiable Rfunction f : R 7→ R, the total 1 variation over the interval [0, 1] is finite, and in fact equals 0 |f 0 (t)| dt, while the quadratic variation is 0. This means that the sample paths of Brownian are not differentiable (w.p. 1). In fact, another testimony to the erratic nature of Brownian motion is that the sample paths are nowhere differentiable, w.p. 1. Exercise 4.35: Show that Brownian motion is not differentiable in the mean square at any given point, i.e. show that the difference quotients 1 (Bt+h − Bt ) h does not have a limit in the mean square as h & 0, at any t ≥ 0. Hint: Compute the variance of the difference quotient and let h & 0. Note: See also exercise 4.39.

The maximum over a finite interval How far to the right of the origin does Brownian motion move in a given finite time interval [0, t]? Define the maximum 75

St = max{Bs : 0 ≤ s ≤ t}. The following theorem shows a surprising connection between the distribution of St and the distribution of Bt : Theorem 4.2.2 (Distribution of the maximum of Brownian motion). For any t, x > 0, we have √ P(St ≥ x) = 2P(Bt ≥ x) = 2Φ(−x/ t) (4.1) where, as always, Φ is the cumulative distribution function of a standard Gaussian variable. Note that maximum process {St : t ≥ 0} is also self-similar; for example α−1 Sα2 t has the same distribution as St whenever α > 0. Sketch. First, since the path of the Brownian motion is continuous and the interval [0, t] is closed and bounded, the maximum is actually attained and St is measurable. Next, notice that P(St ≥ x) = P(St ≥ x, Bt ≤ x) + P(St ≥ x, Bt ≥ x) − P(St ≥ x, Bt = x) and for the last term we find P(St ≥ x, Bt = x) ≤ P(Bt = x) = 0. Now, consider a realization ω for which Bt (ω) ≥ x. Let τ = τ (ω) be the “hitting time”, a random variable defined by τ (ω) = inf{s : Bs (ω) = x} . This is the first, but definitely not the last, time we encounter such a hitting time; see section 4.3. Note that τ (ω) ≤ t since we assumed that Bt (ω) ≥ x > 0 = B0 , and since the sample path is continuous. Define the reflected trajectory (see figure 4.6)

Bs(r) (ω)

 =

Bs (ω) 2x − Bs (ω)

for 0 ≤ s ≤ τ (ω) for τ (ω) ≤ s ≤ t

We see that each sample path with St ≥ x, Bt ≥ x corresponds in this way to exactly one sample path with St ≥ x, Bt ≤ x. Moreover, the reflection operation does not change the absolute values of the increments, and therefore the original and the reflected sample path are equally likely realizations of Brownian motion. This is the argument that works straightforwardly in the case of a discrete-time random walk on Z (Grimmett and Stirzaker, 1992), but for Brownian motion some care is needed to make the statement and the argument precise. The key is to partition the time interval into ever finer grids; see (Rogers and Williams, 76

4 2 −2

0

Bt

0

2

4

6

8

10

t Figure 4.6. The reflection argument used to derive the distribution of the maximum St . When Bt first hits the level x = 2, we cut the trajectory and reflect the rest of the trajectory.

77

1994a) or (Karatzas and Shreve, 1997). Omitting the details of this technicality, we reach the conclusion P(St ≥ x, Bt ≥ x) = P(St ≥ x, Bt ≤ x) and therefore P(St ≥ x) = 2P(St ≥ x, Bt ≥ x) Now if the end point exceeds x, then obviously the process must have hit x, so Bt ≥ x ⇒ St ≥ x. Hence P(St ≥ x, Bt ≥ x) = P(Bt ≥ x) and therefore P(St ≥ x) = 2P(Bt ≥ x).

In many situations involving stochastic differential equations we need to bound the effect of random permutations. It is therefore quite useful that the maximum value of the Brownian motion follows a known distribution with finite moments. Exercise 4.36 Bounds on |Bt |: If we want to know how far the Brownian motion has moved from the origin, then our interest is in |Bt | rather than Bt . 1. Show that we have the quick-and-dirty bound P(max{|Bs | : 0 ≤ s ≤ t} ≥ x) ≤ 4P(Bt ≥ x) . 2. Examine the bounds with Monte Carlo: Simulate M sample paths of the process {Bt : t ∈ {0, 1, . . . , N }}. Plot the empirical estimate of the survival function P(max{B0 , B1 , . . . , BN } ≥ x) as function of x. Add the theoretical result P(SN ≥ x) to the plot. Note: Statistical accuracy requires M ≥ 1000 sample paths, roughly. When N is large (1000, roughly) there is good agreement between the two curves. When N is small (e.g., 10) there is noticeable difference. Why? 3. Plot the empirical estimate P(max{|B0 |, |B1 |, . . . , |BN |} ≥ x) as function of x. Add the bound from previous exercise 4.36. Take N = 1000. Explain why there is good agreement for large x but poor agreement for small x.

78

Brownian motion is null-recurrent Brownian motion in one dimension always hits any given point on the real line, and always returns to the origin again, but the expected time until it does so is infinite. To state this property precisely, let x 6= 0 be an arbitrary point and define again the hitting time τ (ω) to be the first time the sample path hits x, i.e. τ = inf{t > 0 : Bt = x}. By convention the infimum over an empty set is infinity, so a finite τ means that the sample path hits x while τ = ∞ means that the sample path never hits x. Theorem 4.2.3 (Hitting time distribution of Brownian motion). The distribution of τ is given by √ P{τ ≤ t} = 2Φ(−|x|/ t) In particular, P(τ < ∞) = 1 and Eτ = ∞. Proof. Assume that x > 0; the case x < 0 follows using symmetry. Then, recall from our discussion of the maximum St = max{Bs : 0 ≤ s ≤ t}, that τ ≤ t ⇔ St ≥ x and in particular √ P(τ ≤ t) = P(St ≥ x) = 2Φ(−x/ t) Now it is clear that P(τ ≤ t) → 1 as t → ∞, so τ is finite w.p. 1. On the other hand, the probability density function of τ is

P(τ ∈ dt)/dt = fτ (t) =

√ dP(τ ≤ t) = xt−3/2 φ(x/ t) dt

,

which is plotted in figure 4.7. Notice the heavy power-law tail with a slope of −3/2, which indicates an divergent expectation: Z



tfτ (t) dt = ∞ .

Eτ = 0

To show this more formally, note that fτ (t) ≥ x5 t−3/2 whenever t ≥ x2 . Hence Z



Z



tfτ (t) dt ≥

Eτ = 0

x2

x −1/2 t dt = ∞ . 5

To recap, Brownian motion reaches any given level x almost surely, although the expected time until it does so is infinite. In the next section, concerning asymptotics, we will show that once it has reached the level x, it will almost surely reach the origin again at some later point. The alert reader may have noticed that we have not treated the case x = 0! The next two properties will show that if the Brownian motion starts at B0 = 0, then almost surely, it will hit the origin again immediately after. 79

1e−01 Pdf fτ(t)

1e−03

0.4 0.3 0.2 0.0

1e−05

0.1

Pdf fτ(t)

Slope = −3/2

0.0

0.5

1.0

1.5

2.0

1

5

Time t

50

500

Time t

Figure 4.7. The p.d.f. of the hitting time τ = inf{t : Bt = 1}. Left panel: The initial part of the curve. Right panel: The tail of the curve. Notice the log-scales. Included is also a straight line corresponding to a power law decay ∼ t−3/2 .

Asymptotics and the law of the iterated logarithm √ We know that Brownian motion Bt scales with the square root of time in the sense that Bt / t is identically distributed for all t > 0, in fact follows a standard √ Gaussian distribution. We are now concerned with the behavior of the sample path of Bt / t in the limit t → ∞. Theorem 4.2.4 (The law of the iterated logarithm). Bt lim sup √ =1 t→∞ 2 t log log t

,

with probability one. This is a very useful result: It states quite precisely how far from the origin the Brownian motion will deviate, in the long run, and this can be used to derive asymptotic properties and bounds on more general diffusion processes. Since Brownian motion is symmetric, if follows immediately that almost surely Bt = −1 lim inf √ t→∞ 2 t log log t Now, since the path of √ the Brownian motion is continuous and, loosely said, makes neverending excursions to ±2 t log log t, it also follows that the sample path almost always re-visits 80

the origin: Almost surely there exists a sequence tn (ω) such that tn → ∞ and Btn = 0. Although the law of the iterated√logarithm is simple to state and use, it is a quite remarkable result. The scaled process Bt /(2 t log log t) converges (slowly!) to 0 in L2 as t → ∞ (Exercise: Verify this! ), but the sample path will continue to make excursions away from 0, and the ultimate size of these excursions are equal to 1, no more, no less. Stated in a different way, when we normalize the Brownian√motion, and view it in logarithmic time, we reach the process {Xs : s ≥ 0} given by Xs = Bt / t with t = exp(s). This process {Xs } is Gaussian stationary (compare figure 4.8) 1 Hence, it will eventually break any bound, i.e. lim sup √ s→∞ Xs = ∞. But if the bounds grow slowly with logarithmic time s as well, i.e. ±2 log s (compare figure 4.8), then the process will ultimately just touch the bound. The proof of the law of the iterated logarithm is out of our scope; see (Williams, 1991). Exercise 4.37: Using the law of the iterated logarithm, show that (almost surely) lim supt→∞ t−p Bt = 0 for p > 1/2, while lim supt→∞ t−p Bt = ∞ for 0 ≤ p ≤ 1/2.

Invariant under time-inversion If {Bt : t ≥ 0} is Brownian motion, then also the process {Wt : t ≥ 0} given by W0 = 0 ,

Wt = tB1/t for t > 0

,

is Brownian motion. Exercise 4.38: Show that this {Wt } satisfies the conditions in the definition of Brownian motion. Hint: To establish continuity of Wt at t = 0, use the law of the iterated logarithm, in particular the results established in exercise 4.37. This result is particularly useful, because it can be used to connect properties in the limit t → ∞ with properties in the limit t → 0. For example, from the discussion of the law of the iterated logarithm we learned that Brownian motion almost always revisits the origin in the long run. By time inversion, it then follows that Brownian motion almost always revisits the origin immediately after time 0. To be precise, with probability 1 there exists a sequence tn such that tn → 0 and Btn = 0. Exercise 4.39: Following up on exercise 4.35, show that the sample paths of Brownian motion are not differentiable at the origin, almost surely. Specifically, show that 1 lim sup Bt = +∞ , t t&0

1 lim inf Bt = −∞ t&0 t

almost surely. Hint: Use time inversion and the law of the iterated logarithm, in particular exercise 4.37. 1

You may want to verify the stationarity, i.e. that EXs , VXs and EXs Xs+h do not depend on time s. Here h ≥ 0 is a time lag. We will discuss stationary processes further in chapter 5.

81

3 2 1 0

t −3

−2

−1

Bt

0

100

200

300

400

500

log(t)

Figure 4.8. Rescaled Brownian motion in logarithmic time. Included is also the growing bounds from the Law of the Iterated Logarithm.

82

4.3

Filtrations and accumulation of information

Information is a key concept in the study of stochastic processes; both in the theoretical construction of stochastic differential equations, and for practical applications such as estimation, prediction, or hindcasting. Recall that we used a σ-algebra of events to model information. This is a static concept, i.e. the information does not change with time. When information changes with time, we obtain a family of σ-algebras, parametrized by time. Our interest is accumulation of information obtained by new observations, and not, for example, loss of information due to limited memory. We therefore define a filtration to be a family of σ-algebras, i.e. {F t : t ∈ R}, which is increasing in the sense that F s ⊂ F t whenever s < t We can think of a filtration as the information available to an observer who monitors an evolving stochastic experiment - as time progresses, this observer is able to answer more and more questions about the stochastic experiment. In our context, the information F t almost invariably comes from observation of some stochastic process Xt , so that F t = σ(Xs : 0 ≤ s ≤ t) In this case we say that the filtration {F t : t ≥ 0} is generated by the process {Xt : t ≥ 0}. A related situation is that the information F t is sufficient to determine Xt , for any t ≥ 0. This is the dynamic version of the property that a random variable X is measurable w.r.t. a given σ-algebra H. If Xt is F t -measurable, for any t ≥ 0, then we say that the process {Xt : t ≥ 0} is adapted to the filtration {F t : t ≥ 0}. Since a filtration is an increasing family of σ-algebras, it follows that also earlier values of the stochastic process are measurable, i.e. Xs is measurable with respect to F t whenever 0 ≤ s ≤ t. Of course, a process is adapted to its own filtration, i.e. the filtration it generates. We shall make use of this notion, processes that are adapted to a filtration, for two distinct purposes: First, as we discussed loosely on page (25), stochastic differential equations involve stochastic integrals. To develop the integral, in chapter 6, we need to specify which integrands are admitted. We shall see that a key requirement is that they are adapted to an underlying filtration related to the Brownian motion. Second, in chapter 10 we turn to the dynamic estimation problem, the so-called filtering problem, where an observer monitors one process and based on this estimates another process on-line. Since the filter cannot make use information before it is available, the estimate must be adapted to the filtration generated by the measurements. Another notion relevant to filtrations concerns phenomena that take place at random times. ¯ + ∪ {∞}. We allow the Let τ be a random time, i.e. a random variable taking values in R value τ = ∞ to indicate that the phenomenon never occurs. An important distinction is if we recognize the phenomenon immediately when it occurs, or only afterward, in hindsight, 83

realized that the phenomenon occurred. We use the term Markov time, or stopping time, to describe the first situation: Definition 4.3.1 (Markov time, stopping time). A random variable τ taking values in [0, ∞] is denoted a Markov time (or a stopping time) (w.r.t. F t ) if the event {ω : τ (ω) ≤ t} is contained in F t for any t ≥ 0. Probably the most important example of stopping times are hitting times, which we have already encountered: If {Xt : t ≥ 0} is a stochastic process taking values in Rn and F t its filtration and B ⊂ Rn is a Borel set, then the time of first entrance τ = inf{t ≥ 0 : Xt ∈ B} is a stopping time (with respect to F t ). Recall that by convention we take the infimum of an empty set to be ∞. On the other hand, the time of last exit sup{t ≥ 0 : Xt ∈ B} is not in general a stopping time, since we need to know the future in order to tell if the process will ever enter B again. Biography: Andrei Andreyevich Markov (1856-1922) was a Russian mathematician who introduced the class of stochastic processes which we today know as Markov chains. His purpose was to show that a weak law of large numbers could hold, even if the random variables were not independent. Markov lived and worked in St. Petersburg, and was influenced by Pafnuty Chebyshev, who taught him probability theory.

4.4

Markov processes and stochastic state space models

I assume that you have already encountered the Markov property in a basic course on stochastic processes: Given the present, the future is independent of the past. To state this precisely using the terminology we have established, we start with a filtration {F t : t ≥ 0}. 84

Xt1

Xt2

Xt3

Xtn

Figure 4.9. Graphical model of Markov process (w.r.t. its own filtration) evaluated at a set of time points 0 ≤ t0 < t1 < · · · < tn . The Markov property implies, for example, that Xt3 and Xt1 are conditionally independent given Xt2 , so there is no arrow from Xt1 to Xt3 . Definition 4.4.1 (Markov process). Given a probability space (Ω, F, P) and a filtration {F t : t ≥ 0}, a process {Xt : t ≥ 0} taking values on Rn is said to be a Markov process if 1. it is adapted to the filtration F t and 2. E{h(Xt )|F s } = E{h(Xt )|Xs } holds (w.p. 1) for any t ≥ s ≥ 0. Here h : Rn 7→ R is an arbitrary (but Borel-measurable) real-valued function.

Note that the definition includes a filtration. This implies that two observers with access to different information, i.e. different filtrations, may disagree on whether a process {Xt } is Markov. If we just say that a stochastic process Xs is Markov without specifying the filtration, then we take the filtration to be the one generated by the process itself, F t = σ({Xs : 0 ≤ s ≤ t}), i.e. the observer in question obtains his information by monitoring the process {Xt } and nothing else. An equivalent way to phrase the Markov property is that the future state Xt is conditionally independent of the history F s , given the state Xs . This implies that the variables Xt1 , . . . , Xtn have a dependence structure as depicted in figure 4.9. The law of Markov process (i.e., the finite-dimensional distributions) can therefore be specified by transition probabilities, which give the conditional distribution of Xt given Xs , for any 0 ≤ s < t. Theorem 4.4.1. Brownian motion is a Markov process.

Proof. Note that there is no reference to any filtration, so we take the filtration to be the one generated by the process itself. Therefore, the first condition of definition 4.4.1 is trivially satisfied. Now, let 0 ≤ s ≤ t. The intuition should be clear: Since Brownian motion has independent increments, an observer who has access to Xs cannot improve his prediction of Bt by using information about Bu for 0 ≤ u ≤ s. More stringently, we must show that the conditional distribution of Bt given F s is identical to the conditional distribution of Bt given Xs . To show this, we use that the distribution of a random variable, say X, is uniquely determined by its moment generating function M (k) = E exp(kX) where k ∈ R. We get 85

E{ekBt |F s } = ekBs E{ek(Bt −Bs ) |F s } = ekBs E{ek(Bt −Bs ) } = ekBs E{ek(Bt −Bs ) |Bs } = E{ekBt |Bs } Here we have used the laws of conditional expectations, that exp(kBs ) is F s -measurable as well as Bs -measurable, and that the increment Bt − Bs is independent of F s and of Bs . Since this holds for any k, it follows that the two conditional moment generating functions are identical. Hence the two conditional distributions are identical, and hence the conditional expectation of h(Bt ) is the same, whether we condition on F s or on σ(Bs ). Note that we do not use other properties of Brownian motion than independence of increments, so the theorem actually holds for any such process (provided the involved expectations exist). Remark 4.4.1 (Diffusions as Markov processes). The notion of Markov processes is very important in our context, since diffusion processes - the solution to stochastic differential equations - are Markov processes. Indeed, one approach to diffusion processes would have been to define them as Markov processes for which the transition probabilities are governed by an advection-diffusion equation. This is not the approach we follow in this book, but let us briefly outline it. Consider as in chapter 2 the advection-diffusion equation in Rn ∂C = −∇ · (uC − D∇C) ∂t Here C : RN × R 7→ R is the density, u : Rn 7→ Rn is a flow field, and D : Rn 7→ Rn×n is a diffusion tensor field which we for simplicity assume is positively definite; we ignore additional technical assumptions on the data as well as boundary conditions. Then consider the fundamental solution H(t, x; x0 ) corresponding to the initial condition C(x, 0) = δ(x−x0 ). We can use this fundamental solution as transition densities, i.e. construct a Markov process {Xt : t ≥ 0} for which the transition probabilities are Z P(Xt ∈ B|Xs = x0 ) =

H(t − s, x; x0 ) dx

.

B

An entire theory of diffusion processes develops from this property. However, in the following we will follow the dual complementary approach which takes as a starting point stochastic integrals. Markov processes formalize the notion of a state space model in the stochastic setting: The state of a system is a set of variables which are sufficient statistics of the history of the system, so that any prediction about the future can be stated using only the current values of the state variables. 86

For a given real-world system, it is therefore a relevant question which variables together constitute the state. More precisely, which stochastic processes related to the system do together constitute a multivariate Markov process? For simple physical systems, the answer is often supplied by the laws of physics.

Example 4.4.2. For a mechanical particle subject to Newton’s laws, the position itself is not a Markov process: To predict future positions, we need not just current position, but also the current velocity. In turn, the position and velocity together form a state vector. However, as we saw in chapter 2, the particle may be small and embedded in a fluid, we may observe the motion with a coarse resolution, and we may be interested in the motion on larger scales. In this situation, the velocity process is fluctuating too rapidly to be measurable, and the instantaneous velocity would not provide much information about the future displacement, which is determined by the unpredictable future collisions with other particles rather than inertia. Hence, the position itself can be considered a Markov process, with only negligible error. For more complex systems (a human body seen as a physiological system, an ocean seen as an ecosystem, or a nation seen as an economic system), it is a non-trivial modeling question how many and which state variables are needed, and the answer typically requires some simplification which is justified by the limits in the intended use of the model.

A particular feature of stochastic dynamic models is that they require additional “noise states” in order to describe adequately the temporal correlation of fluctuations; these states cannot always be given a direct physical interpretation. One great advantage of the Markov property, and state space models in greater generality, is that a large number of analysis questions can be reduced to analysis on state space, rather than sample space. For Markov chains with a finite number of states, this means that computations are done in terms of vectors, representing probability distributions or functions on state space, and linear operators on these, i.e. matrices. For stochastic differential equations and diffusion processes, the main computational tool is partial differential equations which govern probability densities and functions on state space. Given a non-Markov process, it is possible to extend the state space to include extra information so that the extended process is Markov. In many situations, the conceptual and computational advantages that come with the Markov property outweighs the extra complexity that comes with the expansion of state space. Rt Exercise 4.40: Let {Bt : t ≥ 0} be Brownian motion and consider the integral Xt = 0 Bs ds. Show that {Xt : t ≥ 0} is not a Markov process (w.r.t. its own filtration). Start with a verbal argument; then make it precise, i.e. find t ≥ s ≥ 0 and h which violate the Markov condition. Next, show that the pair {(Bt , Xt ) : t ≥ 0} is a Markov process w.r.t. its own filtration. 87

4.5

Martingales

Martingales are a class of stochastic processes which can be seen as general unbiased random walks. Martingales are important in the study of stochastic differential equations, primarily because the It¯ o integral (which we previewed in section 2.6 and will develop in chapter 6) is a martingale.2 This is useful, because surprisingly many conclusions can be drawn from the martingale property. In this section we introduce the martingales and demonstrate some of these conclusions. Definition 4.5.1. Given a probability space (Ω, F, P) with filtration {F t : t ≥ 0}, a stochastic process {Mt : t ≥ 0} is a martingale (w.r.t. F t and P) if 1. The process Mt is adapted to the filtration F t . 2. For all times t ≥ 0, E|Mt | < ∞. 3. E{Mt |F s } = Ms whenever t ≥ s ≥ 0. The first condition states that the σ-algebra F t contains enough information to determine Mt , for each t ≥ 0. The second condition ensures that the expectation in the third condition exists. The third condition, which is referred to as “the martingale property”, expresses that this is an unbiased random walk: At time s, the conditional expectation of the future increment Mt − Ms is 0. The time argument t can be discrete, t ∈ N, or continuous, t ∈ R+ . If we just say that Mt is a martingale, and it is obvious from the context which probability measure should be used to compute the expectations, then it is understood that the filtration {F t :≥ 0} is the one generated by the process itself, i.e. F t = σ(Ms : s ≤ t). Exercise 4.41: Show that Brownian motion is a martingale w.r.t. its own filtration. Exercise 4.42: Show that the process {Bt2 −t : t ≥ 0} is a martingale w.r.t. its own filtration. Exercise 4.43: Let Xi be independent random variables for i = 1, P 2, . . . such that E|Xi | < ∞ and EXi = 0. Show that the process {Mi : i ∈ N} given by Mi = ij=1 Xi is a martingale. Exercise 4.44 Doob’s martingale: Let X be a random variable such that E|X| < ∞, and let {F t : t ≥ 0} be a filtration. Show that the process {Mt : t ≥ 0} given by Mt = E{X|F t } 2

The term martingale has an original meaning which is fairly far from its usage in stochastic processes: A martingale can be a part of a horse’s harness, a piece of rigging on a tall ship, or even a half belt on a coat; such martingales provide control and hold things down. Gamblers in 18th century France used the term for a betting strategy where one doubles the stake after a loss; if the name should indicate that this controls the losses, then it is quite misleading. In turn, the accumulated winnings (or losses) in a fair game is a canonical example of a stochastic process with the martingale property.

88

is a martingale w.r.t. {F t : t ≥ 0}. Biography:

Joseph Leo Doob (1910-2004) was an American mathematician who greatly influenced the theory of stochastic processes in the middle of the 1900’s. He developed the theory of martingales and connected potential theory to the theory of stochastic processes. His book (Doob, 1953) on stochastic processes is considered a classic.

In the context of gambling, martingales are often said to characterize fair games: if Mt is the accumulated winnings of a gambler at time t, and {Mt : t ≥ 0} is a martingale, then the game can be said to be fair. In this context, an important result is that it is impossible to beat the house on average, i.e. obtain a positive expected gain, by quitting the game early. Lemma 4.5.1. On a probability space (Ω, F, P), let the process {Mt : t ≥ 0} be a continuous martingale with respect to a filtration {F t : t ≥ 0}. Let τ be a stopping time. Then the stopped process {Mt∧τ : t ≥ 0} is a martingale with respect to {F t : t ≥ 0}. In particular, E(Mt∧τ |F 0 ) = M0 . Here, ∧ is the “min” symbol: a ∧ b = min(a, b).

Proof. Note that the stopped process Mt∧τ equals the original process Mt before the stopping time τ , i.e. for t ≤ τ . After the stopping time τ , the stopped process Mt∧τ stays constant and equal to Mτ . With this, it is clear that Mt∧τ is F t -measurable, for any t ≥ 0. To verify that E|Mt∧τ | < ∞ and that E{Mt∧τ |F s } = Ms∧τ , consider first the case where τ is discrete, i.e. takes value in an increasing deterministic sequence {ti : i ∈ N} w.p. 1. Assume that E|Mti ∧τ | < ∞, then

E|Mti+1 ∧τ | = E|1(τ ≤ ti )Mτ + 1(τ > ti )Mti+1 | ≤ E|1(τ ≤ ti )Mτ ∧ti | + E|1(τ > ti )Mti+1 | ti )Mti+1 ∧τ |F ti } = E{1(τ ≤ ti )Mτ + 1(τ > ti )Mti+1 |F ti } = 1(τ ≤ ti )Mτ + 1(τ > ti )Mti = Mτ ∧ti It follows by iteration that {Mti ∧τ : i ∈ N} is a martingale. We outline the argument for the general case where τ is not necessarily discrete: We approximate τ with discrete stopping times {τn : n ∈ N} which converge monotonically to τ , for each ω. For each approximation τn , the stopped process Mt∧τn is a martingale, and in the limit n → ∞ we find E{Mt∧τ |F s } = lim E{Mt∧τn |F s } = Ms n→∞

which shows that {Mt∧τ : t ≥ 0} is also a martingale. Of course, it is crucial that τ is a stopping time, and not just any random time: It is not allowed to sneak peek a future loss and stop before it occurs. It is also important that the option is to quit the game early, i.e. the game always ends no later than a fixed time t. Consider, for example, the stopping time τ = inf{t : Bt ≥ 1} where Bt is Brownian motion. Then τ is finite almost surely (since Brownian motion is recurrent, section 4.2.1), and of course Bτ = 1 so that, in particular, EBτ 6= B0 . But this stopping time τ is not bounded so stopping at τ is not a strategy to quit early. Although this result should seem fairly obvious - except perhaps to die-hard gamblers - it has a somewhat surprising corollary: It bounds (in probability) the maximum value of the sample path of a non-negative martingale: Theorem 4.5.2. [The martingale inequality] Let {Mt : t ≥ 0} be a non-negative continuous martingale such that the initial value M0 is deterministic. Then P( max Ms ≥ c) ≤ s∈[0,t]

M0 c

.

This inequality is key in stochastic stability theory, where it is used to obtain bounds on the solutions of stochastic differential equations. In that context we shall see a slightly stronger statement. Proof. Define the stopping time τ = inf{t : Mt ≥ c} and consider the stopped process {Mt∧τ : t ≥ 0}. This is a martingale, and therefore M0 = EMt∧τ 90

.

By Markov’s inequality, EMt∧τ ≥ c P(Mt∧τ ≥ c)

.

Combining, we obtain M0 ≥ c P(Mt∧τ ≥ c). Noting that Mt∧τ ≥ c if and only if max{Ms : 0 ≤ s ≤ t} ≥ c, the conclusion follows.

Exercise 4.45: A gambler plays repeated rounds of a fair game. At each round, he decides the stakes. He can never bet more than his current fortune, and he can never lose more than he bets. His initial fortune is 1. Show that the probability that he ever reaches a fortune of 100, is no greater than 1 %. The definition of a martingale concerns only expectations, so a martingale does not necessarily have finite variance. However, many things are simpler if the variances do in fact exist, i.e. if E|Mt |2 < ∞ for all t ≥ 0. In the remainder of this section, {Mt : t ≥ 0} is such a martingale with finite variance for each t. Exercise 4.46: Show that if {Mt : t ≥ 0} is a martingale such that E|Mt |2 < ∞ for all t, then the increments Mt − Ms and Mv − Mu , are uncorrelated, whenever 0 ≤ s ≤ t ≤ u ≤ v. Hint: When computing the covariance E(Mv − Mu )(Mt − Ms ), condition on F u . Exercise 4.47: Show that if {Mt : t ≥ 0} is a martingale such that E|Mt |2 < ∞ for all t, then the variance is increasing. Specifically, let 0 ≤ s ≤ t, then VMs ≤ VMt

.

Hint: For example, use the variance version of the Tower property (page 52), evaluating VMt by conditioning on F s . Since the variance is increasing, an important characteristic of an L2 martingale is how fast and how far the variance increases. To illustrate this, we may ask what happens as t → ∞. Clearly the variance must either diverge to infinity, or converge to a limit, limt→∞ VMt < ∞. A very useful result is that if the variance converges, then also the process itself converges. Specifically: Theorem 4.5.3 (Martingale convergence, L2 version). Let {Mt : t ≥ 0} be a continuous martingale such that the variance {VMt : t ≥ 0} is bounded. Then there exists a random variable M∞ such that Mt → M∞ w.p. 1 and in L2 . Proof. Let 0 = t1 < t2 < · · · be an increasing divergent sequence of time points, i.e. ti → ∞ as i → ∞. Then we claim that {Mtn : n ∈ N} is a Cauchy sequence. To see this, let  be given. We must show that there exists an N such that for all n, m > N , kMtn − Mtm k2 < . But this is easy: Choose N such that VMtN > limt→∞ VMt − 2 . 91

It follows from the completeness of L2 that there exists an M∞ ∈ L2 such that Mtn → M∞ in mean square. Moreover, it is easy to see that this M∞ does not depend on the particular sequence {ti : i ∈ N}. We omit the proof that the limit is also w.p. 1. This proof uses quite different techniques; see (Williams, 1991).

4.6

Summary

A stochastic process is a family of random variables such as {Xt : t ≥ 0}. For fixed time t, we obtain a random variable Xt : Ω 7→ Rn . For fixed realization ω, we obtain a sample path ¯ + 7→ Rn . In the framework we have presented in this chapter, these two notions X· (ω) : R are equally important and two sides of the same coin. Properties of the sample paths are at the core of our study. We have seen that stochastic processes can be seen as evolving stochastic experiments. These require a large sample space Ω, typically a function space, and a filtration {F t : t ≥ 0} which describes how information is accumulated as new rounds of the experiment is being performed and more dice come to rest. We have seen that, using the measure-theoretic basis, we obtain a rigorous framework for stochastic processes. Choosing the σ-algebra as the one generated by the random variables Xt for t ≥ 0, many of the technicalities regarding measurability resolve themselves. The issues of different versions of stochastic processes remain and is treated in the notes following this summary; this manifests itself in the question if Brownian motion has continuous sample paths. However, choosing canonical Brownian motion, where the sample space consists of continuous functions which we identify with the sample path of Brownian motion, resolves this issue: Brownian motion, in our definition, has continuous sample paths. We have used Brownian motion as the running example because it is a fundamental process in the study of diffusions and stochastic differential equations. Among its many interesting properties, the most important to appreciate is arguably its self-similarity and that distance scales with the square root of time. Brownian motion is the prime example of a Markov process, and of a martingale. The Markov property is central in the study of stochastic differential equations, because diffusions are Markov processes. Historically, the first approach to diffusions was by considering Markov processes whose transition probabilities are governed by advection-diffusion equations. On the other hand, the following chapters concern the modern approach which relies on martingale theory, since the stochastic integrals we consider in the next chapter inherit the martingale property from the underlying Brownian motion. 92

4.7

Notes and references *

The material in this chapter is standard text-book material and taken mostly from (Williams, 1991; Rogers and Williams, 1994a; Øksendal, 1995; Karatzas and Shreve, 1997). In the following we discuss some additional aspects of this material. This concerns convergence of random variables, the finite-dimensional distributions of a stochastic process, and the construction of Brownian motion. The material be omitted without disturbing the flow.

4.7.1

Convergence of random variables

When considering stochastic processes, one of the first questions that emerge is convergence. We may be interested in asymptotic questions such as the Law of Large Numbers or the Central Limit Theorem, and for continuous-time process, we may also ask about the continuity of the process. For both situations, we need convergence of sequences of random variables. But exactly what does it mean that a sequence of random variables converges? I.e., what does Xi → X as i → ∞ mean, where X and the sequence Xi , i ∈ N are random variables on (Ω, F, P)? Since random variables are functions on Ω, there are several modes of convergence, at least as many as in standard analysis. The most common modes of convergence are given in the following: Definition 4.7.1. We assume that random variables X and {Xi : i ∈ N} are defined on a probability triple (Ω, F, P) and these random variables take values in Rn . We say that, as i → ∞, 1. Xi → X almost surely (a.s.) if {ω : Xi (ω) → X(ω)} is an event with probability 1. An equivalent notion is convergence with probability 1 (w.p. 1). 2. Xi → X in Lp if kXi − Xkp → 0. This is equivalent to E|Xi − X|p → 0, provided that p < ∞. When p = 1 we use the term convergence in the mean and when p = 2 we say convergence in the mean square. The cases p = 1, p = 2 and p = ∞ are the most common. 3. Xi → X in probability if P(|Xi − X| > ) → 0 as i → ∞ for any  > 0. 4. Xi → X in distribution if 93

P(Xi ∈ B) → P(X ∈ B) for any Borel set B such that P(X ∈ ∂B) = 0. We also say that Xi → X in law, or that the sequence Xi converges weakly to X. Note that several of the definitions use the norm | · | in Rn ; recall that it does not matter which norm in Rn we use, since all norms on Rn are equivalent. So in a given situation, we may choose the most convenient norm. In most situations, this will either be the Euclidean 2-norm |x|2 = |x21 + · · · + x2n |1/2 , the max norm |x|∞ = max{|x1 |, . . . , |xn |}, or the sum norm |x|1 = |x1 | + · · · + |xn |. Regarding random variables as functions on Ω, almost sure convergence corresponds to pointwise convergence (except possibly on a set of measure 0). Note that there may be realizations ω for which the convergence does not happen. For example, for the Gaussian white noise process {Xi : i ∈ N} considered in the previous, there are realizations ω for which Xi > 1 for any i ∈ N, and any such realization may seem to violate the Strong Law of Large Numbers (which we recap in a minute). But each realization (obviously) has probability 0; also the event {ω : ∀i ∈ N : Xi (ω) > 1} has probability 0. Indeed, the Strong Law of Large Numbers states only that convergence occurs with probability 1, not that it happens for every realization. Regarding convergence in distribution, the requirement that P(X ∈ ∂B) = 0 cannot be disregarded: Consider for example the sequence Xi ∼ N (0, i−2 ). Then P(Xi = 0) = 0 for all i but the weak limit X has P(X = 0) = 1. For scalar random variables, the requirement is that the distribution functions Fi (x) = P(Xi ≤ x) converge pointwise to F (x) = P(X ≤ x) at any point x where F is continuous.

Classical convergence theorems Let us recall some famous convergence theorems. These theorems should be familiar, but note the different modes of convergence: Theorem 4.7.1 (Central limit theorem of Lindeberg-L´evy). Let {Xi : i ∈ N} be a sequence of independent and identically distributed random variables with mean µ and variance 0 < σ 2 < ∞. Then n 1 X √ (Xi − µ) → N (0, 1) in distribution σ n i=1

as n → ∞. Theorem 4.7.2 (Weak law of large numbers). Let {Xi : i ∈ N} be a sequence of independent and identically distributed random variables with mean µ. Then n

1X Xi → µ in probability n i=1

94

in distribution in probability in L1 in L2

w.p.1

Figure 4.10. Modes of convergence for random variables, illustrated by sets of sequences. For example, a sequence which converges in probability also converges in distribution. as n → ∞. Theorem 4.7.3 (Strong law of large numbers). Let {Xi : i ∈ N} be a sequence of independent and identically distributed random variables with mean µ and variance σ 2 < ∞. Then n

1X Xi → µ almost surely and in L2 n i=1

as n → ∞.

Relationship between modes of convergence As the following theorem states, the different modes of convergence are not completely independent (see also figure 4.10). Theorem 4.7.4. Given a sequence {Xi : i ∈ N} of random variables, and a candidate limit X, all on a probability space {Ω, F, P} and taking values in Rn . 1. If Xi → X in Lp and p ≥ q ≥ 1, then Xi → X in Lq . In particular, convergence in the mean square (in L2 ) implies convergence in the mean (in L1 ). 2. If Xi → X in Lp , then Xi → X in probability. 95

3. If Xi → X almost surely, then Xi → X in probability. 4. If Xi → X in probability, then Xi → X in distribution. It is useful to think through situations where variables converge in one sense but not in another, because it illuminates the difference between the modes of convergence. For example: Example 4.7.1. 1. If Xi and X are all independent and identically distributed, then Xi → X in distribution but in no other sense. In fact, convergence in distribution concerns the distributions only and not the random variables themselves. Therefore, we often specify the limit as a distribution, rather than a random variable, saying that a sequence of random variables converges towards a certain distribution. Of course, the central limit theorem is a prime example of this mode of convergence. 2. Almost sure convergence does not in general imply convergence of moments. A standard counter-example considers a uniform distribution on [0, 1], i.e. Ω = [0, 1], F the usual Borel algebra on [0, 1], and P(B) = |B| for B ∈ F. Now let 1 Xi (ω) = i · 1(ω ∈ [0, ]) = i



i when 0 ≤ ω ≤ 0 else.

1 i

Then Xi → 0 w.p. 1 and hence also in probability, but kXi kp = ip−1 6→ 0 for any p ≥ 1. Another example, which concerns Brownian motion and is related to stability theory, is the subject of exercise 4.50. 3. In turn, convergence in moments does not in general imply almost sure convergence. A standard counter-example makes use of the same probability space and a sequence of random variables constructed as indicator variables as follows:

X1 = 1[ 0 , 1 ]

,

X4 = 1[ 1 , 2 ]

,

2 2

X3 = 1[ 0 , 1 ] 4 4

,

X2 = 1[ 1 , 2 ]

,

X5 = 1[ 2 , 3 ]

,

2 2

4 4

4 4

X6 = 1[ 3 , 4 ]

,

4 4

··· Then Xi → 0 in Lp for 1 ≤ p < ∞ and hence also in probability, but not with probability 1: For every ω and every n ∈ N, there exists an i > n such that Xi (ω) = 1. Note that this is essentially the same example as in exercise 3.22. We will see another example of convergence in L2 but not w.p. 1 in the following, when discussing the Law of the Iterated Logarithm for Brownian motion. With some extra qualifications, however, the converse statements can be made to hold: • If Xi converges weakly to a deterministic limit, i.e. Xi → X where X is a constant function on Ω, then the convergence is also in probability. 96

• Monotone convergence: If Xi (ω) for each ω is an non-negative non-decreasing function of i ∈ N which converges to X(ω), then either EXi → ∞ and EX = ∞, or X ∈ L1 and Xi → X in L1 . • Dominated convergence: If there is a bound Y ∈ L1 (Ω, F, P), a random variable such that |Xi (ω)| ≤ Y (ω) for each ω and each i, and Xi → X almost surely, then X ∈ L1 and the convergence is in L1 . • Fast convergence: If Xi → X in probability “fast enough”, then the convergence is also almost sure. Specifically, if for all  > 0 ∞ X

P (|Xi − X| > ) < ∞,

i=1

then Xi → X almost surely. This follows from the first Borel-Cantelli also Pn lemma. This p implies that if Xi → X in Lp , 1 ≤ p < ∞, fast enough so that i=1 E|Xi − X| < ∞, then Xi → X almost surely. • Convergence of a subsequence: If Xi → X in probability, then there exists an increasing subsequence {ni : i ∈ N} such that Xni → X converges fast and hence also almost surely. For example, for the sequence in example 4.7.1, item 3, the sequence {X2i : i ∈ N} converges to 0 almost surely. When the bound Y in the dominated convergence theorem is a constant, then the theorem is called bounded convergence theorem. Regarding convergence in Lp , a situation that appears frequently is that we are faced with a sequence of random variables {Xn : n ∈ N} and aim to show that it converges to some limit X which is unknown to us. A useful property of the Lp spaces is that they are complete: If the sequence Xn has the Cauchy property that the increments tend to zero, i.e. sup kXm − Xn kp → 0 as N → ∞ m,n>N

then there exists a limit X ∈ Lp such that Xn → X as n → ∞. Recall (or prove!) that a convergent series necessarily is Cauchy; the word “complete” indicates that the spaces Lp include the limits of Cauchy sequences, so that a sequence if Cauchy if and only if it converges.

4.7.2

The finite-dimensional distributions

A general definition of finite-dimensional distributions is the following: Definition 4.7.2. For a stochastic process {Xt : t ≥ 0} taking values in Rn , the finitedimensional distributions are the family of functions ρt1 ,t2 ,...,tm (B1 , B2 , . . . , Bm ) = P(Xt1 ∈ B1 , Xt2 ∈ B2 , . . . , Xtm ∈ Bm ) 97

where m ∈ N, and ti ∈ R and Bi ∈ B(Rm ) for i ∈ {1, . . . , m} with ti 6= tj for i 6= j. The finite-dimensional distributions of a stochastic process satisfy two consistency conditions: First, the probability is invariant under permutations of the pairs (ti , Bi ). For example, with m = 2 we must have

ρt1 ,t2 (B1 , B2 ) = ρt2 ,t1 (B2 , B1 ) More generally, it suffices to specify the finite-dimensional distributions for the case t1 < t2 < · · · < tm . Second, the probabilities reduce trivially when some of the sets Bi ’s are the entire space Rn . For example, with m = 2 we must have ρt1 ,t2 (B1 , Rn ) = ρt1 (B1 ) The perhaps surprising but powerful fact is that these two requirements are enough to characterize valid finite-dimensional distributions: Theorem 4.7.5 (Kolmogorov’s extension theorem). Given a family of functions ρ which satisfy the two consistency conditions above, there exists a probability space (Ω, F, P) and a stochastic process {Xt : t ≥ 0} on this probability space, such that the functions ρ are the finite-dimensional distributions of X. A proof can be found in (Karatzas and Shreve, 1997). ¯ + 7→ Rn . In this The sample space Ω can always be taken to be the set of all functions R case the stochastic process Xt is given by Xt (ω) = ω(t). The system of events F is the Borel algebra, i.e., the smallest σ-algebra which contains Xt−1 (B) for any Borel set B and any t ≥ 0. Finite-dimensional distributions contain much information about the stochastic process, but not all information. If two stochastic processes (not necessarily defined on the same probability space) have the same finite-dimensional distributions, then we say that they are versions of each other. The following example that different versions may have quite different sample paths, despite sharing finite-dimensional distribution: Example 4.7.2. Let Ω = [0, 1], let F be the usual Borel algebra on this interval, and let P be the Lebesgue measure on Ω. Consider the two real-valued stochastic processes {Xt : t ∈ [0, 1]} and {Yt : t ∈ [0, 1]} on this probability space: Xt (ω) = 0

 Yt (ω) =

1 0

when t = ω else. 98

Note that X and Y have identical finite-dimensional distributions, but different sample paths (w.p. 1). In particular, the sample path of X is continuous w.p. 1, while the sample path of Y is discontinuous, w.p. 1.

4.7.3

Wiener measure

We have stated that Brownian motion exists, but we have not been explicit about how to construct it. In the following we sketch the most common construction: The finite-dimensional distributions of Brownian motion are specified in the definition 4.2.1. According to Kolmogorov’s extension theorem 4.7.5, there exists a probability triple (Ω, F, P) and a stochastic process {Bt : t ≥ 0} defined on it which has these finite-dimensional distributions. Moreover, ¯ + 7→ R. we may take the sample space to be the set of functions R This resolves the defining properties of Brownian motion (definition 4.2.1), except the requirement that the sample paths are continuous. This requirement is included in the definition for two reasons: First, Brownian motion models the movement of a particle, so on physical grounds we would like to restrict attention to continuous sample paths. Second, from a mathematical point of view, discontinuous sample paths would be a nuisance when we develop the stochastic calculus in the following chapters. At present, our candidate sample space includes ¯ + 7→ R and therefore the entire horror cabinet of counterexamples in real all functions R analysis, and we would like to exclude as many of these more or less pathological functions as possible. In short, we do insist on continuous sample paths, even if it implies some extra work to show that continuous Brownian motion exists. It is instructive to think for a moment about the difficulties with this question. Recall that example 4.7.2 shows that one cannot conclude from the finite-dimensional distributions, whether the sample paths of a process are continuous. This means that we should not ask if Brownian motion, as constructed in the previous, has continuous sample paths, but rather if we may take the sample paths to be continuous without violating the finite-dimensional distributions. Similarly, note that from the finite-dimensional distributions, it follows that the process Bt is Lp -continuous for any 0 < p < ∞, which means that E|Bt+h − Bt |p → 0 as h ↓ 0 for all t ≥ 0. But as we have seen, convergence in Lp does not imply almost sure convergence, so we cannot yet say if the sample path is continuous at a given t ≤ 0, almost surely. And even if this were the case, it would be a much more difficult question if the sample path is continuous at all t ≥ 0, almost surely. The problem is that this statement involves an uncountable number of events (one for each t ≥ 0) and therefore we need to show that it is an event, i.e. included in F. The issue is resolved by a theorem due to Kolmogorov and Chentsov, referred to as Kolmogorov’s continuity theorem, which concerns a general stochastic process {Xt : t ≥ 0}. This theorem assumes that the moments of the increments can be bounded in terms of the time lag, i.e. there exists a constant C > 0 and exponents p > 0 and q > 1 such that 99

E|Xt − Xs |p ≤ C|t − s|q holds for all 0 ≤ s ≤ t. The theorem then states that there exists a version of {Xt : t ≥ 0} which has continuous sample paths. See (Karatzas and Shreve, 1997). Since Brownian motion is self-similar and scales with the square root of time, the theorem applies to {Bt : t ≥ 0} with e.g. q = 2 and p = 4. In the our previous construction ¯ + 7→ R, we can now of Brownian motion, where the sample space contains all functions R assign probability 0 to the set of discontinuous functions. Next, we exclude the discontinuous ¯ + 7→ R. functions from the sample space, which then becomes the set of continuous functions R We still identify the realization with the sample path, so that Brownian motion is given by Bt (ω) = ω(t). This particular construction is called canonical Brownian motion. In this way, the finite-dimensional distributions of Brownian motion induce a measure on the set of continuous functions. This measure is called Wiener measure.

4.8

Exercises

Simulation of Brownian motion

Exercise 4.48 The Brownian bridge: A shortcoming of the way we have simulated Brownian motion so far, is that we have to choose the time step h at the outset. So far, we have no method for refining the time step. One way to obtain this is by the Brownian bridge, which is a Brownian motion in an interval [0, T ] conditional on the values at the endpoint, i.e. conditional on BT . First, recall that if two random variables (X, Y ) are jointly Gaussian, i.e. 

X Y



 ∼N

µX µY

  ΣXX , ΣY X

ΣXY ΣY Y



then the conditional distribution of X given Y is Gaussian with (conditional) mean E{X|Y } = −1 µX + ΣXY Σ−1 Y Y (Y − µY ), and the (conditional) variance V{X|Y } = ΣXX − ΣXY ΣY Y ΣY X 1. Give a partition 0 = t0 < t1 < · · · < tn = T and consider the conditional distribution of the Brownian motion at one time point ti , given all the others. Show that

  ti − ti−1 (ti − ti−1 )(ti+1 − ti ) Bti |{Btj : j = 6 i} ∼ N Bti−1 + (Bti+1 − Bti−1 ), ti+1 − ti−1 ti+1 − ti−1 i.e., the conditional mean linearly interpolates the two neighbors, while the conditional variance is a quadratic function which has slope ±1 at the neighboring points. 100

2. Based on this, write a function which takes as input a partition t0 < t1 < · · · < tn and associated values of the Brownian motion Bt0 , Bt1 , . . . , Btn , and which returns as output a finer partition s0 < s1 < · · · < s2n+1 along with simulated values of the Brownian motion Bs0 , Bs1 , . . . , Bs2n+1 . Here, the finer partition includes also all mid-points, i.e. s0 = t0 ,

s1 =

t0 + t1 , 2

s2 = t1 ,

s3 =

t1 + t2 , 2

. . . , s2n+1 = tn

3. Use this function iteratively to simulate Brownian motion on the interval [0, 1] in the following way: First, simulate B0 = 0 and B1 ∼ N (0, 1). Then, conditional on this, simulate B0 , B1/2 , B1 using your function. Then, conditionally on these, simulate B0 , B1/4 , B1/2 , B3/4 , B1 . Continue in this fashion until you have simulated Brownian motion with a temporal resolution of h = 1/512. Plot the resulting trajectory. 4. Yet another way of simulating a Brownian bridge uses a basic result regarding conditional simulation in Gaussian distributions. For two jointly Gaussian random variables (X, Y ) as in the start of the exercise, we can simulate X from the conditional distribution given Y as follows: (a) Compute the conditional mean, E{X|Y }. ¯ Y¯ ) from the joint distribution of (X, Y ). The compute the residual (b) Sample (X, ˜ ¯ ¯ Y˜ }. X = X − E{X| ¯ = E{X|Y } + X. ˜ (c) Return X ¯ given Y is identical to the conditional Check that the conditional distribution of this X distribution of X given Y . Then write a function which inputs a partition 0 = t0 < t1 < · · · < tn = T and a value of BT , and which returns a sample path of the Brownian bridge B0 , Bt1 , . . . , BT which connects the two end points.

Exercise 4.49 The Wiener expansion: Another way to simulate Brownian motion is by using frequency domain. The outline is as follows: First, we simulate harmonics (i.e., sine and cosine functions) with random amplitude. Then, we add them to obtain an approximation to white noise. Finally we integrate to obtain an approximation to Brownian motion. For compactness, we work with complex-valued Brownian motion {Bt : 0 ≤ t ≤ 2π}, i.e. the real and imaginary parts are independent and each standard Brownian motion, and we restrict time to the interval [0, 2π]. Generate 2N + 1 complex-valued random Gaussian variables {Vk : k = −N, . . . , N } such that the real and imaginary parts are independent and standard Gaussians. A reasonable value of N for illustrative purposes is N = 16. Then generate an approximation to white noise as

(N ) Wt

N 1 X √ = Vk eikt 2π k=−N

101

(N )

Check (by analysis or by simulation) that the real and imaginary part of Wt and distributed as N (0, (2N + 1)/(2π)).

are independent

(N )

Evaluate Wt on a partition of the time interval [0, 2π] and plot the empirical autocovariance function. Comment on the degree to which it resembles a Dirac delta. Compute a sample path of approximate Brownian motion by integration: (N ) Bt

Z =

t

Ws(N ) ds

0

The integration of each harmonic should preferably be done analytically, i.e. Z

t

e

i0s

Z ds = t,

0

t

eiks ds =

0

1 ikt (e − 1) for k 6= 0. ik

Plot the real part of the sample path. Write a function which takes a number of time points, and which returns a sample of (realvalued) Brownian motion evaluated on those time points, using the calculations above. Verify the function by simulating 1,000 realizations of (B1 , B1.5 , B2 ) and computing the empirical mean and the empirical covariance matrix. (N )

Check that the real part and imaginary part of Bt are independent and identically dis(N ) 2 tributed. Then check, analytically, that E|Bt | = 2t.

Convergence of random variables

Exercise 4.50

Convergence w.p. 1, but not in L2 :

Let {Bt : t ≥ 0} be Brownian motion and define, for t ≥ 0 1 Xt = exp(Bt − t) 2 1. Show that Xt → 0 almost surely as t → ∞, but that E|Xt |2 → ∞. 2. Simulate a number of realizations of Xt and explain in words how Xt can converge to 0 almost surely while diverging in L2 . Note: This process {Xt : t ≥ 0} is one example of geometric Brownian motion; we will return to this process repeatedly. Exercise 4.51

Continuity of stochastic processes: 102

1. Let {Nt : t ≥ 0} be a Poisson process with unit intensity, i.e. a Markov process N0 = 0 and with transition probabilities given by Nt |Ns being Poisson distributed with mean t − s for 0 ≤ s ≤ t. Show that {Nt } is continuous in the mean square but that almost no sample paths are continuous. 2. Let V be a real-valued random variable such that E|V 2 | = ∞ and define the stochastic process {Xt : t ≥ 0} by Xt = V · t. Show that {Xt } has continuous sample paths but is not continuous in the mean square.

Exercise 4.52 The (second) arcsine law: Consider Brownian motion on the time interval [0,1]. Define τ as the last time the process hits 0: τ = sup{t ∈ [0, 1] : Bt = 0} 1. Show that P{τ ≤ t} =

√ 2 arcsin t π

for 0 ≤ t ≤ 1. 2. Estimate the distribution function of τ using Monte Carlo, by simulating N = 1, 000 sample paths of Brownian motion on [0,1] and reporting the last time the sign is changed. Use a time step of h = 0.001. Plot the empirical distribution function and compare with the analytical expression.

Exercise 4.53 The strong Markov property of Brownian motion: Recall the definition 4.4.1 of the Markov property. Note that in this definition, the “current” time t is deterministic. In many applications it is useful to allow the current time to be random. Show, by repeating the proof of theorem 4.4.1, that for Brownian motion E{h(kBτ +t )|F τ } = E{h(kBτ +t )|Bτ } holds, whenever h is a measurable function, τ ≥ 0 is a stopping time w.r.t. the filtration R {F t }, and t ≥ 0 is deterministic. Then, let τ be the stopping time τ {t ≥ 0 : Bt ≥ 1}. What is the distribution of Bτ +1 , conditional on F τ , and unconditionally? We shall return to such strong Markov properties later, in chapter 9.

103

CHAPTER

5

Linear dynamic systems

The theory of systems of linear differential equations, with exogenous random inputs, is an important special case in the theory of stochastic processes. It is possible to obtain simple formulas for the mean and covariance structure. In the stationary case, when the statistics of the driving noise do not depend explicitly on time, the statistics can also be described in frequency domain, i.e. in terms of variance spectra. If the driving noise is fast compared to the time scales of the linear system, and to the time scales of interest to the modeler, then one may approximate the driving noise with a white noise signal. White noise is an idealization which can - informally - be seen as the velocity of a particle undergoing Brownian motion. White noise corresponds to a spectral density which is a constant function of frequency, and to a Dirac delta-correlated autocovariance function. Linear systems driven by white noise constitute the simplest class of stochastic differential equations. Linear models are important in practical applications, because they can quickly give explicit results. If a linear model is reasonable, or can at least be used as a first approximation, then it is typically worthwhile to start the analysis there. In this chapter, we cover the key elements in linear dynamic systems driven by noise (nonwhite or white): State-space formalism, impulse response, frequency response, autocovariance functions, and variance spectra. I assume that you have seen much of this material in a course on linear systems - indeed, this material typically makes up the better part of such a course - so the exposition is concise.

5.1

Linear systems with deterministic inputs

Let us start by briefly recapitulate the basic theory of linear time-invariant systems driven by additive inputs. A simple example of such a system is a mass connected to a wall with a spring and a damper, and subject to an external force, see figure 5.1. The governing equations 105

are

dQt = Vt dt dVt = −kQt − cVt + Ft m dt

(5.1) (5.2)

This can be written in the standard vector-matrix notation, where the state is Xt = (Qt , Vt )> , and the input is Ut = Ft : X˙ t = AXt + GUt  The system matrices are A =

0 1 −k/m −c/m



(5.3) 

and G =

 0 . 1/m

For such a linear system with exogenous input, a fundamental property is the impulse response. This is the (fundamental) solution to the system which is at rest before time t = 0, i.e. Qt = Vt = 0 for t < 0, and subject to a force Ft = pδ(t) where δ is the Dirac delta and p is a constant, the momentum. Physically, this input corresponds a force of large magnitude which is applied over a short time period starting at time 0, so that the momentum mVt is effectively instantaneously changed from 0 to p. Figure 5.2 shows the impulse response of the position. In general, the impulse response of the system (5.3) corresponding to the input Ut = δ(t) is  h(t) =

0 exp(At)G

for t < 0, for t ≥ 0.

Of equal importance is the frequency response: If we apply a force of the form Ft = F0 cos ωt, and wait until transients have died out, then the system will respond with a periodic motion Xt = F0 a(ω) cos(ωt + φ(ω)). The frequency of the response is the same as that of the applied force, but the amplitude F0 a(ω) is proportional to the magnitude of the applied force, and the constant of proportionality a(ω) depends on the frequency. Also the phase shift φ(ω) depends on frequency ω. If is convenient to write this as Xt = F0 Re(H(ω) exp(iωt)) where H(ω) is the complex-valued frequency response given by a(ω) = |H(ω)|, φ(ω) = ∠H(ω). Figure 5.2 shows the frequency response of the position. The two descriptors, impulse response and frequency response, contain the same information: The frequency response is the Fourier transform of the impulse response. For the system (5.3), the frequency response is Z

+∞

H(ω) =

−iωt

h(t)e −∞



Z dt =

eAt Ge−iωt dt = (iω · I − A)−1 G

(5.4)

0

where i is the imaginary unit and I is a unit matrix of same dimensions as A. Here we have assumed that transients do indeed die out, i.e. exp(At) → 0 as t → ∞, such that the impulse 106

F k

m x

c 0

Figure 5.1. A mass-spring-damper system drive by an exogenous force is a simple example of a linear system.

response is L2 and the integral converges. This amounts to A being stable, i.e. all eigenvalues have negative real parts. The frequency response can conveniently be found by searching for solutions of (5.3) of the form Ut = exp(iωt), Xt = H(ω) exp(iωt). For a general forcing Ut applied at t ≥ 0, and an initial condition X0 = x0 , the system admits the unique solution

At

Z

t At

Z

h(t − s) Us ds = e x0 +

Xt = e x0 + 0

t

eA(t−s) GUs ds

(5.5)

0

¯ (ω) : ω ∈ R} If the forcing {Ut : t ≥ 0} is square integrable, then its Fourier transform {U exists. If furthermore A is stable, then also the response {Xt : t ≥ 0} will have a Fourier ¯ transform {X(ω) : ω ∈ R}. The two will be related by

¯ ¯ (ω) X(ω) = (iωI − A)−1 x0 + H(ω) U This formula expresses that each angular frequency ω can be examined independently. The ¯ response X(ω) is decomposed in a response to the initial condition and a response to the driving input. The response to the input is obtained by multiplying each frequency in the ¯ (ω) with H(ω), which specifies the amplification and phase shift of an input with that input U frequency.

5.2

Linear systems driven by noise

We now turn to the situation where the input is not deterministic, but a stochastic process. For the mass-spring-damper example, we assume that the driving force is a stochastic process {Ft : t ≥ 0} generated in the following way: The force is piecewise constant and changes value at the points in a Poisson process with mean interarrival time τ . At a point of change, a new force is sampled from a Gaussian distribution with mean 0 and variance σ 2 , independently of all other variables. 107

The matrix exponential: The homogeneous linear system X˙ t = AXt with initial condition X0 = x0 has the unique solution Xt = exp(At)x0 where exp(At) is termed the matrix exponential. This is in fact one way to define the matrix exponential. In principle, the matrix exponential may be computed through its Taylor series ∞

X1 1 exp(At) = I + At + (At)2 + · · · = (At)i 2 i! i=0

but the series converges quite slowly and should not be used directly; it is only useful when t is small. Better algorithms are described in (Moler and Van Loan, 2003) and are implemented in good environments for scientific computing. The matrix exponential should not be confused with element-wise exponential; (eA )ij does not in general equal eAij . If you use Matlab or R, compare the two functions exp and expm. If A admits the eigenvalue decomposition A = T ΛT −1 , then exp(At) = T exp(Λt)T −1 and if Λ is diagonala with diagonal elements λi , then exp(Λt) is a diagonal matrix with diagonal elements exp(λi t). It may also be written as exp(At) =

n X

vi eλi t ui

i=1

where ui and vi are left and right eigenvectors of A corresponding to the eigenvalue λi , normalized so that ui vi = 1. I.e. Avi = vi λi and ui A = λi ui . To connect the two formulations, note that Λ = diag(λ1 , . . . , λn ), vi are the columns of T , and ui are the rows of T −1 . These formulas highlights the central importance of eigenvalues and eigenvectors when solving linear systems of differential equations with constant coefficients. a

Similar results exist when A cannot be diagonalized, i.e. when Λ is a Jordan matrix.

108

0.2 0.5

0.6 0.4

0.5

1.0

1.5

2.0

2.5

2.0

2.5

−0.4 0

5

10

15

20

25

Time [s]

−1.5 −3.0

Phase [rad]

0.0

0.2

Angular frequency [rad/s]

−0.2

Position [m]

2.0

Frequency response Amplitude [m]

Impulse response

0.5

1.0

1.5

Angular frequency [rad/s]

Figure 5.2. Impulse response and frequency response for the mass-spring-damper system. Parameters are m = 1 kg, k = 1 N/m, and c = 0.4 Ns/m. The applied impulse has magnitude 1 Ns. Note the peak in the amplitude response at a frequency of 1 rad/s, corresponding to the damped oscillations of period 2π in the impulse response. Note also that at slower frequencies, the response is in phase with the excitation, while at faster frequencies, the response is lagging behind at eventually in counterphase.

Exercise 5.54: Construct a probability space (Ω, F, P) and define the process {Ft : t ≥ 0} on this space, such that it has the desired statistics. Exercise 5.55: Show that this noise process {Ft : t ≥ 0} is Markov (w.r.t. its own filtration). Figure 5.3 shows one realization of the force Ft , along with the response of the system (Qt , Vt ). Here, the mean time between jumps in the applied force is 20 seconds, so in most cases the systems falls to rest before the force changes again, and the step response of the system is clearly visible. Exercise 5.56: Show that the state of the mass-spring-damper system itself, i.e. (Qt , Vt ), does not have the Markov property. Then show that when we combine this state with the state of the noise subsystem, i.e. Ft , we obtain a combined state vector Xt = (Qt , Vt , Ft ) such that {Xt : t ≥ 0} is a Markov process. This exercise demonstrates a more general technique for state-space models: When we connect systems in state-space form, the state of the resulting system is the collection of the states of the components. Stated differently, the state space of the connected system is the Cartesian product of the state spaces of the components. 109

4 3 2 1 −1

0

1

2

−2

−1

0

1

2 −2

0

q v F

0

50

100

150

200

250

Time Figure 5.3. A force with random steps, and the associated response of the mass-springdamper system. Top panel: Position Q [m]. Middle panel: Velocity V [m/s]. Bottom panel: Force F [N].

110

5.3

First and second order statistics of the response

It is instructive to simulate a driving input and the response it causes, but it is also cumbersome. We would like to have simpler and more general analysis tools for computing the statistics of the response, without knowing the realization of the force, but solely from the statistics of the force. We now develop these tools for the first and second order statistics of the response, i.e. the mean value and the covariance structure. We assume that the driving process {Ut : t ≥ 0} is a stochastic process with well defined mean u ¯(t) = EUt and autocovariance ρU (s, t) = E(Us − u ¯(s))(Ut − u ¯(s))> . We assume also that the initial condition x0 is a deterministic vector, and finally we assume that {Ut : t ≥ 0} is so well-behaved that we can, for each realization of {Ut : t ≥ 0}, compute the corresponding realization of the solution {Xt : t ≥ 0} by means of the solution formula (5.5). In this formula (5.5), we can take expectation on both sides. Fubini’s theorem allows as to commute expectation and time integration, so we obtain for the mean value µt = EXt : Z

At

µ(t) = e x0 +

t

eA(t−s) G¯ u(s) ds

0

Differentiating with respect to time, we obtain an ordinary differential equation governing the mean value: d µ(t) = Aµ(t) + G¯ u(t) dt In addition, the initial condition µ(0) = x0 is given. We see that we can obtain the governing ordinary differential equation for the mean value simply by taking expectation in (5.3). ˜t = Next, we aim to obtain the covariance ρX (s, t) = E(Xs − µ(s))(Xt − µ(t))> . Using U ˜ t = Xt − x Ut − u ¯(t) and X ¯(t) for the deviations of the processes from their mean values, we first write integral formulas for the deviation at time s and t:

˜s = X

Z

s

e

A(s−v)

˜v dv and X ˜t = GU

0

Z

t

˜w dw eA(t−w) GU

0

Combining the two, and commuting the expectation and integration over time, we obtain

˜sX ˜ t> = ρX (s, t) = EX

Z 0

sZ t

eA(s−v) GρU (v, w)G> eA

> (t−w)

dw dv

(5.6)

0

This is a useful result, but it can be made much more explicit in the stationary situation. Let us first introduce stationarity in general; later we will return to this expression. 111

5.4

Stationary processes

A stationary stochastic process is one where the statistics do not depend on time; note that this does not at all imply that the sample paths are all constant functions of time. There are several notions of stationarity, but for our purpose at this point we only need the wide notion: Definition 5.4.1 (Wide-sense stationary process). A stochastic process {Xt : t ≥ 0} taking values in Rn is said to be wide-sense stationary, if E|Xt |2 < ∞ for all t ≥ 0, and EXs = EXt ,

> > EXs Xs+h = EXt Xt+h

for all s, t, h ≥ 0. Wide-sense stationarity is also referred to as weak stationarity or second-order stationarity. For a wide-sense stationary process with mean µ = EXt , the autocovariance depends only on the time lag, so we define the autocovariance function ρX : R 7→ Rn×n by Careful! Different authors the ρX (h) = E(Xt − µ)(Xt+h − µ)> . word autocovariance Here t may be chosen arbitrarily, subject to the restriction that Xt and Xt+h are defined! function in slightly Example 5.4.1 (Auto-covariance of the driving force Ft ). Consider again the force process different {Ft : t ≥ 0} described in section 5.2. Its mean is µ = EFt = 0, for any t ≥ 0. To derive its meanings. autocovariance function ρF (h) = EFt Ft+h for h ≥ 0, we condition on F t and use the simple Tower property: EFt Ft+h = E(E{Ft Ft+h |F t }) = E(Ft E{Ft+h |F t }) We can compute the conditional expectation E{Ft+h |F t } as follows: If the force does not change between time t and time t + h, then Ft+h = Ft . This has probability exp(−h/τ ), since the time to next event in a Poisson process is exponentially distributed. On the other hand, if the force changes between time t and time t + h, then the conditional expectation of Ft+h is 0. Combining, we get E{Ft+h |F t } = Ft exp(−h/τ ) and thus ρF (h) = σ 2 exp(−|h|/τ )

(5.7)

Here we have used symmetry to obtain also autocovariance at negative lags. This form is archetypal: It contains a time constant τ and a variance, σ 2 . In this example, the noise process {Ft : t ≥ 0} decorrelates exponentially so that the autocovariance function ρF is L2 . In this case,1 we can define the variance spectrum SF as the Fourier transform 1

According to the Wiener-Khinchin theorem, this requirement can be relaxed.

112

2.00 0.50 0.20 0.05

0.10

Spectrum SF(ω)

2 1

τ = 0.25 τ=1

0.02

0

A.c.f. ρF(h)

3

1.00

4

τ = 0.25 τ=1

−4

−2

0

2

4

0.1

0.2

0.5

1.0

2.0

5.0

10.0

Frequency ω

Time lag h

Figure 5.4. Autocovariance function and variance spectrum of the piecewise constant force process {Ft : t ≥ 0}. Shown are results for two values of the time constant τ ; the variance σ 2 as adjusted so that σ 2 τ = 1. Note, in the spectrum, that the cutoff frequency 1/τ marks a transition from low-frequency behavior to high-frequency behavior.

113

Z

+∞

ρF (h) exp(−iωh) dh

SF (ω) =

.

−∞

With the particular autocovariance function (5.7), we get the spectrum Z

+∞

ρF (h) exp(−iωh) dh =

SF (ω) = −∞

2σ 2 τ 1 + ω2τ 2

(5.8)

as shown in figure 5.4. This form is again archetypal: There is a low-frequency response 2σ 2 τ , which expresses the strength of slow oscillations present in the response, a cut-off frequency ω which indicates the transition from slow to fast modes, and a roll-off at high frequencies where the response decays with ω 2 . When the process is scalar, as in this case, we may replace the complex exponential exp(−iωh) with the cosine cos(ωh) since the autocorrelation function is even, and the spectrum is realvalued. To justify the name variance spectrum, note that by the inverse Fourier transform

1 ρF (t) = 2π

Z

+∞

SF (ω) exp(iωt) dω −∞

and in particular 1 VFt = ρF (0) = 2π

Z

+∞

SF (ω) dω −∞

We see that the variance spectrum SF (ω) decomposes the variance of Ft into contributions from cycles of different frequencies.

5.4.1

Stable systems driven by stationary processes

We now return to a linear system (5.3), such as the mass-spring-damper system, driven by a noise process {Ut : t ≥ 0}, such as the force process, and assume that two conditions are met: First, the system is exponentially stable, i.e. all eigenvalues of A have negative real part, so that the effect of the initial condition and old inputs vanishes as time progresses. Second, the driving input {Ut : t ≥ 0} has stationary covariance structure, i.e. the covariance function ρU (s, t) depends only on the time lag t − s. In this case the covariance structure of the solution {Xt : t ≥ 0} will also approach a stationary situation where the autocovariance function ρX (s, t) depends only on the time lag. Writing ρU (t − s) for ρU (s, t) and ρX (t − s) for ρX (s, t), we obtain Z

sZ t

ρX (s, t) = 0

eA(s−v) GρU (w − v)G> eA

0

114

> (t−w)

dw dv

and for s, t → ∞ with l = t − s fixed, this converges to Z

∞Z ∞

ρX (l) = 0

eAv GρU (l + v − w)G> eA

>w

dw dv

0

Exercise: Verify this convergence, using that exp(Av) converges exponentially to zero as v → ∞, and that ρU (·) is bounded by ρU (0). This is a useful result, but it is even more instructive in frequency domain. Taking Fourier transform of the autocovariance function ρX (·), we obtain the variance spectrum 1 SX (ω) = 2π

Z

+∞

ρX (l) exp(−iωl) dl −∞

With standard methods for Fourier transforms we obtain SX (ω) = H(−ω) · SU (ω) · H > (ω)

(5.9)

Exercise: Verify this! In case where Xt and Ut are scalar, we get the simpler formula SX (ω) = |H(ω)|2 · SU (ω) In words, the variance contribution from a given frequency ω depends on its presence in U and its magnification through system dynamics. Example 5.4.2. We apply the result (5.9) to the mass-spring-damper system in section 5.2. Combining the variance spectrum (5.8) of the force process with the frequency response (5.4) of the mass-spring-damper system, we obtain the variance spectrum of position and velocity in figure 5.5. Note the resonance peak at of the system at 1 rad/s, and the roll-off in F and X beginning at 0.05 rad/s due to the time constant τ .

5.5

The white noise limit

For the example in the previous section, the mass-spring-damper system had a resonance frequency at 1 rad/s while the driving force was constant of periods of 20 s, on average. In short, the driving force was slow compared to the system dynamics. The stochastic differential equations we are interested in are characterized by the opposite: The driving noise is fast compared to system dynamics. For the example of molecular diffusion, we consider molecular collisions to be a process that occurs at small spatial and temporal scales. Figure 5.6 shows spectra for the situation where the force applied to the mass-spring-damper system is faster than the system dynamics. Specifically, the resonance frequency is still 1 rad/s corresponding to a period of 2π s, but now the mean time between force jumps is τ = 0.1 s. 115

1e+01



● ●





● ●

1e−01 0.01

0.02

0.05

0.10

0.20



●● ●

●●

●●● ●●● ● ● ●● ● ● ● ● ● ● ●

●● ● ● ●● ● ●● ●● ●●● ● ● ●● ● ●● ●●● ●● ● ● ●●● ●● ● ●● ●● ●● ● ● ●● ●● ●● ● ● ● ● ●●● ●● ●● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ●●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●● ● ●●● ● ● ● ● ● ●

0.50

1.00

2.00

0.200

Frequency [rad/s] ● ●











0.020 5.00 0.002



●●● ● ●● ●● ● ●● ●● ● ●● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ●●● ● ●● ● ●● ● ● ●● ● ● ● ●●



0.01



0.02



0.05



0.10●

0.20 ●

0.50

1.00

2.00



Frequency [rad/s] ● ●

● ●





● ●

0.50

● ●







0.05

Velocity Svv[m2s] Force SFF[m2s]





● ● ●

1e−03

Position Sxx[m2s]

Variance spectra

0.01

0.02

0.05

0.10

0.20

0.50

● ● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ● ●● ● ●●●●●●●●●●●● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ●●● ● ●●● ● ●● ●● ● ● ●●●●● ● ●● ● ●● ● ●●● ●

1.00

2.00

Frequency [rad/s]

Figure 5.5. Variance spectrum of position, velocity, and force for the mass-springdamper system. Lines are the analytical expressions. Dots are estimated spectra based on simulation of the process.

116

Variance spectra ●















● ● ●



● ●









● ●







●●

●●

●● ●

●● ●●

●●



●●





0.1

0.2

0.5 ●



1.0

● ● ●

2.0

● ●

●●





● ●



●●

●●

●●









● ●●



●●



0.1



20.0



● ●

10.0

●● ●

●● ●





0.2

0.5

●●



●●





1.0

2.0

5.0

●●

●●



●●



10.0

●●

















20.0



Frequency [rad/s]

0.50

2.001e−04







1e−02

5.0



Force SFF[m2s]

●●

● Frequency ● [rad/s]







●●





Velocity Svv[m2s]



1e−03



1e+00 1e−07

Position Sxx[m2s]







● ●





● ●





● ●

● ● ● ● ● ● ● ●

● ● ● ●





●●



●●●● ●● ●

●●●

●●●



●●●

0.02

0.10









●●



●●

●●

●●

●●

●● ●

0.1

0.2

0.5

1.0

2.0

5.0

10.0

20.0

Frequency [rad/s]

Figure 5.6. Variance spectrum of position, velocity, and force for the mass-springdamper system. As figure 5.5 but for τ = 0.1 s.

117

In terms of the spectra, we see that in the frequency range up to, say, 5 rad/s, we can approximate the spectrum of the driving force with a constant function SF (ω) ≈ 2σ 2 τ for ω  1/τ Moreover, we see that the total variance of the response Qt is not very sensitive to the spectrum of the force F at frequencies larger than 5 rad/s, since the frequency response of the system is small for such high frequencies. Therefore we may as well ignore the details of the spectrum SF at high frequencies, and approximate SF (ω) with the constant 2σ 2 τ for all frequencies ω. This is called the white noise approximation: Recall that when characterizing colors, white light has the property that all frequencies or wavelengths contribute equally to the energy. By analogy, a white noise signal is one where all frequencies contribute equally to the variance or power. White noise signals are an idealization, just as is white light: Such a signal would have infinite variance. But it is a useful approximation; when the force changes rapidly compared to the system dynamics, we may approximate the force with a white noise signal. The approximation is valid as long as we operate in the frequency range of the system, i.e. at frequencies ω  1/τ . We find that the spectrum of the position is well approximated by SQ (ω) ≈ 2σ 2 τ |H(ω)|2 for ω  1/τ Approximating the force with white noise amounts to letting τ → 0, but at the same time letting σ 2 → ∞ so that the spectrum SF (0) = 2σ 2 τ remains constant at frequency 0. At any other frequency ω, we have (pointwise) convergence SF (ω) → SF (0) as τ → 0. In terms of the autocovariance function of the driving force, which was ρF (h) = σ 2 exp(−|h|/τ ) we see that this corresponds to approximating the autocovariance function with a Dirac delta: ρF (h) → 2σ 2 τ · δ(h) In the limit, ρF (h) vanishes for any non-zero h, and therefore the time-domain characterization of white noise is independence, i.e. F (s) and F (t) are uncorrelated for any s 6= t.

5.6

Integrated white noise is Brownian motion

In this section we show the connection between white noise and Brownian motion: Brownian motion can, formally, be regarded as integrated white noise. Stated differently, white noise can - formally - be seen as the derivative of Brownian motion. 118

First, we investigate the difference quotient of Brownian motion. Let {Bt : t ≥ 0} be Brownian motion, let a time lag k be given, and define the stochastic process {Xt : t ≥ 0} by Xt =

1 (Bt+k − Bt ) k

This Xt is a difference quotient, and for small k we may think of {Xt : t ≥ 0} as an approximation to the (non-existing) derivative of Brownian motion. Exercise: Show that {Xt : t ≥ 0} is wide sense stationary, has mean 0, and the following autocovariance function:

ρX (h) =

k − |h| ∨0 k2

.

This autocorrelation function is shown in figure 5.7. Note that as the time lag k decreases towards 0, the a.c.f. approaches a Dirac delta. This justifies the useful but imprecise statement that the derivative of Brownian motion is delta-correlated. Figure 5.7 displays also the spectrum of the difference quotient {Xt : t ≥ 0}. The analytical expression for this spectrum is  SX (ω) =

ωk 2 1−cos ω 2 k2 1

for ω 6= 0, for ω = 0.

Note that the spectrum at frequency 0 is SX (0) = 1 for any k, since the a.c.f. integrates to 1 for any k. Note also that as k → 0, the spectrum SX (ω) converges to the constant 1, for any frequency ω, in agreement with the a.c.f. approaching a Dirac delta. To recap, a constant spectrum is also called “white”, with a coarse analogy to light in the range of wavelengths visible to the human eye. This is the motivation behind the statement“the derivative of Brownian motion is white noise”. This statement is useful but should, of course, not be taken too literally since Brownian motion is not differentiable. Now, conversely, consider a white noise signal Ut with mean 0 and autocovariance function ρU (h) = δ(h), which is to say that its variance spectrum is SU (ω) = 1. The following derivation is purely formal, so try not to be disturbed by the fact that such a signal does not exist! Instead, define the integral process {Bt : t ≥ 0} Z

t

Bt =

Us ds 0

and consider the covariance structure of {Bt : t ≥ 0}. We can apply formula (5.6) with A = 0, G = 1 to get ρB (s, t) = EBs Bt = s 119

0.15

10

0.10

k= 0.1 k= 1

0

0.00

2

0.05

4

Spectrum SX(ω)

8 6

A.c.f. ρX(h)

k= 0.1 k= 1

−2

−1

0

1

2

−10

−5

0

5

10

Frequency ω

Time lag h

Figure 5.7. Autocorrelation function (left) and variance spectrum (right) of the difference quotient of Brownian motion. In particular, VBt = t. By stationarity, we get V(Bt −Bs ) = t−s for any 0 < s < t. Exercise: Show that the increments of {Bt : t ≥ 0} are uncorrelated. I.e., assume 0 < s < t < v < w, and show that E(Bt − B − s)(Bw − Bv ) = 0. By appeal to the Central Limit Theorem, Bt is Gaussian. We see that {Bt : t ≥ 0} agrees with our definition 4.2.1 of Brownian motion. This formal calculation justifies the statement “Brownian motion is integrated white noise”. Again, this statement is useful but should not be taken too literally since continuous time white noise does not exist as a stochastic process in our sense.

5.7

Linear systems driven by white noise

We now return to the linear system driven by noise X˙ t = AXt + GUt

(5.10)

We are interested in the limiting case where Ut approaches white noise. We will call this limit a linear stochastic differential equation in the narrow sense, because the drift term AXt is linear in the state, and the noise term GUt is independent of the state. We say that the noise enters additively. 2 Here, we approach this limit indirectly since white noise does not exist as a stochastic process. We first integrate the equation w.r.t. dt to obtain 2

This is in contrast to linear SDE’s in the wide sense, where the noise term depends linearly on the state.

120

Z

t

Xt − X0 =

Z

t

Us ds

AXs ds + G 0

0

Now, as U approaches white noise, the last integral will approach Brownian motion t

Z Xt − X0 =

AXs ds + GBt 0

Since this equation involves Brownian motion and not white noise, it does not suffer from the problem that white noise does not exist. We shall see that it is the right starting point for a general theory of stochastic differential equations. The solution to the equation (5.10) can be written t

Z

exp(A(t − s)GUs ds

Xt = exp(At) X0 + 0

and, letting U approach white noise and substituting dBs for Us ds, we can write this as Z

t

exp(A(t − s)G dBs

Xt = exp(At) X0 + 0

The last term is a stochastic integral, an integral with respect to Brownian motion, and a major milestone in this course is to develop this integral. In this situation, however, the integrand exp(A(t − s)) is smooth enough that we can interpret the integral as a standard Riemann-Stieltjes integral. Finally, let us identify the covariance structure of this Xt . From (5.6), we obtain s

Z ρX (s, t) =

eA(s−v) GG> eA

> (t−v)

dv

0

using that the autocovariance function of the noise is ρU (v, w) = δ(v − w). It is convenient to first look at the variance at time t, Σ(t) = ρ(t, t): Z Σ(t) =

t

eA(t−v) GG> eA

> (t−v)

dv

0

Differentiating with respect to t, we obtain d Σ(t) = AΣ(t) + Σ(t)A> + GG> dt

(5.11)

This is a linear matrix differential equation, known as the differential Lyapunov equation. Together with the initial condition Σ(0) = 0 it determines the variance function. See exercise 121

5.58 for an alternative derivation, and exercise 5.59 for methods for finding the solution numerically. With Σ(·) in hand, we can find the autocovariance function using the Markov property of X:

˜ t> ˜sX ρ(s, t) = EX ˜sX ˜ t> |Xs }) = E(E{X

(5.12)

˜ s E{X > |Xs }) = E(X t

(5.14)

A> (t−s)

˜sX ˜ >e = E(X s = Σ(s) · e

A> (t−s

)

)

(5.13)

(5.15) (5.16)

Of special interest is the stationary case where Σ(t) does not depend on t. To reiterate, this ˜ t is constant, but rather that its statistics are constant. The does not mean that the process X stationary variance is given by an algebraic Lyapunov equation AΣ + ΣA> + GG> = 0 which expresses that Σ is an equilibrium of the differential Lyapunov equation. It can be shown that this equation has a unique solution if A contains no eigenvalues on the imaginary axis; moreover if A is exponentially stable (all eigenvalues in the open left half plane), then the unique solution Σ is positive semidefinite. In this case, Σ(t) → Σ as t → ∞. Moreover, Σ will be positive definite if the noise excites the system (more precisely, if the pair (A, G) is controllable); a sufficient (but far from necessary) condition is that G is square and invertible. Exercise: Formulate and verify these statements in the scalar case. In summary, in this common situation - a stable system all the dynamics of which are excited by the noise - the process Xt will approach a stationary process. The stationary variance is Σ, the unique solution to the algebraic Lyapunov equation, and the autocovariance function is ρ(h) = Σ exp(Ah) for h ≥ 0.

5.8

Summary

In this chapter, we have seen that linear systems (of ordinary differential equations) driven by random inputs make a quite tractable class of stochastic dynamic systems. We can work out the mean and autocovariance structure quite explicitly, and even if these two statistics do not fully describe a stochastic process, they may be sufficient for a given purpose. In the stationary case, where systems are stable and we assume that the effect of a distant initial condition has decayed, we obtain explicit formulas, which can conveniently be stated in frequency domain: The spectrum of the output is obtained by multiplying the spectrum of the input with the frequency response squared. This holds in the scalar case; the multivariate case follows straightforwardly. 122

When the noise fluctuates fast relative to system dynamics, it may be an advantage to approximate it with white noise. In time domain, this corresponds to approximate the autocovariance function with a Dirac delta, while in frequency domain, it corresponds to approximate the variance spectrum with a constant function. It should be kept in mind that white noise only exists as an idealization. Linear systems drive by white noise is particularly simple to analyze with respect to variance structure; the Lyapunov equation is a key element. Such linear systems are a simple special case of stochastic differential equations; many of the intricacies of the general theory are not visible at first sight in the linear case. This theory is highly useful in practice.

5.9

Notes and references

Decorrelation time For a stationary scalar stochastic process Xt , the decorrelation time τ is defined as the integral of its autocorrelation function from 0 to ∞ or equivalently, half the integral from −∞ to +∞. In terms of the autocovariance function r(t) 1 T = 2

R +∞ −∞

r(t) dt

r(0)

For the Ornstein-Uhlenbeck process, the decorrelation time equals the time constant. Exercise: Verify this. In general, the decorrelation time gives one important indication of the temporal scales which characterize the process. Note that the decorrelation time can be written

T =

S(0) r(0)

where S(ω) is the variance spectrum of the process. For a perfect band-limited process with cut-off frequency (bandwidth) ω0 , i.e. a process for which the spectrum is  S(ω) =

S0 0

if |ω| ≤ ω0 else.

the decorrelation time is T = π/2ω0 . Exercise: Verify this. Hence, the decorrelation time is the time to complete a quarter cycle for the fastest frequencies present in the process. Although the decorrelation time is a useful descriptor, one should be careful with its interpretation for oscillating processes. Oscillations contribute to the variance r(0), but average out in the long run so do not contribute to S(0). As a result, oscillations tend to decrease the decorrelation time. For an extreme example, consider Xt = sin(ωt + Φ) where ω is fixed and Φ is a random variable, uniformly over [0, 2π). This is a stationary process with decorrelation time 0, although the process never decorrelates, in the sense that the autocorrelation 123

function does not converge to 0 as the time lag goes to ±∞. Exercise: Verify this. If you are uncomfortable with the fact that the autocovariance function is not square integrable, consider in stead a process the spectrum of which vanishes except when |ω| ∈ [ω1 , ω2 ], where 0 < ω1 < ω2 .

5.10

Exercises

Exercise 5.57: Consider an RC-circuit (resistor-capacitor) given by the differential equation 1 1 Q˙ t = − Qt + Ut RC R where Qt is the charge in the capacitor at time t and Ut is the voltage at time t. Assume that the driving noise Ut is stationary, has mean 0 and root mean square σ = 1V. Finally, its decorrelation time is τ . 1. Sketch a simple autocovariance function of Ut which is consistent with these statistics. Hint: Try a form rU U (t) = rU U (0) · exp(−|λt|). Sketch the variance spectrum SU U (ω). 2. Pose a linear stochastic differential equation, driven by standard white noise, which is consistent with the a.c.f. of question 1. 3. Derive and sketch the impulse response h(t) and the frequency response H(ω) of the linear system which has Ut as input and Qt as output. 4. Sketch the variance spectrum SQQ (ω) of Qt , in the two situations: a: τ  RC and b: τ  RC. For the remainder, we assume τ  RC. 5. Specify a white noise signal which is a reasonable approximation to Ut . State a limit on parameters σ, τ corresponding to the approximation. 6. State a first-order linear stochastic differential equation driven by white noise which governs Qt . 7. Assume that Q0 = 0. Sketch VQt as a function of time t.

Exercise 5.58 The differential Lyapunov equation revisited: derivation of the differential Lyapunov equation (5.11)

An alternative

˙ t = AΣt + Σt A> + GG> Σ is as follows: Assume that Σ(t) = VXt is given. We aim to find Σ(t + h). To this end, we use the formula 124

Z

t+h

AXs ds + G(Bt+h − Bt )

Xt+h = Xt + t

If the time step h is short, then it is reasonable to approximate the integral with AXt · h, obtaining:

Xt+h ≈ Xt + AXt · h + G(Bt+h − Bt ) We will justify this Euler discretization later. For now, take it for granted, and use it to find the variance Σ(t + h) of Xt+h . Divide with h and let h → 0 to find a differential equation for Σ(t). Exercise 5.59 Numerical solution of the differential Lyapunov equation: Consider again the differential Lyapunov equation (5.11) ˙ Σ(t) = AΣ(t) + Σ(t)A> + GG> governing the variance-covariance matrix for the linear system dXt = AXt dt + G dBt . Here, A, Σt and GG> are n-by-n matrices. 1. Show the following: If there exists a S = S > such that AS + SA> + GG> = 0, then the solution can be written > Σ(t) = S − eAt SeA t Note: If A is exponentially stable (i.e., all eigenvalues have negative real values), then S is guaranteed to exist, will be non-negative definite, and is the steady-state variancecovariance matrix of the process, i.e. Σ(t) → S as t → ∞. 2. Show that the differential Lyapunov equation can be written as s˙ t = M st + g where st is a column vector made from entries of Σt by stacking columns on top of each other, g is a column vector made from GG> in the same way, and M is the n2 - by- n2 matrix M =A⊗I +I ⊗A where ⊗ is the Kronecker product, and I is an n-by-n identity matrix. 3. Show that the solution, when M is invertible, is  st = M −1 eM t − I g + eM t s0 125

4. Show that the solution can be written as 

g st





=e

g s0



0 0 I M



Pt



where P =

Hint: Show that (g, st ) satisfies the linear ODE 

g˙ s˙ t



 =

0 g + M st



5. Implement the solution in a function in your favorite programming language. Test the function; use for example a simple system where we know the solution on forehand, or use asymptotics where t ↓ 0 and t → ∞. Save the function for future use.

Exercise 5.60

The noisy harmonic oscillator: Consider the linear SDE dXt = AXt dt + σ dBt

where Xt ∈ R2 , {Bt : t ≥ 0} is two-dimensional standard Brownian motion, and   −λ −ω . A= ω −λ 1. Simulate a sample path of the system, taking σ = 1, ω = 1, starting at X0 = 0 and on 1 1 the time interval [0, 30]. Use different levels of λ, e.g. 10 , 2 , 1. 2. Show that exp(At) = e

−λt



cos ωt − sin ωt sin ωt cos ωt



3. Show that the transition probabilities of {Xt } are Gaussian and given by   σ2 −2λ(t−s) Xt |Xs ∼ N exp(At)Xs , (1 − e )I 2λ where I is the 2-by-2 identity matrix. 4. Assume that {Xt } is stationary. Compute and plot the autocovariance function of {Xt }, for the same parameters as in question 1. Note: This system is a useful model of a stochastic process that oscillates periodically, albeit perturbed by noise. The period is 2π/ω. In stationarity, the two state variables are uncorrelated and have variance σ 2 /2/λ. The ratio ω/λ controls the signal-to-noise ratio, i.e. how visible are the oscillations through the noise. 126

CHAPTER

6

Integration w.r.t. Brownian motion

Differential equations, driven by white noise, do not admit differentiable solutions, due to the irregularity of white noise. This problem can be by-passed in the case of linear systems, but needs to be addressed for non-linear systems. A fruitful approach, which we follow, is to rewrite the differential equations in integral form. This leads us to the problem of how to understand an integral, when the integrand is white noise. The central point in Kiyosi It¯o’s construction is that such integrals should be understood as integrals with respect to Brownian motion. However, they cannot be understood as simple Riemann-type integrals familiar from deterministic calculus, because the sample paths of Brownian motion do not have bounded variation. Therefore we must develop a new integral, which defines what it means to integrate with respect to Brownian motion. This is the It¯o integral, which we present in this chapter. In this development, and in contrast to the Riemann integrals, we need to make a choice as to where we evaluate the integrand when we evaluate the contribution from a small sub-interval. In the It¯o integral, we choose to evaluate the integrand at the left end-point. In this chapter, we motivate the It¯ o integral, state the definition, and show its properties. The two most important properties of the It¯ o integral is, first, that it is a martingale, and second, the It¯ o isometry, which relates the variance of the integral to the mean square of the integrand.

6.1

Models based on integral equations

Many of us are more used to modeling dynamic systems through differential equations than through integral equations. For example, consider again the mass-spring-damper system from chapter 5, depicted in figure 6.1: A body with mass m has position Qt ∈ R and velocity Vt . It is subject to an external force Ft , a linear spring force −kQt , and a linear damping force −cVt . In chapter 5 we followed standard practice and stated the equations of motion as two coupled ODE’s 127

F k

m x

c 0

Figure 6.1. A mass-spring-damper system can be modeled with differential equations or integral equations.

dQt = Vt dt dVt m = −kQt − cVt + Ft dt

(6.1) (6.2)

However, we could equally well model this system with two coupled integral equations: Z Qt = Q0 +

t

Z Vs ds

,

0

t

−kQs − cVs ds + Pt

mVt = mV0 + 0

where we have introduced the momentum Pt supplied to the particle by the external force Ft during the time interval [0, t], given by Z

t

Pt :=

Fs ds 0

The ordinary differential equations are mathematically equivalent to the integral equations, as long as the external force F : [0, ∞) 7→ R is continuous: Two differentiable functions Qt , Vt : [0, ∞) 7→ R satisfy the ordinary differential equations if and only if they satisfy the integral equations. Notice that the integral equations have direct physical interpretation; in particular, the change in momentum of the mass equals the momentum delivered by the wall and the external force. An advantage of the integral version is that it applies also when the momentum P : [0, ∞) 7→ R is a non-differentiable function. In a deterministic context, a simple example of a nondifferentiable momentum function is a step function, corresponding to an impulsive force at some time t0 . This situation can be treated with elementary methods, since the momentum is differentiable except at one singular point, so we can solve the equation classically for t < t0 and for t > t0 , and then glue the solution together at t = t0 . This was what we did to establish the impulse response in section 5.1. A more difficult example, and one of greater relevance to our agenda, is the case where the momentum P : [0, ∞) 7→ R is a sample path of Brownian motion. Exercise: Consider the situation where the momentum Pt originates from many collisions between the mass and small atmospheric molecules, occurring in the time 128

interval [0, t]. Describe why it may be appropriate to model {Pt : t ≥ 0} as Brownian motion in this situation. In general, since Brownian motion is not differentiable, we use integral equations to model dynamic systems driven by noise, and not differential equations.

6.2

From white noise to stochastic integrals

Let us now attempt to rewrite the equation d Xt = f (Xt ) + g(Xt )ξt dt

(6.3)

as an integral equation, when ξt should be white noise. Recall from chapter 5 that Brownian motion can loosely be seen as integrated white noise, so Brownian motion should satisfy the simplest example of such an equation: d Bt = ξt dt corresponding to f ≡ 0, g ≡ 1. Now, as we saw in chapter 5, Brownian motion is not differentiable, so the left hand side is not well defined, and white noise is not a stochastic process in the sense of chapter 4, so the right hand side is not well defined either. This problem carries over to the more general equation (6.3): There is no hope of finding a solution in the classical sense, i.e. a stochastic process {Xt : t ≥ 0} with differentiable sample paths. So let us make use of the standard trick and rewrite the differential equation in integral form. Integrating the equation (6.3) formally, we get t

Z Xt − X0 =

t

Z f (Xs ) ds +

0

g(Xs )ξs ds 0

Now, pretend momentarily that ξt is a continuous function with anti-derivative Bt . Then we would have ξs ds = dBs , i.e. Z Xt − X0 =

t

Z

t

f (Xs ) ds + 0

g(Xs ) dBs

(6.4)

0

where the last integral would be a Stieltjes integral. We will discuss Stieltjes integrals in the next section; if you are not comfortable with the notion, then you may consider this simply a substitution of variable in the integral. To proceed, we need to define the integral Z It :=

t

g(Xs ) dBs 0

129

(6.5)

This definition, due to It¯ o, is the topic of section 6.5. A similar route to the same integral equation is based on the Euler discretisation of the original deterministic ordinary differential equation: (h)

(h)

(h)

X(i+1)h = Xih + f (Xih )h (h)

Here, the sub-script indicates the time, while the superscript indicates that Xt is an approximation of the true solution Xt , and that the approximation error depends on the time step h. Now, we add a noise term to the right hand side, to perturb the solution away from the deterministic one: (h)

(h)

(h)

(h)

(h)

X(i+1)h = Xih + f (Xih )h + g(Xih )ξi

(6.6)

(h)

Here, {ξi : i ∈ N} is a discrete-time white noise process. The statistics of these noise (h) terms ξi should of course depend on the chosen time step h. To see how, consider again the simplest case f ≡ 0, g ≡ 1 where we would like the solution to be Brownian motion (h) Xt = Bt . Then we have (h)

ξi

= B(i+1)h − Bih ∼ N (0, h) (h)

Notice in particular that the variance of the noise term ξi is linear in the time step h, in consistency with the diffusive scaling between space and time we have seen previously in sections and 2.1.3 and 4.4. Now, we can rewrite the difference equation (6.6) as (h)

(h)

(h)

(h)

X(i+1)h − Xih = f (Xih )h + g(Xih )(B(i+1)h − Bih )

(6.7)

Now, sum up this equation for i ∈ {0, . . . , n} where (i + 1)h = T , to get

(h)

(h)

XT − X0

=

n X

n X

(h)

f (Xih )h +

i=0

(h)

g(Xih )(B(i+1)h − Bih )

(6.8)

i=0

We now take the right hand side to be an approximation of the integrals n X

(h)

Z

f (Xih )h ≈

i=0 n X

(h) g(Xih )(B(i+1)h

T

0

(h)

f (Xt ) dt

Z − Bih ) ≈ 0

i=0

130

T

(h)

g(Xt ) dBt

.

(6.9)

Notice in the dBt -integral, that the contribution from a time interval [ih, (i + 1)h] is obtained as the product of the “measure” of the interval, B(i+1)h − Bih , and the integrand evaluated at (h)

the left endpoint, i.e. g(Xih ). This is the It¯o integral, that we will develop shortly. (h)

With these approximations, we may hope that the solution Xt of the difference equation (6.8) converges as the time step h tends to zero, in which case we will say that the limit Xt satisfies the integral equation Z

T

Z

T

g(Xt ) dBt

f (Xt ) dt +

Xt = X0 +

0

0

which - at least formally - is the same as (6.4). To summarize, a useful way to treat differential equations driven by white noise, is to rewrite them as integral equations (6.4) where one integral (6.5) is with respect to Brownian motion. Moreover, based on the Euler approximation, we see that a useful way to understand this integral is through its approximation, where the contribution from an interval is obtained by evaluating the integrand at the left endpoint of the interval.

6.3

Integrals: Riemann, Lebesgue, and Stieltjes

Let us recapitulate some standard integrals to see if we can base stochastic calculus on these integrals. Consider a (deterministic) function g : [0, t] 7→ R, where we interpret the argument as time. Recall the familiar Riemann integral of g w.r.t. time t

Z

g(s) ds 0

which can be approximated with the Riemann sum

Z

n(∆)

t

g(s) ds ≈ 0

X

g(t∗i ) · (ti − ti−1 )

i=1

Here we have partitioned the time interval [0, t] into a partition ∆ = {ti : i = 0, . . . , n} where 0 = t0 < t1 < · · · < tn = t; n(∆) is the number of sub-intervals. We evaluate the integrand at some point t∗i ∈ [ti−1 , ti ] in each sub-interval. The function g is said to be Riemann integrable if the Riemann sum converges to the same limit as the mesh |∆| = max{ti −ti−1 : i = 1, . . . , n} converges to 0, irrespectively of where we evaluate the integrand, t∗i . In particular, continuous functions are Riemann integrable over compact intervals. From chapter 3, recall that we may also integrate the function g with respect to a measure µ defined on the time axis, rather than with respect to time itself. I.e., µ measures intervals (and other subsets) of the time axis. In this case an approximating Riemann sum is 131

Z

t

g(s) µ(ds) ≈ 0

n X

g(t∗i )µ((ti−1 , ti ])

i=1

This generalizes the Riemann integral, which we recover when the measure µ is “length”, i.e. µ((ti−1 , ti ]) = ∆ti = ti − ti−1 . One way to define measures on the real line is through an increasing function F : [0, ∞) 7→ R. This is the idea in the Stieltjes integral. We define the measure as the increment in F : µ((s, t]) = F (t) − F (s) whenever 0 ≤ s ≤ t. Often, F will be continuous, but discontinuities are useful for example in probability theory when considering random variables which have a mixture of continuous and discrete distributions. In those situations, we require that F is right-continuous with left limits.1 In either case we say that we integrate the function g with respect to F , and use the Stieltjes notation: Z

t

Z g(s) dF (s) :=

0

t

g(s)µ(ds) 0

With F (t) = t we recover the Lebesgue measure of length and thus the standard Riemann integral. If F increases rapidly in a certain region, then this region will have a large measure and will be weighed up in the integral. Example 6.3.1. Let F (t) = t2 for t ≥ 0. Then F is differentiable with derivative f (t) = 2t, so the measure µ([t, t + h)) can be approximated as f (t) · h + o(h). With g(t) = exp(−t), we get Z



Z

0



exp(−t)2t dt = · · · = 2

exp(−t)dF (t) =

.

0

What makes this exampleR simple is that the measure µ admits a density f with respect to b length, i.e. µ((a, b]) = a f (t) dt. This allows us re-write the µ-integral as a Riemann integral, since the integrand exp(−t) is continuous. An important situation is when F is the cumulative distribution function of a non-negative random variable X. In that case we have, for example

Z

+∞

x dF (x) = EX . 0 1

This is called “c` adl` ag” from French continue ` a droite, limite ` a gauche. You should consider the end points s, t carefully and convince yourself that this measure treats point masses, i.e. discontinuities in F , correctly.

132

Exercise 6.61: Give a probabilistic interpretation to the integral Z



t2 dF (t) with F (t) = 1 − exp(−λt)

0

and state its value. In the Stieltjes integral, we can allow the integrator F to be non-monotonous, as long as it can be written F (t) = F + (t) − F − (t) where F + and F − are non-decreasing real-valued continuous (or, more generally, c`adl`ag) functions. In that case we define Z

t

t

Z

g(s) dF + (s) −

g(s) dF (s) = 0

0

Z

t

g(s) dF − (s)

0

Functions F which can be decomposed in this way, are exactly those which have bounded variation (Royden, 1988, p. 103). An example of such an integral is: Z

t

t

Z sin s d sin s =

sin s cos s ds =

0

0

1 2

Z

t

sin 2s ds = 0

1 [1 − cos 2t] 4

One could perhaps hope that the Stieltjes integral would allow us to integrate with respect to Brownian motion, for each realization, and thus give meaning to (6.5). Alas, recall that almost no sample paths of Brownian motion have bounded variation (section 4.2.1), so the Stieltjes integral is not defined when the integrator F is a sample path of Brownian motion. This is the reason we need the construction of the It¯o integral. The next section illustrates why we cannot interpret integrals with respect to Brownian motion as Stieltjes integrals.

6.4

The problem:

Rt 0

Bs dBs

Let us turn to one of the difficulties we encounter when integrating with respect to Brownian motion. Consider integrating Brownian motion with respect to itself, i.e. Z

t

Bs dBs

.

0

Motivated by the Riemann integral, we may try to evaluate this integral numerically by discretization. To this end, we partition the interval [0, t] 0 = t0 < t1 < t2 < · · · < tn = t 133

For each of the intervals [ti−1 , ti ], we could for example evaluate the integrand at the left endpoint ti−1 , or at the right endpoint ti . Set ItL equal to the approximating integral using the left endpoint

ItL

=

n X

Bti−1 (Bti − Bti−1 )

i=1

and correspondingly ItR is the approximating integral using the right endpoint

ItR

=

n X

Bti (Bti − Bti−1 )

i=1

If the Riemann integral were well defined, then these two approximations will converge to each other as the time discretisation becomes finer, and that limit will be the integral. Figure 6.2 shows the two approximations, as functions of t, and for two values of ∆t. Note that the difference between the two discretizations appears to grow with time, but does not seem very sensitive to the time step ∆t. In fact,

ItR



ItL

=

n X

(Bti − Bti−1 )2

i=1

Exercise: Show this! Taking expectation, and using that increments in Brownian motion satisfy E(Bt − Bs )2 = |t − s|, we get that E(ItR − ItL ) = t. Moreover, as the discretization becomes finer, the difference between the two discretizations is exactly the quadratic variation of Brownian motion, ItR − ItL → [B]t = t, a.s. We see that the Riemann integral fails because Brownian motion fluctuates violently, even when the small scale of the interval [ti , ti−1 ], so that the quadraticPvariation Pn observed on n 2 does not vanish in the limit of fine partitions, while the variation (B −B ) t t i i−1 i=1 i=1 |Bti − Bti−1 | diverges. Moreover, if we try to apply the elementary calculus and solve the integral by substitution u = Bt , then we would get Z

t

Z Bs dBs =

0

0

Bt

1 2 Bt 1 u du = u = Bt2 2 0 2

Here we have used that B0 = 0. If you are uncomfortable with solving Riemann-Stieltjes integrals by substitution, then R t go ahead and pretend that Bt is differentiable with derivative Wt , rewrite the integral as 0 Bs Ws ds, and then do the substitution u = Bt , du = Wt dt. This analytical “result” for the integral is also included in figure 6.2. Note that it lies perfectly in between the two numerical approximations. In fact, this analytical expression corresponds to the Stratonovich integral (section 6.10) where the integrand is evaluated at the mid point. 134

1.5 1.0 0.5 −0.5

0.0

Bt

0.2

0.4

0.6

0.8

1.5

0.0

1.0

0.5

ILt

0.0

ILt, IRt

1.0

IRt 2 Bt 2

0.0

0.2

0.4

0.6

0.8

1.0

t

Figure 6.2. Top panel: A realization of Brownian motion Bt for t ∈ [0, 1]. Bottom panel: Rt Numerical approximation of 0 Bs dBs , evaluating the integrand at left end points (ItL ) and right end points (ItR ). Each are plotted with a fine grid (∆t = 2−13 , thick gray line) and a coarse grid (∆t = 2−9 , thin black line). Included is also Bt2 /2 (solid).

135

Bt Gt

It

R

Figure 6.3. The It¯ o integral canRbe be seen as an operator that maps an integrand t {Gt : t ≥ 0} to an integral {It = 0 Gs dBs : t ≥ 0}, using the integrator {Bt : t ≥ 0}. The information available to this operator at time t is F t ; this must include the history of {Gt } and {Bt } and is then enough to determine It . In summary: In contrast to the Riemann integral, it does matter where we evaluate the integrand. Moreover, when we evaluate the integrand in either of the end points, we cannot solve the integral by substitution using the classical chain rule of calculus. The approach of It¯ o is to choose the numerical approximation based on evaluating the integrand at left hand sides. We present this development in section 6.5, including the properties of the resulting It¯ o integral. With this choice, we need to develop a stochastic calculus which replaces standard calculus; this is the topic of chapter 7. A central result is the stochastic chain rule known as It¯ o’s lemma (section 7.2).

6.5

The It¯ o integral

To recapitulate the previous sections, we have argued that dynamic systems driven by (white) noise should be described by integral equations, where one of the integrals is with respect to Brownian motion. We have shown that this integral cannot be understood in the usual Riemann sense. The solution to this problem is to construct a new integral, the It¯o integral, which is a stochastic integral with respect to Brownian motion. Here is a preview of the final result: Definition 6.5.1 (It¯ o integral for continuous L2 integrands). Let {Bt : t ≥ 0} be Brownian motion on a filtered probability space (Ω, F, {F t : t ≥ 0}, P). Let {Gt : t ≥ 0} be a stochastic process which is adapted toR the filtration {F t }, has continuous sample paths, and has locally t integrable variance, i.e. E 0 |Gs |2 ds < ∞ for any t > 0. Then we define the It¯ o integral of {Gt } over [S, T ] with respect to {Bt } as Z

T

I=

Gt dBt = lim S

|∆|→0

n X

Gti−1 (Bti − Bti−1 )

i=1

where the limit is in the mean square. In most applications, our integrands are indeed continuous and have locally integrable variance, so this is the definition we need. However, from a mathematical point of view we need a slightly stronger definition, which is given in theorem 6.5.2 in the following. 136

First, we need to define which functions we will integrate, i.e. the admissible integrands: Definition 6.5.2 (L2 It¯ o integrable process). Let (Ω, F, {F t : t ≥ 0}, P) be a filtered probability space. We say that a real-valued stochastic process {Gt : t ≥ 0} is L2 It¯o integrable, if it 1. is adapted to the filtration {F t : t ≥ 0}, 2. for each t ≥ 0, the function G : Ω × [0, t] 7→ R is measurable with respect to the product algebra F × B([0, t]), and Rt 3. has locally integrable variance, i.e. E 0 |Gs |2 ds < ∞ for any t > 0. The first condition, that of adaptation, should be natural now: The filtration {F t } is the information available to the integrator, and clearly the integrator must know the function it is aiming to integrate. The second condition has to do with the technicalities of integration and rules out, for example, deterministic integrands which are not measurable functions of time. It ensures that integrals Rt such as 0 Gs (ω) ds are measurable functions of the realization ω, i.e. proper random variables. The condition is satisfied if the integrand has right-continuous sample paths, or left-continuous sample paths, but can also be relaxed (Karatzas and Shreve, 1997), and the integrands that appear in applications generally meet this requirement. The third condition is a bound in L2 norm which allows us to use the machinery of convergence in L2 spaces when constructing the integral and deriving its properties. We have previously seen that restricting variables to L2 makes many things simpler, for example when discussing martingales. In most applications, this condition is natural. Nevertheless, it is slightly unsatisfying, so for this reason, and in order to have a more complete theory, we will relax it in the following (section 6.7). Biography: Kiyosi It¯ o was born in 1915 and died in 2008 in Japan; he held positions at several Japanese universities and also at Princeton, Aarhus, and Cornell. He made fundamental contributions to the mathematical theory of stochastic processes, besides his seminal work in what we now call It¯o calculus. Between 1938 and 1950 he developed the stochastic integral and the associated calculus, inspired by the work of Kolmogorov, Levy, and Doob. His approach to diffusion processes focused on sample paths and the stochastic differential equations they satisfy, as opposed to the works of Kolmogorov and Feller which focused on transition probabilities.

We first need a particularly simple class of integrands, namely the elementary processes, which have bounded and piecewise constant sample paths, where the points of discontinuities are deterministic. 137







1













3

4

−2

−1

0



−3

GT BT IT 0

1

2

5

T

RT Figure 6.4. The It¯ o integral S Gt dBt of an elementary function {Gt : t ≥ 0} w.r.t. Brownian motion {Bt : t ≥ 0}, as function of the upper limit T . Here, S = 0.75.

138

Definition 6.5.3 (Elementary process). A process {Gt : S ≤ t ≤ T } is said to be elementary, if: 1. There exists a bound K > 0, such that |Gt | ≤ K for all t ∈ [S, T ]. 2. There exists a deterministic sequence S = t0 < t1 < · · · < tn = T , such that the sample paths of Gt are constant on each interval [ti−1 , ti ). Thus, there exists a sequence of random variables {γi : i ∈ N} such that X γi 1(t ∈ [ti−1 , ti )) Gt = i∈N

We introduce elementary processes because it is fairly unambiguous how to define the integral of such an elementary process with respect to Brownian motion. This is a useful first step. Later, we will approximate any L2 It¯o integrable process with an elementary one, and use this to define the It¯ o integral for general (non-elementary) integrable processes. Definition 6.5.4 (It¯ o integral of an elementary process). Assume that the process {Gt : t ≥ 0} is elementary (as in definition 6.5.3) and L2 It¯ o-integrable and that {Bt : t ≥ 0} is Brownian motion (w.r.t. (Ω, F, {F t }, P)). Then we define the It¯ o integral of {Gt } w.r.t. {Bt } as Z

T

Gt dBt = S

n X

γi (BS∨ti ∧T − BS∨ti−1 ∧T )

i=1

for 0 ≤ S ≤ T . Here, S ∨ t ∧ T = min(max(S, t), T ) is the projection of t onto the interval [S, T ]. Notice that for such an elementary process, we evaluate the integrand at the left end point when computing the contribution from a sub-interval [ti−1 , ti ). We can now show some basic properties of the It¯o integral of an elementary process: Lemma 6.5.1. [Properties of the It¯ o integral] Let {Ft : t ≥ 0} and {Gt : t ≥ 0} be elementary It¯ o integrable processes. Then the following holds: RU Gt dBt + T Gt dBt for 0 ≤ S ≤ T ≤ U . RT RT RT 2. Linearity: S aFt + bGt dBt = a S Ft dBt + b S Gt dBt when a, b ∈ R. Rt 3. Continuity: The process {It : t ≥ 0}, where It = 0 Gs dBs , is continuous in the mean square and a.s. 1. Additivity:

RU S

Gt dBt =

RT S

4. Martingale property: {It : t ≥ 0} is a Martingale (w.r.t. {F t : t ≥ 0} and P). RT RT 5. The It¯ o isometry: E| S Gt dBt |2 = S E|Gt |2 dt.

Proof. The proofs are simple and left as an exercise. 139

Notice that the martingale property also includes that the integral It is F t -measurable, which is reasonable from the point of view of information processing: Clearly, the system that does the integration should know the result! Moreover, it involves the key property: Z

T

Z

T

Gt dBt |F S } = 0; in particular E

E{

Gt dBt = 0 S

S

We can now define the integral of any L2 It¯o integrable process, by approximating the integrand with elementary processes. The main reason this works is the It¯o isometry, which essentially says that the It¯ o integral is a continuous functional; i.e. given a convergent sequence of integrands, the integral of the limit equals the limit of the integrals. Theorem 6.5.2 (The It¯ o integral). Let {Gt : t ≥ 0} be L2 It¯ o integrable on the interval (n) [S, T ]. Then there exists a sequence {Gt : S ≤ t ≤ T, n ∈ N} of elementary It¯ o integrable processes which converges to {Gt } in L2 (Ω × [S, T ]). Let {I (n) : n ∈ N} be the It¯ o integrals

I

(n)

Z

T

(n)

=

Gt

S

dBt

then this sequence converges in the mean square, and the limit depends on {Gt } but not on (n) the particular sequence {Gt }. We define the It¯ o integral of {Gt } as this limit: Z

T

I= S

Gt dBt := lim I (n) (limit in the mean square) n→∞

Finally, this integral has the properties in lemma 6.5.1. Proof. We sketch the proof only; see (Øksendal, 2010) or (Karatzas and Shreve, 1997) for details. First, we show that any L2 It¯ o integrable process {Gt : S ≤ t ≤ T } can be approximated by a bounded L2 It¯ o integrable process: To this end, define {GB t : S ≤ t ≤ T } by GB t = Gt · 1(|Gt | < K) We must show that by choosing K suitably, we can make {GB t } arbitrarily close to {Gt }. To this end, we let G denote the stochastic process {Gt : S ≤ t ≤ T } seen as a function on Ω × [S, T ], and k · k denote the L2 -norm for such functions, i.e. s kGk =

Z

T

E S

Then 140

|Gt |2 dt

.

Z

B 2

T

kG − G k = E

|Gt |2 · 1(|Gt | ≥ K) dt

S

and clearly this converges to 0 as K → ∞, since {Gt } has locally integrable variance. Next, any bounded L2 It¯ o integrable process {GB t : S ≤ t ≤ T } can be approximated by a continuous bounded L2 It¯ o integrable process {GL t : S ≤ t ≤ T }: GL t

Z

t

=

GB s H(t

 − s) ds with H(∆t) =

S

λ, 0

if 0 ≤ ∆t ≤ λ1 , else .

B Notice that this {GL t } is obtained from {Gt } by smoothing out the sample paths using a running average, i.e. a box smoothing kernel. Since the kernel is causal, {GL t } is {F t }L adapted. It is easy to see (Exercise: ! ) that the sample paths of {Gt } are continuous; in fact Lipschitz continuous with Lipschitz constant 2Kλ. Moreover, the fundamental theorem of B calculus implies that all sample paths of {GL t } converge to those of {Gt }, pointwise for almost all t ∈ [S, T ] as λ → ∞. The bounded convergence theorem then implies that kGL −GB k → 0.

Finally, any (uniformly in ω) Lipschitz continuous L2 It¯o integrable process {GL t } can be approximated by an elementary It¯ o integrable process

GSt =

n X

GL ti−1 · 1(t ∈ [ti−1 , ti ))

i=1

where ∆ = {ti : i = 0, . . . , n} is a partition of [S, T ]. Notice that {GSt } is F t -adapted since we use the value at the left end-point, GL ti−1 , for each subinterval [ti−1 , ti ). Exercise: Show √ S L that kG − G k ≤ c |∆| T − S, where c is the Lipschitz constant. This shows that any L2 It¯ o integrable process can be approximated by an elementary one: Given a L2 It¯ o integrable process {Gt }, we must find an elementary process {GSt } such that S B kG − G k < . To this end, first find a bounded process {GB t } such that kG − G k < /3. L L B Then find a Lipschitz continuous process {Gt } such that kG −G k < /3, and finally find an elementary process {GSt } such that kGS − GL k < /3. Then, the triangle inequality ensures that kG − GS k < . Now, we have established that the It¯o integral restricted to elementary processes is a contin(n) uous operator. Thus, when a sequence of elementary processes {Gt : n ∈ N} converge, then also the integrals I (n) converge. Moreover, the limit is independent of the particular sequence RT of elementary processes. This defines the integral I = S Gt dBt . To show that the properties in lemma 6.5.1 also holds for general (non-elementary but L2 It¯o integrable) integrands, we simply notice that they hold for each element in a sequence of approximating elementary processes, and carry over in the limit. Remark 6.5.1. When the integrand {Gt : S ≤ t ≤ T } has continuous sample paths, we can approximate it with the process Γt given by 141

Γt = Gti−1 for ti−1 ≤ t < ti . Here, ∆ is a partition S = t0 < t1 < · · · < tn = T . Note that {Γt } will be adapted if {Gt } is. The approximation of the It¯ o integral is then Z

T

Γt dBt = S

n X

Gti−1 (Bti − Bti−1 )

i=1

so that

Z

n(∆)

T

I=

X

Gt dBt = lim

|∆|→0

S

Gti−1 (Bti − Bti−1 ) .

i=1

This is the discrete-time approximation of the It¯ o integral.

6.6

Examples of It¯ o integrals

Example 6.6.1. When studying the narrow-sense linear system in section 5.7, we found the solution

Xt = eAt X0 +

Z

t

eA(t−s) G dBs

0

We assume that the initial condition X0 is deterministic. Using the Martingale property of the It¯ o integral, we see that EXt = eAt X0 To find the variance of Xt , we use the It¯ o isometry:

Z

t A(t−s)



VXt = V e G dBs 0 Z t > = eA(t−s) GG> eA (t−s) ds 0

This agrees with what we found studying the Lyapunov equation (section 5.7 and exercise 5.58). A realization of the integral is seen in figure 6.5. Here, A = −1 and G = 1. Since Xt arises as a linear combination of Gaussian random variables, Xt is also Gaussian, so its distribution is determined by the mean and variance.

142

1.5 1.0 0.5 Xt

0.0 −0.5 −1.0 −1.5 0.0

0.5

1.0

1.5

2.0

t

Figure 6.5. A realization of the process {Xt } from example 6.6.1 (solid). Included is also plus/minus the standard deviation of Xt (dashed).

143

Example 6.6.2. We have already considered the integral of Brownian motion with respect to itself (section 6.4). We can now recognize the It¯ o integral as the limit of the left-hand Riemann sums: t

Z

Bs dBs = lim

|∆|→0

0

n X

Bti−1 · (Bti − Bti−1 )

i=1

where, as always, ∆ is a partition of the interval in question, here (0 = t0 , t1 , . . . , tn = t). Manipulating the sum, we find n X

n

Bti−1

i=1

1 1X · (Bti − Bti−1 ) = Bt2 − (Bti − Bti−1 )2 2 2 i=1

Letting the mesh |∆| go to zero, and using the quadratic of the variation Brownian motion [B]t = t, we find t

Z 0

1 1 Bs dBs = Bt2 − t 2 2

From the properties of Gaussian variables (exercise 3.29), we find Z E 0

t

1 1 Bs dBs = t − t = 0, 2 2

which agrees with theorem 6.5.1. For the variance, we find: Z V 0

t

1 1 1 1 Bs dBs = E(Bt2 − t)2 = (EBt4 − 2tEBt2 + t2 ) = (3t2 − 2t2 + t2 ) = t2 4 4 4 2

which agrees with the It¯ o isometry: Z V

t

Z

2

Z

E|Bs | ds =

Bs dBs = 0

t

0

0

t

1 s ds = s2 2

A sample path of the integral is seen in figure 6.6.

6.7

Relaxing the L2 constraint

One requirements for a process {Gt : t ≥ 0} to be L2 It¯o integrable functions is that R t of our 2 E 0 |Gs | ds < ∞, i.e. locally integrable variance. Since we have argued that It¯o’s construction is based on sample paths, this assumption is somewhat unsatisfying: When integrating 144

3 2 1 It

0 −1 −2 −3 0.0

0.5

1.0

1.5

2.0

t

Rt Figure 6.6. A realization of the process {It = 0 Bs dBs } from example 6.6.2 (solid). Included is also plus/minus the standard deviation of It (dashed).

145

one sample path, why should the rest of the ensemble matter? Indeed, the assumption about finite integrated variance can be relaxed: Definition 6.7.1 (It¯ o integrable process). We say that a real-valued process {Gt : t ≥ 0} is It¯ o integrable (w.r.t. (Ω, F, {F t }, P)) if 1. is adapted to the filtration {F t : t ≥ 0}, 2. for each t ≥ 0, the function G : Ω × [0, t] 7→ R is measurable with respect to the product algebra F × B([0, t]), and 3. has locally square integrable sample paths, i.e.   Z t 2 P ∀t ≥ 0 : |Gs | ds < ∞ = 1 0

It can now be shown that the It¯ o integral, as defined in theorem 6.5.2, can be extended to also cover It¯ o integrable processes, so that the name is chosen appropriately. We skip over the details; see (Øksendal, 2010; Karatzas and Shreve, 1997; Rogers and Williams, 1994b). The idea is to approximate the integrand with L2 integrable processes, and show that the integrals converge in probability. Example 6.7.1. A perhaps trivial example where we need the extension, is the integral t

Z

G dBs = G Bt 0

where G is a F 0 -measurable random variable (i.e., constant in time) with infinite variance. A more demanding example is the integral Z

t

1

2

Bs e 2 Bs dBs

0

While we cannot give a closed form expression for this integral, at least we now know that it is well defined for all t ≥ 0. This extended It¯ o integral is not necessarily a martingale, because the expectation may not exist, but it is a local martingale: Definition 6.7.2 (Local martingale). We say that a process {Mt : t ≥ 0} is a local martingale (w.r.t. (Ω, F, {F t }, P)) if there exists a sequence of Markov times {τi : i ∈ N} such that 1. the sequence {τi : i ∈ N} is increasing and divergent, i.e. 0 ≤ τ1 ≤ τ2 ≤ · · · , and τi → ∞ as i → ∞, (a.s.) and 2. for any i ∈ N, the stopped process {Mt∧τi } is a martingale. 146

The typical sequence of Markov times consists of exit times: τi = inf{t ≥ 0 : |Mt | > i} Thus, we localize the process by stopping it when it exits a sphere, and next let the sphere grow in radius. Recall that if {Mt } is a martingale, then also each stopped process is a martingale, so martingales are also local martingales. In most of the applications in this book, the integrands do in fact have locally integrable variance, so we rarely need this extension of the It¯o integral. However, it is required to have a complete theory, and also there are stochastic differential equations where the solutions grow so rapidly that they break any L2 bound. In those cases, we need this extended It¯o integral.

6.8

It¯ o processes and solutions to SDEs

Now, recall our aim of constructing a model for the path of a particle embedded in aR fluid and t subject to random perturbations from collisions with molecules. The It¯o integral 0 Gs dBs is a key component in this model. But since the It¯o integral is a martingale, it can only produce an unbiased random walk by itself. To include any bias in the path from bulk fluid flow - or, as we shall see, from spatial fluctuations in the diffusivity - we include a drift term {Ft : t ≥ 0}, and consider the resulting path: Z Xt = X0 +

t

Z Fs ds +

0

t

Gs dBs

(6.10)

0

where {Gs } is an It¯ o integrable process, which we will term the intensity. We use the shorthand dXt = Ft dt + Gt dBt for such a process; this shorthand does not refer to the initial position X0 , which is required to fully specify the process. If the time step h is small, then the conditional distribution of Xt+h given F t is approximately normal with mean and variance given by E{Xt |F t } = Xt + Ft · h + o(h),

V{Xt |F t } ≈ |Gt |2 · h + o(h)

provided that the drift {Ft } and the intensity {Gt } are continuous and bounded.2 Definition 6.8.1 (It¯ o process). We say that a process {Xt : t ≥ 0} of the form (6.10) is an It¯ o process (w.r.t. (Ω, F, {F t }, P)), provided that 1. the initial condition X0 is F 0 -measurable, 2

This holds under much weaker conditions that boundedness, but requires that |Fs | and |Gs | do not grow unbounded “too quickly” for s ≥ t.

147

2. {Ft } is {F t }-adapted, 3. {Ft } is locally integrable, almost surely, i.e. Z t |Fs | ds < ∞) = 1 P(∀t : 0

4. {Gt } is It¯ o integrable (in the sense of definition 6.7.1) Although It¯ o processes are not tied up to the application of particles moving in fluids, it is useful to think of an It¯ o process as the position of such a particle. In that case, {Ft } is the random “drift” term responsible for the mean change in position of {Xt }, while {Gt } is the random intensity of the unbiased random walk resulting from collisions with fluid molecules. 3

With this image in mind, we will expect the random drift Ft and intensity Gt acting on the particle to depend on the position Xt of the particle at the time. But seen as integrands, we only require that they are adapted, i.e. at time t, the observer who has access to the σ-algebra F t knows the current drift Ft and intensity Gt that affects the particle. Recall that {Xt } is adapted, so this observer will also know the current position of the particle. Our main motivation for defining It¯ o processes is that they serve as solutions for stochastic differential equations: Definition 6.8.2 (Solution of a stochastic differential equation). Let there be given a filtered probability space (Ω, F, {F t : t ≥ 0}, P) with a Brownian motion {Bt : t ≥ 0} and a stochastic process {Xt : t ≥ 0}. We say that {Xt } satisfies the (It¯ o) stochastic differential equation dXt = f (Xt , t) dt + g(Xt , t) dBt if {Xt } is an It¯ o process dXt = Ft dt + Gt dBt where {Ft : t ≥ 0} and {Gt : t ≥ 0} satisfy the conditions of definition 6.8.1, and Ft = f (Xt , t) and Gt = g(Xt , t). In that case we call {Xt } an It¯ odiffusion. Remark 6.8.1. The theory of stochastic differential equations operate with several definitions of solutions. The one given here is denoted a strong solution. In these notes, we will not make use of other types of solutions. Example 6.8.2 (Brownian motion with drift). The process {Xt : t ≥ 0} given by Xt = x0 + ut + σBt satisfies the It¯ o stochastic differential equation dXt = u dt + σ dBt . Here, the initial condition X0 = x0 is arbitrary. This process corresponds to the solution (2.10) of the advection-diffusion equation (2.9) with constant flow u and diffusivity D = σ 2 /2: The process describes the motion of a single particle, while the advection-diffusion equation describes the evolution of the density of particles. 3 It is tempting to equate Ft with the bulk fluid flow at the position of the particle, and relate Gt to the diffusivity at the particle, but this is only true when the diffusivity in constant in space. We shall return to this issue later, in section 9.4.

148

6.8.1

The Euler method

We now present a numerical method for simulating sample paths of the solution of a stochastic differential equation. The simplest method to do this is the Euler method, also known as the Euler-Maruyama method in the context of stochastic differential equations. For the It¯ o equation dXt = f (Xt , t) dt + g(Xt , t) dBt the method approximates the solution on a mesh 0 = t0 < t1 · · · < tn = t by ˆt − X ˆ t = f (X ˆ t , ti−1 ) · (ti − ti−1 ) + g(X ˆ t , ti−1 ) · (Bt − Bt ) X i i−1 i−1 i−1 i i−1 for i = 1, . . . , n. Notice that this is consistent with the way that we approximated It¯o integrals by discretization. We will discuss more sophisticated numerical methods later. Exercise 6.62: Consider the Euler discretization in the previous. Define the approximated integrands ˆ t ), Fˆt = f (ti , X i

ˆ t = g(ti , X ˆ t ) for ti ≤ t < ti+1 G i

and the interpolated solution ˆ t = Xt + Fˆt · (t − ti ) + G ˆ t · (Bt − Bt ) for ti ≤ t < ti+1 X i i Show that ˆ t = Fˆt dt + G ˆ t dBt dX

Exercise 6.63: Let {Xt } be a scalar It¯o process given by dXt = Ft dt + Gt dBt . Assume that {Xt } is stationary, i.e. all statistics of {Xt } are invariant under translation of time. Show that EFt = 0.

6.9

Integration w.r.t. semimartingales *

We introduced the notation dXt = Ft dt + Gt dBt for an It¯o process as a shorthand for 149

t

Z Xt − X0 =

Fs ds + Gs dBs 0

but the notation suggests that dXt , dt and dBt are objects belonging to the same class. For example, provided Gt > 0, it is tempting to re-write formally 1 Ft dXt − dt = dBt Gt Gt which would then be a shorthand for Z

t

0

1 dXs − Gs

t

Z 0

Fs ds = Bt − B0 Gs

This, however, requires that we can integrate with respect to {Xt : t ≥ 0}, i.e. that It¯ o integrals are defined not just for integrating with respect to Brownian motion, but also with respect to an It¯ o process {Xt }! This leads to the question: For which classes of integrators can we repeat the construction of the It¯ o integral and still have an operational theory? It turns out that as long as the integrator is a semimartingale, the construction of the It¯o integral remains valid. Definition 6.9.1 (Semimartingale). A process {Xt : t ≥ 0} is a semimartingale (w.r.t. (Ω, F, {F t }, P)) if it can be written as a sum of two processes Xt = Mt + At where {Mt : t ≥ 0} is a local martingale and {At : t ≥ 0} is a {F t }-adapted c` adl` ag process of locally bounded variation. Let {Ht : t ≥ 0} be a stochastic process. We define the It¯o integral of {Ht } with respect to {Xt } as the limit Z

t

Hs dXs = lim 0

|∆|→0

n X

Hti−1 · (Xti − Xti−1 ) (limit in probability)

i=1

whenever it exists. In that case we say that {Ht } is It¯o integrable with respect to {Xt }, or simply that {Ht } is {Xt }-integrable. We consider only the case where {Xt } has continuous sample paths. In that case, it turns out that a sufficient condition for integrability is that {Ht } has continuous sample paths, is {F t }-adapted, and Z

t

|Hs |2 d[X]s < ∞

0

150

for all t > 0 and almost all ω ∈ Ω. Remember that the quadratic variation of Brownian motion equals time, so this condition reduces to our previous notion of It¯o integrability when Xt = Bt . Semimartingales are a fairly large class of stochastic processes, so this is a quite general theory of stochastic integration. For our purpose, it is important that any It¯o process is a semimartingale: The It¯ o integral t

Z

Gs dBs 0

is a local martingale, and the dt-integral

Z

t

Fs ds 0

has bounded variation. In particular, time {t : t ≥ 0} is a semimartingale, as is Brownian motion {Bt : t ≥ 0}. We can now see that the equation

dXt = Ft dt + Gt dBt is not just a shorthand defining Xt in terms of an integral, but in fact an equation among semimartingales seen as integrators, which implies that for any process {Ht : t ≥ 0}, we have

Ht dXt = Ht Ft dt + Ht Gt dBt provided that {Ht } is {Xt }-integrable, {Ft Ht } is {t}-integrable, and {Ht Gt } is {Bt }-integrable. Exercise 6.64 Numerical integration w.r.t. an Ito process: Choose your favorite drift term {Ft } and noise intensity {Gt }. Then, construct numerically a sample path of the It¯o process {Xt } with dXt = Ft dt + Gt dBt . Then, reconstruct the driving Brownian motion numerically, i.e. solve dWt = (Gt )−1 (dXt − Ft dt). Does {Wt } equal {Bt }?

6.10

The Stratonovich integral

151

Biography: Ruslan Leontievich Stratonovich (1930-1997) was born and lived in Moscow. He studied physics and engineering, focusing on noise in physical systems. Besides the stochastic calculus which is centered around the Stratonovich integral, his most important contribution was a general technique for filtering in non-linear dynamic systems, which includes the Kalman-Bucy filter as a special case.

We have stressed that the definition of the It¯o integral involves Riemann sums where the integrand is evaluated at the left end-point of each sub-interval, and we saw in section 6.4 that this choice has an impact on the resulting integral. It should then be clear that we could define an entire family of stochastic integrals, parametrized by where we choose to evaluate the integrand. In this family, the most prominent member beside the It¯o integral is the Stratonovich integral, where we evaluate the integrand at the mid-point, or equivalently use the trapezoidal rule

Z

n(∆)

t

Gs ◦ dBs = lim 0

|∆|→0

X 1   Gti−1 + Gti Bti − Bti−1 (limit in probability) 2 i=1

where, as always, ∆ is a (deterministic) partition 0 = t0 < t1 < · · · < tn = t of the interval [0, t], n(∆) is the number of sub-intervals in the partition, and |∆| = max{tt − ti−1 : i = 1, . . . , n(∆)} is the mesh. Exercise 6.65 Numerical Stratonovich integration: Write a function, which takes as input a time partition t0 < t1 < · · · < tn , as well as the integrand {Gti : i = 0, . . . , n} and the Brownian motion {Bti : i = 0, . . . , n} sampled R tn at these time points, and which returns (an approximation to) the Stratonovich integral t0 Gt ◦ dBt sampled at the same time points. R1 Verify the function by computing the integral 0 Bt ◦ dBt and comparing it to the theoretical result 12 B12 , as in figure 6.2. How does the Stratonovich integral relate to the It¯o integral? The difference between the two integrals originate from the different treatment of fine-scale fluctuations in the two processes, the integrator {Bt } and the integrand {Gt }. To quantify this, we introduce: Definition 6.10.1 ((Quadratic) covariation between two processes). Let {Lt } and {Mt } be two It¯ o processes. We define the quadratic covariation [L, M ]t on the time interval [0, t] as the limit in probability [L, M ]t = lim

|∆|→0

n X (Lti − Lti−1 )(Mti − Mti−1 ) i=1

whenever it exists. Now, comparing the discrete-time approximations of the It¯o and the Stratonovich integral, we see that 152

Z

t

Z Gs ◦ dBs −

0

0

t

n

X 1 Gs dBs = lim (Gti − Gti−1 )(Bti − Bti−1 ) 2 |∆|→0 i=1

1 = [G, B]t 2

(6.11)

Exercise 6.66: Use this result to verify (once again) that Z

t

Z Bs ◦ dBs = 0

0

t

1 Bs dBs + t 2

Notice that in this case the covariation is deterministic. Then, consider a differentiable function h and show the more general result Z

t

Z h(Bs ) ◦ dBs =

0

0

t

1 h(Bs ) dBs + 2

Z

t

h0 (Bs ) ds

0

Note that, in general, the difference between the two integrals is a random variable. If the integrand is of bounded total variation, then the covariation vanishes, [G, B]t = 0, and the It¯o and Stratonovich integral are identical. Exercise: Show this! In the most general case, where the integrand is an adapted process but not necessarily of the form Gt = h(Bt ), it may not be possible to get more explicit result for the relationship between the two integrals than (6.11). If this sounds frustrating, recall that in general we do not have explicit results for either of the two integrals! Our focus will be mostly on the It¯ o integral, primarily because of its martingale property, which is connected to the simple Euler method for approximating solutions to stochastic differential equations. This property is key in the theoretical development. However, there are two reasons why the Stratonovich integral is a popular alternative to the It¯o integral in applications, besides the perhaps natural and symmetric choice of the mid-point: First, as we shall see in chapter 7, the stochastic calculus that results from the Stratonovich integral appears simpler and closer to ordinary (deterministic) calculus; this is a particular advantage when the application involves many coordinate transformations. Therefore, we will also make use of the Stratonovich integral in some examples to compare with deterministic results. Second, the Stratonovich integral emerges as the limit, when the integrator {Bt } is bandlimited noise but approximates white noise: Exercise 6.67 The Stratonovich integral and band-limited noise: Recall from (N ) exercise 4.49 the Wiener expansion of Brownian motion. As in that exercise, let Bt be the approximation of Bt based on frequencies −N, −N + 1, . . . , N . 1. Verify by stochastic simulation and numerical Stratonovich integration that Z 1 1 (N ) Bs(N ) dBs(N ) = (B1 )2 2 0 153

Note: These are Riemann integrals for each realization ω ∈ Ω, so it does not matter where we evaluate the integrand, at least in the fine-mesh limit. You may want to verify this, too. 2. Verify the same equation analytically.

154

CHAPTER

7

Stochastic calculus

The It¯o integral is the key component in the theory of diffusion processes and stochastic differential equations. Nevertheless, in practical computations, we very rarely compute It¯ o integrals explicitly. In stead, we almost invariably use It¯o’s lemma, a stochastic version of the chain rule, to deduce relationships between stochastic processes. It¯o’s lemma in its simplest form states that if {Xt ≥ 0} is a scalar It¯o process given by dXt = Ft dt + Gt dBt , and {Yt ≥ 0} is a new scalar stochastic process defined by Yt = h(Xt ), then {Yt } is again an It¯ o process which satisfies 1 1 dYt = h0 (Xt ) dXt + h00 (Xt ) (dXt )2 = h0 (Xt ) [Ft dt + Gt dBt ] + h00 (Xt )G2t dt 2 2 Compared to standard calculus, the terms highlighted with red are unfamiliar. They are due to the fact that It¯ o processes in general, like Brownian motion, do not have bounded variation. In this chapter we state It¯ o’s lemma in several different forms, and give examples of applications of the result. The main applications are:

• To solve stochastic differential equations analytically, in the few situations where this is possible. • To change coordinates. For example, the equation may describe the motion of a particle in Cartesian or polar coordinates. • To find the dynamics of quantities that are derived from the state, such as the energy of a particle.

It¯o’s lemma tells us how to transform the dependent variable. We also give formulas for how to transform the independent variable, i.e. change the units of time. 155

7.1

The chain rule of deterministic calculus

Let us first recapitulate the well-known chain rule from ordinary (deterministic) calculus. Let 0 t {Xt : t ≥ 0} be a C 1 real-valued function with derivative dX dt = Xt , and let h : R × R 7→ R be a C 1 function. Define Yt = h(t, Xt ) then, according to the chain rule, {Yt : t ≥ 0} is C 1 with derivative dYt ∂h ∂h = (t, Xt ) + (t, Xt ) Xt0 dt ∂t ∂x

Yt0 =

Formally, we may multiply this with dt and obtain

dYt =

∂h ∂h dt + dXt ∂t ∂x

which is an equation involving differentials, which we can write in integral form: Z t ∂h ∂h Yt = Y0 + (s, Xs ) ds + (s, Xs ) dXs 0 ∂t 0 ∂x   Z t ∂h ∂h 0 (s, Xs ) + (s, Xs )Xs ds = Y0 + ∂t ∂x 0 Z

t

More generally, we can use the chain rule to transform integrals by substitution

Z 0

t

t

Z t ∂h ∂h Hs dYs = Hs (s, Xs ) ds + Hs (s, Xs ) dXs ∂t ∂x 0 0   Z t ∂h ∂h 0 = Hs (s, Xs ) + (s, Xs )Xs ds ∂t ∂x 0 Z

where {Hs : s ≥ 0} is an arbitrary integrable test function. Again, this is under the assumption that {Xt } is C 1 .

7.2

It¯ o’s lemma: The stochastic chain rule

The deterministic chain rule applied in the previous because we assumed that {Xt } is smooth. However, in the stochastic case, if {Xt } were a sample path of Brownian motion, then {Xt } 156

would not be of bounded variation (a.s.), and we cannot assume that the deterministic chain rule holds. Indeed, with the example 1 h(t, x) = x2 , 2

Xt = Bt

we found t

Z Yt = h(t, Xt ) = 0

1 Xs dXs + t = 2

Z

t

Z Xs dXs +

0

0

t

1 ds 2

and the second term does not origin from standard calculus. This extra term origins from the quadratic variation of the process {Xt }. It turns out that for processes with non-vanishing quadratic variation, the chain rule includes double derivatives and quadratic terms. To describe this, we adopt the notational convention dLt dMt = d[L, M ]t where {Lt } and {Mt } are two It¯ o processes and [L, M ]t is their covariation. With this convention, we get dt2 = 0,

dt dBt = 0,

dBt dBt> = Id dt

(7.1)

when {Bt } is d-dimensional Brownian motion and Id ∈ Rd×d is the identity matrix. We can now state the chain rule of stochastic calculus: Theorem 7.2.1 (It¯ o’s lemma). Let {Xt : t ≥ 0} be an It¯ o process in Rn on (Ω, F, {F t }, P), i.e. dXt = Ft dt + Gt dBt where {Bt : t ≥ 0} is d-dimensional Brownian motion. Let h : Rn × R 7→ R be differentiable w.r.t. time t and twice differentiable w.r.t. x, with continuous derivatives. Define Yt = h(Xt , t), then

dYt = =

∂h 1 ∂2h ∂h dt + dXt + dXt> 2 dXt ∂t ∂x 2 ∂x   2h ∂h ∂h 1 ∂ ∂h dt + Ft + trG> Gt dt + Gt dBt t 2 ∂t ∂x 2 ∂x ∂x

(7.2) (7.3)

Here, recall that the trace of a quadratic n-by-n matrix A is the sum of the diagonal elements, A11 + · · · + Ann . 157

Proof. We outline the structure of the proof only; see e.g. (Øksendal, 2010). Assume for simplicity that the function h does not depend on time t, and that {Xt } is scalar. Let ∆ = (t0 , t1 , . . . , tn ) be a partition of [0, t] and write

Yt = Y0 +

n−1 X

∆Yi with ∆Yi = Yti+1 − Yti .

i=0

Now, to evaluate the increment ∆Yi , Taylor expand the function h around Xti to get 1 ∆Yi = h0 (Xti ) ∆Xi + h00 (Xti ) (∆Xi )2 + Ri 2 where ∆Xi = Xti+1 − Xti and Ri is the remainder term. The proof now proceeds by first showing that as the partition becomes finer, the sum of the residual terms Ri vanishes. Next, we show that as the partition becomes finer, the sum of the terms converges to the integral t

Z 0

1 00 2 h (Xti )

(∆Xi )2

1 00 h (Xs ) d[X]s 2

where [X]t is the quadratic variation process of {Xt }. In greater generality, a sum of terms Hti (∆Li ) (∆Mi ) converges to the integral Z

t

Hs d[L, M ]t 0

This justifies our notation dLt dMt = d[L, M ]t and allows us to generalize the proof to the case where {Xt } is a multivariate It¯ o process. Finally, we allow h to depend on time. To this end, we consider the joint process {Yt } with Yt = (t, Xt ); recall that the process {t} is an It¯ o process, even if a rather trivial one, so that {Yt } is also an It¯o process.

Exercise 7.68: Show the equality between equations (7.2) and (7.3), using dLt dMt = d[L, M ]t . Remark 7.2.1. Notice that the (conditional) mean increment E{∆Y |F t } is given by three terms: The t-derivative of h, the x-gradient of h in combination with the drift Ft , and an “extra” term involving the x-curvature of h in combination with the noise intensity Gt of the process {Xt }. This extra term is necessary due to Brownian motion scaling with the square root of time. In turn, the conditional variance V{∆Y |F t } origins from the x-derivative of h in combination with the (conditionally) random term in ∆X, i.e. Gt ∆B.

158

Exercise 7.69 Numerical verification of It¯ o’s lemma: Apply It¯o’s lemma to write 3 Yt = Bt as an It¯ o process. Next, simulate a sample path of Brownian motion on [0, 1]. Plot {Yt : 0 ≤ t ≤ 1}, computed both directly and by solving the integrals in the It¯ o process formulation numerically. Repeat with different time steps, holding the sample path of Brownian motion constant. We have stated It¯ o’s lemma for a scalar-valued function h. When the function h maps the state Xt to a vector Yt ∈ Rm , i.e. h : Rn × R 7→ Rm , we may apply It¯o´s lemma to each coordinate (i) of Yt at a time. I.e., we apply the previous formula to Yt = hi (Xt , t) for i = 1, . . . , m. Example 7.2.2. Let {Xt : t ≥ 0} be a vector-valued It¯ o process given by dXt = Ft dt+Gt dBt and let Yt = T Xt where T is a matrix. Then dYt = T Ft dt + T Gt dBt . Being even more specific, let {Xt } satisfy the linear SDE dXt = AXt dt + G dBt and assume that T is square and invertible. Then {Yt } satisfies the linear SDE dYt = T AT −1 Yt dt + T G dBt

7.3

Analytical solutions of SDE’s and stochastic integrals

It¯o’s lemma applies to transformations of It¯o processes, i.e. the drift {Ft } and {Gt } can be arbitrary stochastic processes, as long as they satisfy the technical requirements. However, in most of the applications we are interested in, the It¯o process is a solution to a stochastic differential equation. For these situations, an important application of It¯o’s lemma is to solve It¯o stochastic differential equations analytically. This is possible only for a small class of equations and integrals, but these special cases play a prominent role due to their tractability. Exercise 7.70 Geometric Brownian motion: Show that Xt = x exp((r − 21 σ 2 )t + σBt ) satisfies the so-called wide-sense linear stochastic differential equation dXt = rXt dt + σXt dBt

, X0 = x

.

Find the mean and variance of Xt , using the properties of the log-normal distribution. Exercise 7.71: Show that Xt = tanh Bt satisfies dXt = −Xt (1 − Xt2 ) dt + (1 − Xt2 ) dBt

Exercise 7.72: Show that Xt = (x1/3 + 31 Bt )3 satisfies 1 1/3 2/3 dXt = Xt dt + Xt dBt 3 159

, X0 = 0 .

The log-normal distribution. A random variable Y is said to be log-normal (or log-Gaussian) distributed with location parameter µ and scale parameter σ > 0, i.e. Y ∼ LN (µ, σ 2 ) if X = log Y is Gaussian with mean µ and variance σ 2 . Property Mean EY Variance VY C.d.f. FY (y) P.d.f. fY (y) Median Mode

Expression exp(µ + 21 σ 2 ) (exp(σ 2 ) − 1) exp(2µ + σ 2 ) Φ(σ −1 (log(y) − µ)) σ −1 y −1 φ(σ −1 (log(y) − µ)) exp(µ) exp(µ − σ 2 )

We can now return to the narrow-sense linear stochastic differential equation which we considered in chapter 5:

dXt = −λXt dt + σ dBt

(7.4)

with the initial condition X0 = x. In that chapter we postulated that the solution could be found with methods from ordinary calculus, and was

Xt = xe−λt +

Z

t

σe−λ(t−s) dBs

(7.5)

0

The reason that the correct solution can be found with methods from ordinary calculus, is that the integrand in the It¯ o integral is a smooth function of time; therefore we expect that no “corrections terms” are needed (and, in addition, it does not matter if we had taken the It¯o or the Stratonovich interpretation of the integral). We now verify this solution. We cannot use It¯o’s lemma directly, because the integrand in the It¯o integral depends on the upper limit t. To circumvent this difficulty, we introduce a transformed version of Xt , namely the process Yt given by

Yt = h(t, Xt ) = eλt Xt where h(t, x) = eλt x

Using It¯o’s lemma, we find that if Xt satisfies the original SDE (7.4), then this transformed process must satisfy 160

∂h ∂h 1 ∂2h dt + dXt + (dXt )2 ∂t ∂x 2 ∂x2 = λeλt Xt dt + eλt (−λXt dt + σ dBt )

dYt =

= eλt σ dBt

.

This is a stochastic differential equation for which we can easily write up the solution. Since Y0 = h(0, X0 ) = x, we find Z Yt = x +

t

eλs σdBs

.

0

We can now back-transform: Xt = e−λt Yt = xe−λt +

Z

t

e−λ(t−s) σdBs

0

which was the postulated solution. The solution (7.5) is called the Ornstein-Uhlenbeck process. Uhlenbeck and Ornstein introduced it in 1930 as a model for the velocity of a molecule under diffusion; compared to Brownian motion it has the advantage that it predicts finite velocities! It is a model that one encounters frequently in applications, whether in physics, engineering, biology or finance. Exercise 7.73: Consider a vector process {Xt ∈ Rn : t ≥ 0} which satisfies the narrow-sense linear SDE dXt = (AXt + wt ) dt + G dBt with initial condition X0 = x, where the external input wt is F t -adapted and of bounded variation. Show that Xt can be written as At

Z

Xt = e x +

t

eA(t−s) (ws ds + GdBs )

0

Hint: Follow the reasoning for the Ornstein-Uhlenbeck process in the previous, i.e. start by defining Yt = h(t, Xt ) with h(t, x) = exp(−At)x. Exercise 7.74: Consider the scalar time-varying equation dXt = λt Xt dt + σt dBt with initial condition X0 = x, where {λt : t ≥ 0} and {σt : t ≥ 0} are deterministic functions of bounded variation. Show that the solution is 161

Ft

t

Z

Xt = e x +

eFt −Fs σs dBs

0

where Z Ft =

t

λs ds 0

In summary, for linear stochastic differential equations, there is a unique solution which has a closed-form expression. As we shall see in the following, the situation is not quite so desirable for general non-linear stochastic differential equations.

7.4

Dynamics of derived quantities

A second common application of It¯ o’s lemma is the situation where a dynamic system evolves according to a stochastic differential equation, and we are interested in some prescribed function of interest defined on state space. As an example, consider the two coupled stochastic differential equations

dXt = Vt dt,

dVt = −u0 (Xt ) dt − µVt dt + σ dBt

where {Xt } represents the position of a physical system in configuration space, while {Vt } is the velocity. Here, u(·) is a potential defined on configuration space, so that −u0 (x) is the (mass specific) force acting on the system when in configuration x. This system generalizes the mass-spring-damper system we studied in chapter 5 and section 6.1 (subject to a white noise force) where the potential corresponding to a linear spring is u(x) = kx2 /(2m). Define the (mass specific) potential and kinetic energies in the system at time t:

Ut = u(Xt ),

1 Tt = Vt2 . 2

Then with It¯ o’s lemma, these can be written as It¯o processes dUt = u0 (X)Vt dt

  1 1 2 2 0 2 dTt = Vt dVt + (dVt ) = −Vt u (Xt ) − µVt + σ dt + σVt dBt 2 2 Finally, define the total energy in the system: 162

1 Et = Ut + Tt = u(Xt ) + Vt2 2 Then the total energy can be written as an It¯o process:

dEt = dUt + dTt   1 2 2 = σ − µVt dt + σVt dBt 2 You should verify that these expressions follow from It¯o’s lemma. We see that the noise gives rise to stochastic fluctuations in the energy through the term σVt dBt : When the random force, corresponding to σ dBt , is in the same direction as the velocity Vt , the kinetic energy and the total energy of the system increases, and conversely when this force is directed against the velocity. But perhaps more counter-intuitively, the presence of noise gives rise to an expected increase in total energy through the term 12 σ 2 dt. We now aim to examine possible stationary solutions to these equations. Stationarity here means that the statistics are invariant to time translations; note that this does not mean that sample paths are constant in time! At this point we have no way of ensuring that stationary solutions exist or a relevant in any sense, but with an appeal to physics we would expect this to be the case, so we simply assume it. Recall from exercise 6.63 that for an It¯o process to be stationary, the drift term, i.e. the dt-integrand, must have expectation 0. Applying this to the total energy Et , we see that stationarity requires  1 2 1 σ2 2 E σ − µVt = 0 ⇔ E Vtt = 2 2 4µ 

If a stationary solution exists, it must therefore have an expected kinetic energy of σ 2 /(4µ) which expresses a balance between the energy supplied by the noise and energy dissipated by the viscous damping µ. We shall return to this later where we will also derive a corresponding expression for the potential energy in stationarity. Exercise 7.75: Continuing exercise 6.63, let {Xt } be a scalar It¯o process given by dXt = Ft dt + Gt dBt . Assume that {Xt } is stationary, i.e. all statistics of {Xt } are invariant under translation of time. Show that E(2Xt Ft + G2t ) = 0. Hint: Apply the result of exercise 6.63 to Yt = Xt2 .

7.5

Coordinate transformations

A final frequent application of It¯ o’s lemma is to change coordinates in the underlying state space. Consider, for example, Brownian motion on the circle, which is a process taking values on the unit circle in the plane. In polar coordinates (r, θ), the process is easy to describe: 163

The radius r is constant and equal to 1, while the angle θ is Brownian motion, Θt = Bt . Transforming to Cartesian coordinates, we find:

Xt = cos Θt = cos Bt ,

Yt = sin Θt = sin Bt .

It¯o’s lemma yields that (Xt , Yt ) = (cos Bt , sin Bt ) satisfies 

dXt dYt



 =

− sin Bt dBt − 12 cos Bt dt cos Bt dBt − 12 sin Bt dt



and substituting the cos Bt = Xt , sin Bt = Yt , we can rewrite this 

dXt dYt



 =

−Yt Xt



1 dBt − 2



Xt Yt

 dt

This is an It¯ o stochastic differential equation governing the processes {(Xt , Yt ) : t ≥ 0}, i.e. the motion in Cartesian coordinates. It is useful to think about this geometrically: The dBt term is orthogonal to the position (Xt , √ Yt ), as we would expect of motion along a tangent. However, since the dBt -term is of order dt, it acts to project the particle along the tangent which would increase the radius. To balance this, dt-term contracts the particle towards the origin. Viewing this as motion in the plane, the diffusivity is anisotropic: There is no diffusivity in the radial direction since the radius stays constant, but a constant diffusivity in the tangential direction. Hence the diffusivity is a matrix (or a tensor), and not just a scalar. Moreover, since the direction of the tangent changes with position, so does the diffusivity matrix; i.e. the diffusivity is inhomogeneous.

7.5.1

The Lamperti transform

While it is useful to be able to change between different coordinate systems, and one should ideally be able to obtain the same results regardless of the choice of coordinate systems, this raises the question which coordinate system is most convenient for a given purpose. In the previous example, an argument in favor of polar coordinates is that the diffusivity is constant in this coordinate system, i.e. the noise is additive. As we shall see later, statistics and numerics are simpler in this additive case. We can therefore ask if it, for a given stochastic differential equation, is it possibly to change coordinates so that in the transformed system, the noise is additive? If so, we say that the transformation is the Lamperti transform. Consider the scalar stochastic differential equation

dXt = f (Xt ) dt + g(Xt ) dBt 164

and assume that g(x) > 0 for all x. Then define the Lamperti transformed coordinate {Yt } by Yt = h(Xt ) where Z

x

h(x) =

1 dv g(v)

(7.6)

and thus h0 (x) =

1 , g(x)

h00 (x) = −

g 0 (x) g 2 (x)

Then It¯o’s lemma yields

1 dYt = h0 (Xt ) dXt + h00 (Xt ) g 2 (Xt ) dt 2 f (Xt ) 1 = dt + dBt − g 0 (Xt ) dt g(Xt ) 2   −1 f (h (Yt )) 1 0 −1 − g (h (Yt )) dt + dBt = g(h−1 (Yt )) 2 Note that in the Lamperti transformed coordinate Yt , the noise is additive, as required. Lamperti transforms are possible also in many multivariate situations, but not always. Exercise 7.76 again the SDE

Lamperti transforming geometric Brownian motion:

Consider

dXt = rXt dt + σXt dBt Identify the Lamperti transform and write the SDE that governs the transformed coordinate.

7.5.2

The scale function

For scalar processes, it is also possible to transform the coordinates so that the transformed processes is driftless, i.e. a Martingale. Specifically, consider again the general scalar It¯ o equation dXt = f (Xt ) dt + g(Xt ) dBt where we assume, in addition to Lipschitz continuity, that g is bounded away from 0, i.e. there exits an  > 0 such that g(x) >  for all x. Now, introduce the transformed coordinate Yt = s(Xt ) where we have yet to determine the transform s. Then {Yt } is an It¯o process with 165

  1 dYt = f (Xt )s0 (Xt ) + g 2 (Xt )s00 (Xt ) dt + s0 (Xt )g(Xt ) dBt 2 We now aim to choose the transform s : R 7→ R so that the drift f s0 + 12 g2s00 vanishes. Introduce φ = s0 , then the equation in φ is 1 f φ + g 2 φ0 = 0 2 with the solution  Z φ(x) = exp −

x

2f (y) dy g 2 (y)



where, again, the absence of a lower limit in the integral indicates that this may be chosen arbitrarily, i.e. log φ is any antiderivative of −2f /g 2 . From φ we may then find the transformation s through Z

x

s(x) =

φ(y) dy

We saw that there were two arbitrary constants involved in s, one for each integral. These two corresponds to every affine transformation Zt = aYt + b also being driftless. The resulting s is known as the scale function. We will return to this function later; it appears, for example, in the analysis of boundary points. Transforming the state so that the drift vanishes, i.e. so that the transformed process is a martingale, is in some sense complementary to the Lamperti transform: The former simplifies the drift term, while the latter simplifies the noise intensity term. Exercise 7.77: Consider Brownian motion with drift, i.e. dXt = µ dt + σ dBt . Determine the scale function s and the governing equation for the transformed coordinate Yt , and explain in words how {Yt } can be driftless when {Xt } has constant drift.

7.6

The chain rule in Stratonovich calculus

We now establish the corresponding chain rule for Stratonovich integrals. For simplicity, we consider scalar processes. So let {Xt : t ≥ 0} be a semimartingale given in terms of a Stratonovich integral: dXt = Ft dt + Gt ◦ dBt and, as before, consider the image of this process under a smooth map h: 166

Yt = h(t, Xt ) Aiming to write {Yt } in terms of a Stratonovich integral, we make a detour over the It¯ o calculus. We first rewrite {Xt } as an It¯o process, as in exercise (6.11): 1 dXt = Ft dt + Gt dBt + d[G, B]t 2 Now It¯o’s lemma yields:

dYt =

∂h ∂h 1 ∂2h (dXt )2 + dXt + ∂t ∂x 2 ∂x2

Here, the last term on the right is an ordinary dt-integral, while the middle term is an It¯ o integral. We rewrite this integral as a Stratonovich integral:

dYt =

∂h ∂h 1 ∂h 1 ∂2h 2 + ◦ dXt − d[ , X]t + G dt ∂t ∂x 2 ∂x 2 ∂x2 t

Now, a slight extension of exercise (6.11) yields that

d[

∂h ∂2h ∂2h , X]t = (t, X ) d[X, X] = (t, Xt ) G2t dt t t ∂x ∂x2 ∂x2

so that the last two terms cancel:

dYt =

∂h ∂h dt + ◦ dXt ∂t ∂x

Notice that this is what we would expect if we naively applied the chain rule from deterministic calculus. It can be written explicitly as

dYt =

∂h ∂h ∂h dt + Ft dt + Gt ◦ dBt ∂t ∂x ∂x

Exercise 7.78: Consider again Brownian motion on the circle, i.e. Θt = Bt , in Cartesian coordinates Xt = cos Θt , Yt = sin Θt . Derive Stratonovich equations that govern (Xt , Yt ). 167

Exercise 7.79 Lamperti transforming a Stratonovich equation: scalar Stratonovich equation

Consider the

dXt = f (Xt ) dt + g(Xt ) ◦ dBt where g(x) > 0 for all x. Verify that to transform to coordinates where the noise is additive, we must use the same Lamperti transform as in the It¯o case, i.e. (7.6). Next, write the It¯ o equation which governs {Xt }, and Lamperti transform this equation. Do we arrive at the same resulting equation?

7.7

Time change

It¯o’s lemma allows us to change the coordinate system used to describe the state space, i.e. the dependent variable, viewing the sample path as a function of time. In some situations it is equally desirable to change coordinates on the time axis, i.e. the independent variable. A first and perhaps trivial, but illustrative, example of this is to change the unit of time. Consider a stochastic differential equation dXt = f (Xt ) dt + g(Xt ) dBt where we take time t to have units of seconds. Now, let u be time measured in hours, i.e. u = t/3600, and define Yu = Xt = X3600u . Aiming to define standard Brownian motion in units of hours, and recalling the scaling of Brownian motion in section 4.2.1, we see that the process {Wu : u ≥ 0} given by

Wu =

1 B3600u 60

is standard Brownian motion. Thus, the governing equation for {Yu : u ≥ 0} is dYu = 3600 · f (Yu ) du + 60 · g(Yu ) dWu . √ An easy way to remember this rescaling of time is to recall that the physical unit of Bt is s √ while the physical unit of Wu is hour. This is perhaps even easier in terms of the diffusivity g 2 (x)/2.

Random time change We now generalize the previous to situations where the time change is dynamic and possibly even random. For example, a process may run with a speed which depends on temperature, with the temperature evolving randomly. 168

To this end, define transformed time {Ut : t ≥ 0} by dUt = Ht dt,

U0 = 0

where {Ht : t ≥ 0} is an {F t }-adapted process such that Ht > 0. Assume that t 7→ Ut defines ¯ + onto itself, and let Tu = inf{t ≥ 0 : Ut ≥ u} be the inverse map, an invertible map from R so that we can transform back and forth between original time t and transformed time u, for each realization. This will hold, for example, if Ht is bounded above and below. Next, define the process {Wu : u ≥ 0} by Z Wu =

Tu

p Ht dBt

0

Then {Wu : u ≥ 0} is standard Brownian motion. Exercise: Show this. Now, let the It¯ o process {Xt : t ≥ 0} be given by dXt = Ft dt + Gt dBt , then the process {Yu : u ≥ 0} given by Yu = XTu satisfies the It¯o stochastic differential equation

dYu =

FTu GT du + p u dWu HTu HTu

.

Note that, as always, different rescalings in the time integral and the It¯o integral, due to the scaling properties of Brownian motion. Exercise 7.80: Let {Bt : t ≥ 0} be Brownian motion. Define, for t > 0 1 Xt = √ Bt and Yu = Xt t where t = exp(u). Show that {Yu : u ≥ 0} is governed by 1 dYu = − Yu du + dWu 2 i.e., the rescaled process {Yu : u ≥ 0} is an Ornstein-Uhlenbeck process. A common situation, when the It¯ o process is in fact a solution of a stochastic differential equation, is that the time change depends on the state itself. So consider a process {Xt : t ≥ 0} given by the stochastic differential equation

dXt = f (Xt ) dt + g(Xt ) dBt and consider the time change dUt = h(Xt ) dt. Then the process {Yu : u ≥ 0} satisfies the It¯ o stochastic differential equation 169

dYu =

f (Yu ) g(Yu ) dWu du + p h(Yu ) h(Yu )

Example 7.7.1 (Kinesis). Given a spatially varying diffusivity D(x) > 0 (with x ∈ Rd ) we can define a kinesis process {Xt : t ≥ 0} as the solution to the equation dXt =

p 2D(Xt ) dBt

where {Bt : t ≥ 0} is d-dimensional standard Brownian motion. This process Xt is unbiased (i.e., a Martingale) but it will spend more time in regions where the diffusivity is low (we will return to this when discussing transition probabilities and, in particular, stationary distributions, in chapter 9). Define the time change dUt = 2D(Xt ) dt, then we have

p dYu = p

2D(XTu ) 2D(XTu )

dWu = dWu

i.e., {Yu } is Brownian motion. In other words, the process {Xt } is a random time change of Brownian motion.

Exercise 7.81: Consider a scalar diffusion process {Xt : t ≥ 0} given by dXt = f (Xt ) dt + g(Xt ) dBt and let s(x) be the associated scale function. Let {Yu : u ≥ 0} be a time-changed process given by dUt = h(Xt ) dt. Show that s(·) is also the scale function of {Yu }.

170

CHAPTER

8

SDEs: Existence and uniqueness

With what we know now, if we are given a process {Xt : t ≥ 0} that satisfies the “usual” conditions, then we can (at least in principle) compute the left and right hand sides of the integral equation Z Xt = X0 +

t

Z f (s, Xs ) ds +

0

t

g(s, Xs ) dBs

(8.1)

0

and so verify if {Xt : t ≥ 0} satisfies the equation. Recall that we use the short hand dXt = f (t, Xt ) dt + g(t, Xt ) dBt

(8.2)

for this equation; that we call this a stochastic differential equation, and that we call the process {Xt } a diffusion process. In this chapter, we address the inverse problem, namely solving the stochastic differential equation (8.2). With solving, we mean finding a diffusion process {Xt : t ≥ 0}, given the data f, g, the initial condition X0 = x, and the Brownian motion {Bt }, such that the integrals in (8.1) are well defined and the equation holds. This {Xt } is called a strong solution. Solving the stochastic differentiable equation (8.2) involves the following questions: 1. Does there exist a solution? 2. In that case, is it unique? 3. In that case, can we find it? As soon as we consider stochastic differential equations outside a small standard catalog, it is rather unusual to actually be able to find explicit solutions for stochastic differential 171

equations. This should not be a surprise; the same holds for ordinary differential equations and partial differential equations. Therefore the following two questions become tremendously important: 4. If we cannot find the solution, can we approximate the solution, numerically or analytically? 5. Can we characterize the solution, i.e. examine its quantitative and qualitative properties?

8.1

Uniqueness of solutions

We are used to, and prefer, initial value problems to have a unique solution. I.e., that given a model (f, g), an initial condition x, and a Brownian motion Bt (with underlying probability space and filtration), there exists exactly one process Xt such that dXt = f (Xt , t) dt + g(Xt , t) dBt

(8.3)

and such that X0 = x. Although this is the desirable situation, it is not always the case. A standard counterexample from ordinary differential equations is the following:

8.1.1

Non-uniqueness: The falling ball

Here is an example in which there is no unique solution. A ball is at rest at time t = 0. We use Xt to denote its vertical position at time t; the axis is directed downwards. The potential energy of the ball at time t is −mgXt where m is the mass of the ball and g is the gravitational acceleration. The kinetic energy is 12 mVt2 where Vt = dXt /dt is the vertical velocity downwards. Energy conservation dictates 1 mVt2 − mgXt = 0 2 We can isolate Vt to obtain Vt =

dXt p = 2gXt dt

(8.4)

where we have used physical reasoning to discard the negative solution. This is a first order ordinary differential equation in {Xt }. We may write it in the standard form: X˙ t = f (Xt , t) where f (x, t) = With the initial condition X0 = 0, one solution is 172

p

2gx

1 Xt = gt2 2 But is this solution unique? No, another solution is Xt = 0. In fact, for any non-negative t0 , we can construct a solution  Xt =

0 1 2 g(t

, for t ≤ t0 − t0 )2 , for t > t0

.

The physical interpretation of these solutions is that we hold the ball until time t0 and then let go. Note that for each parameter t0 , this expression defines a continuously differentiable function of time t, so is a valid solution in every mathematical sense. Physically, the reason why uniqueness fails is that when the particle is at standstill, we may apply a force without doing work, by holding the particle. So energy conservation does not determine the force and hence the motion. √ Mathematically, the problem is that the right hand side of the ODE, f (x, t) = 2gx, has a singularity at x = 0: the derivative is p ∂f (x, t) = g/2x ∂x which approaches ∞ as x ↓ 0. Exercise 8.82: What would happen if you tried to solve the differential equation (8.4) with the initial condition X0 = 0, using numerical methods, e.g. in Matlab or R? If you cannot guess the results, then try it! As the following exercise shows, the same phenomenon can occur in a stochastic setting. Exercise 8.83: Show that, given any deterministic T ≥ 0, the process  Xt =

0 (Bt − BT )3

for 0 ≤ t ≤ T for t > T

satisfies the Stratonovich stochastic differential equation 2/3

dXt = 3Xt

◦ dBt

Note that we have used the Stratonovich calculus so that the calculations in the deterministic and the stochastic case appear more analogous. There is not a straightforward physical interpretation of this example, but mathematically, it contains a similar singularity at the origin in that the function R 7→ R given by x 7→ 3x2/3 has a derivative that diverges at the origin. 173

2 0 −2 −6

−4

f (x)

−4

−2

0

2

4

x

Figure 8.1. Lipshitz continuity. The function x 7→ x + sin x + cos 2x is Lipschitz continuous with Lipschitz constant 4, as can be shown by differentiating. For each point x, this restricts the graph to a cone, indicated by green for the point x = −1.

8.1.2

Lipschitz continuity implies uniqueness

Non-uniqueness of solutions is a phenomenon which should be avoided in the modeling process. We see from the examples that non-uniqueness may appear at singularities, i.e. points where the derivatives ∂f /∂x or ∂g/∂x diverge. In order to rule out singularities and thus also non-uniqueness, it is standard to make a fairly harsh assumption, namely that the model is Lipschitz continuous (see also figure 8.1) Definition 8.1.1 (Lipschitz continuity). Consider a function f from one normed space X to another Y. We say that f is Lipschitz continuous if there exists a constant K > 0 such that for any x1 , x2 ∈ X: |f (x1 ) − f (x2 )| ≤ K|x1 − x2 | A few remarks: 1. The normed spaces we encounter in this book are finite-dimensional, in fact they can be written Rn×m for natural n and m. In this case it does not matter which norm we choose, since they are all equivalent (i.e., they can be bounded in terms of eachother). 2. A differentiable function is Lipschitz if and only if the derivative is bounded. However, a Lipschitz function does not have to be differentiable; for example the function x 7→ |x| is Lipschitz but not differentiable at 0. 174

If the stochastic differential equation (8.3) is time-invariant, i.e. f and g do not depend on t, then Lipschitz continuity is exactly what we need. For time varying systems, a simple extension which suffices for most purposes is uniform Lipschitz continuity: Theorem 8.1.1. Assume that f and g satisfy a locally uniform Lipschitz condition: For all T there exists a K(T ) > 0 such that |f (x, t) − f (y, t)| ≤ K(T ) · |x − y| ,

|g(x, t) − g(y, t)| ≤ K(T ) · |x − y|

whenever 0 ≤ t ≤ T and for all x, y. Then there can exist at most one solution {Xt : t ≥ 0} to the stochastic differential equation dXt = f (Xt , t) dt + g(Xt , t) dBt

,

X0 = x

(in the sense that if {Xt : t ≥ 0} and {Yt : t ≥ 0} are two solutions, then Xt = Yt for all t, almost surely). Furthermore, let {Xt : t ≥ 0} be a solution and let {Yt : t ≥ 0} be a solution to the same stochastic differential equation with a different initial condition, i.e. dYt = f (Yt , t) dt + g(Yt , t) dBt

,

Y0 = y

then the two solutions can diverge at most exponentially from eachother, i.e. E|Yt − Xt |2 ≤ |x − y|2 · exp(t [2K(T ) + K 2 (T )]) for 0 ≤ t ≤ T .

Proof. Let Xt and Yt both satisfy the stochastic differential equation (8.3). Introduce the processes

Dt = Xt − Yt Ft = f (Xt , t) − f (Yt , t) Gt = g(Xt , t) − g(Yt , t) St = |Xt − Yt |2 = |Dt |2 . Then, by the It¯ o formula, {Dt } and {St } are It¯o processes, and dDt = Ft dt + Gt dBt and 175

The Gro ¨nwall-Bellman inequality This lemma makes precise the statement that “linear bounds on dynamics imply exponential bounds on solutions”. The original formulation, due to T.H. Gr¨ onwall, consider a differentiable function v : [0, ∞) 7→ R. If this v satisfies v 0 (t) ≤ a(t)v(t) for all t ≥ 0, where a : [0, ∞) 7→ R is continuous, then also Z t  v(t) ≤ v(0) exp a(s) ds 0

holds for all t ≥ 0. To see this, write v 0 (t) = a(t)v(t) + f (t) where f (t) ≤ 0. Then v(t) = Rt Rt Rt v(0) exp 0 a(s) ds + 0 exp( s a(u) du)f (s) ds. Now note that the second term is non-positive. This result was extended by R. Bellman, who relaxed the assumption that v is differentiable, and therefore considered an integral form. For our purpose, a simple version suffices: If v : [0, ∞) 7→ R is a continuous function satisfying Z t v(t) ≤ v(0) + a(s)v(s) ds 0

for all t ≥ 0, then also Z v(t) ≤ v(0) exp

176

 a(s) ds

0

holds for all t ≥ 0.

t

dSt = 2Dt · dDt + |dDt |2 = 2Dt · Ft dt + 2Dt · Gt dBt + tr[Gt G> t ] dt Exercise: Verify this! It follows that Z ESt = S0 + E

t

2 Ds · Fs + tr[Gs G> s ] ds

0

Now, the Lipschitz bounds on f and g, and the Cauchy-Schwarz inequality ha, bi| ≤ kak kbk, implies Z ESt ≤ S0 +

t

(2 K(T ) + K 2 (T ))ESs ds

0

and in turn the Gronwall-Bellman inequality implies that ESt ≤ S0 e[2K(T )+K

2 (T )]t

which is the second part of the theorem. Now, assume that X0 = Y0 so that S0 = 0. Then it follows that St = 0 for all t, i.e. the solutions Xt and Xt agree w.p. 1 for each t. By continuity it follows that the two solutions agree for all t, w.p. 1.

8.2

Existence of solutions

Given a stochastic differential equation, does it have a solution? This is the question of existence. At first glance, it may seem obvious that we would like our model to have a solution, and that is indeed often the case. The most interesting reason a stochastic differential equation may fail to have a solution, is that there is a solution which is only defined up to some (random) point of time, where the solution explodes. It is instructive to begin with a deterministic example: Example 8.2.1 (Explosion in an ordinary differential equation). The ODE x˙ = 1 + x2 ,

x0 = 0

has the (unique maximal) solution xt = tan t which is defined for 0 ≤ t < π/2. An explosion occurs at t = π/2 where xt → ∞. Explosions also occur in continuous-time Markov chains taking discrete values. For example, consider the “birth” process {Xt : t ≥ 0} where Xt ∈ N and the only state transitions are 177

from one state x to x + 1. Assume that the process starts in state X0 = 1 and that the rate of the transition x 7→ x + 1 is x2 . Thus, the processes accelerates in the sense that transitions to the next level occur faster and faster, in expectation. This process can model growth of a population1 , where Xt is the number of animals at time t, but can also be used to model nuclear chain reactions. For this model, the expected sojourn time in state x is 1/x2 . Define τn as the time of arrivalP to state n, i.e. τn = inf{t ≥ 0 : Xt = n}, then the expected arrival time −2 to state n is Eτn = n−1 x=1 x . This expected arrival time remains bounded as n increases. In fact, as n → ∞, we have Eτn → 1.077, approximately. Now, define τ = limn→∞ τn . We can say that an explosion occurs if τ < ∞. We see that Eτ ≈ 1.077; in particular, explosion occurs at a finite time, almost surely. Example 8.2.2 (Explosion in a Stratonovich SDE). Motivated by the previous example, we consider the Stratonovich SDE dXt = (1 + Xt2 ) ◦ dBt

(8.5)

With the chain rule of Stratonovich calculus, we can easily verify2 that this equation has the solution Xt = tan Bt which is well defined until the stopping time τ given by τ = inf{t : |Bt | ≥

π } 2

The time τ is the (random) time of explosion. See figure 8.2.

8.2.1

Linear bounds rule out explosions

In some applications, explosions are an important part of the dynamics and indeed the focus of the analysis. However, at this point we prefer to only consider stochastic differential equations which do in fact admit solutions defined on the entire time axis. Therefore we aim to rule out the possibility of explosions. It turns out that the same principle which gave us uniqueness of solutions, can also be used to rule out explosions and thus guarantee existence of a solution, global in time. The principle is that linear bounds on dynamics imply exponential bounds on solutions. This gives both existence and uniqueness, and later on it will be used in the analysis of numerics and of stability. 1

In that case, the model says that animals meet randomly and such encounters gives rise to an offspring. Here we ignore the distinction between males and females. 2 Here we are skipping the “detail” that the chain rule assumes the process to be well defined for all t ≥ 0. The way around this obstacle is to consider the stopped process {Xt∧T : t ≥ 0} where T = inf{t ≥ 0 : |Xt | ≥ K}, i.e., we stop the process when it leaves the interval [−K, K]. This is a well-behaved process. Then we let K → ∞.

178

10 5 Xt

0 −5 −10 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

t

Figure 8.2. Explosion at a random time. Three sample paths of the process (8.5), up to the random time τ of explosion.

179

Theorem 8.2.1. Consider the It¯ o stochastic differential equation dXt = f (Xt , t) dt + g(Xt , t) dBt

(8.6)

with initial condition X0 = x. Assume that the data (f, g) satisfies the Lipschitz condition for uniqueness, and in addition that there is some C > 0 such that the bound x> f (x, t) ≤ C · (1 + |x|2 ) ,

|g(x, t)|2 ≤ C · (1 + |x|2 )

holds for all x ∈ Rn , and all t ∈ [0, T ]. Then there exists a unique solution {Xt : 0 ≤ t ≤ T } which is {F t }-adapted, has continuous sample paths, and satisfies the norm bound T

Z

|Xt |2 dt < ∞

E 0

In all that is to follow, we will routinely assume that this theorem applies.

8.2.2

The Picard iteration and the proof of existence

It is instructive to consider first an example of how we can construct the solution of an SDE. The method is the so-called Picard iteration, where we make a guess on the solution, and use the guess to compute a better guess. Consider the Stratonovich SDE dXt = rXt dt + σXt ◦ dBt

(8.7)

and the initial condition X0 = 1. Our first (somewhat naive) guess on the solution is the (0) (0) constant process {Xt : t ≥ 0} given by Xt = 1, which at least satisfies the initial condition. Next, we insert this guess on the right hand side of the SDE, and thus obtain an improved guess:

(i) Xt

Z =1+

t

rXs(i−1) ds + σXs(i−1) ◦ dBs for i = 1, 2, . . .

(8.8)

0

With the Stratonovich calculus, we get:

(1) Xt

(2)

Xt

Z =1+

t

Z

0

t

σ ◦ dBs

r ds +

(8.9)

0

= 1 + rt + σBt Z t Z t =1+ r(1 + rs + σBs ) ds + σ(1 + rs + σBs ) ◦ dBs 0

(8.10) (8.11)

0

1 = 1 + (rt + σBt ) + (rt + σBt )2 2 180

(8.12)

25 20

Xt

15

X4t

10

x

X3t

5

X2t

X1t

0

X0t 0.0

0.5

1.0

1.5

2.0

2.5

3.0

t

Figure 8.3. The Picard iteration used to construct the solution of the Stratonovich SDE 8.7. and, in general by induction, we see that

(n)

Xt

=

n X 1 (rt + σBt )i i! i=0

It is a good exercise in Stratonovich calculus to verify this. We now recognize this as the truncated Taylor expansion of an exponential function, i.e. (n)

Xt

→ exp(rt + σBt )

for all t and all ω. It can be shown that the convergence is also in the mean square. Figure 8.3 shows the first few iterates. 181

For the general It¯ o SDE (8.6), we construct again the Picard iteration. First, we use the bounds on f and g to show that each iterate remains bounded in L2 . Then, we show that the sequence is a Cauchy sequence, hence convergent in L2 . Finally, we show that the limit satisfies the stochastic differential equations. The proof can be found in (Øksendal, 2010).

8.3

It¯ o versus Stratonovich equations

In the previous, we have given a theorem for existence and uniqueness of It¯o equations, but several of our examples have been Stratonovich equations. This raises three questions: 1. What is the relationship between It¯o equations and Stratonovich equations 2. Should we develop the mathematical framework on the foundation of the It¯o integral, or of the Stratonocih integral? 3. Which guidelines exist for choosing between an It¯o equation and a Stratonovich equation to model a given system? To answer the first question, we consider an It¯o diffusion process {Xt : t ≥ 0} which satisfies the It¯o stochastic differential equation

dXt = f (Xt ) dt + g(Xt ) dBt .

(8.13)

Then this process also satisfies a related Stratonovich equation dXt = f¯(Xt ) dt + g(Xt ) ◦ dBt

(8.14)

Notice that the noise intensity g(·) is the same in the two equations, but that the drift terms differ. In the case of scalar Brownian motion, the relationship between the two drift terms is: 1 ∂g f (x) = f¯(x) + (x) g(x) 2 ∂x

(8.15)

Here, ∂g/∂x is the Jacobian of g. Example 8.3.1. Consider the process {Xt = exp Bt : t ≥ 0}. By the chain rule of Stratonovich calculus, {Xt } satisifies the Stratonovich equation dXt = Xt ◦dBt . Inserting, we find f¯(x) = 0, o equation dXt = 21 Xt dt + Xt dBt . g(x) = x and thus f (x) = 12 x, so {Xt } also satisfies the It¯ This is, of course, in agreement with what we find by appplying It¯ o’s lemma. To see the relationship (8.15), note that 182

Z

t

Xt = X0 +

f¯(Xt ) dt + g(Xt ) ◦ dBt

0

Z = X0 + 0

t

1 f¯(Xt ) dt + g(Xt ) dBt + [g(X), B]t 2

Now,

t

Z [g(X), B]t =

d[g(X), B]s 0 t

Z = 0

t

Z = 0

∂g (Xs ) d[X, B]s ∂x ∂g (Xs )g(Xs ) ds ∂x

so that

Z Xt = X0 + 0

t

1 ∂g (Xs )g(Xs ) ds f¯(Xt ) dt + g(Xt ) dBt + 2 ∂x

or 

1 ∂g dXt = f¯(Xt ) + (Xs )g(Xs ) 2 ∂x

 dt + g(Xt ) dBt

Since this involves It¯ o integrals, we can now compare with (8.13). The result (8.15) follows. In the case of m-dimensional Brownian motion, we treat each component separately. I.e., we write the It¯ o equation as

dXt = f (Xt ) dt +

m X

(k)

gk (Xt ) dBt

k=1

and correspondingly for the Stratonovich equation. Rewriting each of the m stochastic integrals, we then find the relationship between the It¯o drift f and the Stratonovich drift f¯: m

1 X ∂gk f (x) = f¯(x) + (x) gk (x) 2 ∂x k=1

or, written out explicitly element-wise: 183

m

n

1 X X ∂gik fi (x) = f¯i (x) + (x) gjk (x). 2 ∂xj k=1 j=1

With this formula, we can convert back and forth between It¯o and Stratonovich equations, i.e. for a given It¯ o equation, we can find a Stratonovich equation which has the same solution, and conversely. Therefore, turning to the second question in the opening of this section, we could build base the mathematical framework on either the It¯o integral or the Stratonovich integral; in either way we’d describe the same class of processes. Developing the mathematical machinery for one of the interpretations, the results can then readily be transformed to also apply to the other. For example, we do not have to state existence and uniqueness results for both It¯ o equations and Stratonovich equations. In this book, like in most texts, we base ourselves on the It¯o intepretation. The main motivation this choice is the martingale property of the It¯o integral. Also, the simplest numerical method for simulating sample paths of a diffusion process is the explicit Euler method for It¯ o equations. Despite this choice, there are applications where the analyst may prefer the Stratonovich calculus. The simpler chain rule is a good reason to choose Stratonovich calculus when many transformations will be done, e.g. when changing coordinates. However, this does not yet answer the third question that the modeler may rise, whether to use It¯o or Stratonovich equations to describe a given system. In many situations, we know the drift term from physical laws, so that the starting point for the model is an ordinary differential equation, dXt = f (Xt ) dt. If we then want to add state-dependent noise, should this be It¯o noise, g(Xt ) dBt , or Stratonovich noise, g(Xt ) ◦ dBt ? A main argument for choosing It¯ o noise in this situation, could be that the instantaneous rate of change of the mean EXt is then given by the original drift term f . On the other hand, the Stratonovich interpretation has the nice property that it appears as the limit, when the noise is band-limited with increasing band-width. This stems from the property that the Stratonovich integral arises as the limit when we approximate the integrator, Brownian motion, with its harmonic expansion (exercise 6.67). Finally, in some situations we would like the noise to mimic Fickian diffusion. This neither leads directly to It¯o or Stratonovich models; we will return to this case in chapter 9. Of course, if time series data is available for the system, then one can do statistics do infer if the It¯o interpretation or the Stratonovich interpretation - or something third - matches the observations best.

8.4

Additional exercises

184

Exercise 8.84: Brownian motion on the n-sphere. Let Xt ∈ Rn satisfy the It¯o SDE dXt = −

n−1 Xt dt + (I − Xt Xt0 ) dBt 2

where Bt is Brownian motion in Rn . 1. Show that the sphere |x| = 1 is invariant, i.e. if |X0 | = 1 then |Xt | = 1 for all t > 0 (a.s.). Hint: Set St = |Xt |2 and use It¯o’s lemma to derive an SDE for St . 2. Show that Xt is rotational symmetric in law (isotropic). I.e., let U be a unitary matrix (i.e., such that U U 0 = U 0 U = I) and set Yt = U Xt , then X and Y are Markov processes with the same laws. 3. Let X0 = x be the initial condition, assuming |x| = 1. Find the expectation EXt . Hint: As we will show later, the mean satisfies the equation dEXt = −(n − 1)/2 EXt dt. 4. Derive the Stratonovich equation which {Xt } satisfies.

Exercise 8.85 The Brownian bridge: If we measure the position of a Brownian particle at times 0 and T > 0, what can we say about the path it took between these two points in time? The answer to this question is the Brownian bridge. Assume that we have measured B0 = 0 and BT = b. 1. Derive the conditional distribution of Bt given B0 and BT , for 0 ≥ t ≥ T . Plot E{Bs |Bt } as a function of Tpwith, say T = 1, a = 0 and b = 1. Include marginal confidence limits, i.e. E{Bs |Bt } ± V{Bs |Bt }. Hint: Find the joint distribution of Xt and XT given X0 and use standard conditioning in Gaussian distributions. 2. Let Xt be the solution to dXt =

b − Xt dt + dWt T −t

where {Wt } is standard Brownian motion. Show that EXt = E{Bt |BT = b} and that VXt = V{Bt |BT = b} for 0 ≥ t ≥ T . Note: It can be shown that the finite-dimensional conditional distribution of Bt given BT = b agrees with the finite-dimensional distribution of Xt , for 0 ≥ t ≥ T , using that the two processes are both Markov and share the same Gaussian transition probabilities.

Exercise 8.86: Show that Z

t

Z Bs ds|Bt } = E{

E{ 0

0

t

1 s dBs |Bt } = tBt 2

Hint: You may use the properties of the Brownian bridge, and the product rule d(tBt ) = Bt dt + t dBt . Compute the corresponding variances. 185

Exercise 8.87: The Cox-Ingersoll-Ross process (Cox, Ingersoll Jr, and Ross, 1985) is given by the stochastic differential equation dXt = λ(ξ − Xt ) dt + γ

p Xt dBt

with λ > 0, ξ > 0, γ > 0. This model may describe interest rates demographic noise in population dynamics. 1. Show that existence and uniqueness is guaranteed as long as Xt > 0. Argue heuristically that if the process hits x = 0, then it will immediately be repelled back to Xt > 0, so that existence and uniqueness holds. 2. Assuming that existence and uniqueness holds, and that X0 = x > 0, show that the mean µt = EXt is ξ + (x − ξ) exp(−λt). Assuming x = ξ, show that the variance is Σt = VXt = (1 − exp(−2λt))γ 2 ξ/(2λ). Derive the limit of Σt as t → ∞.

186

CHAPTER

9

Transition probabilities: The Kolmogorov equations

Summary It¯o diffusions, the solutions to It¯ o stochastic differential equations, are Markov processes. Therefore an important characterization of them are their transition probabilities. The transition densities p(s 7→ t, x 7→ y) are governed by partial differential equations of the advectiondiffusion type, known as the Kolmogorov equations. The forward Kolmogorov equation, which in the scalar case is 1 φ˙ = −(f φ)0 + ( g 2 φ)00 2 governs the transition probability density φ(t, y) as function of terminal state y and time t, for fixed initial condition Xs = x. This equation describes how probability is redistributed in space as time marches forward. The forward Kolmogorov equation can be written in advection-diffusion form: φ˙ = −(uφ − Dφ0 )0 where D(x) = 12 g 2 (x) is the diffusivity field, and u(x) is the advective flow field, related to the drift term f (x) in the stochastic differential equation by u(x) = f (x) − D0 (x) This confirms the connection between diffusive transport, as we encountered in chapter 2, and stochastic differential equations: If particles move independently and the motion of each individual particle is governed by a stochastic differential equation, then the density of particles is governed by an advection-diffusion equation. 187

The forward Kolmogorov equation has an immediate transport interpretation, and therefore appeals to physical intuition. From a probabilistic point of view, we can equally well focus on the backward Kolmogorov equation, which in the scalar case is 1 ψ˙ + ψ 0 f + g 2 ψ 00 = 0 2 which governs the transition probabilities ψ(s, x) as function of initial state x and time s, for fixed terminal time t and position y. Also the backward equation can be written in an advective-diffusive form: ψ˙ + ψ 0 u + (Dψ 0 )0 = 0 If the particle is observed at the terminal time and the initial condition is an unknown deterministic parameter, then the backward Kolmogorov equation can be used to find the likelihood function of the initial condition. More generally, the backward Kolmogorov equation can be used to determine expectations to the future. The backward and forward equations are adjoint, so they contain the same information. While most students find the forward equation more familiar, we will see in the following chapters that the backward equation is key to applications such as stability analysis, performance evaluation, and dynamic optimization, so the effort familiarizing oneself with the backward view is well spent.

9.1

Diffusions are Markov processes

Recall that the Markov property loosely can be stated as “given the present, the future is independent of the past”. More precisely: Theorem 9.1.1. Let {Xt : t ≥ 0} be a stochastic process defined on a probability space (Ω, F, P) and taking values in Rn . Assume that X satisfies the stochastic differential equation dXt = f (Xt , t) dt + g(Xt , t) dBt as well as the initial condition X0 = x. Here, {Bt : t ≥ 0} is Brownian motion with respect to a filtration {F t : t ≥ 0}, and f and g satisfy the sufficient conditions for existence and uniqueness of solutions to the stochastic differential equation. Then the process {Xt : t ≥ 0} is Markov, i.e. Ex {h(Xt )|F s } = Ex {h(Xt )|Xs } = EXs ,s h(Xt ) holds for any 0 ≤ s ≤ t and any Borel measurable test function h : Rn 7→ R such that Ex |h(Xt )| < ∞. 188

We should perhaps clarify the notation in EXs ,s h(Xt ): First, consider the same stochastic differential equation dYt = f (Yt , t) dt + g(Yt , t) dBt but with initial condition Ys = y. By existence and uniqueness, E h(Xt ) is a function of the initial condition y, s; use the notation η(y, s) = Ey,s h(Xt ) to denote this function. Now, EXs ,s h(Xt ) = η(Xs , s). The theorem should come as no surprise: Consider the initial value problem consisting of the stochastic differential equation and fixing Xs : Z

t

Z

t

g(Xu ) dBu

f (Xu ) du +

Xt = Xs +

s

s

By uniqueness of solutions, Xt depends only on the initial condition Xs and the Brownian motion {Bu − Bs : u ∈ [s, t]}. In particular, including the remaining information F s does not change expectations. A complete proof can be found in (Øksendal, 1995).

9.2

Transition probabilities

Since diffusions are Markov processes, a key to their description is the transition probabilities Ps,x (Xt ∈ B) i.e., if the process starts with Xs = x, what is the probability that at a later time t > s it will reside in a given Borel set B? We will for the moment assume that Xt admits a probability density function p(s 7→ t, x 7→ y) so that

P

s,x

Z (Xt ∈ B) =

p(s 7→ t, x 7→ y) dy B

This function p is called the transition density. Being a function of four variables, it is a quite complex object. To simplify, we may fix the initial condition (s, x) and get the density of Xt φ(t, y) = p(s 7→ t, x 7→ y) On the other hand, if we fix the terminal condition (t, y), we get ψ(s, x) = p(s 7→ t, x 7→ y) which is the probability of ending in (an infinitesimal region around) the target state y, seen as a function of the initial condition. We can think of ψ as the likelihood function of the initial condition, in case the initial condition is unknown but the terminal position Xt has been 189

measured. Note that ψ(s, ·) does not necessarily integrate to one; ψ(s, ·) is not a probability density of the initial state Xs . When the dynamics (f, g) are time invariant, the transition densities p will depend on s and t only through the time lag t − s. It is however convenient to keep all four arguments. In rare situations we can determine the transition probabilities from the solution of the stochastic differential equation itself: Example 9.2.1. If Xt is Brownian motion in one dimension, then   1 1 (x − y)2 p(s 7→ t, x 7→ y) = p exp − 2 t−s 2π(t − s) Example 9.2.2. For the Ornstein-Uhlenbeck process in R given by the linear stochastic differential equation dXt = −λXt dt + σ dBt with λ, σ > 0, we have earlier established that given the initial condition Xs = x, Xt is Gaussian distributed with mean µ(s, t) := Es,x Xt = e−λ(t−s) x and variance Σ(s, t) := Vs,x Xt =

σ2 (1 − e−2λ(t−s) ) 2λ

This means that the transition density is

p(s 7→ t, x 7→ y) = p

1

1

2πΣ(s, t)

2 /Σ(s,t)

e 2 (y−µ(s,t))

1 (y − e−λ(t−s) x)2 =q exp − σ2 2 2 (1 − e−2λ(t−s) ) 2π σ2λ (1 − e−2λ(t−s) ) 2λ 1

!

In figures 9.1 and 9.2 we see these transition densities plotted. Figure 9.1 shows the “forward” view of the transition density, i.e. we fix the initial condition Xs = x and plot the p.d.f. of Xt for increasing t. Note that the variance in the distribution quickly widens and that the mean approaches the steady-state mean 0 exponentially. The density converges towards the stationary distribution, a Gaussian with mean 0 and variance σ 2 /2λ. Figure 9.2 shows the “backward” view on the transition density, i.e. we fix the terminal position Xt = y and plot p(s 7→ t, x 7→ y) as function of x for different s. One way to understand these functions is to consider the problem of estimating the initial condition Xs = x based on an observation Xt = y. Then, the function x 7→ p(s 7→ t, x 7→ y) is the likelihood function of the unknown parameter x. Note that this likelihood function flattens, that its mode diverges, and that the likelihood function converges (pointwise) to a constant: At very large time lags t − s, the likelihood of the 190

observation Xt = y equals to the stationary probability density at y, regardless of the initial condition, because the initial condition no longer holds information about the observation. To see these conclusions from the transition densities, note that the mode of the likelihood function is x ˆs := Arg max p(s 7→ t, x 7→ y) = eλ(t−s) y x

,

which diverges exponentially as the time lag t − s increases, that the maximum is max p(s 7→ t, x 7→ y) = x

1 1 = σ2 2πΣ(s, t) 2π 2λ (1 − e−2λ(t−s) )

,

which approaches an asymptote, and that the Fisher information, the curvature of the loglikelihood, is

I := − =

∂ 2 log p(s 7→ t, x 7→ y) ∂x2

e−2λ(t−s) Σt 1

= σ2

2λ (e

2λ(t−s)

− 1)

which vanishes as the time lag t − s increases. Example 9.2.3. For the narrow-sense linear SDE in Rn dXt = (AXt + wt ) dt + G dBt where wt is deterministic, we have previously established an integral expression for the solution. For given initial condition (s, x), Xt is Gaussian with a mean value µ(t) and a variance Σ(t), which can be determined from their governing ordinary differential equations: d µ(t) = Aµ(t) + wt dt with initial condition µ(s) = x and hence the solution Z µ(t) = exp(A(t − s))x +

t

exp(A(t − u))wu du s

For the variance Σ(t), we found in section 5.7 the differential Lyapunov equation d Σ(t) = AΣ(t) + Σ(t)A> + GG> dt 191

0.6 0.4 0.2 0.0

Transition density p(0−>t,x−>y)

0.8

1.0

Forward transition probabilities

−4

−2

0

2

4

Terminal position y

Figure 9.1. The transition probabilities for the Ornstein-Uhlenbeck process, as functions of final position y for different t, and for λ = σ = 1, s = 0, x = −3.

192

0.6 0.4 0.2 0.0

Transition density p(s−>0,x−>y)

0.8

1.0

Backward transition probabilities

−10

−5

0

5

10

Initial position x

Figure 9.2. The transition probabilities for the Ornstein-Uhlenbeck process, as functions of initial position x for different s, and for λ = σ = 1, t = 0, y = −1.

193

with the initial condition Σ(s) = 0 (see also exercise 5.59). If the variance Σ(t) is positive definite, then Xt admits a density (see section 9.9.2 for a discussion of when this is the case). This is the well-known density of a multivariate Gaussian distribution, which evaluates at y to   1 1 > −1 p(s 7→ t, x 7→ y) = exp − (y − µ(t)) Σ (t)(y − µ(t)) 2 (2π)d/2 |Σ(t)|1/2 where µ(t) and Σ(t) are found by solving the governing ordinary differential equations. The fact that these linear models have transition probabilities in closed form makes them very tractable, and therefore they have a predominant place in applications. Regardless, it should be clear that such closed-form solutions of transition probabilities is the exception rather than the rule. The normal situation is the other way around: We do not know the trajectories, but hope to be able to characterize them in terms of transition probabilities. For the majority of situations, where transition probabilities cannot be determined from the process itself, we need governing equations from which we can find the transition probabilities. The objective of this chapter is to derive partial differential equations which govern these transition probabilities. We shall see that they are second order equations of the advectiondiffusion type. They are known as the Kolmogorov equations.

9.3

The backward Kolmogorov equation

Biography: Andrei Nikolaevich Kolmogorov (1903-1987). One of the most prominent mathematicians of the 20th century, he contributed to the foundations of probability theory and stochastic processes, to filtering and prediction, and to the description of turbulence. His approach to diffusion processes was to consider continuous-time, continuous-space Markov processes for which the transition probabilities are governed by advection-diffusion equations.

It turns out that the simplest way to establish the Kolmogorov equations is the following: First, we establish the backward equation which governs ψ(x, t), i.e. the transition densities as a function of the initial condition. Then, we obtain the forward equation using a duality argument. So, consider again the process {Xt : t ≥ 0} from theorem 9.1.1. We stop the process at a fixed (deterministic) time T , and evaluate the function h at XT . Here, h is an arbitrary bounded C 2 test function on state space. We now define the process {Yt : 0 ≤ t ≤ T } given by Yt = E{h(XT ) | F t } 194

i.e., the expected terminal reward, conditional on the information obtained by observing the Brownian motion up to time t. First, notice that Yt must be a Doob’s martingale w.r.t. F t (example 4.44). To repeat the argument and be explicit, for 0 ≤ s ≤ t ≤ T we have

E{Yt | F s } = E{E[h(XT ) | F t ] | F s } = E{h(XT ) | F s } = Ys

.

Here we have used that the information available at time s is also available at time t, i.e. F s ⊂ F t , which allows us to use the Tower property of conditional expectations. Exercise: Verify that Y also possesses the other defining properties of a martingale. Next, note that the Markov property of the process {Xt : t ≥ 0} implies that Yt can only depend on F t through Xt , i.e. Yt is Xt -measurable. Therefore there must exist a function k(x, t) such that Yt = k(Xt , t) = Ex,t h(XT ) . This implies that Yt is in itself an It¯ o process satisfying

∂k ∂k 1 ∂2k dt + dX + (dX)2 ∂t ∂x 2 ∂x2 ∂k ∂k + Lk) dt + g dBt = ( ∂t ∂x

dYt =

by It¯o’s lemma. Here, we have omitted the detail about smoothness of k, the arguments to k(Xt , t), and we have introduced the differential operator L given by

(Lk)(x, t) =

∂k 1 ∂2k (x, t)f (x, t) + trg(x, t)g > (x, t) 2 (x, t) ∂x 2 ∂x

defined for functions k(x, t) which are twice differentiable in x. Now, since Yt is a martingale, we have ∂k + Lk = 0 . ∂t We summarize the findings: Theorem 9.3.1 (Kolmogorov’s backward equation). Let h : Rn 7→ R be bounded and C 2 . Then Ex h(XT ) 195

is found as k(x, 0) where k(x, t) satisfies the backward Kolmogorov equation ∂k + Lk = 0 ∂t

(9.1)

with terminal condition k(x, T ) = h(x). Furthermore, if the transition probabilities admit a density p(s 7→ t, x 7→ y), then ∂ψ + Lψ = 0 ∂s where the terminal condition (t, y) is arbitrary and ψ(s, x) = p(s 7→ t, x 7→ y). Example 9.3.1 (Likelihood estimation of the initial condition). If we have observed the process y = XT (ω) for some T > 0 and want to estimate the initial condition X0 = x, then the likelihood function is Λ(x) = p(0 7→ T, x 7→ y) assuming that the transition probabilities admit a density p. To determine this likelihood, we solve the Kolmogorov backward equation ∂k + Lk = 0 ∂t for t ∈ [0, T ], with terminal conditional k(t, x) = δ(x−y), a Dirac delta. Then, Λ(x) = k(0, x). More generally, assume that at time T or later we take one or several measurements such that, if the process had started at time T with XT = y, the likelihood function of y would be l(y). Assume now that the process actually started earlier at X0 = x. The likelihood function of x can now be written as x

Z l(y)p(0 7→ T, x 7→ y) dx

Λ(x) = E l(XT ) = X

To compute this, we solve the Kolmogorov backward equation ∂k + Lk = 0 ∂t for t ∈ [0, T ], with terminal conditional k(t, x) = l(x). Then, Λ(x) = k(x, 0).

9.4

The forward Kolmogorov equation

We now turn to the forward equations, i.e. partial differential equations that govern the transition probabilities φ(t, y) = p(s 7→ t, x 7→ y) as functions of the end point t, y, for a given initial condition Xs = x. These equations will also govern the probability density of the state Xt as a function of time, when the initial condition is a random variable. 196

Theorem 9.4.1. Under the same assumptions as in theorem 9.1.1, assume additionally that the distribution of Xt admits a probability density for all t in a neighborhood of s, and use φt (x) to denote this density at x. Then the forward Kolmogorov equation φ˙ + ∇ · (f φ) − ∇ · ∇(Dφ) = 0 holds for all t ≥ s and for all x ∈ Rn . Here, all functions are evaluated at t, x, and the diffusivity matrix is 1 D(x, t) = g(x, t)g > (x, t) 2 The forward Kolmogorov equation expresses how probability is redistributed in space by means of the advective and diffusive transport processes. To make this transparent, we introduce the advective flow vector field u = f − ∇D

.

Now the forward equation can be written in the conservative advection-diffusion form φ˙ = −∇ · (uφ − D∇φ) which expresses conservation of probability mass: The local rate of increase is minus the divergence of the flux J = uφ − D∇φ. This flux in turn has an advective contribution uφ and a diffusive contribution −D∇φ, as in chapter 2. Proof. Let Yt be as in the previous section, then we can compute the expectation of Yt = k(Xt , t) by integration over state space: Z EYt =

φt (x)k(x, t) dx X

Since Yt is a martingale, this expectation is independent of time. Differentiating with respect to time, we obtain: Z

˙ − φLk dx = 0 φk

(9.2)

X

where we have omitted arguments and used k˙ = −Lk. We consider the term writing out Lk:

Z

Z φ·(

φLk dx = X

X

∂k 1 ∂2k f + trgg > 2 ) dx ∂x 2 ∂x

197

R

X φLk

dx,

For the first term, we find

Z

Z

k∇ · (f φ) dx

φf · ∇k dx = − X

X

by integration by parts, or using the divergence theorem on the vector field kφf . Here we have used that the boundary term is zero at |x| = ∞ is zero: There can be no flow to infinity, since the process cannot escape to infinity in finite time. For the second term, we define the diffusivity matrix D = 21 gg > and find Z φtrD X

∂2k dx = − ∂x2

Z ∇(φD) · ∇k dx X

once again using integration by parts, or the divergence theorem on the vector field φD∇k, and omitting boundary terms. Here, ∇(φD) is the vector field with elements

∇(φD)i =

X ∂(φDij ) j

∂xj

Repeating, now using the divergence theorem on the vector field k∇(φD), we arrive at

∂2k φtr(D 2 ) dx = − ∂x X

Z

Z X

∂φ 12 trgg > ∂k dx = ∂x ∂x

Z k∇ · ∇(φD) dx X

Inserting this in equation (9.2), we obtain Z

(φ˙ + ∇ · (f φ) + ∇ · ∇(φD))k dx = 0

X

Since h is arbitrary, we conclude that φ˙ + ∇ · (f φ) − ∇ · ∇(Dφ) = 0 which is the result we pursued.

9.5

An abstract formulation of the Kolmogorov equations *

This section requires theory from functional analysis which is not used elsewhere in this course. It may be skipped without disturbing the flow. 198

The relationship between the forward and backward equation can be stated very compactly in functional analytical terms. We sketch this, neglecting technicalities, and considering only the time invariant case where f (x, t) = f (x) and g(x, t) = g(x). For fixed time t, the density φt (·) is in L1 (X, R) since it must integrate to 1. Likewise, if h is bounded, then kt (·) is in L∞ (X, R) for fixed t. Recall that these two spaces are dual; a function in L1 can be seen as a linear operator on L∞ , and vise versa. The unconditional expectation of h(XT ) can be written, conditioning on Xt : Z Eh(XT ) = EE{h(XT )|Xt } = hφt , kt i =

φt (x)kt (x) dx X

Since this cannot depend on t, we must have

h

d d φt , kt i + hφt , kt i = 0 dt dt

Now, we write the backward equation as −k˙ t = Lkt

: Kolmogorov’s backward equation

Using this and letting L∗ denote the (Hilbert) adjoint of L, given by hL∗ φ, ki = hφ, Lki for all φ, k, arrive at

h

d φt , kt i = hL∗ φt , kt i dt

Since this must hold for all φ and k, we must have φ˙ t = L∗ φ

: Kolmogorov’s forward equation

What remains is to establish the expression for the adjoint L∗ in terms of the dynamics (f, g). To this end, it is helpful to note that the advection operator k 7→ u·∇k has adjoint φ 7→ ∇(uφ) and that the diffusion operator k 7→ ∇ · (D∇k) is self-adjoint. We can write the solutions to these equations compactly as φt = exp(L∗ t)φ0

,

kt = exp(L(T − t))kT

where exp(Lt) and exp(L∗ t) are the dual semigroups generated by L and L∗ . Note the notational, and conceptual, similarity to the formulas for continuous Markov chains: The generator is now a differential operator L rather than a matrix G; transition probabilities are obtained as exp(Lt) or exp(Gt), and finally the duality of the forward and backward Kolmogorov equation consists of finding the adjoint operator of L rather than transposing G (or equivalently multiplying vectors on G from right vs. left). 199

9.6

The stationary distribution

In many situations the forward Kolmogorov equation admits a stationary density φ(x), which by definition is one that satisfies −∇ · (uφ − D∇φ) = 0 If the initial condition X0 is sampled from the stationary distribution, then Xt will also follow the stationary distribution, for any t > 0. Stationary distributions are as important for stochastic differential equations as equilibria are for ordinary differential equations; in many applications the main concern is the stationary distribution, and sometimes only very little extra characterization of the process {Xt : t ≥ 0} is needed. In general, a stochastic differential equation may admit many stationary distributions. However, in most of the models we study there can be only one stationary distribution. Example 9.6.1 (The general scalar SDE). Consider the scalar equation dXt = f (Xt ) dt + g(Xt ) dBt . The stationary forward Kolmogorov equation is 1 −(φf )0 + g 2 φ00 = 0 2 which we rewrite in advection-diffusion form: −(φu − Dφ0 )0 = 0 with D(x) = 12 g 2 (x), u(x) = f (x) − D0 (x), as usual. We can integrate this once to obtain uφ − Dφ0 = j where j is an arbitrary integration constant. This constant can be interpreted as the flux of probability in stationarity: d P(Xt > x) = j dt for any x ∈ R. If the process {Xt : t ≥ 0} is stationary, then this flux must equal 0. We elaborate on this point in section 9.7. Proceeding, and assuming D(x) > 0 for all x, we find Z x  u(y) 1 φ(x) = exp dy Z x0 D(y) where the reference point x0 is fixed arbitrarily, and the constant Z is also chosen arbitrarily. We hope to be able to chose Z so that φ integrates to 1 and thus is a probability density function. Whether this is possible depends on the system, i.e. f and g or equivalently u and D.

200

Exercise 9.88: Continuing the previous exercise, show that the stationary distribution of Xt can be written in terms of the drift f as Z x  Z x  1 1 f (y) 1 2 2f (y) φ(x) = exp dy = exp dy 2 Z D(x) Z g 2 (x) x0 D(y) x0 g (y) where Z is a (new) normalization constant. Next, let {Yu : u ≥ 0} be a time-changed process p dYu = h(Yu )f (Yu ) du + h(Yu )g(Yu ) dBu . Show that the stationary distribution of Yu is φ(y)/h(y). Example 9.6.2 (The Gibbs canonical distribution). We consider the SDE in Rn dXt = −∇U (Xt ) dt + σ dBt where U is a given potential on state space and σ is a scalar constant; {Bt : t ≥ 0} is n-dimensional Brownian motion. The forward Kolmogorov equation for this equation is φ˙ = −∇ · (−φ∇U − D∇φ) where the diffusivity is D = 21 σ 2 . A candidate stationary distribution is φ(x) =

1 exp(−U (x)/D) Z

where Z is a normalization constant. If the function x 7→ exp(−U (x)/D) is integrable, then Z can be chosen so that φ integrates to 1, and this φ is in fact a stationary distribution. Exercise: Verify this! Such distributions are called Gibbs canonical distributions in statistical physics; the normalization constant is called the partition function. The distribution has maximum where the potential U has minimum; the diffusivity D determines how flat the maximum is. Example 9.6.3 (The linear SDE). Consider the linear SDE in Rn dXt = AXt dt + G dBt where A ∈ Rn×n , G ∈ Rn×m and {Bt : t ≥ 0} is m-dimensional Brownian motion. Assume that all eigenvalues of A have negative real part. Then there is a stationary distribution; it is Gaussian with mean 0 and variance Σ which is the unique solution to AΣ + ΣA> + GG> = 0

.

This equation is known as the algebraic Lyapunov equation. To verify this result, it is possible to show that the unnormalized density 1 φ(x) = exp(− x> Σ−1 x) 2 201

satisfies the forward Kolmogorov equation. However, the calculations are tedious, and particular care must be taken if Σ is singular: In that case the stationary distribution does not admit a density w.r.t. Lebesgue measure, but is concentrated on a linear subspace of Rn . An easier approach is the following: Remember that we have previously shown that the mean µt = EXt satisfies the ODE d µt = Aµt dt In particular, if EX0 = 0, then EXt = 0 for all t ≥ 0. Similarly, the variance Σt = VXt = E(Xt − µt )(Xt − µt )> satisfies d Σt = AΣt + Σt A> + GG> dt and if the initial variance Σ0 = Σ where AΣ + ΣA> + GG> = 0, then the variance remains constant. Finally, we know that the solution can be written

Xt = eAt X0 +

Z

t

eA(t−s) G dBs

0

which implies that if X0 is Gaussian, then also Xt is Gaussian. In summary, if X0 is Gaussian with mean 0 and variance Σ, then so is Xt for t ≥ 0. In the previous, it may not be clear why we need the condition that all eigenvalues of A have negative real part. To see this, consider the scalar equation dXt = Xt dt + dBt . If we attempt to compute the steady-state variance using the formulas above, we find Σ = −1/2! This is a symptom of the fact that the system is unstable, so Xt diverges to infinity, and no steady-state exists. In generality, the stability condition on A ensures that the algebraic Lyapunov equation AΣ + ΣA> + GG> = 0 admits a unique solution which is non-negative definite. Example 9.6.4 (Kinesis). Consider the stochastic differential equation in Rn

dXt =

p 2D(Xt ) dBt

where {Bt : t ≥ 0} is Brownian motion in Rn and D : Rn → R is smooth and non-negative. Then a candidate stationary distribution is

φ(x) =

1 1 Z D(x)

where Z is a normalization constant. If Z can be chosen so that φ integrates to 1, then φ is in fact a stationary distribution. To see this, note that the forward Kolmogorov equation is 202

φ˙ = ∇2 (D φ)

.

Note that the process accumulates where the diffusivity is low. This {Xt : t ≥ 0} is a martingale, i.e. an unbiased random walk, but it is not pure Fickian diffusion, since pure Fickian diffusion has a uniform steady-state.

Exercise 9.89: Consider the Cox-Ingersoll-Ross process p dXt = λ(ξ − Xt ) dt + γ Xt dBt with λ, ξ, γ > 0. Show that in stationarity, Xt is Gamma distributed with rate parameter ω = 2λ/γ 2 and shape parameter ν = 2λξ/γ 2 , i.e. density φ(x) =

ω ν ν−1 −ωx x e . Γ(ν)

Derive the mean and variance in stationarity and compare with the results of exercise 8.87. Note: Also the transition probabilities are available in closed form. Exercise 9.90: Consider the Stratonovich equation dXt = (I −

1 Xt Xt> ) ◦ dBt |Xt |2

where Xt ∈ Rn and {Bt } is n-dimensional Brownian motion. Show that |Xt |2 is constant along trajectories. In particular, there cannot exist a unique stationary distribution. Note: The process {Xt } is called Brownian motion on the n-sphere. In generality, ergodic theory, which is outside our scope, addresses the question if a unique stationary distribution exist.

9.7

Detailed balance, no-flux, and reversibility

We continue the example 9.6.2 concerning the diffusive motion in a potential dXt = −∇U (Xt ) dt + σ dBt Assume that Xt is distributed according to the Gibb’s canonical distribution Xt ∼ exp(−U (x)/D) and compute the flux of probability: J = φ∇U − D∇φ = φ∇U − D(−(∇U )/D)φ = 0 203

i.e., not only is the flow divergence free (as required by stationarity), but in fact 0. This means that in steady state, the net exchange over any surface is zero. This is called detailed balance. For comparison, stationarity just implies that the net exchange over any closed surface vanishes (using the divergence theorem). Thinking in terms of an ensemble of particles that move randomly according to the diffusion process, this means that the number of particles that cross a given element of the surface in one direction, equals the number of particles that cross the same element in the opposite direction, so that the net flux of particles across the element is zero. This, in turn, means that the probability that a given particle moves from one region A to another region B in a specified time t, exactly equals the probability that this particle moves from B to A in the same time. This, together with the Markov property, implies that the statistics of the process are preserved if we change the direction of time. This is a quite remarkable property. Example 9.7.1. We have earlier considered a positively buoyant particle suspended in a water column, the vertical position of which satisfied the stochastic differential equation

dXt = v dt + σ dBt with v > 0. The solution is constrained to the interval [0, H] by reflection at the boundaries. For this system, the potential energy is −vXt , and so the stationary distribution is the canonical distribution

φ(x) =

1 exp(vx/D) Z

where Z is the normalization constant that ensures that φ integrates to 1 over [0, H]. We conclude that the stationary process describing the motion of this particle satisfies detailed balance and hence the process is time reversible. This is perhaps surprising as the particle has a preference for moving upwards, which one could think introduces an asymmetry in time. However, in stationarity the random fluctuations must push the particle down just as much as buoyancy push its towards the surface; the end result is that the motion is symmetric in time. In general, detailed balance is a special case of stationarity. However, for scalar processes, any stationary process must necessarily satisfy detailed balance: Loosely, if the flux were constant in space but non-zero, then the particle must be absorbed at the right boundary and re-emerge at the left boundary (or vice versa). Since we have not introduced such exotic boundary behavior, we conclude that stationarity for one-dimensional processes implies detailed balance. This corresponds to the result for Markov chains, that a stationary process on an acyclic graph must satisfy detailed balance (Grimmett and Stirzaker, 1992). For diffusions in higher dimensions, there is a generalized concept of detailed balance where we allow some state variables to change sign as we reverse time. See (Gardiner, 1985). 204

9.8

Chapter conclusion

It¯o diffusion processes, the solutions to stochastic differential equations, are Markov processes and their transition probabilities are governed by partial differential equations of the advection-diffusion type. This link is tremendously important in applications. Many practical problems regarding stochastic differential equations reduce to problems of analysis on state space, which involve solving a partial differential equation. Partial differential equations are as instrumental in the analysis of diffusions, as linear algebra and matrix analysis are in the analysis of Markov chains. The link is also quite natural, bearing in mind our initial motivation of following a diffusing molecule in fluid flow, and is anticipated by using the name “It¯o diffusion” for solutions of stochastic differential equations. Historically, the first approach to diffusion processes was actually to consider Markov processes with transition probabilities governed by advectiondiffusion equations, and the connection - established by It¯o - with stochastic integrals and stochastic differential equations came later.

9.9 9.9.1

Notes and references The strong Markov property

In the definition of the Markov property, we compared two conditional expectations at a time t: Conditioning on the present, vs. conditioning on the past. It is important that the “present” time t is deterministic. To deal with the situation where the present time is chosen randomly, we have the notion of strong Markov processes: Definition 9.9.1 (Strong Markov process). A Markov process {Xt : t ≥ 0} taking values in Rn is said to be a strong Markov process if E{h(Xτ +t )|F τ } = E{h(Xτ +t )|Xτ } holds for any stopping time τ , any test function h : Rn → R and any time lag t > 0. Here, F τ is the information obtained by observing the Brownian motion up to time τ ; more precisely, the σ-algebra generated by Bs∧τ for s ≥ 0. It¯o diffusions are strong Markov processes: Theorem 9.9.1. Let {Xt : t ≥ 0} be an It¯ o diffusion as in theorem 9.1.1. Then {Xt : t ≥ 0} is a strong Markov process. See (Øksendal, 1995) for the proof. 205

0.4



y

0.1

0.2

0.3

X0

● 0.0



−10

−5

0

5

10

x

Figure 9.3. A diffusion on this curve is a Markov process, but not a strong Markov process: If the process starts at X0 is stopped at the crossing point, then the future depends on whether it reached the point from the left or from the right.

The strong Markov property is relevant for questions such as: “Once the process hits a given region, what happens next?”. In this case the stopping time τ would be the hitting time. In the discrete time case, the ordinary Markov property implies the strong Markov property. However, continuous-time Markov processes need not in general be strong Markov processes. An example is diffusion on a self-intersecting curve such as the one in figure 9.3. The curve is parametrized by the function f : R 7→ R given by f (t) = (t+20 t φ0 (t), φ(t)) where, as always, φ is the standard Gaussian p.d.f. The specifics of this parametrization are not important; what matters is that it is “almost one-to-one” in that there is only one self-intersecting point, and that it is smooth with bounded derivative so that all technical assumptions hold. Now, consider the process {Xt : t ≥ 0} taking values in R2 given by Xt = f (Bt ) where {Bt : t ≥ 0} is standard Brownian motion. Then Xt is an It¯o process, by It¯o’s lemma. But Xt is not a strong Markov process: If we define the “present” as the stopping time τ where the point hits the point of intersection, then the future depends on the past. This Xt is, however, a Markov process. The reason is that when we consider any deterministic time t, the probability that Xt is at the intersection point is P(τ = t) = 0. Now, recall that the Markov property E{h(Xt )|F s } = E{h(Xt )|Xs } only needs to hold with probability 1. Notice that this process Xt is not an It¯o diffusion, and cannot be written as the solution of a stochastic differential equation (what would the drift and diffusivity be at the intersection point?), so this counter-example does not violate theorem 9.9.1. 206

9.9.2

Do the transition probabilities admit densities?

The transition probabilities may be degenerate in the sense that all probability is concentrated on a line, plane or another set in the state space with Lebesgue measure 0. In this case the transition probabilities do not admit probability densities with respect to Lebesgue measure. One example is Brownian motion on the n-sphere (exercise 9.90): The r-sphere S = {ξ : |ξ| = r} is invariant, for all r > 0, so if the initial condition X0 = x is on the sphere, then Px {Xt ∈ S} = 1 for any t > 0, despite S having Lebesgue measure 0. For a stochastic differential equation with an initial condition x, there are, in general, two situations: Either there exists an invariant manifold of dimension less than n in which the solution remains, or the transition probabilities admit densities. A sufficient condition for a density to exist is that g(x)g > (x) > 0 at each x; i.e. the diffusivity tensor is non-singular, so that diffusion acts in all directions. This condition is not necessary. An example is the linear process dXt = Vt dt

,

dVt = −Vt + dBt

i.e., the position of a particle the velocity of which is an Ornstein-Uhlenbeck process. Here, diffusion is singular since it acts only in the v-direction. Nevertheless, the (Xt , Vt ) is bivariate Gaussian distributed for all t > 0 with non-singular covariance matrix, and so admits a density. For general linear time-invariant systems, we have the following result: Theorem 9.9.2. Consider the linear stochastic differential equation dXt = AXt dt + G dBt with initial condition X0 = x ∈ Rn . Let Σ(t) be the covariance matrix of Xt . Then the following are equivalent: 1. There exists a time t > 0 such that Σ(t) > 0. 2. For all times t > 0 it holds that Σ(t) > 0. 3. For all left eigenvectors p 6= 0 of A, pG 6= 0. 4. The so-called controllability matrix [G, AG, A2 G, . . . , An−1 G] has full row rank. If these hold, we say that the pair (A, G) is controllable. The proof of this theorem can be found, for example, in (Zhou, Doyle, and Glover, 1996). In the nonlinear case, the geometry is harder to analyze, and more regularity assumptions have to be made. However, we still have the equivalence between a) The transition probabilities admitting densities b) The lack of existence of an invariant sub-manifold c) The (local) controllability of the system, seeing the noise as an exogenous input. See (Rogers and Williams, 1994b). 207

9.9.3

Random walks with heterogeneous diffusivity

One important application of stochastic differential equations is to do Monte Carlo simulation of advection-diffusion equations. Specifically, given the advection-diffusion equation C˙ = −∇(uC − D∇C) we would like to devise a stochastic particle tracking algorithm: Distributing a batch of particles random in space according to the initial distribution C(·, 0), we would like to have a stochastic algorithm for their motion such that at a later time t, the particles are distributed according to C(·, t). Earlier, in chapter 2, we saw that the advection-diffusion model was consistent with a random walk model: Under the assumption that D is constant in space, we showed that a step in the random walk starting at position x should have mean u(x) ∆t and covariance matrix 2D(x) ∆t. We now see that the advection-diffusion equation is the forward Kolmogorov equation of the stochastic differential equation √ dXt = u dt +

2D dBt

and the random walk model is the Euler scheme for this stochastic differential equation. So stochastic differential equations provide the link between the advection-diffusion equation and the discrete-time stochastic recursion. When the diffusivity is not constant in space, we first find the stochastic differential equation for which the advection-diffusion equation is the forward Kolmogorov equation: dXt = (u + ∇D) dt +

√ 2D dBt

and then do an Euler scheme of this SDE, to obtain Xt+h = Xt + (u(Xt ) + ∇D(Xt )) h +

p

2D(Xt ) (Bt+h − Bt )

The term ∇D pushes particles in direction of high diffusivity; without this term, particles will tend to aggregate in regions with low diffusivity, which is not in agreement with Fickian diffusion.

9.9.4

Numerical computation of transition probabilities

For systems in one, two, and three dimensions, it is feasible to compute the transition probabilities numerically by discretizing the governing partial differential equations. There exist a large number of available software packages for this purpose, ranging from commercial industrial-grade packages to community-driven open source projects. Many of 208

these packages are designed for problems in fluid mechanics or thermodynamics, but can be applied to stochastic differential equations because the Kolmogorov equations are similar to the equations governing temperature fields and concentrations of solutes. In this section we introduce the principles for discretizing the Kolmogorov equations. For technical simplicity, we focus on the one-dimensional case. There are many different approaches to numerical analysis of partial differential equations, the most important being the finite difference methods, the finite element methods, and the spectral methods. For the purpose of analyzing transition probabilities of stochastic differential equations, it is attractive to use a finite volume method (Versteeg and Malalasekera, 1995): Such a method keeps track of the integral of the unknown function over a grid cell. In probabilistic terms, this corresponds to approximating the diffusion process with a Markov chain which jumps between neighboring grid cells. For an It¯o equation in advection-diffusion form dXt = (u + ∇D)dt +

√ 2D dB

we concentrate on the forward Kolmogorov equation C˙ = −∇ · J = −∇ · (uC − D∇C) When discretizing this equation with the finite volume method, we obtain an algorithm based on the following principles: 1. We divide the domain into a finite number of cells. 2. We represent the probability distribution only by the probability of each cell, i.e. the integral of the density C over each cell. 3. At each point of time, we compute the flux J at the interfaces between neighboring grid cells. 4. We use this flux to update the probability of each grid cell. Although these principles apply also to problems in more than one dimension, we focus on the one-dimensional case where the domain is a bounded interval of reals.

Discretization of space We divide the domain real axis into neighbouring grid cells I1 , I2 , . . . , IN . We need some notation: • Let xi−1/2 denote the interface between cell Ii and Ii−1 ; likewise, xi+1/2 is the interface between cell Ii and Ii+1 . 209

• Let Pi (t) be the probability that the process Xt is in grid cell i, i.e. the integral of C over grid cell i Z Pi (t) = C(x, t) dx Ii

• Let Ci (t) be the average probability density in grid cell i, i.e. Ci =

Pi (t) where |Ii | = xi+1/2 − xi−1/2 |Ii |

• Let ui−1/2 denote u(xi−1/2 ) and let Di−1/2 denote D(xi−1/2 ). • Let Ci−1/2 , Ji−1/2 and ∇Ci−1/2 denote (estimates of) the concentration, flux, and concentration gradient at xi−1/2 . We shall describe ways of obtaining these estimates.

Redistribution of probability Conservation of probability mass yields d Pi = Ji−1/2 − Ji+1/2 dt

(9.3)

The flux J is the sum of an advective flux and a diffusive flux: A D Ji−1/2 = Ji−1/2 + Ji−1/2

(9.4)

A D where the advective flux is Ji−1/2 = ui−1/2 Ci−1/2 and the diffusive flux is Ji−1/2 = −Di−1/2 ∇Ci−1/2 .

Approximation of fluxes So far we have only stated balance of mass. The discretization and the approximation enters when we estimate these fluxes, using only the probability mass Pi in each grid cell. For the diffusive flux, the natural approximation is obtained by estimating the gradient ∇Ci−1/2 from the difference in concentrations in cell i and i − 1: D Ji−1/2 := Di−1/2 2

Ci−1 − Ci Ci−1 − Ci = 2Di−1/2 |Ii | + |Ii−1 | xi+1/2 − xi−3/2

(9.5)

This estimate would be accurate if the concentration profile is linear in x over the two grid cells Ii−1 ∪ Ii = [xi−3/2 , xi+1/2 ]. A For the advective flux Ji−1/2 = ui−1/2 Ci−1/2 there are two principles, differing in how we estimate the concentration at the interface: The upwind scheme, and the central scheme.

210

The upwind scheme In the upwind scheme we estimate the concentration at the interface as the average concentration in that grid cell which is ”upwind” of the interface. If advection u is positive, this would be the left cell. We get:  Ci−1/2 :=

Ci Ci−1

if ui−1/2 < 0 if ui−1/2 > 0

Then, the advective flux can be written compactly as ui−1/2 Ci−1/2 := (ui−1/2 ∨ 0)Ci−1 + (ui−1/2 ∧ 0)Ci

(9.6)

where, as always, a ∨ b = max(a, b) and a ∧ b = min(a, b).

The central scheme In the second order central scheme, we estimate the concentration at the interface as a weighted average of the concentrations in the two neighboring cells:

Ci−1/2 :=

Ci−1 |Ii | + Ci |Ii−1 | |I1 | + |Ii−1 |

(9.7)

Also this approximation would be exact, if the true concentration profile was a linear function of space x in the two cells. The upwind scheme is a first-order scheme, so has larger truncation error than the second order central scheme. On the other hand, the second order central scheme has the unfortunate property that even if a grid cell is empty, it may result in a net flow out of the cell. Therefore it may predict negative concentrations. This may also give rise to stability problems, i.e. introduce spatial waves of concentrations the amplitude of which grows unbounded.

Boundary cells We have derived these formulas assuming that the interface is interior, so it remains to specify what to do at the left and right boundaries, i.e. how to specify J1/2 and JN +1/2 . There are three common principles: 1. Periodic boundaries. Here, we set J1/2 = JN +1/2 so what moves out of the left exterior boundary enters through the right. We compute this flux according to the same formulas as for interior cells, i.e., (9.4), (9.5), and either (9.6) or (9.7), where we use P0 := PN and PN +1 := P1 . 211

2. No-flux boundaries. Here we set J1/2 = JN +1/2 = 0. This corresponds to modelling the exterior boundaries as impenetrable walls, so that the state particle reflects when hitting the boundary. 3. Absorbing boundaries. Here we compute the fluxes J1/2 and JN +1/2 using the same formulas as for interior cells, where we assume that there is an empty cell to the left of cell 1, and similarly assume that there is an empty cell to the right of cell N . Absorbing boundaries correspond to homogeneous Dirichlet condition, while a no-flux boundary corresponds to a homogeneous Robin condition.

Solution of the discretized equations Regardless of which of these schemes we use at the boundary, and whether we use the upwind or the central scheme, we are now able to compute the flux J across each cell interface from the probability mass in each grid cell. Using the method of lines, we have discretized the partial differential equation to N coupled ordinary differential equations, one for each grid cell governing Pi (t). Conceptually, this corresponds to approximate the diffusion process with a continuous-time Markov chain which keep tracks, not of the precise location of the particle, but only which grid cell the particle is currently residing in. The generator of this Markov chain is the discretized version of the generator of the diffusion process. These coupled ordinary differential equations can be solved numerically in many ways. If the system is time invariant, and the number of grid cells is reasonably low, then the matrix exponential is a feasible and simple solution. In other cases, you may use the standard ODE solver in your favorite software package for scientific computing. The ordinary differential equations may also be discretized in time. If one uses the explicit Euler method, a very simple scheme results. This corresponds to approximating the diffusion process with a discrete-time Markov chain. However, the explicit Euler method requires very small time steps: For a uniform grid with grid size ∆x, the time step ∆t must satisfy 2D∆t < (∆x)2 or negative probabilities may occur. If further D∆t > (∆x)2 , the numerical solution becomes unstable so that probabilities may diverge to ±∞. For fine grids, this bound on the time step means that the explicit Euler method becomes prohibitively expensive. Fully implicit methods, or semi-implicit methods where only the diffusive fluxes is resolved implicitly, have better stability properties and are useful in practice, in particular since one may argue that a very accurate solution of the ordinary differential equations is unnecessary, since these equations are already an approximation.

212

CHAPTER

10

State estimation

In this chapter we consider the following problem: We have a diffusion process {Xt : t ≥ 0}. We envision an observer who, at certain points of time t, takes measurements which provide some information about the current state Xt . We aim to give the distribution of state, conditional on these measurements. This is what we call the problem of state estimation. We start by considering a general, possibly non-linear and time-varying, It¯o equation dXt = f (Xt , t) dt + g(Xt , t) dBt

(10.1)

We assume that we take measurements at times 0 ≤ t1 < · · · < tN = T and denote the measurements Y1 , Y2 , . . . , YN . These are random variables. We consider an observer who monitors the measurements; the information available to this observer at time t is measurements taken no later than time t, i.e. the σ-algebra G t = σ({Yi : ti ≤ t}) Note that {G t : t ≥ 0} is a filtration. A graphical model of the situation is seen in figure 10.1. The graphic illustrates that our problem is one of inference in a Hidden Markov Model (Zucchini and MacDonald, 2009): The state process {Xt : t ≥ 0}, sub-sampled at the times {t0 = 0, t1 , . . . , tN = T }, constitute a discrete-time Markov process {Xti : i = 0, . . . , N }. The states are unobserved or “hidden”, but we aim to infer them indirectly from measurements {Yi : i = 1, . . . , N }. State estimation in stochastic differential equations, using this Hidden Markov Model approach, involves the following steps: 213

X0

Gti

Xt1

···

Xti

Xti+1

···

XT

Y1

···

Yi

Yi+1

···

YN

Figure 10.1. A graphical model of the random variables in the filtering problem. An arrow from one random variable to another indicates that the model is specified in terms of the conditional distribution of the latter given the former: The measurement equation specifies the conditional distribution of Yi given Xti , and the stochastic differential equation specifies the conditional distribution of Xti+1 given Xti . The shaded region contains the measurements in Gti , i.e. those which are used for the state estimate ψi and the state prediction φi+1 1. Specification of the model, in particular the conditional distribution of observations given the state. 2. A recursive filter, which runs forward in time and processes one measurement at a time. This filter can run on-line in real time, i.e. process measurements as they become available, although it is probably most often used in off-line studies. This filter consists of a time update, which addresses the link between state Xti−1 and state Xti in figure 10.1, and a data update, which addresses the link between state Xti and observation Yi . 3. This filter yields state estimates, such as E{Xti |G ti }, and state predictions E{Xti+1 |G ti }. Along with these conditional means come variances and, more generally, the entire conditional distributions. 4. With the filter in hand, we are able to evaluate the likelihood of any unknown underlying parameters in the stochastic differential equation, or in the measurement process, and therefore estimate such parameters. 5. A second recursion improves the state estimates to include also information based on future observations, i.e. form conditional expectations such as E{Xti |G T }. This is the smoothing filter. It can, of course, only be used in off-line studies where past values of the state are hindcasted. 6. Finally, one may be interested in not just the marginal conditional distribution of Xti given G T , but in the entire joint distribution of {Xt : 0 ≤ t ≤ T } given G T . It is possible to draw samples from this distribution, i.e. simulating typical tracks, and to identify the mode of the distribution, i.e. the most probable track. In the following we address these steps one at a time. To make the presentation more specific, we will use a running example, namely state estimation in the Cox-Ingersoll-Ross process. Specifically, we consider the process dXt = (1 − Xt ) dt + 214

p

Xt dBt

(10.2)

4 3 2 0

1

Abundance

0

5

10

15

20

Time Figure 10.2. Simulation of the Cox-Ingersoll-Ross process (10.2). For the filtering example, we assume that this state process is unobserved and aim to estimate it. which we interpret as the abundance of a biological population. Figure 10.2 displays the simulation of the process. In the following sections, we assume that this trajectory is unobserved but aim to estimate it from available data.

10.1

Observation models and the state likelihood

We now focus on the observations Yi . Ultimately, we aim to apply Bayes’ theorem to find the conditional distribution of Xti given the measurement Yi . For example, if we observe Yi (ω) = yi , and no other information, then we find the conditional density of Xti as 1 fXti |Yi (x|yi ) = fXti (x)fYi |Xti (yi |x) c R Here c is the normalization constant fYi (yi ) = fXti (x)fYi |Xti (yi |x) dx. We have assumed that the joint distribution of Xti and Yi is continuous. We see from this example that information in the measurement Yi about the state Xti is quantified by the function fYi |Xti (yi |x). We introduce the term state likelihood function, and the symbol li (x), for this function: 215

Poisson

Round−off

0 2 4 6 8

0.4

li(x)

0.0

0.0

0.00

0.10

li(x) 0.2

li(x)

0.4

0.8

0.20

Gaussian

0 2 4 6 8

x

x

0 2 4 6 8 x

Figure 10.3. Three examples of state likelihood functions: To the left, an observation Y = 3 with Y ∼ N (X, 1/2). In the middle, an observation Y = 3 with Y ∼ Poisson(X). To the right, an observation Y = 3 with Y being X rounded to the nearest integer. Note that these are likelihood functions of the state X and not probability density functions of X.

li (x) = fYi |Xti (yi |x) Note that to the observer, yi = Yi (ω) is a known quantity at time ti and thereafter, so li (x) is known, for each value of the state x, at time ti . In more technical terms, the li (x) is a G ti -measurable random variable ω 7→ fYi |Xti (Yi (ω)|x), for each x. A common situation is that the measurements Yi are noisy observations of some function of the state Xti , e.g.

Yi = c(Xi ) + s(Xi )ξi where the function c(x) specifies the quantity which is measured, and where ξi is Gaussian measurement errors with mean 0 and variance 1, which are independent of each other and of the Brownian motion {Bt : t ≥ 0}. In this situation, the state likelihood functions are   1 1 (yi − c(x))2 li (x) = √ exp − 2 s(x)2 2πs(x) But many other situations are possible. For example, the conditional distribution of the random variable Yi given Xti may be Poisson distributed with a mean which depends on Xti , i.e. 216

o

ooo

o

oo

o oo ooooo o o o o

0

oooooooo ooooooooooo ooooooooo ooooooo ooooooooooooooooooo oooo ooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooooo 0

5

10

15

0.0

0.5

o ooo o

1.0

3 2

o o ooo oooooooooo ooo oooo ooooooo oo ooo

Observations

oooo ooo

2.0

oo

1.5

o oo

1

Abundance

4

5

2.5

o

20

Time Figure 10.4. The observed time series {Yi } for the Cox-Ingersoll-Ross example, plotted along with the assumed unobserved states {Xt }.

li (x) =

(µ(x))yi −µ(x) e yi !

where the function µ : X 7→ R describes how the conditional mean of Yi depends on Xti . Or we can have observed simply whether Xti is in some region A of state space or not:  li (x) =

1(x ∈ A) 1(x 6∈ A)

if yi = 1 if yi = 0

Regardless of the specific way measurements Yi are taken, we see that this information is condensed to a state likelihood function li (x). This function describes the information about the state Xti which is acquired at time ti . In agreement with figure 10.1 we also assume likelihood independence, i.e. the joint likelihood of all unobserved states Xti is the product of the li ’s. For our example of the Cox-Ingersoll-Ross process (10.2), we assume that the state is measured at regular time intervals ti = ih with a sampling interval h = 0.1. At each sampling time, a random variable Yi is observed where Yi |Xti ∼ Poisson(vXti ) . 217

The interpretation is that we count the number of organisms in a small volume v. Figure 10.4 shows the simulated data set {Yi }, overlaid with the true states {Xt }. Note that the individual observation contains very little information about the state; in order to estimate the states, it is important to take the state dynamics into account so that the state is estimated not just from a single observation, but from all available observations.

10.2

The filtering principle: Time update and data update

Our objective is now to combine the process model, the It¯o stochastic differential equation (10.1) governing Xt , and the observation model, the state likelihood functions li (·), so as to obtain estimates of the state Xti based on the information G ti . The approach is to process or “filter” the measurements one at a time. This recursive approach to estimation also allows us to run the filter on-line, i.e. do the processing in real time as the measurements are taken. To this end, we introduce two conditional distributions of the state Xti , which differ in how many observations are available to estimate of Xti . First, the predicted distribution φi (x) is the p.d.f. of Xti given G ti−1 , i.e. all measurements taken strictly prior to time ti . Similarly, the estimated distribution ψi (x) is the p.d.f. of Xti given G ti , i.e. all measurements taken no later than time ti . The principle is to tie the predictions φi and the estimates ψi together in a recursion. To this end, notice that the estimate ψi and the prediction φi+1 both are based on observations available at time ti . In the time interval [ti , ti+1 ], the process Xt evolves according to the stochastic differential equation. Therefore, let ρ(x, t) be the p.d.f. of Xt , conditional on G ti , and evaluated at point x. Then, for t ≥ ti , ρ is governed by the forward Kolmogorov equation ρ˙ = −∇ · (uρ − D∇ρ)

(10.3)

where we have used the advection-diffusion form of the forward Kolmogorov equation, with D(x, t) = 12 g(x, t)g > (x, t) and u(x, t) = f (x, t) − ∇D(x, t). We solve this equation for t ∈ [ti , ti+1 ] subject to the initial condition ρ(x, ti ) = ψi (x). Then, we find the next state prediction as φi+1 (x) = ρ(x, ti+1 ). We use the name time update to denote this step, which changes the time index of the estimated state without changing the information on which the estimation is based. In the important special case of a time-invariant stochastic differential equation, i.e. when the drift term f and noise term g in (10.1) do not depend on time t, we can write formally ∗ (t i+1 −ti )

φi+1 = eL

218

ψi

At the time ti+1 the new observation Yi+1 (ω) = yi+1 becomes available to the observer. This causes us to modify the distribution of the state, using Bayes’ rule

ψi+1 (x) =

1 ci+1

φi+1 (x)li+1 (x)

(10.4)

where ci+1 is the normalization constant Z ci+1 =

φi+1 (x)li+1 (x) dx X

This normalization constant is the probability density of the next observation, conditional on current information, and evaluated at the actual measurement. This step, which changes the information with which we estimate Xti , is called the data update. We summarize the algorithm: 1. Start at t0 = 0 with ψ0 (·). Set i = 0. 2. Time update: Solve the forward Kolmogorov equation (10.3) on t ∈ [ti , ti+1 ] with initial condition ρ(x, ti ) = ψi (x). Advance time to ti+1 and set φi+1 (x) equal to the solution ρ(x, ti+1 ). 3. Data update: Compute ψi+1 (·) from Bayes’ rule (10.4). 4. Advance i := i + 1 and go to step 2. Figure 10.5 illustrates the time update and the data update for the Cox-Ingersoll-Ross model. Here, we have discretized the state space and solved the equations numerically. In the left panel, we see the time update at a certain time point ti . The estimated distribution ψi is concentrated at lower states than the stationary mean x = 1, so the time update shifts the distribution slightly to the right towards the stationary distribution, and also widens it. The time step 0.1 is small relative to the decorrelation time 1 of the process, so the effects of the time update is slight, but nevertheless still important. To the right we see the effect of the date update. At time ti+1 , a measurement Yi+1 = 1 is made available. Since Yi is Poisson distributed with mean vXti , with v = 0.5, the maximum of the state likelihood is obtained at x = 2. This is much larger than the mean in the predicted distribution φi+1 , so the data update shifts the distribution even further to the right. Figure 10.6 displays the estimated state, defined as the mean in the estimated distribution ψi , as a function of time. Included is also the true state {Xt } as well as lower and upper confidence limits on Xti , derived as 16.6 % and 83.3 % percentiles in the distribution ψi , respectively. Notice that the estimated state follows the true state reasonably accurately, and in particular that the true state is within the confidence intervals most of the time. Notice also that the estimated state appears to lag a bit behind the true state. This is because the estimated state is based on past measurements. 219

0

1

2

3

4

1

2

x

3

0.1

0.2 0

0.0

0.0

φi+1 li+1 ψi+1 4

x

Figure 10.5. The time update (left panel) and the data update (right panel) for the Cox-.Ingersoll-Ross process, illustrated for a single time step where the estimated state is low but a positive measurement Yi arrives.

2 0

1

x

3

4

True state Estimated state

0

5

10

15

20

t Figure 10.6. Estimated (dashed) and true (solid) state trajectory for the Cox-IngersollRoss process. The estimate is the mean in the estimated distribution ψi . Included is also 95 % confidence intervals, derived from the estimated distribution ψi . 220

State likelihood

0.3

1.5 0.5

P.d.f.

1.0

1.5 1.0 0.5

P.d.f.

0.0

ψi φi+1

While this recursive Bayes algorithm in principle solves the filtering problem, it remains to be described how in practice to solve the forward Kolmogorov equation, and how to do the Bayes’ update. When the state space has low dimension (up to three, say) we can solve the equations directly using numerical discretization such as a finite volume method. This is the approach we use in our example with the Cox-Ingersoll-Ross process. Depending on the specifics of the model, there may be other techniques for solving these equations: 1. The case of linear stochastic differential equations with linear observations and Gaussian noise. Here we can give the solution analytically in closed form, or at least characterize them in terms of solutions to ordinary differential equations. This leads to the Kalman filter which we describe in section 10.6. 2. In other situations we can approximate the solutions with Gaussians. This holds if the system is close to linear, i.e. if the non-linearities are weak at the scale given by the covariance of the state estimate. The resulting algorithm is the Extended Kalman filter or variants, e.g. the unscented Kalman filter, or filters which include higher-order corrections. 3. In some situations it is convenient to use Monte Carlo to estimate the solutions. This leads to the particle filter. Also hybrid techniques are possible.

10.3

The smoothing filter

We return to the general nonlinear stochastic differential equation with discrete-time measurements. At this point we have characterized φi and ψi , i.e. the p.d.f. of Xti conditional on measurements taken strictly before, or no later than, time ti . We now describe the so-called smoothing step, which aims to improve on the estimate ψi by including also measurements taken later than time ti . This is typically only relevant in an offline situation, where we want to hindcast the process. With the state estimate ψi (·) in hand, the way to include future measurements is the following: We consider ψi the prior density of Xti . Next, we derive the likelihood of all future measurements, seen as a function of x, the realized value of Xti . Finally, we combine prior distribution and likelihood using Bayes’ rule. To compute these likelihood functions, we perform a recursion over time. This is similar to the predictive filter in that it contains a time update step and a data update step, but it runs in reverse time. Specifically, viewing Xti as an unknown parameter, let µi (x) denote the likelihood of all strictly future measurements µi (x) = fYi+1 ,Yi+2 ,...,YN |Xti (yi+1 , yi+2 , . . . , yN , x) where we use the general notation that fY |X (y, x) denotes the conditional probability density 221

of Y given X, evaluated where Y = y and X = x. Similarly let λi (x) denote the likelihood of all measurements taken at ti or later λi (x) = fYi ,Yi+1 ,...,YN |Xti (yi , yi+1 , . . . , yN , x) We aim to tie these likelihood functions together in a recursion. First, if we have computed µi (x), then we can compute also λi (x) by including the likelihood of the measurement taken at time ti−1 : λi (x) = µi (x) · li (x) This is the analogy of the data update in the predictive filter. Next, with λi (·) in hand we need to compute µi−1 (·). Now, by the properties of the conditional expectation we have µi−1 (x) = Ex,ti−1 λi (Xti ) This means that we can find µi−1 by solving the backward Kolmogorov equation −h˙ = u · ∇h + ∇ · (D∇h) for t ≤ ti together with the terminal condition h(x, ti ) = λi (x). The interpretation of this function h is in fact the likelihood of measurements taken at time ti or later, viewing Xt = x as the initial condition. Then

µi−1 (x) = h(x, ti−1 ) This is the analogy of the time update in the predictive filter. In the important special case of a time-invariant stochastic differential equation, we can allow the convenient notation µi−1 (x) = eL·(ti −ti−1 ) λi (x) These two steps, the time update and the data update, are iterated backward in time, starting with

λN (x) = lN (x) Having completed the recursion, we find the posterior distribution

πi (x) =

1 ψi (x)µi (x) ki 222

2 0

1

X

3

4

Smoothed state Estimated state

0

5

10

15

20

t Figure 10.7. Estimated (dashed) and smoothed (solid) state trajectory for the CoxIngersoll-Ross process. Included is also 95 % confidence intervals, derived from the smoothed distribution πi . which is the conditional distribution of Xti given all measurements, past, present, and future. Here ki is a normalization constant ensuring that the πi sum to one. (Of course we could equally well have used φi (x)λi (x) after normalization). Figure 10.7 displays the smoothed estimate for the Cox-Ingersoll-Ross process. For comparison, we have also included the estimate from figure 10.6. Comparing the two figures, note that the uncertainty associated with the estimates only reduces slightly by including also future observations in the estimate. On the other hand, the smoothed estimate does not display the lagging that the estimate has.

10.4

Sampling typical tracks

The smoothing filter provides us with the conditional distribution of Xti given all measurements. These distributions are marginal in time; for example, they do not specify the joint conditional distribution of Xti and Xti+1 . Ultimately, we would like to know the conditional law of the stochastic process {Xt : t ∈ [0, T ]} given all measurements, but on the other hand this is too complicated an object to operate with. The notable exception to this rule is the linear-Gaussian case, where the conditional 223

law is given by the estimate µt|T and the autocovariance structure of the estimation error. For general, non-linear models, we have to be satisfied with the ability to sample “typical” tracks (X0 , Xt1 , . . . , XT ) from the posterior distribution. Such simulated trajectories are often very useful for communicating the results of the filtering problem, in particular to non-specialists. They can also be used to make Monte Carlo estimates of statistics that are otherwise difficult to compute, for example the distribution of the maximum max{X0 , Xt1 , . . . , XT }, conditional on measurements. An algorithm for this is as follows: 1. First, sample ξT from πN (·). 2. Next, for each i ∈ {N − 1, N − 2, . . . , 0}, do (a) Compute the conditional distribution of ξti given measurements and Xti+1 = ξti+1 . Unnormalized, this distribution has density at x ψi (x) · p(ti → ti+1 , x → ξti+1 ) where p are the transition probabilities, which are most conveniently found by solving Kolmogorov’s backward equation governing p(s → ti+1 , x → ξti+1 ) as a terminal value problem, with the terminal condition p(ti+1 → ti+1 , x → ξti+1 ) = δ(x − ξti+1 ). (b) Sample ξti from this distribution. In additional to the marginal distribution of the state at each time, and to the sampled trajectories, it is useful to compute the mode of the joint distribution of X0 , Xt1 , . . . , XT conditional on measurements. This is called the most probably track. Maximizing the posterior distribution over all possible trajectories is a (deterministic) dynamic optimization problem which can be solved with dynamic programming, and the resulting algorithm is known as the Viterbi algorithm.

10.5

Likelihood inference

If the underlying model of system dynamics, or the measurement process, includes unknown parameters, then we can use the filter to estimate these parameters. In fact, maximum likelihood estimation of unknown parameters corresponds to tuning the predictive filter. To see this, assume that the parameters f and g in the stochastic differential equation (10.1) depend on some unknown parameter vector θ, which may also enter in the state likelihood functions li (x). Now, the likelihood function L(θ) is the joint probability density function of the observation variables Y1 , . . . , YN , evaluated at the measured values y1 , . . . , yN , and for the given parameter θ. This joint p.d.f. can be written in terms of the conditional densities. First, we single out the first measurement: fY1 ,...,YN (y1 , . . . , yN ; θ) = fY1 (y1 ; θ)fY2 ,...,YN |Y1 (y1 , . . . , yN ; θ) 224

where we use the notation fX|Y (x, y; θ) for the conditional density of X given Y , evaluated at x, y and for parameter θ. Next, in the second term we single out Y2 , and continuing this recursion we get

fY1 ,...,YN (y1 , . . . , yN ; θ) =

N Y

fYi |Y1 ,...,Yi−1 (y1 , . . . , yi ; θ)

i=1

The terms in this product are exactly the normalization constants ci+1 in the data update of the filter (10.4). Thus, the likelihood of the unknown parameter can be written

L(θ) =

N −1 Y

ci+1 (θ)

i=0

where we have stressed that the normalization constants depend on θ because the predictions and/or the state likelihood functions li (x) do. Thus, maximizing the likelihood function corresponds to tuning the predictive filter so that it predicts the next measurement optimally. While this in principle allows us to perform maximum likelihood estimation, the question remains how to compute the maximum likelihood estimate, and to assess the properties of the estimator. There is a large specialized literature on this subject, as well as on alternative approaches to parameter estimation which lead to simpler algorithms and/or analysis. A word of caution must be issued. Maximum likelihood estimation is, generally, a good approach when the data generating system belongs to the family of models in which we identify an estimate. On the other hand, the maximum likelihood estimator may not have good robustness properties, i.e. may not perform well when the actual data generating system does not belong to the model family. For example, if the sampling frequency is high, then measurement errors are typically not perfectly white. Any color in the measurement noise may not effect the state estimation severely for known parameters, but may render the parameter estimate useless because the estimator tries to match dynamics in the measurement noise rather than system dynamics. A useful solution to this problem is to tune a predictor with a longer prediction horizon than from one measurement to the next; this is often a reasonable measure of model performance. In summary, since model families are often gross simplifications of true data generating systems, it is imperative to make a careful model validation after estimating parameters.

10.6

The Kalman filter

225

Biography: Rudolf Emil Kalman. Born 1930 in Hungary; his family emigrated to the United States in 1943. A pioneer of “Modern Control Theory”, i.e. state space methods in dynamic systems and control. Developed the discrete-time “Kalman” filter in 1958, and together with Richard Bucy, the continuous-time version in 1961. Other important contributions are the “Kalman axioms” for dynamic systems, realization theory for linear systems, and the recognition that Lyapunov stability theory would be useful for the analysis and design of control systems. . We now consider the linear system dXt = (AXt + ut ) dt + G dBt

(10.5)

where ut is a deterministic function of time. At times {ti : i = 1, . . . , N } we take the measurements Yi = CXi + DVi where {Vi } is a sequence of Gaussian variables with mean 0 and unit variance matrix, independent of each other and of the Brownian motion {Bt : t ≥ 0}. As before, we let G t be the information available at time t, i.e. the σ-algebra generated by Yi for ti ≤ t. To perform the time update, assume that conditionally on G ti , Xti is Gaussian with mean µti |ti = E{Xti |G ti } and variance Σti |ti = V{Xti |G ti }. Then conditional on the same information, the distribution of Xt remains Gaussian for t ∈ [ti , ti+1 ] with a conditional mean given by the vector ordinary differential equation d µ = Aµt|ti + ut dt t|ti and a conditional variance given by the matrix ODE d Σ = AΣt|ti + Σt|ti A0 + GG0 dt t|ti By advancing these ODE’s to time ti+1 , we have completed the time update. To do the data update, we can use a standard result about conditioning in multivariate Gaussians, rather than doing the computations explicitly. To this end, note first that conditional on G ti , Xti+1 and Yi+1 are joint normal with mean  E

Xti+1 Yi+1



G t i



and covariance matrix 226

 =

µti+1 |ti Cµti+1 |ti



 V

Xti+1 Yi+1



G t

"

 =

i

Σti+1 |ti

Σti+1 |ti C 0

CΣti+1 |ti

CΣti+1 |ti C 0 + DD0

#

Next, recalling a standard result about conditional distributions in multivariate Gaussians (exercise 3.27), we can now summarize the Kalman filter for linear systems with discrete-time measurements: 1. Time update: Advance time from ti to ti+1 by solving the equations d µ dt t|ti d Σ dt t|ti

= Aµt|ti + ut = AΣt|ti + Σt|ti A0 + GG0

for t ∈ [ti , ti+1 ] with initial conditions µti |ti and Σti |ti . 2. Data update: Compute the so-called Kalman gain, i.e. the matrix Ki+1 = Σti+1 |ti C 0 (CΣti+1|ti C 0 + DD0 )−1 and then include information at time ti+1 as follows: µti+1 |ti+1

= µti+1|ti + Ki+1 (yi+1 − Cµti+1 |ti )

Σti+1 |ti+1

= Σti+1 |ti − Ki+1 CΣti+1 |ti

Remark 10.6.1. When the sampling interval h = ti+1 − ti is constant, the time update can be done more efficiently as follows: First, before the iteration starts, compute the matrix exponential exp(Ah) and solve the Lyapunov matrix ODE d S(t) = AS(t) + S(t)A0 + GG0 dt with initial condition S(0) = 0, for t ∈ [0, h]. See exercise 5.59 for how to do this. Then 0

Σti+1 |ti = eAh Σti |ti eA h + S(h) To see this, use the general rule VXti+1 = EV{Xti+1 | Xti } + VE{Xti+1 | Xti } where expectation and variance are conditional on G ti . Thus we can advance the variance matrix without solving a matrix ODE at each time step; matrix multiplication suffices. Depending on the driving term ut , the same may be possible for the mean value. For example, if A is invertible and ut is constant and equal to u over each time step (ti , ti+1 ), then µti+1 |ti = eAh µti |ti + A−1 (eAh − I)u so also the vector ODE for µt|ti needs not be solved numerically; rather the computations reduce to matrix algebra. The theory of Kalman filtering for linear systems is quite complete and addresses many other issues than deriving the basic algorithm as we have just done. Just to give a few examples, there exist conditions under which the filter converges to a steady state, the special case of periodic systems has been investigated, as have the robustness of the filter to various types of model errors. 227

10.7

Fast sampling and continuous-time filtering

So far we have discussed the Kalman filter with measurements that are taken at discrete points of time. In the early days of Kalman filtering, the filter would often be implemented with analog electric circuits, so truly operating in continuous time. Nowadays, even if the filter runs on digital hardware and in discrete time, the sampling rate may be so fast that the filter effectively runs in continuous time. In other situations we have not yet determined the sampling time, but start by examining the continuous-time filter and later choose the sampling frequency to be “fast enough”. For these reasons we investigate in this section the Kalman filter when the measurements are available in continuous time. A model of continuous-time measurements is Zt = Cxt + Dwt and we wish to estimate Xt based on observations of Zs for s ≤ t. Here, wt is a noise signal which we assume is white and of unit strength. To make this model fit into our general framework, we must replace it with its integrated version dYt = Cxt dt + DdWt

(10.6)

where Wt is Brownian motion. Thus we aim to estimate Xt based on G t , the σ-algebra generated by Ys for s ≤ t. The easiest way to do this is to first consider a discretized version, obtained by sampling Yt at regular intervals nh for n ∈ N where h is the sampling time. This discretized problem we can solve with the material of the previous section. Then we let the sampling time h tend to 0. To this end, assume that we at time t = nh have access to the measurements Yh , Y2h , . . . , Ynh and based on this information, Xt has conditional mean µnh|nh and variance Σnh|nh . Performing the time update as in the previous, we get µnh+h|nh = µnh|nh + Aµnh|nh h + unh h + o(h) and  Σnh+h|nh = Σnh|nh + AΣnh|nh + Σnh|nh A0 + GG0 h + o(h) We now turn to the data update. At time nh + h the new information is Ynh+h , but since we already know Ynh , we can equivalently say that the new information is 228

∆Ynh = (Ynh+h − Ynh ) = CXnh h + D(Wnh+h − Wnh ) + o(h) This agrees with the form in the previous section except for the covariance of the measurement error (Wnh+h − Wnh ) which is h · I rather than I where I, as usual, is the identity matrix. Making the obvious rescaling, and neglecting higher order terms, the data update is given by a Kalman gain Knh = Σnh+h|nh C 0 h(CΣnh+h|nh C 0 h2 + DD0 h)−1 We now make the assumption that DD0 > 0. This means that all measurements are noisy, and it implies that the Kalman gain has a well-defined limit as h & 0: Knh = Σnh|nh C 0 (DD0 )−1 + O(h) With this Kalman gain, the data update for the mean becomes µnh+h|nh+h = µnh+h|nh + Knh (∆Yn − Chµnh+h|nh + o(h)) and for the variance Σnh+h|nh+h = Σnh+h|nh − Knh C 0 hΣnh+h|nh + o(h) Combining with the time update, we get µnh+h|nh+h = µnh|nh + Aµnh|nh h + unh h + Knh (∆Yn − Chµnh|nh + o(h)) and  Σnh+h|nh+h = Σnh|nh + AΣnh|nh + Σnh|nh A0 + GG0 − Kn C 0 Σnh|nh ) h + o(h) Letting the time step h tend to zero, we can summarize the analysis: Theorem 10.7.1. Consider the filtering problem consisting of the state equation (10.5) and the continuous-time observation equation (10.6) with DD0 > 0. In the limit h & 0, the state estimate satisfies the It¯ o stochastic differential equation dµt|t = (Aµt|t + ut ) dt + Kt (dYt − Cµt|t dt) Here, the Kalman gain is Kt = Σt|t C 0 (DD0 )−1 229

The variance of the estimation error satisfies the so-called Riccati equation, an ordinary differential matrix equation d Σ = AΣt|t + Σt|t A0 + GG0 − Σt|t C 0 (DD0 )−1 C 0 Σt|t dt t|t

10.7.1

The stationary filter

If the system parameters A, G, C, D are constant in time, then we can characterize the asymptotic behavior of the Kalman gain K(t), the variance Σt|t , and the estimation error ˜ t := Xt − µt|t . For simplicity, we assume that the pair (A, C) is observable: X Definition 10.7.1. Consider the matrix pair (C, A) where A ∈ Rn×n and C ∈ Rm×n . The pair is said to be observable, if the following two equivalent conditions hold: 1. All right eigenvectors v 6= 0 of A satisfy Cv 6= 0. 2. The so-called observability matrix       

C CA CA2 .. . CAn−1

      

is injective (i.e., has full column rank). The observability condition says that all eigenmodes of the system are visible in the output Yt ; this means that in the absence of noise we would be able to reconstruct the state perfectly using continuous-time measurements Yt over any interval t ∈ [0, T ]. When noise is present, it is sufficient to prevent the variance from growing beyond bounds. In fact, under this condition, ˜tX ˜ t0 converges to an asymptotic value in can be shown that the estimation variance Σt|t = EX Σ, which solves the algebraic Riccati equation (ARE) AΣ + ΣA0 + GG0 − ΣC 0 (DD0 )−1 CΣ = 0 Next, we further assume that the pair (A, G) is controllable: Definition 10.7.2. Consider the matrix pair (A, G) where A ∈ Rn×n and G ∈ Rn×k . The pair is said to be controllable, if the following three equivalent conditions hold: 1. All left eigenvectors p 6= 0 of A satisfy pB 6= 0. 2. The so-called controllability matrix [B AB, A2 B · · · An−1 B] is surjective (i.e., has full row rank). 230

3. The pair (G0 , A0 ) is observable. This condition says that all dynamics of the system are excited by the noise Bt . Since also all measurements are noisy (DD0 > 0), the variance Σt|t of the estimation error will be positive definite for any t > 0, and also the limiting variance Σ = limt→∞ Σt|t is positive definite. Theorem 10.7.2. Consider the filtering problem (10.5), (10.6) with DD0 > 0, (C, A) observable and (A, G) controllable. Then the steady-state variance on the estimation error Σ exists and is positive definite. Moreover, it is the maximal solution to the algebraic Riccati equation in P AP + P A0 + GG0 − P C 0 (DD0 )−1 C 0 P = 0 i.e., any P that satisfies this equation has P ≤ Σ. Finally, Σ is the unique stabilizing solution, i.e. the only solution P to this equation such that A − P C(DD0 )−1 is asymptotically stable. ˜ t = Xt − µt|t satisfies the stochastic differential equation The estimation error X ˜ t = (A − KC)X ˜ t + G dBt + K dWt dX and, in particular, has a stationary distribution with mean 0 and variance Σ. Note that the dynamics of the estimation error, i.e. its autocorrelation function and its decorrelation time, depends on A − KC. In particular, the eigenvalues of this matrix contains useful information.

10.8

Estimating states and parameters as a mixed-effect model

An entirely different approach to state estimation is to formulate the model as a general statistical model where the unobserved states {Xti } are considered random effects, and use general numerical methods for inference in this model. One benefit of this approach is that it becomes a minor extension to estimate states and system parameters in one sweep. This approach has become feasible in recent years thanks to the availability of powerful software for such models. Here, we use the R package Template Model Builder (TMB) by (Kristensen et al., 2016). Let us illustrate this approach with the same example of state estimation in the Cox-IngersollRoss process (10.2), now written with dXt = λ(ξ − Xt ) dt + γ

p Xt dBt

Let θ = (λ, ξ, γ) be the vector of system parameter; note that we consider the sample volume v known. Let φ(xt0 , xt1 , . . . , xT , y1 , . . . , yN ; θ) denote the joint probability density of all states and observations, for a given set of system parameters θ. This can be written as 231

N Y

φ=

p(ti−1 7→ ti , xti−1 7→ xti ) · li (xti )

i=1

where we have omitted the arguments of φ. The likelihood of the system parameters θ, for a given set of observations y1 , . . . , yN , is then obtained by integrating out the unobserved states {Xti }, using the law of total probability: Z

Z

φ(xt0 , xt1 , . . . , xT , y1 , . . . , yN ; θ) dxt0 · · · dxtT

···

L(θ) = X

X

and the maximum likelihood estimate of the system parameters is θˆ = arg max L(θ) . θ

The approach we pursue is to find this estimate with numerical optimization. To this end, we need also a numerical method for evaluating the likelihood function, in particular integrating out the unobserved states. The method we emply is the Laplace approximation which approximate the integrand with a multivariate Gaussian. With this approximation, the integral is approximated with an expression that involves the mode x ˆ(θ) = arg maxxt0 ,xt1 ,...,xT φ(xt0 , xt1 , . . . , xT , y1 , . . . , yN ; θ) as well as the curvature of the integrand at that mode. This means that the computationally intractable task of integration in a very high-dimensional space is simplified to maximization and computation of derivatives. Once the maximum likelihood estimate θˆ has been found, ˆ These steps have all been implemented in TMB, so our only task is our state estimate is x ˆ(θ). to specify the joint density of states and observations. Table 10.1 shows the results of the parameter estimation. Note that the parameter estimates are quite uncertain, but that the true values are within confidence limit, with the possible exception of γ. Figure 10.8 shows the estimated states from TMB and, for comparison, the estimates from the HMM method employed in the previous sections. Note that TMB makes use of a Gaussian approximation of the conditional distribution of the states, so there is no distinction between expectation, mode and median in TMB. This is in contrast to the HMM method, where we find the full posterior distribution and can distinguish between these statistics. Note that the two approaches, TMB and HMM, give fairly consistent results. One difference between the two is that the HMM estimates appear smoother and in particular do not have the “dips” in periods with zero observations as TMB does. The plausible explanation for this difference is that the state estimates for TMB are based on estimated system parameters as opposed to the HMM method, where we have used true system parameters. In particular, TMB estimates the rate parameter λ to 1.65 which implies that TMB believes system dynamics to be faster than HMM. 232

Parameter λ ξ γ

Estimate ± s.d. 1.7 ± 1.7 0.7 ± 1.7 2.1 ± 1.0

True value 1.0 1.0 1.0

Table 10.1. Parameter estimation in the Cox-Ingersoll-Ross model. For each parameter, the “true” value used in the simulation, the maximum likelihood estimate, and the standard deviation on that estimate as derived from Fisher information.

2 0

1

X

3

4

TMB HMM, mean HMM, mode HMM, median

0

5

10

15

20

t Figure 10.8. Comparison of state estimates with Template Model Builder (thick solid line) and the Hidden Markov Model method. For the HMM method, we show both the mean (solid), mode (dashed), and median (dotted) in the smoothed distribution.

233

10.9

Chapter conclusion

An important area of application of diffusions and stochastic differential equations is statistical analysis of time series. For these problems, stochastic differential equations offer a framework in which mechanistic understanding of system dynamics can be combined with rigorous treatment of data. We have emphasized the “recursive filtering” approach to this analysis. Discretizing the diffusion processes to a discrete Markov model, corresponding to numerical solution of Kolmogorov’s equations, means that we approximate the filtering problem for the diffusion process with a Hidden Markov Model. The resulting numerical algorithms are very direct translations of the mathematical formulas. This “brute force” approach is instructive, but only feasible for low-dimensional problems: One dimension is straightforward, two dimensions require a bit of work, three dimensions are challenging both for the modeller and for the CPU, and higher dimensions are prohibitively laborious. Problems with very high dimensions are only feasible in the linear paradigm; the Kalman filter can be implemented routinely with hundreds of states and, with some attention to implementation, even much higher number of dimensions. However, the Kalman filter is restricted to linear problems, or to problems that can be approximated with linear ones as in the extended Kalman filter or similar. These filtering approaches to estimation in diffusion models have been well established since the 1960’s. The alternative approach, that of treating the problem as a general mixed-effect model, is much more recent, and it is not yet entirely clear which algorithms are applicable or most efficient in which situations. Markov Chain methods apply in principle to any problem, but they may be very tricky to get to work in practice. Numerical optimization of likelihood functions, using the Laplace approximation, is powerful but limited to situations where the posterior distribution are well approximated with Gaussian ones. This could imply that this approach is useful for the same problems as extended Kalman filters, in which case the choice between the two methods is primarily a matter of the time it takes for the modeler to implement the model and for the computer to conduct the computations. Estimation in diffusion processes is a large area, and there are several established text-books on the matter, and a steadily increasing number of journal papers both on methods and applications. This chapter serves only as a first introduction to the topic, but has the ambition that the reader will be able to see the connection between different methods that at first glance may appear very different.

234

CHAPTER

11

Expectations to the future

For a diffusion Xt , we have seen that the forward Kolmogorov equation governs the distribution of the state, and that the backward Kolmogorov equation governs the likelihood of future observations. In this chapter, we take a closer look at the backward Kolmogorov equation. Loosely, the differential operator which appears in the backward Kolmogorov equation, and in It¯o’s formula, is the “generator” of the diffusion. It returns the expected rate of change of the function it is applied to, and it governs expectations to the future in general. Such expectations appear in a number of applications; in this chapter we focus on when and where the process is expected to reach a boundary. Later we shall see that the generator is also pivotal in stability analysis, optimal stopping, and optimal control. Throughout this chapter, Xt is an time-homogeneous It¯o diffusion in Rn given by

dXt = f (Xt ) dt + g(Xt ) dBt where f and g are globally Lipschitz continuous, which guarantees existence and uniqueness of a solution to this stochastic differential equation for a given initial condition (theorem 8.2.1).

11.1

The generator and expected rate of change

We now consider a function h(x) defined on state space, which is independent of time, and ask the question how the expectation

Eh(Xt ) evolves as a function of time. The answer to this question is provided by It¯o’s lemma: If h(x) is sufficiently smooth and Yt = h(Xt ), then Yt is an It¯o process given by 235

∂h 1 ∂2h dXt + tr 2 dXt dXt> 2 ∂x  ∂x ∂h ∂h 1 ∂2h > dt + = f + tr 2 gg g dBt ∂x 2 ∂x ∂x

dYt =

where f and g are evaluated at Xt and h is evaluated at Xt . Now, we make use of the “backward” differential operator L, which we introduced in Kolmogorov’s backward equation, and which is given by

(Lh)(x) =

∂h 1 ∂2h f + tr 2 gg > ∂x 2 ∂x

whenever h ∈ C 2 . Here, all functions on the right hand side are evaluated at x. Now, we can write It¯o’s lemma as

dYt = Lh dt +

∂h g dBt ∂x

which, of course, is a shorthand for the integral form Z YT = Y0 +

T

(Lh)(Xt ) dt + 0

∂h (Xt )g(Xt ) dBt ∂x

Recall that the It¯ o integral is a martingale, and therefore has expectation 0, provided that ∂h integrand ∂x g (is adapted, continuous and) has locally integrable variance. A sufficient condition for this is that h is C02 (Rn ), i.e. twice continuously differentiable and with bounded support. We make this assumption, noting that it is far from necessary. In that case, we have

E

x

Z 0

T

∂h g dBt = 0 ∂x

and thus

x

E YT = Y0 + E

x

Z

T

(Lh)(Xt ) dt

(11.1)

0

If T is small, then the integrand is well approximated by its value at time 0, since h is smooth and sample paths are continuous. From this we find Ex YT = Y0 + (Lh)(x)T + o(T ) 236

and so 1 x E (h(XT ) − h(x)) T &0 T

(Lh)(x) = lim

(11.2)

We take the right hand side as the definition of the generator of the diffusion: Definition 11.1.1. The generator A of a time-homogeneous diffusion Xt is the operator given by 1 (Ah)(x) = lim Ex (h(XT ) − h(x)) T &0 T whenever the limit exists. We let DA denote the set of functions h for which the limit is defined, for all x. Combining with the previous, we get the following: Theorem 11.1.1. Let h ∈ C02 , then h ∈ DA and Ah = Lh. We see that for a given function h, we may compute the initial rate of change of Eh(Xt ) when starting in x as (Ah)(x), i.e. the generator applied to h, evaluated at x. When h ∈ C02 , this rate of change has two origins: The gradient of h in concert with the drift f , which result in the term ∇h · f , and the curvature of h in concert with diffusion D, which gives rise to the term trD∂ 2 h/∂x2 . Example 11.1.1 (Mean square distance to the origin for Brownian motion). Let Xt be Brownian motion in Rn , i.e. we take f = 0 and g = I, an n-by-n identity matrix. Then the backward operator is 1 1 (Lh)(x) = tr(H(f )(x)) = ∇2 f (x) 2 2 where H(f )(x) is the Hessian matrix at x, i.e. an n-by-n matrix with entries

Hij (x) =

∂2h (x) ∂xi ∂xj

.

Now, consider the function h : Rn 7→ R given by h(x) = |x|2 . This function does not have compact support, but it turns out that contributions from the tails are finite, so it is in the domain of the generator and Ah = Lh. We find

(Lh)(x) = n so Ex |Xt |2 = |x|2 + nt. This agrees, of course, with Bt ∼ N (x, I · t). 237

11.2

The Dynkin formula

The result in the previous can be generalized to the situation where the terminal time T is replaced by a stopping time τ . Theorem 11.2.1 (Dynkin). Let h ∈ C02 (Rn ) and let τ be a stopping time such that Ex τ < ∞. Then Ex h(Xτ ) = h(x) + Ex

Z

τ

(Lh)(Xs )ds 0

This formula can be seen as a stochastic version of the fundamental theorem of calculus, which Rb states that if F is absolutely continuous with derivative f , then F (b) = F (a) + a f (x) dx for a < b. In the following sections, we look at applications of this result. Proof. Define Yt = h(Xt ) −

Rt

0 (Lh)(Xs )

ds, then Yt is an It¯o process which satisfies

dYt = Lh(Xt ) dt − Lh(Xt ) dt +

∂h ∂h g dBt = g dBt ∂x ∂x

In particular, Yt is an It¯ o integral and therefore a martingale (theorem 6.5.2 on page 140 ). It follows that Ex Yτ = Y0 = h(x) (lemma 4.5.1 on page 89) which can be restated as in the theorem.

11.3

Expected point of exit

We consider the diffusion Xt evolving on a domain Ω ⊂ Rd , and stop the process when it hits the boundary ∂Ω. We ask the question: Depending on the initial condition, where on the boundary do we expect the process to hit? To make the question precise, we define a function c(x) on the boundary, and aim to evaluate Ex c(Xτ ) where τ = inf{t : Xt 6∈ Ω} For simplicity, we assume throughout that the diffusivity is non-singular everywhere, D(x) > 0 ¯ and that the domain Ω is bounded. This also implies that time of exit τ has for all x ∈ Ω, finite expectation, Ex τ < ∞ for all x ∈ Ω. ¯ such that Now, assume that we have found a function h on Ω Lh(x) = 0 for x ∈ Ω, and h(x) = c(x) for x ∈ ∂Ω Then, by the Dynkin formula 238

h(x) = Ex h(Xτ ) Conversely, Ex c(Xτ ) must be smooth and so satisfy the equation Lh = 0. In summary, Ex c(Xτ ) is uniquely characterized as the solution to the Dirichlet problem Lh = 0, h|∂Ω = c.

11.3.1

Does a scalar diffusion exit right or left?

Consider, as an example, Brownian motion with drift, i.e. dXt = u dt + σ dBt where Xt ∈ R and u and σ are positive constants. Let the domain be (0, l) and let τ be the time of exit of the domain, i.e. τ = inf{t : Xt 6∈ (0, l)}. We aim to determine the probability that the process exits to the left, i.e. the probability that Xτ = 0, as a function of the initial condition X0 = x. Define c(0) = 1 and c(l) = 0, then we can characterize this probability as h(x) = Ex c(Xτ ) = Px {Xτ = 0} To determine the function h, we make use of the Dynkin formula, which states that h is governed by the boundary value problem

u

dh d2 h + D 2 = 0 for x ∈ (0, l), dx dx

h(0) = 1,

h(l) = 0

where the diffusivity is D = 21 σ 2 . It is a standard exercise to solve this second order linear ordinary differential equation; the solution is h(x) = Px {Xτ = 0} =

exp(−ux/D) − exp(−ul/D) 1 − exp(−ul/D)

We can once again introduce the Pecl´et number Pe = ul/D; then the solution can be written as

h(x) =

exp(− xl Pe) − exp(−Pe) 1 − exp(−Pe)

Note the exponentially decaying term exp(− xl Pe). For large Pecl´et numbers, i.e. when D/u  l, a verbal characterization of the solution is a diffusive boundary layer around x = 0 in which there is a significant probability of exiting to the left. This boundary layer occupies a fraction 1/Pe of the domain, i.e. it has the width l/Pe = D/u. Outside the boundary layer, the process is nearly certain to exit the right. Exercise 11.91: Show that when the Pecl´et number goes to 0 from above, the probability of exit to the left converges to 1 − x/l. (Hint: Approximate the exponentials with the leading 239

1.0 0.8 0.6 0.4 0.0

0.2

Px(Xτ = L)

0

2

4

6

8

10

x

Figure 11.1. The probability of exit to the right for Brownian motion with drift, u = D = 1.

240

terms in their Taylor expansions) Next, show that this limit satisfies the governing equation for pure diffusion, i.e. when u = 0. We now aim to generalize this example to any scalar diffusion process Xt . Assume that the process starts inside an interval [0, l] such that g does not vanish in this interval. What is the probability that the process exits to the left? More precisely, define τ = inf{t : Xt 6= (0, L)} What is then h(x) = P x (Xτ = l) ? We now know that h is governed by the boundary value problem 1 Lh(x) = h0 f + g 2 h00 = 0 on (a, b), h(0) = 0, h(l) = 1 . 2 We have previously, in section 7.5.2, studied this equation, and found that the full solution could be written h(x) = c1 s(x) + c2 where c1 and c2 are arbitrary real constants. Here, s is a scale function Z

x

s(x) =

φ(y) dy

where Z φ(x) = exp

x

−2f (y) dy g 2 (y)

 .

Notice that we have not specified the lower limit of the integration; we may choose this arbitrarily. We fix the coefficients c1 and c2 through the boundary conditions h(0) = 0, h(L) = 1: c1 s(0) + c2 = 0 ,

c1 s(L) + c2 = 1

and find

c1 =

1 s(0) , c2 = − s(L) − s(0) s(L) − s(0)

Inserting into the general solution, we find 241

h(x) =

s(x) − s(0) s(L) − s(0)

Exercise 11.92: Consider an unbiased random walk on the interval [0, l] with spatially varying diffusivity, i.e. f = 0 while g > 0 is not constant. Sketch the function h(x) = Px (Xτ = l). Exercise 11.93: Consider pure diffusion on the interval [0, l] with a diffusivity D(x) = 12 g 2 (x) which increases with x. Consider h(l/2), the probability of exit to the right, given that the process starts in the center. Is this probability greater or smaller than 1/2?

11.4

Analysis of a singular boundary point

We continue the analysis of whether a diffusion process exits right or left. Consider the process {Xt } given by the It¯ o SDE dXt = µ dt + σ

p Xt dBt

(11.3)

on the interval [a, b] with 0 < a < b. Note that since a > 0, theorem 8.2.1 guarantees existence and uniqueness of a solution up to exit from the interval (a, b), when the initial condition X0 = x is in that interval. However, we are interested in the limit when a → 0. The point x = 0 is singular in the sense that g(0) = 0, g 0 (0) = ∞, so we are not guaranteed existence and uniqueness when a = 0, and it is yet unclear what happens in the limit a → 0. The scale function is

x

Z

y

 2µ exp − 2 dz dy s(x) = σ z   Z x 2µ = exp − 2 log y dy σ Z x 2 = y −2µ/σ dy Z

=

1 ν x ν

with ν = 1 − 2µ/σ 2 . So the probability of exit at b before at a is

h(x) = P(Xτ = b) = when starting at X0 = x. 242

xν − aν bν − aν

0.8 0.6 0.4

h 0.4

h

0.6

0.8

1.0

µ = 0.25, σ2=1

1.0

µ = 1, σ2=1

0.0

0.2

0.4

0.6

0.8

0.2

a=1e−8 a=0.001 a=0.01 a=0.1

0.0

0.0

0.2

a=1e−8 a=0.001 a=0.01 a=0.1 1.0

0.0

x

0.2

0.4

0.6

0.8

1.0

x

Figure 11.2. The probability of exit to the right of (a, 1) for the model (11.3) for b = 1 and different values of a. Left panel: With µ = 1, σ 2 = 1, the probability approaches 1 as a → 0. Right panel: With µ = 0.25, σ 2 = 1, the probability of exit to the right √ approaches x as a → 0.

We can now ask: What happens if a → 0? Figure 11.2 shows the function h(x; a) = P(Xτ = 1) for b = 1, different values of a and for two sets of system parameters. In the left panel, where µ = 1, σ 2 = 1, we see that P(Xτ = 1) → 1 as a → 0. From the expression for h, we see that this will be the case when 1 − 2µ/σ 2 < 0, i.e. when µ > 12 σ 2 . The interpretation is that when the drift away from the singular point x = 0 is stronger than the noise, the singular point will never be reached; when a is very close to the singular point, we are almost certain to exit at 1 rather than at a. In contrast, in the right panel we have µ = 1/2, σ 2 = 1, and thus µ < 21 σ 2 . We see that √ P(Xτ = 1) → x when a → 0. With a slight jump to the conclusion, this implies that √ with a = 0, there is a probability of x of exiting at the right end 1, and a probability of √ 1 − x of exiting to the left, at the singular point. The interpretation is now that the drift is relatively weak and random fluctuations due to the noise may cause the state to actually hit the singular point at the origin. This example was tractable because the drift and noise terms are sufficiently simple that the scale function can be given in closed form, and even with a simple expression. But we see that the conclusion follows from a local analysis of the scale function s near a singular point x0 . Specifically, if the scale function diverges as x → x0 , then the singular point is repelling in the sense that as a → x0 , the probability of reaching a vanishes. Conversely, if the scale function converges as x → x0 , then the singular point x0 is attainable in the sense that there is a positive probability that sample paths which converge to x0 in finite or infinite time. See (Gard, 1988) for precise statements and elaboration. Moreover, the behaviour of the scale function near the singular point depends only on the 243

local behaviour of f and g near x0 . Let us illustrate with a couple of examples. Example 11.4.1 (The Cox-Ingersoll-Ross). Consider again the Cox-Ingersoll-Ross process given by p dXt = λ(ξ − Xt ) dt + γ Xt dBt where the origin x = 0 is a singular point. Near x = 0, the behaviour of the scale function is identical to that of equation (11.3) with µ = λξ, σ = γ. We see that the origin x = 0 is attainable if λξ < 12 γ 2 . Example 11.4.2 (Geometric Brownian motion). Consider geometric Brownian motion governed by the It¯ o equation dXt = rXt dt + σXt dBt Here, the scale function agrees with that of equation (11.3) with µ = r, σ = σ. So we find the same conclusion: The origin is attainable if r < 21 σ 2 . The two models, geometric Brownian motion and equation (11.3, have the same scale function because the two models are time changes of eachother. Following section 7.7, let {Xt } be geometric Brownian motion and introduce scaled time Ut with dUt = Xt dt. Then, in scaled time, the process {Yu } is given by dYu = r du + σ

p Yu dWu

i.e. equation (11.3). Note that as the process approaches the origin, the rescaled time {Ut } slows down. In fact, when the drift is weak, r < 12 σ 2 , geometric Brownian motion converges to 0 as t → ∞ (compare exercise 7.70), whereas the process given by equation (11.3) may converge to 0 in finite time.

11.5

Recurrence of Brownian motion

The analysis of exit points is very tractable in the scalar case, but here is a simple example which is also tractable in higher dimensions. Let Xt be a diffusion in Rd and let A ⊂ Rd . Let τ be the time of first entry into A, i.e. τ = inf{t : Xt ∈ A}. Here we use the convention that the infimum over an empty set is ∞, i.e. τ = ∞ if the process never enters A. We then say that A is recurrent if Px {τ < ∞} = 1 for any x ∈ Rd ; i.e., regardless of the initial condition, we are certain to enter A at some point. Otherwise A is transient. The process itself is said to be recurrent, if any set A with non-empty interior is recurrent. We now ask if Brownian motion in Rd is recurrent. To this end, we first investigate if the sphere {x : |x| ≤ r} is recurrent for given r > 0. To this end, stop the process when it either hits the “inner” sphere {x : |x| ≤ r} or leaves the “outer” sphere {x : |x| ≤ R} for given R; later we let R → ∞. Thus, define the following exit times:

τr = inf{t : |Xt | ≤ r}, τR = inf{t : |Xt | ≥ R}, τ = min{τr , τR }. 244

Define the probability of hitting the inner sphere first: h(x) = Px {|Xτ | = r} i.e., for a given initial condition X0 = x, h(x) is the probability that the process hits the inner sphere with radius r before it hits the outer sphere with radius R. To determine this function h, we use the Dynkin formula which states that h satisfies ∇2 h = 0 along with the boundary conditions h(x) = 1 on the inner sphere {x : |x| = r} and h(x) = 0 on the outer sphere {x : |x| = R}. Due to spherical symmetry h can be a function of |x| only, i.e. there is some function φ so that h(x) = φ(|x|). Writing the Laplacian in spherical coordinates, this function φ must satisfy φ00 (ρ) +

d−1 0 φ (ρ) = 0 for ρ ∈ (r, R) ρ

along with boundary conditions φ(r) = 1, φ(R) = 0. Linear independent solutions to this equation are φ = 1 and φ = ρ2−d when d 6= 2; in two dimensions d = 2 the latter is replaced by log ρ. The solution to the boundary value problem is then obtained as a linear combination of these solutions, determining the coefficients so as to satisfy boundary conditions. In two dimensions d = 2 we find

φ(ρ) =

log(R/r) − log(ρ/r) log(R/r)

while in all other dimensions, d 6= 2, we find

φ(ρ) =

R2−d − ρ2−d R2−d − r2−d

These expressions hold when we start between the inner and outer sphere, i.e. for ρ such that r < ρ < R. We can now, for fixed ρ = |x| and r, let R → ∞. For d = 1 we get

φ(ρ) =

R−ρ 1 − ρ/R = →1 R−r 1 − r/R

This means that as the outer sphere goes to infinity, we are certain to hit the inner sphere first. Hence, we are certain to hit the inner sphere at some point. We conclude that in one dimension, the interval {x : |x| ≤ r} is recurrent. Now, to see that Brownian motion in itself is recurrent, let A be any set with non-empty interior. Then A contains an interval. 245

By invariance under translations, this interval is recurrent; thus also A recurrent. Since A was arbitrary, we conclude that the the process itself, Brownian motion in one dimension, is recurrent. For d = 2 we get

φ(ρ) =

log(R/r) − log(ρ/r) log(ρ/r) =1− →1 log(R/r) log(R/r)

and so we reach the same conclusion, i.e. Brownian motion in two dimensions is recurrent. For d ≥ 3 we get

R2−d − ρ2−d → φ(ρ) = 2−d R − r2−d

 d−2 r ρ

and so, in three dimensions or more, there is a non-zero probability of never hitting the inner sphere, provided that we start outside it. Hence the sphere is transient and thus Brownian motion is itself transient, in three dimensions or higher.

11.6

The expected time to exit

We continue the analysis of the diffusion Xt in the previous section, and now turn to the question of when the process exits the domain. To this end, let h be a smooth function on Ω such that

Lh + 1 = 0 ,

h|∂Ω = 0

(11.4)

Then by the Dynkin formula,

0 = Ex h(Xτ ) = h(x) − Ex τ and so h(x) = Ex τ . In turn, Ex τ must be a smooth function of x and therefore satisfy (11.4), which expresses that time passes by with (expected) rate 1. Exercise 11.94 Exit time of diffusion with drift on the line: Consider again the stochastic differential equation dXt = u dt + σ dBt on the domain (0, l), where l, u and σ are positive constants. For given initial condition X0 = x, what is the expected time to exit of the domain? 246

6 5 4 0

1

2

3

Exτ

0

2

4

6

8

10

x

Figure 11.3. The expected time to exit for Brownian motion with drift, u = D = 1.

247

11.6.1

Exit time in a sphere, and the diffusive time scale

When does pure diffusion in Rd exit a sphere? Let Xt = σBt where Bt is Brownian motion in Rd and let τ be the time of exit of the sphere with center at the origin, and with radius R, i.e. τ = inf{t : |Xt | ≥ R}. As before, let h be the expected time of exit as a function of initial condition, i.e. h(x) = Ex τ . By spherical symmetry, h can depend only on |x|. We write h(x) = k(|x|) and find that the governing equation D∇2 h(x) + 1 = 0 where D = 21 σ 2 , can be written as k 00 (ρ) +

d−1 0 1 k (ρ) + =0 ρ D

and, with the boundary conditions k(R) = 0, k 0 (0) = 0 we find

k(ρ) =

R2 − ρ2 2dD

Thus, in particular, E0 τ = R2 /2dD. This is the “diffusive time scale” corresponding to the length scale R, i.e. the expected time it takes to move a distance R from the starting point.

11.7

Expectations of general rewards

We can generalize the two previous sections as follows: We still consider the diffusion Xt in the domain Ω and stop it when it reaches the boundary ∂Ω. We integrate a “running reward”γ along the integral, and at the time of exit we receive an additional “terminal reward” κ. The total reward is then Z

τ

J=

γ(Xt ) dt + κ(Xτ ) 0

where τ = inf{t : Xt 6∈ Ω}. Then h(x) = Ex J is characterized as the unique solution to

Lh + γ = 0 for x ∈ Ω

h(x) = κ(x) for x ∈ ∂Ω

, 248

So we may evaluate such expected rewards by solving a partial differential equation. This may be used in analysis of controlled systems, where the objective of the analysis is to examine how accurately the system follows a prescribed trajectory and reaches its intended target state. It may also be used for financial applications, where the running reward and terminal reward are payments associated with a financial derivative, and the objective of the analysis is to determine the expected payments which in determine the fair price of the derivative.

249

CHAPTER

12

Numerical simulation of sample paths

For stochastic differential equations, we most often cannot write up the solutions explicitly, neither in terms of the paths or in terms of the probability distributions. This should come as no surprise: It also holds for (deterministic) ordinary and partial differential equations. Therefore, numerical simulation of sample paths play an important role in any practical application involving diffusion and stochastic differential equations. In fact, even when we can write up the solution explicitly, we may learn much about the system by inspecting sample paths, and the easiest way to generate these sample paths is often through numerical simulation. Numerical simulation involves discretization of space and/or time. Here, we focus on discretetime continuous-space approximations of sample paths; in most applications this is the most useful approach. Recall that we discussed discretisation of space in chapter 9, when studying transition probabilities. For ordinary differential equations, strong standard libraries exist, involving e.g. high-order Runga-Kutta schemes, implicit schemes, and adaptive step control. The situation is less favorable for stochastic differential equations: Although some libraries have appeared (the sde package for R (Iacus, 2008), SDELab for Matlab (Gilsing and Shardlow, 2007)), most practitioners still write their own code. One reason for this is that practical applications often involve aspects such as reflection or stopping, which are rarely addressed by public libraries. In this chapter we present a few basic algorithms and notions for simulation of sample paths. There is an extensive and technical literature on the topic. A good place to start is (Kloeden and Platen, 1995).

251

12.1

The Euler scheme

We consider the It¯ o stochastic differential equation dXt = f (Xt ) dt + g(Xt ) dBt with initial condition X0 = x. Throughout this chapter we assume the setting of chapter 8, i.e. we are guaranteed existence and uniqueness of solutions. We have already encountered the stochastic Euler scheme with a time step h: (h)

(h)

Xt+h = Xt (h)

(h)

(h)

+ f (Xt ) h + g(Xt )(Bt+h − Bt )

(h)

which is used to find Xh , X2h , . . ., which in turn is an approximation of Xh , X2h , . . .. It is possible to use a general time discretization 0 = t0 < t1 < . . . < tn = T rather than the equidistant one. Step size control is possible and recommended (Lamba, 2003), although it not used as widely as in the deterministic case. The Euler scheme is a straightforward generalization of the Euler scheme for ordinary differential equations. This generalization is due to (Maruyama, 1955), and hence the scheme is also known as the Euler-Maruyama scheme (Kloeden and Platen, 1995). The motivation for the scheme is the integral form of the stochastic differential equation: We approximate the integrand with a constant between time t and t + h:

Z

t+h

Xt+h = Xt +

Z

t+h

f (Xs ) ds + t

Z ≈ Xt +

g(Xs ) dBs

(12.1)

g(Xt ) dBs

(12.2)

t t+h

Z f (Xt ) ds +

t

t+h

t

In the following we will analyse the properties of the Euler-Maruyama scheme.

12.2

The strong order: The Euler method for geometric Brownian motion

The performance of a numerical scheme is commonly done in terms of the order of the scheme, which characterizes the rate with which the discretization error converges to 0 as the time step tends to 0. This, of course, assumes that the scheme is consistent, i.e. the discretized solution does indeed converge to the true solution. For stochastic processes, things are complicated by the fact that there exist different modes of convergence of random variables, as we have seen in chapter 4.7.1. Here, we focus on convergence in the mean, which leads to the so-called strong order. 252

To define the strong order of any numerical method for discretizing a stochastic differential equation, consider the initial condition X0 = x fixed, and assume that we wish to simulate (h) XT . Use a regular grid with size h and denote the resulting approximation XT . (h)

We are interested in the error E(h) = |XT − XT |, which is also a stochastic variable. We hope that when h → 0, E(h) converges to 0 in the mean. It is instructive to first consider a specific case, namely geometric Brownian motion dXt = rXt dt + σXt dBt with initial condition X0 = 1. The solution is, as we have seen before 1 Xt = exp((r − σ 2 )t + σBt ) 2 The Euler discretization is given by the recursion (h)

(h)

Xt+h = Xt

(h)

+ rXt

(h)

(h)

h + σXt (Bt+h − Bt ) = Xt (1 + rh + σ(Bt+h − Bt ))

so that the solution is

(h)

Xnh =

n−1 Y

(1 + rh + σ(B(i+1)h − Bih ))

(12.3)

i=0

Figure 12.1 shows one realization of geometric Brownian motion along with its Euler discretization, for parameters r = 0.5, σ = 0.5 and with a time step of h = 0.1. To estimate the error, we first simulate a large number (N = 104 ) of realizations of Brownian motion on a fine temporal grid (h = 2−12 ). For each realization, we compute the Euler approximation as well as the analytical solution. Comparing the two, we compute the root mean square error. Next, we sub-sample the Brownian motion on a coarser grid. This increases the time step with a factor 2. With this coarser grid, we repeat the computations of the Euler discretized solution and the error. We repeat subsampling and computation of error on ever coarser grids. The results are seen in figure 12.2, where we have plotted the root mean square error as function of the time step in a double logarithmic plot so that power relationships display as straight lines. The figure also include a line with a slope of 0.5, corresponding to a square root relationship between the time step and the error: E|E(h)| ∼

√ h

We see that the experimental results are in good agreement with a square root scaling. In fact, as we will see in the next section, this is the theoretical prediction. We say that the Euler scheme, in general, has strong order 0.5. 253

2.2

Analytical Euler



1.6

Solution X

1.8

2.0





● ●

1.4

● ● ●

1.2



1.0





0.0

0.2

0.4

0.6

0.8

1.0

Time

Figure 12.1. Geometric Brownian motion and its Euler discretization.

254



Simulation Order 0.5 prediction







0.010

Strong error

0.020





0.005







5e−04

1e−03

2e−03

5e−03

1e−02

2e−02

Time step h

Figure 12.2. The strong error of the Euler scheme for geometric Brownian motion: Plotting the mean absolute error vs. the time step in a double logarithmic plot, a straight line with slope 0.5 indicates that the error scales with the square root of the time step. The plot is based on 10,000 realizations.

255

Compare this to the situation from ordinary differential equations, where the Euler scheme has order 1. For stochastic differential equations it may be shown that the dominating term in the error in general is the approximation Z

t+h

g(Xs ) dBs ≈ g(Xt )(Bt+h − Bt ) t

However, in the particular case where g is constant, this is an exact equality and not an approximation. As a result, it may be shown that when g is constant, the Euler scheme has strong order 1. Nevertheless, compared to the situation within other fields of numerical analysis, the strong order 0.5 of the Euler scheme is a sign of very slow convergence to the true solution, as the time step becomes smaller. For example, in ordinary differential equations we are used to Runge-Kutta schemes of order 4. In practice, this means that very small time steps may be needed in order to obtain a given level of accuracy. Despite this low performance, the Euler scheme has practical value, primarily because it is so simple to implement.

12.3

Strong order analysis of the Euler scheme

We return to the general SDE and its Euler discretization, simulating the SDE from time 0 (h) to time T . At time t, we have the analytical solution Xt and the approximation Xt based on Euler discretization with a time step h. At time t + h, the true solution is t+h

Z Xt+h = Xt +

Z

t+h

f (Xs ) ds + t

g(Xs ) dBs t

while the Euler approximation is (h)

(h)

Xt+h = Xt

(h)

(h)

+ f (Xt ) h + g(Xt ) (Bt+h − Bt ) (h)

We can now write an equation governing the error Xt+h − Xt+h :

(h)

(h)

Xt+h − Xt+h = Xt − Xt Z t+h Z + f (Xs ) − f (Xt ) ds + t

t+h

g(Xs ) − g(Xt ) dBs

t

(h)

We see that even if the there is no approximation error at time t, so Xt − Xt = 0, a local (h) error will be introduced so that Xt+h − Xt+h 6= 0, in general. Next, as soon as a local error (h)

has been introduced so that Xt − Xt

6= 0, this error will be amplified or attenuated as 256

time marches forward, depending on system dynamics, to yield the error at later times and ultimately at the final time T . For now, we focus on the local error, aiming to identify the leading terms in this error and (h) their order. So assume that Xt = Xt . When s is near t, the difference between g(Xs ) and g(Xt ) can be approximated as g(Xs ) − g(Xt ) = g 0 (Xt )∆X + O((∆X)2 ) according to It¯ o’s lemma. Here ∆X = Xs − Xt ≈ f (Xt )(s − t) + g(Xt )(B √ s − Bt ). The dominating term in ∆X is Bs − Bt ; this has r.m.s. value which scales as s − t. (On the short time scale, noise is dominating the signal, or diffusion is dominating advection). Keeping just this term, the error introduced at this time step can be approximated as Z

t+h

g 0 (Xt )g(Xt )(Bs − Bt ) dBs

(12.4)

t

Since the terms involving g are constants; we recognize the integral of Brownian motion with respect to itself: Z t

t+h

1 g 0 (Xt )g(Xt )(Bs − Bt ) dBs = g 0 (Xt )g(Xt )((Bt+h − Bt )2 − h) 2

(12.5)

This term has mean 0 and variance 12 (g 0 g)2 h2 , and hence it is of first order in h, assuming the general case where neither g nor g 0 vanish. I.e., (h)

Ex,t |Xt+h − Xt+h | = C(x) · h1 + o(h1 )

(12.6)

The exponent of h, i.e. 1, is the strong order of the local error. In the case of additive noise where g 0 (x) = 0 for all x, one must turn attention to the next error term. It can be shown that this term is of second order in the time step.

Local errors versus global errors So far we have analyzed the local error introduced at a given time step from t to t + h, but ultimately we want to quantify the global error, i.e. the error on XT . We must therefore take into account how previous errors are kept and accumulated along the trajectory. As long as local errors are small, the global error can be approximated as a weighted sum of the local errors from each time step: Effectively, we may view the local errors as perturbations of the true solution; the weights depend on whether the system dynamics, linearized around the true solution, amplifies or attenuates the error from time t and until the final time T . The weights are therefore properties of the system, independent of the local errors themselves, and can therefore be considered constants in the erroranalysis; the precise value of these constants do 257

not affect the order of the scheme. Moreover, when several local error terms are in play, we can determine the global error corresponding to each term, and simply add them up, thanks to linearity. Despite this, the amplification of the local error does of course influence the magnitude of the error associated with any finite time step. This amplification is related to the Lyapunov exponent of the true solution, which is bounded by the Lipschitz constant of the system dynamics f and g. These constants therefore determine the actual time step need to obtain a given accuracy. Nevertheless, the focus in the analysis is on the order of the scheme. With this insight, we now return to the error term (12.5) which had mean 0 and mean variance ∼ h2 . The global error arising from this term the is the weighted sum of T /h independent terms, and therefore has mean √ 0 and variance ∼ h. Hence, the mean absolute (global) error in the Euler scheme scales with h. We conclude that the Euler scheme has strong order 1/2.

12.4

The weak order

In many situations, the individual sample path is not of interest, only statistics over many realizations. For example, we may use Monte Carlo to compute some “analytical” property of the model, such as the distribution of XT . In this situation, the strong order analysis is an overly harsh measure of the performance of a scheme. In this section we consider an alternative, the weak order, which measure the speed of convergence in distribution as the time step tends to 0. To this end, it is practical to assume that the objective of the simulation study is to determine Ex k(XT ) for a given function k. This test function k could be a low-order polynomial, i.e. we are interested in the moments of XT , or an indicator function of some set, i.e. we are interested in determining the probability of reaching a given region in state space. For now, we consider k a given but arbitrary polynomial. We say that the scheme has weak order p, if for each such k there exists a constant C > 0 such that x (h) E k(XT ) − Ex k(XT ) ≤ Chp holds. Notice that the weak order is concerned only with the statistics of the solution; the (h) actual sample paths may be different, and it is not required that XT → XT in any stronger sense than convergence in distribution. The weak order of a given scheme is clearly at least as great as the strong order. Exercise: Show this! 258

5e−03



Simulation Order 1 prediction



5e−04 1e−03 2e−03







5e−05 1e−04 2e−04

Weak error









5e−04

1e−03

2e−03

5e−03

1e−02

2e−02

Time step h

Figure 12.3. The weak error of the Euler scheme for geometric Brownian motion: In a double logarithmic plot, a straight line with slope 1 indicates that the error scales with the time step. The plot is based on 100,000 realizations.

259

12.5

The Milstein scheme

In the analysis of It¯ o’s lemma, we derived the formula (dBt )2 = dt. When we analyzed the Euler scheme, we found that the leading term in the local error was due to the fact that this formula only holds in the limit of ∆t → 0. In other words, the local error (12.5) included the term 12 gg 0 ((∆B)2 − h) where g 0 is the derivative dg/dx. The Milstein scheme modifies the Euler scheme by explicitly including this leading term in the update. The algorithm is, in one dimension:   1 Xt+h = Xt + f (Xt , t) h + g(Xt , t)∆B + gg 0 (∆B)2 − h 2 Note that the last term can be seen as a correction term, which has expectation zero and variance proportional to h2 , conditional on Xt . When the state Xt is multidimensional but the Brownian motion is scalar, the Milstein scheme is   1 Xt+h = Xt + f h + g∆B + (∇g)g (∆B)2 − h 2 Here, ∇g is the Jacobian of g, i.e. a matrix with (i, j)-entry ∂gi /∂xj . All functions f , g and ∇g are evaluated at (Xt , t). Example 12.5.1. We have previously studied Brownian motion on the unit circle from an analytic point of view. This process is useful to illustrate geometrically the principles in the different schemes, by investigating how the discretization lifts the process off the unit circle. It should be noted that it is quite easy to modify the schemes to make the process stay on the circle, for example simply by projecting the process onto the circle at each time step, so our use of this example is more for illustrative purposes than practical. One It¯ o stochastic differential equation which governs this process is 1 dXt = − Xt dt − Yt dBt 2

,

1 dYt = − Yt dt + Xt dBt 2

for which the solution corresponding to X0 = 1, Y0 = 1 is Xt = cos Bt

,

Yt = sin Bt

Taylor expanding the solution around (Xt , Yt ) with respect to the Brownian motion Bt , we get



Xt+h Yt+h



 =

Xt Yt



 +

−Yt Xt



1 (Bt+h − Bt ) + 2

+O((Bt+h − Bt )3 )



−Xt −Yt



(Bt+h − Bt )2 (12.7)

260

y 1 2 2 (∆B)

∆B

x

Figure 12.4. The Milstein scheme for Brownian motion on the unit circle. One step in the Milstein scheme is shown, starting at position (Xt , Yt ) = (1, 0). The increment ∆B in the Brownian motion is shown exceedingly large for clarity. Note that the Milstein scheme returns a point on the supporting parabola (dashed), rather than on the actual circle.

To construct the Milstein scheme for this process, we first identify ∇g:

 ∇g =

0 −1 1 0



and thus (∇g)g = (−x, −y). The Milstein scheme is then



Xt+h Yt+h







      1 Xt 1 Xt −Yt = − h+ ∆B − ((∆B)2 − h) Yt Xt Yt 2 2       1 Xt Xt −Yt = + ∆B − (∆B)2 Yt Xt Yt 2 Xt Yt

We see that this corresponds to neglecting the terms in the Taylor expansion (12.7) which are of third order or higher in ∆B. In turn, replacing the last term involving (∆B)2 with its expectation, we are back at the Euler scheme. 261

12.6

The stochastic Heun method

The Heun method is a predictor-corrector scheme for ordinary differential equations. It may be generalized to stochastic differential equations, most conveniently in the Stratonovich case. So consider the Stratonovich equation dXt = f (Xt , t) dt + g(Xt , t) ◦ dBt

(12.8)

First we form the predictor with an Euler step: Zt+h = Xt + f (Xt , t)h + g(Xt , t)(Bt+h − Bt ) Next, we use this predictor to modify our estimates of drift and diffusion term: 1 f¯ = (f (Xt , t) + f (Zt+h , t + h)) 2 1 g¯ = (g(Xt , t) + g(Zt+h , t + h)) 2 Then we use this modified estimate of drift and diffusion to make the final update: Xt+h = Xt + f¯h + g¯(Bt+h − Bt ) The Heun method is consistent with the Stratonovich equation (12.8). The strong order of the scheme is 1 when the Brownian motion is scalar, or when the Brownian motion is multidimensional but the noise structure is commutative (R¨ uemelin, 1982). In the deterministic case g = 0, the Heun method has second-order global error. This is an indication that the scheme is likely to perform well when the noise g is weak. Example 12.6.1. We consider again Brownian motion on the unit circle, now described with the Stratonovich equation  d

Xt Yt



 =

−Yt Xt

 ◦ dBt

The predictor step returns the point Zt+h 

Xt − Yt ∆B Yt + Xt ∆B



−Yt − Xt ∆B Xt − Yt ∆B

Zt+h =



The diffusion term evaluated at Zt+h is

g(Zt+h )∆B =

262

 ∆B

g( Z )∆ B

y

Z g(X)∆B X

x

Figure 12.5. The Heun scheme for Brownian motion on the unit circle. One step in the Milstein scheme is shown, starting at position (Xt , Yt ) = (1, 0). The increment ∆B in the Brownian motion is shown exceedingly large for clarity. Note that the Heun scheme returns a point on the supporting parabola (dashed), rather than on the actual circle.

263

Taking average, we obtain 

Xt+h Yt+h



Xt − Yt ∆B − 12 Xt (∆B)2 Yt + Xt ∆B − 21 Yt (∆B)2

 =



Note that this exactly agrees with the Milstein scheme for the corresponding It¯ o equation, which we examined in example 12.5.1. To apply the Heun method to an It¯ o equation, we must first transform the equation to an equivalent Stratonovich equation . It is possible to do other weightings of old and new time step; see (Iacus, 2008), but then we need derivatives. An advantage of the scheme is that it allows to assess the local error at each time step by comparing the drift f and noise term g evaluated at (Xt , t) and at (Zt+h , t + h). This can be used for adaptive time step control.

12.7

Stability and implicit schemes

Let us first recall the motivation for implicit schemes in the numerical solution of ordinary (deterministic) differential equation. Consider the first order deterministic equation X˙ t = −λXt where λ > 0; the solution is Xt = X0 exp(−λt). The explicit Euler scheme for this equation is Xt+h = (1 − λh)Xt which has the solution Xnh = X0 (1 − λh)n . In the previous sections we have focused on the order of the scheme, which concern the error as the time step tends to zero. In contrast, we (h) now ask how the error Xt − Xt depends on t for a given time step h. We see that when (h) λh > 1, the numeric approximation Xt becomes an oscillatory function of t, in contrast to the actual solution which has constant sign. Furthermore, when λh > 2, the numerical (h) approximation becomes unstable, so that |Xt | → ∞ as t → ∞, in contrast to the true solution Xt which tends to 0 as t → ∞. This also implies that the error between true solution and numerical approximation is unstable and diverges. The conclusion is that the Euler scheme should never be used with time steps larger than 2/λ; if the sign is important, then the bound on the time step is h < 1/λ. However, it is not always practical to use the small time steps that are dictated by the stability properties of the Euler scheme. Consider the two-dimensional system X˙ t =



−1 0 0 −λ 264

 Xt

where λ is large and positive. For stability, we must use a time step smaller than 2/λ. The slow dynamics has a time scale of 1, so we need λ/2 time steps to resolve the slow dynamics. If λ is on the order of, say, 106 , the Euler method seems prohibitively inefficient. This system is the simplest example of a so-called stiff systems, i.e. a system which contains both fast dynamics (numerically large eigenvalues) and slow dynamics (numerically small eigenvalues). Many real-world systems are stiff in this sense, and if one does not explicitly avoid stiffness in the modeling process, the models ends up stiff as well. Stiff models also arise when discretizing partial differential equations with the method of lines, i.e. discretizing space but keeping time continuous. In contrast to the trivial two-dimensional example we have just presented, it is not always straightforward to separate the fast dynamics from the slow dynamics, so we need numerical methods which can handle stiff systems. The simplest approach handle a stiff systems is to use the implicit Euler method. For the general nonlinear system in Rn X˙ t = f (Xt ) we approximate the time derivative as (Xt − Xt−h )/h. We get (Xt − Xt−h )/h = f (Xt ) and isolate Xt , finding Xt − hf (Xt ) = Xt+h This equation must be solved for Xt at each time step, which is the down side of the scheme. On the other hand, it has nice stability properties. To see this, consider the linear case where f (x) = Ax. Then the implicit scheme reads (I − Ah)Xt = Xt−h or Xt = (I − Ah)−1 Xt−h The important point is not so much that we can invert the matrix I − Ah once and for all and thus reach an efficient implementation, but rather that we see that the discrete-time system is stable whenever A is, regardless of h > 0. Exercise: Verify this by finding the eigenvalues of (I − Ah)−1 , assuming you know the eigenvalues of A. Even if we cannot reach quite such a strong conclusion for the general nonlinear system, the implicit Euler method still has much more favorable stability properties than the explicit scheme. In the stochastic situation, the arguments for implicit schemes are the same, but the situation is complicated by the stochastic integral. Problem 12.7.1. Consider an It¯ o equation dXt = f (Xt ) dt + g(Xt ) dBt 265

Demonstrate that a naive implicit scheme

Xt − Xt−h = f (Xt )h + g(Xt )(Bt − Bt−h ) can be inconsistent, i.e. the discretized solution does not converge to the true solution as the time step h tends to 0. So implicit schemes for stochastic differential equations, in most cases, only treat the deterministic part of the equation implicitly, to avoid problems with consistency - and, indeed, existence and uniqueness of the discrete time update. The simplest such scheme is (h)

Xt

(h)

(h)

(h)

= Xt−h + f (Xt ) h + g(Xt−h )(Bt − Bt−h )

A further discussion of implicit schemes can be found in (Kloeden and Platen, 1995; Higham, 2001).

12.8

Discussion

High-performance numerical algorithms for stochastic differential equations are not as readily available as is the case for deterministic ordinary differential equations. The algorithms are substantially more complicated than their deterministic counterparts, and standard libraries are not as developed. In addition, stochastic simulation experiments often include extra elements such as handling of stopping time events, which make it more complicated to design general-purpose simulation algorithms. Therefore, practitioners rely to a greater extent on their own low-level implementations. While the Euler method is often preferred due to its simplicity, its performance is not convincing, as indicated by the strong order 0.5. Even going to a Mihlstein scheme, or in the case of Stratonovich equations, the Heun method, can lead to substantial improvements. In many applications, the objective of the stochastic simulation is to investigate some statistical property of the solution. In this case, weak order may be the most relevant. However, sensitivity studies are often most effectively performed on individual sample paths, to block the effect of the realization. In such situations, it is the strong order of the scheme which is more relevant. While the order of a scheme is an important characteristic, it should be kept in mind that our ultimate interest is to obtain the highest possible accuracy within limits on the effort or - conversely - reach a specified accuracy with least effort possible. This trade-off involves not just the order of the scheme, but also the constant multiplier in the error term, possible stability bounds, and the number of replicates used to reduce statistical uncertainty. A final factor is the complexity of the algorithm and the resulting coding effort; although this is more difficult to quantify, it often limits the space of solutions. 266

Additional topics, which we have not covered here, include step size control (Iacus, 2008) and variance reduction (Kloeden and Platen, 1995). Other useful references are (Higham, 2001) and (Burrage, Burrage, and Tian, 2004).

267

CHAPTER

13

Dynamic optimization

One area of application for stochastic differential equations is decision making: Consider a dynamic system where we have the possibility to affect the dynamics, but we do not have complete control; partly because some state variables are beyond our direct control, and partly because also external disturbances are affecting the system. We seek guidance on how we should affect the dynamics in order to achieve some stated objective. In this chapter we cast this problem as one of optimization, and providethe solution of that optimization problem. These problems appear in many different areas of applications: Control engineering aims to build feedback control systems in order to reduce harmful oscillations in wind turbine blades, have a fighter jet reach operational altitude and speed as quickly as possible, or operate a chemical plant with the highest possible output without jeopardizing the plant itself. In quantitative finance, an investor may aim to continuously manage his portfolio so as to maximize the expected income subject to bounds on acceptable risk. Every living organism, from bacteria to humans, has a behavior which has been selected through evolution; a useful model is that this behavior maximizes the ability of that organism to generate offspring, and this model allows us to understand the behavior we observe. In summary, the theory of dynamic optimization helps us to design new systems in the widest sense of the word, as well as to analyze the behavior of existing systems and decision makers. The theory we present in this chapter for dynamic optimization problems is based on the socalled value function, which depends on the state as well as on the time. This value function quantifies the expected future performance of the controlled process, under the assumption that the control is optimal. This value function generalizes the expectations to the future that we studied earlier in chapter 11, and it satisfies a “backward” partial differential equation known as the Hamilton-Jacobi-Bellman equation which generalizes the backward Kolmogorov equation of that chapter. In the theory of dynamic optimization for stochastic differential equations, a certain amount of technicalities become critical. These stem, most importantly, from questions such as whether the value function is smooth, whether an optimal decision exists or only near-optimal ones, 269

U1

X1

···

U2

X2

···

Ut

Xt

···

Xt+1

UT −1

···

XT

Figure 13.1. Graphical network illustrating the interdependence between the random variables in the Markov Decision Problem. and if so, whether the optimal decision depend smoothly on the state and on the time. These technicalities tend to overshadow the conceptual simplicity of the approach. To avoid this, we start the presentation by considering a discrete Markov Decision Problem where these technicalities do not appear. This should illuminate the structure of the solution. Thereafter, we apply this structure to our original problem, i.e. dynamic optimization for stochastic differential equations.

13.1

Markov decision problems

In this section we consider a simpler version of the dynamic optimization problem: We take time to be discrete in stead of continuous, and we take the state space to be discrete and finite in stead of Euclidean. The resulting problem of dynamic optimization for a finite discrete-time Markov chain is commonly referred to as a Markov decision problem. This simplification allows us to focus on the core principle in the problem and its solution, rather than the technicalities. An added benefit is that the dynamic optimization problem for diffusion problems can be discretized to a Markov decision problem. Consider a stochastic process {Xt : t ∈ T} evolving on the state space X = {1, . . . , N } in discrete time, T = {1, 2, . . . , T }. At each time t we have to choose a decision variable Ut ∈ U where U is the set of permissible controls. This set U could be a finite set, or a closed interval of reals; we require that U is compact. At time t we must take this decision based on the current state, i.e. Ut must be Xt -measurable. This means that there must be a function µ : X × T 7→ U such that Ut = µ(Xt , t). We refer to this µ as the control strategy. The initial state X1 has a prescribed distribution. Conditional on the current state and control, the next state Xt+1 is random with probabilities P{Xt+1 = j | Xt = i, Ut = u} = Piju and independent of past states and controls, i.e. of Xs and Us for s = 1, . . . , t − 1. We assume that these transition probabilities Piju are continuous functions of u ∈ U, for each i, j. The interaction between the random variables is illustrated in figure 13.1. Having described the state dynamics, we now turn to the objective of the decision-making. We assume that there is issued a reward at each time t ∈ T, which depends on the state Xt 270

and the control Ut at that time, as well as on time t itself. I.e., we assume a reward function

h : X × U × T 7→ R

which is continuous in u ∈ U for each x ∈ X and each t ∈ T, such that the total reward issued is

X

h(Xt , Ut , t)

t∈T

We assume that the objective is to maximize the expected reward, i.e. we aim to identify

max Eµ µ

X

h(Xt , Ut , t)

(13.1)

t∈T

as well as the maximizing argument µ. Here the superscript in Eµ indicates that the distribution of Xt and Ut depends on µ. The maximum is over all functions µ : X × T 7→ U. Notice that this is a compact set, and that the expected reward is a continuous function of µ, so the maximum is guaranteed to exist.

13.1.1

The backward iteration: Dynamic programming

It is perfectly true, as the philosophers say, that life must be understood backwards. But they forget the other proposition, that it must be lived forwards. - Søren Kierkegaard, Journals IV A 164 (1843). Often paraphrased as “Life can only be understood backwards, but it must be lived forwards.”

We can now state the celebrated dynamic programming solution to the Markov decision problem 13.1, due to Bellman. What makes the problem difficult is that we are maximizing over a space which could potentially be very large, viz. the space of all admissible controls µ : X × T 7→ U. The approach to this complex optimization problem is to iteratively break it into simpler sub-problems.

271

Biography:

Richard Ernest Bellman (1920-1984) was an American applied mathematician who was central to the rise of “modern control theory”, i.e. state space methods in systems and control. His most celebrated contribution is dynamic programming and the principle of optimality, which both concern dividing complex decision-making problems into more, but simpler, subproblems. A Ph.D. from Princeton, he spend the majority of his career at RAND corporation.

Theorem 13.1.1 (Dynamic programming). Let the value function V : X × T 7→ R be given by the recusion V (x, s) = max EXs =x,Us =u [h(x, u, s) + V (Xs+1 , s + 1)] u∈U

(13.2)

for s = 1, . . . , T − 1, and the terminal condition

V (x, T ) = max h(x, u, T ) .

(13.3)

u∈U

Then V (x, 1) = max Eµ,X1 =x µ

X

h(Xt , Ut , t) .

t∈T

Moreover, the optimal value is obtained with any control strategy µ such that µ(x, s) ∈ arg max EXs =x,Us =u [h(x, u, s) + V (Xs+1 , s + 1)] . u∈U

Note that the dynamic programming equation (13.2) can be written more explicitly in terms of the transition probabilities as 



V (x, s) = max h(x, u, s) + u∈U

X y∈X

272

u Pxy V (y, s + 1) .

(13.4)

Proof. We claim that V (x, s) = max Eµ,Xs =x µ

T X

h(Xt , Ut , t) .

t=s

for all (x, s) ∈ X × {1, . . . , T }. Clearly it holds for s = T due to the terminal value (13.3). So assume that it holds on X × {s + 1}. We aim to show that it then also holds on X × {s}. Write

Eµ,Xs =x

T X

h(Xt , Ut , t) = h(x, u, s) + Eµ,Xs =x

t=s

T X

h(Xt , Ut , t)

t=s+1

where u = µ(x, s). Now, use the simple Tower property on the expectation, conditioning on Xs+1 : T X

Eµ,Xs =x

" h(Xt , Ut , t) = EUs =u,Xs =x E

µT s+1

t=s+1

T X t=s+1

# h(Xt , Ut , t) Xs+1

T

Here, the conditional expectation is written Eµs+1 to indicate that it only depends on the restriction of µ to X × {s + 1, . . . , T }, and that it is independent of the initial condition Xs = x, due to the Markov property and conditioning on Xs+1 . We then get #! h(Xt , Ut , t) Xs+1 V (x, s) = max h(x, u, s) + EUs =u,Xs =x E u,µT s+1 t=s+1 " T #! X Us =u,Xs =x µT h(Xt , Ut , t) Xs+1 = max h(x, u, s) + E max E s+1 u µT s+1 t=s+1  = max h(x, u, s) + EUs =u,Xs =x V (Xs+1 , s + 1) "

µT s+1

T X

u

which was to be shown. The theorem follows.

The result shows that we can solve the Markov Decision Problem by solving the Dynamic Programming equation (13.2). During this process we identify both the optimal control strategy µ, and the value. The Dynamic Programming equation is a recursion, progressing backward in time, starting from the terminal condition (13.3). Let us illustrate the result with an example.

13.1.2

Optimal foraging and Gilliam’s rule

Consider an animal who can be in one of two states, x = 1 meaning that the animal is dead while x = 2 means that it is still alive. At each time step, the animal can choose to forage with 273

an effort u ∈ [0, 1]. For a given foraging effort u, the animal will harvest the amount of energy ρ(u) = u/(u0 + u) during the time step, assuming that it is alive at the start of the time step. Notice that this function is increasing in u, but decelerating (concave), i.e. the returns are diminishing, and the parameter u0 determines (roughly) when the returns start to diminish. The animal will survive to the next time step with probability π(u) = p · (1 − u) where p ∈ (0, 1) is a constant parameter describing the maximum attainable survival probability. The harder this animal feeds, the smaller is its probability of surviving. It will die at time T , no matter what. It aims to maximize the total amount of energy harvested over its entire lifetime. To translate this verbal description to a specific model, we see that we have X = {1, 2}, U = [0, 1], T = {1, . . . , T } and a harvest function h(x, u, t) = ρ(u) · 1(x = 2) =

u · 1(x = 2) u0 + u

while the transition probabilities are  P (u) =

1 0 1 − π(u) π(u)



 =

1 0 1 − p · (1 − u) p · (1 − u)



Inserting into the dynamic programming equation (13.2), or equivalently into (13.4), we find V (1, s) = 0,

V (2, s) = sup [ρ(u) + π(u) · V (2, s + 1)]

(13.5)

u

This value function V (2, s) denotes the amount of energy that an animal which is alive at time s can expect to harvest over its remaining life. For the particular functional forms of ρ(u) and π(u), we can find the maximum analytically. Define temporarily v = V (2, s). The function to be maximized is concave for u > −u0 , so we first ignore the constraint u ∈ [0, 1] and identify the unique stationary point larger than u0 , i.e. r Arg sup [ρ(u) + π(u) · v] = u>−u0

u0 − u0 p·v

Exercise: Check that the function we maximize is concave, that this is indeed the unique stationary point, and that it solves the maximization problem. Next, we identify the maximum over the interval U = [0, 1] by projecting this stationary point onto that interval, i.e.   r  u0 u (v) = Arg sup [ρ(u) + π(u) · v] = max 0, min 1, − u0 p·v u∈U ∗

At this point we can solve the backward dynamic programming recursively for s = T − 1, T − 2, . . . , 1: V (2, s) = ρ(Us ) + π(Us ) · V (2, s + 1) where Us = u∗ (V (2, s + 1)) 274

with the terminal condition UT = 1, V (2, T ) = ρ(1). The solution is shown in figure 13.2. Notice that under the optimal strategy, the effort Ut is fairly constant until time t approaches the terminal time T , at which point the effort increases. Correspondingly, the value V (2, t) of an live individual is fairly constant until time t approaches the terminal time, at which point the value decreases. For this problem, and for many other problems of dynamic optimization, it is possible to characterize steady-state behavior simpler and more explicitly. By this we mean the strategy at a fixed time t, say t = 1, letting the horizon T tend to infinity. We expect that when we let T → ∞, V (2, t) will approach a stationary state where V (2, t) = V (2, t + 1). In that case, the dynamic programming equation becomes V (2, t) = sup [ρ(u) + π(u) · V (2, t)] u

or sup [ρ(u) − µ(u) · V (2, t)] = 0 u

where µ(u) = 1 − π(u) is the mortality, i.e. the probability of dying during the next time step. Since µ(u) ≥ 0 and the maximum equals 0, we can divide with µ(u) to obtain 

ρ(u) V (2, t) = sup µ(u) u



I.e., the animal should maximize harvest rate over mortality. This principle is known as Gilliam’s rule (Gilliam and Fraser, 1987), see also (Sainmont et al., 2015). It can also be found with elementary arguments: If the horizon is infinite and the foraging effort is constant, then the life time is geometrically distributed with parameter µ(u), so has expectation 1/µ(u). The expected harvest over the remainder of the lifetime of the animal is therefore ρ(u)/µ(u), which is what the animal should maximize. It can be interpreted geometrically as in figure 13.3: The animal aims to maximize harvest and minimize mortality, and the optimal tradeoff between these two opposing objectives is in this case to maximize their ratio, which corresponds graphically to finding a line through the origin which is a tangent to the curve {(µ(u), ρ(u) : u ≥ 0}. With this understanding of the steady-state problem, we can re-examine the solution of the dynamic optimization problem in figure 13.2: When the horizon is far out in the distance, the animal should effectively behave as if it was optimizing for steady state. As the initial time s approaches the terminal time T , the optimal behavior begins to be affected by the finite horizon and thus deviates from what is optimal under steady state. Here, the behavior becomes increasingly risk-willing. The reason for this is that the animal’s time is about to run out; having shorter time to harvest in, its expected future harvest decreases below what would be obtained in steady state. As a result, the value of being alive decreases, and in the trade-off (13.5) between harvest and survival, the animal becomes more focused on harvest and less focused on survival. 275

0.8 0.4

● ●



































0.0

Optimal effort u



5

10

15

20

Time



















0.85











● ●

0.75

● ●

0.65

Fitness V





5

10

15

20

Time Figure 13.2. Dynamic programming solution of the optimal harvest problem 13.5. Upper panel: The optimal foraging effort as a function of time. Lower panel: The resulting value as a function of time.

276

u=1

0.5

0.6



● ● ● ● ● ● ● ●

0.3

+

0.0

0.1

0.2

Harvest ρ(u)

0.4



u=0 0.0

0.2

0.4

0.6

0.8

1.0

Mortality µ(u) Figure 13.3. Graphical solution of the steady-state version of the optimal harvest problem 13.5: The harvest vs. the mortality, parametrized by the foraging effort u. The steady-state optimum is found by maximizing harvest over mortality, which corresponds to identifying a line through the origin which is tangent to the curve in. Included is the steady-state optimal harvest and mortality (marked with +) as well as transients of this two computed from the optimal harvest in figure 13.2 (open circles).

277

13.2

Controlled diffusions and performance objectives

At this point we return to diffusion processes, aiming to pose a dynamic optimization similar to the Markov decision problem, and determine the solution using dynamic programing. We consider a controlled diffusion {Xt : t ≥ 0} taking values in X = Rn given by dXt = f (Xt , Ut ) dt + g(Xt , Ut ) dBt with initial condition X0 = x. Here {Bt : t ≥ 0} is standard Brownian motion, as always, and {Ut : t ≥ 0} is the control signal, which we consider a decision variable: At each time t we must choose Ut from some specified set U of permissible controls. At this point we restrict attention to state feedback controls given by

Ut = µ(Xt , t) for some function µ : X × T 7→ U such that the closed-loop system

dXt = f (Xt , µ(Xt , t)) dt + g(Xt , µ(Xt , t)) dBt

(13.6)

satisfies the conditions in chapter 8 for existence and uniqueness of a solution {Xt }. This solution will then be a Markov process, so another name for for state feedback controls Ut = µ(Xt , t) is Markov controls. The controlled system terminates when the state exits a set G ⊂ X, or when the time reaches T , i.e. we have a stopping time τ = min{T, inf{t ∈ [0, T ] : Xt 6∈ G}} At this point a reward is issued τ

Z k(Xτ , τ ) +

h(Xt , Ut , t) dt 0

The first term, k(Xτ , τ ), is called a terminal reward and depends on the terminal time and state, while the integrand in the second integral term is called a running reward and is being accumulated along the trajectory until termination. For a given Markov control strategy {Ut = µ(Xt , t) : t ∈ T}, and a given initial condition Xs = x, we can assess the performance objective given by

J(x, µ, s) = E

µ,Xs =x

 Z k(Xτ , τ ) + s

278

τ

 h(Xt , Ut , t) dt

The control problem we consider is to determine the control signal {Ut }, or equivalently the control strategy µ, which maximizes J(x, µ, s), if such an optimal control signal exists. Notice that our original interest is the case where the initial time is zero, s = 0, but that we include the initial time as a parameter in order to prepare for a dynamic programming solution. Instrumental in the solution is the generator of the controlled diffusion. For a fixed control u ∈ U, define the generator Lu as   ∂V 1 ∂2V T (L V )(x) = (x) · f (x, u) + tr g (x, u) 2 (x)g(x, u) ∂x 2 ∂x u

while for a control strategy µ : X 7→ U, define the “closed-loop” generator Lµ as   ∂V 1 ∂2V T (L V )(x) = (x) · f (x, µ(x)) + tr g (x, µ(x)) 2 (x)g(x, µ(x)) . ∂x 2 ∂x µ

13.3

Verification and the Hamilton-Jacobi-Bellman equation

Let us jump to the first conclusion and state the verification theorem which allows us to conclude that we have solved the optimization problem, provided that we have identified a solution to the dynamic programming equation. Theorem 13.3.1. Let V : G × T 7→ R be C 2,1 and satisfy the Hamilton-Jacobi-Bellman equation ∂V + sup [Lu V + h] = 0 (13.7) ∂t u∈U on the interior Go × To , along with V = k on ∂G × T o ∪ Go × {T } Let µ∗ : G × T 7→ U be such that ∗

sup [Lu V + h] = Lµ V + h u∈U

on Go ×To , and assume that with this µ∗ , the closed-loop system (13.6) satisfies the conditions in theorem 8.2.1. Then, for all x ∈ G and all s ∈ T V (x, s) = sup J(x, µ, s) = J(x, µ∗ , s) µ

This theorem provides a strategy for solving stochastic dynamic optimization problems: First, try to solve the Hamilton-Jacobi-Bellman equation, and in doing so, identify the optimal control strategy µ∗ . If this succeeds, then the theorem states that we have solve the optimization problem. Because the solution involves a terminal value problem governing the value function, 279

Maximization vs. minimization Optimization problems can come as maximization problems, i.e. maxx f (x) or supx f (x), or as minimization problems, i.e. minx f (x) or inf x f (x). We only need to develop theory for one of the two: Assume that we have a theory, or a software routine, for the maximization problem maxx f (x) but want to minimize the function g(x); then we use that min g(x) = − max −g(x). x

x

An important issue in optimization is convexity. We say that a minimization problem is convex if we aim to minimize a convex function defined on a convex set. The corresponding key property for a maximization problem is if we aim to maximize a concave function over a convex set.

just as was the case for the Markov Decision Problem of section 13.1, we refer to the solution provided by theorem 13.3.1 as a dynamic programming solution. We will prove the theorem in section 13.5, but let us first look at a classical situation where this strategy works.

13.4

Multivariate linear-quadratic control

An important special case of control problems is the linear-quadratic regulator (LQR) problem, where system dynamics are linear

dXt = AXt dt + F Ut dt + G dBt

(13.8)

while the control objective is to minimize the quadratic functional

J(x, U ) = E

x

T

Z 0

1 > 1 1 Xt QXt + Ut> RUt dt + XT> P XT 2 2 2

(13.9)

Here, Xt ∈ Rn , Ut ∈ Rm , while {Bt : t ≥ 0} is l-dimensional standard Brownian motion. The terminal time T > 0 is fixed. The matrix dimensions are A ∈ Rn×n , F ∈ Rn×m , G ∈ Rn×l , Q ∈ Rn×n , P ∈ Rn×n , R ∈ Rm×m . We assume Q ≥ 0, R > 0, P ≥ 0. We guess that the value function is quadratic in the state: 1 V (x, t) = x> Wt x + wt 2 where Wt ∈ Rn×n , Wt ≥ 0, while wt is scalar. The HJB equation then reads 280

  1 > ˙ 1 1 > 1 > > > x Wt x + w˙ t + inf x Wt (Ax + F u) + trG Wt G + x Qx + u Ru = 0 u 2 2 2 2 Differentiating the bracket w.r.t. u, we see that the minimizing u is given by x> Wt F + u> R = 0 i.e., the optimal control signal is u∗ = −R−1 F > Wt x where we have used that R = R> and Wt = Wt> . This optimal control u∗ depends linearly on the state x, with a gain −R−1 F > Wt which depends on the value function, i.e. on Wt . Inserting in the HJB equation and collecting terms, we get i 1 1 >h ˙ x Wt + 2Wt (A − F R−1 F > Wt ) + Q + Wt F > R−1 F Wt x + w˙ t + trG> Wt G = 0 2 2 while the terminal condition is 1 > 1 x P x = x > W T x + wT . 2 2 Now notice that x> 2Wt Ax = x> (Wt A + A> Wt )x, where the matrix in the bracket is symmetric. Then, recall that if two quadratic forms x> S1 x and x> S2 x agree for all x, and S1 and S2 are symmetric, then S1 = S2 . We see that the HJB equation is satisfied for all x, t iff the following two hold: 1. The matrix function {Wt : 0 ≤ t ≤ T } satisfies ˙ t + Wt A + A> Wt − Wt F R−1 F > Wt + Q = 0 W

(13.10)

along with the terminal condition WT = P . This matrix differential equation is termed the Riccati equation. 2. The off-set {wt : 0 ≤ t ≤ T } satisfies the scalar ordinary differential equation 1 w˙ t + trG> Wt G = 0 2 along with the terminal condition wT = 0. We summarize the result: 281

Theorem 13.4.1 (LQR control). The LQR problem of minimizing the quadratic cost (13.9) w.r.t. the control strategy {Ut : 0 ≤ t ≤ T }, subject to system dynamics (13.8), is solved by the linear static state feedback control Ut = µ(Xt , t) = −R−1 F > Wt Xt where {Wt : 0 ≤ t ≤ T } is governed by the Riccati equation (13.10), with terminal condition WT = P . The associated cost is Φ(x) = x> W0 x + w0 where {wt : 0 ≤ t ≤ T } is found by Z wt = t

T

1 trG> Ws G ds 2

Notice that the Riccati equation (13.10) does not involve the noise intensity G. So the optimal control strategy is independent of the noise intensity, but the noise determines the optimal cost through wt . The advantage of the linear quadratic framework is that the problem reduces to the Riccati matrix differential equation (13.10). So instead of a partial differential equation on Rn ×[0, T ], we face n(n + 1)/2 scalar ordinary differential equations; here we have used to symmetry of Wt and solve only for, say, the upper triangular part of Wt . This reduction from a PDE to a set of ODE’s allows one to include a large number of states in the problem: While solving the Hamilton-Jacobi-Bellman partial differential equation (13.7) for a general nonlinear problem becomes numerically challenging even in three dimensions, an LQR problem can have hundreds of states.

13.5

Performance evalution and the proof of the verification theorem

We now state the proof of the verification theorem 13.3.1. Let µ : X × T 7→ U be a given control strategy such that the closed loop dynamics (13.6) satisfy the usual assumptions. Let V : X × T 7→ R be C 2,1 and satisfy ∂V + Lµ V + h(x, µ(x, s), s) = 0 for (x, t) ∈ G × To ∂s along with the boundary condition V = k on ∂G × To ∪ Go × {T }. Then Dynkin’s lemma, theorem 11.2.1, states that this V governs the expected pay-off 282

V (x, s) = E

µ,Xs =x

Z

τ

h(Xt , Ut , t) dt + k(Xτ , τ ) s

In particular, let µ be the control µ∗ in the verification theorem 13.3.1 and let V be the value function. Then V is the expected pay-off with the control µ∗ . Now, let µ1 be any other control strategy such that the closed loop system satisfies the usual assumptions. Then the HJB equation ensures that ∂V + Lµ1 V + h(x, µ1 (x, s), s) ≤ 0 ∂s By Dynkin’s lemma, it follows that Eµ1 ,Xs =x

Z

τ

h(Xt , Ut , t) dt + k(Xτ , τ ) ≤ V (x, s) s

It follows that this control strategy µ1 results in an expected pay-off which is lower than what is obtained with the strategy µ∗ . We conclude that the strategy µ∗ is optimal. This concludes the proof.

13.6

Steady-state control problems

In many practical applications, the control horizon T is exceedingly large compared to the time constants of the controlled system. In that case, we may pursue the limit T → ∞. This situation is called “infinite horizon control problems”. If also the system dynamics f, g and the running pay-off h are independent of time t, they often give rise to steady-state control strategies in which the optimal control strategy µ does not depend on time. This situation can come in two flavors: Transient control problems and stationary control problems. In a transient control problem, the control mission ends when the state exits the region G of the state space, and in the infinite horizon case we assume that this takes place before time runs out at t = T . Examples of applications include an autopilot that lands an aircraft: The mission ends when the aircraft is at standstill on the runway, safe and sound, and not at a specified time. In this situation, we seek a value function V : X 7→ R which satisfies sup [Lu V + h] = 0 for x ∈ Go ,

V = k on ∂G.

u∈U

Example 13.6.1 (Swim left or right?). Consider the controlled diffusion on X = R dXt = Ut dt + σ dBt with G = [−1, 1], and the performance criterion 283

Z

τ

1 + Ut2 dt

0

where τ = inf{t : Xt 6∈ G} is the time of exit. This criterion should be minimized in expectation. We can imagine a swimmer in a river, who is taken randomly left or right by currents, modeled by the noise term dBt . The swimmer should decide in which direction to swim and how strongly; he wants to make it to the bank ∂G = {−1, 1} as quickly as possibly but at the same time minimizing the swimming effort. The running cost therefore reflects a trade-off between time (the constant term 1) and effort (the term u2 ). The Hamilton-Jacobi-Bellman equation is  ∂V 1 2 ∂2V 2 inf u+ σ +1+u =0 u ∂x 2 ∂x2 

We see that the optimal control strategy is u = µ∗ (x) = − 21 ∂V /∂x. Inserting this in the HJB equation, we find 1 2 ∂2V 1 σ − 2 ∂x2 4



∂V ∂x

2 + 1 = 0 for x ∈ (−1, 1)

along with boundary conditions V (−1) = V (1) = 0. The solution of this equation can be found using a symbolic computational tool such as Maple:

V (x) = 2σ 2 log

cosh(1/σ 2 ) cosh(x/σ 2 )

We plot the solution for σ 2 = 1/2 in figure 13.4. The optimal control can be described as follows: In the middle of the river, the swimmer should stay calm and choose u = 0, waiting to see if the random currents take him left or right. He should then start to swim towards the nearest bank, with a determination that grows as he approaches that bank and is confident that random currents will not take him back to the middle of the river again. √Close to the bank, the noise becomes irrelevant; the optimal swimming speed asymptotes to 2 which would be the result if there was no noise. In the stationary control problem, on the other hand, the closed loop system (13.6) with the time-invariant control strategy Ut = µ(Xt ) admits a stationary solution {Xt : t ≥ 0}, and the assumption is that the process mixes sufficiently fast compared to the terminal time T , so that the optimization problem concerns this stationary process. In this case, the boundary ∂G should never be reached, and the terminal pay-off k becomes irrelevant. This corresponds to finding solutions to the HJB equation (13.7) of the form

V (x, t) = V0 (x) − γ · t 284

Figure 13.4. Value function (left) and optimal strategy (right) for the “swim left or right” example 13.6.1. where γ ∈ R is the expected running payoff of the stationary process, while the off-set V0 (x) indicates whether a state x is more or less favorable than average. Inserting this particular form into the HJB equation (13.7), we see that this off-set V0 : X 7→ R must satisfy sup [Lu V0 + h] = γ.

(13.11)

u∈U

Here, we stress that we seek a matching pair of V0 and γ. In general, there can be many such pairs, and we should seek the maximal γ. A minor detail is that this equation can at most specify V0 up to an additive constant; we typically address this by requiring that V0 has maximum 0. In this case, there will be an “optimal state” x such that V0 (x) = 0, this could for example correspond to an optimal operating point, and other states will be measured on how far they are from this optimum in the sense of less pay-off. In the case of dynamic minimization problems, the corresponding normalization is that V0 has minimum 0. Example 13.6.2 (Stationary scalar LQR control). Consider the controlled scalar diffusion {Xt : t ≥ 0} given by the linear SDE dXt = (aXt + f Ut ) dt + g dBt with f, g 6= 0, where the performance objective is to minimize Z 0

T

1 2 1 2 qX + Ut dt 2 t 2

with q > 0. We pursue the limit T → ∞. We guess a solution V0 (x) = 12 Sx2 to the stationary HJB equation (13.11), which then reads   1 2 1 2 1 2 min xS(ax + f u) + g S + qx + u = γ u 2 2 2 285

Minimizing w.r.t. u, inserting this optimal u, and collecting terms which are independent of and quadratic in x, we find u = −Sf x,

1 γ = g 2 S, 2

1 1 Sa − S 2 f 2 + q = 0. 2 2

The last equation is quadratic in S. It is termed the algebraic Riccati equation. It admits two solutions: p p a − a2 + f 2 q a + a2 + f 2 q S1 = , S2 = f2 f2 For each of these two solutions, we can aim to compute the corresponding stationary running cost γ. We see that S1 is negative, irrespective of the sign of a, which should correspond to a negative stationary expected running cost γ. This clearly cannot be the case, since the running cost is non-negative by definition. The explanation for this is that the closed-loop system corresponding to S1 is dXt = (a − S1 f 2 )Xt dt + g dBt =

p a2 + f 2 qXt dt + g dBt

This system is unstable and therefore does not admit a solution which is a stationary process. In turn, inserting the solution S2 , we find a positive expected stationary running cost γ, and closed-loop dynamics p dXt = (a − S2 f 2 )Xt dt + g dBt = − a2 + f 2 qXt dt + g dBt which are stable. To summarize, the stationary HJB equation has more than one solution, and the relevant one is the maximal one, which is also the unique one which leads to stable closed-loop dynamics, and thus a stationary controlled process. The following theorem generalizes this example. Theorem 13.6.1. Consider the LQR problem (13.8), (13.9) and let T → ∞. Assume that the pair (A, Q) is detectable, i.e., for any right eigenvector v of A such that the corresponding eigenvalue λ has Reλ ≥ 0, it holds that Qv 6= 0. Then, the optimal state feedback is Ut = µ(Xt ) = −R−1 F > SXt where S is the unique positive semidefinite solution to the Algebraic Riccati equation SA + A> S − SF R−1 F > S + Q = 0 with the property that A − R−1 F > S is stable. This solution is also maximal in the sense that any other symmetric solution S1 to the algebraic Riccati equation has S1 ≤ S. The associated time-averaged running cost is 286

1 trG> SG 2

See e.g. (Doyle et al., 1989). The theory of linear-quadratic control is very well developed, both theoretically and numerically, and has been applied to a large suit of real-world problems.

13.7

A case: Fisheries management

We now turn attention to an economic problem: How to design an optimal fishing policy. We assume the biomass of a fish stock grows according to the stochastic differential equation dXt = [Xt (1 − Xt ) − Ut ] dt + σXt dBt where {Xt : t ≥ 0} is the biomass, {Ut : t ≥ 0} is the catch rate, which is our decision variable, {Bt : t ≥ 0} is standard Brownian motion, and σ is the level of noise in the population dynamics. This is a non-dimensionalized model where we have rescaled time t so that the the specific growth rate of a small unfished population is 1, and rescaled biomass x such that the carrying capacity - i.e., the equilibrium point in absence of fishing and noise - is 1. We assume first that the objective of the management is to maximize the expected catch Ut integrated over a long time. The steady-state Hamilton-Jacobi-Bellman equation is  ∂V0 1 2 2 ∂ 2 V0 sup (x(1 − x) − u) + σ x +u =γ 2 ∂x2 u≥0 ∂x 

We maximize over only non-negative u, because we cannot unfish.1 We can then see that an optimal strategy must satisfy  u=

0 ∞

when ∂V0 /∂x > 1 when ∂V0 /∂x < 1.

when ∂V0 /∂x = 1, any u is optimal. In words, we should either not fish at all, or fish as hard as possible, depending on ∂V0 /∂x. Clearly there is no Markov strategy which satisfies this and at the same time satisfies our assumptions for existence and uniqueness of solutions. It is possible to interpret this necessary condition, but here, we prefer to change the model so that an optimum actually exists. The problem with the current model formulation is that the instantaneous maximization problem in the HJB equation is linear in u. Therefore, there will always be optimal controls u on the boundary of U - in this case, 0 and ∞. Notice that changing U to a bounded interval doesn’t solve the problem, because the optimal control may then depend discontinuously on the state x. This suggests that our formulation of the problem 1

Stock rebuilding programmes do exist, but not in our model.

287

was too simplistic. From a modeling perspective, we need to include the disadvantages of exceedingly large values of u, and rapid changes in u. Here, we take a simple solution: We argue that exceedingly large catches u will flood the market and reduce the price. One model that takes this into account is to assume that the cumulated profit is Z

T

p Ut dt.

0

Our choice of square root is rather arbitrary; it serves our purpose. With this performance criterion, the HJB equation becomes  ∂V0 1 2 2 ∂ 2 V0 √ sup (x(1 − x) − u) + σ x + u = γ. 2 ∂x2 u≥0 ∂x 

(13.12)

Now, the optimal control is, whenever ∂V0 /∂x > 0, u = µ∗ (x) =

1

(13.13)

4(V00 (x))2

We solve the HJB equation numerically by time-stepping backwards until steady state is reached. The solution is seen in figure 13.5, thin lines. Artifacts of the numerical method are seen at the boundaries of the computational domain X = [0, 4]. The optimal strategy Ut = µ∗ (Xt ) is practically indistinguishable from a quadratic relationship between biomass Xt and catch rate Ut , except near the boundaries of the computational domain. In fact, this problem does have a simple analytical solution. Notice from (13.13) that if the optimal control is quadratic in x, then V00 (x) must be of the form a/x, i.e. we guess a solution V0 (x) = a log x + b,

u = µ∗ (x) =

1 . 4a2

Exercise 13.95: Show that this V0 , µ∗ , γ solves the HJB equation (13.11) iff a = 21 , γ = 1 1 2 ∗ 2 2 (1 − 2 σ ), so that u = µ (x) = x . This solution is included in figure 13.5, with bold lines. The offset b, which is arbitrary, has been chosen to match the numerical solution at low biomass. In this case we found an analytical solution to the HJB equation. Examples with analytical solutions play a prominent role in the literature, but it should be clear that they are exceptions rather than the rule. From a modeling perspective, it would be an extreme restriction if we have to confine ourselves to models with analytical solutions, and simple numerical methods are an important element of the toolbox. Nevertheless, analytical solutions provide insight into the problem. For the closed-loop system and in absence of noise, the equilibrium state is x = 1/2. The noise will perturb the system 288

2.5 2.0 1.5 1.0 0.0

0.5

ValueV0(x)

0

1

2

3

4

3

4

4 3 2 1 0

Optimal harvest rate µ*(x)

5

Biomass x

0

1

2 Biomass x

Figure 13.5. The value function and the optimal strategy for the fisheries management problem. Thin lines: Numerical solution. Bold lines: The analytical solution.

289

away from this state, but the optimal control attempts to bring the system back to this state. What is particular about the state x = 1/2 is that the surplus production x(1 − x), and hence the harvest, is maximized at this point. In the literature of fisheries management, this is known as the Maximum Sustainable Yield solution. A simulation of the system is shown in figure 13.6. The simulation starts at the unexploited equilibrium, X0 = 1. In addition to the optimal policy Ut = Xt2 , the figure includes two simpler policies: First, the constant catch policy Ut = 1/4. Without any noise in the system, this policy would be able to maintain Maximum Sustainable Yield and a population at Xt = 1/2. However, this equilibrium solution would be unstable, and the noise in the population dynamics (or any other perturbation) drives the system away from the equilibrium and causes the population to crash and reach Xt = 0, at which point the population is extinct and the fishery closes. Next, the figure includes the policy Ut = Xt /2. We can call this a constant effort policy, since we are removing a fixed fraction of the biomass in each small time interval, regardless of the biomass. In the absence of noise, this would also lead to the Maximum Sustainable Yield solution Xt = 1/2, Ut = 1/4. In contrast to the constant catch policy, this solution is stable, so that noise gives rise to fluctuations around this equilibrium. Compared to the the optimal policy Ut = Xt2 , we see that the constant effort policy leads to lower biomass most of the time, and consequently also to lower catches and profits. The optimal policy is more consequent at relaxing fishery effort in bad years, allowing the biomass to rebuild, and also a exploiting good years to the fullest. Exercise 13.96: The value function V0 (x) tends to −∞ as x & 0. Explain how this should be understood. Hint: Start by considering the time-varying optimization problem and the corresponding value function V (x, t) for x near 0. For this case of fisheries management, the optimal control and the value function are independent of the noise level. The noise, in fact, only affects the problem through the steady-state expected pay-off, γ = 12 (1 − 12 σ 2 ). Because the noise moves the system away from the optimal operating point, it reduces the expected payoff. Notice that the limit γ = 0 is reached when σ 2 = 2; recall from the analysis of geometric Brownian motion that this is the stability threshold for the equation dXt = dt + σXt dBt . I.e., when the noise level σ 2 exceeds 2, the population will crash and converge to 0, even in the absence of fishing, so that no profit can be obtained from the system in steady state. The fact that the noise intensity does not affect the value function, or the optimal control, appears because we chose the noise intensity to be linear in the state, so this is not a general feature.

13.8

Summary

Optimal control problems appear in a range of applications, where the objective is to design a dynamic system which performs optimally. These covers traditional control engineering applications, i.e. technical systems, as well as financial and management problems. They also appear in situations where we do not aim to design a system, but rather to understand an existing decision maker, for example to predict future decisions. 290

1.0 0.5 0.0

Biomass Xt

1.5

Constant harvest Constant effort Optimal policy

0

5

10

15

20

15

20

1.0 0.5 0.0

Profit rate Ut

1.5

Time t

0

5

10 Time t

Figure 13.6. Simulation of the fisheries management system with σ 2 = 1/2 and under three management regimes: Constant catch Ut = 1/4 until collapse, constant effort Ut = Xt /2, and the optimal policy Ut = Xt2 . The three simulations are done with the same realization of Brownian motion.

291

We have focused on the technique of Dynamic Programming for solving such problems. In the case of stochastic differential equations, dynamic programming amounts to solving the Hamilton-Jacobi-Bellman equation. This equation generalizes the backward Kolmogorov equation and, more specifically, Dynkin’s lemma that we studied in chapter 11: When analyzing a given control system, i.e. when U is a singleton, we can determine its performance by solving this backward equation. When we include the choice of control, this adds the “sup” in the Hamilton-Jacobi-Bellman equation (13.7), and this equation reduces the problem of dynamic optimization to a family of static optimization problems, where we trade-off instantaneous gains h(x, u, t) against future gains (Lu V )(x, t). The presentation in this chapter is meant as a first introduction, and a number of important issues have been omitted. The most important ones are the characterization theorem, which states that the value function satisfies the HJB equation, provided it is smooth. See (Øksendal, 2010). In many problems, however, we can expected that the solution is not smooth. The framework of viscosity solutions to partial differential equation addresses this issue. As we have stated, solutions with analytical solutions play a prominent role in the literature, but a numerical toolbox is essential to anyone interested in non-idealized applications. Here, we have only sketched one numerical approach, which is to discretize the problem to a Markov Decision Problem which can then be solved straightforwardly. Several other methods exist, but for discretization, and for approximating the original problem with one that can be solved analytically. We encourage the interested reader to consult the specialized literature; a good starting point is (Kushner and Dupuis, 1992).

13.9

Notes and references

13.9.1

Numerical analysis of the HJB equation

Here we describe the numerical analysis of the Hamilton-Jacobi-Bellman equation using using time marching, using the fisheries management example from section 13.7. The objective is to compute the value function V (x, t) and the control strategy µ∗ (x, t) for t ∈ [0, T ]. While the state in principle should be any non-negative number, for numerical work we truncate to x ∈ [0, L]. We solve the HJB equation as a terminal value problem using time marching. At each time step, to operations must be done: 1. From the current value function, compute the optimal control at each state, i.e. find µ∗ (x, t) = arg max[Lu Vt + h] u

2. Based on this control strategy, iterate the value function backwards in time. Finding the optimal control strategy at the current time t can, in general, involve both penand-paper work and numerical optimization. Here, with 292

√ 1 Lu V + h = V 0 x(1 − x − u) + σ 2 x2 V 00 + xu 2 we find that Lu V +h is concave in u for u ≥ 0 and initially increasing, so that the maximization u is found as the unique stationary point, i.e. µ∗ (x) = 1/4/(V 0 (x))2 /x. The derivative V 0 is found using finite difference approximation. In order to avoid excessive controls at the boundary, which can be introduced by numerical boundary effects, we bound u < U . To do the time update, it is possible to use the Euler method, i.e. ∗

V (x, t − h) = V (x, t) − [Lµ V (x, t) +

p xµ∗ (x)] · h

where µ∗ . However, the Euler method often requires excessively small time steps. An alternative is to use an implicit Euler method or, alternatively, make us of the exponential matrix which allows arbitrarily large time steps.

293

CHAPTER

14

Solutions to selected exercises

Exercise 2.1: The concentration profile is decreasing with x, so the diffusive flux is from left to right, i.e. positive. The slope is decreasing in magnitude, i.e. the curve is steeper at x = a than at x = b, so J(a) is larger than J(b). So there is a net influx into the interval [a, b], and the amount of material increases. More generally, the concentration profile is convex, i.e. C 00 > 0, so from (2.5) we see that C˙ is positive; the concentration increases everywhere in the region plotted. Exercise 2.2: The verification is most easily done with a computer algebra system such as Maple, Mathematica, or sage. The following piece of sage does the verification:

var ( ’ x t D x0 ’) phi ( x ) = 1/ sqrt (2* pi )* exp ( - x ^2/2) C (x , t ) = phi (( x - x0 )/ sqrt (2* D * t ))/ sqrt (2* D * t ) res = factor ( diff ( C (x , t ) , t ) - D * diff ( C (x , t ) ,x , x )) print ( res )

The factor is there to simplify the expression. When run, the code will output the result 0 which verifies that the left and right hand side of the equation (2.6) agree. The initial condition is satisfied in the sense that the solution C(x, t) converges to a Dirac delta as t & 0. Here, the convergence is in measure or in L1 norm. Exercise 2.3: We define the diffusive length scale L as the standard deviation in the plume, p i.e. 2D/T where T is the time scale of interest. We get: 295

Process Salt in water at 293 K Smoke in air at 293 K Carbon in iron at 1250 K

Diffusivity 1 × 10−9 2 × 10−5 2 × 10−11 m2 /s

1 sec 4.5 × 10−5 6.3 × 10−3 6.3 × 10−6 m

1 minute 3.5 × 10−4 4.9 × 10−2 4.9 × 10−5 m

1 hour 2.7 × 10−3 3.8 × 10−1 3.8 × 10−4 m

1 day 1.3 × 10−2 1.9 × 100 1.9 × 10−3 m

Note that these length scales are all quite small, by everyday measures. Exercise 2.4: We have C˙ = −λC and C 00 = −k 2 C Combining, we get C˙ = λ/k 2 C 00 which, with D = λ/k 2 , agrees with the diffusion equation governing for C. Exercise 3.10: The idea is to start with rectangular (box) sets in the plane and construct the set A using countably many set operations. So define

Bnm = {(x, y) : x > m/n, y > −m/n} for natural n and integer m. You should sketch Bnm , if you can’t visualize it directly. Clearly Bnm is Borel. Now define B = ∪n∈N ∪m∈Z Bnm , then B is also Borel. Now it is easy to see that (x, y) ∈ B if and only if x + y > 0. If you don’t agree, then you should set out to find a n, m so that (x, y) ∈ Bnm , for given (x, y) such that x + y > 0. So A = R2 B. Hence A is also Borel. Exercise 3.11: 1. Let B ⊂ R be Borel. Then g −1 (B) is also Borel and therefore Z −1 (B) = X −1 (g −1 (A)) ∈ F. This shows that Z : Ω 7→ R is measurable, and hence a random variable. 2. Write Z(ω) = g(X(ω)) with g(x) = x2 . Then g is continuous and hence Borel measurable. It follows that Z = X 2 is a measurable function Ω 7→ R and hence a random variable.

Exercise 3.12: We have, clearly

P(X > x) = P(G−1 (ω) > x) = P(ω < G(x)) = G(x) R∞ since G(·) is non-increasing. To see that EX = 0 G(x) dx, it is instructive to consider a specific example. Take G(x) = exp(−x). The following figure illustrates the situation. 296

1.0 0.8 0.6 0.4 0.0

0.2

G(x) or ω

0.0

0.5

1.0

1.5 −1

x or G

2.0

2.5

3.0

(ω)

By definition, Z EX =

Z X(ω) P(dω) =



1

G−1 (ω) dω

0

which corresponds to the gray shaded area. Finding the area of this set by integrating along the abscissa (using Fubini’s theorem), we see that Z EX =



G(x) dx 0

Note that this result is standard, but is usually verified by integration by parts. Exercise 3.14: 1. This corresponds to an observer who always knows the realized value of the random variable X. This could be the omniscient observer, H = F, or the observer that simply observes X, i.e. H = σ(X), or anything “in between”. 2. This corresponds to an observer whose information does not lead to changes in the expected value of X. This could be the null observer, H = {∅, Ω}. It is also possibly to construct examples where the observer does get information about X, but this information does not affect the expected value. For example, let X be standard Gaussian and let H = σ(|X|).

Exercise 3.15: We must show that Z = EX satisfies the conditions in definition 3.5.1. Clearly Z is Y -measurable, since it is deterministic. Next, we must show that E{Z · 1H } = E{X · 1H } 297

for every H ∈ σ(Y ). The right hand size equals Z · PH since Z is deterministic. Assume that X is simple, i.e. P(X = xi ) = pi where x1 < · · · < xn and p1 + · · · + pn = 1. Then

E{X · 1H } =

n X

xi P{H ∧ (X = xi )}

i=1

By independence, P{H ∧ (X = xi )} = P{H}pi so the right hand side evaluates to PH · EX. To show the result for general X, we approximate X with a sequence of simple random variables. Exercise 3.16: By the Doob-Dynkin lemma, 1 + Y 2 is a random R R variable which is measurable w.r.t. σ(Y ). To verify the condition A (1 + Y 2 ) dP(ω) = A S dP(ω) for A ∈ H, note that A must be a cylinder set, i.e. a set of the form A = {ω = (x, y) : y ∈ B} for some Borel set B. It is useful to think of B as an interval so that A is a horizontal strip, A = {ω = (x, y) : a ≤ y ≤ b} (compare figure 3.6 on page 48). We then get:

Z bZ

Z

+∞

(x2 + y 2 ) φ(x) φ(y) dx dy

S(ω) dP(ω) = A

−∞ +∞

a

Z bZ

(1 + y 2 ) dx dy

= Za =

−∞

(1 + Y 2 (ω)) dP(ω)

A

R +∞ R +∞ since −∞ x2 φ(x) dx = 1 = −∞ 1 φ(x) dx. Note that these calculations remain valid if we replace the interval [a, b] with a general Borel set B. Exercise 3.17: If X is H-measurable, then also X 2 is H-measurable. We get V{X|H} = E{X 2 |H} − (E{X|H})2 = X2 − X2 − 0

Exercise 3.18: VX = EX 2 − (EX)2 = EE{X 2 |H} − (EE{X|H})2 = EV{X|H} + E(E{X|H})2 − (EE{X|H})2 = EV{X|H} + VE{X|H} 298

Exercise 3.19: By the simple Tower property we have EY = EE{Y |N } = EµN = µλ

.

For the variance, we find VY = EV{Y |N } + VE{Y |N } = EN σ 2 + VµN = (µ2 + σ 2 )λ

Exercise 3.20: We must show that E|X|q < ∞ To see this, note that |x|q ≤ 1 + |x|p for any x ∈ R, so E|X|q < 1 + E|X|p < ∞. Exercise 3.25: The following table shows the various conditional expectations. To find e.g. E{X|G}, an easy approach is to consider the atomic sets in G, i.e. {1, 3}, {2}, {4, 6}, {5}, and x E{X|G} E{X|H} E{E{X|G}|H} E{E{X|H}|G} 1 2 3 3 3 2 2 4 4 4 average X over each of these sets. 3 2 3 3 3 4 5 4 4 4 5 5 3 3 3 6 5 4 4 4 Exercise 3.26: 1. The situation can be described by two observers, G and H, with information G and H, respectively, so that G knows everything that H knows. V{X|H} describes observer H’s uncertainty about X. Since H does not know all the information that G has, H cannot for sure say if V{X|G} ≤ V{X|H}, but since H knows that G has access to more information, H expects that V{X|G} is smaller than V{X|H}. 2. The clue is to decompose X into orthogonal components: The first is H’s estimate, E{X|H}. The next term E{X|G} − E{X|H}) is G’s correction to H’s estimate. The last term X − E{X|G} is G’s estimation error. These terms are orthogonal, also conditionally on H. To see this for for H’s estimate and G’s correction, multiply the terms and take conditional expectation w.r.t. H, and 299

use the properties of conditional expectation (taking out what is known, and the Tower property):

E[E{X|H}(E{X|G} − E{X|H})|H] = E{X|H} · (E[E{X|G}|H] − E{X|H}) = 0 It follows that H expect to be able to explain less variance than G:

E[(E{X|G})2 |H] = (E{X|H})2 + E[(E{X|G} − E{X|H})2 |H] ≥ (E{X|H})2 With this inequality on the variance explained by the two observers, we can go to the definition of the conditional variances: V{X|H} = E{X 2 |H} − (E{X|H})2 and V{X|G} = E{X 2 |G} − (E{X|G})2

.

Taking conditional expectation of the latter w.r.t. H, we get E[V{X|G}|H] =E{X 2 |H} − E{(E{X|G})2 |H} ≤E{X 2 |H} − E{X 2 |H} =V{X|H} . 3. This situation may seem counter-intuitive, but can occur if the extra information available to G is that X is more uncertain than on average. A simple example is the following: Let Y be a Bernoulli variable with parameter p ∈ (0, 1) and let X|Y be Gaussian distributed with variance Y . Let G = σ(Y ) and let H = {∅, Ω}, i.e. H has no information about the outcome of the stochastic experiment. Then V{X|H} = V{X} = EV{X|Y } + VE{X|Y } = p but V{X|G} = Y which exceeds p when Y = 1.

Exercise 3.27: Without loss of generality, we can assume µX = 0 and µY = 0. Set ˜ := X − Z; this is the estimation error when using Z as an estimate Z := QR−1 Y . Define X ˜ is uncorrelated with Y : of X based on Y . Then X ˜ 0 = EXY 0 − EZY 0 = Q − EQR−1 Y Y 0 = Q − QR−1 R = 0 EXY This shows that Z = E{X | Y }. To show the result for the conditional variance, we evaluate the variance directly: ˜ V{X | Y } = V{X} = V(X − QR−1 Y ) = P − 2QR−1 Q0 + QR−1 RR−1 Q0 = P − QR−1 Q0 300

˜ where Z is known given Y and X ˜ is independent where the first equality comes X = Z + X of Y . Finally, to see that the conditional distribution is Gaussian it suffices to note that the logarithm to the conditional density of X is a quadratic form in x. Exercise 3.28: Let A and B be two Borel sets and let G = σ(Z). We must show that P{X ∈ A, Y ∈ B|G} = P{X ∈ A|G} · P{Y ∈ B|G} The conditional joint density of X, Y is fX|Z (x, Z) · fY |Z (y, Z). (Recall that this is a Zmeasurable random variable). Evaluating the left hand side, we get Z Z Z fX|Z (x, Z) · fY |Z (y, Z) dx dy = fX|Z (x, Z) dx · fY |Z (y, Z) dy A×B

A

B

which is the right hand side. This shows that X and Y are conditionally independent given Z. An simple counterexample, which demonstrates that conditional independence given a σalgebra does not imply conditional independence given any even in that σ-algebra, is as follows. Let Z ∼ U (0, 1), X|Z ∼ U (0, Z), Y |Z ∼ U (Z, 1), and let X and Y be conditionally independent given Z. Then X and Y are not independent. For example P{X > 12 , Y < 12 } = 0 although P{X > 21 } > 0 and P{Y < 21 } > 0. So X and Y are not conditionally independent given G = Ω ∈ G. Exercise 3.29: We find Z +∞ 1 |x|p √ E|X|p = dx 2π −∞ Z ∞ 1 1 =2 xp √ exp(− x2 ) 2 2π 0 Z ∞ p = 2/π (2u)p/2−1/2 exp(−u) du

(3.6) (3.7) =

p 2p /πΓ(p/2 + 1/2)

(3.8)

0

The following R sniplet evaluates the result numerically and compares with Monte Carlo: absX n, then clearly Bn ⊂ Bnm for all m > n. Hence PBn ≤ PBnm for all m > n and hence PBn = 0. Now, the statement “the sample path contains a finite number of positive P values” corresponds to the event (!) A = ∪n≥1 Bn . Clearly we have P(A) ≤ n≥1 PBn = 0.

301

Exercise 4.32: The following R code implements one way to simulate Brownian motion, and does the verification. rBM + GG> dt

.

Exercise 6.61: The integral corresponds to Eτ 2 , where F (t) = P{τ ≤ t}. I.e., τ is exponentially distributed with mean 1. From the properties of that distribution, we know that Eτ 2 = Vτ + (Eτ )2 = 1 + 1 = 2. A direct evaluation of the integral is Z ∞ Z 2 t dF (t) = 0



t2 exp(−t) dt

0

= Γ(3) = 2

Exercise 7.68: Inserting dXt = Ft dt + Gt dBt in (7.2), the equation between (7.2) and (7.3) reduces to 1 ∂2h 1 ∂2h dXt> 2 dXt = trG> Gt dt . t 2 ∂x 2 ∂x2 To see this:

1 ∂2h 1 ∂2h dXt> 2 dXt = tr 2 dXt dXt> 2 ∂x 2 ∂x 1 ∂2h = tr 2 d[X, X > ]t 2 ∂x 1 ∂2h = tr 2 Gt d[B, B > ]t G> t 2 ∂x 2 1 ∂ h = tr 2 Gt Id dt G> t 2 ∂x 2 1 ∂ h = trG> Gt dt . t 2 ∂x2

305

Exercise 7.70: We apply It¯ o’s lemma to Xt = h(t, Bt ) where h(t, b) = x exp((r− 21 σ 2 )t+σb). We find ∂h 1 = (r − σ 2 )h ∂t 2 ∂h = σh ∂b ∂h2 = σ2h ∂b2 so that

∂h ∂h 1 ∂2 dt + dBt + dt ∂t ∂b 2 ∂b2 1 1 = (r − σ 2 )Xt dt + σXt dBt + σ 2 Xt dt 2 2 = rXt dt + σXt dBt

dXt =

as required. Next, log Xt is Gaussian with mean log x + (r − 21 σ 2 )t and variance σ 2 t, i.e. the transition probabilities of Xt are log-Gaussian distributions 1 Xt ∼ LN(log x + (r − σ 2 )t, σ 2 t). 2 It follows from the properties of the log-Gaussian distributions that Xt has mean 1 1 EXt = x exp((r − σ 2 )t + σ 2 t) = x exp(rt) 2 2 and variance 1 VXt = x2 (exp(σ 2 t) − 1) exp(2(r − σ 2 )t + σ 2 t) = x2 (exp(σ 2 t) − 1) exp(2rt). 2

Exercise 7.73: Using the hint, define Yt = h(t, Xt ) = exp(−At)Xt , then Z Yt = x +

t

e−As (ws ds + G dBs )

0

so dYt = e−At (wt dt + G dBt ). By It¯ o´s lemma, this implies dXt = eAt dYt + AeAt Yt dt = AXt dt + wt dt + G dBt as required.

306

Exercise 7.74: Define Yt = h(t, x) where h(t, x) = e−Ft x, then Z

t

Yt = x +

e−Fs σs dBs

0

or dYt = e−Ft σt dBt By It¯o´s lemma applied to Xt = g(Ft , Yt ) with g(f, y) = exp(f )y, using ∂ 2 g/∂y 2 = 0 and (dFt )2 = dFt · dYt 0, we get dXt = Xt dFt + eF (t) dYt = λt Xt dt + σt dBt So Xt = g(Ft , Yt ) = eFt x +

Rt 0

eFt −Fs σs dBs satisfies the stochastic differential equation.

Exercise 7.76: The transform is Z

x

h(x) =

1 dv = σ −1 log x σv

The transformed process is {Yt : t ≥ 0} given by Yt = σ −1 log Xt , which is governed by the SDE rXt r 1 1 dYt = ( − σ) dt + dBt = ( − σ) dt + dBt σXt 2 σ 2 Alternatively, we could prefer to not scale with σ, so define a transformed process Zt = log Xt , corresponding to the SDE 1 dZt = (r − σ 2 ) dt + σ dBt 2 which has constant noise intensity, although that constant and equal to 1. Exercise 7.78: For Xt we get dXt = −sinΘt ◦ dBt = −Yt ◦ dBt while for Yt we get dYt = cos Θt ◦ dBt = Xt ◦ dBt

.

Combining, we have 

dXt dYt



 =

−Yt Xt

 ◦ dBt

Note that this suggests that the increment (dXt , dYt ) is (random) orthogonal to the position (Xt , Yt ) - compare with Brownian motion on the circle (section 7.5).

307

Exercise 7.79: We apply the transformation Z Yt = h(Xt ) with h(x) = and find dYt =

x

1 dv g(v)

f (Xt ) dt + dBt g(Xt )

Next, we rewrite the original equation governing {Xt } as an It¯o equation: 1 dXt = (f (Xt ) + g(Xt )g 0 (Xt )) dt + g(Xt ) dBt 2 When we Lamperti transform this equation with the same transformation, we find   f (Xt ) 1 0 1 0 f (Xt ) dYt = + g (Xt ) − g (Xt ) dt + dBt = dt + dBt g(Xt ) 2 2 g(Xt ) in agreement with the direct transformation applied to the Stratonovich equation. Notice that the noise intensity is constant, as was the purpose of the transformation, and that this implies that the It¯ o and Stratonovich interpretation of the equation coincide. This explains why we reach the same result. √ Exercise 7.80: We have Xt = g(t, Bt ) with g(t, b) = b/ t, so It¯o’s lemma gives 1 1 dXt = − Bt t−3/2 dt + t−1/2 dBt = − Xt t−1 dt + t−1/2 dBt 2 2 With Ft = − 12 Xt t−1 and Gt = t−1/2 , and with the time change Ut = log t, we get Ht = 1/t and

dYu =

GT 1 FTu du + p u dWu = − Yu du + dWu HTu 2 HTu

Thus, by rescaling both the dependent and the independent variable, we can transform Brownian motion to an Ornstein-Uhlenbeck process. Exercise 8.85: 1. Let {Bt } be standard Brownian motion. The joint distribution of Bt and BT , for 0 ≤ t ≤ T , is Gaussian with mean (0, 0) and variance-covariance matrix   t t . t T It follows from standard conditioning in Gaussian distributions (exercise 3.27) that the conditional distribution of Bt given BT = b is Gaussian with mean bt/T and variance t − t2 /T = t(1 − t/T ). 308

2. Since the drift is linear in the state, the mean µt = EXt satisfies the ordinary differential equation d b − µt µt = . dt T −t Inserting bt = bt/T , we see that this satisfies the ordinary differental equation. Likewise, the variance Σt = VXt is governed by the equation d 2 Σt = − Σt + 1 dt T −t and we see that Σt = t(1 − t/T ) satisfies this equation. Rt Exercise 8.86: By the Brownian bridge, E{Bs |Bt } = sBt /t for 0 ≤ s ≤ t. So E{ 0 Bs ds|Bt } = Rt 1 2 1 0 (sBt /t) ds = Bt /t · 2 t = 2 tBt . Rt Rt Rt By the product formula, 0 s dBs = tBt − 0 Bs ds, so E{ 0 s dBs |Bt } = 12 tBt . Exercise 9.88: The stationary distribution can be written as

φ(x) = = = =

Z x  1 u(y) exp dy Z x0 D(y) Z x  1 f (y) − D0 (y) exp dy Z D(y) x Z 0x  f (y) 1 exp dy − log(D(x)) + log(D(x0 )) Z x0 D(y) Z x  D(x0 ) f (y) exp dy − log(D(x)) + log(D(x0 )) Z · D(x) x0 D(y)

Redefining Z := Z/D(x0 ), we obtain the desired result. Next, let ψ(y) be p the stationary density of Yu , then we use that the diffusivity of the process {Yu } is 12 ( h(y) · g(y))2 = h(y)D(y) and thus Z y  u(x) 1 1 ψ(y) = D(y) exp dx Z h(y) y0 D(x) 1 1 = φ(y) Z h(y) which should be shown. Note that this result can be explained as follows: The sample path of {Yu : u ≥ 0} visits exactly the same points as {Xt : t ≥ 0}, but the time the process {Yu : u ≥ 0} spends in a certain region dy is a factor h(y) smaller than the time that {Xt : t ≥ 0} spends in the same region, due to the time change. Exercise 9.90: Define h(x) = |x|2 and Yt = h(Xt ). Stratonovich calculus, we have dYt =

According to the chain rule of

∂h ◦ dXt = 2Xt> ◦ dXt = 0 ∂x 309

Thus any sphere is invariant. Exercise 11.91: We fix x and consider h(x; Pe) in the limit where the Pecl´et number goes to zero. We apply l’Hospital’s rule and find

h(x; Pe) →

− xl + 1 x =1− 1 L

For pure diffusion (Pe = 0), the governing equation is Dh00 = 0 which requires h(x) to be a straight line. Exercise 11.92: Since f = 0, we find φ(x) = 1. Thus s(x) = x and h(x) = x/l. The fact that the diffusivity varies affects the time it takes to reach the boundary, but does not affect the eventual outcome. Exercise 11.93:√ The diffusion Xt solves the It¯o SDE dXt = f (Xt ) dt + g(Xt ) dBt with f = D0 and g = 2D. Note that f is positive. We claim that then h is concave, i.e. h00 < 0. To see this, note first that φ is positive. This means that the scale function s is increasing. Therefore also c1 is positive and hence h is increasing, which should not come as a surprise: The further one starts to the right, the greater is the probability of exit to the right. Now, from the equation f h0 + 12 g 2 h00 = 0 we find h00 = −

2f h0 g2

Therefore, in any region where f is positive, h00 must be negative and thus h is concave. In this case, this applies to the entire interval [0, L]. The graph of a concave function lies above any chord, and hence h(L/2) > (h(0) + h(L))/2 = 1/2. So the process is more likely to exit at the boundary point where the diffusivity is high. This result adds to our understanding that pure Fickian diffusion, when the diffusivity varies with space, is biased towards regions with high diffusivity. Exercise 11.94: Let τ be the time of exit, τ = inf{t : Xt 6= (0, l)}, and let h be the expected time of exit, h(x) = Ex τ , for x ∈ (0, l). Then h is governed by the equation uh0 + Dh00 + 1 = 0 310

and the boundary condition h(0) = h(l) = 0. To solve this, we first set k = h0 and find uk + Dk 0 + 1 = 0 which has the general solution k = −1/u+c1 exp(−xu/D) where c1 is an arbitrary integration constant. By integration, the general solution for h is

h(x) = −

x c1 D − exp(−xu/D) + c2 u u

where also c2 is arbitrary. Inserting boundary conditions and solving for c1 and c2 , we get l exp(− xl Pe) − exp(−Pe) l−x − h(x) = u u 1 − exp(−Pe) This solution is shown in figure 11.3 . The first term, (l − x)/u, is the advective time scale - the time it takes a particle to travel a distance l − x when moving with constant speed u. The second term can be seen as a correction, effective in the diffusive boundary layer, which takes into account the event that the process exits quickly to the left rather than traversing the domain and exiting to the right. Exercise 12.94: The problem is that the term g(Xt )(Bt − Bt−h ) is not consistent with the It¯o integral. To demonstrate this, it is convenient to choose an example where an analytical solution is available, and where g(x) is not a constant function. So consider geometric Brownian motion dXt = aXt dt + σXt dBt with the solution, as we know, 1 Xt = X0 exp((a − σ 2 )t + σBt ) 2 The naive implicit scheme has (h)

Xt

(h)

(h)

(h)

− Xt−h = aXt h + σXt (Bt − Bt−h )

or (h)

Xt

(h)

= (1 − ah − σ(Bt − Bt−h ))−1 Xt−h

(h)

Notice that the solution Xt may change sign during a time step, and that we may attempt to divide with 0. Even if division with 0 happens with probability 0 at each time step, it is still an indication that this scheme is fragile. 311

(h)

Moreover, the conditional density of Xt (h)

(h)

given Xt−h has heavy tails that decays as x−2 ,

(h)

implying that E{|Xt | |Xt−h } = ∞. For comparison, the true solution has E{Xt |Xt−h } = Xt−h · [1 + ah] + o(h) We see that this fully stochastic implicit scheme is not consistent with the It¯o solution. In hindsight, this should come as no surprise: The way we evaluate the integrand here is not consistent with the It¯ o integral.

312

Bibliography

Billingsley, P. (1995). Probability and Measure. Wiley-Interscience, third edition. Burrage, Kevin, PM Burrage, and Tianhai Tian (2004). Numerical methods for strong solutions of stochastic differential equations: an overview. Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences 460(2041): 373–402. Cox, J.C., J.E. Ingersoll Jr, and S.A. Ross (1985). A theory of the term structure of interest rates. Econometrica: Journal of the Econometric Society pp. 385–407. Doob, J.L. (1953). Stochastic Processes . John Wiley & Sons. Doyle, J.C., K. Glover, P.P. Khargonekar, and B.A. Francis (1989). State-space solutions to standard H2 and H∞ control problems. IEEE Transactions on Automatic Control 34: 831–847. Einstein, A. (1905). On the motion of small particles suspended in liquids at rest required by the molecular-kinetic theory of heat. Annalen der Physik 17: 549–560. Translated into English by A.D. Cowper and reprinted by Dover (1956). Gard, T.C. (1988). Introduction to Stochastic Differential Equations, Vol. 114 of Monographs and textbooks in pure and applied mathematics. Marcel Dekker. Gardiner, C.W. (1985). Handbook of Stochastic Models. Springer, second edition. Gilliam, James F and Douglas F Fraser (1987). Habitat selection under predation hazard: test of a model with foraging minnows. Ecology 68(6): 1856–1862. Gilsing, H. and T. Shardlow (2007). Sdelab: A package for solving stochastic differential equations in matlab. Journal of Computational and Applied Mathematics 205(2): 1002 – 1018. Special issue on evolutionary problems. Grimmett, G.R. and D.R. Stirzaker (1992). Probability and Random Processes. Oxford University Press, second edition. Higham, D.J. (2001). An algorithmic introduction to numerical simulation of stochastic differential equations. SIAM Review 43(3): 525–546. Iacus, S.M. (2008). Simulation and inference for stochastic differential equations: with R examples. Springer Verlag. 313

Karatzas, I. and S.E. Shreve (1997). Brownian Motion and Stochastic Calculus. Springer, second edition. Kloeden, P.E. and E. Platen (1995). Numerical Solution of Stochastic Differential Equations. Springer. Kristensen, K, A Nielsen, CW Berg, and H Skaug (2016). Template model builder, TMB. J. Stat. Softw . To appear. Kushner, H.J. and P.G. Dupuis (1992). Numerical Methods for Stochastic Control Problems in Continuous Time, Vol. 24 of Applications of Mathematics. Springer-Verlag. Lamba, H. (2003). An adaptive timestepping algorithm for stochastic differential equations. Journal of computational and applied mathematics 161(2): 417–430. Maruyama, G. (1955). Continuous Markov processes and stochastic equations. Rendiconti del Circolo Matematico di Palermo 4(1): 48–90. Moler, C. and C. Van Loan (2003). Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. SIAM Review 45(1): 3–49. Øksendal, B. (1995). Stochastic Differential Equations - An Introduction with Applications. Springer-Verlag, third edition. Øksendal, B. (2010). Stochastic Differential Equations - An Introduction with Applications. Springer-Verlag, sixth edition. Philibert, J. (2006). One and a Half Century of Diffusion: Fick, Einstein, Before and Beyond. Diffusion Fundamentals 4: 6–1. Rogers, L.C.G. and D. Williams (1994a). Diffusions, Markov processes, and martingales., Vol. 1: Foundations. Cambridge University Press. Rogers, L.C.G. and D. Williams (1994b). Diffusions, Markov processes, and martingales., Vol. 2: Itˆ o calculus. Cambridge University Press. Royden, H.L. (1988). Real Analysis. Macmillan, New York, third edition. First edition published in 1963. R¨ uemelin, W. (1982). Numerical treatment of stochastic differential equations. SIAM Journal on Numerical Analysis 19(3): 604–613. Sainmont, Julie, Ken H Andersen, Uffe H Thygesen, Øyvind Fiksen, and Andr´e W Visser (2015). An effective algorithm for approximating adaptive behavior in seasonal environments. Ecological Modelling 311: 20–30. Uhlenbeck, G.E. and L.S. Ornstein (1930). Phys.Rev. 36: 823–841.

On the theory of brownian motion.

Versteeg, H.K. and W. Malalasekera (1995). An Introduction to Computational Fluid Dynamics: The Finite Volume Method. Prentice Hall, Harlow, England. Williams, D. (1991). Probability with Martingales. Cambridge University Press. 314

Zhou, K., J. Doyle, and K. Glover (1996). Robust and Optimal Control. Prentice Hall. Zucchini, Walter and Iain L MacDonald (2009). Hidden Markov models for time series: an introduction using R. CRC Press.

315

Index

Conditional distribution, 51 in the Gaussian case, 63 Conditional expectation, 46 Properties of, 51 Conditional independence, 63 Conditional variance, 51 Conservation equation, 10 Continuity of a stochastic process, 103 Control Infinite horizon, 281 Convergence of random variables, 93 Covariaton of two processes, 152 Cox-Ingersoll-Ross process, 186 Attainability of the origin, 244 State estimation, 215 Stationary distribution, 203 Cumulative distribution function (c.d.f.), 40 Cylinder set, 296

Adapted process, 83 Additive noise, 120 Adjoint operator, 199 Advection-diffusion equation, 21 Advective flux, 20 Almost surely (a.s.), 42 Backward Kolmogorov equation, 195, 196 Bellman, Richard E., 270 Borel algebra, 36 Borel measurable function, 39 Borel’s paradox, 44, 48 ´ Borel, Emile, 37 Brown, Robert, 18 Brownian bridge, 100, 185 Brownian motion, 18, 19 As a transformed Ornstein-Uhlenbeck process, 169 Canonical, 100 Definition, 68 exit time of sphere, 247 Hitting time distribution, 79 in Rn , recurrence of, 244 is a Markov process, 85 on the n-sphere, 203 on the circle, 164 Physical unit, 20 Simulation of, 100 Standard, 20 with drift, 148

Data update in state estimation, 219 Decomposition of variance, 52 Decorrelation time, 123, 124 Detailed balance, 204 Differential Lyapunov equation, 121, 125, 191 Diffusion Acting on a point source, 12 Acting on a wave, 16 in n dimensions, 27 Macroscopic description, 11–12 Microscopic model, 18 Diffusion equation, 11

c`adl`ag function, 132 Chain rule in It¯o calculus, 157 in Stratonovich calculus, 166 Complementary error function, 32 317

Gibbs canonical distribution, 201 Gilliam’s rule, 271 Green’s function of diffusion equation, 14

Diffusion It¯o, 148 Diffusion process Sampled, 23 Diffusivity Examples of, 11 Doob, Joseph Leo, 89 Drift of an It¯ o process, 147 Dynamic programming For controlled diffusions, 277 For Markov Decision Problems, 270 Dynkin’s formula, 238

Hamilton-Jacobi-Bellman equation, 277 Hidden Markov model and state estimation, 213 Hitting time distribution for Brownian motion, 79 Impulse response, 107 of a linear system, 106 Independence, 47 Conditional, 63 Infinite horizon control, 281 Intensity of an It¯o process, 147 Iterated logarithm Law of, 80 It¯o diffusion, 148 Ito integral It¯o integral properties, 139 It¯o integral, 140 It¯o isometry, 139 Ito process It¯o process Stationary, 148 It¯o process, 147 It¯o process Stationary, 162 It¯o´s lemma, 157 It¯o, K., 137

Einstein, Albert, 20 Error function, 32 Euler method for It¯o SDE’s, 251 Euler scheme for advection-diffusion, 24 Events, 35 Exit probabilities, 239 Expectation Properties of, 61 Expectation of a random variable, 42 Exponential of a matrix, 108 Extension theorem Kolmogorov, 98 Fick’s first law, 11 Fick’s second law, 11 Fick, Adolph Eugen, 11 Filtration F t of a probability space, 83 Finite-dimensional distributions, 71, 98 Forward Kolmogorov equation, 197 Frequency response, 107 of a linear system, 106 Fundamental solution of a linear system, 106 of diffusion equation, 14

Kalman filtering with continuous-time measurements, 228 with discrete-time measurements, 226 Kinesis and time change, 170 Stationary distribution, 202 Kolmogorov equation Backward, 195 Forward, 197 Kolmogorov equations Backward, 196 Duality of, 199

Gamma function, 64 Generator Definition of, 237 Geometric Brownian motion, 102 Attainability of the origin, 244 Transition probabilities, 159 318

Noisy oscillator, 126 Norm of random variables, 53 Norms on Rn , 94

Kolmogorov extension theorem, 98 Lamperti transform, 165 Law of the iterated logarithm, 80 Lebesgue-Stieltjes integral, 42 Linear dynamic system, 107 Linear SDE Wide-sense, 159 Linear stochastic differential equation Narrow-sense, 120, 160 Linear system of ordinary differential equations, 106 Lipschitz continuity, 174 Log-normal distribution, 159 Lp spaces of random variables, 52 Lyapunov equation Differential, 121, 125, 191

Optimal foraging, 271 Ornstein-Uhlenbeck process, see also Narrow-sense linear stochastic differential equation, 161, 190 As transformed Brownian motion, 169 Regularity of transition densities, 207 Oscillator Noisy, 126 Parameter estimation, 224 Partition, 73 of an interval, 152 Partition function, 201 Pecl´et number, 26 and exit probabilities, 239 and exit times, 309 Picard iteration, 180 Probability, 35 Probability measure, 37 Probability triple (Ω, F, P), 38

Markov control, 276 Markov decision problem, 268 Markov process, 85 Strong, 103 Markov property, 85 Definition, 189 of Brownian motion, 85 of diffusions, 189 Strong, 103 Markov time, 83 Markov, Andrei Andreyevich, 84 Martingale Definition of, 88 property of the It¯ o integral, 139 Martingale convergence, 91 Martingale inequality, 90 Matrix exponential, 107, 108 Maximum sustainable yield, 288 Measurable function, 39 Mesh of a partition, 73, 152 Monotone convergence theorem, 58 Monte Carlo simulation of advectiondiffusion, 24

Quadratic covariaton of two processes, 152 Quadratic variation, 75 Random variable, 38, 39 Simple, 40 Random walk of molecules, 18 Recurrence of Brownian motion in Rn , 244 Scale function, 241 Scale function, 165 σ-algebra of events, F , 36 as a model of information, 43 Simple random variable, 40 Smoothing filter, 221 State estimation and prediction, 218 State feedback control strategy, 276 State hindcasting, 221 State-space model, 106 Stationary It¯o process, 148, 162 Stationary process

Narrow-sense linear stochastic differential equation, 120 Regularity of transition densities, 207 Noise Additive, 120 White, 66 319

Wide-sense, 112 Stieltjes integral, 132 Stochastic differential equation It¯o vs. Stratonovich interpretation, 182 Narrow-sense linear, 160 Linear, 161 Solution of, 148 Wide-sense linear, 159 Stochastic differential equations Existence of a solution, 177 Stochastic experiment, 34 Stochastic process Continuity, 103 Versions of, 98 Stopped process Mt∧τ , 89 Stopping time, 83 Stratonovich integral, 151 Stratonovich interpretation of an SDE, 182 Strong Markov property of Brownian motion, 103 Strong order

of a numerical scheme, 252 Tail σ-algebra, 58 Time change, 168 Time update in state estimation, 219 Tower property for variances, 62 of conditional expectations, 51 Trace of a matrix, 157 Transfer function, 107 Truth set, 35 Uniqueness of solutions, 172 Variance Decomposition of, 52 Variance spectrum, 114 White noise, 66 Wide-sense linear SDE, 159 Wide-sense stationary process, 112 Wiener measure, 70, 100 With probability 1 (w.p. 1), 42

320