Optimal quantization methods for nonlinear filtering with discrete-time observations ` Gilles PAGES
Huyˆen PHAM
Laboratoire de Probabilit´es et Mod`eles Al´eatoires CNRS, UMR 7599 Universit´e Paris 6 e-mail:
[email protected]
Laboratoire de Probabilit´es et Mod`eles Al´eatoires CNRS, UMR 7599 Universit´e Paris 7 e-mail:
[email protected] and CREST, Laboratoire de Finance-Assurance
Abstract We develop an optimal quantization approach for numerically solving nonlinear filtering problems associated with discrete-time or continuous-time state process and discrete-time observations. Two quantization methods are proposed: a marginal quantization and a Markovian quantization of the signal process. The approximate filters are explicitly solved by a finite-dimensional forward procedure. A posteriori error bounds are stated and we show that the approximate error terms are minimal at some specific grids that may be computed off-line by a stochastic gradient method based on Monte Carlo simulations. Some numerical experiments are carried out: the convergence of the approximate filter as the accuracy of the quantization increases and its stability when the latent process is mixing are emphasized .
Key words: Nonlinear filtering, Markov chain, Euler scheme, vector quantization, stochastic gradient descent, stationary signal, numerical approximation. MSC Classification (2000): 60G35, 93E11, 65C20, 65N50.
1
1
Introduction
We address the following nonlinear discrete time filtering problem. In the sequel, all the random variables are defined on a probability space (Ω, F, P). The signal process is an Rd -valued Markov chain {Xk , k ∈ N} with known probability transition Pk (x, dx ), k ≥ 1 (i.e. the transition from time k − 1 to time k). The initial law of X0 is known and denoted by µ. We have noisy observations {Yk , k ∈ N∗ } valued in Rq and our aim is to compute at some time n ≥ 1 the conditional law ΠY,n of Xn given the observations Y = (Y1 , . . . , Yn ). In other words, we wish to calculate the conditional expectations ΠY,n f
= E [ f (Xn )| Y1 , . . . , Yn ] ,
(1.1)
for all reasonable functions f on Rd . Throughout the paper, we fix the observations Y = (Y1 , . . . , Yn ) tat y = (y1 , . . . , yn ) and we write Πy,n for ΠY,n . The initial value Y0 is assumed for simplicity to be nonrandom, equal to zero for convenience. We consider an observation process (or design) where the pair (Xk , Yk )k∈N is a Markov chain and such that for all k ≥ 1: (H)
The law of Yk conditional on (Xk−1 , Yk−1 , Xk ) admits a density y −→ gk (Xk−1 , Yk−1 , Xk , y ).
Notice that the probability transition of the Markov chain (Xk , Yk )k∈N is then given by Pk (x, dx )gk (x, y, x , y )dy . An example of the observation scheme considered above is the model: Xk = Fk (Xk−1 , εk ),
k = 1, . . . , n,
(1.2)
Yk = Gk (Xk−1 , Yk−1 , Xk , ηk ),
k = 1, . . . , n,
(1.3)
where (εk )k and (ηk )k are two i.i.d. independent sequences of random variables, and Fk , Gk are measurable functions. The pair (Xk , Yk )k is then Markovian with respect to the filtration generated by (εk , ηk )k . The basic assumption concerning Gk and (ηk )k is that for each k, x, x ∈ Rd , y ∈ Rq , the variable Gk (x, y, x , ηk ) admits a density y → gk (x, y, x , y ). An explicit solution to problem (1.1) can be found only in very special cases: essentially when the signal-observation model forms a linear-Gaussian system, leading to the wellknown Kalman-Bucy filter. In the general case, the nonlinear filtering problem (1.1) leads to a dynamical system in the infinite dimensional space of measures, and we have to search for approximate solutions. Actually, approximations to (1.1) have been studied by various authors. We refer for example to [Kushner 1977], [Di Masi and Runggaldier 1982] or [Di Masi et al. 1985] for approaches related to the one followed here. In these papers, for an observation scheme in the particular form Yk = G(Xk ) + ηk , k = 1, . . . , n, the method consists basically in approximating the signal Markov chain (Xk )k (or in the continuous-time problem, the signal diffusion (Xt )t ) by a finite state space Markov chain. 2
This reduces the nonlinear filtering problem to an approximate resolution by an iterative finite-dimensional system. In these methods, the space grid is fixed prior to any computation and regardless of the structure of the Markov chain. So, from a computational viewpoint, they are effective only for low dimensions of the signal state space. On the other hand, although some bounds are obtained in [Di Masi and Runggaldier 1982] and [Di Masi et al. 1985], they are not sharp. Moreover, they essentially yield the convergence of the approximate filter to the true filter but do not provide an estimate of the rate of convergence. We propose in this paper an approximation of the filter based on an optimal quantization approach. Basically this means relying on a spatial discretization of the dynamics of the signal (Xk )1≤k≤n optimally “fitted” to its probabilistic features. Let us be more specific by considering the case of a single random vector X. If one wishes to approximate X by a random vector taking its values in a finite grid Γ := {x1 , . . . , xN }, we consider its projection ProjΓ (X) on the grid following the closest neighbour rule. Then, the resulting mean Lp -error (p ≥ 1) is X − ProjΓ (X)p = min1≤i≤N |X − xi |p . It only depends on the distribution PX of X and the grid Γ. For historical reasons, ProjΓ (X) is often called the quantization of the r.v. X by the grid Γ and the induced error, the Lp -mean quantization error (see [Graf and Luschgy 2000]). This quantity has been extensively investigated in signal processing and information theory since the early 50’s. Thus, the Lp -mean quantization error is continuous as a function of the grid Γ and reaches a minimum over all the grids Γ with size at most N . Furthermore, following Zador’s Theorem (see [Graf and Luschgy 2000]), min X − ProjΓ (X)p = c(PX , p) × c(d)N − d + o(N − d ) 1
1
Γ,|Γ|=N
as N goes to infinity. If PX has an absolutely continuous component, c(PX , p) > 0; its value is known whereas that of the universal constant c2 (d) remains unknown (see [Graf and Luschgy 2000] for bounds and asymptotics). On the other hand, except in some very specific cases like the uniform distribution over the unit interval, no closed form is available for the optimal grids that achieve the minimal quantization error of a probability distribution. In fact, no rigorous result is available to describe precisely the geometric structure or “shape” of such an optimal grid. However, using the integral representation of X − ProjΓ (X)pp one derives a stochastic gradient descent that converges towards some (at least locally) optimal grids. The distribution of ProjΓ (X) and the resulting quantization error can be obtained as a by-products, especially when in the quadratic case p = 2 (see, e.g. [Pag`es 1997]). Simulations like those carried out in [Pag`es and Printems 2003] for the 2-dim Gaussian distribution confirm what could be a priori expected: the heavier an area is weighted by the quantized distribution, the more points it contains. A first application to numerical probability is proposed in [Pag`es 1997] for numerical integration: if Γ∗ = {x∗,1 , . . . , x∗,N } is an optimal grid for the quadratic quantization of X and if f : Rd → R is C 1 with Lipschitz continuous differential Df then, f (x∗,i ) pi with pi = P(ProjΓ∗ (X) = xi ) Ef (ProjΓ∗ (X)) = 1≤i≤N
|Ef (ProjΓ∗ (X)) − Ef (X)| ≤ [Df ]Lip X − ProjΓ∗ (X)2 = O 2
3
1 N 2/d
.
this shows that weak approximation by quantization can be strictly more performing than strong approximation, so that quantization method may overperform Monte Carlo simulation at least up to 4 dimensions. Numerical experiments carried out in [Pag`es and Printems 2003] even suggest that this naive approach is in fact pessimistic, in particular for not too large values of N . Transferring these ideas to a Markovian dynamics (Xk )k can be done to approximate efficiently the transition distributions i.e. L(Xk | Xk−1 ) and the joint distributions (Xk , Xk−1 ). Two different methods can be implemented : the first approach gives the preference to the approximation of the marginal distributions of the signal Xk at every time k = 0, . . . , n, the second one enhances the preservation of the dynamics, namely the Markov property. In the first case, one approximates at each time k the signal Xk by its marginal optimal quantization: k := ProjΓ (Xk ), k = 0, . . . , n, X k where the grids Γk minimize the Lp -quantization error Xk − ProjΓk (Xk )p among the k )0≤k≤n no longer has the grids with size Nk for every k = 0, . . . , n. The sequence (X Markov property. Then, one defines the approximate quantized filter by simply replacing in the forward explicit formula of the nonlinear filter the conditional law of Xk+1 given k+1 given X k . This approach was originally introduced Xk by the conditional law of X in [Bally and Pag`es 2003] and [Bally et al. 2001] to discretize reflected backward stochastic differential equations. In the Markovian approach, and for a signal-observation model in the form (1.2)-(1.3), one sets k = ProjΓ (Fk (X k−1 , εk )), X 0 = ProjΓ (X0 ). X 0 k k )0≤k≤n remains a Markov chain with respect to the same filtration So, the sequence (X as (Xk ) but is no longer the best Lp -marginal approximation of (Xk )0≤k≤n . Then one n conditional on the considers as an approximate quantized filter the nonlinear filter of X process k−1 , Yk−1 , X k , ηk ), k = 1, . . . , n. Yk = Gk (X This Markovian quantization approach was introduced in [Pag`es et al. 2004a] to approximate numerically some stochastic control problems for multi-dimensional Markov chains. As far as filtering is concerned, both quantization approaches make possible the analysis of the error under some appropriate Lipschitz continuity assumptions on the underlying Markov dynamics. The a priori error bounds are expressed using the quantization errors k−1 , εk ) Zk − ProjΓk (Zk )P , where Zk = Xk in the marginal approach and Zk = F (X in the Markovian approach. Although the methods of proof are significantly different, the a priori error bounds look quite similar for both methods suggesting a balance between the positive and negative features of the two approximation methods. An extensive discussion and comparison of both quantization methods, marginal and quantization, is carried out in [Pag`es et al. 2004b]. For a detailed description of the algorithms we refer to [Bally and Pag`es 2003] and [Pag`es et al. 2004a] in which the methods have been originally introduced. In Section 6, we analyze the practical aspects of the algorithm, in terms of complexity. The most interesting feature of the quantization approach is that, once an optimal
4
quantization of the signal (Xk )0≤k≤n has been processed and kept off-line, it instantaneously produces for any set of observations some reproducible deterministic results. This underlines that the set of observations is reasonably likely since the approximate filter distribution is structurally supported by the quantization of X. This can be implemented with multi-dimensional signal processes, at least up to 4-dimension and possibly higher. This restriction on the dimension comes from the fact that, for a given size N of the quantization, the error does depend on the dimension d of the signal like for numerical integration. In the special case of a stationary signal, the marginal quantization of the whole Markov chain reduces to that of its stationary distribution, so that the the quantization optimization phase – which is clearly – the most demanding one is divided by a factor n in terms of duration and storing. In the recent past years, a technique for approximating the nonlinear filtering problem has received much attention: it is a Monte-Carlo method based on interacting particle systems, see [Del Moral 1998], [Del Moral et al. 2001], [Florchinger and LeGland 1992] and [Crisan and Lyons 1997]; this is a typical “on-line” method (the whole process involves the observation set y and the function f ) while quantization is typically an off-line method (the demanding part of the computations can be stored, those depending on y and f being instantaneous). So it is rather difficult to define a comparison protocol since their field of applications are quite different. It may also be interesting to keep the same filter for close observation samples in order to spare heavy computations. This can be processed by replacing the observations by some discrete values. This point of view is investigated for real-valued observations in [Newton 2000a] and [Newton 2000b]: some functional weak convergence results are stated toward the original filter. In a framework where both X and the observation process are quantized, some a priori error bounds are derived in [Sellami 2004]. The paper is organized as follows. Some preliminaries on nonlinear filtering are provided in Section 2. In particular, we recall the known forward inductive formula for the filter, and also the (less known) backward formula. In Sections 3 and 4, we study the approximate filter by marginal and Markovian quantization respectively, including an explicit error analysis. In Section 5, the convergence of both quantized approximating filters is established. In Section 6, we show how to get optimal grids for both quantization methods, and we discuss their respective qualities and drawbacks from a practical viewpoint, especially concerning the optimization phase. We point out that the marginal quantization approach can be significantly simplified from a computational viewpoint in the important case of stationary signal process. We discuss in Section 7 how our results may be applied to the case of discretely observed diffusions. Finally, we provide in Section 8 several numerical experiments. Firstly, we compare our approximate filter with the explicit Kalman-Bucy filter: one one hand its convergence – with some rate – as the size of the quantization increases is confirmed and on the other hand its stability as n increases. Then, we evaluate the approximate filter by quantization in a state model with multiplicative Gaussian noise arising in stochastic volatility models: a converging behaviour is emphasized as the quantization accuracy increases, although no reference value is available. Notations: • For every Borel function f : Rd → R set f ∞ = sup |f (x)|
and
[f ]Lip = sup
x=x
x∈Rd
5
|f (x) − f (x )| . |x − x |
• We use the traditional notation for transition kernels: if P is a bounded transition kernel and f is a bounded measurable function, we write P f for the bounded measurable function P f (x) := f (x )P (x, dx ).
2
Nonlinear filtering: preliminaries and remarks
In this section, we recall some useful facts about nonlinear filtering. Using the Markov property of the pair (X, Y ) and the Bayes formula, one derives the Kallianpur-Striebel formula for the the filter: Πy,n f
:=
πy,n f , πy,n 1
(2.1)
where πy,n is the so-called unnormalized filter defined by πy,n f
=
f (xn )µ(dx0 )
n
gk (xk−1 , yk−1 , xk , yk )Pk (xk−1 , dxk ),
(2.2)
k=1
E [f (Xn )Ly,n ] , n := gk (Xk−1 , yk−1 , Xk , yk ) =
with
Ly,n
(2.3) (2.4)
k=1
(convention y0 = 0). Notice that πy,n 1 = E [Ly,n ] = φn (y).
(2.5)
where φn (y) is the value of the density function φn of (Y1 , . . . , Yn ) with respect to the Lebesgue measure at the observed values y = (y1 , . . . , yn ) ∈ (Rq )n : In the sequel, we shall denote for notational convenience gy,k (x, x ) = gk (x, yk−1 , x , yk ), k ≥ 1. The unnormalized filter can be written using a family of bounded transition kernel Hy,k , k = 1, . . . , n, defined on bounded measurable functions f : Rd → R by Hy,k f (x) := E [ f (Xk )gy,k (x, Xk ) |Xk−1 = x] = f (x )gy,k (x, x )Pk (x, dx ), x ∈ Rd . For convenience we also define
Hy,0 f (x) := πy,0 f = E[f (X0 )] =
f (x0 )µ(dx0 ), x ∈ Rd .
Then, one shows that the unnormalized filter at time k, πy,k := E[f (Xk )Ly,k ], satisfies the inductive formula πy,k f = πy,k−1 Hy,k f, k = 1, . . . , n (2.6) so that πy,n = Hy,0 ◦ Hy,1 ◦ · · · ◦ Hy,n .
6
(2.7)
Equation (2.6) is called the forward expression of the filter. One also derives from the “symmetric” expression (2.7) a backward expression for the filter which turns out to be a useful tool for proofs. Namely πy,n f = uy,−1 (f ) where uy,−1 (f ) is defined as the final value of the backward induction uy,n (f )(x) = f (x),
uy,k−1 (f ) = Hy,k uy,k (f ), k = 0, . . . , n.
(2.8)
Note that in fact uy,k = Hy,k−1 ◦ · · · ◦ Hy,n f . We shall replace the true filter by a computable approximate filter constructed. This will follow from a spatial discretization of the signal process (Xk )k based on optimal quantization. We will propose two types of quantization – marginal and Markovian – leading y,k of the transition kernel Hy,k both based on (2.6). Then to different approximations H we will compute in both cases an approximate distribution π y,n, of the unnormalized filter y,k f . πy,n using a quantized form of the forward: expression (2.6) π y,k f = π y,k−1 H
3
Approximate filter by marginal quantization
3.1
Method
In this section, one considers a marginal quantization of the Markov chain (Xk )k , time by time i.e k = ProjΓ (Xk ), 0 ≤ k ≤ n, X (3.1) k where Γk , k = 0, . . . , n are grids consisting of Nk points xik in Rd , i = 1, . . . , Nk , to be optimized later and ProjΓk denotes the closest neighbour projection on Γk . Notice that k )k is not a Markov chain We construct an approximate filter based on the process (X an approximation of the probability transition Pk (xk , dxk+1 ) of Xk+1 given Xk by the k+1 given X k : probability transition matrix Pk := [Pkij ] of X Pkij
k = xj | X k−1 = xi ], i = 1, . . . , Nk−1 , j = 1, . . . , Nk . = P[X k−1 k
(3.2)
In other words, we approximate the transition kernel Hy,k by the quantized transition kernel y,k given by: H y,k := H
Nk
ij δ i H y,k x
k−1
, k = 1, . . . , n,
(3.3)
j=1
with
ij H y,k
=
= gy,k (xik−1 , xjk )Pkij , i = 1, . . . , Nk−1 , j = 1, . . . , Nk ,
for k = 1, . . . , n, so that, for every function f : Γk → R,
k−1 ) := E gy,k (X k−1 , X k )f (X k ) | X k−1 , k = 1, . . . , n. y,k f (X H Finally, we set y,0 = H
N0
P0i δxi
0
0 = xi ], i = 1, . . . , N0 . with P0i := P[X 0
i=1
7
(3.4)
We then define the approximate unnormalized filter π y,n =
Nn
i δ y,n xin i=1 π
by
y,0 ◦ · · · ◦ H y,n . π y,n = H It is easily computed by the following forward induction: Nk−1 ij i y,0 , π y,k := π y,k = πy,k−1 H y,k−1 H π y,0 = H
, k = 1, . . . , n.
y,k
i=1
(3.5)
j=1,...,Nk
y,n is then given by: The approximate filter Π y,n = Π
Nn
i δ xi Π y,n n
with
i Π Y,n =
Nn
i=1
3.2
i π y,n
i y,n i=1 π
, i = 1, . . . , Nn .
Error analysis
y,n in terms of In this paragraph, we will estimate the accuracy of the approximate filter Π the marginal quantization errors on the signal ∆k 1 , k = 0, . . . , n, defined by: ∆k = Xk − ProjΓk (Xk ).
(3.6)
k ) is not a Markov chain. We shall impose some Lipschitz conNote that the process (X ditions on the Markov transition of Xk and on the conditional law Yk given Xk−1 , Yk−1 , Xk . To be more precise, we first recall some definitions. We say that a probability transition P on Rd is C-Lipschitz for some positive real constant C if for every Lipschitz function ϕ on Rd with ratio [ϕ]Lip , P ϕ is Lipschitz and [P ϕ]Lip ≤ C[ϕ]Lip . Then, we may define the Lipschitz ratio [P ϕ]Lip , ϕ nonzero Lipschitz continuous function < +∞. [P ]Lip = sup [ϕ]Lip (A1)
The Markov transition operators Pk (x, dx ), k = 1, . . . , n, are Lipschitz, so that [P ]Lip := max [Pk ]Lip < +∞. k=0,...,n
(i) For every k = 1, . . . , n, the functions gk are bounded on Rd × Rq × Rd × Rq and we set Kg := maxk=1,...,n gk ∞ . (ii) For every k = 1, . . . , n, there exist two Borel functions (A2) [gk1 ]Lip , [gk2 ]Lip : Rq ×Rq → R+ such that, for all x, x , x , x ∈ Rd and y, y ∈ Rq , |gk (x, y, x , y ) − gk ( x, y, x , y )| ≤ [gk1 ]Lip (y, y )|x − x | + [gk2 ]Lip (y, y )|x − x |. An essential tool for the proof of theorem 3.1 below is to introduce the sequence functions ( uy,k (f ))−1≤k≤n which is the quantized counterpart of the sequence (uk (f ))−1≤k≤n defined in (2.8) as the backward expression of the filter: mimicking this backward dynamic formula, one recursively define the u y,k (f ) on Γk , k = 0, . . . , n, by u y,n (f ) = f,
on the grid Γn ,
y,k+1 u y,k+1 (f ) on the grid Γk−1 , k = −1, . . . , n − 1. u y,k (f ) = H 8
The approximate unnormalized filter π y,n is then given by: = u y,−1 (f )
π y,n f
y,n f | = |uy,−1 (f ) − u y,−1 (f )|. so that |πy,n f − π Theorem 3.1 Assume that (A1) and (A2) hold. Then, for every bounded Lipschitz continuous functions f on Rd and each n-tuple of observations y = (y1 , . . . , yn ), we have, for every p ≥ 1, n Kgn |Πy,n f − Πy,n f | ≤ Bkn (f, y, p)∆k p . (3.7) φn (y) ∨ φn (y) k=0
with φn (y) := π y,n 1,
(3.8) f ∞ 1 [gk+1 [f ]Lip+2 ]Lip (yk , yk+1 )+[gk2 ]Lip (yk−1 , yk ) (3.9) Bkn (f, y, p) := (2 − δ2,p )[P ]n−k Lip Kg n f ∞ [gj1 ]Lip (yj−1 , yj ) + [P ]Lip [gj2 ]Lip (yj−1 , yj ) . +(2 − δ2,p ) [P ]j−(k+1) Lip Kg
j=k+1
(Convention: g0 = gn+1 ≡ 0 and δr,p is for the usual Kronecker symbol) Remark 3.1 Note that φn (y) = π y,n 1 =
Nn
i π y,n
i=1
is the normalizing factor of the approximate filter distribution so that (3.7) produces a completely computable error bound. Remark 3.2 The interesting case for the general Lp -bounds is the case p = 2 where the coefficients Dnk (f, y, p) are smaller than in the L1 -case (other bounds are trivial since the Lp -norm is nondecreasing as a function of p). Remark 3.3 If one introduces [g]Lip :=
max
sup ([gk1 ]Lip (y, y ) ∨ [gk2 ]Lip (y, y )), the
k=1,...,n y,y ∈Rq
Bkn (f, y, p) is upper-bounded by the simpler coefficient: n (f, p) B k
:= (2 − δ2,p )[P ]Lip [f ]Lip n−k
with the usual convention
1 u−1
[P ]Lip + 1 f ∞ n−k +2 [g]Lip 2 + (2 − δ2,p ) ([P ]Lip − 1) . Kg [P ]Lip − 1
× (um − 1) = m when u = 1 and m ∈ N.
Remark 3.4 If the Lipschitz condition (A2)(ii) is weakened into a local Lipschitz one, (A2)(ii ) For every k = 1, . . . , n, there exist two Borel functions [gk1 ]Liploc , [gk2 ]Liploc : Rq × Rq → R+ such that, for all x, x , x , x ∈ Rd and y, y ∈ Rq , gk (x, y, x , y ) − gk ( x, y, x , y ) ≤ [gk1 ]Liploc (y, y )(1 + |x| + |x | + | x| + | x |)|x − x | x| + | x |)|x − x |, +[gk2 ]Liploc (y, y )(1 + |x| + |x | + |
9
then we may state a similar estimate for the approximate filter as in Theorem 3.1: for every p ≥ 1 and every p , q ∈ (1, ∞), 1/p + 1/q = 1: y,n f ≤ Πy,n f − Π
n Kgn n B k (p, pq , f )∆k pp , φn (y) k=0
with
and
n B k (p, r, f )
= (2 − δ2,p )[P ]n−k [f ]Lip Lip [P ]Lip +1 f ∞ +2 [g]Lip 2+(2 − δ2,p ) − 1) Mn (r), ([P ]n−k Lip Kg [P ]Lip − 1 r , r ≥ 1. Mn (r) = 1 + 2Xr + 2X
Here we have set [g]Liploc := max q = max X k q . and X
sup ([gk1 ]Liploc (y, y )∨[gk2 ]Liploc (y, y )), Xq = max Xk q
k=1,...,n y,y ∈Rq
k=0,...,n
k=0,...,n
is an optimal quadratic quantization one shows (see [Graf and Luschgy 2000] Furthermore, when X k = E(Xk | X k ) so that X k r ≤ Xk r for every r ≥ 1. or (B.7) in Appendix B) that X Hence, one may take Mn (r) = 1 + 4Xr , r ≥ 1. We will see in Section 7 (and Appendix A) that Assumption (A2) is satisfied by the conditional density of certain discretely observed diffusion models.
To get the announced error bound, we need three steps. The first one in lemma 3.1 below makes an abstract connection between errors in the unnormalized world and the normalized world (it is used in next section too). The second one, in Lemma 3.2 yields some bound for the Lipschitz coefficient of the functions uy,k (f ) defined in (2.8). In the third step – k )p by a which is the proof of the theorem itself –, we will bound uy,k (f )(Xk ) − u y,k (f )(X k )|. backward induction, having in mind that |πy,n f − π y,n f | = |uy,−1 (f )(Xk ) − u y,−1 (f )(X Lemma 3.1 Let (µy ) and (νy ) two families of finite positive measures on a measurable space (E, E). Assume that there exist two symmetric functions R and S defined on the set of positive finite measures such that, for every bounded Lipschitz function f dµy − f dνy ≤ f ∞ R(µy , νy ) + [f ] S(µy , νy ) (3.10) Lip then
dµy f dν 1 y µy (E) − νy (E) ≤ µy (E) ∨ νy (E) 2f ∞ R(µy , νy ) + [f ]Lip S(µy , νy ) .
Proof: dµy f dνy µy (E) − νy (E) ≤ ≤ ≤
1 | dµy − f dνy | 1 + |f |dνy − µy (E) µy (E) νy (E) νy (E) f ∞ R(µy , νy ) + [f ]Lip S(µy , νy ) + f ∞ − 1 µy (E) µy (E) f ∞ R(µy , νy ) + [f ]Lip S(µy , νy ) + f ∞ |νy (E) − µy (E)| µy (E) 10
Now
|µy (E) − νy (E)| ≤ 1 × R(µy , νy ) so that f dµy f dνy 1 µy (E) − νy (E) ≤ µy (E) 2f ∞ R(µy , νy ) + [f ]Lip S(µy , νy ) .
One completes the proof by a symmetry argument.
Lemma 3.2 Assume (A1) and (A2) hold. Let (yk )k=1,...,n be a generic observation. Then, for every bounded Lipschitz continuous function f , the functions uy,k (f ) defined by (2.8) are bounded Lipschitz continuous as well, with Lipschitz coefficient [uy,k (f )]Lip satisfying 2 ]Lip [uy,k (f )]Lip ≤ [Pk+1 ]Lip Kg [uy,k+1 (f )]Lip + uy,k+1 (f )∞ [gy,k+1 1 ]Lip , +uy,k+1 (f )∞ [gy,k+1
uy,k (f )∞
and
≤ Kgn−k f ∞ ,
k = 0, . . . , n − 1, k = 0, . . . , n.
In particular, for every k ∈ {0, . . . , n}, [uy,k (f )]Lip ≤ ([P ]Lip Kg )n−k [f ]Lip + f ∞ Kgn−(k+1)
n−k
1 2 [P ]−1 ([gk+ ]Lip + [P ]Lip [gk+ ]Lip ). Lip
=1
For notational convenience, we will temporarily drop the dependency in the function f and in the observation sequence y in the proofs below. Proof. One derives the first two formulae from the recursive definition (2.8) of the uk ’s: uk (x) = E [ gk+1 (x, Xk+1 )uk+1 (Xk+1 )| Xk = x] = gk+1 (x, x )uk+1 (x )Pk+1 (x, dx ) and from the Lipschitz property of the transitions Pk (x, dx ): [uk ]Lip
1 ≤ [Pk+1 ]Lip sup [x → gk+1 (x, x )uk+1 (x )]Lip + uk+1 ∞ [gk+1 ]Lip . x∈Rd
Now un ∞ = f ∞ and uk ∞ ≤ Kg uk+1 ∞ so that uk ∞ ≤ Kgn−k f ∞ . Hence [uk ]Lip
2 1 ≤ A[uk+1 ]Lip + B Kg−(k+1) [gk+1 ]Lip + CKg−(k+1) [gk+1 ]Lip
with A = [P ]Lip Kg , B := [P ]Lip f ∞ Kgn and C := f ∞ Kgn . Standard computations complete the proof. k )p , one prok (X Proof of Theorem 3.1. To obtain an upper-bound for uk (Xk ) − u ceeds by induction. Set temporarily for every k ≤ n − 1 and every xk , xk+1 , xk+1 ∈ Rd , ϕ(xk , xk+1 , xk+1 ) := gk+1 (xk , xk+1 )uk+1 (xk+1 ). Then, k )p = E(ϕ(Xk , Xk+1 , Xk+1 ) | Xk ) − E(gk+1 (X k , X k+1 ) k+1 ) | X k )p uk (Xk ) − u k (X uk+1 (X k ) is a Lp -contraction, one gets for every k ∈ {0, . . . , n − 1}, Using that E( . | X k )p ≤ E(ϕ(Xk , Xk+1 , Xk+1 ) | Xk )−E(ϕ(X k , X k+1 , Xk+1 ) | X k )p k (X uk (Xk ) − u k , X k+1 , Xk+1 )−gk+1 (X k , X k+1 ) k+1 )p +ϕ(X uk+1 (X k , X k+1 , Xk+1 ) | X k )p (3.11) ≤ E(ϕ(Xk , Xk+1 , Xk+1 ) | Fk )−E(ϕ(X k+1 )p . +Kg uk+1 (Xk+1 ) − u k+1 (X 11
Now k , X k+1 , Xk+1 ) | X k )p E(ϕ(Xk , Xk+1 , Xk+1 ) | Xk ) −E(ϕ(X k )p + E(uk (Xk ) | X k ) − E(ϕ(X k , X k+1 , Xk+1 ) | X k )p . (3.12) ≤ uk (Xk )−E(uk (Xk ) | X Then
k )p uk (Xk ) − E(uk (Xk ) | X
k )p + E(uk (X k ) − uk (Xk ) | X k )p ≤ uk (Xk ) − uk (X k )p ≤ 2uk (Xk ) − uk (X
since conditional expectation is a Lp -contraction. When p = 2, k ) = min uk (Xk ) − ψ(X k ) , ψ(X k ) ∈ L2 ≤ uk (Xk )−uk (X k ) . uk (Xk )−E(uk (Xk ) | X 2 2 2 k being σ(Xk )-measurable, E(uk (Xk ) | X k ) = E(ϕ(Xk , Xk+1 , Xk+1 ) | X k ), so that Now X k ) − E(ϕ(X k , X k+1 , Xk+1 ) | X k )p E(uk (Xk ) | X
k , X k+1 , Xk+1 )p ≤ ϕ(Xk , Xk+1 , Xk+1 ) − ϕ(X k , X k+1 )p ≤ uk+1 ∞ gk+1 (Xk , Xk+1 ) − gk+1 (X
Consequently, plugging into Equation (3.12) yields k , X k+1 , Xk+1 ) | X k )p E(ϕ(Xk , Xk+1 , Xk+1 ) | Xk ) − E(ϕ(X k p ≤ (2 − δp,2 )[uk ]Lip Xk − X 1 k p + [g 2 ]Lip Xk+1 − X k+1 p . +uk+1 ∞ [gk+1 ]Lip Xk − X k+1 k , plugging into EquaSince we are dealing with marginal quantization ∆k = Xk − X tion (3.11) yields the following induction: for every k ∈ {0, . . . , n − 1}, k )p k (X uk (Xk ) − u with and
≤
k+1 )p + αk ∆k p + βk+1 ∆k+1 p Kg uk+1 (Xk+1 ) − u k+1 (X
1 αk := (2 − δp,2 )[uk ]Lip + uk+1 ∞ [gk+1 ]Lip , 0 ≤ k ≤ n − 1,
βk := [gk2 ]Lip uk ∞ , 1 ≤ k ≤ n.
For notational convenience, one sets αn := (2 − δp,2 )[f ]Lip (in fact αn = [f ]Lip would always be suitable) and β0 := 0. Then, standard computations using Lemma 3.2 yield 0 ) y,n f | = u0 (f )(X0 ) − u 0 (f )(X |πy,n f − π 1 n 0 )p ≤ ≤ u0 (f )(X0 ) − u 0 (f )(X Ckn (f, y, p)∆k p k=0
where, for every 0 ≤ k ≤ n, Ckn (f, y, p) := Kgk−1 (αk Kg + βk ), 1 ]Lip + [gk2 ]Lip ) = (2 − δ2,p )Kgk [uk ]Lip + Kgn−1 f ∞ ([gk+1 [f ]Lip ≤ Kgn (2 − δ2,p )[P ]n−k Lip !" n−k f ∞ 1 1 2 [gk+1 ]Lip +[gk2 ]Lip+(2 − δ2,p ) [gk+m + [P ]m−1 ]Lip +[P ]Lip [gk+m ]Lip Lip Kg m=1
One concludes by Lemma 3.1. 12
4
Approximate filter by Markovian quantization
4.1
Method
This method is based on the Markovian quantization developed in [Pag`es et al. 2004a]. We assume that the signal-observation model is given by (1.2)-(1.3). At each time k = 0, . . . , n, we are given a grid Γk consisting of Nk points xik in Rd , i = 1, . . . , Nk , to be optimized later k , Yk )k defined by: on. We then consider the Markovian process (X k−1 , εk ) , k = 1, . . . , n k = ProjΓ Fk (X (4.1) X k k−1 , Yk−1 , X k , ηk ), k = 1, . . . , n, Yk = Gk (X
(4.2)
0 = ProjΓ (X0 ) and Y0 = Y0 = 0. Here ProjΓ still denotes the closest neighbour with X 0 k projection on Γk . The idea is now to approximate the filter Πy,n by the discrete conditional y,n of X n given that the observations Y = (Y1 , . . . , Yn ) are fixed to y = (y1 , . . . , yn ). law Π n is valued in the finite grid Γn made of Nn points xi , i = 1, . . . , Nn , the discrete Since X n y,n is characterized by its weights Π iy,n = P[X n = xin |Y = y], i = probability measure Π 1, . . . , Nn : for any bounded measurable function f on Rd , we have y,n f Π
=
Nn
i . f (xin )Π y,n
i=1
i y,n = Nn Π In other words, Π i=1 y,n δxin , where δx is the Dirac mass at x. By same arguments k , Yk )k , we have as in Section 2, using Bayes rule and Markov property of (X i π y,n Nn i , i = 1, . . . , Nn , π y,n
i=1 = E 1Xn =xi Ly,n , i = 1, . . . , Nn ,
i Π y,n = i π y,n
where
n
y,n = L
and
n
k−1 , yk−1 , X k , yk ). gk (X
(4.3) (4.4) (4.5)
k=1
From an algorithmic viewpoint, the unnormalized approximate filter π y,n may be computed either in a forward or backward induction in view of 1) or 2) of Section 2. We describe here the forward procedure which is less costly in terms of complexity. We denote 0 , i.e. Pi = P[X 0 = xi ], i = 1, . . . , N0 , and for by P0 = (P0i )i=1,...,N0 the probability law of X 0 0 k = 1, . . . , n, by (Pk )k the probability transition matrix of the finite state space Markovian k )k , i.e. process (X Pkij
k = xj |X k−1 = xi ], i = 1, . . . , Nk−1 , j = 1, . . . , Nk . = P[X k−1 k
(4.6)
y,k by: We introduce the transition matrix H ij = gy,k (xi , xj )Pij , i = 1, . . . , Nk−1 , j = 1, . . . , Nk , (4.7) H k−1 k y,k k Nn i y,n δxin by the following forward for k = 1, . . . , n. We then compute explicitly π y,n = i=1 π algorithm: π y,0
= P0 ,
Nk−1 j π y,k
=
i ij π H y,k y,k−1 , j = 1, . . . , Nk , k = 1, . . . , n.
i=1
13
(4.8)
4.2
Error analysis
y,n in terms of the In this paragraph, we estimate the quality of the approximate filter Π Markovian quantization errors on the signal ∆k 1 , k = 0, . . . , n, defined by: k−1 , εk ) − ProjΓ Fk (X k−1 , εk ) , k ≥ 1 ∆k = Fk (X (4.9) k ∆0 = X0 − ProjΓ0 (X0 ), k = 0.
(4.10)
We make the following Lipschitz assumptions on the model (1.2)-(1.3). (A1’)
For each k = 1, . . . , n, there exists a positive constant [Fk ]Lip such that ∀ x, x ∈ Rd ,
Fk (x, εk ) − Fk ( x, εk )1
≤ [Fk ]Lip |x − x |.
We then set [F ]Lip = max [Fk ]Lip . k=1,...,n
Comments on assumptions (A1’) and (A1): Note that [Pk ]Lip ≤ [Fk ]Lip . In fact it may happen for some models that [Pk ]Lip < ∞ = [Fk ]Lip which means that the field of application of the marginal quantization is wider than that of Markovian quantization, at least theoretically. For example, set Xk+1 = sign(Xk − εk+1 )G(Xk , εk+1 ), where (εk )k is an i.i.d. sequence, Pε1 (du) = g(u) λq (du) (λq lebesgue measure on Rq ) and (x, u) → G(x, u) is Lipschitz continuous in x uniformly in u with ratio [G]Lip . Then, one easily shows that [P ]Lip ≤ [G]Lip < +∞ whereas x → F (x, ε1 ) = sign(x − ε1 )G(x, ε1 ) is not even continuous so that (A1 ) is not satisfied. We also rely on assumption (A2) introduced in Section 3 for the marginal quantization. Remark 4.1 Beyond natural discrete time dynamics, we notice that Assumption (A1’) is satisfied by a usual discretization scheme of Gaussian diffusions like the Euler scheme. On the other hand, Assumption (A2) often needs to be slightly strengthened to encompass diffusion discretization schemes. This is the aim of Remarks 4.3 and 3.4 which provide a setting often fulfilled by the Euler scheme of nondegenerate diffusions. Theorem 4.1 Assume that (A1’) and (A2) hold. Then, for every bounded Lipschitz continuous function f : Rd → R and any sequence of observed values y = (y1 , . . . , yn ) ∈ (Rq )n , we have the a-posteriori estimation y,n f ≤ Πy,n f − Π
n Kgn Ank (f, y)∆k 1 . φn (y) ∨ φn (y) k=0
14
(4.11)
with
φn (y) := π y;n 1, Ank (f, y)
=
(4.12)
[F ]n−k [f ]Lip + 2 Lip +2
f ∞ 2 [g ] (yk−1 , yk ) Kg k Lip
(4.13)
n 1 f ∞ [F ]j−k−1 [gj ]Lip (yj−1 , yj ) + [F ]Lip [gj2 ]Lip (yj−1 , yj ) , Lip Kg j=k+1
(with the convention that the sum in (4.13) is zero for k = n). Remark 4.2 By introducing [g]Lip := max
sup ([gk1 ]Lip (y, y )∨[gk2 ]Lip (y, y )), the Ank (f, y)
k=1,...,n y,y ∈Rq
is bounded by the simpler quantity n (f ) = [F ]n−k [f ] + 2f ∞ [g] A k Lip Lip Lip Kg with the usual convention that
1 m u−1 (u
[F ]Lip + 1 n−k ([F ]Lip − 1) + 1 , [F ]Lip − 1
− 1) = m if u = 1 and m ∈ N.
Remark 4.3 If the Lipschitz condition (A2)(ii) is weakened into (A2)(ii ) like in Section 3 and if (A1) is strengthened into the slightly more stringent condition than (A1’), (A1’p )
For each k = 1, . . . , n, there exists a positive constant [Fk ]Lip such that x, εk )p ≤ [Fk ]Lip |x − x |, Fk (x, εk ) − Fk (
for all p ∈ (1, ∞) and x, x ∈ Rd , then we may state a similar estimate for the approximate filter as in Theorem 4.1: for each p ∈ (1, ∞) (and 1/p + 1/q = 1): Πy,n f − Πy,n f ≤
with and
n Kgn A¯nk (q, f )∆k p , φn (y) ∨ φn (y) k=0 [F ]Lip + 1 2f ∞ n n−k n−k [g]Lip A¯k (q, f ) = [F ]Lip [f ]Lip + ([F ]Lip − 1) + 1 Nn (q), Kg [F ]Lip − 1 n n−1 ∆l q . Nn (q) = 1 + 4Xq + (1 + [F ]Lip ) [F ]Lip ∨ 1 l=0
Here we have set [g]Lip := max Xk q .
max
sup ([gk1 ]Liploc (y, y ) ∨ [gk2 ]Liploc (y, y )) and Xq =
k=0,...,n y,y ∈Rq
k=0,...,n
In order to prove Theorem 4.1, we need the following Lemmata. Lemma 4.1 Assume (A2) holds. Then, for all y1 , . . . , yn ∈ Rq , we have n n−1 k−1 | + [g 2 ] (yk−1 , yk )|Xk − X k |. ≤ K L − L [gk1 ]Lip (yk−1 , yk )|Xk−1 − X y,n y,n g k Lip k=1
15
n on y1 , . . . , yn . Proof. In order to alleviate notations, we omit the dependence of Ln and L From (2.4) and (4.5), we have for all k = 1, . . . , n, k = k−1 , yk−1 , X k , yk ) Lk−1 Lk − L gk (Xk−1 , yk−1 , Xk , yk ) − gk (X k−1 , yk−1 , X k , yk )(Lk−1 − L k−1 ). +gk (X From the boundedness condition (A2)(i) on gk , we have Lk−1 ≤ Kgk−1 . Hence, by Assumption (A2)(ii), we get k−1 1 2 − L ] (y , y )|X − X |+[g ] (y , y )|X − X | +K − L [g ≤ K L L k g k−1 k k−1 k−1 k k k−1 . g k Lip k−1 k k Lip k−1 k 0 = 1, we obtain the required result by induction. Noting that L0 = L
Lemma 4.2 Assume that (A1’) holds. Then for each k = 0, . . . , n, we have k Xk − X 1
k
≤
[F ]k−j ∆j 1 . Lip
j=0
k , and (4.9) of ∆k , we obviously Proof. From the definitions (1.2) and (4.1) of Xk and X get for each k ≥ 1: # # # k ≤ # F Xk − X (X , ε ) − F ( X , ε ) # k k−1 k k k−1 k # + ∆k 1 . 1 1
k−1 , we then obtain: By Assumption (A1’) and since εk is independent of Xk−1 and X k Xk − X 1
k−1 + ∆k . ≤ [Fk ]Lip Xk−1 − X 1 1
0 = ∆0 , we conclude by backward induction. Recalling that X0 − X 1 1
Proof of Theorem 4.1: From expressions (2.3) and (4.4), one derives that n )L y,n ]| y,n f | = |E[f (Xn )Ly,n ] − E[f (X |πy,n f − π y,n | + [f ] E[|Xn − X n |L y,n ] ≤ f ∞ E|Ly,n − L lip y,n | + [f ] K n Xn − X n . ≤ f ∞ E|Ly,n − L g 1 lip
Lemmata 3.1,4.1 and 4.2 complete the proof.
5
Convergence of the quantized filters. Optimization
In both approaches the error analysis leads to some a priori error bounds (4.11) and (3.7) with the same structure, from which one derives a slightly looser upper-bound given by y,n f | ≤ f ∞ ∨ [f ] |Πy,n f − Π Lip
n Kgn Dkn (y, p)∆k p , φn (y) k=0
for all p ∈ [1, ∞), where ∆k = Zk − ProjΓk (Zk ) is the difference between a simulated random variable, say Zk and its projection ProjΓk (Zk ) following the closest neighbour rule 16
onto the grid Γk with size |Γk | = Nk the Markovian quantization approach $ Bkn (f0 , y, p) n Dk (y, p) = Ank (f0 , y)
(in the marginal quantization method Zk = Xk , in k−1 , εk )). Here Zk = F (X for the marginal quantization, for the Markovian quantization(p = 1)
|x| where f0 (x) = 1+|x| (so that f0 ∞ = [f0 ]Lip = 1). If one assumes that, at every time k = 0, 1, . . . , n, the grid Γk is Lp -optimal i.e. minimizes the mean Lp -quantization error among all grids with size Nk , then the asymptotic of optimal quantization stated in Zador’s Theorem 8.1 implies that, for every k = 0, . . . , n, there is a positive real constant θk := θ(p, PZk , d) such that −1/d ∆k p ≤ θk Nk
so that
y,n f | ≤ f ∞ ∨ [f ] |Πy,n f − Π Lip
n Kgn −1/d θk Dkn (y, p)Nk . φn (y)
(5.1)
k=0
Theorem 5.1 Let n ≥ 1. In both marginal and Markovian settings, the optimally quantized approximate filters converge toward the true filter as min1≤k≤n Nk goes to infinity. Application to optimal dispatching: If some numerical bounds d¯nk and θ¯k are available for Dkn (y, p) (locally) uniformly in the observations y = (y1 , . . . , yn ) and for θk (which requires some information on the density of Zk ), then one may easily solve numerically the optimal allocation problem min
N0 +···+Nn =N
n
−1/d θ¯k d¯nk (y, p)Nk
(5.2)
k=0
to optimally dispatch the N points among the n + 1 time steps. For more details we refer to [Bally and Pag`es 2003] and [Pag`es et al. 2004a] in which this phase has been carried out in different settings. In some situations, like the marginal quantization of a stationary signal, it may happen that alternative approaches turn out to be more efficient: optimizing only one huge grid and its transition parameters and then replicate it at every time produces better results. Application to discretized diffusions: We now discuss the convergence of the quantized filter when n also goes to infinity. This asymptotic is relevant especially when the signal (Xk )0≤k≤n is a time-discretization with step h = T /n of a continuous-time signal (Xt )0≤t≤T . For example, if Xt follows a diffusion process: dXt = b(Xt )dt + σ(Xt )dWt , with W a standard Brownian motion, one may discretize it by an Euler scheme: % T T Xk+1 = F (Xk , εk+1 ) := Xk + b(Xk ) + σ(Xk ) εk+1 , n n where (εk )k is a Gaussian white noise process. Standard computations show that when b and σ are Lipschitz, condition (A1’) is satisfied with: [F ]Lip
= 1+ 17
c , n
for some positive constant independent of n. We then easily see that Dkn (y, p), k = 0, . . . , n, ¯ := is bounded by a constant independent of k, n. Therefore if one simply assigns Nk = N N/(n + 1) points at each grid Γk , k = 0, . . . , n, (5.1) provides a rate of convergence for the approximate filters of order: Kgn n + 1 . ¯ d1 φn (y) N This has to be compared with the rate of convergence obtained by the particle Monte-Carlo ¯ interacting particles, see [Del Moral et al. 2001]: methods using N Kgn n 1 . ¯ 12 φn (y) N
6
Some facts about practical implementation
6.1
General features, complexity
At this stage, it is important to mention when and how a quantization method can be implemented. First, one must keep in mind that it is an off-line method: a significant part of the computations can be carried out and kept off-line. In fact things need to be done that way to make the method fully competitive. A natural framework to implement the quantization approach is to assume that the probabilistic features of the state process (Xk ) does not change too fast and/or too often. This is not a real restriction when one thinks about some usual applications of filtering problems (surge of the sea, satellite pursuit, financial modelling based on stochastic volatility,. . . ). Moreover, the more functions f one needs to estimate for a given set of observations, the more efficient the method becomes. This can be easily understood when one describes in more details the three phases of the filter approximation. Phase I (Off-line optimization): This phase is devoted to the construction of the weighted optimal quantization tree of the Markov process (Xk ), given that it contains a total number of N points. This means – specifying the sizes Nk of the grids Γk , 1 ≤ k ≤ n (a priori dispatching). – optimizing the every grid Γk := {xik , 1 ≤ i ≤ Nk }, 1 ≤ k ≤ n that is solving the optimization problem min Xk − ProjΓk (Xk )2 |ΓK |≤Nk
k = xj | X k−1 = xi ), 1 ≤ i ≤ Nk , – computing the transition weights Pkij = P(X k−1 1 ≤ j ≤ Nk+1 , k = 0, 1, . . . n. The dispatching is made a priori, based either on the minimization of the theoretical error bounds (see [Bally and Pag`es 2003] or [Pag`es et al. 2004b]) by solving (5.2) or on more specific features of the state process (see the stationary case below). The optimization of the grids results from a stochastic gradient descent called Competitive Learning vector Quantization based on Monte Carlo simulations of the (Xk ). The computation of 18
the transition weights and of the quantization error are carried out either simultaneously or by a new Monte Carlo simulation (see [Bally and Pag`es 2003]). They have been extensively investigated in former papers and we refer to them for a precise description of the procedure (see e.g. [Bally and Pag`es 2003], [Pag`es et al. 2004b]). This optimization phase is computationally the most demanding. This means less than 10 minutes of CPU time to compute the whole quantization tree (“height” n = 20, size N = 20 000) of a(n asymptotically) non-stationary 4-dimensional process using a 1 GHz micro-processor. However, in many cases this phase can be significantly shortened: when, (Xk ) is a stationary process only one grid is necessary (see section 6.2 below), which drastically cuts down the procedure by a factor n. Furthermore, if X is a Gaussian process, a library of optimal grids with various sizes is now available for the d-dimensional normal distributions N (0; Id ) (see [Pag`es and Printems 2003], the files are available at the URL – www.proba.jussieu.fr/pageperso/pages.html or – www.univ-paris12.fr/www/labos/cmup/homepages/printems). The crucial fact is that the optimization phase does not depend on the observations which explains why its results can be kept off-line. Phase II (Computation of the quantized filter distribution): One computes the j )1≤j≤Nn by plugging the observation vector y = (y1 , . . . , yn ) and the weight vector ( πy,n transition weights Pkij into the forward representations of the approximate filter i.e. (3.5) or (4.8). The theoretical complexity of the quantization tree descent is n−1 k=0 Nk Nk+1 which is nN 2 N2 N at least (n+1)2 ≈ n (when Nk = n+1 , k = 0, . . . , n). In practice, many transitions are 0 (when xik−1 and xjk are remote) and every knot xik of the tree has approximately the same number ν of “active connections” with knots of time k + 1. So, the resulting complexity, ¯ where after an appropriate pruning of the quantization tree, is approximately ν × n × N ¯ N = N/(n + 1) is the average number of points per time step. In all our numerical 1 experiments, this phase is almost instantaneous (less than 10 second with N = 20 000 points in dimension d = 4 with the same 1 GHz micro-processor). This distribution approximation phase does not depend on the function f . y,n f ): one computes for every (needed) function f Phase III (Computation of Π
y,n (dx) = f (x)Π
Nn
i f (xin )Π n
i=1
The complexity of this phase is proportional to Nn and is negligible (although depending on f ). ¯ new particles following In particle methods, at every time step, one needs to simulate N ¯ a (weighted) empirical measure (with a support of size N ). This requires first to compute ¯ random numbers and then to the weights of the empirical measure, secondly to generate N ¯ simulate by an inverse distribution function methods N appropriately distributed numbers. ¯ log(N ¯ )) comparisons The average complexity of the last phase cannot be lower than O(N (see [Devroye 1986], indeed a naive approach yields the a.s. worst possible complexity that 19
¯ 2 )) which makes an average total number of (n + 1)N ¯ log(N ¯ ) comparisons and O(N ¯) is O(N multiplications. It may happen for some observation vectors that the optimal filter and the prior distribution of the process X assign some masses at significantly different areas of the space, making the algorithm less efficient. One way to prevent this problem is to quantize the observation process to evaluate the likelihood of an observation vector (see [Sellami 2004]).
6.2
A special case of interest: marginal quantization of a stationary signal
In the case where the signal (Xk )0≤k≤n is a stationary Markov chain with distribution ν, the optimal Lp -quantization whole chain clearly amounts to that of its stationary & 1 of the ' N distribution ν. Let Γ := x ,...,x be a N := N/(n + 1)-optimal grid i.e. such that X − ProjΓ (X)p =
min
Γ⊂Rd , |Γ|≤N
X − ProjΓ (X)p .
k := Γ, 0 ≤ k ≤ n make up the optimal quantization of the chain. The companion Then Γ parameters are that is P0 = Proj (X0 ) – the quantization of the distribution ν induced by Γ, Γ
– and a single transition matrix 1 = x 0 = x j | X i ), Pkij = P1ij := P(X
0 ≤ i, j ≤ N .
The size of the parameters to be stored is obviously divided by a factor n (or the possible quantization size for the distribution ν and the transition matrix is multiplied by n). This ability to take into account the stationarity of the signal process is an interesting feature of the optimal quantization approach which seems not to be shared by other numerical methods.
7
Discretely observed diffusions
In this section, we discuss how our previous results can be applied when the signalobservation process evolves according to a stochastic differential equation in the form: dXt = b(Xt )dt + σ(Xt )dWt , X0 µ
(7.1)
dYt = β(Xt , Yt )dt + γ(Xt , Yt )dBt , Y0 = 0,
(7.2)
where W is a d-dimensional Brownian motion independent of the q-dimensional Brownian motion B, µ is a known distribution, and b, β, σ, γ are known functions. We set Σ(x) = σ(x)σ(x)∗ and Λ(x, y) = γ(x, y)γ(x, y)∗ . We assume that x → Σ(x) and (x, y) → Λ(x, y) 1 1 are uniformly nondegenerate functions and we denote by x → Σ 2 (x) and (x, y) → Λ 2 (x, y) their square roots functions(1 ) which are clearly uniformly nondegenerate. The process (X, Y ) is Markov with a transition semi-group denoted by (Rt )t . 1
Every nonnegative symmetric S matrix admits a unique square root S 1/2 satisfying S 1/2 is nonnegative, symmetric, S 1/2 S 1/2 = S and S 1/2 commutes with S.
20
We consider here that the sample path (Yt ) is observed at n discrete times with regular sampling interval, say 1. Our aim is then to compute the filter Πy,n of Xn conditional on the observations (Y1 , . . . , Yn ) settled at y = (y1 , . . . , yn ). The sequences (Xk , Yk )k∈N and (Xk )k∈N are (homogeneous) Markov chain with transitions R1 (x, y, dx , dy ) and P (x, dx ) = R1 (x, y, dx , Rq ). Under suitable conditions on the coefficients of the diffusion (7.1)-(7.2), for example if the functions b, σ, β, γ are twice differentiable with bounded derivatives of all orders up to 2, the transition R1 (x, y, dx , dy ) admits a density (x , y ) → r(x, y, x , y ). Hence, we are in the situation (H) of the introduction: the law of Yk conditional on (Xk−1 , Yk−1 , Xk ) = (x, y, x ) admits density y → g(x, y, x , y ) given by g(x, y, x , y ) =
r(x, y, x , y ) p(x, x )
where p(x, x ) = r(x, y, x , y )dy is the density of the transition P (x, dx ). But we do not know explicitly the density r (and so g) and we have to approximate it by an Euler scheme. We closely follow here the arguments of [Del Moral et al. 2001]. More precisely, for a step size 1/m, and given a starting point (x, y) ∈ Rd × Rq , we define by induction the variables (m)
X(x)0
= x,
(m)
(m)
X(x)i+1 = X(x)i (m)
Y (x, y)0
(m)
+ b(X(x)i
)
1 (m) εi+1 + σ(X(x)i ) √ , m m
= y,
(m)
(m)
Y (x, y)i+1 = Y (x, y)i
(m)
+ β(X(x)i
(m)
, Y (x, y)i
)
1 (m) (m) ηi+1 + γ(X(x)i , Y (x, y)i ) √ , m m
for i = 0, . . . , m − 1, where the (εi )i and (ηi )i are independent sequences of i.i.d centered (m) Gaussian vectors with unit covariance matrices. We denote by R1 (x, y, dx , dy ) the law (m) (m) (m) of (X(x)m , Y (x, y)m ). Then R1 (x, y, dx , dy ) has a density (x , y ) → r(m) (x, y, x , y ) explicitly given by r
(m)
(x, y, x , y ) =
m−1
φ(xi , xi+1 )ψ(xi , yi , yi+1 )dx1 . . . dxm−1 dy1 . . . dym−1 ,
i=0
with (x0 , y0 ) = (x, y), (xm , ym ) = (x , y ) and ( 2 " d −1 2 1 m m b(x) , 1 exp − Σ 2 (x) φ(x, x ) = x − x − d 2 m 2 2 (2π) det Σ (x) ( " q −1 m 1 m2 β(x, y) 2 1 exp − Λ 2 (x) y −y− ψ(x, y, y ) = q . 2 m (2π) 2 det Λ 2 (x, y) The density r(m) (x, y, x , y ) is an approximation of the density r(x, y, x , y ). More precisely, we have from [Bally and Talay 1996] the existence of constants C and C depending only
21
on the coefficients b, β, σ, γ such that: r(x, y, x , y ) + r(m) (x, y, x , y ) |x − x | + |y − y | >
2 m
C exp −C (|x − x |2 + |y − y |2 ) ,
≤
(7.3)
=⇒ |r(x, y, x , y ) − r(m) (x, y, x , y )|
C exp −C(|x − x |2 + |y − y |2 ) . (7.4) m (m) The law P (m) (x, dx ) of X(x)m has a density x → p(m) (x, x ) = r(m) (x, y, x , y )dy . We have then an approximation of g(x, y, x , y ) given by ≤
g (m) (x, y, x , y ) =
r(m) (x, y, x , y ) . p(m) (x, x )
(7.5)
y,n defined by the marginal quantization algorithm We then approximate Πy,n by Π in Section 3 where we replace the unknown function g by g (m) . The estimation error is measured through (m) f ≤ Πy,n f − Π(m) f + Π(m) f − Π (m) f , (7.6) Πy,n f − Π y,n y,n y,n y,n (m)
(m)
where Πy,n is the filter given by the formulas (2.1)- (2.2), with the transition probability distribution P (x, dx ) = p(x, x )dx replaced by P (m) (x, dx ) = p(m) (x, x )dx and the conditional density g replaced by g (m) . Actually, from the preliminaries in Section 2, we recall that the true filter Πy,n is given in an inductive form by: Πy,k−1 Hy,k f Πy,k = , k = 1, . . . , n, Πy,k−1 Hy,k 1 f (x )g(x, yk−1 , x , yk )P (x, dx ) = f (x )r(x, yk−1 , x , yk )dx , Hy,k f (x) =
Πy,0 = µ, with
(m)
while the approximate probability measure Πy,n is given by: (m) Πy,0 (m)
= µ,
with Hy,k f (x) =
(m) Πy,k
=
(m)
(m)
(m)
(m)
Πy,k−1 Hy,k f
, k = 1, . . . , n,
Πy,k−1 Hy,k 1
f (x )g (m) (x, yk−1 , x , yk )P (m) (x, dx ) =
f (x )r(m) (x, yk−1 , x , yk )dx .
By using estimations (7.3)-(7.4), it is easily checked that for any bounded function f , (m)
Hy,k f − Hy,k f ∞ ≤
C f ∞ , m
for some positive constant C independent of m and y. Therefore, by Proposition 2.1 in [Del Moral et al. 2001], the first term in (7.6) is estimated by: (m) f − Π f Π y,n y,n ≤
ρn (y)n+1 − ρn (y) C f ∞ , m Kg (ρn (y) − 1)
with ρn (y) = 2Kgn /φn (y). The second term in (7.6) is given by estimation result as in Theorem 3.1 provided one can check some Lipschitz condition for g (m) . Actually, it is 22
proved in the Appendix that when the functions b, σ and γ are constant, there exists a positive constant C (independent of m) such that, for all x, x , x , x ∈ Rd and y, y ∈ Rq , q+1 (m) x, y, x , y ) ≤ Cm 2 1 + |x| + |x | + | x| + | x | |x − x | (7.7) g (x, y, x , y ) − g (m) ( q+3 +Cm 2 1 + |x| + |x | + | x| + | x | |x − x |.
8
Numerical illustrations
Two numerical illustrations are proposed: one with the Kalman-Bucy model derived from the noisy observation of the discretization of an Ornstein-Uhlenbeck model, the second one with a stochastic volatility model arising in financial time series.
8.1
The Kalman-Bucy model Xk = AXk−1 + τ εk Yk = BXk + θηk
∈ Rd ∈R
(8.8)
q
(8.9)
for k ∈ N, and X0 is normally distributed with mean m0 = 0 and covariance matrix Σ20 . Here A, B τ and θ are matrices of appropriate dimensions, and (εk )k≥1 , (ηk )k≥1 are independent centered Gaussian processes, εk N (0, Id ), ηk N (0, Id ). In this case, we have 2 1 1 ) exp − θ−1 (y − Bx) . gk (x, y) = g(x, y) = 2 (2π)d/2 det(θθ ) n n If |||A||| < 1, then, (Xk )k≥0 is stationary iff Σ20 = n≥0 (A τ )(A τ ) (unique solution of Σ20 = AΣ20 A + τ τ where M denotes the transpose of M ). Of course, the filter Πy,n is explicitly known, see e.g. [Elliott et al. 1995]: it is a Gaussian distribution of mean mn and covariance matrix Cn given by the inductive equations: Ck+1 = (Id − Kk+1 B)(τ τ + A Ck A ), mk+1 = A mk + Kk+1 (yk+1 − BA mk ), where
C0 := 0, m0 := 0,
Kk+1 = (τ τ + A Ck A )B (B(τ τ + A Ck A )B + θθ )−1 .
Note that mn depends on the observation vector y whereas Cn does not. The above system can be seen as the Euler scheme with step ∆t of the following (linear) Gaussian diffusion system dX(t) = −αX(t)dt + σX dW X (t) with < W Y , W Y >≡ 0 dY (t) = B dX(t) + σY dW Y (t) √ √ if one sets A = Id − ∆t α, τ = ∆t σX , θ = ∆t σY . The numerical experiments has been carried out as follows: the (stationary) process X is quantized by marginal quantization. However, we decided to use grids which are not optimal for the stationary distribution N (0; Σ20 ). Instead, we selected some grids of the form Γ0 = Σ0 Γ∗ := {Σ0 ξ, ξ ∈ Γ∗ } 23
where Γ∗ is L2 -optimal for N (0; Id ). This induces slightly less accurate results but illustrates the robustness of the method and the interest of keeping off-line some tabulations. Two kinds of tests have been carried out with the Kalman-Bucy filter in order to track ¯ ) and n. The choice of a the behaviour of the quantized filter as a function of N (or N stationary setting is motivated by the possibility to detect more simply the dependency in these parameters. However, some simulations carried out using the same model, but starting at some deterministic value X0 = 0 yield quite similar results for the appropriate architecture of the “quantization tree” see [Sellami 2004]. This tree was made up with (non optimal) grids suitably scaled from optimal grids for the normal distribution. ¯ = Test 1: Convergence of the filter at a fixed instant n as a function of the size N N/(n + 1) of the grid Γ0 . We set ∆t = 1/250, n = 15 and considered three functions f1 (x) = xd ,
f2 (x) = |x|2 ,
f3 (x) = e−|x
d|
n fi , i = 1, 2, 3). that we implemented in dimensions d = 1 and d = 3 (closed forms exist for Π In both dimensions the results are summed up in a figure showing for every function fi , i = ¯ → Π n (fi ) (or N ¯ → |Πn (fi ) − Π n (fi )|) and logN ¯ → log|Πn (fi ) − Π n (fi )| 1, 2, 3, the graphs N (i.e. a log-scale) with its least square regression line denoted by “y = −ax + b” (a and b appearing as numerical values). This means that b n (fi )| ≈ e . |Πn (fi ) − Π ¯a N
• d = 1: α = B = 1, σ X = 0.5 and σ Y = 1 so that A = 0.996, τ = 0.0316 (so that τ ¯ runs in the interval [50, 400]. Σ0 = √1−A ≈ 0.354) and θ = 0.0663. The grid size N 2 Figure 1 contains the results. Furthermore, we added in Figure 2, for every function ¯ → |Πn (fi ) − Π n (fi )| for three different observation vectors. fi , i = 1, 2, 3, a graph N 1.4445 0.5556 0.7778 0.9942 −0.00222 −0.0031 • d = 3 : let α := 0.5556 0.9445 0.2222 so that A = −0.0022 0.9962 −0.0009 ; 0.7778 0.2222 1.6110 −0.00311 −0.0009 0.9936 1.0118 0.0900 0.1349 0.1079 0.0317 0.0444 set τ = 0.0317 0.0793 0.0127 so that Σ0 = 0.0900 0.9219 0.0449 . Then, set 0.1349 0.0449 1.0343 0.0444 0.0127 0.1173 ¯ B = I3 , θ = 0.5 I3 . The grid size N runs in the interval [50, 600]. Figure 3 contains the results. ¯ ¯ Test 2: Stability of the filter for a fixed grid size We con* N = Nmax as n grows. + 1.1625 −0.8488 sider now a d = 2-dimensional model with α = so that A := −0.8488 1.5875 * + * + 0.99535 0.00340 0.08830 −0.02419 , τ := , B = I2 , θ := 0.5I2 (so that Σ0 = 0.003340 0.99365 −0.02419 0.10041 * + 1.01976 0.10041 ¯ = 600 with n running from 1 up to 100. In Figure 4(left) ). We set N 0.10041 0.969493 are depicted respectively n (xi ), i = 1, 2 n → Πn (xi ), n → Π
and 24
n →
n (|x|) − Πn (|x|) Π , Πn (|x|)
n ∈ [1, 100],
and the linear regression line of these relative errors. These results are much more satisfactory than those induced by the a posteriori errors bounds obtained in Theorem 3.1 or Theorem 4.1, although the process (Xk ) is not “rapidly mixing” since |||A||| is close to 1 (this explains why the regression line is not completely flat). When |||A||| is less than 0.8 like in Figure 4(right), the true value and the quantized one become indistinguishable. This means that, like for interacting particle methods, the mixing property of the state variable X induces the stability of the filter as n increases. However, we have not yet theoretical results to support this fact.
8.2
A stochastic volatility model:
We consider a state model with multiplicative Gaussian noise process: Yk = σ(Xk )ηk
∈R
with
Xk = ρ Xk−1 + εk
∈R
(8.10)
where ρ is a real constant, σ(.) is a positive Borel function on R and (εk )k≥1 , (ηk )k≥1 are independent Gaussian processes. In terms of financial modeling, (Yk )k≥0 represents a (martingale) asset price model with stochastic volatility σ(Xk ). We still consider (8.10) as an Euler scheme, with step size ∆t, of a continuous-time Ornstein-Uhlenbeck stochastic volatility model: dX(t) = −αX(t)dt + τ dW (t), 0 ≤ t ≤ 1, with positive parameters λ and τ . We then suppose ρ = 1 − α ∆t, εk N (0, τ 2 ∆t)
and
ηk N (0, ∆t).
The filtering problem consists in estimating the volatility σ(Xn ) at step n given the observations of the prices (Y0 , . . . , Yn ). Here, 1 y2 . gk (x, y) = g(x, y) = √ exp − 2 2 σ (x)∆t 2π∆t σ(x) The values of the parameters in our simulation are for (α, τ, ∆t) = (1, 0.5, 1/250). The Gaussian distribution)of X0 is such that)the sequence (Xk )k≥1 is stationary i.e. X0 ∼ N (0, Σ20 ) with Σ0 = τ ∆t/(1 − ρ2 ) = τ / α(2 − α∆t) ≈ 0.354 The selected model is here: (ABS) ≡ σ(Xk ) = γ + |Xk |
with γ = 0.05.
n (fi )| which strongly suggests con¯ → |Πn (fi ) − Π In Figure 5 is depicted the graph N vergence for the three functions (although we have no reference value at hand).
Appendix A: Lipschitz condition on the conditional density of the Euler scheme We consider the particular case where the coefficients b, σ and γ of the diffusion (X, Y ) in (7.1)(7.2) are real constants. We then assume w.l.o.g. σ = Id and γ = Iq . We also assume that the function β is bounded and differentiable with bounded derivatives. We show in this appendix that
25
then the conditional density g (m) of the Euler scheme is bounded and satisfies the locally Lipschitz continuous condition (7.7). First, we recall that g (m) (x0 , y0 , xm , ym )
with
r
(m)
(x0 , y0 , xm , ym )
=
m−1
=
r(m) (x0 , y0 , xm , ym ) p(m) (x0 , xm )
(A.1)
φ(xi , xi+1 )ψ(xi , yi , yi+1 ) dx1 . . . dxm−1 dy1 . . . dym−1 ,
i=0
p(m) (x0 , xm )
=
m−1
φ(xi , xi+1 )dx1 . . . dxm−1 ,
i=0
( 2 " 2 " q 2 m b m m β(x, y) x − x − , ψ(x, y, y ) = y − y − . φ(x, x ) = q exp − d exp − 2 2 m 2 m (2π) (2π) 2 (
d
and
m2
q
m 2 First, by writing that ψ(xm−1 , ym−1 , ym ) is bounded by ( 2π ) and using the fact that, for every ,m−1 x0 , . . . , xm−1 ∈ R, y0 ∈ R, (y1 , . . . , ym−1 ) → i=0 ψ(xi , yi , yi+1 ) is a (Gaussian) density function, we see by Fubini’s theorem that m q2 g (m) ≤ . (A.2) 2π
Both functions p(m) (x0 , xm ) and r(m) (x0 , y0 , xm , ym ) are clearly differentiable with respect to x0 and xm with derivatives given by: + * ∂r(m) b β(x0 , y0 ) ∂β y1 − y0 − = mId x1 − x0 − + ∂x0 m ∂x0 m m−1
φ(xi , xi+1 )ψ(xi , yi , yi+1 )dx1 . . . dxm−1 dy1 . . . dym−1 ,
i=0
+ * b = −mId xm − xm−1 − m
∂r(m) ∂xm
m−1
φ(xi , xi+1 )ψ(xi , yi , yi+1 )dx1 . . . dxm−1 dy1 . . . dym−1 ,
i=0
+ m−1 * b mId x1 − x0 − φ(xi , xi+1 )dx1 . . . dxm−1 , m i=0 + m−1 * b = −mId xm − xm−1 − φ(xi , xi+1 )dx1 . . . dxm−1 . m i=0
∂p(m) ∂x0
=
∂p(m) ∂xm
Now using the same arguments as for(A.2), one gets (m) ∂r ∂x0
q
≤ Cm 2
m−1 b m x1 − x0 − + 1 φ(xi , xi+1 )dx1 . . . dxm−1 , m i=0
for some positive constant C. Hence, (m) ∂r q ∂x0 (m) ≤ Cm 2 (1 + mB(x0 , xm )) , p where
B(x0 , xm )
=
, x1 − x0 − b m−1 φ(xi , xi+1 )dx1 . . . dxm−1 m . ,m−1 i=0 i=0 φ(xi , xi+1 )dx1 . . . dxm−1
26
(A.3)
By making the change of variables, xi → xi − xi−1 − b/m, i = 1, . . . , m − 1, we have B(x0 , xm ) = ¯ m − x0 − b) with B(x * 2 + m−1 m−1 m 2 dx1 . . . dxm−1 |x1 | exp − 2 i=1 |xi | + i=1 xi − x ¯ + * B(x) = . (A.4) 2 m−1 m−1 2 + dx exp − m |x | x − x . . . dx i i 1 m−1 i=1 i=1 2 By writing the sum in parentheses in the previous relation as a canonical square sum in xm−1 , i.e. m−1 2 2 2 m−2 m−2 m−2 m−1 x − x 1 i |xi |2 + xi − x = 2 xm−1 + i=1 |xi |2 + xi − x , + 2 2 i=1
i=1
i=1
i=1
we obtain by integrating in (A.4) first with respect to xm−1 (by Fubini’s theorem): * 2 + m−2 m−2 1 2 |x1 | exp − m dx1 . . . dxm−2 |x | + x − x i i i=1 i=1 2 2 ¯ * B(x) = . 2 + m−2 m−2 2 + 1 dx exp − m |x | x − x . . . dx i i 1 m−2 i=1 i=1 2 2 By induction, this yields ¯ B(x)
=
2 1 2 |x1 | exp − m |x dx1 | + |x − x| 1 1 2 m−1
. 2 1 2 exp − m dx1 2 |x1 | + m−1 |x1 − x|
Using again canonical square sum in x1 , we then get m2 x1 − x 2 dx1 |x1 | exp − 2(m−1) m ¯ B(x) = . m2 x1 − x 2 dx1 exp − 2(m−1) m √ With the change of variable x1 → m(x1 − x/m)/ m − 1, it is then clear that ¯ B(x) ≤
C √ ( m + |x|), ∀x ∈ Rd , m
for some positive constant C. From (A.3), we deduce that (m) ∂r q √ ∂x0 m + |x0 | + |xm | . (m) ≤ Cm 2 p By same arguments as above, we have (m) ∂p ∂x0 √ m + |x0 | + |xm | . p(m) ≤ C Therefore, (m) ∂g ∂x0
q
≤ Cm 2
√
m + |x0 | + |xm | .
By same arguments as above, we also show that (m) ∂g √ q 2 +1 m + |x0 | + |xm | . ∂xm ≤ Cm The local Lipschitz assumption (7.7) straightforwardly follows from (A.5)-(A.6)
27
(A.5)
(A.6)
Appendix B: Optimal quantization (numerical aspects) As mentioned in the introduction, quantization consists in replacing a Rd -valued random vector X Γ := ProjΓ (X). For a grid by its projection following a closest neighbour rule onto a grid Γ ⊂ Rd , X 1 N d Γ := {x , . . . , x }, such a projection is defined&by a Borel partition C1 (Γ), . . . , CN (Γ) ' of R (called d i j Voronoi tessellation of Γ) satisfying Ci (Γ) ⊂ ξ ∈ R : |ξ − x | = minxj ∈Γ |ξ − x | , i = 1, . . . , N , where | . | denotes the usual canonical Euclidean norm. We then set Γ X
=
N
xi 1Ci (Γ) (X).
(B.1)
i=1
If X ∈ Lp , the Lp -error induced by this projection – called Lp -quantization error – is given by . It is obvious that this quantization error depends on the grid Γ. In fact, one easily X − X p derives from the closest neighbour rule that if Γ = {x1 , . . . , xN } Γ p = E min |X − xi |p X − X (B.2) p 1≤i≤N
So, if one identifies a grid Γ of size N with the N -tuple (x1 , . . . , xN ) or any of its permutation, the pthe power of the Lp -quantization error – called Lp -distortion – appears as a symmetric function QpN (x1 , . . . , xN ) := min |ξ − xi |p PX (dξ) i
) which van obviously be defined on the whole (Rd )N . The function p QpN is Lipschitz continuous and does reach a minimum. If |X(Ω)| is infinite, then any N -tuple that achieves the minimum has pairwise distinct and this minimum decreases toward 0 as N goes to infinity. Its rate of convergence is ruled by the so-called Zador’s theorem. Theorem 8.1 (see [Graf and Luschgy 2000]) Assume that E|X|p+ε < +∞ for some ε > 0. Then lim N N
1 d
Γ min X − X p
|Γ|≤N
= J˜p,d
ϕ(ξ)
d d+p
p1 + d1 dξ
(B.3)
Rd
where PX (dξ) = ϕ(ξ) λd (dξ) + ν(dξ), ν ⊥ λd (λd Lebesgue measure on Rd ). The constant J˜p,d corresponds to the case of the uniform distribution on [0, 1]d . 1 5√ ˜ Except in 1 dimension (J˜p,1 = ,. . . ) the true value of J˜p,d is unknown. 1 , J2,2 = 18 3 2(p+1) p
1 d However, J˜p,d ∼ ( 2πe ) 2 as d goes to infinity (see [Graf and Luschgy 2000]). This theorem says that Γ ∼ θX,p,d N − d1 . This is in accordance with the rates O(N −1/d ) obtained in min|Γ|≤N X − X p numerical integration with uniform grid methods although optimal quantizers are never uniform grids (except for the uniform distribution in 1-dimension) : optimal quantization provides the “best fitting” grid for a given distribution µ = PX . Such grids correspond to the real constant θX,p,d .
Stochastic gradient descent to get optimal quantization: When the dimension d is greater than 1 and independent copies of X can be easily simulated on a computer, an efficient approach consists in differentiating the integral representation of the quantization error function to implement a stochastic gradient algorithm. Namely, set for every x = (x1 , . . . , xN ) ∈ (Rd )N and every ξ ∈ Rd , p qN (x, ξ) := min |xi − ξ|p . 1≤i≤N
For notational convenience, we will temporarily denote by Ci (x) the Voronoi cell of xi in the grid Γ := {x1 , . . . , xN } (instead of Ci (Γ)). One shows (see [Pag`es 1997]) that, if p > 1, QpN is continuously
28
differentiable at every N -tuple x ∈ (Rd )N having pairwise distinct components and a PX -negligible p Voronoi boundary ∪N i=1 ∂Ci (x). Its gradient ∇QN is obtained by formal differentiation: . / p (x, X) , ∇QpN (x) = E ∇x qN p i ∂ qN x −ξ i p p−1 where ∇x qN (x, ξ) = (x, ξ) := p (B.4) . |x − ξ| 1Ci (x) (ξ) ∂xi |xi − ξ| 1≤i≤N 1≤i≤N 0 p with the convention |0| = 0. Note that then, ∇x qN (x, ξ) has exactly one non-zero component i(x, ξ) defined by ξ ∈ Ci(x,ξ) (x). The above differentiability result still holds for p = 1 if PX is continuous (i.e. weights no point in Rd ).
Then, one may process a stochastic gradient descent algorithm (starting from an initial grid Γ0 with N pairwise distinct components) defined by Γs+1 = Γs −
δs+1 p ∇x q N (Γs , ξ s+1 ) p
(B.5)
where (ξ s )s≥1 is an i.i.d. sequence of X-distributed random vectors and (δs )s≥1 a (0, 1)-valued sequence of step parameters satisfying the usual δs = +∞ and δs2 < +∞. (B.6) s
s s
Note that (B.5) a.s. grants by induction that Γ has pairwise distinct components. Under some appropriate assumptions, such a stochastic descent procedure a.s. converges toward a local minimum of its potential function, here it would be QpN . Although these assumptions are not fulfilled by the function QpN some theoretical problems may be overcome (see [Pag`es 1997]). However, it provides satisfactory results a posteriori (this is a common situation when implementing gradient descents). The companion parameters (PX -weights of the cells and Lp -quantization errors) can be obtained as by-products of the procedure. For more details, we refer to [Pag`es 1997, Bally and Pag`es 2003, Pag`es et al. 2004a] where these questions are extensively investigated and discussed. The quadratic case p = 2 is the most commonly implemented for applications and is known as the Competitive Learning Vector Quantization (CLVQ) algorithm. Stationary quantizers: The differentiability of Q2N also has a noticeable theoretical consequence. Since Q2N is differentiable at any N -tuple x lying in argminQ2N – even in argminlocQ2N if PX weights no hyperplane –, any such N -tuple is a stationary quantizer i.e. ∇Q2N (x) = 0. Standard computations then show that this equation reads = E[X | X] X
(B.7)
≤ X . In particular this means that, for every p ∈ [1, +∞], X p p Acknowledgement: We are grateful to Afef Sellami and Jacques Printems who took an important part in the numerical experiments. We would like to thank Fran¸cois LeGland for useful remarks.
References [Bally et al. 2001] Bally V. , Pag`es G., Printems J. (2001) A stochastic quantization method for nonlinear problems, Monte Carlo Methods and Applications, 7, n0 1-2, 21-34. [Bally and Pag`es 2003] Bally V., Pag`es G. (2003) An optimal quantization algorithm for solving discrete time multidimensional optimal stopping problems, Bernoulli, 9, n0 6, 1003-1049. [Bally and Talay 1996] Bally V., Talay D. (1996) The law of the Euler scheme for stochastic differential equations: II. Approximation of the density, Monte Carlo Methods and Applications, 2, 93-128.
29
[Bucklew and Wise 1982] Bucklew J., Wise G. (1982) Multidimensional Asymptotic Quantization Theory with rthe Power distortion Measures, IEEE Trans. on Inform. The., 28, n0 2, 239-247. [Crisan and Lyons 1997] Crisan D., Lyons T. (1997) Nonlinear filtering and measure-valued processes, Probab. Theory Related Fields, 109, n0 2, 217-244. [Del Moral 1998] Del Moral P. (1998) Measure-valued processes and interacting particle systems, application to nonlinear filtering problem, Annals of Applied Probability, 8, 438-495. [Del Moral et al. 2001] Del Moral P., Jacod J., Protter P. (2001) The Monte-Carlo method for filtering with discrete-time observations, Probab. Theory Related Fields, 120, 346-368. [Devroye 1986] Devroye L., Non-uniform random variate generation, Springer, 1986. [Di Masi et al. 1985] Di Masi G., Pratelli M., Runggaldier W. (1985) An Approximation for the nonlinear filtering problem with error bounds, Stochastics, 14, 247-271. [Di Masi and Runggaldier 1982] Di Masi G., Runggaldier W. (1982) Approximation and bounds for discrete-time nonlinear filtering, for the nonlinear filtering problem with error bounds, Lect. Notes in Cont. Info. Sci., 44, A. Bensoussan et J.L. Lions eds. [Elliott et al. 1995] Elliott R., Aggoun L., Moore J.B. (1995) Hidden Markov Models, Estimation and Control, Springer Verlag. [Florchinger and LeGland 1992] Florchinger P., LeGland F. (1992) Particle Approximation for First-Order SPDE’s, Lect. Notes in Cont. Info. Sci., 177, pp. 121-133, Berlin: Springer Verlag. [Graf and Luschgy 2000] Graf S., Luschgy H. (2000) Foundations of quantization for probability Distributions, Lecture Notes in Mathematics n0 1730, Berlin: Springer. [Kushner 1977] Kushner H.J. (1977) Probability methods for approximations in stochastic control control and for elliptic equations, Academic Press, New York. [Newton 2000a] Newton N. (2000) Observations preprocessing and quantisation for nonlinear filters, SIAM J. Control Optim., 38, 2, 482-502. [Newton 2000b] Newton N. (2000) Observation sampling and quantisation for continuous-time estimators, Stoch. Proc. and their Appl., 87, 311-337. [Pag`es 1997] Pag`es G. (1997) A space vector quantization method for numerical integration, Journal of Applied and Computational Mathematics, 89, 1-38. [Pag`es and Printems 2003] Pag`es G., Printems J. (2003) Optimal quadratic quantization for numerics: the Gaussian case, Monte Carlo Methods and Applications, 9, n0 2, 2003, 135-166. [Pag`es et al. 2004a] Pag`es G., Pham H. and Printems J. (2004) A quantization algorithm for multidimensional stochastic control problems, Stochastics and Dynamics, 4, n0 4, 1-45. [Pag`es et al. 2004b] Pag`es G., Pham H. and Printems J. (2004) Optimal quantization methods and applications to numerical problems in finance, Handbook of Computational and Numerical Methods in Finance, S. Rachev ed., Boston: Birkh¨ auser. [Sellami 2004] Sellami A., Non linear filtering with quantization of the observations, preprint LPMA, Univ. Paris 6&7, 2004.
30
−4.5 Mom 1 Log Err linear
0.068
−5
0.066
y = − 1.4*x − 0.58
−5.5
−6 0.064
−6.5 0.062
−7
−7.5 0.06
−8 0.058
−8.5
f1 Quant Values f1 Ref Value
−9 0.056
0
50
100
150
200
250
300
350
−3
6
3
3.5
4
400
x 10
4.5 log N
5
5.5
6
−7 Mom 2 Log Err linear −7.5
5.8
y = − 1.3*x − 3.2 −8
−8.5
5.6
−9
5.4 −9.5
−10
5.2
−10.5
f2 Quant Values f2 Ref Value
5
−11
0
50
100
150
200
250
300
350
3
3.5
4
400
4.5 log N
5
5.5
6
−4.5 Exp−|.| Log Err linear
0.95 f3 Quant Values f3 Ref Values
0.948
−5 y = − 1.3*x − 0.87
−5.5
0.946
0.944
−6
0.942
−6.5
0.94
−7
0.938
−7.5
0.936
−8 0.934
−8.5 0.932
−9 0.93
0
50
100
150
200
250
300
350
400
3
3.5
4
4.5 log N
5
5.5
6
¯ grows Figure 1: d = 1, n = 15. Convergence and convergence rate (in a log-scale) as N
31
0.04 Mom 1 Err − Obs 3 Mom 1 Err − Obs 1 Mom 1 Err − Obs 2
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0
50
100
150
200
250
300
350
400
0.045
Mom 2 Err − Obs 3 Mom 2 Err − Obs 1 Mom 2 Err − Obs 2
0.04
0.035
0.03
0.025
0.02
0.015
0.01
0.005
0
0
50
100
150
200
250
300
350
400
0.025
Exp −|.| Err − Obs 3 Exp −|.| Err − Obs 1 Exp −|.| Err − Obs 2 0.02
0.015
0.01
0.005
0
0
50
100
150
200
250
300
350
400
32 ¯ grows for three different Figure 2: d = 1, n = 15. Errors and convergence rate as N observation vectors
1
−1.8
Q Values Ref Value
C Log Err linear
0.9
−2 0.8
y = − 0.39*x − 0.13
−2.2
0.7
0.6
−2.4
0.5
−2.6 0.4
−2.8 0.3
−3
0.2
0.1
0
−3.2
0
200
400
600
800
1000 N
1200
1400
1600
1800
2000
−3.4 4.5
1.1
5
5.5
6
6.5
7
7.5
8
−1
Q Values Ref Value
f2 Log Err linear −1.5
1.05
−2
−2.5
1
−3 0.95
−3.5
y = − 1*x + 3.5 −4
0.9
−4.5 0.85
0
200
400
600
800
1000 N
1200
1400
1600
1800
2000
−5 4.5
5
5.5
6
6.5
7
7.5
8
1
−2.4
Q Values Ref Value
f3 Log Err
0.9
linear −2.6
0.8
−2.8
0.7
0.6
−3
0.5
−3.2 0.4
−3.4 0.3
−3.6
0.2
y = − 0.38*x − 0.73
0.1
0
−3.8
0
200
400
600
800
1000 N
1200
1400
1600
1800
2000
−4 4.5
5
5.5
6
6.5
7
7.5
8
Figure 3: d = 3, n = 15. Convergence and convergence rate (in a log-scale) as a function ¯ of N
33
1.2
2.5 Exact val 1st−c Quant val 1st−c
Exact val 1st−c Quant val 1st−c 2
1
1.5
0.8
1 0.6 0.5 0.4 0 0.2 −0.5 0
−1
−0.2
−0.4
−1.5
0
10
20
30
40
50 time step
60
70
80
90
−2
100
0
10
20
30
40
50 time step
60
70
80
90
100
3
−0.5 Exact val 2nd−c Quant val 2nd−c
2
1
−1
0
−1
−1.5
−2 Exact val 2nd−c Quant val 2nd−c −2
0
10
20
30
40
50 time step
60
70
80
90
−3
100
0.14
0
10
20
30
40
50 time step
60
70
80
90
100
0.02
Relative Mom1 Err linear
Relative Mom1 Err 0.018
0.12 0.016
0.1
0.014
0.012
0.08
0.01
0.06 0.008
0.006
0.04
0.004
0.02 0.002
0
0
10
20
30
40
50 time step
60
70
80
90
0
100
0
10
20
30
40
50 time step
60
70
80
90
100
¯ = 600, 1 ≤ n ≤ 100, with |||A||| close to Figure 4: First moment error and relative error N 1 (left) and less than 0.8 (right)
34
−3
2.5
x 10
Mom 1 2.45
2.4
2.35
2.3
2.25
2.2
2.15
2.1
0
50
100
150
200 N
250
300
350
400
−3
1.85
x 10
Mom 2
1.8
1.75
1.7
1.65
1.6
1.55
0
50
100
150
200 N
250
300
350
400
0.9735 Exp−|.| 0.973
0.9725
0.972
0.9715
0.971
0.9705
0.97
0.9695
0.969
0
50
100
150
200 N
250
300
350
400
Figure 5: Stochastic volatility model: filter values for f1 , f2 and f3 functions(ρ, θ, γ) = ¯ grows (0.996, 0.0316, 0.05) as N 35