Dec 11, 2000 - Abstract. The paper is concerned with approximations to nonlinear filtering problems that are of interest over a very long time interval. Since the.
Approximation and Limit Results for Nonlinear Filters Over an Infinite Time Interval Harold J. Kushner† Division of Applied Mathematics Brown University Providence, RI 02912
Amarjit Budhiraja ∗ Department of Mathematics University of Notre Dame Notre Dame, IN 46656
December 11, 2000
Abstract The paper is concerned with approximations to nonlinear filtering problems that are of interest over a very long time interval. Since the optimal filter can rarely be constructed, one needs to compute with numerically feasible approximations. The signal model can be a jump–diffusion or just a process that is approximated by a jump–diffusion. The observation noise can be either white or of wide bandwidth. The observations can be taken either in discrete or continuous time. The cost of interest is the pathwise error per unit time over a long time interval. It is shown under quite reasonable conditions on the approximating filter and the signal and noise processes that (as time, bandwidth, process and filter approximation, etc.) go to their limit in any way at all, the limit of the pathwise average costs per unit time is just what one would get if the approximating processes were replaced by their ideal values and the optimal filter were used. Analogous results are obtained (with appropriate scaling) if the observations are taken in discrete time, and the sampling interval also goes to zero. For these cases, the approximating filter is a numerical approximation to the optimal filter for the presumed limit (signal, observation noise) problem.
Key words: Nonlinear filters, approximations to nonlinear filters, infinite time filtering, occupation measures, pathwise average error criteria, AMS subject classification numbers: 93E11, 60G35 ∗ Supported in part by contracts N00014-96-1-0276 and N00014-96-1-0279 from the Office of Naval Research and NSF grant DMI 9812857. † Supported in part by contracts DAAH04-96-1-0075 from the Army Research office and NSF grant ECS 9703895.
1
1
Introduction
The paper is concerned with approximations to nonlinear filtering problems that are of interest over a very long time interval. Consider the simplest such problem where the observation is a function of the state of a diffusion process and either or both the function or the process is nonlinear. Then, except in a few cases which on the whole are not of much practical value, it is not feasible to compute the truly optimal filter. Then one must approximate in some way. Suppose that the approximation is parameterized by a parameter h such that as h → 0, the approximation converges to the true filter in the sense that the computed expectations of any bounded and continuous function converges to the true conditional expectation; equivalently, the computed approximating conditional distribution converges weakly to the true conditional distribution. Now suppose that the filter is of interest over an arbitrarily large interval T and the errors of concern are the pathwise average (not the mean value) “prediction” errors per unit time for some rather arbitrary definition of prediction. Pathwise errors are a main item of interest in many such applications since we work with only one long path, and the law of large numbers argument which justifies the use of mean values is inappropriate. Now there are two parameters; h and T . The convergence of the filter over any fixed finite interval says nothing about the behavior of the pathwise average errors as h → 0 and T → ∞ arbitrarily. We will show that, under quite reasonable conditions, the pathwise errors converge in probability to an optimal deterministic limit in a very natural sense. This limit is what one would get if one used the true optimal filter, and not an approximation. In fact, the limit is also what one would get for the true optimal filter if the expectation of the pathwise average replaced the pathwise average. The convergence is independent of how h → 0 or T → ∞. The error function is quite general and can even be an (appropriate) path functional. This is quite a strong result. Note that it is important that we allow h → 0 and T → ∞ in an arbitrary way. If, for example, we required that the time required to get a good approximation by the “ergodic limit” depends on h, for small h, then the approximating filter would not be good for large time, since the better the filter is (i.e., the smaller h is), the more time is required for the pathwise average to be well approximated by the ergodic limit. Actually, our interest is in more general problems as well. We let the signal process be replaced by a “pre-jump-diffusion”, which is only approximated by a jump–diffusion. The discrete time case is also treated. Also, we allow wide bandwidth observation noise. Much (perhaps most) of the theoretical work in nonlinear filtering is in continuous time, and white observation noise is the usual one. But most applications are in discrete time. If the interval between observations becomes small so that one is tempted to use a continuous time model, then one is confronted with the very real presence of wide bandwidth and not white noise. One usually constructs a numerical approximation to to a filter which assumes white observation noise and a diffusion model. Then the wide bandwidth observation noise, combined with all of the other approximations, 2
can conceivably lead to serious errors when T is large. The last part of the paper deals with this general problem. The observations are in discrete time with a small time interval between the observations, the observation noise is wide bandwidth, the signal is only approximated by a sampled diffusion, and the pathwise average errors over a long time interval T are of interest. Under reasonable conditions we show that the desired limit result holds here as well, as all the parameters go to their limits simultaneously (observation interval, observation noise bandwidth, h, the signal converging to a jump–diffusion, T , etc.). This is a true justification of filtering work in continuous time. Some work on this problem was done in [21], where the pathwise average error was replaced by an expectation of the pathwise average error, and which contains much additional material on the wide bandwidth observation noise problem. In [12], the asymptotics of the filter alone was dealt with, and it was presumed that the filter was the true optimal filter, not an approximation. The approach taken here uses an adaptation of the occupation measure approach which was used successfully on a variety of control and limit problems in [19, 22].
2
Problem Formulation and Occupation Measures
Let h parameterize the approximating filter. Until Section 5, we work with observations in continuous time. Let X(·) denote the basic signal process, which is the solution to an Itˆ o equation. The process can be a jump–diffusion with time independent coefficient functions, with only notational changes required in the development. But for notational simplicity, we work with a simple diffusion, which takes values in a compact set in IRr , Euclidean r−space. Where continuity is used in the current proof, for the jump–diffusion case it is replaced by the fact that we have continuity at each time with probability one, and uniform (in t) right continuity in probability. Similarly, under appropriate conditions on the boundary and reflection directions, the results hold for a reflected jump– diffusion. Our basic condition on the signal process is the following. A2.1 X(·) takes values in a compact subset G of IRr for all X(0) ∈ G and satisfies the Itˆ o equation dX = p(X)dt + σ(X)dW,
(2.1)
where W (·) is a standard vector–valued Wiener process and p(·) and σ(·) are continuous. The solution of the above Itˆ o equation is unique in the weak sense for each initial condition. The uniqueness assumption imposes rather mild requirements on p(·) and σ(·), while the compactness of the state space will hold if say σ(·) vanishes outside a compact set and 0 is a global attractor for the deterministic equation obtained by substituting σ = 0 in (2.1). Alternatively, compactness of the 3
state space can be achieved by constraining a diffusion in a bounded region by appropriately reflecting it at the boundary. Then the model is represented as the solution to the Skorohod problem, and the techniques of the current paper can be used to cover this case as well. In many applications, the state spaces are inherently compact; for example, where the state is the phase of a signal. Additionally, unbounded signal models are often used for mathematical convenience, while the true process does take values in a compact set. The observation process is Y (·), defined by Z Y (t) =
t
g(X(s))ds + B(t),
(2.2)
0
where g(·) is a continuous vector–valued function and B(·) is a standard vectorvalued Wiener process, independent of W (·) and X(0). Unless otherwise stated all statements of the form, ”with probability one (w.p.1)”, ”a.s.”, ”convergence in probability”, etc. will correspond to the basic probability space which supports X(·), Y (·), W (·), B(·). The optimal filter process. We will use the representation of the optimal filter as it was originally developed in [15]. This is most convenient for our purposes, and it is completely equivalent to the forms used later, as in [8, 11, 23]. ˜ be a process satisfying (2.1), and which (loosely speaking) is conditionLet X(·) ally independent of (X(·), W (·), B(·)) given its initial condition: We formalize ˜ is a process satisfying (2.1) such that there exists a (possibly this as follows. X(·) random) probability measure Π∗ on IRr with the properties that conditioned on ˜ is independent of (X(·), W (·), B(·)) and the conditional distribution of Π∗ , X(·) ˜ ˜ X(0) given Π∗ is Π∗ . We will call Π∗ the “random initial distribution” of X(·) ˜ (i.e., the distribution of X(0)). It will vary depending on the need, and will be specified when needed. Let Ya,b , a < b, denote the values of Y (·) − Y (a) in the time interval [a, b]. Until further notice, let Π(0) denote the distribution of X(0) and Π(t) the ˜ 0,a denote the values distribution of X(t) given the data Y0,t and Π(0). Let X ˜ of X(·) in the time interval [0, a]. Define the function Z t Z 2 1 t ˜ ˜ 0,t , Y0,t ) = exp ˜ R(X g 0 (X(s))dY (s) − X(s)) (2.3) g( ds . 2 0 0 Let EZ f denote the expectation of a function f given the data Z. Then the optimal filter Π(·) satisfies the following relation (which defines the inner product notation). For each bounded and measurable real–valued function φ(·) h i Z ˜ ˜ 0,t , Y0,t ) E{Π(0),Y0,t } φ(X(t))R( X , (2.4) φ(x)Π(t, dx) ≡ hΠ(t), φi = ˜ 0,t , Y0,t ) E{Π(0),Y0,t } R(X ˜ where Π(0) is the distribution of the initial condition X(0).
4
Owing to the Markov property of X(·), the optimal filter defined by (2.4) satisfies the semigroup relation: For 0 < s < t h i ˜ ˜ 0,s , Yt−s,t ) E{Π(t−s),Yt−s,t } φ(X(s))R( X hΠ(t), φi = . (2.5) ˜ 0,s , Yt−s,t ) E{Π(t−s),Yt−s,t } R(X In (2.5), Π(t − s) is the (possibly random) distribution of the initial condition ˜ ˜ 0,s , Ya,b ) for X(0). Throughout the paper, we use the notation E{Π(a),Ya,b } F (X the conditional expectation, given the data Ya,b and where the random initial ˜ is Π(a). The analogous notation will be used when approxdistribution for X(·) ˜ imations to X(·) are used. Since Π(t) is nearly always hard (if not impossible) to compute for nonlinear problems, in applications one uses various approximations. Let Πh (·) denote the actual measure–valued process, to be specified more precisely below (see (2.7)), which is the “numerical” filter used in the application. Thus, Πh (t) (with values Πh (t, dx)) is the approximation at time t. As discussed earlier, Π(0) and Πh (0), which represent the initial condition of the auxiliary process used in the filter formula, will be allowed to be random (i.e., they can be measure–valued random variables, although always non anticipative with respect to the Wiener processes. In this case, we can still write PΠ(0) {X(0) ∈ A} = Π(0, A), random or not. Defining Π(0) arbitrarily. Equation (2.4) defines a measure–valued process Π(·), with initial condition Π(0). This process Π(·) is well defined even if Π(0) is not the distribution of X(0), although it would not then be the optimal filter. Keep in mind the following important fact. In what follows, we allow Π(0) to be arbitrary but with support in G, and we will note it explicitly if it is the distribution of X(0). Then, the semigroup property in equation (2.5) shows that the pair (X(·), Π(·)) is Markov. An approximating auxiliary process and approximate filter. A common way to approximate the optimal filter is to find a process which suitably approximates X(·) but for which the optimal filter is feasible. We then construct the optimal filter for the approximating process, but use the actual observations (2.2). This was the motivation for the Markov chain approximation filter in [16, 20]. The approximating filter Πh (·) which will be defined later in this ˜ section will be constructed by replacing the auxiliary process X(·) in equation ˜ h (·), where X ˜ h (·) approximates the auxiliary process X(·), ˜ (2.4) by X but is “simpler” than it. Thus, this filter is built under the assumption that the true ˜ h (·) not X(·). The process X ˜ h (·) might be Markov, for signal process is X example a continuous time Markov chain on a finite state space. More commonly, it is an interpolation of a discrete parameter process: I.e, there is δh > 0 ˜ h (·) is constant on the intervals and which goes to zero as h → 0 such that X h ˜ (nδh ), n = 0, . . . , is Markov. When the signal process is [nδh , nδh + δh ) and X ˜ h (·) is of one of these two defined in continuous time, we always assume that X
5
˜ h (t) forms. Furthermore, we always suppose (without loss of generality) that X takes values in G. ˜ h (·) approximates X(·), ˜ To quantify the sense in which X we will assume the following basic consistency condition. A2.2 For any sequence {Πh } of probability measures converging weakly to some ˜ h (·) with the initial distribution Πh converges weakly probability measure Π, X ˜ to X(·) with the initial distribution Π. This is a very weak assumption on the approximating filter. It holds, for ˜ h (·) is a finite state Markov chain, which is “asymptotically example, when X consistent” with X(·), as defined in [16, 20]. Indeed, the convergence (as h → 0 and for an arbitrary finite time) of filters based on this approximating model has been proved in [16, 20], and the approximation performs well if the state dimension is not higher than four. Another possibility is to use an appropriate ˜ discrete time approximation to X(·). One might discretize time, but not space, provided that the computation is feasible. In either case, (A2.2) will be satisfied if the approximating process satisfies the minimal consistency conditions in [16, 20]. ˜ ˜ h (·) to be independent Similar to the construction of X(·), we will take X of (X(·), W (·), B(·)), given the initial condition. The distribution of the initial ˜ h (0) will be a measure–valued random variable, analogously to what condition X ˜ was done in (2.3) and (2.4), when X(·) was used. Define Z t Z 2 1 t ˜ h h 0 ˜h ˜ (2.6) R(X0,t , Y0,t ) = exp g (X (s))dY (s) − g(X (s)) ds . 2 0 0 ˜ h (·), the approximating filter Πh (·) is defined by For Markov X h i ˜ h (t))R(X ˜ h , Y0,t ) E{Πh (0),Y0,t } φ(X 0,t , hΠh (t), φi = ˜ h , Y0,t ) E{Πh (0),Y0,t } R(X 0,t
(2.7)
and Πh (·) satisfies the semigroup equation h i ˜ h (s))R(X ˜ h , Yt,t+s ) E{Πh (t),Yt,t+s } φ(X 0,s hΠh (t + s), φi = , s > 0, t ≥ 0. (2.8) h ˜ E{Πh (t),Yt,t+s } R(X0,s , Yt,t+s ) ˜ h (·) in (2.6) is According to our standard notation, the initial distribution of X h h Π (0) and it is Π (t) in (2.7). ˜ h (·) is piecewise constant with {X ˜ h (nδh ); n ≥ 1} being Markov, When X then the approximating filter is defined by (2.7) and (2.8), but where t and s are integral multiples of δh , and Πh (·) is constant on the intervals [nδh , nδh +δh ). Thus, the evolution of Πh (·) can be written in recursive form in general. ¿From the Feller-Markov property of X(·), it follows easily that the pair (X(·), Π(·)) is Feller-Markov. Since X(t) takes values in a compact set G, and
6
the samples of Π(t) are measures with support in G, we have at least one invariant measure for the Markov family determined by (X(·), Π(·)). When initialized at this measure, the process (X(·), Π(·)) is stationary i.e., the distribution of (X(t + ·), Π(t + ·)) ¯ denote the measure of the joint process does not depend on t. Let Q(·) Ψ(·) = (X(·), Π(·), Y (·), B(·), W (·)), ¯ f (·) denote the measure of the stationary where (X(·), Π(·)) is stationary. Let Q joint process (X(·), Π(·)). We will make the following fundamental assumption throughout this paper. A2.3 There is a unique invariant measure for (X(·), Π(·)). In Section 7 we provide conditions for the uniqueness of the invariant measure to hold and provide some examples for which these conditions can be verified. The critical importance of the uniqueness of the stationary joint process was first raised in [21]. The stationary process and associated measure will be given a filtering interpretation later in this section. In the sequel, lower case letters x(·), π(·), etc., are used for the canonical sample paths. Letters such as x, y, . . . , are used to denote vectors such as x(t), y(t), etc. Define ψf (·) = (x(·), π(·)), Ψf (·) = (X(·), Π(·)) and ψ(·) = (x(·), π(·), y(·), b(·), w(·)), although we will not always need the component w(·). For each t ≥ 0, define the process Ψh (t, ·) = X(t + ·), Πh (t + ·), Y (t + ·) − Y (t), B(t + ·) − B(t), W (t + ·) − W (t) . Define Ψhf (t, ·) similarly. In Sections 4 and 6 in which we consider more general approximate filtering problems we will consider modifications to the Ψh (t, ·) process by replacing X(·), Y (·), B(·) by h-parametrized processes X h (·), Y h (·), B h (·). The path spaces. The path space D[IRk ; 0, ∞) of IRk -valued functions which are right continuous and have left hand limits (CADLAG), endowed with the Skorohod topology [3, 9] will be used for the vector-valued processes such as ˜ X(·), Y (·), B(·), X(·), etc. for the appropriate value of k. Let M(G) denote the space of measures on G, with the weak topology. Let mn (·) and m(·) be in M(G). Recall that mn (·) converges weakly to m(·) if for each bounded and continuous function φ(·) on G, hmn , φi → hm, φi. Equivalently, for a set of continuous functions {φi (·)} which are dense (in the topology of uniform convergence) in the set of bounded and continuous functions on G, we have the metric X d(mn , m) = 2−i |hmn − m, φi i| → 0. i
˜ h (t) are in G, the samples of Owing to the fact that the values of X(t) and X h the Π(t) and its approximations Π (·) take values in M(G). The path Π(·) and 7
its approximations will take values in the CADLAG space D[M(G); 0, ∞), also with the Skorohod topology used. The measure valued random variables Qh,T (·) defined by (2.8) below take values in the space of measures on the product path space M D[IRk ; 0, ∞) × D[M(G); 0, ∞) for the appropriate value of k (which is the sum of the dimensions of x, y, b, w). A “prediction error” function. We start with a special case of the error or performance function in order to fix ideas. Let φ(·) be a bounded, continuous and real–valued function of x and consider the pathwise average error per unit time Z 2 1 T h hΠ (t), φi − φ(X(t)) dt. (2.9) Gh,T (φ) ≡ T 0 We will show that, under quite broad conditions on the approximate filter, Z 2 ¯ Gh,T (φ) → [hπ(0), φi − φ(x(0))] Q (2.10) f (dψf (·)) in the sense of probability as h → 0 and T → ∞ in any way at all. Later in this section, it will be seen that the right side of (2.10) is what one would also get as the limit if the true optimal filter were used and not the approximation. In this sense there is pathwise asymptotic optimality. The result will actually be much more general. But first, we define the occupation measures, in terms of which the errors will be written. The occupation measures. For a random variable Z and set A, let IA (Z) denote the indicator function of the event that Z ∈ A. Let C and C 0 be measurable sets in the product path spaces of Ψh (t, ·) and Ψhf (t, ·) respectively. Define the occupation measures Qh,T (·) and Qh,T f (·) as Z T 1 Qh,T (C) = IC (Ψh (t, ·))dt, (2.11) T 0 and 0 Qh,T f (C ) =
1 T
T
Z
IC 0 (Ψhf (t, ·))dt.
(2.12)
0
We can write (2.9) in terms of Qh,T f (·),as Z 2 Gh,T (φ) = [hπ(0), φi − φ(x(0))] Qh,T f (dψf (·)).
(2.13)
A more general error or performance function. It will be shown that for any (bounded, measurable and continuous w.p.1 with respect to the measure ¯ f (·)) real–valued function F (·) Q Z Z Z 1 T h,T h ¯ f (dψ(·)) (2.14) F (Ψf (t))dt = F (ψf (·))Qf (dψf (·)) → F (ψf (·))Q T 0 8
in probability. This says that sample mean errors of many types (see example in the next paragraph) will converge to the stationary value, which is the same value that one would get if the true optimal filter were used, and the pathwise occupation measure replaced by its expected value. The convergence is in the sense of probability, and holds as T → ∞ and h → 0 in any way at all. The arbitrariness of the way that T → ∞ and h → 0 is crucial in applications. It is important that the approximation is good for all small h, not depending on T , if T is large enough. Results such as (2.14) are the main contributions of the paper. The function F (·) can be quite general. In (2.10), F (ψf (·)) depends only on ψf (0), since it equals |hπ(0), φi − φ(x(0))|2 . But, F (·) might have a more complex dependence on ψf (·). For example, consider the case where we are interested in Z 1 T max φ(X(s)) − hΠh (s), φi dt. T 0 s∈[t−1,t] Here F (ψf (·)) = max |φ(x(s)) − hπ(s), φi|. s≤1
Comment on the stationary process. Let us deviate a little and discuss the meaning of a stationary process (X(·), Π(·)). Recall that we assume that ¯ f (·) of the stationary pair (X(·), Π(·)) is unique. Now, consider the measure Q the correctly initialized filter; i.e., the true optimal filter, which we denote by Π0 (·). I.e., Π0 (B) = P {X(0) ∈ B} for each Borel set B ∈ IRr . Define the mean occupation measure QTf (·) by: Z 1 T T Qf (C) = P ((X(t + ·), Π0 (t + ·)) ∈ C) dt, (2.14) T 0 for measurable sets C in the product path space. Note that QTf (·) is a measure but it is not random. It follows from the proofs of Theorems 3.1 and 3.2 that ¯ f (·), the measure of the unique {QTf (·); T < ∞} is tight and converges to Q stationary process. Thus, the limit in (2.13) is the same as if the optimal filter were used (with probability one) and the pathwise occupation measure replaced by the mean occupation measure. Note that the integrand in (2.14) is the measure of the (system, filter) pair on the time interval [t, ∞), and the integral is equivalent to choosing t at random ¯ f (·) in terms of a random initial in [0, T ]. This gives a loose interpretation of Q time.
3
Tightness and Weak Convergence
The following lemma will be used in the tightness arguments to follow. Lemma 3.1. [14, Theorem 2.7b]. Let Zn,s (·), n = 1, 2, · · · ; s > 0 be a family of processes with paths in the Skorohod space D[S0 ; 0, ∞), where S0 is a complete 9
and separable metric space with metric γ(·). For each δ > 0 and each t in a dense set, let there be a compact set Sδ,t ⊂ S0 such that sup P {Zn,s (t) ∈ / Sδ,t } ≤ δ.
(3.1)
n,s
Let Ftn,s denote the minimal σ−algebra which measures {Zn,s (u), u ≤ t}, and Tn,s (T ) the set of Ftn,s -stopping times which are less than T > 0. Suppose that for each T lim lim sup sup sup E [γ (Zn,s (τ + δ), Zn,s (τ )) ∧ 1] = 0.
δ→0
n
s
(3.2)
τ ∈Tn (T )
Then to every > 0 there exists a compact set K such that lim sup sup P (Zn,s (·) ∈ K ) > 1 − . n
s
We will apply the above theorem for the doubly parametrized process {Πh (t+ ·); h > 0, t > 0}. Condition (3.1) will hold trivially due to the compactness of the space and we will concentrate on verifying (3.2). Theorem 3.1. Assume (A2.1), (A2.2), (A2.3). Then {X(t + ·); t ≥ 0} is tight. n o ˜ h (·); h, all possible initial conditions is tight. X
(3.3a) (3.3b)
To every > 0 there exists a compact set K such that lim sup sup P (Πh (t + ·) ∈ K ) > 1 − . h→0
(3.3c)
t
Also for every sequence hk → 0, Tk → ∞, h ,T Q k k (·); k ≥ 1 is tight.
(3.4)
Let Q(·) denote a weak sense limit of the set in (3.4), as k → ∞. Let ω be the canonical variable on the probability space on which Q(·) is defined, and denote the samples by Qω (·). Then, for each ω, Qω (·) is a measure on the product path space D[IRk ; 0, ∞) × D[M(G); 0, ∞) for appropriate k. It induces a process Ψω (·) = (X ω (·), Πω (·), Y ω (·), B ω (·), W ω (·)) .
(3.5)
For almost all ω, (X ω (·), Πω (·)) is stationary. Proof. The set {X(t + ·), t ≥ 0} is tight, by the fact that the set of “initial conditions” {X(t), t > 0} are confined to a compact set, and the properties ˜ h (·) of the diffusion (2.1). The set in (3.3b) is tight by the assumptions on X concerning weak convergence, which (using the fact that the possible values of 10
˜ h (0) are confined to some compact set) implies that every subsequence of the X set in (3.3b) has a further subsequence which converges weakly. To prove (3.3c) we will apply Lemma 3.1. For a real number ρ > 0 and each fixed h and t, let T h,t (ρ) denote the set of stopping times, with respect to the filtration σ{Πh (t + s) : s ≤ u}u≥0 , bounded by ρ. Then, by Lemma 3.1, it is sufficient to show that (3.6) lim sup sup sup E hΠh (t + τ + δ), φi − hΠh (t + τ ), φi = 0 δ→0 h→0
t
τ ∈T h,t (ρ)
for each continuous, bounded and real–valued φ(·). Define the function h i ˜ h , Yt+τ,t+τ +δ ) ˜ h (δ))R(X E{Πh (t+τ ),Yt+τ,t+τ +δ } φ(X 0,δ . K(h, t, τ, δ) = ˜ h , Yt+τ,t+τ +δ ) E{Πh (t+τ ),Yt+τ,t+τ +δ } R(X 0,δ Owing to the use of the weak topology, the boundedness of φ(·) and the semigroup property (2.15), proving (3.6) is equivalent to proving that lim sup sup sup E K(h, t, τ, δ) − hΠh (t + τ ), φi = 0. δ→0 h→0
t
τ ∈T h,t (ρ)
But showing this is straightforward, owing to the convergence to unity of the ˜ h , Yt+τ,t+τ +δ ) as δ → 0. In particular, it is sufficient exponential function R(X 0,δ to show that (where ds denoted the differential with respect to the s variable) ˜ h (δ)) E{Πh (t+τ ),Yt+τ,t+τ +δ } φ(X ( "Z # ) Z 2 δ 1 δ ˜ h 0 ˜h × exp g (X (s))ds Y (t + τ + s) − g(X (s)) ds − 1 2 0 0 goes to zero in mean as δ → 0, uniformly in (h, t, τ ), for each bounded and continuous function φ(·). But this follows from the boundedness of g(·). This proves (3.3c). Since {B(t + ·) − B(t), W (t + ·) − W (t), Y (t + ·) − Y (t), t > 0} is always tight, (3.3c) implies that to every > 0 there exists a compact set K such that lim sup sup P (Ψh (t + ·) ∈ K ) > 1 − . h
t
Now standard arguments show that the family in (3.4) is tight; for more detail see [19, Chapter 1, Section 6] or [22, Section 5], where there are general results on getting the tightness of occupation measures from the tightness of the set of time shifted processes. The proofs in [22, Theorem 6.3, part 2] or in [19, Chapter 5, Theorem 2.2] can be adapted to show the stationarity. We work with the marginals Qh,T f (·), and a subsequence (h, T ) such that {Qh,T f (·)} converges weakly. Let H be a 11
Borel set in the product (x(·), π(·)) path space. For c > 0 define the left shift Hc by Hc = {ψf (·) : ψf (c, ·) ∈ H}. Then Qh,T f (Hc ) Qh,T f (Hc )
−
1 = T
Qh,T f (H)
Z
T
IHc (Ψhf (t, ·))dt
0
1 = T
Z
1 = T
T +c
IH (Ψhf (t, ·))dt
T
T
Z
IH (Ψhf (t + c, ·))dt,
0
1 − T
Z
c
IH (Ψhf (t, ·))dt. (3.7)
0
The difference goes to zero as T → ∞ for all ω, c, h and H. This and the weak convergence yields the stationarity of the limit processes (X ω (·), Πω (·)) for almost all ω. We now need to identify the processes associated with the limits of the occupation measures and prove the main result (2.10). Theorem 3.2. Assume (A.1), (A.2), (A.3). Then, for almost all ω, the following hold. (i) (B ω (·), W ω (·)) are standard Wiener processes, with respect to which (X ω (·), Πω (·), Y ω (·)) are non anticipative. (ii) dY ω = g(X ω )dt + dB ω , (3.8) (iii) dX ω = p(X ω )dt + σ(X ω )dW ω . (iv) For each bounded and measurable real–valued function φ(·) i h ˜ ˜ 0,t , Y ω ) ω } φ(X(t))R( E{Πω (0),Y0,t X 0,t . hΠω (t), φi = ˜ 0,t , Y ω ) ω } R(X E{Πω (0),Y0,t 0,t
(3.9)
(3.10)
(v) (X ω (·), Πω (·)) is the unique stationary process, and hence its distribution does not depend on ω. Finally, for any bounded and measurable real–valued function F (·) which is con¯ f (·) of the statinuous almost everywhere with respect to the unique measure Q tionary process, 1 T
Z
T
F (Ψhf (t + ·))dt →
Z
¯ f (dψf (·)) F (ψf (·))Q
(3.11)
0
in probability as h → 0 and T → ∞ in any way at all. Proof. Let hk → 0, Tk → ∞. Fix a weakly convergent subsequence of the {Qhk ,Tk (·)}k≥1 , and index it by hk , Tk also, abusing the notation. It will turn out that all convergent subsequences have the same limit. . The processes B ω (·), W ω (·). Let us demonstrate the Wiener and mutual independence properties (for almost all ω). Let fi (·) and q(·) be real–valued and 12
continuous functions of their arguments and with compact support, with the fi (·) being twice continuously differentiable. Let φj (·) be an arbitrary finite collection of bounded and continuous real–valued functions. Let ρ and τ be arbitrary nonnegative numbers, and let ui ≤ ρ be (a finite collection of) arbitrary nonnegative real numbers. Define the function F (ψ(·)) = q(x(ui ), hπ(ui ), φj i, b(ui ), w(ui ); i, j) Z × f1 (b(ρ + τ ))f2 (w(ρ + τ )) − f1 (b(ρ))f2 (w(ρ)) −
ρ+τ
A0 f1 (b(s))f2 (w(s)))ds ,
ρ
P P where A0 = (1/2)[ i ∂ 2 /∂b2i + i ∂ 2 /∂wi2 ]. The bi (resp. wi ) are the scalar components of b (resp., w). We will show that Z Qω (dψ(·))F (ψ(·)) = 0 (3.12) for almost all ω. The arbitrariness of the functions and of the ρ, τ, ui ≤ ρ, and (3.12) imply that W ω (·) and B ω (·) are standard vector–valued and independent Wiener processes, and martingales with respect to the filtration generated by {Ψh (s), s ≤ t}, hence the non anticipativity. To get (3.12), we will show that the mean square value Z 2 hk ,Tk E Q (dψ(·))F (ψ(·)) (3.13) goes to zero as h, T go to their limits. This and the weak convergence imply R 2 (3.12) since they imply that E Q(dψ(·))F (ψ(·)) = 0. [An analogous calculation for a queueing problem is in [22].] By the definition of the occupation measure, the expression Z Qhk ,Tk (dψ(·))F (ψ(·)) (3.14a) equals Z Tk 1 F1 (t)dt, Tk 0 where F1 (t) equals F2 (t)F3 (t), where
(3.14b)
F2 (t) = q X(t + ui ), hΠhk (t + ui ), φj i, B(t + ui ) − B(t), W (t + ui ) − W (t); i, j , F3 (t) = f1 (B(t + ρ + τ ) − B(t))f2 (W (t + ρ + τ ) − W (t)) −f1 (B(t + ρ) − B(t))f2 (W (t + ρ) − W (t)) Z t+ρ+τ − A0 f1 (B(t + s) − B(t))f2 (W (t + s) − W (t)))ds . t+ρ
The mean square value of (3.14b) equals Z Tk Z Tk 1 EF2 (t)F2 (v)F3 (t)F3 (v)dtdv. Tk2 0 0 13
(3.15)
Now, using the fact that (the martingale property) E [F3 (v)|W (u), B(u), u ≤ t + ρ] = 0,
for v ≥ t,
(3.16)
we see that (3.15) is of the order of O(1/Tk ), which yields the desired result. Identify the limits X ω (·), Y ω (·). For a vector z, define |z|21 = |z|2 ∧ 1. To identify the Y ω (·) process, we will prove that Z E
Z Q (dψ(·)) y(s) −
s
ω
0
for each s. This is equivalent to Z ω ω EE Y (s) −
0
s
2 g(x(u))du − b(s) = 0
(3.17)
1
2 g(X (u))du − B (s) = 0, ω
ω
1
which implies (3.8) for almost all ω. Equation (3.17) is shown by using the definition of the occupation measure and the weak convergence. I.e., use the fact that (2.2) implies that 2 Z s Qhk ,Tk (dψ(·)) y(s) − g(x(u))du − b(s) 0 1 2 Z Tk Z t+s 1 Y (t + s) − Y (t) − dt = 0. g(X(u))du − (B(t + s) − B(t)) =E Tk 0 t 1
Z E
Now use the weak convergence and the fact that the integrand in the first integral is a bounded and continuous function of ψ(·). We deal with X(·) by a process approximation method. Recall the definitions of p(·) and σ(·) from (2.1). For each > 0 and t ≥ 0, there is ∆0 > 0 such that for ∆ ≤ ∆0 we have (uniformly in u ≥ 0) Z u+t E X(u + t) − X(u) − p(X(s))ds u 2 (3.18) X − σ(X(u + i∆)) [W (u + i∆ + ∆) − W (u + i∆)] = O(). i:i∆ 0 and t ≥ 0 there is ∆ > 0 such that for ∆ ≤ ∆ , Z Z t E x(t) − x(0) − p(x(s))ds 0 2 (3.19) X − σ(x(i∆))[w(i∆ + ∆) − w(i∆)] Qω (dψ(·)) = O(). ω
1
i:i∆ 0 and > 0, there is ∆0 such that for ∆ < ∆0 , and small h > 0, we have n o 00 (3.20) sup P Zˆ h,∆ (t, s) ≥ 0 ≤ . Πh (t),t
This is due to the just cited equicontinuity in probability, and keeping in mind that the state space is compact, and where the sup is over all possible values. Next, we verify that (3.10) holds. We would like to show that, for each s, Z E
h i 2 ˜ ˜ 0,s , y0,s ) E{π(0),y0,s } φ(X(s))R( X = 0. Qω (dψ(·)) hπ(s), φi − ˜ 0,s , y0,s ) E{π(0),y0,s } R(X
(3.21)
1
But the integrand function is not well defined, since y(·) is an arbitrary path function, and the stochastic integral in R(·) in (3.21) is not defined. The problem is resolved by working with an approximation. It will actually be shown that h i 2 ∆ ˜ Z ˜ E φ( X(s))R ( X , y ) 0,s 0,s {π(0),y } 0,s ω E Q (dψ(·)) hπ(s), φi − (3.22) ∆ ˜ E{π(0),y0,s } R (X0,s , y0,s ) 1
˜ is a process with inigoes to zero as ∆ → 0, which implies (3.10). Here the X(·) tial distribution π(0), and the law of evolution of X(·). Note that the integrand function in (3.22) does not depend on this process, only on its initial condition. Also it is bounded and continuous with probability one (with respect to Qω (·)) for almost all ω. By the weak convergence, (3.22) is the limit of Z E
h i 2 ˜ hk (s))R∆ (X ˜ hk , y0,s ) E{π(0),y0,s } φ(X 0,s . Qhk ,Tk (dψ(·)) hπ(s), φi − hk ∆ ˜ E{π(0),y0,s } R (X0,s , y0,s ) 1
Thus, we need only show that, for each s > 0 and each bounded and continuous function φ(·), this expression is small for small ∆ > 0, hk > 0 and large Tk < ∞. By the definition of the measure Qh,T (·), this expression equals E
1 Tk
Z 0
Tk
2 ˆ hk ,∆ (t, s) dt. Z 1
Now, the fact that (3.23) is small for small hk , ∆ follows from (3.20). 16
(3.23)
Theorem 3.1 yields the stationarity of (X ω (·), Πω (·)) for almost all ω. The ¯ f (·) of the stationary (X ω (·), Πω (·)) process is unique by assumpmeasure Q ¯ f (·) along the chosen subsequence. tion. Thus Qhf k ,Tk (·) converges weakly to Q The uniqueness also implies that the subsequence is irrelevant and that Qh,T f (·) ¯ converges weakly to Qf (·) along any (h, T ) subsequence. This implies (3.11).
4
Non–Markov Signal Processes
In actual physical applications, the signal process would not generally be a diffusion or jump–diffusion, but rather some non Markov process which is only approximated by a diffusion or jump–diffusion. For example, the physical process might be a dynamical system which is driven by a wide bandwidth noise process and not a Wiener process. To quantify the connection, we need to parameterize the physical process, and suppose that it converges weakly to some “ideal” signal process, a diffusion or jump–diffusion, as the approximation parameter h goes to its limit. We would construct an approximation to the optimal filter for the ideal signal process, but use the physical observations. Thus, we are concerned with the convergence of the pathwise average errors as the time interval goes to its infinite limit, the physical process is better and better approximated by its ideal limit and the filter approximation converges to the optimal filter for that ideal limit. Thus, we have a physical process X h (·), with values on the same compact range space G and which is approximated by the diffusion X(·), defined by (2.1), and a filter Πh (·) which is a numerical approximation to the optimal filter for (2.1) with observations (2.2), but where the actual physical observations defined by dY h = g(X h (t))dt + dB
(4.1)
are used. We could use another parameter than h to index the “pre–diffusion” process and then let this go to its limit together with h, T , but there is no loss of generality in using h to index this process as well. The filter to be built is as in Section 2. This would be a filter constructed under the assumption that the system is (2.1) and observations (2.2). But, with ˜ h (·), (4.1) actually used. The approximating filter uses the auxiliary process X which satisfies (A2.2). Thus the approximating filter takes the form (2.7), but ˜ h (·). with Y h (·) replacing Y (·). The process X h (·) is not to be confused with X The former is the actual physical signal process. The latter continues to play the role that it had in the previous sections. I.e., it is an approximation to the ideal limit, which is used only to get a numerical approximation to the optimal filter for (2.1) and (2.2). The following is the basic assumption on the physical process X h (·). A4.1 X h (t) ∈ G for all t, h > 0. The family {X h (t + ·); h > 0, t ≥ 0} is tight and whenever for some subsequence (hk , tk )k≥1 , where hk → 0 as k → ∞, the distribution of X hk (tk ) converges weakly to some measure Π and the sequence 17
{X hk (tk + ·)}k≥1 is weakly convergent, then the limit of the latter is X(·) with initial distribution Π. If the filter is to operate over some fixed finite time interval, then convergence results are relatively easy to obtain, since the criteria of interest are mean values. Many such results are in [21]. In the definition of Ψh (·), drop the W −component, and replace X(·), Y (·), by X h (·), Y h (·), resp. Redefine the Ψ(·), ψ(·), Qh,T (·) analogously. In Theorem 4.1 we obtain the analog of Theorems 3.1 and 3.2 for this approximate filtering problem. Theorem 3.1 continues to hold, with the same proof. In the proof of the analog of Theorem 3.2, there is no new problem with the identification of the limit processes Y ω (·) and Πω (·). But there is a problem in the identification of the limit X ω (·). The assumption of weak convergence on X h (t + ·) made above is not enough. The problem is that the Qh,T (·) are occupation measures, whose values depend on the samples of the paths. Owing to this we need to make a (quite unrestrictive) assumption so that a martingale type method can be used to identify X ω (·). Theorem 3.2 can be viewed as an ergodic theorem or weak law of large numbers, and it works partly due to the ergodic properties of X(·), as reflected in the uniqueness of the stationary process. But the assumed weak convergence of X h (t + ·) is not enough in itself to get such a weak law of large numbers. It says little about “long range dependence.” This is the main issue that we will need to deal with. A note on the martingale problem method. In order to motivate the assumptions to be used, we first recall the classical martingale problem formulation of the existence of a solution to (2.1). Let φj (·), q(·) and ρ, τ, ui ≤ ρ, satisfy the conditions above (3.12) and let f (·) be real–valued with compact support and with partial derivatives up to third order being continuous. Let A denote the differential generator of X(·). To identify the process X ω (·) as that satisfying (3.9) for an appropriate Wiener process W ω (·), it is sufficient to show that X(·) solves the martingale problem of Stroock and Varadhan; namely, for all such f (·), q(·), etc., Z Qω (dψ(·))q(x(ui ), hπ(ui ), φj i, b(ui ); i, j) Z ρ+τ (4.2) × f (x(ρ + τ )) − f (x(ρ)) − Af (x(s))ds = 0. ρ
To show (4.2), it is natural to try to evaluate Z E Qh,T (dψ(·))q(x(ui ), hπ(ui , φj i, b(ui ); i, j, ) Z ρ+τ 2 × f (x(ρ + τ )) − f (x(ρ)) − Af (x(s))ds ρ
(4.3)
1
and show that it goes to zero as h, T go to their limits, analogous to the procedure used in connection with (3.12) and (3.13). This and the weak convergence imply (4.2); equivalently, there exists X ω (·) and W ω (·), with X ω (·) non anticipative, satisfying (3.9) for almost all ω. 18
It is hard to show (4.3) directly. We take an approach which has been one of the most powerful tools to date for functional limit theorems for convergence to diffusion type processes, namely the perturbed test function method [5, 13, 17, 18, 19, 26]. This method uses a perturbation f h (·) to the test function f (·) in (4.2). To simplify the development, and preserve the generality of the results, we will simply assume that a suitable perturbation exists, but methods for constructing it in many important cases are in the references. The perturbed test function method. Given a test function f (·), which is three times continuously differentiable and with compact support, we seek a process f h (·) which is close to f (X h (·)), and an extension Aˆh of the operator A such that, loosely speaking, f h (·) − f (X h (·)) → 0, Aˆh f h (·) − Af (X h (·)) → 0. With such perturbed test functions f γ (·) available, simple adaptations of the martingale method can be used to identify X ω (·). The operator Aˆh . Let Fth be a filtration on the probability space, where Fth measures at least {Ψh (s), s ≤ t}, where Ψh (s) = (X h (s), Πh (s), Y h (s), B(s)). Let u(·) and v(·) denote measurable processes which are Fth -adapted and progressively measurable and satisfy the following conditions: sup E|u(t)| < ∞,
sup E|v(t)| < ∞,
t
t
E|E[u(t + δ)|Fth ] − u(t)| < ∞, δ 0 0: lim sup E|f h (t) − f (X h (t))| = 0 h
Z lim sup sup E h t τ ≤T 1
(4.6)
t
t
t+τ
h
i h h h ˆ A f (s) − Af (X (s)) ds = 0.
(4.7)
General methods of construction of f h (·) are in [18] under quite broad conditions on the processes involved, and the reader is referred to that reference for more information. See also [19]. The key result for applications is the following: Lemma 4.2. [18] Let the sequence {X h (·), h > 0} converge weakly in D[IRr ; 0, ∞) to a process X(·). Assume (A4.2) Then, X(·) solves the martingale problem for operator A. The following theorem essentially says that under the above assumptions Theorems 3.1 and 3.2 continue to hold for this approximate filtering problem. Theorem 4.1. Assume A2.1, A2.2, A2.3, A4.1, A4.2. Then (3.3a) holds with X(·) replaced by X h (·). Also, (3.3b), (3.3c) and (3.4) continue to hold with the modified definitions of Πh (·) and Qh,T (·) stated at the beginning of this section. Let Q be a weak limit as in Theorem 3.1, then Qω induces a process Ψω (·) = (X ω (·), Πω (·), Y ω (·), B ω (·)) . For almost all ω the following hold. (i) The pair (X ω (·), Πω (·)) is stationary. (ii) There exists a Wiener process W ω for which statements (i) through (v) in Theorem 3.2 hold. Finally the last conclusion of Theorem 3.2 also holds, i.e (3.11) is true. 20
Comment on the proof. The only difference from the proofs of Theorems 3.1 and 3.2 concern the proof of (4.2). As noted, this is to be done by showing that (4.3) goes to zero as h, T go to their limits. Thus, we confine our remarks to an evaluation of (4.3). Let h → 0, T → ∞ index a weakly convergent subsequence. The expression Z Qh,T (dψ(·))q(x(ui ), hπ(ui ), φj i, b(ui ); i, j) Z ρ+τ (4.8) × f (x(ρ + τ )) − f (x(ρ)) − Af (x(s))ds ρ
equals 1 T
Z
T
q(X h (t + ui ), hΠh (t + ui ), φj i, B(t + ui ) − B(t); i, j) Z t+ρ+τ h h h × f (X (t + τ + ρ)) − f (X (t + ρ)) − Af (X (s))ds dt.
0
(4.9)
t+ρ
Define Mfh (t) = f h (t) −
t
Z
Aˆh f h (s)ds.
0
Rewrite the bracketed term in (4.9) as h Mf (t + τ + ρ) − Mfh (t + ρ) + [V h (t + τ + ρ) − V h (t + ρ)], where V h (t) = [f (X h (t)) − f h (t)] −
Z
t
[Af (X h (s)) − Aˆh f h (s)]ds.
0 h
By (4.6) and (4.7), limh supt E|V (t + τ + ρ) − V h (t + ρ)| = 0. The weak sense limits of (4.9) are then the same as those of 1 T
Z
T
q(X h (t+ui ), hΠh (t + ui ), φj i, B(t+ui )−B(t); i, j) Mfh (t + τ + ρ) − Mfh (t + τ ) dt.
0
(4.10) Now, square (4.10), take expectations and use the martingale properties of Mfh (·) to get that the mean square value is O(1/T ). This and the weak convergence imply (4.2).
5 5.1
Discrete Time Observations A Markov Signal Process
Now, let all processes be defined in discrete time. The idea is fully analogous to what was done in the continuous time case. The following is the basic assumption on the signal process. 21
A5.1 The signal process X(·) = {X(n), n < ∞} is Feller-Markov and takes values in a compact subset G of IRr . The observations are defined by Y (0) = 0 and Y (n) − Y (n − 1) = g(X(n)) + ξ(n), n = 1, . . . ,
(5.1)
where {ξ(n)} are mutually independent (0, I) Gaussian random variables which are independent of X(·), and g(·) is continuous. The Bayes’ rule formula for the true conditional distribution of X(n) given ˜ Y0,n can be represented in terms of an auxiliary process X(·) as in Section 2, ˜ where X(·) has the probability law of X(·) but conditioned on its (possibly random) initial condition is independent of all the other processes. Analogous to the notation used for the continuous time problem, let Y0,n denote the set {Y (i); i ≤ n}. Define " n # n 2 X X 1 0 ˜ 0,n , Y0,n ) = exp ˜ ˜ R(X g (X(i))[Y (i) − Y (i − 1)] − g(X(i)) . 2 i=1 i=1 Then as in Section 2, the optimal filter Π(·) can be defined by its moments: h i ˜ ˜ 0,n , Y0,n ) E{Π(0),Y0,n } φ(X(n))R( X hΠ(n), φi = , (5.2) ˜ 0,n , Y0,n ) E{Π(0),Y0,n } R(X ˜ where Π(0) is the distribution of X(0) and X(0). As before (X(·), Π(·)) is FellerMarkov. Henceforth, as in the previous sections, when discussing Π(·), we allow Π(0) to be arbitrary, and not necessarily the distribution of X(0). Analogously to the case in Section 2, it is generally necessary to approximate the optimal filter in some way. For example, the sequence X(·) might be samples of a diffusion process taken at discrete instants, and the transition function must then be computed approximately. In order to approximate the transition function we might try to solve the corresponding Fokker–Planck equation on [0, 1] by the Markov chain approximation and also appropriately approximate the initial condition. Or we might try to solve it by some other numerical method (eg. finite elements, spectral method, etc.) which yields approximations to the one step transition density (assuming that there is a density) which converge in L2 to the true transition density as the approximation parameter goes to its limit. If this approximation is not non-negative, we can take its positive part and renormalize to get a transition probability. Finally, even if the original processes were given in discrete time, i.e. no approximations of a p.d.e were needed, it might be too hard to perform the integrations with the true transition function and we might have to use an appropriate approximation instead. These considerations lead to a formulation analogous to what led to the approximating form (2.14), which we now develop. Analogously to for the continuous time case, the approximating filter can often be represented in terms of a discrete parameter auxiliary Markov process 22
˜ h (·), with values in a compact subset of IRr , and which is independent of the X other processes, given its initial condition. As in the continuous time case we will assume the following condition on the auxiliary process. ˜ h (0) converges weakly (for any subsequence of values of h → 0) with A5.2 If X ˜ h (·) converges weakly to X(·) with limit distribution Π(0), then the sequence X initial distribution Π(0). Define, for n = 1, . . . , " n # n 2 X 1 X ˜ h 0 ˜h h ˜ R(X0,n , Y0,n ) = exp g (X (i))[Y (i) − Y (i − 1)] − g(X (i)) . 2 i=1 i=1 Again, define the approximating filter Πh (·) by its moments, as follows: h i ˜ h (n))R(X ˜ h , Y0,n ) E{Πh (0),Y0,n } φ(X 0,n hΠh (n), φi = . ˜ h , Y0,n ) E{Πh (0),Y0,n } R(X 0,n
(5.3)
Equation (5.3) is just the Bayes’ rule formula for the filter for the signal process ˜ h (·), but with the actual observations Y (n) − Y (n − 1) used. X The analog of the semigroup relation (2.15) holds, but where time is discrete. The semigroup property stems from the fact that the filter is that for the discrete ˜ h (·), but with the actual physical observations parameter Markov process X used. Pn Define the sequences B(n) = i=1 ξ i and Ψh (n, ·) = X(n + ·), Πh (n + ·), B(n + ·) − B(n), Y (n + ·) − Y (n) , Ψhf (n, ·) = {X(n + ·), Πh (n + ·)} Define ψ(·) and ψf (·) analogously. The Skorohod topology is replaced by a “sequence” topology, as follows. The Πh (n) still take values in M(G) and the weak topology is still used on this space. Let dπ (·) and dk (·) denote the metrics on the space of measures and on Euclidean-k space, resp., where k is the sum of the dimensions of Y (n), B(n) and X(n). Let d0 (·) denote the product metric. Then the metric on the product path (sequence) space is d(a(·), b(·)) =
∞ X
2−n [d0 (a(n), b(n)) ∧ 1] .
n=0
Define the occupation measure Qh,N (·) by: For a Borel set C in the product sequence space, N 1 X Qh,N (C) = IC (Ψh (n, ·)). (5.4) N n=1 Finally we impose the following basic condition. 23
¯ f (·) of the Ψf (·) = (X(·), Π(·)) A5.3 There is a unique stationary measure Q process. Let F (·) be a real–valued bounded and continuous (with probability one ¯ f (·)) function of ψf (·). Then, analogous to (2.7) and (2.10), with respect to Q we are concerned with the convergence (in probability) Z N 1 X h ¯ f (dψf (·)), F (Ψf (n + ·)) → F (ψf (·))Q N n=1
(5.5)
where h → 0 and N → ∞ in any way at all. The analogs of Theorems 3.1 and 3.2 hold, and the proof is close to those proofs. But owing to the discrete time some of the details are simpler, and others slightly different. Theorem 5.1. Assume (A5.1), (A5.2), (A5.3). Then {Qh,N (·); h > 0, N > 0} is tight. Let Q(·) denote a weak sense limit, always as h → 0 and N → ∞. Let ω be the canonical variable on the probability space on which Q(·) is defined, and denote the sample values by Qω (·). Then, for each ω, Qω (·) is a measure on the product path (sequence) space. It induces a process Ψω (·) = (X ω (·), Πω (·), Y ω (·), B ω (·)) . For almost all ω, the following hold. (i) (X ω (·), Πω (·)) is stationary. (ii) B ω (·) is the sum if mutually independent N (0, I) random variables {ξ ω (n)} which are independent of X ω (·). (iii) Y ω (n) − Y ω (n − 1) = g(X ω (n)) + ξ ω (n), (iv) X ω (·) has the transition function of X(·). Finally, for each integer n and each bounded and measurable real–valued function φ(·) h i ˜ ˜ 0,n , Y ω ) ω } φ(X(n))R( E{Πω (0),Y0,n X 0,n hΠω (n), φi = . ω ˜ ω } R(X0,n , Y E{Πω (0),Y0,n 0,n ) Remark on the proof. The details are essentially the same as those of the proofs of Theorems 3.1 and 3.2, and are omitted. We note only two differences. First, owing to the use of discrete time, the martingale method cannot be used to characterize B ω (·), as done in the first part of the proof of Theorem 3.2. But an analogous method, using a direct computation of the (conditional) characteristic function can be used instead, as follows. Replace the expression F (·) defined above (3.12) by the following. For arbitrary integers m, n, k, let ui ≤ m and vp ; p ≤ n, be arbitrary integers. Let νi be vectors (dimension that of ξ(n)), and
24
replace the expression by F (ψ(·)) = q(x(vp ), hπ(ui ), φj i, b(ui ); i ≤ m, p ≤ n, j) # "m+k # " "m+k X X 2 0 νl [b(l + 1) − b(l)] − exp |νl | /2 × exp l=m
l=m
We need to show that, for almost all ω, Z F (ψ(·))Qω (dψ(·)) = 0,
(5.6a)
since it will prove that Z Pm+k 0 Qω (dψ(·))q(x(vp ), hπ(ui ), φj i, b(ui ); i ≤ m, p ≤ n, j)e l=m νl [b(l+1)−b(l)] Z Pm+k 2 = Qω (dψ(·))q(x(vp ), hπ(ui ), φj i, b(ui ); i ≤ m, p ≤ n, j)e l=m |νl | /2 . (5.6b) This, in turn, implies that (for almost all ω) the ξ ω (l) are mutually independent, normally distributed with mean zero and with the covariance being the identity matrix, and are independent of the X ω (·) and of the “past” of the Πω (·) process. The proof of (5.6a) is analogous to the arguments below (3.12), and the details are omitted. We need to characterize X ω (·). In particular, we need to show that it is Markov with the transition function of X(·). A characteristic function and “weak law of large numbers type” argument can be used. The goal is to show that, for any integer m, any bounded and continuous real–valued function q(·), and any vector λ, Z Z ω λ0 x(m+1) iλ0 v Q (dψ(·))q(x(uj ); uj ≤ m) e − e P (dv, 1|x(m)) = 0, (5.7) where P (dv, 1|x) is the one step transition function of X(·). Equation (5.7) says that X ω (·) is Markov with the transition function of X(·) since it says that the 0 ω conditional expectation of eiλ X (m+1) given the “past” and the current state can be computed by use of the transition function. Rewrite (5.7) as the weak sense limit of Z N 1 X iλ0 X(n+m+1) iλ0 v q(X(n + uj ); uj ≤ m) e − e P (dv, 1|X(n + m)) . N n=1 (5.8) Note that by the Markov property the expectation of the n−th summand (conditioned on the data to time n + m) in (5.8) is zero. Now, use this last fact to show that the mean square value is O(1/N ). The rest of the details are as in the proofs of Theorems 3.1 and 3.2.
25
5.2
A Non Markov Signal Process
Recall the model of Section 4, where the underlying signal process, called X h (·) there, was not Markov. We supposed that it converged weakly to the process of Section 2 as h → 0. We can do a similar analysis in the discrete time case, and it is worthwhile since the actual signal process will not usually be Markov and it is not apriori obvious that even small errors per step will not lead to large errors as the time interval goes to infinity. As noted in Section 4, we could use a symbol other than h to index this process, but there is no loss of generality in using h. In lieu of the perturbed test function method introduced in Section 4, we will follow the procedure of the previous subsection as closely as possible. ˜ h (·) and the filter form (5.3), which uses instead We keep the assumptions on X of Y (·) the following modified observation process. Y h (n) − Y h (n − 1) = g(X h (n)) + ξ(n). The main new problem is the identification of the X ω (·). We concentrate on that and make the assumptions of the last subsection. Some assumption on the convergence of X h (·) to X(·) is needed. The procedure of the last subsection requires that we show that Z N 0 h 0 1 X q(X h (n + uj ); uj ≤ m) eiλ X (n+m+1) − eiλ v P (dv, 1|X h (n + m)) N n=1 (5.9) converges to zero in probability as N → ∞. The convergence in (5.9) is not guaranteed by the weak convergence of X h (·) to X(·), since the property (5.9) involves the “long range dependencies” of the X h (·) processes. We will take an approach that is quite flexible, and whose conditions are not stringent. We make the following assumptions. Define the process X(·|x) to have the law of evolution of X(·), but with initial condition x. A5.4 For an arbitrary integer m and any bounded and continuous real valued function f (·) of m arguments, lim sup Ef (X h (n + uj ); uj ≤ m) − E f (X(uj |X h (n)); uj ≤ m) = 0. (5.10) h
n
A5.5 For an arbitrary integer m and any bounded and continuous real valued function f (·) of m arguments and any µ > 0, there is mµ < ∞ such that lim sup sup E f (X h (n + uj ); uj ≤ m) − Ef (X h (n + uj ); uj ≤ m) h |n−ν|≥mµ × f (X h (ν + uj ); uj ≤ m) − Ef (X h (ν + uj ); uj ≤ m) ≤ µ. (5.11)
26
Condition (5.10) can be interpreted as follows. It is equivalent to saying that if for any sequence nh , X h (nh ) converges weakly to a random variable ˆ X(0) as h → 0, then {X h (nh + ·)} converges weakly to X(·) with initial condiˆ tion X(0). It is unrestrictive since the usual choices for X h (·) are Markov with time independent transition functions. Condition (5.11) deals with long range dependence. It basically says that if we take increments of the X h (·) process which are separated by a large time interval which does not depend on h), then their correlation is small for small h. This does not appear to be restrictive. It is guaranteed by appropriate mixing conditions [9]. Actually, (5.11) is needed only for the types of functions which appear in (5.9). In the definition of Ψh (·) replace X(·), Y (·) by X h (·), Y h (·) respectively. Redefine Πh (·) by replacing Y (·) by Y h (·) in Equation (5.3). Modify the definitions of Ψhf (·) and Qh,N (·) in a similar manner. Theorem 5.2. Assume (A5.1), (A5.2), (A5.3), (A5.4), (A5.5). Then the conclusions of Theorem 5.1 hold with the above modified definitions of Πh (·) and Qh,N (·). Comments on the proof. All of the details are as in Theorem 5.1, except for the characterization of X ω (·) and we concentrate on this. Return to the proof of (5.7), via an evaluation of the limit of (5.9). For a bounded and continuous function q(·), define 0
q1 (x(uj ), x(m + 1); uj ≤ m) = q(x(uj ); uj ≤ m)eiλ x(m+1) , Z 0 q2 (x(uj ); uj ≤ m) = q(x(uj ); uj ≤ m) eiλ v P (dv, 1|x(m)). We will show that Z QN,h (dψ(·)) [q1 (x(uj ), x(m + 1); uj ≤ m) − q2 (x(uj ); uj ≤ m)] → 0 (5.12) in mean as N → ∞ and h → 0. This together with the weak convergence, yields (5.7) for almost all ω. Evaluate the integral in (5.12) by first rewriting it as N 1 X q1 (X h (n + uj ), X h (n + m + 1); uj ≤ m) − q2 (X h (n + uj ); uj ≤ m) . N n=1 (5.13) Now, split (5.13) into the difference of the two sums
and
N 1 X q1 (X h (n + uj ), X h (n + m + 1); uj ≤ m) N n=1 −Eq1 (X h (n + uj ), X h (n + m + 1); uj ≤ m)
(5.14)
N 1 X q2 (X h (n + uj ), X h (n + m + 1); uj ≤ m) N n=1 −Eq1 (X h (n + uj ), X h (n + m + 1); uj ≤ m) .
(5.15)
27
By (A5.5), (5.14) goes to zero in mean square as N → ∞ and h → 0. By (A5.4), the term Eq1 (X h (n + uj ), X h (n + m + 1); uj ≤ m) in (5.15) can be replaced by Eq1 (X(uj |X h (n)), X(m + 1|X h (n)); uj ≤ m) in the sense that the mean square limits as N → ∞ and h → 0 are the same. By the Markov property of X(·), for any initial condition Eq1 (X(uj ), X(m + 1); uj ≤ m) = Eq2 (X(uj ); uj ≤ m). By the last two sentences, we can replace the expected value in (5.15) by Eq2 (X(uj |X h (n)); uj ≤ m) without changing the limits. Using (A5.4) again, we can replace Eq2 (X(uj |X h (n)); uj ≤ m) by Eq2 (X h (n + uj ); uj ≤ m) in (5.15) in that the mean square limits are the same. Now, use (A5.5) again to get that the mean square limit of (5.15) with this last replacement is zero. We have shown that the mean square limit of (5.13) is zero, as N → ∞ and h → 0, which implies (5.7).
6
Discrete to Continuous Time: White or Wideband Observation Noise
The most appropriate justification of the classical continuous time filter is as an approximation to the discrete time model for small sampling intervals. Quite often, one has wide bandwidth rather than white noise and the signal is not a diffusion but only an approximation to a diffusion. One still might wish to use some sort of ideal filter, which is built on the assumption that the signal is a particular diffusion and that the observation noise is white. If the filter is to be used for a very long time interval and one is interested in pathwise prediction errors or other performance measures (see Section 2 for an example), questions arise concerning the asymptotic quality, as the time, the bandwidth of the observation noise, the parameter in the diffusion approximation to the actual signal process, etc, all go to their respective limits. We will see that the previous robustness results continue to hold under reasonable conditions. Asymptotic, large time, results were obtained in [21] for mean value criteria. Depending on how the filter with wide bandwidth observation noise is constructed, there might be the so–called correction terms [21]. In the present context, we suppose that either the basic Bayes’ rule formulas such as (2.14) or its later analogs are used or else that the constructed filter is equivalent to these formulas, at least asymptotically. There is no need to introduce correction terms into the Bayes’ rule formulas as written here. The problem formulation. The signal process X h (·) is assumed to satisfy the conditions used in Theorem 4.1, and is sampled at intervals of width ∆h → 0. The observation noise is wide bandwidth in the sense to be made precise below. There is a process B h (·) satisfying assumptions (A6.1), (A6.2), (A6.3) below such that the observation process, {Y h (n∆h ); n ≥ 0} satisfies the equation: Y h (n∆h )−Y h (n∆h −∆h ) = ∆h g(X h (n∆h ))+B h (n∆h )−B h (n∆h −∆h ). (6.1) 28
The assumptions on the process B h (·) are as follows. A6.1 The process B h (·) is independent of X h (·). The set {B h (t + ·) − B h (t); h > 0, t ≥ 0}
(6.2)
is tight and any weakly convergent subsequence of the form {B hk (tk + ·) − B hk (tk )}k≥1 , such that hk → 0 as k → ∞, converges to the standard Wiener process B(·) as k → ∞ . A6.2 For every family of random variables of the form {µ∆,h n , n ≥ 1, h > 0, ∆ > 0} which satisfies (i) − (iii) below, we have that for each T > 0 and > 0, n+s/∆ X h ∆,h h lim lim sup sup P sup µi B (i∆h + ∆h ) − B h (i∆h ) ≥ = 0. ∆→0 s≤T n h i=n (6.3) (i) µ∆,h is bounded uniformly in n, ∆ and h. n h (ii) For each fixed h and ∆, the family {µ∆,h n , n < ∞} is independent of B (·). (iii) ( ) ∆,h lim lim sup sup P sup (6.4) µi ≥ δ = 0, for each δ > 0. ∆→0
h
n
n≤i≤n+T /∆h
The requirement (6.3) is mainly a condition on how fast the conditional dependence of the “future,” given the “past” goes to zero. Owing to the independence of {µhn , n < ∞} and B h (·), the condition is not stringent. Various sufficient conditions for (A6.2) to hold will be given at the end of the section. Now for the final assumption. A6.3 For every positive integer m, real numbers ρ > 0, uj ≤ ρ, j ≤ m, and real–valued bounded and continuous function f (·) Ef (B h (t + uj ) − B h (t); j)f (B h (s + uj ) − B h (s); j) −Ef (B h (t + uj ) − B h (t); j)Ef (B h (s + uj ) − B h (s); j) → 0
(6.5)
as h → 0, uniformly in (s, t) such that s > t + ρ + 1. This condition essentially says that the correlation between intervals that are separated in time by at least some constant goes to zero as h → 0, uniformly in time. The process B h (·) could be white, but not necessarily Gaussian. In this case, the verification of the conditions (A6.2) and (A6.3) is usually trivial. We define Y h (s) = B h (s) = 0 for s < 0. The filter is denoted by {Πh (n∆h )}, ˜ h (·) and we define the continuous time interpolations Y h (·), B h (·), Πh (·) and X so that they are constant on the intervals [n∆h , n∆h + ∆h ) 29
In what follows, we use the convention that if a limit of summation is not an integer, take the integer part. We represent the filter as in the previous sections, ˜ h (·) which satisfies the assumptions imposed on in terms an auxiliary process X it in Section 2. At the sampling times, the filter takes the form: h i h ˜ h (n∆h ))R(X ˜h E{Πh (0),Y h } φ(X , Y ) 0,n∆ 0,n∆ h h 0,n∆h , (6.6) hΠh (n∆h ), φi = h ˜h E{Πh (0),Y h } R(X , Y ) 0,n∆h 0,n∆h 0,n∆h
where we use the definition h h ˜ 0,n∆ ) = eZ , R(X , Yk∆ h h ,(n+k)∆h
(6.7)
where Z is defined by n X
˜ h (i∆h ))[Y h (k∆h +i∆h )−Y h (k∆h +i∆h −∆h )]− ∆h g (X 2 i=1 0
n 2 X ˜h g(X (i∆h ) i=1
(6.8) The definition (6.6), with (6.8) used, satisfies the semigroup property over the discrete time mesh: {n∆h ; n ≥ 1}. Instead of constant interpolation of Πh (·) over the interval [n∆h , n∆h + ∆h ), one may use the following different definition for the approximate filter. h i h ˜h ˜h E{Πh (0),Y0,t h } φ(X (t))R(X0,t , Y0,t ) hΠh (t), φi = . (6.9) ˜h , Y h ) E{Πh (0),Y0,t h } R(X 0,t 0,t The forms (6.6) and (6.9) are equal at the sampling times. Theorem 6.1 continues to hold for the form of filter in (6.9), however for the sake of brevity we will only prove the constant interpolation case. Define Ψh (·) = X h (·), Πh (·), Y h (·), B h (·) , Ψhf (·) = X h (·), Πh (·) , and define Qh,T , ψ(·) and ψf (·) analogously. Theorem 6.1. Assume (A2.1), (A2.2), (A2.3), (A4.1), (A4.2), (A6.1), (A6.2), (A6.3). Then all the conclusions of Theorem 4.1 hold with the above modified definitions of Πh (·) (see (6.6)) and Qh,T (·) . Comments on the proof. The proof is similar to those of Theorems 3.1, 3.2 and 4.1. The main differences concern the characterization of the limit X ω (·), the proof for (3.3c), the proof that B ω (·) is a standard Wiener process which is independent of X ω (·), and the proof of the representation of Πω (·). The characterization of X ω (·) as X(·), with the appropriate initial condition, is done analogously to what was done in Section 4, with only minor notational differences which are due to the replacement of B(·) by B h (·) and we omit the details. Proof of (3.3c). It is difficult to prove (3.3c) directly. We will actually prove the tightness of an approximation to the family {Πh (t + ·); h > 0, t > 0}, and 30
use the following useful fact. Let the sequence {Zn (·)} have paths in some Skorohod space with metric denoted by d(·, ·). Suppose that for each positive real numbers 1 , 2 and T0 , there is a process Zn1 ,2 ,T0 (·) such that 1 ,2 ,T0 lim sup P sup d(Zn (s), Zn (s)) ≥ 1 ≤ 2 , (6.10) n
s≤T0
and {Zn1 ,2 ,T0 (·), n < ∞} is tight for each 1 , 2 , T0 . Then {Zn (·)} is tight. The approximations generally simplify the characterization of the weak sense limits as well. Note that for our case the process index is the pair (h, t) instead of n. Pursuing this idea, let 1 , 2 , T0 be as above. Then, for each h and t we seek a ˆ h (t, ·) (dropping the affixes 1 , 2 , T0 ) such that process Π h h ˆ lim sup sup P sup hΠ (t + s), φi− < Π (t, s), φ > ≥ 1 ≤ 2 (6.11) h
t
s≤T0
ˆ h (t, ·); h > 0, t ≥ 0} is tight. If we can find such processes then (3.3c) and {Π holds. We now show (6.11). We can work with each φ(·) separately due to the definition of the weak topology. Keep in mind that we can have a different modification for each (h, t). We first consider the case where t = n∆h for some n. Fix φ(·). Let ∆ > 0 be small but fixed. Assume that h is small enough such that ∆ >> ∆h . The idea is to break the sum s/∆h
X
˜ h (j∆h )) Y h (t + j∆h ) − Y h (t + j∆h − ∆h ) g 0 (X
(6.12)
j=1
in (6.8) as a sum of s/∆−1
X
˜ h (i∆)) Y h (t + i∆ + ∆) − Y h (t + i∆) g 0 (X
(6.13)
i=0
and the ”error“ V
h,∆
h
(t, s) =
s/∆−1
(i+1)∆/∆h ∧s/∆h
X
X
i=0
j=i∆/∆hi +1
˜ h (j∆h )) − g 0 (X ˜ h (i∆)) g 0 (X
h Y (t + j∆h ) − Y h (t + j∆h − ∆h ) . (6.14) Note that if we concatenate the sums in (6.14), we get n+s/∆h
X
µih,∆ Y h (i∆h + ∆h ) − Y h (i∆h )
i=n
, which clearly satisfy (i), (ii) and (iii) in (A6.2), for an obvious definition of µh,∆ i ˜ h (·). The case where t/∆h is not in view of weak convergence properties of X 31
an integer is dealt with in a completely analogous way and we omit the details. Now we use (A6.2) for the family {µh,∆ ; h, ∆, i}. For an arbitrary small 3 > 0, i define τ h,∆,t = min{s : |V h,∆ (t, s)| ≥ 3 }. Then (A6.2) yields that for small > 0 and small enough 3 > 0, n o h,∆,t lim lim sup sup P P{Πh (0),Yt,t+s < T0 ≥ = 0. (6.15) h } τ ∆→0
h
t
h h ˆ h,∆ (X ˜ h , Yt,t+s ˜ h , Yt,t+s Define R ) to be R(X ) with (6.13) replacing (6.12). Now 0,s 0,s h ˆ (t, ·) by define Π h i ˜h ˆ h,∆ (X ˜h , Y h ) E{Πh (0),Y0,t h } φ(X (t))R 0,s 0,s ˆ h (t, s), φi = . (6.16) hΠ ˆ h,∆ (X ˜h , Y h ) E{Πh (0),Y0,t h }R 0,s 0,s
We have
h h h h h ˆ h,∆ (X ˜ 0,s ˜ 0,s R , Yt,t+s )eV (t,s) = R(X , Yt,t+s )
and, except on as a (h, t-dependent) set of arbitrarily small measure, τ h,∆,t ≥ T0 h for small h and ∆. Hence, we can suppose that, asymptotically, |eV (t,s) − 1| ≤ 4 , arbitrary. Thus, (6.11) holds. ˆ h (t, ·); h, t} for each ∆ > 0 is clear. This follows by The tightness of the set {Π ˆ h (t, ·) contains only noting that the exponential contains the sum (6.13), hence Π simple discontinuities at deterministic times i∆. This proves (3.3c). Thus in view of the tightness assumptions on the set {X h (t + ·), B h (t + ·) − B h (t); h > 0, t}, we have that (3.4) holds. We can extract a weakly convergent subsequence of the set in (3.4), indexed again by (h, t), and with limit denoted by Q(·), with sample values Qω (·). The independence of B ω (·) and X ω (·). We will only do a simple calculation to illustrate the general procedure. Let fi (·) be bounded and continuous realvalued functions of their arguments, and with compact support. We wish to show that, for almost all ω and any nonnegative u, v, Z Z Z Qω (dψ(·))f1 (b(v))f2 (x(u)) = Qω (dψ(·))f1 (b(v)) Qω (dψ(·))f2 (x(u)). Write the prelimit forms of the difference between the two sides: Z 1 T f1 (B h (t + v) − B h (t))f2 (X h (t + u))dt T 0 Z T Z T 1 h h − 2 f1 (B (t + v) − B (t))dt f2 (X h (s + u))ds. T 0 0 The rest of the procedure is to square, take expectations, and use the fact that B h (·) is independent of X h (·) and (A6.3) to show that the difference goes to zero in mean square as h → 0, T → ∞. Define C = Ef1 (B(v)). Let us first evaluate the expectation of the square of the first term; namely, evaluate Z TZ T 1 Ef1 (B h (t+v)−B h (t))f2 (X h (t+u))f1 (B h (s+v)−B h (s))f2 (X h (s+u))dtds. T2 0 0 32
Now, use (A6.3), the weak convergence of B h (t+·)−B h (t), and the independence of B h (·) and X h (·), to show that the limits are the same as those of C2
1 T2
Z
T
0
T
Z
Ef2 (X h (t + u))f2 (X h (s + u))dtds.
0
A similar procedure shows that the limits of the expectation the square of the second term and of the product of the first and second term is the same. Thus, the limit is zero, as desired. The same procedure works if the functions fi (·) depend on the processes at an arbitrary finite number of times. Hence the independence of X ω (·) and B ω (·). The Wiener property of B ω (·) for almost all ω. The proof is similar to what was done to prove the independence of the X ω (·) and B ω (·). We need to show characteristic function results such as the following. Let λq be arbitrary vectors (of the dimension of b). Show that, for almost all ω, Z 0 0 Qω (dψ(·))eiλ1 (b(u+s)−b(u)) eiλ2 b(u) Z Z 0 0 = Qω (dψ(·))eiλ1 (b(u+s)−b(u)) Qω (dψ(·))eiλ2 b(u) = e−s|λ1 |
2
/2 −u|λ2 |2 /2
e
.
To show this, work with the difference of the prelimit forms of the two sides: Z 2 2 1 T h iλ01 (B h (t+u+s)−B h (t+u)) iλ2 B h (t+u)−B h (t)) i e e dt − e−s|λ1 | /2 e−u|λ2 | /2 . T 0 Square, take expectations and use the weak convergence and (A6.3), analogously to what was done to show the independence property. Characterization of the process Πω (·). Given the tightness of {Πh (·)} and the discussion in that section, the fact that (3.10) holds is essentially a property ˜ h (·). It can be proved by working with Π ˆ h (t, ·) of the asymptotic smoothness of X h in lieu of Π (t + ·), and then letting ∆ → 0, and the details are omitted. Sufficient conditions for (A6.2) to hold. Define ξih = B h (i∆h + ∆h ) − B h (i∆h ). The simplest case satisfying (6.8) is where the ξih are martingale differences with variances bounded by some constant times ∆h ). I.e., where the observation noise is white, but not necessarily Gaussian. Correlated noise can be treated by the perturbed test function method [18] of the type introduced in Section 4. Here we will construct a particular perturbation which serves our present purposes. Let En denote the expectation conditioned on {µh,∆ , ξih , i ≤ n}. Fix T > 0. Define i Sih,∆,n =
n+i X
µh,∆ ξjh , i ≤ T /∆h . j
j=n
33
Assume that
2 sup E ξih < ∞.
(6.17)
i,h
Define the “perturbation” n+T /∆h −1
X
δSih,∆,n,T =
h En+i µh,∆ j+1 ξj+1 , i ≤ T /∆h .
(6.18)
j=n+i
Suppose that n+T /∆h
X
h = O(∆h ) E En+i ξj+1
(6.19)
j=n+i
where O(∆h ) is uniform in i, and that n+T n+T /∆h /∆h n+T /∆h X X X |ξih |2 + Eξih ξjh < ∞. lim sup sup E n h i=n i=n j=i+1
(6.20)
These conditions are rather mild. They would be satisfied, for example, if B h (·) were a sufficiently “nonanticipatively smoothed” Wiener process; e.g., Rt B h (t) = −∞ qh (t − s)B(s)ds for an appropriate kernel qh (·) which converges to a delta function as h → 0. Noting that the index i is in the upper limit in (6.17) and in the lower limit in (6.18), we have h i h,∆,n h En+i Si+1 − Sih,∆,n = En+i µh,∆ n+i+1 ξn+i+1 , h i h,∆,n,T h En+i δSi+1 − δSih,∆,n,T = −En+i µh,∆ n+i+1 ξn+i+1 . Thus the perturbed function S˜ih,∆,n,T = Sih,∆,n + δSih,∆,n,T is a martingale, and martingale inequalities can be used to get (6.3). The desire to get this martingale property is the key motivation behind the way the perturbation δSih,∆,n,T was constructed. It was constructed specifically to effect the desired cancellation. Recall that if Mi is a square integrable martingale, then 2 EMN . (6.21) P sup |Mi | ≥ ≤ 2 i≤N Condition (6.19) and the fact that the µh,∆ can be taken to be arbitrarily small i implies that, for any > 0, ( ) h,∆,n,T lim lim sup sup P sup δSi (6.22) ≥ = 0. ∆→0
h
n
i≤T /∆h
34
This is a consequence of the facts that the µ-terms are bounded and small in probability and independent of the ξ terms, and that (6.19) implies that X E δSih,∆,n,T = O(1). i≤T /∆h
Since δSTh,∆,n,T = 0, /∆h h,∆,n 2 2 E|S˜Th,∆,n,T /∆h | = E|ST /∆h | .
Also, a direct computation using (6.20), (6.21) and the boundeness and smallness in probability of the µ−terms (and their independence of the ξ), yields 2 lim lim sup sup E|STh,∆,n /∆h | = 0.
∆→0
h
n
The last two equations and (6.21) yield ( lim lim sup sup P
∆→0
h
n
) ˜h,∆,n,T sup Si ≥ = 0.
(6.23)
i≤T /∆h
This and (6.22) imply (6.3).
7
Uniqueness of the Invariant Measure of (X(·), Π(·)).
In this section we address the question of uniqueness of the invariant measure for the pair (X(·), Π(·)). It is known from Kunita [12] that if X(·) is a time homogeneous Feller Markov process with a compact state space such that it admits a unique invariant measure and the corresponding stationary flow is purely nondeterministic then the true optimal filtering process Π(·) taken by itself has a unique invariant measure. The results in [12], however, do not provide any information concerning the uniqueness of the stationary joint distribution of (X(·), Π(·)). Uniqueness for the joint problem is considerably harder to prove and to date the only known result, excepting the linear filtering case, is for a discrete time filtering problem where the signal is a finite state Markov chain (cf. Stettner [28]). We provide in Theorem 7.1 a sufficient condition under which the uniqueness for the joint problem holds. The theorem roughly says that if the signal process has a unique invariant measure and the filter “forgets its initial condition,” in the sense to be made precise below, then the pair (X(·), Π(·)) has at most one invariant measure. We refer the reader to [1, 2, 6, 7, 10, 24, 25] for some recent results on forgetting of the initial condition. Finally, we discuss some examples in which the sufficient condition can be seen to hold. Let (S, S) be a Polish space, S being the Borel sigma-field on S. Let X(·) be an S valued Markov process with CADLAG paths and Y (·) be given via Equation (2.2), where, as before, B(·) is independent of X(·). As in Section 2, we will consider the process (X(·), Y (·)) on the path space, D(S ×IRm ; 0, ∞), where 35
m denotes the dimension of the observation vector. Let Pµ be the probability measure on the path space of (X(·), Y (·)) under which X(0) has the distribution µ. With an abuse of notation, we will use the notation Px when µ is a point mass concentrated at x ∈ S. If the filter is initialized at Π(0) = ν (whether or not it is the distribution of X(0)), we denote the process by Πν (·). Following the notation of Section 2, Πν (·) is defined by h i ˜ ˜ 0,t , Y0,t ) Eν,Y0,t φ(X(t))R( X h i , hΠν (t), φi = ˜ 0,t , Y0,t ) Eν,Y0,t R(X ˜ is as in where φ(·) is a bounded and measurable real–valued function, and X(·) ν ˜ Section 2. Recall that Π (0) = ν is the distribution of X(0). Definition 7.1. We say that the “filter forgets its initial condition” if for all ν1 , ν2 ∈ M(S), for all x ∈ S and continuous bounded real valued functions φ(·) on S: |hΠν1 (t), φi − hΠν2 (t), φi| converges to zero in Px -probability as t → ∞. Theorem 7.1. Suppose that X(t) has a unique invariant measure and the filter forgets its initial condition. Then the pair (X(·), Π(·)) has at most one invariant measure. Proof. Suppose ρ1 and ρ2 are two invariant measures for the pair (X(·), Π(·)). We will show that for all continuous and bounded functions f on S × M(S) Z Z f (x, α)ρ1 (dx, dα) = f (x, α)ρ2 (dx, dα), (7.1) where, (x, α) denotes a generic element of S × M(S). Let µ denote the unique invariant measure of X(·), and let µ1 , µ2 be regular conditional probability functions such that ρi (dx, dα) = µi (x, dα)µ(dx); i = 1, 2. The left side of (7.1) equals, for all t > 0, Z Z Ex [f (X(t), Πα (t))] µ1 (x, dα) µ(dx), while the right hand side of (7.1) equals, for all t > 0, Z Z β Ex f (X(t), Π (t)) µ2 (x, dβ) µ(dx). Thus R R f (x, α)ρ1 (dx, dα) − f (x, α)ρ2 (dx, dα) Z Z Z ≤ Ex f (X(t), Πα (t)) − f (X(t), Πβ (t)) µ1 (x, dα)µ2 (x, dβ)µ(dx) (7.2) 36
The assumption on forgetting of initial condition implies that for all x ∈ S and α, β ∈ M(S), as t → ∞, Ex f (X(t), Πα (t)) − f (X(t), Πβ (t)) → 0. Now an application of dominated convergence theorem shows that the right side of (7.2) converges to 0 as t → ∞. This proves (7.1). Remark Although we have stated and proved the above theorem for a signal evolving continuously in time and observations taken continuously, the result holds for discrete time models such as in Section 5, where the signal can evolve either continuously or discretely in time, but the observations are taken at discrete time instants. [The paper [6] also allowed a point process observation model.] The proof, excepting some minor notational changes, remains unchanged. We now give some examples, in the following two theorems, where the sufficient condition in the above theorem is seen to hold. The theorems below are fairly straightforward consequences of the results proved in Atar and Zeitouni [2], LeGland [10], in view of which, details of the proofs are omitted. Theorem 7.2. Let the signal X(·) (indexed either by a continuous or discrete time parameter) be a Feller time homogeneous Markov process taking values in a compact Polish space S. Suppose that the observations are in discrete time and given by Equation (5.1), Suppose that there exists a finite measure µ on (S, S) and an integer n ≥ 1 such that for all x ∈ S: c1 µ(dχ) ≤ P (dχ, n|x) ≤ c2 µ(dχ),
(7.2)
where P (dχ, n|x) is the n−step transition probability function for X(·) and c1 , c2 are some finite positive constants. Then the process {X(k), Π(k), k < ∞} has at most one stationary measure. Remarks on the Proof. The proof is based on the properties of Hilbert’s projective metric (cf. [4]) on the space of finite measures. It can be shown that for x, ν1 , ν2 as in the beginning of the section, the Hilbert distance between Πν1 (m) and Πν2 (m) converges to zero a.s. Px as m → ∞. This fact is proved for n = 1 in [2] and for n ≥ 1, in the special case of finite state Markov chains in [10]. The case of general compact state space follows by similar arguments. The convergence in the Hilbert metric implies the convergence in total variation norm which in turn implies the property of filter forgetting its initial condition in the sense of Definition 7.1. Since X(·) is Feller with a compact state space, it has at least one invariant measure. Finally, if µ1 , µ2 are two such invariant measures we have, using the properties of Hilbert metricR and Equation (7.2), that the Hilbert distance R between P (·, n|x)µ1 (dx) and P (·, n|x)µ2 (dx) converges to zero as n → ∞. This implies that µ1 = µ2 and therefore Theorem 7.1 can be applied. Theorem 7.2. Let the signal be given via Equation 2.1 and suppose that it takes values in a compact Riemannian manifold. Suppose that p(·) and σ(·) are 37
Lipschitz continuous and that the generator of the Markov process is strictly elliptic. Assume that the observation process is given via Equation 2.2 with g(·) a twice continuously differentiable real–valued function. Then (X(·), Π(·)) has a unique stationary measure. Remarks on the Proof. Arguments as in Theorem 3.1 and 3.2 show that (X(·), Π(·)) has at least one invariant measure. Furthermore, upper and lower bounds on the transition probability kernel of a Markov diffusion process on a compact manifold (cf. [2]) yield that if µ1R and µ2 are two invariant R measures for X(·) then the Hilbert distance between P (·, t|x)µ1 (dx) and P (·, t|x)µ2 (dx) converges to 0 as t → ∞, where P (·, t|x) is the transition probability kernel of X(·). This proves the uniqueness of the invariant measure for X(·). Let h(·) denote the Hilbert distance between the measures. Next, the results in [2] show that, for the additive white noise corrupted observation case, where x, ν1 , ν2 are as in the beginning of this section and satisfy h(ν1 , ν2 ) < ∞, a.s. Px , we have that h(Πν1 (t), Πν2 (t)) converges to zero a.s. Px as t → ∞. The constraint, h(ν1 , ν2 ) < ∞, can be removed using techniques similar to those of Theorem 3.2 of [7] on first proving that the Px -probability of the event, h(Πν1 (t), Πν2 (t)) < ∞ for some t is one.
References [1] R. Atar. Exponential stability for nonlinear filtering of diffusion processes in non-compact domain. To appear in Annals of Probability. [2] R. Atar and O. Zeitouni. Exponential stability for nonlinear filtering. Ann. Inst. Henri Poincar´e, 33:697–725,1997. [3] P. Billingsley. Convergence of Probability Measures. John Wiley, New York, 1968. [4] G. Birkhoff. Lattice Theory. Am. Math. Soc., Providence, 1967. Publ.25, 3rd ed. [5] G. Blankenship and G.C. Papanicolaou. Stability and control of systems with wide band noise disturbances. SIAM J. Appl. Math., 34:437–476, 1978. [6] A. Budhiraja and H.J. Kushner. Robustness of nonlinear filters over the infinite time interval. Lefschetz Center for Dynamical Systems report. To Appear in SIAM J. on Control and Optimization. [7] A. Budhiraja and D. Ocone. Exponential stability of discrete time filters without signal ergodicity. System and Control Letters, 30:185–193, 1997. [8] R. Elliot. Stochastic Calculus and Applications. Springer-Verlag, Berlin and New York, 1982.
38
[9] S.N. Ethier and T.G. Kurtz. Markov Processes: Characterization and Convergence. Wiley, New York, 1986. [10] F. Le Gland and L. Mevel. Geometric ergodicity in hidden Markov models. To be published, 1996. [11] G. Kallianpur H. Fujisaki and H. Kunita. Stochastic differential equations for the nonlinear filtering problem. Osaka Math. J., 9:19–40, 1972. [12] H. Kunita. Asymptotic behavior of the nonlinear filtering errors of a Markov process,. J. Multivariate Anal., 1:365–393, 1971. [13] T.G. Kurtz. Semigroups of conditional shifts and approximation of Markov processes. Ann. Probab., 4:618–642, 1975. [14] T.G. Kurtz. Approximation of Population Processes, volume 36 of CBMSNSF Regional Conf. Series in Appl. Math. SIAM, Philadelphia, 1981. [15] H.J. Kushner. Dynamical equations for nonlinear filtering. J. Differential Equations, 3:179–190, 1967. [16] H.J. Kushner. Probability Methods for Approximations in Stochastic Control and for Elliptic Equations. Academic Press, New York, 1977. [17] H.J. Kushner. Jump-diffusion approximations for ordinary differential equations with wideband random right hand sides. SIAM J. Control Optim., 17:729–744, 1979. [18] H.J. Kushner. Approximation and Weak Convergence Methods for Random Processes with Applications to Stochastic Systems Theory. MIT Press, Cambridge, MA, 1984. [19] H.J. Kushner. Weak Convergence Methods and Singularly Perturbed Stochastic Control and Filtering Problems, volume 3 of Systems and Control. Birkh¨ auser, Boston, 1990. [20] H.J. Kushner and P. Dupuis. Numerical Methods for Stochastic Control Problems in Continuous Time. Springer-Verlag, Berlin and New York, 1992. [21] H.J. Kushner and H. Huang. Approximation and limit results for nonlinear filters with wide bandwidth observation noise. Stochastics Stochastics Rep., 16:65–96, 1986. [22] H.J. Kushner and L.F. Martins. Limit theorems for pathwise average cost per unit time problems for queues in heavy traffic. Stochastics Stochastics Rep., 42:25–51, 1993. [23] R. Liptser and A.N. Shiryaev. Statistics of Random Processes. SpringerVerlag, Berlin and New York, 1977.
39
[24] D. Ocone. Asymptotic stability of Benes filters. Preprint. [25] D. Ocone and E. Pardoux Asymptotic Stability of the optimal filter with respect to its initial condition. SIAM J. Contr. Optim., 34:226–243, 1996. [26] G.C. Papanicolaou, D. Stroock, and S.R.S. Varadhan. Martingale approach to some limit theorems. In Proceedings of the 1976 Duke University Conference on Turbulence, 1976. [27] R. Rishel. Necessary and sufficient dynamic programming conditions for continuous time stochastic optimal control. SIAM J. Control, 8:559–571, 1970. [28] L. Stettner. On invariant measures of filtering processes. In K. Helmes, N. Christopeit, and M. Kohlmann, editors, Stochastic differential systems, Proc. 4th Bad Honnef Conf., Lecture Notes in Control and Inform Sci., pages 279–292. Springer, Berlin and New York, 1989.
40