Stochastic Sampling Algorithms for State Estimation of Jump Markov ...

4 downloads 0 Views 268KB Size Report
Jan 16, 2000 - Department of Engineering, Trumpington Street .... While the DA algorithm proposed in [20, 21] is a well-known algorithm in the statistical litera-.
Stochastic Sampling Algorithms for State Estimation of Jump Markov Linear Systems

Arnaud Doucet

Signal Processing Group, University of Cambridge

Department of Engineering, Trumpington Street [email protected]

CB2 1PZ Cambridge, UK

Andrew Logothetis Automatic Control Department of Signals, Sensors and Systems Royal Institute of Technology (KTH) SE-100 44 Stockholm, Sweden [email protected] Vikram Krishnamurthy

CRC for Sensor Signal and Information Processing Department of Electrical and Electronic Engineering University of Melbourne Parkville, Victoria 3052, Australia [email protected] IEEE Transactions on Automatic Control January 16, 2000

Abstract Jump Markov linear systems are linear systems whose parameters evolve with time according to a finite state Markov chain. Given a set of observations, our aim is to estimate the states of the finite state Markov chain and the continuous (in space) states of the linear system. The computational cost in computing conditional mean or maximum a posteriori (MAP) state estimates of the Markov chain or the state of the jump Markov linear system grows exponentially in the number of observations. In this paper, we present three globally convergent algorithms based on stochastic sampling methods for state estimation of jump Markov linear systems. The cost per iteration is linear in the data length. The first proposed algorithm is a data augmentation (DA) scheme that yields conditional mean state estimates. The second proposed scheme is a stochastic annealing (SA) version of DA that computes the joint MAP sequence estimate of the finite and continuous states. Finally, a Metropolis-Hastings DA scheme based on SA is designed to yield the MAP estimate of the finite state Markov chain, is proposed. Convergence results of the three above mentioned stochastic algorithms are obtained. Computer simulations are carried out to evaluate the performances of the proposed algorithms. The problem of estimating a sparse signal arising from a neutron sensor based on a set of noisy data and the problem of narrowband interference suppression in Spread Spectrum CDMA (Code Division Multiple Access) systems are considered.

1

Introduction

A discrete time jump Markov linear system can be modeled as: xt+1 = A (rt+1 ) xt + B (rt+1 ) vt+1 + F (rt+1 ) ut+1 yt = C (rt ) xt + D (rt ) wt + G (rt ) ut

(1) (2)

where t = 1, 2, . . . denotes discrete time, rt denotes the unknown realization of a finite state Markov chain with states in {1, 2, . . . , s}. xt denotes the unknown state of the jump linear system, vt and wt are uncorrelated Gaussian white noise sequences, yt is the observation at time t and ut is a known exogenous input sequence. A (rt ), B (rt ), C (rt ), D (rt ), F (rt ) and G (rt ) are timevarying matrices that evolve according to the realization of the finite state Markov chain rt . For △





notational convenience let y = (y1 , ..., yT ), x = (x0 , ..., xT ) and r = (r1 , ..., rT ) denote the sequence of measurements, the states of the jump Markov linear system and the states of the finite state Markov chain, respectively. Jump Markov linear systems appear in several fields in electrical engineering (see [11] and references therein), including control (e.g. hybrid systems, target tracking), signal processing (e.g. blind channel equalization), communications (e.g. interference suppression in mobile telephony) and other areas such as econometrics and biometrics. In these fields, given the T observations y1 , ..., yT generated by the signal model (1), (2) and assuming that the parameters (A(1), . . . , A(s), B(1), . . . , B(s), C(1), . . . , C(s), D(1), . . . , D(s), F (1), . . . , F (s), G(1), . . . , G(s)) are known, it is of interest to compute the following state estimates: • Conditional mean state estimates of rt and xt , namely E { rt | y} and E { xt | y}, for t = 1, 2, . . . , T . b) = arg maxx,r f ( x, r| y) • Maximum a posteriori (MAP) sequence estimates rt and xt defined as (b x, y and b r = arg maxr f ( r| y), where f (·|·) denotes the conditional probability density (or mass) function.

Unfortunately, it is well known that exact computation of these estimates involves a prohibitive computational cost of order sT , where T denotes the number of measurements and sT corresponding to all possible realizations of the finite Markov chain [24]. Thus, it is necessary to consider in practice 1

sub-optimal estimation algorithms. A variety of such suboptimal algorithms have been proposed, see for examples [7, 23, 24]. In particular, [24] presents a truncated (approximate) maximum likelihood procedure for parameter estimation and a truncated approximation of the conditional mean state estimates. The estimates are computed using a bank of Kalman filters. This paper presents three stochastic iterative algorithms for computing conditional mean state estimates and maximum a posteriori (MAP) estimates of the Markov state rt and the state of the jump Markov linear system xt in (1) and (2). These algorithms are based on the Data Augmentation (DA) algorithm (proposed by Tanner and Wong [20, 21]) and two originally proposed hybrid DA/Stochastic Annealing (SA) algorithms. The algorithms have a computational cost of O (T ) per iteration. While the DA algorithm proposed in [20, 21] is a well-known algorithm in the statistical literature, it is rarely mentioned or applied in the engineering literature. The DA algorithm is one of the simplest Markov chain Monte Carlo (MCMC) algorithms [20, 21]. MCMC are powerful stochastic algorithms, which are used to sample from complex multivariate probability distributions. These methods are well known in image processing since they have been introduced by Geman and Geman in 1984 [6] to simulate from the Gibbs distribution of a Markov random field. Their introduction in the early 1990’s has revolutionized the field of applied statistics. The key idea behind MCMC methodology consists of sampling from a “target” complex probability distribution by simulating a Markov chain that admits as its invariant (or stationary) distribution, the ’target’ distribution. In this paper we show how the DA algorithm can be used to estimate conditional mean estimates E [ xt | y1 , ..., yT ] and E [ rt | y1 , ..., yT ] for jump Markov linear systems. Then we propose two new

hybrid SA/DA algorithms that yield MAP sequence estimates. In recent work [11], we have used the Expectation Maximization algorithm to iteratively compute MAP sequence estimates for jump Markov linear systems. While these EM algorithms in some cases perform remarkably well, convergence to a local stationary point (maximum, minimum or saddle point) is a major drawback. A significant advantage of the SA methods proposed in this paper is that they are globally convergent. Main Results: We now list the main results and organization of this paper: • Section 2 formally presents the signal model and estimation objectives. 2

• In Section 3, we use the Data Augmentation algorithm, proposed by [20], together with the Law of Large Numbers [19] to compute conditional mean state estimates of the finite state Markov chain and the state of jump Markov linear system. In Theorem 3.2 we prove the convergence of the DA applied to the state space model (1) and (2). • In Section 4, a new Simulated Annealing algorithm is derived based on the Data Augmentation algorithm. The algorithm yields joint MAP state sequence estimates of the finite state Markov chain and the state of jump Markov linear system. Sufficient conditions for the convergence of the proposed algorithm to the global maxima are presented in Theorem 4.2. • In Section 5, a Metropolis-Hastings Simulated Annealing algorithm, based on Data Augmentation, which yields the MAP state sequence estimate of the Markov chain, is derived. Sufficient conditions for the convergence of the proposed algorithm to the global maxima are presented in Theorem 5.2. • In Section 6, the proposed algorithms are applied to two practical examples: (i) Estimation of sparse signals and (ii) Narrowband interference suppression in CDMA spread spectrum mobile communication systems. As detailed in Sec. 6, several recent papers in the communications and adaptive signal processing literature have studied these problems and developed recursive (on-line) state and parameter estimation algorithms. Our objective is to illustrate the use of stochastic sampling algorithms in these applications.

2

Problem Formulation

In this section we present the signal model and outline the estimation objectives.

2.1

Signal Model

Let t ∈ {1, 2, ...} denote discrete time. Let rt denote a discrete-time, time-homogeneous, s-state, first-order Markov chain with transition probabilities △

pij = Pr{rt+1 = j|rt = i},

(i, j ∈ S); S = {1, 2, ..., s}

3

(3)



Denote the initial probability distribution as pi = Pr{r1 = i}, for i ∈ S, such that pi ≥ 0, ∀i ∈ S P and Si=1 pi = 1. The transition probability matrix [pij ], is a s × s matrix, with elements satisfying P pij ≥ 0 and sj=1 pij = 1, for each i ∈ S.

There are sT possible realizations of the Markov chain r. The set of all realizations of r is  denoted as r(1), r(2), . . . , r(sT ) . The realizations r(l) for l = 1, . . . , sT are ordered arbitrarily to simplify notation.

Let R denote the set of realizations of the finite Markov chain rt of non null prior probability, that is R=

(



r(l) = (r1 (l), ..., rT (l)) ; l∈ 1, ..., s

T



such that pr1 (l)

TY −1

)

prt (l)rt+1 (l) > 0

t=1

(4)

Notation: We will use nx to denote the dimension of an arbitrary vector x. R P We use ϕ (r) dr instead of r(l′ )∈S T ϕ (r(l′ )) for ϕ : S T → R , where S T = S × S × · · · × S.

We do not distinguish between random variables and their realizations.

f (·|·) will be used to represent distributions of both discrete and continuous random variables. Superscripts denote exponentiation, e.g. f 1/T (k) . Superscripts enclosed in brackets, e.g. x(k) denotes iteration number k. The index k denotes iteration number of the various iterative algorithms.



Consider the jump Markov linear system of Eqs. (1) and (2), where xt ∈ R nx is the system state, yt ∈ R ny is the observation at time t, ut ∈ R nu is a known deterministic input, vt ∈ R nv is a zero-mean white Gaussian noise sequence with identity covariance Inv and B (i) B t (i) > 0 (∀i ∈ S) 1 , wt ∈ R nw is a zero-mean white Gaussian noise sequence with identity covariance Inw and D (i) D t (i) > 0 (∀i ∈ S). A, B, C, D, F, G are functions of the Markov chain state rt , i.e. {A, B, C, D, F, G} ⊂ {A (i) , B (i) , C (i) , D (i) , F (i) , G (i) ; i ∈ S} and they evolve according to the realization of the finite state Markov chain rt . We assume that x0 ∼ N (ˆ x0 , P0 ) and let x0 , vt and 1

This assumption is not restrictive as in numerous applications when B(i)B t (i) is singular, the jump Markov

linear system can be transformed to a new system where the noise covariance matrix is positive definite. See Sec. 3.9 of [9] or [5] for details.

4

wt be mutually independent for all t. Assumption 2.1 The model parameters λ are assumed known where △

λ = {pi , pij , A(i), B(i), C(i), D(i), F (i), G(i), xˆ0 , P0 ; i, j ∈ S}

(5)

Remark 2.1 The assumption that x0 is normally distributed is easily relaxed to finite Gaussian mixture distributions. Since any regular distribution can be approximated arbitrarily closely by a finite mixture of Gaussians this assumption is not really restrictive. The procedure for dealing with P finite mixture Gaussian distributed x0 is as follows. Suppose x0 satisfies x0 ∼ ni=1 πi N (mi , Σi ), PN where i=1 πi = 1 and πi ≥ 0. Introduce the indicator variable I0 ∈ {1, . . . , n} such that

x0 | (I0 = i) ∼ N (mi , Σi ). Then conditioned upon (I0 , r1 , . . . , rT ) the system (1) and (2) is lin-

ear Gaussian so that we can use the algorithms presented below. Remark 2.2 In many applications the model parameters are known. For example, in spread spectrum communications systems (see Sec. 6.2) pij are known a priori since they are function of the spreading code. In cases where the model parameters are not known, several algorithms for estimating these parameters are available, see for example [24]. Such models are also considered under a Dynamic Linear Model framework in [25]. It is also possible in a Bayesian framework to use algorithms presented in this paper to jointly compute state and parameter estimates. An important issue which is beyond the scope of this paper is the identifiability of the model parameters of a jump Markov linear system.

2.2

Estimation objectives

Given the observations y, assuming that the model parameters λ are exactly known, all Bayesian inference for jump Markov linear systems relies on the joint posterior distribution f ( r(l′ ), x| y). In this paper, the following three Bayesian estimation problems are considered: 1. Conditional mean estimates of x and r: compute optimal (in a mean square sense) estimates of x and r given by E { r| y} and E { x| y}. 5

2. Maximum a posteriori estimate of x and r: compute optimal (in a MAP sense) state estimates of x and r by maximizing the joint posterior distribution, i.e. (b x, b r) = arg max f ( x, r(l)| y)

(6)

x,r(l)

3. Maximum a posteriori estimate of r: compute optimal (in a MAP sense) state estimates of r by maximizing the marginal posterior distribution, i.e. b r = arg max f ( r(l)| y)

(7)

r(l)

3

Conditional Mean Estimation

The aim of this section is to compute the conditional mean estimates E { r| y} and E { x| y} for the signal model (1), (2). These estimates are ’theoretically’ obtained by integration with respect to the joint posterior distribution f ( r(l), x| y). If we were able to obtain N (for large N ) iid   (independent and identically distributed) samples r(k) , x(k) ; k = 1, ..., N according to the distribution f ( r(l), x| y), then using the Law of Large Numbers [19], conditional mean estimates can

be computed by averaging, thus solving the state estimation problem. Unfortunately, obtaining such iid samples from the posterior distribution f ( r, x| y), is not straightforward. Thus, alternative sampling schemes must be investigated.

3.1

Data Augmentation

In this paper, we compute samples from the posterior distribution f ( r, x| y) using Markov Chain Monte Carlo (MCMC) methods and in particular we use the Data Augmentation (DA) algorithm introduced by Tanner and Wong [20, 21]. The samples are then used to compute conditional mean estimates of the states r and x. The proposed conditional mean state estimator via the data augmentation algorithm is summarized in Fig. 1.

6

Data Augmentation Algorithm  1. Initialization: Select randomly r(0) , x(0) .

  2. Iteration: Given r(k−1) , x(k−1) , compute r(k) , x(k) for the kth iteration (for k = 1, 2, . . .) as follows:

• Simulate x(k) from

• Simulate r(k) from

  x(k) ∼ f x| y, r(k−1)

(8)

  r(k) ∼ f r| y, x(k)

(9)

3. State Estimation: Compute conditional mean estimates using the empirical estimators (Eqs. (14) and (15)) or the mixture estimators (Eqs. (16) and (17)). Figure 1: Algorithm I : Conditional mean state estimator via the data augmentation algorithm. Remark 3.1 Theoretically speaking the DA algorithm does not have a stopping criterion. However, a reasonable choice (see for example [4]) is to terminate the algorithm when kr(N )−r(N −1)k2 is less than some specified tolerance limit. Sampling Schemes: The Data Augmentation Algorithm presented in Fig. 1, requires to com  pute samples from f x| y, r(k−1) and f r| y, x(k) . One possible scheme is the efficient forward

filtering-backward sampling recursions introduced by Carter and Kohn [2] and independently by Fr¨ uwirth-Schnatter [5]. These recursions are given in the Appendix. An alternative scheme, not investigated here, for sampling from posterior densities of Gaussian state space systems is the simulation smoother of De Jong and Shephard [3].

3.2

Convergence of data augmentation

The Data Augmentation algorithm described in Sec. 3.1 has been used in [2] for identification of linear state space models with errors that are a mixture of normals and coefficients that can switch with time. We generalize the model in [2] to our jump Markov linear system given by (1) and (2). Our main contribution here, is to prove that sampling via the DA algorithm converges uniformly 7

geometrically fast to yield samples that mimic samples from the desired (or target) distribution. As a consequence, a law of large numbers holds, which states that sample averaging will converge almost surely (a.s.) to the expected value. From construction (see Eqs. (8) and (9)) the process Markov chain with transition kernel given by K



 r(k) , x(k) ; k ∈ N is a homogeneous

       r(k−1) , x(k−1) = (r(l), x) ; r(k) , x(k) = r(m), x′ = f x′ y, r(l) f r(m)| y, x′

(10)

The following theorem (proved in Appendix B.1) and corollary are the main results of this section: △

There exists a constant 0 ≤ ρ < 1, ρ = 1−  P (0) , x(0) , the min K (r(l); r(m)), such that for any initial distribution p (r, x) of r 0 r(l)∈R r(m)∈R  state distribution of the Markov chain r(k) , x(k) at iteration k, denoted as pk (r, x), satisfies (Uniform Geometric Ergodicity).

Z Z

Thus, the stochastic process



|pk (r, x) − f ( r, x| y)| drdx ≤ 2ρk−2

(11)

 r(k) , x(k) ; k ∈ N obtained from the Data Augmentation sam-

pling scheme is a Markov chain with invariant distribution f ( r, x| y) and is ergodic. Ergodicity

implies convergence of ergodic (sample) averages ([22], Th. 3, pp. 1717). Uniform ergodicity implies that a law of large numbers and a central limit theorem also holds ([22], Th. 5, pp. 1717, [17, 4]). (Convergence of Ergodic Averages). For every real-valued function ϕ : S T × R (T +1)nx → R , let PN −1 △ ϕ(r(k) , x(k) ). us consider the time average of the N first outputs of the Markov chain ϕN = N1 k=0 If E f ( r,x|y) [|ϕ(r, x)|] < ∞ then for any initial distribution, ϕN → E f ( r,x|y) [ϕ(r, x)]

a.s.

(12)

  If E f ( r,x|y) |ϕ(r, x)|2 < +∞, then there exists a constant σ (ϕ) such that the distribution of √

N ϕN − E f ( r,x|y) [ϕ(r, x)]



converges in distribution to a zero-mean normal distribution of variance σ 2 (ϕ).

8

(13)

Remark 3.2 Theorem 3.2 states that the sample (r(k) , x(k) ; k ∈ N ) generated for large k via the DA algorithm, will mimic a sample from the posterior distribution f ( r, x| y). Remark 3.3 2ρk−2 in (11) of Theorem 3.2, is an upper bound on the rate of convergence of the Data Augmentation Algorithm. Since ρ is not known, one cannot a priori choose the number of iterations k required for the Data Augmentation Algorithm to converge to a desired level of accuracy. Computing Conditional Mean Estimates: Corollary 3.2 can straightforwardly be applied to compute the conditional mean estimates of r and x by the ergodic averages r(N ) and x(N ):2 △

r(N ) = △

x(N ) =

N −1 1 X (k) r N

(14)

1 N

(15)

k=0 N −1 X

x(k)

k=0

where r and x are known as the empirical estimators (see [10]). Conditional mean estimates may also be computed via the mixture estimators (see [10]), i.e. △

ˆr(N ) = △

ˆ (N ) = x

N −1 i 1 X h E r| y, x(k) N

(16)

1 N

(17)

k=0 N −1 X k=0

h i E x| y, r(k)

The empirical estimator and the mixture estimator will almost surely converge to the true condition mean estimates. The next important question is to determine which of the two estimators yields smaller asymptotic variances.

1 N

The following proposition shows that the mixture estimator yields smaller asymptotic variances.   When the Markov chain r(k) , x(k) ; k ∈ N is in its stationary regime, then the estimates   PN −1  PN −1  (k) and 1 (k) converge a.s. to E [ r| y] and E [ x| y], respectively, k=0 E r| y, x k=0 E x| y, r N

and satisfy:

# # " N −1 N −1 i X 1 X h 1 r(k) var E r| y, x(k) ≤ var N N k=0 k=0 " N −1 # # " N −1 i h 1 X 1 X (k) (k) E x| y, r var x ≤ var N N "

k=0

2

(18)

(19)

k=0

r(N ) and ˆ r(N ) defined subsequently in (16), denote the average over N iterations. They are not to be confused

with r(·) defined in (4)

9

Proof The proof can be found in [10], Th. 4.1.     In our case, E x| y, r(k) and E r| y, x(k) can be easily computed using respectively a Kalman

smoother [1] and the forward-backward recursions of a hidden Markov model smoother [16].

4

Maximum a Posteriori State Sequence Estimation of x and r

4.1

Simulated Annealing Data Augmentation Scheme

Simulated annealing (SA) is a numerical optimization technique that allows to solve combinatorial optimization problems [13]. It is a stochastic algorithm, which will converge to globally optimal solutions, by randomly generating a sequence of possible solutions. In this section, we introduce an algorithm for obtaining optimal, in the MAP sense, joint sequence estimates of the states x and r defined in (6). We propose a SA version of the DA algorithm, i.e. we build a non homogeneous version of the DA algorithm dependent on a deterministic so-called cooling schedule {T (k) ; k ∈ N } verifying T (k + 1) ≤ T (k) and The proposed algorithm is summarized in Fig. 2.

10

lim T (k) = 0

k→+∞

(20)

Simulated Annealing Data Augmentation Algorithm  1. Initialization: Select randomly r(0) , x(0) .

  2. Iteration: Given r(k−1) , x(k−1) , compute r(k) , x(k) for the kth iteration (for k = 1, 2, . . .) as follows:

• Simulate x(k) from x(k) ∼ f

1/T (k)



• Simulate r(k) from r(k) ∼ f where f

1/T (k)

1/T (k)

x| y, r(k−1)



r| y, x(k)





(21)

(22)

  1/T (k) x| y, r(k−1) and f r| y, x(k) are defined in Eqs. (23) and

(24), respectively.

Figure 2: Algorithm II : Simulated Annealing data augmentation algorithm for computing joint Bayesian maximum likelihood sequence estimates of finite state Markov chain r and the state of the jump linear system x. To implement this algorithm, we sample from f

1/T (k)

  1/T (k) x| y, r(k−1) and f r| y, x(k) using

the efficient forward filtering-backward sampling recursions. The desired densities in (21) and (22) are defined as follows: f

   ∝ f 1/T (k) x| y, r(k−1) x| y, r(k−1)     1/T (k) ∝ f 1/T (k) r| y, x(k) r| y, x(k) f

1/T (k)



(23) (24)

where ∝ denotes proportionality.

4.2

Convergence of the Algorithm

Here, we obtain sufficient conditions on T (k) to ensure convergence of the Markov chain  (k) (k)  r ,x ; k ∈ N to the set of global maxima. First, we note that obtaining MAP estimates of r

and x is equivalent to solve a NP (non-deterministic polynomial) hard combinatorial optimization

11

problem. Indeed (6) is equivalent to b) = arg max f ( r(l)| y) f ( x| y, r(l)) (b r, x

(25)

r(l),x

Note that f ( x| y, r(l)) is a nx (T + 1)−dimensional Gaussian distribution of mean ml and covariance Σl , with a strictly positive determinant |Σl |. Thus b r is obtained as b r = arg max f ( r(l)| y) |Σl |−1/2

(26)

r(l)∈S T

We denote M ⊂ R the set of these global maxima. Once b r, say b r = r(l) has been obtained, then b = ml , then ml is obtained via a Kalman smoother [1]. x

Our proof of convergence relies on the fact that the stochastic process



r(k) ; k ∈ N



is a in-

homogeneous finite state space Markov chain which converges asymptotically towards the set of   global maxima M. From the definition of our algorithm, r(k) , x(k) ; k ∈ N is an inhomogeneous

Markov chain with transition kernel at time k given by Kk



       1/T (k) 1/T (k) x(k) y, r(k−1) f r(k) y, x(k) =f r(k−1) , x(k−1) ; r(k) , x(k)

(27)

 Thus r(k) ; k ∈ N is an inhomogeneous finite state space Markov chain of transition kernel   Z 1/T (k) 1/T (k) (k−1) (k) Kk r = r(l); r = r(m) = f ( r(m)| y, x) f ( x| y, r(l)) dx

(28)

The main result in this section is given by the following theorem which is proved in Appendix B.6. If {T (k) ; k ∈ N } satisfies T (k) =

γ ln (k + u)

with γ ≥ L, where L is defined in Eq. (81) and u > 0, then the Markov chain

(29) 

r(k) ; k ∈ N

strongly ergodic. For any initial distribution, the distribution pk (r) of r(k) satisfies Z ∞ lim pk (r) − f ( r| y) dr = 0

k→+∞

where f





is

(30)

( r| y) is defined in Eq. (32). Furthermore, the distribution pk (x) of x(k) satisfies lim

k→+∞

where fk (x) ,

R

f

1/T (k)

( x| y, r) f



Z

|pk (x) − fk (x)| dx = 0

( r| y) dr. 12

(31)

The proof of Theorem 4.2 involves the following four lemmas:   1/T (k) 1/T (k) ( r, x| y) as its invariant distribution where f ( r, x| y) ∝ Kk r(k−1) , x(k−1) ; r(k) , x(k) admits f  1/T (k) f 1/T (k) ( r, x| y). Thus, the marginal transition kernel Kk r(k−1) ; r(k) admits f ( r| y) , R 1/T (k) f ( r, x| y) dx as its invariant distribution. Proof

1/T (k)

The sequence of invariant distributions f distribution f



( r(l)| y) converges, as k goes to infinity, to a

( r(l)| y) localized on the set M : f



( r(l)| y) = P

|Σl |1/2

r(m)∈M

|Σm |1/2

δM (r(l))

(32)

where δM (r(l)) = 1 if r(l) ∈ M and δM (r(l)) = 0 otherwise. Proof 1/T (k)

The sequence of invariant discrete distributions f ( r| y) satisfies +∞ Z X 1/T (k) 1/T (k+1) ( r| y) − f ( r| y) dr < +∞ f

(33)

k=0

Proof There exists k0 ∈ N such that the sequence of transitions kernels Kk of the inhomogeneous  finite state-space Markov chain r(k) ; k ∈ N satisfies for any (r(l), r(m)) ∈ R × R and k ≥ k0   L Kk (r(l); r(m)) ≥ C exp − (34) T (k) where 0 < C < ∞ and 0 ≤ L < ∞ are some constants independent of k. Proof

5 5.1

Maximum a Posteriori State Sequence Estimate of r Metropolis-Hastings/Data Augmentation Simulated Annealing Scheme

In this section we present a SA algorithm based on DA for obtaining the maximum a posteriori state sequence estimate of the finite Markov chain r defined in (7). We build a non homogeneous Markov chain whose transition kernel at iteration k depends on a cooling schedule {T (k) ; k ∈ N } verifying T (k + 1) ≤ T (k) and 13

lim T (k) = 0

k→+∞

(35)

The proposed algorithm is summarized in Fig. 3. Simulated Annealing Algorithm using Data Augmentation  1. Initialization: Select randomly r(0) , x(0) and set k = 1.

  2. Iteration: Given r(k−1) , x(k−1) , compute r(k) , x(k) for the kth iteration as follows: • Simulate x(k) from

  x(k) ∼ f x| y, r(k−1)

(36)

• Simulate a ’candidate’ rc from   rc ∼ f r| y, x(k)  • Evaluate the acceptance ratio α r(k−1) , rc : " #1/T (k)−1      f ( rc | y)  αk r(k−1) , rc = min , 1  f r(k−1) y 

(37)

(38)

• Simulate an uniform random variable u on [0, 1], u ∼ U (0, 1).

 • If u < αk r(k−1) , rc then set r(k) = rc , otherwise set r(k) = r(k−1) . Figure 3: Algorithm III : Annealing data augmentation algorithm for computing the Bayesian maximum likelihood sequence estimate of finite state Markov chain r.   To implement the proposed algorithm, we sample from f x| y, r(k−1) and f r| y, x(k) using

the forward filtering-backward sampling recursions. To evaluate the acceptance ratio (38), we need

to be able to evaluate f ( r| y) up to a normalizing constant. We have f ( r| y) ∝ f (r) f ( y| r). The first term is the prior distribution of the realization of the finite state Markov chain. The second term is the likelihood which is evaluated using the weighted sequence of innovations given by the Kalman filter [1]. Since the Kalman filter is used to compute samples from f ( x| y, r), the additional computational cost of evaluating f ( r| y) is minimal.

14

5.2

Convergence of the Algorithm

We obtain sufficient conditions on T (k) to ensure convergence of the algorithm towards the set M⋆ ⊂ R of global maxima of f ( r| y) following the approach of Mitra et al.[13]. From the definition  of the algorithm, r(k) ; k ∈ N is an inhomogeneous finite-state Markov chain of transition kernels

{Kk ; k ∈ N }

Kk (r(l); r(m)) = αk (r(l), r(m)) Kh (r(l); r(m)) + δlm

Z

(1 − αk (r(l), r)) Kh (r(l); r) dr

(39)

where δlm denotes the Kronecker delta and Kh is the associated (homogeneous) data augmentation Markov chain kernel defined as follows Kh (r(l); r(m))

,

Z

f ( r(m)| y, x) f ( x| y, r(l)) dx

(40)

The following theorem is the main result of this section and is proved in Appendix B.9 If {T (k) ; k ∈ N } satisfies T (k) =

γ ln (k + u)

with γ ≥ L, where L is defined in Eq. (85), and u > 0 then the Markov chain

(41) 

r(k) ; k ∈ N

strongly ergodic. For any initial distribution, the distribution pk (r) of r(k) satisfies Z ∞ lim pk (r) − f (r) dr = 0

k→+∞

where f



k→+∞ ∞

is

(42)

(r) is defined in Eq. (45). Furthermore the distribution pk (x) of x(k) satisfies lim

where f



(x) ,

R

f ( x| y, r) f



Z

|pk (x) − fk (x)| dx = 0

(43)

(r) dr.

The proof of Theorem 5.2 involves the following four lemmas. For any k ∈ N ∗ , Kk (r(m); r(l)) admits f f

1/T (k)

1/T (k)

(r) as its invariant distribution where

(r) ∝ f 1/T (k) ( r| y)

Proof We obtain straightforwardly the following lemmas, see for example [13, 26]. 15

(44)

The sequence of invariant distributions f

1/T (k)

(r) (defined in Eq. (44)) converges, as k goes to

infinity, towards the set of global maxima M⋆ of f ( r(l)| y), that is f



( r(l)| y) =

δM⋆ (r(l)) M

(45)

⋆ ⋆

where δM⋆ (r(l)) = 1 if r(l) ∈ M⋆ and δM⋆ (r(l)) = 0 otherwise. M denotes the cardinality of M⋆ . The sequence of invariant discrete distributions f

1/T (k)

(r) satisfies

+∞ Z X 1/T (k) 1/T (k+1) ( r| y) − f ( r| y) dr < +∞ f k=0

There exists k0 such that for any k ≥ k0 and for any (r(l), r(m)) ∈ R × R, the sequence of  transitions kernels Kk of the inhomogeneous finite state-space Markov chain r(k) ; k ∈ N satisfies 

L Kk (r(l); r(m)) ≥ C exp − T (k)



where 0 < C < ∞ and 0 ≤ L < ∞ are some constants independent of k. Proof See Appendix B.8

6

Applications and Numerical Examples

Theoretically, the DA and SA based on DA algorithms require an infinite number of iterations to give the exact values of conditional mean estimates and of global maxima, respectively. In all our computer simulations below, the first N0 iterations of the data augmentation algorithm are discarded. The first N0 iterations are assumed to correspond to the so-called ‘burn in period’ (or the transient period prior to the convergence) of the Markov chain3 . As in [4], the DA algorithm is iterated until the desired computed values of the ergodic averages are no longer modified. As shown in Secs. 4 and 5 the SA algorithms require logarithmic cooling schedules. Such schedules are too slow to be implemented in practice. As it is usually done in practice [6, 13, 26], we have implemented exponential and polynomial cooling schedules, i.e. T (k) = Cαk with 0 < α < 1 and T (k) = Ckn . Our simulations (not presented here) show that there appears to be little 3

Methods for determining the burn-in period N0 are beyond the scope of this paper.

16

difference between polynomial and exponential cooling schedules. Therefore, in the simulations presented below only exponential cooling schedules are used. Computer simulations were carried out to evaluate the performances of our three algorithms. Sec. 6.1 considers the problem of estimating a sparse signal arising from a neutron sensor based on a set of noisy data. Sec. 6.2 considers the problem of narrowband interference suppression in Spread Spectrum CDMA communication systems.

6.1

Estimation of a sparse signal

In several problems related to geophysics, nuclear science or speech processing, the signal of interest can be modeled as an autoregressive process excited by a noise which admits as marginal distribution a mixture of Gaussians [12]. We consider the following model st = a1 st−1 + a2 st−2 + v˜t

(46)

yt = st + σw wt

(47)

where v˜ is the dynamic (mixture) noise process   v˜t ∼ λN 0, σ12 + (1 − λ) N 0, σ22 , wt ∼ N (0, 1)

v˜t is often assumed to be a white noise sequence but it could be also modeled as a first-order Markov sequence to take into account the dead time of the sensor. This model (46) and (47) can be re-expressed as the jump Markov linear system (1) and (2) where the state vector is xt = (st , st−1 )′ and for all k, ut = 0 



 a1 a2  A =  , C = 1 0



1 0



D = σw , F = 0, G = 0

and B (rt ) = (σi , 0)′ with i = rt . In the following simulations, we set the following parameters: T = 250, Markov noise: p1 = 0.1, p2 = 0.9, p11 = 0.05, p22 = 0.894 (so that rt is in its stationary regime), a1 = 1.51, a2 = −0.55, λ = 0.05, σ1 = 0.30, σ2 = 0.01. It models a neutron sensor in a noisy environment. In Fig. 4, the signal st and its noisy observations yt are displayed. We have chosen to illustrate the performance of our three algorithms by comparing the estimates of the dynamic noise v˜t which 17

are directly obtained from the estimates of the states. The closer the estimate of the dynamic noise to the true dynamic noise, the better the performance of the algorithm [12]. In Fig. 5, we display the conditional mean estimate of the dynamic noise E { v˜t | y}, computed using Algorithm 1. In Fig. 6, we present E { v˜t | y, r} where r is computed using Algorithm 2. Finally, in Fig. 7, we present E { v˜t | y, r} where r is the MAP estimate given by Algorithm 3. In all cases, the algorithms were

initialized randomly. 2

1.5

1

0.5

0

−0.5

−1

−1.5

0

50

100

150

200

250

Figure 4: A typical realization of the data. (−) true signal, (· · ·) noisy observed signal. Algorithms 1 and 3 give satisfactory results, since the signal to noise ratio (SNR) is quite low. Algorithm 2 underestimates the number of impulses in the signal. One must not be surprised by this result. Other works using this criteria in the specific field of sparse signal deconvolution report similar results [12] and adopt the criteria that we maximize using Algorithm 3. In this example, we discard the first N0 = 20 samples simulated by the data augmentation algorithm. Then we take into account the 150 following iterations of the data augmentation algorithm. For the simulated annealing algorithms, we implement an exponential cooling schedule T (k) = Cαk with C = 1 and α = 0.98 and we use 150 iterations.

18

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

−0.8

0

50

100

150

200

250

Figure 5: True (×) and estimated (o) dynamic noise sequence. The estimates are the conditional mean estimates computed via Algorithm 1.

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

−0.8

0

50

100

150

200

Figure 6: True (×) and estimated (o) dynamic noise sequence. E { vt | y, r} where r is computed using Algorithm 2.

19

250

The estimates are given by

6.2

Narrowband interference suppression in Spread Spectrum CDMA systems

Code-division multiple-access (CDMA) spread-spectrum signaling is among the most promising multiplexing technologies for cellular telecommunications services, such as personal telecommunications, mobile telephony, and indoor wireless networks. The explosive growth in cellular telephony, in conjunction with emerging new applications, has inspired significant interest in interference suppression techniques for enhancing the performance of CDMA systems. CDMA provides a means of separating the signals of multiple users transmitting simultaneously and occupying the same radio frequency (RF) bandwidth. It is well known that system performance is greatly enhanced if the receiver employs some means of suppressing narrowband interference prior to signal “despreading” [27]. Numerous recent papers study the problem of narrowband interference suppression in CDMA systems, see [14, 15, 28] and the references therein. Our aim here is to illustrate the use of the iterative stochastic sampling algorithms proposed in the previous sections for narrowband interference suppression. Note, however, that realistic algorithms would be recursive (on-line). In the papers [14, 15, 28], the following signal models is used: The sampled received signal yt consists of the spread spectrum signal rt from N users, the narrowband interference it and observation noise wt , that is yt = rt + it + σw wt

(48)

where wt is a zero-mean white Gaussian process of variance 1. As in [14, 15, 28], the narrowband interference it is modeled as a second-order auto-regressive process with both poles at z = 0.99, i.e. it = 1.98it−1 − 0.98it−2 + σe et

(49)

where et is a zero-mean white Gaussian process of variance 1. The power of the received spread spectrum signal for each user was held constant with amplitudes ±1, randomly selected and rt was binomially distributed. Note that the CDMA spread spectrum model (48) and (49) can be re-expressed as the jump ′  denotes the state of the Markov model of (1) and (2), where the state vector xt = it it−1

20

σw

Algorithm 1

Algorithm 2

Algorithm 3

0.5

3.13

3.51

3.25

0.6

5.88

6.82

6.48

0.7

8.84

10.23

10.61

0.8

11.89

13.02

14.90

0.9

14.54

16.12

17.88

1.0

17.29

18.04

21.12

Table 1: Bit error rate (in percent) of the 3 algorithms for interference suppression in CDMA systems narrowband interference at times k and k − 1, ut = 1 for all k and      σe   1.98 −0.980  A =  , B =   1 0 0   C = 1 0 , D = σw , F = 0

(50)

and G(1) = 1, G(2) = −1. In the numerical examples below we considered a single user, with σe = 0.02 and we compared the performance of the three proposed algorithms for increased observation noise. We compute the conditional mean estimate and the MAP estimates of the discrete sequence rt . In this case, the joint MAP estimate given by Algorithm 2 and the MAP estimate given by Algorithm 3 are theoretically equal. We present the bit error rate for these algorithms. To evaluate the bit error rate from the conditional mean estimate E [ rt | y], we set rbt = 1 if rt (N ) > 0 and rbt = −1 otherwise.

The algorithms were run on 400 points, and averaged over 100 independent runs. In all cases,

the algorithms were initialized randomly. This problem is statistically easier than the problem described in Sec. 6.1. In this example, we discard the first N0 = 20 samples simulated by the data augmentation algorithm. Then taking into account the 50 following iterations of the data augmentation algorithm has appeared to be sufficient. For the simulated annealing algorithms, we implement an exponential cooling schedules T (k) = Cαk with C = 1 and α = 0.80 and we use 50 iterations. Our numerical examples presented here and other simulations (not presented here) suggest that 21

Algorithm 1 performs the best for the CDMA narrowband interference suppression problem. The results obtained are better than those obtained using the EM algorithm [11].

7

Conclusion

In this paper, we have presented three iterative stochastic sampling algorithms to compute conditional mean estimates and MAP state estimates of jump Markov linear models. The computational cost of an iteration of each algorithm is linear in the data length. Convergence results for these algorithms towards required estimates have been obtained. A key property of the two algorithms for MAP state estimation is that they are globally convergent. This is in contrast to gradient type algorithms such as the EM algorithm [11], which suffer from convergence to stationary points. Two applications (in sparse signal detection/estimation and narrowband interference suppression in CDMA communication systems) were presented to illustrate their performances. Future work will focus on adaptive recursive versions of the proposed algorithms, in narrowband interference suppression and multiuser detection in CDMA systems.

A

Forward Filtering-Backward Sampling Recursions

A.1

Sampling from p (x| y, r)

We have the following decomposition p ( x| y, r) = p ( xT | yT , r) where yt

TY −1 t=0

p ( xt | yt , r,xt+1 )

(51)

, (y1, ..., yt ), rt , (r1, ..., rt ), and r0 = y0 = ∅. Given xt+1 , p ( xt | yt , r,xt+1 ) is a Gaussian

distribution. This decomposition suggests the following algorithm [2]. 1. Kalman filter (Forward filtering): Set m0|0 = x ˆ0 and P0|0 = P0 , then for t = 1, ..., T compute using the Kalman filter equations m t|t−1 = A (rt ) m t−1|t−1 + F (rt ) ut

(52)

P t|t−1 = A (rt ) P t−1|t−1 A′ (rt ) + B (rt ) B ′ (rt )

(53)

z t|t−1 = C (rt ) m t|t−1 + G (rt ) ut

(54)

22

St = C (rt ) P t|t−1 C ′ (rt ) + D (rt ) D ′ (rt )

(55)

m t|t = m t|t−1 + P t|t−1 C ′ (rt ) St−1 yt − z t|t−1 P t|t = P t|t−1 − P t|t−1 C ′ (rt ) St−1 C (rt ) P t|t−1



(56) (57)

and store for t = 1, ..., T , m t|t−1 = E { xt | yt−1 , r}, m t|t = E { xt | yt , r}, P t|t−1 = n n o o   ′ ′ E xt − m t|t−1 xt − m t|t−1 yt−1 , r and P t|t = E xt − m t|t xt − m t|t yt , r .

2. Backward sampling: For t = T, ..., 0, sample xt from N (m∗t , Pt∗ ) where m∗T = m T |T , PT∗ = P T |T and for 0 ≤ t < T −1 m∗t = m t|t + P t|t A′ (rt+1 ) P t+1|t xt+1 − m t+1|t



−1 A (rt+1 ) P t|t Pt∗ = P t|t − P t|t A′ (rt+1 ) P t+1|t

A.2

(58) (59)

Sampling from p (r| y, x)

p ( r| y, x) can be decomposed as follows: p ( r| y, x) = p ( rT | yT , xT ) where yt

, (y1, ..., yt ), xt , (x0, ..., xt ).

TY −1 t=1

p ( rt | yt , xt ,rt+1 )

(60)

Given rt+1 , p ( rt | yt , xt ,rt+1 ) is a discrete distribution. It

suggests the following algorithm to sample from p ( r| y, x). 1. Optimal filter (Forward filtering): For t = 2, ..., T , compute the optimal filter [16]. For any i, evaluate: p ( rt = i| yt−1 , xt−1 ) =

s X

pji p ( rt−1 = j| yt−1 , xt−1 )

(61)

j=1

p ( yt , xt | yt−1 , xt−1 , rt = i) p ( rt = i| yt−1 , xt−1 ) j=1 p ( yt , xt | yt−1 , xt−1 , rt = j) p ( rt = j| yt−1 , xt−1 )

p ( rt = i| yt , xt ) = Ps

(62)

and store for t = 1, ..., T and for i = 1, ..., s p ( rt = i| yt−1 , xt−1 ) and p ( rt = i| yt , xt ). 2. Backward sampling: Sample rT from p ( rT | yT , x) then for t = T − 1, ..., 1, sample rt from p ( rt | yt , xt ,rt+1 ) where for i = 1, ..., s p ( rt+1 | rt = i) p ( rt = i| yt , xt ) j=1 p ( rt+1 | rt = j) p ( rt = j| yt , xt )

p ( rt = i| yt , xt ,rt+1 ) = Ps

23

(63)

B

Proof of Theorems and Lemmas

B.1

Proof of Theorem 3.2

 The marginal sequence r(k) ; k ∈ N is a Markov chain with transition kernel given by   Z (k−1) (k) K r = r(l); r = r(m) = f ( r(m)| y, x) f ( x| y, r(l)) dx

(64)

  By construction [20, 21], r(k) , x(k) ; k ∈ N admits f ( r, x| y) as its invariant distribution. Thus  (k) r ; k ∈ N admits f ( r| y) as its invariant distribution. From the model assumptions given

in Sec. 2.1 and Eq. (4) in particular, the posterior distribution f ( r| y) is strictly positive on R

and null on S T \R. Moreover, we have K (r(l); r(m)) > 0 for any (r(l), r(m)) ∈ S T × R and K (r(l); r(m)) = 0 for any (r(l), r(m)) ∈ S T × S T \R, i.e. at k = 1 the finite state Markov chain  enters in R and never leaves this set. Thus r(k) ; k ∈ N is irreducible and aperiodic on R. Hence  it is uniformly geometrically ergodic. From ([18], pp. 401–402), r(k) ; k ∈ N satisfies: Z |pk (r) − f ( r| y)| dr ≤ 2ρk−1 (and

R

|pk (r) − f ( r| y)| dr ≤ 2ρk if r(0) ∈ R) where 0 ≤ ρ < 1 satisfies: ρ=1−

X

min K (r(l); r(m)) r(l)∈R

r(m)∈R

Applying the duality principle of Robert and Diebolt [17, 4, 22], we now show that: Z |pk (x) − f ( x| y)| dx ≤ 2ρk−2

(65)

 Thus, the property of uniform geometric convergence of the Markov chain r(k) ; k ∈ N is “trans ferred” to the continuous state-space Markov chain x(k) ; k ∈ N . To prove (65), first note that Z ′ ′ pk (r(l), x ) = f (r(l)|x , y) pk−1 (r)f (x′ |r, y)dr (66) = f (r(l)|x′ , y)pk (x′ )

Thus, Z

|pk (x) − f ( x| y)| dx Z Z Z dx = f (x|r, y)p (r)dr − f ( x| r, y) f ( r| y) dr k−1 Z Z ≤ f ( x| r, y) |pk−1 (r) − f ( r| y)| drdx Z = |pk−1 (r) − f ( r| y)| dr ≤ 2ρk−2 24

(67)

Using (66), we now prove the uniform ergodicity of the Markov chain Z Z



 r(k) , xk ; k ∈ N

|pk (r, x) − f ( r, x| y)| drdx Z Z = |f ( r| y, x) pk (x) − f ( r| y, x) f ( x| y)| drdx Z Z ≤ f ( r| y, x) |pk (x) − f ( x| y)| dxdr Z = |pk (x) − f ( x| y)| dx ≤ 2ρk−2

B.2

(68)

Proof of Lemma 4.2 Z Z

 1/T (k) (r, x) drdx Kk (r, x) ; r′ , x′ f Z Z  1/T (k) ′  1/T (k) 1/T (k) 1/T (k) = f x′ y, r f r y, x′ f ( x| y, r) f ( r| y) drdx Z  1/T (k) ′  1/T (k) = f x′ , r y f r y, x′ dr = f

B.3

1/T (k)

r′ , x′



Proof of Lemma 4.2

f

1/T (k)

R 1/T (k) f ( x| y, r(l)) dxf 1/T (k) ( r(l)| y) ( r(l)| y) = R R 1/T (k) f ( x| y, r) f 1/T (k) ( r| y) dxdr 1−1/T (k)  (2π)n/2 |Σl |1/2 (T (k))n/2 f 1/T (k) ( r(l)| y) = 1−1/T (k)  P n/2 1/2 (2π) |Σ | (T (k))n/2 f 1/T (k) ( r(m)| y) m r(m)∈R  !1/T (k) −1 X |Σm |1/2 f ( r(m)| y) |Σm |−1/2  =  1/2 −1/2 |Σ | f ( r(l)| y) |Σ | l l r(m)∈R

(69)

where n , nx (T + 1). If r(l) ∈ / M, there exists r(m) ∈ M such that lim

k→+∞

f ( r(m)| y) |Σm |−1/2 f ( r(l)| y) |Σl |−1/2

!1/T (k)

= +∞

(70)

and thus lim f

k→+∞

1/T (k)

( r(l)| y) = 0

If r(l) ∈ M, then lim f

k→+∞

1/T (k)



( r(l)| y) =  25

X

r(m)∈M

−1 |Σm |1/2  | Σl |1/2

(71)

(72)

B.4

Proof of Lemma 4.2

The proof of this result follows from arguments similar to those of Mitra et al.([13], proposition 3.3. pp. 755-756). For any r(l) ∈ R, by differentiating with respect to T the function f

1/T

( r(l)| y), we

obtain df

1/T

( r(l)| y) = dT 1/T  |Σl |1/2 f ( r(l)| y) |Σl |−1/2 C 2 (T ) T 2

(73) X

r(m)∈R

|Σm |

1/2

1/T  f ( r(m)| y) |Σm |−1/2 ln

f ( r(m)| y) |Σm |−1/2 f ( r(l)| y) |Σl |−1/2

!

where C (T ) =

X

r(m)∈R

1/T  |Σm |1/2 f ( r(m)| y) |Σm |−1/2

(74)

From this result, it follows that for any r(l) ∈ M and for k ≥ 0 f

1/T (k+1)

( r(l)| y) − f

1/T (k)

( r(l)| y) > 0

since each term on the right hand side of (73) is either zero or negative, thus df

(75) 1/T

( r(l)| y) /dT < 0.

From (20), Eq. (75) immediately follows. Let H(l), for r(l) ∈ R\M, denote the set of r(m) such that f ( r(m)| y) |Σm |−1/2 > f ( r(l)| y) |Σl |−1/2 , that is H(l) = {r(m) : f ( r(m)| y) |Σm |−1/2 > f ( r(l)| y) |Σl |−1/2 , r(m) ∈ R, r(l) ∈ R\M}

(76)

For r(l) ∈ R\M, 

C 2 (T ) T 2

1/T 2  −1/2 1/2 f ( r(l)| y) |Σl | |Σl | X

r(m)∈H(i)



X

|Σm |1/2 |Σl |1/2

r(m)∈H(i) /

|Σm |1/2 |Σl |1/2

df

1/T

( r(l)| y) = dT

f ( r(m)| y) |Σm |−1/2 f ( r(l)| y) |Σl |−1/2

!1/T

f ( r(m)| y) |Σm |−1/2 f ( r(l)| y) |Σl |−1/2

f ( r(m)| y) |Σm |−1/2

ln

!1/T

(77)

f ( r(l)| y) |Σl |−1/2

ln

!

f ( r(l)| y) |Σl |−1/2

f ( r(m)| y) |Σm |−1/2

!

Following the same arguments as in [13], we make the following observations: the first term on the right hand of (77) is monotonically increasing with decreasing T , while the second term on the right hand side of (77) is monotonically decreasing with decreasing T . Furthermore, in the limit 26

T → 0, the first term tends towards +∞, while the second term tends towards 0. Thus, for T → 0, df

1/T

( r(l)| y) /dT > 0. The set R\M being finite, there exists a k′ ∈ N such that for any k > k′ f

1/T (k+1)

( r(l)| y) − f

1/T (k)

( r(l)| y) < 0, if r(l) ∈ R\M

(78)

Then following Mitra et al. ([13], proposition 5.1., pp. 762-763), from (75) and (78) and for k > k′ we have X 1/T (k+1) 1/T (k) ( r(l)| y) − f ( r(l)| y) = f

 X  1/T (k+1) 1/T (k) f ( r(l)| y) − f ( r(l)| y)

r(l)∈M

r(l)∈R



= 2

X

r(l)∈R\M

X 

f

  1/T (k+1) 1/T (k) f ( r(l)| y) − f ( r(l)| y)

1/T (k+1)

r(l)∈M

( r(l)| y) − f

1/T (k)

( r(l)| y)



Hence, +∞ X X 1/T (k) 1/T (k+1) ( r| y) − f ( r| y) ≤ 2 f

(79)

k=k ′ r(l)∈R

The result (33) immediately follows.

B.5

Proof of Lemma 4.2

Using Bayes’ rule f

1/T (k)



r(m)| y, x



=

=

f

f

1/T (k)

1/T (k)

R

f

1/T (k)

( x′ | y, r(m)) f ( r(m)| y) R 1/T (k) f ( x′ , r| y) dr ( x′ | y, r(m)) f

1/T (k)

( x′ | y, r) f

There exists k0 ∈ N such that for any (r(m′ ), r(l′ )) ∈ M × R, f

1/T (k)

1/T (k)

1/T (k)

( r(m)| y)

( r| y) dr

( r(m′ )| y) ≥ f

thus the denominator of the expression above can be bounded Z  1/T (k) 1/T (k) f x′ y, r f ( r| y) dr Z   1/T (k) 1/T (k) ≤ f r(m′ ) y f x′ y, r dr ≤ f

1/T (k)

where R is the cardinality of R and bk =

 r(m′ ) y Rbk

h i  1/T (k) max f x′ y, r = N (ml′ , T (k) Σl′ )

r(l′ )∈R

= (2πT (k))−n/2 max |Σl′ |−1/2 r(l′ )∈R

27

1/T (k)

( r(l′ )| y),

(80)

Thus Kk (r(l); r(m)) = ≥

Z

f f

f

1/T (k) 1/T (k)

1/T (k)

 1/T (k) ′  r(m)| y, x′ f x y, r(l) dx′

( r(m)| y)

( r(m′ )| y) Rbk

Z

f

1/T (k)

For any (r(l), r(m)) ∈ R × R Z

f

1/T (k)

( x| y, r(l)) f

1/T (k)

( x| y, r(m)) dx =

Z

 1/T (k) ′  x′ y, r(m) f x y, r(l) dx′

N (ml , T (k) Σl ) N (mm , T (k) Σm ) dx

We obtain Z

f

1/T (k)

= where Σ−1 l,m

( x| y, r(l)) f

1/T (k)

( x| y, r(m)) dx

|Σl,m|1/2

exp

(2πT (k))n/2 |Σl |1/2 |Σm |1/2



 −1 ′ −1 (ml − mm ) (Σl + Σm ) (ml − mm ) 2T (k)

−1 , Σ−1 l + Σm . Define the following constants

C′ = R

−1

min

(r(l),r(m))∈R×R

  |Σl,m|1/2 / |Σl Σm |1/2 / max |Σl′ |−1/2 > 0 r(l′ )∈R

and L1 =

1 max (ml − mm )′ (Σl + Σm )−1 (ml − mm ) 2 (r(l),r(m))∈R×R

Then



L1 Kk (r(l); r(m)) ≥ C exp − T (k) ′



f f

1/T (k)

1/T (k)

( r(m)| y) ( r(m′ )| y)

Using (69), we have f f

1/T (k)

1/T (k)

( r(m)| y) ( r(m′ )| y)

=



|Σm | |Σm′ |

1/2 "

f ( r(m)| y) |Σm |−1/2

f ( r(m′ )| y) |Σm′ |−1/2

Thus with C=C L2 =



min

(r(l),r(m))∈R ×R

max

(r(l),r(m))∈R×R

ln



|Σm | | Σl |

1/2

f ( r(m)| y) |Σm |−1/2 f ( r(l)| y) |Σl |−1/2

#1/T (k)

!

and L = L1 + L2 we obtain the result. 28

(81)

B.6

Proof of Theorem 4.2

 We first prove that r(k) ; k ∈ N is weakly ergodic ([8], definition V.1.1. pp. 137; [13], definition 4.4 P pp. 758). The inhomogeneous Markov chain {Kk ; k ∈ N ∗ } is weakly ergodic if ∞ k=1 [1 − ζ (Kk )] = +∞ where the ergodic coefficient ζ (P ) of any Markov transition kernel (stochastic matrix) P on R is defined as: ζ (P ) = 1 −

min

(r(l),r(m))∈R×R

X

min (P (r(l); r(k)) , P (r(m); r(k)))

(82)

r(k)∈R

Using Lemma 4.2, the ergodic coefficient of Kk is given by ζ (Kk ) = 1 −

(r(l),r(m))∈R×R

≤ 1−

(r(l),r(m))∈R×R

min

X

r(l′ )∈R

  min Kk r(l); r(l′ ) , Kk r(m); r(l′ )

min Kk (r(l); r(m))   L ≤ 1 − C exp − T (k)

(83)

thus if T (k) = γ/ ln (k + u) +∞ X

k=k0

[1 − ζ (Kk )] ≥ C

+∞  X

k=k0

1 k+u

L/γ

The ergodic coefficient ζ (Kk ) diverges if γ ≥ L implying the weak ergodicity of {r(k); k ∈ N }. Now using Lemma 4.2 and ([8], theorem V.4.3. pp. 160; see also [13], theorem 4.2. pp. 759), we obtain that {r(k); k ∈ N } is strongly ergodic and (30) is thus satisfied. To prove the result in (31), we make use of the following Z Z Z   1/T (k+1) ∞ |pk+1 (x) − fk+1 (x)| dx = ( x| y, r) pk (r) − f ( r| y) dr dx f Z Z ∞ 1/T (k+1) f ( r| y) ( x| y, r) drdx (r) − ≤ f pk Z ∞ (r) − ≤ f ( r| y) pk dr

(84)

From (30) and (84), Eq. (31) immediately follows.

B.7

Proof of Lemma 5.2

Notice that our algorithm is nothing but a Metropolis-Hastings algorithm of invariant distribution f

1/T (k)

(r) and proposal distribution Kh (r(l); r(m)) [22]. The simple expression of the acceptance

ratio (38) follows from the fact that Kh (r(l); r(m)) is in detailed balance with f ( r(m)| y), i.e. f (r(l)) Kh (r(l); r(m)) = f (r(m)) Kh (r(m); r(l)). 29

B.8

Proof of Lemma 5.2

We denote N as the set of global minima of f ( r(m)| y) on R. • For all r(l) ∈ R\N there exists r(m) ∈ R such that f ( r(m)| y) < f ( r(l)| y). There exists a finite k1 such that for any k ≥ k1 , 1/T (k) ≥ 1 and for any r(m) ∈ R\ {r(l)}, we have Kk (r(l); r(m)) = αk (r(l), r(m)) Kh (r(l); r(m))   f ( r(m)| y) 1/T (k)−1 ≥ min min Kh (r(l); r(m)) r(m)∈R f ( r(l)| y) r(m)∈R\{r(l)}   L ≥ C exp − T (k) where L=

max

(r(l),r(m))∈R×R

ln

f ( r(l)| y) f ( r(m)| y)

(85)

and C = exp (L) ×

min

min

r(l)∈R\M∗ r(m)∈R\{r(l)}

Kh (r(l); r(m))

• If r(m) = r(l), Kk (r(l); r(l)) = Kh (r(l); r(l)) +

X

r(m)∈R\{r(l)}

= Kh (r(l); r(l)) + X

(1 − αk (r(l), r(m))) Kh (r(l); r(m))

r(m)∈(R\{r(l)})/f ( r(m)| y) 0   Hence there exists a finite k3 such that for any k ≥ k3 , Kk (r(l); r(m)) ≥ C exp − T L(k) . Now choosing k0 = max (k2 , k3 ), the result follows.

30

B.9

Proof of Theorem 5.2

The first part is similar to the proof of Theorem 4.2 and is thus omitted. Eq. (43) is obtained from the following result by taking the limit as k goes to infinity Z Z Z ∞ ∞ pk+1 (x) − f (x) dx = f ( x| y, r) pk (r) − f ( x| y, r) f (r) dr dx Z Z ∞ = f ( x| y, r) pk (r) − f (r) dr dx Z Z ∞ ≤ pk (r) − f (r) dr f ( x| y, r) dx Z ∞ ≤ pk (r) − f (r) dr

References [1] B.D.O. Anderson and J.B. Moore, Optimal Filtering , Prentice-Hall, Englewood Cliffs, 1979. [2] C.K. Carter and R. Kohn, “On Gibbs Sampling for State Space Models”, Biometrika, Vol. 81, No. 3, pp. 541-553, 1994. [3] P. DeJong and N. Shephard, “The Simulation Smoother for Time Series Models”, Biometrika, Vol. 82, No. 2, pp. 339-350, 1995. [4] J. Diebolt and C.P. Robert, “Estimation of Finite Mixture Distributions through Bayesian Sampling”, J. R. Statis. Soc. B, Vol. 56, No. 2, pp. 363-375, 1994. [5] S. Fr¨ uwirth-Schnatter, “Data Augmentation and Dynamic Linear Models”, J. Time Ser. Ana., Vol. 15, pp. 183-202, 1994. [6] S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distribution and the Bayesian Restoration of Images”, IEEE Trans. Patt. Ana. Mach. Int., Vol. 6, pp. 721-741, 1984. [7] R.E. Helmick, W.D. Blair and S.A. Hoffman, “Fixed-Interval Smoothing for Markovian Switching Systems”, IEEE Trans. Info. Theory , Vol. 41, No. 6, pp. 1845-1855, 1995. [8] D.L. Isaacson and R.W. Madsen, Markov Chains Theory and Applications, Wiley, 1976. [9] A.H. Jazwinski, Stochastic Processes and Filtering Theory, Academic Press, 1970.

31

[10] J.S. Liu, W.H. Wong and A. Kong, “Covariance Structure of the Gibbs Sampler with Applications to the Comparisons of Estimators and Augmentation Schemes”, Biometrika, Vol. 81, pp. 27-40, 1994. [11] A. Logothetis and V. Krishnamurthy, “A Bayesian Expectation-Maximization Algorithm for Estimating Jump Markov Linear Systems”, accepted to IEEE Trans. Signal Processing, 1998. [12] J.M. Mendel, Maximum Likelihood Deconvolution, Springer, New York, 1990. [13] D. Mitra, F. Romeo and A. Sangiovanni-Vincetelli, “Convergence and Finite-time Behavior of Simulated Annealing”, Adv. App. Prob., Vol. 18, pp. 747-771, 1986. [14] H.V.

Poor

and

X.

Wang,

“Code-aided

Interference

Suppression

for

DS-CDMA

Communications–Part I: Interference Suppression Capability”, IEEE Trans. Communications, Vol.45. No.9, pp. 1101-1111, Sept 1997. [15] H.V.

Poor

and

X.

Wang,

“Code-aided

Interference

Suppression

for

DS-CDMA

Communications–Part II: Parallel Blind Adaptive Implementations”, IEEE Trans. Communications, Vol.45. No.9, pp. 1112-1122, Sept 1997. [16] L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proc. IEEE, Vol. 77, pp. 257-286, 1989. [17] C.P. Robert, G. Celeux and J. Diebolt, “Bayesian Estimation of Hidden Markov Models: a Stochastic Implementation”, Stat. Prob. Lett., Vol. 16, pp. 77-83, 1993. [18] J.S. Rosenthal, “Convergence Rates of Markov chains”, SIAM Review, Vol. 37, pp. 387-415, 1995. [19] A.N. Shiryaev, Probability, Springer-Verlag, New York, 1996. [20] M.A. Tanner and W.A. Wong, “The Calculation of Posterior Distributions by Data Augmentation”, J. Am. Statis. Assoc., Vol. 82, pp. 528-540, 1987. [21] M.A. Tanner, Tools for statistical inference : methods for the exploration of posterior distributions and likelihood functions, Springer-Verlag, New York, 1993. 32

[22] L. Tierney, “Markov Chains for Exploring Posterior Distributions” (with discussion), Ann. Stat., Vol. 22, pp. 1701-1762, 1994. [23] D.M. Titterington, A.F.M. Smith and U.E. Makov, Statistical Analysis of Finite Mixture Distributions, Wiley, New York, 1985. [24] J.K. Tugnait, “Adaptive Estimation and Identification for Discrete Systems with Markov Jump Parameters”, IEEE Trans. Automatic Control, Vol. 25, pp. 1054-1065, 1982. [25] M. West and J.F. Harrison, Bayesian Forecasting and Dynamic Models, Springer Series in Statistics, 1996. [26] P.J. Van Laarhoven and E.H.L. Arts, Simulated Annealing: Theory and Applications, Reidel Pub., Amsterdam, 1987. [27] L.B. Milstein, “Interference Rejection Techniques in Spread Spectrum Communications”, Proc. IEEE, Vol.76, No.6, pp. 657-671, 1988. [28] L.A. Rusch and H.V. Poor, “Narrowband Interference Suppression in CDMA Spread Spectrum Communications”, Trans. Comm., Vol. 42, No. 2/3/4, pp. 1969-1979, 1994.

33

0.8

0.6

0.4

0.2

0

−0.2

−0.4

−0.6

−0.8

0

50

100

150

200

Figure 7: True (×) and estimated (o) dynamic noise sequence. E { vt | y, r} where r is computed using Algorithm 3.

34

250

The estimates are given by

Suggest Documents