Noname manuscript No. (will be inserted by the editor)
On the convergence of two sequential Monte Carlo methods for maximum a posteriori sequence estimation and stochastic global optimization Joaqu´ın M´ıguez · Dan Crisan · Petar M. Djuri´ c
September 15, 2011
Abstract This paper addresses the problem of maximum a posteriori (MAP) sequence estimation in general state-space models. We consider two algorithms based on the sequential Monte Carlo (SMC) methodology (also known as particle filtering). We prove that they produce approximations of the MAP estimator and that they converge almost surely. We also derive a lower bound for the number of particles that are needed to achieve a given approximation accuracy. In the last part of the paper, we investigate the application of particle filtering and MAP estimation to the global optimization of a class of (possibly nonconvex and possibly non-differentiable) cost functions. In particular, we show how to convert the costminimization problem into one of MAP sequence estimation for a state-space model that is “matched” to the cost of interest. We provide examples that illustrate the application of the methodology as well as numerical results. Keywords Sequential Monte Carlo · MAP sequence estimation · Convergence of particle filters · State space models · Global optimization PACS 65C35 · 65C05 · 60K35 · 65K10 · 90C56 · 90C47 J. M´ıguez Department of Signal Theory Universidad Carlos III (Spain). E-mail:
[email protected]
&
Communications,
D. Crisan Department of Mathematics, Imperial College London (UK). E-mail:
[email protected] P. M. Djuri´ c Department of Electrical & Computer Engineering, Stony Brook University (USA). E-mail:
[email protected]
1 Introduction State-space stochastic models are useful in representing a multitude of problems appearing in different scientific and engineering fields. They involve a random sequence {Xt }t≥0 , representing the unobserved “state” of the system and a sequence of related observations {Yt }t≥1 . The goal is to perform inference on the states using the observed data. When the model can be described by linear equations with Gaussian perturbations, the Kalman filter (Kalman 1960) provides an exact solution for the (Gaussian) probability distribution of Xt given the observations Y1 , . . . , Yt . However, the analytical intractability of general state-space models (nonlinear and/or non-Gaussian) has motivated a great amount of work on approximation techniques, including variations on the Kalman filter (Anderson and Moore 1979; Julier and Uhlmann 2004) and the family of sequential Monte Carlo (SMC) methods, also known as particle filters (Gordon et al 1993; Liu and Chen 1998; Doucet et al 2000, 2001b; Ristic et al 2004). Particle filters generate discrete random measures that can be naturally used to approximate integrals with respect to the posterior distribution of Xt given the data (including, e.g., its mean). A number of theoretical results ensure the convergence of such approximations – see (Del Moral 2004; Gland and Oudjane 2004; Bain and Crisan 2008; Heine and Crisan 2008; Hu et al 2008) and the references therein. In this paper, we investigate the use of SMC algorithms for the approximation of maximum a posteriori (MAP) sequence estimates, i.e., we study the problem of finding the sequence of state values Xk = xk , k = 0, 1, . . . , t, that presents the highest probability density conditional on the available data Yk = yk ,
2
k = 1, 2, . . . , t. We do not claim that MAP estimators are inherently superior to others. Possibly, the most widely-used Bayesian estimator is the mean of the posterior distribution of the state Xk , which minimizes the expected value of a quadratic cost function and, hence, it is often referred to as the minimum mean square error (MMSE) estimator. In comparison, the MAP estimator results from the minimization of the posterior expectation of a 0-1 cost function (Robert 2007). The adoption of the latter criterion turns out natural in some applications. Consider, for example, a system that yields a multimodal posterior probability distribution. In such a case, the MMSE estimate of Xk may lie in a low-density region and, therefore, may turn out to be even misleading about the actual state of the system, while the MAP criterion produces an estimate located in a high probability region of the state space. Such scenarios often appear in (multi-)target tracking problems (Bar-Shalom and Blair 2000). A limited number of methods for MAP sequence estimation using particle filtering can be found in the literature. The straightforward approach consists in a linear search of the particle with the highest posterior density. In (Godsill et al 2001), it is suggested to use the collection of particles at times t = 1, 2, ... to build a trellis representation of the state space. Then, it is possible to run a Viterbi algorithm (Forney 1973) to find the path in the trellis with the highest posterior density. This method has become standard, enjoying some applications in engineering; see, e.g., (Nyblom et al 2008). However, its computational complexity grows with N 2 , where N is the number of particles generated by the SMC algorithm, and it can be prohibitive in some applications. In (Klaas et al 2005), it is suggested to use tree search procedures to achieve a computationally more efficient implementation. More recently, in (Saha et al 2009), it has been proposed to perform marginal MAP estimation of the state Xt by obtaining an approximation of the filtering density and then using standard optimization techniques to compute its maximum. It is important to point out that MAP estimates cannot be computed as integrals with respect to the posterior distribution of Xt . As a consequence, the classical convergence results for particle filters in (Del Moral and Miclo 2000; Del Moral 2004; Bain and Crisan 2008) or (Gland and Oudjane 2004) do not guarantee the convergence of the approximate MAP estimates produced by the algorithm in (Godsill et al 2001). In this paper, we address a formal analysis of the convergence of the MAP estimates computed using SMC algorithms. In particular, we consider two methods based on the standard sequential importance
Joaqu´ın M´ıguez et al.
resampling (SIR) algorithm (Gordon et al 1993) (see also (Doucet et al 2001a)). The first one involves a direct search over the sample paths in the space of {X0 , . . . , Xt } generated by the SIR algorithm and has a complexity that grows linearly with the number of sample paths. The second one is the algorithm of (Godsill et al 2001), that performs a trellis search over an extended grid of paths using the Viterbi algorithm and has a quadratic complexity. Both search procedures can be implemented sequentially and together with the SIR method. Our analysis includes: a) The derivation of explicit convergence rates for the Lp errors (with arbitrary integer p) in the approximation of integrals with respect to the joint posterior distribution of X0 , . . . , Xt given a fixed record of observations. A similar result was originally introduced in (Del Moral et al 2001) for a general class of interacting particle systems. However, the conditions assumed here are minimal and easily verified for the class of state-space models of interest in this paper. b) A proof, based on the rates for the Lp≥4 errors, of the almost sure convergence of the MAP sequence estimates produced by the SIR algorithm, both with the simple direct search over the sample paths and using the Viterbi algorithm on a trellis grid. c) Lower bounds on the number of particles (sample paths) that are needed to achieve a prescribed accuracy with the MAP estimator. Another contribution of this work is to show how the MAP sequence estimation algorithms can be used as tools for the global optimization of a broad class of objective functions. We identify a family of cost functions that admit a certain recursive decomposition and describe how it is possible to design state-space models that are “matched” to the cost, meaning that (a) the unknowns of the cost function are assimilated to the random sequence of states in the model and (b) the maxima of the posterior probability density function (pdf) of the state-space model coincide with the minima of the cost function. With this reformulation of the problem, SMC algorithms can be used to produce, in a natural way, a random grid in the space of the unknowns that is dense in the regions where the cost is low and sparse elsewhere. We illustrate this approach by way of two examples, including a typical global optimization problem, the Neumaier 3 problem (Ali et al 2005), and the design of cross-talk cancellation acoustic filters by a minimax optimization criterion (Rao et al 2007). We present computer simulation results for the Neumaier 3 problem that illustrate the advantage of using random, instead of deterministic, grids for the search of solutions.
SMC methods for MAP sequence estimation and global optimization
We note that our approach to global optimization by way of SMC methods bears similarities to the work in (Najim et al 2006). However, we do not restrict ourselves to optimal control applications and consider a broader class of minimization problems instead. Indeed, the objective functions studied in (Najim et al 2006) are instances of the family of additive costs discussed in Section 7.2 of this paper. A comparison of our analysis of the asymptotic convergence of the SMC algorithms and that presented in (Najim et al 2006) is presented in Remark 3. The rest of this paper is organized as follows. A brief survey of the basic notations is presented in Section 2. The problem of MAP sequence estimation for state-space models is formally stated in Section 3. In Section 4, we explicitly describe the algorithms to be analyzed. Our main results on the convergence of the SIR algorithm and the MAP estimation algorithms designed around it are introduced in Section 5. An application example in the context of target tracking is given in Section 6. The application of the MAP sequence estimation tools for the global minimization of cost functions is introduced in Section 7. The paper concludes with a brief summary in Section 8.
2 Notation Random variables (possibly vector valued) and their realizations are represented by the same upper- and lower-case letter, e.g., the random variable X and its realization X = x. Random sequences are denoted as {Xt }t∈N . We use Rd , with integer d ≥ 1, to denote the set of d-dimensional vectors with real entries. The Borel σalgebra in Rd is indicated as B(Rd ). The set of bounded real functions over Rd is denoted as B(Rd ). Pdf’s are indicated by the letter π. This is an argument-wise notation, hence for the random variables X and Y , π(x) signifies the density of X, possibly different from π(y), which represents the pdf of Y . The integral of a function f (x) with respect to a measure with density π(x) is denoted by the shorthand (f, π) , R f (x)π(x)dx. For a discrete-time sequence x0 , x1 , . . . , xt , . . ., the shorthand xt1 :t2 = {xt1 , xt1 +1 , . . . , xt2 } denotes the subsequence from time t1 up to time t2 .
3 Problem statement Let {Xt }t≥0 and {Yt }t>0 be discrete-time stochastic processes that take values in Rdx and Rdy , respectively. The common probability measure for the pair (Xt , Yt ) is
3
denoted as P and assumed to be absolutely continuous with respect to the Lebesgue measure. We refer to {Xt }t≥0 as the “state” or “signal” process, while {Yt }t≥0 is termed the “observation” or “measurement” process. For t = 0, the random variable X0 has a pdf with respect to the Lebesgue measure in Rdx , that we denote as π(x0 ), and, for t > 0, the process evolves according to the conditional probability law Z P {Xt ∈ A|x0:t−1 } = π(xt |x0:t−1 )dxt , A
where π(xt |x0:t−1 ) denotes the pdf, with respect to the Lebesgue measure, of Xt given X0:t−1 = x0:t−1 and A is any Borel subset of Rdx , i.e., A ∈ B(Rdx ). The observation process, {Yt }t>0 , follows the conditional probability law Z 0 P {Yt ∈ A |x0:t , y1:t−1 } = π(yt |x0:t , y1:t−1 )dyt , A0
where π(yt |x0:t , y1:t−1 ) denotes the conditional pdf of Yt given X0:t = x0:t and Y1:t−1 = y1:t−1 , again with respect to the Lebesgue measure, and A0 ∈ B(Rdy ). We refer to the densities π(x0 ) and π(xt |x0:t−1 ) as the prior pdf and the transition pdf of the state process, respectively, while for fixed observations Y1:t = y1:t , the function gt (x0:t ) , π(yt |x0:t , y1:t−1 ) is referred to as the likelihood of the signal path X0:t = x0:t at time t. Together, the densities π(x0 ),
π(xt |x0:t−1 )
and π(yt |x0:t , y1:t−1 )
(1)
determine a random state-space model. The a posteriori pdf of a signal path X0:t = x0:t given a sequence of observations Y1:t = y1:t is denoted as π(x0:t |y1:t ). It can be easily derived from the functions in Eq. (1) using Bayes’ theorem, namely π(x0:t |y1:t ) ∝ π(yt |x0:t , y1:t−1 )π(xt |x0:t−1 ) × π(x0:t−1 |y1:t−1 ) = π(x0 )π(y1 |x0:1 )π(x1 |x0 ) × t Y π(yk |x0:k , y1:k−1 )π(xk |x0:k−1 ).
(2)
k=2
In this paper, we address the problem of finding the maxima of the a posteriori pdf. In particular, let T be an arbitrarily large but finite time horizon and let the observations Y1:T = y1:T be fixed. We seek the sequences of length T + 1 in the state space that maximize the function π(x0:T |y1:T ), i.e., we aim at finding the solution set XπT defined as XπT = arg
max
x0:T ∈(Rdx )T +1
π(x0:T |y1:T ).
(3)
4
Joaqu´ın M´ıguez et al.
Note that every element x ˆ0:T ∈ XπT is a MAP estimate of the sequence X0:T given the observations Y1:T = y1:t . Since the observation sequence is kept fixed, in the sequel we use the shorthand π0:t (x0:t ) = π(x0:t |y1:t ) for the posterior density at any time t = 0, . . . , T . 4 Algorithms The optimization problem (3) is analytically intractable for most models of practical interest (linear Gaussian systems being one exception) and we need to resort to numerical techniques in order to find approximate solutions. In this section, we first describe how to employ the standard sequential importance resampling (SIR) algorithm (Gordon et al 1993; Doucet et al 2000) in order to obtain two random-grid approximations, with different coarseness, of the signal-path space T +1 . Then, it is possible to either directly choose Rd x the node of a “coarse” grid with the highest posterior density or run the Viterbi (Forney 1973) algorithm on a “fine” (trellis shaped) grid, as suggested in (Godsill et al 2001). 4.1 Discretization of the state-space: the sequential importance resampling algorithm We aim at numerically computing the elements x ˆ0:T of the solution set XπT in problem (3). Even if the posterior pdf π0:T (x0:T ) can be evaluated up to a proportionality constant using the factorization of Eq. (2), this is, in general, a difficult optimization problem in a high dimensional space, possibly with multiple global and/or local extrema. In this paper, we propose to tackle these difficulties by using a SMC method in order to obtain T +1 a suitable discretization of the path space Rdx . Different search methods can subsequently be applied to find the point of the discretized space with the highest density. SMC algorithms aim at recursively computing approximations of the sequence of posterior probability laws Z P {X0:t ∈ A|y1:t } = π0:t (x0:t )dx0:t , (4) A
t+1 t = 1, . . . , T , where A ∈ B Rdx is a Borel set. Specifically, at each time t,n a SMC algorithm o (n) N generates random paths Ω0:t = x0:t such n=1,...,N
that integrals with respect to the pdf π0:t (x0:t ) can be approximated by summations (Crisan and Doucet R PN (n) 2000), i.e., f (x0:t )π0:t (x0:t )dx0:t ≈ N1 n=1 f (x0:t ), t+1 where f : Rdx → R is a real function defined in the
path space and integrable with respect to the posterior probability law. Although various possibilities exist (Doucet et al 2001b), in this paper we consider the standard sequential importance sampling algorithm with resampling at every time step (Doucet et al 2000), also known as bootstrap filter (Gordon et al 1993). We refer to this algorithm as SIR through the the paper. The algorithm is based on the recursive decomposition of π0:t (x0:t ) given by Eq. (2) and the computational procedure is simple. – Initialization. At time t = 0, we draw N independent and identically distributed (i.i.d.) samples from the prior probability distribution with density π(x0 ). Let us denote this initial sample as (n)
Ω0N = {x0 }n=1,...,N . – Recursive step. Assume that a random sample (n)
N Ω0:t−1 = {x0:t−1 }n=1,...,N
has been generated up to time t − 1. Then, at time t, we take the following steps. i. Draw N new samples in the state space Rdx from the probability distributions with densities (n) π(xt |x0:t−1 ), n = 1, ..., N , and denote them as (n) (n) (n) (n) {¯ xt }n=1,...,N . Set x ¯0:t = {x0:t−1 , x ¯t }. ii. Weight each sample according to its likelihood, i.e., compute importance weights (n)
w ˜t
(n)
= π(yt |¯ x0:t , y1:t−1 )
and normalize them to obtain (n)
(n)
wt
w ˜ = PN t
k=1
(k)
.
w ˜t
(n)
(k)
iii. Resampling: for n = 1, . . . , N , set x0:t = x ¯0:t (k) with probability wt , k ∈ {1, . . . , N }. Reset the (n) weights to wt = 1/N for n = 1, ..., N . The multinomial resampling procedure in step iii. can be substituted by other techniques. A number of alternative resampling methods and their associated computational complexity are discussed in (Carpenter et al 1999), while (Douc et al 2005) provides analytical results regarding the effect of multinomial, residual, stratified and systematic resampling techniques on the variance of Monte Carlo estimates. In (Crisan 2001), a tree-based branching algorithm is described that minimizes the variance of the (random) number of offsprings in the resampling step. In practice, the use of a low-variance resampling procedure should result in a (not necessarily large) improvement in the accuracy
SMC methods for MAP sequence estimation and global optimization
of the estimators computed using the random samples generated by the SIR algorithm. (n) N We shall use the random grid Ω0:T = {x0:T }n=1,...,N T +1 as a discrete approximation of the path space Rdx where the random sequence X0:T takes its values. Note that the SIR algorithm also yields “marginal grids” (n) for each time t, denoted ΩtN = {xt }n=1,...,N , t = N 0, 1, ..., T . The points of the grid Ω0:T (often also the N points of every Ωt ) are called particles and the SMC methods that generate them are referred to as particle filters (Doucet et al 2000) or particle smoothers (Godsill et al 2004) depending on whether one is interested in the filtering pdf’s π(xt |y1:t ), t = 1, 2, ..., or the smoothing pdf’s π0:t (x0:t ), t = 1, 2, ..., respectively. N Using the particles in Ω0:T , it is straightforward to build PN N a random measure π0:t (dx0:t ) = N1 n=1 δn (dx0:t ), (n) where δn is the unit delta measure centered at x0:t , and use it to approximate integrals of the form Z (5) (f, π0:t ) = f (x0:t )π0:t (x0:t )dx0:t , t+1 where f : Rdx → R is a real function over the space of the paths up to time t. Indeed, we write Z N 1 X (n) N N (f, π0:t ) = f (x0:t )π0:t (dx0:t ) = ft (x0:t ) N n=1 for the particle approximation of (f, π0:t ). If the N function f is, for example, bounded, then (f, π0:t ) is a good approximation of (f, π0:t ) for N sufficiently large (Crisan and Doucet 2000). We will take advantage of this result for the analysis in Section 5. The standard particle filter described in this section is an instance of the general class of sequential importance sampling (SIS) methods in which (a) the particles are drawn from the transition pdf π(xt |x0:t−1 ) and (b) resampling is carried out at every time step. There are various versions of this class of algorithms that can be used to obtain the random grid N Ω0:T , however. For example, it is possible to use an importance function different from π(xt |x0:t−1 ) in order to generate particles (Liu and Chen 1998) or to perform resampling every m ≥ 1 time steps, where m can be deterministic or random (Doucet et al 2000). Indeed, not only the standard SIR technique but also most of the SIS-like methods described in the literature can be easily plugged into the MAP estimation algorithms that we present below. 4.2 Sequence estimation algorithms We propose to use the random grids generated by the particle filtering algorithm to search for
5
approximate maximizers of the pdf π0:T (x0:T ). In particular, we investigate two algorithms. The first one is a straightforward extension of the SIR procedure, while the second one combines it with the Viterbi algorithm as suggested in (Godsill et al 2001). We will subsequently refer to them as Algorithm 1 and Algorithm 2, respectively. 4.2.1 Algorithm 1 N We simply search the element of Ω0:T with the highest posterior density. For this purpose, note that we can easily extend the SIR algorithm described in Section 4.1 to recursively compute the posterior density of each particle up to a proportionality constant. To be specific, we need to perform the following additional computations. (n) (n) – At the initialization step, let a0 = log π(x0 for n = 1, ..., N . – At the recursive step, modify steps ii. and iii. as follows. ii. Weight each sample according to its likelihood, (n) i.e., compute the importance weights w ˜t = (n) π(yt |¯ x0:t , y1:t−1 ) and normalize them to obtain (n) (n) PN (k) (n) (n) wt = w ˜t / k=1 w ˜ a ¯t = at−1 t . Compute + (n)
(n)
(n)
log π(yt |¯ x0:t , y1:t−1 ) + log π(¯ xt |x0:t−1 ) . (n)
(k)
iii. Resampling: for n = 1, . . . , N , set x0:t = x ¯0:t (n) (k) (k) and at = a ¯t with probability wt , k ∈ (n) {1, . . . , N }. Reset the weights to wt = 1/N for n = 1, ..., N . Finally, we select (n )
o x ˆN 0:T = x0:T , where no = arg
max
n∈{1,...,N }
(n)
aT ,
(6)
as the approximate maximizer of π0:T (x0:T ). 4.2.2 Algorithm 2 Let us briefly describe the MAP estimation algorithm N of (Godsill et al 2001). Instead of Ω0:T , we consider now T +1 dx a finer discretization of R , namely the product space N ¯0:T ¯1N × · · · × Ω ¯TN , Ω = Ω0N × Ω (n) (n) ¯tN = {¯ where Ω0N = {x0 }n=1,...,N and Ω xt }n=1,...,N ¯tN is for t = 1, 2, ..., T . Specifically note that Ω constructed from the particles available at step ii. of the SIR algorithm, i.e., before resampling, to avoid duplicate samples.
6
Joaqu´ın M´ıguez et al.
We assume for clarity1 that π(xt |x0:t−1 ) = π(xt |xt−1 ) and π(yt |x0:t , y1:t−1 ) = π(yt |xt ). Given the ¯tN , t = 1, ..., T , the Viterbi random grids Ω0N and Ω (n ) (n ) (n ) algorithm outputs a sequence (x0 0 , x ¯1 1 , ..., x ¯T T ) ∈ N ¯ , ni ∈ {1, ..., N }, with the highest posterior density, Ω 0:T i.e., it solves the discrete optimization problem x ¯N 0:T ∈ arg
max
¯N x ¯0:T ∈Ω 0:T
π0:T (¯ x0:T )
(8)
exactly. The procedure is described below. (n)
– Initialization. For n = 1, ..., N , let a0 = (n) log(π(x0 )). – Recursive step. At time t > 0, the random grids ¯ N and Ω ¯ N , as well as {a(n) }n=1,...,N , are Ω t t−1 t−1 available. Then, for n = 1, ..., N , compute (n)
at
(n)
`t
(n)
= log(π(yt |¯ xt )) + h i (k) (n) (k) at−1 + log(π(¯ max xt |¯ xt−1 )) , k∈{1,...,N } i h (k) (n) (k) xt−1 )) . (9) = arg max at−1 + log(π(¯ xt |¯ k∈{1,...,N }
– Backtracking. Computation of an optimal sequence. (k) i. At time T , let jT = arg maxk∈{1,...,N } aT and (j )
¯T T . assign x ¯N T =x (jt+1 ) iii. For t = T − 1, T − 2, ..., 0, let jt = `t+1 and (jt ) N ¯t . assign x ¯t = x The Viterbi recursion can be run sequentially, together with the SIR algorithm described in Section 4.1. Specifically, we can take a complete recursive step of the Viterbi algorithm right after step ii. of the SIR ¯tN is method (i.e., once the random marginal grid Ω obtained). The combination of the SIR and Viterbi methods to compute x ¯N 0:T will be termed Algorithm 2 in the sequel. Compared to Algorithm 1, the application of the Viterbi method in Algorithm 2 adds a considerable (extra) computational burden. Specifically, it is needed to calculate N 2 branch metrics (associated to the (n) indices `t , n = 1, ..., N , t = 1, . . . , T , in Eq. (9)) per time step. As a consequence, the computational complexity of the method is O(N 2 T ). 1
The algorithm can also be applied when
π (xt |x0:t−1 ) = π (xt |xt−k:t−1 )
(7)
and π (yt |x0:t , y1:t−1 ) = π (yt |xt−k:t , y1:t−1 ) for some fixed k > 0, but the computational complexity of the Viterbi algorithm grows exponentially with k and the notation also becomes more involved. Note that if (7) holds, then the state space model can be rewritten in a fully equivalent firstorder Markov form with the extended state process Zt = (Xt−k+1 , . . . , Xt ), in such a way that π (zt |z0:t−1 ) = π (zt |zt−1 ) and π (yt |z0:t , y1:t−1 ) = π (yt |zt , y1:t−1 ).
5 Analysis 5.1 Outline We now establish the almost sure convergence of the two MAP sequence estimation algorithms described in Section 4.2. In the results that follow, we assume that: – The sequence Y1:T = y1:T is fixed (not random). – The likelihoods gt (x0:t ) = π(yt |x0:t , y1:t−1 ) are bounded functions of x0:t for every t = 1, 2, ..., T . – Let p0:t (x0:t ) = π(x0:t |y1:t−1 ) be the posterior pdf of the sequence X0:t given the observations Y1:t−1 = y1:t−1 . The integral of the likelihood gt (x0:t ) with respect to the measure p0:t (x0:t )dx0:t is positive, i.e., (gt , p0:t ) > 0 for 1 ≤ t ≤ T . – The set XπT is not empty and the posterior pdf π0:T (x0:T ) is continuous at every point x ˆ0:T ∈ XπT . The first three assumptions are applied to show that the SIR algorithm converges in an adequate way while ˆ0:T . the fourth one is used to show that x ˆN 0:T → x Specifically note that continuity is only assumed at the global maxima of π0:T (x0:T ) and not necessarily over its whole support. Obviously, the convergence of the MAP estimation Algorithms 1 and 2 relies on the convergence of the SIR algorithm. To be precise, given a real bounded function f ∈ B (Rdx )T +1 our analysis requires the N convergence of (f, π0:T ) toward the actual integral (f, π0:T ) in the Lp norm for some p ≥ 4. Similar, but not directly applicable, results exist in the literature, N ) → (f, π0:T ) in e.g., convergence rates for (f, π0:T the L2 norm can be found in (Crisan and Doucet 2000), while the convergence of (f, πTN ) → (f, πT ) (where πT (xt ) = π(xt |y1:t ) is the filtering density PN and πTN (dxt ) = N1 n=1 δn (dxt ) its approximation) in terms of generic Lp errors was established in (Del Moral and Miclo 2000) under additional constraints. In Lemma 1 below, we establish the rate of N convergence of (f, π0:T ) → (f, π0:T ) in a generic Lp norm, p ≥ 1. This is required for the subsequent analysis of the approximate MAP estimates x ˆN 0:T and N x ¯0:T . Specifically, in Theorem 1, we use Lemma 1 to show that Algorithm 1 converges almost surely (a.s). More precisely, we prove that π0:T (ˆ xN x0:T ), 0:T ) → π0:T (ˆ π with x ˆ0:T ∈ XT . The convergence of Algorithm 2 follows immediately (see Corollary 1). Finally, in Theorem 2, we establish a lower bound on the number of particles N needed to achieve a certain accuracy in the approximation of the MAP estimates.
SMC methods for MAP sequence estimation and global optimization
5.2 Asymptotic convergence results In the sequel, kξkp denotes the Lp norm of the random variable ξ, defined as kξkp = E[|ξ|p ]1/p , where E[·] denotes mathematical expectation and kf k∞ =
sup x0:T
|f (x0:T )| < ∞
∈(Rdx )T +1
denotes the supremum norm of the real bounded function f ∈ B (Rdx )T +1 . Lemma 1 For every f ∈ B (Rdx )T +1 there exists a constant c = c(p, T, y1:T ), independent of N , such that c||f ||∞ N k(f, π0:T ) − (f, π0:T )kp ≤ √ , N
7
T +1 → R be a bounded real Proof: Let f : Rdx function of the path x0:T . From Lemma 1, we obtain ckf k∞ N k(f, π0:T ) − (f, π0:T )kp ≤ √ , N
(10)
where c is a constant independent of N . Choose p ≥ 4, an arbitrarily constant 0 < ε < 1 and construct the positive random variable ΘTp,ε =
∞ X
p p N N 2 −1−ε (f, π0:T ) − (f, π0:T ) .
N =1
From Fatou’s lemma and Lemma 1, E[ΘTp,ε ] ≤
∞ X
p
N 2 −1−ε
N =1
for all N ≥ 1.
= cp kf kp∞
See the Appendix for a proof.
∞ X
cp kf kp∞ p N2
N −1−ε < ∞,
N =1
Remark 1 Lemma 1 is similar to Theorem 3.1 in (Del Moral et al 2001). The latter result is derived for a general class of interacting particle systems that assume a certain regularity condition (condition (K) on page 6). This condition is satisfied for processes defined on a compact (or finite) state space but, in general, it is quite hard to check for processes defined on Rdx . By contrast, the proof of Lemma 1 does not require this additional condition. Remark 2 In such generality, Lemma 1 does not hold true for unbounded functions f . However, under additional constraints and variations of the main algorithm, one can obtain similar rates of convergence for unbounded functions (see, for example, (Heine and Crisan 2008) and (Hu et al 2008)). Convergence results for unbounded functions are desirable as, in particular, they imply the convergence of state estimators such as the conditional mean (see, for example, Corollary 4.1 in (Heine and Crisan 2008)). The paper (Douc and Moulines 2008) contains an alternative attempt to resolve this issue. Note, however, that the convergence results in the latter references (Hu et al 2008; Heine and Crisan 2008; Douc and Moulines 2008) refer to the approximation of integrals with respect to the filtering measure π(xt |y1:t )dxt , while Lemma 1 refers to the approximation of integrals with respect to the measure π0:t (x0:t )dx0:t . Theorem 1 Let x ˆN 0:T be the output sequence of Algorithm 1. Then, almost surely, lim π0:T (ˆ xN 0:T ) =
N →∞
max
x0:T ∈(Rdx )T +1
π0:T (x0:T )
Moreover any convergent subsequence of x ˆN 0:T has a π limit x ˆ0:T that belongs to the solution set XT .
ΘTp,ε
is a.s. finite. hence p p N ) − (f, π0:T ) ≤ ΘTp,ε , Obviously, N 2 −1−ε (f, π0:T N ) − (f, π0:T )|, with p ≥ 4, yields and solving for |(f, π0:T Θδ N (f, π0:T ) − (f, π0:T ) ≤ 1T−δ , N2
(11)
where δ=
1+ε p
(12) 1
and ΘTδ = (ΘTp,ε ) p . Note that, since p ≥ 4 and 0 < ε < 1, it turns out that 0 < δ < 21 . As a consequence of Eq. N ) converges with probability 1, (11), the integral (f, π0:T i.e., N (13) ) − (f, π0:T ) = 0 a.s. lim (f, π0:T N →∞
Now, choose any MAP estimate x ˆ0:T ∈ XπT and consider the open ball 1 dx T +1 Bk (ˆ x0:T ) = z ∈ R : kz − x ˆ0:T k < k where k is a positive integer and k·k denotes the norm of T +1 the Euclidean space Rdx . The indicator function 1 if x0:T ∈ Bk (ˆ x0:T ) IBk (ˆx0:T ) (x0:T ) = (14) 0 otherwise is real and bounded, hence, from Eq. (13), N − IBk (ˆx0:T ) , π0:T = 0 lim IBk (ˆx0:T ) , π0:T
N →∞
a.s. (15)
Since the posterior pdf π0:T (x0:T ) is continuous at x ˆ0:T ∈ XπT and π0:T (ˆ x0:T ) > 0, it follows that π0:T (x0:T )
8
Joaqu´ın M´ıguez et al.
is positive on an open ball around x ˆ0:T . In particular, the value Ak = IBk (ˆx0:T ) , π0:T is strictly positive. Also note that the particle approximation of Ak has the form
one can show that, for any δ > 0, q(N )2 (i) P max π0:T (x0:T ) < Mδ ≤ c(T ) + N i=1,...,q(N )
m(N, k) N , AN x0:T ) , π0:T = k = IBk (ˆ N
π0:T (A(δ))q(N ) ,
where m(N, k) denotes the number of elements of the N discretized path space Ω0:T that belong to the ball N Bk (ˆ x0:T ) (equivalently, m(N, k) = Ω0:T ∩ Bk (ˆ x0:T ) is the number of points in the discrete intersection set N Ω0:T ∩ Bk (ˆ x0:T )). Since limN →∞ AN − A = 0 a.s., it k k follows that, for any k ≥ 1, lim m(N, k) > 0
N →∞
a.s.
π0:T (xN,k xN x0:T ). 0:T ) ≤ π0:T (ˆ 0:T ) ≤ π0:T (ˆ Since π0:T is continuous at x ˆ0:T and kˆ x0:T − x ˆN,k 0:T k < 1/k, we deduce that limk→∞ π0:T (xN,k ) = π (ˆ x0:T ) 0:T 0:T and, as a consequence, lim π0:T (ˆ xN x0:T ) 0:T ) = π0:T (ˆ
k→∞
Z π0:T (A(δ)) =
IA(δ) (x0:T )π0:T (x0:T )dx0:T .
This, in turn, leads to the convergence in probability (but not a.s.) of the estimator toward maxx0:T ∈(Rdx )T +1 π0:T (x0:T ). Remark 4 Theorem 1 is valid for general topological spaces provided the posterior distribution charges any open neighborhood of points in XπT and its density is lower semicontinuous (and hence continuous) at the points in XπT . This includes discrete spaces (finite or infinite) with the corresponding discrete topology. Corollary 1 Assume that π(xt |x0:t−1 ) = π(xt |xt−1 ), π(yt |x0:t , y1:t−1 ) = π(yt |xt ) and let x ¯N 0:T be the output sequence of Algorithm 2. Then, lim π0:T (¯ xN 0:T ) =
N →∞
Moreover, if is a convergent subsequence } with limit, say, x ˇ0:T , it follows that of {ˆ xN N ∈N 0:T i = π0:T (ˆ x0:T ). π0:T (ˇ x0:T ) = limi→∞ π0:T (ˆ xN 0:T ) π t u Therefore x ˇ0:T ∈ XT , which concludes the proof. Remark 3 In (Najim et al 2006) a different approach is used to prove a result similar to Theorem 1 based on the propagation-of-chaos property of genealogical tree simulations models (see (Del Moral 2004) for details). (i) The basic idea is that a sub-sample from {x0:T }i=1,...,N behaves asymptotically as a perfect sample from π0:T . More precisely, using Theorem 8.3.3 in (Del Moral ⊗q 2004) one can show that if π0:T is the tensor product of q copies of the measure π0:T , then −
⊗q π0:T ktv
q2 ≤ c(T ), N
(17)
x0:T ∈(Rdx )T +1
π0:T (x0:T ) − δ,
max
x0:T ∈(Rdx )T +1
π0:T (x0:T )
a.s.
Moreover any convergent subsequence of x ¯N 0:T has a π limit x ˆ0:T that belongs to the solution set XT . N ¯ N and, as a Proof: Simply note that Ω0:T ⊂ Ω 0:T N N consequence, π0:T (ˆ x0:T ) ≤ π0:T (¯ x0:T ) ≤ π0:T (ˆ x0:T ). t u
Remark 5 We emphasize that the sequences {ˆ xN 0:T }N ∈N N and {¯ x0:T }N ∈N may not necessarily be convergent themselves, as they may contain subsequences that converge to different elements of the solution set XπT (we have not assumed uniqueness of the global minimizer). Moreover, if lim sup π0:T (x0:T ) =
where k · ktv denotes the total variation norm between two probability measures and c(T ) is a constant with respect to N . By choosing q = q(N ) to be of order o(N ) and denoting max
and π0:T (A(δ)) is a shorthand for the integral
a.s.
i {ˆ xN 0:T }i∈N
Mδ =
A(δ) = x0:T ∈ (Rdx )T +1 : π0:T (x0:T ) < Mδ
(16)
The limit (16) implies that, for any k ≥ 1, the N intersection Ω0:T ∩ Bk (ˆ x0:T ) is a.s. nonempty when N is sufficiently large. Therefore, let us choose a point N xN,k ∈ Ω0:T ∩ Bk (ˆ x0:T ). Obviously, π0:T (xN,k 0:T 0:T ) ≤ π0:T (ˆ x0:T ) but, given the selection rule (6), we also have that π0:T (xN,k xN 0:T ). Therefore, 0:T ) ≤ π0:T (ˆ
(1) (2) (q) kLaw(x0:T , x0:T , . . . , x0:T )
where A(δ) is defined to be the set
kx0:T k→∞
max
x0:T ∈(Rdx )T +1
π0:T (x0:T )
then the sequence may contain subsequences that diverge to infinity, or the entire sequence can diverge to infinity. If that is the case, we need to restrict the search for a global minimizer to a (sufficiently large) compact set. However, in general, limkx0:T k→∞ π0:T (x0:T ) = 0 and, therefore, ending up with a sequence divergent to infinity does not occur.
SMC methods for MAP sequence estimation and global optimization
Equation (11) states that, for a real bounded function √ of x0:T , the approximation error converges with N , which determines the accuracy of the N discretization of the state-space Ω0:T . This enables us to find how large should the number of particles N be N ¯N ) such that the (random) grids Ω0:T (respectively, Ω 0:T contain points at a distance from a true MAP estimate smaller than k1 , for k arbitrary but sufficiently large. Theorem 2 For sufficiently large k, the (random) N ¯ N contain points at a distance from grids Ω0:T and Ω 0:T a true MAP estimate smaller than k1 provided that dx (T +1) 1
N > Θk 2 −δ , where Θ is a positive random variable independent of N and k and 0 < δ < 12 is a constant. Proof: From Eq. (11), there exists a positive random variable ΘTδ such that, for all N > 0, we have m(N, k) ΘTδ − A k ≤ 1 N N 2 −δ
(18)
dx (T +1) 1 Θδ m(N, k) − 1T−δ ≤ . (19) k N N2
By inspection of (19), we realize that m(N, k) can be guaranteed to be strictly positive if we take N large enough for the inequality qT +1 π0:T (ˆ x0:T ) 2
dx (T +1) 1 Θδ − 1T−δ > 0 k N2
to hold true. Solving for N , we obtain N > Θk 11 −δ 2Θ δ for Θ = qT +1 π0:TT(ˆx0:T ) 2 .
for an arbitrarily small δ > 0. Using a standard argument, one can deduce from Eq. (20) that there exist two positive random variables ΘT1 and ΘT2 such that, for all N > 0, we have m(N, k) − Ak ≤ ΘT1 exp{−ΘT2 N } N which implies that if N > Θ log k for a suitably chosen positive random variable Θ, then m(N, k) is strictly N ¯N positive and, hence, the (random) grids Ω0:T and Ω 0:T 1 contain points at a distance smaller than k .
6.1 Problem statement MAP sequence estimation methods find a natural application in problems where the posterior densities π0:t (x0:t ), t = 1, . . . , T , are multimodal. In such cases, the mean of the a posteriori probability distribution may yield a path lying in a low probability region and it is often preferred to use a mode of the distribution as an estimate. The problem of tracking a target that moves over a two dimensional region using only two sensors that provide distance-dependent observations falls within this category. Let the system state at discrete time t be the four-dimensional random (column) vector Xt = [X1,t , . . . , X4,t ]> ∈ R4 , where Rt = [X1,t , X2,t ]> ∈ R2 determines the position of the target and Vt = [X3,t , X4,t ]> denotes its velocity. The state vector is assumed to evolve with time according to the constantvelocity model3 Xt = ATo Xt−1 + σx Ut ,
dx (T +1) 1 −δ 2
t u
Remark 6 Under additional assumptions (for example if the state space is compact), one can deduce2 a smaller lower bound for the size N of the sample required to obtain a point at a distance less than, say, k1 . The basis of this is the following exponential bound (see (Del Moral and Miclo 2000) for details and the required 2
assumptions). One can show that there exist constants c1 = c1 (T, f, δ) and c2 = c2 (T, f, δ) such that 2 N P |(f, π0:T ) − (f, π0:T )| ≥ δ ≤ c1 e−c2 N δ (20)
6 Application example: target tracking
a.s. for 0 < δ < 21 (the constant δ can be chosen as small as desired by takingR a large value of p in (12)). Recall that Ak = Bk (ˆx0:T ) π0:T (x0:T )dx0:T , for some x ˆ0:T ∈ XπT . When k is sufficiently large, π0:T (x0:T ) is very close to π0:T (ˆ x0:T ) for any x0:T ∈ Bk (ˆ x0:T ). In particular, we can assume that 1 x0:T ) ≤ π0:T (x0:T ) ≤ π0:T (ˆ x0:T ) for any 2 π0:T (ˆ x0:T ∈ Bk (ˆ x0:T ). Therefore we can deduce that Ak ≥ dx (T +1) qT +1 x0:T ) k1 , where qT +1 is the volume of 2 π0:T (ˆ the unit ball in (Rdx )T +1 , and from (18) we arrive at qT +1 π0:T (ˆ x0:T ) 2
9
This approach was suggested to us by Pierre Del Moral.
t = 1, 2, . . . , T,
where ATo is the 4 × 4 constant matrix 1 0 To 0 0 1 0 To ATo = 0 0 1 0 , 00 0 1 To is the observation period in seconds (s), i.e., the duration of the discrete-time unit in the model, σx2 is the variance of the Gaussian perturbation of the state 3 The model assumes that the target velocity remains constant in intervals of length To , the observation period. See, e.g., (Gustafsson et al 2002) for a discussion of kinetic models for target tracking.
10
Joaqu´ın M´ıguez et al.
and Ut is a standard (zero mean, identity covariance matrix) normal vector, i.e., π(ut ) = N (ut ; 0, I4 ), where I4 is the 4 × 4 identity matrix and 0 ∈ R4 . Therefore, the transition pdf at time t is π(xt |x0:t−1 ) = π(xt |xt−1 ) = N (xt ; ATo xt−1 , σx2 I4 ). (21) We assume a Gaussian prior π(x0 ) = N (x0 ; 0, Σ0 ), where 0 ∈ R4 and Σ0 is a diagonal, 4 × 4, positive definite matrix. Two sensors measure the power of a radio signal transmitted by the target. The observation collected at discrete-time t by the i-th sensor is modeled as Po Yi,t = 10 log10 + σy Zi,t (dB), kRt − si k2 where i = 1, 2, Po is the power of the signal transmitted by the target, si ∈ R2 is the position of the i-th sensor, σy2 is the variance of the observational noise and Zi,t is a standard Gaussian variable, π(zi,t ) = N (zt ; 0, σy2 ). We assume that the sequences {Z1,t }t≥1 and {Z2,t }t≥1 are white and mutually independent. With this model, the conditional pdf of the observations Yt = [Y1,t , Y2,t ] ∈ R2 given the state Xt of the system is π(yt |x0:t , y1:t−1 ) = π(yt |xt ), where π(yt |xt ) =
2 Y
N
yi,t ; 10 log10
i=1
Po krt − si k2
, σy2
. (22)
The goal is to compute an estimate of the sequence of states X0:T (and, especially, of the positions R0:T ) given a fixed sequence of observations Y1:T = y1:T . 6.2 Numerical results We have carried out computer simulations for the model described by (21) and (22) with the following set of parameters. – The prior pdf π(x0 ) = N (x0 ; 0, Σ0 ) has a covariance matrix 100 0 0 0 0 100 0 0 Σ0 = 0 0 0.05 0 , 0 0 0 0.05 i.e., we assume little knowledge about the initial position and a low speed. – The observation period is To = 14 s and the variance of the signal noise is selected proportional to To2 , 1 namely, σx2 = 12 To2 = 32 . – The target transmits a signal of unit power, Po = 1, and the sensor positions are s1 = [0, 0]> and s2 = [20, 0]> . The observational noise variance is σy2 = 12 .
– The target is observed during Td = 20 s, which yields T = TTdo = 80 discrete-time steps. For the simulations, we have generated a single trajectory x0:T with its associated observations y1:T using the described state-space model. Therefore, the observations are fixed in all the computer experiments. Recall that Rt = [X1,t , X2,t ]> denotes the target position at time t. Figure 1 (top) displays a histogram of π(rT |y1:T ) obtained with N = 4 × 105 particles generated using the standard SMC algorithm of Section 4.1. It is clearly seen that the filtering distribution for this system is bimodal. Note that, in order to obtain a unimodal posterior distribution using distancedependent observations in dimension dx one needs either to collect at least dx +1 observations or to choose a prior for the position that prevents ambiguities. A consequence of the shape of the distribution in Fig. 1 (top) is that the mean of the posterior distribution is not a useful estimate of the trajectory R0 , . . . , RT given the observations y1:T . Indeed, Fig. 1 (bottom) shows: – The true target trajectory, as a dark-colored solid line. – The sensor positions, as circles. – The mean of the posterior distribution obtained from the bootstrap filter with N = 4×105 particles, as a light-colored thick line. (i1 ) (i100 ) – The 100 sample paths, x0:T , . . . , x1:T , with the highest a posteriori density generated by the particle filter, displayed as thin light-colored lines. These paths have been computed using Algorithm 1. It is apparent that the posterior mean of π(x0:T |y1:T ) yields a path that lies far away from the two regions of high probability density. The collection of high-density sample paths, however, reveals clearly the two modes of the distribution. Any of these paths is a useful practical estimator of the target trajectory but it should be noted that the system is ambiguous. Both modes are equally likely and only a modification of the model (e.g., the addition of new observations from different sensors, the modification of the prior distribution or a restriction of the region where the target can move) would allow to discriminate one from the other. Figure 2 shows a comparison of the Algorithms 1 and 2 for MAP sequence estimation. For the same record of observations as in Fig. 1, we have run the two algorithms 100 times, with N = 1, 000 particles4 , and recorded the outputs x ˆN ¯N 0:T and x 0:T for each simulation. 4
We have used a very large number of particles (N = 4 × 105 ) to generate Figure 1 in order to accurately show the two
SMC methods for MAP sequence estimation and global optimization
11
0
-50
5000 4000 3000 60
2000
log π(x0:t|y1:t)
6000
-100
-150
Algorithm 2
40
1000 20
0
-200
0
10 15
Algorithm 1
-20
20 -40
25
-250
30
0
35
5
10
15 time (s)
20
25
Fig. 2 Box-and-whiskers plots for the logarithm of the Algorithm 1 true post. mean sensors
60
posterior densities of the approximate MAP estimates produced by Algorithm 1 and Algorithm 2 in 100 independent simulation runs. The dark-colored plot shows the outcomes for Algorithm 1, i.e., log π (ˆ xN 0:t |y1:t ) for t = 0, 1, ..., T . The light-colored plot shows the outcomes for Algorithm 2, i.e., log π (¯ xN 0:t |y1:t ) for t = 0, 1, ..., T . The boxes show the interquartile range (IQR) and the whiskers show the smallest (largest) datum still within 1.5 × IQR of the lower (upper) quartile. Data between 1.5 × IQR and 3 × IQR away from the lower or upper quartiles are displayed with the ‘+’ symbol. Data further than 3 × IQR away from the lower or upper quartile are displayed with the ’◦’ symbol.
40
20
0
-20
-40
-60 -10
-5
0
5
10
15
20
25
30
Fig. 1 Top: Histogram generated from N = 4 × 105 particles of the bootstrap filter at time t = T . It shows that the posterior distribution for this system is bimodal. Bottom: True
trajectory (dark-colored line), posterior mean estimate (thick light-colored line) and the 100 sample paths with highest a posteriori density, computed using Algorithm 1 (thin and light-colored).
we carry out a search over a random grid approximation ¯ N , that is a refinement of the of the path space, Ω 0:T N . As random grid used by Algorithm 1, denoted Ω0:T N N N ¯ a consequence, Ω0:T ⊂ Ω0:T and π(ˆ x0:T |y1:T ) ≤ π(¯ xN 0:T |y1:T ) (most often with strict inequality). 7 Application example: global otimization 7.1 Problem statement
In particular, the figure displays box-and-whiskers plots for the logarithms of the sequence of posterior densities log π(ˆ xN 0:t |y1:t ),
t = 0, 1, 2, . . . , T,
for Algorithm 1 (dark-colored), and log π(¯ xN 0:t |y1:t ),
t = 0, 1, 2, . . . , T,
for Algorithm 2 (light-colored). It can be seen that the posterior density of the estimates produced by Algorithm 2 (¯ xN 0:T ) is always higher than that produced by Algorithm 1 (ˆ xN 0:T ). This is because in Algorithm 2 (symmetric) modes in the posterior pdf π0:T (x0:T ). However, the practical application of the estimation algorithms does not require such a large number of particles (which would make the computational cost of Algorithm 2 prohibitive). Hence, we have run both procedures with only N = 1, 000 particles in this computer experiment.
As an application of the MAP sequence estimation techniques investigated in this paper, we address the problem of finding the global minima of a certain class of cost functions with recursive structure. For this purpose, let {xt }t≥0 and {yt }t≥1 be discrete-time vector-valued sequences in Rdx and Rdy , respectively. For some arbitrarily large but finite horizon T , we aim at computing XcT = arg
min
x0:T ∈(Rdx )T +1
CT (x0:T ; y1:T ),
(23)
T +1 where CT (·; y1:T ) : Rdx → R+ is the real nonnegative cost function of interest, the subsequence x0:T denotes the unknowns to be optimized and the subsequence y1:T is known and provides the fixed parameters that determine the specific form of CT .
12
Joaqu´ın M´ıguez et al.
The MAP sequence estimation methods of Section 4 can be applied to solve problem (23) when the cost function can be constructed recursively, i.e., when there t+1 → exists a sequence of functions Ct (·; y1:t ) : Rdx R+ , t = 0, 1, . . . , T , such that Ct (x0:t ; y1:t ) can be computed from Ct−1 (x1:t−1 ; y1:t−1 ) by some known update rule. In particular, we assume that Ct can be decomposed as
for t = 0, ..., T , where the fixed parameters y1:t are omitted. The first example involves a purely additive rule, H(a, b) = a + b. Additive costs appear frequently in scientific and engineering problems, e.g., in positioning and navigation (Sayed et al 2005), finance (Ziemba and Vickson 2006) or operational research (Baker 2000). Let us consider the generic additive form
Ct (x0:t ; y1:t ) = H (Ct−1 (x0:t−1 ; y1:t−1 ), ct (x0:t ; y1:t )) ,
C0:t (x0:t ) = C0:t−1 (x0:t−1 ) + ct (x0:t ).
t = 1, . . . , T , where H : R+ × R+ → R+ is the update t+1 → R+ is termed the function and ct (·; y1:t ) : Rdx partial cost at time t. The recursion is initialized with some function C0 : Rdx → R+ which does not formally depend on any element of the sequence y1:T . Despite the simplicity of the recursive structure, we may realistically expect that problems of the form of (23) be hard to solve in practical scenarios. Indeed, CT (x0:T ; y1:T ) may be analytically intractable and present multiple minima. Also, due to the potentially high dimension, dx (T + 1), of the unknown, x0:T ∈ T +1 Rd x , it may be hard to devise effective numerical optimization algorithms with acceptable computational complexity. In order to compute approximate solutions, we propose to recast the optimization problem (23) as one of MAP sequence estimation in a state-space model and then apply the SMC algorithms of Section 4. The first step in our approach, therefore, is to select a suitable state-space model. We say that valid models are matched to the cost function.
This cost can be related to the posterior pdf easily by means of the exponential transformation
Definition 1 Let y1:T be a fixed sequence in Rdy . A state-space model of the form of (1) is matched to the cost function CT (x0:T ; y1:T ) if, and only if, XcT = XπT .
π0:t (x0:t ) = κt exp {−C0:t (x0:t )} ,
7.2 Examples
(25)
where the proportionality constant κt is independent of x0:t . Substituting (24) into (25), we readily obtain that π0:t (x0:t ) = κt exp {−C0:t−1 (x0:t−1 )} exp {−ct (x0:t )} (26) and, comparing Eqs. (26) and (2), it becomes apparent that any state-space model such that π(x0 ) ∝ C0 (x0 ) and π(yt |x0:t , y1:t−1 )π(xt |x0:t−1 ) ∝ exp {−ct (x0:t )} , for t = 1, . . . , T , is matched to C0:T (x0:T ). Now, we discuss a specific example taken from the global optimization literature. Example 1 (Neumaier 3 problem) The Neumaier 3 problem is included in the collection of (Ali et al 2005) and consists in the minimization of the cost function J(x1:T ) =
T X
(xt − 1)2 −
t=1
Therefore, a state-space model is matched to the cost CT when the maxima of the posterior pdf π(x0:T |y1:T ) exactly coincide with the minima of CT (x0:T ; y1:T ). There is not a unique model matched to a given cost, but rather a complete class of systems that yield the same solution set XπT = XcT , as exemplified in Section 7.2 below.
(24)
T X
xt xt−1 ,
(27)
t=2
subject to −T 2 ≤ xt ≤ T 2 for all t ∈ {1, ..., T }. The number of local minima of J(x1:T ) is not known, but the global minimum can be expressed as J(xo1:T ) = −
T (T + 4)(T − 1) , 6
where xot = t(T + 1 − t), t = 1, ..., T . We can easily adapt the cost of Eq. (27) to the notation in this paper by defining " T # T X 1 X 2 C0:T (x0:T ) = 2 (xt − yt ) − xt xt−1 , (28) σ t=1 t=2
In this Section we illustrate the construction of statespace models matched to cost functions by way of two examples, each of them dealing with a different class of update functions H(·, ·). For notational conciseness, in the rest of this section we use the shorthands
where yt = 1 for all t ≥ 1 and σ 2 > 0 is an arbitrary scale parameter. Note that, subject to −T 2 ≤ xt ≤ T 2 ,
C0:t (x0:t ) = Ct (x0:t ; y1:t )
arg min J(x1:T ) = arg min C0:T (x0:T ),
and ct (x0:t ) = ct (x0:t ; y1:t ),
x1:T
x1:T
(29)
SMC methods for MAP sequence estimation and global optimization
and x0 is a dummy unknown included only for notational compatibility. The functions C0:t , t = 2, ..., T , admit the recursive decomposition C0:t (x0:t ) = C0:t−1 (x0:t−1 ) +
1 (xt − yt )2 − xt xt−1 . 2 σ (30)
Let us construct the matched state-space model with signal process X0:T and fixed observations Y1:T = y1:T . Since X0 is a dummy variable in this example, it is trivial to choose the prior π(x0 ) = U (x0 ; −T 2 , +T 2 ), which does not modify the location of the maxima of the posterior pdf. We also note that the partial cost at time t = 1 is independent of X0 , C(x0:1 ) = 1 2 σ 2 (x1 −y1 ) , hence we can also select a uniform density for the random variable X1 , π(x1 |x0 ) = π(x1 ) = U (x1 ; −T 2 , +T 2 ), and let the likelihood function be Gaussian, π(y1 |x1 ) ∝ exp{−
1 (x1 − y1 )2 }. σ2
Thus, the posterior pdf at time t = 1 is a truncated Gaussian function, namely π(x1 |y1 ) ∝ π(y1 |x1 ), where x1 ∈ [−T 2 , T 2 ].
13
where st = 1 + 12 xt−1 . The proportionality constant for this pdf is "Z 2 T σ2 dxt κt = N xt ; st , 2 −∞ #−1 Z −T 2 σ2 − N xt ; st , dxt , 2 −∞ hence 1 2 π(xt |xt−1 ) = κt exp − 2 (xt − st ) , σ
for −T 2 ≤ xt ≤ +T 2 . Let us note that (xt − st )2 = (1 − xt )2 − xt xt−1 − xt−1
x
t−1
4
+1 ,
i.e., π(xt |xt−1 ) ∝ exp{−ct (xt−1:t )} but the proportionality constant depends on the variable xt−1 . The likelihood π(yt |x0:t , y1:t−1 ) has to be selected to account for the choice of π(xt |xt−1 ) and comply with (31) when yk = 1 for all 1 ≤ k ≤ t. In particular, we define σ2 π(yt |x0:t , y1:t−1 ) = π(yt |xt−1 ) = N yt ; zt , , 2
In order to determine the form of the matched statespace model for t ≥ 2 we have to select the transition densities and the likelihood functions to comply with the relationship
where r x t−1 +1 zt ∈ 1 ± σ 2 (bT + log κt ) − xt−1 (32) 4 2 2 and bT ≥ Tσ2 T4 + 1 is a constant chosen to
π(yt |x0:t , y1:t−1 )π(xt |x0:t−1 ) ∝ 1 exp − 2 (xt − yt )2 − xt xt−1 , σ
guarantee that zt ∈ R. Eq. (32) ensures that n x o x κ−1 t−1 t−1 π(yt = 1|xt−1 ) = p t exp − 2 +1 σ 4 {πσ 2 }
(31)
where the proportionality constant must be independent of x0:t . There are several choices compatible with (31). A simple one is to choose the transition to be uniform, π(xt |x0:t−1 ) = π(xt ) = U (xt ; −T 2 , +T 2 ), and let the likelihood account for the partial cost, π(yt |x0:t , y1:t−1 ) = π(yt |xt−1:t ) (xt − yt )2 − xt xt−1 ∝ exp − . σ2 Alternatively, we can “split” the partial cost between the transition density and the likelihood function. Let N (z; m, v) denote the normal density of the variable Z with mean m and variance v, while T N (z; m, v, a, b) denotes the normal pdf of Z with mean m and variance v truncated within the support a < z < b. We can select the transition density as σ2 2 2 π(xt |x0:t−1 ) = π(xt |xt−1 ) = T N xt ; st , , −T , T , 2
and, hence, Eq. (31) holds true.
Another class of problems that abound in engineering, finance and other disciplines consist in the minimization of the maximum value of a certain function (see, e.g., (Du and Pardalos 1995; Rao et al 2007; Pankov et al 2003)). Let a ∨ b and a ∧ b denote the maximum and the minimum, respectively, of a and b. In a second example, we study cost functions of the form Ct (x0:t ) = Ct−1 (x0:t−1 ) ∨ ct (x0:t ). We can also apply the exponential transformation in this case, to obtain π(x0:t |y1:t−1 ) ∝ exp {−C0:t (x0:t )} = exp {− (Ct−1 (x0:t−1 ) ∨ ct (x0:t ))} , (33) for t = 1, ..., T . We can put Eq. (33) in a form comparable with (2), π(x0:t |y1:t−1 ) ∝ exp {−C0:t−1 (x0:t−1 )} × exp {−ct (x0:t )} , exp {− (C0:t−1 (x0:t−1 ) ∧ ct (x0:t ))}
14
Joaqu´ın M´ıguez et al.
with the proportionality constant independent of x0:t , and reduce the problem of building a matched statespace model to selecting a transition density and a likelihood function such that π(yt |x0:t , y1:t−1 )π(xt |x0:t−1 ) ∝ exp {−ct (x0:t )} . exp {− (C0:t−1 (x0:t−1 ) ∧ ct (x0:t ))}
k=0
(34)
We work out a brief example from the signal processing literature. Example 2 (Cross-talk cancellation) In (Rao et al 2007), the design of an acoustic filter for cross-talk cancellation in a 3D audio system is stated as a minimax problem. Let ha (n), n ∈ Z, be a sequence that represents the combined effect of the acoustic impulse responses between the sound sources (loudspeakers) and (say) the listener’s left ear and let hf (n), n ∈ Z, be the cross-talk cancellation filter that should let the desired source signal pass while mitigating all other signals coming from different sources (see (Rao et al 2007) for details). The impulse response ha (n) is causal with length 2M − 1, i.e., ha (n) = 0 for all n < 0 and n ≥ 2M − 1, while the filter hf (n) is assumed causal with length K, i.e., hf (n) = 0 for all n < 0 and n ≥ K. The goal is to find the response hf (n) such that the convolution c(n) = ha (n) ∗ hf (n) = P2M −2 k=0 ha (k)hf (n − k) is the closest to the desired response 1, if n = 0 d(n) = , (35) 0, otherwise i.e., the filter hf (n) is selected to invert the combined acoustic response ha (n). Perfect inversion is not possible, since hf (n) has a finite length, hence we seek to solve the equations d(n) − c(n) = 0, n = 0, ..., K + 2M − 3, approximately instead. Let us collect the complete set of filter coefficients into the vector hf = [hf (0), . . . , hf (K −1)]> ∈ RK . In (Rao et al 2007) it is proposed to select hf as the solution of the minimax problem. ˆ f = arg min {J(hf )}, h
(36)
hf ∈RK
hence yt = d(t − 1), t = 1, 2, ..., K + 2M − 2. We define the partial cost at time t ≥ 1 as 2M −2 X ct (x0:t ) = yt − ha (k)xt−k+1
where 2M −2 X J(hf ) = max ha (k)hf (n − k) . d(n) − n∈{0,...,K+2M −3} k=0
We can easily rewrite problem (36) using our notation. For the unknowns, we let xt = hf (t − 1) ∈ R for t = 0, 1, ..., K (hence, x0 = 0 and xt>K = 0). The desired sequence d(n) plays the role of the observations,
while, trivially, C0 (x0 ) = 0. The overall cost at time t then becomes C0:t (x0:t ) = C0:t−1 (x0:t−1 ) ∨ ct (x0:t ). The time horizon is T = K+2M −2 and C0:T (x0:T ) = J(hf ). Note that x0 = 1 with probability 1. Also, assume the filter coefficients xt (equivalently, hf (t − 1)) are restricted to the interval5 (−a, +a). The simplest way to choose the transition density and likelihood function compatible with Eq. (34) is to let π(xt |x0:t−1 ) = π(xt ) = U (xt ; −a, +a),
t = 1, ..., T,
and make the likelihood account for the cost update, π(yt |x0:t , y1:t−1 ) ∝
exp {−ct (x0:t )} . exp {− (C0:t−1 (x0:t−1 ) ∧ ct (x0:t ))}
Similar to the Example 1, other choices of π(xt |x0:t−1 ) and π(yt |x0:t , y1:t−1 ) are possible. 7.3 Numerical results In this Section we apply Algorithm 2 to the Neumaier 3 problem described in Example 1. The goal of this numerical example is to illustrate the advantage of using a random grid of points (generated by the particle filter and coherent with the posterior probability distribution) over the space of the unknowns in order to perform the optimization. To show it, we have also implemented a deterministic optimization procedure that consists in 1. building a deterministic grid of N equally spaced points in the interval [−T 2 , +T 2 ], denoted GTN , and 2. running the Viterbi algorithm to compute the sequence of points in the grid GTN with the least cost, i.e., x ˜N 1:T = arg
min
T
CT (x0:T ).
N x1:T ∈(GT )
Note that this is exactly the same scheme as Algorithm 2, with the only difference that the grid is deterministic rather than stochastic. In the first experiment, we check the influence of the scale factor σ 2 on the solutions generated by the proposed optimization algorithm. Note that, even if the choice of σ 2 > 0 is irrelevant from the perspective of the solution set XπT = XcT , the convergence rate of the 5
This is an actual constraint in a practical fixed-point hardware implementation of the filter.
SMC methods for MAP sequence estimation and global optimization
8 Summary We have analyzed the asymptotic convergence of two SMC algorithms for MAP sequence estimation. Both methods rely on the standard SIR technique to generate random-grid approximations of the state space. They differ, however, in the way the “best node” of the grid is sought. In Algorithm 1, a simple linear search among (n) the paths {x0:T }n=1,...,N produced by the SIR method is carried out, while in Algorithm 2 these paths are combined to create a finer grid which is then explored using the Viterbi algorithm, as proposed in (Godsill et al 2001). The output of the algorithms is the node in the (corresponding) grid with the highest posterior density. Our analysis starts with an extension of well-known results on the convergence of particle filters by (Del Moral and Miclo 2000; Crisan and Doucet 2000). We provide explicit convergence rates for the Lp error in the approximation of integrals of bounded functions with respect to the joint posterior probability measure of the sequence of states X0:T . Using this new result, we
-18
Optimum SIR+VA Deterministic VA
-20 -22
CT/T2
-24 -26 -28 -30 -32 -34 -36 0
2000
4000
6000
8000
10000
σ2/T2 30
Optimum SIR+VA Deterministic VA
20
10
0 C/T2
numerical algorithms used to approximate the solutions in XπT may indeed be affected by this parameter. Therefore, we have applied Algorithm 2 to the Neumaier 3 problem with dimension T = 200, using N = 400 particles and values of σ 2 ranging from σ 2 = 100 × T 2 to σ 2 = 104 × T 2 . Figure 3 (top) shows the average cost (normalized by T 2 ) of the solutions generated by Algorithm 2 for the various values of σ 2 . Each point in the plot has been obtained by averaging 2 the normalized cost of the solution, C0:T (ˆ xN 0:T )/T , over 50 independent simulation runs. The figure also depicts the true minimum cost for reference (labeled ‘optimum’). It is observed that the smaller scale factors yield solutions which are poorer than the output of the deterministic algorithm (labeled ‘Deterministic VA’), while for σ 2 ≥ 400 × T 2 the solutions generated by the (stochastic) Algorithm 2 yield a clearly lower cost. Figure 3 (bottom) shows the convergence of Algorithm 2 as the number of particles, N , is increased. For a fixed scale factor σ 2 = 4000 × T 2 and T = 200 variables, we have carried out 50 independent simulation trials and averaged the normalized cost of 2 the approximate solution, C0:T (ˆ xN 1:T )/T , for several values of the number of particles N . The error reduction as N grows is apparent. We observe that for the lowest number of particles considered, N = 100, the Viterbi method with a deterministic grid outperforms Algorithm 2. For N ≥ 200, however, the random grid generated by the particle filter always yields a lower cost.
15
-10
-20
-30
-40 0
500
1000
1500
2000
2500
3000
3500
N
Fig. 3 Performance of Algorithm 2 for the Neumaier 3 problem with dimension T = 200. Top: Average cost of the solution x ˆN 0:T , with N = 400 particles, for several values of the scale parameter σ 2 . Both the cost (in the vertical axis) and σ 2 (horizontal axis) are normalied by T 2 . Bottom: For fixed 2 σ 2 = T 2 × 4 × 104 , average cost of x ˆN 0:T (normalized by T ) for N = 100, 200, 400, 800, 1600, 3200.
prove that the posterior density of the output paths of Algorithms 1 and 2 converge almost surely (as N → ∞) to the actual maximum of the posterior pdf. We have also found explicit lower bounds on the number of particles that are needed to ensure a given accuracy in the approximation of the maximum of the pdf. The last part of the paper is devoted to the application of Algorithms 1 and 2 to the global minimization of a class of objective functions (possibly non convex and possibly non differentiable) that admit a certain recursive factorization. By way of two examples, we have described how to select state-space models “matched” to a given cost function. For these models, the global minima of the cost function coincide with the global maxima of the a posteriori pdf and, hence, we can apply Algorithms 1 and 2 to locate them. In this context, the SIR method can be interpreted as a
16
Joaqu´ın M´ıguez et al.
tool to generate a random grid (in the space of the unknowns of the cost function) that is dense in the region where the cost is low and sparse elsewhere. We have presented numerical simulations that show how this approach can be more efficient than the use of deterministic grids.
(39) Obviously, f¯ is bounded (since kf¯k∞ ≤ kf k∞ ) and, from the induction hypothesis (37), we deduce that
cpT kf k∞ N
(f¯, π0:
¯ √ . T ) − (f , π0:T ) p ≤ Moreover, since 1 N T,N p E (f, pN |G T,N p 0:T +1 ) − E (f, p0:T +1 )|G
Acknowledgements J. M. acknowledges the support of the Ministry of Science and Technology of Spain (program ConsoliderIngenio 2010 CSD2008-00010 COMONSENS and project DEIPRO TEC2009-14504-C02-01). Part of this work was done during D. C.’s visit to the Department of Signal Theory & Communications, Universidad Carlos III (Spain), in April 2008. The hospitality of the Department is gratefully acknowledged. The work of P. M. D. has been supported by the National Science Foundation under Award CCF1018323 and by the Office of Naval Research under Award N00014-09-1-1154.. Part of this work was carried out while P. M. D. held a Chair of Excellence of Universidad Carlos III de Madrid-Banco de Santander.
≤
check that
cp kf k∞ N
(f, π0:0 ) − (f, π0:0 ) p ≤ 0√ ,
c¯ ˜pT +1 kf k∞
(f, pN
¯ √ , 0:T +1 ) − (f , π0:T ) p ≤
(42)
N
where c¯ ˜pT +1 = E [˜ cpT +1 ] + cpT . N Consider next the measure π ¯0: T +1 that is obtained after sub-step ii. of the algorithm. This measure can be defined by N (f, π ¯0: T +1 ) =
(f gT +1 , pN 0:T +1 )
(43)
(gT +1 , pN 0:T +1 )
(recall that gT +1 (x0:T +1 ) = π (yT +1 |x0:T +1 , y1:T ) is the bounded likelihood function). Also let p0:T +1 (x0:T +1 ) = π (xT +1 |x0:T )π0:T (x0:T ) be the predictive pdf at time T + 1, which satisfies (f, p0:T +1 ) = (f¯, π0:T ), and rewrite (42) as (44)
Since, from the Bayes’ rule, (f, π0:T +1 ) =
(f gT +1 , p0:T +1 ) , (gT +1 , p0:T +1 )
(45)
we can take (43) and (45) together in order to obtain
N
N (f, π ¯0: T +1 )−(f, π0:T +1 ) =
where cp0 is a constant independent of N . Now we assume that
cpT kf k∞ N
(f, π0: √ , T ) − (f, π0:T ) p ≤
(37)
N
for an integer T > 0 and aim at proving the corresponding inequality for T + 1. The recursive step of the SIR algorithm, as presented in Section 4.1, consists of three sub-steps. Let pN 0:T +1 be the empirical measure obtained after the first sub-step, i.e., PN 1 pN (dx), where δx¯(n) denotes the (n) 0:T +1 (dx) = N n=1 δx ¯ 0:T +1
(41)
where c˜pT +1 is a positive random variable independent of N , it is straightforward to combine (38), (40) and (41) using the triangle inequality to arrive at
N
We proceed by induction in T . For T = 0, the random measure N (dx) is constructed from an i.i.d. sample of size N from π0:0 the distribution with pdf π0:0 . Hence, it is straightforward to
0:T +1
unit delta measure centered at x ¯(0:nT) +1 . Also let G T,N denote (n) the σ -algebra generated by the random variables X0: T, n = 1, ..., N . Then, for f : (Rdx )T +2 → R, we have T,N N E (f, pN = (f¯, π0: (38) T ), 0:T +1 )|G
where f¯ is obtained from f by integrating with respect to the measure π (xT +1 |x0:T )dxT +1 , i.e., Z f (x0 , x1 , ..., xT +1 )π (xT +1 |x0:T )dxT +1 . Rd x
c˜pT +1 kf k∞ √ , N
c¯ ˜pT +1 kf k∞
(f, pN
√ . 0:T +1 ) − (f, p0:T +1 ) p ≤
A Proof of Lemma 1
f¯(x0 , x1 , ..., xT ) ,
(40)
N
(f gT +1 , pN 0:T +1 ) (gT +1 , pN 0:T +1 )
−
(f gT +1 , p0:T +1 ) . (gT +1 , p0:T +1 )
By adding and subtracting the term (f gT +1 , pN 0:T +1 )/(gT +1 , p0:T +1 ) in the equation above, we easily arrive at N (f, π ¯0: T +1 ) − (f, π0:T +1 ) = i N (f gT +1 , pN 0:T +1 ) (gT +1 , p0:T +1 ) − (gT +1 , p0:T +1 )
h
(gT +1 , pN 0:T +1 )(gT +1 , p0:T +1 ) +
(f gT +1 , pN 0:T +1 ) − (f gT +1 , p0:T +1 )
(gT +1 , p0:T +1 ) N and, since (f gT +1 , pN 0:T +1 ) ≤ kf k∞ (gT +1 , p0:T +1 ), it readily follows that N (f, π ¯0: T +1 ) − (f, π0:T +1 ) ≤ kf k∞ (gT +1 , p0:T +1 ) − (gT +1 , pN 0:T +1 ) + (gT +1 , p0:T +1 ) (f gT +1 , pN 0:T +1 ) − (f gT +1 , p0:T +1 ) (gT +1 , p0:T +1 )
.
SMC methods for MAP sequence estimation and global optimization The latter inequality, together with (44) and the assumed boundedness of the likelihood gT +1 , yields N k(f, π ¯0: T +1 ) − (f, π0:T +1 )kp ≤
c˘pT +1 kf k∞ √ , N
(46)
¯p /(gT +1 , p0:T +1 ) is a constant where c˘pT +1 = 2kgT +1 k∞ c˜ T +1 independent of N . In order to analyze the last substep (the resampling), we introduce the σ -algebra generated by the random variables ¯ (n) , n = 1, ..., N , and denote it as G¯T +1,N . It is X 0:T +1 h i straightforward to obtain that E (f, π N )|G¯T +1,N = 0:T +1
N (f, π ¯0: T +1 ), hence the conditional expectation of the error becomes
E
c´p kf k∞ p T +1,N 1 N N p ≤ T +1 ¯ √ (f, π0: ) − ( f, π ¯ ) | G , T +1 0:T +1
where
N
c´pT +1
is a positive random variable independent of (n)
N . As a consequence, taking the expectation on X0:T +1 , n = 1, ..., N , yields N N k(f, π0: ¯0: T +1 ) − (f, π T +1 )kp ≤
c¯ ´pT +1 kf k∞ √ , N
(47)
where c¯ ´pT +1 is the expected value of c´pT +1 . Combining (46) and (47) by way of the triangle inequality yields N N N k(f, π0: ¯0: T +1 ) − (f, π0:T +1 )kp ≤ k(f, π0:T +1 ) − (f, π T +1 )kp N +k(f, π ¯0: T +1 ) − (f, π0:T +1 )kp
≤
cpT +1 kf k∞ √ , N
where cpT +1 = c¯ ´pT +1 + c˘pT +1 is a constant independent of N . u t
References Ali MM, Khompatraporn C, Zabinsky ZB (2005) A numerical evaluation of several stochastic algorithms on selected continuous global optimization problems. Journal of Global Optimization 31:635–672 Anderson BDO, Moore JB (1979) Optimal Filtering. Englewood Cliffs Bain A, Crisan D (2008) Fundamentals of Stochastic Filtering. Springer Baker RD (2000) How to correctly calculate discounted healthcare costs and benefits. The Journal of the Operational Research Society 51(7):863–868 Bar-Shalom Y, Blair WD (eds) (2000) Multitargetmultisensor tracking: Applications and advances. Volume III. Artech House, Norwood (MA, USA) Carpenter J, Clifford P, Fearnhead P (1999) Improved particle filter for nonlinear problems. IEE Proceedings Radar, Sonar and Navigation 146(1):2–7 Crisan D (2001) Particle filters - a theoretical perspective. In: Doucet A, de Freitas N, Gordon N (eds) Sequential Monte Carlo Methods in Practice, Springer, chap 2, pp 17–42 Crisan D, Doucet A (2000) Convergence of sequential Monte Carlo methods. Technical Report Cambridge University (CUED/FINFENG/TR381). URL citeseer.ist.psu.edu/crisan00convergence.html
17
Del Moral P (2004) Feynman-Kac Formulae: Genealogical and Interacting Particle Systems with Applications. Springer Del Moral P, Miclo L (2000) Branching and interacting particle systems. Approximations of Feynman-Kac formulae with applications to non-linear filtering. Lecture Notes in Mathematics pp 1–145 Del Moral P, Kouritzin MA, Miclo L (2001) On a class of discrete generation interacting particle systems. Electronic Journal of Probability 6(16):1–26 Douc R, Moulines E (2008) Limit theorems for weighted samples with applications to sequential Monte Carlo methods. Annals of Statistics 36(5):2344–2376 Douc R, Capp´ e O, Moulines E (2005) Comparison of resampling schemes for particle filtering. In: Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, pp 64–69 Doucet A, Godsill S, Andrieu C (2000) On sequential Monte Carlo Sampling methods for Bayesian filtering. Statistics and Computing 10(3):197–208 Doucet A, de Freitas N, Gordon N (2001a) An introduction to sequential Monte Carlo methods. In: Doucet A, de Freitas N, Gordon N (eds) Sequential Monte Carlo Methods in Practice, Springer, chap 1, pp 4–14 Doucet A, de Freitas N, Gordon N (eds) (2001b) Sequential Monte Carlo Methods in Practice. Springer, New York (USA) Du D, Pardalos PM (eds) (1995) Minimax and applications. Kluwer Academic Publishers Forney GD (1973) The Viterbi algorithm. Proceedings of the IEEE 61(3):268–278 Gland FL, Oudjane N (2004) Stability and uniform approximation of nonlinear filters using the Hilbert metric and application to particle filters. Annals of Applied Probability pp 144–187 Godsill S, Doucet A, West M (2001) Maximum a posteriori sequence estimation using Monte Carlo particle filters. Annals of the Institute of Statistical Mathematics 53(1):82–96 Godsill S, Doucet A, West M (2004) Monte Carlo smoothing for nonlinear time series. Journal of the American Statistical Association 99(465):156–168 Gordon N, Salmond D, Smith AFM (1993) Novel approach to nonlinear and non-Gaussian Bayesian state estimation. IEE Proceedings-F 140(2):107–113 Gustafsson F, Gunnarsson F, Bergman N, Forssell U, Jansson J, Karlsson R, Nordlund PJ (2002) Particle filters for positioning, navigation and tracking. IEEE Transactions Signal Processing 50(2):425–437 Heine K, Crisan D (2008) Uniform approximations of discrete-time filters. Advances in Applied Probability 40(4):979–1001 Hu X, Schon T, Ljung L (2008) A basic convergence result for particle filtering. IEEE Transactions on Signal Processing 56(4):1337–1348 Julier SJ, Uhlmann J (2004) Unscented filtering and nonlinear estimation. Proceedings of the IEEE 92(2):401– 422 Kalman RE (1960) A new approach to linear filtering and prediction problems. Journal of Basic Engineering 82:35– 45 Klaas M, Lang D, de Freitas N (2005) Fast maximum a posteriori inference in Monte Carlo state spaces. In: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, Citeseer
18 Liu JS, Chen R (1998) Sequential Monte Carlo methods for dynamic systems. Journal of the American Statistical Association 93(443):1032–1044 Najim K, Ikonen E, Del Moral P (2006) Open-loop regulation and tracking control based on a genealogical decision tree. Neural Computing and Applications 15:339–349 Nyblom P, Olsson PM, Rudol P, Doherty P (2008) Particle filters and map sequence estimation for vehicle tracking URL http://www.scientificcommons.org/42403304 Pankov AR, Platonov EN, Semenikhin KV (2003) Minimax optimization of investment portfolio by quantile criterion. Autom Remote Control 64(7):1122–1137, DOI http://dx.doi.org/10.1023/A:1024738302885 Rao HIK, Mathews VJ, Y-C-Park (2007) A minimax approach for the joint design of acoustic cross-talk cancellation filters. IEEE Transactions on Audio, Speech and Language Processing 15(8):2287–2298 Ristic B, Arulampalam S, Gordon N (2004) Beyond the Kalman Filter: Particle Filters for Tracking Applications. Artech House, Boston Robert CP (2007) The Bayesian Choice. Springer Saha S, Boers Y, Driessen H, Mandal P, Bagchi A (2009) Particle based MAP state estimation: A comparison. In: Proceedings of the 12th International Conference on Information Fusion, IEEE, pp 278–283 Sayed AH, Tarighat A, Khajehnouri N (2005) Network based wireless location. IEEE Signal Processing Magazine 22(4):24–40 Ziemba WT, Vickson RG (eds) (2006) Stochastic Optimization Models in Finance. World Scientific, Singapore
Joaqu´ın M´ıguez et al.