Document not found! Please try again

Pruned Resampling: Probabilistic Model Selection ...

1 downloads 0 Views 944KB Size Report
tories, NHK (Japan Broadcasting Corporation), Tokyo, 157–8510. Japan. .... The hyper-parameters αn and βn,r govern the general behav- ior of the SMC ...... is a member of the Institute of Image Information and Television Engineers of Japan ...
IEICE TRANS. INF. & SYST., VOL.E90–D, NO.8 AUGUST 2007

1151

PAPER

Special Section on Image Recognition and Understanding

Pruned Resampling: Probabilistic Model Selection Schemes for Sequential Face Recognition Atsushi MATSUI†,††a) , Simon CLIPPINGDALE†b) , and Takashi MATSUMOTO††c) , Members

SUMMARY This paper proposes probabilistic pruning techniques for a Bayesian video face recognition system. The system selects the most probable face model using model posterior distributions, which can be calculated using a Sequential Monte Carlo (SMC) method. A combination of two new pruning schemes at the resampling stage significantly boosts computational efficiency by comparison with the original online learning algorithm. Experimental results demonstrate that this approach achieves better performance in terms of both processing time and ID error rate than a contrasting approach with a temporal decay scheme. key words: Sequential Monte Carlo, model comparison, face recognition, pruning, resampling

1. Introduction Human faces in broadcast video exhibit substantial variation in position, size, head pose, facial expression and so on, forcing face recognition systems for video indexing to incorporate flexibility in the database and/or matching algorithms used. The authors have introduced a Bayesian face recognition system [1], [2] which uses deformable template matching [3], [4] and a Sequential Monte Carlo (SMC) online learning algorithm [5], [6]. Although this system can absorb a certain amount of facial deformation, the Monte Carlo approximations require substantial computation. In this work, we introduce probabilistic approaches to model pruning for the Bayesian face recognition system. We show that two pruning techniques successfully identify unlikely face models and exclude them from the set of candidate face models at an early stage. By contrast, the original algorithm expended much computation on performing Monte Carlo integrations to evaluate the probability of all candidate face models. By discarding some unlikely models at an early stage, the modified system achieves a drastic improvement in processing cost. In Sect. 2 we briefly review the deformable template matching procedure and similarity function used in our original system. In Sect. 3 we introduce an online learning approach and show how the most probable model can be estimated together with distributions of system parameters. Manuscript received October 5, 2006. Manuscript revised February 8, 2007. † The authors are with Science and Technical Research Laboratories, NHK (Japan Broadcasting Corporation), Tokyo, 157–8510 Japan. †† The authors are with the Faculty of Science and Engineering, Waseda University, Tokyo, 169–8555 Japan. a) E-mail: [email protected] b) E-mail: [email protected] c) E-mail: [email protected] DOI: 10.1093/ietisy/e90–d.8.1151

In Sect. 4 we describe the details of the SMC algorithm. In Sect. 5 we introduce probabilistic pruning schemes that boost computational efficiency, and show experimental results. The paper concludes with a discussion of the results and possible directions for further work. 2. Deformable Template Matching Our original system [1], [2] evaluates Bayesian posterior probability distributions of face models based on deformable face templates. Each template consists of the normalized coordinates of M = 9 facial feature points, xA = {x1A , . . . , xAM }, together with features cA computed by convolutions with Gabor wavelets at each of the feature points. The Gabor wavelet at resolution r and orientation n is a complex exponential grating patch with a 2-D Gaussian envelope:  2 kr2 − kr2 x22  ikrn T x − σ2 2σ e × e , − e σ2   cos( Nnπorn ) , krn = kr sin( Nnπorn ) grn (x) =

(1)

for Norn = 8 orientations and R = 5 resolutions. This data representation is similar to that used in the Elastic Graph Matching system [3] for face recognition in static images, but the chosen feature points differ, as do the parameters of the Gabor wavelets. Our previous approach [4] applies templates to input video frames and deforms them by shifting the feature points so as to maximize the similarity to the Gabor features in the template. It then computes an overall match score for each deformed template, incorporating a feature similarity term and a penalty related to the deformation as follows: √   crA , crB  E r , (2) S = 1 − α f 1 − A B − αs |cr ||cr | λr where A denotes the undeformed template and B the deformed feature points on the image; cA and cB are feature vectors of Gabor wavelet coefficients, respectively from the template and measured at the deformed feature point positions xB on the image; E is the deformation energy between the feature points xA in the template and the deformed feature points xB on the image, up to a dilation, rotation and shift; α f and α s are weights for the feature similarity and spatial deformation terms respectively; and λr = 2π/kr is

c 2007 The Institute of Electronics, Information and Communication Engineers Copyright 

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.8 AUGUST 2007

1152

Fig. 1

An overview of video face recognition based on online Bayesian learning.

the modulation wavelength of the Gabor wavelet at resolution r. In the sequel we will often omit the A and B superscripts where the meaning is clear. 3. Online Bayesian Learning Optimizing a target function with penalty terms can be considered as maximizing the Bayesian posterior probabilities of parameters [9]. From the same viewpoint, the most probable model describing faces in input video is defined by the mode of posterior distributions of face models. Though the optimum set of parameters in some case may yield satisfactory performance in other cases, problems can arise if the target probability distribution takes on a more complex form. In general, finding the global maximum of a target function is difficult, and prone to falling into local maxima. Moreover, the optimal parameter set obviously depends on unknown input data. Instead of searching for the mode, we estimate both the peak and tails of probability distributions of parameters using an online learning algorithm. Figure 1 shows an overview of a video face recognition system based on online Bayesian learning. 3.1 Likelihood Function Consider the situation where data is given as a video sequence. Let yn be image data at the nth frame and let Yn = {y1 , y2 .., yn } be the image data set up to the current frame. Recall that the feature similarity term in Eq. (2) depends on an inner product between two normalized feature vectors. Therefore it is natural to consider as a likelihood function for this directional data a mixture of von MisesFisher distributions [8]: P(yn |xn , Bn , H j )

 B  crA ,cn,r R exp βn,r A B  |cr ||cn,r | 1 = , R r=1 Zb (βn,r )

(3)

k

Zb (β) =

(2π) 2 I 2k −1 (β) β 2 −1 k

.

(4)

where xn represents a set of feature points at the nth input frame, βn,r is a hyper-parameter for resolution r, Bn = (βn,1 , . . . , βn,R ) is a set of hyper-parameters, and H j is a face model (hypothesis or template) with identity number j ( j = 1, . . . , N p ). I• (β) is the modified Bessel function, and k = 2M × Norn . 3.2 Parameter Dynamics Suppose that we are provided with a set of feature point locations for the jth template, xAj . We assume a Gaussian predictive distribution for xn :   exp − α2n δxTn Λ−1 j δxn , P(xn |αn , T n , H j ) = Za (αn ) δxn = (T n−1 (xn ) − xAj ). (5) where αn is a hyper-parameter, Λ j is the covariance matrix of feature point positions for face model H j , and Za (αn ) =

(2π)M det Λ j /αn is a normalizing factor. T n is a rigid linear transformation of the feature point set, consisting of a dilation by a factor rn , rotation Rθn through an angle θn , and translation (un , vn )T . It expresses the rigid component of the mapping from the template feature point set xA onto the deformed feature point set xB on the input image plane, leaving the nonrigid deformation. T n (xA ) = rn Rθn xA + vn , xA = (x1A , y1A , . . . , xAM , yAM )T , vn = (un , vn , . . . , un , vn )T .

(6) (7) (8)

For simplicity we will use the symbolic notation T n to denote the set of mapping parameters (rn , θn , un , vn ), and Θn to denote the set of model parameters and the mapping parameters (xn , αn , Bn , T n ). The parameters of T n determine the size, position, and in-plane rotation angle of a face region. In this paper, we describe a sequential learning algorithm to estimate probability distributions of these parameters given a sequence of input images. One way of performing online learning is to consider

MATSUI et al.: PRUNED RESAMPLING: PROBABILISTIC MODEL SELECTION SCHEMES

1153

stochastic updates of the parameters in question. Assuming smooth motion of the target face region, we consider a recursive update P(T n |T n−1 ) described by rn = rn−1 + νr , θn = θn−1 + νθ , un = un−1 + ν x , vn = vn−1 + νy ,

νr ∼ N(0, σ2r ), νθ ∼ N(0, σ2θ ), ν x ∼ N(0, σ2x ), νy ∼ N(0, σ2y ).

(9)

The hyper-parameters αn and βn,r govern the general behavior of the SMC processes, and inappropriate settings may degrade the prediction accuracy. In this paper, we adopt a transition model to perform sequential updates of the probability distributions of αn and βn,r : log αn = log αn−1 + να , να ∼ N(0, σ2α ), log βn,r = log βn−1,r + νβ ,

(10)

νβ ∼ N(0, σ2β ),

(11)

given by π(Θn |H j ) = P(xn |αn , T n , H j )P(αn |αn−1 , σα ) × P(Bn |Bn−1 , σβ )P(T n |T n−1 ),

from which one obtains the following approximations (see Appendix A for details): P(Θn |Yn−1 , H j ) 

(12)



Nj 

(13)

P(Θn |Yn−1 , H j ) = P(Θn |Θn−1 , H j ) (14)

At each frame, the system outputs the most probable (n) face model, H MP , which attains the maximum value of the posterior distribution of the model given the sequence of in(n) = arg max j P(H j |Yn ). put images: H MP 4. Sequential Monte Carlo Algorithm 4.1 Sequential Importance Sampling (SIS) We apply a Monte Carlo method to estimate the complex high-dimensional integrals in (13) and (14). Sequential Monte Carlo requires proposal distributions from which one (N j ,N )

(i, j)

(i, j)

P(yn |Θn , H j )w˜ n−1 ,

(17)

where the normalized importance weights (see Appendix B for details) are equal to: (i, j)

w˜ n

j) w(i, n = N k Nn p

j) w(i, n

(m,k) m=1 wn

,

(18)

j) = P(yn |Θ(i, n , Yn−1 , H j )

×

j) (i, j) P(Θ(i, n |Θn−1 ) (i, j)

π(Θn ) (i, j)

(i, j)

× wn−1

(i, j)

= P(yn |Θn )wn−1 .

(19)

In the same way, one can evaluate the sequential model posterior distribution by

where the one-step model marginal likelihood P(yn |Yn−1 , H j ) is given by the following integrals:

× P(Θn−1 |Yn−1 , H j )d(Θn−1 ).

(16)

P(yn |Yn−1 , H j ) = P(yn |Θn , H j )P(Θn , |Yn−1 , H j )d(Θn )

k=1

Using Bayes’ theorem we obtain straightforwardly a recursive formula for P(H j |Yn ), the posterior distribution of the model H j given the data:

P(yn |Yn−1 , H j ) = P(yn |Θn , H j )P(Θn |Yn−1 , H j )d(Θn ),

  j) (i, j) w˜ (i, n−1 δ ||Θn − Θn || ,

i=1

3.3 Posterior Distribution

P(yn |Yn−1 , H j )P(H j |Yn−1 ) , P(yn |Yn−1 )

Nj 

i=1

where the log-normal distribution guarantees positivity of the hyper-parameters.

P(H j |Yn ) =

(15)

j) p n can draw samples {Θ(i, n }(i=1, j=1) for each model H j by standard methods. The proposal distribution used in this work is

P(H j |Yn ) P(yn |Yn−1 , H j )P(H j |Yn−1 ) = N p k=1 P(yn |Yn−1 , Hk )P(Hk |Yn−1 ) Nnj (i, j) P(H j |Yn−1 ) i=1 w˜ n  N Nnk (i,k) . p ˜n k=1 P(Hk |Yn−1 ) i=1 w

(20)

In the absence of reasons to do otherwise, we set the initial model probabilities uniformly: P(H j |y0 ) = 1/N p . 4.2 Resampling As the SIS process evolves, the variance of the importance j) increases. After a number of iterations, many weights w˜ (i, n of the weights become very small and a few very large. To ameliorate this degeneracy of the distribution of particles, the following procedure is often performed in the SIS scheme.

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.8 AUGUST 2007

1154

' Resampling (residual resampling)

$

j) 1. Retain k j = Nntotal × w˜ (i, n  copies of (i, j) j Θn , i = 1, . . . , Nn , j = 1, . . . , N p . Let mr = Nntotal − k1 − . . . − kN p . 2. Obtain mr i.i.d. draws from j j) (Nn ,N p ) {Θ(i, n }(i=1, j=1)

with probabilities (i, j)

proportional to Nntotal × w˜ n − k j , i = 1, . . . , Nnj , j = 1, . . . , N p .

5. Model Pruning at the Resampling Stage 5.1 Temporal Decay Scheme Assuming that the SMC particles gradually concentrate on particular face models, an exponential temporal decay might be a reasonable profile to apply to the total number of particles, at least over some time interval: Nntotal = N0total × e− τ , (t = n/30 [sec.]). t

(21)

3. Reset all the weights to N p Nnj (i, j) total ˜ n /Nn . j=1 i=1 w &

This simple method reduces the total number of particles at a rate determined by a decay constant τ > 0. % Alternatively, since the system evaluates the degree of confidence of each face model by sequentially estimating total In many cases, the total number of particles, Nn , is the model likelihood and model posterior, it is natural to set to be constant. This normalizing condition guarantees utilize these criteria for model pruning, as outlined in the steady precision of the Monte Carlo approximations if we following two sections. run the SMC procedures for only one face model. But if we perform these probabilistic selection and extinction processes for multiple possible models, the result is computational redundancy after the system has built up a given level of confidence from the input sequence about which models are probable and which are wildly improbable. Figure 2 shows the evolution of the posterior distributions of models estimated by our original SMC algorithm [2] given a test video sequence of a person with correct identity (model number) jtrue = 10. The posterior probability for j = 10 grows steadily with time (frame number) while the others tend after a certain period to gradually decrease. The tendency for a particular model to accumulate ever larger numbers of particles at the expense of the other models reflects the increasing confidence of the system, with increasing volumes of data, that the given model is correct and the others incorrect. However, much of the associated computation is unnecessary; we do not require so many particles to verify a well-supported hypothesis. In this paper we evaluate some pruning schemes for this model comparison problem in an effort to avoid such redundant computation.

Fig. 2 Evolution of model posterior distributions P(H j |Yn ) with baseline model ( jtrue = 10, N0total = 30, 000).

5.2 Pruning with Likelihood-Based Normalization It is possible to introduce a normalization at the resampling stage such that the number of particles for the model with the greatest likelihood:

j = arg max P(yn |Yn−1 , Hk ) k

= arg max Nnk , k

(22)

remains approximately constant (rather than the total num N p k ber Nntotal = k=1 Nn in the original scheme), where Nnk denotes the number of particles belonging to the kth face model after the nth iteration of the SMC procedure. Since the resampling process shares out the total mass of normalized importance weights at the previous iteration, it is natural to set the new total number of particles Nntotal as follows: N p Nnk (i,k) w˜ n

j total Nn = Nn × k=1 ˆj i=1 Nn (i, j) ˜n i=1 w N p P(yn |Yn−1 , Hk )

j . (23) = Nn × k=1 P(yn |Yn−1 , H ˆj ) This likelihood-based normalization maintains nearly constant the number of particles (and hence the volume of computation) at each step associated with the most likely model. In so doing, it reduces the amount of attention paid to increasingly unlikely models faster than does the original unmodified resampling scheme. Concentration of hypotheses in holistic search has been shown to give rise to redundant computation in various settings including Sequential Monte Carlo, simulated annealing and genetic algorithms. Himeno introduced the DDC method [10], a similar normalization technique based on a heuristically determined threshold, in the context of genetic algorithms.

MATSUI et al.: PRUNED RESAMPLING: PROBABILISTIC MODEL SELECTION SCHEMES

1155

5.3 Posterior Threshold Pruning Following the likelihood, the procedure next estimates posterior distributions for each model. If the tails of posterior distributions correspond to improbable models, a further possibility is to prune them using something like the following scheme:  (n) 0 if P(H j |Yn ) < ∆P(H MP |Yn ), (24) Nnj = j Nn−1 otherwise, (n) is the model with maximal posterior probability where H MP conditioned on the input video sequence up to the nth frame, and 0 < ∆ < 1 is a threshold.

5.4 Experimental Results Five resampling schemes were investigated: (i) (ii) (iii) (iv) (v)

baseline scheme (Nntotal constant); temporal decay scheme; likelihood-based normalization; posterior threshold pruning; and a combination of (iii) and (iv).

(iii), (iv), and (v) are the probabilistic schemes proposed in this paper; (i) and (ii) are provided for comparison. For the baseline and likelihood-based normalization schemes, we tried four different initial settings of the total number of particles: N0total = 5000, 10000, 20000, and 30000. For the temporal decay scheme, we fixed N0total = 30, 000 and tried different settings of the decay constant: τ = 0.1, 0.2, . . . , 1.0. For the posterior threshold pruning and combined schemes, we fixed N0total = 30, 000 and tried different settings of the threshold, ∆ = 0.1, 0.2, . . . , 0.9. Template images and test sequences showed 10 Japanese actors in frontal pose against a blue background. For the template images, each individual showed a neutral facial expression, while for the test sequences, subjects were encouraged to show expressions and to talk freely. We assumed that initial face regions were pre-detected so that the center position and radius of the face region were already available. Since some variability in the results arises from the random nature of the Monte Carlo process, we performed 20 runs on a single Intel Xeon 3.8 GHz CPU for each setting of each scheme, and took the average of the numerical results. Figure 3 shows the averaged experimental results for the baseline scheme, the temporal decay scheme, and the three schemes using the proposed statistical pruning methods. The ID error rate of the baseline scheme monotonically decreases as the total number of particles increases, but at the expense of a commensurate increase in computational cost. The performance curve for the temporal decay scheme with N0total = 30000 shows that it successfully reduces processing time, but with a dramatically increased ID error rate for the fastest setting of the decay constant

Fig. 3 ID error rate vs. processing time. Baseline scheme and likelihood-based normalization: total number of particles varies from 5,000 at leftmost point to 30,000 at rightmost. Temporal decay scheme: decay constant varies from 0.1s at leftmost point to 1.0s at rightmost. Posterior threshold pruning and combined scheme: threshold varies from 0.1 at rightmost point to 0.9 at leftmost.

(τ = 0.1s). The likelihood-based normalization scheme for all initial conditions (N0total = 5000, 10000, 20000, 30000) draws a similar curve of performance, showing a trade-off between processing time and accuracy, but with somewhat more stable ID error rate than the temporal decay scheme. The posterior threshold pruning, and the combined approach of likelihood-based normalization plus posterior threshold pruning, achieved better performance in terms of both processing time and ID error rate than the temporal decay scheme, but the trade-off relationship is again evident as ∆ approaches 0.9. Thus all the curves show a trade-off between ID error rate and processing time; however, the two schemes involving posterior threshold pruning also exhibit an inverse relationship (at the right end of Fig. 3) where ID error rate rises despite increased processing time. In this paper, we evaluate the ID error rate as an average over all frames of all ten input video sequences. Therefore, if some incorrect face models are judged the most probable in the early stages of some of the input sequences, before the system converges to the correct model, the durations of such incorrect states will be reflected directly (linearly) in the ID error rate. Since we sequentially update the estimated posterior probabilities of the models using limited computational resources (i.e. particles), these durations depend on the number of particles used and the number of models among which they are distributed. The posterior threshold pruning can be regarded as a dynamical scheme to control the number of models and the speed of convergence. Lower thresholds (and less pruning) give rise to both increased processing time and longer persistence of incorrect models prior to convergence. The concavity of the performance curves of (iv) and (v) reflect the two independent mechanisms of the boosting effect of the convergence speed in the threshold-based pruning cases and the general relationship of trade-off. Table 1 shows the best performance on ID error rate

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.8 AUGUST 2007

1156 Table 1 Model L.normalization Error rate[%] Proc.time[sec]

Face recognition results (N0total = 30, 000). Baseline No 4.3 965.0

Yes 6.2 554.4

T.decay (τ=0.9) n/a 4.1 593.0

P.threshold (∆=0.6) No Yes 2.1 2.2 223.2 185.8

Fig. 5 Evolution of model posterior distributions P(H j |Yn ) with likelihood-based normalization (N0total = 30, 000).

Fig. 4 Evolution of model posterior distributions P(H j |Yn ) with temporal decay scheme with τ = 0.9 (N0total = 30, 000).

achieved by each of the five schemes. The combination of the two statistical schemes reduced the ID error rate from 4.3% to 2.2%, and reduced the computational cost by over 80% relative to the baseline scheme. By contrast, the temporal decay scheme achieved only about a 40% reduction relative to the baseline scheme, with a slight improvement of the ID error rate. There are various possible reasons for the considerable improvement obtained by the last two schemes. First, they can control the degree of pruning according to the confidence for the most probable model H MP . When an alternative model achieves significant posterior distribution mass, the proposed algorithm refrains from pruning it inappropriately. On the other hand, when the posterior distribution has a highly sharp peak at H MP , these schemes will rapidly discard unlikely face models associated with the tails of the distribution, allowing much more of the computation (i.e. many more SMC particles) to be focused on the competition between the more probable face models. Second, the criteria applied for model pruning and the final model selection are consistent. The two proposed schemes, likelihood-based normalization and posterior threshold pruning, depend naturally on the same Bayesian criteria of model posterior probabilities, while the temporal decay scheme has no relationship with the evolving statistics. Figures 4, 5 and 6 show the evolution of the model posterior distributions for the temporal decay scheme, likelihood-based normalization, and posterior threshold pruning respectively, given the same test sequence as used in Fig. 2 ( jtrue = 10). Figure 7 shows example trajectories of the kurtosis of the model posterior distributions against the total number of particles at the nth frame of the test se-

Fig. 6 Evolution of model posterior distributions P(H j |Yn ) with posterior threshold pruning with ∆ = 0.6 (N0total = 30, 000).

Fig. 7 Kurtosis of posterior distributions for each scheme against Nntotal : frame number varies from 0 at rightmost point to 29 at leftmost point on each curve.

quence. Figures 6 and 7 show that the proposed statistical approaches successfully identify and prune improbable models at an early stage.

MATSUI et al.: PRUNED RESAMPLING: PROBABILISTIC MODEL SELECTION SCHEMES

1157

6. Conclusions

Appendix A: Monte Carlo Approximations

We introduced two probabilistic model pruning schemes for Bayesian face recognition based on a Sequential Monte Carlo online learning algorithm. Modifications at the resampling stage of the SMC procedure based on these schemes significantly boosted both computational efficiency and recognition accuracy, the best results being achieved by a combination of the two schemes. Topics remaining for further work include analyzing the behavior of pruned systems on larger databases; automation of the face detection stage and its combination with the SMC algorithm; and extensions to deal with larger image motions and changes in face pose and lighting conditions (the SMC-based change detection algorithm described in [11] may be useful in this regard).

Derivation of Eq. (16) Drawing N j i.i.d. samples from the proposal distribution:

Acknowledgments

Then we get

We should like to thank Mr. Yohei Nakada (Waseda Univ.), Prof. Yukito Iba (ISM), and Prof. Arnaud Doucet (UBC) for valuable discussions. References [1] A. Matsui, S. Clippingdale, F. Uzawa, and T. Matsumoto, “Bayesian face recognition using a Markov chain Monte Carlo method,” Proc. 17th International Conference on Pattern Recognition (ICPR2004), vol.3, pp.918–921, Cambridge, U.K., Aug. 2004. [2] A. Matsui, S. Clippingdale, and T. Matsumoto, “A sequential Monte Carlo method for Bayesian face recognition,” Proc. 4th International Workshop on Statistical Pattern Recognition (SPR2006), LNCS 4109, pp.578–586, Hong Kong, China, Aug. 2006. [3] L. Wiskott, J.M. Fellous, N. Kr¨uger, and C. von der Malsburg, “Face recognition by elastic bunch graph matching,” TR96-08, Institut f¨ur Neuroinformatik, Ruhr-Universit¨at Bochum, 1996. [4] S. Clippingdale and T. Ito, “A unified approach to video face detection, tracking and recognition,” Proc. 1999 IEEE International Conference on Image Processing (ICIP’99), pp.662–666, Kobe, Japan, Oct. 1999. [5] A. Doucet, “On sequential simulation-based methods for Bayesian filtering,” Technical Report CUED/F-INFENG/TR-310, Cambridge University, 1998. [6] J.S. Liu, Monte Carlo Strategies in Scientific Computing, SpringerVerlag, New York, 2001. [7] C. Andrieu, N. de Freitas, A. Doucet, and M.I. Jordan, “An introduction to MCMC for machine learning,” Mach. Learn., vol.50, no.1, pp.5–43, Jan. 2003. [8] K.V. Mardia and P. Jupp, Directional Statistics, 2nd ed, John Willey & Sons, 2000. [9] D.J.C. Mackay, Information Theory, Inference, and Learning Algorithms, Cambridge University Press, U.K., 2003. [10] M. Himeno and R. Himeno, “The niching method for obtaining global optima and local optima in multimodal functions,” IEICE Trans. Inf. & Syst. (Japanese Edition), vol.J85-D-I, no.11, pp.1015– 1027, Nov. 2002. [11] T. Matsumoto, “Marginal likelihood change detection: Particle filter approach,” Proc. 25th International Workshop on Bayesian Inference and Maximum Entropy for Science and Engineering (MaxEnt2005), AIP Conf. Proc., vol.803, no.1, pp.129–136, San Jos´e, USA, Nov. 2005.

j) j) Θ(i, ∼ π(Θn |Θ(i, n n−1 , H j ),

(A· 1)

and letting π(Θn |Θn−1 , H j ) = P(Θn |Θn−1 ),

(A· 2)

we have P(Θn−1 |Yn−1 , H j ) 

Nj 

  j) (i, j) w˜ (i, n−1 δ ||Θn−1 − Θn−1 || .

(A· 3)

i=1

P(Θn |Yn−1 , H j ) = P(Θn |Θn−1 , H j ) ×P(Θn−1 |Yn−1 , H j )d(Θn−1 )  π(Θn |Θn−1 , H j ) ×

Nj 

  (i, j) (i, j) w˜ n−1 δ ||Θn−1 − Θn−1 || d(Θn−1 )

i=1

=

Nj  i=1

j) w˜ (i, n−1



× =

Nj 

  j) π(Θn |Θn−1 , H j )δ ||Θn−1 − Θ(i, n−1 || d(Θn−1 ) (i, j)

(i, j)

w˜ n−1 π(Θn |Θn−1 , H j )

i=1

=

Nj 

  j) (i, j) w˜ (i, n−1 δ ||Θn − Θn || .

 (A· 4)

i=1

Derivation of Eq. (17) Putting together Eq. (13) and Eq. (A· 4) we get P(yn |Yn−1 , H j ) = P(yn |Θn , H j )P(Θn , |Yn−1 , H j )d(Θn )  =

P(yn |Θn , H j ) Nj 

j) w˜ (i, n−1

Nj 

  j) (i, j) w˜ (i, n−1 δ ||Θn − Θn || d(Θn )

i=1

  j) P(yn |Θn , H j )δ ||Θn − Θ(i, n || d(Θn )

i=1

=

Nj  i=1

(i, j)

(i, j)

w˜ n−1 P(yn |Θn , H j ).

 (A· 5)

IEICE TRANS. INF. & SYST., VOL.E90–D, NO.8 AUGUST 2007

1158

Appendix B:

In the experiments reported in Sect. 4 and 5, our proposal distribution is selected from among several possible ones as

Derivation of Importance Weights

We begin by using the Bayes rule for the joint posterior distributions of a series of parameters: P(Θ0:n |y1:n , H j ) P(yn |Θ0:n , y1:n−1 , H j )P(Θ0:n |y1:n−1 , H j ) , = P(yn |y1:n−1 , H j )

(A· 6)

π(Θn |Θ0:n−1 , H j ) = P(Θn |Θn−1 ).

In other words, it is equal to the parameter/hyper-parameter dynamics; therefore, the importance weight dynamics is significantly simplified to

where Θ0:n denotes {Θ0 , . . . , Θn }. Using (3) and applying the identity

j) w(i, 0:n

w(i, j) #(i, j) w˜ 0:n := j 0:n , (m, j) Nn w m=1 0:n

to the the numerator of (A· 6), we have P(yn |Θn , H j )P(Θn |Θn−1 ) P(yn |y1:n−1 , H j ) × P(Θ0:n−1 |y1:n−1 , H j ),

(A· 7)

which represents the joint posterior distribution dynamics. The initial condition is specified by P(Θ0:1 |y1:1 , H j ) P(y1 |Θ1 , H j )P(Θ1 |Θ0 ) P(Θ0 |H j ), := P(y1 |H j )

(i, j)

j) j) (i, j) w(i, = P(yn |Θ(i, n n )wn−1 .

(A· 8)

(A· 9)

can be obtained by standard methods. Second, the draws can be sequentially obtained. One possible sequential proposal distribution can assume the following form: π(Θ0:n |y1:n , H j ) = π(Θn |Θ0:n−1 , yn , H j ) × π(Θ0:n−1 |y1:n−1 , H j ).

w0:n

(A· 11)

Assume that the draws (i, j)

Θ0:n−1 ∼ π(Θ0:n−1 |y1:n−1 , H j ),

(A· 12)

are available at the previous time step. Then the sequential proposal (A· 10) is used to draw the samples. (i, j)

Θn

(i, j)

∼ π(Θn |Θ0:n−1 , yn , H j ).

(A· 16)

(A· 17)

By defining w(i, j) w˜ n#(i, j) := j n (m, j) Nn m=1 wn

(A· 18)

and #(i, j)

w j) w˜ (i, := N n n p k=1

wn#(i,k)

,

(A· 19)

we will approximate the model posterior by (A· 10)

In this manner, entire trajectory samples need not be redrawn each time. Using this proposal distribution, the ratio w0:n between joint posterior (A· 7) and proposal (A· 10) also becomes sequential: P(Θ0:n |y1:n , H j ) := π(Θ0:n |y1:n , H j ) P(yn |Θn , H j )P(Θn |Θn−1 ) = P(yn |y1:n−1 , H j )π(Θn |Θ0:n−1 , yn , H j ) × w0:n−1 .

(A· 15)

the denominator of (A· 15) disappears. This is an important property of the present approach. For the sequential Monte Carlo approximation of the filtering distribution P(Θn |y1:n , H j ), the preceding trajec(i, j) (i, j) tory samples Θ0:n−1 included in Θ0:n can be discarded (see (A· 3)). Then we get

where P(Θ0 |H j ) is the initial distribution at t = 0. There are at least two desirable properties of the proposal distribution. First, the i.i.d. draws Θ0:n ∼ π(Θ0:n |y1:n , H j ),

j) P(yn |Θ(i, n , H j ) (i, j) w = . P(yn |y1:n−1 , H j ) 0:n−1

Through a normalization procedure

P(Θ0:n |y1:n−1 , H j ) = P(Θn |Θn−1 )P(Θ0:n−1 |y1:n−1 , H j ),

P(Θ0:n |y1:n , H j ) =

(A· 14)

(A· 13)

P(H j |Yn ) P(yn |Yn−1 , H j )P(H j |Yn−1 ) = N p k=1 P(yn |Yn−1 , Hk )P(Hk |Yn−1 ) Nnj (i, j) P(H j |Yn−1 ) i=1 w˜ n  N . k N (i,k) p n P(H |Y ) w ˜ k n−1 n k=1 i=1

 (A· 20)

MATSUI et al.: PRUNED RESAMPLING: PROBABILISTIC MODEL SELECTION SCHEMES

1159

Atsushi Matsui received the B.E. and M.E. degrees in Electrical Engineering from Waseda University, Tokyo, Japan in 1994 and 1996 respectively. He joined Japan Broadcasting Corporation (NHK) in 1996. Since 1998, he has been with NHK Science and Technical Research Laboratories, engaged in research on speech recognition and face recognition. He is a member of the Institute of Image Information and Television Engineers of Japan (ITE). He currently serves on the IEICE Technical Committee on Pattern Recognition and Media Understanding.

Simon Clippingdale received the B.Sc. Honours degree in Electronic and Electrical Engineering in 1982 from the University of Birmingham, U.K. and the Ph.D. degree in Computer Science in 1988 from the University of Warwick, U.K. He was a Japanese Government Science & Technology Agency (STA) Research Fellow at NHK Science and Technical Research Laboratories in 1990–1991, and after lecturing at the University of Warwick, joined NHK in 1996. Since then he has been with NHK Science and Technical Research Laboratories, pursuing research on image recognition and vision. He is currently a Senior Research Engineer, and is a member of the Institute of Image Information and Television Engineers of Japan (ITE). He has served the IEICE Technical Committee on Pattern Recognition and Media Understanding.

Takashi Matsumoto received his BSEE from Waseda University, Tokyo, Japan, MS in Applied Mathematics from Harvard University, Cambridge, MA, and PhD in Electrical Engineering from Waseda University, Tokyo, Japan. Currently he is a professor at the Graduate School of Science and Engineering, Waseda University, Tokyo, Japan. His current interests include sequential Monte Carlo implementations of online Bayesian learning with applications to practical problems such as change detection, event detection, face detection/tracking/recognition, biometric person authentication, online pattern classification, and reinforcement learning. He also researches sequential Monte Carlo-based boosting with application to object detection, Monte Carlo-based dynamical system predictions with application to biological data, as well as Bayesian nonparametrics with application to clustering problems. He has coauthored several books, including Bifurcations, Springer-Verlag, 1993, and Frontiers of Statistical Science Vol. 4, Iwanami Shoten, 2005. Dr. Matsumoto held visiting positions at UC Berkeley (1977–1979) and Cambridge University, UK (2003–2004). He has served as a member of the Editorial Board of the Proceedings of the IEEE, and a Guest Editor for the Proceedings of the IEEE.