Sep 24, 1997 - d'2 k log %ÐXnÐ1; nÐ Ð 1ZnÐ1Ðk h. â 1. 2'2. kYn. Ð 1. 2'4. kYn. ÐYnÐ1 â mkYnÐ2 i. V bb`. bbX. Ð26Ð. 47. RECURSIVE ESTIMATION FROM ...
Statistics 34 (2000) pp. 27 ± 51 Reprints available directly from the publisher Photocopying permitted by license only
# 2000 OPA (Overseas Publishers Association) N.V. Published by license under the Gordon and Breach Science Publishers imprint. Printed in Malaysia.
ON RECURSIVE ESTIMATION IN INCOMPLETE DATA MODELS JIAN-FENG YAO SAMOS, Universit e Paris I (Received 24 September 1997; In ®nal form 30 September 1999) We consider a new recursive algorithm for parameter estimation from an independent incomplete data sequence. The algorithm can be viewed as a recursive version of the well-known EM algorithm, augmented with a Monte-Carlo step which restores the missing data. Based on recent results on stochastic algorithms, we give conditions for the a.s. convergence of the algorithm. Moreover, asymptotical variance of this estimator is reduced by a simple averaging. Application to ®nite mixtures is given with a simulation experiment. Keywords: Incomplete data; EM algorithm; recursive estimation; mixtures; stochastic algorithm AMS Classi®cation Codes: 62 F 12, 62 L 20
1. INTRODUCTION In many statistical models data are observed only upon some deterministic distorsion. Such examples include censored data, mixture models and the deconvolution problem. The most popular parameter estimator for these models is probably the EM estimator found in the seminar paper (Dempster et al., 1977). An up to date report on its various extensions can be found in (Meng and van Dyk, 1997) with an extensive discussion. Nevertheless on-line parameter estimation for theses models has not been fully addressed. Standard recursive estimation based on the observed likelihood (Fabian, 1978) do not directly apply, mainly because this likelihood is ill-de®ned for most of incomplete data sequence. 27
28
J.-F. YAO
Recently, Ryden (Ryden, 1998) develops a projection-based recursive likelihood estimator for the mixture problem. However it is not clear how to extend this approach to other incompletely observed models. We propose in this paper a new recursive algorithm, called RSEM, which can be viewed as a recursive version of the EM algorithm, augmented with a Monte-Carlo data imputation step which restores the missing data. To state the problem, let us call, following (Dempster et al., 1977), complete data some unobservable random variable X with domain X . We indeed observe a transformation of this, say Y = ( X ), where is a known non-random map from X into some observation space Y. Typically is a ``many-to-one'' map, we thus call the observations Y incomplete data, since many information is lost by . Assume that X (resp. Y ) has a density (x ; ) (resp. g (y ; )) with respect to some -®nite measure dx (resp. dy). These densities are linked by the following Z g
y ;
X
y
x ; dx;
where X
y denotes the set {x : (x) = y}. It is worth noting that the conditional distribution of X given Y = y, denoted ( X j Y = y ; ), has the density k
x j y ; 1X
y
x
x ; =g
y ; : Let an i.i.d. sequence Y1, . . . , Yn, . . . of Y be successively observed. For the on-line parameter estimation purpose, we consider the following recursive algorithm, named as RSEM ( for recursive SEM-like algorithm). This algorithm is ®rst introduced in (Du¯o, 1996). The sequence ( n) below ( gains sequence) is any sequence of positive numbers which decreases to 0. The parameter space is some open subset of Rp. AL G O R I T H M
RSEM:
(i) Pick an arbitrary initial value 0 2 . (ii) At each time n 1, perform the following steps with the new observation Yn 1:
RECURSIVE ESTIMATION FROM INCOMPLETE DATA
29
R-step (restoration) restore the corresponding unobserved complete data X by drawing an independent sample Xn 1 from the conditional distribution ( X jY =Yn 1 ; n). E-step (Estimation) update the estimation by n1 n n r log
Xn1 ; n :
1
In a non recursive framework, the above restoration R-step is the key novelty found in a class of so-called SEM, MCEM or SRE algorithms as introduced in (Celeux and Diebolt, 1985), (Wei and Tanner, 1990) and (Qian and Titterington, 1991) respectively. For a recent account on SEM algorithms, we refer to (Celeux et al., 1996) or (Lavielle and Moulines, 1995). This R-step plays a similar role as the E-step of the EM algorithm: it enables the use of the likelihood from the complete data for parameter updating. This paper is devoted to an accurate study of the RSEM algorithm including convergence and asymptotic eciency. We shall ®rst show in x 2 that the RSEM algorithm is a stochastic gradient algorithm which minimizes Kullback-Leibler divergence from the true model. Unfortunately, for most of incomplete data models of interest, it is not clear from the current state of the stochastic algorithms theory (up to our knowledge), whether or not the RSEM algorithm should converge without transformation (see x2). Therefore, we shall (x3) truncate this algorithm at randomly varying bounds in spirit of (Chen et al., 1988; Chen, 1993). This technique is surprisingly eective and yields the a.s. convergence of the truncated algorithm. Finally in x4, we implement the RSEM algorithm for ®nite mixtures and report a simulation experiment.
2. DOES THE RSEM ALGORITHM CONVERGE? From now on, the true parameter value is denoted . Let us de®ne the Kullback-Leibler divergence Z K
:
Y
g
y ; log
g
y ; dy; g
y ;
2
and some assumptions on the smoothness of the likelihood function.
30
J.-F. YAO
AS S U M P T I O N s (S) is an open convex set of Rp. The likelihood 7! log (x ; ) is twice dierentiable on such that: (a) K is twice continuously dierentiable on and the identity (2) can be twice dierentiated under the integral sign. R (b) For all and all y, we have rg
y ; X
y r
x ; dx These conditions are merely what is necessary for a statistical model to be regular. In particular, the assumption (S)-(b) implies the following identity which says that the incomplete score function is equal to the average of the unobserved complete score function: Z r log
x ; k
x j y ; dx:
3 r log g
y ; X
y
We shall also use the ®ltration F
F n , with F 0 f ; ;g; F n
Y1 ; . . . ; Yn ; X1 ; . . . ; Xn for n 1, which is the natural ®ltration associated to the RSEM algorithm. LE M M A 1 Under the assumptions (S )-(a)-(b), the RSEM algorithm is a stochastic gradient algorithm as follows n1 n ÿ n rK
n "n1 ; where "n1 ÿf r log
Xn1 ; n ÿ E r log
Xn1 ; n j F n g; is a F -adapted noise sequence, that is E
"n1 jF n 0 for all n. Proof
We have by (S)-(a)-(b) h
: E r log
Xn1 ; n j F n E r log
X1 ; 0 Z Z r log
x ; k
x j y ; g
y ; dx dy Y X
y Z r log g
y ; g
y ; dy ÿrK
:
Y
Therefore, the RSEM algorithm is a stochastic gradient algorithm obtained by perturbation of the following gradient system _ ÿrK
:
4
RECURSIVE ESTIMATION FROM INCOMPLETE DATA
31
The convergence analysis of such an algorithm is usually set up in two steps (see e.g. (Du¯o, 1996)). First we attempt to stabilize the algorithm, that is to check whether a.s. the sequence (n) would live in a compact subset of or not. After the algorithm is stabilized, there are many methods to ensure the a.s. convergence of (n) to a point (equilibrium) of {rK = 0}. Recent results based on the so-called ODE method (Kushner and Clark, 1978) is reported in (Fort and Pages, 1996). The gradient system (4) has a natural Lyapounov function V = K : for any solution (t)t 0 of the system, the function t 7! K (t) is decreasing. If in addition K is inf-compact, i.e., for all a 2 R; fK ag is a compact set of ;
i:e:; lim K
1 !@
5 the whole trajectory (t)t 0 would stay in the compact set {K K (0)}. Therefore this inf-compacity of K is a basic tool to stabilize such an algorithm. Another widely used requirement for stabilization of this system is the following Lipshitz condition rK is Lipshitz on :
6
Unfortunately, Kullback-Leibler divergence K from incomplete data models often does not meet both the Conditions (5) and (6). In the example below with censorated data, (6) is not ful®lled, while for a ®nite mixture (cf. x4), (5) is no longer satis®ed. Consequently, we do not know whether or not the RSEM algorithm could be stabilized without transformation (actually our simulation experiment in x4 in the mixture case seems to show it would be not). Therefore we shall develop in the next section a truncated version of the RSEM algorithm which will converge. Example with censored data Let X be a real-valued variable and Y = min( X, c) with a known constant c 2 R. Let also a = P ( X < c ; ) and the reference measures be dx = 1x 6 c dm(x) dc(x), dy = 1y < c dm(y) dc(y) with the Lebesgue measure m and the Dirac mass c at c. Then g
y ;
y ; 1y0, 1 D : 2 : d
; @ and kk :
37
14
When ! 1, D ! . Assume that on D, max[ÿ()] can be bounded by an increasing map '( ) sup max ÿ
'
:
15
D
Necessarily, lim '( ) = 1 when ! 1. Let be some constants b > 1 and c > 0. We may take, by inversion of ', a strictly increasing sequence ( s) such that '
s c
s2ÿ1
log sb
; for s 2:
16
For example, if ' is continuous in addition, we can take s = 'ÿ 1 (cs2 ÿ 1 (log s)ÿ b). Finally let us take for large enough s, say s s0, Cs : Ds :
17
Thus for n s0, since n 2 C (n) Cn = Dn, we have max ÿ
n '
n c
n2ÿ1
log nb
:
Hence, (13) as well as (T.c) holds. Summarizing, we have PR O P O S I T I O N 2 In the case where max[ÿ()] is unbounded on , assume there exists an upper bound '() satisfying (15) for large . With P the compact sets Cs de®ned in (16) ± (17), it holds that n"n1 converges a.s. (condition (T.c)). Such a construction will be explicited in x4 for ®nite mixtures.
4. APPLICATION TO FINITE MIXTURES This section is devoted to show the eectiveness of the truncated RSEM algorithm for mixture models. Since the moment estimators
38
J.-F. YAO
introduced in (Pearson, 1894), ®nite mixture models has been and continue to be widely analysed by statisticians. We refer to (Redner and Walker, 1984) and (Titterington et al., 1985) for classical backgrounds. Recent advances including Bayesian approach, stochastic estimators or dimension testing can be found in (Celeux and Diebolt, 1985; DacunhaCastelle and Gassiat, 1997; Robert, 1996; Richardson and Green, 1997; Celeux et al., 1996). For recursive estimation, (Ryden, 1998) proposed a recursive likelihood estimator using projection on some suitable compact subset of . Despite of some similarity, several dierences arise between Ryden's algorithm and the RSEM algorithm. First the RSEM algorithm is based on the complete data likelihood up to a restoration step; second, instead of projections on a compact subset of as proposed in Ryden's procedure, the RSEM algorithm achieves stabilisation by truncations at randomly varying bounds. Let be m a positive integer, ÿ an open convex set of Rq and D f f
; ; 2 ÿg a parametric family of univariate densities. A ®nite mixture with m components is a variable Y with the following density g
y;
m X
k f
y; k
18
k1
where a : (k) is a probability distribution (mixing distribution) on the set {1, . . . , m}, and f : (1, . . . , m) are component parameters of the mixture. Therefore, the whole parameter vector is = (a, f) which belongs to = ÿm where denotes the open simplex {1 > 0, . . . , mÿ1 > 0, 1 mÿ1 < 1}. A classical way to view such a mixture model as an incomplete data model is to think about an unobservable location variable Z on the integers {1, . . . , m}, and that the observation Y is drawn conditionally to this location (see (Dempster et al., 1977)). That is to consider the vector X = (Y, Z ) where Z 2 f1; . . . ; mg;
with P
Z k k ;
and
YjZ k f
y; k dy: It is easy to see that the marginal distribution of Y is exactly the mixture (18). The missing data here is the location variable Z.
RECURSIVE ESTIMATION FROM INCOMPLETE DATA
39
4.1. Kullback-Leibler Divergence Between Finite Mixtures Let us show that in general, the inf-compacity (5) does not hold for mixtures. First consider the case with one of the mixing probability k, say 1, tends to 0. The density g(; ) degenerates to the density of a smaller mixture with m ÿ 1 components. Therefore, we may have lim K
< 1;
1 !0
when for example, the probability distributions in D are all equivalents. This ®nite limiting behaviour may still happen if the component parameters f go to the boundary. Consider indeed the family of exponential distributions f
y; k k eÿk y ; k 2
0; 1 and take m = 2. The parameter vector is = (, 1, 2 ) with = (0, 1) (0, 1) (0, 1). It is easy to see that with fixed
; 1 ;
lim K
< 1:
2 !1
4.2. Set up of the RSEM Algorithm for Mixtures Let n = (1,n, . . . , mÿ1, n, 1,n,. . . m,n) be the current estimate at each step n. The restoration R-step draws a sample Zn1 from the following conditional distribution (ZjY =Yn1; n) k;n f
Yn1 ; k;n ; P
Z kjY Yn1 ; n Pm `1 `;n f
Yn1 ; `;n
1 k m:
19
The p.d.f of the complete data X = (Y, Z ) is (x; ) = ( y, z; ) = z f (y; z). However, the mixing probabilities a: = (k) have to be kept within the simplex and this constraint is particularly hard to satisfy for a recursive computation procedure. For instance, the updating rule (1) applied to 1 gives at step n, 1;n1 1;n n
@=@1 log
Xn1 ; n , and we see that even if 1,n 2 (0, 1), 1,n1 may escape from (0, 1) with a positive probability. Therefore to overcome such numerical instability, we propose a new parametrisation based on the Logit transformation. That is instead of a = (k), we shall use v : (!k) de®ned as k ; for 1 k < m; and !m 0:
20 !k log m
40
J.-F. YAO
Note that v : (!k) belongs to Rmÿ1 and this transformation has an easy inversion formula. Hence the new parameters are = (!1, . . . , !mÿ1, 1, . . . m). The estimation E-step writes as !k;n1 !k;n n 1Zn1 k ÿ k;n ; 1 k < m
21 k;n1 k;n n 1Zn1 k r log f
Yn1 ; k;n ; 1 k m: , as n and
f Recall that the ®nal estimates are the averages
v n de®ned in (12). 4.3. A Simulation Experiment We shall consider two univariate Gaussian mixtures with two components each, taken in (Redner and Walker, 1984). Thus
!1 ; m1 ; 21 ; m2 ; 22 and g
; 1 N
m1 ; 21
1 ÿ 1 N
m2 ; 22 . The two ``true'' mixtures are M1 0:3 N
3; 1 0:7 N
ÿ3; 1 and M2 0:3 N
1; 1 0:7 N
ÿ1; 1. The ®rst is a well-separated bimodal distribution, while the second is an unimodal one. Note that the smoothness conditions (S) are met in this Gaussian mixture case. We now describe in details the set up of the algorithm and the constants used throughout all the simulations. Initialisation
The algorithm starts with 3 1 2 3 1 2 0 0; m1; ; 1 ; m2; ; 2 : 2 2 2 2
Again the same initial value is used in (Redner and Walker, 1984) for their experiments on the EM algorithm. Note that !1 = 0 i 1 =1/2. Gains sequence The gains sequence is a piecewise constant version of the general rule (8) with exponent = 3/4: we set for each n 2 100
k ÿ 1; 100k with integer k 1; n
10
100k 103=4
:
Compacts Cs for truncation For sake of simplicity, the description is given by means of the proportion parameter 1: the actually used
RECURSIVE ESTIMATION FROM INCOMPLETE DATA
41
bounds for !1 could be easily derived. Thus Cs will take a product form As Ms Vs Ms Vs, for 1 2 As, mk 2 Ms and 2k 2 Vs . We use a two-staged set-up for (Cs). (i) For 100 ®rst compacts with 0 s< 100, a constant step size is used As 10ÿ1 ÿ s10ÿ4 ; 1 ÿ
10ÿ1 ÿ s10ÿ4 ; Ms ÿ100 ÿ s; 100 s; Vs 10ÿ3 ÿ s10ÿ6 ; 100 s: Note that the starting compact is C0 f0:1 1 0:9; jmk j 100; 10ÿ3 2k 100g
22
In particular, the true parameter is believed to belong to C0. In real data case, C0 should be set with a rough preliminary estimate of . (ii) For next compacts with s 100, we use the construction de®ned in Proposition 2. Taking into account the believed fact 2 C0, this procedure yields the following Cs
1 1 1 1 1 ÿ ; jmk j s ; 2k s s s s
23
with 8 ÿ 91=8 < s 1=2 = s 103 ÿ 100 2 : : log s ;
24
log 100
Detailed derivations of these formula are postponed to Lemma 2 at the end of the section. It is worth noting that 100 = 103, so that C99 C100 which links up two stages well. Finally, the restart value w used after each truncation is set to be the same as the starting value: w = 0. For each of the two mixtures M1 and M2, 100 independent runs RSEM over N = 1000 iterations are generated and then averaged. Table I and Table II show respectively statistics about ®nally minimised
42
J.-F. YAO
TABLE I Mean of the minimised Kullback distances from 100 independent runs of the truncated RSEM algorithm, with their ratio to the starting value
Mixture M1 Mixture M2
Mean of K
N
Starting value K(0)
ratio K
N =K
0
0.0538 0.0152
2.4819 0.2386
2.2% 6.3%
TABLE II Truncation statistics from 1000 independent runs of the truncated RSEM algorithm: number of runs with at least one truncation, total number of truncations (N ) over 100 independent runs and mean of the last truncation time T from those runs with truncations
Mixture M1 Mixture M2
Number of runs actually truncated
Sum of (N)
Mean of T
37 91
47 265
119.1 243.5
Kullback distance K
N , number of truncations (N ) and the last truncation time T de®ned as T maxfn : n n r log
Xn1 ; n 2 = C
n and n Ng: From Table I, we see that the Mixture M2 is more dicult to handle with since the minimisation ratio (6.3%) is about 3 times bigger than for the mixture M1 (2.2%), even absolute Kullback distances are smaller. A histogram of minimised Kullback distances is given on the top row in Figure 2. These distances are highly grouped near 0. For the truncation behaviour shown in Table II, the mixture M2 need more truncations (91 truncated runs against 37 ones, and 265 truncations in total against 47). Also the last truncation can occur much later (243.5 in average against 119.1). The middle row in Figure 2 show the corresponding histograms. Note that the maximum of the truncation number observed over 100 runs is respectively 9 for M2 and 4 for M1. Again these histograms are concentrated near 0. Thus it seems true that few truncations are enough to stabilize the algorithm. Figure 3 shows the evolution of a density estimate g
; n for time n = 0, 250, 500 and 1000, taken from one of the 100 generated samples for the mixture M1. This sample run has the following characteristics:
N 1; K
N 0:076 and T = 69, which are close to the mean values given in Table I. This sample can then be viewed as a mean behaviour of the truncated RSEM algorithm for the mixture M1.
FIGURE 2 Histograms from 100 independent runs for the mixture M1(a, c, e) and M2(b, d, f ).
FIGURE 2 (Continued).
FIGURE 2
(Continued).
FIGURE 3 Evolution of density estimates for the M1 (dashed lines): for time 0,250,500 and 1000 from top to bottom and left to right. The corresponding Kullback distances are 2.482, 0.155, 0.104 and 0.076. The true density is shown with solid lines.
RECURSIVE ESTIMATION FROM INCOMPLETE DATA
47
An analogous sample for the mixture M2 is given in Figure 4 with (N) = 3, K
N 0:0188 and T = 69. It is here worth noting that the algorithm starts with a bimodal density far enough from the true unimodal density, and yields a close approximation after 250 iterations only. Truncation compacts Cs for ®nite mixture of normal variables We prove the eectiveness of used barrier compacts Cs de®ned by (23) ± (24). According to the construction of Proposition 2, we shall use for the set D, de®ned in (14) D
1 1 1 2 1 1 ÿ ; jmk j ; k :
Recall that we have to bound max(ÿ()) on D with some increasing map '(). LE M M A 2 For mixture of two univariate normal variables with
!1 ; m1 ; 21 ; m2 ; 22 , by taking into account a rough preliminary estimate 2 C0, we can take for upper bound ' and 103 '
38 :
25
Moreover, for s 100, the barrier compacts Cs de®ned by (23) ± (24) are valid. Proof Let us denote respectively by En; and Vn; , the conditional expectation and variance with respect to F n ; n . Recall that ÿ
Vn; r log
Xn1 ; n . We have max ÿ
trace ÿ
X
Vn;
j
@ log
Xn1 ; n : @j
where j stands for !1, mk and 2k . For mixture of two univariate normal variables, we have by (21) 8 @ log
Xn1 ; n 1Zn1 1 ÿ 1;n > > < @!@ 1 Yn1 ÿmk;n @mk log
Xn1 ; n 1Zn1 k 2 h k;n i > > : @ 2 log
Xn1 ; n 1Z k ÿ 12 14
Yn1 ÿ mk;n 2 n1 @ 2 2 k
k;n
k;n
26
FIGURE 4 Evolution of density estimates for M2 (dashed lines): for time 0,250,500 and 1000 from top to bottom and left to right. The corresponding Kullback distances are 0.2386, 0.0254, 0.0255 and 0.0188. The true density is shown with solid lines.
RECURSIVE ESTIMATION FROM INCOMPLETE DATA
49
Straightforward computation shows that for 2 D Vn; 1Zn1 1 ÿ 1;n 1; 2 Yn1 ÿ mk;n 22
2 2; ; En; 1Zn1 k 2k;n 2 1 1 4
4 22; 2 4; : En; 1Zn1 k ÿ 2 4
Yn1 ÿ mk;n 2 2k;n 2k;n
where j; E jYj j . Hence for 1, trace ÿ
24
4 42; 2 4; 2 1: It remains to bound the expectations j,. But if W is a normal variable N
m; 2 , we have EjWj j 2 jÿ1
jmj j 4 j : Now we have assumed 2 C0, so that by (22), jmk,j 100 and 10ÿ3 2k; 100: It follows 2; 105 ;
4; 4 109 :
Hence for 103, 24
4 42; 2 4; 2 1 38 : and (25) follows. Finally, to de®ne s and Cs = Ds for s 100, let us take b =2 in the rule (16). Since ' is continuous, and we have chosen the exponent = 3/4 for ( n), we are looking for s satisfying '
s 3
s 8 c
s1=2
log s2
;
for large s:
We ®x the constant c by the initial condition for s 100; s 103 : Hence (24) follows.
50
J.-F. YAO
Acknowledgement I thank Marie Du¯o for several helpful discussions on this work. References Brandiere, O. and Du¯o, M. (1996) Les algorithmes stochastiques contournent-ils les pieges? Ann. Inst. Henri Poincar e, 32, 395 ± 427. Celeux, G., Chauveau, D. and Diebolt, J. (1996) Stochastic versions of the EM algorithm: an experimental study in the mixture case. J. Statist. Computation and Simulation, 55, 287 ± 314. Celeux, G. and Diebolt, J. (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Computational Statistics Quaterly, 2, 73 ± 82. Chen, H. (1993) Asymptotically ecient stochastic approximation. Stochastics and Stochastics Reports, 45, 1 ± 16. Chen, H., Guo, L. and Gao, A. (1988) Convergence and robustess of the RobbinsMonro algorithm truncated at randomly varying bounds. Stochastic Processes and their Applications, 27, 217 ± 231. Dacunha-Castelle, D. and Gassiat, E. (1997) Testing in locally conic models and applications to mixture models. ESAIM Probab. Statist., 1, 285 ± 317. Dempster, A., Laird, N. and Rubin, D. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Soc. Stat. B, 39, 1 ± 38. Du¯o, M. (1996) Algorithmes Stochastiques. Springer-Verlag, Berlin. Fabian, V. (1978) On asymptotic ecient recursive estimation. Ann. of Stat., 6, 854 ± 856. Fort, J. and Pages, G. (1996) Convergence of stochastic algotithms: from the KushnerClark theorem to the Lyapounov functional method. Adv. Appl. Probab., 28, 1072 ± 1094. Kushner, H. and Clark, D. (1978) Stochastic Approximation for Constrained and Unconstrained Systems. Springer-Verlag, Berlin. Lavielle, M. and Moulines, E. (1995) On a stochastic approximation version of the EM algorithm. Technical Report 95.08, Universite Paris-Sud. Meng, X. and van Dyk, D. (1997) The EM algorithm ± and old folk-song sung to a fast new tune. J. Royal Statist. Soc., 59, 511 ± 567. Pearson, K. (1894) Contributions to the mathematical theory of evolution. Phil. Trans Royal Soc. A, 185, 71 ± 110. Polyak, B. T. (1990) New stochastic approximation type procedures. Avtomat. i Telemekh., 7, 98 ± 107. English translation: Automation and remote control 51, 1991. Polyak, B. T. and Juditsky, A. B. (1992) Acceleration of stochastic approximation by averaging. SIAM J. Control and Optimization, 30, 838 ± 855. Qian, W. and Titterington, D. (1991) Estimation of parameters in hidden Markov models. Phil. Trans. R. Soc. Lond., 337, 407 ± 428. Redner, R. and Walker, H. (1984) Mixture densities, maximum likelihood and the EM algorithm, SIAM Review, 26, 195 ± 239. Richardson, S. and Green, P. J. (1997) On Bayesian analysis of mixtures with an unknown number of components (with discussion). J. Roy. Statist. Soc. B, 59, 731 ± 792. Robert, C. (1996) Mixtures of distributions: inference and estimation. In: Gilks, W. R., Richardson, S. and Spiegelhater, D. J., Editors, Markov Chain Monte-Carlo in Pratice, chapter 24, pages 441 ± 464. Chapman and Hall, London. Ryden, T. (1998) Asymptotically ecient recursive estimation for incomplete data models using the observed information. Metrika, 44, 119 ± 145.
RECURSIVE ESTIMATION FROM INCOMPLETE DATA
51
Titterington, D., Smith, A. and Markov, U. (1985) Statistical Analysis of Finite Mixture Distributions. Wiley, New York. Wei, G. and Tanner, M. (1990) A Monte-Carlo implementation of the EM algorithm and the poor man's data augmentation algorithm. J. American Statist. Association, 85, 699 ± 704. Jian-feng Yao, SAMOS, Universite Paris I, 90 rue de Tolbiac 75634 Paris Cedex France