Computation of Standard Errors for Maximum-likelihood ... - CiteSeerX

30 downloads 0 Views 250KB Size Report
Computation of Standard Errors for. Maximum-likelihood Estimates in. Hidden Markov Models. Tero Aittokallio. Department of Mathematics, University of Turku.
Computation of Standard Errors for Maximum-likelihood Estimates in Hidden Markov Models Tero Aittokallio

Department of Mathematics, University of Turku

Esa Uusipaikka

Department of Statistics, University of Turku

Turku Centre for Computer Science TUCS Technical Report No 379 November 2000 ISBN 952-12-0760-4 ISSN 1239-1891

Abstract Explicit computation of the score vector and the observed information matrix in hidden Markov models is described. With the help of the information matrix Wald's condence intervals can be formed for the model parameters. Finite sample properties of the maximum-likelihood estimator and its standard error are investigated by means of simulation studies. We compare the condence levels of intervals based on two model estimation methods. The problems in model estimation due to multimodal nature of the likelihood surface are demonstrated and discussed as well. Moreover, using the same approach one can investigate the uncertainty of the classication procedure based on the likelihood values in the recognition problems.

Keywords: hidden Markov model, parameter estimation, condence level, nite sample properties

TUCS Research Group

Algorithmics and Biomathematical Research Groups

1 Introduction A hidden Markov model (HMM), also known as probabilistic function of a Markov chain, Markov source, and Markov regime model, is a discrete-time stochastic process including an underlying nite-state Markov chain (state sequence), and a sequence of random variables whose distributions depend on the state sequence only trough the current state (observation sequence). The state sequence is not observable, and hence all the conclusions about the process must be made using only the observation sequence. The basic theory of HMMs was introduced already in the late 60's by Baum and Petrie (1966), and thereafter the model has been extensively used in many areas such as speech processing (Baker, 1975; Juang and Rabiner, 1991), recognition of handwritten word (Kundu et al., 1989), and modeling and analysis of DNA and protein sequences (Churchill, 1989; Eddy, 1998). In the applications of HMMs, likelihood functions and estimates of the model parameters have been routinely computed. However, the more advanced statistical inference, like condence intervals, have been constantly ignored. As an exception, Klein et al. (1997) presented in general the form of the Wald's condence interval in the ion channel application, but they did not show formulas for computation or any simulation results. The objective of this paper is to scrutinize in detail the computation of the Wald's condence intervals for the parameters of HMM. First, we explicitely give recursive formulas for the computation of the score vector and the observed information matrix, which are employed in the computation of the intervals. Second, we compare intervals based on two point estimation methods, namely pure Baum-Welch algorithm and modied algorithm that make use of a simple global optimization procedure. The comparison is performed with simulation studies for the two intervals. Although we formulate the HMM with continuous observation densities in general form, the simulations are presented only on the left-to-right models.

2 Computation of likelihood quantities In this section, we shortly review the basic theory of the hidden Markov models, i.e., notation and assumptions of the model, evaluation of the likelihood function, and estimation of the model parameters. For more detailly discussion about the subject, see e.g. the tutorial of Rabiner (1989). In addition to the basic inference procedures, we also supplement the theory by giving recursive formulas for the computation of the score vector and the observed information matrix of the likelihood function. 1

2.1 The hidden Markov model

While in ordinary Markov models each state corresponds an observable event, in HMMs the observation is a probabilistic function of the current state. Thus, the HMMs can be considered as a hierarchical statistical model with an underlying state process, which is not observable, but can only be observed through another set of stochastic processes that produce the sequence of observations. The HMM is completely characterized by the distributions for the initial state, state transition probabilities and observation probabilities, and by the two assumptions concerning the state process and observation process. Let us denote N individual states where the state process may be as s = fs1; s2; : : : ; s g and let q be the state of the process at time t. At time t = 1 the state q1 is specied by the initial state distribution N

t

 = P (q1 = s ) i

i

1  i  N:

The state sequence is assumed to be a Markov chain where the state transition probabilities from state i to state j

a = P (q = s jq ?1 = s ) ij

t

j

t

i

1  i; j  N; t  2

are assumed to be time-homogenous. Now A = (a ) =1 is the state transition probability matrix whose entries satisfy the conditions ij

a 0 ij

1  i; j  N

and

N X

j

=1

N

i;j

1  i  N;

a =1 ij

(1)

i.e., A is a stochastic matrix. We consider a sample of K observation sequences y = fy1 ; y2 ; : : : ; y k g, where T is the number of observations in the kth sequence. The observations of each observation sequence are assumed to be conditionally independent given the state sequence. Although the original theory of HMMs and large number of applications are presented using discrete observation distributions, we focus in this paper only on the continuous observation distributions. More precisely, the observation in state s at time t is assumed to be generated from the normal distribution 2 f (y ) = p 1 exp[? (y 2?2 ) ] 1  i  N; 1  k  K; t  1: (2) 2 The HMM is specied by the values of the parameters of the probability measures dened above. In the following the vector of these values is denoted k

k

k

k

T

k

i

k

i

i

t

k

t

i

i

2

by , which comprises M = N (N ?1)+2N +N ?1 = N (N +2)?1 elements. It follows from the Markovian property and from the conditional independency, that the density function of the kth observation sequence is given by X

p(y ; ) = k

 1 f 1 (y1 ) k

i

i1 ;::: ;i

Tk

Tk Y

i

=2

1  k  K;

a t?1 t f t (y ) k

i

i

i

t

(3)

t

where all possible state sequences i1; : : : ; i k are taken into account. We further assume that the observation sequences y = fy1; y2; : : : ; y g are mutually independent, and thus the density (or likelihood) function of y (or ) is the product T

K

L(; y) = p(y; ) =

K Y

k

=1

p(y ; ): k

2.2 Recursive formulas for likelihood, score vector and observed information matrix

The probability density function of y could be straightforwardly evaluated by using the equation (3). This computation, however, involves on the order of 2T N k operations. Fortunately there exists an alternative technique for the evaluation problem without the exponential growth in computation, namely the forward-backward procedure due to Baum an his collegues (1970). First, let us dene the forward and backward probabilities as (i) = P (y1 ; y2 ; : : : ; y ; q = s j) and (i) = P (y +1; y +2; : : : ; y k jq = s ; ). The forward probability is the probability of the partial observation sequence up to the time t and of the state process being in the state s at time t. Similarly, the backward probability is the probability of the partial observation sequence from time t + 1 to the end of the sequence conditioned on the state process being in the state s at time t. The procedure uses the following recursive formulas to compute the forward probabilities 1 (i) = " f (y1 ) 1# i  N; k

k

T

k

t

k

k

k

t

t

i

k

k

k

k

t

t

t

T

t

i

i

i

k

k

i

i

N X

+1 (i) = k

t

j

=1

(j )a f (y +1) k

k

ji

t

i

t

1  t  T ? 1; 1  i  N; k

and likewise the backward probabilities k (i) = 1 1  i  N; k

T

(i) = k

N X

t

j

=1

a f (y +1) +1 (j ) ij

j

k

k

t

t

3

T ? 1  t  1; 1  i  N: k

Using the forward and backward probabilities, the density (3) can be evaluated by N X

p(y ; ) = k

=1

1kK

(i) (i) k

k

t

t

i

for each 1  t  T . Especially, setting t = T gives k

k

p := p(y ; ) = k

N X

k

=1

1  k  K:

k (i) k

T

(4)

i

From the recursive formulas for the forward probabilities one can also derive recursive formulas for computation of the rst and second derivatives of (i) with respect to model parameters. The derivatives are needed in the computation of the the score vector and the observed information matrix of the log-likelihood function t

l(; y) = ln L(; y) =

K X

k

ln p :

(5)

k

=1

If we denote the model parameters by  (1  i  M ), we get by dierentiating (5) and (4) i

@l(; y) = X 1 @p @ =1 p @ K

1  i  M;

k

i

where

k

k

@p = X @ k (s) @ =1 @ N

k

k

i

s

(6)

1  i  M; 1  k  K:

T

i

i

(7)

Appendix contains all details for computation of @ k (s)=@ in the manifold dened by the stochastic constraints (1). The score vector of l(; y) is   @l (  ; y ) : (8) rl(; y) = k

T

i

M

@

=1

i

i

From (6) and (7) we get further

@ 2 l(; y) = @ @ @2p = @ @ i

j

k

i

j

1 @ 2 p ? 1 @p @p 1  i; j  M; 2 @ @ p @ @ p =1 X @ 2 (s) k 1  i; j  M; 1  k  K:

K X

k

N





k

k

i

j

k

k

T

=1

s

@ @ i

j

4

k

k

i

j

See again Appendix for computation of @ 2 k (s)=@ @ . The Hessian of l(; y) is k

i

T

2 l(; y )  @ H (; y) = @ @ 

i

j

M

i;j

=1

j

:

Finally, the Hessian gives the observed information matrix I and the approximated covariance matrix C of the log-likelihood function by

I (; y) = ?H (; y)

and

C (; y) = I (; y)?1:

(9)

2.3 Maximum-likelihood estimates Given a sample of observation sequences y, the estimation of the model parameters  is so far the most dicult problem of HMMs. Naturally there exists several possible optimization criterias (see e.g. Ephraim et al., 1989; Juang and Rabiner, 1990; Kwong et al., 1998), but we focus in this section on the most popular one, namely we choose for the object function the loglikelihood function l(; y) in (5) subject to the stochastic contraints for A and . The maximum-likelihood estimates (MLEs) are computed by using the iterative Baum-Welch algorithm, originally intoduced by Baum et al. (1970), and later extended for the multiple observation sequences by Levinson et al. (1983) and for the continuous observation distributions by Juang et al. (1986). The algorithm is based on the same ideas as the EM algorithm of Dempster et al. (1977). However, it should be kept in mind that the estimates (xed points) produced by the Baum-Welch reestimation procedure originate from the critical points (local maxima or saddlepoints) of the log-likelihood surface, i.e., only the vanishing of the score vector in (8), computed in the interior of the manifold (1), can be quaranteed. Given the observation sequences y = fy1; y2; : : : ; y g and by denoting the new (reestimated) model by , the Baum-Welch reestimation formulas can be stated in terms of the forward and backward probabilities in the following K

5

way

a =

PK k

ij

=1

1

k ?1 =1 (i)a f (y +1 ) +1 (j ) 1 P k ?1 =1 k =1 (i) (i)

PT

k

p

PK k

1

k

ij

t

t

T

k

t

k t

1  i; j  N;

(i) (i)y  = P 1Pk 1  i  N; =1 k =1 (i) (i) P 1 P k (i) (i)(y ?  )2 =1 =1 k 1  i  N;  2 = P 1 P k (i) (i) =1 k =1 PK k

i

=1

=1

k

t

k t

t

p

j

PT

k

k

t

p

K

k

p

i

k t

PK

=1

k

k

k

t

t

t

t

T

k t

t

p

1 (i) (i) 1 k 1 k

p

k t

T

K k

i

k

t

t

p

K

 =

k

t

T

k

k

k

t

k

K

i

k t

1  i  N:

Note, that the estimates satisfy the stochastic constraints for A and for  at each iteration. It is possible to show that always l( ; y) > l(; y) except if  is at a critical point of log-likelihood function, in which case  = . If we iteratively replace  by , we increase the value of the log-likelihood by each iteration until some critical point ^ is reached. Even though we can always discard the saddlepoints by examining whether the observed information matrix I (^; y) is positive denite, there exist usually multiple local maxima. The local maximum which the algorithm converges depends on the initial values of the model parameters.

2.4 Wald's condence intervals With the help of the covariance matrix in (9) one can compute, after parameter estimation, the standard error for each estimated model parameter (Cox and Hinkley, 1974). If we denote the estimated parameter value by ^ and the corresponding diagonal element of the covariance matrix C (^; y) computed at the estimated point by c^ , the Wald's 95 % condence interval for the parameter  is i

ii

i

p p (^ ? 1:96 c^ ; ^ + 1:96 c^ ) i

ii

i

ii

1  i  M:

(10)

In Section 3, we show some simulation studies for estimates and their errors, which markedly demonstrate the problems caused by the multiple local maxima. 6

3 Study of inference procedures by simulation In this section, we give the results of the Monte Carlo simulations that study the behavior of the inference procedures discussed earlier. At this point we concentrate only on the specic type of the HMMs, namely on left-to-right models (see Rabiner, 1989). We have two reasons for this. First, the left-toright models are widely used because they are able to model signals whose properties change over time, e.g. speech signal. Second, when estimating the parameters of the fully connected (or ergodic) model, one frequently obtains models with equal likelihood values, and whose parameters are same but in dierent order. This is due the fact, that one can rename the states of the model by N ! dierent way. In order to avoid the task of book keeping of all permutations during simulation studies, we xed a priori some probabilities to zero. More accuartely, we determine that the state sequence begins in state s1: 

i = 1,  = 10 ifotherwise. i

(11)

and restrict the state transitions only to current state or to possible next state:

a =0 ij

if j < i or j > i + 1:

(12)

The constraints (11) and (12) can be quaranteed in Baum-Welch algorithm by placing the constraints for the initial values. This is due to the fact, that any parameter in  or A set to zero initially will remain at zero throughout the reestimation procedure. Infact, we do not deed to estimate the initial state distribution  in left-to-right models because of the constraint (11), and therefore it is ignored in the rest of this paper. For the parameters of the observation distributions we do not place any constraints.

3.1 Implementation issues for generation of observation sequences and model estimation In our experiments the parameters of the true model are assumed to be known a priori. However, for determining the length of the observation sequence under generation there exists no unabiguous fashion. Since the duration of staying in state s conditioned on starting in that state follows the geometric distribution with the parameter 1 ? a , we can calculate the expected number i

ii

7

of observations in states before the nal state as ?1 X D = 1 ?1 a : (13) =1 The problem here is to decide the number of observations produced by the nal state. In this work, we determined that the length of the observation sequence is T = dD + 5e for each k (1  k  K ). The following algorithm is used to generate the observation sequences. N

ii

i

k

Algorithm Generate.

Input: model parameters a ,  ; 2 (1  i  N ). Output: observation sequence y = fy1 ; y2 ; : : : ; y k g. 1. Set t 1 and q s1. 2. Denoting s = q , choose y according to the f . 3. Denoting s = q , choose q +1 2 fs ; s +1g according to the a . 4. Set t t + 1. If t < T , then return to step 2; Otherwise terminate the algorithm. It should be noted that we do not force the state sequence to end in the nal state s as it is sometimes customary (see e.g. Levinson, 1983). To investigate the MLEs of the model given the sample of observation sequences y = fy1; y2; : : : ; y g, we used the Baum-Welch estimation algorithm. In addition to the sequences, also the architecture of the model was assumed to be known a priori, i.e., the number of states (N ), the form of the observation densities (Gaussian), and the type of the model (left-to-right). The important issue of learning the architecture from the observation sequences will not be considered in this context. Transition probabilities a (1  i  N ? 1) were initialized at random with the constraints (1) and (12), and the mean  and variance 2 of each observation distribution (1  i  N ) were initilized according to the sample mean and variance of the observation sequences. Note that in this case, the number of the model parameters is M = N ? 1 + 2N = 3N ? 1. Since the sequences were quite short we do not have to worry about scaling (Rabiner, 1989). The Baum-Welch algorithm (Section 2.3) was terminated as soon as both the relative change of log-likelihood value and the absolute change of model parameters from one iteration to the next fell below a given threshold. More precisely, the stopping criteria used was    l ( ; y ) ? l (  ; y )  ?  < 10?5:  max  l(; y) ; 1max ii

i

i

k

k

k

k

T

t

k

i

t

t

i

t

t

i

i

i

ii

k

N

K

ii

i

i

i

8

M

i

i

s a i

ii

LR3



i

LR6

2 a i



ii

i

2 i

1 0.7 10 4 0.7 30 4 2 0.4 0 4 0.4 10 4 3 1.0 -10 4 0.5 0 4 4 0.3 20 4 5 0.6 -10 4 6 1.0 -30 4 Table 1: Parameter values of the models LR3 and LR6. The two HMMs used in Monte Carlo simulations are shown in Table 1. The three-state model is referred later as LR3 and the six-state model as LR6. Note, that model LR6 includes state s4 whose self-transition probability is only 0.3, and thus the expected number of observations in that state is 1.4 in each observation sequence. We will see later the eect of a small self-transition probability on the condence levels. Note also, that all the observation distributions are dierent. The length of the observation sequences are determined using (13); T = d5 + 5e = 10 for LR3, and T = d153=14 + 5e = 16 for LR6. k

k

3.2 Fixed points of the Baum-Welch algorithm

It is evident that the number of local maxima of the log-likelihood function is strongly dependent on the number of states N and the number of observation sequences K . In order to study the inuence of K with models LR3 and LR6, a sample of K sequences were generated for both models using the algorithm Generate for several values of K . After the generation of the sample sequences, the Baum-Welch algorithm was employed in computing the MLEs as described earlier. The algorithm was started from 10000 random points, and two xed points of the algorithm, say ^1 and ^2 , were considered to be the same if k^1 ? ^2k2 < 10?3. Table 2 shows an example concerning one sample (K = 10) of observation sequences generated from the model LR3. For this sample, the number of dierent xed points were 4. The xed point obtained from the rst run (No 1) combines the states s1 and s2 of the true model LR3 as one state having mean ^1 = 6:72 and variance ^12 = 24:5, and the both remaining states model the state s3. The second xed point (No 2) is closest to the true model, and it also has the largest log-likelihood value. The third xed point (No 3) splits the state s1 of the true model into two states, and the fourth xed point (No 9

No a^11 a^22 ^1 ^2 ^3 ^12 ^22 ^32 l(^; y) F 1 0.84 0.55 6.72 -11.0 -9.99 24.5 6.79 2.73 -281.0 108 2 0.71 0.56 10.2 0.53 -10.4 3.59 3.28 4.04 -240.5 9867 3 0.00 0.60 9.85 10.0 -7.27 3.31 5.69 27.8 -293.8 24 4 0.85 0.92 6.69 -10.4 -10.3 24.7 5.21 0.69 -282.1 1 Table 2: Fixed points of the Baum-Welch algorithm given the observation sequences generated from the model LR3. The length of the sequences were T = 10 (1  k  10), and the number of starting points was 10000. Symbol F denotes the frequency of the xed point. For each xed point the corresponding observed information matrix I (^; y) is positive denite.  The actual computed value is 5:90  10?4, i.e. also the model No 3 lies in the interior of the parameter manifold. k

4) is very similar with the rst one, but the self-transition probability of state s2 is larger. The observed information matrix of each xed point is positive denite, and thus all points are possible candidates for the global maximum. Note, that the log-likelihood value of the best point is 14 percents larger than the value of the rst point. Fortunately, there is a very high probability to obatain the best xed point in this example. By using the frequencies of the xed points, we can estimate that this probability is about 98.7 percents. However, we can not quarantee based on the 10000 starting points, that the xed point is the global maximum. We observed, that the sucient choise of K with model LR3 was K = 30 yielding only one maximum from all starting points and for all samples of observation sequences used. For the model LR6, however, even though the increasing of K resulted in decreasing of number of the local maxima, the uniqueness was not yet obatained with this value of K . The average number of local maxima for some samples we tested was 10. In practise, when one applies HMMs to some series of observations, the number of observation sequences is of course limited. Therefore, the value K = 30 was chosen, and the rest of this section is devoted on the results of the simulation studies for the condence intervals in the situations of sucient (LR3) and insucient (LR6) data. We also describe, how to improve the intervals based on the pure Baum-Welch algorithm with simple global optimization technique.

10

3.3 Condence intervals Besides multiple local maxima, the insuciency of data results also in zero values of self-transition probabilities a obtained from the Baum-Welch algorithm. In this paper, we do not concentrate on the very important issue of applying condence procedures on the estimates which lie on a boundary of the parameter manifold. In fact, we assume a priori konowledge on the fact that the true parameters of the models are not on a boundary. As we will see later, the global optimization technique removes also this problem, when the true parameters are indeed in the interior of the manifold. However, we want to emphasize that this is not the case in general problem in practise, and this issue is a part of the architecture design problem discussed shortly before. We recall, that the sucient conditions for the case that a xed point ^ of the Baum-Welch algorithm is a local maximum are: 1) the point ^ is in the interior of the parameter manifold, and 2) the observed information matrix computed at this point I (^; y) is positive denite. We describe next two methods for computing the estimates of a HMM, and then compare the condence levels of the Wald's intervals based on the dierent methods. First, we nd n xed points by starting the Baum-Welch algorithm from n random points. The rst xed point that satises the conditions mentioned above is referred as BW estimate, and the xed point with the largest log-likelihood value among the points that satisfy the conditions is referred as GBW estimate. The straightforward method used to obtain the GBW estimate is known as Multistart in global optimization literature (see Horst and Pardalos, 1995). The value n = 10 for the number of starting points was chosen on the basis of the simulations with LR6 as described in previous section. Results of the Monte Carlo simulations based on 10000 samples of observation sequences are shown in Tables 3 and 4. Tables contain the percentage of the models, estimated using both BW and GBW methods, which produce an interval (10) including the true parameter. Percentages are counted for each  (1  i  3N ? 1) in models LR3 and LR6. Note, that the self-transition probability a is not a free parameter due to the constraint (12). For the LR3, the two methods yield quite similar condence levels, see Table 3. The percentages for the transition probabilities and for the means range from 94.0 to 94.8. For the variances, however, the percentages are markedly poorer. This is possible a result from the fact, that the variance has constraint 2 > 0. One could improve the approximation with Gaussian by computing the interval for the logarithm of the variance. During the simulation we also counted the number of xed points methods discarded, ii

i

NN

11

s

i

a

ii

BW



i

2 i

a

ii

GBW



i

2 i

1 94.5 94.7 93.2 94.5 94.8 93.2 2 94.3 94.0 90.9 94.3 94.1 90.9 3 94.3 93.4 94.3 93.5 Table 3: Condence levels of intervals based on the BW and GBW estimates in model LR3. The length of the sequences of a sample were T = 10 (1  k  30), and the number of samples was 10000. k

since they did not fulll the conditions for the local maximum. The BW method discarded 3 xed points, while the GBW method none, i.e., the xed point with largest log-likelihood value was always in the interior of the parameter manifold and also gave a positive denite observed information matrix. From the table 3 we can conclcude that there is no signicant benet of using the GBW method instead of normal BW in the computation of intervals, when N = 3 and K = 30. Table 4 shows the corresponding comparison for the model LR6. Here the eect of the global optimization is already more evident. Especially, in the percentages for the transition probabilities and means, the dierence is hugh. Note, that both methods obtain the best condence levels for parameters of the states s1 and s6 . This is due to the fact, that these states have the largest self-transition probabilities, and thus also the longest staying durations as discussed earlier. The poorest percentages are for the parameters in the state s4 , which also has the smallest self-transition probability. The number of xed points the BW method discarded during simulation was 645, but again, all xed points with the largest log-likelihood value satised the two conditions for the local maximum.

s

i

1 2 3 4 5 6 Table 4:

a

ii

94.4 79.7 83.0 79.6 91.7

BW



i

2 i

a

ii

GBW



i

2 i

94.7 93.2 94.4 94.7 93.3 81.3 78.0 93.8 93.9 91.4 79.1 89.9 94.3 94.5 91.7 78.9 89.7 92.4 94.1 89.7 79.2 89.8 94.5 94.4 92.5 94.3 93.2 94.9 93.9 Condence levels of intervals in model LR6. 12

4 Discussion Statistical inference beyond parameter estimation has rarely been done in applications of HMM. We have derived recursive formulas for computation of the score vector and the observed information matrix, with help of which condence intervals can be formed for model parameters. Observed information matrix can also be used to distinguish whether a xed point is a local maximum or a saddlepoint. In addition to the model parameters, one can also compute condence intervals for the functions of the parameters, e.g. for the value of the loglikelihood function = ln p(y0; ) of some observation sequence y0. Denoting the MLE of by ^ = ln p(y0; ^), the Wald's 95 % condence interval for is q

q

( ^ ? 1:96 r ^ C^ r ^; ^ + 1:96 r ^ C^ r ^); T

T

where C^ = C (^; y) is the estimated covariance matrix, and the score vector r = rp0 =p0 can be computed using formula (7) at the estimated point. This is an interesting idea in the point of view of classication since in this way one can compute the condence interval for the log-likelihood value of every new case to be classied, that is, examine the uncertainty of the classication procedure. The other topic of the present paper was to demonstrate the eect of the multiple local maxima on the computation of the MLE. For ergodic models, the consistency of the MLE was proved by Leroux (1992) and the asymptotic normality of the MLE by Bickel et al. (1998). We studied only the left-toright models, but it is intuitively clear that the same results can be obtained if the number of sequences K tends to innity (instead of T ) in left-to-right models. In practical problems, however, this is of course impossible. Although we used the Baum-Welch algorithm in the computation of MLE, also standard gradient techniques can be used in the estimation problem (Levinson et al., 1983; Baldi and Chauvin, 1994). Regardless of the optimization algorithm, one always faces the problem that the likelihood surface in general is multimodal, and therefore global optimization procedures are required. In this work, we used a method called Multistart, which inspite of its simplicity produced superior condence intervals when compared with the commonly used estimation procedure. By increasing the number of starting points one could naturally improve the condence levels. However, the Multistart is quite ineective method, since it often nds the same local maximum several times. Therefore, it makes sense to try to locate good starting points like in clustering methods. Also some other global optimization techniques, such as 13

simulated annealing and genetic algorithms, could be of great benet (Horst and Pardalos, 1995). There exists better ways to form condence intervals for real-valued parameter functions of interest. One approach is prole likelihood method. The formulas for the score vector and observed infromation matrix presented in this paper are useful for computation of prole likelihood condence intervals as well. Application of the ideas of the present paper in case of prole likelihood intervals will appear elsewhere.

References Baker, J. K. (1975). The Dragon system - an overview, IEEE Transactions on Acoustics, Speech, and Signal Processing, 23, 24-29. Baldi, P. and Chauvin, Y. (1994). Smooth on-line learning algorithms for hidden Markov models, Neural Computation, 6, 307-318. Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions of nite state Markov chains, The Annals of Mathematical Statistics, 37, 1554-1563. Baum, L. E., Petrie, T., Soules, G. and Weiss, N. (1970). A maximization technique occuring in the statistical analysis of probabilistic functions of Markov chains, The Annals of Mathematical Statistics, 41, 164-171. Bickel, P. J., Ritov, Y. and Rydén, T. (1998). Asymptotic normality of the maximum-likelihood estimator for general hidden Markov models, The Annals of Statistics, 26, 1614-1635. Churchill, G. A. (1989). Stochastic models for heterogeneous DNA sequences, Bulletin of Mathematical Biology, 51, 79-94. Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics, London: Chapman & Hall. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion), Journal of the Royal Statistical Society, Series B, 39, 1-38. Eddy, S. R. (1998). Prole hidden Markov models, Bioinformatics review, 14, 755-763. 14

Ephraim, Y., Dembo, A. and Rabiner, L. R. (1989). Minimum discrimination information approach for hidden Markov modeling, IEEE Transactions on Information Theory, 35, 1000-1013. Horst, R. and Pardalos, P. M. (1995). Handbook of Global Optimization, Dordrecht: Kluwer Academic Publishers. Juang, B. H., Levinson, S. E. and Sondhi, M. M. (1986). Maximum likelihood estimation for multivariate mixture observation of Markov chains, IEEE Transactions on Information Theory, 32, 307-309. Juang, B. H. and Rabiner, L. R. (1990). The segmental K-means algorithm for estimating parameters of hidden Markov models, IEEE Transactions on Acoustics, Speech, and Signal Processing, 38, 1639-1641. Juang, B. H. and Rabiner, L. R. (1991). Hidden Markov models for speech recognition, Technometrics, 33, 251-272. Klein, S., Timmer, J. and Honerkamp, J. (1997). Analysis of multichannel patch clamb recordings by hidden Markov models, Biometrics, 53, 870884. Kundu, A., He, Y. and Bahl, P. (1989). Recognition of handwritten word: rst and second order hidden Markov model based approach, Pattern Recognition, 22, 283-297. Kwong, S., He, Q. H., Man, K. F. and Tang, K. S. (1998). A maximum model distance approach for HMM-based speech recognition, Pattern Recognition, 31, 219-229. Leroux, B. G. (1992). Maximum-likelihood estimation for hidden Markov models, Stochastic Processes and Their Applications, 40, 127-143. Levinson, S. E., Rabiner, L. R. and Sondhi, M. M. (1983). An introduction to the application of the Theory of probabilistic functions of a Markov process to automatic speech recognition, The Bell System Technical Journal, 62, 1035-1074. Rabiner, L. R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, 77, 257286.

15

Appendix: computation of the derivatives

Let  and  be parameters of an observation desnsities f . The rst derivatives of the forward probabilities (1  x; y  N ) are @ 1 (i) = 0; x

x

@a

x

xy

"

@ +1 (i) = @a t

#

@ (j ) a +  (x) f (y ); +1 =1 @a

N X

t

ji

xy

j

iy

t

i

t

xy

@ 1 (i) =   @f (y1) ; @ @ " # @ +1 (i) = X @ (j ) a f (y ) +  +1 @ =1 @ i

ix

i

x

x

" N X

#

(y +1) : (j )a @f @ =1 The second derivatives of the forward probabilities (1  x; y; z; w  N ) are @ 2 1 (i) = 0; @a @a # " @ 2 +1 (i) = X @ 2 (j ) a +  @ (x) +  @ (z) f (y ); +1 @a @a @a @a =1 @a @a N

t

t

ji

x

xy

i

t

ix

x

j

i

t

t

ji

x

j

zw

N

t

t

t

ji

xy

zw

xy

j

t

iy

i

iw

zw

zw

t

xy

@ 2 1(i) =    @ 2 f (y1) ; @ @ @ @ # " " # @ 2 +1 (i) = X @ 2 (j ) a f (y ) +  X @ (j ) a @f (y +1) +1 @ @ @ =1 @ @ =1 @ # " # " X X @ (j ) @f ( y ) @ 2 f (y +1 ) ; +1 +   a ( j ) a + @ @ @ =1 =1 @ i

ix

x

iy

i

y

x

y

N

N

t

t

t

ji

x

y

x

j

i

t

i

iy

y

x

j

y

N

N

i

t

t

i

ji

ix

iy

y

j

ix

x

t

j

@ 2 1 (i) = 0; @a @ " # @ 2 +1 (i) = X @ 2 (j ) a +  @ (x) f (y ) +1 @a @ @ =1 @a @ " # X @ (j ) @f (y +1) : + a +  ( x ) @ =1 @a xy

z

N

t

t

t

ji

xy

z

xy

j

iy

i

z

t

z

N

t

i

iz

ji

j

t

ji

iy

xy

t

t

z

16

t

ji

x

y

Here the symbol  is the Kronecker delta function:  if i = j ,  = 01 otherwise. When using the one dimensional normal observation distribution (2) the derivatives of the density function with respect to mean  and variance 2 (1  x; y  N ) are @f (y ) =  y ?  f (y ); ij

ij

x

i

t

t

@ @f (y ) @2 @ 2 f (y ) @2 @ 2 f (y ) @ (2 )2 @ 2 f (y ) @ @2 x

i

t

x

i

t

i

2

ix

i

t

x

i

t

x

y

t

i

2 2 =  (y ?  )4 ?  f (y ); 2 2 2 =  (y ?  ) ?  f (y ); t

i

i

ix

i

t

i

t

i

t

i

i

4

ix

x

i

x

i

4 2 2 4 =  (y ?  ) ? 6(4y8?  )  + 3 f (y ); t

i

t

i

i

i

i

ix

t

i

2 2 =   [(y ?  ) ?236 ] (y ?  ) f (y ): t

ix

i

i

t

i

iy

i

t

i

In HMMs, one has to compute the derivatives in the manifold dened by the stochastic constraints in (1). For that purpose, we introduce N (N ? 1) new variables a~ , corresponding to the free transition probabilities, via the function   N ? 1, a = a1~ ? P ?1 a~ ifif jj = N. =1 Note, that we let the last entry of each row in A be dependent of the other entries of that row. Using the chain rule, the derivatives of the forward probabilities with respect to the independent transition probabilities (1  x; z  N; 1  y; w  N ? 1) become @ (i) = @ (i) ? @ (i) ; @ a~ @a @a @ 2 (i) = @ 2 (i) ? @ 2 (i) ? @ 2 (i) + @ 2 (i) ; @ a~ @ a~ @a @a @a @a @a @a @a @a @ 2 (i) = @ 2 (i) ? @ 2 (i) : @ a~ @ @a @ @a @ All other derivatives remain unchanged. ij

ij

ij

N

ik

k

t

t

xy

t

xy

xN

t

t

xy

zw

t

xy

xy

t

zw

t

z

xy

t

xy

zN

t

z

xN

17

z

xN

t

zw

xN

zN

Turku Centre for Computer Science Lemminkäisenkatu 14 FIN-20520 Turku Finland http://www.tucs.abo.

University of Turku  Department of Mathematical Sciences

Åbo Akademi University  Department of Computer Science  Institute for Advanced Management Systems Research

Turku School of Economics and Business Administration  Institute of Information Systems Science

Suggest Documents