Comments on “Efficient Training Algorithms for HMM's Using ...

2 downloads 0 Views 97KB Size Report
variability, thus avoiding instability and howling in everyday use. A design example is presented for a linear gain hearing aid filter with a given max-.
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 6, NOVEMBER 2000

based on KLT and fast DCT are listed in Table II. We can see from this table that the KLT-based and FDCT-based ECSS methods significantly improved the recognition accuracy compared to the baseline, especially under the driving conditions of 30 mi/h and 55 mi/h. The recognition performance using the fast DCT is very close to that using KLT. The WRAs under the driving conditions of idle and 30 mi/h using fast DCT are even slightly higher, though not significant, than the WRAs using KLT. In our experiments, the first 2048 noise-only signal samples were used to estimate the nosie AR parameters with an order of four. In general, the noise is required to be relatively stationary, similar to other speech enhancement or noise compensation methods such as spectral subtraction or parallel model combination (PMC). On the other hand, the nosie AR parameters can be updated during speech inactive periods, and in this sense it can track slowly time-varying noise. Experimental results on speech enhancement (Fig. 4) shows that the FECSS method also helped to improve the SNR of the babble noise which is nonstationary.

IV. CONCLUSION In this study, a fast algorithm has been proposed to approximate KLT for signal-subspace based speech enhancement. It is proven that the DCT is a good approximation to KLT for the covariance matrix of an AR( ) process. For computing the approximate eigenvalues of a symmetrix Toeplitz matrix, the fast algorithm reduces the computation cost from ( 3 ) in KLT to 2 . Experimental results show that the eigenvalues computed from the fast algorithm are very close to those from KLT for AR(10) processes with the AR coefficients extracted from the vowel, stop and nasal type phonemes from the TIMIT database. This fast eigen-decomposition algorithm is incorporated into the ECSS speech enhancement algorithm [2] for robust speech recognition in car and achieved recognition performances very close to those of the KLT-based ECSS algorithm while significantly reducing the processing time.

p ON

N

ACKNOWLEDGMENT The authors would also like to thank Dr. M. Randolph and Dr. Y. Cheng for providing Motorola’s car noisy speech data.

REFERENCES [1] Y. Ephraim and H. L. Van Trees, “A signal subspace approach for speech enhancement,” IEEE Trans. Speech Audio Processing, vol. 3, pp. 251–266, July 1995. [2] J. Huang and Y. Zhao, “An energy-conatrsined signal subspace method for speech enhancement and recognition in colored noise,” in Proc. IEEE Int. Conf. Acoustics, Speech, Signal Processing, May 1998, pp. 377–340. [3] A. K. Jain, Fundamentals of Digital Image Processing. Englewood Cliffs, NJ: Prentice-Hall, 1989. [4] , “A fast Karhunen Loève transform for a class of random processes,” IEEE Trans. Commun., vol. COM-24, pp. 1023–1029, Sept. 1976. [5] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” IEEE Trans. Commun., vol. COM-22, pp. 90–93, Jan. 1974. [6] J. Makhoul, “Linear prediction: A tutorial review,” Proc. IEEE, vol. 63, pp. 561–580, Apr. 1975. [7] K. C. Yen, J. Huang, and Y. Zhao, “Co-channel speech seperation in the presence of correlated and uncorrelated noises,” in Proc. Eurospeech’99, Budapest, Hungary, Sept. 1999, pp. 2587–2590.

751

Comments on “Efficient Training Algorithms for HMM’s Using Incremental Estimation” William Byrne and Asela Gunawardana

Abstract—“Efficient training algorithms for HMM’s using incremental estimation” [1] investigates EM procedures that increase training speed. The authors’ claim that these are GEM [2] procedures is incorrect. We discuss why this is so, provide an example of nonmonotonic convergence to a local maximum in likelihood, and outline conditions that guarantee such convergence. Index Terms—EM algorithm, hidden Markov models, incremental estimation, maximum likelihood estimation.

I. INTRODUCTION “Efficient training algorithms for hidden Markov models (HMMs) using incremental estimation” by Gotoh et al. [1] discusses alternatives to the usual EM algorithm [2] for estimation of hidden Markov model parameters. The incremental EM procedure proposed by Neal and Hinton [3], which at each iteration updates the model parameters based on a portion of the training data instead of the entirety of the training data, is applied to HMM training. Gotoh et al. [1] find experimentally that convergence is faster than under the usual EM algorithm. This is a valuable observation which adds to the growing body of evidence that such EM variants can be useful. Gotoh et al. [1, Sec. II-B, fn. 2] state that “the incremental variant is a generalized EM (GEM) algorithm [and] the likelihood will increase monotonically to a [local] maximum” by the results of Dempster et al. [2]. However, the incremental EM procedure described by Gotoh et al. [1] is not guaranteed to increase the EM auxiliary function, and monotonic increase in the likelihood is not guaranteed, i.e., it is not a GEM procedure [2], and this convergence argument cannot be used. We discuss why this is so, and provide conditions under which convergence results can be obtained. We also provide an example of an incremental EM procedure that finds a local maximum in likelihood and but is not monotonic in likelihood. Gotoh et al. [1] draw the following conclusion in Section V-A. The incremental ML estimation approach has a solid theoretical foundation that extends the standard EM. An important consequence is that the monotone likelihood improvement still holds for this variant and stable convergence is guaranteed. In light of our analysis, this conclusion is not correct. We suggest an alternative conclusion: Incremental EM has a sound theoretical foundation and stable convergence to stationary points can be guaranteed even when the likelihood does not increase monotonically. In practice, these points are local maxima in likelihood (c.f. [4, ch. 3]). However, this algorithm is not a GEM procedure and claims about its convergence cannot be based on GEM arguments. More general convergence arguments, which we provide, are required [5]. Similarly, incremental EM schemes used in tomography such as ordered subsets EM (OS-EM) of [8] and block iterative EM methods of Byrne [9] are not GEM algorithms, and other arguments are used to show their convergence. In contrast, algorithms such as

Manuscript received February 19, 1999; revised February 28, 2000. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Wu Chou. The authors are with the Center for Language and Speech Processing and Department of Electrical and Computer Engineering, The Johns Hopkins University, Baltimore, MD 21218 USA (e-mail: [email protected]). Publisher Item Identifier S 1063-6676(00)09266-X.

1063–6676/00$10.00 © 2000 IEEE

752

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 6, NOVEMBER 2000

space alternating generalized EM (SAGE) [6] and alternating expectation conditional maximization (AECM) [7] are GEM algorithms, and GEM arguments are used in obtaining their convergence properties. II. INCREMENTAL EM PROCEDURES NEED NOT SATISFY THE GEM CONDITION For clarity, we adopt the notation of Dempster et al. [2]. Here, x is a random variable with density f (xj), and y is an incomplete observation, which is completely determined by x. Since the mapping from x to y is many-to-one, y does not determine x, and the conditional density of x given y is k(xjy; ). The EM auxiliary function [2, Eq. (2.17)] is defined as Q(0 j) = E [log f (xj0 )jy; ], and the cross-entropy [2, Eq. (3.1)] as H (0 j) = E [log k(xjy; 0 )y; ]. As specified by Dempster et al. [2, Sec. III], the usual GEM procedure is to first fix a parameter value  and then find a new parameter 0 so that Q(0 j) > Q(j). The relationship between the likelihood L(0 ) of the observed data and the auxiliary function is given by Q(0 j) = L(0 ) + H (0 j) for all 0 [2, Eq. (3.2)]. The cross-entropy satisfies H (j)  H (0 j) [2, Lemma 1], and since 0 is found to satisfy Q(0 j)  Q(j), we get that L(0 )  L()+ H (j) 0 H (0 j), so that L(0 )  L(). Now suppose that x consists of independent components x1 ; 1 1 1 ; xT , where xt has density ft (xt j), and that each of the observations yt in y = (y1 ; 1 1 1 ; yT ) is determined by the corresponding xt . These independence assumptions result in the conditional density k(xjy; ) decomposing into a product of component conditional densities 5Tt=1 kt (xt jyt ; ). The EM auxiliary function [2, Eq. (2.17)] also decomposes into a sum of component auxiliary functions to give

Q(0 j) =

T

Fig. 1. Density from which the training data is drawn, and the training samples themselves (indicated by “o”).

is satisfied. This would then give L (p+1)  L (p) [2, Th. 1]. However equation (1) does not guarantee the GEM condition. The following shows why this is so. Suppose T = 2 and p is odd. Then, (p+1) satisfies

Q1 (p) j(p01)

+ Q2 (p) j(p)  Q1 (p+1) j(p01) + Q2 (p+1) j(p) :

It follows [2, Eq. (3.3)] that

Qt (0 j)

+ H1 (p) j(p01) + L2 (p) + H2 (p) j(p)  L1 (p+1) + H1 (p+1) j(p01) + L2 (p+1) where Qt (0 j) = E [log ft (xt j0 )jyt ; ]. The cross-entropy can be 0 decomposed in a similar manner into a sum of components Ht ( j) = + H2 (p+1) j(p) : E [log kt (xt jyt ; 0 )jyt ; ]. The incremental EM algorithm [3] makes use of this decomposition to avoid updating all the component auxiliary Rearranging and using the fact that L() = L () + L () shows [2, 1 2 t=1

L1 (p)

functions at every iteration. x is the observed variUsing the notation of Gotoh et al. [1] where ~ able and (~ x; y) the complete variable, and supposing we process the observations sequentially, the incremental EM algorithm proceeds as follows. E-Step: Compute

St(p+1) = and set S

(p+1)

E

St =

S (~xt ; y)j~xt ; 

(p)

(p)

;

; if t = (pmod T ) + 1; otherwise

T S (p+1) : i=1 i

L (p)

L

E [S (x; y)j] = S (p+1) :

In the notation of Dempster et al. [2], this is a specialization to exponential families of the reestimation procedure )



+ Q2 j(p0(p01) ) + 1 1 1 + QT j(p0(p0(t01)) ) :

(1)

Is this incremental procedure a GEM procedure? If so, (1) should guarantee that the GEM condition Q (p+1) j(p)  Q (p) j(p)

+ H1 (p) j(p01) 0 H1 (p+1) j(p01) (p+1) :

(2)

This is not sufficient to conclude that L (p+1)  L (p) . Likelihood can decrease as long as the cross-entropy terms vary to satisfy equation (2). Therefore incremental EM procedures are generally not GEM procedures. Of course, this problem does not arise in (G)EM procedures. In these cases, the analog of (2) is L (p) + H (p) j(p) 0 H (p+1) j(p)  L (p+1) and the increase 1

M-Step: Set (p+1) to the solution of

(p+1) = argmax Q1 j(p0p

Lemma 1] that

1

in likelihood follows [2, Lemma 1]. Example of Nonmonotonic Convergence of Incremental EM: This example [10, Sec. 6.4.2] consists of estimating the means (1 ; 2 ) in a mixture of two univariate Gaussians whose variances and mixture weights are known. The observed data is a sample of 25 points drawn from this distribution, as shown in Fig. 1. The incremental EM algorithm is used to estimate these means by processing one training sample at a time. The resulting iterates (p) = (1 ; 2 )(p) are shown in Fig. 2, and the corresponding log likelihoods L (p) in Fig. 3. Observe that the likelihood is not monotonic and that the algorithm converges to a local maximum in likelihood. Also observe from Fig. 2 that the standard EM algorithm starting from the same initial point would have converged to a different local maximum. Thus the regions of attraction of

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 6, NOVEMBER 2000

Fig. 2. Trajectory taken by the incremental EM algorithm. Individual iterates are indicated by “ ” and cycles through the data are indicated by “o.” The initial point is  : ; : . The contours indicate likelihood, and the broken line denotes the boundary of the regions of attraction of the two maxima under the standard EM algorithm. The lower figure is a detail showing the behavior of the early stages of the algorithm.

+ = 0 5601

= 0 2939

the local maxima under the incremental EM algorithm can be different from those obtained under the standard EM algorithm.

III. THEORETICAL CONVERGENCE RESULTS The behavior illustrated in the previous section is predicted by our theoretical results [5]. It can be shown [5] that the incremental EM algorithm will converge to a stationary point in likelihood under slightly stronger conditions than those that are assumed for obtaining the convergence results for the standard EM algorithm [2], [11]. In addition to the conditions given there, we assume that 1) argmax Tt=1 Qt (jt ) is unique for all (1 ; 1 1 1 ; T ) 2 T ; 2) family f (xj) has full support.

8

The first condition ensures that convergence of the value of the extended auxiliary function guarantees convergence of the parameter.

753

Fig. 3. Behavior of the training set likelihood during incremental training. The top figure shows the overall behavior of the likelihood, while the lower ones are details of regions where the likelihood drops. It can be seen that this is not limited to the first few iterates.

Note that this condition is slightly different from identifiability—it restricts not only the parameterization of the family of densities f (xj) but also the densities themselves. The second condition is necessary to prevent convergence to singularities of the extended auxiliary function. This assumption is not necessary for the standard EM algorithm because of the regularity assumptions on the model family and assumptions on the initial point (see [11, Eq. (6)] and the assumption on the initial point in the paragraph following it). While we do not know of a general theoretical explanation of the improved convergence rates of incremental EM found in practice, we note that this procedure is closely related to OS-EM [8], which is used in tomography. Hero [12] raises the concern that the nonmonotonicity of OS-EM leads “to the additional burden of monitoring the iterations for instability.” Our analysis allays these concerns by providing easily verifiable conditions that guarantee the same convergence properties as the standard EM algorithm. Gotoh et al. [1] suggest that a “decay factor” can be included in the procedure. They refer to a simple linear factor applied to the history. Our convergence analysis suggests that if ML convergence is de-

754

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 6, NOVEMBER 2000

sired, linear interpolation of statistics between iterations is more correct. We note that such modifications have recently proven useful for rapid speaker adaptation [13].

Control of Feedback in Hearing Aids—A Robust Filter Design Approach Boaz Rafaely and Mariano Roccasalva-Firenze

REFERENCES [1] Y. Gotoh, H. M. Hochberg, and H. F. Silverman, “Efficient training algorithms for HMM’s using incremental estimation,” IEEE Trans. Speech Audio Processing, vol. 6, pp. 539–548, Nov. 1998. [2] A. P. Dempster, A. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data,” J. R. Statist. Soc. B, vol. 39, no. 1, pp. 1–38, 1977. [3] R. M. Neal and G. E. Hinton, “A view of the EM algorithm that justifies incremental and other variants,” in Learning in Graphical Models, M. I. Jordon, Ed. Norwell, MA: Kluwer, 1998. [4] G. J. McLachlan and T. Krishnan, The EM Algorithm and Extensions. New York: Wiley, 1997. [5] A. Gunawardana and W. Byrne. Convergence of EM variants. CLSP, Johns Hopkins Univ., Baltimore, MD. [Online]http://www.clsp.jhu.edu [6] J. A. Fessler and A. O. Hero, “Space-alternating generalized expectation-maximization algorithm,” IEEE Trans. Speech Audio Processing, vol. 42, pp. 2664–2677, Oct. 1994. [7] X.-L. Meng and D. van Dyk, “The EM algorithm—An old folk-song sung to a fast new tune,” J. R. Statist. Soc. B, vol. 59, no. 3, pp. 511–567, 1997. [8] H. M. Hudson and R. S. Larkin, “Accelerated image reconstruction using ordered subsets of projection data,” IEEE Trans. Med. Imag., vol. 13, pp. 601–609, Dec. 1994. [9] C. L. Byrne, “Accelerating the EMML algorithm and related iterative algorithms by rescaled block iterative methods,” IEEE Trans. Image Processing, vol. 7, pp. 100–109, Jan. 1998. [10] R. Duda and P. Hart, Pattern Classification and Scene Analysis. New York: Wiley, 1973. [11] C. F. J. Wu, “On the convergence properties of the EM algorithm,” Ann. Statist., vol. 11, no. 1, pp. 95–103, 1983. [12] A. Hero, “Advances in detection and estimation algorithms for signal processing,” IEEE Signal Processing Mag., pp. 24–26, Sept. 1998. [13] W. Byrne and A. Gunawardana, “Discounted likelihood linear regression for rapid adaptation,” in Eur. Conf. Speech Communication Technology, 1999.

Abstract—A bound on the variability of the feedback path is employed in the design of fixed FIR hearing aid filters that are robust to the specified variability, thus avoiding instability and howling in everyday use. A design example is presented for a linear gain hearing aid filter with a given maximal mismatch of the feedback cancellation filter. Index Terms—Acoustic applications, acoustic signal processing, feedback systems, FIR digital filters, hearing aids.

I. INTRODUCTION Digital hearing aids are becoming increasingly popular, with new digital signal processing (DSP) technology facilitating smaller and better performing devices. The feedback problem, which causes the “howling” or “whistling” sound when the receiver signal is fed back to the microphone, is also easier to handle in digital hearing aids. Adaptive feedback cancellation algorithms have been developed and tested, which cancel the feedback signal using an internal model of the feedback path, thus allowing for higher gain and improved performance before the hearing aid approaches instability [1]–[6]. The feedback path model is never perfect, owing to rapid variations in the feedback path, model order limitations, and precision effects, and so a residual feedback signal still exists which hinders any additional gain. In this case, the hearing aid system must be made robust to differences between the feedback path and the model; otherwise the system will go unstable. This study presents a new approach in the design of hearing aid filters. These are designed such that the hearing aid system is robust to any given uncertainty in the feedback path, represented by a multiplicative bound around the feedback path model, while the gain requirements for the hearing loss compensation are satisfied. The robust hearing aid filter will ensure howling-free operation in everyday use. This paper presents the theory and an example of such a robust hearing aid filter design. II. ROBUST STABILITY IN HEARING AIDS A typical hearing aid includes a microphone to detect the external sound, a filter to provide the required compensation, a receiver to drive the sound into the user’s ear, and a vent to reduce the unnatural feeling induced when the ear is occluded [7]. A block diagram of a hearing aid system is presented in Fig. 1, where the hearing aid filter H is composed of a forward compensation filter Q and a model of the feedback path G0 , also referred to as a feedback cancellation filter. A and B are the forward and feedback paths through the vent, and M and R are the electroacoustic responses of the microphone and receiver, respectively. The feedback path

G = RBM

(1)

is the response from the receiver, through the vent, to the microphone, and could also include feedback due to mechanical or electrical paths. Manuscript received April 14, 1999; revised March 17, 2000. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Dennis R. Morgan. The authors are with the Institute of Sound and Vibration Research, University of Southampton, Southampton SO17 1BJ, U.K. (e-mail: [email protected]). Publisher Item Identifier S 1063-6676(00)09267-1. 1063–6676/00$10.00 © 2000 IEEE

Suggest Documents