Document not found! Please try again

online unsupervised learning of hmm parameters

0 downloads 0 Views 59KB Size Report
learning and the multiple assessments in parameter selection can improve the ... In transformation-based adaptation [7], clusters of HMM pdfs can be adapted to ...
ONLINE UNSUPERVISED LEARNING OF HMM PARAMETERS FOR SPEAKER ADAPTATION Jen-Tzung CHIEN Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan [email protected] ABSTRACT This paper presents an online unsupervised learning algorithm to flexibly adapt the speaker-independent (SI) hidden Markov models (HMM’s) to new speaker. We apply the quasi-Bayes (QB) estimate to incrementally obtain word sequence and adaptation parameters for adjusting HMM’s once a block of unlabeled data is enrolled. Accordingly, the nonstationary statistics of varying speakers can be successively traced according to the newest enrollment data. To improve the QB estimate, we employ the adaptive initial hyperparameters in the beginning session of online learning. These hyperparameters are estimated from a cluster of training speakers closest to the test speaker. Additionally, we develop a selection process to select reliable parameters from a list of candidates for unsupervised learning. A set of reliability assessment criteria is explored. From the experiments, we confirm the effectiveness of proposed method and find that using the adaptive initial hyperparameters in online learning and the multiple assessments in parameter selection can improve the speaker adaptation performance.

1. INTRODUCTION In real-world speech recognition systems, the speaker variabilities including accent, age, gender, emotion, etc., make it necessary to adapt the original SI HMM’s to enrolled speaker. Because the real environments are nonstationary and the labeled enrollment data are difficult to collect, it becomes crucial to exploit the online/ incremental and unsupervised learning schemes to avoid the inconvenience to collect batch and labeled data for speaker adaptation. In the literature, an increasing number of research works have been focused on online and unsupervised adaptation. The incremental expectation-maximization (EM) algorithm was employed in estimating the affine parameters for online speaker adaptation [3]. An alternative approach based on QB estimate was developed for online adaptation of continuous density HMM’s [5] and model transformation parameters [2]. On the other hand, the unsupervised adaptation based on N-best word sequence hypotheses of adaptation data was shown to be effective [8]. The confidence measure of word transcription was helpful to judge the correctness of transcribed data [1]. In this study, we develop a joint learning approach to online and unsupervised adaptation strategies for HMM-based speech recognition. This approach adopts the QB estimate to derive a recursive approximate Bayes algorithm for estimating the data transcriptions and the transformation parameters for online unsupervised adaptation. To improve QB estimate in online adaptation, we investigate the

sensitivity of initial prior parameters and propose the adaptive initial hyperparameters in the beginning phase of online learning. The most likely speaker cluster related to the test speaker is detected to estimate the initial hyperparameters. To deal with the ill-conditioned transcriptions in unsupervised learning, we evaluate the fitness between the adaptation frames and the transcribed word, state and mixture component sequences. A selective unsupervised adaptation scheme is applied to select reliable parameters for online unsupervised adaptation. In our experiments, the online unsupervised adaptation with adaptive initial hyperparameters and selection processes is examined for various adaptation data lengths.

2. ONLINE UNSUPERVISED LEARNING 2.1 Background of Hierarchical Transformation In transformation-based adaptation [7], clusters of HMM pdfs can be adapted to new speaker via a set of shared transformation parameters even though some HMM units are unseen in adaptation data. To dynamically control the degrees of transformation sharing, we construct a tree structure of HMM pdfs, which records the sharing relations of HMM pdfs in various tree layers, to perform the hierarchical transformation [2]. As shown in Fig. 1, each HMM pdf in leaf layer can search the tree node along its associated tree path in bottom-up fashion so as to catch the most individual parameters for HMM adaptation. ΓB : Tree Path of Leaf Node B

ΓA : Tree Path of Leaf Node A

Tree Cut

Leaf Node A

Leaf Node B

Fig. 1 Tree structure of HMM pdfs for hierarchical transformation.

2.2 General formulation Let χ n = {X 1 , X 2 , L , X n } be n i.i.d. and sequentially enrolled block data, the goal of online unsupervised adaptation is intended

to incrementally adapt HMM pdfs λ by means of simultaneously estimating the word sequence W of block data and the [1]. The QB estimates transformation parameters η ( n ) ( n ) ( n ) ˆ Θ = (ηˆ , Wˆ ) after observing the current block data X can n

prior density of parameters ηc( n ) = ( µ c( n ) , θ c( n ) ) of cluster Ω c by a normal-Wishart density [2] g ( µ c( n ) , θ c( n ) ϕ η( n,c−1) ) ∝ θ c( n )

be derived by [2][5]

]

Θ

≅ arg max p ( X n Θ) ⋅ g ( Θ ϕ ( n −1) ) ,

(1)

Θ

where the true pdf of previous data p (Θ χ n −1 ) is approximated by the closest tractable prior pdf g ( Θ ϕ

( n −1)

. Given the initial hyperparameters ϕ ( 0) , we can estimate ϕ ˆ (1) by applying X . Then, the hyperparameters ϕ (1) are Θ 1 ˆ ( 2) . A recursive QB updated and stored for estimation of Θ ( 1 ) ( 2 ) ( n ) ˆ ,Θ ˆ ,L, Θ ˆ is established. The HMM pdfs formulation for Θ λ = {λik } = {µ ik , rik } are then adapted using the parameters

ηˆ ( n ) = {ηˆc( n ) } and referring to the tree paths Γ = {Γik } . Herein, the transformation function is defined by [2] λˆik( n ) = Gηˆ ( n ) ∈Γ (λik ) = {µ ik + µˆ c( n ) , θˆc( n ) rik } µˆ ( n ) ,θˆ ( n ) ∈Γ , ik

c

c

(2)

(Wˆ ( n ) ,ηˆ ( n ) ) = arg max E{log p ( X n , s n , l n Wˆ ( n ) ,ηˆ ( n ) ) (Wˆ ( n ) ,ηˆ ( n ) )

X n ,W



(n)

hyperparameters

ϕ η( n,c−1) = (τ c( n −1) , m c( n −1) , α c( n −1) , u c( n −1) ) .

Correspondingly, the optimal parameters ηˆc( n ) = ( µˆ c( n ) , θˆc( n ) ) and the refreshed hyperparameters ϕ η(n ) can be derived as shown in [].

3. ADAPTIVE INITIAL HYPERPARAMETERS FOR ONLINE LEARNING The initial hyperparameters ϕη( 0) should suitably characterize the statistics of transformation parameters to win a good start for online adaptation. Conventionally, we apply the training material from a large population of speakers to estimate the informative prior statistics [2][5]. Such prior knowledge provides good generalization for different speakers. However, in order to improve online adaptation, the global initial hyperparameters could be adjusted to get closer to target speaker.

ik

According to EM algorithm, the QB estimation in (1) can be solved by

(n)

with

(4)

) with hyperparameters

( n −1)

c

 1 exp − ( µ c( n ) − m c( n −1) ) T  2

 1  × θ c( n )τ c( n −1) ( µ c( n ) − m c( n −1) ) exp − tr(u c( n −1)θ c( n ) ) , 2  

ˆ ( n ) = arg max p( Θ χ n ) = arg max p ( X Θ) ⋅ p ( Θ χ n −1 ) Θ n Θ

(α c( n −1) − D ) 2

} + log g (Wˆ ( n ) ϕ W( n −1) ) + log g (ηˆ ( n ) ϕ η( n −1) ) ,

(3)

where (s n , l n ) are state and mixture component sequences. Because the parameters (ηˆ ( n ) ,Wˆ ( n ) ) are inherently independent, this QB estimation turns out to be two alternate estimation stages of word sequence Wˆ ( n ) and transformation parameters ηˆ ( n ) [1].

3.1 Non-Informative and Static Initial Hyperparameters To see the effect of non-informative initial hyperparameters, we heuristically assign the initial hyperparameters such that the first parameter set ηˆ (1) = {µˆ c(1) , θˆc(1) } has the form of −1       (1)    i k r ξ ( , )  ∑ ∑ t ik   ∑ ∑ ξ t (i , k ) rik ( x t − µ ik ) , I  , (5)  t i ,k∈Ω c     t i ,k∈Ω c

where I is identity matrix and ξ t (i, k ) = Pr( s t(1) = i, l t(1) = k

2.3 Parameter Estimation

X 1 ,η c(1) , Wˆ (1) ) is posterior probability of being in state i and mixture k given ( X ,η (1) , Wˆ (1) ) . Moreover, we may extract the

In (3), the most likely word sequence Wˆ ( n ) can be obtained by plugging the acoustic model with current HMM parameters λ( n ) = G ( n ) ( λ ) and the language model g (Wˆ ( n ) ϕ ( n −1) ) into the

informative initial hyperparameters from SI training data using the technique mentioned in [2]. The static initial hyperparameters are accordingly produced with good generalization and fixed for online unsupervised adaptation.

conventional MAP decoder. In general, the language model parameters ϕ W( n −1) are expressed in terms of relative occurrence

3.2 Adaptive Initial Hyperparameters

η

1

W

frequencies of word sequences. After applying X n for online adaptation, the language model parameters should be updated to ϕ W(n ) by incrementing the occurrence frequencies associated with the estimated word sequence Wˆ ( n ) . The updated language and acoustic models (ϕW( n ) , λˆ( n ) ) will be used for recognizing next data X n +1 and estimating new HMM parameters λˆ( n +1) . With the benefits of online updated acoustic and language models, we are able to improve performance of speech recognition and understanding in nonstationary environments. In this study, the online unsupervised learning of language model is skipped. Regarding the transformation parameters, we may specify the joint

c

Actually, in the beginning of online unsupervised adaptation, the existing data available for determining initial hyperparameters are not only the SI training data but also the first unlabeled block data X1 . Assuming that the training data contain a cluster of speakers, who are acoustically near target environments, we may use the data corresponding to the closest speaker cluster rather than the complete training speakers to estimate initial hyperparameters. In [9], they estimated the HMM’s of individual training speakers and used the adaptation data to pick up top N training speakers. The transformed training data of top N speakers were employed for estimating the speaker adaptive HMM’s. In this study, we attempt to alleviate the memory and computation loads in [9]. Because a tree structure of HMM pdfs has been constructed, we could alleviate the load by estimating a limited set of transformation

factors associated with the tree nodes at higher layers for each training speaker. To be consistent with the estimation of static initial hyperparameters in [2], the transformation factor of cluster c and speaker q could be determined by using the averaged bias ~ vectors b cq between the training frames χ q = {x q ,t } of speaker q and the aligned HMM mean vectors µ ik for tree node Ω c . For each training speaker, we first use the transformation factors ~ to estimate his/her HMM parameters by {b cq } ~ λˆq = {µ ik + b cq , rik } λik ∈Ω c . Subsequently, the top N training speakers are determined according to the calculated likelihood p ( X1 | λˆq ) of data X1 matched with various speaker models λˆq . ~ The transformation factors b cq corresponding to top N speaker likelihood are forwarded to estimate initial hyperparameters. Finally, the adaptive initial hyperparameters mc( 0) and u c( 0 ) are respectively estimated by computing the ensemble mean and ~ variance of {b cq } with respect to top N acoustically close speakers. The hyperparameters τ c( 0) and α c( 0 ) are kept the same as those in static initial hyperparameters.

4. SELECTION PROCESS FOR UNSUPERVISED LEARNING The hierarchical transformation adapts overall HMM pdfs using the parameters selected from tree structure. However, the errorprone word transcriptions in online unsupervised learning usually damage the goodness of transformation parameters. It becomes crucial to select reliable parameters to obtain stabilized performance. Different from bottom-up search algorithm, which always selects the parameters nearest to leaf layer, we examine a set of reliability assessment criteria in selection process. Generally, the selection process is attempted to balance the tradeoff between the reliability of data transcriptions and the complexity of transformation parameters. This attempt is analogous to adopt the regularization theory, which has been widely applied for network pruning in large-scale neural networks [6]. Herein, the selection process is similar to prune the tree structure by eliminating the unreliable parameters at lower layers under the penalty of reducing transformation complexity. The reliability assessment criteria are used to control the complexity penalty of transformation parameters.

4.1 Confidence Measure and Description Length The confidence measure (CM) is a good choice to evaluate how ) confident the adaptation token x (n is transcribed by the HMM t pdf λt . Our approach is to select the tree node with the highest averaged log likelihood ratio per frame from the candidates in tree path Γik [1]. An alternative selection paradigm is to estimate a tree cut as shown in Fig. 1, which optimally chooses the set of reliable parameters as well as the number of parameters. The HMM pdf λik is adapted using the parameters located at the intersection node of the tee path Γik and the tree cut. The tree cut could be identified using the minimum description length (DL) [10], a realization of regularization theory. On the other hand, if the selection process could be examined by simultaneously

considering CM and DL criteria, the online unsupervised adaptation could be further improved. Therefore, we propose a two-pass adaptation by performing the first pass adaptation using CM criterion followed by the second pass adaptation using DL criterion [1]. This two-pass selection process is performed in each EM iteration and incremental interval.

4.2 Multi-Pass Hierarchical Adaptation In [4], an unsupervised speaker adaptation scheme was designed to hierarchically generate new clusters of speech models while maintaining the continuity between adjacent clusters. Herein, we aim to recursively perform the unsupervised hierarchical adaptation from using global parameters to local parameters according to the multi-pass hierarchical adaptation algorithm. Multi-Pass Hierarchical Adaptation Algorithm 1 for each incremental interval data X n 2 for each EM iteration 3 for tree layer from root layer to leaf layer 4 Calculate transcriptions of adaptation data Wˆ ( n ) 5 Calculate transformation parameters ηˆ ( n ) 6 for each HMM pdf λik 7 Extract cluster label Ω c of λik in that layer 8 9

if transformation parameters of that label ηˆc( n ) exist Perform model transformation Gηˆ ( n ) (λik )

10

else if hyperparameters of that label ϕη( n −1) exist

c

11

Perform model transformation Gϕ ( n −1) ( λik ) η

12 end 13 end 14 end 15 end 16 end Using this algorithm, we could hierarchically adapt the HMM pdfs to target environments and gradually enhance the individuality of HMM pdfs. However, performing multi-pass model adaptation would result in much higher computational overhead than other unsupervised adaptation schemes.

5. EXPERIMENTS 5.1 Baseline System and Speech Databases In this paper, the online unsupervised adaptation is examined by a series of speaker adaptation experiments, which are aimed at the recognition of Mandarin speech. The acoustic modeling of Mandarin speech has been described in [1-2]. In our experiments, the training database consisted of 5045 phonetically-balanced Mandarin words spoken by 51 males and 50 females. Each word contained two to four Mandarin syllables. This database covered all acoustics of 408 Mandarin syllables. Also, we collected another database contained four repetitions of 408 isolated Mandarin syllables spoken by a single female speaker. We used three repetitions for testing and the remaining one for adaptation. It is a recognition task of 408 Mandarin syllables, which is known to be a highly confusable vocabulary. Without adaptation, the baseline result using SI speech models had a top-five recognition rate of 73.8%. The tree structure of HMM pdfs was built using the technique mentioned in [1-2]. For comparison, we carry out the unsupervised variant of Huo’s online adaptation (denoted by OLUA) in which the QB estimates of word sequence and HMM

pdfs are performed. This is different from the proposed online unsupervised transformation of HMM pdfs (OLUT).

5.2 Recognition Results First of all, we evaluate the sensitivity of initial hyperparameters in online unsupervised adaptation. Three conditions based on noninformative, static and adaptive initial hyperparameters are included. The cases of bottom-up algorithm, update interval being 25 utterances and maximum data amount being 200 are considered. As shown in Fig. 2, the use of informative hyperparameters performs better than that of non-informative hyperparameters. By additionally determining the closest speaker cluster, the adaptive initial hyperparameters significantly outperform the static initial hyperparameters for various amounts of adaptation data. Also, we find that OLUT consistently improves the performance by increasing amount of data. But, OLUA did not obtain the improvement. On the other hands, various approaches to blind selection of reliable parameters are examined for online unsupervised adaptation. In Fig. 3, the recognition rates of OLUT using bottom-up algorithm are compared with those using CM, DL, combined CM-DL and multi-pass hierarchical adaptation. The case of adaptive initial hyperparameters is applied. We find that The selection processes using CM and DL achieve comparable performance and outperform bottom-up algorithm. The combined CM-DL and multi-pass hierarchical adaptation do improve the recognition. However, the computational cost of combined CM-DL is much less than that of multi-pass hierarchical adaptation. 84

top 5 reco gnition rate (% )

82

80

78 O L UT, a d a p tive initia l hyp e r O L UT, s ta tic initia l hyp e r O L UT, no n-info rm a tive initia l hyp e r O L UA , s ta tic initia l hyp e r

76

74

72

70 0

50

100

150

200

a m o unt o f a d a p ta tio n d a ta

Fig. 2 Evaluation of OLUT with various initial hyperparameters. 86

top 5 reco gnition rate (% )

84

82

80

78 OL UT, OL UT, OL UT, OL UT, OL UT,

76

74

M ulti-P a ss C o m bined C M -D L CM DL B otto m -Up

72 0

50

100 am ount of ada ptatio n da ta

150

200

Fig. 3 Evaluation of OLUT with various selection criteria.

6. CONCLUSION We have presented a novel framework of online unsupervised learning to accomplish the robustness and the flexibility for automatic speech recognition. The robustness was achieved by building the learning ability to unknown environments. The flexibility was assured while the adaptation was done in incremental and unsupervised manners. In this paper, we jointly and recursively maximized the QB density of word sequence and transformation parameters to fulfill online unsupervised learning. To improve the incremental learning, the speaker adaptive initial hyperparameters were estimated by averaging the transformation factors associated with a cluster of training speakers, which was acoustically nearest to test speaker. To stabilize the unsupervised learning, several selection criteria were proposed to identify reliable parameters. A multi-pass hierarchical adaptation was also designed to gradually perform unsupervised adaptation from using global parameters to local parameters. From the experiments on online unsupervised speaker adaptation, the proposed adaptive initial hyperparameters did improve the incremental learning. The selection processes using various criteria and the multi-pass hierarchical adaptation did obtain desirable unsupervised learning performance. Therefore, we conclude that this general framework is feasible to online unsupervised learning task for an adaptive speech recognition system.

7. REFERENCES [1] CHIEN, J.-T. and JUNQUA, J.-C.: ‘Unsupervised hierarchical adaptation using reliable selection of clusterdependent parameters’, Speech Communication, vol. 30, no. 4, 2000, pp. 235-253. [2] CHIEN, J.-T.: ‘Online hierarchical transformation of hidden Markov models for speech recognition’, IEEE Trans. SAP, vol. 7, no. 6, 1999, pp. 656-667. [3] DIGALAKIS, V. V.: ‘Online adaptation of hidden Markov models using incremental estimation algorithms’, IEEE Trans. SAP, vol. 7, no. 3, 1999, pp. 253-261. [4] FURUI, S.: ‘Unsupervised speaker adaptation based on hierarchical spectral clustering’, IEEE Trans. ASSP, vol. 37, no. 12, 1989, pp. 1923-1930. [5] HUO, Q. and LEE, C.-H.: ‘On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate’, IEEE Trans. SAP, vol. 5, no. 2, 1997, pp. 161-172. [6] HAYKIN, S.: Neural Networks – A Comprehensive Foundation, 2nd edition, 1999. [7] LEGGETTER, C. J. and WOODLAND, P. C.: ‘Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models’, Computer Speech and Language, vol. 9, 1995, pp. 171-185. [8] MATSUI, T. and FURUI, S.: ‘N-best-based unsupervised speaker adaptation for speech recognition’, Computer Speech and Language, vol. 12, 1998, pp. 41-50. [9] PADMANABHAN, M., BAHL, L. R., NAHAMOO, D. and PICHENY, M. A.: ‘Speaker clustering and transformation for speaker adaptation in speech recognition systems’, IEEE Trans. SAP, vol. 6, no. 1, 1998, pp. 71-77. [10] SHINODA, K. and WATANABE, T.: ‘Speaker adaptation with autonomous model complexity control by MDL principle’, Proc. of ICASSP, 1996, pp. 717-720.