NAGOYA INSTITUTE OF TECHNOLOGY DEPT. OF COMPUTER SCIENCE & ENGINEERING TECHNICAL REPORT IMPLEMENTING AN HSMM-BASED SPEECH SYNTHESIS SYSTEM USING AN EFFICIENT FORWARD-BACKWARD ALGORITHM Heiga Zen December 2007 Department of Computer Science and Engineering, Nagoya Institute of Technology Gokiso-cho, Showa-ku, Nagoya, 466-8555 JAPAN
E-mail:
[email protected] http://www.sp.nitech.ac.jp/~zen/
Number: TR-SP-0001 Copyright 2007, Nagoya Institute of Technology
Abstract A statistical parametric speech synthesis system based on hidden semi-Markov models (HSMMs) has been developed. In the training and synthesis part of the system, the expectation-maximization (EM) algorithm is used. To perform the expectation step of the EM algorithm, this system has used a forward-backward algorithm which was proposed by Ferguson and refined by Levinson. Recently, Yu and Kobayashi proposed a more efficient forward-backward algorithm. In this report, we re-derive parameter reestimation formulae and the speech parameter generation algorithm based on the new efficient forward-backward algorithm. key words: speech synthesis, hidden semi-Markov model (HSMM), forward-backward algorithm
1 Introduction A statistical parametric speech synthesis system based on hidden semi-Markov models (HSMMs) has been developed [1]. In this system, spectrum, excitation, and duration of speech are modeled simultaneously by contextdependent HSMMs, and speech parameter vector sequences are generated from the HSMMs themselves. This system defines a speech synthesis problem in a generative model framework and solves it based on the maximum likelihood (ML) criterion. In the training and synthesis part of the system, the expectation-maximization (EM) algorithm is used. To perform the expectation part of the EM algorithm for HSMMs, two types of forward-backward algorithm have been proposed. First one was proposed by Ferguson [2], who defines the forward-backward variables by the joint probabilities of the state ending at a certain time and a series of observations up to that time. The joint probability distribution of a sequence of observations, which is required in the Ferguson algorithm and given in a product form, can be calculated more efficiently by a recursive method as suggested by Levinson [3] and further refined by Mitchel et al. [4]. Second one is to transform a given HSMM into a “super HMM” [5]. The HSMM-based speech synthesis system has been developed based on the first algorithm [1]. Recently, Yu and Kobayashi proposed a more efficient forward-backward algorithm for HSMMs [6, 7]. In this algorithm, the forward-backward variables are defined using the notion of a state together with its residual life time. It shows that it is more efficient algorithm compared with the above two algorithms. In this report, we rederive parameter reestimation formulae and the speech parameter generation algorithm based on the new efficient forward-backward algorithm.
2 Derivation of HSMM-based speech synthesis system 2.1 Notation For reference purposes, this section lists the various formulae employed within the parameter estimation tools. All are standard, however, the use of non-emitting states and multiple data streams leads to various special cases which are usually not covered fully in the literature. The following notation is used in this section N S T Q Nq O ot ost aij bj ./ bjs ./ pj ./ Dj wjsg jsg †jsg j j
number of states number of streams number of observations number of sub-word models in an embedded training sequence number of states in the q’th model in a training sequence a sequence of observations the observation at time t , 1 t T the observation vector for stream s at time t the probability of a transition from state i to j output probability distribution for state j output probability distribution for state j stream s duration probability distribution for state j maximum duration for state j weight of space component g in state j stream s vector of means for the space component g of state j stream s covariance matrix for the space component m of state j stream s mean for state duration of state j variance for state duration of state j the set of all parameters defining an HSMM
1
2.2 Forward/Backward probabilities of HSMMs In the case of embedded training where the HSMM spanning the observations is a composite constructed by concatenating Q subword models, it is assumed that at time t , the ˛ and ˇ values corresponding to the entry state and exit states of an HSMM represent the forward and backward probabilities at time t t and t C t , respectively, where t is small. The equations for calculating ˛ and ˇ are then as follows: For the forward probability, the initial conditions are established as time t D 1 as follows: ( 1 if q D 1 .q/ ; (1) ˛1 .1; 1/ D .q 1/ ˛1.q 1/ .1; 1/a1N otherwise q 1 .q/ .q/ ˛j.q/ .1; 1/ D ˛1.q/ .1; 1/a1j bj .o1 /;
(2)
Nq 1
X
.q/ .1; 1/ D ˛N q
.q/ ; ˛i.q/ .1; 1/pi.1/ .1/aiN q
(3)
i D2
where the superscript in parentheses refers to the index of the model in the sequence of concatenated models. All unspecified values of ˛ are zero. For time t > 1, ( 0 if q D 1 .q/ ˛1 .t; 1/ D ; (4) .q 1/ .q 1/ .q 1/ ˛Nq 1 .t 1; 1/ C ˛1 .t; 1/a1Nq 1 otherwise 2 3 .q/ Nq 1 Di X X 6 .q/ .q/ 7 .q/ ˛j.q/ .t; 1/ D 4˛1.q/ .t; 1/a1j C ˛i.q/ .t 1; d /pi.q/ .d /aij (5) 5 bj .o t /; i D2 d D1
˛j.q/ .t; d / D ˛j.q/ .t Nq 1 .q/ ˛N .t; 1/ D q
1/bj.q/ .o t /;
(6)
.q/ ˛i.q/ .t; d /pi.q/ .d /aiN : q
(7)
1; d
.q/ Di
X X
iD2 d D1
For the backward probability, the initial conditions are set at time t D T as follows: ( 1 if q D Q .q/ ˇNq .T; 1/ D ; .qC1/ .qC1/ ˇN .T; 1/a otherwise 1NqC1 qC1 .q/ .q/ ˇi.q/ .T; d / D pi.q/ .d /aiN ˇ .T; 1/; q Nq
(8) (9)
Nq 1
ˇ1.q/ .T; 1/ D
X
.q/ .q/ bj .oT /ˇj.q/ .T; 1/; a1j
(10)
j D2
where once again, all unspecified ˇ values are zero. For time t < T , ( 0 if q D Q .q/ ; ˇNq .t; 1/ D .qC1/ .qC1/ otherwise .t; 1/a ˇ1.qC1/ .t C 1; 1/ C ˇN 1NqC1 qC1
(11)
Nq 1 .q/ .q/ ˇi.q/ .t; d / D pi.q/ .d /aiN ˇ .t; 1/ C pi.q/ .d / q Nq
C bi.q/ .o tC1 /ˇi.q/ .t C 1; d C 1/;
X
j D2
.q/ .q/ aij bj .o tC1 /ˇj.q/ .t C 1; 1/
(12)
Nq 1
ˇ1.q/ .t; 1/
D
X
.q/ .q/ a1j bj .o t /ˇj.q/ .t; 1/;
(13)
j D2
2
The total probability can be computed from either the forward or backward probabilities .Q/ P .O j / D ˛N .T; 1/ D ˇ1.1/ .1; 1/: Q
(14)
The computational complexity of this algorithm is O..ND C N 2 /T / where D D maxq;j Dj.q/ . This is more
efficient than that of Ferguson’s algorithm and its variants. If 8q;j Dj.q/ D 1, this forward-backward algorithm becomes the Baum-Welch algorithm, which is the standard forward-backward algorithm for HMMs.
2.3 Parameter reestimation Using ˛ ./ and ˇ ./, we can obtain the following variables for re-estimation of HSMM parameters: 8 ˆ