DOMAIN ADAPTATION FOR TTS SYSTEMS Min Chu, Chun Li*, Hu Peng and Eric Chang
Microsoft Research Asia, Beijing; * Tsinghua University, Beijing {minchu; i-hupeng; echang}@microsoft.com;
[email protected] ABSTRACT This paper puts forward a domain adaptation problem that has not been studied well. For corpus-driven TTS systems, domain adaptation is realized by adding a small amount of domainspecific speech that will provide the maximum increase in average length of units that are used for synthesizing speech in that domain. An approach for generating optimized script for adaptation is proposed, the core of which is a dynamic programming based algorithm that segments domain-specific corpus into minimum number of segments that appear in the unit inventory. Increase in MOS after adaptation can be estimated from the generated script without recording speech from it. The results show that the amount of MOS increase depends not only on the size of the training set and the size of the script for adaptation, but also on the broadness of the domain. Narrower domains have larger increase in MOS.
1.
INTRODUCTION
There is a long tradition of research on speech synthesis technology. With the newly burgeoning applications such as spoken dialog systems, call center services and voice-enabled web and email services, an increasing emphasis is put on generating natural sounding speech. The simplest way for generating speech prompt in domain-specific applications is to play back a collection of pre-stored waveforms for words, phrases and sentences. When the domain is narrow and closed, very natural speech prompt can be generated with this method at relatively low cost. However, when the domain is not closed or is broader, or when the number of domains increases, cost for constructing and maintaining such prompt systems will increase greatly. A general TTS system is preferred instead. However, general-purpose TTS systems sometimes cannot generate high quality speech for some domains, especially when the domain mismatches the speech corpus that is used as the unit inventory. It would be desirable to have a general-purpose TTS system that can produce rather natural speech without domain restrictions and that can generate more natural speech for a specific domain after domain adaptation. Domain adaptation is a concept that has been used in many research areas. However, not many studies have been done in TTS systems. This paper proposes a domain adaptation approach for corpus-driven TTS systems. It increases the naturalness on specific domains by adding domain-specific speech into the unit inventory. The script for speech collection is generated automatically by maximizing the average length of the concatenated units over a testing corpus on that domain. *
The paper is organized as follows: Section 2 investigates the problem of domain adaptation in details. The proposed method for generating the script for adaptation is introduced in Section 3. Experiments and results are given in Section 4. Section 5 provides the conclusion.
2.
In the last decade, concatenation based speech synthesis has been widely adopted and rapidly developed because of its capability of producing high quality speech. To some extent, speech synthesis becomes a problem of collecting, annotating, indexing and retrieval of large speech databases [1]. The naturalness of synthesized speech depends to some extent on the size and coverage of the unit inventory. Though expanding the unit inventory is not an efficient way to increase naturalness for a generalpurpose TTS system, it is a reasonable method for increasing naturalness on a specific domain. Thus, the problem of domain adaptation is converted into a problem of generating an optimized script for collecting domain-specific speech. The sticking point of the problem is to find an efficient objective measure for naturalness. A formal subjective evaluation has been done to investigate the relationship between the naturalness of synthetic speech and some objective measures. ACC (Average Concatenative Cost) is shown to be highly correlated with MOS (Mean Opinion Score), which reveals that the ACC predicts, to a great extent, the perceptual behavior of human beings. 400 ACC vs. MOS pairs are plotted in Figure 1. The linear regression equation at the top right corner in Figure 1 can be used to estimate MOS from ACC. More details on the definition of ACC and the subjective evaluation can be found in [2]. Several factors are considered when calculating ACC. Among them, smoothness cost is a very important one. With this constraint, the longest speech segment that matches other prosodic and phonetic constraint will be selected for concatenation. Utterances concatenated by a few long segments often sound very natural. Thus, the Average Segment Length (ASL), defined as the average number of characters in selected segments, also reflects naturalness. The ASL vs. MOS pairs for 400 synthetic utterances are plotted in Figure 2. Though the MOS of dots with ASL smaller than 1.5 scatters into a broad range, it concentrates around 3.5 when ASL increases to 2 and it goes above 4 when ASL exceeds 3. To simplify the problem, our adaptation method maximizes the ASL when generating the script for adaptation. With this, the problem of domain adaptation can be described as follows:
This work was conducted during the second author’s visit to MSR Asia
0-7803-7402-9/02/$17.00 ©2002 IEEE
PROBLEM INVESTIGATION
I - 453
1. Definition of symbols Cs: a domain-specific text corpus. Naturalness of speech synthesized from this corpus is to be improved. Ug: the scripts for the general unit inventory. Us: a generated script for domain adaptation. Ss: the size of Us in number of sentences. Lg: ASL for corpus Cs when unit inventory Ug is used. Ls: ASL for corpus Cs when Ug+Us is used.
unit inventory U. Among many possible schemes for sub-string selection, the one with the maximum ASL is assumed to be the most natural one. A Dynamic Programming (DP) [3] based algorithm is proposed for finding the segmentation scheme with the minimum number of segments, i.e. maximum ASL. Details of the algorithm will be presented in Section 3.3. After finding the best string sequence for all sentences in Cs, the ASL for Cs, when U is used, is given by equation (1):
2. The problem For a given Ss, generate Us that maximizes L = Ls Lg. An automatic approach for generating Us will be proposed in Section 3.
ASL(Cs, U) = Size(Cs) / Count(Segment, (Cs, U))
Subjective measure (MOS)
ü
5
y = -0.8149x + 4.7385 4 3 2 1 0
1
2
3
4
(1)
where, Size(Cs) is the number of characters in corpus Cs, and Count(Segment, (Cs, U)) is the number of segments used for synthesizing Cs with U. Obviously, when a DSS is added into U, ASL(Cs, U) will increase. An iteratively algorithm is used to search for a DSS that will provide maximum increase in ASL(Cs, U) one by one until a threshold for ASL or the number of DSS is met. Since DSS are expected to carry sentence level prosody, DDS that cover all DSS are selected from Cs. The block diagram of the approach for generating domainspecific script is given in Figure 3. Details for each component will be presented in the following sections.
5
Objective measure (ACC)
Cs: Domain Specific Corpus
Figure 1. Correlation between ACC and MOS.
Ug: General Unit Inventory
PAT Tree Construction
5 4.5
PAT Tree for Cs
MOS
4
PAT Tree for Ug
3.5
Candidate DSS Generation
3 2.5 2
List of Candidate DSS
1.5 1 1
2
3
DSS Extraction
4
ASL of selected units List of DSS
Figure 2. Correlation between ASL and MOS.
DDS Generation
3.
APPROACH FOR GENERATING OPTIMIZED SCRIPT FOR DOMAIN ADAPTATION
As mentioned in Section 2, the domain adaptation problem for TTS systems is converted into a problem of generating an optimized script Us that will provide maximum increase in ASL within a size limitation Ss. Theoretically, the larger Ss is, the larger the ASL will be. However, we normally don’t want to spend too much on speech collection for specific domains. An automatic approach is proposed in this section that extracts Domain Dependent Sentences (DDS) one by one according to their contribution to the increase in ASL. A stop threshold for Ss and ASL should be chosen according to the user’s expectation of recording effort and naturalness. 3.1. Overview of the proposed approach The approach is decomposed into two steps. First, DomainSpecific Strings (DSS) are extracted from Cs. DSS is defined as a string of characters which appears frequently in Cs, yet never appears in Ug. Then, DDS are generated to cover all DSS. In the DSS extraction step, all sentences in Cs are assumed to be synthesized by concatenating sub-strings that appear in the
Sentence Segmentation Engine Domain Specific Script
Figure 3. The block diagram of the approach for generating domain-specific script. 3.2. PAT indexing and candidate DSS generation For calculating ASL over Cs, the operation of searching for a sub-string and its frequency of occurrence in Cs and U is frequently used. So an efficient indexing technique is needed. In this paper, PAT tree is used to index both Cs and Ug. The PAT tree [4][5] is an efficient data structure that has been successfully used in the field of information retrieval and content indexing [6]. A PAT tree is in fact a binary digital tree, in which each internal node has two branches and each external node represents a semi-infinite string (denoted as Sistring)[4]. For constructing a PAT tree, each Sistring in the corpus should be encoded into a bit stream. In our realization, GB2312 code for Chinese is used. Once the PAT tree is constructed, all Sistrings which appear in the corpus can be retrieved efficiently. PAT trees are constructed for Cs and Ug respectively. A list of candidate DSS is generated from the tree for Cs by the criteria
I - 454
that candidate DSS should appear in Cs for at least N times and that they should never appear in Ug. 3.3. Maximum ASL segmentation algorithm To find the best DSS from all candidates, Cs should be segmented into sub-strings appearing in Ug with the maximum ASL constraint. The problem is detailed as follows. A sentence with N Chinese characters is denoted as C1C2…CN. It is to be segmented into M (M ≤ N) sub-strings, all of which should appear at least once in Ug. Though many segmentation schemes exist, only the one with the smallest M is what we are searching for. In fact, it turns out to be a searching problem for the optimal path, which is illustrated under the DP frame work in Figure 4. Node 0 represents the start point of a sentence and nodes 1 through N represent character C1C2…CN respectively. Each node is allowed to jump to all the nodes behind it. The arc from node i to node j represents the sub-string Ci+1…Cj. A distance d(i ,j) is assigned for it by equation (2). Each path from node 0 to i corresponds to one segmentation scheme for string Ci…Cj. The distance for each path is the sum of the distances of all arcs on the path. Let f (i ) denotes the shortest distances from node 0 to i and g (i ) keeps the nodes on the path with f (i ) . Then, g (N ) with f (N ) is the path we are looking for. if Ci+1 KC j appears at least once in 1 d (i, j ) = the unit inventory i < j ∞ otherwise
Among the extracted DSS, some are sub-strings of the others. It is not necessary to keep them all. The shorter ones are filtered out. 3.5. DDS generation Since sentences are preferred for speech data collection to carry sentence level prosody, DDS that cover all extracted DSS should be generated. Though it can be written manually, it is more efficient to select DDS from Cs automatically. All sentences in Cs are considered as candidates. The criterion for selecting DDS is ASLIPC for a sentence, which is the sum of ASL increase for all DSS appearing in the sentence divided by the sentence length. The sentence with the highest ASLIPC is selected first and removed from the candidate list Cs. The DSS appearing in this sentence should be removed from the DSS list too. The procedure is done iteratively until the DSS list is empty or the number of selected sentences reaches a preset limit. start calculate ASLIP C for each candidate DSS select the one with the largest ASLIP C
the largest ASLIP C < threshold
The segmentation algorithm is described as follows: Step 1: Initialization
Yes
No output the DSS; add it into unit inventory; remove it from candidate list.
f (0) = 0, g (0) = −1
Step 2: Recursion f ( j ) = min [ f (i) + d (i, j )] 0≤ i < j
g ( j ) = arg min[ f (i ) + d (i, j )] 0≤ i < j
for
j = 1,2,..., N − 1, N
No
Step3: Termination the path with shortest distance g (N ) the distance for path g (N ) (equivalent to the f (N ) number of sub-strings on the path)
Yes end
Figure 5. The flowchart of DSS extraction.
4. 0
1
2
...
N-1
the list of candidate DSS empty?
EXPERIMENTS AND RESULTS
Four experiments are done to investigate the validity of the proposed method from different aspects.
N
4.1. Experiment 1 – size of generated script vs. ASL increase Figure 4. Networks for sentence segmentation. 3.4. DSS extraction To measure the efficiency of a candidate DSS for increasing ASL, ASL Increase Per Character (ASLIPC) is defined by equation (3): ASLIPC = ( ASLa − ASLo ) / L
(3)
where, L is the length of a candidate DSS in characters, ASLo is the ASL for Cs when it is segmented by the unit inventory without current candidate DSS, and ASLa is the ASL after adding current candidate DSS into the unit inventory. The flowchart of DSS extraction is shown in Figure 5.
The first domain we choose is stock review. A 250 Kbyte corpus is used as training set, from which DDS are extracted. Another 150 Kbyte corpus is used as testing set to verify the extensibility of the selected DDS. ASL for the training and testing set before adaptation is 1.59 and 1.58 respectively. Increases in ASL after adaptation with 100 to 800 DDS for both sets are shown in Figure 6. The ASL increase for testing set is close to that for training set when the number of DDS is small. Differences between the two sets go up rapidly after the number of DDS exceeds 500. There is no absolutely best number of DDS to be extracted. Users should decide their own balance point between size and naturalness.
I - 455
Retrieval: Data Structures and Algorithms, Prentice Hall Press, New Jersey, pp. 66-82, 1992.
4.2. Experiment 2 – size of training set vs. ASL increase The stock domain is used again in this experiment. The size of training set changes from 100K to 600K with a 100K step size. The testing set is the same as above. 500 DDS are extracted from each training set. Increases in ASL after adaptation are shown in Figure 7. We find that when the size of training set exceeds 300K, the increase in ASL for training set drops and the increase in ASL for testing set goes flat. The result shows that a 200K300K training corpus is sufficient.
[5] Morrison, D., “PATRICIA: Practical Algorithm to Retrieve Information Coded in Alphanumeric”, JACM, pp. 514-534, 1968. [6] Chien, L. F., “PAT-tree-based Keyword Extraction for Chinese Information Retrieval”, Proceedings of the 20th annual international ACM SIGIR conference on research and development in information retrieval, pp. 50-58, 1997.
4.3. Experiment 3 – increase in ASL for different domains TrainingSet Increase in ASL
To compare the performance on different domains, 500 DDS are extracted from 4 domains (stock review, finance news, football news and sports news) separately. The size for each training and testing set are 250K and 150K respectively. ASL for testing set before and after adaptation are given in Figure 8. The original ASL for the four domains are different. Among them, the one for stock review is the smallest. This is due to the fact that few stock related corpora have been used when generating the general unit inventory. After adapting with 500 DDS, increase in ASL for the two narrower domains (stock and football) are larger than those for the two broader domains (finance and sports).
TestingSet
1 0.8 0.6 0.4 0.2 0 100 200 300 400 500 600 700 800 Number of DDS
Figure 6. Number of DDS selected vs. the increase in ASL.
4.4. Experiment 4 – increase in MOS for different domains TrainingSet
5.
Increase in ASL
In the DDS generation approach, ASL is the only constraint for unit selection. However, in a real TTS system, other constraint on prosody context and phonetic context are used. To clarify how much improvement is achieved on naturalness, the real unit selection procedure is used to calculate ACC before and after adaptation for the four domains. The corresponding MOS are estimated by the equation in Figure 1 and they are shown in Figure 9. MOS for the four domains increase from 0.1 to 0.22 respectively. Since all features used for calculating ACC can be derived from texts, MOS after adaptation can be estimated without recording speech.
0.6 0.4 0.2 0 100
200 300 400 500 Size of training set (Kbyte)
600
Figure 7. Size of training set vs. increase in ASL. Original
2.5
CONCLUSIONS
ASL
This paper has put forward a domain adaptation problem that has not been studied well. A framework for generating domainspecific script has been proposed. With it, application developers will have a good estimate on how much improvement can be achieved before starting to record speech for a specific domain. Experiments on four domains show that the extent of increase in naturalness depends not only on the size of the training set and the size of the script for adaptation, but also on the broadness of the domain. Greater increases in naturalness are observed for narrower domains.
Adapted
2
1.5 1 Stock
Finance Football Sports Domains
Figure 8. Increase in ASL for different domains.
Estimated MOS
6.
TestingSet
0.8
REFERENCES
[1] Chu, M, Peng, H., Yang, H. and Chang, E., “Selecting nonuniform units from a very large corpus for concatenative speech synthesizer”, ICASSP2001.
3.3
Original
3.2
Adapted
3.1 3 2.9 Stock Finance Football Sports
[2] Chu, M. and Peng, H., “An objective measure for estimating MOS of synthesized speech”, Eurospeech2001.
Figure 9. Increase in estimated MOS for different domains.
[3] Bellman, R. E., Dynamic Programming, Princeton University Press, New Jersey, 1957. [4] Gonnet, G. H., Baeza-Yates, R. A. and Sinder, T., “New Indices for Text: PAT Trees and PAT Arrays,” Information
I - 456