Compression of Speech Database by Feature ... - Semantic Scholar

2 downloads 18 Views 187KB Size Report
speech coding algorithms. 3) The complexity of the coding end for speech database ... STRAIGHT (Speech. Transformation and Representation using Adaptive.
Compression of Speech Database by Feature Separation and Pattern Clustering Using STRAIGHT Zhen-Hua Ling, Yu Hu, Zhi-Wei Shuang, Ren-Hua Wang iFlytek Speech Laboratory University of Science and Technology of China, Hefei, Anhui, P.R.China {zhling,yuhu}@iflytek.com, [email protected], [email protected]

Abstract This paper presents an alternative solution for speech database compression aiming at the embedded application of concatenative synthesis systems. The waveform of a speech segment is firstly decomposed into a prosodic pattern and a spectral pattern by STRAIGHT – a powerful speech analysissynthesis algorithm. Then all the prosodic and spectral patterns are clustered respectively to remove the redundant acoustic information within database. The clustering process is controllable and can export flexible compression ratio to meet the actual footprint requirement of various embedded devices. Besides, some labeling and contextual information are utilized to improve the performance of pattern clustering. Subjective listening test shows that our Mandarin synthesis system with corpus compressed by proposed method at about 2.7kbps perform corresponding to the same system compressed by G.723.1 at 5.3kps and the quality degradation is not serious as the compression ratio increases.

1. Introduction With the development of speech technology, the TTS (Textto-Speech) systems have begun to emerge on small-embedded devices during recent years. However, most of these systems, especially on the small-footprint low-resource devices, are based on parametric synthesizers, which limit the intelligibility and naturalness of synthesized speech greatly compared with concatenative systems. The primary problem with the application of the latter one is the disagreement between the size of large speech database and the small memory footprint available on embedded devices. So an effective speech database compression method is required. However general speech coding algorithms may not be appropriate to fulfill this task for the following reasons, some of which have been discussed in [1]: 1)

2)

Instead of purchasing the similarity between original and reconstructed waveforms as most coding algorithms do, the guideline of speech database compression is to preserve the characteristics within database and promise the intelligibility and naturalness of synthesized speech as much as possible. That means some distortion of acoustic features during compression is also allowable as long as the reconstructed speech segment is still representative and capable for the given environment in synthesized speech. Database compression can utilize accurate labeling information and contextual description of database to improve the performance; while it is difficult for general speech coding algorithms.

3)

4)

The complexity of the coding end for speech database compression is uncared because it is always conducted offline. So complicated calculation can be implemented during compression process. Contrasted with the fixed bit rate of general coding algorithms, flexible compression ratio is expected to fit various requirements of actual situations.

Therefore, an alternative for the compression of speech database with flexible compression ratio is proposed here. The main idea is based on the separation of acoustic features and clustering of similar acoustic patterns. STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum)[2], a vocoder type speech analysis-synthesis algorithm is introduced here to realize the separation of acoustic features and the reconstruction of speech waveforms. By using proposed method, we have compressed an 8kHz sampled Mandarin speech database with original size of 28 MB into 875KB, 600KB, 500KB, 400KB and 250 KB respectively. The corresponding bit rate of compression output varies from about 4kbps to 1.2kbps. The evaluation result shows that the quality of synthesized speech does not degrade seriously with the increasing of compression ratio and the performance of proposed method at about 2.7kbps is slightly better than that of G.723.1 at 5.3kbps. In the rest of this paper an introduction to the main ideas of proposed method is presented in section 2. Section 3 introduces the details of implementation. Section 4 gives the results of evaluation and section 5 is conclusion.

2. Main ideas 2.1. Analysis of redundancy within speech database In order to compress a speech database effectively, not only some signal processing methods should be adopted as general speech coding algorithms do, but also the intrinsic characteristic of a given database should be studied, especially the redundant information contained within it. Here the redundant information can be explained as the presence of similar acoustic segments or close acoustic features that are unwanted in the recorded database. By elimination such redundant information, the speech database can be compressed in smaller output with better performance than general speech coding algorithms. In concatenative speech synthesis system, recording materials for database construction are always designed to cover the varieties of each synthesis unit as sufficiently as possible by analyzing the contextual and syntax information of each unit in the candidate texts. During synthesis, the segments with proper acoustic features or contextual environments are

selected to concatenate a sentence [3][4]. For a recorded database, the redundancy can be examined from the following two aspects: 1)

2)

From the aspect of unit size. Because of the limitation of linguistic model and the restriction of natural pronunciation, speech databases have to be designed based on the phonetic unit with absolute linguistic meaning and possibility of natural reading. For example, syllable is commonly treated as a basic unit for corpus design and unit selection in Mandarin synthesis system. However, even for an ideal database without redundant syllable segments, if we split all syllables into initials and finals, it is still possible to find similar acoustic segments that are replaceable with each other, especially for some initials because of their lacking of diversity in common speech. From the aspect of integration and separation of acoustic features. The varieties for each synthesis unit during corpus design are finally reflected by the diversity of acoustic segments in recorded database. Each segment presents an integration of a group of acoustic features, including pitch, duration, spectrum and so on. It is reasonable to believe that for each separated acoustic feature the diversity may not be as various as the integrated features and there must exist different redundancy for different features.

2.2. Our method STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum) [2] as a high-quality analysis-synthesis method has been presented in recent years. This method is based on fundamental frequency extraction using TEMPO (Time-Domain Excitation extractor using Minimum Perturbation Operator) [5] and pitch adaptive spectral smoothing. Because of the high-performance in feature separation, parameter modification and speech reconstruction, it provides us the possibility to study the recorded speech database in smaller phonetic unit and separate acoustic feature to solve the problem mentioned in 2.1. In Mandarin, each syllable has a stable initial (voiced or unvoiced)-final (always voiced) structure. Taking the difficulty of concatenating voiced segments within a syllable into account, we split each syllable in our Mandarin database into an unvoiced segment (unvoiced initial) and a voiced segment (voiced initial with final or just final) to present the redundancy in smaller phonetic units. Then a prosodic pattern and a spectral pattern are extracted by STRAIGHT for each segment to present the redundancy in separated acoustic feature. Afterwards, the prosodic patterns and spectral patterns are compressed respectively. Experiment result tells us that people are insensitive to some small distortion of prosodic features in synthesized speech. So, all the prosodic patterns of voiced segments can be clustered into a small quantity of pitch contour to represent the prosodic character of speech database. A more complex clustering process is required to eliminate the redundant spectral patterns because it is more difficult to measure the distance between spectral patterns than that of prosodic patterns. A strategy of combining LSF distance and contextual difference into segmental distance is adopted here to judge the similarity among spectral patterns sharing the

same phoneme label. For a cluster of spectral patterns with close segmental distance, only one pattern is remained, others are eliminated. As the reduced patterns during clustering are always the less representative ones, better compression performance can be achieved than general speech coding algorithms that treat all patterns equally. During decompressing, the clustered prosodic pattern and spectral pattern of each segment are decoded and then imported into STRAIGHT to reconstruct waveform after spectral smoothing at concatenative boundaries. By controlling the variance threshold during patterning clustering, variable compression rate can be easily obtained.

3. Implementation 3.1. Synthesis system and speech database The purpose of our work is to help realizing a Mandarin synthesis system with total footprint under 1MB for embedded application. The synthesis system is a concatenative one based on decision tree and constructed by tailoring from a large corpus-based system in an improved method as discussed in [6]. The tailored speech database is composed of 7737 syllables with accurate labeling and segmentation. The total size of 8kHz sampled waveform is about 28MB and coding output by G.723.1 at 5.3kbps is about 1.16MB, which still exceeds our restriction in footprint. 3.2. Baseline scheme The baseline compression scheme is a regular parametric coding process as shown in Fig.1, which consists of three steps: Voiced Segment

Unvoiced Segment STRAIGHT Analysis

Pitch Contour

2-D Spectral Envelop Temporal Sampling Temporal Sampled Spectral Envelop All-Pole Model

Gains

Pole Coefficients

LSFs

Figure 1: Illustration of baseline compression scheme 1)

2)

Parameter extraction. Each unvoiced/voiced segment is decomposed into a pitch contour (for voiced segment only) and a two-dimension spectral envelop by STRAIGHT analysis. Temporal sampling. For each unvoiced segment the spectral envelope is sampled at 10ms fixed interval to reduce the storage. For voiced segment, we first sample the spectral envelope at each pitch-synchronous position along temporal axis. Then an adaptive searching algorithm is conducted to choose some key positions among these pitch-synchronous positions. Fig.2 shows the spectrogram and result of temporal sampling for a voiced segment ‘mang’ in Chinese. The duration of the segment is 302ms and 21 key positions are selected from 61 pitch-synchronous positions. It is demonstrated in

Fig.2 that the distribution of these key positions catches the movement degree of spectral structure reasonably.

Where LSFDist(A,B) represents the distance of LSFs at pitchsynchronous positions between two spectral patterns. A DTW algorithm is executed here and the searching path is restricted by certain constraints to avoid the mapping from one frame to several frames. For two frames of LSFs, the distance is calculated as Eq.2. N

¦ w (lsf

LSFDist (a, b)

i

i a

(2)

 lsf bi ) 2 / N

i 1

Here, N is the order of LSF and wi is a group of weights defined as: Figure 2: An example of temporal sampling for voiced segment ‘mang’ in Chinese (the vertical limes represent the extracted key positions) 3)

All-pole modeling. The spectral envelop at each key position is converted into a gain coefficient and a group of poles by all-pole model [7]. The model order is set to 10 for our 8kHz sampled speech and the poles are then converted into LSFs for the better performance in quantization and distance measurement.

After these steps, each segment is converted into a prosodic pattern and a spectral pattern (for voiced segment) or a spectral pattern only (for unvoiced segment). The prosodic pattern corresponds to a pitch contour. The spectral pattern includes the gains and LSFs at every extracted key position. There are totally 7737 prosodic patterns, 5016 unvoiced spectral patterns and 7737 voiced spectral patterns extracted form our database. For each segment, by scalar quantization of pitch value in 10ms interval with 9 bits, logarithmic quantization of gain with 8 bits and vector quantization of LSFs at each key position with 24 bits [8], the baseline compression scheme is completed. 3.3. Clustering of prosodic patterns As mentioned in 2.2, people are not sensitive to small distortion of prosodic features in synthesized speech. That means it is not necessary to keep all prosodic patterns in the database accurately. The redundancy is reduced by clustering. At first the pitch contour of each prosodic pattern is normalized by removing mean and interpolated to 15 points. Then the total 7737 normalized contours are clustered into 256 classes by a general LBG algorithm. THRE_P is used here as a threshold. For the pattern with clustering error under THRE_P, a class number, pitch mean and duration is recorded to present the prosodic pattern; otherwise, the accurate pitch contour sampled at 10ms interval is saved. 3.4. Clustering of spectral patterns In order to cluster the spectral patterns sharing the same phoneme label, we define the segmental distance between two patterns as Eq.1 with weights omitted. SegDist ( A, B ) ­ LSFDist ( A, B)  ConDist ( A, B)  PitchDist ( A, B) ° for voiced segment ° ® ° LSFDist ( A, B)  ConDist ( A, B) °¯ for unvoiced segment

(1)

wi

min(lsf

i 1

1  lsf i , lsf i  lsf

(3) i 1

)

ConDist(A,B) represents the contextual difference between the segments where the spectral patterns are extracted from. It is calculated according to the following features which are believed to influence segmental representation of speech unit significantly: z The type of previous and next phoneme. z The category of previous and next border of corresponding syllable (syllable, word, phrase or sentence). Furthermore, the distance of pith contour between the voiced segments is also introduced here to avoid the improper combination of clustered prosodic patterns and spectral patterns for voiced segments during decompressing. Because of the actual diversity in spectral structure and the difficulty in distance measurement, the spectral patterns can hardly be compressed to a ratio as large as prosodic patterns. The maximum clustering ratio in our experiment is about 3 times for voiced segment and 6 times for unvoiced segment. For the spectral patterns with the same phoneme label, an LBG algorithm with hierarchical initializing [9] is adopted here for its better performance in dealing with small ratio clustering. The target class number is tested step by step to promise the variance caused by clustering is under a given threshold THRE_S, which is different for unvoiced segment and voiced segment. By modifying these two thresholds, the compression ratio of spectral patterns can be controlled.

4. Evaluation 4.1. The size of compression output By modifying the thresholds for the clustering of pitch patterns and spectral patterns, five compressed databases are obtained with variable size form 875KB to 250KB. The setting details are listed in Table.1. The size and bit rate are approximate. Table 1: Size comparison under different clustering settings OP THRE_P (Hz) SP_US SP_VS Size (KB) Bit rate (kbps)

A 0 5016 7737 875 4.0

B 10 2942 5425 600 2.7

C 10 2152 4374 500 2.3

D 10 1355 3387 400 1.8

E INF 810 2517 250 1.2

Where OP_A to OP_E represent the compressed outputs under 5 different clustering settings. Because actual meaning of THRE_S is not very clear here, we replace it with the number of clustered spectral patterns for unvoiced segment and voiced segment, labeled as SP_US and SP_VS in the table. OP_A is our baseline compression scheme and OP_B is the database we use to construct our embedded system with total footprint under 1MB. The symbol INF for the THRE_P of OP_E means that all prosodic patterns are stored in clustered way for OP_E. 4.2. Performance evaluation In order to evaluate the quality of the 5 compressed databases, a subjective listening test is conducted. 20 randomly selected sentences are synthesized with these 5 compressed databases respectively through the same concatenative synthesis system mentioned in 3.1. Besides, the speech coding output by ITU-T recommendation G.723.1 at 5.3kbps [10] is added in for comparison. 10 listeners are required to give each sentence two scores in the scale from 1 (very bad), 2 (bad), 3 (acceptable), 4 (good) to 5 (excellent) on naturalness and speech quality respectively. The results of average scores are shown in Fig.3.

system on embedded systems. We firstly split syllables in our Mandarin speech database into unvoiced segments and voiced segments. Then by introducing STRAIHGT into the procedure of parameter separation, the speech segments are decomposed into prosodic patterns and spectral patterns, which are compressed respectively to eliminate the acoustic redundancy within original database. Furthermore, some contextual information is integrated into LSF distance to give a reasonable measurement of the distance between spectral patterns and pick up the most representative ones during clustering. By modifying the threshold for pattern clustering, flexible compression ratio can be obtained. Through above processing, a Mandarin speech database with original 28MB is compressed at several compression ratios. The performance does not degrade serious as the output size varies from 875KB to 250KB. The output of about 600KB is adopted to realize our embedded Mandarin synthesis system under total 1MB footprint, which performs corresponding to the compressed database by G.723.1 at 5.3kbps with 1.16MB output.

6. References [1] A. B. Kain, and J. PH. van Santen, “A Speech Model of Acoustic Inventories Based on Asynchronous Interpolation”, Proc. Eurospeech, pp. 329-332, 2003. [2] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, “Restructuring speech representations using a pitchadaptive time frequency smoothing and an instantaneousfrequency-based F0 extraction: Possible role of a repetitive structure in Sound”, Speech Communication, vol.27, pp. 187-207, 1999.

Figure 3: Evaluation on naturalness and speech quality of synthesized speech with compressed database 4.3. Discussion From above evaluations we can see that our Mandarin concatenateive synthesis system with speech database compressed by proposed method at about 2.7 kbps (OP_B) still performs slightly better than G.723.1 at 5.3kbps on both naturalness (3.55 vs. 3.48) and speech quality (3.36 vs. 3.26) of synthesized sentences. With OP_B, we construct our embedded system under 1MB total footprint. Besides, from OP_A to OP_E, the footprint reduces from 875KB to 250KB, while the performance on naturalness and speech quality degrade only about 0.3. Even for OP_E, the average scores on naturalness (3.33) and speech quality (3.10) are still above 3, that means they are acceptable to common listeners. Therefore, it is proved that by using proposed method we can control the ratio of speech database compression effectively and provide our synthesis system the ability to meet variable and practical footprint requirements.

5. Conclusion In this paper, a method for the compression of speech database is proposed in order to extend the application of TTS

[3] Renhua Wang, Yu Hu, Wei Li, and Zhenhua Ling, “Corpus based Chinese Synthesis System Using Decision Tree”, Proc. NCMMSC6, pp. 183-187, 2001. [4] Zhenhua Ling, Yu Hu, Zhiwei Shuang, and Renhua Wang, “Decision Tree Based Unit Pre-Selection in Mandarin Chinese Synthesis”, Proc. ISCSLP, pp. 277280, 2002. [5] H. Kawahara, H. Katayose, A. de Cheveigné, and R. D. Patterson, “Fixed Point Analysis of Frequency to Instantaneous Frequency Mapping for Acurrate Estimation of F0 and Periodicity”, Proc. Eurospeech, pp. 2781-2784, 1999. [6] Zhiwei Shuang, Yu Hu, Zhenhua Ling, and Renhua Wang, “A Miniature Chinese TTS System Based on Tailored Corpus”, Proc. ICSLP, pp. 2389-2392, 2002. [7] R. J. MacAulay, and T. F. Quatieri, Sinusoidal coding, in Speech Coding and Synthesis, Elsevier, Amsterdam, 1995. [8] K. Paliwal, and S. Atal, “Efficient vector quantization of LPC parameters at 24 bits/Frame”, IEEE Trans. Speech and Audio Processing, pp. 3-7, 1993. [9] D. Keim, and A. Hinneburg, “Clustering techniques for large data sets: From the past to the future”, Tutorial Notes for ACM SIGKDD, pp. 141–181, 1999. [10] ITU-T, G.723.1, Dual rate speech coder for multimedia communications transmitting at 5.3 and 6.3 kbit/s, March 1996

Suggest Documents