Contextual Maximum Entropy Model for Edit

0 downloads 0 Views 266KB Size Report
interruption point detection are conducted for evaluation. Experimental ... Edit disfluencies involve syntactically relevant content that is repeated, revised, or.
Contextual Maximum Entropy Model for Edit Disfluency Detection of Spontaneous Speech Jui-Feng Yeh+, Chung-Hsien Wu and Wei-Yen Wu Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, Ta-Hsueh Road, Tainan, + Department of Computer Science and Information Engineering, Far East University No. 49, Chung-Hua Rd., His-Shih, Tainan County 744, [email protected] and {chwu, wywu}@csie.ncku.edu.tw

Abstract. This study describes an approach to edit disfluency detection based on maximum entropy (ME) using contextual features for rich transcription of spontaneous speech. The contextual features contain word-level, chunk-level and sentence-level features for edit disfluency modeling. Due to the problem of data sparsity, word-level features are determined according to the taxonomy of the primary features of the words defined in Hownet. Chunk-level features are extracted based on mutual information of the words. Sentence-level feature are identified according to verbs and their corresponding features. The Improved Iterative Scaling (IIS) algorithm is employed to estimate the optimal weights in the maximum entropy models. Performance on edit disfluency detection and interruption point detection are conducted for evaluation. Experimental results show that the proposed method outperforms the DF-gram approach. Keywords: Disfluency, maximum entropy, contextual feature, spontaneous speech

1

Introduction

In the latest decade, the research of speech recognition has been significantly improved in practice. Current speech recognition systems output simply a stream of words for the oral reading speech. However, the variety of spontaneous speech acutely degrades the performances of speech recognition and spoken language understanding. Interactive spoken dialog systems face new challenges for speech recognition. One of the most critical problems in spoken dialog systems is the prevalence of speech disfluencies, such as hesitation, false starts, self-repairs, etc. Edit disfluency uttered by the speakers is a mortal factor for spoken language understanding and should be detected and corrected for better understanding of the utterance's meaning [1]. Edit disfluencies involve syntactically relevant content that is repeated, revised, or abandoned with the structural pattern composed of deletable region, interruption point,

editing term (optional) and correction part. Deletable region is defined as the portion of the utterance that is corrected or abandoned. Interruption point is the position at which point the speaker break off the original utterance and fluent speech becomes disfluent. Editing term is composed of filled pause, discourse marker, and explicit editing term. Edit disfluency are categorized as simple and complex edit disfluencies. Simple edit disfluencies are further divided into three categories, namely repetitions, revisions or repairs and restarts. Complex edit disfluecy represents that the corrected portion of one edit disfluency contains another disfluency and is composed of several simple edit disfluencies in the same sentence or utterance. Take the following fluent (I want to go to Taipei tomorrow.). The sentence as an example: definition and example corresponding to each category of edit disfluency are illustrated as follows: (1) Repetition: the abandoned words repeated in the corrected portion of the utterance, as depicted in the following: Chinese sentence : English translation: (I want to go to Taipei tomorrow * tomorrow.) Correction : (2) Revision or repair: although similar to repetitions, the corrected portion that replaces the abandoned constituent modifies its meaning, rather than repeats it. Chinese sentence: English translation: (I want to go to Taipei today * tomorrow.) Correction : (3) Restarts or false starts: a speaker abandons an utterance and neither corrects it nor repeats it partially or wholly, but instead restructures the utterance and starts over. Chinese sentence: English translation: (I * I want to go to Taipei tomorrow.) Correction : where the dashed line “---” represents the corrected portion and the interruption point (IP) is marked with “*”. 1.1

Related Works

Much of the previous research on detecting edit disfluency has been investigated for improving performance of spoken language understanding. Coquoz found it is very useful for enriching speech recognition by processing edit disfluency especially for spontaneous speech recognition [2]. There are several approaches which can automatically generate sentence information useful to parsing algorithm [3] [4]. To identify and verify a speech act precisely, interference caused by speech repairs should be considered. Accordingly, a reliable model is desired to detect and correct conversation utterances with disfluency [5]. For edit disfluency detection, several cues exist to suggest when some edit disfluency may occur and can be observed from linguistic features[6] , acoustic features [7] or integrated knowledge sources [8]. Shriberg et al. [9] outlined the phonetic

consequences of disfluency to improve disfluency processing methods in speech applications. Savova and Bachenko presented four disfluency detection rules using intonation, segment duration and pause duration [10]. IBM has adopted a discriminatively trained full covariance Gaussian system [11] for rich transcription. Kim et al. utilized a decision tree to detect the structural metadata [12]. Furui et al. [13] presented the corpus collection, analysis and annotation for conversational speech processing. Charniak and Johnson [14] proposed the architecture for parsing the transcribed speech using an edit word detector to remove edit words or fillers from the sentence string, and adopted a standard statistical parser to parse the remaining words. The statistical parser and the parameters estimated by boosting are employed to detect and correct the disfluency. Lease et al. later presented a novel parser to model the structural information of spoken language by Tree Adjoining Grammars (TAGs) [15-16]. Honal and Schultz used TAG channel model incorporating more syntactic information to achieve good performance in detecting edit disfluency [17]. The TAG transducer in the channel model is responsible for generating the words in the reparandum and the interregnum of a speech repair from the corresponding repair. Heeman et al. presented a statistical language model that can identify POS tags, discourse markers, speech repairs and intonational phrases [1819]. By solving these problems simultaneously, the detection of edit disfluency was addressed separately. The noisy channel model was proposed to model the edit disfluency [16] [20] [21]. Snover et al. [22] integrated the lexical information and rules for disfluency detection using transformation-based learning. Hain et al. [23] presented techniques in front-end processing, acoustic modeling, language and pronunciation modeling for transcribing conversational telephone speech automatically. Soltau et al. transcribed telephone speech using LDA [24]. Harper et al. utilized parsing approaches to rich transcription [25]. Liu et al. not only detected the boundaries of sentence-like units using the conditional random fields [26], but also compared the performances of HMM, maximum entropy [27] and conditional random fields on disfluency detection [28]. 1.2

Methods Proposed in this Paper

Berger et al. first applied the maximum entropy (ME) approach to natural language processing and achieved a better improvement compared to conventional approaches in machine translation [29]. Huang and Zweig proposed a maximum entropy model to deal with the sentence boundary detection problem [27]. Using the maximum entropy model to estimate conditional distributions provides a more principled way to combine a large number of overlapping features from several knowledge sources. In this paper, we propose the use of the maximum entropy approach with contextual features, containing word, chunk and sentence features, for edit disfluency detection and correction. Given labeled training data with edit disfluency information, maximum entropy is a statistical technique which predicts the probability of a label given the test data using the improved interactive scaling algorithm (IIS). The language related structural factors are taken as the features in the maximum entropy model.

1.3

Organization

The rest of this paper is organized as follows. Section 2 introduces the framework of the edit disfluency detection and correction using maximum entropy model and the language features adopted in this model. The weight estimation using improved iterative scaling algorithm are also described in this section. Section 3 summarizes the experimental results, along with a discussion made of those results. In Section 4, we conclude the findings and the directions for future work.

2 Contextual Maximum Entropy Models for Edit Disfluency Correction In this paper, we regard the edit disfluency detection and correction as a word labeling problem. That is to say, we can identify the edit disfluency type and the portion category of every word in the utterance and accordingly correct the edit disfluency. Herein, the edit disfluency type includes normal, repetition, revision and restart. The portion categories contain sentence-like units, deletable region, editing term and correction part. Since the interruption point (IP) appears in an inter-word position that speaker breaks off the original utterance, all the inter-word positions in an utterance are regarded as the potential IP positions. Therefore, as long as the labels of the words in the utterance are determined, the detected edit disfluency can be corrected. Fig. 1 shows the example of an utterance with “revision” disfluency. For edit disfluency detection, the maximum entropy model, also called log-linear Gibbs model, is adopted, which uses contextual features and takes the parametric exponential form for edit disfluency detection:

(

)

P PTwi | W , F =

(

(

1 exp ∑ λk f k PTwi , W , F k Z λ (W , F )

))

(1)

where PTwi contains the edit disfluency type and the portion category of word wi . wi denotes the i-th word in the speech utterance W. F is the feature set used in the

contextual maximum entropy model. f k ( ⋅) is an indicator function corresponding to

contextual features described in the next section. λk denotes the weight of feature

f k ( ⋅) . Z λ (W , F ) is a normalizing factor calculated as follows:

(

(

Z λ (W , F ) = ∑ exp ∑ λk f k PTwi ,W , F PTwi

k

))

(2)

Fig. 1. The original utterance is

. The corrected sentence can be obtained from the highlighted words with disfluency type and portion category, in with “deletable” portion category is deleted. which the word

2.1 Contextual Features Since edit disfluency is a structural pattern, each word cannot be treated independently for the detection of edit disfluency type and portion category. Instead, the features extracted from the contexts around the word wi , called contextual features, should be considered for edit disfluency detection. The concepts derived from the primary features of the words defined in HowNet and co-occurrence of words are employed to form the contextual features for pattern modeling of edit disfluency. In order to consider the contextual information of a word, an observation window is adopted. The contextual features defined in this study are categorized bidirectional n-grams and variable-length structural patterns. These two features are described in the following. (1) Bi-directional n-gram features are extracted from a sequence of words. Considering the words that follow and before the observed word wi , the right hand side n-gram and left-hand side n-gram are obtained. Therefore, the uni-gram feature is shown as equation (3)

⎧1 if Class ( wi ) = category j f k PTwi , W , Fo = ⎨ otherwise ⎩0

(

)

(3)

Where category j denotes the j-th taxonomy defined in HowNet. The right hand side n-gram and left hand side n-gram are shown in equations (4) and (5), respectively. ⎧⎪1 if Class ( wi − n +1 ) = category ji +n−1 ∧ "Class ( wi ) = category ji f k PTwi ,W , FR = ⎨ otherwise ⎪⎩0

(4)

⎧⎪1 if Class ( wi ) = category ji ∧ "Class ( wi + n −1 ) = category ji +n−1 f k PTwi ,W , FL = ⎨ otherwise ⎪⎩0

(5)

( (

)

)

The contextual feature set, employed in the proposed model, consists of uni-gram, right-hand side bi-gram, left-hand side bi-gram, right-hand side tri-gram, left-hand side tri-gram and right-hand side and left-and side n-grams. Editing terms containing fillers, discourse markers and negative words play important roles in edit disfluency detection using contextual features.

Fig. 2. Illustration of left-hand side n-gram and right-hand side n-gram contextual features. (2) Variable-length structural patterns are derived according to the characteristics of edit disfluencies. Since each word models only local information, structural information, such as sentences and phrases can be employed for syntactic pattern modeling. The units with variable length are considered to form the syntactic patterns using the sentences and chunks as the building blocks instead of words only. That is to say, we can extend the contextual scope by sentence and chunk n-grams to obtain better resolution of edit disfluency as shown in Fig. 3.

Fig. 3. Illustration of the syntactic patterns with three kinds of units: word, chunk and sentence.

The sentence-level feature is identified according to the verbs and their corresponding necessary arguments defined in [30]. The chunk-level feature is extracted by the mutual information of the word sequence ci c j according to co-occurrence and term frequencies of ci and c j . ⎛ ⎞ P ( ci c j ) ≥ξ⎟ Chunck ( ci c j ) ≡ I ⎜ log 2 ⎜ ⎟ P ( ci ) P ( c j ) ⎝ ⎠

(6)

Where Chunck ( ⋅) denotes the function to determine if the word sequence is a chunk. ci and c j can be a word or a chunk. I ( ⋅) and ξ are the indicator function and the

threshold of mutual information, respectively.

2.2 Parameter Estimation In maximum entropy modeling, improved iterative scaling algorithm (IIS) is employed to estimate the parameter λk . The weight vector Λ ≡ {λ1 , λ2 ,",λn } is updated iteratively using IIS with the constraint that the expected values of various feature functions match the empirical averages in the training data. That is to say, the conditional log likelihood is also maximized over the training data. IIS algorithm is illustrated in Fig. 4. Algorithm Improved iterative scaling algorithm (IIS) Initial Λ

(0)

= ( 0, 0, " , 0 )

T

Do

(

)

Solve δ i( ) according to E p ( f i ) = ∑ p ( x ) exp δ i( t ) ∑ f i ( x ) × f i ( x ) t

x

i

Λ (t +1) = Λ (t ) + δ (t ) Until converge

Fig. 4. The Improved iterative scaling algorithm (IIS) algorithm for estimating the weight vector. Where E p is the expectation operator with respect to the empirical distribution.

{

δ ( t ) ≡ Δλ1( t ) , Δλ2(t ) , " ,Δλn(t ) th iteration.

}

represents the increment of weight vector Λ( ) for the tt

3

Experiments

3.1 Data Preparation The Mandarin Conversational Dialogue Corpus (MCDC) [31], collected from 2000 to 2001 at the Institute of Linguistics of Academia Sinica, Taiwan, comprising 30 digitized conversational dialogues numbered from 01 to 30 of a total length of 27 hours, is used for edit disfluency detection and correction in this paper. The annotations described in [31] give concise explanations and detailed operational definitions of each tag. Like SimpleMDE, direct repetitions, partial repetitions, overt repairs and abandoned utterances are considered as the edit disfluency and the related information are labeled in MCDC. Besides the subsyllable acoustic models, filler models [32] and discourse markers were defined for training using the Hidden Markov Model Toolkit (HTK) [33], and the recognized results were considered in language modeling. A speech recognition engine using HTK was built for syllable recognition using eight states (three states for the Initial part, and five states for the Final part in Mandarin syllable). The input speech was pre-emphasized with a coefficient of 0.97. A frame size of 32 ms (512 samples) with a frame shift of 10.625 ms (170 samples) was used. The MAT Speech Database, TCC-300 [34] and MCDC were used to train the parameters in the speech recognizer. 3.2 Experiments on Edit Disfluency Detection Edit word detection (EWD) detects the input speech containing the words in the deletable regions. One of the primary metrics for the evaluation of edit disfluency correction is the edit word detection rate defined in RT’04F. This method is defined as the average number of missed edit word detections and falsely detected edit words per reference IP: ErrorEWD =

nM − EWD + nFA − EWD nEWD

(7)

where nM − EWD is the number of deletable edit words in the reference transcription that are not covered by the deletable regions of the system-detected edits; nFA − EWD denotes the number of reference words that are not deletable, yet are covered by deletable regions of the system-detected edits, and nEWD represents the number of deletable reference edit words. For assessing the performance of the proposed model, the statistical language model for speech disfluencies, proposed by Stolcke and Shriberg called DF-gram [35], is developed for comparison. The model is based on a generalization of the standard N-gram language model. The dynamic programming is used to compute the probability of a word sequence considering

possible hidden disfluency events. Table 1 presents the results of the proposed method and DF-gram. Table 1. Results of the proposed maximum entropy model and DF-gram

Missed Maximum Entropy DF-gram

False Alarm

0.05 0.13

Error (ErrorEWD)

0.20 0.16

0.25 0.29

The missed and false alarm error rates for the proposed maximum entropy approach are 0.05 and 0.2 respectively. The proposed contextual maximum entropy approach outperforms the DF-gram, especially for missed edit word errors. There are two reasons leading to disappointing results in false alarm: the insertion error of speech recognition and the misclassification of restart. These results indicate that the proposed model can handle repetition and revision disfluencies very well. However, it did not perform as well as expected for “restart” detection, where the improvement was less pronounced than that for other edit disfluency categories. 3.3 Experiments on Contextual Features Since the feature set plays an important role in maximum entropy model, this experiment is designed for obtaining optimum size of the observation window. The right-hand side and left-hand side contextual features are selected symmetrically to form the feature set used in the proposed model. According to the number of units within the observation window, we can obtain the n-gram features based on words chunks and sentences. For example, if the number of words within the observation window is one, the feature set contains only the uni-gram feature. The determination of edit word depends only on the word itself in the observation window. If the observation window size is five, the right-hand side bi-gram, right-hand side tri-gram, left-hand side bi-gram, left-hand side tri-gram and uni-gram are included in the feature set. Table 2 shows the results for edit word detection with various observation window sizes. Table 2. Edit word detection results for various observation window sizes to form the feature set. ErrorEWD(I) and ErrorEWD(O) represent the error rates for inside and outside tests, respectively. Observation Window Size

ErrorEWD(I) ErrorEWD(O)

1

3

5

7

9

11

0.158 0.201

0.143 0.222

0.117 0.209

0.111 0.190

0.114 0.196

0.122 0.197

Compared to observation window size of three, the feature set from that is one can provide comparable performance. The best result appears when the observation window size is seven. The performance of edit word detection task gradually declines as the observation window size increases. The reason is that the abandoned deteletable region usually contains few words. According to our observation, another finding of this experiment is the unit using sentence can provide significant

improvement of the resolution between “restart” and fluent sentence. For example the (I heard that you want to go to Taipei)” is fluent sentence “ (I go * you go to Taipei)” confused with the disfluent sentence “ when the model with the word-based bi-directional n-gram features. By introducing [S] (I heard sentence-level feature, these two sentences can be regarded as “ [S] (I go * [S])”. The verb “ (heard)"can be followed by a [S])” and “ sentence [S], while the verb “ (go)” should be followed by a noun. Considering the characteristics of verbs and sentence structural information, we can achieve significant improvement on detecting the “restart” disfluency. 3.4 Results Analysis on Corresponding Edit Disfluency Types As we have observed, there are different effects due to various edit disfluency types. For comparison, we also show the detection results of repletion, repair and restart by the proposed maximum entropy model and DF-gram in Table 3. Table 3. The results of different edit disfluency types in edit word error by the proposed maximum entropy model and DF-gram approaches.

ME DF-gram

Repetition 0.12 0.15

Repair 0.24 0.28

Restart 0.29 0.34

ErrorEWD 0.25 0.29

The result of maximum entropy model is better than that of DF-gram for three kinds of edit disfluencies especially the “restart”. In fact, “restart” is usually confused with two cascaded normal sentences. The performance of “restart” detection is improved significantly by introducing the features of chunk and sentence features. In addition, the number of verbs within contextual scope is also helpful to detect “restart”.

4

Conclusion and Future Work

This paper has presented a contextual maximum entropy model to detect and correct edit disfluency that appears in spontaneous speech. The contextual features of variable length are introduced for modeling contextual patterns that contain deletable region, interruption point, editing term and correction part. Improved iterative scaling algorithm is used to estimate the weight of the proposed model. According to the experimental results, we can find the proposed method can achieve an error rate of 25% in edit word detection. Besides the word-level features, chunk-level and sentence-level features are adopted as the basic units to extend the contextual scope for capturing not only local information but also structural information. The results show that the proposed method outperforms the DF-gram. For the future work, prosodic features are also beneficial for interruption point and edit disfluency detection. In addition, tagging training data is labor intensive and bias

due to personal training, automatic or semi-automatic annotation tools should be developed to help the transcription of dialogs or meeting records.

References 1.

2. 3.

4. 5.

6.

7.

8.

9.

10. 11.

12.

13.

14. 15. 16.

Nakatani, C. and Hirschberg, J.: A speech-first model for repair detection and correction. Proceedings of the 31 Annual Meeting of the Association for Computational Linguistics, (1993) 46-53. Coquoz, S.: Broadcast News segmentation using MDE and STT Information to Improve Speech Recognition. International Computer Science Instute, Tech. Report., (2004). Gregory, M., Johnson, M. and Charniak, E.: Sentence-internal prosody does not help parsing the way punctuation does not Help Parsing the Way Punctuation Does, Proc. NAACL, (2004). 81-88. Kahn, J.G., M. Ostendorf and C. Chelba: Parsing Conversational Speech Using Enhanced Segmentation. Proc. HLT-NAACL, 2004. pp. 125-128. Wu, C.-H. and Yan, G.-L.: Speech Act Modeling and Verification of Spontaneous Speech With Disfluency in a Spoken Dialogue System. IEEE Transaction on Speech and Acoustic Processing, Vol. 13, No. 3, (2005). 330-344. Yeh J.-F. and Wu C.-H.: Edit Disfluency Detection and Correction Using a Cleanup Language Model and an Alignment Model. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 18, Issue 5,(2006), 1574-1583. Shriberg, E., Stolcke, A., Hakkani-Tur, D. and Tur, G.: Prosody-based automatic segmentation of speech into sentences and topics”, Speech Communication, 32(1-2), (2000), 127-154. Bear, J., Dowding, J. and Shriberg, E.: Integrating multiple knowledge sources for detecting and correction of repairs in human computer dialog. Proc. of ACL, (1992). 56– 63. Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A. and Stolcke, A.: Modeling prosodic feature sequences for speaker recognition. Speech Communication 46 , (2005), 455–472 Savova, G. and Bachenko, J.: Prosodic features of four types of disfluencies. in Proc. of DiSS 2003, (2003), 91–94. Soltau, H., Kingsbury, B., Mangu, L., Povey, D., Saon, G., and Zweig, D.: The IBM 2004 Conversational Telephony System for Rich Transcription. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP '05). (2005), 205-208. Kim, J., Schwarm, S. E. and Ostendorf, M.: Detecting structural metadata with decision trees and transformation-based learning. Proceedings of HLT/NAACL 2004, (2004), 137– 144. Furui, S., Nakamura, M., Ichiba, T. and Iwano, K.:Analysis and recognition of spontaneous speech using Corpus of Spontaneous Japanese. Speech Communication 47, (2005), 208–219 Charniak, E. and Johnson, M.: Edit detection and parsing for transcribed speech. in Proc. of NAACL 2001, (2001), 118–126. Johnson, M. and Charniak, E.: A TAG-based noisy channel model of speech repairs. in Proc. of ACL 2004, (2004). 33-39. Lease, M., Charniak, E., and Johnson, M.: Parsing and its applications for conversational speech, in Proc. of ICASSP 2005, (2005).

17. Honal, M. and Schultz T.: Automatic Disfluency Removal on Recognized Spontaneous Speech - Rapid Adaptation to Speaker Dependent Disfluencies. In Proceedings of ICASSP ’05. (2005), 969-972. 18. Heeman, P. A. and Allen, J.: Speech repairs, intonational phrases and discourse markers: Modeling speakers’ utterances in spoken dialogue. Computational Linguistics, vol. 25, (1999), 527–571. 19. Heeman, P.A., Loken-Kim, K., Allen, J.: Combining the detection and correction of speech repairs. In Proceedings of the 4rd International Conference on Spoken Language Processing (ICSLP-96). (1996), 358--361. 20. Honal, M. and Schultz, T.: Automatic disfluency removal on recognized spontaneous speech - rapid adaptation to speaker dependent dislfuencies. in Proc. of ICASSP 2005, (2005). 969-972 21. Honal, M. and Schultz, T. “Corrections of disfluencies in spontaneous speech using a noisy-channel approach,” in Proc. of Eurospeech, 2003, (2003). 2781-2784. 22. Snover, M., Dorr, B., and Schwartz, R.: A lexically-driven algorithm for disfluency detection. in Proc. of HLT/NAACL 2004, (2004), 157-160. 23. Hain, T., Woodland, P. C., Evermann, G., Gales, M.J.F., Liu, X., Moore, G. L., Povey, D. and Wang, L.: Automatic Transcription of Conversational Telephone Speech. IEEE Transactions on Speech and Audio Processing: Accepted for future publication. 24. Soltau, H., Yu, H., Metze, F., Fugen, C., Qin, J., Jou, S.-C.: The 2003 ISL rich transcription system for conversational telephony speech. In Proceedings of Acoustics, Speech, and Signal Processing 2004 (ICASSP), (2004), 17-21. 25. Harper, M., Dorr, B. J., Hale, J., Roark, B., Shafran, I., Lease, M., Liu, Y., Snover, M., Yung, L., Krasnyanskaya, A. and Stewart, R.: Final Report on Parsing and Spoken Structural Event Detection, Johns Hopkins Summer Workshop, (2005). 26. Liu, Y., Stolcke, A., Shriberg, E. and Harper, M.: Using Conditional Random Fields for Sentence Boundary Detection in Speech. In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics: ACL 2005, (2005) 27. Huang J. and Zweig, G.: Maximum Entropy Model for Punctuation Annotation from Speech. In Proceedings of ICSLP 2002, (2002). 917-920. 28. Liu, Y., Shriberg, E., Stolcke A. and Harper, M.: Comparing HMM, Maximum Entropy, and Conditional Random Fields for Disfluency Detection. in Proc. of Eurospeech, 2005, (2005). 3313-3316. 29. Berger, A. L., Pietra, S. A. D. and Pietra V. J. D.: A maximum entropy approach to natural language processing. Computational Linguistics, Vol. 22, (1996). 39-72. 30. Chinese Knowledge Information Processing Group (CKIP): Technical Report 93-05: Chinese Part-of-speech Analysis. Academia Sinica, Taipei. (1993). 31. Tseng, S.-C. and Liu, Y.-F.: Annotation of Mandarin Conversational Dialogue Corpus. CKIP Technical Report no. 02-01.” Academia Sinica. (2002). 32. Wu, C.-H. and Yan, G.-L.: Acoustic Feature Analysis and Discriminative Modeling of Filled Pauses for Spontaneous Speech Recognition, Journal of VLSI Signal ProcessingSystems for Signal, Image, and Video Technology, 36, (2004), 87-99. 33. Young, S. J., Evermann, G., Hain, T., Kershaw, D., Moore, G. L., Odell, J. J., Ollason, D., Povey, D., Valtchev, V., and Woodland, P. C.: The HTK Book. Cambridge, U.K.: Cambridge Univ. Press, (2003). 34. MAT Speech Database – TCC-300 (http://rocling.iis.sinica.edu.tw/ROCLING/MAT/Tcc_300brief.htm) 35. Stolcke, A. and Shriberg, E.: Statistical Language Modeling for Speech Disfluencies, In Proceedings of ICASSP-96, (1996), 405-408.

Suggest Documents