Integrating Linguistic and Signal Knowledge in a ...

25 downloads 70 Views 252KB Size Report
Aug 30, 1999 - tool are implemented within a client-server model, the system loads can ..... functional role of NPs (noun phrases) and VPs (verb phrases) in a ...
Integrating Linguistic and Signal Knowledge in a Morpheme Based Speech Corpus Annotation Tool Byeongchang Kim, Jin-seok Lee, Jeongwon Cha, Geunbae Lee and Jong-Hyeok Lee Department of Computer Science & Engineering Pohang University of Science & Technology Pohang, 790-784, South Korea

fbckim, likeoxy, himen, gblee, [email protected] August 30, 1999

Abstract As more and more speech systems require high-level linguistic knowledge to accommodate various levels of applications, corpora that are tagged with high-level linguistic annotations as well as signal-level annotations are highly recommended for development of today's speech systems. Among the high-level linguistic annotations, POS (part-of-speech) tag annotations are indispensable in speech corpora because they provide the basic-level syntactic classes for each morpheme, which are essential for most modern spoken language applications of morphologically complex agglutinative languages such as Korean. Considering the above demands, we have developed a single uni ed speech corpus tool that enables corpus builders to link linguistic annotations with signal-level annotations using a morphological analyzer and a POS tagger as basic morpheme based linguistic engines, and integrates syntactic analyzer, phrase break detector, graphemeto-phoneme converter, automatic phonetic aligner and statistical language model generator together. Each engine automatically annotates its own linguistic and signal knowledge and interacts with the corpus developers to revise and correct the annota1

tions on demand. All the linguistic/phonetic engines were developed and merged with an interactive visualization tool in a client-server network communication model. The corpora that can be constructed using our annotation tool will be multi-purpose and applicable to both speech recognition and text-to-speech (TTS) systems. Finally, since the linguistic and signal processing engines and user interactive visualization tool are implemented within a client-server model, the system loads can be reasonably distributed over several machines.

1 Introduction As statistical methods have become dominant in speech research communities, large annotated speech corpora have become essential for good performance in various speech systems. In the case of speech recognition systems, large-sized speech corpora tend to promise good performance, regardless of whether the corpus is phonetically aligned or not. The parameters of Hidden Markov Model (HMM) and the bigram, trigram or n-gram language models can be well estimated with a large-sized corpus. In the case of TTS systems, a speech corpus with high-level linguistic annotations is necessary to predict prosodic elements like intonation, pause and duration. Furthermore, a thoroughly, phonetically aligned speech database, which can be extracted from naturally or carefully spoken speech, makes the TTS systems more intelligible. To build a large-sized speech corpus with its linguistic and signal annotations, corpus builders use annotation tools, which can eliminate cumbersome and time-consuming tasks. The annotation tools must help the builders to build large-sized and linguistically annotated corpora rapidly and accurately using a set of functions, such as signal processing, linguistic processing, grapheme-to-phoneme conversion, automatic phonetic alignment, and even language model

generation, which are unique to each tool. However, most previous speech annotation tools only dealt with signal and phonetic level tagging, and have been developed for a single type of application domain. As speech systems increasingly require high-level linguistic knowledge to accommodate various levels of applications, corpora that are tagged with high-level linguistic annotations, as well as signal-level annotations, are highly recommended for the development of today's speech systems. Accordingly, we propose a speech corpus annotation tool that enables corpus builders to link linguistic annotations with signal-level annotations in the corpora using several linguistic and signal processing engines. Each engine automatically annotates its own linguistic and signal knowledge and interacts with the corpus builder to accommodate revisions and corrections of the annotations on demand. A corpus constructed using this annotation tool will be multi-purpose and applicable to both speech recognition/understanding and text-to-speech (TTS) systems. Among the high-level linguistic annotations, POS (part-of-speech) tag annotations are indispensable in speech corpora because they provide the basic level of syntactic classes for each morpheme, which are essential for most modern spoken language applications of morphologically complex agglutinative languages such as Korean. In the case of the grapheme-to-phoneme conversion, because phonological changes conditionally occur not only in phonological environments but also in morphological environments, phonological and morphological knowledge should be integrated and used in the grapheme-to-phoneme converter. The statistical language model, which can be used in the automatic speech recognition system, should be constructed both at the morpheme-level and at the word-level with POS tags, graphemes and phonemes to accommodate the characteristics of agglutinative languages. In the next section, existing speech corpus tools are reviewed and compared with our system. 3

Section 3 describes the design philosophy of our speech annotation tool (POSCAT: POSTECH Corpus Annotation Tool) and Section 4 shows several signal and high-level linguistic processing engines. A client visualizing tool is described in Section 5, and some conclusions are drawn in Section 6.

2 Previous Research Though there are so many speech corpus annotation tools, most previous annotation tools have been developed for a single type of application domain. Table 2 shows existing speech corpus annotation tools with their characteristics. All of them visualize signal wave and also show its textual transcriptions and some other information as necessary. Some of them perform signal processing to help corpus builders annotate signal and linguistic knowledge on the speech corpus, where only two systems utilize natural language processing like grapheme-to-phoneme conversion and morphosyntactic analyzer (Entropic, 1997)(CMU, 1998). Though most of them support phonetic/word-level/sentence-level segmentation, only the Annotator, the SLAM and the Speech Analyzer automatically segment speech wave into phonetic units (Entropic, 1997)(Institute of Phonetics and Dialectology, 1997)(SIL, 1999). The CHILDES supports word-level automatic segmentation, the Archivage and the Segmenter support sentence-level and wordlevel manual segmentation, and the others support only manual phonetic segmentation (CMU, 1998)(Michailovsky et al., 1998)(ISIP, 1999). In conclusion, though there have been many useful speech annotation tools within the speech research community, they do not have the functionality needed to annotate both signal-level and high-level linguistic annotations. Furthermore, there are no tools that have the ability to either manually or automatically annotate morphemes and their corresponding POS tags, which are 4

System/ Project

Developer

Annotator

Entoropic

Archivage

LACITO/CN RS

CHILDES

CMU

SoundWalker/ CSAE

University of California Santa Barbara

CSLU Toolkit

OGI

Segmenter

ISIP

SFS

University College London Institute of Phonetics and Dialectology

SLAM

5 Snack Speech Analyzer

SIL

Transcriber

DGA

Praat

Paul Boersma

POSCAT (our system)

POSTECH

Table 1: Existing speech annotation tools and their characteristics

Visualization Signal wave, text and others Signal wave and text Signal wave and text Signal name and text Signal wave, text and others Signal wave Signal wave and others Signal wave and others

Signal Processing

Natural Language Processing Dictionary based grapheme-to-phoneme conversion No

Phonetic Segmentation Automatic with the Aligner Manual

No

Morphosyntactic analyzer

No

Yes

Segmentation Units Phone

File Format ESPS and Sphere

Remarks Multilingual

XML

text Tree-structure

Automatic (word level)

Sentence (sentence level) Word, sentence and discourse

CHAT

English + 19 other language

No

No

Sentence

Their own

Yes

No

Manual

Phone

Their own

No

No

Word

Their own

Yes

No

Manual (word level) Manual

Phone

Their own

Yes

No

Automatic

Phone

Their own

No

Signal wave, text and others Signal wave, text and others Signal wave and text Signal wave and others

Yes

No

Manual

Phone

Their own

Yes

No

Automatic

Phone

Their own

No

No

Manual

XML based

Yes

No

Manual

Signal wave, text and others

Yes

POS tagging, grapheme-to-phoneme conversion, syntactic Analysis, LM generation

Phone, word and sentence Phone, word and sentence

Automatic

Phone, word and sentence

Their own

SGML based

Not Integrated

IPA Phonetic

indispensable in speech corpora for agglutinative languages, such as Korean, Japanese, Finish, Turkish, etc. The last row of the table compares our POSCAT with all the other systems.

3 POSCAT Design Philosophy There are many kinds of speech corpora in the speech research community. The aims of the corpora are to support the development of speech recognition systems, to provide prosodic elements for TTS systems, to give phonetic segments to the speech signal synthesizer in TTS systems, and to provide a variety of speakers or languages for speaker or language recognition systems. According to its aims, each corpus has its own characteristics, such as recording environments, number of speakers, narrative/ uent, overlapping/nonoverlapping speech fragments, and so on. The purpose of the corpora that can be constructed by our speech corpus annotation tool is to support the development of both automatic speech recognition/understanding and TTS systems. For the development of conventional automatic speech recognition systems, largesized, phonetically aligned speech corpus is necessary in training HMM, and a large-sized POS tagged and error-free text corpus is required to generate a statistical language model. The development of conventional TTS systems requires speech corpora composed of small-sized and phonetically well-aligned speech segments, a large-sized prosodically annotated speech corpus and its textual transcription. We can construct the corpus for an automatic speech recognition system using the following sequence. First, textual transcriptions and their speech signal are prepared. POS tagging and grapheme-to-phoneme conversion are performed on the textual transcriptions, sentence by sentence, because grapheme-to-phoneme conversion requires the results of POS tagging. We 6

can now complete the corpus by aligning the phonetical labels with their corresponding speech segments. The conventional sequence of making speech corpora for TTS systems is as follows. The smallsized and phonetically well-aligned speech segments can be constructed by transcription preparing, speech recording and phonetical aligning, without any linguistic processing. The large-sized prosodically annotated speech can be constructed by syntactic analyzing and phrase break detection on the corpus that was constructed for an automatic speech recognition system. As described in the previous paragraphs, most of the tasks required to build annotated speech corpora are tedious and time-consuming. Our annotation tool can accelerate this process by helping the corpus builders annotate signal and linguistic knowledge on the speech corpus easily, precisely and rapidly. Followings are the design parameters of our speech corpus annotation tool.

 The speech corpus annotation tool has to browse the corpus and visualize some portion of the corpus in various ways as demanded by tool users.

 The speech corpus annotation tool has to annotate the signal and linguistic knowledge on the corpus automatically although the annotations are not so precise.

 The automatically annotated linguistic annotations must include POS tags that provide a basic level of syntactic classes for each morpheme in morphologically complex agglutinative languages.

 The speech corpus annotation tool has to provide corpus builders with a facility to revise and correct the automatically annotated corpus. 7

 The speech corpus annotation tool must not overload a machine in order to provide the corpus builders with various signal and linguistic knowledge.

 The le format and the internal structures for annotations must be simple, intensive and extensible. We designed a speech corpus annotation tool to accommodate the above design parameters. The single uni ed speech corpus tool enables corpus builders to link linguistic annotations with signal-level annotations using a morphological analyzer and a POS tagger as basic morpheme based linguistic engines, and integrates syntactic analyzer, phrase break detector, graphemeto-phoneme converter, automatic phonetic aligner and statistical language model generator together. Each engine automatically annotates its own linguistic and signal knowledge, and interacts with the corpus developers to revise and correct the annotations on demand. All the linguistic/phonetic engines were developed and merged with an interactive visualization tool using a client-server communication model to distribute system loads over several machines. The annotation les are formatted using Standard Generalized Markup Language (SGML) markup, which has been adopted by the Linguistic Data Consortium (LDC) and called Universal Transcription Format (UTF) (LDC, 1998). The internal data structure for annotations is a hybrid structure of a tree and linked list, which can represent the hierarchical structure of annotations including parse trees easily and extensibly.

4 Linguistic and Signal Annotation Servers There are 6 server engines for linguistic and signal processing in the POSCAT, including morphological analyzer and POS tagger, syntactic analyzer, phrase break detector, grapheme-tophoneme converter, automatic phonetic aligner and statistical language model generator. As 8

Client

Servers TCP/IP

File I/O Speech Corpus

Part-of-Speech Tagger Request/Response

User Interactive Tool

Grapheme to Phoneme Converter Automatic Phonetic Aligner Phrase Break Detector Syntactic Analyzer Korean-Statistical Language Model Generater

Figure 1: Linguistic and signal processing engines and visualization tool using a client-server communication model shown in the previous section, each engine plays an important role in the signal and linguistic annotating. The engines are distributed over several machines and merged with an interactive visualization tool using a client-server communication model as shown in Figure 1. The protocol in which the servers and the client communicate is the simplest: The client requests and then the servers respond through the TCP/IP layer without any error control. The time sequence diagram of the protocol is also shown in the Figure 1. The following subsections give brief explanation on each server engine.

9

4.1 Morphological Analyzer and Part-of-Speech Tagger POS tagging is a basic step in grapheme-to-phoneme conversion since phonological changes depend on morphotactic and phonotactic environments in complex agglutinative languages. Furthermore, it is well known that POS tagging is also a basic step in syntactic analysis and statistical language model generation. The POS tagging system has to handle out-of-vocabulary (OOV) words for accurate grapheme-to-phoneme conversion of an unlimited vocabulary (Bechet and El-Beze, 1997). Figure 2 shows the hybrid architecture for Korean POS tagging with generalized unknownmorpheme guessing (Cha et al., 1998). There are three major components: the morphological analyzer with unknown-morpheme handler, the statistical POS tagger, and the rule-based error corrector. The morphological analyzer segments the morphemes from the words in a sentence and reconstructs the original morphemes from the spelling changes of irregular conjugations. It also assigns all possible POS tags to each morpheme by consulting a morpheme dictionary. The unknown-morpheme handler within the morphological analyzer assigns the POS's of the morphemes that are not registered in the morpheme dictionary using the morpheme pattern dictionary matching. The statistical tagger runs the Viterbi algorithm (Forney, 1973) on the morpheme graph to search the optimal tag sequence for POS disambiguation. For remedying the defects of a statistical tagger, we introduce a post error-correction mechanism. The error-corrector is a rule-based transformer (Brill, 1992)(Brill, 1994), and it corrects the mis-tagged morphemes by considering the lexical patterns and necessary contextual information.

10

Sentence Morpheme Morpheme connectivity connectivitytable table Lexical/transition Lexical/transition probability probabilitytable table Syllable trigram table Syllable trigram table for unknown word for unknown word

Morphological Morphological analyzer analyzer Statistical StatisticalPOS POStagger tagger Post Posterror errorcorrector corrector

Morpheme Morpheme dictionary dictionary Morpheme Morphemepattern pattern dictionary dictionary Post Posterror-correcting error-correcting rules rules

POS tagged sentence

Figure 2: Hybrid architecture for Korean POS tagging

4.2 Grapheme-to-Phoneme Converter Grapheme-to-phoneme conversion can be described as a function mapping from the spelling of words to their phonetic symbols. Because pronunciation of a word cannot always be determined only from the spelling of the word, the function needs some linguistic knowledge, especially morphological and phonological, but often also semantic knowledge. The phoneme sequence of a sentence is the fundamental representation of the sentence itself together with its textual transcription. From the phoneme sequence, we can get phonetic time alignment between the phoneme sequence and its speech waveform. Figure 3 shows a grapheme-to-phoneme conversion method using a dictionary-based and rulebased hybrid method with a phonetic pattern dictionary and CCV (consonant consonant vowel) LTS (letter to sound) rules (Kim et al., 1998). In order to handle numbers, abbreviations, and acronyms, each morpheme that has non-Korean symbols is normalized by replacing them with Korean graphemes. In the morpheme phoneticizer, specially pronounced morphemes are rst converted into phoneme sequences by consulting the phonetic exception dictionary. Other regular morphemes are processed in two phases. 11

Morpheme sequence Morpheme Morphemenormalizer normalizer

Phonetic Phoneticexception exception dictionary dictionary

Morpheme Morpheme phoneticizer phoneticizer

Phonetic Phoneticpattern pattern dictionary dictionary

Phoneme graph Morphophonemic Morphophonemic connectivity connectivitytable table

CCV LTS rule CCV LTS rule

Morphophonemic Morphophonemic connectivity connectivitychecker checker Correct phoneme sequence

Figure 3: Architecture of the grapheme-to-phoneme converter Firstly, the graphemes in the morpheme boundary are converted into phonemes by consulting the phonetic pattern dictionary. Secondly, the graphemes within morphemes are converted into phonemes according to CCV LTS rules. To model phonemes' connectabilities between morpheme boundaries, a morphophonemic connectivity table encodes the phonological changes between the morphemes with their POS tags. Output of the grapheme-to-phoneme converter is a phonetic transcription corresponding to the input sentence.

4.3 Automatic Phonetic Aligner To time-align a phonetic description to its speech waveform, the aligner uses a phone-based Hidden Markov Model (HMM) and Viterbi search algorithm without any complex entities (Figure 4). The aligner dynamically strings together the phonetic HMMs in the sequence determined by the phonetic transcription, and nds the optimal time alignment between the phonetic transcription and the waveform using the Viterbi search algorithm. Though there 12

Speech signal

Phoneme sequence

Feature Featureextractor extractor

Phonetic PhoneticHMM HMMaligner aligner

Feature vector

HMM sequence

Phonetic PhoneticHMM HMM

Viterbi Viterbisearcher searcher Time aligned result

Figure 4: Architecture of the automatic phonetic aligner may be some errors in this time alignment, they are consistent, so that a speech corpus builder can revise and correct the erroneous alignment easily and consistently. The HMM used in the aligner is a continuous HMM that de nes distributions as probability densities in continuous observation spaces. The phone-based HMM was bootstrapped using a small-sized, phonetically aligned speech corpus, and trained using a large-sized, phonetically unaligned speech corpus.

4.4 Phrase Break Detector Some researchers addressed the fact that syntactic structure and prosodic structure are related, and determined intonation and duration patterns using some syntactic information (Allen and Hunnicut, 1987). However, others attempted to predict prosodic boundaries from robust features of the input text rather than using syntactic clause boundaries (Taylor and Black, 1998). The major di erence of the two approaches is in the amount of syntactic features used to predict prosodic boundaries. Our current phrase break detector uses POS tag sequence as robust features and phrase structure as syntactic features to take both the robustness and elaborateness of the features. 13

Morpheme sequence POS POStag tagtrigram trigramtable table for phrase break for phrase break

Statistical Statistical phrase phrasebreak breakdetector detector Post Posterror errorcorrector corrector Phrase break tagged morpheme sequence

Transformational Transformational Error Errorcorrecting correcting rules rules

Figure 5: Architecture of the phrase break detector The architecture of the phrase break detector consists of a probabilistic phrase break detector and a transformational rule-based post error corrector (see Figure 5) (Kim and Lee, 1999). The probabilistic method alone usually su ers from performance degradation due to inherent data sparseness problems. Consequently, we adopted transformational rule-based error correction to overcome these training data limitations. The probabilistic phrase break detector segments the POS sequences into several phrases according to word trigram probabilities. However, the probabilistic detection method covers only a limited range of contextual information. Moreover, the module does not see the POS tag selectively or the relative distance to the other phrase breaks. So, an initially phrase break tagged morpheme sequence is corrected with error correcting rules. The rules are learned by comparing the correctly phrase break tagged corpus with the output of the probabilistic phrase break detector (Brill, 1994). The rule-based post error correction provided more accurate results even when the phrase break detector has initial poor performance.

14

4.5 Syntactic Analyzer Syntactic information is used not only in the prediction of prosodic boundaries but also in language modeling for automatic speech recognition systems. Today's trend shows the utilization of structure-based language model to de ne conditioning event and to capture long distance bigrams for automatic speech recognition (Cole et al., 1996). Korean is a non-con gurational, postpositional and agglutinative language. Postpositions, such as noun- endings, verb-endings, and pre nal verb-endings are morphemes that determine the functional role of NPs (noun phrases) and VPs (verb phrases) in a sentence, and also transform a VP into an NP or AP (adjective phrase). Since a sequence of pre nal verb-endings, auxiliary verbs and verb-endings can generate hundreds of di erent usages of the same verb, morphemebased grammar modeling is considered as necessary for Korean language processing. Korean Combinatory Categorial Grammar (K-CCG) is an extended combinatory categorial formalism that can capture the syntax and interpretation of the "relative free" word order in Korean. The approach we have developed combines the advantages of CCG's ability to type raise and compose them along with the abilities to handle variable categories and to model unordered arguments for the treatment of relatively free word order (Ho man, 1995). In KCCG, type-raising using case-markers is adopted for converting nouns into the functors over a verb, and a composition rule is used for coordination modeling. Figure 6 shows the syntactic analyzer that was implemented with the K-CCG formalism (Cha et al., 1999).

15

Morpheme sequence Syntactic Syntacticcategory category trigram trigram

Syntactic SyntacticAnalyzer Analyzer Parse Tree

Synyax Synyax dictionary dictionary Syntax Syntaxpattern pattern dictionary dictionary

Figure 6: Architecture of the syntactic analyzer

4.6 Korean Statistical Language Model Generator For large vocabulary speech recognition, incorporation of knowledge of language is essential. Current speech recognition systems use statistical language models to reduce the search space and resolve acoustic ambiguity (Cole et al., 1996). The simplest way to construct a language model is just gathering word-level n-gram. As mentioned above, Korean is a postpositional and agglutinative language. Since a sequence of pre nal verb-endings, auxiliary verbs and verb-endings can generate hundreds of di erent words of the same verb, and noun-endings can generate several di erent words of the same noun, statistical language modeling of the word sequence makes the language model become huge. Furthermore, since Korean is noncon gurational, the sequence of words is somewhat meaningless. Considering above linguistic facts regarding Korean, we borrowed some language modeling theory from the Statistical Language Modeling toolkit (SLM toolkit) (Rosenfeld, 1995)(Clarkson and Rosenfeld, 1995), and re-designed and implemented a morpheme-based statistical language model generator, which can model not only the sequences of morphemes but also the sequence of phonemes and POS tags as shown in Figure 7. The statistical language model can support the development of an automatic speech recognition system, POS tagging system, or other linguistic processing systems that require statistical linguistic knowledge. 16

Morpheme sequence with phoneme sequence Word WordFrequency Frequency Counter Counter

N-gram N-gramGenerator Generator N-gram N-gramtotoLanguage Language Model Converter Model Converter Language model

Vocabulary Vocabulary

Figure 7: Architecture of statistical language model generator for Korean

5 Client Visualization Tool The client visualization tool reads data from les, constructs internal data structures from them and displays them in a useful way. It also consults all the linguistic servers located in di erent machines concerning the linguistic annotations of the given speech and text, and services the corpus builder to annotate, revise and correct signal and linguistic annotations easily and consistently. We developed the client visualization tool with the scripting language Tcl/Tk and C extentions (Figure 8). It utilized the Snack sound extension, which has primitives for sound visualization (Sjolander, 1999). We now describe some required functions of the client visualization tool, le format in which the annotations are stored physically, and data structures in which the annotations are stored logically.

17

Figure 8: The client visualization tool

5.1 Functions The visualization tool has to provide all the data in visual form on the demand of users, and help users to annotate some markers. The types of data to be displayed are as follows.

Wave Just simple signal wave, basic data Textual transcription The text corresponding to the wave, basic data Spectrogram Created from the Fast Fourier Transform (FFT) of the wave Zero Crossing Rate The number of times the wave changes sign Power Average squared value per sample POS tag sequence The sequence of POS tags corresponding to the textual transcription Phonetic transcription Phoneme sequence of the wave and textual transcription 18

Phonetic alignment Time-aligned phonetic transcription to the wave Parse tree The result of syntactic analysis on the textual transcription Phrase break sequence The degree of pause between two words in the textual transcription A wave and its textual transcription are basic data from external sources. Simple signal analyses such as the FFT, power computation and Zero Crossing Rate (ZCR) are performed by the client visualization tool because the analyses require all the waves and produce results of the same size as the waves, or of bigger size than waves. The size of waves and their results are much heavy to be communicated via network. The other linguistic data are delivered by the linguistic server engines and are revised by the corpus builders, so the client tool contains only functions with which the corpus builders trigger the server engines to produce the linguistic data, and to revise them.

5.2 File Formats for Annotations There have been as many le formats for annotated speech corpora as there have been speech tools. Though each has its own strong points, the overhead costs to support these formats are not so small. We decided to use SGML markup as the le format for our speech corpus annotation tool, which made it possible to use existing knowledge and software, and thus maximize portability. There are also many le formats using SGML markup, and UTF is a representative one (LDC, 1998). However, because UTF is not appropriate for accommodating linguistic data, such as POS tags, phonetic time-aligns and syntactic categories, some tags for linguistic annotations and their structure are newly de ned. Following are the structures of tags for our annotation le format. 19

?! (
section name
) 
?! ( textual sentence )  ?! ( break index )  ?! j category ?! ( textual word )  ?! ( POS tag )  ?! root form  ?! phoneme start time end time

Separated from the SGML markup le formats, there are binary le formats for the statistical language model, which enables fast access to large-sized language model les.

5.3 Internal Data Structures for Annotations The fundamental structure of a corpus is a tree. The corpus consists of several sections, each section consists of several sentences, and a sentence consists of several phrases, which consist of several words consisting of one or more morphemes. Though it is possible to represent them in a graph structure as in (Bird and Liberman, 1999), there are some problems in representing hierarchical information such as parse trees and relations between sentences in the corpus. Consequently, we adopted tree structures as the fundamental internal annotation structures, and added list structures to link the entities in the same layer. Figure 9 shows the overall data structures used in our client tool. The corpus node is the root node of the entire structure, where all the entities are structured hierarchically and all the entities in the same layer are linked sequentially. Because the parse tree is irrelevant to the 20

Corpus s[decl]| np[ subj]

Section

vp[past|H]|np[subj]

Sentence vp( [X]|Y)/( vp

[X]|Y)

Phrases vp[past|H]|np[subj]

Word sequence v[H]\ np[ subj]

Morpheme sequence Phoneme sequence

(vp[past|X]|Y) ((vp[X]| (v[H]\ Y) v[H] Y) /(v[H]\Y) \np[subj] \(v[X]\Y) /(vp[X]|Y)) \(vp[Z]|W) aju yebb eu eod go Very pretty -PAST and

vp[past|H]|np[subj]

(vp[past|X]|Y) (s[decl]| v[H] Y) \np[subj] \(v[X]\Y) \(vp[X]|Y) ddogddoghayeod dda smart -PAST -ENDING

Figure 9: Internal data structure for corpus annotations phrases located between the sentence node and word nodes, the tree is located independently with the other annotation structure. The sentence node has a link to the root node of the parse tree corresponding to the sentence, and the leaf nodes of the parse tree have links to the corresponding morphemes. Each node, except nodes in the parse tree, has its own time indexes. There is a conventional problem when using tree structures to store annotations. Insertion or deletion of some layers requires reconstructing of the tree structures to maintain consistency. In the case of our annotation structure, this is not the case because all the layers constituting the tree structures are prepared automatically by the server engines and no layer deletion exists.

6 Conclusion We proposed a uni ed speech corpus annotation tool integrated with a morphological analyzer and a POS tagger, syntactic analyzer, phrase break detector, grapheme-to-phoneme converter, automatic phonetic aligner and statistical language model generator. Therefore, the annotation tool can automatically annotate not only signal-level annotations but also high-level linguistic 21

annotations, and the corpus builders can link high-level linguistic information with the signallevel information, and can revise/correct the annotations. Moreover, the annotation tool facilitates POS (part-of-speech) and syntactic tag annotations that are indispensable in speech corpora, because they provide basic levels of syntactic classes for each morpheme, which are essential for most modern spoken language applications of morphologically complex agglutinative languages. The corpora that can be constructed using our annotation tool will be multi-purpose and applicable to both speech recognition and TTS systems. The phonetically aligned speech corpus and statistical language model are essential in all speech recognition systems, while the phrase breaks, morphologically/syntactically aligned speech corpora are very useful in prosody and pronunciation generation for every TTS system. Finally, since the linguistic and signal processing engines and user interactive visualization tool are implemented using a client-server model, the system loads can be reasonably distributed over several machines.

References J. Allen and S. Hunnicut. 1987. From Text to Speech: the MITalk System. Cambridge University Press. F. Bechet and M. El-Beze. 1997. Automatic assignment of part-of-speech to out-of-vocabulary words for text-to-speech processing. In Proceedings of the EUROSPEECH '97, pages 983{986. Steven Bird and Mark Liberman. 1999. A formal framework for linguistic annotations. Technical Report MS-CIS-99-01, Department of Computer and Information Science, University of Pennsylvania. E. Brill. 1992. A simple rule-based part-of-speech tagger. In Proceedings of the conference on applied natural language processing. E. Brill. 1994. Some advances in transformation-based part-of-speech tagging. In Proceedings of the AAAI-94. Jeongwon Cha, Geunbae Lee, and Jong-Hyeok Lee. 1998. Generalized unknown morpheme guessing for hybrid POS tagging of Korean. In Proceedings of the Sixth Workshop on Very Large Corpora, pages 85{93. Jeongwon Cha, WonIl Lee, Geunbae Lee, and Jong-Hyeok Lee. 1999. Morpho-syntactic modeling of Korean with K-CCG. In Proceedings of the International Conference on Computer Processing of Oriental Language, pages 67{73.

22

Philip Clarkson and Ronald Rosenfeld. 1995. Statistical language modeling using the CMU-Cambridge toolkit. In Proceedings of the 5th European Conference on Speech Communication and Technology (EUROSPEECH '97), pages 2707{2710. CMU. 1998. CHILDES. http://childes.psy.cmu.edu. Ronald A. Cole, Joseph Mariani, Hans Uszkoreit, Annie Zaenen, and Victor Zue. 1996. Survey of the State of the Art in Human Language Technology. http://cslu.cse.ogi.edu/HLTsurvey/. Entropic. 1997. Annotator. http://www.entropic.com/products&services/annotator/annotator.html. G. Forney. 1973. The Viterbi algorithm. Proc. of the IEEE, 61:268{278. Beryl Ho man. 1995. Computational Analysis of the Syntax and Interpretation of Free Word-order in Turkish. Ph.D. thesis, U.Penn. IRCS Report 95-17. Institute of Phonetics and Dialectology. 1997. SLAM. http://nts.csrf.pd.cnr.it/IFeD/Pages/slam.htm. ISIP. 1999. Segmenter tool. http://www.isip.msstate.edu/projects/speech/software/swb segmenter/index.html. Byeongchang Kim and Geunbae Lee. 1999. Statistical/rule-based hybrid phrase break detection. In Proceedings of the International Conference on Speech Processing, pages 595{599. Byeongchang Kim, WonIl Lee, Geunbae Lee, and Jong-Hyeok Lee. 1998. Unlimited vocabulary grapheme to phoneme conversion for Korean TTS. In Proceedings of the Coling-ACL '98, pages 675{679. LDC. 1998. A universal transcription format (UTF) annotation speci cation for evaluation of spoken language technology corpora. Technical Report www.nist.gov/speech/hub4 98/utf-1.0-v2.ps, LDC. Boyd Michailovsky, John B. Lowe, and http://lacito.vjf.cnrs.fr/ARCHIVAG/ENGLISH.htm.

Michel

Jacobson.

1998.

Archivage.

Ronald Rosenfeld. 1995. The CMU statistical language modeling toolkit and its use in the 1994 ARPA CSR evaluation. In Proceedings of the ARPA Spoken Language Technology Workshop. SIL. 1999. Speech Analyzer. http://www.sil.org/computing/sil computing.html. Kare Sjolander. 1999. The Snack Sound Extension for Tcl/Tk. http://www.speech.kth.se/snack. Paul Taylor and Alan W. Black. 1998. Assigning phrase breaks from part-of-speech sequences. Computer Speech and Language, 12(2):99{117.

23

Suggest Documents