The Language Resources Development and Language Processing ...

The Language Resources Development* and Language Processing Service for Thai Asanee Kawtrakul1, Yuen Poovorawan1, Frédéric Andres2, Mukda Suktarajarn1, Patcharee Varasrai1, Nithiwat Kampanya1, Supavat Vongwatthaporn1, Nattakan Pengphon1, Chaiwat Ketsuvarn1 1

Computer Engineering Dept. Engineering Faculty, Kasetsart University, Thailand. 2 National Institute of Informatics, Japan. [email protected], [email protected]

Abstract This paper presents the research on the Development of Language Resources on Network for Thai Language Processing Research Service. The goal of this project has been the study, design and implementation of Research Resource Management including web based resources maintenance. The resources consist of Language Knowledge including lexica, corpus, and statistical linguistic information throughout software tools, called tools kit, for mutual benefits of language phenomena learning. The tools kit also consists of Annotatator, Aligner, CREQ (Corpus Resources Enquiry and Query) and GUI browser. For Thai Language processing service, the service is aimed to provide document processing via internet, such as word cut, automatically indexing, automatically clustering and intelligent search engine for Thai text. The website for language resources and services is http://naist.cpe.ku.ac.th.

1

Introduction-Motivation

The interesting of NLP application system development and researches in Thailand has a higher trend, such as Thai morphological Analysis [(Charoenpornsawat, P., et al., 1998), (Kawtrakul, A., et al., 1995), (A. Kawtrakul, C. Thumkanon, T. Jamjanya, Muagyunnan, K. Poolwan, and Y. Inagaki, 1996), (Kawtrakul, A., et al., 1996), (Kawtrakul, A., et al., 1997-3)], Thai Sounddex [(Karoonboonyanan, T., et al., 1997), (Ongroongruang, S., et al., 1995), Speech Processing (Ahkuptra V., et al., 1997), (Jittapunkul S. and Areepongsa, 1995), (Kiat-arpakul, R., et al., 1995), (Luksaneeyanawin, S., 1992), (Maneenoi, E., et al., 1997), (Nuntiyagul, A., 1989)], language translator, automatic text abstraction, information retrieval by using user own language as well as writing verification. NAiST** has been developed a prototype writing production assistant system (Kawtrakul, A., et al., 1998) since 1995. The results of the research are Linguistic Knowledge, i.e., Lexi Base [(A. Kawtrakul, et al., 1995-2), (Kawtrakul, A., et al., 1998)], Thai corpus [(F. Andres and K. Ono, 1997), (Kawtrakul, A., et al., 1997-2), (Kawtrakul, A., et al., 1998)]. Moreover, many computational models for Thai morphological processing are provided [(Kawtrakul, A., et al., 1995), (A. Kawtrakul, C. Thumkanon, T.

* This project is granted by NECTEC, KURDI, and NII. ** NAiST – Natural Language Processing and Intelligent Information System Technology Research Laboratory.

Jamjanya, Muagyunnan, K. Poolwan, and Y. Inagaki, 1996), (Kawtrakul, A., et al., 1996-2)]. In order to save time and manpower, the results of the research resources mentioned above have been developed on the network for providing the services on Thai Language Processing (TLP). The service consists of Knowledge Sources of Thai language, computational linguistic throughout software tools for mutual benefits. The service is also aimed to provide language processing via internet, such as automatically indexing, automatically clustering and intelligent search engine for Thai text. Since information systems are changing to be very large set of data, complex data types including multimedia types and heterogeneous system, research resource management for Thai language processing service is necessary. To enhance research resource management performance, Thai language processing service and language resources have been also implemented on the VLSHDS platform which based on PHASME application– oriented service functions and AHYDS [(Advanced Hypermedia Delivery System) (F. Andres, A. Kawtrakul, K. Ono, et al., 1998), (Kawtrakul, A., Andres, F., et al., 1999), (AHYDS System), (Phasme)]. The remainder of the paper is organized as follows. In section 2, we overview the VLSHDS architecture for research resource management. The section 3 describes Language Resources and the development. The section 4 describes the Language Processing service and section 5 concludes and gives the direction of the future work.

2

the server by using the AHYDS communication support. At the server side, there are two main components: Language Resource and Language Processing. In order to extend the collaboration on resource sharing, this project is divided into two phases. The first phase uses three tiers topology (see figure 1). The second phase will use Multi-tiers topology (see figure 2) in order to provide the fault-tolerance, decentralization and well maintenance. Since our architecture of research resource management for Thai Language Processing is based on the AHYDS platform, it can be simply implemented by using multi-tiers. AHYDS Server NLP Server

Network

User Client

Figure 1 The topology of Research Resource Management for TLP on the VLSHDS platform( in the first phase). NLP Server

AHYDS Servers

User Client

User Client

NLP Server

User Client NLP Server

User Client

Figure 2. The Multi-tiers topology of Research Resource Management for TLP on the VLSHDS platform (in the Second phase).

An Architecture of the VLSHDS Platform for Research Resource Management

Since information systems are changing to be very large set of data, complex data types including multimedia types and heterogeneous system, to meet the next generation of Thai Language Engineering applications and information system, research resource management is implemented based on PHASME application– oriented service functions and AHYDS (Advanced Hypermedia Delivery System). The system consists of a client/server three tier architecture. At Client side, queries are sent to

User Client

3

Thai Language Resource Development

To a large extent, the language resource development are motivated by the desire to share and make maximal use of the existing resources and the tools for mining and acquisite the linguistic information from text copora. At the current state, the following Language Resources tools are provided: Lexicon and Thesaurus [(A. Kawtrakul, et al., 1995-2), (Kawtrakul, A., et al., 1998)], Corpus and Tools Kit (see figure 3).

Figure 3.

3.1

The Language Resources for Sharing and The Language Processing for Services.

Lexicon

Lexibase was designed for spelling, grammar and style checking. The purpose of lexibase is a rich information dictionary for checking word grammar and style. It consists of three parts. : ● Syntactic Feature which is the part-ofspeeches which are revised by observing from the context, function and word order from the real text . As a result, there are 6 categories and 41 subcategories. (more detail see [(Kawtrakul, A., et al., 1997-2), (Kawtrakul, A., et al., 1998)]) ● Semantic Feature which is the semantic concept of a word. Every noun, verb, pre verb, post-verb, modifier and preposition has semantic concept. ● Syntactic-Semantic Relation which consists of the relations between words.

3.2

Corpus

Our text corpus is a collection of edited Thai prose, both spoken and written texts from web, newspaper and magazine that has been divided into 2 types, based on Brown Corpus; non-fiction and fiction. The non-fiction has five genres, while fiction has six genres. Each genre has many samples and each sample contains a large set of texts . In order to use a corpus as corpus based natural language processing, it also needs annotation which consists of meta-data information and linguistic information.

3.2.1 Meta-data Information Meta-data information is the first level of annotation information about documents in the corpus. The Doublin Meta-data Core Element Set has been applied consisting of Title, Creator, publisher, Keyword, genre, source, language, relation, rights, contributor, Date.

3.2.2 Linguistic information The next level of annotation is linguistic information. It applies the grammatical tagging where each tag indicates its grammatical category. Each document in the text corpus will be tagged according to: ● document level, i.e., abstract, body and paragraphs, ● sentence level, i.e., sentence and phrase, ● word level, i.e., part-of-speech and semantic concept. These information will be used for inducing the linguistic properties and thus generate a computational knowledge. Additionally, Linguistic information in sentence and word level is annotated on three sub-levels: syntactic level, semantic level and discourse level. Syntactic Level: Syntactic Annotation is the practice of adding syntactic information to a corpus. It consists of: ● Sentence sub-categorization ● Syntactic clause sub-categorization o Phrase sub-categorization ● Noun phrase ● Verb phrase

● Adverbial phrase ● Preposition phrase ● Word sub-categorization (POS tagging) Semantic Level: It is semantic of word sub-categorization. This information is included at the level of POS Tagging. At this state, there are 817 semantic concepts and 46 POS categories. Discourse Level: This includes a variety of anaphora phenomena, such as co-referentiality, substitution, and ellipsis. (see Preeti, et.al,2000)

3.2.3 The Process of Annotation Figure 4 shows the process of annotaion. The annotated corpus is in XML format. The tools for annotating will be described in the next section.

Raw Text

Metadata Information

Sentence Breaker

Word Segmentation

4 Dictionary

Syn/Sem Tagger

Chunking Parser

Grammar Rules

Anaphora Resolution

Annotated Text in XML format

Figure 4. The process of Annotation

3.3

process linguistically the corpus itself are provided, as follows: (a) General purpose corpus data retrieval tools: concordance facilities, handle corpora in complex formats including treebanks, sort and search in varied way, (b) Tools to facilitate corpus annotation at various levels: automatic processing (like tagging systems), semi-automatic interactive use (e.g. parser , shallow parser). The annotation tools are implemented by using both probabilistic model and Information Extraction techniques. (c) Tools to provide interchange information between corpora and lexical and grammatical database, called CREQ: Corpus Resourcs Enquiry and Analysis System. Based on CREQ, users can extract all linguistic expression patterns or phenomena relevant to query. (d) Maintenance tool for updating about words: both in lexicon and its information from the corpus. The tools mentioned above could be downloaded from our website (see figure 5).

Tools Kit

Since a corpus needs the support of a sophisticated computation environment, software tools both to retrieve data from the corpus and to

Thai Language Processing Service

The aim of this project is also providing Thai Language Processing Services such as automatically document indexing and clustering.

4.1

Automatically Document Indexing

We develop multilevel indexing model for document processing. The multilevel indexing model consists of three modules: lexical token identification, phrase identification and relation extraction, and multilevel indices generation. Each module accesses different linguistic knowledge-bases stored inside the EBG data structure (more details see (NACSIS, 1999)). Since the document indexing needs rich knowledge in lexicon and thesaurus, we plan to provide the document or book indexing service based on our resources.

4.2

Automatically Document Clustering

Text categorization or document clustering consists of two parts: a prototype learning process to provide prototypes for each cluster of

documents and a clustering process, which compute the similarity between input document and prototype. For more details see (NACSIS, 1999). As document indexing, we also plan to provide service on document clustering . The end users could upload their files to our server and then apply our tools for catagorizing their files (see figures 6).

5

Conclusions and Future Works

At the current state, only the members of SIGNLP could access our website http://naist.cpe.ku.ac.th to download language resources for doing research only. For language processing service, the system is limited to the computer technical report domain. The direction of the future work is to enlarge the lexica knowledge by using automatic acquisition tools and to enhance the performance of language processing services.

References (Ahkuptra V., et al., 1997) Ahkuptra V. et al., “A Speaker-Independent Thai Polysyllabic Word Recognition System Using Hidden Markov Model Proc. of NLPRS, 1997. pp.281-286. (F. Andres, et al., 1996) F. Andres, et al. “ Providing Information Retrieval Mechanism inside a WWW Database Server for Structured Document Management” in Proceedings of the ADB96 Symposium, Tokyo, December 1996. (F. Andres and K. Ono, 1997) F. Andres, and K. Ono Phasme: A High Performance Parallel Application-oriented DBMS Informatica Journal, Special Issue on Parallel and Distributed Database Systems, 1997. (F. Andres and K. Ono, 1998) F. Andres, and K. Ono “The Active HYpermedia Delivery System ” , in Proceedings of ICDE98, Orlando, USA, February 1998. (F. Andres, A. Kawtrakul, K. Ono, et al., 1998) F. Andres, A. Kawtrakul, K. Ono and al., "Development of Thai Document Processing System based on AHYDS by Network Collaboration, in Proc. 5th internatioal Workshop of Academic Information Networks on Systems(WAINS), Bangkok, Thailand, December 1998. (D.C. Blair, 1990) D.C. BLAIR, “Language and Representation in Informational Retrieval”, Elsevier Science Publishers, 1990.

(Charoenporn, T., et al., 1997) Charoenporn, T. et.al,. “Building a large Thai Text Corpus-part-of-speech Tagged Corpus: Orchid”. Proc. of NLPRS,1997. pp. 509-512. (Charoenpornsawat, P., et al., 1998) Charoenpornsawat, P. et.al,. “Feature-based Thai Unknown Word Boundary Indentification Using Winnow”. Proc. of IEEE,1998,pp.547-550. (E. Chaniak, 1993) E. Chaniak, “Statistcal Language Learning”, MIT Press, 1993. (W.W.Cohen and Y. Singer, 1996) W. W. Cohen, and Y. Singer. “ Context-sensitive learning methods for text categorization”. In Proceedings of the 19th Int. ACM SIGIR Conference on Research and Development in Information Retrieval, 1996. (D.J. Ittner, David D. Lewis, and David D. Ahn, 1995) D. J. Ittner, David D. Lewis, and David D. Ahn, “Text Categorization of Low Quality Images”, Fourth Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, pp. 301-305, 1995. (Jittapunkul S. and Areepongsa, 1995) Jittapunkul S. and Areepongsa. “Speaker-Independent Thai Numeral Speech Recognition Using Hidden Makov Model and Vector Quantization”. Proc. of SNLP. 1995. pp.370-378. (Karoonboonyanan, T., et al., 1997) Karoonboonyanan, T. et.al,. “A Thai Soundex Systems for Spelling Correction”. Proc. of NLPRS, 1997,pp.633-644. (Kanlayayanawat, W., et al., 1997) Kanlayayanawat, W. et.al,. “Automatic Indexing for Thai Text with unknown Words Using Trie Structure”. Proc. of NLPRS, 1997,pp.115-120. (Kawtrakul, A., et al., 1995) Kawtrakul, A. et.al,. “A Statistical Approach to Thai Word Filtering” , The 2nd Symposium on Natural Language Processing, bangkok, pp 398-406, 1995. (A. Kawtrakul, et al., 1995-2) A. Kawtrakul, et.al., “A Lexibase Model for Writing Production Assistant System” In Proceedings of the 2nd Symposium on Natural Language Processing, Bangkok, pp. 226-236, 1995. (A. Kawtrakul, C. Thumkanon, T. Jamjanya, Muagyunnan, K. Poolwan, and Y. Inagaki, 1996) A. Kawtrakul, C. Thumkanon, T.Jamjanya, Muagyunnan, K.Poolwan, Y.Inagaki, “ A Gradual Refinement Model for A Robust Thai Morphological Analyzer ” , COLING 96, pp. 1086-1089, 1996.

(Kawtrakul, A., et al., 1996) Kawtrakul, A. et.al,. “Thai morphological Analysis” Final Report to the Kasetsart University Research and Development Institute,1996. (Kawtrakul, A., et al., 1997) Kawtrakul, A. et.al,. “Grammar and Style Checking for Thai sentences” A progress Report to the National Research Council of Thailand, 1997. (Kawtrakul, A., et al., 1997-2) Kawtrakul, A. et.al,. “The Development of Resources on Network for NLP Researches ” A progress Report to the National Electronics and Computer Technology Center.1997. (Kawtrakul, A., et al., 1997-3) Kawtrakul, A. et.al,. “Automatic Thai Unknown Word Recognition”. Proc. of NLPRS,1997,pp.341-346. (Kawtrakul, A., et al., 1998) Kawtrakul, A. et.al,. “Grammar and Style Checking for Thai sentences” Final Report to the National Research Council of Thailand,1998. (Kawtrakul, A., et al., 1998-2) Kawtrakul, A. et.al,. “Towards Automatic Multilevel Indexing for Thai Text Information Retrieval”. Proc. of IEEE, 1998, pp. 551-554. (Kawtrakul, A., et al., 1998-3) Kawtrakul A. et.al,. "Backward Transliteration for Thai Document Retrieval", Proc. of IEEE.1998. pp. 563-566 (Kawtrakul, A., Andres, F., et al., 1999) Kawtrakul A., Andres, F.et.al.,. “A Prototype of Globalize Digital libraries : The VLSHDS Architecture for Thai Document processing. 1999. (on the process of submission) (Kiat-arpakul, R., et al., 1995) Kiat-arpakul R, et.al,. “A Combined Phoneme-based and Demisyllable-based Approach for Thai Speech Synthesis”. 1995. Proc. of SNLP. pp. 361-369. (G. Kowalski, 1998) G. Kowalski "Information Retrieval Systems: Theory and Implementation", Kluwer Academic Pulishers, second edition, ISBN 0-7923-9926-9, 1998. (Luksaneeyanawin, S., 1992) Luksaneeyanawin S. “A Thai Text to Speech System”. In Proceedings of the Conference on Electronics and Computer Research and Development. NECTEC.1992. (Maneenoi, E., et al., 1997) Maneenoi E. et.al,.” Modification of BP Algorithm for Thai Speech Recognition”. Proc. of NLPRS 1997. pp. 287-291 (Nuntiyagul, A., 1989) Nuntiyagul A. “Thai Text to Speech Synthesis”. M.Sc. Thesis. Chulalongkorn University.1989.

(Ongroongruang, S., et al., 1995) Ongroongruang, S. et.al,. “English to Thai Word Retrieval Using Sound Index”. Proc. of SNLP,1995. pp. 407-419. (Peter Schauble and Alan F. Smeaton, 1998) Peter Schauble and Alan F. Smeaton, "Summary Report of the Series of Joint NSF-EU Working Groups on Future Directions for Digital Libraries Research", DELOS Working Group Report 98/W004, http://www.iei.pi.cnr.it/DELOS/NSF/nsf.htm (Preeti, et al., 2000) Preeti Pitialongkorn, Asanee Kawtrakul. “A Survey of Computational Anaphora Solving for Thai Information Retrieval”, Proc. of the 7th International Workshop on Academic Information Networks and Systems, December, 2000. (G. Salton, 1989) G. Salton, “ Automatic Text Processing. The Transformation, Analysis, and Retrieval of Information by Computer”, Singapore: Addison-Wesley Publishing Company, 1989. (Thubtong, N., 1995) Thubtong N. “A Thai Speech Recognition System Based on Phonemic Distinctive Features”. Master of Science Thesis, Department of Computer Engineering, Chulalongkorn University.1995. (TungThangthum, A., 1998) Tungthangthum A. ”Tone Recognition for Thai”. Proc. of IEEE. 1998. pp.157 –160 (AHYDS System) AHYDS System, NACSIS project, http://www.rd.nacsis.ac.jp/~andres/db/ahyds.ht ml (Phasme) Phasme Application-oriented DBMS System, NACSIS R&D department, http://www.rd.nacsis.ac.jp/~andres/db/phasme.ht ml (ACM, V. 38 N. 4, 1995) Special Issue on Digital Libraries, Communications of the ACM, April 1995, Volume 38, number 4. (ObjComp) Washington University, Center for Distributed Object Computing, http://www.cs.wustl.edu/~schmidt/doc-center.ht ml (NACSIS, 1999) Report of “NACSIS-NAiST” project, 1999

Figure 5. Toolkits are provided for The NLP members.

Figure 6. The Result of Web-based Document Clustering.