Research Resource Management for* Thai Language Processing Service Asanee Kawtrakul1 Yeun Poovorawan1 Frederic Andres2 Kamala Nakasiri3 3 Witsamai Manomaiphibul Wipakorn Wongthai3 Siriporn ThangTieng3 Thatsanalai Burapachep3 1 Chalatip Tumganon Chavalit Chirarattanachan1 1 Thanussak Thanyasiri Nathavit Buranapraphanont1 Navapat Khantonthong1 Patcharee Varasrai1 1 Mukda Suktarajarn Kunawut Subsean1 Songpol Chinthanet1 1
Computer Engineering Dept. Engineering Faculty, Kasetsart University. 2 National Center for Science Information System. 3 Linguistic Dept, Humanity Faculty, Kasetsart University.
[email protected]
ABSTRACT This paper presents the research on the Development of Resources on Network for Natural Language Processing Research**. The goal of this project has been the study, design and implementation of Research Resource Management for Thai Language Processing Service. The service consists of Knowledge Sources of Thai language, computational linguistic throughout software tools for mutual benefits. The service is also aimed to provide language processing via internet, such as automatically indexing, automatically clustering and intelligent search engine for Thai text. Since information systems are changing to be very large set of data, complex data types including multimedia types and heterogeneous system, research resource management for Thai language processing service is necessary. To meet the next generation of Thai language Processing applications and information system, research resource management is implemented based on PHASME application– oriented service functions and AHYDS (Advanced Hypermedia Delivery System). At the current state, the Linguistic Knowledge & Tools, and Thai language processing have been developed. The output of some applications will be kept in the Extended Binary Graph data structure of the PHASME engine. KEY WORDS Thai language processing, Automatically Indexing, Automatically Clustering, Very Large Scale Hypermedia Delivery Systems, Research Resource Management.
*
This Project is granted by NECTEC, KURDI and NACSIS
**
This Project is granted by NECTEC
1
1. Introduction – Motivation
management for Thai language processing service is necessary.
1.1 Thai Language processing Natural Language processing (NLP) researches are essential to the enhancement of computer capability to process human-being language. Result of the researches originated various technique and computational linguistic theorem which cause the business software such as a language translator, automatic text abstraction, information retrieval by using user own language as well as writing verification. The interesting of NLP system & applications development and researches in Thailand has a higher trend, such as Thai morphological Analysis [8,15,17,18,21], Thai Sounddex [ 13,31 ], Speech Processing [1,12,26,28,29,30], due to the difference of Thai language from others at the point of phonology, morphology and syntax levels. Thai language researches need time, manpower, hardware and software tools. Results of the researches will be invaluable resource in analysis to achieve computational linguistic theorem and finally step to the language processing products such as automatically indexing and clustering for Thai text or document, semi-automatically text translation.
1.2 Research
Resource Management for Thai Language Processing Service
NAiST* has been developed a prototype writing production assistant system [22] since 1995. The results of the research are Linguistic Knowledge, i.e., Lexi Base [16,22], Thai corpus [3,20,22]. Moreover, many computational models for Thai morphological processing are provided [15,17,18 ]. In order to save time and manpower, the results of the research resources mentioned above have been developed on the network for providing the services on Thai Language Processing (TLP). The service consists of Knowledge Sources of Thai language, computational linguistic throughout software tools for mutual benefits. The service is also aimed to provide language processing via internet, such as automatically indexing, automatically clustering and intelligent search engine for Thai text. Since information systems are changing to be very large set of data, complex data types including multimedia types and heterogeneous system, research resource
1.3 The VLSHDS platform The VLSHDS – Very Large Scale Hypermedia Delivery Systems is the project between NACSIS and NAiST*[5,25] which combines data processing services and full text retrieval services in a very large-scale environment in order to meet user needs in Thai document retrieval in both digital library field and growing internet world. To enhance research resource management performance, Thai language processing service and knowledge resources have been also implemented on the VLSHDS platform which based on PHASME application– oriented service functions and AHYDS (Advanced Hypermedia Delivery System).
1.4 Paper Organization The remainder of the paper is organized as follows. In section 2, we overview the VLSHDS architecture for research resource management. The section 3 describes Knowledge Resources and tools. The section 4 describes the TLP and section 5 concludes and gives the direction of the future work.
2.
An Architecture of The VLSHDS for Research Resource Management
This section presents an overview of research resource management on the VLSHDS platform (Very Large Scale Hypermedia Delivery System). The VLSHDS platform is based on the Active Hypermedia Delivery System AHYDS [36,37]. The system consists of a client/server three tier architecture. At Client side, queries are sent t o t h e s e r v e r b y u s i n g th e A H YD S communication support as described in Section 2.1. At the server side, there are two main components: Knowledge Resource & tools and Thai Language Processing.
*
NAiST – Natural Language Processing and Intelligent Information System Technology Research laboratory has been promoted and supported by KURDI as Special Research Unit of NAiST since 1999.
2
2.1 The AHYDS–Active Hypermedia
Delivery system platform The AHYDS architecture (see Figure 1) provides a framework for open data delivery service. While a large class of applications will need a QoS ,Quality of Service, not only for continuous data support, but in terms of multiresolution, multi-language, and multidimension, some applications will need to replace or to tailor some of the database services like optimization, multi-dimension index or execution model. The plugin service is a vertical service from the traditional data type definition (ADT) to the execution service which includes operating system functions such as dynamic supervision, load balancing, and fault tolerance. It enables the database engine to be tailored according to the requirements of the applications. The core system named Phasme [37] is one example of “Application-oriented DBMS”[3]. Disk storage
Text Interface
Hypermedia
Video Catalog
e ag
Plu gin s
Recovery Data manager
Memory Management
ts
Gs EB
en m cu Do
im
Request Manager AD
Scheduler
Dynamic Opt. Load Balancing Fault Tolerance
ce rvi Se
Execution service in s ug Concurrency Control Pl Algebras
SQL/OQL Interfaces
Co mm unic atio n
Web
Service
DBMS Services
Figure 1: The AHYDS architecture
2.1.2 The Query Execution Service and Supervision Service Phasme manages the allocation and the scheduling of all resources of the server, including processor(s), memory, disk bandwidth and network bandwidth. The task of the query supervisor is to determine a schedule for the use of these resources to satisfy a query plan according runtime workloads over the distributed and heterogeneous information system. The query supervision service manages and supervises all the resources in the distributed system using a load balancing scheduler. The primary feature of the Phasme query execution service is the ability to be customized in order to integrate the requirements of a wide variety of applications in terms of data retrieval and processing. It achieves this in three ways. First, it dynamically uses query plugins consisting of the set of functions that integrate application specific aspects of the processing. The query plugins include the manipulation functions for data item. The query execution service manages the operation processing inside the storage manager and performs the operations directly on memory-mapped data. Second, the query execution service coordinates the QoS of parallel executions (intra and inter operation parallelisms). Third, it maximizes the use of each resource. The distributed management is based on Phasme interface and AHYDS client layers are used in order to access AHYDS platform and Phasme Engine from the client tools. Training about AHYD platform, Phasme and the Application Oriented Service are done by distance learning via High Speed Network ThaiJapan.
2.1.1 The Communication Service
2.2 The VLSHDS platform for Research Resource Management
The communication service manages the communication inside the distributed and heterogeneous environment. This architecture supports a name service that enables agents to join or to leave the data delivery system (service information and access). Agents subscribe to the Phasme framework.
This project is divided into two phases. The first phase uses three tiers topology (see figure 2). The second phase will use Multi-tiers topology (see figure 3) in order to provide the fault-tolerance. Since our architecture of research resource management for Thai Language Processing is based on the AHYDS platform, it can be simply implemented by using multi-tiers.
3
Table 1 Syntactic-Semantic relation AHYDS Server NLP Server
User Client Network
User Client
pos noun (n)
Syntactic –Semantic Relations Relation (Noun) : rel (of): n: concept1, concept2, etc. (made of) : concept3, concept4, etc. (have) : concept5, concept6, etc (location) : concept8, concept7, etc. (colour) : concept9, concept10, etc. (purpose) : concept11, concept12, etc. mod :concept1, concept2, etc. cl : w1, w2, etc. compound :concept1, concept2, etc.
verb (v)
Relation : AGT : n: concept1, …conceptn OBJ : n: concept1, …conceptm mod : mod: concept1, …conceptr postv : postv: conceptr, etc. prep : prep: concepts V: verb: concept1, …conceptl
Figure 2 The topology of Research Resource Management for TLP on the VLSHDS platform( in the first phase).
NLP Server
AHYDS Servers
User Client
User Client
NLP Server
NLP Server
User Client
modifier (mod)
User Client
preposition (prep) post verb (postv)
Figure 3. The Multi-tiers topology of Research Resource Management for TLP on the VLSHDS platform (in the Second phase).
3. Knowledge Resources and Tools
Relation (noun) : n concept1, …conceptn Relation (Verb) : …conceptn
postv: concept1,
Table 2 An example in the Lexibase of “กิน”: Syntactic feature Semantic concept Relation : AGT:
At the current state, the following knowledge sources and tools are provided: LEXIBASE[ 16,22] , KU PARSER, Corpus Tools, Rule Editor, Word Segmentation Tool and Word Co-occurrence extraction tool[22].
0BJ :
mod postv prep V
3.1 Lexibase Lexibase was designed for spelling, grammar and style checking. The purpose of lexibase is a rich information dictionary for checking word grammar and style. It consists of three parts. : • Syntactic Feature which is the part-ofspeeches which are revised by observing from the context, function and word order from the real text . As a result, there are 6 categories and 41 subcategories. (more detail see [20,22] • Semantic Feature which is the semantic concept of a word. Every noun, verb, pre verb, post-verb, modifier and preposition has semantic concept. • Syntactic-Semantic Relation which consists of the relations between words (see the example in table1 and table2) :
Relation (Verb) : mod : concept1, …conceptn
Style : Level Style
: tav : 13 (ingestion) : 01010101(human) : 01010102 (animal) : 0101020114 (food) : 0101020115 (drink) : 0101020112 (medicine) : 01 02 03 04 12 13 14 : 01 14 15 16 18 : 01 02 03 04 05 : 06 : 0 : รับประทาน(1) ทาน(4) ฉัน(5) เสวย(6)
(Note : The number shown in lexibase structure is the Representation of semantic concepts.) Word levels are classified into 7 classes : Level 0 = informal Level 1 = formal Level 2 = acronym Level 3 = foreign word Level 4 = colloquial word Level 5 = monk Level 6 = official palace language
4
3.2 KU Parser The purpose of KU parser is to provide the assistant tool for checking the designed grammar rules. The end-users can provide their own grammar rules and dictionary. The output of the parser is every possible parsed tree (see figure 4). 1/2
3.4 Word Co-occurrence Extraction Tool This program is an assistant tool in studying the word co-occurrence. The program will give information of all word pairs from input file. The given information is the frequency of word co-occurrence and their position. (for the example see Fig 5). For more detail, see [20].
ประโยค
ภาคประธาน
ภาคแสดง
Heading word คํานาม
สกรรมกริยา
คน
กิน
นามวลี
คํานาม
ขาว
บุพบทวลี
บุพบท
นามวลี
ใน
คํานาม
List of succeeding words with frequency according to its position
พลังงาน คือ บาง สิ่ง ที่ สามารถ ทํางาน
บาน
2/2
ประโยค
ภาคประธาน
ฝาย
1,0,0,0,0,0,1 0,2,0,0,0,1,0 0,1,0,0,0,0,0 0,0,0,1,0,2,1 3,2,0,0,1,1,0 0,0,1,0,0,1,0 . . . 0,0,0,0,0,0
ภาคแสดง
บุพบทวลี
Figure 5 Word co-occurrence with frequency according to its position
คํานาม
สกรรมกริยา
นามวลี
คน
กิน
คํานาม
บุพบท
นามวลี
4. Thai Language Processing Service
ขาว
ใน
คํานาม
The aim of this project is also providing Thai Language Processing Services such as automatically document indexing and clustering.
บาน
Figure 4. The output of KU parser for the sentence. “ คนกินขาวในบาน”
3.3 Corpus Tool and Rule editor The Corpus tool was designed for assisting the linguistic researchers in tagging the part of speech of the sentences in the Corpus. The Rule editor is the assisting tool for creating the tree structures of the sentence. From the tree structure, the tool will automatically generate the grammar rules.
4.1 Problems in Thai Document Processing Figure 6 summarizes the other language problems frequently found in Thai document: loan words, acronym, synonym, and anaphora. These problems effect document representation. In order to solve the problems mentioned above, knowledge base and backward transliteration are necessary. With using Thai morphological analysis [17] and shallow parsing, implicit unknown word and partial phrase in context will be solved. Knowledge base, consisting of Lexibase and Thai wordnet, is used for abbreviation and synonym resolution and backward transliteration [24] for loan word resolution.
5
Implicit Unknown word Abbreviation
Disappear in context
Loan Word
AUTHOR>เลิศรัก ศรีกิจการ (Lertrak Srikitjakarn) การใหยาถายพยาธิในสัตวเพื่อตัดวงจรของพยาธิใบไมในตับ (F. gigantica) ในภาค อีสาน ของไทยStrategic Treatment of Fasciola gigantica in the Northeast of Thailand Ruminant, Swamp Buffaloes, Fasciola Gigantica, Anthelmintics Niclofolan, Northeast of Thailand กระบือปลัก, พยาธิใบไมในตับ, ภาคตะวันออกเฉียงเหนือ, ยาถายพยาธินิโคลโฟลัน การใหยาถายพยาธิ Niclofolan หนึ่งครั้งแกฝูงควายเมื่อตนเดือนตุลาคม สามารถลดอัตราการติดโรคพยาธิในตับในฝูงจาก 50 เปอรเซ็นต เหลือ 0 และ 5 เปอรเซ็นต ในชวง 2.5 และ 6 เดือน แลวเพิ่มขึ้นเปน 10 เปอรเซ็นต เมื่อ 10 เดือน หลังใหยา การเปลี่ยนแปลงของอัตราการติด พยาธิในโฮสทกึ่งกลาง (หอย Lymnaea rubiginosa) พบสูงกวา 30 เปอรเซ็นต ในชวง 3 เดือนแรกหลังใหยาแลวลดลงอยาง รวดเร็ว จนเมื่อ 1 ป หลังจากที่ใหยาแกฝูงควายไมพบ การติด cercaria ของพยาธิ ในหอย ผลการทดลองนี้สรุปไดวา เวลาที่เหมาะสมสําหรับการใหยาถายพยาธิแกสัตว เพื่อตัดวงจรของพยาธิใบไมในภาคอีสานของไทย คือในชวงตนเดือนกันยายน
Partial phrase in context
Synonym
Figure 6: The problem that effect document representation
4.2 Automatically Document Indexing
Documents Automatic Indexing
Lexical Token Identification - Lexical Token Recognition
- Backward Transliteration
- Lexibase - CUVOALD
Compute Weight Phrasal Identification & Extraction Multilevel Index Generation
NP Rules
Thai Word Net
Di = Clustering
As shown in Figure7, to provide document representation or indices, we propose multilevel indexing model. The multilevel indexing model consists of three modules: lexical token identification, phrase identification and relation extraction, and multilevel indices generation. Each module accesses different linguistic knowledge-bases stored inside the EBG data structure. (more details see [40] )
4.3 Automatically Document Clustering Text categorization or document clustering consists of two parts: a prototype learning process to provide prototypes for each cluster of documents and a clustering process, which compute the similarity between input document and prototype (see Figure 8). For more details see [40].
Di =
Figure 7: An Overview of Thai document processing
6
New document
•
Initial learning data Represent document into weight vector
•
Learning Process
Prototype class (Pc) with weight vector
•
Compute similarity of document IF it is not similar THEN Add this document to unknown category ELSE Adjust weight vector in Pc
Figure8 : Text Categorization process
5. Conclusions and Future Works The Research Resource Management for TLP on VLSHDS project combines data processing services and Thai Language Processing service in a very large-scale environment (Kasetsart University distributed over Thailand and Japan ) in order to meet user needs in linguistic knowledge and tools and Thai document Processing in the digital library field and in the growing Internet world. Our project provides novative design and implementation solutions to issues. To enhance QoS, the research resource management is implemented on the VLSHDS platform. This project is also an example of cooperative work between Thailand, Japan and USA inside which distance learning has been set up to diffuse technology knowledge. The current state of this work is: • Corpus collecting in Thai language. 45 MB of Thai corpus are collected consisting of : Agriculture research abstract Computer Science article Sport News Crime News Economic News Politic News Technical report • Developing automatically multilevel indexing and document categorization,
Implementing client environment, plug-ins and tools by using AHYDS and Phasme technology supporting by Distance Learning (40 hours module) between NACSIS and NAiST. Developing upgrade version of Thai morphological analyzer. Providing linguistic tools: rule editor, KU parser, corpus tool, word segmentation tool and word co-occurrence extraction tool.
The Future works are: • To implement automatically indexing using NLP techniques • To enhance retrieval performance by integrating document clustering with the retrieval system with the benefit of the EBG data structure. • To plug-in the retrieval system with NLP Technique and document clustering inside the Phasme engine. • To benchmark the results inside the TREC framework as AHYDS project has been participating to TREC series (6/7/8). • To implement web-based distance learning for Natural Language processing.
6. References [1] Ahkuptra V. et.al,. “A Speaker-Independent Thai Polysyllabic Word Recognition System Using Hidden Markov Model Proc. of NLPRS,1997. pp.281-286. [2] F. Andres, et al. “Providing Information Retrieval Mechanism inside a WWW Database Server for Structured Document Management” in Proceedings of the ADB96 Symposium, Tokyo, December 1996. [3] F. Andres, and K. Ono Phasme: A High Performance Parallel Application-oriented DBMS Informatica Journal, Special Issue on Parallel and Distributed Database Systems, 1997. [4] F. Andres, and K. Ono “The Active HYpermedia Delivery System”, in Proceedings of ICDE98, Orlando, USA, February 1998. [5] F. Andres, A. Kawtrakul, K. Ono and al., "Development of Thai Document Processing System based on AHYDS by Network Collaboration, in Proc. 5th internatioal Workshop of Academic Information Networks on Systems(WAINS), Bangkok, Thailand, December 1998.
7
[6] D.C. BLAIR, “Language and Representation in Informational Retrieval”, Elsevier Science Publishers, 1990. [7] Charoenporn, T. et.al,. “Building a large Thai Text Corpus-part-of-speech Tagged Corpus: Orchid”. Proc. of NLPRS,1997. pp. 509-512. [8] Charoenpornsawat, P. et.al,. “Featurebased Thai Unknown Word Boundary Indentification Using Winnow”. Proc. of IEEE,1998,pp.547-550. [9] E. Chaniak, “Statistcal Language Learning”, MIT Press, 1993. [10] W. W. Cohen, and Y. Singer. “Contextsensitive learning methods for text categorization”. In Proceedings of the 19th Int. ACM SIGIR Conference on Research and Development in Information Retrieval, 1996. [11] D. J. Ittner, David D. Lewis, and David D. Ahn, “Text Categorization of Low Quality Images”, Fourth Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, pp. 301-305, 1995. [12] Jittapunkul S. and Areepongsa. “SpeakerIndependent Thai Numeral Speech Recognition Using Hidden Makov Model and Vector Quantization”. Proc. of SNLP. 1995. pp.370-378. [13] Karoonboonyanan, T. et.al,. “A Thai Soundex Systems for Spelling Correction”. Proc. of NLPRS,1997,pp.633-644. [14] Kanlayayanawat, W. et.al,. “Automatic Indexing for Thai Text with unknown Words Using Trie Structure”. Proc. of NLPRS,1997,pp.115-120. [15] Kawtrakul, A. et.al,. “A Statistical Approach to Thai Word Filtering” , The 2nd Symposium on Natural Language Processing, bangkok, pp 398-406. [16] A. Kawtrakul, et.al., “A Lexibase Model for Writing Production Assistant System” In Proceedings of the 2nd Symposium on Natural Language Processing, Bangkok, pp. 226-236, 1995. [17] A. Kawtrakul, C.Thumkanon, T.Jamjanya, Muagyunnan, K.Poolwan, Y.Inagaki, “A Gradual Refinement Model for A Robust Thai Morphological Analyzer”, COLING 96, pp. 1086-1089, 1996. [18] Kawtrakul, A. et.al,. “Thai morphological Analysis” Final Report to the Kasetsart University Research and Development Institute,1996. [19] Kawtrakul, A. et.al,. “Grammar and Style Checking for Thai sentences” A progress
[20]
[21] [22]
[23]
[24]
[25]
[26]
[27]
[28]
[29] [30] [31]
Report to the National Research Council of Thailand, 1997. Kawtrakul, A. et.al,. “The Development of Resources on Network for NLP Researches ” A progress Report to the National Electronics and Computer Technology Center.1997. Kawtrakul, A. et.al,. “Automatic Thai Unknown Word Recognition”. Proc. of NLPRS,1997,pp.341-346. Kawtrakul, A. et.al,. “Grammar and Style Checking for Thai sentences” Final Report to the National Research Council of Thailand,1998. Kawtrakul, A. et.al,. “Towards Automatic Multilevel Indexing for Thai Text Information Retrieval”. Proc. of IEEE,1998,pp.551-554. Kawtrakul A. et.al,. "Backward Transliteration for Thai Document Retrieval", Proc. of IEEE.1998. pp. 563566 Kawtrakul A., Andres, F.et.al.,. “A Prototype of Globalize Digital libraries : The VLSHDS Architecture for Thai Document processing. 1999. (on the process of submission) Kiat-arpakul R, et.al,. “A Combined Phoneme-based and Demisyllable-based Approach for Thai Speech Synthesis”. 1995. Proc. of SNLP. pp. 361-369. [16] Kongkachandra R. et.al,. “Thai Intonation in Harmonic-Frequency Domain”. Proc. of IEEE. 1998. pp. 165168 G. Kowalski "Information Retrieval Systems: Theory and Implementation", Kluwer Academic Pulishers, second edition, ISBN 0-7923-9926-9, 1998. Luksaneeyanawin S. “A Thai Text to Speech System”. In Proceedings of the Conference on Electronics and Computer Research and Development. NECTEC.1992. Maneenoi E. et.al,.” Modification of BP Algorithm for Thai Speech Recognition”. Proc. of NLPRS 1997. pp. 287-291 Nuntiyagul A. “Thai Text to Speech Synthesis”. M.Sc. Thesis. Chulalongkorn University.1989. Ongroongruang, S. et.al,. “English to Thai Word Retrieval Using Sound Index”. Proc. of SNLP,1995. pp. 407-419.
8
[32] Peter Schauble and Alan F. Smeaton, "Summary Report of the Series of Joint NSF-EU Working Groups on Future Directions for Digital Libraries Research", DELOS Working Group Report 98/W004,
http://www.iei.pi.cnr.it/DELOS/NSF/n sf.htm [33] G. Salton, “Automatic Text Processing. The Transformation, Analysis, and Retrieval of Information by Computer”, Singapore: Addison-Wesley Publishing Company, 1989. [34] Thubtong N. “A Thai Speech Recognition System Based on Phonemic Distinctive Features”. Master of Science Thesis, Department of Computer Engineering ,Chulalongkorn University.1995. [35] Tungthangthum A. ”Tone Recognition for Thai”. Proc. of IEEE.1998. pp.157 –160 [36] AHYDS System, NACSIS project,http://www.rd.nacsis.ac.jp/~andres/ db/ahyds.html [37] Phasme Application-oriented DBMS System, NACSIS R&D department, http://www.rd.nacsis.ac.jp/~andres/db/phas me.html [38] Special Issue on Digital Libraries, Communications of the ACM, April 1995, Volume 38, number 4. [39] Washington University, Center for Distributed Object Computing, http://www.cs.wustl.edu/~schmidt/doccenter.html [40] Report of “NACSIS-NAiST” project, 1999
9