Keywords: Information Extraction, Word Segmentation, Part-of-Speech Tagger, Phrase Structure, .... The words are tokenised using a longest word matching.
Information Extraction for Thai Documents Rattasit Sukhahuta and Dan Smith University of East Anglia School of Information Systems, Norwich, NR4 7TJ, UK Email: {R.Sukhahuta, Dan.Smith}@uea.ac.uk Abstract An increasing amount of electronically available information is stored in Asian language documents, which makes Information Retrieval (IR) and Information Extraction (IE) for these languages important for a large number of users. Analysis and extraction of information in these languages presents several interesting problems not seen in Western European languages; these are interesting in their own right and for the insights they can give into more general IR and IE techniques. We describe these problems and our system for Thai language IE One of the main concerns when working with Thai natural language is that the structure of the language itself is highly ambiguous. The analyser therefore requires more sophisticated techniques and large amounts of domain knowledge to cope with these ambiguities. We describe our approach to a natural language analysis system that performs preprocessing for the Thai language and the extraction module to retrieve specific information according to the predefined concept definitions. Keywords: Information Extraction, Word Segmentation, Part-of-Speech Tagger, Phrase Structure, Grammar Parser.
1
Introduction
Information retrieval (IR) and extraction (IE) for Asian language documents are becoming more important as the amount of electronically stored information in these languages increase. Information extraction in these languages presents several problems not seen in Western European languages, which are interesting in their own right and for the insights they can give into more general techniques. For pragmatic reasons our work focuses on information extraction for Thai language documents; similar considerations apply to other South Asian and South East Asian languages. For IE, the information of interest is specified by profiles using keyterm and rule patterns, so called concept definitions. Each concept definition contains trigger terms and internal data representations. The concept definitions should then be able to capture the contents of information within a similar context and cope with differences in the expression of the content (synonyms, different measurement systems, phraseology, … ). Working with the Thai documents introduces many problems during the language analysis and therefore unable to extract the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies and not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and / or a fee. Proceedings of the 5th International Workshop Information Retrieval with Asian Languages Copyright ACM 1-58113-300-6/00/009 … $5.00
correct information. Thus the problems of structural ambiguity first need to be resolved to organise the information of interest into more a structured form. The primary motivation for this work is to determine the depth and complexity of the analysis needed to retrieve specific information efficiently. Our approach to Thai information extraction is, first, to provide the preprocessing analysis modules that assess the surface structure of the text. These processes require natural language processing with domain knowledge acquisition, which includes word tokenisation to segment the stream of characters into individual syllables. Then the syntactic surface is defined for each token, which gives the system an ability to find the exact role of a particular word in the sentence. The analyser then uses the syntactic information to build the syntactic tree according to the context-free grammar rules. Chomsky's Government and Binding theory [6] is introduced to analyse the phrase structure trees and to extract specific information. The remainder of the paper is organised as follows. In section 2 we will first present related work in the area of Asian language processing. Section 3 briefly describes the structure of the Thai language. Sections 4 and 5 describe the text analysis and information extraction approaches we have adopted. Section 6 presents the experimental designs and the results. Section 7 outlines the major open issues and planned future work in this project. 2. Related Work The exploitation of natural language processing techniques for IE and IR is not new; several projects [3,
¢éÍ ÁÙÅÊÔ¹¤éÒ Í Ò ËÒ Ã· ÐàÅáÅÐÊÔ¹¤éÒ à¹×éÍ ä¡ èã¹µÅÒ ´ à´¹ÁÒ Ãì¡ ã¹»Õ 1995 ¹Ó à¢éÒ à»ç¹»ÃÔÁ Ò ³ 356.0 µÑ¹ ÁÙŤèÒ 26.4 ÅéÒ ¹à´¹Ôªâ¤Ã¹ µèÍ ÁÒ ã¹ »Õ 1996 ¹Ó à¢éÒ à¾Õ  § 256.5 µÑ¹ ÁÙŤèÒ 18.6 ÅéÒ ¹à´¹Ôªâ¤Ã¹ ŴŧÃéÍ Â ÅÐ 29.2 àÁ×èÍ à»ÃÕ Â º à· Õ Â º ¡ Ѻ »Õ 1995 Figure 1: Example of the Thai text 23, 18] have successfully integrated linguistic techniques into their systems. However, when working with Asian languages that have a series of phonetic symbols and an orthographic structure such as Thai, Japanese, Chinese and Korean, new problems and challenges are introduced. A technique for Augmented Maximal Matching to segment Chinese phonetic symbols has been proposed [19]. There are several projects that have been devoted to Thai word segmentation: longest matching (greedy match) [22], Maximal matching [26], sistrings or semiinfinite strings using a dictionary-based as primary knowledge source [8], statistical approaches using probabilistic determination [11] and a feature-based approach [20]. Many Asian languages also share the similar problems of the difficulties in recognising unknown words. Normally these words are compound nouns, foreign words or proper nouns. In [13], unknown words in Thai are classified into five categories: proper noun, loan word, acronym, foreign word, mistype and official places. This is more important than in many European languages because some of the Asian languages do not use capitalisation. In [1], a corpus-based learning method was applied to detect proper nouns or unknown words in Chinese. For Thai, [13] the combination of a statistical model and a set of context sensitive rules are used to detect unknown words and also incorporated a spelling checker. Much of this work applies linguistic techniques to language analysis. In [13, 7, 27], the n-gram part-ofspeech was used to determine the syntactic category. Kawtrakul [13] used probabilistic semantic tagging in the semantic attribute classification. In order to detect Thai loan words, [10] used a backward transliteration model to search for the English terms to transcribe between Thai and English. To resolve syntactic ambiguities [15] represented the phrase structure trees using probabilistic context-free grammar rules to determine the correct syntactic tree. N-gram probabilistic models have also been used for word-based indexing [29, 14], phrase-based indexing [7] and multi-level indexing [12]. In [8], a trie structure was used for indexing as well as detecting unknown words in Thai. The approach to Thai IE presented in this paper operates
mainly on a basis of syntactic analysis using a Phrase Structure Grammar (PSG). Our work provides a full-text analysis that assesses the unstructured Thai sentences into phrasal tree structures. The information of interest can then be found within these structures. We believe this method is an improvement over the classical pattern matching techniques widely used. 3. Aspects of written Thai In Thai, sentences are written as a long series of characters without word or sentence markers. The Thai alphabet consists of 44 consonants, 32 vowels and 4 tone marks; there are no capital letters. Thai sentences exhibit SVO (Subject-Verb-Object) word-order [21]. Figure 1 shows an example of Thai sentences selected from the Thai import and export domain1. Since there are no changes in word form or word inflection as an expression of tense, case or gender, word ordering plays an important part in determining the syntactic role of word [28]. The same form of words in different positions contains different syntactic properties and therefore conveys different meanings. To express tense and case, additional words often are inserted to clarify the meaning. Thai grammar does not follow the extended projection principle [6], as found in English, where a sentence must have an overt subject. The subject can be omitted even if it is pronomial; this characteristic is referred to as null subject parameter. Thai contains relatively few headwords [4]. Many Thai words are formed from a combination of different nouns, verbs and auxiliaries to form compound nouns. In sentence formation, words are combined together without separation, such as a space in English. In addition, any Thai sentence can be embedded and become a subordinate clause in a complex sentence, sometimes referred to as a sub-sentence. This makes tokenisation between words and sentences much more difficult. 4. Preprocessing Process Thai documents in a form of natural language text often have a highly ambiguous structure. The initial steps of language understanding are to analyse and resolve these ambiguities. This section describes the automatic processing of natural language using various linguistic methods. First, the morphological level of linguistic processing is concerned with the processing of recognisable paragraphs and the word forms. Next, the lexicon deals with the analysis of words and syntactic features (e.g. nouns, verbs, adjective) using an n-gram probabilistic model to resolve part-of-speech (POS). Finally, a phrase structure grammar is used to produce a tree structure for the sentences. Figure 2 illustrates the overall architecture of the system.
1
http://www.dbe.moc.go.th/
Figure 2: System Architecture 4.1 Morphological Analysis Since there is no clear indication of sentence boundaries, it is not possible for the computer program to process the document using a sentence-based approach. In the text sets we have used, it is usually possible to identify paragraph breaks, taking a paragraph as one or more sentences separated by space. However, space may also be used for other purposes, making the identification of paragraphs unreliable and complicating the syntactic analysis. In Thai, a space can also be used as a separator between numbers, abbreviations, double quotes, and terms that used as the example [17]. Since the notion of sentences is not clear, we will refer to the chunk of text separated by each space as a ‘context’. This process simply inserts a paragraph marker “” at each empty line, which is presumed to be the end of the paragraph. For each paragraph, morphological or word tokenisation is applied to segment a series of characters into words. The words are tokenised using a longest word matching algorithm. In this technique the pattern of words are formed and validated against the list of words in the dictionary. The algorithm for this technique is a greedy algorithm, matching with the longest words found in dictionary. Therefore, the system provides a backtracking approach using a bigram probabilistic model based on the statistical information collected from training corpus to find a suitable word that has the largest probability of the word sequences. As the result, each token or each word is separated using a word break marker “”.
a bigram probabilistic model. This dictionary contains a list of head-words and some of the compound nouns. We should note that the size of head-words dictionary that only contain a list of single syllables will be much smaller than a dictionary that includes the compound nouns. However, for the latter case, the segmented words will be fragmented into several units and sometimes the original meaning is lost. For example, the compound word ¹Ó ä»ãËé consists of three different compound verbs ¹Ó /bring ä»/go ãËé/give, but when these words are combined together into a single word means “bring to”. To reduce the number of ambiguous structures, a finite state calculus using regular expression operators is used to define the patterns of words whose lexical structure are recognisable [5, 9]. The syntax to capture Thai characters is specified using the standard character code ISO646. An example of words with recognizable structures are numbers (e.g. 3,568,254.00) (also including Thai numeric (e.g. ñ ò ó ô õ ö ÷ ø ù ð), time (e.g. 24.05 ¹. ), date (e.g. 25 Á¡ ÃÒ ¤Á 25432), punctuation marks, foreign languages and abbreviations (e.g ¾.È .). Figure 3 shows a list of the regular expressions defined to capture Thai specific lexical elements. Thai abbreviations 1. /([\xA1-\xCE]+\.(?:[\xA1-\xCE]+\.)+)/
2. /[xA1-\xCE]+\./ Thai numbers 1. /(-?[\xF0-\xF9]+(?:\,[\xF0-\xF9]+)+ (?:\.[\xF0-\xF9]+)?)/ 2. /(-?(?:[\xF0-\xF9]+)(?:\.[\xF0\xF9]+)+)/ 3. /([\xF0-\xF9]+)/
The morphological analysis is considered as one of the important processes required by most languages, such as Chinese, that does not contain word or sentence boundaries. The quality of subsequence processes relies on how well the tokenisation is performed. Nevertheless, there are still problems lying within the orthography of the Thai language itself. For instance, the sentence ÊÔ¹ ¤é Ò + ¶Ù¡ + « ×éÍ (product + cheap or being + buy) can be segmented to either ÊÔ¹ ¤é Ò ¶Ù ¡ + « ×éÍ (cheap product buy) or ÊÔ¹ ¤é Ò + ¶Ù ¡ « ×éÍ (product is bought).
Date
The approach used in word tokenisation in our system is based on a dictionary-based algorithm in combined with
2
/(\d\d?\/\d\d?\/25\d\d)/
Time
¹.)/
/(\d\d:\d\d\s+
Figure 3: Example of the Regular Expressions defined to capture Thai words with specific lexicon structure
Thailand follows the Buddhist calendar; the year 2543BE is equivalent to 2000AD.
In the word tokenisation process, the input text is treated as a stream of characters with a special treatment of spaces while ignoring words that have been recognised by regular expressions. 4.2. Syntactic Analysis Thai words usually belong to multiple syntactic groups. In Thai grammar, there is no change in the word form for different syntactic categories. The context of a word depends on the syntactic category to which that word it belongs. The word ordering and the position in the sentence are used to determine which syntactic group a word belongs to. In this analysis, a trigram probabilistic model is used to tag each token with the appropriate part of speech (POS). The POS tag set used in our experimental is the Orchid [27] set, which consists of 47 tags. The training text was obtained from the Orchid corpus consisting of 6Mb of Thai text manually segmented into sentences and words, where each word is tagged with its POS. The information contained in the corpus is 23,125 sentences of Thai words with some embedded English language terms, numbers, and proper nouns used in the electronics industry. The punctuation mark tags were also removed from the training corpus to prevent false frequency counts for word sequences following and preceding the punctuation marks. To make p( wi | wi − 1 ) meaningful for i = 1 [2], we attach the beginning and ending of the training sentences with wbos and weos tokens e.g. “ wbos FIXN VACT NCMN NCMN CFQC DONM w eos ”. In the n-grams part-ofspeech model of trigram, where n = 3, the probability estimates are from the transitional probabilities for
p( w i | w i − 1 , w i − 2 ) =
c ( w i − 2 , wi − 1 , w i ) ∑ c( w i − 2 , w i − 1 , w i )
rules, the highest projection of the labelled node is a phrasal unit (e.g. noun phrase (NP), verb phrase (VP), prepositional phrase (PP) and classifier (CL)) rather than a sentence unit (S). The rule for the noun phrase can be written as “NP à N + (VP) + (PP) + (CL)” where variables are joined together with “+” symbol and the variables inside parenthesis are optional. The rules are written as a form of “rewriting rules” because the nonterminal symbols of the rewrite rules (e.g NP) can be replaced by the right sides of the rules regardless of the context in which these symbols may appear [24]. Since it is not possible to define the boundaries of the sentences, the contexts separated by spaces are assumed to be the phrasal units. However, since the spaces can also be used to separate between numbers, labels, examples and abbreviation, we first need to remove these spaces so that the same information in content will be joined together as a single phrase. The grammar parser operates in a top-down manner using the syntactic information obtained from the POS analysis. The results returned by the grammar parser are in labelled brackets where each lexicon head represents a phrasal unit. For example, the sentence “à´¹ÁÒ Ãì¡ ¹Ó à¢éÒ ÊÔ¹ ¤é Ò à¹×éÍ ä¡ è¨Ò ¡ »ÃÐà· È ¤Ùèá¢è§¢Í §ä·  ” (Denmark import product chicken from the rival country of Thai) will be parsed as
[NP](N'([à´¹ÁÒ Ãì¡ ,Country,nprp]) VP(V'([¹Ó à¢éÒ ,,vact]) NP(N'([ÊÔ¹¤éÒ ,,ncmn] [à¹×éÍ ä¡ è,,ncmn]) PP(Prep'([¨Ò ¡ ,,rpre]) NP(N'([»ÃÐà· È ,,ncmn] [¤Ùèá¢è§,,ncmn]) PP(Prep'([¢Í §,,rpre]) NP(N'([ä· Â ,Country,nprp]))))))))
(1)
wi
frequency counts of words given where c( w i − 2 , w i − 1 , w i ) is the number of times the trigram w i − 2 , w i − 1 , w i occurs. At the same time, we can count how often the bigram w i − 1 , wi occurs in the training corpus. Once the syntactic form of words in a document is identified, the words with proper term (NPRP) syntax are directly assigned with the semantic classes. This is done by direct mapping between words and the word entries in the semantic dictionary. Note that, at this stage, we do not attempt to resolve the semantic ambiguity. Therefore, the words may be assigned to multiple semantic classes. This semantic information will be used later as one of the constraints specified in the concept definitions. 4.3 Phrase Structure Grammar The next step of analysis involves finding the syntactic surface structure for a given sequence of tagged text. The phrase structure rules used in the system are based on a context-free phrase structure grammar for Thai. In these
Figure 4: Example of the extraction system The results generated by the parser may contain several phrase structures in which they will be represented as phrase structure trees. Figure 4 illustrates the tree diagram of the phrasal units. The leaves at the bottom of the trees show the Thai words and their associated syntax, and a list of the possible English direct translation for each word. 5. Concept Definitions In recent years, many information extraction approaches have focused on describing the information of interest through a concept-based strategy [16, 23, 25]. These
concepts describe the internal structure of how the actual information is represented. The “concept definitions” or “extraction patterns” usually contain keywords (more generally key terms), sometimes referred to as trigger words, which activate the concept matching process when they are found within the context. Our concept matching process is based around regular expressions and a frame-based representation of concept instances. Each slot in a concept frame contains a semantic class constraint enabling the extracted information to be filtered. Once the relevant contexts are identified, the phrase structure trees will be constructed, and the specific information will be extracted from these trees.
underlying structure. For the Thai language, the number of arguments required for each predicate cannot be simply determined directly from the syntactic property. It requires semantic information to determine the actual θrole of that word in the sentence. Therefore, users must understand the meaning of the trigger terms and what activity they are expressing as well as know how many participants will be involved in the argument structure. These techniques are used in trying to improve the number of recall and precision of the extraction results. Figure 5 illustrates an example of the concept definition defined in the Thai import/export domain to identify the name of a country and the imported products.
In this approach, the specific information is retrieved from the parse tree using the verb argument structure. The semantic relationships between predicates and their arguments are referred to in terms of thematic role or theta-roles (θ-roles) [6]. According to the syntactic category defined in [27], three classes of verbs are distinguished: Active, Stative and Attributive verbs. An active verb like ¹Ó à¢éÒ (import) expresses an activity that involves two participants. First is the person or country that imports and second is a direct object of import. We can express this argument structure in terms of the general notion of argument structures as
Concept: Name of the country and products imported. Rule: [¹Ó à¢éÒ ] {0,1,1,1} First slot constraint: Second slot constraint: Trigger word: ¹Ó à¢éÒ Parameters: {Retrieve whole tree, Include adjuncts, Include trigger terms, Use semantic constraint}
Predicate(Argument1,Argument2,..., Argumentn).
In this example, the predicate “import” takes two arguments that are realised by two noun phrases. We refer to this kind of predicate as two-place predicate. However, if there is only one argument, we refer to this type of predicate as one-place predicate [6]. The argument structure sometimes can be determined from the syntactic information since the thematic roles need to be clarified from the semantic meaning. Using the idea of formal logic using the argument structure, we can say that every verb predicate requires at least one argument. We also should note that some of the arguments could be implicitly ignored. The number of arguments required also depending on how much information is available within the context. For example, the sentence Russia exported chicken products contains two arguments exports[NP, NP] which are Russia and chicken products, and the sentence Russia exported chicken products to the European countries has three arguments exports[NP, NP, PP] where the third argument is to the European countries. According to the X'-theory [6], the (verb-bar) where it is assumed to be found in a head verb of the verb phrase node. The however can be a single noun phrase that follows a verb predicate and the additional adjunct information (usually found in prepositional phrases) such as place, time and quantifier. In order to capture the specific information from different sentences with similarity in their structure, we need to assume that these structures have the same
Figure 5: Example of the concept definitions Since the analysed documents are in the format of phrase tree structures, we can optimise the way in which the concept definitions can be defined using “noun phrase” and “verb phrase” notions. During the extraction, the system first identifies the trees that contain the trigger terms specified in the concept definitions. For each tree, the node that matches with one of the trigger terms is treated as a starting node. First we find the c-commanding node by moving upward until we reach the first branching node. We then can move downwards either leftward or rightward following the branches of tree that is c-commanded by the starting node. For instance, according to the concept definitions given in figure 5 the system will search for the node that contains a trigger word “import”. In this example, a trigger word is found in the noun phrase tree (in figure 6) and the node found is V'. The system then moves upward until it finds the first branching node, in this case is VP node. To the left of theta, VP, is the N' node, which is assumed to be a subject (the node c-commanded by the VP) and on the right is the c-command of the V' node, denoting object. In this experiment, we assume that a subject is the last possible antecedent of a tree. All elements of the c-command node are referred to as “ccommands domain”. In this example, if all the branches of N' (subject, specifier) and NP (object, adjuncts) are visited, the retrieved information for the subject will be “Denmark” and for the object is “chicken products from France and England”. For the object, the c-command domain also includes the prepositional phrase (PP) from France and England, which is a location adjunct. However, if we only consider main predicate of the subject and object we could visit the head noun of the ccommand node in which the results are now refined to “Denmark” and “chicken products”. If the starting node is the head node where the trigger word is found, we
cannot move anymore to the left. For instance, if the trigger term “import” is found in the verb phrase node and the concept definition suggests that there should be a subject for this trigger word. For this case, the extraction system should be able to expand its search for the noun phrase node in the previous tree. From the extraction results, we can now use the semantic class specified in the concept definition as a filtering constraint. In this example, the extracted data for the subject needs to be classified as a class country.
Figure 6: The phrasal tree structure 6. Experiment Results and Discussion We performed the experiments for measuring the effectiveness of the preprocessing processes to ensure that the results will not be effect on the next analysis. These experiments include word segmentation and partof-speech (POS) tagging (for both general and specific part-of-speech tag set) which were conducted on the Orchid corpus [27]. This corpus contains 160 documents of 5.75Mb or 311,426 words, which divided into contexts, and each context is segmented into words tagged with specific part-of-speech. Two sets of documents were created from the original format
Figure 7: Word segmentation results
each module is based on the accuracy of the number of words that correctly segmented and tagged with the appropriate POS. As the result, the segmentation technique described in section 4 using a list of 12,657 head words has an average accuracy of 79% while when the additional 6,192 domain specific terms are added to the dictionary the accuracy is improved to 83% (shown in figure 7). This implies that the number of entries in dictionary has only a slight effect on segmentation accuracy. The POS tagger experiments were performed on the same corpus using two different sets of tags: first, using 14 general tags, and second, using a 47 tag set where the syntax is divided into different subcategories; for instance, noun is divided into proper noun, cardinal number, cardinal number, label noun, common noun and title noun [27]. The average accuracy in percentage of the tagger with specific POS tags is 60% and with general POS tags is 71.5% (shown in figure 8). This distinction was made because the context-free grammar rules for the parser are mostly specified using the general POS tags. Therefore, the results of the specific POS should not have much impact on the parser. Country-Import-Products: 1. \[[\xA1-\xF9_\w\d\.]+,Country,\w+\] \[{¹Ó à¢éÒ },, vact\].*? 2. [¹Ó à¢éÒ ] {0,1,1,1} Figure 9: Example of the concept definitions For IE, the experiment was conducted on the Thai import and export database. The design of the concept definitions is divided into two phrases. First, the concept patterns are defined using regular expressions to capture the specific contexts that contain relevant information, and second using the patterns defined using phrase structure grammar rules to extract specific information or phrasal units from the parse tree. To capture the information of interest may require at least one or both parts depending on how the information is represented. Initially, the documents are tagged using an SGML DTD to structure the relevant information e.g. … … Once all the documents are tagged, the Concept Generator module automatically generates a list of concept definitions as shown in figure 9. The evaluation of the extraction module is done by measuring the number of correct concepts extracted. Table 1 shows a list of topics and the number of concepts associated with that topic. The numbers of precision and recall figures are shown in table 2.
Figure 8: Part-of-Speech results provided by the corpus: text documents without word indication and segmented documents. These two sets of documents were used to evaluate the segmentation and part-of-speech module respectively. The performance of
As the results shown in table 2, the average number of precision is around 42%. Most of the concepts performed well in terms of identifying the relevant context judging from the concepts that have a number of recall over 70%.
However, the precision number, which indicates how well the correct information is extracted from that relevant context, is still low. One of the causes of the errors comes from the complexity and ambiguity of the representation of sentence structure. Topic Category
No. Concepts
Products Imported
47
Year and Products imported
6
Year and Amount imported
6
Suppliers Countries
3
Country and Products imported
13
Year and Countries imported
2
Year and Products exported
2
Country and Products exported
6
Country and Amount exported
2
Others
8
7. Conclusion
Table 1: Classification of concept definitions. Although most Thai sentences are exhibited in SVO order but in practical this order could be changed. One of our main concerns during the extraction is that the extracted results should be meaningful in the sense that they contain enough information to describe an event of interested. We have noticed that when the all elements of c-command domain are included, they provide meaningful results. In this experiment, there is still a large amount of general knowledge that needs to be considered. As we have mentioned earlier, Thai does not use capitalisation, therefore proper nouns such as names of countries, government institutions, and currency units need to be included in the dictionary. Since the information provided in the Thai import and export domain can be presented using long description of when, where, and how to describe an event related to products imported and exported, the question is how much information should be extracted. Concept
Concept
This approach requires many analysis steps. Therefore the performance of the extraction system is dependent on the preprocessing stages. The ambiguous context of the sentence structures of the input documents is also one of the major causes in degrading the efficiency of the parser. In general, the grammar rules can be applied to most documents except for documents that contain a lot of embedded foreign language words and phrases. The documents that contain a lot of punctuation marks such as parenthesis, hyphens and brackets can cause problems during parsing in the current version of our system since they do not fall into any grammar rules.
R
P
R
P
#30
0.33
0.22
#41
0.75
0.60
#73
0.38
0.70
#3
0.75
0.21
#42
0.40
0.13
#39
0.75
0.13
#43
0.43
0.13
#66
0.76
0.42
#70
0.50
0.14
#64
0.77
0.50
#40
0.50
0.07
#74
0.79
0.66
#48
0.54
0.35
#1
0.79
0.80
#31
0.57
0.53
#67
0.82
0.77
#68
0.58
0.58
#15
0.83
0.29
#44
0.67
0.31
#24
0.83
0.56
#16
0.67
0.40
#71
0.83
0.40
#23
0.67
0.40
#51
0.83
0.33
#5
0.67
0.13
#20
0.89
0.42
#69
0.68
0.54
#65
0.93
0.65
Table 2: The extraction results
In this approach, Thai language documents can be analysed and transformed into syntactic tree structure. The phrase structure organisation provides the ability for the information extraction system to capture specific data using c-commands rules. This approach has an advantage over pure or simple pattern matching in that it organises the context structure into groups of phrases where the extraction engine can detect the content boundaries. Therefore, the enhancement of information searching content structures that differ only slightly can be improved. The process requires many sophisticated preprocessing steps that analyse and organise natural language documents into the phrase structures. Each step of analysis provides the accessibility toward structure understanding. Morphological analysis plays an important role as to segment the ‘chunks’ of contents into individual words. Syntactic information provides additional information of roles for words in the sentences. Finally, a syntactic tree or phrase structure grammar tree is used to organise the context for the ease of extraction and information tracing. This experimental, however, is performed mainly based on the syntactic surface analysis without realising the semantic features and word sense disambiguities. Thus it is still far from text understanding but it gives an idea of how surface structure analysis can be achieved given unstructured input documents. For future development, we believe that word sense disambiguation can be introduced into the system to improve the number precision. References [1]
K.-J. Chen and M.-H. Bai. Unknown word detection for Chinese by a corpus-based learning method. Computational Linguistics and Chinese Language Processing, 3(1), 27-44, 1998.
[2]
S. F. Chen. Building Probabilistic Models for Natural Language, PhD thesis, Center of Research in Computing Technology, Harvard U., 1996.
[3]
F. Ciravegna and N. Cancedda. Integrating shallow and linguistic techniques for information
extraction from text. Proc. Conf. Italian Assoc. for A. I., 127-138, 1995. [4]
[5]
D. Cooper. How to read and know more: Approximate OCR for Thai. SIGIR’97, 216-225, 1997. G. Grefenstette and P. Tapanainen. What is a word, what is a sentence? Problems of tokenization. Proc. COMPLEX’94. http://www.xrce.xerox.com/publis/mltt/mltt004.ps.
[6]
Liliane Haegeman. Introduction to Government & Binding Theory. Blackwell, Oxford, 1991.
[7]
N. Kando, K. Kageura, M. Yoshioka and K. Oyama. Phrase processing methods for Japanese text retrieval. SIGIR’98, 1998.
[8]
[9]
[10]
[17]
D. D. Lewis. Feature selection and feature extraction for text categorization. Proc. Speech and Natural Language Workshop, 212-217, Morgan Kaufmann, 1992.
[18]
D. D. Lewis and K. Sparck Jones. Natural language processing for information retrieval. CACM, 39(1), 92-101, 1996.
[19]
A. F. Lochovsky and K. H. Chung. Word segmenation for Chinese phonetic symbols. Proc. Int. Computer Symposium, 991-916, 1994.
[20]
S. Meknavin, P. Charoenpornsawat and B. Kijsirikul. Feature-based Thai word segmentation. Proc. NLPRS’97, 1997.
[21]
R. Pankhuenkhat. Thai Linguistics. Chulalongkorn, 1998.
W. Kanlayanawat and S. Prasitjutrakul. Automatic indexing for Thai text with unknown words using trie structure. Proc. NLP Pacific Rim Symposium, 115-120, 1997.
[22]
Y. Poowarawan. Dictionary-based Thai syllable separation. Proc. Elec. Eng. Conf., 1986.
[23]
L. Karttunen, J.-P. Chanod, G. Grefenstette and Anne Schiller. Regular expressions for language engineering. Nat. Lang. Eng., 4, 305-328, 1996.
E. Riloff and W. Lehnert. Information extraction as a basis for high-precision text classification. ACM TOIS, 269-333, 1994.
[24]
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983.
[25]
Stephen Soderland, David Fisher, Jonathan Aseltine and Wendy Lehnert. Crystal: Inducing a conceptual dictionary. Proc. IJCAI’95, 1995.
A. Kawtrakul, A. Deemagarn, C. Thumkanon, N. Khantonthong and P. McFetridge. Backward transliteration for thai document retrieval. IEEE Asia Pacific Conf. on Circuits and Systems, 563566, 1998.
[11]
Asanee Kawtrakul, Supapas Kumtanode, Thitima Jamjanya and Chanvit Jewriyavech. A lexibase model for writing production assistant system. Symposium on Nat. Lang. Proc., 1995.
[26]
V. Sornlertlamvanich. Word segmentation for Thai in a machine translation system (in Thai). Computerized Language Translation, 50-55, 1993.
[12]
A. Kawtrakul, C. Thumkanon and P. McFetridge. Automatic multilevel indexing for Thai text information retrieval. IEEE Asia Pacific Conf. on Circuits and Systems, 1998.
[27]
V. Sornlertlamvanich, T. Charoenporn and H. Isahara. ORCHID: Thai Part-Of-Speech Tagged Corpus. Technical Report, Linguistics and Knowledge Science Laboratory (NECTEC), 1997.
[13]
A. Kawtrakul, C. Thumkanon, Y. Poovorawan, P. Varasrai and M. Suktarachan. Automatic Thai unknown word recognition. Proc. NLPRS’97, 341-348, 1997.
[28]
V. Sornlertlamvanich and W. Pantachat. Information-based language analysis for Thai. ASEAN J. Science and Technology for Development, 10(2):181-196, 1993.
[14]
J. H. Lee and J. S. Ahn. Using n-grams for Korean text retrieval. SIGIR 1996, 216-224, 1996.
[29]
O. Yasushi and M. Toru. Overlapping statistical word indexing: A new indexing method for Japanese text. SIGIR’97, 226-234, 1997.
[15]
K. J. Lee, J.-H. Kim and G. C. Kim. Probabilistic parsing of Korean sentences using collocational information. Proc. NLPRS’97, 1997.
[16]
W. Lehnert, C. Cardie, D. Fisher, J. McCarthy, E. Riloff and S. Soderland. Evaluating an information extraction system. J. Integrated Computer-Aided Engineering 1, 1994.