Word Segmentation Based on Estimation of Words from Examples Juntae Yoon and Woonjae Lee and Key-Sun Choi fjtyoon,
wjlee,
[email protected]
KorTerm, Dept. of Computer Science, Korea Advanced Institute of Science and Technology, Taejon, Korea January 21, 1999
Paper ID Code: 155 Submission Type: Thematic Session Topic Areas or Theme ID: L3: Word Segmentation Word Count: 2858 Under consideration for other conferences (specify)? no Abstract From a cognitive point of view, words can be recognized based on learned data which can be obtained from linguistic materials. Namely, people learn words from many examples which they meet. We propose a word segmentation algorithm based on estimated knowledge for words acquired from both local texts being processed and POS tagged corpus. In order to show the feasibility of our model, we apply it to guessing of unknown words caused by morphological analysis failure.
1 Introduction We continuously learn words by seeing and hearing examples, and acquire new ones based on learned knowledge and new examples. We can think of recognition and segmentation of words as the cognitive process. Consider the following example
Figure 1: Words can be generalized from many samples
!|ÐUX9L(hag-gyo-e1, to school) (a) !|Ð(hag) + UX9L(gyo-e) (b) !|ÐUX(hag-gyo) + 9L(e) (c) !|ÐUX9L(hag-gyo-e)
It is possible to divide an eojeol2 `!|ÐUX9L'(hag-gyo-e) in three ways. Human knows that the case (b) is the correct one. The reason that people could do is that they can latently recognize `!|ÐUX'(hag-gyo) and `9L'(e) as words. Actually, they can recognize and generate many words by generalizing many examples. They repetitively meet many examples such as `!|ÐUX"'(hag-gyoga, school+SUBJ marker), `!|ÐUXÉÚê'(hag-gyo-reul, school+OBJ marker), `!|ÐUXÁÆç'(hag-gyo-neun,
school+TOP marker), and so on. They also cognize `9L'(e, TO) as a word by seeing and hearing the examples like `$Å9L'(jib-e, to (the) house), `)ÏÖê9L'(seo-ul-e, to Seoul), and `%»9L'(san-e, to (the) mountain) etc. (Figure 1). As described above, words can be recognized if many examples are given. In this paper, we propose a method of word segmentation based on the idea of the above and apply it to unknown word guessing. The morphological analyzer is inevitably faced with unknown words in analyzing texts due to the restricted size of dictionary. Accurate guessing of unknown words is crucial for the performance of the natural language application system. For instance, since proper nouns, transliterated foreign words, or compound nouns are content words which are representative of We use the Roman alphabet for transliteration, and the syllable is delimited by dash Eojeol is a spacing unit in Korean which consists of a content word and functional words. However, many compound nouns consist of several nouns without being delimited by whitespace 1 2
2
documents, their correct estimation has an great eect on the performance of the IR system. In fact, most unknown words are caused by proper nouns, transliterated foreign words, and compound nouns. In them, proper nouns and transliterated foreign words are ones which are not registered in the dictionary. On the other hand, it is assumed that an eojeol including compound noun is composed of known words. The failure of morphological analysis is caused by the dierent reasons. Thus, many researchers have regarded them as distinct problems and proposed separate models. In fact, it is, however, dicult to judge whether failure results are caused by compound nouns or words unregistered in dictionary, when morphological analysis is failed for an eojeol. It is necessary to make the generalized segmentation model, therefore. Our model present a consistent method for guessing unknown words without regard to the kind of analysis failure. Proper nouns and transliterated foreign words have characteristic to occur more than one time in local contexts or in a text being processed. This means local texts are important sources to learn new words, and crucial for guessing unknown words. Therefore, we can establish guessing rules for them, if we can generalize words from examples in local texts. Moreover, words can be accurately obtained from POS tagged corpus. We make prior knowledge of words with POS tagged corpus, which is used to segment compound nouns with improvement of analysis accuracy. Namely, learned data from POS tagged corpus serve as known words beforehand in our system. The system uses two kinds of data and the modi ed CYK parsing algorithm to execute the word segmentation.
2 Learning We consider the example `!|ÐUX9L'(hag-gyo-e, to school) in the previous section in order to describe the basic idea of word segmentation. As we said, it is possible to divide `!|ÐUX9L' in 3
three ways. First, consider (a) of the previous example. The head part `!|Ð' is found in the head parts of many examples in Korean. However, the tail part `UX9L' rarely occurs in the tails of other eojeols, which means that it does not behave as the sux (postposition) of an eojeol. On the other hand, `!|ÐUX' and `9L' are found in the head and tail parts of various examples respectively. Actually, they are used as words in Korean. That is, it is likely that a sequence of syllables is used as a word if it is found in various usages in many times. Corpus is linguistic data which contains usages of a language. Hence, we can provide a large amount of usages of words by corpus. Also, various pieces of information such as POS tags, phrase structure, etc. are attached to the corpus according to the application. Considering our application, we use POS tagged corpus in order to enhance the application performance, hence two kinds of text materials are presented into our work - raw texts and POS tagged corpus. Words extracted from POS tagged corpus are ones that we know beforehand. On the other hand, we can generalize examples of raw texts to words. To focus on unregistered words to take place in analyzing the raw text, we supply erroneous results of morphological analysis, which are local linguistic data to be used to generalize and recognize unregistered new words.
2.1 Training from POS tagged corpus We used about 400 thousands of eojeols from KAIST POS tagged corpus (Choi et al. , 1994). In the tagged corpus, each eojeol is segmented by morphemes with POS tags. Training from POS tagged corpus is very simple as follows: 1. Extract words from POS tagged corpus 2. Sort them and count the frequency of each word. 3. Select words w such that freq(w) k since words are not often signi cant when 4
the count is too small. In this paper, we considered only nouns and postpositions because most errors of morphological analysis occur in nominal eojeols except misspelled words.
2.2 Training from local texts Data given here is the local text to be processed. When a local text is analyzed, morphological analysis failure is ubiquitous. Many eojeols of the failed results contain a common stem (head part), since the local text serves as context in which words appeared repetitiously. Therefore, new words can be learned from the failure results, which is referred to as local learning data. The procedure to estimate words from local learning data is as follows: Given an eojeol consisting of n syllables s1 s2 : : : s , n
for i=2 to n do begin store(s1 : : : s ); i
store(si + 1 : : : s ); n
end
Here, as well as the procedure store() stores given syllable, it sets the frequency of the syllable to 1 if it does not exist, and increases the frequency otherwise. To illustrate, let us suppose local learning data consist of four eojeols as follows:
!|ÐUX" (hag-gyo-ga, !|ÐUX(school)+"(SUBJ marker)) 5
(a)
|Ð! !|ÐUX" UX" " $Å ) ÏÖê9L
(b) 2 1 1 1 1 1 1
(a)
! UX |Ð !|ÐUX9L UX9L 9L $Å9L )ÏÖê )ÏÖê9L
(b) 2 1 1 3 1 1 1
Table 1: Examples of data trained from local texts, (a) series of syllables (b) frequency
! UX9L (hag-gyo-e, !|ÐUX(school)+9L(TO)) |Ð $Å9L (jib-e, $Å(house)+9L(TO)) )ÏÖê9L (seo-ul-e, )ÏÖê(Seoul)+9L(TO)) If we execute the proposed algorithm for the data, the series of syllables are stored with their frequencies as shown in Table 1.
2.3 Estimation of word segmentation To simply describe the basic idea of our word segmentation, let us suppose an eojeol be segmented into only two words. Given an eojeol, it is segmented by the possibility that a sequence of syllables inside it forms a word. The possibility that a sequence of syllables forms a word is measured by the following formula.
W ord(s ; : : : s ) = i
j
f + f N + N P
(1)
L
P
L
In the formula, f and f are the frequency of the syllable s1 : : : s occurred in the POS tagged P
L
j
corpus and local learning text respectively. N and N are the total number of syllables(words) P
L
6
in two kinds of learning data. In addition, is a weight to normalize the size of the learning data. If POS tagged corpus is not given, f and N are all zero. In that case, we can estimate P
P
each possibility of segmentation for the eojeol `!|ÐUX9L'(to school) as follows: 1. W ord(!|Ð) = 2/18, W ord(UX9L) = 1/18 2. W ord(!|ÐUX) = 2/18, W ord(9L) = 3/18 3. W ord(!|ÐUX9L) = 1/18
As we mentioned earlier, it is desirable that the eojeol is segmented in the position where every sequence of syllables divided occurs frequently enough in learned data. Therefore, we rst take the minimum from W ord() values for each possibility of segmentation, In the example, 1/18, 2/18, and 1/18 are taken for each possible segmentation. Next, we choose the maximum of the selected minimums, that is, we use min-max composition. The segmentation result with the maximum is selected. By the algorithm, the eojeol `!|ÐUX9L' is segmented to `!|ÐUX+9L'.
3 Word Segmentation In this section, we generalize the word segmentation algorithm based on data obtained by the training method. The basic idea is to apply min-max operation to each syllable in an eojeol by a bottom-up strategy. If the minimum between W ord of two sequences of syllables is greater than W ord of the combination of them, the syllables should be segmented. For instance, let us suppose an eojeol consist of two syllable s1 and s2. If min(W ord(s1); W ord(s2)) > W ord(s1s2 ), then the eojeol is segmented with s1 and s2 . It does not segmented, otherwise. Figure 2 shows an example: In Korean, `!|Ð'(hag) is a word that occurs frequently, but `UX'(gyo) is not. 7
Figure 2: Example to apply min-max to `!|ÐUX'
Figure 3: Composition Table Therefore, we can hardly regard the sequence of syllable `!|ÐUX'(hag-gyo) as the combination of two words `!|Ð' and `UX'. From the fact, syllables with rare frequency should be rejected, which leads to introduce the minimum operation. Rather, `!|ÐUX' is a word which many eojeols share in the stem part. In fact, the algorithm can be applied recursively from individual syllable to the whole syllable of eojeol. The segmentation algorithm is eectively implemented by borrowing CYK parsing method. Since we use the bottom-up strategy, the execution looks like composition rather than segmentation. After all possible segmentation of syllables being checked, the nal results is put in the top of the table. When an eojeol is composed of n syllables, i.e. s1s2 : : : s , the composition is n
started from each s (i = 1 : : : n). Thus, the possibility that the individual syllable forms a word i
is recorded in the cell of the rst row. In the cell C , the most probable segmentation result i;j
for a series of syllables s
j;:::;j
+ ?1 i
is stored with its possibility value to form a word (Figure 3).
Now, we will describe the segmentation algorithm. When it is about to make the segmentation of syllables s : : : s , the segmentation result for s s ?1 have been already stored in the table. i
j
i
8
j
Therefore, s should be contained in the syllable sequences of the target processed in that j
stage. To make it easy to explain the algorithm, we take an example eojeol `!|ÐUXCKÉ{|Ô$'(haggyo-saeng-hwal-i) which is segmented with `!|ÐUX(school)',`CKÉ{|Ô(life)' and `$(SUBJ marker)'
(Figure 5). When we come to cell C4 1, we have to make the most probable segmentation for `!|ÐUXCKÉ{|Ô' i.e. ;
s1 to s4 . Since each cell has the most probable result and its value, it is simple to nd the best
segmentation for syllables including s4. First, we select s4 which is put in C1 4. Then, s1 s2s3 ;
is the syllable sequence which can be connected to s4. Let us represent the possibility value of C with value(C ). Then, the segmentation result and value(C3 1 ) is recorded in C3 1 . We i;j
i;j
;
;
take the minimum between value(C3 1) and W ord(s4). Next, we select `CKÉ{|Ô' i.e. s3s4 which ;
of the segmentation is stored in C2 3. The syllable s2 is the one put in the most right-hand of ;
the syllable sequence which is connected to it. Besides, the result of segmentation for s1 to s2 is recorded in C2 1. In the similar manner with the above, we take the minimum of value(C2 1) ;
;
and value(C2 3). In such a way, four cases are compared to make segmentation of s1s2s3 s4 as ;
follows: 1. min(value(C3 1); value(C3 4)) ;
;
2. min(value(C2 1); value(C2 3)) ;
;
3. min(value(C1 1); value(C3 2)) ;
;
4. W ord(s1s2s3 s4) = W ord(!|ÐUXCKÉ{|Ô)
From the four cases, the maximum value and the segmentation result are selected and recorded in C4 1. To generalize it, the algorithm is described as shown in Figure 4. The overall complexity ;
9
/* initialization step */ for j=1 to n do value(C1 ) = W ord(s ); for i=2 to n do for j=1 to i do value(C ) = max(min(value(C ?1 ); value(C1 + ?1)), min(value(C ?2 ); value(C2 ?2)), ;j
j
i;j
i
i
;j
;j
;j
i
;j
::: min(value(C1 ); value(C ?1 +1)), W ord(s ; : : : ; s + )) ;j
j
i
i
;j
j
Figure 4: The segmentation algorithm
Figure 5: State of table when analyzing `!|ÐUXCKÉ{|Ô$'. Here, w(s : : : s ) = value(C ) i
j
i;j
of the algorithm follows that of CYK parsing, O(n3).
4 Experiments and Discussions For the testing purpose, we prepared several test sets. They consist of computer manual, newspaper article about economincs, scienti c paper, religious novel. The size of each text is chosen with about 10K bytes to give appropriate contexts. All eojeols should be delimited by whitespace, there are many eojeols not delimited in real texts. For instance, `$ÓÚêIL$/»$Ë'(dipeul-re-i-syeon-mit, de ation and) should be delimited to `$ÓÚêIL$/» $Ë', but often occurred
in real texts without any space. Therefore, it is necessary to estimate how well the system can segment eojeols with general spacing errors. We determined to contain all the spacing errors in order to put to test our system in real domain. The number of eojeols in each article and failures by morphological analysis is represented in Table 2. 10
computer manual newspaper scienti c paper novel (a) (b) (a) (b) (a) (b) (a) (b) 1394 38 766 91 990 166 1200 73 Table 2: Constitution of each test set (a) number of eojeols (b) number of morphological analysis failures Number of eojeols Number of correct results Precision 368 303 82.3 Table 3: Overall Precision of the system Table 3 shows overall precision rate of our system. As shown in Table 2, newspapers contain less eojeols and many errors in spite of the similar size of text les. We felt that it is needed to test the system for each article. The accuracy for each article is represented in Table 4. By experiments, our system makes pretty good results despite inclusion of general spacing errors. Our system does not take into consideration connectivity information between morphemes at all, which is the main reason of errors. Korean is an agglutinative language, and thus constraints are imposed on the combination of morphemes by connectivity information. Because in this paper we focused on the estimation of words by learning from examples in raw material, we do not use it. Therefore, we expect to prominently improve the performance of the system by introducing the connectivity information. In addition, it is necessary that the system produce n-best results, considering the automatic POS tagging system. It is easy to implement it with the CYK parsing table. If these factors can be re ected to the system, it would be very useful for various applications. computer manual newspaper scienti c paper novel Precision 78.9 83.5 83.7 79.5 Table 4: Precision for each test set
11
5 Conclusions We thought of the model of word segmentation in the cognitive view. Words can be generalized from many usages under this model. We presented the usages with POS tagged corpus and local texts being processed. POS tagged corpus gives more accurate information about words, and local texts provides information of words unregistered in dictionary. To test the adequacy of our model, we applied it to failed results by morphological analysis with the modi ed CYK parsing method. The model presented the generalized model for unknown word guessing and made good results.
References Cha, J., Lee, G. and Lee, J. 1998. Generalized Unknown Morpheme Guessing for Hybrid POS Tagging of Korean. In Proceedings of the 6th Workshop on Very Large Corpora. Choi, K. S., Han, Y. S., Han, Y. G., and Kwon, O. W. 1994. KAIST Tree Bank Project for Korean: Present and Future Development. In Proceedings of the International Workshop on Sharable Natural Language Resources. Elmi, M. A. and Evens, M. 1998. Spelling Correction Using Context. In Proceedings of COLING/ACL 98 Hopcroft, J. E. and Ullman, J. D. 1979. Introduction to Automata Theory, Languages, and Computation. Jin, W. and Chen, L. 1995. Identifying Unknown Words in Chinese Corpora In Proceedings of NLPRS 95 Li, J. and Wang, K. 1995. Study and Implementation of Nondictionary Chinese Segmentation. In Proceedings
of NLPRS 95
Nagao, M. and Mori, S. 1994. A New Method of N-gram Statistics for Large Number of N and Automatic Extraction of Words and Phrases from Large Text Data of Japanese. In Proceedings of COLING 94 Park, B, R., Hwang, Y. S. and Rim, H. C. 1997. Recognizing Korean Unknown Words by Comparatively Analyzing Example Words. In Proceedings of ICCPOL 97 Sproat, R. W., Shih, W., Gale, W. and Chang, N. 1994. A Stochastic Finite-State Word-segmentation Algorithm for Chinese. In Proceedings of the 32nd Annual Meeting of ACL Yun, B. H., Cho, M. C. and Rim, H. C. 1997. Korean Compound Noun Indexing based on Lexical Association and Conceptual Association. In Proceedings of PACLING
12