an automatic chinese collocation extraction algorithm ... - CiteSeerX

0 downloads 0 Views 238KB Size Report
the co-occurrence of words and phrases that have a fixed usage and ... second stage, high frequency n-grams are extracted from the ... whole corpus to find all its co-words in a fixed ..... To overcome this problem, we adjust the f-means to.
AN AUTOMATIC CHINESE COLLOCATION EXTRACTION ALGORITHM BASED ON LEXICAL STATISTICS Ruifeng Xu , Qin Lu, and Yin Li Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong {csrfxu, csluqin, csyinli }@comp.polyu.edu.hk

language [Manning,1999].

ABSTRACT This paper presents an automatic Chinese collocation extraction system using lexical statistics and syntactical knowledge. This system extracts collocations from manually segmented and tagged Chinese news corpus in three stages. First, the BI-directional BI-Gram statistical measures, including BI-directional strength and spread, and χ2 test value, are employed to extract candidate two-word pairs. These candidate word pairs are then used to extract high frequency multi-word collocations from their context. In the third stage, precision is further improved by using syntactical knowledge of collocation patterns between content words to eliminate pseudo collocations. In the preliminary experiment on 30 selected headwords, this three-stage system achieves a 73% precision rate, a substantial improvement on the 61% achieved using an algorithm we developed earlier based on an improved version of the Smdja’s 53% accurate Xtract system. Keywords: Chinese Collocation, information extraction, and statistical models

1. INTRODUCTION Collocation is a lexical phenomenon where two or more words are used together to convey a specific semantic meaning. For example, we will say “warm greeting” , but not“hot greeting”, “broad daylight”, but not“bright daylight”. Similarly, in Chinese “ are synonyms, but we will say rather than . In short, collocation refers to rather than the co-occurrence of words and phrases that have a fixed usage and meaning, but which apply no general syntactic or semantic rules, rather, on natural habitual

Definitions of collocations are given by linguists. In this work, we follow Benson’s widely adopted definition [Benson,1990], a collocation is an arbitrary and recurrent word combination. Choueka first attempted to extract collocations of two or more adjacent words that frequently appear together from an 11 million-word corpus taken from the New York Times [Choueka,1988]. Church and Hanks [Church,1990] redefined the collocation as a pair of correlated words. They employed mutual information to extract pairs of words that tend to co-occur within a fixed-size of 5 words allowing co-occurrence of adjacent words as well as distant pairs to be extracted. However, multi-word collocations were not extracted until Smadja [Smadja,1993] developed Xtract to extract adjacent and distant, two-words and multi-words collocations. Xtract is a three-stage system. In the first stage, the strength and spread statistics (to be introduced in 3.2) are calculated for words co-occurring with a given headword, in its –5 to 5 word context window. The two-word pairs with significant strength and spread value are extracted as collocation candidates. In the second stage, high frequency n-grams are extracted from the candidate list. They are treated as n-gram collocation candidates. In the last stage, parsing information is employed to weed out pseudo collocation candidates. Xtract is the most important work and best-performing approach to English collocation extraction. Works on Chinese collocation extraction were reported in [Sun 1997,Sun 1998,Wei 2002] including a work done by our team [Lu,2003]. The main structure and the statistical measures used in these systems are quite similar to those of Xtract. However

their performance is not ideal. When we directly applied the first two stages of Xtract to a Chinese corpus, the precision rate of two-word collocations is only around 50%. After we optimized the parameters for this algorithm, precision increased to 53%. Taking relative proximities into account, precision rate improved another 8%. As we sought, in the first two stages, to improve the performance of statistic-based Chinese collocation algorithm, we noticed that the frequency of Chinese word co-occurrence does not follow the so called normal distribution, as assumed by Smadja and most of the reported work. Collocation extraction, especially in Chinese, operates in the Keyword in Context (KWIC) mode. In this mode, once a headword is selected, the system has to search the whole corpus to find all its co-words in a fixed context window. This over-emphasis the frequency of a headword, with the result the system cannot find headword and co-word frequencies which fall into the pattern of “high-low” or “low-high”. For example, given a headword and a co-word , because is a high-frequency word and is a very low-frequency word, this collocation cannot be identified because the too strong frequency difference leads to over-dependence on the high frequency words. Based on these observations, we propose to extract two-word collocations using BI-directional BI-Gram statistics. The improved system still has three-stages. In the first stage, bi-gram word pairs will be extracted as collocation candidates using bi-directional bi-gram statistics. The average frequency and standard deviation of the co-words are optimized in order to better describe the actual distribution. Then, the strength and spread are both extended to BI-directionally analyze the co-occurrence status of headword and its co-words. We then evaluate the co-relation of word pairs using χ2 test, a more effective measure for data that do not fit well into a normal distribution. In the second stage, we follow our previous method reported in another paper, to retrieve the high frequency strings that consist of extracted two-word collocation candidates as multi-word collocation candidates. In the third stage, the syntactic collocation patterns between content words, introduced by [Lin,1993], are applied to weed out candidates that do not fit into Chinese syntactical patterns. Since function words are less important in Chinese than in English, only candidates with at least two content words (nouns, verbs, adjectives and

adverbs) are kept in the final result. In our preliminary experiment, 30 words (nouns, verbs and adjectives across a wide frequency range) are selected as headwords for testing. Comparing the automatically extracted collocations and the collocations manually extracted by professional linguists, our improved system achieves 75% precision rate. The rest of the paper is organized as follows. Section 2 introduces the work on establishing word BI-Gram co-occurrence database and shows the distribution of word pairs in Chinese. Section 3 describes our algorithm for using word BI-Gram statistics to extract two-word collocations. Section 4 briefly introduces multi-word collocation extraction. Section 5 explains the work on using syntactical collocation patterns to remove pseudo candidate collocations. Section 6 briefly evaluates the algorithms using a simple example. Section 7 concludes our paper.

2. WORD BI-GRAM DATABASE We used the annotated Peking University Corpus of the on People’s Daily from Jan 1998 to June 1998 as our testing corpus. Follow the lead of most reported works; we limited the observation context window to a range of [-5, 5] words for a headword. Each word pair in the corpus is recorded as a 3-tuple: (wi, wj, , d), where wi and wj, are two co-occurred words and d denotes the distance. Since a word pair can appear at different distances, the system collects all of them within the observation window. Thus a record of a word pair is actually an 11-tuple as shown below: wi ,wj,, t-5, t-4, t-3, t-2, t-1,t, t+1, t+2, t+3 t+4 t+5 Where t-5 to t+5 are the frequency counts that wj appear in the –5 to +5 positions for the head word wi, t is the total frequency count of the headword wi. To speed up the search process, we built a database of all the co-occurrences of word pairs and indexed them by headword and co-word. This way, co-occurrence information can be found directly in this database without the need to search for them in the corpus. Working with the co-occurrence database has obvious advantages over the traditional KWIC mode in that the word co-occurrence database supports global statistical analysis and optimization, which the KWIC mode does not. The distribution of word frequency with respect to testing corpus size is represented by the solid line in Figure 1. The test corpus consists of 5,802,864 words, with 32,643,684 two-two pairs and 8,330,460

different word pairs is 8,330,460. Note that the numbers of word co-occurrences rapidly increases in the first 30% and then slows down in 30-70% range. The growth slows down even more beyond that. As the abundant numerals, personal names, place names in the corpus lead to a large number of sparse and useless word pairs, we substituted them by a so called class words, Consequently, the co-occurrence frequency has a much flatter growth and the total number of different word pairs are reduced to only 71,020,431. In Figure 1, this is represented by a dotted line. Our system retains both the original word co-occurrence data and the substituted word cooccurrence data.

large number of sparse co-occurrence cases. It is obvious that the distribution of word co-occurrence does not fit into the normal distribution. This is an important reason to modify the existing collocation extraction algorithms because the measures were suited for an approximate normal distribution.

3. COLLOCATION EXTRACTION USING WORD BI-GRAM STATISTICS 3.1 Average frequency In Xtract, average frequency and the variance of co-words for a given headword are used to calculate the strength and spread to identify the word pairs with significant values. Obviously, the value of average frequency strongly influences the threshold for collocation extraction. With a closer examination of the co-word frequency, we want to identify two important points of the distribution, the drop-down spot, SA, a point after which the curve would drop significantly changing the shape of the curve, and flat-out spot, SB, a point after which the co-occurrence frequency is considered sparse and statistically insignificant. Suppose there exist the i-th co-word, such that −

and

f (ith ) ≥ 4 * f

Figure 1. Co-occurrence Freq. vs. corpus size

f (ith −1 ) − f (ith ) , ≥R f (ith ) − f (ith +1 )

(3.1.1)

where R is a threshold experimentally set to 5. Then i is considered the drop-down spot JA. If SA, can be found, the distribution of co-words has i co-words with large wild values. If such a spot is not found, the first co-word is treated as SA. In order to reduce the influence of a large range of sparse co-words on the average frequency, we take the summation of frequencies until we find a j such that the summation is 95% of the total frequency where j is considered the flat-out spot, SB. Figure 2. Frequency distribution of co-words Figure 2 shows the frequency distribution of co-words with respect to headwords. The X-axis represents different co-words. Y-axis shows the corresponding frequency distribution. The figure uses the relative percentage of co-occurrence instead of the absolute frequency. From Figure 2, we find that on the average, only the first 5-10% of co-words are high frequency. After that, the frequency of subsequent co-words quickly decreases leavings a

Then the revised average frequency, f new , is calculated as SB



f new =

∑ f (i )

(3.1.2)

i=S A

SB − S A

Using the new average frequency, re-compute the standard deviation. σ new =

1 SB − S A

SB

∑( f

i=S A

i

− f

new

)2

(3.1.3)

characteristics of the distribution of co-words.

measurement bi-directional. extended the strength to

3.2 BI-directional BI-Gram Statistics

kinew= 0.5 *

This step is a key component of the whole collocation extraction process because the rest of the collocation extraction methods are based its results.

and spread to

f new and σ new are expected to better reflect the

In Xtract, for a given headword w, the strength of its co-word wi , is defined as follows: ki =

fi − f

(3.2.1)

σ

The strength is an indication of how strongly the two words are co-related. Greater strength means a stronger co-relation between w and wi. Further, assume that f i , j is the frequency that wi co-occurs with w in distance j where –5 U 0 , and following the χ2 test will be extracted as two-word collocation candidates. 3.3 χ2 test The most reported works on collocation extraction, including our previous system, assume that the probabilities of word co-occurrence are approximately normally distributed. However, this assumption is proven not true in English [Church,1993], and according to our word co-occurrence statistics, mentioned above, this is also not true for Chinese. To overcome this problem, we adjust the f-means to reduce the influence of wild value and sparse data. Furthermore, the χ2 (chi-square) test, which does not assume normal distribution probabilities, is applied here to evaluate the collocation candidate by comparing the observed frequencies of collocation candidate with the frequencies expected for independence. If the difference is obvious larger than a threshold, the words in the collocation candidates are proven far from independent, and thus, are highly associated. The χ2 statistics summarizes the differences between observed and expected frequencies as follows: χ2 = ∑ i, j

(Tij − Eij ) 2

(3. 3.1)

Tij

where Tij is the observed times of word pairs, and Eij is the expected frequency for independence. In our system, we applied the χ2 test only to BI-Gram collocation candidate evaluation. For word wa, and word wb, suppose wa appears ta times, and wb appears tb times in a N-word corpus, and the co-occurrence times of wa and wb is tab, the expected frequencies for independence Eab is calculated as: E ab =

ta tb ⋅ ⋅N N N

(3.3.2)

Then the χ2 test for the BI-Gram wa and wb is

χ2 =

N (t a ⋅ t a b − t ab ⋅ t a b ) 2

(3.2.3)

t a ⋅ t b ⋅ (t ab + t a b ) ⋅ (t a b + t a b )

Considering χ2 value is 3.841 corresponding to the cases that wa and wb have a 90% probability of dependence, this value is selected as a threshold. If the χ2 values for these two words are larger than 3.841, these words will tend to be dependent, with a larger value indicating a stronger association.

fit a syntactic template patterns. These template patterns are based on noun, verb, and adjective headwords at three levels. The following table explains this but due to limitations of space, provides an example for a noun only.

At this stage, the word pairs with frequency and position distributions that satisfy the extended strength, extended spread, and χ2 tests are extracted as candidate two-word collocations.

4. MULTI-WORD COLLOCATION EXTRACTIOIN Based on the extracted two word-BI-Grams, the multi-word collocation extraction is simple. Here, we basically directly followed Xtract’s algorithm. A simple process is described as: Given a pair of words w and wi, and a integer specifying the distance between the two words, (w, wi, d), all the sentences containing them in a given position are produced. Since we are only interested in relative frequencies, we compute only the moment of order 1 of the frequency distributions. For each possible relative distances from w, we only keep those words occupying the position which have a probability greater than a given threshold T. In other words, we compute the f(wi)/Nx. Once the result is over 0.75, a threshold, (X, wkj) will be kept. Finally, combine all the words satisfying the above requirement will be considered a multi-word collocation. Both Smadja’s and our experiment proves that such a simple statistical extraction for multi-word collocation can achieve approximately 95% accuracy. Thus no further improvements are necessary.

5. WEEDING OUT THE PSEUDO COLLOCATIONS The precision of two-word collocation is a mere 50%-60%. This is because some word pairs, that having significant statistic measures, are syntactically strongly correlated yet are not considered true collocations, as in the case of . We call these pseudo collocations. These can be removed by using predefined syntactic templates. According to [10], The Collocation Dictionary of Modern Chinese Content Words, collocations usually

Table 1. The template Patterns for The first level pattern indicates whether the headword is the ‘key word of’ a phrase: Subject-Phrase, Verb-P, Object-P, Attribute-P, Complement-P Adverbial-P, or Noun-Phrase. The second level indicates the POS of the co-word, such as /n, /v, /a, /m, /q. We notate with PB (Before) and PA (after) to represent the position of a co-word with respect to the headword. This information will help us to remove word pairs that have high statistical values, but which should be eliminated. For example, from the table we know that if for a noun is present in a Verb-Phrase, the noun must follow the verb. If a noun-verb word pair is extracted in a Verb-Phrase, it will be eliminated no matter how high the co-occurrence frequency may be. Furthermore, the co-occurrence of functional words and content words in Chinese tend to be syntactical only, so at this stage, only collocations which consist of at least two content words are reserved.

6. EVALUATION Since no benchmark data are available to evaluate the precision of collocation, as with Xtract, it was necessary to have professional linguists examine the results. At this time, we have selected 30 headwords including 10 nouns, 10 verbs and 10 adjectives across a wide frequency range. The extracted results are compared with manually established correct answers. Our improved collocation extraction system achieves 73% precision rate on average. The following provides an example of how each of the algorithm is used to get the result for a given headword . in the corpus is The word frequency for 2,271 with a total of 462 co-words. The frequency distribution of its co-words is shown in Figure 3. Using Smadja’s approach, the average is 4.9, but in our algorithm, it is 7.8. For the two-word pairs

extraction, the suggested collocations using our previous system, modified from Xtract, is

out that we are not able to obtain recall rate due to the lack of availability of such information. In the future, we will further investigate methods to identify collocations through synonym substitutions to eliminate the word pairs which are highly frequent but which tend to be free combinations. Furthermore, a shallow parser will be built to process running text so as to extract collocations in real environments. Acknowledgement This project is partly supported by CERG grant entitled: Automatic Acquisition of Chinese Collocations with High Precision (PolyU reference: B-Q535)

Figure 3 Co-word Frequency for Based

on

our algorithm, however, and are eliminated as they do not satisfy the requirement of the BI-directional BI-Grams and χ2 tests. Meanwhile two more candidates are appended, they are 10 and 8. Using the syntactical filter mentioned in Section 4, 20 and 28 are filtered out. the candidate Finally, our system output suggests collocations:

Compared with the correct answer by a professional linguist with reference to [10],

are considered collocations. In this example, our system achieves 11/14=78% precision rate, which is higher than 13/23=56% achieved in our earlier system.

7. CONCLUSION In this paper we present a collocation extraction system, which first extracts candidate, and then elimination using syntactic templates. The algorithm uses BI-directional BI-Gram measures as selection criteria with improved selection functions based on existing systems. Result showed that our system can achieve the precision of 73% in average for the 30 words tested, a 10% to 20% absolute improvement over other systems. It should be pointed

REFERENCES [1] C. D. Manning, and H. Schutze, 1999, Foundations of Statistical Natural Language Processing, The MIT Press [2] M. Benson, 1990, Collocations and General Purpose Dictionaries.” International Journal of Lexicography, vol. 3 (1) [3] Y. Choueka Y. 1988 “Looking for Needles in a Haystack or Locating Interesting Collocation Expressions in Large Textual Database” in Proc. of the RIAO Conf. on User-oriented Content-based Text and Image Handling, 21-24 Cambridge [4] K. Church, and P. Hanks 1990, “Word Association Norms, Mutual Information, and Lexicography.” Computational Linguistics, vol. 16(1) [5] F. Smadja, 1993 “ Retrieving Collocations from Text: Xtract.” Computational Linguistics, vol. 19 (1) [6] M. S. Sun, J. Fang, and C. Huang, 1997 “A Preliminary Study on the Quantitative Analysis on Chinese Collocations. Chinese Linguistics, vol.1 [7] H. L. Sun 1998 “Distributional Properties of Chinese Collocations in Texts,” In Proc. 1998 Int. Conf. on Chinese Information Processing, Tsinghua University Press. [8] N. X. Wei 2002 “The Research of Corpus-based and Corpus-driven Collocation Extraction,” Modern Linguistics, vol. 4 (2) [9] Q. Lu, Y. Li, and R. F. Xu, "Improving Xtract for Chinese Collocation Extraction", submitted for publication [10] S. K. Zhang, and X. G.. Lin 1992, The Collocation Dictionary of Content Words in Modern Chinese, Commercial Press. [11] K.W. Church, et al. 1993 “Introduction to the Special Issue on Computational Linguistics Using Large Corpora.” Computational Linguistics, vol.19