Vietnamese Word Clustering and Antonym Frame Identification Kim-Anh Nguyen
Van-Lam Pham
Phuong-Thai Nguyen
Institute of Linguistics Faculty of Mathematics and Technology University of Engineering and Technology Vietnam Academy of Social Sciences Vietnam National University Hung Vuong University Hanoi, Vietnam Hanoi, Vietnam Phutho, Vietnam
[email protected] [email protected] [email protected]
Abstract—Word clustering is a method employed to group similar words in a cluster. A word is similar to one another when they both appear in the same contexts. It has been widely known that word clustering has been studied in languages such as English, Chinese, Japanese and so on. In this paper, we conducted word clustering for Vietnamese using two methods proposed by Brown and Dekang Lin. Moreover, we propose five criteria to evaluate the quality of a cluster. We also use statistics method to extract 20 frames of antonym to identify antonym class in clusters. Fig. 1.
An example of Brown’s cluster algorithm
Keywords—natural language processing, word clustering, word similarity, antonym class, antonym pairs.
II.
RELATED WORKS
A. Brown’s Algorithm I.
INTRODUCTION
In recent years, statistical learning methods have been widely applied to natural language processing tasks. The labeled data is normally pre-processed manually or in some other ways, which could be time consuming and expensive. In contrast, the unlabeled data is basically free and abundant on the Internet; this kind of data is considered as raw text. Previous works have shown that using the unlabeled data to replace conventional labeled data can improve performance (Miller et al., 2004; Abney, 2004; Collins and Singer, 1999) [8][1][3]. In this paper, we focus on how two methods: word clustering by Brown [10] and word similarity by Dekang Lin [4] are applied in Vietnamese unlabeled data. Since the two studied methods clustered words based on two different approaches, we proposed an evaluation method to compare the advantages and disadvantages of each method. In order to do so, we conducted experiments on the same corpus with five criteria to evaluate the quality of clusters. Moreover, the two methods just managed to cluster words in the corpus but did not categorize the relations of semantic classes in each cluster such as synonyms, antonyms, similar context. Thus, in order to clarify the semantic relations within a cluster, we conducted a deeper investigation into antonym class by using the statistic method in which we extracted 20 frames of antonym to identify antonym class in the experimented clusters. Besides, the results of word clustering was used as a feature for Vietnamese functional labeling task to improve the performance of the task [9].
Brown’s algorithm is considered as a method for estimating the probabilities of low frequency events. One of the aims of word clustering is predicting a word from previous words in a sample text. Authors used the bottom-up word clustering algorithm to derive a hierarchical clustering of words. The input to the algorithm is a corpus of unlabeled data which consists of a vocabulary of words to be clustered. The output of the word cluster algorithm is a binary tree in Figure 1, in which the leaves of the tree are the words in the vocabulary and each internal node is interpreted as a cluster containing words in that sub-tree. B. Word Similarity The meaning of a word in the corpus can basically be inferred from other words in its context. Two words are similar if they appear in similar contexts or they can replace each other to some extent. The similarity between two objects is identified by the amount of information contained in the commonality between the objects divided by the amount information in the descriptions of the objects (Lin, 1997) [5]. To compute the similarity of two words in its context, Dekang Lin used a dependency triple (w, r, w0 ). In which, w is considered; r is the grammatical relationship between w and w0 ; w0 is the word context of w and the notation k(w, r, w0 )k is the frequency counts of the dependency triples (w, r, w0 ) in the corpus. When w, r, or w0 is the wild card (*) then the frequency counts of all the dependency triples which matches the rest of the sample. The information of a word w consists of the frequency counts of the dependency triples which matches the pattern (w, ∗, ∗). For each word, Dekang Lin created a thesaurus entry which contains the top-10 words
that are most similar to it. For example, the top-10 words in the noun for the word brief as follow as: brief (noun): affidavit 0.13, petition 0.05, memorandum 0.05, motion 0.05, lawsuit 0.05, deposition 0.05, slight 0.05,... III.
OUR APPROACH
A. Vietnamese Word Clustering 1) Brown’s algorithm: In order to extract Vietnamese word cluster, we employed Brown’s algorithm, which is one of the most well-known and effective clustering algorithms in Language Models, to cluster words for Vietnamese corpus. The Brown algorithm uses mutual information between cluster pairs in a bottom-up approach to maximize Average Mutual Information between adjacent clusters. The output’s algorithm is hierarchical clustering of words in which words will be hierarchically clustered by bits string as Figure 2. Cluster 628: xoay (turn) |—> 0110101010110111 |—> 0110101010110111 |—> 0110101010110111 |—> 0110101010110111 |—> 0110101010110111 Fig. 2.
gác (lean) lấp (cover) nghiêng (incline) chụp (snap) quay (revolve)
261 355 376 854 2061
To evaluate the truth value of clusters, we propose five criteria: true(1), true(2), true(3), true(4) and true(5) to define whether a word is similar to another word. true(1) is evaluated automatically when the main words are included in thesaurus dictionary, the remaining criteria: true(2), true(3), true(4) and true(5) have to be done manually when the main words do not appear in thesaurus dictionary. These five criteria are: •
Synonym: two or more words can replace other ones in some context and complete synonyms is T rue(1). For example: đẹp - xinh (beautiful - pretty).
•
Antonym: words of opposite meaning and antonym is T rue(2). For example: yêu - ghét (love - hate).
•
Specific – abstract relation: the description of the relation between an object with entity which is T rue(3). For example: nhạc_trữ_tình - nhạc (ballad - music).
•
Abstract – specific relation: The reversal of T rue(3). This is T rue(4). For example: hoa - hoa_hồng (flower - rose).
•
Context similarity: words do not belong to four criteria and they appear together in similar contexts. This is T rue(5). For example: gấu - voi (bear - elephant).
An example for Vietnamese word clustering
A word cluster contains a main word and subordinate words, all subordinate words in the corpus have the same bit string and corresponding frequency. The number of subordinate words is different in each cluster. 2) Word similarity: To identify the similarity between words in the Vietnamese corpus, we extracted the dependency triples (w, r, w0 ) as follows: Check all the nodes in the tree, in each node, we also examine its child node, w is correspondent to words in the central node, w0 is correspondent to words that are not in the central node, r is determined based on the labels of parent nodes and child nodes containing w0 . This means in a parent node that contains k child node, we can extract maximum k − 1 triples. We use 8 grammatical relations (r) in the dependency triples, which are 4 initial relations and their reversed forms as follows: Subject (sub and sub − of ), Object (obj and obj − of ), Complement (mod and mod − of ) and Prepositional object (pobj and pobj − of ). Figure 3 illustrates the n similar words and its score of similarity with one main word: bộ_luật (code) : (N) : |—> luật (law) |—> pháp_lệnh (state law) |—> nghị_định (decree) |—> nghị_quyết (resolution) |—> thông_tư (circular) Fig. 3.
words is identified true with main word. In contrast, if subordinate words are identified f alse with main word, this cluster is f alse.
(24) (13) (12) (11) (9)
0.134735 0.126747 0.113054 0.107454 0.103456
An example for Vietnamese word similarity
B. Evaluating Methodology The method we use to evaluate in this paper is semiautomatic method. A word cluster is true if one of subordinate
We use main words in the thesaurus dictionary as representatives of corresponding clusters in the output of word clustering. It is illustrated in Figure 4, in which xu_hướng (tendency) is a main word in thesaurus dictionary, we will find xu_hướng (tendency) in the output of word clustering to make it the main word of the cluster containing xu_hướng (tendency). Finally, we will in turn compare subordinate words of xu_hướng (tendency) in thesaurus dictionary with subordinate words of the cluster where xu_hướng (tendency) is considered as the main word. If identical words are recorded, we mark it as true. Cluster 629: xu_hướng (tendency) |—> 0111111 khuynh_hướng (trend) |—> 0111111 thói_quen (habit) |—> 0111111 nguyện_vọng (hope) |—> 0111111 nhu_cầu (demand) Fig. 4.
238 315 443 3344
true(1)
Select word clusters by thesaurus dictionary
C. Antonym Classes In this part, we will use the statistics method to propose some frames of antonym pairs to identify the antonym relation in each cluster. The relations between antonym pairs can be classified according to the semantic classes as follows (Jones, 2002) [6]: 1) Ancillary antonym: In this semantic class, antonym pairs in the sentence are classified by two ways: supporting each other to increase the mutual contrast or helping them to increase the contrast to words or phrases that they modify. Below is an example: a)
Plustek_Optic Pro 4800P có ưu_điểm là độ_phân_giải cao và chân_đế nhỏ, tuy nhiên có nhược_điểm là kết_quả
quét tối và OCR kém. (The advantage of Plustek_Optic Pro 4800P is that it has high resolution and small flange; however, the disadvantage of the scanner is that dark film and weak OCR).
In example a), the antonym pair ưu_điểm/nhược_điểm (advatage/disadvantage) has the role of increasing the contrast in the sentence. In Table I is some frames in ancillary antonym class that we notified from the corpus, in which: X and Y are an antonym pair and w1 and w2 are words modifying X and Y: TABLE I.
ANCILLARY ANTONYM FRAMES
Antonym Frames Vietnamese English w1 X ... nhưng ... w2 Y w1 X ... but ... w2 Y w1 X ... w2 Y w1 X ... w2 Y X w1 ... Y w2 X w1 ... Y w2
2) Coordinated antonym: In a sentence, when the antonym pair has inclusiveness or exhaustiveness of scale meaning, they are classified as coordinated antonym class. Below is an examples of this class: b)
IV.
A. Results and Comparison In our experiments, we used an unlabeled corpus which contains about 13 million words, which were equivalent to approximately 700 thousand Vietnamese sentences collected from online newspapers including: Lao Dong, PC World, and Tuoi Tre. This corpus was pre-processed with sentence split, word segmentation1 for Brown’s algorithm and POS tagging [7], parsing of sentences [2] for Lin’s method. To cluster words, we used an open source tool2 (Liang, 2005) [11], which is an efficient implementation of Brown’s algorithm. We ran the algorithm with different numbers of clusters: 300, 500, 700, 900 and 1000 respectively. After that, we used semi-automatic method to evaluate truth value of word clusters according to five proposed criteria. While we use thesaurus dictionary to evaluate, we filtered 9771 main words of output clusters, which were equivalent to 9771 main words in thesaurus dictionary. The results are illustrated in Table IV: TABLE IV.
TABLE II.
COORDINATED ANTONYM FRAMES
Antonym Frames Vietnamese English X và Y X and Y X hoặc Y X or Y X hay Y X or Y không X cũng không Y either (neither) X or (nor) Y cả X lẫn Y both X and Y
THE RESULTS OF 5-INITIAL CLUSTERS k-initial 300 500 700 900 1000
Dùng ASP framework có_thể dễ_dàng tạo các biểu đơn_giản và phức_tạp, nhưng thiếu các công_cụ sửa lỗi. (With ASP framework, it is easier to create simple and complex formulas, but error correcting tool is missing from the framework).
In example b), the pair đơn_giản/phức_tạp (simple/complex) modifies the noun ASP framework and identifies the inclusiveness of scale. In Table II is some frames in coordinated antonym classes that we notified from the corpus:
EXPERIMENT
Good clusters 1881 1893 1757 1761 1641
The reason for the difference in the number of true clusters for different initial clusters shown in Table IV is that word clustering algorithm merged semantic classes. As a result, small clusters were merged into larger ones, leading to the loss of certain semantic classes. For example, when k = 300, the cluster containing the main word account is false; but when k = 500, it is true. To compare the two methods: we experimented on the same corpus and the same 700 main words extracted from thesaurus dictionary for both methods. The results are reported in Table V: TABLE V.
COMPARISON BETWEEN WORD CLUSTERING AND WORD SIMILARITY
3) Minor classes: Antonym pairs in minor classes are mostly distributed in two semantic classes: comparative antonym and transitional antonym. The example in c) will illustrate this: c)
Phải tốn thời_gian và phải_biết cách thực_hiện các chương_trình DOS trong WINDOWS, nó sẽ đem lại nhiều thành_công hơn thất_bại. (Performing well DOS in WINDOWS will bring more success than failure)
In comparative semantic class, the antonym pairs often appear together in the frame: X hơn Y (more X than Y). In transitional antonym class, the antonym pairs often appear together in the frame in Table III: TABLE III.
TRANSITIONAL ANTONYM FRAMES
Antonym Frames Vietnamese English từ X ... tới ... Y f rom X to Y từ X ... đến ... Y f rom X to Y
Methods Word clustering Word similarity
k-initial Clusters 700 700
Automatic Method 292 206
Manual Method 181 182
Precision 67.5% 55.4%
After the results were compared, there was a big difference between two methods when automatic evaluation was employed. Also, there were 104 overlapped true clusters for both methods. This is due to the fact that in Brown’s algorithm, words are clustered based on distributional probability with words on its left and right. Thus, when a word is synonym to another word, it can replace each other in this context. Whereas, in Dekang Lin’s method, clustered words depends on the grammatical relations and context words. This, then, reduces the possibility of identifying synonyms to clustered words. 1 http://www.mim.hus.vnu.edu.vn/phuonglh/softwares/vnTokenizer 2 http://www.cs.berkeley.edu/~pliang
B. Antonym Frames To extract antonym frames, we used statistic method in the corpus. We extracted 11.166 sentences from this corpus including antonym pairs based on a set of 130 antonym pairs. Any frame that occurred more than 10 times was used as a frame in our research. When further analyses were done to cases that w1 and w2 appeared together with the pair X/Y, we found that there were two main trends: w1 and w2 were the same or w1 and w2 were different.
The value of distance between w1 and w2 is 0 ≤ d(w1 , w2 ) ≤ 21, the distance is 0 when w1 and w2 belonged to the same cluster. Besides, the average of distance between w1 and w2 is (avg = 4.54). Thus, we used a threshold value (t ≤ 6) for the distance between w1 and w2 to identify the good pairs w1 /w2 . The distance value of w1 and w2 from 0 to 10 are shown in Table VII. TABLE VII. Distances 0 1 2 3 4 5 6 7 8 9 10
Table VI is the frequency of identification of antonym frames in the case of w1 and w2 were the same. In this case, we identified 675 pairs of the same w1 /w2 , which were equivalent to 13 frames containing w1 /w2 pairs. To analyze the relations between 675 w1 /w2 pairs, we used the information of the clusters that contained those w1 /w2 words. As a result, there are 443 clusters containing 675 w1 /w2 words, in which 323 clusters contained one w1 /w2 pairs, and 119 clusters contained at least 2 w1 /w2 pairs. This means the similarity among w1 /w2 pairs is great, accounting for 52%. Besides, the w1 /w2 pairs identified in this case are pairs of highest frequency in the clusters containing them. TABLE VI. Vietnamese w1 X ... w2 Y X w1 ... Y w2 X và Y w1 X và ... w2 Y X w1 và Y w2 X hay Y X hoặc Y w1 X w2 Y từ X đến Y X w1 hay Y w2 w1 X hoặc ... w2 Y cả X lẫn Y w1 X hay ... w2 Y X w1 hoặc Y w2 X cũng_như Y w1 X với w2 Y từ X tới Y
THE IDENTIFICATION OF ANTONYM FRAMES Antonym Frames English w1 X ... w2 Y X w1 ... Y w2 X and Y w1 X and ... w2 Y X w1 and Y w2 X or Y X or Y w1 X w2 Y f rom X to Y X w1 or Y w2 w1 X or ... w2 Y both X and Y w1 X or ... w2 Y X w1 or Y w2 X as well as Y w1 X together with w2 Y f rom X to Y Total
Quantity 3568 1249 893 526 292 214 155 144 97 96 59 53 43 28 21 18 13 7469
Precision 31.95% 11.19% 8.00% 4.71% 2.62% 1.92% 1.39% 1.29% 0.87% 0.86% 0.53% 0.47% 0.39% 0.25% 0.19% 0.16% 0.12% 66.89%
In the case that w1 is different from w2 , we identified 5093 pairs w1 /w2 . To analyze the relations between w1 and w2 , we proposed a function to compute the distance between w1 and w2 based on the bit string code in clusters containing those two words and the score of distance between node and leaf in a binary tree is d(node, leaf ) = 1. The function to compute the distance is demonstrated in the binary tree in Figure 5. Details are as follows: d(A,B) = 1; d(D, E) = d(D,C) + d(C,E) = 1 + 1; d(B,E) = d(B,A) + d(A,C) + d(C,E) = 1 + 1 + 1. In which: d(x, y) is distance between x and y.
V.
Percentage 2.34% 0.00% 1.39% 1.49% 2.55% 3.14% 4.12% 5.99% 7.34% 8.95% 10.62%
CONCLUSION AND FUTURE WORKS
REFERENCES [1] [2]
[3]
[4] [5] [6] [7]
[8]
[10]
[11] The function of computing distance
Number of w1 /w2 119 0 71 76 130 160 210 305 374 456 541
In this paper, we researched some algorithms for Vietnamese word clustering. We proposed some evaluation methods after we applied those two methods on Vietnamese corpus. Besides, we extracted 20 antonym frames to identify antonym pairs in clusters. However, there remain some problems that need improvement in the future such bettering semantic classes in clusters to increase the quality of clusters. We are also concern about using clusters as features in some other natural language processing tasks to increase the performance of the task.
[9]
Fig. 5.
THE DISTANCE VALUE OF w1 AND w2
Abney, S. (2004), Understanding the Yarowsky Algorithm. Computational Linguistics, 30(3). Anh-Cuong Le, Phuong-Thai Nguyen, Hoai-Thu Vuong, Minh-Thu Pham, Tu-Bao Ho. 2009, An Experimental on Lexicalized Statistical Parsing for Vietnamese. Proceedings of KSE 2009, pp 162-167. Collins, M. and Singer, Y. (1999), Unsupervised models for named entity classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. D. Lin, Automatic Retrieval and Clustering of Similar Words. COLINGACL98, Montreal, Canada, August, 1998. D. Lin, Using Syntactic Dependency as Local Context to Resolve Word Sense Ambiguity. In Proceedings of ACL-97, Madrid, Spain. July, 1997. Jones, S, Antonymy: a corpus-based perspective. Routledge, 2002. eScholarID:4b966 Le Minh Nguyen, Bach Ngo Xuan, Cuong Nguyen Viet, Minh Pham Quang Nhat, Akira Shimazu, A Semi-supervised Learning Method for Vietnamese Part-of-Speech Tagging. KSE 2010: 141-146 Miller, S., Guinness, J., and Zamanian, A. (2004), Name Tagging with Word Clusters and Discriminative Training. In Proceedings of HLTNAACL 2004, pages 337– 342. Nguyen Thanh Huy, Nguyen Kim Anh, Nguyen Phuong Thai, Building an Efficient Functional-Tag Labeling System for Vietnamese. KSE 2011: 92-97 P.F. Brown, V.J. Della Pietra, P.V deSouza, J.C. Lai, and R.L. Mercer. 1992, Class-based n-gram models of natural language. Computational Linguistics, 18(4):467-479. Percy Liang, Semi-supervised learning for natural language. Massachusetts Institute of Technology, 2005.