Bootstrapping and Rule-Based Model for Recognizing Vietnamese Named Entity Hieu Le Trung1, , Vu Le Anh2 , and Kien Le Trung3 1
2
Duy Tan University, Da Nang, Vietnam Nguyen Tat Thanh University, Ho Chi Minh, Vietnam 3 Hue University of Sciences, Hue, Vietnam
[email protected]
Abstract This paper intends to address and solve the problem Vietnamese Named Entity recognition and classification (VNER) by using the bootstrapping algorithm and rule-based model. The rule-based model relies on contextual rules to provide contextual evidence that a VNE belongs to a category. These rules exploit linguistic constraints of category are constructed by using the bootstrapping algorithm. Bootstrapping algorithm starts with a handful of seed VNEs of a given category and accumulate all contextual rules found around these seeds in a large corpus. These rules are ranked and used to find new VNEs. Our experimented corpus is generated from about 250.034 online news articles and over 9.000 literatures. Our VNER system consists 27 categories and more 300.000 VNEs which are recognized and categorized. The accuracy of the recognizing and classifying algorithm is about 95%.
1
Introduction
Named Entity Recognition (NER) problem has become a major task of Natural Language Processing (NLP). Named Entities (NE) represent important parts of the meaning of human-written sentences, such as persons, places and objects [1]. In the sixth and the seventh editions of the Conference on Computational Natural Language Learning (CoNLL 2002 and CoNLL 2003) the NER task was defined as to determine the proper names existing within an open domain text and classify them as one of the following four classes: Person; Location; Organization; and Miscellaneous. To correctly identify all of the named entities is a very difficult task for any language, since the level of difficulty depends on the diversity of language settings. As to Vietnamese, there are many difficult problems in NER. Firstly, Vietnamese Named Entity (VNE) doesn’t have some given special syllables for names. Secondly, VNE is an open class and the number of its component is very large such that it is very hard to enumerate all of them. Our work intends to address and solve the problem Vietnamese named entity recognition and classification(VNER).
Corresponding author.
N.T. Nguyen et al. (Eds.): ACIIDS 2014, Part II, LNAI 8398, pp. 167–176, 2014. c Springer International Publishing Switzerland 2014
168
H. Le Trung, V. Le Anh, and K. Le Trung
Our approach for two problems is based on the bootstrapping algorithm[4, 5, 6] and rule-based model. The rule-based model relies on contextual rules to provide contextual evidence that a VNE belongs to a category. These rules exploit linguistic constraints and contextual information in identifying VNEs, is constructed based on models of word recognition[2] and theory of word order patterns[3] in Vietnamese. In which, each word and phrase surrounding VNE in sentence is recognized with the high accuracy, and frequency sequence of words are ranked. The collected VNEs and contextual rules are ranked based on confidence functions and similarity functions. In which, confidence function expresses a numerical confidence (reliably) that the contextual rule will extract members of the category, and similarity function expresses similarity between a VNE and a category. The bootstrapping algorithm involves a small set of seed VNEs for each category, for starting the learning process. A set of seed VNEs is generated by sorting the VNEs in the corpus and manually identifying. First, the system searches for sentences or phrases that contain these VNEs and tries to identify some contextual rules. Then, the system tries to find other instances of VNEs appearing in similar contexts. The learning process is then reapplied to the newly found VNEs, so as to discover new relevant contexts. Finally, the VNE recognition and classification algorithm is performed based on model matching of context surrounding of VNE in sentence with linguistic constraints of each category. Our contribution is a new approach for solving the VNER problem. In which, the model of the problem VNER with the bootstrapping algorithm for recognizing a category, and the automatic algorithm identify VNEs in sentence are described and discussed. The system of categories of VNEs are constructed with high accuracy and its linguistic constraints are important data for solving different tasks in Vietnamese natural language processing. The rest of this paper is structured as follows. In section 2, we introduce and describe the problem recognition and classification Vietnamese named entities(VNER). In section 3, we define a contextual rule and it’s construction. Based on a set of contextual rules, the C-VNE recognition algorithm is proposed to recognize a C-VNE syllable, a Vietnamese Named Entity which has a category C, and the VNE recognition algorithm is proposed to recognize a VNE-syllable, a Vietnamese Named Entity. Section 4 presents the results of our experiments that support our approach. Section 5 concludes the paper.
2
VNE Recognition Problem
The purpose of the Vietnamese Named Entities Recognition (VNER) problem[7, 8] is to recognize a Vietnamese Named Entity (VNE) in documents, and determine whether the VNE is a member of predefined categories of interested as: person names, organization names, location names, etc. Simply, the VNER problem is constructed as follows:
Bootstrapping and Rule-Based Model
169
1. Syllable, sentence and corpus are given by following descriptions: – Syllable s is an original syllable (such as “của”, “đã”) or a linking syllable (such as “công_việc”, “Hồ_Chí_Minh”, “Hà_Nội”). We call “của” the first component of the syllable “của”, “công” and “việc” the first and second components of the syllable “công_việc”. – Sentence S is an ordered sequence of syllables, S = s1 s2 . . . sn . n = |S| is called the length of the sentence S. – Corpus C = {(S , s, k )} is a finite set of triple objects: S is a sentence, s is a syllable in S, and k is a position of s in S, 1 ≤ k ≤ |S|. S is called the context of the triple. For example, assuming that there is a sentence S = “Chúng_tôi yêu Việt_Nam” in the corpus, so C has three elements (S, Chúng_tôi, 1), (S, yêu, 2), and (S, Việt_Nam, 3). A structure of the corpus C is described carefully in [2]. 2. Let V ⊂ C is a set of all Vietnamese Named Entities in the corpus C. The set V is determined by a map frec : C → {0; 1} for which 1 if (S, s, k) ∈ V (1) frec (S, s, k) = 0 otherwise. The map frec is called the recognition function. 3. Let E = {C1 , C2 , . . . , Cm } is a category of VNE, and a map fcla : V → E which is called a classification function. For each category C ∈ E, a VNE of which category is C is called shortly a C-VNE. The VNER aims to solve two problems: Problem 1. Determining the recognition function frec so that for all (S, s, k) ∈ C, we recognize the syllable s is VNE or not. Problem 2. Determining the classification function fcla so that for all VNE (S, s, k) ∈ V, we identify a category of the syllable s, fcla (S, s, k). For each category C ∈ E, let us denote V(C) = {(S, s, k) ∈ V | fcla (S, s, k) = C}, the set of all C-VNEs. The key of our works is how we gain the grammarrules of the syllables in V(C) to recognize a syllable is in V(C) or not. Precisely, we consider a problem: Problem 3. Let C ∈ E be a category, and S(C) ⊂ V(C) be a given set of some C-VNEs. Based on S(C), constructing some grammar-rules R(C) to determine a map fC : C → {0; 1} for which 1 if (S, s, k) ∈ V(C) (2) fC (S, s, k) = 0 otherwise. fC is called a C-VNE recognition function.
170
H. Le Trung, V. Le Anh, and K. Le Trung
First of all, we introduce a simple grammar-rule for which we can easily recognize a lot of non-VNE and non-C-VNE syllables. This grammar-rule based on a following concept. Definition 1. A syllable s is called a candidate if the first letters of all its components are capitalization. An example of a candidate is a syllable “Hồ_Chí_Minh”, and an example of a non-candidate is a syllable “Chủ_tịch”. In this work, we mention that a VNE should be a candidate. Thus, Grammar-Rule 1 Given (S, s, k) in the corpus. If s is not a candidate,(S, s, k) is not a VNE, frec (S, s, k) = 0, and of course for all category C ∈ E, (S, s, k) is not a C-VNE, fC (S, s, k) = 0. In next sections, we would like to study how we utilize a given set of some CVNEs to construct other grammar-rules for determining the C-VNE recognition function fC .
3
Contextual Rules and Recognition
Let C ∈ E be a category, and S(C) be a given set of some C-VNEs. The C-VNE recognition function fC is determined based on the set of contextual rules, which ¯ s¯, k) ¯ ∈ represent linguistic constraints for deciding another syllable (S, / S(C) is a C-VNE or not. Thus, the aims of this section is study how to construct the set of contextual rules of the category C from the given set S(C) and how to determine the function fC from these contextual rules. The first aim is done based on a confidence function, a similar function and the idea of bootstrapping algorithm. The second aim is based on a probability model which measures a similar structure between a target syllable and some given C-VNE syllables. These measuring are depended on the roles of contextual rules. 3.1
Contextual Rule
Let us consider an example: Given two sentences S1 = “Ông Nam và ông Nhân đã thực_hiện thành_công dự_án này.” S2 = “Tôi cùng ông Nghĩa đã thực_hiện một nhiệm_vụ quan trọng.” Assuming that two candidates (S1 , “Nhân”, 5) and (S2 , “Nghĩa”, 4) belong to V(C), where C is the “names of people”. We have an idea: For any candidate (S, s, k) ∈ C, if “ông s đã thực_hiện” is a subsequence of S can we decide the candidate (S, s, k) ∈ V(C)? If we can decide, the subsequence “ông s đã thực_hiện” where s is a candidate is called a contextual rule of the ‘names of people’ -VNE. Formally, let C ∈ E be a category, a contextual rule of C is defined by:
Bootstrapping and Rule-Based Model
171
Definition 2. A contextual rule p of C is a pair of subsequences (sl , sr ), where sl is called a left context and sr is called a right context. Considering the above example, p = (“ông”, “đã thực_hiện”) is a contextual rule of the category names of people, and sl = “ông” is the left context, sr = “đã thực_hiện” is the right context of p. For any candidate (S, s, k) and a contextual rule p = (sl , sr ) ∈ R, we define Definition 3. (S, s, k) satisfies the contextual rule p if sl s sr ⊂ S. p(S, s, k) = 1 is denoted for (S, s, k) satisfies p, and p(S, s, k) = 0 is denoted for another case. 3.2
Constructing Contextual Rules
Assuming that R is a set of some contextual rules of the category C. Let us denote σ(VC ) a set of all subsets of VC , VC (p) = {(S, s, k) ∈ VC | p(S, s, k) = 1} VC (p, W) = {(S, s, k) ∈ W |p(S, s, k) = 1}, where p ∈ R, W ⊆ VC , and R(S, s, k) = {p ∈ R | p(S, s, k) = 1}. Definition 4. Fcon is called a confidence function corresponding with R if Fcon : R × σ(VC ) → R for which ∀p ∈ R, W ∈ σ(VC ), Fcon (p, W ) =
|VC (p, W)| log |VC (p, W)|, |VC (p)|
(3)
where | · | denote the size of a set. The confidence function receives a high value if a high percentage of the contextual rule’s extractions are members of W , or if a moderate percentage of the rule’s extractions are members of W and it extracts a lot of them. Definition 5. Fsim is called a similarity function corresponding with C if Fsim : C × σ(VC ) → R for which ∀(S, s, k) ∈ C, W ∈ σ(VC ), p∈R(S,s,k) |VC (p, W)| Fsim (S, s, k, W ) = |R(S, s, k)|
(4)
where | · | denote the size of a set. The similarity function measures a similarity between a candidate (S, s, k) and other candidates in W corresponding with R. It receives a high value if (S, s, k) satisfies contextual rules in R that also have a tendency to extract the members of W . The significance of the confidence function and the similarity function is explained as follows: If we have a set W of some C-VNEs, the confidence function shows that from W it can find some contextual rules for recognizing C-VNEs. Next, when the confidence function gives us a set R of some contextual rules for recognizing C-VNEs, the similarity function will recognize other C-VNEs outside W . This is an idea for constructing the set of contextual rules and also recognizing the set V(C) of all C-VNEs.
172
H. Le Trung, V. Le Anh, and K. Le Trung
Concretely, let S(C) ⊂ V(C) be a small set of given C-VNEs. The constructing contextual rules and recognizing C-VNEs algorithm consists two steps. First, the algorithm searches for sentences that contain C-VNEs in S(C) and tries to identify some contextual rules to the best examples. Next, it tries to find new C-VNEs appearing in similar contexts. The learning process is then reapplied to the newly found examples. By repeating this process, a large number of CVNEs and a large number of contextual rules will eventually be gathered. This algorithm is called a C-VNEs recognition algorithm and given as follows The C-VNEs Recognition Algorithm Input: A = S(C) Output: V(C) - the set of all C-VNEs R(C) = ∅ 1. for i = 1 to n do 2. Setting the parameters to (θic , θis ) 3. repeat 4. for each contextual rule p in R
5. 6. 7. 8. 9. 10. 11. 12.
Calculate Fcon (p, A) if Fcon (p, A) ≥ θic then Add p to R(C) for each (S, s, k) in C \ A Evaluated Fsim (S, s, k, A) if Fsim (S, s, k, A) ≥ θis then Add (S, s, k) to A until no contextual rules add to R(C)
The bootstrapping process loops n-times with n parameters, which converge to desired one. The way we choose these parameters guarantees the quality of learning process. At each iteration, the process adds the new contextual rules, which have confidence value higher than θic to R(C). All of its extractions, which have similarity value higher than θis , are inferred to be category members and added to A. Then the next best contextual rule is identified, based on both the original seed VNEs and the new VNEs that were just added to the category, and the process repeats. Adding the best contextual rules help to more accurately identify value of similarity function of new VNEs with the set of C-VNEs A. The results of learning process are the set of all C-VNEs A, and the set of all contextual rules R(C) for category C. 3.3
Recognition of VNEs
Let E be a category set. For each category C ∈ E, let R(C) be a set of all contextual rules of C. We define a C-VNEs recognition score, a function on the space of candidates and categories C × E, as follows: Score(S, s, k, C) =
Fcon (p, V(C)) ∗ p(S, s, k) ∗ Focc (S, s, k, V(C))
(5)
p∈R(C)
where, Fcon (p, V(C)) is a confidence function of the contextual rule p with V(C), and Focc (S, s, k, V(C)) is an occurrence function of candidate (S, s, k) in V(C) is determined by following formula: 1 − if (S, s, k) ∈ V(C), Focc (S, s, k, V(C)) = (6) otherwise where, 0 ≤ ≤ 1. By definition, the candidate (S, s, k) has high value of score C-VNEs recognition if it satisfies some high confidence value contextual rules in
Bootstrapping and Rule-Based Model
173
R(C) and is a C-VNEs. However, if (S, s, k) is not in V(C), we are still calculating its C-VNEs recognition score with the coefficient . In this case, the candidate (S, s, k) satisfies a lot of contextual rules of which their confidence values are very high, then we recognize a new C-VNE. Given a parameter θ ≥ 0, the recognition function and the classification function are determined based on the C-VNE recognition score. These formulae are given: for all (S, s, k) ∈ C, 1 if ∃C ∈ E, Score(S, s, k, C) ≥ θ, frec (S, s, k) = (7) 0 otherwise and fcla (S, s, k) =
argmaxC∈E Score(S, s, k, C)
if frec (S, s, k) = 0, otherwise
(8)
The VNE recognition algorithm is given as follows: The VNE Recognition Algorithm Input: Candidate (S, s, k), θ > 0 Output: frec (S, s, k) and fcla (S, s, k) max = 0 1. for each category C in E do 2. Calculate Score(S, s, k, C) 3. if Score(S, s, k, C) > max then 4. max = Score(S, s, k, C) 5. Cmax = C 6. if max < θ then 7. return frec (S, s, k) = 0 and fcla (S, s, k) = 8. return frec (S, s, k) = 1 and fcla (S, s, k) = Cmax
4
Evaluation
Corpus and Contextual Rules. Our data of sentences is collected from 250.034 articles in the TuoiTre online newspaper (http://www.tuoitre.com.vn) and more than 9.000 novels in VNThuquan website (http://www.vnthuquan.net). We preprocessed the data first by applying the data normalization: fixing the code fonts and repairing spelling mistakes of syllables and segmenting sentences[2]. The initial corpus has 31.565.364 sentences whose total length is 411.623.127 syllables. Categories and The Seed VNEs. Each category of VNEs is specified by a small number of the seed VNEs. The seed VNEs of a given category must be satisfy 2 conditions: (i) must be frequent in the corpus for the bootstrapping algorithm to work well; and (ii) satisfy the condition of many contextual rules in linguistic constraints of given category, do not satisfy the condition of ambiguous rules, which has high frequency and extracts many VNEs of many distinct categories. The seed VNEs was generated by sorting candidates in the corpus and manually
174
H. Le Trung, V. Le Anh, and K. Le Trung Table 1. Seed Named Entities Lists Category
Seed NEs
NEs for Nations: Mỹ, Pháp, Anh, Đức, Nga, Nhật, Úc, Trung_Quốc, Ý, Việt_Nam. NEs for Cities:
Sài_Gòn, Huế, Đà_Nẵng, Cần_Thơ, Đà_Lạt, Hồ_Chí_Minh, Bình_Dương, Quảng_Nam, Nha_Trang, Đồng_Nai.
NEs for Rivers:
Hương, Hàn, Gianh, Cửu_Long, Nhà_Bè, Cầu, Đà, Kim_Ngưu.
NEs for Streets: Lê_Duẩn, Chi_Lăng, Đống_Đa, Cách_Mạng_Tháng_Tám, Phố_Huế, Hai_Bà_Trưng, Trần_Hưng_Đạo, Bến_Nghé, Bạch_Đằng.
Table 2. Top 15 extraction patterns for semantic categories NEs for Nations trận gặp [NE]
NEs for Cities và tỉnh [NE]
NEs for Rivers
NEs for Humans
dọc sông [NE]
là anh [NE]
hàng phòng_ngự [NE] địa_bàn tỉnh [NE]
vượt sông [NE]
[B] ông [NE]
vào thị_trường [NE]
địa_phận tỉnh [NE]
tả_ngạn sông [NE]
của ông [NE]
với đội_tuyển [NE]
lãnh_đạo tỉnh [NE]
của sông [NE]
là ông [NE]
[B] quốc_tịch [NE]
cho tỉnh [NE]
thượng_nguồn sông [NE] hỏi thằng [NE]
của tuyển [NE]
từ tỉnh [NE]
hạ_lưu sông [NE]
gọi chị [NE]
sang thị_trường [NE] chính_quyền tỉnh [NE]
trên dòng_sông [NE]
tìm anh [NE]
nền kinh_tế [NE]
của tỉnh [NE]
dọc_theo sông [NE]
lúc anh [NE]
tại thị_trường [NE]
địa_bàn thành_phố [NE] dọc bờ_sông [NE]
gặp_lại chị [NE]
nhập quốc_tịch [NE]
tại tỉnh [NE]
bên dòng_sông [NE]
và cậu [NE]
nhập_khẩu từ [NE]
toàn tỉnh [NE]
bên bờ_sông [NE]
mời chú [NE]
[B] du_khách [NE]
trên địa_bàn [NE]
và sông [NE]
nhìn ông [NE]
tiền_đạo người [NE]
ở tỉnh [NE]
ven sông [NE]
để bác [NE]
[B] người [NE]
chi_nhánh tại [NE]
nguồn sông [NE]
nhìn bà [NE]
đoạn sông [NE]
của mợ [NE]
xuất_khẩu sang [NE] thuộc tỉnh [NE]
identifying. The seed VNEs lists for some categories that we used are shown in Table 1. Contextual rules for category. We ran the bootstrapping algorithm for 50 iterations. Contextual rules produced by the last iteration were the output of the system. Through the process of checking list of the best contextual rules of 20 categories, some results and ideas are collected: – Collected contextual rules reflect correctly linguistic constraints of each category. Ambiguous rules such as: “của [NE]”, “là [NE]” do not appear in list. The experimental results reflect correctly the accuracy and efficiency of the our measurements and algorithms. – Process recognition and classification categories depends only on the seed VNEs, help increase the flexibility and efficiency of the system. It would indeed be appropriate to classify all VNEs, which are separated by different type.
Bootstrapping and Rule-Based Model
175
Table 2 shows the top 20 contextual rules for some categories produced by bootstrapping after 50 iterations. In which, [B] is label for beginning of phrase, and [NE] is label for VNE. Most of these contextual rules are clearly useful linguistic constraints for extracting VNEs for each catogory. Table 3. Accuracy of the Semantic Lexicons Iter 1
Iter 10
Iter 20
Iter 30
Iter 40
Iter 50
NEs for Nations 5/5(1) 43/50(.86) 82/100(.82) 134/150(.89) 178/200(.89) 219/250(.87) NEs for Cities
5/5(1) 44/50(.88) 91/100(.91) 139/150(.93) 183/200(.92) 221/250(.88)
NEs for Rivers
5/5(1) 49/50(.98) 98/100(.98) 146/150(.97) 185/200(.97) 239/250(.96)
NEs for Streets 5/5(1) 50/50(1) 99/100(.99) 149/150(.99) 197/200(.98) 245/250(.98) NEs for Persons 5/5(1) 49/50(.98) 99/100(.99) 149/150(.99) 199/200(.99) 248/250(.99)
The Bootstrapping Algorithm. Table 3 shows the accuracy of some categories after the 1st iteration of bootstrapping and after each 10th iteration. Each cell shows the number of true category members among the entries generated thus far. For example, 50 VNEs were recognized as named entities of cities after tenth iteration and 44 of those (88%) were true VNEs of cities. Table 3 shows that bootstrapping identified 219 VNEs of nations, 221 VNEs of cities, 239 VNEs of rivers, 245 VNEs of streets and 248 VNEs of persons. The accuracy of processing recognize VNEs is very high. Vietnamese Named Person Recognition. Experiment for Vietnamese named person recognition ran bootstrapping algorithm for 200 iterations. The results identified 31.427 VNEs of person with the accuracy about 98%, cả short name as “Vũ, Kiên, Hiếu”, and full name as “Lê_Trung_Hiếu”, “Nguyễn_Thu_Phương”. Besides, the selection of appropriate seed VNEs help identify good named persons for man, woman, or named persons for singer, actor, politicians,... These results are an important data for problems such as Extraction Information, Abstract Document, Classification Documents,...
5
Conclusion
We have presented rule-based model for recognizing and classifying Vietnamese named entities. The model is constructed by using the bootstrapping algorithm. The rule-based model relies on contextual rules to provide contextual evidence that a VNE belongs to a category. The model works well on a large corpus and can extract many categories, such as, name of person, nations, cities, streets, rivers, places, oranizations,... with the high accuracy. Vietnamese language is not explained and described well by grammar rules. However, according to research results have demonstrated the ability to apply models Vietnamese word recognition and theory word order patterns for solving some tasks in Vietnamese language processing. One of our reseach direction is
176
H. Le Trung, V. Le Anh, and K. Le Trung
“Finding the most common formulas of Vietnamese sentence”. We believe that with the huge corpus, we can solve many problems of Vietnamese language processing based on statistic.
References 1. Chen, C., Lee, H.J.: A Three-Phase System for Chinese Named Entity Recognition. In: Proceedings of ROCLING XVI, pp. 39–48 (2004) 2. Le Trung, H., Le Anh, V., Le Trung, K.: An Unsupervised Learning and Statistical Approach for Vietnamese Word Recognition and Segmentation. In: Nguyen, N.T., ´ atek, ¸ J. (eds.) ACIIDS 2010, Part II. LNCS (LNAI), vol. 5991, pp. Le, M.T., Swi 195–204. Springer, Heidelberg (2010) 3. Le Trung, H., Le Anh, V., Dang, V.-H., Hoang, H.V.: Recognizing and Tagging Vietnamese Words Based on Statistics and Word Order Patterns. In: Nguyen, N.T., Trawi´ nski, B., Katarzyniak, R., Jo, G.-S. (eds.) Adv. Methods for Comput. Collective Intelligence. SCI, vol. 457, pp. 3–12. Springer, Heidelberg (2013) 4. Lin, W., Yangarber, R., Grishman, R.: Bootstrapped learning of semantic classes from positive and negative examples. In: Proceedings of ICMLK 2003 Workshop on the Continuum from Labeled to Unlabeled Data (2003) 5. Micheal, T., Riloff, E.: A Bootstrapping Method for Learning Semantic Lexicon using Extraction Pattern Contexts. In: Proceedings of the ACL 2002 conference on Empirical Methods in Natural Language Processing, pp. 214–221 (2002) 6. Riloff, E., Jones, R.: Learning Dictionaries for Information Extraction by Multi-level Bootstrapping. In: Proceedings of the Sixteenth National Conference on the Artificial Intelligence and the Eleventh Innovative Applications of Artificial Intelligence Conference, pp. 474–479 (1999) 7. Tran, Q.T., Pham, T.X.T., Ngo, Q.H., Dinh, D., Collier, N.: Named Entity Recognition in Vietnamese documents. Progress in Informatics Journal, 5–13 (2007) 8. Pham, T.X.T., Kawazoe, A., Dinh, D., Collier, N., Tran, Q.T.: Construction of a Vietnamese Corpora for Named Entity Recognition. In: RIAO 2007, 8th International Conference, pp. 719–724. Carnegie Mellon University, Pittsburgh (2007)