For example, âGeorge W. Bushâ may be written first and then âGeorge Bushâ, ... to the president of United States instead of Barbara Bush, Laura Bush, or Samuel ...
A Knowledge-Based Approach to Named Entity Disambiguation in News Articles Hien T. Nguyen1 and Tru H. Cao2 1
2
Ho Chi Minh City University of Industry, Vietnam Ho Chi Minh City University of Technology, Vietnam {hiennt, tru}@cse.hcmut.edu.vn
Abstract. Named entity disambiguation has been one of the main challenges to research in Information Extraction and development of Semantic Web. Therefore, it has attracted much research effort, with various methods introduced for different domains, scopes, and purposes. In this paper, we propose a new approach that is not limited to some entity classes and does not require wellstructured texts. The novelty is that it exploits relations between co-occurring entities in a text as defined in a knowledge base for disambiguation. Combined with class weighting and coreference resolution, our knowledge-based method outperforms KIM system in this problem. Implemented algorithms and conducted experiments for the method are presented and discussed. Keywords: Name disambiguation, coreference resolution, ontology, knowledge base, semantic web.
1 Introduction In Information Extraction and Natural Language Processing areas, named entities (NE) are people, organizations, locations, and others that are referred to by names ([3]). Having been raised from research in those areas, named entities have also become a key issue in development of Semantic Web ([11]). That is because, in many domains, in particular news articles, the information and semantics of the articles’ texts center around the named entities and their relations mentioned therein. One great challenge in dealing with named entities is that one entity may have different names and one name may refer to different entities. The former raises the NE coreference problem, while the latter the NE ambiguity problem, which this paper addresses. For example, the name “Jim Clark” in one news article may refer to Jim Clark who is a Formula One world champion, whereas in another news article it may refer to Jim Clark who is the founder of Netscape. Such identity uncertainty ([9]) has attracted much research effort and been tackled in various domains, with different scopes, and purposes, e.g. name disambiguation ([1], [4], [6], [8]), citation matching ([9]), and entity identification ([5], [7], [10]). Name disambiguation is to make clear if two identical names refer to the same entity or not. Several methods, using vector space model or graph model, for instance, have been proposed for name disambiguation of people in social network ([1]), person names in web search ([4]), authors in publications ([6]), and people in newspaper M.A. Orgun and J. Thornton (Eds.): AI 2007, LNAI 4830, pp. 619–624, 2007. © Springer-Verlag Berlin Heidelberg 2007
620
H.T. Nguyen and T.H. Cao
([8]). Citation matching is to determine if two references are of the same publication. Meanwhile, entity identification aims at locating in a knowledge base (KB) of discourse the entity that a name represents. Closely related to our work are those of [5], [7] and [10]. The method proposed in [5] relies on affiliation, text proximity, areas of interest, and co-author relationship as clues for disambiguating person names in calls for papers only. Meanwhile, the domain of [10] is that of geographical names in texts. The authors use some patterns to narrow down the candidates of ambiguous geographical names. For instance, “Paris, France” more likely refers to the capital of France than a small town in Texas. Then, it ranks the remaining candidate entities based on the weights that are attached to classes of the constructed Geoname ontology. The shortcoming of those methods is to omit relationships between named entities with different classes, such as between person and organization, or organization and location, etc. In [7], some pattern matching rules written in JAPE’s grammar ([12]) are applied to resolve simple ambiguous cases in a text, based on the prefix or suffix of a name. For example, “John Smith, Inc.” is recognized as an organization instead of a person because of the suffix “Inc.” implying an organization. Then, disambiguation uses an entity-ranking algorithm, i.e., priority ordering of entities with the same label based on corpus statistics. Our work contrasts with the above-mentioned ones in both of the two following aspects, at once. Firstly, the problem that we address is not limited to named entities of a particular class or domain, but for all that may occur in texts of news domain, where texts are not as well-structured as those of publication references or calls for papers, and co-occurrence and relationship of named entities are essential to identify them. Second, we do not only disambiguate a name, but also identify the entity of that name in a KB of discourse. Here a KB is used both for the goal and as the means of a NE disambiguation process. The intuition and assumption behind our approach is that the identity of the entity represented by an ambiguous name in a news article depends on co-occurring unambiguous named entities in the text. For example, suppose that in our KB there are two entities named “Jim Clark”, one of which has a relation with the Formula One car racing championship and the other with Netscape. Then, if in a text where the name appears there are occurrences of Netscape or web-related named entities, then it is more likely that the name refers to the one with Netscape in the KB. After running NE recognition engine with respect to an ontology and a KB, those entities in the KB that are already mapped to the unambiguous named entities are called disambiguated entities. Also, each entity in the KB whose name is identical to an ambiguous name appearing in a text is called a candidate entity for that name.
2 Proposed Approach In a news article, co-occurring entities are usually related to each other. Furthermore, the identity of a named entity is inferable from nearby and previously identified named entities in the text. For example, when the name “Georgia” occurs with “Atlanta” in a text, if Atlanta is already recognized as a city in the United States, then it is more likely that “Georgia” refers to a state of the U.S. rather than the country Georgia, and vice versa when it occurs with Tbilisi as a country capital in another text. From this observation, our method takes into account not only most probable
A Knowledge-Based Approach to Named Entity Disambiguation in News Articles
621
classes of an ontology, but also other co-occurring named identities in a text of discourse. Concretely, the whole disambiguation process involves several iterations of the following steps (sub-sections 2.1, 2.2, and 2.3), where named entities identified in a round are used as a basis to disambiguate remaining ones in the next round. 2.1 Knowledge-Based Ranking of Candidate Entities At this step, disambiguated entities in the text are exploited to filter the most promising candidate entities for a name occurrence. The idea is that, considering two candidate entities, the entity whose related entities in the KB occur more in the text is the more likely one. For example, there are one street in Ha Noi and one street in Saigon that have the same name. Then, in a text, if that name occurs with the city Saigon, then the name probably refers to the street in Saigon. Let C be the set of candidate entities, E be the set of disambiguated entities, and f be the function expressing if there exists a relation between a candidate entity and a disambiguated entity. That is f: C × E → {0, 1}, where f(c, e) = 1 (c∈C, e∈E), if and only if there exists a relation between c and e; f(c, e) = 0, otherwise. Candidate entities are then ranked by the number of disambiguated entities that have relations with each entity as given in the KB; the more relations, the higher rank. Particularly, we calculate confidence score for candidate entities. The confidence score of each candidate is increased one every time there is a relation encountered between that entity and any disambiguated entity so far. The set of the candidate entities with the highest score will be chosen. If two or more of the candidate entities have the same maximal confidence score, then the one that has the highest score and with close-proximity to the disambiguated entities in the same window context is selected. A window context could be a sentence, a paragraph, or a sequence of words containing the name under consideration. 2.2 Class-Based Ranking of Remaining Candidate Entities This ranking is based on the assumption that entities of one class may be more common than those of another, whence the former class is given a greater weight ([10]). For example, when “Athens” appears in a text, it is more likely that it mentions the capital of Greece rather than a small town in the state Georgia of the US, because country capitals are more often encountered than small towns, as in news for instance. Particularly, the remaining candidate entities after the knowledge-based ranking step are going to continue to be filtered. Each the candidate entity is assigned a score that reflects its preference in terms of its class over others. Then, the candidate entities with the highest score are chosen. We note that the classes of considered candidate entities must be subclasses of the same class Person, Location, or Organization. After this step, if the final set contains only one named entity, the ambiguous name is successfully resolved. The chosen candidate entity is then added to the set of disambiguated entities to be used for further disambiguation. 2.3 Named Entity Coreference Resolution This step performs coreference resolution within a text to create coreference chains, which help to identify an entity after recognizing some others in the same chain. It
622
H.T. Nguyen and T.H. Cao
can also be observed that a news article often use a full name (i.e., main alias) in its headline or first paragraph, before using other variants such as abbreviation or acronyms. For example, “George W. Bush” may be written first and then “George Bush”, “Mr. Bush” or “Bush” used as anaphora. If “George W. Bush”, “Mr. Bush” and “Bush” are found to be coreferent, then it is likely that “Mr. Bush” and “Bush” refers to the president of United States instead of Barbara Bush, Laura Bush, or Samuel D. Bush, for instance. We employ the rules in [2] for coreference resolution.
3 Experiments and Evaluation 3.1 Corpora and Experiments For experiments, we employ KIM ontology, KB, and NE recognition module to produce annotated web pages. Currently, KIM ontology contains 250 named entity classes, 100 relations and attributes, while KIM KB is populated with over 200,000 entities. However, since KIM also performs NE disambiguation, its annotated texts do not contain ambiguous cases. Therefore, we have to re-embed all possible annotations for each name in a text. Since our goal is to evaluate ambiguity resolution, we are interested in ambiguous names in news articles. The corpora are built from English news of CNN, BBC, New York Times, Washington Post, and Business Week. More specifically, two entities are selected for the disambiguation experiments, namely, (“Georgia”, Location) and (“Smith”, Person). There are totally 3267 named entities in the Georgia corpus, and 1467 named entities in the Smith corpus. Table 1 shows the occurrence times of each type of (“Georgia”, Location) in the Georgia corpus. Table 2 shows the occurrence times of each person in the Smith corpus. For later testing, all the entities referred to in those corpora are manually disambiguated with respect to KIM KB, by two persons for the quality of the corpora. For the Georgia corpus in the experiments, we assume the class Country having a greater weight than Province. Table 1. Occurrene times of (“Georgia”, Location) in the Georgia corpus Class for Georgia
Number of articles
Occurrence times
30 17 47
116 213 329
Province Country Total
Table 2. Occurrene times of (“Smith”, Person) in the Smith corpus Person for Smith Jason Smith Richard A. Smith Rick Smith Jason Smith Richard J. Smith
Position COO Chairman CEO Finance. Dir CFO Total
Number of articles
Occurrence times
1 5 2 2 2 12
2 27 3 15 2 49
A Knowledge-Based Approach to Named Entity Disambiguation in News Articles
623
3.2 Performance Evaluation For evaluation, we run KIM, KIM enhanced with class weighting, and our method on the same corpora. The results are matched against the manually disambiguated corpora. Table 3 and Table 4 summarize the results when the methods are tested on the Georgia corpus and the Smith corpus, respectively. The obtained precision and recall measures say that our method, which exploits NE relations in a KB, outperforms both KIM and its enhancement with class weighting. Table 3. Testing on the Georgia corpus Method KIM KIM + Class weighting Our method
Correct disam- Disambiguated Ambiguous Precision Recall biguation names names 194 207 306
312 315 315
329 329 329
62.17% 65.71% 97.14%
58.96% 62.91% 93.00%
Table 4. Testing on the Smith corpus Method KIM KIM + Class weighting Our method
Correct disam- Disambiguated Ambiguous Precision Recall biguation names names 39 39 46
47 47 47
49 49 49
82.97% 82.97% 97.87%
79.59% 79.59% 93.87%
4 Conclusion We have proposed a new approach to name disambiguation that exploits knowledgebased relations between named entities in a text under consideration. Firstly, it is quite natural and similar to the way humans do, relying on well-identified entities to resolve other ambiguous names in the context. Secondly, it is robust to free texts without well-defined structures or templates. The experiments have shown that our method achieves higher precision and recall performance measures than KIM even enhanced with class weighting. Regarding class weighting, since in the experimental corpus there are only two classes, weight assignment is simple. For a complete ontology with several classes, automatic class weight assignment is desirable. Also, there is still contextual and knowledge-based information other than entity relationship that can be used for name disambiguation in a larger scale. We are currently investigating along those lines of research.
References [1] Bekkerman, R., McCallum, A.: Disambiguating Web Appearances of People in a Social Network. In: Proc. of the 14th International World Wide Web Conference, Chiba, Japan, pp. 463–470 (2005) [2] Bontcheva, K., Dimitrov, M., Maynard, D., Tablan, V., Cunningham, H.: Shallow Methods for Named Entity Coreference Resolution. In: Proc. of TALN 2002 Workshop, Nancy, France (2002)
624
H.T. Nguyen and T.H. Cao
[3] Chinchor, N., Robinson, P.: MUC-7 Named Entity Task Definition. In: Proc. of MUC-7 (1998) [4] Guha, R., Garg, A.: Disambigusting People in Search. In: Proc. of the 13th International World Wide Web Conference, New York, USA (2004) [5] Hassell, J., Aleman-Meza, B., Arpinar, I.B.: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 44–57. Springer, Heidelberg (2006) [6] Han, H., Giles, L., Zha, H., Li, C., Tsioutsiouliklis, K.: Two Supervised Learning Approaches for Name Disambiguation in Author Citations. In: Proc. of ACM/IEEE Joint Conference on Digital Libraries, Tucson, Arizona (2004) [7] Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.: Semantic Annotation, Indexing, and Retrieval. Journal of Web Semantics 2(1) (2005) [8] Mann, G., Yarowsky, D.: Unsupervised Personal Name Disambiguation. In: Proc. of the 17th Conference on Natural Language Learning, Edmonton, Canada, pp. 33–40 (2003) [9] Pasula, H., Marthi, B., Milch, B., Russell, S.J., Shpitser, I.: Identity Uncertainty and Citation Matching. Advances in Neural Information Processing Systems 15 (2002) [10] Raphael, V., Joachim, K., Wolfgang, M.: Towards Ontology-based Disambiguation of Geographical Identifiers. In: Proc. of the 16th International WWW Conference, Banff, Canada (2007) [11] Shadbolt, N., Hall, W., Berners-Lee, T.: The Semantic Web Revisited. IEEE Intelligent Systems 21(3), 96–101 (2006) [12] Cunningham, H., Maynard, D., Tablan, V.: JAPE: A Java Annotation Patterns, Technical report CS–00–10, University of Sheffield, Department of Computer Science (2000)