Named Entity Disambiguation: A Hybrid Statistical ...

8 downloads 0 Views 144KB Size Report
As an example, the Wikipedia page for the George Bush belongs not .... particular context, can refer to Laura Bush, or Samuel D. Bush, for instance. If “George ...
Named Entity Disambiguation: A Hybrid Statistical and Rule-Based Incremental Approach Hien T. Nguyen1 and Tru H. Cao2 1

Ton Duc Thang University, Vietnam [email protected] 2 Ho Chi Minh City University of Technology, Vietnam [email protected]

Abstract. The rapidly increasing use of large-scale data on the Web makes named entity disambiguation become one of the main challenges to research in Information Extraction and development of Semantic Web. This paper presents a novel method for detecting proper names in a text and linking them to the right entities in Wikipedia. The method is hybrid, containing two phases of which the first one utilizes some heuristics and patterns to narrow down the candidates, and the second one employs the vector space model to rank the ambiguous cases to choose the right candidate. The novelty is that the disambiguation process is incremental and includes several rounds that filter the candidates, by exploiting previously identified entities and extending the text by those entity attributes every time they are successfully resolved in a round. We test the performance of the proposed method in disambiguation of names of people, locations and organizations in texts of the news domain. The experiment results show that our approach achieves high accuracy and can be used to construct a robust named entity disambiguation system.

1 Introduction In Information Extraction and Natural Language Processing areas, named entities (NE) are people, organizations, locations, and others that are referred to by names. A wider interpretation of the term includes any token referring to something specific in reality, such as numbers, addresses, amounts of money, dates, etc. ([7]). Having been raised from research in those areas, named entities have also become a key issue in development of Semantic Web ([20]). According to the vision of the Semantic Web, metadata about named entities would be widely available with high quality for easily sharing, integrating and processing by software agents. In that spirit, extracting named entities in texts and linking them to some ontology or knowledge base (KB) such as KIM1, OpenCyc2, Wikipedia3, etc. have been increasingly attracting researchers’ attention. 1

http://www.ontotext.com/kim/ http://www.opencyc.org/ 3 http://www.wikipedia.org/ 2

J. Domingue and C. Anutariya (Eds.): ASWC 2008, LNCS 5367, pp. 420–433, 2008. © Springer-Verlag Berlin Heidelberg 2008

Named Entity Disambiguation: A Hybrid Statistical

421

One great challenge in dealing with named entities is that one name may refer to different entities in different occurrences and one entity may have different names which may be written in different ways and with spelling errors. For example, the name “John McCarthy” in different occurrences may refer to different named entities such as a computer scientist from Stanford University, a linguist from University of Massachusetts Amherst, a British journalist who was kidnapped by Iranian terrorists in Lebanon in April 1986, an Australian ambassador, etc. Such ambiguity makes identification of NEs more difficult and raises NE disambiguation problem as one of the main challenges to research not only in the Semantic Web but also in areas of natural language processing in general. Our work aims at detecting named entities in a text, disambiguating and linking them to the right ones in Wikipedia. The proposed method utilizes NEs and related terms co-occurring with the target entity in a text and Wikipedia for disambiguation because the intuition is that these respectively convey its relationship and attributes. For example, suppose that in a KB there are two entities named “Jim Clark”, one of which has a relation with the Formula One car racing championship and the other with Netscape. Then, if in a text where the name appears there are occurrences of Netscape or web-related referents and terms, then it is more likely that the name refers to the one with Netscape in the KB. We exploit Wikipedia as a source of NE annotations due to its size, variation, accuracy and quantity of hyper-links ([23]) to construct annotated corpus for disambiguation. The contribution of this paper is three-fold as follows: − First, we propose a hybrid method that combines heuristics and a learning model for disambiguation and identification of NEs in a text with respect to Wikipedia. − Second, the proposed disambiguation process is incremental and includes several rounds that filter the candidates, by exploiting previously identified entities and extending the text by those entity attributes every time they are successfully resolved in a round. Importantly, we explore context in several levels, from local to the whole text, where diverse clues are extracted for disambiguation at high accuracy. − Finally, our method utilizes disambiguation texts in titles of articles in Wikipedia as an important feature not only to choose the right entity for an ambiguous referent by search for their occurrences in local context, but also to disambiguate other ambiguous referents in a text. The rest of the paper is organized as follows. Section 2 states the problem as well as describes its scope. Section 3 describes resources of information in Wikipedia that are essential for our method. Section 4 describes extraction of named entities in Wikipedia to create a disambiguation dictionary. Section 5 presents in details the disambiguation method. Section 6 describes evaluation of our method. Section 7 presents related works. Finally, we draw a conclusion in Section 8.

2 Background The problem of disambiguation is to determine whether two named entities refer to the same entity. For instance, do “J. Smith” and “John Smith” refer to the same person?

422

H.T. Nguyen and T.H. Cao

Do different occurrences of “John Smith” refer to the same person? This paper addresses the problem that aims at mapping referents that are not resolved yet in a text to the right referents in a predefined list of resolved referents. For instance, for the text “the computer scientist John McCarthy coined the term artificial intelligence in the late 1950's,” our method detects whether John McCarthy and the resolved referent John McCarthy (computer scientist) in Wikipedia refer to the same entity and then link the referent John McCarthy to John McCarthy (computer scientist) in Wikipedia. In [15], the authors identify several levels of named entity ambiguity. The first level of ambiguity is structured ambiguity, where the structures of names are ambiguous. For instance, in the name “Victoria and Albert Museum” the word and is a part of this name, whereas, in the case of “IBM and Bell Laboratories”, and is a conjunction joining two computer company names. The second level is semantic ambiguity, where entity-type is ambiguous. For instance, “John Smith” may refer to a company or a person. Referent ambiguity is the next level, when one name may be used to refer to different entities, e.g. “Paris” may refer to Paris, France, or a small town in Texas, the United States. Our work performs both the semantic ambiguity and the referent ambiguity resolution with assumption that the structured ambiguity is resolved in preprocessing steps. The task of NE disambiguation bears a resemblance to Word Sense Disambiguation (WSD) in that it comprises two key steps of which a look-up step retrieves candidate referents in a KB of discourse (sense inventory in WSD) and the second one chooses the most likely candidate. However, this task is different from WSD in that NEs, roughly speaking, represent specific individuals in the world of discourse, while words denote generic concepts such as types, properties, and relations. Reasoning with words thus requires only lexical semantics and common senses, while reasoning with NEs requires specific knowledge about the world of discourse. This problem has attracted much research effort, with various methods introduced for different domains, scopes, and purposes. Most of those methods fit into categories described below: − Rule-based methods, the methods that use some heuristics to disambiguate NEs of the Location ([17], [18]), the Person ([13]) or arbitrary types from a given ontology ([16]). − Machine learning methods, the methods that extract information from Wikipedia to form language models ([2], [4]) or co-occurrence model ([21]), and then use those models to disambiguating named entities. − Data-driven methods, the methods of disambiguation apply semi-supervised techniques, combined with additional un-annotated corpus, to learn contextual information for disambiguation ([22]). The method of disambiguation presented here combines heuristics and a learning model. To maximize accuracy of mapping a NE referred to in a text to the right one in Wikipedia poses a significant question that how contexts in which referents occur are exploited and how corresponding NEs can be represented. In our case, we represent NEs by their attributes and relations. The attributes are birthday, career, occupation, alias, first name, last name, and so on. The relations of an entity represent relations to others such as part-of, located-in, etc. The way we exploit the contexts is based on

Named Entity Disambiguation: A Hybrid Statistical

423

Harris’ Distributional Hypothesis [14] stating that words occurring in similar contexts tend to have similar senses. We adapt the hypothesis to NE instead of word sense disambiguation.

3 Wikipedia Wikipedia is a free encyclopedia written by a collaborative effort of a large number of volunteer contributors. It is a multilingual resource and growing exponentially so that it has become the largest organized knowledge repository on the Web ([19]). We utilize Wikipedia data because of its size, quality, growth speed, as well as a source of information about synonyms, spelling variations, and abbreviations of NEs. Also, it is a fertile source for exploiting related terms and co-occurring NEs; those actually provide explicit information about the important features of the corresponding entity, such as location and industry of a company, etc. In addition, many-to-many correspondence between names and entities can be captured from Wikipedia by utilizing redirect pages and disambiguation pages. Pages A basic entry in Wikipedia is a page (or article) that defines and describes a single entity or concept. It is uniquely identified by its title. Typically, the title is the canonical name for the entity described in the page. When the name is ambiguous, the title contains further information that we call disambiguation text to distinguish the entity described from others. The disambiguation text separates from the name by parentheses, or a comma, e.g. John McCarthy (computer scientist), Columbia, South Carolina. Each title is an identifier (ID) of a specific named entity in Wikipedia, because it identifies a unique entity in Wikipedia entity space. Links Each page consists of many links whose role is not only to point from the page to others but also to guide readers to pages that provide additional information about the entities mentioned. Each link is associated with an anchor text that represents the surface form of the corresponding entity. Note that if the anchor text denotes an ambiguous name or is an alternative name instead of canonical name, a piped link is used instead in wiki source code. For instances, a typical link might look like [[Midland, Texas | Midland]], where Midland, Texas is the link target and Midland is its surface form. Categories The Wikipedia category system is also a source of meaningful information. The Wikipedia category tree is an example of a folksonomy, a kind of collaborative tagging system that enables the users to categorize the content of the encyclopedic entries. Thus, the taxonomy of Wikipedia can express not only hyponymic relations but also meronymic relations as well. As an example, the Wikipedia page for the George Bush belongs not only to the categories Presidents of the United States and Texas Republicans (is-a) but also to the 1946 births (has-property). In Wikipedia, every entity page is associated with one or more categories, each of which can have subcategories

424

H.T. Nguyen and T.H. Cao

expressing meronymic or hyponymic relations. Note that we extract not only direct category information of an entity but also all its parent and ancestors. Redirect pages A redirecting page typically contains only a reference to an entity or a concept page. Title of redirecting page is an alternative name of that entity or concept. For example, from redirect pages of United States, we extract alternative names of the United States entity such as “US”, “USA”, “United States of America”, etc. Disambiguation pages A disambiguation page is created for an ambiguous name which denotes two or more entities in Wikipedia. It consists of links to pages that define the different entities with the same name. For instance, the disambiguation page for “John McCarthy” lists pages discussing John McCarthy (referee), John McCarthy (journalist), John McCarthy (journalist), John McCarthy (computer scientist), etc. From the disambiguation pages we detect all entities that have the same name in Wikipedia for creating a disambiguation dictionary.

4 Creating the Disambiguation Dictionary Based on the resources of information aforementioned, we follow the method presented in [2] to create a disambiguation dictionary. Since our work focuses on NEs, we first consider which pages in Wikipedia define NEs. In [2], the authors consider a page describing a NE if it satisfies one of the following heuristics: 1. If its title is a multiword title, check the capitalization of all content words in the title, i.e. words other than prepositions, determiners, conjunctions, relative pronouns or negations. Consider the page describing a NE if and only if all the content words are capitalized. 2. If its title is a one-word title that contains at least two capital letters, then the page describes a NE. Otherwise, go to step 3. 3. Count how many times its title occurs in the text of the page, in positions other than at the beginning of sentences. If at least 75% of these occurrences are capitalized, then the page describes a NE. Following this way, a dictionary is constructed so that the set of entries in the dictionary consists of all strings that may denote a named entity. In particular, if e is a named entity, its title name, its redirect names, and disambiguation names are all added as entries in the dictionary. Then each entry string in the dictionary is mapped to a set of entities that the string may denote in Wikipedia. As a result, a named entity e is included in the set if and only if the string is one of title name, redirect names, or disambiguation names of e. Note that although we utilize information from Wikipedia to create the disambiguation dictionary, our method can be adapted for an ontology or knowledge base in general. In particular, one can generate a profile for each of KB entities by making use of ontology concepts and properties of the entities. In other words, one can take advantage of hierarchy of classes for feature by extracting direct class and parent classes of the entities. Also, value of properties of entities was exploited. For attrib-

Named Entity Disambiguation: A Hybrid Statistical

425

utes, their values were directly extracted. For relation properties, one can utilize names and ID of the corresponding entities. All the extracted feature values of an entity will be concatenated into a text snippet, which can be considered a profile of the entity, for further processing.

5 Proposed Method In a news article, co-occurring entities are usually related to each other. Furthermore, the identity of a named entity is inferable from nearby and previously identified NEs in the text. For example, when the name “Georgia” occurs with “Atlanta” in a text and if Atlanta is already recognized as a city in the United States, then it is more likely that “Georgia” refers to a state of the U.S. rather than the country Georgia, and vice versa when it occurs with Tbilisi as a country capital in another text. Furthermore, the words surrounding ambiguous names may denote attributes of NEs referred to. If those words are automatically extracted, the ambiguous names may be disambiguated. For example, in the text “Michael Collins, assistant professor”, the word “professor” helps to discriminate “Michael Collins” who works at MIT from “Michael Collins” who flew on the Apollo 11 and others with the same name. From those observations, we propose a method with the following essential points: − It is a hybrid method containing two phases the first one of which is a rule-based phase that filters candidates and disambiguates named entities if possible and the second one employs a statistical learning model to rank the candidates. − It is an iterative and incremental process in which a resolved referent at each iteration step is intermediately utilized to disambiguate others in the next step. − It exploits both entity IDs and keywords as means of named entity disambiguation in two phases. In particular, in the first phase, based on NE identifiers of previously identified NEs in the local context, it searches for occurrences of candidates’ disambiguation texts not only to filter the candidates but also to disambiguate ambiguous referents; then in the second phase, it utilizes words surrounding ambiguous referents in consideration and entity IDs in the whole text for ranking the candidates. The disambiguation process comprises the following steps in each iteration: 1. Looking up candidates for referents in text using the disambiguation dictionary as a gazetteer. 2. Narrowing down the candidates for ambiguous referents in text using textual information, IDs in local context and disambiguation texts of candidates. After this step, the text will be extended by disambiguation text of the chosen candidate. 3. Ranking candidates using features extracted from the extended text and Wikipedia to disambiguate the referents that have not been resolved yet. 5.1 Looking Up Candidates Prior to looking up candidates in the disambiguation dictionary, we perform preprocessing steps. In particular, we perform NE recognition and NE coreference resolution

426

H.T. Nguyen and T.H. Cao

using natural language processing resources of Information Extraction engine based on GATE ([6]), a general architecture for developing natural language processing applications. The NE recognition applies pattern-matching rules written in JAPE’s grammar of GATE, in order to identify the class of an entity in the text. After detecting all mentions of entities occurring in the text, we run NE co-reference resolution ([3]) module in the GATE system to resolve the different mentions of an NE into one group that uniquely represents the NE. After pre-processing steps, for each entity name in the text, we send it as a query to the dictionary to retrieval candidates. If there is only one candidate in the result, the corresponding referent is resolved. It can also be observed in practice, in particular news articles, that the use of short names in place of full names is very common. For example, the names “Bush”, “George Bush” may be used to refer to the current president of the United States stand for “George W. Bush” in a news article, while the name “Bush”, if taken out of a particular context, can refer to Laura Bush, or Samuel D. Bush, for instance. If “George W. Bush” and “Bush” are found to be coreferent in a text, then it is likely that they refer to the president of the United States. Our work is based on the assumption that all various representations of a name in a text mention the same entity. Therefore, we propagate resolved referents to others in their coreference chains. For example, assumption that in a text, there are occurrences of “Denny Hillis” and “Hillis” (“Hillis” may refer to Ali Hillis, American actress, Horace Hillis, American politician, W. Daniel Hillis, American inventor, entrepreneur, and author, etc.), if “Denny Hillis” is recognized that it refers to the W. Daniel Hillis and “Hillis” also refers to W. Daniel Hillis. 5.2 Narrowing Down Candidates In this step we exploit the local context to narrow down candidates and disambiguate ambiguous referent if possible. The local context of a location referent is its neighbor referents (i.e. the previous and successive ones) in the text. For example, if “Paris” is a location mention and followed by “France”, then “France” is in local context of “Paris”. Local context of a person or an organization referent are words and referents occurring in a limit length window, whose size is set to 10 tokens, centered on the entity mention. In particular, we utilize disambiguation texts of candidates to choose the correct one for each referent. For a location referent, the right one is the candidate either whose disambiguation text is identical to the successive entity name or whose name is identical to disambiguation text of the previous resolved referent. For example, in the text “Columbia, South Carolina”, the candidate Columbia, South Carolina, the largest city of South Carolina, in Wikipedia is chosen because its disambiguation text is “South Carolina”, or in the text “Atlanta, Georgia”, the candidate, a major city of state Georgia of United States, with the name “Atlanta” and disambiguation text “Georgia” is chosen, and Georgia is also resolved as a U.S. state because previous resolved referent with identifier Atlanta, Georgia has disambiguation text identical to “Georgia”. For a person or an organization referent, chosen candidates are ones that have disambiguation text occurring in its local context. After this step, if there is only one candidate in the result, the referent is considered being resolved. For example, in the

Named Entity Disambiguation: A Hybrid Statistical

427

text “Veteran referee (Big) John McCarthy, one of the most recognizable faces of mixed martial arts”, the word “referee” helps choose the candidate John McCarthy (referee) as the right one instead of John McCarthy (computer scientist), John McCarthy (linguist), etc. in Wikipedia. After that, we extend the text by disambiguation texts of the resolved referents. Those will be exploited to resolve the remaining ambiguous referents in the next step. For example, for the text “Atlanta, Georgia”, after Atlanta was recognized as a city of state Georgia of the United Sates and Georgia was recognized as a state of the United States, the extended text is “Atlanta, Georgia, Georgia (U.S. sate)” in which Atlanta, Georgia is the identifier of the city Atlanta, and Georgia (U.S. state) is the identifier of the state Georgia in Wikipedia named entity space. 5.3 Ranking Candidates After extracting all information about NEs in Wikipedia based on features that are titles of entity pages, titles of redirect pages, categories, hyperlinks, we represent those NEs by their information. For each of the remaining ambiguous referents in the extended text, we extract its features’ values as follows: − All words occurring in a window context centered on the entity name in the extended text. The window size is set to 55, which is the value that was observed to give optimum performance in the related task of cross-document coreference ([12]). − All NE identifiers of identified named entities in the extended text. After that we concatenate the entire feature’ values of the referents in the extended text and NEs in Wikipedia into text snippets and represent them in form of token-based feature vectors. Then we need a similarity metric to calculate the similarity between the vectors. Cohen et al. ([5]) presents various string similarity schemes for the task of matching English entity names and they report TFIDF (or cosine similarity), which is widely used in the information retrieval community, as the best among the token-based distance metrics. Given a pair of feature vectors S = (s1, s2,…, sn), and T = (t1, t2, …, tm), in which si (i=1,2, …, n) and tj (j=1,2, …, m) are words (or tokens). Then the TFIDF is defined as

TFIDF =

∑V (w, S ) × V (w, T )

w∈S ∩T

V’(w, S) = log(TFw,S + 1) . log(IDFw), and V ( w, S ) =

V ' ( w, S )



w∈S

V ' ( w, S ) 2

where TFw,S is the frequency of word w in S, IDFw is the inverse of the fraction of snippets in a snippet collection that contains w. Let CE be a set of entities in Wikipedia that have the same name with the target entity, e, in consideration in the extended text. We cast the named entity disambiguation problem as a ranking problem with assumption that there is an appropriate scoring function to calculate semantic similarity between feature vectors of an entity ce ∈ CE and the entity e. We build a ranking function that takes input as a set of feature

428

H.T. Nguyen and T.H. Cao

vectors of entities in CE and the feature vector of the entity e, then based on the scoring function to return the entity ce ∈ CE with the highest score. Fig.1 presents an algorithm using TFIDF as the scoring function. At the line 3 in Fig.1, the score function takes input as a token-based feature vector of a candidate and a token-based feature vector of an ambiguous referent, and then it ranks and returns the candidate with the highest score. Algorithm RankingCandidates (a set of ambiguous referents R) 1: for each referent r ∈ R do 2: let C a set of candidates of r * 3: c = arg max score(Vector(ci), Vector(r)) c i ∈C

*

4: assign c to r 5: extend the text by c* disambiguation text 6: end for Fig. 1. An algorithm ranking candidates using TFIDF

5.4 Algorithm

Fig.2 presents our disambiguating process. First we resolve some trivial ambiguous cases and retrieve candidates for ambiguous referents (line 1). Line 2 performs coreference resolution. From line 3 to line 11, we use some heuristics to disambiguate. Line 6 searches for disambiguation text of a candidate in a window containing the ambiguous referent in consideration, its previous and successive ones in the ambiguous case of location referents and in a window containing 10 tokens centered at ambiguous referent in the ambiguous cases of person and organization referents. If only one candidate for which we found it disambiguated, the corresponding referent is resolved. Line 12 performs extending the text by disambiguation text of the resolved Algorithm: Disambiguation 1: resolve trivial (unambiguous) referents 2: resolve coreference of referents 3: for each ambiguous referent r do 4: let C a set of candidate referents of r 5: forall candidate c ∈ C do 6: search for c’s disambiguation text in a window context 7: if found for only one candidate c* then 8: propagate c* to all referents in the coref-chain of r 9: end if 10: end forall 11: end for 12: extend the text Fig. 2. Disambiguating process

Named Entity Disambiguation: A Hybrid Statistical

429

referents. Finally, we use TFIDF to rank candidates for each of the remaining ambiguous referents and choose the candidate with the highest score. Note that after each iteration step, the algorithm extends the text by disambiguation text of the chosen candidate. Therefore, in the next iteration, the algorithms reform the feature vector of the ambiguous referent in consideration.

6 Evaluation We downloaded top two stories in the five CNN news categories (Travel, Entertainment, World, World Business, and Americas) on July 22, 2008 to build a dataset for evaluation. For later testing, all the NEs referred to in this dataset are manually disambiguated with respect to Wikipedia, by two persons for the quality of the dataset. The Wikipedia version we used that is of Zesch4 ([24]). Note that, due to the incompleteness of Wikipedia, an ambiguous name may be used to refer to some NEs not in Wikipedia, which are out of our work’s target. Also note that, we evaluate our method on named entities of three types – Person, Location, and Organization. Table 1. Statistic about named entities in the manually disambiguated dataset Category

# of referents # of found entities in Wikis # of ambiguous referents

Person

261

213

123

Location

168

159

94

Organization

89

84

45

Total

518

454

262

There are 518 proper names occurring in the dataset 454 names of which refer to NEs in Wikipedia and 262 names refer to two or more than different NEs in Wikipedia. We evaluate our method in two scenarios. In the first scenario, we use GATE to detect and tag boundaries of names occurring in the dataset and then categorize corresponding referents as Person, Location and Organization. After that, we gain D1 dataset. We found that GATE fails to detect boundaries of some names (12 names). For example, “Omar al-Bashir” is recognized as separate names “Omar” and “alBashir”, “Sony Ericsson” is recognized as separate names “Sony” and “Ericsson”, and “African National Congress” recognized as “African National”, etc. Also there are many names (77 names) that GATE does not recognize as entity names. For example, “Darfur”, “Qunu”, “Soweto”, “Interfax”, “Rosoboronexport”, and so on are not recognized as entity names. Then we manually fix all errors in the dataset D1 by adjusting wrong boundaries, added tagging, and re-categorizing all the wrong cases and gain dataset D2 with no error. Table 1 presents the number of referents, the 4

http://www.ukp.tu-darmstadt.de/software/jwpl/

430

H.T. Nguyen and T.H. Cao

number of ambiguous referents, and the number of entities in Wikipedia referred to in the manually disambiguated dataset. After that we run our method on D1 and D2, respectively. The results are matched against the manually disambiguated dataset. We apply the way that Fernandez et al. ([10]) measure the effect of their method to ours. In particular, we measure accuracy as the total number of right assignments NE (in text)/Wiki NE divided by the total number of assignments. Table 2 presents statistic information about named entities in dataset D1. The data on the Table show that the number of names detected less than it would be, which is because GATE fails to detect many proper names. Table 2 shows that there are 482 referents in dataset D1 which is less than one presented in Table 1 because GATE does not recognize some referents as the case of Darfur and fails to detect boudaries of proper names as the case of “Omar al-Bashir”. Table 3 presents accuracy results when we run our method on this dataset. Table 2. Statistic about named entities in the dataset D1 Category

# of referents # of found entities in Wikis # of ambiguous referents

Person

245

195

118

Location

148

140

88

Organization

89

76

41

Total

482

411

247

Table 3. Accurracy results on the dataset D1 Category

Correct disambiguation Accuracy

Person Location Organization

174

120

89.23% 85.70%

All

62

356

81.60%

86.61%

Table 4 presents accuracy results when we run our method on the dataset D2. The results show that our method achieves high accuracy. The accuracy results in Table 3 when we run our methods on this dataset D1 decrease comparable to those results achieved when we run the method on the dataset D2, which is because of noise accumulated from the pre-processing steps. There are three reasons as follows: − GATE fails to dectect booundaries of names, e.g “Christopher Nolan“ detected as “Nolan“, “Luis Moreno-Ocampo” detected as “Luis Moreno-”, etc. − GATE fails to categorize named entites, e.g Robben Island Prison is recognized as a person. − GATE does not detect some proper names that could be clues providing meaningful information to disambiguate other entities.

Named Entity Disambiguation: A Hybrid Statistical

431

Table 4. Accurracy results on dataset D2 Category

Correct disambiguation Accuracy

Person Location Organization

207

149

97.18% 93.70%

All

73

428

86.90%

94.27%

7 Related Works Some works resolve semantic ambiguity by performing named entity tagging which is usually seen as task of identifying text span and classifying it into a broad category such as Person, Organization, Location, etc. ([6], [9]), or into a more fine-grained category that is specified by a given ontology ([8], [11]). However, those works are not disambiguation and identification of NEs in texts. Some other works use heuristic rules and focus on only one type of NE such as location ([18]), or person ([13]). The proposed method in [13] relies on affiliation, text proximity, areas of interest, and co-author relationship as clues for disambiguating person names in calls for papers only. Meanwhile, the domain Raphael ([18]) is that of geographical names in texts. In [18], the authors use some patterns to narrow down the candidates of ambiguous geographical names. For instance, “Paris, France” more likely refers to the capital of France than a small town in Texas. Then, it ranks the remaining candidate entities based on the weights that are attached to classes of the constructed Geoname ontology. The shortcoming of those methods is that it omits relationships between named entities with different classes, such as between person and organization, or organization and location, etc. The statistical method in [10], although it leverages the co-occurrence relation between NEs with different classes, is semi-automatic and uses user feedback to disambiguated results for updating heuristics and rules considered as a training dataset. Closely related works to ours are presented in [2], [4], [16]. Bunescu et al. [2] and Cucerzan [4] exploited several of the disambiguation resources such as Wikipedia entity pages, redirection pages, categories, and hyperlinks, whereas in [16], the authors proposed a rule-based method that utilizes KB-based relationship and named entity co-occurring with ambiguous ones to disambiguate. Bunescu et al. and Cucerzan extracted that information in Wikipedia to form language models and then used those models to disambiguate named entities. Those language models are used as means for capturing different contexts in which different names referring to the same entity occur. In this paper, we propose a hybrid statistical and rule-based incremental method that combines heuristics and a learning model. We first use heuristics and pattern matching for entity disambiguation and then extract that information from Wikipedia to form a language model to disambiguate named entities. The proposed method is incremental process which includes several rounds that filter the candidates by exploiting previously identified entities and extending the text by those entity attributes every time they are successfully resolved in a round. Importantly, our method utilizes entity IDs instead of entity names in literature to identify the right entity for ambiguous referents. Furthermore, we explore context in several levels, from local to the whole text, where diverse clues are extracted for disambiguation at high accuracy.

432

H.T. Nguyen and T.H. Cao

8 Conclusion We have proposed an original approach to named entity disambiguation. It is a hybrid and incremental process that utilizes previously identified NEs and related terms cooccurring with ambiguous names in a text for entity disambiguation. Firstly, it is quite natural and similar to the way humans do, relying on co-occurring entities and terms to resolve other ambiguous referents in a given context. Secondly, it is robust to free texts without well-defined structures or templates. Next, currently Wikipedia editions are available for approximately 253 languages, which mean that our method can be used to build named entity disambiguation systems for a large number of languages. Finally, despite the exploitation of Wikipedia as a means of named entity disambiguation, our method can be adapted for any ontology and KB in general. The experiment results have shown that our method achieves high accuracy.

References [1] Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley,

Reading (1999) [2] Bunescu, R., Paşca, M.: Using encyclopedic knowledge for named entity disambiguation.

In: Proc. of the 11th Conference of EACL, pp. 9–16 (2006) [3] Bontcheva, K., Dimitrov, M., Maynard, D., Tablan, V., Cunningham, H.: Shallow

[4] [5] [6] [7] [8] [9]

[10]

[11] [12] [13]

[14] [15]

Methods for Named Entity Coreference Resolution. In: Proc. of TALN 2002 Workshop, Nancy, France (2002) Cucerzan, S.: Large-Scale Named Entity Disambiguation Based on Wikipedia data. In: Proc. of EMNLP-CoNLL Joint Conference (2007) Cohen, W., Ravikumar, P., Fienberg, S.: A Comparison of String Metrics for NameMatching Tasks. In: IJCAI-03 II-Web Workshop (2003) Cunningham, H., et al.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proc. of the 40th ACL (2002) Chinchor, N., Robinson, P.: MUC-7 Named Entity Task Definition. In: Proc. of MUC-7 (1998) Cimiano, P., Völker, J.: Towards large-scale, open-domain and ontology-based named entity classification. In: Proc. of RANLP 2005, pp. 166–172 (2005) Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 Shared Task: Language Independent Named Entity Recognition. In: Proc. of CoNLL 2003, pp. 142– 147 (2003) Fernandez, N., et al.: IdentityRank: Named entity disambiguation in the context of the NEWS project. In: Franconi, E., Kifer, M., May, W. (eds.) ESWC 2007. LNCS, vol. 4519. Springer, Heidelberg (2007) Fleischman, M., Hovy, E.: Fine grained classification of named entities. In: Proc. of Conference on Computational Linguistics (2002) Gooi, C.H., Allan, J.: Cross-document coreference on a large-scale corpus. In: Proc. of HLT-NAACL for Computational Linguistics Annual Meeting, Boston, MA (2004) Hassell, J., Aleman-Meza, B., Arpinar, I.B.: Ontology-driven automatic entity disambiguation in unstructured text. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 44–57. Springer, Heidelberg (2006) Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954) Wacholder, N., Ravin, Y., Choi, M.: Disambiguation of proper names in text. In: Proc. of ANLP, pp. 202–208 (1997)

Named Entity Disambiguation: A Hybrid Statistical

433

[16] Nguyen, H.T., Cao, T.H.: A knowledge-based approach to named entity disambiguation

[17] [18]

[19] [20] [21] [22] [23] [24]

in news articles. In: Orgun, M.A., Thornton, J. (eds.) AI 2007. LNCS (LNAI), vol. 4830, pp. 619–624. Springer, Heidelberg (2007) Peng, Y., He, D., Mao, M.: Geographic Named Entity Disambiguation with Automatic Profile Generation. In: Proc. of WI 2006 (2006) Raphael, V., Joachim, K., Wolfgang, M.: Towards Ontology-based Disambiguation of Geographical Identifiers. In: Proc. of the 16th WWW Workshop on I3: Identity, Identifiers, Identifications (2007) Remy, M.: Wikipedia: The free encyclopedia. Information Review 26(6), 434 (2002) Shadbolt, N., Hall, W., Berners-Lee, T.: The Semantic Web Revisited. IEEE Intelligent Systems 21(3), 96–101 (2006) Overell, S., Rüger, S.: Geographic Co-occurrence as a Tool for GIR. In: Proc. of CIKM Workshop on Geographic Information Retrieval, Lisbon, Portugal, pp. 71–76 (2007) Smith, D., Mann, G.: Bootstrapping toponym classifiers. In: HLT-NAACL Workshop on Analysis of Geographic References, pp. 45–49 (2003) Weaver, G., Strickland, B., Crane, G.: Quantifying the accuracy of relational statements in Wikipedia: a methodology. In: Proc. of JCDL, pp. 358–358 (2006) Zesch, T., Gurevych, I., Mühlhäuser, M.: Analyzing and Accessing Wikipedia as a Lexical Semantic Resource. In: Rehm, G., Witt, A., Lemnitzer, L. (eds.) Data Structures for Linguistic Resources and Applications, pp. 197–205 (2007)