Entity Resolution by Kernel Methods - CiteSeerX

0 downloads 0 Views 127KB Size Report
kernels for comparing these context vectors and use kernel classifiers, e.g. rank classifiers ..... Optimizing search engines using clickthrough data. In. Eighth ACM ...
Entity Resolution by Kernel Methods Anja Pilz, Lukas Molzberger, and Gerhard Paaß Fraunhofer Institute Intelligent Analysis and Information Systems St. Augustin, Germany

Abstract. An important problem in text mining and semantic retrieval is entity resolution which aims at detecting the identity of a named entity. Note that the name of a unique entity may be written in variant ways and different unique entities may have the same name. The term “bush” for instance may refer to a woody plant, a mechanical fixing, 52 persons and 8 places covered in Wikipedia and thousands of other persons. For the first time, according to our knowledge we apply a kernel entity resolution approach to the German Wikipedia as reference for named entities. We describe the context of named entities in Wikipedia and the context of a detected name phrase in a new document by a context vector of relevant features. These contain not only the name itself and variant writings, but also relevant key terms, other identified named entities as well as topic indicators generated by an LDA topic model. We formulate different kernels for comparing these context vectors and use kernel classifiers, e.g. rank classifiers, to determine the right match. In comparison to a baseline approach using only text similarity the addition of topics approach gives a much higher f-value, which is comparable to the results published for English. It turns out that the procedure also is able to detect with high reliability if a person is not covered by the Wikipedia.

1

Introduction

The problem of name ambiguity exists in many forms. It is common for different people to share the same name. For example, there is a Gerhard Schr¨oder who was chancellor of the Federal Republic of Germany, another who was Federal Minister in Germany, and several more who are broadcaster officials or journalists. Locations may have the same name. For example, there are 27 municipalities in Germany called Neustadt. The acronyms associated with organizations may also be ambiguous. UMD may stand for the University of Michigan – Dearborn, the University of Minnesota, Duluth or the University of Maryland. On the other hand there may be different names for the same entity. For many organizations there exist a number of different acronyms and designations. For the political party Sozialdemokratische Partei Deutschlands we have the synonyms Sozialdemokraten, SPD, Sozis, etc. The effects of name ambiguity can be seen when carrying out web searches or retrieving articles from an archive of newspaper text. For example, the top 10 hits of a Google search for “Peter M¨ uller” mention seven different people. While it may be clear to a human that the Prime Minister from Saarland, the boxer

from Cologne, and the professor of mathematics of the university of W¨ urzburg are not the same person, it is difficult for a computer program to make the same distinction. In fact, a human may have a hard time retrieving all the material relevant to the particular person they are interested in without being swamped by information on namesakes. Approaches to entity resolution generally rely on the strong contextual hypothesis of Miller and Charles [MC91], who hypothesize that words with similar meanings are often used in similar contexts. This is equally true for proper names, where a particular entity will likely be mentioned in certain contexts. For example, Peter M¨ uller the prime minister may not be mentioned with W¨ urzburg University very often, while Peter M¨ uller the Professor will be. Thus, our approach to entity resolution consists in finding classes of similar contexts such that each class represents a distinct entity. As a first step we identify name phrases, which may refer to named entities like persons, organizations, locations etc. There are a number of quite mature techniques for name phrase recognition, such as persons or locations [SM03]. Customary they are termed named entity recognizers, although they do only spot possible name phrases. A second step is the entity resolution, the identification of the identity of each name phrase. More formally we want to assign a name phrase n and the associated context information c(n), e.g. surrounding words, to one of the candidate entities in a set En = {e1 , . . . , em } where each ei corresponds to a ”true” underlying named entity, e.g. a specific person. Each entity ei in turn is characterized by a features d(ei ), e.g. the words in a description of ei . In this paper we are mainly interested in assigning name phrases to the corresponding Wikipedia article, which can be considered as an unambiguous reference for a specific entity. Note that we have find out, whether a person is covered in Wikipedia at all. This is a difficult task, as, for example, there are 583 persons with name Gerhard Schr¨oder listed in the German telephone directory, whereas only 5 are mentioned in Wikipedia. If it is not possible to assign n to one of the know entities in E we may formally assign it to eout . In the next section we describe related work to entity resolution. Then we outline the properties of Wikipedia as a reference for unique named entities. Subsequently we describe the kernel approach used in this paper. Then we describe the experimental setup for training and testing. Then we describe the results and summarize the paper.

2

Related Work

Named entity resolution is closely related to the task of word sense disambiguation (WSD) aiming to resolve the ambiguity of common words in a text. Both tasks have in common, that a meaning of a mention is strongly dependent on the context it appears in. Entity resolution, however, usually concerns only phrases of a restricted type (e.g. persons) but there are potentially very many underlying

target entities. Nevertheless, many concepts of WSD may be applied to entity resolution. 2.1

Unsupervised Approaches

One of the first works in the field of entity resolution was that of [BB98]. In their approach they create context vectors for each occurrence of a name. Each vector contains exactly the words that occur within a fixed-sized window around the ambiguous name n. The similarity among names is measured using the cosine measure. To evaluate their approach they created the ”John Smith” corpus, which consists of 197 new articles mentioning 35 different John Smiths. Agglomerative clustering is used by [GA04] to form groups of context vectors. Among others they employed the Kullback-Leibler distance measure. They conclude that agglomerative clustering works particularly well, and observe that a context window size of 55 words gives optimum performance. [MY03] enhance the representation of context by biographic facts (date/place of birth, occupation, etc.) extracted from the web using pattern matching. They show that these automatically extracted biographic information to improves clustering results, which are the basis for entity resolution. [BM05] present two unsupervised approaches to disambiguate persons mentioned in Web pages. One method exploits that Web pages of people related to each other are likely to be interconnected. The other approach simultaneously clusters documents and words employing the fact that similar documents have similar distributions over words, while similar words are similarly distributed over documents. The derived word similarity as well as the link features are successfully used for named entity resolution. [BG05] extensively use relational evidence for entity resolution. In the context of citations we may conclude that ”R. Srikant” and ”Ramakrishnan Srikant” are the same author, since both are coauthors of another author. They consider the mutual relations between authors, paper titles, paper categories, and conference venues. They argue that if they jointly resolve the identity of authors, papers, etc. that this leads to a better result than considering each type alone. They construct probabilistic networks capitalizing on two main ideas: First they use tied parameters for repeating features over pairs of references and their resolution decisions. Second they exploit the overlap between decisions, as two different decisions are dependent. They use similarity measures to resolve entities by clustering them taking into account the relations between objects, and get encouraging results. 2.2

Supervised Approaches

A classifier is used by [HGZ+ 04] to resolve entities in citations. Their naive Bayes approach takes authors as classes, computes the prior probability of each author, and uses coauthors, title words, and publishing details extracted from the citation as features. Their SVM method again considers each author as a class using the same features.

[HAMA06] disambiguate researcher names in citations by exploiting relational information contained in an ontology derived from the DBLP database. Attributes such as affiliations, topics of interests, or collaborators are extracted from the ontology and matched against the text surrounding a name occurrence. The results of the match are then combined in a linear scoring function that ranks all possible senses of that name. This scoring function is trained on a set of disambiguated name queries that are automatically extracted from Wikipedia articles. The method is also able to detect when a name denotes an entity that is not covered in Wikipedia. [HEG06] combine supervised distance measure learning and unsupervised clustering and apply it to author disambiguation. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. For evaluation, they manually annotated 3,355 papers with 490 authors and achieved 90.6% pairwise F1. [BP06] resolve entities using a specific version of the SVM which generates a ranked list of plausible entities. This ranking SVM was introduced by [Joa02]. As features they use all words in a window around the name phrase as well as the Wikipedia categories. For training Wikipedia articles are used as unambiguous references for specific entities. They are described by the words and categories of these articles. Within Wikipedia most articles are linked to specific spots in other articles. The text around the name phrase in a link is used as an instance for the occurrence. In this paper we adapt this approach to entity resolution for German name phrases. [WKSH08] expands this method to other languges. [Cuc07] present a large-scale system for the recognition and semantic disambiguation of named entities based on information extracted from Wikipedia and Web search results. The system uses coreference analysis to associate different surface forms of a name in a text, e.g. ”George W. Bush” and ”Bush”. In addition to context words they use Wikipedia categories to describe an entity. Within Wikipedia they use the article about an entity as well as the context of links to an entity to characterize an entity. By the links in Wikipedia they get multiple contexts of an entity in Wikipedia and by coreference resolution they get multiple contexts for a name phrase in a new document. The assignment is done by maximizing the non-normalized scalar products for the contexts of entities and name phrases.

3

Wikipedia as Reference Knowledge Resource

Wikipedia [Wik] is a web-based, free content encyclopedia project, written collaboratively by volunteers using a wiki software that allows almost anyone to add and change articles. Since its creation in 2001 it has become the largest organized knowledge repository on the Web. In this paper we use the German version. In Nov. 2008 it had 845k articles with an average length of 3500 bytes and 20.8 million internal links [Wik09].

Articles hold information focused on one specific entity or concept. An article is uniquely identified by the common name for the subject or person described in the article. Ambiguous names are further qualified with additional information placed in parentheses. For instance, the six entities sharing the name Michael M¨ uller are distinguished by their affiliations, occupations, or locations: Michael M¨ uller (Berlin), Michael M¨ uller (Comedian), Michael M¨ uller (FDP), Michael M¨ uller (Handballspieler). Michael M¨ uller (Liedermacher), and Michael M¨ uller (SPD). Every article in Wikipedia has one or more categories, representing the topics it belongs to. Categories can be very broad but also very specific, i.e. applying only to two persons such as the category ”Tr¨ager des Bundesverdienstkreuzes (Großkreuz in besonderer Ausf¨ uhrung). Relations among entity and concepts are expressed by links. When mentioning an entity or concept with an existing article page, contributing authors are required to link at least the first mention to the corresponding article. This link structure may be used to estimate semantic relatedness [Mil07].

4

Learning Ranking Functions

As in [Joa02] we start with a collection D = {d1 , . . . , dm } of articles specifying unique entities. For a context c = c(n) containing features describing a name phrase n, we want to determine a list of relevant articles in D, where the most relevant articles appear first. This corresponds to a ranking relation r∗ (c) ⊆ D × D that fulfills the properties of a weak ordering, i.e. asymmetric and transitive. If a document di is ranked higher than dj for an ordering r, i.e. di w′ Φ(c(n), dj ) where Φ(c(n), di ) is a given mapping of the context features c(n) and the features of document di into a high-dimensional feature space and w is a weight vector of matching dimension. For the linear ranking functions defined above maximizing the number of concordant pairs is equivalent to finding the weight

vector w so that the maximum number of the following inequalities holds: ∀(di ,dj )∈r1∗ wΦ(c(n1 ), di ) > wΦ(c(n1 ), dj )

(1)

... ∀(di ,dj )∈rn∗ wΦ(c(nn ), di ) > wΦ(c(nn ), dj ) As the exact solution of this problem is NP-hard an approximate solution is proposed by [Joa02] by introducing non-negative slack variables ξi,j,k and minimizing the sum of slack variables. Regularizing the length of w to maximize margins leads to the following optimization problem XXX 1 w∗w+C ξi,j,k 2 i=1 j=1 m

minimize V (w, ξ) =

m

n

(2)

k=1

subject to ∀k=1,...,n ∀(di ,dj )∈rk∗ wΦ(c(nk ), di ) ≥ wΦ(c(nk ), dj ) + 1 − ζi,j,k ∀k=1,...,n ∀i=1,...,m ∀i=1,...,m ζi,j,k ≥ 0

(3)

C is a parameter trading-off the training error in terms of nd to the margin size. The optimization is convex and has no local optima. As the inequalities are equivalent to w (Φ(c(nk ), di ) − Φ(c(nk ), dj )) ≥ 1 − ζi,j,k we get the same optimization problem as the usual SVM for difference vectors Φ(c(nk ), di ) − Φ(c(nk ), dj ). The algorithm is implemented in the SVMlight package of Thorsten Joachims. As usual non-linear feature mappings for arbitrary kernels may be used. Note that according to the properties of the SVM the algorithm should be able to generalize. If the set of training contexts c1 , . . . , cn is i.i.d. and representative, then the optimal parameter should also give near-optimal rankings for new contexts, which follow the an underlying context distribution.

5

Entity Resolution Approach

In this paper we adapt the entity resolution approach of [BP06] and enhance it with additional features. For the first time, according to our knowledge, we apply it to the resolution of German name phrases using German Wikipedia entries. We represent a name phrase n by its context c(n) and want to assign it to one of a set of Wikipedia articles D = {d1 , . . . , dm }. We represent an article di of Wikipedia describing an entity, e.g. a person by the stemmed words of the article weighted with tfidf-score with respect to Wikipedia. The context c(n) contains the tfidf-weighted stemmed words in a window of size 100 around n. For a name phrase n we determine the set of candidate articles by using the disambiguation pages of Wikipedia. If only the surname is available, this set of candidates is large, while the set is much smaller if n contains a first name and a surname. Currently we ignore errors like misspellings. In principle they may also be taken into account, e.g. by a specific Levenshtein distance measure.

Table 1. Results of Entity Resolution Experiments Data ≥ 10 entities/name ≥ 2 entities/name names not in Wikipedia

6

Features Micro F1 % tfidf of stemmed words 74.8 tfidf of stemmed words 83.2 tfidf of stemmed words 84.6

Training and Testing

For our ongoing work we restricted the experiments to persons and considered the 3207 name phrases (first and last names) which correspond to at least 2 entities in German Wikipedia. In this set there are on average 2.5 entities per name phrase. In further experiments we will include all persons as well as other named entities like locations and organizations. For training we use links in Wikipedia as names phrases n and extract the corresponding context c(n) from the neighborhood of n. On average each person article has 12.1 links to other articles in Wikipedia. We used the ranking SVM described (2) with the associated inequalities (3). As feature function Φ(ck , di ) we used a linear kernel.

7

Results

The results of experiments are shown in table 1. In a first experiment we used highly ambiguous name phrases with at least ten possible corresponding entities. Here we achieved an F1-value of 74.8%. A second experiment targets all name phrases with at least 2 possible entities in Wikipedia. This yields a markedly increased F1-value of 83.2%. This value is very similar to the accuracy value of 84.8% achieved by [BP06], who in addition used Wikipedia classes as semantic features. Note that the accuracy will get much better if the large number of ”unique” name phrases is included, for whom only one article exists in Wikipedia. On a subset of entities we tested the ability of the approach to identify entities not covered in Wikipedia. For this randomly selected subset we reached an F1value of 84.6%. This means that including entities not covered in Wikipedia does not hamper the performance of the approach.

8

Summary and Conclusion

The paper describes work in progress on the disambiguation of named entities. We applied a kernel-based classification approach to named entity resolution that exploited the potential of an online encyclopedia to provide an unambiguous reference to names as well as suplying contex information. The results shows that the accuracy values achieved for German named entities is similar to the performance figures given in the literature for other languages. The application

of the new named entity disambiguation method holds the promise of better results for search engine queries. There are a number of extensions which we will explore in the near future. Most importantly we will investigate the effectiveness of wikipedia categories. This should enhance the comparison of semantic features and may lead to an improved accuracy. In the same way we will exploit topic modelling techniques, which allow an annotation of semantics in an unsupervised way.

9

Acknowledgement

The work presented here was funded by the German Federal Ministry of Economy and Technology (BMWi) under the THESEUS project.

References [BB98]

Amit Bagga and Breck Baldwin. Entity-based cross-document coreferencing using the vector space model. In Proc. of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pages 79–85, San Francisco, California, 1998. [BG05] Indrajit Bhattacharya and Lise Getoor. Relational clustering for multitype entity resolution. In Proc. Fourth International Workshop on MultiRelational Data Mining (MRDM2005), 2005. [BM05] Ron Bekkerman and Andrew McCallum. Disambiguating web appearances of people in a social network. In WWW ’05: Proceedings of the 14th international conference on World Wide Web, pages 463–470, New York, NY, USA, 2005. ACM. [BP06] Razvan C. Bunescu and Marius Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proc. of EACL, pages 9–16, 2006. [Cuc07] Silviu Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proc. 2007 Joint Conference on EMNLP and CNLL, pages 708–716, 2007. [GA04] Chung H. Gooi and James Allan. Cross-document coreference on a large scale corpus. In HLT-NAACL, pages 9–16, 2004. [HAMA06] J. Hassel, B. Aleman-Meza, and I. B. Arpinar. Ontology-driven automatic entity disambiguation in unstructured text. pages 44–57, 2006. [HEG06] Jian Huang, Seyda Ertekin, and C. Lee Giles. Efficient name disambiguation for large-scale databases. In Proceedings of PKDD, pages 536–544. PKDD, 2006. [HGZ+ 04] Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In JCDL ’04: Proceedings of the 4th ACM/IEEE-CS joint conference on Digital libraries, pages 296–305, New York, NY, USA, 2004. ACM. [Joa02] Thorsten Joachims. Optimizing search engines using clickthrough data. In Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), 2002. [Ken55] Maurice Kendall. Rank Correlation Methods. Hafner, 1955.

[MC91]

G.A. Miller and W.G. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):128,, 1991. [Mil07] David Milne. Computing semantic relatedness using wikipedia link structure. In Proc. of the New Zealand Computer Science Research Student Conference (NZCSRSC’2007), 2007. [MY03] Gideon S. Mann and David Yarowsky. Unsupervised personal name disambiguation. In Proc. of the seventh conference on Natural language learning at HLT-NAACL 2003, volume 4, pages 33–40, 2003. [SM03] Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the conll2003 shared task: language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLTNAACL 2003, pages 142–147, Morristown, NJ, USA, 2003. Association for Computational Linguistics. [Wik] Wikipedia. http://www.wikipedia.org. [Wik09] Wikimedia. http://stats.wikimedia.org/de/tablesrecenttrends.htm. Retrieved on Feb. 22. 2009. [WKSH08] Wolodja Wentland, Johannes Knopp, Carina Silberer, and Matthias Hartung. Building a multilingual lexical resource for named entity disambiguation, translation and transliteration. In Proc. of the Sixth International Language Resources and Evaluation (LREC’08), 2008.