Named Entity Resolution Using Automatically

0 downloads 0 Views 167KB Size Report
German Wikipedia as reference for named enti- ties. ... For many organizations there exist a number ... ple, Peter Müller the prime minister may not be mentioned.
Named Entity Resolution Using Automatically Extracted Semantic Information Anja Pilz and Gerhard Paaß Fraunhofer Institute Intelligent Analysis and Information Systems St. Augustin, Germany {anja.pilz, gerhard.paass}@iais.fraunhofer.de

Abstract One major problem in text mining and semantic retrieval is that detected entity mentions have to be assigned to the true underlying entity. The ambiguity of a name results from both the polysemy and synonymy problem, as the name of a unique entity may be written in variant ways and different unique entities may have the same name. The term “bush” for instance may refer to a woody plant, a mechanical fixing, a nocturnal primate, 52 persons and 8 places covered in Wikipedia and thousands of other persons. For the first time, according to our knowledge we apply a kernel entity resolution approach to the German Wikipedia as reference for named entities. We describe the context of named entities in Wikipedia and the context of a detected name phrase in a new document by a context vector of relevant features. These are designed from automatically extracted topic indicators generated by an LDA topic model. We use kernel classifiers, e.g. rank classifiers, to determine the right matching entity but also to detect uncovered entities. In comparison to a baseline approach using only text similarity the addition of topics approach gives a much higher f-value, which is comparable to the results published for English. It turns out that the procedure also is able to detect with high reliability if a person is not covered by the Wikipedia.

1

Introduction

The problem of name ambiguity exists in many forms. It is common for different people to share the same name. For example, there is a Gerhard Schr¨oder who was chancellor of the Federal Republic of Germany, another who was Federal Minister in Germany, and several more who are broadcaster officials or journalists. Locations may have the same name. For example, there are 27 municipalities in Germany called Neustadt. The acronyms associated with organizations may also be ambiguous. UMD may stand for the University of Michigan – Dearborn, the University of Minnesota, Duluth or the University of Maryland. On the other hand there may be different names for the same entity. For many organizations there exist a number of different acronyms and designations. For the political party Sozialdemokratische Partei Deutschlands we have the synonyms Sozialdemokraten, SPD, Sozis, etc. The effects of name ambiguity can be seen when carrying out web searches or retrieving articles from an archive

of newspaper text. For example, the top 10 hits of a Google search for “Peter M¨uller” mention seven different people. While it may be clear to a human that the Prime Minister from Saarland, the boxer from Cologne, and the professor of mathematics of the university of W¨urzburg are not the same person, it is difficult for a computer program to make the same distinction. In fact, a human may have a hard time retrieving all the material relevant to the particular person they are interested in without being swamped by information on namesakes. Approaches to entity resolution generally rely on the strong contextual hypothesis of Miller and Charles [Miller and Charles, 1991], who hypothesize that words with similar meanings are often used in similar contexts. This is equally true for proper names, where a particular entity will likely be mentioned in certain contexts. For example, Peter M¨uller the prime minister may not be mentioned with W¨urzburg University very often, while Peter M¨uller the Professor will be. Thus, our approach to entity resolution consists in finding classes of similar contexts such that each class represents a distinct entity. As a first step we identify name phrases, which may refer to named entities like persons, organizations, locations etc. There are a number of quite mature techniques for name phrase recognition, such as persons or locations [Sang and Meulder, 2003]. Customary they are termed named entity recognizers, although they do only spot possible name phrases. A second step is the entity resolution, the identification of the identity of each name phrase. More formally we want to assign a name phrase n and the associated context information c(n), e.g. the surrounding words, to one of the candidate entities in a set En = {e1 , . . . , em } where each ei corresponds to a ”true” underlying named entity, e.g. a specific person. Each entity ei in turn is characterized by features d(ei ), e.g. the words in a description of ei . In this paper we are mainly interested in assigning name phrases to the corresponding Wikipedia article, which can be considered as an unambiguous reference for a specific entity. Note that we have to find out, whether a person is covered in Wikipedia at all. This is a difficult task, as, for example, there are 583 persons with name Gerhard Schr¨oder listed in the German telephone directory, whereas only 5 are mentioned in Wikipedia. If it is not possible to assign n to one of the know entities in E we may formally assign it to eout . In the next section we describe related work to entity resolution. Then we outline the properties of Wikipedia as a reference for unique named entities. Subsequently we describe the kernel approach used in this paper. Then we de-

scribe the experimental setup for training and testing. Then we describe the results and summarize the paper.

2

Related Work

Named entity resolution is closely related to the task of word sense disambiguation (WSD) aiming to resolve the ambiguity of common words in a text. Both tasks have in common, that a meaning of a mention is strongly dependent on the context it appears in. Entity resolution, however, usually concerns only phrases of a restricted type (e.g. persons) but there are potentially very many underlying target entities. Nevertheless, many concepts of WSD may be applied to entity resolution.

2.1

Unsupervised Approaches

One of the first works in the field of entity resolution was that of [Bagga and Baldwin, 1998]. In their approach they create context vectors for each occurrence of a name. Each vector contains exactly the words that occur within a fixedsized window around the ambiguous name n. The similarity among names is measured using the cosine measure. To evaluate their approach they created the ”John Smith” corpus, which consists of 197 new articles mentioning 35 different John Smiths. Agglomerative clustering is used by [Gooi and Allan, 2004] to form groups of context vectors. Among others they employed the Kullback-Leibler distance measure. They conclude that agglomerative clustering works particularly well, and observe that a context window size of 55 words gives optimum performance. [Mann and Yarowsky, 2003] enhance the representation of context by biographic facts (date/place of birth, occupation, etc.) extracted from the web using pattern matching. They show that these automatically extracted biographic information to improves clustering results, which are the basis for entity resolution. [Bekkerman and McCallum, 2005] present two unsupervised approaches to disambiguate persons mentioned in Web pages. One method exploits that Web pages of people related to each other are likely to be interconnected. The other approach simultaneously clusters documents and words employing the fact that similar documents have similar distributions over words, while similar words are similarly distributed over documents. The derived word similarity as well as the link features are successfully used for named entity resolution. [Bhattacharya and Getoor, 2005] extensively use relational evidence for entity resolution. In the context of citations we may conclude that ”R. Srikant” and ”Ramakrishnan Srikant” are the same author, since both are coauthors of another author. They consider the mutual relations between authors, paper titles, paper categories, and conference venues. They argue that if they jointly resolve the identity of authors, papers, etc. that this leads to a better result than considering each type alone. They construct probabilistic networks capitalizing on two main ideas: First they use tied parameters for repeating features over pairs of references and their resolution decisions. Second they exploit the overlap between decisions, as two different decisions are dependent. They use similarity measures to resolve entities by clustering them taking into account the relations between objects, and get encouraging results.

2.2

Supervised Approaches

A classifier is used by [Han et al., 2004] to resolve entities in citations. Their naive Bayes approach takes authors

as classes, computes the prior probability of each author, and uses coauthors, title words, and publishing details extracted from the citation as features. Their SVM method again considers each author as a class using the same features. [Hassel et al., 2006] disambiguate researcher names in citations by exploiting relational information contained in an ontology derived from the DBLP database. Attributes such as affiliations, topics of interests, or collaborators are extracted from the ontology and matched against the text surrounding a name occurrence. The results of the match are then combined in a linear scoring function that ranks all possible senses of that name. This scoring function is trained on a set of disambiguated name queries that are automatically extracted from Wikipedia articles. The method is also able to detect when a name denotes an entity that is not covered in Wikipedia. [Huang et al., 2006] combine supervised distance measure learning and unsupervised clustering and apply it to author disambiguation. The distance metric between papers used in DBSCAN is calculated by an online active selection support vector machine algorithm (LASVM), yielding a simpler model, lower test errors and faster prediction time than a standard SVM. [Bunescu and Pasca, 2006] resolve entities using a specific version of the SVM which generates a ranked list of plausible entities. This ranking SVM was introduced by [Joachims, 2002]. As features they use all words in a window around the name phrase as well as the Wikipedia categories. For training Wikipedia articles are used as unambiguous references for specific entities. They are described by the words and categories of these articles. Within Wikipedia most articles are linked to specific spots in other articles. The text around the name phrase in a link is used as an instance for the occurrence. In this paper we adapt this approach to entity resolution for German name phrases. [Wentland et al., 2008] expands this method to other languges. [Cucerzan, 2007] present a large-scale system for the recognition and semantic disambiguation of named entities based on information extracted from Wikipedia and Web search results. The system uses coreference analysis to associate different surface forms of a name in a text, e.g. ”George W. Bush” and ”Bush”. In addition to context words they use Wikipedia categories to describe an entity. Within Wikipedia they use the article about an entity as well as the context of links to an entity to characterize an entity. By the links in Wikipedia they get multiple contexts of an entity in Wikipedia and by coreference resolution they get multiple contexts for a name phrase in a new document. The assignment is done by maximizing the nonnormalized scalar products for the contexts of entities and name phrases. [Waltinger and Mehler, 2008] exploit the contextsurrounding of a token by analyzing the border-sentences of an entity. The information of the incorporated context is used for building an expanded context graph around the unknown entity. This is done by querying a co-occurrence network, keeping the most significant edges to the contextinstance. The approach shows very promising results.

3

Wikipedia as Reference Knowledge Resource

Wikipedia is a web-based, free content encyclopedia project, written collaboratively by volunteers using a wiki

software that allows almost anyone to add and change articles. Since its creation in 2001 it has become the largest organized knowledge repository on the Web. In this paper we use the German version. In Nov. 2008 it had 845k articles with an average length of 3500 bytes and 20.8 million internal links [Wikimedia, 2009]. Articles hold information focused on one specific entity or concept. An article is uniquely identified by the common name for the subject or person described in the article. Ambiguous names are further qualified with additional information placed in parentheses. For instance, the six entities sharing the name Michael M¨uller are distinguished by their affiliations, occupations, or locations: Michael M¨uller (Berlin), Michael M¨uller (Comedian), Michael M¨uller (FDP), Michael M¨uller (Handballspieler). Michael M¨uller (Liedermacher), and Michael M¨uller (SPD). Every article in Wikipedia has one or more categories, representing the topics it belongs to. Categories can be very broad but also very specific, i.e. applying only to two persons such as the category ”Tr¨ager des Bundesverdienstkreuzes (Großkreuz in besonderer Ausf¨uhrung)”. Relations among entity and concepts are expressed by links. When mentioning an entity or concept with an existing article page, contributing authors are required to link at least the first mention to the corresponding article. This link structure may be used to estimate semantic relatedness [Milne, 2007].

4

Semantic Information from Topic Models

Wikipedia contains several million different words and usual similarity metrics such as the cosine similarity are not capable to grasp the synonymy of different terms. The same content describing entities may be expressed by completely different words, i.e. ”Bundeskanzler” and ”Regierungschef” , which shows that the direct comparison of words, even in a stemmed form, may be misleading.

4.1

Topic Modeling by Latent Dirichlet Allocation

Topic modeling by Latent Dirichlet Allocation [Blei et al., 2003] aims to represent the meaning of sentences and documents by a low-dimensional vector of ”topics”. It assumes the following simplified probabilistic model for the generation of a document di : • Randomly generate the number of words in a document N ≈ P oisson(ξ), where ξ is a fixed prior parameter. • Randomly choose a h-dimensional probability vector describing the distribution of topics in the document θ ∼ Dirichlet(α). • For each of the N words w in the document – Randomly choose a topic zk ∼ M ultinomial(θ), where α is a h-dimensional parameter vector describing the prior distribution of probability values. – Randomly select a word wn ∼ p(wn |zk , β), where the multinomial distribution p(wn |zk , β) describes the probability of words for the topic zk . Given a training set of unlabeled documents all free parameters in the model, the conditional distribution p(wn |zn , β) of word given topics as well as the hyperparameters ξ,α and

β may be estimated. Using a Bayesian approach to regularize parameters [Blei et al., 2003] propose an efficient approximate inference techniques based on variational methods. The resulting word distributions p(wn |zn , β) for each topic have high probabilities for words that often co-occur in documents. Topics alleviate two main problems arising in natural languages: synonymy and polysemy. Synonymy refers to a case where two different words (say car and automobile) have the same meaning. These synonyms usually will occur in the same topics. Polysemy on the other hand refers to the case where a term such as plant has multiple meanings (industrial plant, biological plant). Depending on the context (industry or biology) different topics will be assigned to the word plant. A document is generated by picking a distribution over topics (i.e. mostly about DOG, mostly about CAT, or a bit of both), and given this distribution, picking the topic of each specific word. Then words are generated given their topics. (Notice that words are considered to be independent given the topics. This is a standard bag of words model assumption, and makes the individual words exchangeable.) The application of a topic model to a Wikipedia article has a similar effect as the manual assignment of categories by Wikipedia users. The articles content is given some labels as to whether which topics/categories are the most probable. Generally, one cannot assume that each assignment is relevant since many categories exist, that only apply to two persons and are hence to specific. The assumption that category assignments are appropriate (correct) need not be true but in the end, this is a problem from which topic models suffer as well. Topic Models can help to solve this problem. They rely on the article text and not on the user’s intuition and therefore they are more conservative in the assignment of meanings. The usage of topic models, e.g. for example the assignment of the highest probability topics as feature to each query, has two reasons. Alternative meanings of words are grasped by the model and hence the similarity measure is less restrictive. Additionally, we gain a summary of the contexts subject.

4.2

Training

The topic model is trained on 100000 Wikipedia articles that have persons as subject. The number of topics is set to 200 which was considered as appropriate given the number of training articles. A manual analysis of models with higher or lower granularity in topics revealed more volatile or less expressive topic clusters.

4.3

Inference

For a trained topic model, one can compute the probability of each of the 200 topics to be present in a new document (see for example [Blei et al., 2003]). We can thus define a new feature for both the query context and the article text. Let • e.T be the probability distribution of topics in the article text e.T • c.T be the probability distribution of topics in the query text c.T assigned by a pre-trained model. To demonstrate the distinctive character of these attributes, we give the example of a context mentioning an

entity called Willi Weyer, with two candidates in Wikipedia, i.e. the politician Willi Weyer and the soccer player Willi Weyer (Fußballspieler). The context is extracted from an article on delegates in a German federal state and consists of the following words (stopwords removed): {Weyer, Willi, Landeslist, SPD, CDU, Wahlkreis, Heinrich, FDP, Wenk, Detmold, Wendt, Hermann, Geld, Wehr, Wilhelm, Minden-Nord, Wehking, Juni, Landeslist, Wiesmann, Recklinghausen-Land-Sudw, Wint, Friedrich, Lemgo-W, Witthaus, Bernhard, Mulheim-Ruhr-Sud, Wolf} All terms are stemmed using the Snowball algorithm for German [Porter, 2001]. The application of the topic model on this text yields a probability distribution that presents the probability for each topic known to the model to be presented in the context and hence new semantic features for the context descritpion, i.e. c.T = {t67 , t106 , t9 }. Table 1 shows the most important words of these topic clusters with the topics probability given the context at hand. In this case, Topic t67

t106

t9

The 15 most important words Vorsitz Abgeordnet stellvertret SPD Landtag CDu Bundestag Wahlkreis Ausschuss Oberburgermeist FDP Politikerin Stadtrat Fraktion eingezog Karl Heinrich preussisch Ferdinand Wurzburg Landwirtschaft Freiherr Gut Geheim Dom Greifswald landwirtschaft Kuhn Pomm konig August Friedrich Wilhelm Gross Christian Philipp Elisabeth Adolf geb Katharina Luis Moritz Sophi Rhein Conrad

Topic t67

t186

t105

The 15 most important words Vorsitz Abgeordnet stellvertret SPD Landtag CDu Bundestag Wahlkreis Ausschuss Oberburgermeist FDP Politikerin Stadtrat Fraktion eingezog Prasident Regier Bitt Amtszeit Minist Ministerprasident loesch Erklaerung Kabinett Rucktritt Premierminist Reform Aussenminist Liberal Finanzminist fuhr Renn gefahr Fahr todlich unfall Motor Wag Auto Racing Kreuzzug byzantin Roberto fahr ergriff

P (ti |e.T ) 0.124

0.0956

0.0713

Table 2: Most probable topics for the article of the politician Willi Weyer Wikipedia categories for Willi Weyer Finanzminister (Nordrhein-Westfalen) Landtagsabgeordneter (Nordrhein-Westfalen) Bundestagsabgeordneter FDP-Mitglied Sportfunktion¨ar

P (ti |c.T ) 0.255

Table 3: Wikipedia categories for the politician Willi Weyer. 0.0611

0.0416

porating the teams the entity was engaged with. Considering that this candidate is assigned only one category, i.e. that of Fußballspieler (Deutschland), we can deduce much more information from the assigned topics. Note that the relatively high probabilities deduce a very distinctive association to topics and hence indicate differing contexts that allow us to distinguish among entities merely by the context they appear in.

Table 1: Most probable topics for the query of Willi Weyer

5 the three most probable topics do not fit the query equally good, as can already be seen from each topics probability value. Whereas the topic with the highest probability indicates an political context, e.g. the entity’s occupation, the less probable topics are due to the relatively high frequency of names in the context, that are associated with different name clusters (i.e. names in historical context such as royals). Candidate entity Willi Weyer: We now compare the entity mentioned in the query to the possible candidates extracted from Wikipedia. The application of the topic model on the entity’s article text e1 .T yields again a probability distribution denoting the probability for each topic to be present in the article, which is here summarized into e1 .T = {t67 , t186 , t105 }, as also shown in table 2. Obviously, the first two topics represent the entity’s occupation very well and the third holds a relation to the fact that Weyer established traffic reports and highway police in his federal state. Comparing this information to the Wikipedia categories assigned to the entity (see table 3), it becomes obvious that they relate very well to each other and the automatically extracted feature holds an equal amount of information. Candidate entity Willi Weyer (Fußballspieler): The other candidate entity is a former German soccer player with the qualified Wikipedia article name Willi Weyer (Fußballspieler). For the associated article text, the three most probable topics yield e2 .T = {t168 , t122 , t148 }, as shown in table 4. All three of the assigned topics relate to the entity’s occupation and specify it if further by incor-

Learning Ranking Functions

As in [Joachims, 2002] we start with a collection D = {d1 , . . . , dm } of articles specifying unique entities. For a context c = c(n) containing features describing a name phrase n, we want to determine a list of relevant articles in D, where the most relevant articles appear first. This corresponds to a ranking relation r∗ (c) ⊆ D × D that fulfills the properties of a weak ordering, i.e. asymmetric and transitive. If a document di is ranked higher than dj for an ordering r, i.e. di w0 Φ(c(n), dj ) (1) where Φ(c(n), di ) is a given mapping of the context features c(n) and the features of document di into a highdimensional feature space and w is a weight vector of

Topic t168

t122

t148

The 15 most important words Saison Tor Fussballspiel Einsatz Mannschaft Bundesliga Sturm schoss Mittelfeldspiel Borussia Aufstieg nationalmannschaft Regionalliga nationaljahr Eintracht Koln Dusseldorf Kurt Rot Aach Nationalsozialist Bernd Willi freien KPD Wuppertal Hubert Rheinland Machtergreif Nordrhein-Westfal Spiel Train erzielt Nationalmannschaft bestritt Landerspiel Fussball Fussballspiel Fussballnationalmannschaft Klub Pokalsieg Treff Fussball-Weltmeisterschaft Nationalspiel Europapokal

P (ti |e.T ) 0.1705

0.0932

0.0685

features. For the first time, according to our knowledge, we apply it to the resolution of German name phrases using German Wikipedia entries. We represent a name phrase n by its context c(n) and want to assign it to one of a set of Wikipedia articles D = {d1 , . . . , dm }. For a name phrase n we determine the set of candidate articles by using the disambiguation pages of Wikipedia. If only the surname is available, this set of candidates is large, while the set is much smaller if n contains a first name and a surname. Currently we ignore errors like misspellings. In principle they may also be taken into account, e.g. by a specific Levenshtein distance measure.

6.1

Ranking with Context-Article Similarity

(2)

The first approach to model similarity between a query context and an Wikipedia article context is a simple summation over common words, based on the idea, that the larger this number the more similar the context and hence the more similar the entities denoted. [Bunescu and Pasca, 2006] and [Cucerzan, 2007] both evaluated experimentally a ranking function based on the cosine similarity between the context of the query and the text of the entity’s article: c.T e.T φcos = cos(c.T, e.T ) = · ||c.T || ||e.T ||

As the exact solution of this problem is NP-hard an approximate solution is proposed by [Joachims, 2002] by introducing non-negative slack variables ξi,j,k and minimizing the sum of slack variables. Regularizing the length of w to maximize margins leads to the following optimization problem

The factors c.T and e.T are represented in the standard vector space model, where each component corresponds to a term in the vocabulary. This results in a weighted sum of the number of common words in the query context and article context. Measuring the similarity between contexts in this way has one major drawback: if alternative terms for one meaning are used, the similarity will be low even if the contexts denote the same entity.

Table 4: Most probable topics for the article of the soccer player Willi Weyer (Fußballspieler) matching dimension. For the linear ranking functions defined above maximizing the number of concordant pairs is equivalent to finding the weight vector w so that the maximum number of the following inequalities holds: ∀(di ,dj )∈r1∗ wΦ(c(n1 ), di ) >

wΦ(c(n1 ), dj ) ... ∀(di ,dj )∈rn∗ wΦ(c(nn ), di ) > wΦ(c(nn ), dj )

m

minimize V (w, ξ) =

m

n

XXX 1 w∗w+C ξi,j,k (3) 2 i=1 j=1 k=1

subject to ∀k∀(di , dj ) ∈ rk∗ : wΦ(c(nk ), di ) ≥ ∀k∀i∀j : ξi,j,k ≥

wΦ(c(nk ), dj ) + 1 − ξi,j,k(4) 0

C is a parameter trading-off the training error in terms of nd to the margin size. The optimization is convex and has no local optima. As the inequalities are equivalent to w (Φ(c(nk ), di ) − Φ(c(nk ), dj )) ≥ 1 − ξi,j,k we get the same optimization problem as the usual SVM for difference vectors Φ(c(nk ), di ) − Φ(c(nk ), dj ). The algorithm is implemented in the SVMlight package of Thorsten Joachims and similar to that of structured SVMs. As usual non-linear feature mappings for arbitrary kernels may be used. Note that according to the properties of the SVM, the algorithm should be able to generalize. If the set of training contexts c1 , . . . , cn is i.i.d. and representative, then the optimal parameter should also give near-optimal rankings for new contexts, which follow the underlying context distribution.

6

Entity Resolution Approaches

In this paper we adapt the entity resolution approach of [Bunescu and Pasca, 2006] and enhance it with additional

6.2

Ranking with Aggregated Semantic Information

In this approach, the baseline cosine similarity is combined with additional information derived from a topic model over Wikipedia articles. This information holds the documents probability distribution over all topics from which the three topics with highest probability are used as additional features for both the query and the article text. To account for the divergence of query and article topic distributions, the symmetric Kullback-Leibler divergence of them is added as a dedicated feature. This is given by 1 Dsym (q, p) = (D(q, p) + D(p, q)) 2   N X p(ti ) D(q, p) = p(ti ) log , q(ti ) i=1 where N is the number of topics in the topic model, and p(ti ) is the probability of topic i in the document. It should be noted, that the query text is in general much shorter than the article text and hence the computed topic distribution is less representative for the query text than for the article text. Hence, the overall feature vector consists of   Φ(c, e) = φcos |φc.T |φe.T |φDsym φcos = cos(c.T, e.T ) φc.T = P (ti |c.T ), for the 3 most probable topics in c.T φe.T = P (ti |e.T ), for the 3 most probable topics in e.T φDsym = Dsym (c.T , e.T ).

6.3

Ranking with Aggregated Semantic Information and Weighted Context Information

When context information is condensed into only a few singular measures, the ranking model has no chance to judge the actual influence of a specific word. Therefore we augmented the information presented to the model by the weighted context information, i.e. each common word is given as a feature whose value is its tf × idf score in the article text. Instead of a binary representation, additional importance is given to words that are important in the candidate entities article and hence potentially indicative. We additionally add  w ei,e , ∀wi ∈ e.T ∩ c.T (5) φtf idf = 0, else where w ei,e denotes the tf × idf score of the i-th common word in the article text of e, to the overall feature vector. Detecting Non-Listed Entities Many entities appear in text, that are not present in Wikipedia and should hence not be related to a Wikipedia entity. To evaluate the models ability to detect if an entity is not present in Wikipedia a scenario was created, that simulates this. Here, not all the possible candidates were presented to the model, instead a given fraction was deliberately left out to present entities not mentioned in Wikipedia (non-listed entities eout ). The feature vectors in this scenario are built of the same feature set as described above plus the additional feature φout = 1(e, eout ).

6.4

Ranking with Word-Topic Correlation

An alternative representation is to model the word-topic correlation. This is being done by correlating each common word with the complete topic distribution of the article. This representation is similar to the word-category correlation employed by [Bunescu and Pasca, 2006], with the difference that categories are substituted with topics and additionally the representation is not binary but involves the probability of each specific topic to be present in the article. The intuition is that this way, the most descriptive feature vectors can be build using φtopicw,t = P (ti |e.T ), if w ∈ c.T ∩ e.T ∀i = 1, ..., 200. For each of the common words w ∈ c.T ∩ e.T , these vectors contain a group of features (i.e. the distribution of topics). This relates each word distinctly to the topic distribution extracted from the article text, which is a much more expressive summary than the other approaches discussed above.

7

Training and Testing

For our ongoing work we restricted the experiments to persons and considered the 3207 name phrases (first and last names) which correspond to at least 2 entities in German Wikipedia. In this set there are on average 2.5 entities per name phrase. In further experiments we will include all persons as well as other named entities like locations and organizations. For training we use links in Wikipedia as names phrases n and extract the corresponding context c(n) from the

neighborhood of n. On average each person article has 12.1 links to other articles in Wikipedia. We randomly split each of these context sets into training and test queries, such that 90% are used for training and 10% for testing. We used the ranking SVM described (3) with the associated inequalities (4). As feature function Φ(ck , di ) we used a linear kernel.

8

Results

We first evaluated the approach using cosine similarity and aggregated information inferred from the application of the above mentioned topic model. Since this information could potentially change with the width of the extracted context, we created two scenarios. In the first scenario, the context window is taken above the 25 left and 25 right neighboring words of the name in the query, in the second scenario above the 50 left and 50 right neighboring words. Both context representations naturally include the name itself. The article of the true entity is given as positive example and represented as a context of the first 50 resp. 100 words in the article, also assuming that the most descriptive and distinctive information is given in this snapshot. It could be shown, that enlarging the context does not increase the models performance, as was also observed by [Gooi and Allan, 2004] in their approach to cross-document coreference resolution, hence this and the other presented results are acquired on a 50 word context. The best performance was achieved as shown in table 5. Fmicro Fmacro

Training 85.88 87.34

Test 83.03 78.96

Table 5: Performance (in %): Ranking SVM using cosine similarity, most probable topics and symmetric KullbackLeibler divergence. While the best result for the simple cosine similarity approach was a macro F-measure of 75.54%, the usage of semantic information improves the performance by 3.42% to an F-measure of 78.96%. We then evaluated the approach using additional weighted context information. The results presented in table 6 show that again performance can be increased from 78.96% to now 84.85%, which motivates the approach using a more complex combination of context and topic associtation. Fmicro Fmacro

Training 91.11 91.83

Test 87.99 84.85

Table 6: Performance (in %): Ranking SVM using cosine similarity, most probable topics, symmetric KullbackLeibler divergence and weighted context information. As described above, context information is important for the correct disambiguation of entities. In the following experiment, this is represented as a correlation between common words and the topics associated with the candidate entities article text as described in 6.4. Due to constraints in the implementation, it was not possible to produce results on the complete set of ambiguous names in time but only on a reduced set.

A reduced dataset was created over 500 ambiguous names yielding 1072 entities to be disambiguated and a sufficiently high number of training (5441) and test instances (1072). Context and training parameters were taken from the previous experiments. As table 7 shows, there is a dramatical increase in performance compared to the previous approaches to an F-measure of 97.01%, i.e. only 32 of the presented entities were not disambiguated correctly. To asFmicro Fmacro Pmacro Rmacro

Training 99.17 98.91 99.09 99.04

Test 97.01 96.05 95.57 97.01

Table 7: Performance (in %): Ranking SVM using wordtopic correlation. sess the models ability to deal with non-listed entities, 10% of the entities were simulated to be non-listed. This results in 970 listed and 102 non-listed entities, from which all were correctly marked as such. The number of false associations among Wikipedia entities is reduced to 14, which is due to the fact, that the remaining entities were simulated as non-listed. As table 8 shows, the performance was not reduced but shows equally good results. The even slightly increased F-measure is due to the fact, that some of the previously incorrectly disambiguated entities were now by chance in the set of non-listed entities. Fmicro Fmacro Pmacro Rmacro

Training 99.23 99.00 99.19 99.13

Test 97.76 96.70 96.29 97.53

Table 8: Performance (in %): Ranking SVM using wordtopic correlation with non-listed entities. Since all non-listed entities are marked correctly, the micro performance has equal precision and recall values, which could not be observed in the other approaches, where the model was not able to rank all non-listed entities correctly. Although this dataset is considerably smaller than the ones in previous experiments, the results look very promising and should also apply to larger corpora. We additionally evaluated a variant of the approach in [Bunescu and Pasca, 2006]: Instead of using only the toplevel Wikipedia categories, the correlation between common words and all Wikipedia categories assigned to an article is used for the ranking approach. In order to keep the experiment comparable to the word-topic correlation approach, the data set is the same. Table 9 shows, that the performance is increased by 1.59 points to an F-measure of 98.6%. Fmicro Fmacro Pmacro Rmacro

Training 99.65 99.49 99.55 99.57

Test 98.6 98.1 97.9 98.6

Table 9: Performance (in %): Ranking SVM using wordcategory correlation [Bunescu and Pasca, 2006] reduced the number of

treated categories, with the effect that more persons share categories and hence the categories are less distinctive, which can be a reason for the lower performance of only 84.8% accuracy as compared to 98.6% that were achieved here. [Bunescu and Pasca, 2006] did not restrict the regarded context to the common words in query and article. But in fact, the usage of only common words to model context similarity is already rather restrictive. If they are additionally presented in pairs with categories, a nearly perfect model is achievable. One of them is due to the ratio of entities and categories. For the complete corpus of 198903 Wikipedia person articles exist 16201 categories. Neglecting the 3996 categories that hold year of birth (1758) and year of death (2238) information, 12205 categories remain. From these, 2377 affect only one person. Hence, the number of categories is rather small in relation to the number of persons. Note that result achieved with the word-topic correlation feature is only 1.59 points lower, but the topic model was only allowed to have 200 topics where as in the wordcategory correlation feature more than 4000 categories are used. Although the topic correlation feature yielded slightly worse results than the category correlation feature, improved results are likely to be achieved using a better trained topic model or hierarchical topic models [Blei et al., 2004] that can better incorporate category information. Wikipedia offers a good dataset for the evaluation of such models but generally their training can be performed on nearly any dataset. Moreover, the uniqueness of entity pages was not always guaranteed in the version of Wikipedia used in this thesis. The ambiguous surface name Jens Jessen was mapped ¨ onto the three entities Jens Jessen, Jens Jessen (Okonom) and Jens Jessen (Journalist), where actually the entities ¨ Jens Jessen and Jens Jessen (Okonom) were one and the same entity, although presented in two articles. A slightly different formulation of the disambiguation model could perform a consistency check on the uniqueness of entity pages, from which the model itself can only gain. The most crucial assumption, e.g. that of correct links, was also proven not to hold generally. A false link results in an error in the disambiguation model, that at the current stage was identifiable only through manual analysis. This analysis showed for example that the human annotators mixed the two entities denoted by the name John Barber, e.g. the inventor of the gas turbine and an English race driver, whereas the disambiguation model identified them correctly.

9

Summary and Conclusion

This paper describes a model, that correctly assigns textual mentions of entities in German text to their representation in an external knowledge resource, e.g. Wikipedia. A ranking SVM was used with a diverse set of feature representations, that use simple cosine similarity, weighted context representation and sophisticated topics as well as the Wikipedia category set. The presented results are comparable to those achieved on English datasets and can compete with those achieved on similar datasets with comparable ambiguities in names.

It is shown, that the approach may be used for the disambiguation of German named entities and is extendable to the more general task of concept disambiguation. We have demonstrated, that it is not necessary to rely on categories manually assigned to the Wikipedia articles, but instead the application of topic models as an replacement for these categories achieves equally good results. This has the positive effect, that the system can be translated to any collection descrbing named entities, that is not endowed with categorization. In this way time-consuming annotations of this type can be omitted. A challenging question is, how the disambiguation of one entity affects the disambiguation of other entities or related concepts mentioned for example in the same document. This could be investigated using for example a cascade of communicating disambiguation models.

10

Acknowledgement

The work presented here was funded by the German Federal Ministry of Economy and Technology (BMWi) under the THESEUS project.

References [Bagga and Baldwin, 1998] Amit Bagga and Breck Baldwin. Entity-based cross-document coreferencing using the vector space model. In Proc. of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pages 79–85, San Francisco, California, 1998. [Bekkerman and McCallum, 2005] Ron Bekkerman and Andrew McCallum. Disambiguating web appearances of people in a social network. In WWW ’05: Proceedings of the 14th international conference on World Wide Web, pages 463–470, New York, NY, USA, 2005. ACM. [Bhattacharya and Getoor, 2005] Indrajit Bhattacharya and Lise Getoor. Relational clustering for multitype entity resolution. In Proc. Fourth International Workshop on MultiRelational Data Mining (MRDM2005), 2005. [Blei et al., 2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003. [Blei et al., 2004] David M. Blei, Michael I. Jordan, Thomas L. Griffiths, and Joshua B Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In Proc. NIPS Advances in Neural Information Processing Systems 16, 2004. [Bunescu and Pasca, 2006] Razvan C. Bunescu and Marius Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proc. of EACL, pages 9–16, 2006. [Cucerzan, 2007] Silviu Cucerzan. Large-scale named entity disambiguation based on wikipedia data. In Proc. 2007 Joint Conference on EMNLP and CNLL, pages 708–716, 2007. [Gooi and Allan, 2004] Chung H. Gooi and James Allan. Cross-document coreference on a large scale corpus. In HLT-NAACL, pages 9–16, 2004. [Han et al., 2004] Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In JCDL ’04: Proceedings of the

4th ACM/IEEE-CS joint conference on Digital libraries, pages 296–305, New York, NY, USA, 2004. ACM. [Hassel et al., 2006] J. Hassel, B. Aleman-Meza, and I. B. Arpinar. Ontology-driven automatic entity disambiguation in unstructured text. pages 44–57, 2006. [Huang et al., 2006] Jian Huang, Seyda Ertekin, and C. Lee Giles. Efficient name disambiguation for largescale databases. In Proceedings of PKDD, pages 536– 544. PKDD, 2006. [Joachims, 2002] Thorsten Joachims. Optimizing search engines using clickthrough data. In Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2002), 2002. [Kendall, 1955] Maurice Kendall. Rank Correlation Methods. Hafner, 1955. [Mann and Yarowsky, 2003] Gideon S. Mann and David Yarowsky. Unsupervised personal name disambiguation. In Proc. of the seventh conference on Natural language learning at HLT-NAACL 2003, volume 4, pages 33–40, 2003. [Miller and Charles, 1991] G.A. Miller and W.G. Charles. Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1):128,, 1991. [Milne, 2007] David Milne. Computing semantic relatedness using wikipedia link structure. In Proc. of the New Zealand Computer Science Research Student Conference (NZCSRSC’2007), 2007. [Porter, 2001] Martin F. Porter. Snowball: A language for stemming algorithms. http://snowball.tartarus.org, 2001. [Sang and Meulder, 2003] Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 shared task: language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, pages 142– 147, Morristown, NJ, USA, 2003. Association for Computational Linguistics. [Waltinger and Mehler, 2008] Ulli Waltinger and Alexander Mehler. Who is it? context sensitive named entity and instance recognition by means of wikipedia. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence (WI-2008), 2008. [Wentland et al., 2008] Wolodja Wentland, Johannes Knopp, Carina Silberer, and Matthias Hartung. Building a multilingual lexical resource for named entity disambiguation, translation and transliteration. In Proc. of the Sixth International Language Resources and Evaluation (LREC’08), 2008. [Wikimedia, 2009] Wikimedia. http://stats.wikimedia.org/de/tablesrecenttrends.htm. Retrieved on Feb. 22. 2009.