A Probabilistic Model for Linking Named Entities in Web Text with Heterogeneous Information Networks Wei Shen† , Jiawei Han‡ , Jianyong Wang†
†
Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Computer Science and Technology, Tsinghua University, Beijing, China ‡ Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL †
[email protected],
[email protected] ‡
[email protected]
ABSTRACT
Categories and Subject Descriptors
Heterogeneous information networks that consist of multitype, interconnected objects are becoming ubiquitous and increasingly popular, such as social media networks and bibliographic networks. The task to link named entity mentions detected from the unstructured Web text with their corresponding entities existing in a heterogeneous information network is of practical importance for the problem of information network population and enrichment. This task is challenging due to name ambiguity and limited knowledge existing in the information network. Most existing entity linking methods focus on linking entities with Wikipedia or Wikipedia-derived knowledge bases (e.g., YAGO), and are largely dependent on the special features associated with Wikipedia (e.g., Wikipedia articles or Wikipedia-based relatedness measures). Since heterogeneous information networks do not have such features, these previous methods cannot be applied to our task. In this paper, we propose SHINE, the first probabilistic model to link the named entitieS in Web text with a Heterogeneous Information NEtwork to the best of our knowledge. Our model consists of two components: the entity popularity model that captures the popularity of an entity, and the entity object model that captures the distribution of multi-type objects appearing in the textual context of an entity, which is generated using metapath constrained random walks over networks. As different meta-paths express diverse semantic meanings and lead to various distributions over objects, different paths have different weights in entity linking. We propose an effective iterative approach to automatically learning the weights for each meta-path based on the expectation-maximization (EM) algorithm without requiring any training data. Experimental results on a real world data set demonstrate the effectiveness and efficiency of our proposed model in comparison with the baselines.
H.3.3 [Information Systems]: Information Storage and Retrieval—Information Search and Retrieval
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA. Copyright 2014 ACM 978-1-4503-2376-5/14/06 ...$15.00. http://dx.doi.org/10.1145/2588555.2593676.
Keywords Entity linking; Heterogeneous information networks; Domainspecific entity linking
1.
INTRODUCTION
Heterogeneous information networks (HIN) that involve a large number of multi-type objects are becoming ubiquitous and prevalent, since real world physical and abstract data objects are all connected via different relations, forming diverse heterogeneous information networks [28]. For example, in a bibliographic dataset, objects of multiple types, such as papers (P), authors (A), publication venues (V), and title terms (T), and relations of multiple types, such as write, publish, and contain are interconnected together, providing rich information and forming a heterogeneous information network. However, object names in a heterogeneous network are potentially ambiguous: the same textual name may refer to several different entities. As the example shown in Figure 1, in the DBLP network, the object name “Wei Wang” may refer to 45 different authors, including “Wei Wang” at University at Albany, SUNY, “Wei Wang” at Fudan Univ., China, “Wei Wang” at UCLA, and “Wei Wang” at UNSW, Australia. Despite there exist many large-scale heterogeneous information networks, information contained in them is limited. For example, there does not exist the advisor relation between authors in the DBLP network. Furthermore, as the world evolves, new facts come into existence and are digitally expressed on the Web. Therefore, populating and enriching the existing heterogeneous information networks with the newly extracted facts (such as relations and descriptive paragraphs for entities) become increasingly important. However, integrating the newly extracted facts derived from the information extraction systems into an existing heterogeneous information network inevitably needs a system to map the entity mentions associated with the extracted facts to their corresponding entities in the heterogeneous information network. For instance, we could extract the graduateFrom relation between the author name “Wei Wang” and the organization name “UCLA” from the Web document in Figure 1. Before populating this relation into the DBLP network, we need to map the author name “Wei Wang” in
Web document Wei Wang received a Ph.D degree in Computer Science from UCLA in 1999 under the supervision of Prof. Richard R. Muntz. Her research interests include data mining, bioinformatics and computational biology, and databases. She is also a pioneer in applying data mining methods to biomedical domains. She serves on the organization and program committees of international conferences including ACM SIGMOD, ACM SIGKDD, ACM BCB, VLDB, ICDE, EDBT, ACM CIKM, IEEE ICDM, SIAM DM, SSDBM, BIBM.
Candidate entities from DBLP “Wei Wang” at University at Albany, SUNY “Wei Wang” at Fudan Univ., China “Wei Wang” at UCLA “Wei Wang” at UNSW, Australia “Wei Wang” at South Dakota State University “Wei Wang” at Murdoch University … (other 39 different “Wei Wang”s)
Figure 1: An illustration for the task of linking an entity mention in the Web document with the DBLP bibliographic network. Named entity mention detected from the Web document is in bold face; candidate entities from the DBLP network for this entity mention are shown on the right; true mapping entity is underlined. this relation to its true mapping author (i.e., “Wei Wang” at UCLA) as there are 45 different authors having the same name “Wei Wang” in the DBLP network. On the other hand, to some extent, some heterogeneous information networks could be regarded as domain-specific knowledge bases [6]. For example, the DBLP network (or the IMDb network) contains more interesting and diverse knowledge than Wikipedia with respect to the domain of computer science (or entertainment). In this case, we could regard this task as one type of domain-specific entity linking. In our task, we focus on linking entity mentions appearing in the domain-specific unstructured Web text, which pertains to the same domain as the heterogeneous information network. Therefore, this task is beneficial for bridging the unstructured documents and the semi-structured heterogeneous information networks, which can facilitate many tasks such as information integration and information retrieval. Meanwhile, this task can be used to automatically annotate the domain-specific Web text with the related knowledge mined from a heterogeneous information network, which could significantly improve user experience. For instance, when a user is reading the Web document in Figure 1, we could show some related knowledge about the author “Wei Wang” mined from the DBLP network (e.g., publication records, similar authors, research areas, etc.) after linking it with “Wei Wang” at UCLA in the DBLP network. Traditional entity linking methods mainly focus on linking entities with Wikipedia or Wikipedia-derived knowledge bases (e.g., YAGO[26]) [3, 4, 14, 11, 23, 10, 22, 21, 24]. They mainly rely on the context knowledge embedded in the Wikipedia article [3, 4, 14, 10, 21, 22, 24], Wikipediabased semantic relatedness measures [14, 21, 11, 23, 24] (e.g., Wikipedia Link-based Measure [19]), and some special structures in Wikipedia [3, 4, 14, 10, 21, 23, 22, 24] (e.g., disambiguation page and hyperlink in the Wikipedia article). While in this paper, we study the problem of linking entities in Web text with a heterogeneous information network. Heterogeneous information networks do not have these specific features associated with Wikipedia. Thus, these traditional entity linking methods cannot be applied to our task. For example, an essential step in these previous approaches [3, 4, 14, 10, 11, 21, 24] is to define a context similarity measure between the Wikipedia article associated with the candidate entity and the text around the entity mention,
while for each author entity in the DBLP network, we do not have her descriptive article and cannot calculate the context similarity measure. In addition, the Wikipedia Link-based Measure [19] has been utilized to calculate the topical coherence between mapping entities in many existing entity linking methods [14, 21, 11, 23, 24]. However, this measure is based on the hyperlink structure among Wikipedia articles and cannot be used to calculate the topical coherence between entities in the DBLP network. To deal with this problem, we propose a probabilistic model, which combines the entity popularity model and the entity object model. The entity popularity model is contextindependent and captures the popularity of an entity. For example, a famous professor named “Wei Wang” who has published many papers is usually considered more popular than a student who is also named “Wei Wang” and has published very few papers. The entity object model captures the probability of multitype objects appearing in the textual context of an entity. In a heterogeneous information network, multi-type objects are connected via different types of relations or sequences of relations, forming a set of meta-paths [27]. Meta-path is a path consisting of a sequence of relations between different object types (i.e., structural path at the meta level). Different meta-paths imply distinct semantic meanings, which may lead to diverse distributions over objects. For example, in a bibliographic network, A-P-A is a meta-path denoting a relation between an author and her coauthor, whereas A-PV denotes a relation between an author and a venue where her paper is published. Random walks starting from one author along the meta-path A-P-A may generate the distribution of coauthors for that author, while the meta-path A-P-V may lead to the distribution of venues for that author. A question then arises: which meta-paths are more important for the entity linking task? The remaining problem of our model is to determine which meta-paths (or their weighted combination) are used for the specific entity linking task. It is difficult to ask a user to explicitly specify the weights for such sophisticated meta-paths. To address this problem, an effective iterative approach is proposed to automatically learn the most appropriate weights of metapaths based on the expectation-maximization (EM) algorithm without requiring any training data. With regard to different meta-path sets for arbitrary heterogeneous in-
Venue
Author
Movie
Paper
Term
(a) DBLP
Genre
Actor
Year
Keyword
Director
(b) IMDb
Figure 2: DBLP and IMDb network schema formation networks, our model can automatically learn the proper meta-path weights, which makes our model general and flexible enough to accommodate various types of heterogeneous networks. Contributions. The main contributions of this paper are summarized as follows. • We are among the first to explore the problem of linking entities with a heterogeneous information network, a new and increasingly important issue due to the proliferation of network data and its broad applications. • We propose an effective probabilistic model called SHINE to deal with this task, which unifies the entity popularity model with the entity object model. • An effective iterative approach is proposed to automatically learn the weights of meta-paths based on the EM algorithm without requiring any training data, which makes our model unsupervised. • To verify the effectiveness and efficiency of SHINE, we conducted experiments over a real heterogeneous information network and manually annotated Web document collection. The experimental results show that SHINE significantly outperforms the baselines in terms of accuracy, and scales very well. The remainder of this paper is organized as follows. Section 2 introduces some preliminaries and notations. We present the probabilistic model in Section 3, and introduce the model learning algorithm in Section 4. Section 5 presents the experimental results and Section 6 discusses the related work. Finally, we conclude this paper in Section 7.
2. PRELIMINARIES AND NOTATIONS In this section, we begin by introducing some concepts in heterogeneous information networks. Next, we define the task of linking entities in Web text with a heterogeneous information network (entity linking with a HIN for short).
The DBLP bibliographic network1 is a typical heterogeneous information network, containing five types of objects: papers (P), authors (A), publication venues (V), title terms (T), and publication years (Y). Links exist between authors and papers by the relations write and write−1 , between publication venues and papers by publish and publish−1 , between papers and title terms by contain and contain−1 , and between papers and publication years by publishedIn and publishedIn−1 . Network schema (i.e., meta-level description for the network) for the DBLP network is shown in Figure 2(a). The IMDb network2 (see schema in Figure 2(b)) is also a heterogeneous information network, containing five types of objects: movies (M), actors (Ac), genres (G), description keywords (K), and directors (D). Links exist between actors and movies by the relations perform and perform−1 , between movies and genres by belongTo and belongTo−1 , between movies and description keywords by contain and contain−1 , and between directors and movies by direct and direct−1 . In a heterogeneous information network, two objects can be connected via different types of relations or sequences of relations, forming a set of meta-paths, which are defined as follows. Definition 2 (Meta-path). A meta-path p is a path defined over the network schema of a given network G, and R
R
R
1 2 l is denoted in the form of T1 −−→ T2 −−→ ... − − → Tl+1 (l ≥ 1), which defines a composite relation R1 ◦ R2 ◦ . . . ◦ Rl between object types T1 and Tl+1 , where ◦ denotes the composition operator on relations.
Meta-path p can be also described as a sequence of relations (denoted by R1 − R2 − . . . − Rl ), or a sequence of object types (denoted by T1 − T2 − . . . − Tl+1 ) if there exist no multiple relations between the same pair of object types for simplicity. The length of meta-path p is the number of relations in p. For example, in the DBLP network, A-P-V is a length-2 meta-path denoting a relation between an author and a venue where her paper is published, and A-P-A-P-V is a length-4 meta-path denoting a relation between an author and a venue where the paper of her coauthor is published. In the IMDb network, Ac-M-D is a length-2 meta-path denoting a relation between an actor and a director who directs the movie she performs, Ac-M-K is a length-2 meta-path denoting a relation between an actor and a description keyword her movie contains, Ac-M-G-M-Ac is a length-4 meta-path denoting a relation between actors who perform the same genre of movies, and Ac-M-Ac-M-G is a length-4 meta-path denoting a relation between an actor and a genre to which the movie of her co-actor belongs.
2.2
Entity linking with a HIN
According to the task setting, we take (1) a collection of 2.1 Heterogeneous information network unstructured Web documents (denoted by D), (2) named A heterogeneous information network G is an information entity mentions recognized in the given documents D (denetwork with multiple types of objects and multiple types noted by M ), and (3) a heterogeneous information network of links [28, 27]. G as input. Each Web document d ∈ D should pertain to the same domain as the heterogeneous information network Definition 1 (Heterogeneous information network). G; otherwise, the document d does not have any common A heterogeneous information network is defined as a directed knowledge with the information network, which makes engraph G = (V, Z), where V is the object set and Z is the link tity linking meaningless. For example, if we link with the set. Each object v ∈ V belongs to a particular object type DBLP network, the Web document d ∈ D should pertain T , and each link z ∈ Z belongs to a particular relation type 1 R. Moreover, the number of object types |{T }| > 1 and the http://www.informatik.uni-trier.de/vley/db/ 2 http://www.imdb.com/ number of relation types |{R}| > 1.
Table 1: Notations G V v∈V Z z∈Z T R p D d∈D M m∈M E e∈E
A heterogeneous information network The object set in G An object in the object set V The link set in G A link in the link set Z An object type A relation type A meta-path A Web document collection A document in the document collection D Named entity mentions recognized in D A named entity mention recognized in a document d The set of entities in G having the same type as M An entity in the entity set E
to the domain of computer science (CS), such as CS researcher’s homepage, news article in CS department website, CS talk/seminar introduction page, etc. Each entity mention m ∈ M detected from document d is a token sequence (or surface form) of a named entity that is potentially linked with an entity in the heterogeneous information network G. E is the set of entities in the heterogeneous information network G which have the same object type as the type of entity mentions M . Each entity in E is denoted by e. Generally, the entity set E is a subset of the object set V in the network G (i.e., E ⊂ V ). For example, if we want to link author name mentions in Web text with the DBLP network, the entity set E should be the object set of author type in the DBLP network. Here, we formally state the task of entity linking with a HIN as follows. Definition 3 (Entity linking with a HIN). Given a named entity mention set M detected from a Web document collection D and a heterogeneous information network G, the goal is to identify the mapping entity e ∈ E in the heterogeneous information network G for each entity mention m ∈ M in a document d ∈ D. For illustration, we show a running example of the task of entity linking with a HIN. Example 1: (Entity linking with a HIN). In this example, we consider the task of entity linking with a HIN shown in Figure 1. Named entity mention “Wei Wang” in the Web document of Figure 1 needs to be linked with its referring author in the DBLP bibliographic network. There are totally 45 different candidate author entities in the DBLP network according to Figure 1. For the named entity mention “Wei Wang” in this example, we should output its true mapping author entity (i.e., “Wei Wang” at UCLA), which is underlined in Figure 1. In this paper, due to limited scope we assume that the heterogeneous information network G contains all the mapping entities for all the named entity mentions M . The method for predicting entity mentions that do not have their corresponding entity records in the heterogeneous information network is left for future research. Some notations used in this paper are summarized in Table 1.
3. THE PROBABILISTIC MODEL In this section, we propose a probabilistic model to deal with the task of entity linking with a HIN. Given a named entity mention m ∈ M detected from a document d ∈ D, we want to find its most likely mapping entity e ∈ E in the heterogeneous information network G. This leads to the following inference problem.
Problem 1 (Inference). Given a named entity mention m appearing in a document d, compute arg max P(e|m, d) e∈E
(1)
i.e., the most likely mapping entity e given an entity mention m in a document d. According to Formula 1, given a named entity mention m appearing in a document d, we could find its mapping entity e as follows: P(m, d, e) = arg max P(m, d, e) arg max P(e|m, d) = arg max e∈E e∈E P(m, d) e∈E (2) We assume that there is an underlying distribution P over the set M ×D×E. Therefore, our goal is to model P(m, d, e). The probability of an entity mention m whose context is the document d referring to a specific entity e could be expressed as the following formula (here we assume that m and d are independent given e): P(m, d, e) = P(e)P(m|e)P(d|e)
(3)
The mapping entity e should have the name of surface form m and we denote entities that could be referred by the name of surface form m as the candidate entities for entity mention m. For simplicity, we assume that the probability P(m|e) of observing the name of surface form m given each candidate entity e for mention m is the same and defined as a constant η where 0 < η ≤ 1. For example, given each of the 45 author entities named “Wei Wang” shown in Figure 1, we assume that the likelihood of observing “Wei Wang” as her name is the same. Under this reasonable assumption, the complete model can be expressed as P(m, d, e) = η · P(e)P(d|e)
(4)
where e is a candidate entity for entity mention m. This probabilistic model as shown in Formula 4 mainly consists of two components. (1) The entity popularity model P(e) captures the popularity of an entity e, which is the likelihood of observing an entity e appearing in a document without knowing any context information. (2) The entity object model P(d|e) denotes the probability of observing document d as the textual context for entity e. In the following subsections, we discuss the entity popularity model (Section 3.1) and the entity object model (Section 3.2) in more detail.
3.1
Entity popularity model
Given a heterogeneous information network G, to model the entity popularity P(e), in the simplest way we could assume that each entity e in the entity set E has the uniform popularity. We call it the uniform popularity model and estimate it as: 1 (5) P(e) = |E| where |E| is the number of entities in the entity set E. However, this uniform popularity model does not reflect well the real situation. We have the observation that each entity in the heterogeneous information network has different popularity. Some entities in the heterogeneous information network are obviously more prevalent than other entities. For example, a professor named “Rakesh Kumar” who has published many papers is usually regarded more popular than
a Ph.D student who has published very few papers and has the same name “Rakesh Kumar”. Most previous entity linking systems estimate the popularity of an entity using the entity frequency in the Wikipedia article corpus [14, 10, 11, 21, 23, 22, 24]. However, this approach cannot be applied to our task of entity linking with a HIN. Entities in the heterogeneous information network are connected via different relations, and the popularity of an entity in the network relies on the visibility of other connected entities. For example, in the DBLP bibliographic network, the popularity of an author depends on some features, such as her coauthors’ popularity, her publication quantity, and the authoritativeness of the venues where her papers are published. As we know, PageRank [2] is a general-purpose network node importance measure which is fairly successful for many tasks. The basic idea of PageRank is that if object v1 has a link to object v2 , then object v1 implicitly confers some importance to object v2 . For simplicity, we ignore the object types in the information network G, and let |V | be the number of objects in the information network G. We use 1 ≤ k ≤ |V | to index the object in V , and the object with index k is denoted by vk . Suppose that we want to compute the PageRank score → = (pr , ..., pr , ..., pr ) where pr is the PageRvector − pr 1 k k |V | ank score for object vk to indicate the popularity of object vk . The PageRank score prk of each object vk is decided by the combination of its initial PageRank score and the propagated score of other connected objects, and the final → can be computed efficiently by PageRank score vector − pr → − matrix multiplication. Let ip be the initial PageRank score − → vector and ip = (ip1 , ..., ipk , ..., ip|V | ), where ipk is the initial PageRank score for object vk . In this paper, we assume each object vk has the same initial PageRank score (i.e., |V1 | ) and → − assign a uniform vector to ip. Let B be the |V | × |V | matrix corresponding to the link structure of network G, assuming each object in G has at least one outgoing link. If there is a link from object vk to object vk′ , we let the matrix entry bk′ ,k have the value N1k , where Nk is the outdegree of object vk . Thus, the matrix B is a column-normalized matrix. Formally, we define the computation of the PageRank score → as: vector − pr → − → = λ− → pr ip + (1 − λ)B · − pr
(6)
where λ ∈ [0, 1] is a parameter that balances the relative im→ − portance of the two parts (i.e., the initial PageRank score ip − → and the propagated score of other connected objects B · pr). As Formula 6 is in a recursive form, in order to calculate the → we firstly initialize the final final PageRank score vector − pr, → − → →=− PageRank score vector pr by setting − pr ip and then apply − → Formula 6 iteratively until pr stabilizes within some threshold. Since the PageRank score is context-independent, we could compute their PageRank scores prk ’s for each object vk offline. For each entity e ∈ E ⊂ V , let pr(e) be its PageRank score calculated via Formula 6. Then our entity popularity model uses the PageRank score to estimate the popularity P(e) of entity e as follows. P(e) = ∑
pr(e) pr(e′ )
e′ ∈E
(7)
Table 2: Entity popularity in Example 1 Candidate entity
Entity popularity
“Wei Wang” at University at Albany, SUNY “Wei Wang” at Fudan Univ., China “Wei Wang” at UCLA “Wei Wang” at UNSW, Australia “Wei Wang” at South Dakota State University “Wei Wang” at Murdoch University
4.441*10−6 7.373*10−6 1.08*10−5 6.202*10−6 3.868*10−6 4.180*10−7
For illustration, we show in Table 2 the entity popularity P(e) for each candidate entity e in Example 1. From the results in Table 2, we can see that the popularity of the author entity “Wei Wang” at UCLA (i.e., 1.08*10−5 ) is the highest among all the candidates, which demonstrates that the author “Wei Wang” at UCLA is the most popular entity in the candidate entity set with respect to the entity mention “Wei Wang” that is consistent with our intuition, while the author “Wei Wang” at Murdoch University who has just published one paper in DBLP has the lowest entity popularity (i.e., 4.180*10−7 ). It can be seen that the entity popularity model suitably expresses the popularity of the candidate entity. Moreover, the experimental results in Section 5.2 also demonstrate that our entity popularity model (Formula 7) is more effective than the simple uniform popularity model (Formula 5) in the entity linking task.
3.2 Entity object model The entity object model P(d|e) captures the probability of observing document d as the textual context for entity e. That is to say, it will assign a high probability P(d|e) if the entity e frequently appears in the context of document d, and will assign a low probability P(d|e) if the entity e rarely appears in the context of document d. Since we are dealing with heterogeneous information networks which involve a large number of multi-type objects, we assume the document d consists of various multi-type objects v’s from the heterogeneous information network and the observation of these different objects given the entity is independent. In Example 1, the Web document where the entity mention “Wei Wang” appears as shown in Figure 1 consists of an object of author type (i.e., Richard R. Muntz ), some objects of venue type (such as SIGMOD, SIGKDD, BCB, VLDB, ICDE, EDBT, CIKM, etc.), some objects of term type (such as computer, data, mining, bioinformatics, computational, biology, database, etc.), and an object of year type (i.e., 1999 ). Then the entity object model P(d|e) can be expressed as the product of the probabilities P(v|e) under the assumption that the document d is composed of various multi-type objects v’s from the heterogeneous information network and the observation of these different objects v’s given the entity e is independent, which is similar to unigram language modeling [18]. Thus we have: ∏ P(d|e) = P(v|e) (8) v∈d
From Formula 8, we can see that the entity object model captures the probability of multi-type objects v’s appearing in the textual context of entity e. The distribution P(v|e) encodes the probability of observing object v given entity e, which can be estimated from the associated network about entity e in the heterogeneous information network. For example, with respect to the entity “Wei Wang” at UCLA,
“Wei Wang” at UCLA
“Wei Wang” at UNSW, Australia
“Wei Wang” at Fudan Univ., China
model P(d|e) as: P(d|e) =
∏
(θ · Pe (v) + (1 − θ) · Pg (v))
(9)
v∈d
Entity object model
Entity object model
Entity object model
Richard R. Muntz (A) =0.00908 SIGMOD (V) =0.0134 SIGKDD (V) = 0.0179 BCB (V) = 0.00446 Data (T) = 0.00968 Mining (T) = 0.0107 Bioinformatics (T) = 0.000553 Biology (T) =0.000277 1999 (Y) = 0.00221
Richard R. Muntz (A) =0 SIGMOD (V) =0.0308 SIGKDD (V) = 0 BCB (V) = 0 Data (T) = 0.00742 Mining (T) = 0.00035 Bioinformatics (T) =0 Biology (T) = 0 1999 (Y) = 0
Richard R. Muntz (A) =0 SIGMOD (V)=0.00649 SIGKDD (V) =0.00974 BCB (V) = 0 Data (T) = 0.00795 Mining (T) = 0.00912 Bioinformatics (T) =0 Biology (T) = 0.000361 1999 (Y) = 0
Figure 3: Entity object model for three candidate entities in Example 1. The letter in the parentheses after each object represents its object type.
the probability of observing venue object SIGMOD should be higher than the probability of observing venue object VLDB, because the author “Wei Wang” at UCLA has published much more papers in the SIGMOD conference (i.e., six papers) than the VLDB conference (i.e., one paper) in DBLP. Figure 3 shows parts of the entity object model for three candidate entities with the highest entity popularity in Example 1 (i.e., “Wei Wang” at UCLA, “Wei Wang” at UNSW, Australia, and “Wei Wang” at Fudan Univ., China), which is generated using meta-path constrained random walks (its definition is given in Formula 11). From Figure 3, we can see that the probability P(d|“Wei Wang” at UCLA) of observing the Web document d in Figure 1 given the entity “Wei Wang” at UCLA is likely to be significantly higher than the probability P(d|“Wei Wang” at Fudan Univ., China) and the probability P(d|“Wei Wang” at UNSW, Australia), because the probabilities of observing most of the representative objects appearing in the document d (e.g., author object Richard R. Muntz, venue objects SIGKDD and BCB, term objects Data, Mining, Bioinformatics, and year object 1999 ) given the entity “Wei Wang” at UCLA are higher than the probabilities of observing these objects given each of the other two candidate entities. From Figure 3, we can also see that the probability of observing some object given an entity is equal to 0 (e.g., the probability of observing author object Richard R. Muntz given the entity “Wei Wang” at Fudan Univ., China, P(Richard R. Muntz|“Wei Wang” at Fudan Univ., China)) due to the sparse data problem. This leads to that the product of probabilities in Formula 8 equals zero. To avoid this problem, we further smooth P(v|e) using a generic object model for the domain. Formally, given a document d that is the textual context for entity e, each object v ∈ d is drawn randomly from a mixture of two object models: an entity-specific object model Pe (v) which is a distribution over objects with respect to entity e and can be generated using meta-path constrained random walks (its definition is given in Formula 11), and a generic object model for the domain Pg (v) which is independent of entity e and can be estimated from the whole collection. Thus, we could further define the entity object
where θ ∈ (0, 1) is a parameter that balances the two parts (i.e., the entity-specific object model Pe (v) and the generic object model for the domain Pg (v)). The generic object model for the domain Pg (v) can be learned by counting the frequencies of multi-type objects appearing in the document collection D. The remaining problem is to estimate the entity-specific object model Pe (v) for each entity e ∈ E. To estimate the entity-specific object model Pe (v), an intuitive way is to apply random walks starting from entity e in the heterogeneous information network G, which may generate the distribution of different objects for entity e. This random walker can follow any type of relation with uniform probability at each step. However, this intuitive way does not distinguish the different semantic meanings embedded in various relation types and regards various relation types as the same. In a heterogeneous information network, an object could link to many different types of objects by multiple metapaths. Different meta-paths imply different semantic meanings, which may lead to rather diverse distributions over objects. For example, in a bibliographic network, A-P-A is a meta-path denoting a relation between an author and her coauthor, whereas A-P-T denotes a relation between an author and a title term that her paper contains. Thus, we explore meta-paths to guide the random walks over the heterogeneous information network G. In this paper, we propose to use meta-path constrained random walks [15] to estimate the entity-specific object model Pe (v). Formally, let meta-path p = R1 −R2 −. . .−Rl , and each relation Rk be a binary relation. We define Rk (v ′ , v) = 1 if object v ′ and object v are linked by relation Rk , and Rk (v ′ , v) = 0 otherwise. We also define Rk (v ′ ) = {v|Rk (v ′ , v)}, which is the set of objects that are linked with object v ′ via the relation Rk . Given the meta-path p = R1 − R2 − . . . − Rl which starts with the same object type as entity e, we define Pe (v|p), i.e., the distribution of observing object v given entity e and meta-path p, as follows. Firstly, if meta-path p is an empty path, we define { 1 if object v is entity e, Pe (v|p) = (10) 0 otherwise. If p = R1 − R2 − . . . − Rl is a nonempty path, then let p′ = R1 − R2 − . . . − Rl−1 , and define ∑ Rl (v ′ , v) Pe (v|p) = Pe (v ′ |p′ ) · (11) |Rl (v ′ )| ′ v ∈V
where |Rl (v ′ )| is the number of objects that are linked with the object v ′ via the relation Rl . This definition (Formula 11) is in a recursive form and is called meta-path constrained random walks, i.e., random walks starting from entity e along meta-path p. Given each meta-path, we could calculate the distribution of observing objects for each entity using Formula 11. For example, given the meta-path A-P-V, with respect to the entity “Wei Wang” at UCLA, the probability of observing venue object SIGMOD is 0.0536, while the probability of observing venue object VLDB or venue object SIGMETRICS is the same (i.e., 0.00893) in the DBLP network, because the author “Wei Wang” at UCLA has published 6 papers in
Table 3: Meta-paths in the DBLP network used in our experiments Meta-path A-P-A A-P-A-P-A A-P-V-P-A A-P-V A-P-A-P-V A-P-T-P-V A-P-T A-P-A-P-T A-P-V-P-T A-P-Y
Semantic meaning of the realtion Authors who coauthor with author e Authors who coauthor with the coauthors of author e Authors who publish papers in the same venues as author e’s papers Venues where author e publishes papers Venues where the coauthors of author e publish papers Venues that publish papers containing the same title terms as author e’s papers Terms that author e’s papers contain Terms that the papers of author e’s coauthors contain Terms that are contained in the papers that are published in the same venues as author e’s papers Years when author e’s papers are published
the SIGMOD conference, and has published just one paper in the VLDB conference and the SIGMETRICS conference respectively, from the DBLP network. Additionally, given the meta-path A-P-A-P-V, with respect to the entity “Wei Wang” at UCLA, the probability of observing venue object VLDB (i.e., 0.00863) is much higher than the probability of observing venue object SIGMETRICS (i.e., 0.00471) in the DBLP network, since the coauthors of author “Wei Wang” at UCLA have published much more papers in the VLDB conference than the SIGMETRICS conference. It can be seen that different meta-paths may imply different semantic meanings, which may lead to rather diverse distributions over objects and further lead to different entity linking results. Therefore it is desirable to learn the relative importance for each meta-path for the specific entity linking task. In order to quantify the importance for each meta-path p, we give a meta-path weight wp for each meta-path p. Given a set of meta-paths, the entity-specific object model Pe (v) could be the weighted sum of the probabilities of observing object v given entity e along each meta-path p. We define the entity-specific object model Pe (v) as follows. ∑ Pe (v) = wp Pe (v|p) (12) ∑
p
where p wp = 1. A larger wp indicates a higher importance for the meta-path p with respect to the entity linking task. − → We define the meta-path weight vector as W , in which each item wp is the weight for meta-path p. Note that, we do not consider negative wp in this model, which means relationships with a negative impact to the entity linking process are not considered, and the extreme case of wp = 0 means the relationships in this meta-path are totally irrelevant to the entity linking process. A set of meta-paths starting from the same object type as entity e, which might be useful for the entity linking task, should be provided as the input of this model. The set of meta-paths used in our experiments are shown in Table 3, where the semantic meanings of each relation denoted by each meta-path are given in the second column. These metapaths could be determined either according to users’ expert knowledge, or by traversing the network schema starting from the same object type as entity e with a length constraint using standard traversal methods such as the BFS (breadth-first search) algorithm. A very long meta path
that will propagate relationships to remote neighborhoods may not carry much meaningful semantic meaning [27] and is not very useful in entity linking. Note that the meta-path weights wp ’s for each meta-path p are the only parameters which need to be learned in our model. The estimation problem of our model could be solved as follows. Problem 2 (Estimation). Given a heterogeneous information network G and a set of named entity mentions M recognized in the given document collection D, determine parameters (i.e., meta-path weights wp ’s for each meta-path p) that maximize the likelihood of observing named entity mentions M in the document collection D. Once we learn the model, we could use it to link entities with a heterogeneous information network G. In the following, we introduce how to solve the estimation problem of our model, which is inspired by the model estimation method in [5].
4.
THE LEARNING ALGORITHM
Given a set of named entity mentions M recognized in the given document collection D, we want to estimate parameters (i.e., meta-path weights wp ’s for each meta-path p) that maximize the likelihood of observing these named entity mentions M in the document collection D. Thus, we want ∏ arg max P(m, d) (13) wp
(m,d)
We have P(m, d) =
∑
P(m, d, e)
(14)
e∈E
where P(m, d, e) is given in Formula 4. Then, we have ∏ ∑ ∏ ∑ arg max η·P(e)P(d|e) = arg max P(e)P(d|e) wp
(m,d)
wp
e
(m,d)
e
(15) Since this objective function is in a product of sum form that is difficult to optimize directly, we define a hidden random variable π(m, d, e) for each triple (m, d, e) to simplify its form as follows. { 1 if mention m in d refers to e, π(m, d, e) = (16) 0 otherwise. Then our optimization function could be written as ∏ ∏ arg max (P(e)P(d|e))π(m,d,e) wp
(17)
(m,d) e
Now we can apply the expectation-maximization (EM) method iteratively to optimize this objective function. In the initialization step, we assume some initial values of the parameters (i.e., give some initial values to the meta-path − → weight vector W ). E-step. In the expectation step, using the current values of the parameters, we could find the expected values of the hidden variables using the following formula. P(m, d, e) P(m, d, e) = ∑ ′ P(m, d) e′ P(m, d, e ) (18) As in Formula 4, entity e is defined as the candidate entity for entity mention m. Therefore, for each entity mention m E(π(m, d, e)) = P(e|m, d) =
in a document d, we maintain a candidate entity set, and assume the probability of linking with other entities to be 0. Thus, this expression can be calculated by iterating over each candidate entity in the candidate entity set for each given mention-document tuple. M-step. In the maximization step, we use the value f (m, d, e) = E(π(m, d, e)) calculated in the E-step, and find the parameters wp ’s that maximize the following function: ∏ ∏ (P(e)P(d|e))f (m,d,e) = (P(e))f (m,d,e) (m,d),e
(m,d),e
∏
·
(P(d|e))f (m,d,e)
(19)
(m,d),e
∏ We can see that the first product (m,d),e (P(e))f (m,d,e) does not involve the parameters wp ’s and does not depend on these parameters. Therefore, we just need to find the optimal parameters wp ’s that maximize the following function: ∏ (P(d|e))f (m,d,e) (20) (m,d),e
By obtaining the logarithm of the above objective function, we get the objective function: ∑ f (m, d, e)lnP(d|e) (21) J=
Algorithm 1 T he Learning Algorithm Input: Heterogeneous information network G, named entity mentions M , document collection D, and entity set E. − → Output: The meta-path weight vector W . 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:
for each meta-path p do Initialize the weight wp = 0 end for repeat E-step: update E(π(m, d, e)) by Formula 18 M-step: − → − → W0 = W t=1 repeat for each meta-path p do (t) Update wp by Formula 23 end for ∑ (t) (t) Normalize wp to satisfy p wp = 1 t=t+1 until objective function J (Formula 22) converges − → − → W = W (t−1) − → until meta-path weight vector W stabilizes within some threshold
(m,d),e
By substituting Formula 9 and Formula 12, the objective function of Formula 21 can be derived as: ∑ ∑ ∑ J= f (m, d, e) ln(θ· wp Pe (v|p)+(1−θ)·Pg (v)) (m,d),e
v
p
(22) We use gradient descent approach to solve this optimization problem. The basic idea of gradient descent is to find the direction (gradient) so that the objective function climbs up and makes a small step towards the direction via itera− → tively updating the meta-path weights wp ’s in the vector W . Specifically, it is an iterative algorithm with the updating formula as: wp(t) = wp(t−1) + α ·
∂J | (t−1) ∂wp wp =wp
(23)
where α is the learning rate, which decides the step size towards the increasing direction and is usually set to a small enough number to guarantee the increase of the objective function J. The partial derivative of wp can be derived as: ∑ ∑ θ ∗ Pe (v|p) ∂J = f (m, d, e) (24) ∂wp P(v|e) v (m,d),e
After each iteration of updating the weights for metapaths using Formula 23, we ∑ normalize the meta-path weights to satisfy the constraint p wp = 1. The learning algorithm proposed in our model SHINE is summarized in Algorithm 1. Overall, it is an iterative algorithm based on the expectation-maximization (EM) method. The optimization of meta-path weights wp ’s contains an inner loop of gradient descent algorithm (line 9-line 15). This learning algorithm can automatically learn the weights of meta-paths by maximizing the likelihood of observing named entity mentions M in the given document collection D without requiring any training data, which makes SHINE unsupervised.
We analyze the time complexity for this learning algorithm. Formally, for the inner gradient descent algorithm, − → the time complexity is O(t1 |M | · |Em | · |Vd | · |W |), where t1 is the number of iterations, |M | is the number of entity mentions in M , |Em | is the number of candidate entities for entity mention m, |Vd | is the number of objects involved in − → the document d where mention m appears, and |W | is the number of meta-paths. The time complexity for the whole − → learning algorithm is O(t(t1 |M | · |Em | · |Vd | · |W | + |M | · − → − → |Em | · |Vd | · |W |)) = O(t(t1 |M | · |Em | · |Vd | · |W |)) where t is the number of iterations for the EM algorithm. Therefore, we can see that the inner gradient descent algorithm consumes most of running time of the whole learning algorithm. Though we do not know the upper bound on the number of iterations this EM learning algorithm may run until it terminates, in our experiments, we observe that it converges quickly and typically takes only a few iterations. Moreover, − → as |Em |, |Vd |, and |W | are usually small constants, the running time of this learning algorithm and the inner gradient descent algorithm is linear to the number of entity mentions in M , which has been confirmed by our experiments shown in Section 5.3. When the number of entity mentions in M is enormous, our learning algorithm becomes a little expensive. At that time, we could use stochastic gradient descent method that is very effective for large-scale learning problem, which samples a subset of entity mentions at each iteration and updates the parameters wp ’s on the basis of these sampled entity mentions only [1]. Then, the running time of our learning algorithm is linear to the number of sampled entity mentions. We state that our probabilistic model is flexible and can be applied to arbitrary heterogeneous information networks directly. For example, given another bibliographic network, we have a new object type (e.g., organizations (ORG)) and links existing between authors and organizations by the relations isAffiliatedWith and isAffiliatedWith−1 . To lever-
which author names in publication records refer to the same author entity), and these ambiguous names are followed by a space character and a four digit number (e.g., “Wei Wang 0010” and “Eric Martin 0001”) to uniquely represent each distinct author [16]. In addition, we combined the DBLP network with the author disambiguation results from the publicly available data set used in [29], which contains 110 author names and their gold standard disambiguation results, to create a partially disambiguated DBLP network. To illustrate the ability of our model, we focus on Web documents that pertain to the domain of computer science. Given any Web document, we could develop a highly accurate classifier to predict whether it pertains to the domain of computer science. Since the main focus of this paper is to investigate the effectiveness of our model for entity linking, we consider developing such classifiers as an orthogonal effort to our task, and opted for querying Web search engine (i.e., Google) to generate the test document collection D. We formed the Web search queries by including the am5. EXPERIMENTAL STUDY biguous author names, as well as some domain representaTo evaluate the effectiveness and efficiency of our model tive phrases (such as “computer science”, “database”, “data SHINE, we present a thorough experimental study in this mining”, etc.). Each returned Web document, along with section. We firstly describe the experimental setting in Secall candidate author entities with respect to the ambiguous tion 5.1 and then study the effectiveness of SHINE in Section author name in this document, is presented to annotators, 5.2. In Section 5.3, we evaluate the efficiency and scalability and the documents which contain ambiguous author names of SHINE. In Section 5.4, we study the impact of paramreferring to the entities existing in the DBLP network are eters to the performance of SHINE. Lastly, we investigate collected. This yields a collection of 709 Web documents, the meta-path weights learned by our learning algorithm. each of which has one author name mention that needs to All the programs were implemented in JAVA and all the exbe linked. periments were conducted on a server (four 2.67GHz CPU To generate the candidate author entities for each author cores, 48GB memory) with 64-bit Windows. In this paper, name mention, we use a method based on string compariwe do not utilize the parallel computing technique and just son between names of the author mention and the author consider single machine implementation. entity. All author entities in the DBLP network that satisfy the predefined rules are extracted as the candidate entities. 5.1 Experimental setting The rules used in this paper include: (1) two author names To the best of our knowledge, there is no publicly availexactly match; (2) two author names have the same first able benchmark data set for the task of entity linking with a name and last name, and one of them has no middle name HIN. Thus, we create a gold standard data set for this task. (e.g., Richard Muntz and Richard R. Muntz) or one middle Since the annotation task for entity linking with a HIN conname is another middle name’s initial (e.g., Michael J. Jorsists of disambiguating entities that would be linked with dan and Michael Jeffrey Jordan). For each test HTML Web in the heterogeneous information network, generating the document, we firstly extracted the full text of the article and test Web documents that pertain to the same domain as the removed the author name mention itself. As we assume that information network, detecting named entity mentions in each Web document d consists of various multi-type objects them, and identifying their corresponding mapping entities v’s from the heterogeneous information network, we recogexisting in the network, the annotation task is difficult and nized objects of author type and objects of venue type from very time consuming. Therefore, we just create one data set DBLP in it using dictionary-based exact matching method. for evaluation. We identified objects of year type using regular expression. In our experiments, we adopt the DBLP bibliographic All remaining terms in the document (removing all punctuanetwork, which has been introduced in Section 2.1, as the tion symbols) are filtered by a stop word list and stemmed by underlying heterogeneous information network. We downPorter Stemmer. Lastly, we regarded these stemmed terms loaded the March 2013 version of DBLP data set and built as the object set of term type. the DBLP network according to the network schema introIn Section 5.4, we evaluate how accuracy changes by varyduced in Figure 2(a). This DBLP network contains over ing θ from 0.1 to 0.9 in our data set. The parameter θ is 1.2M authors, 2.1M papers, and 7K venues (conferences/journals). set to 0.2 in the other experiments. The learning rate α The terms in the paper titles are filtered by a stop word list in Formula 23 decides the step size towards the increasing of size 667 and stemmed by Porter Stemmer3 , which converts direction. When α gets too big, the gradient descent althe morphological variation of each word to its root form. gorithm would fail to converge. α is set to 0.000003 in all We finally got around 408K terms. According to our task experiments. The parameter λ in Formula 6 is set to 0.2 in setting, the entities in the network which would be linked all experiments. To evaluate the performance of SHINE, in with should be disambiguated. The DBLP network has some this paper we adopt the evaluation measure accuracy, which highly ambiguous author names (such as “Wei Wang”, “Eric is used in most work about entity linking [3, 4, 7, 21, 10, Martin”, etc.) that have been disambiguated (i.e., determine 23, 22, 24]. The accuracy is calculated as the number of correctly linked entity mentions divided by the total num3 http://tartarus.org/martin/PorterStemmer/ age this affiliation information to link entities with this new bibliographic network, we could simply add some new metapaths (such as A-ORG and A-P-A-ORG) into the meta-path set used in our model. Then, our model can automatically learn the relative importance for these new meta-paths. For linking entities with other types of heterogeneous information networks, we just need to change the meta-path set used in our model. For example, if we want to link actor name mentions in Web text with the IMDb network, a set of meta-paths starting from the actor type in the IMDb network could be provided as the input of this model, such as Ac-M-Ac, Ac-M-Ac-M-Ac, Ac-M-G-M-Ac, Ac-M-D-M-Ac, Ac-M-G, Ac-M-Ac-M-G, Ac-M-D-M-G, Ac-M-K, Ac-M-AcM-K, Ac-M-G-M-K, Ac-M-D-M-K, Ac-M-D, Ac-M-Ac-MD, Ac-M-G-M-D. Subsequently, we could leverage our learning algorithm to learn the weights for these meta-paths for this new entity linking task.
Table 4: Experimental results of baseline VSim with different object type set Object type set Coauthor Venue Term Year Coauthor+Venue Coauthor+Term Venue+Term Coauthor+Venue+Term Coauthor+Venue+Term+Year
# correctly linked 547 430 576 256 574 598 578 602 604
Accuracy 0.772 0.606 0.812 0.361 0.810 0.843 0.815 0.849 0.852
Table 5: Experimental results of all approaches Approach POP VSim SHINE4-eom SHINE4 SHINEall-eom SHINEall
# correctly linked 345 604 651 655 660 668
Accuracy 0.487 0.852 0.918 0.924 0.931 0.942
that just utilizes the four length-2 meta-paths as SHINE4 , and refer to our model that utilizes all ten meta-paths as SHINEall . To analyze the effectiveness of our entity object model (Formula 9) alone to our SHINE’s performance, we ber of all mentions. All the operations introduced in this substitute the uniform popularity model (Formula 5) for the subsection are regarded as preprocessing. entity popularity model (Formula 7) in the methods SHINE4 5.2 Effectiveness study and SHINEall , and refer to these models as SHINE4-eom and SHINEall-eom , respectively. In this subsection, we study the effectiveness of our model The experimental results of all approaches over our data SHINE under different configurations, and compare them set are shown in Table 5. From the results, we can see that with some baselines. different configurations of our proposed probabilistic model 5.2.1 Baselines all significantly outperform the two baseline methods (i.e., POP and VSim), which demonstrates the effectiveness of Since no previous work deals with the task of entity linkour model. Also, we can see that the methods SHINEall-eom ing with a HIN, we created two baselines in this paper. The and SHINE4-eom that just leverage the entity object model first one is entity popularity-based method (POP). The feaachieve high accuracy (i.e., 0.931 and 0.918, respectively) ture of entity popularity has been found to be very useful over our data set, which shows that our entity object model in previous entity linking systems [12, 11, 21, 22, 23, 24]. alone is very effective for this linking task. Additionally, In this POP baseline method, we used our entity popularity we can see that SHINEall and SHINE4 outperform the modmodel (Formula 7) introduced in Section 3.1 to estimate the els SHINEall-eom and SHINE4-eom , respectively. This means, popularity for each candidate entity. The entity with the our entity popularity model is more effective compared with highest popularity among all the candidate entities for each the uniform popularity model when combined with our enentity mention is considered as the mapping entity for this tity object model in dealing with the task of entity linking entity mention. The second baseline is vector similarity-based method (VSim). with a HIN. Moreover, it can be also seen from Table 5 that the methods SHINEall and SHINEall-eom that leverage all In this VSim method, we constructed a context vector for ten meta-paths outperform SHINE4 and SHINE4-eom respeceach author mention and a profile vector for each canditively, which just utilize the four length-2 meta-paths. The date author entity. Specifically, for each author mention, more useful meta-paths, the better the entity linking accuwe firstly got object sets of different types which compose racy, which is consistent with our intuition, since our entity the document where this author mention appears. To comobject model can obtain more related knowledge about the pute the bag of words context vector, we added each object candidate entity from the information network by leveraging in these object sets of different types, together with its fremore useful meta-paths. quency into the vector. For each candidate author entity, we obtained all her publication records from our partially dis5.3 Efficiency and scalability study ambiguated DBLP network. To compute the bag of words In this subsection, we study the scalability of SHINE usprofile vector, we added objects of different types (i.e., her ing different subsets of our entity mention set. Figure 4(a) coauthors, venues, title terms, and publication years) in her plots the average running time for one iteration of the EM publications, together with their frequencies into the vector. algorithm and one iteration of the inner gradient descent Then we measure the cosine similarity of these two vectors algorithm with varied size of the entity mention set. From for each pair of author mention and candidate entity. Fithe results, we can see that the average running time for one nally, the entity with the highest similarity is considered as iteration of both the EM algorithm and the inner gradient the mapping entity for the author mention. To analyze the descent algorithm is about linear to the number of entity effectiveness of different object types for the baseline VSim, mentions in the data set, which is consistent with the time we show the accuracy and the number of correctly linked aucomplexity analysis of Algorithm 1 described in Section 4. thor mentions obtained by VSim with different object type This evaluation demonstrates the scalability of SHINE. Figsets in Table 4. It can be seen from Table 4 that every object ure 4(b) depicts the accuracy performance of SHINEall with type has a positive impact to the performance of VSim, and varied size of the entity mention set. We can see that our with the combination of all object types VSim can obtain proposed model SHINEall can achieve relatively stable and the best result. high accuracy with different size of data set, which demon5.2.2 Our model SHINE strates the robustness of our model. SHINE can automatiThe set of meta-paths used in our experiments have been cally learn the meta-path weights by maximizing the likelishown in Table 3. Among them, there are four length-2 hood of observing named entity mentions in the given docmeta-paths and six length-4 meta-paths. To analyze the efument collection. Over various entity mention sets, despite fectiveness of different meta-path sets, we refer to our model the learned meta-path weights are different, the final en-
100 90 80 70 60 50 40 30 20 10 0
;