Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
An Approach to Web-scale Named-Entity Disambiguation L. Sarmento1 A. Kehlenbeck2 E. Oliveira3 L. Ungar4
June 28, 2009 1
LIACC/FEUP, Portugal Google Inc., New York, NY, USA 3 LIACC/FEUP, Portugal 4 Univ. of Pennsylvania/CS, Philadelphia, PA, USA 2
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up Results and Analysis Take Home Message
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
The Problem
I
Named-Entity Disambiguation (NED): deciding if the occurrence of the same name (e.g. “Paris”) in different documents refers to the same entity or to different entities: I
I
Web-scale NED: disambiguating the whole Web I
I
Very important for the quality of Web IR / IE
Different from “traditional” NED (small collections): I I I
I
Paris (France) vs. Paris (Hilton) vs. Paris (Troy) vs. ...
many more entities at stake (unknown number for each name!) skewed entity distribution (Paris, France: 75% mentions) terabyte collection (scalability is fundamental)
These challenges have not been considered simultaneously
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Just to illustrate...
# query paris paris france paris hilton paris greek troy paris mo paris tx paris sempron I
# hit count (x106 ) 583 457 58.2 4.130 1.430 0.995 0.299
% 100 78.4 9.99 0.71 0.25 0.17 0.04
Using a 1kb vector to represent each instance: 583Gb for “Paris”!
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Research Questions
I
We will follow a clustering approach to NED: I I I I
Can we scale NED to the whole Web? Is NED easier or harder on the Web? Do we benefit from using much larger amounts of data? Does the “more data, better results” paradigm apply?
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
NED as Clustering
I I
Let C = {d1 , d2 , ...dk } be a document collection Let mij represent a mention: I
I
Let Mall = {m11 , m21 , ...mik } be the set of all mentions in C I
I I
the occurrence of name ni in document dj on the web |Mall | is À 109
Let ej represent an entity Achieving NED by clustering: I I
Partition Mall in disjoint clusters of mentions, M1 , M2 , ... Mn 1-to-1 mapping between clusters, Mi , and entities, ej
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Feature Vector Generation I
Our Assumption: mentions of name ni can be disambiguated using information about names co-occurring at document level I I
“Amsterdam”: {“Netherlands”, “Utrecht”, “Rijksmuseum”...} “Amsterdam”: {“Ian McEwan”, “Amazon”...})
I
Let N(dk ) be set of names found in document dk
I
The mention of name nj in document dk can be described by feature vector, mjk : mjk = [(n1 , v1 ), (n2 , v2 ), (n3 , v3 ), ...(ni , vi )]
(1)
with: I I
ni ∈ N(dk ) \ nj , vi weight derived from frequency (e.g. TF-IDF)
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Clustering Overview I
Perform name annotation in C to obtain Cannot
I
Extract all names from each document dk in Cannot
I
Generate feature vectors mjk
I
Group feature vectors by name M(nj ) = {mj1 , mj2 ...mjx }
(2)
I
Compare vector in each set M(nj ) according to a given comparison strategy and similarity metric sim(mnj , mnk ) (e.g: Cosine or Jaccard Distance).
I
Apply your “favourite” clustering algorithm is applied to each M(nj ), using information about vector similarity
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Our Approach
I I
Designed to be run on a Map-Reduce platform Deals with specific characteristics of web-derived datasets: 1. the mention distribution is highly skewed, and is dominated by the one or two most popular entities 2. the number of entities in which the set of mentions M(nj ) should be mapped is not known
I
We propose graph-based clustering approach for each nn : I I
compute pairwise distances between vectors in M(nn ) build the link graph G(nn ) I
I I
mnj and mnk are linked in G(nn ) if sim(mnj , mnk ) ≥ smin
Find the connected components (CC) in G(nn ) CC’s (should) represent the clusters we seek
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Practical Issues
I
The only parameter of this approach is smin I
no need to set the target number of clusters to be produced
I
Naive (all-against-all) strategies to building G(nn ) are O(n2 )
I
But a sufficiently connected link graph Gsuf (nn ) can be computed in ¶ µ n · |C | · kpos ˜ O 1 − pfn
I
(3)
How? I
I
Each mention is compared to other mentions only until kpos above-threshold similar mentions are found Gsuf (nn ) will have approximately the same CC of G(nn )
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Additional Practical Issues I
For “popular” names, M(nn ) will not fit in a single machine: name Paris Amsterdam Jaguar Pluto
I
I
Google Hits (×106 ) 583 185 73.4 13.8
Even if had enough RAM, processing these very frequent names would require much more time than processing less frequent names I
I
# Wiki Entities 90 35 34 25
extremely long tails in the overall processing time
Solution: break M(nn ) into smaller partitions and distribute them over all machines This generates several independent clustering problems I
a second clustering step is required to merge clusters obtained from each subset of M(nn )
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Clustering Step 2: Merging Clusters from Partitions I
Typically, after the first stage of clustering: I I
I
For each name divide Step 1 clusters in two groups: I I
I
several larger clusters (few dominant entities) many smaller clusters (non-dominant + fragments of dominant entities) Big Clusters, Cbig (10% biggest clusters from Step 1) Small Clusters Csmall (all others)
Re-clustering strategy based on Step 1 cluster centroids: 1. Assign Small Clusters to Big Clusters, using nearest-neighbour assignment 2. Merge Small Clusters to create Medium Clusters, comparing all-against-all 3. Merge Big and Medium Clusters based on only a few top features. 4. Repeat 2 and 3 to reduce fragmentation
I
For each name, it can be run on single machine
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Gold Standard for Evaluating NED I
We used Wikipedia articles as gold standard for NED
I
Each article can be related to one entity / concept Let Wseed (nj ) be the set articles found for name nj
I
I I I
I
Wseed (nj ) can be expanded with documents that “unambiguously” refer entities mentioned using name nj I
I
I
nj can usually be easily identified by the article title if |Wseed (nj )| > 1, then nj is ambiguous elements in Wseed (nj ) refers unambiguously to one entity
For each page in Wseed (nj ) we find all its immediate neighbors in the web link graph linked pages containing nj , should refer to the same entity ek described by article they are linked to
We created Gold Clusters G (nj ) for 52,000 ambiguous names
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Measuring Clustering Performance (I) I
“Purity” - Entropy of test cluster of all |G | true classes: et (tx ) =
|G | X
−
y =0
ixy ixy · ln( ) It (x) It (x)
(4)
P|T (nj )|
|tx | · e(tx ) (5) P|T (nj )| |t | x x=0 “Dispersion” - Entropy of classes over all |T | test clusters: x=0
Et (nj ) =
I
eg (gy ) =
|T | X
−
x=0
ixy ixy · ln( ) |gy | |gy |
(6)
P|G (nj )| Eg (nj ) = I
|gyx| · e(gy ) P|G (nj )| y =0 |gy |
y =0
(7)
Ideally, both Et and Eg should be close to zero
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Measuring Clustering Performance (II) I
Mention Recall for name nj : Ig (y ) rm (gy ) = P|G (n )| j y =0 |gy |
I
(8)
Overall Recall for nj : P|G (nj )| Rm (nj ) =
I
measures the fraction of the entities included in the gold standard clusters for nj are found in the test clusters
For all names we compute arithmetic averages: I
I
(9)
Entity recall, Rm (nj ): I
I
|gy | · rg (gy ) P|G (nj )| |gy | j=0
k=0
Et , Eg , Rm and Re .
Crat : |T( nj )|/|G( nj )|
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Set-Up I I
Base collection, C all : 1 Billion name-annotated documents “Reference run” with 1% sample of C all : I
I
I
We performed NED over samples of sizes 0.5%, 2% and 5% I
I
repeat the complete NED procedure, slowly varying parameter smin compute corresponding values for Et , Eg , Rm and Re compared 1% samples with the closest value for Et , i.e. at similar “purity” values
Implementation using the Map-Reduce paradigm: I I I I
features weighted by tf-idf items were limited to 5000 features partition of 3000 items (RAM & load balancing) after 1-pass, clusters with less than 5 elements were filtered out (affects recall)
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Results
I
I I
%@smin
Et
Eg
Rm (%)
Re (%)
Crat
[email protected] [email protected]
0.0003 0.0001
0.0056 0.0085
0.024 0.055
1.16 1.74
1.23 1.82
[email protected] [email protected]
0.0042 0.0042
0.0226 0.0312
0.135 0.294
3.70 5.43
2.06 3.27
[email protected] [email protected]
0.0103 0.0140
0.0212 0.0797
0.186 0.912
5.00 12.4
2.18 6.91
For similar Et , smin needs to be increased with sample size to reduce the impact of noise but increases fragmentation (Eg ) Crat increases with smin and sample size Recall values seem very low, but we are also “sampling” items from gold standard
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
But, are we improving with sample size?? % vs % 0.5% vs. 1.0% 1.0% vs. 2.0% 1.0% vs. 5.0% I
+/−
re
2.28 2.17 4.9
1.5 1.48 2.48
No!! Ratios are receding with size! I
I
+/−
rm
And become sublinear when going from 1.0% to 5.0%
Some possible reasons: I
Number of partitions for each name increases with data Fragmentation, and hence small clusters, increases:
I
Number of different entities increases with data:
I
I
I I
some small clusters are lost after 1-pass and during 2-pass which might not be well represented in G (nj )
G (nj ) might contain many not truly representative entities
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
More Fundamental Issues (after manual inspection) I
I
For very frequent names (e.g “Amsterdam”) there was a surprisingly high number of medium and large clusters for the dominant entities (e.g. the Dutch capital) Such clusters refer to autonomous scopes or facets of the entity: I
Amsterdam: I I I I
I
Paris: I I
I
as world capital (co-occurring with “Paris”, “New York”...) as Dutch city (co-occurring with “Utrecht”, “Rotterdam”...) as publishing center (“Elsevier” or “Elsevier Science”) ... French Revolution vs. nowadays ...
The number of “autonomous” facets (temporal, geographical,...) increases with data and entity popularity
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Ideally...
I
We would like to merge these facets into a single cluster I
I
I
I
But still keep information about facets
Information about co-occuring names is not sufficient for this We need additional features (e.g. zip codes, phone numbers, other relevant words, document links,...)
Maybe a third clustering step needs to be considered
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Conclusions - Take Home Message
I I
We proposed a Web-scale algorithm for NED But NED does NOT get easier with more data: I I
Name co-occurrence information is not enough Multiple facets for the same entity emerge which make the problem significantly harder
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation
Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up
Thank you!
I
Questions & comments?
I
We would like to thank the Google Team at NYC for all the help and support, specially to Casey Whitelaw and Nemanja Petrovic!
L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar
An Approach to Web-scaleNamed-Entity Disambiguation