An Approach to Web-scale Named-Entity ...

Outline Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up

An Approach to Web-scale Named-Entity Disambiguation L. Sarmento1 A. Kehlenbeck2 E. Oliveira3 L. Ungar4

June 28, 2009 1

LIACC/FEUP, Portugal Google Inc., New York, NY, USA 3 LIACC/FEUP, Portugal 4 Univ. of Pennsylvania/CS, Philadelphia, PA, USA 2

L. Sarmento, A. Kehlenbeck, E. Oliveira, L. Ungar

An Approach to Web-scaleNamed-Entity Disambiguation


Introduction A Clustering Approach to NED Vector Comparison and Clustering Evaluation and Experimental Set-up Results and Analysis Take Home Message




The Problem

I

Named-Entity Disambiguation (NED): deciding if the occurrence of the same name (e.g. “Paris”) in different documents refers to the same entity or to different entities: I

I

Web-scale NED: disambiguating the whole Web I

I

Very important for the quality of Web IR / IE

Different from “traditional” NED (small collections): I I I

I

Paris (France) vs. Paris (Hilton) vs. Paris (Troy) vs. ...

many more entities at stake (unknown number for each name!) skewed entity distribution (Paris, France: 75% mentions) terabyte collection (scalability is fundamental)

These challenges have not been considered simultaneously




Just to illustrate...

# query paris paris france paris hilton paris greek troy paris mo paris tx paris sempron I

# hit count (x106 ) 583 457 58.2 4.130 1.430 0.995 0.299

% 100 78.4 9.99 0.71 0.25 0.17 0.04

Using a 1kb vector to represent each instance: 583Gb for “Paris”!




Research Questions

I

We will follow a clustering approach to NED: I I I I

Can we scale NED to the whole Web? Is NED easier or harder on the Web? Do we benefit from using much larger amounts of data? Does the “more data, better results” paradigm apply?




NED as Clustering

I I

Let C = {d1 , d2 , ...dk } be a document collection Let mij represent a mention: I

I

Let Mall = {m11 , m21 , ...mik } be the set of all mentions in C I

I I

the occurrence of name ni in document dj on the web |Mall | is À 109

Let ej represent an entity Achieving NED by clustering: I I

Partition Mall in disjoint clusters of mentions, M1 , M2 , ... Mn 1-to-1 mapping between clusters, Mi , and entities, ej




Feature Vector Generation I

Our Assumption: mentions of name ni can be disambiguated using information about names co-occurring at document level I I

“Amsterdam”: {“Netherlands”, “Utrecht”, “Rijksmuseum”...} “Amsterdam”: {“Ian McEwan”, “Amazon”...})

I

Let N(dk ) be set of names found in document dk

I

The mention of name nj in document dk can be described by feature vector, mjk : mjk = [(n1 , v1 ), (n2 , v2 ), (n3 , v3 ), ...(ni , vi )]

(1)

with: I I

ni ∈ N(dk ) \ nj , vi weight derived from frequency (e.g. TF-IDF)




Clustering Overview I

Perform name annotation in C to obtain Cannot

I

Extract all names from each document dk in Cannot

I

Generate feature vectors mjk

I

Group feature vectors by name M(nj ) = {mj1 , mj2 ...mjx }

(2)

I

Compare vector in each set M(nj ) according to a given comparison strategy and similarity metric sim(mnj , mnk ) (e.g: Cosine or Jaccard Distance).

I

Apply your “favourite” clustering algorithm is applied to each M(nj ), using information about vector similarity




Our Approach

I I

Designed to be run on a Map-Reduce platform Deals with specific characteristics of web-derived datasets: 1. the mention distribution is highly skewed, and is dominated by the one or two most popular entities 2. the number of entities in which the set of mentions M(nj ) should be mapped is not known

I

We propose graph-based clustering approach for each nn : I I

compute pairwise distances between vectors in M(nn ) build the link graph G(nn ) I

I I

mnj and mnk are linked in G(nn ) if sim(mnj , mnk ) ≥ smin

Find the connected components (CC) in G(nn ) CC’s (should) represent the clusters we seek




Practical Issues

I

The only parameter of this approach is smin I

no need to set the target number of clusters to be produced

I

Naive (all-against-all) strategies to building G(nn ) are O(n2 )

I

But a sufficiently connected link graph Gsuf (nn ) can be computed in ¶ µ n · |C | · kpos ˜ O 1 − pfn

I

(3)

How? I

I

Each mention is compared to other mentions only until kpos above-threshold similar mentions are found Gsuf (nn ) will have approximately the same CC of G(nn )




Additional Practical Issues I

For “popular” names, M(nn ) will not fit in a single machine: name Paris Amsterdam Jaguar Pluto

I

I

Google Hits (×106 ) 583 185 73.4 13.8

Even if had enough RAM, processing these very frequent names would require much more time than processing less frequent names I

I

# Wiki Entities 90 35 34 25

extremely long tails in the overall processing time

Solution: break M(nn ) into smaller partitions and distribute them over all machines This generates several independent clustering problems I

a second clustering step is required to merge clusters obtained from each subset of M(nn )




Clustering Step 2: Merging Clusters from Partitions I

Typically, after the first stage of clustering: I I

I

For each name divide Step 1 clusters in two groups: I I

I

several larger clusters (few dominant entities) many smaller clusters (non-dominant + fragments of dominant entities) Big Clusters, Cbig (10% biggest clusters from Step 1) Small Clusters Csmall (all others)

Re-clustering strategy based on Step 1 cluster centroids: 1. Assign Small Clusters to Big Clusters, using nearest-neighbour assignment 2. Merge Small Clusters to create Medium Clusters, comparing all-against-all 3. Merge Big and Medium Clusters based on only a few top features. 4. Repeat 2 and 3 to reduce fragmentation

I

For each name, it can be run on single machine




Gold Standard for Evaluating NED I

We used Wikipedia articles as gold standard for NED

I

Each article can be related to one entity / concept Let Wseed (nj ) be the set articles found for name nj

I

I I I

I

Wseed (nj ) can be expanded with documents that “unambiguously” refer entities mentioned using name nj I

I

I

nj can usually be easily identified by the article title if |Wseed (nj )| > 1, then nj is ambiguous elements in Wseed (nj ) refers unambiguously to one entity

For each page in Wseed (nj ) we find all its immediate neighbors in the web link graph linked pages containing nj , should refer to the same entity ek described by article they are linked to

We created Gold Clusters G (nj ) for 52,000 ambiguous names




Measuring Clustering Performance (I) I

“Purity” - Entropy of test cluster of all |G | true classes: et (tx ) =

|G | X

−

y =0

ixy ixy · ln( ) It (x) It (x)

(4)

P|T (nj )|

|tx | · e(tx ) (5) P|T (nj )| |t | x x=0 “Dispersion” - Entropy of classes over all |T | test clusters: x=0

Et (nj ) =

I

eg (gy ) =

|T | X

−

x=0

ixy ixy · ln( ) |gy | |gy |

(6)

P|G (nj )| Eg (nj ) = I

|gyx| · e(gy ) P|G (nj )| y =0 |gy |

y =0

(7)

Ideally, both Et and Eg should be close to zero




Measuring Clustering Performance (II) I

Mention Recall for name nj : Ig (y ) rm (gy ) = P|G (n )| j y =0 |gy |

I

(8)

Overall Recall for nj : P|G (nj )| Rm (nj ) =

I

measures the fraction of the entities included in the gold standard clusters for nj are found in the test clusters

For all names we compute arithmetic averages: I

I

(9)

Entity recall, Rm (nj ): I

I

|gy | · rg (gy ) P|G (nj )| |gy | j=0

k=0

Et , Eg , Rm and Re .

Crat : |T( nj )|/|G( nj )|




Set-Up I I

Base collection, C all : 1 Billion name-annotated documents “Reference run” with 1% sample of C all : I

I

I

We performed NED over samples of sizes 0.5%, 2% and 5% I

I

repeat the complete NED procedure, slowly varying parameter smin compute corresponding values for Et , Eg , Rm and Re compared 1% samples with the closest value for Et , i.e. at similar “purity” values

Implementation using the Map-Reduce paradigm: I I I I

features weighted by tf-idf items were limited to 5000 features partition of 3000 items (RAM & load balancing) after 1-pass, clusters with less than 5 elements were filtered out (affects recall)




Results

I

I I

%@smin

Et

Eg

Rm (%)

Re (%)

Crat

[email protected] [email protected]

0.0003 0.0001

0.0056 0.0085

0.024 0.055

1.16 1.74

1.23 1.82


0.0042 0.0042

0.0226 0.0312

0.135 0.294

3.70 5.43

2.06 3.27


0.0103 0.0140

0.0212 0.0797

0.186 0.912

5.00 12.4

2.18 6.91

For similar Et , smin needs to be increased with sample size to reduce the impact of noise but increases fragmentation (Eg ) Crat increases with smin and sample size Recall values seem very low, but we are also “sampling” items from gold standard




But, are we improving with sample size?? % vs % 0.5% vs. 1.0% 1.0% vs. 2.0% 1.0% vs. 5.0% I

+/−

re

2.28 2.17 4.9

1.5 1.48 2.48

No!! Ratios are receding with size! I

I

+/−

rm

And become sublinear when going from 1.0% to 5.0%

Some possible reasons: I

Number of partitions for each name increases with data Fragmentation, and hence small clusters, increases:

I

Number of different entities increases with data:

I

I

I I

some small clusters are lost after 1-pass and during 2-pass which might not be well represented in G (nj )

G (nj ) might contain many not truly representative entities




More Fundamental Issues (after manual inspection) I

I

For very frequent names (e.g “Amsterdam”) there was a surprisingly high number of medium and large clusters for the dominant entities (e.g. the Dutch capital) Such clusters refer to autonomous scopes or facets of the entity: I

Amsterdam: I I I I

I

Paris: I I

I

as world capital (co-occurring with “Paris”, “New York”...) as Dutch city (co-occurring with “Utrecht”, “Rotterdam”...) as publishing center (“Elsevier” or “Elsevier Science”) ... French Revolution vs. nowadays ...

The number of “autonomous” facets (temporal, geographical,...) increases with data and entity popularity




Ideally...

I

We would like to merge these facets into a single cluster I

I

I

I

But still keep information about facets

Information about co-occuring names is not sufficient for this We need additional features (e.g. zip codes, phone numbers, other relevant words, document links,...)

Maybe a third clustering step needs to be considered




Conclusions - Take Home Message

I I

We proposed a Web-scale algorithm for NED But NED does NOT get easier with more data: I I

Name co-occurrence information is not enough Multiple facets for the same entity emerge which make the problem significantly harder




Thank you!

I

Questions & comments?

I

We would like to thank the Google Team at NYC for all the help and support, specially to Casey Whitelaw and Nemanja Petrovic!



An Approach to Web-scale Named-Entity ...

An Approach to Web-scale Named-Entity ...

Suggest Documents

bringing global application control to you - Webscale

at Webscale - OCLC

bringing global application control to you - Webscale

An approach to coma

AN INTERACTIONAL APPROACH TO

An Analytical Approach to

From Webscale to Telco, the Cloud Native Journey - 5G-PPP

An Integrative Approach to Curriculum Development Approach ...

An approach to accreditation - ENQA

An Incidental Approach to

An Approach to Persistent Programming

An Approach to Computing Ethics

An Algebraic Approach to Heterogeneous

An Integrated Approach to Encourage

an abstract approach to bivalence

AN INTEGRATED APPROACH TO MODELING

An Adlerian Approach to Autism.

An axiomatic approach to noncompensatory

An Archaeometrical Approach to Understand

AN APPROACH TO LUMPED CONTROL

AN OPTIMIZATION APPROACH TO THE

An integrated approach to model an ungulate

An Economical Approach to Stop an Experimental

AN ENGINEERING APPROACH TO AN INTEGRATED VALUE ...