Author Name Disambiguation using a New Categorical ... - Google Sites

Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res

Author Name Disambiguation using a New Categorical Distribution Similarity Shaohua Li, Gao Cong and Chunyan Miao Nanyang Technological University

25 September, 2012

1 / 22


Outline

1

Introduction to Author Name Disambiguation

2

Categorical Sampling Likelihood Ratio Comparison to Traditional Set Similarity Measures

3

Disambiguation Framework Two-Stage Clustering

4

Experimental Results

5

Various Applications

6

Souce Code and Data

1 / 22


Motivation for Author Name Disambiguation Many names are ambiguous On DBLP, the most prolific "author": Wei Wang, 846 papers

1 / 22


Motivation for Author Name Disambiguation Many names are ambiguous On DBLP, the most prolific "author": Wei Wang, 846 papers Wei Wang Lei Wang Jim Smith David Brown ···

Namesakes

Papers

217 146 5 22

846 405 39 54

1 / 22


Motivation for Author Name Disambiguation Many names are ambiguous On DBLP, the most prolific "author": Wei Wang, 846 papers Namesakes

Papers

Wei Wang 217 846 Lei Wang 146 405 Jim Smith 5 39 David Brown 22 54 ··· Author paper retrieval and output evaluation become difficult Author Name Disambiguation “Disambiguation” for short Objective: Merge paper records written by the same author Avoid merging papers by different authors 1 / 22


How to Do Author Name Disambiguation? The Intuition Example Jie Xu, Getian Ye, Yang Wang, Wei Wang, Jun Yang: Online Learning for PLSA-Based Visual Recognition. ACCV (2) 2010. Xue-juan Rong, Ping-juan Niu, Wei Wang: A Novel Design of Solid-State Lighting Control System. APWCS 2010.

Merge or not? the evidence we have: Coauthors, paper title, published venue, year

2 / 22


How to Do Author Name Disambiguation? The Intuition Example Jie Xu, Getian Ye, Yang Wang, Wei Wang, Jun Yang: Online Learning for PLSA-Based Visual Recognition. ACCV (2) 2010. Xue-juan Rong, Ping-juan Niu, Wei Wang: A Novel Design of Solid-State Lighting Control System. APWCS 2010.

Merge or not? the evidence we have: Coauthors, paper title, published venue, year Coauthors: Strongest type of evidence (A coauthor with a lot of collaborators can also be weak) Title: Keywords, Weaker Venue: Same or related venues. Weaker Year: Little use Affiliation? Often unavailable 2 / 22


Core Challenge: Measure the Similarity of Two Categorical Sets

Decision based on individual elements is unreliable Decision based on two sets is more reliable Aggregated weak evidence becomes stronger How similar are two categorical sets? Traditional measures – Jaccard Coefficient, Cosine Similarity, KL Divergence Heuristic. Performs badly on small sets. Various bad cases

When sets are small, they are only partial observations of the underlying distribution

3 / 22


New Similarity of Two Categorical Sets: CSLR

CSLR for “Categorical Sampling Likelihood Ratio” Measures how likely two sets are drawn from the same categorical distribution The more likely, the more similar they are Has solid mathematical principles. Good performance on small sets

4 / 22


CSLR – “Balls and Jar” Analogy Set B of author a1

C

?

B drawn from Cat C, which is unknown MLE estimate C by B: P( ) = P( ) = P( ) = P( ) = 0.2 P( ) = P( ) = 0.1 Cˆ

B

5 / 22


CSLR – “Balls and Jar” Analogy Set B of author a1 B drawn from Cat C, which is unknown

C

?

S B

MLE estimate C by B: P( ) = P( ) = P( ) = P( ) = 0.2 P( ) = P( ) = 0.1 Cˆ Set S of author a2

u n s e e n

How likely is S also drawn ˆ denoted P(S|C) ˆ from C?

5 / 22


CSLR – “Balls and Jar” Analogy Set B of author a1 B drawn from Cat C, which is unknown

C

?

S B

MLE estimate C by B: P( ) = P( ) = P( ) = P( ) = 0.2 P( ) = P( ) = 0.1 Cˆ Set S of author a2

u n s e e n

How likely is S also drawn ˆ denoted P(S|C) ˆ from C? P( ) = 0 ˆ =0 P(S|C) 5 / 22


CSLR – “Balls and Jar” Analogy (cont.)

Add "wild-card" color ? , matching any unseen

C

?

Smoothing by Jeffreys Prior: θs = 0.5

S B u n s e e n

P( P( P( P(

) = P( ) = P( ) = ) = 0.18 ) = P( ) = 0.11 ? ) = P( ) = 0.04 Cˆ ˆ = 3 0.18 · 0.18 · P(S|C) 1,1,1 0.04 = 0.0078 6 / 22


The Likelihood Ratio ˆ = 0.0078, not so informative P(S|C) ˆ > P(S), S is likely to be Compare it to P(S): If P(S|C) drawn from Cˆ Intrinsic similarity between S and B ˆ P (S |C) CSLR: Λ = P (S )

7 / 22


The Likelihood Ratio ˆ = 0.0078, not so informative P(S|C) ˆ > P(S), S is likely to be Compare it to P(S): If P(S|C) drawn from Cˆ Intrinsic similarity between S and B ˆ P (S |C) CSLR: Λ = P (S ) Calculation of P(S) S: drawn from an unknown categorical distribution with parameters p Assume a uniform Dirichlet prior distribution for p Integrating out p, P(S) =

Ratio: Λ=

Λ=

10 3

m+n−1

1 m+n−1 n

(

) ˆ P (S |C)

n · 0.0078 = 0.94

7 / 22

Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res Comparison to Traditional Set Similarity Measures

Traditional Set Similarity Measures

Jaccard: Cosine:

|A∩B| |A∪B| n P

Ai ·Bi s n n P P (Ai )2 · (Bi )2 i=1

s

i=1

i=1

Kullback-Leibler divergence: P P(i) DKL (PkQ) = i P(i) ln Q(i) First estimate two categorical distributions from two samples using MLE Very sensitive to unshared outcomes

8 / 22


Bad Cases for Traditional Measures Two imbalanced sets: {ICDM: 1} ⇔ {ICDM: 1, TKDE:1, CIKM: 1, SDM: 1} Jaccard: 0.25, Cosine: 0.5, KL: 1.39 cf. {ICDM: 1} ⇔ {ICDM: 1, TKDE:1} Jaccard: 0.5, Cosine: 0.71, KL: 0.69 Impact of unshared outcomes is ignored or overstated (for KL): {ICDM: 1} ⇔ {ICDM: 1, TKDE:1, CIKM: 1, SDM: 1} Jaccard: 0.25, Cosine: 0.5, KL: 1.39 cf. {ICDM: 1, ICC: 1} ⇔ {ICDM: 1, TKDE:1, CIKM: 1, SDM: 1} Jaccard: 0.2, Cosine: 0.35, KL: n/a or large (smoothed) Impact of frequencies (confidence of decision) is ignored: {ICDM: 1} ⇔ {ICDM: 1} Jaccard: 1, Cosine: 1, KL: 0 cf. {ICDM: 6} ⇔ {ICDM: 10} Jaccard: 1, Cosine: 1, KL: 0

9 / 22


CSLR on "Bad Cases" Two imbalanced sets: {ICDM: 1} ⇔ {ICDM: 1, TKDE:1, CIKM: 1, SDM: 1} Jaccard: 0.25, Cosine: 0.5, KL: 1.39, CSLR: 1.14 cf. {ICDM: 1} ⇔ {ICDM: 1, TKDE:1} Jaccard: 0.5, Cosine: 0.71, KL: 0.69, CSLR: 1.26 Impact of unshared outcomes: {ICDM: 1} ⇔ {ICDM: 1, TKDE:1, CIKM: 1, SDM: 1} Jaccard: 0.25, Cosine: 0.5, KL: 1.39, CSLR: 1.14 cf. {ICDM: 1, ICC: 1} ⇔ {ICDM: 1, TKDE:1, CIKM: 1, SDM: 1} Jaccard: 0.2, Cosine: 0.35, KL: n/a or large, CSLR: 0.59 Impact of frequencies (confidence of decision): {ICDM: 1} ⇔ {ICDM: 1} Jaccard: 1, Cosine: 1, KL: 0, CSLR: 1.45 cf. {ICDM: 6} ⇔ {ICDM: 10} Jaccard: 1, Cosine: 1, KL: 0, CSLR: 5.03

10 / 22

Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res Two-Stage Clustering

Illustration of My System: Initial Clusters Papers

11 / 22


Illustration of My System: Coauthor-only Clustering Clustered by Coauthors

Share a weakevidential coauthor

12 / 22


Illustration of My System: Venue/Title Sets Clustering By Venues & Titles

13 / 22


Experimental Setting Data Sets 10 names: 5 Chinese names, 3 western and 2 Indian Set 1 The original data set used in (Yin 2007). 588 papers

Set 2 Same name set as (Yin 2007) 2050 papers extracted from January 2011 dump of DBLP (1508101 papers) All manually labeled by us

Evaluation Metrics #PairsCorrectlyPredicted #TotalPairsPredicted #PairsCorrectlyPredicted Rec = #TotalCorrectPairs F1 = 2·Prec·Rec Prec+Rec

Prec =

14 / 22


Competitors DISTINCT: Randomly selected training set leads to varying weights and varying performance Clustering threshold is not automatically chosen, but needs manual specification Ran 10 times. Each time tried 12 thresholds (0.00001 ∼ 1.1). Averaged the 10 scores at each threshold. Chose the threshold giving the best average F1 (big favor)

Arnetminer website’s disambiguation pages: The most up-to-date work of (Jie Tang et al. TKDE 2011) Crawled the disambiguation pages on March 12, 2012. Calculated the scores

1 baseline: Jac Jaccard Coefficient for coauthor/venue set similarities Unigram Bag-of-Words (TF*IDF weighted) for title sets 15 / 22


Results on Set 1

Name

Prec. Hui Fang 100.0 Ajay Gupta 100.0 Joseph Hellerstein 50.7 Rakesh Kumar 100.0 Michael Wagner 100.0 Bing Liu 99.8 Jim Smith 100.0 Lei Wang 100.0 Wei Wang 60.5 Bin Yu 70.7 Avg. (macro-F1)

Jac Rec. F1 100.0 100.0 93.1 96.4 83.9 63.2 100.0 100.0 64.0 78.1 84.5 91.5 83.1 90.8 71.2 83.2 83.7 70.2 64.7 67.6

88.2 82.8

84.1

Arnetminer Prec. Rec. F1 55.6 100.0 71.4 100.0 100.0 100.0 97.4 97.4 97.4 100.0 100.0 100.0 100.0 33.7 50.5 86.2 79.8 82.9 100.0 84.5 91.6 59.4 94.2 72.9 28.1 98.5 43.8 87.8 95.3 91.4 81.5 88.4

80.2

Prec. 85.6 67.7 92.4 100.0 90.1 86.5 95.6 42.5 31.0 77.1

DISTINCT Rec. F1 100.0 88.7 94.5 78.8 80.6 84.6 100.0 100.0 96.2 92.9 82.0 83.6 91.7 93.3 75.0 51.8 98.8 47.1 89.2 81.3

76.9 90.8

80.2

Our (CSLR) Prec. Rec. F1 100.0 100.0 100.0 100.0 93.1 96.4 100.0 69.7 82.1 100.0 100.0 100.0 100.0 64.0 78.1 91.8 87.0 89.4 100.0 87.3 93.2 100.0 63.3 77.6 59.3 72.4 65.2 98.8 68.5 80.9 95.0 80.5

86.3

16 / 22


Results on Set 2

Name

Prec. Hui Fang 100.0 Ajay Gupta 96.0 Joseph Hellerstein 52.8 Rakesh Kumar 100.0 Michael Wagner 92.8 Bing Liu 97.8 Jim Smith 100.0 Lei Wang 30.0 Wei Wang 40.2 Bin Yu 70.6 Avg. (macro-F1)

Jac Rec. 68.8 47.0 80.5 89.0 59.4 67.0 44.1 79.8 77.0 42.8

78.0 65.5

F1 81.5 63.1 63.7 94.2 72.4 79.5 61.2 43.6 52.8 53.3 66.5

Arnetminer Prec. Rec. F1 59.1 63.7 61.3 60.0 65.4 62.6 94.5 95.9 95.2 98.4 89.3 93.7 55.6 36.7 44.2 75.7 67.2 71.2 88.6 45.1 59.7 18.1 83.1 29.8 9.7 88.2 17.5 72.4 62.2 66.9 63.2 69.7

60.2

DISTINCT Prec. Rec. F1 81.3 97.9 88.0 65.3 87.9 74.2 92.3 89.5 90.0 89.9 96.0 92.5 67.4 98.2 79.1 83.0 84.7 83.3 94.8 87.8 90.0 29.3 85.9 42.4 25.8 84.2 38.9 54.0 62.0 57.0 68.3 87.4

73.5

Our (CSLR) Prec. Rec. F1 100.0 78.9 88.2 96.0 39.6 56.1 100.0 79.6 88.6 99.9 97.8 98.8 88.1 64.6 74.6 98.1 74.7 84.8 100.0 48.8 65.6 78.1 87.6 82.6 81.0 71.8 76.1 88.0 49.1 63.0 92.9 69.2

77.8

17 / 22


Conclusions

CSLR is superior when the sample size is small and observation is partial When the sample size becomes large, CSLR is close to other measures CSLR has slightly lower recall, but much higher precision (overall 7%-15% better than Jaccard) Clusters sharing some common elements (not enough many) are not merged

Low Recall, high precision is preferrable to high recall, low precision Active learning (Jie Tang et al. ICDM 2011)

18 / 22


Various Applications on Categorical Data Applicable as long as: sets are similar similar distributions E.g. Author Identification

they follow

Extract the frequencies of some feature words An author’s wording stably follows a characteristic categorical distribution

Example: Jane Austen’s novels and her imitater’s novel Word a an this that with without

Sense & Sensibility 147 25 32 94 59 18

Emma 186 26 39 105 74 10

Sanditon I 101 11 15 37 28 10

Sanditon II 83 29 15 22 43 4 19 / 22


Various Applications on Categorical Data Example: Jane Austen’s novels and her imitater’s novel Word a an this that with without

Sense & Sensibility 147 25 32 94 59 18

Emma 186 26 39 105 74 10

Sanditon I 101 11 15 37 28 10

Sanditon II 83 29 15 22 43 4

CSLR scores between Sense & Sensibility, Emma and Sanditon I > 3000 CSLR scores between Sanditon II and Jane Austen’s books < 0.006 Apply to other categorical data problems 20 / 22


Souce Code and Data

All the source code (in Perl) and labeled data: http://github.com/askerlee/namedis Will publish the standalone code of CSLR (implemented in C++, Java, Python...)

21 / 22


Thank You!

22 / 22

Related Work

Details of Disambiguation Framework

Details of Coauthor-only Clustering

Name Ambiguity Estimation

Supplementary Slides

7

Related Work

8


9


10


23 / 22

Related Work




Related Work Hui Han et al. (PSU), JCDL’ 04: Naive Bayes, SVM to learn evidence strength Need training set for each name

Jie Tang et al. (Tsinghua), TKDE 2011: Hidden Markov Random Field Cosine similarity Tries different author label assignments and finds the one with maximum posterior

DISTINCT – Xiaoxin Yin et al. (UIUC), ICDE’07 Jaccard & Random Walk as the Set Similarity SVM to train the weight for each type of feature (evidence). Training set is automatically (randomly) generated Trained weights vary from run to run

I. Bhattacharya et al. (U Maryland), TKDE 2007: Model papers’ relations as a hypergraph Jaccard Coefficient 24 / 22

Related Work




Why Two Stages?

Snowball effect in clustering: Merging mistakes are often amplified during the clustering

Single venue/title is very weak evidence A big venue/title set is closer to the real distribution, more reliable But a noisy set is misleading Coauthor-only Clustering produces very pure clusters: Prec > 99.5% ( But low recall: Rec ≈ 60% )

Coauthor-only Clustering is well suited to bootstrapping

25 / 22

Related Work




Venue Set Expansion and Similarity Basically, use CSLR to compare two venue sets But CSLR is crisp: Correlations between different venues are lost (KDD ⇔ TKDE ⇔ CIKM...) Linear Regression to find correlations between venues Expand venue sets with correlated venues. V1 ⇒ V10 , V2 ⇒ V20 Calculate CSLR on V10 and V20 Example V1 ={PAKDD: 2, ICML: 1, ICDE: 1}

V2 ={TKDE: 1, CIKM: 1}

Λ(V1 , V2 ) = 0.09 1 V1 ⇒ V10 ={PAKDD: 2, ICML: 1, ICDE: 1, CIKM: 1, TKDE: 0.5} {TKDE: 1, CIKM: 1} Λ(V10 , V2 ) = 1.05 > 1 26 / 22

Related Work




Title Similarity based on BoW and Cosine Similarity

Represents the title as a Bag-of-words TF*IDF weighted Cosine Similarity measures the similarity of two BoW’s

27 / 22

Related Work




Combine Venue/Title-Set Similarity

C1 and C2 : venue sets V1 , V2 , title sets T1 , T2 Venue set similarity simvenue (V1 , V2 ). Title Set Similarity simtitle (T1 , T2 ) Two types of evidence complement Same topic, different ranks of conferences/journals Same conference/journal, different topics

Combine them by multiplication (if either is too weak, reject): sim(C1 , C2 ) = simvenue (V1 , V2 ) simtitle (T1 , T2 )

28 / 22

Related Work




Coauthor-only Clustering Sometimes coauthors are not so reliable: Jiawei Han collaborates with ≥ 3 "Wei Wang" weak-evidential Evidential strength of b : P(a1 = a2 |b), b ∈ co(a1 ) ∩ co(a2 ) Coauthors have different evidential strengths Two Types of Shared Coauthors: Weak-evidential: P(a1 = a2 |b) < θq Strong-evidential: P(a1 = a2 |b) ≥ θq

Clustering Procedure Merge all pairs which share a strong-evidential coauthor Use CSLR to decide if only weak-evidential coauthors are shared 29 / 22

Related Work




How to Partition Coauthors into Two Sets

The intuition: The more people with whom b has ever coauthored, the more likely two of them have same name Jiawei Han has 162+ Chinese coauthors, ≥ 3 "Wei Wang" Yanghua Xiao has 29+ coauthors, probably only 1 "Wei Wang"

30 / 22

Related Work




Name Ambiguity Estimation Name ambiguity κ(e): number of authors with name e. Estimation κ ˆ (e) One of the stop criteria of clustering If κ ˆ (e) 1: name e is rare. No need to disambiguate this name 448 papers by Jiawei Han. κ ˆ (Jiawei Han) = 0.29 All are by the same person

31 / 22

Related Work




Name Ambiguity Estimation Name ambiguity κ(e): number of authors with name e. Estimation κ ˆ (e) One of the stop criteria of clustering If κ ˆ (e) 1: name e is rare. No need to disambiguate this name 448 papers by Jiawei Han. κ ˆ (Jiawei Han) = 0.29 All are by the same person

Basic Idea Assume different name parts are chosen independently Estimate P(given name=ABC) and P(surname=XYZ) iteratively P(name = Wei Wang) = P(given name = Wei) · P(surname = Wang) = 0.02 · 0.08 = 0.0016 125468 estimated Chinese authors. 125468 · 0.0016 = 208 Real number: 217 31 / 22