Sep 25, 2012 - Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Author Name Disambiguation using a New Categorical Distribution Similarity Shaohua Li, Gao Cong and Chunyan Miao Nanyang Technological University
25 September, 2012
1 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Outline
1
Introduction to Author Name Disambiguation
2
Categorical Sampling Likelihood Ratio Comparison to Traditional Set Similarity Measures
3
Disambiguation Framework Two-Stage Clustering
4
Experimental Results
5
Various Applications
6
Souce Code and Data
1 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Motivation for Author Name Disambiguation Many names are ambiguous On DBLP, the most prolific "author": Wei Wang, 846 papers
1 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Motivation for Author Name Disambiguation Many names are ambiguous On DBLP, the most prolific "author": Wei Wang, 846 papers Wei Wang Lei Wang Jim Smith David Brown ···
Namesakes
Papers
217 146 5 22
846 405 39 54
1 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Motivation for Author Name Disambiguation Many names are ambiguous On DBLP, the most prolific "author": Wei Wang, 846 papers Namesakes
Papers
Wei Wang 217 846 Lei Wang 146 405 Jim Smith 5 39 David Brown 22 54 ··· Author paper retrieval and output evaluation become difficult Author Name Disambiguation “Disambiguation” for short Objective: Merge paper records written by the same author Avoid merging papers by different authors 1 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
How to Do Author Name Disambiguation? The Intuition Example Jie Xu, Getian Ye, Yang Wang, Wei Wang, Jun Yang: Online Learning for PLSA-Based Visual Recognition. ACCV (2) 2010. Xue-juan Rong, Ping-juan Niu, Wei Wang: A Novel Design of Solid-State Lighting Control System. APWCS 2010.
Merge or not? the evidence we have: Coauthors, paper title, published venue, year
2 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
How to Do Author Name Disambiguation? The Intuition Example Jie Xu, Getian Ye, Yang Wang, Wei Wang, Jun Yang: Online Learning for PLSA-Based Visual Recognition. ACCV (2) 2010. Xue-juan Rong, Ping-juan Niu, Wei Wang: A Novel Design of Solid-State Lighting Control System. APWCS 2010.
Merge or not? the evidence we have: Coauthors, paper title, published venue, year Coauthors: Strongest type of evidence (A coauthor with a lot of collaborators can also be weak) Title: Keywords, Weaker Venue: Same or related venues. Weaker Year: Little use Affiliation? Often unavailable 2 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Core Challenge: Measure the Similarity of Two Categorical Sets
Decision based on individual elements is unreliable Decision based on two sets is more reliable Aggregated weak evidence becomes stronger How similar are two categorical sets? Traditional measures – Jaccard Coefficient, Cosine Similarity, KL Divergence Heuristic. Performs badly on small sets. Various bad cases
When sets are small, they are only partial observations of the underlying distribution
3 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
New Similarity of Two Categorical Sets: CSLR
CSLR for “Categorical Sampling Likelihood Ratio” Measures how likely two sets are drawn from the same categorical distribution The more likely, the more similar they are Has solid mathematical principles. Good performance on small sets
4 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
CSLR – “Balls and Jar” Analogy Set B of author a1
C
?
B drawn from Cat C, which is unknown MLE estimate C by B: P( ) = P( ) = P( ) = P( ) = 0.2 P( ) = P( ) = 0.1 Cˆ
B
5 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
CSLR – “Balls and Jar” Analogy Set B of author a1 B drawn from Cat C, which is unknown
C
?
S B
MLE estimate C by B: P( ) = P( ) = P( ) = P( ) = 0.2 P( ) = P( ) = 0.1 Cˆ Set S of author a2
u n s e e n
How likely is S also drawn ˆ denoted P(S|C) ˆ from C?
5 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
CSLR – “Balls and Jar” Analogy Set B of author a1 B drawn from Cat C, which is unknown
C
?
S B
MLE estimate C by B: P( ) = P( ) = P( ) = P( ) = 0.2 P( ) = P( ) = 0.1 Cˆ Set S of author a2
u n s e e n
How likely is S also drawn ˆ denoted P(S|C) ˆ from C? P( ) = 0 ˆ =0 P(S|C) 5 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
CSLR – “Balls and Jar” Analogy (cont.)
Add "wild-card" color ? , matching any unseen
C
?
Smoothing by Jeffreys Prior: θs = 0.5
S B u n s e e n
P( P( P( P(
) = P( ) = P( ) = ) = 0.18 ) = P( ) = 0.11 ? ) = P( ) = 0.04 Cˆ ˆ = 3 0.18 · 0.18 · P(S|C) 1,1,1 0.04 = 0.0078 6 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
The Likelihood Ratio ˆ = 0.0078, not so informative P(S|C) ˆ > P(S), S is likely to be Compare it to P(S): If P(S|C) drawn from Cˆ Intrinsic similarity between S and B ˆ P (S |C) CSLR: Λ = P (S )
7 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
The Likelihood Ratio ˆ = 0.0078, not so informative P(S|C) ˆ > P(S), S is likely to be Compare it to P(S): If P(S|C) drawn from Cˆ Intrinsic similarity between S and B ˆ P (S |C) CSLR: Λ = P (S ) Calculation of P(S) S: drawn from an unknown categorical distribution with parameters p Assume a uniform Dirichlet prior distribution for p Integrating out p, P(S) =
Ratio: Λ=
Λ=
10 3
m+n−1
1 m+n−1 n
(
) ˆ P (S |C)
n · 0.0078 = 0.94
7 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res Comparison to Traditional Set Similarity Measures
Traditional Set Similarity Measures
Jaccard: Cosine:
|A∩B| |A∪B| n P
Ai ·Bi s n n P P (Ai )2 · (Bi )2 i=1
s
i=1
i=1
Kullback-Leibler divergence: P P(i) DKL (PkQ) = i P(i) ln Q(i) First estimate two categorical distributions from two samples using MLE Very sensitive to unshared outcomes
8 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res Comparison to Traditional Set Similarity Measures
Bad Cases for Traditional Measures Two imbalanced sets: {ICDM: 1} ⇔ {ICDM: 1, TKDE:1, CIKM: 1, SDM: 1} Jaccard: 0.25, Cosine: 0.5, KL: 1.39 cf. {ICDM: 1} ⇔ {ICDM: 1, TKDE:1} Jaccard: 0.5, Cosine: 0.71, KL: 0.69 Impact of unshared outcomes is ignored or overstated (for KL): {ICDM: 1} ⇔ {ICDM: 1, TKDE:1, CIKM: 1, SDM: 1} Jaccard: 0.25, Cosine: 0.5, KL: 1.39 cf. {ICDM: 1, ICC: 1} ⇔ {ICDM: 1, TKDE:1, CIKM: 1, SDM: 1} Jaccard: 0.2, Cosine: 0.35, KL: n/a or large (smoothed) Impact of frequencies (confidence of decision) is ignored: {ICDM: 1} ⇔ {ICDM: 1} Jaccard: 1, Cosine: 1, KL: 0 cf. {ICDM: 6} ⇔ {ICDM: 10} Jaccard: 1, Cosine: 1, KL: 0
9 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res Comparison to Traditional Set Similarity Measures
CSLR on "Bad Cases" Two imbalanced sets: {ICDM: 1} ⇔ {ICDM: 1, TKDE:1, CIKM: 1, SDM: 1} Jaccard: 0.25, Cosine: 0.5, KL: 1.39, CSLR: 1.14 cf. {ICDM: 1} ⇔ {ICDM: 1, TKDE:1} Jaccard: 0.5, Cosine: 0.71, KL: 0.69, CSLR: 1.26 Impact of unshared outcomes: {ICDM: 1} ⇔ {ICDM: 1, TKDE:1, CIKM: 1, SDM: 1} Jaccard: 0.25, Cosine: 0.5, KL: 1.39, CSLR: 1.14 cf. {ICDM: 1, ICC: 1} ⇔ {ICDM: 1, TKDE:1, CIKM: 1, SDM: 1} Jaccard: 0.2, Cosine: 0.35, KL: n/a or large, CSLR: 0.59 Impact of frequencies (confidence of decision): {ICDM: 1} ⇔ {ICDM: 1} Jaccard: 1, Cosine: 1, KL: 0, CSLR: 1.45 cf. {ICDM: 6} ⇔ {ICDM: 10} Jaccard: 1, Cosine: 1, KL: 0, CSLR: 5.03
10 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res Two-Stage Clustering
Illustration of My System: Initial Clusters Papers
11 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res Two-Stage Clustering
Illustration of My System: Coauthor-only Clustering Clustered by Coauthors
Share a weakevidential coauthor
12 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res Two-Stage Clustering
Illustration of My System: Venue/Title Sets Clustering By Venues & Titles
13 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Experimental Setting Data Sets 10 names: 5 Chinese names, 3 western and 2 Indian Set 1 The original data set used in (Yin 2007). 588 papers
Set 2 Same name set as (Yin 2007) 2050 papers extracted from January 2011 dump of DBLP (1508101 papers) All manually labeled by us
Evaluation Metrics #PairsCorrectlyPredicted #TotalPairsPredicted #PairsCorrectlyPredicted Rec = #TotalCorrectPairs F1 = 2·Prec·Rec Prec+Rec
Prec =
14 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Competitors DISTINCT: Randomly selected training set leads to varying weights and varying performance Clustering threshold is not automatically chosen, but needs manual specification Ran 10 times. Each time tried 12 thresholds (0.00001 ∼ 1.1). Averaged the 10 scores at each threshold. Chose the threshold giving the best average F1 (big favor)
Arnetminer website’s disambiguation pages: The most up-to-date work of (Jie Tang et al. TKDE 2011) Crawled the disambiguation pages on March 12, 2012. Calculated the scores
1 baseline: Jac Jaccard Coefficient for coauthor/venue set similarities Unigram Bag-of-Words (TF*IDF weighted) for title sets 15 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Results on Set 1
Name
Prec. Hui Fang 100.0 Ajay Gupta 100.0 Joseph Hellerstein 50.7 Rakesh Kumar 100.0 Michael Wagner 100.0 Bing Liu 99.8 Jim Smith 100.0 Lei Wang 100.0 Wei Wang 60.5 Bin Yu 70.7 Avg. (macro-F1)
Jac Rec. F1 100.0 100.0 93.1 96.4 83.9 63.2 100.0 100.0 64.0 78.1 84.5 91.5 83.1 90.8 71.2 83.2 83.7 70.2 64.7 67.6
88.2 82.8
84.1
Arnetminer Prec. Rec. F1 55.6 100.0 71.4 100.0 100.0 100.0 97.4 97.4 97.4 100.0 100.0 100.0 100.0 33.7 50.5 86.2 79.8 82.9 100.0 84.5 91.6 59.4 94.2 72.9 28.1 98.5 43.8 87.8 95.3 91.4 81.5 88.4
80.2
Prec. 85.6 67.7 92.4 100.0 90.1 86.5 95.6 42.5 31.0 77.1
DISTINCT Rec. F1 100.0 88.7 94.5 78.8 80.6 84.6 100.0 100.0 96.2 92.9 82.0 83.6 91.7 93.3 75.0 51.8 98.8 47.1 89.2 81.3
76.9 90.8
80.2
Our (CSLR) Prec. Rec. F1 100.0 100.0 100.0 100.0 93.1 96.4 100.0 69.7 82.1 100.0 100.0 100.0 100.0 64.0 78.1 91.8 87.0 89.4 100.0 87.3 93.2 100.0 63.3 77.6 59.3 72.4 65.2 98.8 68.5 80.9 95.0 80.5
86.3
16 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Results on Set 2
Name
Prec. Hui Fang 100.0 Ajay Gupta 96.0 Joseph Hellerstein 52.8 Rakesh Kumar 100.0 Michael Wagner 92.8 Bing Liu 97.8 Jim Smith 100.0 Lei Wang 30.0 Wei Wang 40.2 Bin Yu 70.6 Avg. (macro-F1)
Jac Rec. 68.8 47.0 80.5 89.0 59.4 67.0 44.1 79.8 77.0 42.8
78.0 65.5
F1 81.5 63.1 63.7 94.2 72.4 79.5 61.2 43.6 52.8 53.3 66.5
Arnetminer Prec. Rec. F1 59.1 63.7 61.3 60.0 65.4 62.6 94.5 95.9 95.2 98.4 89.3 93.7 55.6 36.7 44.2 75.7 67.2 71.2 88.6 45.1 59.7 18.1 83.1 29.8 9.7 88.2 17.5 72.4 62.2 66.9 63.2 69.7
60.2
DISTINCT Prec. Rec. F1 81.3 97.9 88.0 65.3 87.9 74.2 92.3 89.5 90.0 89.9 96.0 92.5 67.4 98.2 79.1 83.0 84.7 83.3 94.8 87.8 90.0 29.3 85.9 42.4 25.8 84.2 38.9 54.0 62.0 57.0 68.3 87.4
73.5
Our (CSLR) Prec. Rec. F1 100.0 78.9 88.2 96.0 39.6 56.1 100.0 79.6 88.6 99.9 97.8 98.8 88.1 64.6 74.6 98.1 74.7 84.8 100.0 48.8 65.6 78.1 87.6 82.6 81.0 71.8 76.1 88.0 49.1 63.0 92.9 69.2
77.8
17 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Conclusions
CSLR is superior when the sample size is small and observation is partial When the sample size becomes large, CSLR is close to other measures CSLR has slightly lower recall, but much higher precision (overall 7%-15% better than Jaccard) Clusters sharing some common elements (not enough many) are not merged
Low Recall, high precision is preferrable to high recall, low precision Active learning (Jie Tang et al. ICDM 2011)
18 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Various Applications on Categorical Data Applicable as long as: sets are similar similar distributions E.g. Author Identification
they follow
Extract the frequencies of some feature words An author’s wording stably follows a characteristic categorical distribution
Example: Jane Austen’s novels and her imitater’s novel Word a an this that with without
Sense & Sensibility 147 25 32 94 59 18
Emma 186 26 39 105 74 10
Sanditon I 101 11 15 37 28 10
Sanditon II 83 29 15 22 43 4 19 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Various Applications on Categorical Data Example: Jane Austen’s novels and her imitater’s novel Word a an this that with without
Sense & Sensibility 147 25 32 94 59 18
Emma 186 26 39 105 74 10
Sanditon I 101 11 15 37 28 10
Sanditon II 83 29 15 22 43 4
CSLR scores between Sense & Sensibility, Emma and Sanditon I > 3000 CSLR scores between Sanditon II and Jane Austen’s books < 0.006 Apply to other categorical data problems 20 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Souce Code and Data
All the source code (in Perl) and labeled data: http://github.com/askerlee/namedis Will publish the standalone code of CSLR (implemented in C++, Java, Python...)
21 / 22
Introduction to Author Name Disambiguation Categorical Sampling Likelihood Ratio Disambiguation Framework Experimental Res
Thank You!
22 / 22
Related Work
Details of Disambiguation Framework
Details of Coauthor-only Clustering
Name Ambiguity Estimation
Supplementary Slides
7
Related Work
8
Details of Disambiguation Framework
9
Details of Coauthor-only Clustering
10
Name Ambiguity Estimation
23 / 22
Related Work
Details of Disambiguation Framework
Details of Coauthor-only Clustering
Name Ambiguity Estimation
Related Work Hui Han et al. (PSU), JCDL’ 04: Naive Bayes, SVM to learn evidence strength Need training set for each name
Jie Tang et al. (Tsinghua), TKDE 2011: Hidden Markov Random Field Cosine similarity Tries different author label assignments and finds the one with maximum posterior
DISTINCT – Xiaoxin Yin et al. (UIUC), ICDE’07 Jaccard & Random Walk as the Set Similarity SVM to train the weight for each type of feature (evidence). Training set is automatically (randomly) generated Trained weights vary from run to run
I. Bhattacharya et al. (U Maryland), TKDE 2007: Model papers’ relations as a hypergraph Jaccard Coefficient 24 / 22
Related Work
Details of Disambiguation Framework
Details of Coauthor-only Clustering
Name Ambiguity Estimation
Why Two Stages?
Snowball effect in clustering: Merging mistakes are often amplified during the clustering
Single venue/title is very weak evidence A big venue/title set is closer to the real distribution, more reliable But a noisy set is misleading Coauthor-only Clustering produces very pure clusters: Prec > 99.5% ( But low recall: Rec ≈ 60% )
Coauthor-only Clustering is well suited to bootstrapping
25 / 22
Related Work
Details of Disambiguation Framework
Details of Coauthor-only Clustering
Name Ambiguity Estimation
Venue Set Expansion and Similarity Basically, use CSLR to compare two venue sets But CSLR is crisp: Correlations between different venues are lost (KDD ⇔ TKDE ⇔ CIKM...) Linear Regression to find correlations between venues Expand venue sets with correlated venues. V1 ⇒ V10 , V2 ⇒ V20 Calculate CSLR on V10 and V20 Example V1 ={PAKDD: 2, ICML: 1, ICDE: 1}
V2 ={TKDE: 1, CIKM: 1}
Λ(V1 , V2 ) = 0.09 1 V1 ⇒ V10 ={PAKDD: 2, ICML: 1, ICDE: 1, CIKM: 1, TKDE: 0.5} {TKDE: 1, CIKM: 1} Λ(V10 , V2 ) = 1.05 > 1 26 / 22
Related Work
Details of Disambiguation Framework
Details of Coauthor-only Clustering
Name Ambiguity Estimation
Title Similarity based on BoW and Cosine Similarity
Represents the title as a Bag-of-words TF*IDF weighted Cosine Similarity measures the similarity of two BoW’s
27 / 22
Related Work
Details of Disambiguation Framework
Details of Coauthor-only Clustering
Name Ambiguity Estimation
Combine Venue/Title-Set Similarity
C1 and C2 : venue sets V1 , V2 , title sets T1 , T2 Venue set similarity simvenue (V1 , V2 ). Title Set Similarity simtitle (T1 , T2 ) Two types of evidence complement Same topic, different ranks of conferences/journals Same conference/journal, different topics
Combine them by multiplication (if either is too weak, reject): sim(C1 , C2 ) = simvenue (V1 , V2 ) simtitle (T1 , T2 )
28 / 22
Related Work
Details of Disambiguation Framework
Details of Coauthor-only Clustering
Name Ambiguity Estimation
Coauthor-only Clustering Sometimes coauthors are not so reliable: Jiawei Han collaborates with ≥ 3 "Wei Wang" weak-evidential Evidential strength of b : P(a1 = a2 |b), b ∈ co(a1 ) ∩ co(a2 ) Coauthors have different evidential strengths Two Types of Shared Coauthors: Weak-evidential: P(a1 = a2 |b) < θq Strong-evidential: P(a1 = a2 |b) ≥ θq
Clustering Procedure Merge all pairs which share a strong-evidential coauthor Use CSLR to decide if only weak-evidential coauthors are shared 29 / 22
Related Work
Details of Disambiguation Framework
Details of Coauthor-only Clustering
Name Ambiguity Estimation
How to Partition Coauthors into Two Sets
The intuition: The more people with whom b has ever coauthored, the more likely two of them have same name Jiawei Han has 162+ Chinese coauthors, ≥ 3 "Wei Wang" Yanghua Xiao has 29+ coauthors, probably only 1 "Wei Wang"
30 / 22
Related Work
Details of Disambiguation Framework
Details of Coauthor-only Clustering
Name Ambiguity Estimation
Name Ambiguity Estimation Name ambiguity κ(e): number of authors with name e. Estimation κ ˆ (e) One of the stop criteria of clustering If κ ˆ (e) 1: name e is rare. No need to disambiguate this name 448 papers by Jiawei Han. κ ˆ (Jiawei Han) = 0.29 All are by the same person
31 / 22
Related Work
Details of Disambiguation Framework
Details of Coauthor-only Clustering
Name Ambiguity Estimation
Name Ambiguity Estimation Name ambiguity κ(e): number of authors with name e. Estimation κ ˆ (e) One of the stop criteria of clustering If κ ˆ (e) 1: name e is rare. No need to disambiguate this name 448 papers by Jiawei Han. κ ˆ (Jiawei Han) = 0.29 All are by the same person
Basic Idea Assume different name parts are chosen independently Estimate P(given name=ABC) and P(surname=XYZ) iteratively P(name = Wei Wang) = P(given name = Wei) · P(surname = Wang) = 0.02 · 0.08 = 0.0016 125468 estimated Chinese authors. 125468 · 0.0016 = 208 Real number: 217 31 / 22