The Importance of Being Earnest

4 downloads 0 Views 3MB Size Report
Author names may refer to many different individuals. ▸ Common surnames (Smith .... German (language) social science literature (books, articles, grey lit).
The Importance of Being Earnest Non-ambiguous Identification of Information Objects

Andreas Strotmann* ** & Dangzhi Zhao* *University of Alberta – School of Library and Information Studies, Edmonton, AB, Canada **GESIS – Leibniz Institute for the Social Sciences, Bonn, Germany

Overview 

The Ambiguity Problem 



The Urgency of Disambiguation 



Why it is becoming important to address disambiguation again

The Importance of Being Earnest 



… a classic cataloguing issue with classic solutions

How earnest effort may provide satisfactory solutions to some disambiguation problems in the short run

Back to the Future 

What long-term solutions might look like

The Ambiguity Problem 

A classic problem in cataloguing: 

Names and titles are not unique  



Names: Journals, Authors, Institutions Titles: Books, articles, series, songs, …

Individuals may be referred to in many ways  

People, institutions, places, journals… change their names They may be known under different names    



Translation gives multiple titles to a work Transliteration results in multiple names for an individual Pen names provide additional names for an individual Errors creep into references to individuals

Many may be known under the same name 

Common names of institutions or people

The Ambiguity Problem (ctd.) 

Authority Files 



The classic solution to the ambiguity problem

Typical authority files are maintained…  

  

Nationally and internationally (crossconcordances) For person names; book titles; journal names; (book) series; place names; institution names; languages; subject headings; classifications; thesauri; … By trained information professionals: cataloguers, indexers, … Collaboratively across institutions Including cross-links between authority files and records

The New Urgency of Disambiguation 

In one word:



The authority file record for an individual represents one particular person (institution, work…) and that one only   

Semantics

Serves as a reference point to refer to from all possible variations of the person’s (work’s,…) name/title Serves as a reference point for all information available about that person (institution, work,…) Remains constant as properties of the individual change (including name, affiliation, hierarchy…) or are added (translations, transliterations, news,…)

Application: CRIS 

Current Research Information Systems – CRIS  

Hot trend around the world, especially Europe Maintain current information on research performance   



Individuals Institutions Regions …

Integrate many sources of information 

Funding, projects, publications, citations, costs, personnel,….



=> need to know when different types of information pertain to the same individual person/institution/region/ …

Application: Research Evaluation 

E.g., University rankings; faculty hiring  

Who worked when for “my” university? What did they publish? 



What impact did they have? 



 



Papers, patents, books, songs, paintings, speeches, expertises… Collaborations, citations, students, media attention, professional manuals,… … compared to others “like” them?

How much did they spend / earn? …

The problem of a Zipf distribution for impact measures 

A single(!) disambiguation error was found to cause a major(!) ranking error in (at least one) major published study!

Application: Linked Open Data (LOD)  

A simple part of the “Semantic Web” Goal: a universal collection of all kinds of facts about all kinds of entities, e.g.,   



Subgoal I: identification of identical entities across the many separate collections of facts 



People and their friends and acquaintances Works and their impact Places and their coordinates

E.g., PubMed ID, Scopus ID, … of the same article

Subgoal II: cross-linkages among individual entities 

E.g., Dangzhi-Zhao teaches-at UofA-SLIS

Application: Science Mapping 

Science as a complex network  

Authors, publications, institutions… Linked by authorship, cited references, collaboration,…

The Importance of Being Earnest about Author Name Disambiguation  



Multiple nodes for one author: join into one node Multiple authors for one node: separate into several

Names alone don’t identify individual authors

Results of an Earnest Effort at  Author Name Disambiguation Visualization shows similar structure to “full name” subgraph disambiguated

“raw”, full names

A Well-Known Problem: Author Name Ambiguities 

Authors may publish under many names   



Author names may refer to many different individuals   



Transliteration ambiguities (ü/u/ue; Ze/Tse); Misspellings Name changes (marriage or cross-cultural adaptation) Given name adaptations (Robert James/Robert J./Bob/R.J/R.) Common surnames (Smith, Müller, Lee, Wang…) Common given names (Jane, Robert) Given-name initials

Affects bibliographic data collection  

Information retrieval (author search is extremely popular) Bibliometric analysis – e.g. , author rankings, co-citation studies 

Co-occurrence analysis reduces author ambiguity problem: author name pairs much less ambiguous

The Traditional Approach to Author Name Disambiguation (AND) 

Name ambiguities  



a well-known problem in bibliographic data management … with a well-known standard solution

Traditional normalization of author names  

Transliteration to standard character set (a.k.a. romanization) Representation as last name plus initials 



In retrieval, usually represented as last name plus first initial

Features of this standard approach to AND   

Very good recall, but often low precision Works well for most European names in small databases Terrible for East Asian names  

20 Chinese / 3 Korean last names cover ~50% of populations Romanization exacerbates problem, ignoring distinctions in original script

Our Test Case: International Stem Cell Research 2004-‘09 

Stem Cell research 

Interdisciplinary biomedical research field with social science aspects  



Quite young, growing rapidly – currently ~10k journal papers/year Strong international research programmes, esp. in China and Korea

Data set characteristics 

PubMed subject search „stem cell“, published 2004-2009, + metadata for cited references 

References obtained from Scopus and completed via PubMed 



Note that this involves reference disambiguation

Almost all author names available (both citing and cited) 

Most names in full, but large percentage just surname+initial 



No author IDs available – only names

PubMed is renown for the excellent quality of its metadata

Testing the Traditional Approach to AND : Algorithmic AND 

Contrast the simple traditional approach against a complex algorithmic approach to AND 

Automatic disambiguation of hundreds of thousands of author names required – impossible by hand 



Based largely on co-authorship patterns 



Most available as full names (as opposed to last name and initials) the same name on two papers with a lot of the same names probably belongs to the same person

Success rate estimated at 80-90%  

Typical rate for automatic AND algorithms Better than the traditional approach, but not at all perfect

Testing the Traditional Approach to AND: Author Citation Ranking Results name

rank

rank 1

%cite1



In15-most-highly-cited list:

7 are correct (~unique) * 2 are still OK

Weissman, I

1

1

100



Gage, F

2

2

100



Smith, A *

3

5

80



Prockop, D

4

3

100



Caplan, A

5

7

100

Alvarez-buylla, A

6

4

100

Lee, J **

7

918

8



Mckay, R *

8

10

90



Morrison, S

9

6

100

Chen, J **

10

634

12

Wang, J **

11

2021

6

Kim, S **

12

628

13

Wang, Y **

13

1476

8

Lee, S **

14

646

13

Jaenisch, R

15

8

100



Several authors (dozens) 1 dominant (80-90% of cites)

** 6 are massively wrong 100s of authors (300-500) Highest of these 





Received mere ~5-15% of cites Gained hundreds in rank

Mostly Chinese or Korean 

Lee is Chinese, Korean, and English common name



Similar for top-100



=> Do not rank authors this way, ever !!!

Testing the Traditional Approach to AND: Author Cocitation Analysis (ACA) Results ACA result visualizations  



Confirm that ACA is very robust statistical method 



with automatic (top) and traditional (bottom) AND

Structural correspondences readily apparent

Mass authors factor out 

Chinese/Korean names 

 

(yellow central nodes)

Significantly disturb field structure visualization

 significantly more useful results with better AND

The Importance of Being Earnest 

When it comes to researcher names,  

Lazy name disambiguation gives unacceptable results Imperfect but “earnest” semi-automatic name disambiguation can give reasonable and useful results for some purposes  



Especially for robust statistical analyses based on network structure Even where there are important Asian research programs

Perfection in name disambiguation remains elusive, but is required for meaningful ranking or

The Traditional Approach to AND: A Quick Fix for Analyses of Biomedical Fields 

Last-author analysis 

Last author=lab head 





#labs complete phrases in many languages Basic design has been tested and is being tested on small cases

GESIS –  Leibniz Institute for the Social Sciences   

Similar to NLH/NCBI, but for Germany instead of US, and for (Empirical) Social Sciences not Medicine/Biology ~250 people, mostly scientists and PhD/Grad students Support full research life cycle 





Design; methodology; surveying; analysis; archival; curation of studies Focus on survey-based empirical social science studies

Special Information Services for the Social Sciences       

German (language) social science literature (books, articles, grey lit) German/Swiss/Austrian social science research project database Citation index Cambridge Scientific Abstract (German National Lic.) Social Sciences Open Access Repository (SSOAR) Data Registration Agency (DOIs for citable data sets) German Center of Excellence – Women in Science (database/registry) Bibliometric / scientometric studies of/for social sciences

Towards a Semantic Digital Library  for Social Science Research 

Goals  

Integration of wide range of information services, archives Providing a central reference point for social science information   





Social science thesaurus (TheSoz – multilingual) Literature, people, projects Social science data (survey designs, methods, results…)

Reliable information

Semantic enrichment serves all these purposes! 

And more: e.g., international dissemination of (German) social science research results via multilingual access to semantics

Questions? 

Thank you!



Acknowledgments 

Funding  



Social Sciences and Humanities Research Council, Canada GESIS – Leibniz Institute for the Social Sciences, Germany

Software    

Guo, Gencheng – programming (Java) of data collection A. Strotmann – programming (Python) & algorithm development Pajek – visualization SPSS: factor analysis

Implications for Bibliographic Databases 







As bibliographic databases are (ab)used more and more for evaluation purposes, reliability of identification of individual authors, institutions, publications or other metadata becomes of paramount importance In the medium to long term, reliable authority control mechanisms may provide solution to ambiguity problem In the short term (and retroactively), all bibliographic databases need to provide full names of authors or institutions Original names in the original scripts should also be recorded to reduce ambiguities

Author Cocitation Analysis (ACA) 

Count n times Author A cocited with Author B 



 

Matrix of cocitation counts between 100-200 most highly cited members of a research community Factor analysis of cocitation matrix (oblique rotation) Resulting factors ~ research specialties 



Same paper cites both A and B

Hand labeled from publications of authors in factor

Visualization of factor analysis result  

Nodes: authors and factors (specialties) Connections (links): ~loading of author on factor

ACA for Stem Cell Res. 04­09, automatic  name disambiguation

ACA for Stem Cell Res. 04­09,  Traditional AND

Last­author ACA, traditional AND