Author names may refer to many different individuals. ⸠Common surnames (Smith .... German (language) social science literature (books, articles, grey lit).
The Importance of Being Earnest Non-ambiguous Identification of Information Objects
Andreas Strotmann* ** & Dangzhi Zhao* *University of Alberta – School of Library and Information Studies, Edmonton, AB, Canada **GESIS – Leibniz Institute for the Social Sciences, Bonn, Germany
Overview
The Ambiguity Problem
The Urgency of Disambiguation
Why it is becoming important to address disambiguation again
The Importance of Being Earnest
… a classic cataloguing issue with classic solutions
How earnest effort may provide satisfactory solutions to some disambiguation problems in the short run
Back to the Future
What long-term solutions might look like
The Ambiguity Problem
A classic problem in cataloguing:
Names and titles are not unique
Names: Journals, Authors, Institutions Titles: Books, articles, series, songs, …
Individuals may be referred to in many ways
People, institutions, places, journals… change their names They may be known under different names
Translation gives multiple titles to a work Transliteration results in multiple names for an individual Pen names provide additional names for an individual Errors creep into references to individuals
Many may be known under the same name
Common names of institutions or people
The Ambiguity Problem (ctd.)
Authority Files
The classic solution to the ambiguity problem
Typical authority files are maintained…
Nationally and internationally (crossconcordances) For person names; book titles; journal names; (book) series; place names; institution names; languages; subject headings; classifications; thesauri; … By trained information professionals: cataloguers, indexers, … Collaboratively across institutions Including cross-links between authority files and records
The New Urgency of Disambiguation
In one word:
The authority file record for an individual represents one particular person (institution, work…) and that one only
Semantics
Serves as a reference point to refer to from all possible variations of the person’s (work’s,…) name/title Serves as a reference point for all information available about that person (institution, work,…) Remains constant as properties of the individual change (including name, affiliation, hierarchy…) or are added (translations, transliterations, news,…)
Application: CRIS
Current Research Information Systems – CRIS
Hot trend around the world, especially Europe Maintain current information on research performance
Individuals Institutions Regions …
Integrate many sources of information
Funding, projects, publications, citations, costs, personnel,….
=> need to know when different types of information pertain to the same individual person/institution/region/ …
Application: Research Evaluation
E.g., University rankings; faculty hiring
Who worked when for “my” university? What did they publish?
What impact did they have?
Papers, patents, books, songs, paintings, speeches, expertises… Collaborations, citations, students, media attention, professional manuals,… … compared to others “like” them?
How much did they spend / earn? …
The problem of a Zipf distribution for impact measures
A single(!) disambiguation error was found to cause a major(!) ranking error in (at least one) major published study!
Application: Linked Open Data (LOD)
A simple part of the “Semantic Web” Goal: a universal collection of all kinds of facts about all kinds of entities, e.g.,
Subgoal I: identification of identical entities across the many separate collections of facts
People and their friends and acquaintances Works and their impact Places and their coordinates
E.g., PubMed ID, Scopus ID, … of the same article
Subgoal II: cross-linkages among individual entities
E.g., Dangzhi-Zhao teaches-at UofA-SLIS
Application: Science Mapping
Science as a complex network
Authors, publications, institutions… Linked by authorship, cited references, collaboration,…
The Importance of Being Earnest about Author Name Disambiguation
Multiple nodes for one author: join into one node Multiple authors for one node: separate into several
Names alone don’t identify individual authors
Results of an Earnest Effort at Author Name Disambiguation Visualization shows similar structure to “full name” subgraph disambiguated
“raw”, full names
A Well-Known Problem: Author Name Ambiguities
Authors may publish under many names
Author names may refer to many different individuals
Transliteration ambiguities (ü/u/ue; Ze/Tse); Misspellings Name changes (marriage or cross-cultural adaptation) Given name adaptations (Robert James/Robert J./Bob/R.J/R.) Common surnames (Smith, Müller, Lee, Wang…) Common given names (Jane, Robert) Given-name initials
Affects bibliographic data collection
Information retrieval (author search is extremely popular) Bibliometric analysis – e.g. , author rankings, co-citation studies
Co-occurrence analysis reduces author ambiguity problem: author name pairs much less ambiguous
The Traditional Approach to Author Name Disambiguation (AND)
Name ambiguities
a well-known problem in bibliographic data management … with a well-known standard solution
Traditional normalization of author names
Transliteration to standard character set (a.k.a. romanization) Representation as last name plus initials
In retrieval, usually represented as last name plus first initial
Features of this standard approach to AND
Very good recall, but often low precision Works well for most European names in small databases Terrible for East Asian names
20 Chinese / 3 Korean last names cover ~50% of populations Romanization exacerbates problem, ignoring distinctions in original script
Our Test Case: International Stem Cell Research 2004-‘09
Stem Cell research
Interdisciplinary biomedical research field with social science aspects
Quite young, growing rapidly – currently ~10k journal papers/year Strong international research programmes, esp. in China and Korea
Data set characteristics
PubMed subject search „stem cell“, published 2004-2009, + metadata for cited references
References obtained from Scopus and completed via PubMed
Note that this involves reference disambiguation
Almost all author names available (both citing and cited)
Most names in full, but large percentage just surname+initial
No author IDs available – only names
PubMed is renown for the excellent quality of its metadata
Testing the Traditional Approach to AND : Algorithmic AND
Contrast the simple traditional approach against a complex algorithmic approach to AND
Automatic disambiguation of hundreds of thousands of author names required – impossible by hand
Based largely on co-authorship patterns
Most available as full names (as opposed to last name and initials) the same name on two papers with a lot of the same names probably belongs to the same person
Success rate estimated at 80-90%
Typical rate for automatic AND algorithms Better than the traditional approach, but not at all perfect
Testing the Traditional Approach to AND: Author Citation Ranking Results name
rank
rank 1
%cite1
In15-most-highly-cited list:
7 are correct (~unique) * 2 are still OK
Weissman, I
1
1
100
Gage, F
2
2
100
Smith, A *
3
5
80
Prockop, D
4
3
100
Caplan, A
5
7
100
Alvarez-buylla, A
6
4
100
Lee, J **
7
918
8
Mckay, R *
8
10
90
Morrison, S
9
6
100
Chen, J **
10
634
12
Wang, J **
11
2021
6
Kim, S **
12
628
13
Wang, Y **
13
1476
8
Lee, S **
14
646
13
Jaenisch, R
15
8
100
Several authors (dozens) 1 dominant (80-90% of cites)
** 6 are massively wrong 100s of authors (300-500) Highest of these
Received mere ~5-15% of cites Gained hundreds in rank
Mostly Chinese or Korean
Lee is Chinese, Korean, and English common name
Similar for top-100
=> Do not rank authors this way, ever !!!
Testing the Traditional Approach to AND: Author Cocitation Analysis (ACA) Results ACA result visualizations
Confirm that ACA is very robust statistical method
with automatic (top) and traditional (bottom) AND
Structural correspondences readily apparent
Mass authors factor out
Chinese/Korean names
(yellow central nodes)
Significantly disturb field structure visualization
significantly more useful results with better AND
The Importance of Being Earnest
When it comes to researcher names,
Lazy name disambiguation gives unacceptable results Imperfect but “earnest” semi-automatic name disambiguation can give reasonable and useful results for some purposes
Especially for robust statistical analyses based on network structure Even where there are important Asian research programs
Perfection in name disambiguation remains elusive, but is required for meaningful ranking or
The Traditional Approach to AND: A Quick Fix for Analyses of Biomedical Fields
Last-author analysis
Last author=lab head
#labs complete phrases in many languages Basic design has been tested and is being tested on small cases
GESIS – Leibniz Institute for the Social Sciences
Similar to NLH/NCBI, but for Germany instead of US, and for (Empirical) Social Sciences not Medicine/Biology ~250 people, mostly scientists and PhD/Grad students Support full research life cycle
Design; methodology; surveying; analysis; archival; curation of studies Focus on survey-based empirical social science studies
Special Information Services for the Social Sciences
German (language) social science literature (books, articles, grey lit) German/Swiss/Austrian social science research project database Citation index Cambridge Scientific Abstract (German National Lic.) Social Sciences Open Access Repository (SSOAR) Data Registration Agency (DOIs for citable data sets) German Center of Excellence – Women in Science (database/registry) Bibliometric / scientometric studies of/for social sciences
Towards a Semantic Digital Library for Social Science Research
Goals
Integration of wide range of information services, archives Providing a central reference point for social science information
Social science thesaurus (TheSoz – multilingual) Literature, people, projects Social science data (survey designs, methods, results…)
Reliable information
Semantic enrichment serves all these purposes!
And more: e.g., international dissemination of (German) social science research results via multilingual access to semantics
Questions?
Thank you!
Acknowledgments
Funding
Social Sciences and Humanities Research Council, Canada GESIS – Leibniz Institute for the Social Sciences, Germany
Software
Guo, Gencheng – programming (Java) of data collection A. Strotmann – programming (Python) & algorithm development Pajek – visualization SPSS: factor analysis
Implications for Bibliographic Databases
As bibliographic databases are (ab)used more and more for evaluation purposes, reliability of identification of individual authors, institutions, publications or other metadata becomes of paramount importance In the medium to long term, reliable authority control mechanisms may provide solution to ambiguity problem In the short term (and retroactively), all bibliographic databases need to provide full names of authors or institutions Original names in the original scripts should also be recorded to reduce ambiguities
Author Cocitation Analysis (ACA)
Count n times Author A cocited with Author B
Matrix of cocitation counts between 100-200 most highly cited members of a research community Factor analysis of cocitation matrix (oblique rotation) Resulting factors ~ research specialties
Same paper cites both A and B
Hand labeled from publications of authors in factor
Visualization of factor analysis result
Nodes: authors and factors (specialties) Connections (links): ~loading of author on factor
ACA for Stem Cell Res. 0409, automatic name disambiguation
ACA for Stem Cell Res. 0409, Traditional AND
Lastauthor ACA, traditional AND