Author Name Disambiguation in MEDLINE Based on Domain of Research ... Source of Publications ... Medline is a database which contains journal citations and abstracts for ... +Affiliation is provided mainly for the first author. Title. 49%. 51%. Abstract. Available ... Especially if information about Affiliation is missing ...
Author Name Disambiguation in MEDLINE Based on Domain of Research
Dina Vishnyakova, PhD Data Science, Roche Innovation Center Basel SwissText 2017
Motivation
• Track scientific research over time and across of publications • Associate topics to authors • Identify best researchers/research
2
Author Name Disambiguation (AND) in Publications Verified Author?
3
Source of Publications We are disambiguating publications in MEDLINE. Medline is a database which contains journal citations and abstracts for biomedical literature from around the world.
Provides free access to Medline
Researchers, scientists, etc
4
Author Name Disambiguation (AND) in Publications
5
Do you think it is easy?
The most common surnames in publications are (as of 2016):
Wang Y– over 8000 publications Zhang Y– over 7000 publications Li Y– over 6000 publications Liu Y–… Wang J … Wang X…..
Author 4? Author 3?
Author 5? Author 2? Author 1?
*According to a comprehensive survey of residential permits released by the Chinese Ministry of Public Security on April 24, 2007 the ten most common surnames in mainland China are Wang , Li , Zhang , Liu , Chen , Yang, Huang, Zhao, Wu and Zhou.
What is done to solve the problem...
Existing identifiers : ORCID, Scopus... could potentially help BUT ID is not mandatory for publishing -> - Not every author has an ID - ID is not linked to all previous papers
- Missing information
7
Information Availability in Publications Affiliation
MeSH (Medical Subject Heading) 9%
46% 54%
Title
91% Abstract
49% 51%
Available +Affiliation is provided mainly for the first author
Not Available 8
What we extract from available information
1) Email (regular expression) 2) Organisation (NER) 3) City (NER & dictionary-based) 4) Country (NER & dictionary-based)
5) Author’s job title?!
9
Text Mining and Machine Learning if information is limited Human Expert Analyses Titles and Abstracts Search in Google Scholar, ResearchGate, LinkedIn etc BUT • information is limited, even in WWW • It takes time (to process > 1000 articles for some author names) Machine Learning and Text Mining Works with limited information Finds relations which are not visible to humans Time efficient BUT • Need to choose an algorithm (Evaluation problem) • Supervised algorithms requiere training and test sets (gold standard) • Unsupervised ones are more difficult to fine tune Our AND methodolgy is based on supervised algorithms, for which feature selection is crucial
10
Example of publication Hematology
Vascular Disease
Molecular Biology
Author
Co-Authors
+One publication can be a product of a collaboration and can cover several domains
11
Machine Learning Features– Identifying research domains Input Title/Abstract/MeSH
Journal Descriptors
Semantic Types
Vascular Disease
Amino Acid; Peptide; Protein
Endocrinology
Biologically Active Substance
Molecular Biology
Molecular Function
Identification of research domains and disciplines improves AND process. Especially if information about Affiliation is missing
12
Example: Domains of Research for Disambiaguation
Journal Descriptors
Semantic Types
PMID
Last N
First N
Gastroenterology| Pediatrics| Nutritional Sciences
Finding| Pathologic Function| Sign or Symptom
12612331
Zhang
DL
Vascular Disease| Endocrinology| Molecular Biology
Amino Acid; Peptide; Protein| Biologically Active Substance| Molecular Function
7646436
Zhang
D
Medical Informatics| Statistics as Topic | Diagnostic Imaging
Conceptual Entity| Temporal Concept| Entity
10783774
Zhang
D
Acquired Immunodeficiency Syndrome| Dentistry| Statistics as Topic
Conceptual Entity| Activity| Quantative Concept
11550930
Zhang
D
Tropical Medicine| Hematology| Molecular Biology
Amino Acid; Peptide; Protein| Biologically Active Substance| Molecular Function
11891134
Zhang
D
Intersections in domains Red color = Author 1, Blue color = Author 2 13
Results achieved so far...
Most research studies in Author Name Disambiaguation claim to disambiaguate authors with an average accuracy of 95% BUT the evalutation was not done on the same data set or so-called gold standard...
14
Gold Standard for AND problem
The only existing gold standard • Only 1st authors, no co-authors • Few non-Western names • 99% of Authors have information on their Affiliation
Actual variety of author profiles in Medline • • • •
Missing Affiliation Missing Abstracts Old publications Prevalence of Asian-origin names 15
Results of Disambiguation
J 4.8 Algorithm Metrics
MF+JD
MF+ST
MF+JD+ST
Song et al. (2015)
Precision
0.986
0.975
0.987
0.9776
Recall
0.992
0.961
0.994
0.9545
F-Measure
0.989
0.9675
0.990
0.9657
Using additional features such as domains of research, we have achieved better results than state of the art.
16
Further work
1. The results in this area can be improved -> Shared task? 2.
Follow us on ResearchGate - https://www.researchgate.net/project/Author-namedisambiguation
17
Acknowledgements • Dr. Barbara Endler-Jobst, Head of Data Science, Roche Innovation Center, Basel •
Dr. Raul Rodriguez-Esteban, Data Scientist, Roche Innovation Center, Basel
• Data Science group in Roche Innovation Center, Basel • Dr. Fabio Rinaldi, scientific advisor, University of
Zürich • Dr. Ignacio Fernandez Garcia, talent scout at Roche • Dr. Khan Ozol, former talent scout at Roche 18
Doing now what patients need next