Author Name Disambiguation

0 downloads 0 Views 1MB Size Report
Author Name Disambiguation in MEDLINE Based on Domain of Research ... Source of Publications ... Medline is a database which contains journal citations and abstracts for ... +Affiliation is provided mainly for the first author. Title. 49%. 51%. Abstract. Available ... Especially if information about Affiliation is missing ...
Author Name Disambiguation in MEDLINE Based on Domain of Research

Dina Vishnyakova, PhD Data Science, Roche Innovation Center Basel SwissText 2017

Motivation

• Track scientific research over time and across of publications • Associate topics to authors • Identify best researchers/research

2

Author Name Disambiguation (AND) in Publications Verified Author?

3

Source of Publications We are disambiguating publications in MEDLINE. Medline is a database which contains journal citations and abstracts for biomedical literature from around the world.

Provides free access to Medline

Researchers, scientists, etc

4

Author Name Disambiguation (AND) in Publications

5

Do you think it is easy?

The most common surnames in publications are (as of 2016):

Wang Y– over 8000 publications Zhang Y– over 7000 publications Li Y– over 6000 publications Liu Y–… Wang J … Wang X…..

Author 4? Author 3?

Author 5? Author 2? Author 1?

*According to a comprehensive survey of residential permits released by the Chinese Ministry of Public Security on April 24, 2007 the ten most common surnames in mainland China are Wang , Li , Zhang , Liu , Chen , Yang, Huang, Zhao, Wu and Zhou.

What is done to solve the problem...

Existing identifiers : ORCID, Scopus... could potentially help BUT ID is not mandatory for publishing -> - Not every author has an ID - ID is not linked to all previous papers

- Missing information

7

Information Availability in Publications Affiliation

MeSH (Medical Subject Heading) 9%

46% 54%

Title

91% Abstract

49% 51%

Available +Affiliation is provided mainly for the first author

Not Available 8

What we extract from available information

1) Email (regular expression) 2) Organisation (NER) 3) City (NER & dictionary-based) 4) Country (NER & dictionary-based)

5) Author’s job title?!

9

Text Mining and Machine Learning if information is limited Human Expert Analyses Titles and Abstracts Search in Google Scholar, ResearchGate, LinkedIn etc BUT • information is limited, even in WWW • It takes time (to process > 1000 articles for some author names) Machine Learning and Text Mining Works with limited information Finds relations which are not visible to humans Time efficient BUT • Need to choose an algorithm (Evaluation problem) • Supervised algorithms requiere training and test sets (gold standard) • Unsupervised ones are more difficult to fine tune Our AND methodolgy is based on supervised algorithms, for which feature selection is crucial

10

Example of publication Hematology

Vascular Disease

Molecular Biology

Author

Co-Authors

+One publication can be a product of a collaboration and can cover several domains

11

Machine Learning Features– Identifying research domains Input Title/Abstract/MeSH

Journal Descriptors

Semantic Types

Vascular Disease

Amino Acid; Peptide; Protein

Endocrinology

Biologically Active Substance

Molecular Biology

Molecular Function

Identification of research domains and disciplines improves AND process. Especially if information about Affiliation is missing

12

Example: Domains of Research for Disambiaguation

Journal Descriptors

Semantic Types

PMID

Last N

First N

Gastroenterology| Pediatrics| Nutritional Sciences

Finding| Pathologic Function| Sign or Symptom

12612331

Zhang

DL

Vascular Disease| Endocrinology| Molecular Biology

Amino Acid; Peptide; Protein| Biologically Active Substance| Molecular Function

7646436

Zhang

D

Medical Informatics| Statistics as Topic | Diagnostic Imaging

Conceptual Entity| Temporal Concept| Entity

10783774

Zhang

D

Acquired Immunodeficiency Syndrome| Dentistry| Statistics as Topic

Conceptual Entity| Activity| Quantative Concept

11550930

Zhang

D

Tropical Medicine| Hematology| Molecular Biology

Amino Acid; Peptide; Protein| Biologically Active Substance| Molecular Function

11891134

Zhang

D

Intersections in domains Red color = Author 1, Blue color = Author 2 13

Results achieved so far...

Most research studies in Author Name Disambiaguation claim to disambiaguate authors with an average accuracy of 95% BUT the evalutation was not done on the same data set or so-called gold standard...

14

Gold Standard for AND problem

The only existing gold standard • Only 1st authors, no co-authors • Few non-Western names • 99% of Authors have information on their Affiliation

Actual variety of author profiles in Medline • • • •

Missing Affiliation Missing Abstracts Old publications Prevalence of Asian-origin names 15

Results of Disambiguation

J 4.8 Algorithm Metrics

MF+JD

MF+ST

MF+JD+ST

Song et al. (2015)

Precision

0.986

0.975

0.987

0.9776

Recall

0.992

0.961

0.994

0.9545

F-Measure

0.989

0.9675

0.990

0.9657

Using additional features such as domains of research, we have achieved better results than state of the art.

16

Further work

1. The results in this area can be improved -> Shared task? 2.

Follow us on ResearchGate - https://www.researchgate.net/project/Author-namedisambiguation

17

Acknowledgements • Dr. Barbara Endler-Jobst, Head of Data Science, Roche Innovation Center, Basel •

Dr. Raul Rodriguez-Esteban, Data Scientist, Roche Innovation Center, Basel

• Data Science group in Roche Innovation Center, Basel • Dr. Fabio Rinaldi, scientific advisor, University of

Zürich • Dr. Ignacio Fernandez Garcia, talent scout at Roche • Dr. Khan Ozol, former talent scout at Roche 18

Doing now what patients need next