Improving Phenotype Name Recognition Maryam Khordad1 , Robert E. Mercer1 , and Peter Rogan1,2 1
Department of Computer Science 2 Department of Biochemistry The University of Western Ontario, London, ON, Canada {mkhordad,progan}@uwo.ca,
[email protected]
Abstract. Due to the rapidly increasing amount of biomedical literature, automatic processing of biomedical papers is extremely important. Named Entity Recognition (NER) in this type of writing has several difficulties. In this paper we present a system to find phenotype names in biomedical literature. The system is based on Metamap and makes use of the UMLS Metathesaurus and the Human Phenotype Ontology. From an initial basic system that uses only these preexisting tools, five rules that capture stylistic and linguistic properties of this type of literature are proposed to enhance the performance of our NER tool. The tool is tested on a small corpus and the results (precision 97.6% and recall 88.3%) demonstrate its performance.
1
Introduction
During the last decade biomedicine has developed tremendously. Everyday a lot of biomedical papers are published and a great amount of information is produced. Due to the large number of applications of biomedical data, the need for Natural Language Processing (NLP) systems to process this amount of new information is increasing. Current NLP systems try to extract from the biomedical literature different knowledge such as, protein–protein interactions [1] [2] [3] [4] [5], new hypotheses [6] [7] [8], relations between drugs, genes and cells [9] [10] [11], protein structure [12] [13] and protein function [14] [15]. In all of these applications recognizing the biomedical objects or Named Entity Recognition (NER) is a fundamental step and obviously affects the final result. Over the past years it has turned out that finding the name of biomedical objects in literature is a difficult task. Some problematic factors are: the existence of millions of entity names, a constantly growing number of entity names, the lack of naming agreement prior to a standard name being accepted, an extreme use of abbreviations, the use of numerous synonyms and homonyms, and the fact that some biological names are complex names that consist of many words, like “increased erythrocyte adenosine deaminase activity”. Even biologists do not agree on the boundary of the names [16]. Named Entity Recognition in the biomedical domain has been extensively studied and, as a consequence, many methods have been proposed. Some methods like MetaMap [17] and mgrep [18] are generic methods and find all kinds of C. Butz and P. Lingras (Eds.): Canadian AI 2011, LNAI 6657, pp. 246–257, 2011. c Springer-Verlag Berlin Heidelberg 2011
Improving Phenotype Name Recognition
247
entities in the text. Some methods, however, are specialized to recognize particular type of entities like gene or protein names [13] [19], diseases and drugs [9] [20] [21], mutations [22] or properties of protein structures [13]. NER techniques are usually classified into three categories [16]. Dictionarybased techniques like [19] match phrases from the text against some existing dictionaries. Rule-based techniques like [23] make use of some rules to find entity names in the text. And machine learning techniques like [24] transform the NER task into a classification problem. In this paper we want to focus on phenotype name recognition in biomedical literature. Phenotype is defined as the genetically-determined observable characteristics of a cell or organism, including the result of any test that is not a direct test of the genotype [25]. A phenotype of an organism is determined by the interaction of its genetic constitution and the environment. Skin color, height and behavior are some examples of phenotypes. We are developing a system that uses existing databases (UMLS Metathesaurus[26] and Human Phenotype Ontology (HPO) [27]) to find phenotype names1 . Our tool is based on MetaMap[17] to find name phrases and their semantic types. The tool uses these semantic types and some stylistic and linguistic rules to find human phenotype names in the text.
2
Phenotype Name Recognition
The last few years have seen a remarkable growth of NER techniques in the biomedical domain. However, these techniques tend to emphasize finding the name of genes, proteins, diseases and drugs. Although many specialized dictionaries are available, we are not aware of a dictionary which is both comprehensive and ideally suited for phenotype name recognition. For example, The Unified Medical Language System (UMLS) Metathesaurus [26] is a very large, multi-purpose, and multi-lingual vocabulary database that contains more than 1.8 million concepts. These concepts come from more than 100 source vocabularies. The Metathesaurus is linked to the other UMLS Knowledge Sources – the Semantic Network and the SPECIALIST Lexicon. All concepts in the Metathesaurus are assigned to at least one semantic type from the semantic network. However the semantic network does not contain Phenotype as a semantic type so it alone is not adequate to distinguish between phenotypes and other objects in text. In addition, some phenotype names do not exist in the UMLS Metathesaurus at all. The Online Mendelian Inheritance in Man (OMIM) [28] is the most important information source about human genes and genetic phenotypes [27]. Over five decades MIM and then OMIM has achieved great success and now it is used for the daily work of geneticists around the world. Nonetheless OMIM does not use a controlled vocabulary to describe the phenotypic features 1
This paper describes linguistic techniques to determine the sequence of words that is a descriptive phrase for a phenotype. A phenocopy is an environmental condition that mimics a phenotype and hence would have the same descriptive phrase as the phenotype name. We are not distinguishing between phenotype and phenocopy.
248
M. Khordad, R.E. Mercer, and P. Rogan
in its clinical synopsis section that makes it inappropriate for data mining usages [27]. The Human Phenotype Ontology (HPO) [27] is an ontology that was developed using information from OMIM and is specially related to human phenotypes. The HPO contains approximately 10,000 terms. Nevertheless this ontology is not complete and we had several problems finding phenotype names in it. First, some acronyms and abbreviations are not available in the HPO. Second, although the HPO contains synonyms of phenotypes, there are still some synonyms that are not included in the HPO. For example the HPO contains ENDOCRINE ABNORMALITY, but not ENDOCRINE DISORDER. Third, in some cases adjectives and other modifiers are added to phenotype names, making it difficult to find these phenotype names in the ontology. For example, ACUTE LEUKEMIA is in the HPO, but an automatic system would not suggest that ACUTE MYLOID LEUKEMIA is a phenotype simply by searching in the HPO. Fourth, new phenotypes are being continuously introduced to the biomedicine world. HPO is being constantly refined, corrected, and expanded manually, but this process is not fast enough nor can the inclusion of new phenotypes be guaranteed.
3 3.1
Background Named Entity Recognition
Named entities are phrases that contain the name of people, companies, cities, etc., and specifically in biomedical text entities such as genes, proteins, diseases, drugs, or organisms. Consider the following sentence as an example: – The RPS19 gene is involved in Diamond-Blackfan anemia. There are two named entities in this sentence: RPS19 gene and Diamond-Blackfan anemia. Named Entity Recognition (NER) is the task of finding references to known entities in natural language text. An NER technique may consist of some natural language processing methods like part-of-speech (POS) tagging and parsing. Part-of-speech tagging is the process of assigning a part-of-speech or other syntactic class marker to each word in the text [29]. A part-of-speech is a linguistic category of words such as noun, verb, adjective, preposition, etc. which is generally defined by the syntactic or morphological behavior of the word. Parsing is the process of syntactic analysis that recognizes the structure of sentences with respect to a given grammar. Using parsing we can find which groups of words are for example noun phrases and which ones are verb phrases. Complete and efficient parsing is beyond the capability of current parsers. Shallow parsing is an alternative. Shallow parsers decompose each sentence partially into some phrases and after that they find the local dependencies between phrases. They do not analyze the internal structure of phrases. Each phrase is tagged by one of a set of predefined
Improving Phenotype Name Recognition
249
grammatical tags such as Noun Phrase, Verb Phrase, Prepositional Phrase, Adverb Phrase, Subordinated clause, Adjective Phrase, Conjunction Phrase, and List Marker [30]. An important syntactic concept that is applied in our tool is the head of a phrase. The head is the central word in a phrase that determines the syntactic role of the whole phrase. For example in both phrases “low set ears” and “the ears” ears is the head. 3.2
MetaMap
MetaMap [17] is a widely used program developed by the National Library of Medicine (NLM). MetaMap provides a link between biomedical text and the structured knowledge in the Unified Medical Language System (UMLS) Metathesaurus by mapping phrases in the text to concepts in the UMLS Metathesaurus. To achieve this goal it analyzes the input text in some lexical and semantical steps. First, MetaMap tokenizes the input text. In the tokenization process the input text is broken into meaningful elements, like words. After part-of-speech tagging and shallow parsing using the Specialist Lexicon, MetaMap has broken the text into phrases. Phrases undergo further analysis to allow mapping to UMLS concepts. Each phrase is mapped to a set of candidate concepts and scores are calculated that represent how well the phrase matches the candidates. An optional last step is word sense disambiguation (WSD) which chooses the best candidate with respect to the surrounding text [17]. MetaMap is configurable and there are some options for vocabularies and data models in use, output format and algorithmic computations. Human-readable output is one of the output formats. MetaMap’s human-readable output generated from the input text “at diagnosis.” in the sentence “The platelet and the white cell counts are usually normal but neutropenia, thrombopenia or thrombocytosis have been noted at diagnosis.” is shown in Fig. 1. As you see MetaMap found 6 candidates for this phrase and finally after WSD it mapped the phrase to the “diagnosis aspect” concept. In UMLS each Metathesaurus concept is assigned to at least one semantic type. In Fig. 1 the semantic type of each concept is given in the preceding brackets. Semantic types are categorized into some groups that are subdomains of biomedicine such as Anatomy, Living Beings and Disorders [31]. These groups are called Semantic Groups (SG). Each semantic type belongs to one and only one SG. 3.3
Human Phenotype Ontology (HPO)
An ontology, defined in Artificial Intelligence and related areas, is a structured representation of knowledge in a domain. In fact an ontology is a structure of concepts and the relationships among them. The Human Phenotype Ontology (HPO) [27] is an ontology that tries to provide a standardized vocabulary of phenotypic abnormalities encountered in human disease. The HPO was constructed using information initially obtained from the Online Mendelian Inheritance in
250
M. Khordad, R.E. Mercer, and P. Rogan
Phrase: "at diagnosis." >>>>> Phrase diagnosis > Candidates Meta Candidates (6): 1000 Diagnosis [Finding] 1000 Diagnosis (Diagnosis:Impression/interpretation of study: Point in time:^Patient:Narrative) [Clinical Attribute] 1000 Diagnosis (Diagnosis:Impression/interpretation of study: Point in time:^Patient:Nominal) [Clinical Attribute] 1000 diagnosis (diagnosis aspect) [Qualitative Concept] 1000 DIAGNOSIS (Diagnosis Study) [Research Activity] 928 Diagnostic [Functional Concept] > Mappings Meta Mapping (1000): 1000 diagnosis (diagnosis aspect) [Qualitative Concept] > Phrase presented learning disabilities > Candidates Meta Candidates (9): 901 Learning Disabilities [Mental or Behavioral Dysfunction] 882 Learning disability (Learning disability - specialty) [Biomedical Occupation or Discipline] 827 Learning [Mental Process] 827 Disabilities (Disability) [Finding] 743 Disabled (Disabled Persons) [Patient or Disabled Group] 743 Disabled [Qualitative Concept] 660 Presented (Presentation) [Idea or Concept] 627 Present [Quantitative Concept] 627 Present (Present (Time point or interval)) [Temporal Concept] > Mappings Meta Mapping (901): 660 Presented (Presentation) [Idea or Concept] 901 Learning Disabilities [Mental or Behavioral Dysfunction]