Normalizing Biomedical Name Entities by Similarity ... - IEEE Xplore

1 downloads 105 Views 371KB Size Report
Normalizing Biomedical Name Entities by Similarity-based Inference Network and. De-ambiguity Mining. Chih-Hsuan Wei, I-Chin Huang, Yi-Yu Hsu and ...
2009 Ninth IEEE International Conference on Bioinformatics and Bioengineering

Normalizing Biomedical Name Entities by Similarity-based Inference Network and De-ambiguity Mining

Chih-Hsuan Wei, I-Chin Huang, Yi-Yu Hsu and Hung-Yu Kao* Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan, R.O.C. Email addresses: [email protected]*

Abstract—To construct an intelligent biomedical knowledge management system, researchers had proposed many relation extraction methods in past. Before applying these methods, the system has to recognize the name entities in the literature and map the entities to the relative EntrezIDs. The purpose of this study is to automatically and exactly identify the relative EntrezIDs which are mentioned in literatures. We employ the similarity-based inference network to calculate the similarity score with the entities, and this EntrezID is a solution to the term variation problem. The proposed de-ambiguity strategy increases the confidence of EntrezID in literature. The strategy provides researchers a good utilization of information for mapping the entity to the EntrezID. As a result, the precision of system increase about 75.1%, and it makes the identified entity even more meaningful. The system using the proposed strategies outperforms the previous methods in biomedical entity normalization.

constructing the relation network. In this study, we focus on improving the accuracy of normalization which is mapping terms to the identifier of a lexicon.

Keywords-component;Text mining, Inference network, Name Entity Normalization

Figure 1. The process of relation extraction: NERNENRE on biomedical data sources

I.

Rule-based strategy and similarity-based strategy are two mainly approaches of solving the task of normalization. Rule-based strategy proposes the heuristic rules manually by domain experts with the special species knowledge. Those heuristic rules are always strict and remarkable. Weak heuristic rules often lose important information in the entities [2]. Although this approach may be required more human investigation, rule-based strategy is excellent and distinct from similarity-based strategy in needless of train corpora [3]. Another strategy - “similarity-based” needs more computational cost and relies on training corpora, but it could calculate the similarity between two given strings automatically. Soft string matching [4] is an alternative given name of similarity-based strategy and allows users to match the string variously and softly. Moreover, this strategy would be an automatically tuned measure in order to get a better result based on training corpora. In recent studies, hybrid approaches are shown in the task of normalization. Similarity-based strategy would be composed of rule-based strategy to eliminate common noise. Besides the matching strategy, the quality of a lexicon identifier should be paid more attentions. Some researches [5, 6] have demonstrated how the excellent lexicon can make normalization more efficiently.

INTRODUCTION

Text mining based on biomedical data sources has been noted in the last several decades, and then scientists dedicated to extract the information automatically and precisely. Evaluating the assistance of text mining on biomedical data sources has been reported that text mining techniques for biomedical information extraction are not completely reliable [1] and remain challenging to increase the assistance in this domain. Relation Extraction (RE) is one of the tasks retrieving the biomedical information that identifies the relationships among biomedical entities in the literature. In RE, biomedical entities refer to the EntrezIDs, such as gene, protein, and interaction, etc. Before extracting information, there are two crucial steps, Name Entity Recognition (NER) and Name Entity Normalization (NEN), to find the biomedical entities efficiently and precisely. Fig. 1 illustrates the process of relation extraction on biomedical data sources. The NER concentrates on detecting the region in the literature, and then it refers to biomedical entities which may be a single or multiple adjacent words. After recognition, the entities are typically represented in a domain terminology and the NEN has to map the entity to a specific entry in the domain resource, such as a lexicon. In RE step, the relations among entities would be extracted for 978-0-7695-3656-9/09 $25.00 © 2009 IEEE DOI 10.1109/BIBE.2009.41

461

The entities tagged by the tagging system would be modified as candidate entities by proposed heuristic rules.

The proposed methods in this study are concentrated on the solutions of the variation and ambiguity in term normalization. We employ an integrated approach that utilities the rule-based and similarity-based strategies to eliminate the problems of term normalization. Generally, similarity-based inference network and de-ambiguity strategies are the major parts of our system. First strategy calculates the similarity between the entity and the EntrezIDs by the frequency and influence of their tokens. The next strategy is a de-ambiguity algorithm for solving the ambiguity problem. We take those entities which belong to the same literature and refine them into a more confident EntrezIDs of each entity. II.

METHODS

In this section, the proposed similarity matching strategy is able to solve the task of name entity normalization in biomedicine. It utilizes the Term Frequency-Inverse Document Frequency (TF-IDF) strategy in the inference network to score the similarities between entities and EntrezIDs in lexicon. Also, the entities are merged to acquire the better results by de-ambiguity strategy. The whole process diagram of our normalization system is shown in Fig. 2. The process diagram of normalization is displayed in Fig. 3. The flow of whole process diagram illustrates the system structure in this study and there are three components in our system. They are gene/gene product name dictionary system, tagging system, and normalization system. In the gene/gene product name dictionary system component, we collect and store gene/gene product name lexicon from BioCreative2[7] Gene name normalization task. In order to reduce the computational cost, the dictionary is constructed by an inverted index table.

Figure 3. Name Entity Normalization System

In the normalization system component, consider the work flow in Fig. 3 where an entity is tagged from literature and mapped an EntrezID in a lexicon. We denote this entity as Candidate Entity (CE). At the same time, the selected EntrezIDs that probably have referred to the CE/CEs are denoted as Candidate EntrezIDs (CIDs). The CE is selected to map the Dictionary Entities (DEs) from each CID. Then, the process of mapping is separated into three different types, i.e. Directly-Match, Indirectly-Match. Suppose that one of the CEs matches to the DE indirectly. All of the tokens from each CE would be used to select the partial match DEs. Shown in Fig. 3, the similarity score with CE and DE is calculated by the Similarity-based on Inference Network Model (S-INM). The de-ambiguity module is proposed to eliminate the phenomenon of term ambiguity, and the threshold-cut module is used to select the highest confident identifier of a lexicon for the CEs. Pre-processing 1) Lexicon building: In biomedical lexicon, there are some entities that are denoted to be general entities (such as “protein”, “gene”, “receptor”…, etc.) so that we have to filter those general entities before using this lexicon. We split the entities into several tokens (i.e. words or number) by punctuation, symbol, and space. For example, “Hypoxiainducible factor-1 alpha” is split into “Hypoxia”, “inducible”, “factor”, “1”, “alpha”. Those tokens are inserted into inverted-index table except the stopwords such as the following. 2) Stopwords/Domain Common Words Filtering: Stopwords are well-known in information retrieval domain. Those words are very frequent and do not carry meaning

A.

Figure 2. Integrated System: Inverted System, Tagging System, and Normalization System

Next, the tagging system component applies AIIA recognition system [8] to recognize the candidate name entities in the literatures and store those entities into database.

462

receptor like 1” in Fig. 4 is mapped to three or more concepts but there is only one concept is correct, i.e. 51554. It’s not trivial to select the correct concept in this situation. Thus, we employ the similarity-based strategy and tokenization to help system to select DEs precisely. 1) Directly-Match: We assume that the CE is exactly matched by the DE. This CE is not variable but ambiguous. This assumption takes the benefit of rule-based strategy and two rules for Directly-Match are defined to get the precise and broad CIDs. Firstly, we ignore the case-sensitive and this rule is confirmed to be an effective normalization rule for gene/protein names [9]. Secondly, we replace all nonalphabet and non-number symbols in CE into space (e.g. hif-1 to hif 1) and tokenize the composite CE (e.g. hif1) into two parts which are separated by a space (e.g. hif 1). Due to the side effect of the symbol and punctuation, the term variation is eliminated by the replacement and tokenization. For example, “Hif-1” would be directly matched the entity “HIF 1” of the EntrezID: 29072. Straightforward rules would increase the phenomenon of ambiguity, and it will be solve in the de-ambiguity strategy. 2) Indirectly-Match: However, the CE could not be directly match to the entity of concept due to the phenomenon of term variation. We propose another approach for the solution of this phenomenon. In the Directly-Match processes, if the CE could not match the DEs of the CIDs or there is no DE to be selected, we use each token of the CE to list the DEs. For enumerating the tokens of the CE besides the composite words, the two rules described in directly match are implemented. Therefore, the selected DEs are not precise. We have to rank those CIDs by calculating the scores by their DEs and select the most similar one to be the closest EntrezID of the CE. 3) Inference Network Model: The inference network model [12, 13] takes an epistemological view of the information retrieval problem. More specifically, this network allows us to consistently combine evidence from distinct evidential sources to improve the statistical experiment. In addition, the INM is basically a Bayesian network used to model documents. For extensive overviews related to inference network model. Ricardo’s textbook[14] is referred. The INM consists of two sub-networks: the Document Network (DN) and the Query Network (QN). The DN represents the document collection and consists of nodes for each document. Each term of document is connected with the document by directing edges from the node of document to the nodes of term. The QN represents the user information need and consists of a framework of nodes that represent the required terms of query. The DN is constructed during indexing of document and then used to statistic during mapping; the QN is produced from the query during mapping.

(e.g. ‘the’, ‘a’) in natural language and therefore can be ignored. We filter out those words in inverted-index table including numbers and a single character. Besides stopwords, domain common words that are rarely references to concepts should also be ignored. Domain common words in biomedicine include “gene”, “protein”, “sequence”, “cell”, and so forth. These words would not be denoted as the entities of EntrezID, such as gene name or protein name. We filter domain common words which are high frequency from the literatures. Similarity-based Inference Network Model The template is used to format your paper and style the text. All margins, column widths, line spaces, and text fonts are prescribed; please do not alter them. You may note peculiarities. For example, the head margin in this template measures proportionately more than is customary. This measurement and others are deliberate, using specifications that anticipate your paper as one part of the entire proceedings, and not as an independent document. Please do not revise any of the current designations. Name Entity Normalization is confronted with two mainly problems that are term variation and term ambiguity. This section focuses on the solutions of term variation. We develop a similarity-based inference network model by calculating the similarity scores between CE and DEs by TFIDF. For example, Fig. 4 demonstrates the CE and DEs of the real case in the process of normalization. Mapping the CEs to DEs in lexicon would be separated into three different strategies, i.e. Directly-Match, IndirectlyMatch.

B.

Figure 4. Candidate Entity (CE) and Dictionary Entities (DEs) and Candidate EntrezIDs (CIDs)

Consider the example in Fig. 4 where three CEs are extracted from the abstract in PMID: 10767544. We use the three strategies to find Candidate EntrezIDs (CIDs). The Directly-Match situation is like that CE “CCRL1” can be exactly matched by the DEs “CCRL1” in the 51554 and 1524 EntrezIDs. The situation of “CC chemokine receptor like 1” is denoted as Indirectly-Match. In the Indirectly-Match situation, CE can partially match to the concepts in order to solve the term variation. For example, “CC chemokine

463

As the examples in Fig. 5, document

dj

mapping; the QN is produced from the query during mapping. 5) TF-IDF Weighting Strategy:We use these two effects to weight a term in a document based on inference network model. Definition: Let freq i,j be the frequency of term k i in the document

has the nodes

k2 , k i , and k t as its terms in DN and the node I is user q q Information Need that consists of query nodes , 1 , and 2 in QN. The query q is composed of the terms, e.g.

k1 k 2 ,

,

k

and i , and other queries are connected by variant terms in DN. Each link contains a conditional weight to indicate the strength of the relationship between two nodes and the evaluation of a node is using the value of the parent nodes and the conditional weight.

d j and the normalized frequency tf i,j of term k i in document d j is given by: tf i,j =

freq i,j max l freq l,j

That is the frequency of term k i in d j with the maximum frequency. Let N be the total number of documents and n i be the number of documents in which the term k i appears. The inverse document frequency idfi of term k i is then given by:

idf i =log

N ni

The best known term-weighting schemes using TF-IDF strategy which are given by:

Figure 5. Basic Inference Network Model

w i,j =tf i,j u idf i

To perform retrieval, we adopt two further pre-processes [15]: the attachment process and the evaluation process. The attachment process is attached the QN to the DN to form the complete IN and connects the query terms to the terms where they are the same. The evaluation process evaluates each document in complete IN where each document node is used to form the probability of the relevance to the user Information Need. After evaluating the relevance of each document, the probability of document relevance is used to produce the ranking. We apply the TF-IDF weighting strategy to calculate the probability in the evaluation process. 4) TF-IDF Weighting Strategy: The inference network model [12, 13] takes an epistemological view of the information retrieval problem. More specifically, this network allows us to consistently combine evidence from distinct evidential sources to improve the statistical experiment. In addition, the INM is basically a Bayesian network used to model documents. For extensive overviews related to inference network model. Ricardo’s textbook[14] is referred. The INM consists of two sub-networks: the Document Network (DN) and the Query Network (QN). The DN represents the document collection and consists of nodes for each document. Each term of document is connected with the document by directing edges from the node of document to the nodes of term. The QN represents the user information need and consists of a framework of nodes that represent the required terms of query. The DN is constructed during indexing of document and then used to statistic during

For the purpose of representing an inference network model, the relevance for the document d j and user query. The combined evidence, which represents for terms in the document d j is then defined as:

C j = – 1-tf i,j "i

To combine the distinct evidential sources C j with the TF-IDF weighting strategy for evaluating the relevance between the document d j and user query the probability of the relevance P(q š d j ) is given by:

P(q š d j )=C j u P(d j ) u

¦

i w i,j z 0 š w i,q z 0

tf i,j u idf i u

1 1-tf i,j

That P(d j ) stands for a prior probability distribution and we regard it as the inverse norm of the document vector. That P(d j ) is equal to that means the larger of the document, the smaller is relative contribution of its terms. C. De-ambiguity Besides the weighting strategy for the solution of term variation, we propose the de-ambiguity strategy for solving the term ambiguity. This strategy uses all CEs in the same literature to generate the De-Ambiguous EntrezID (DAE), and these DAEs are regarded as the final identifiers in lexicon we suggest for each CEs.

464

similarity score between the CE and FE is lower than this similarity score threshold, this identifier would be discarded and this CE list would not map to any identifier due to the evidence of this CE list is insufficient.

For De-ambiguity strategy, we discard some CIDs by the proposed De-ambiguity algorithm. In this strategy, the merged DAEs will be generated by discarding the CID whose CE group is contained by another group. For examples, identifiers of CID1 have a CE list (CE1, CE2, CE3), and CID2’s list is (CE1, CE2). In this case, the CID2 would be merged by CID1. Therefore, the DAE is regarded as the suggested ID of the CE after de-ambiguity strategy. Before De-ambiguity strategy, we merge those entities with the same CIDs. If the entity is vague and that selects many DEs, the entity would be merged into several unique DE lists. For example, “mammalian chemokine receptors” in TABILE II is used to select many CIDs by Indirectly-Match. When other entities have the same CID with an entity in question, the entity will be merged with them. In this case “CCRL1”, “CCRL1 cDNA”, and “CCR7” entities have to be merged into EntrezIDs: 51554 or 1236. TABLE I. CE CCRL1 CCRL1 cDNA CCR7 chemokine receptor genes mammalian chemokine receptors CCRL1 -CC chemokine receptor like 1

III.

In order to evaluate our system, we apply the corpora released from BioCreAtivE2 Task2-Gene Normalization, BC2GN[7]. Otherwise, the workshop of BioCreAtivE2 is demonstrated the results of researches in the task BC2GN and we evaluate our system by this data set. The evaluation strategy of the normalization task in this study is the literature-based precision, recall, and f-measure. The literature-based means that the identifier of candidate entity is correct if this identifier is also contained in the GoldStandard gene list of the same literature. Thus, the words of candidate entity do not use to compare with the gene/protein name of the identifier in the GoldStandard.

EXAMPLES OF DAE: GENERATE THE FINAL ENTITIES BY DE-AMBIGUITY STRATEGY IN REAL CASE CIDs selected by CE 1524,51554 1524,9034,25901,51554,285737 1236 643,1230,1231,…,1236,…, 51554, …,

A. Final Similarity Score. The Final Similarity Score (FSS) is denoted the score between the CE list and DEs, and FSS is also a kind of confidence score to confirm the relationship between CE and DEs. We set the threshold for FSS which deletes the DEs with the lower similarity score. Fig. 9 demonstrates the evaluation of the FSS threshold on BC2GN test corpora. According to the result of this evaluation, this corpus contains the Directly-Match CE for the most part and we can get the better result in the higher special score and FSS threshold.

DAE 51554 51554 1236 1236 / 51554

102,330,430,…,1236,…,51554, …

1236 / 51554

1074,1367,1657,2007,2337,4479,45 29… 51554, …

1236 / 51554

RESULTS

This strategy utilizes the information from other entities and the identifier of “CCRL1” would be 51554. The selection 1524 will be ignored. D. Threshold-cut In truth, the de-ambiguity strategy wouldn’t identify the DAE of the CE when there is no relevant or alias CE in the same literature. The CE would also be indicated more than one DAE. Thus, we make good use of the similarity score between CE and DAEs, and take the information from deambiguity. In de-ambiguity strategy, the CE list is indicated by one or more final identifiers DAEs, and we summarize the similarity score between the whole CE for this DAE. Thus, the similarity scores of DAE are used to rank and we denote the Final EntrezID (FE) with highest similarity score is selected to be the identifier of the CE list.

Figure 7. Evaluate Final Similarity Score

Figure 6.

B. Compare with other works In this section, the result of our system is used to compare with other normalization systems [16] proposed in the BioCreAtivE II workshop. The Dictionary-Based system is compared. We compared a part of those researches participating in BC2GN task to see if the method we proposed has a high evaluation.

CIDs and DEs and FEs

The threshold of similarity score is adopted to confirm the identifier is the real one. In other words, if the highest

465

[1]

ˋ˃ˁ˃ʸ

ˣ̅˸˶˼̆˼̂́

˥˸˶˴˿˿

˙ˀˠ˸˴̆̈̅˸

ˊˈˁ˃ʸ

[2]

ˊ˃ˁ˃ʸ ˉˈˁ˃ʸ ˉ˃ˁ˃ʸ

[3]

ˈˈˁ˃ʸ ˈ˃ˁ˃ʸ

[4]

˜́˹˸̅˸́˶ ˖˻˸́˺ˀ ˠ˴̅̇˼˽́ʳ ˟˴̈ʿʳ ˗˼˶̇˼̂́˴ ˟˼̀ʿʳ˝ˁˀ ˚̅̂̉˸̅ʿʳ ˸ˡ˸̇̊̂̅ ˝̈ʳ˞̈̂ʿʳ ˦˶˻̈˸̀˼ ˪ˁ˪ˁʿʳ ̅̌ ˛ˁʳ˸̇ʳ˴˿ ˖ˁʳ˸̇ʳ˴˿ ˾ ˸̇ʳ˴˿ ˸ʿʳ˸̇ʳ˴˿ ˸̇ʳ˴˿ ˣ̅˸˶˼̆˼̂́

ˊˇˁˈʸ

ˈˋˁˊʸ

ˊˇˁ˅ʸ

ˊˉˁˊʸ

ˊ˄ˁˊʸ

ˊ˅ˁ˃ʸ

ˊ˅ˁˉʸ

˥˸˶˴˿˿

ˊˈˁˋʸ

ˉˊˁˌʸ

ˉˊˁˉʸ

ˉ˃ˁˉʸ

ˉˉˁˇʸ

ˊˈˁ˃ʸ

ˊˇˁˌʸ

[5]

[6]

Figure 8. Comparison between other methods

Fig. 11 demonstrates the evaluations of different methods. We select the best result of each method submitted for BC2GN competition on test corpus and the used lexicon is EntrezGeneLexicon. Evidently, every method has a better fmeasure than the Dictionary-Based method and the precision value is particular outstanding. As shown in Fig. 11, the proposed system could get the result better than Average evaluation. Especially, the high recall rate is obtained under a slight decreasing amount of precision rate. IV.

[7]

[8]

[9]

CONCLUSIONS

In this study, we have proposed a system for mapping a biomedical entity to the correct concept using inference network weighting and de-ambiguity matching. Experimental results show that our inference network weighting and de-ambiguity strategy could outperform the previous methods and archive a 74.5% precision, 75.8% recall, and 75.1% f-measure. Normalizing entity is not only the crucial task for biomedicine but it is also important in the topic of information retrieval, such as QA system, detecting informative Blogs, and term clustering. We would concentrate on these topics to extract more precise information for users.

[10]

[11]

[12]

[13]

ACKNOWLEDGMENT We thank Drs Shaw-Jenq Tsai, Shin-Mu Tseng, and HeiChia Wang for their helpful comments. The authors would also like to thank the members of IKMlab for all their assistance during the development of the system and the preparation of this manuscript.

[14] [15]

REFERENCES

[16]

466

B. Alex, C. Grover, B. Haddow, M. Kabadjov, E. Klein, M. Matthews, S. Roebuck, R. Tobin, and X. Wang, "Assisted curation: does text mining really help?," Pac Symp Biocomput, pp. 556-67, 2008. Y. Tsuruoka, J. McNaught, and S. Ananiadou, "Normalizing biomedical terms by minimizing ambiguity and variability," BMC Bioinformatics, vol. 9 Suppl 3, p. S2, 2008. W. W. Lau, C. A. Johnson, and K. G. Becker, "Rule-based human gene normalization in biomedical text with confidence estimation," Comput Syst Bioinformatics Conf, vol. 6, pp. 371-9, 2007. Y. Tsuruoka, J. McNaught, J. i. Tsujii, chi, and S. Ananiadou, "Learning string similarity measures for gene/protein name dictionary look-up using logistic regression," Bioinformatics, vol. 23, pp. 27682774, October 15, 2007 2007. K. Fundel, D. Guttler, R. Zimmer, and J. Apostolakis, "A simple approach for protein name identification: prospects and limits," BMC Bioinformatics, vol. 6 Suppl 1, p. S15, 2005. A. Cohen, "Unsupervised Gene/Protein Named Entity Normalization Using Automatically Extracted Dictionaries," Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, pp. 17-24, June 2005. A. A. Morgan, Z. Lu, X. Wang, A. M. Cohen, J. Fluck, P. Ruch, A. Divoli, K. Fundel, R. Leaman, J. Hakenberg, C. Sun, H.-h. Liu, R. Torres, M. Krauthammer, WilliamWLau, H. Liu, C.-N. Hsu, M. Schuemie, K. B. Cohen, and L. Hirschman, "Overview of BioCreative II gene normalization," Genome Biology, vol. 9, p. S3, 2008. C.-N. Hsu, Y.-M. Chang, C.-J. Kuo, Y.-S. Lin, H.-S. Huang, and I.-F. Chung, "Integrating high dimensional bi-directional parsing models for gene mention tagging," Bioinformatics, vol. 24, pp. i286-i294, 2008. K. B. Cohen, K. A.-M. George, E. D. Andrew, and H. Lawrence, "Contrast and variability in gene names," in Proceedings of the ACL02 workshop on Natural language processing in the biomedical domain - Volume 3 Phildadelphia, Pennsylvania: Association for Computational Linguistics, 2002. J.-H. Lim, H. Jang, J. Lim, and S.-J. A. P. S.-J. Park, "Normalization of Gene/Protein Names in Biological Literatures using Vector-Space Model," in Engineering in Medicine and Biology Society, 2007. EMBS 2007. 29th Annual International Conference of the IEEE, 2007, pp. 390-393. Y.-M. C. Cheng-Ju Kuo, Han-Shen Huang, Kuan-Ting Lin, Bo-Hou Yang, Yu-Shi Lin, Chun-Nan Hsu and I-Fang Chung. , "Exploring Match Scores to Boost Precision of Gene Normalization. ," in Proceedings of the BioCreAtIvE II Workshop 2007, Madrid, Spain, 2007. T. Howard and W. B. Croft, "Evaluation of an inference networkbased retrieval model," ACM Trans. Inf. Syst., vol. 9, pp. 187-222, 1991. T. Howard and W. B. Croft, "Inference Networks for Document Retrieval," University of Massachusetts 1990. R. Baeza-Yates and B. Ribeiro-Neto, Modern information retrieval. Harlow: Addison-Wesley, 1999. G. Andrew and L. Mounia, "Video retrieval using an MPEG-7 based inference network," in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval Tampere, Finland: ACM, 2002. R. J. Martijn Schuemie, Jan Kors, "Peregrine: Lightweight Gene Name Normalization by Dictionary Lookup," in Proceedings of the BioCreAtIvE II Workshop 2007, Madrid, Spain, 2007.

Suggest Documents