Natural Language Processing Lab., Department of Computer Science and Engineering, ... inverse query frequency, which is a novel statistics estimated ... A bi- ological term âegfr-1â, for instance, has a number of the variations such as âegfr1â, âegfr.1â, â(egfr-1)â or .... In general, idf is used to present a discriminative power.
Terminology Indexing and Reweighting methods for Biomedical Text Retrieval Young-In Song, Sang-Bum Kim and Hae-Chang Rim Natural Language Processing Lab., Department of Computer Science and Engineering, Korea University, Anam-dong 5 ka, SungPuk-gu, SEOUL, 136-701, KOREA {song,sbkim,rim}@nlp.korea.ac.kr
ABSTRACT
tle ball”. Automatic query expansion techniques based on a gene name thesaurus or gene name rewrite rules[7, 1, 2] can In this paper, we propose terminology indexing and reweightbe quite useful for this kind of user need, but the traditional ing methods considering the characteristics of biomedical query expansion method can cause a critical problem in its documents. For the terminology indexing, we first recognize query term weighting. Suppose that the original query term biomedical terms, and generate multiple keywords in the is a single word terminology, but its added synonym is a term so that the system can perform a flexible matching bemultiword expression consisting of five or six words. In this tween a query and documents. For the term reweighting, we case, the documents including added terminologies acquire devise a method of normalizing the weights of query terms in much higher scores than the document including the original a long multiword biomedical term, and a method of utilizing single query term, and this results in a serious performance inverse query frequency, which is a novel statistics estimated problem. Thus, it is required to balance the query weight in a query domain. The experiment results on MEDLINE for terminology having same meanings but their lengths are corpus show that our term indexing and reweighting methquite different in the query expansion process. ods can improve the retrieval performance. In this paper, we propose terminology indexing and query term reweighing methods which can be easily adapted to a 1. INTRODUCTION typical IR model. In terminology indexing, we first recognize With the rapid growth of biomedical literature data, it bethe biomedical terminologies using biomedical named entity comes more difficult that biologists find information relevant tagger based on support vector machine. Then, some simple to their needs. However, the existing information retrieval heuristics are applied to generate useful keywords from the systems often failed to provide satisfactory search results recognized terminology, which enable the system to perform since there are many domain-specific characteristics, which partial matching between a user query and terminologies in are mostly originated from complex technical terms and inbiomedical documents. In query term reweighting, we proconsistent usages of them in this domain[8]. pose a query length normalization for balancing the weights Compared to the documents in other domains, biomediof each different-length terminology including aliases, syncal terms have a lot of the distinctive features. For example, onyms, etc. We also propose an inverse query frequency(iqf ) there are a large number of long multiword expressions such similar to inverse document frequency(idf ) in order to obas “Melanogaster 5’-phosphoribosylaminoimidazole carboxylase- tain more domain-specific statistics in the case when the 5’-phosphoribosyl-4-(N-succinocarboxamide)-5-aminoimidazole potential queries entered to system are known in advance. synthetase (Ade5) mRNA, complete cds”. Moreover, word order variations are frequent in a biomedical term. A bi2. BIOMEDICAL TERM INDEXING ological term “egfr-1”, for instance, has a number of the variations such as “egfr1”, “egfr.1”, “(egfr-1)” or “egfr 1”. 2.1 Biomedical Term Recognition For this reason, it is obvious that biomedical terms must Biomedical term recognition is formulated as classification be recognized and carefully processed by a biomedical text of each word into one of two classes, T or O that represents retrieval system. region information. T means that the current word is a part Besides, the user would want to retrieve documents which of a named entity, and O means that the word is not in a have acronyms, synonyms, term variations or aliases of the named entity. With the representation, we need only one query such as “egfr”, “egf-r”, “Drosophila epidermal growth binary classifier of two classes, T, and O. factor receptor homologue”, “EGF-receptor”, and “faint litTo achieve a high performance of the defined task, we use SVM as a machine learning approach, which has showed the best performance in various NLP tasks. The features of the designated word are composed of orthographical characterPermission to make digital or hard copies of all or part of this work for istic features shown in Table 1 as well as prefix and suffix personal or classroom use is granted without fee provided that copies are information of the word. not made or distributed for profit or commercial advantage and that copies The SVM classifier has been trained with the GENIA corbear this notice and the full citation on the first page. To copy otherwise, to pus (v3.0p)1 , which consists of 2000 MEDLINE abstracts republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 2001 ACM 0-89791-88-6/97/05 ...$5.00.
1
Available
at
http://www-tsujii.is.s.u-
Table 1: Orthographical features used in term recognition Feature DIGITS SINGLE CAP COMMA PERIOD HYPHON SLASH QUESTION MARK OPEN SQUARE CLOSE SQUARE OPEN PAREN CLOSE PAREN COLON SEMICOLON PERCENT APOSTROPHE ETC SYMBOL TWO CAPS ALL UPPER INCLUDE CAPS GREEK LETTER ALPHA NUMERIC ALL LOWER CAPS DIGIT INIT CAP
Example 1 , 39 A,M , . / ? [ ] ( ) : ; % ’ +, *, etc. alphaCD28 AIDS c-Jun NF-kappa p65 motif CD40 Rel
Figure 1: Example of generating multiple keywords from a multiword term For example, “1A”, “25” or “B” is a functional word component, but “CD” is not. Now we generate the keywords to be indexed from each recognized multiword term by the following steps: 1. Extract each word that is not a functional word component.
annotated with Named Entity and Penn Treebank (PTB) Part-Of-Speech tags. One can find more information about the term recognizer we have used in [5].
2. For each functional word component in the given multiword term, generate two (or one) canonical forms by combining each component with its adjacent word(s) in a canonical order.
2.2 Generating Keywords in Terminology
3. Extract all the word-pairs where both the words are not a functional word component with preserving their original order.
Once biomedical terms are recognized, the recognized term may be regarded as an indexing keyword itself even though it consists of several words including symbols or just a single alphabet. It is one of the possible strategies that we just consider the whole word sequence as a keyword. However, this yields matching failures between a query and a document since word order inversions and partial abbreviations occur very frequently in biomedical text. Moreover, many biomedical terminologies have many functional words consisting of only two numeric characters or even single alphabet letter such as ”1A” or ”B”. These functional words can help in increasing precision by providing important evidences, but also can degrade the system performance by acting as noise words. Thus, we believe that such functional words should not be regarded as an independent indexing unit, but regarded as just components of other keywords in the indexing phase. In addition, we should enable the retrieval system to perform partial matching between a query and a terminology in biomedical text by indexing multiple short word sequences generated from the terminology rather than indexing the whole terminology alone. Based on this idea, we first define functional word component: Definition A functional word component is a word consisting of two characters or less except for words consisting of two alphabets. tokyo.ac.jp/∼genia/topics/Corpus/3.0/GENIA3.0p.intro.html
Figure 1 shows an example of keywords generated from a biomedical term “cyclin-dependent kinase inhibitor 1A (p21, Cip1)” by the above three steps. In this example, only “1A” is a functional word component, so two canonical forms including “1A:inhibit” and “1A:p21” are extracted as keywords, and all the word pairs are also extracted except for “inhibit:p21” and “1A:p21”, which are already extracted in the second step.
3. QUERY TERM REWEIGHTING 3.1 Query Length Normalization When a biologist is going to find documents about “cyclindependent kinase inhibitor 1A (p21, Cip1)”, he may enter either “cyclin-dependent kinase inhibitor 1A (p21, Cip1)” or “CDKN1A” since both the terminologies have a same meaning even though each term is quite different especially in its length. Thus, it is clear that any documents are equally relevant to his query whether the document has a longer name or the short symbol “CDKN1A” if we assume that each term occurs with the same frequency at each document. However, the traditional IR system probably gives a much higher rank to the document containing a longer name since the document has more keywords than the document containing one short keyword and it is more likely to acquire a higher matching score by summing every weight for each keyword in the long terminology.
This unfair scoring can cause some serious problems. In the above example, it is possible that the document containing only “cyclin-dependent kinase inhibitor” or “inhibitor 1A” will have a higher score than the document containing “CDKN1A”. Especially, it causes more serious problem when the query expansion technique is used. In the query expansion, several terminologies with various lengths are added to the original query using pre-constructed lexical ontology such as UMLS (Unified Medical Language System)2 . Suppose that the user’s original query is a short symbol, but new added queries are long synonymous multiword terms. In this case, the system favors the documents having some words of new added long queries and the original short query becomes a trivial word if we does not balance the query weights between the original user query and new added multiword terms. One possible way of alleviating this problem is to perform query weight normalization according to the length of each query terminology. To perform normalization, we calculate the query weight for the keyword i in the terminology j as follows: QWi 1 =
(qk + 1) · qtfij qk · ((1 − qb) + qb · qlj )) + qtfij
(1)
where qtfi is the frequency of the keyword i, ql is the length of terminolgy j, and both qk and qb are the parameters controlling the normalization effect. In this experiment, we set qk and qb with 1.2 and 0.95 respectively. If there are a number of synonymous multiword terms, the weight of each keyword i is calculated by summing all the normalized weights from each term as follows: QWi 1 =
|Q| X j=1
(qk + 1) · qtfij qk · ((1 − qb) + qb · qlj )) + qtfij
(2)
where |Q| indicates the number of terminologies having a same meaning. By this normalization, weights of keywords from long queries are discounted according to their originated queries, while the symbols or acronyms can keep their weights since the length of such queries is, in general, 1.
3.2 Inverse Query Frequency
system. Thus, the new query weight for keyword i adopting inverse query frequency, QW 2 , is stated as follows: QWi 2 = QWi 1 · log
QN − qni + 0.5 qni + 0.5
(3)
where QN is the total number of gene names in the gene name list, and qni is the number of gene names which contain the keyword i in the list. In our experiment on the gene domain, for instance, “tumor” has idf value of 2.64 and iqf value of 5.99, while “kinase” has 3.56 and 4.11. It means that “kinase” is a more discriminative term in MEDLINE corpus, but less informative than “tumor” in the gene domain.
4. EXPERIMENTAL RESULTS 4.1 Data and Evaluation Measure Several experiments were conducted in the same environment as the first TREC Genomics Track3 held in last year. The document collection consisted of 525,938 MEDLINE records where indexing was completed between 4/1/2002 and 4/1/2003. The MEDLINE records were provided in the standard NLM MEDLINE format, and the fields are indicated by their 2-3 letter abbreviation including PubMed Unique Identifier (PMID), title (TI), abstract (AB), and MeSH headings (MH). TREC Genomics track organizers distributed training and test topic sets of 50 genes each. The training data were distributed first, allowing participating groups to get an idea of what the data in the track were like and to tune their systems. The test data were the topics for the official runs in the track. In this paper, the test data and the training data are referred as QuerySet1 and QuerySet2. [3] describes TREC Genomics track in detail. We have implemented our own system based on Okapi retrieval model[6]. By modifying the BM25 term weighting formula in Okapi system, the term weight in our system with query weight normalization is calculated as follows: log
N − ni + 0.5 (k1 + 1) · tfi · · QWi 1 ni + 0.5 K + tfi
(4)
If iqf is used togehter, the term weight is calculated as follows:
In general, idf is used to present a discriminative power of a term in information retrieval. However, one can devise another statistics if the query domain is fairly restricted, and some resources on the domain are available. In our experiment, we used query sets about gene domain. Each word forming a gene name can have different discriminative power. For example, while some words such as “inhibitor”, “receptor”, and “kinase” occur within the various gene names, words such as “p21”, “Cip1” occur only in some specific gene names. In other words, if “Cip1” and “receptor” occur in the same query, “Cip1” is a more useful query term than the common word “receptor”. Based on this observation, we define a new weight factor, inverse query frequency (iqf ) : the number of every possible query divided by the number of queries containing the specific term. We regard a set of every possible query as 15,000 gene names list obtained from the various web sites because we assume that only the gene names are entered into our
Table 2 shows the effect of our term indexing method. In this table, the baseline is the performance of using a simple unigram indexing method. FWC is the performance of using the indexing method in which keywords are generated in canonical forms by combining functional word components with their neighboring words, and FWC+WP is the performance when all word pairs in multiword terms as well as keywords by FWC are extracted for indexing terminologies. Both FWC and FWC+WP basically use the unigram keywords in baseline method together with some additionally generated keywords.
2
3
http://www.nlm.nih.gov/research/umls/
log
N − ni + 0.5 (k1 + 1) · tfi · · QWi 2 ni + 0.5 K + tfi
(5)
4.2 Effect of Terminology Indexing Method
http://medir.ohsu.edu/∼genomics/
Table 2: Average precision of using proposed QuerySet 1 All Short Long baseline 0.1587 0.1549 0.0956 FWC 0.1619 0.1516 0.1066 FWC+WP 0.1649 0.1516 0.1141
term indexing methods QuerySet 2 All Short 0.2953 0.2891 0.3342 0.3125 0.3197 0.3101
Long 0.1462 0.1729 0.1706
Table 3: Performances of using proposed query term reweighting methods QuerySet 1 QuerySet 2 FWC FWC+WP FWC FWC+FP A-Prec. R-Prec. A-Prec. R-Prec. A-Prec. R-Prec. A-Prec. R-Prec. baseline 0.1619 0.1399 0.1649 0.1416 0.3342 0.3112 0.3197 0.2734 QW 1 0.1822 0.1446 0.2011 0.1647 0.3496 0.3191 0.3628 0.3299 QW 2 0.1996 0.1791 0.2100 0.1639 0.3586 0.3303 0.3797 0.3414
In addition, we have built three groups per each query set to investigate the relationship between the characteristics of a query and the keyword extraction methods. First, Short group consists of official and alias symbols of each specific gene, whose length do not exceed one or two words. In contrast, Long group consists of official and alias gene names, which are relatively long multiword terms. All group consists of all gene symbols and gene names from Short group and Long group. In this table, one can notice that FWC leads to substantial improvements with QuerySet 2, but not with QuerySet 1. Moreover, the performance of FWC+WP is similar to FWC performance. It means that WP never contributes to performance improvement in this experiment. We found that two important reasons for these unsatisfactory results. First, FWC is somewhat useless in our experiment on QuerySet 1 because most possible variations for gene symbols are already given in the query set. In contrast, FWC works very well in the experiment on QuerySet 2 where relatively small number of aliases are available in the query set. Since biomedical resources do not always provide sufficient information, we believe that our FWC method is quite useful to process biomedical documents. Second, the word pairs generated by WP have relatively high weights because their df values are too low in general. For this reason, documents containing word pairs tend to have a high score, resulting biased ranking results. This problem should be solved by devising an appropriate weighting method to the word pairs, which is also a difficult problem in information retrieval when phrases are used as keywords[4].
4.3 Effect of Query Term Reweighting Method Table 3 shows the performances of our query reweighting methods including query length normalization and inverse query frequency. In this table, the baseline means the performance when the proposed term indexing method described in section 2 is used and the pure Okapi model is used to calculate term weights. Query length normalization (QW 1 ) improves performances in both the query sets, especially when we extract word pairs as keywords within recognized multiword terms. It is clear that the number of keywords is greatly increased if we extract all word pairs in each multiword term. In the
Table 4: Performances of using query length normalization on QuerySet1
Short Short+Long All
w/o QLM A-Prec R-Prec 0.1362 0.1268 0.1658 0.1377 0.1649 0.1416
with QLM A-Prec R-Prec 0.1362 0.1268 0.1854 0.1571 0.2011 0.1647
baseline method, these word pair keywords slightly improve the performance in the experiment on QuerySet1, but degrade the performance on QuerySet2. We think that their relatively high idf weights often dominate the effect of original keywords even though some adjacencies in a multiword term are captured by the word pairs. In this situation, our proposed query normalization substantially improves the retrieval performance on QuerySet2 by performing effective normalization for the word pairs. Inverse query frequency (QW 2 ) is also helpful to improve performance. The query sets used in our experiment are about gene domain, while MEDLINE document collection covers various topics. Because QW 2 is a discriminative value in a gene name list, we conclude that another idf -like statistics works well if the statistics can be calculated from available resources in more specific query domains. Table 4 shows another advantage of query length normalization. In this table, performances with and without query length normalization are presented when we first retrieve documents using only a symbol query and expand the original query by adding its aliases consisting of several words. More we add synonymous multiword terms to the original query, the gap of performances between the baseline and the query length normalization becomes larger. Especially, it is noticeable that the baseline system deteriorates the performance even with all available synonymous gene names and aliases from thesaurus, which are undoubtedly useful information in retrieval. It means that the traditional IR model is not appropriate to be applied directly to the biomedical text retrieval system. From the result, we conclude that balancing query weights of original query and its synonymous multiword terms by the proposed query length normaliza-
baseline proposed
Table 5: Summary of performances for proposed system QuerySet 1 QuerySet 2 A-Prec R-Prec RetRel P@10 A-Prec R-Prec RetRel 0.1587 0.1343 484 0.1206 0.2953 0.2697 270 0.2100 0.1639 509 0.1700 0.3901 0.3788 281
tion is quite effective in expanding the original query. Table 5 shows the summary of performances. When we use all the proposed methods together, we can obtain about 30% increase of average precision and R-precision compared to the baseline system where only unigrams are regarded as indexing units and pure BM25 weighting is used to calculate term weights. Our methods also show better performance than the baseline at other evaluation measures too. 4 Table 6 shows the degree of each performance improvement when each proposed technique is incorporated into the baseline one by one. All the proposed methods yield better results except for one negative case where we use the word pair keywords with the pure BM25 weighting method. However, we can also obtain substantial improvement with the query length normalization, QW 1 , compared to the baseline performance even if the word pair keywords are used. It means that the decrease of performance from the word pair keywords with the traditional term weighting method is mainly due to the excessive weights assigned to the word pair keywords. In the first Genomics TREC, many systems have used a query formulating or rewriting method for the accurate terminology matching between documents and queries, while our methods mainly concentrate on keyword extraction and weighting. One of notable approaches is okapi query formulation method of [10]. [10] generates multiple query term sets from the original query using three query formulating rules and combines retrieval results produced by all query term sets. Their query formulating rules ranged from a very strict terminology matching method which retrieves only documents containing the exact terminology in the original query, to a loose matching method which retrieves all documents containing the same bigram as found in the query. That method achieved 0.2323 with the queryset 1, and 0.3321 with the queryset 2 in average precision[10]. This result cannot be directly compared with our results presented in Table 5 because they add name of the species in the organism field of query[3] to query term sets. However, the query formulating method seems to be useful for improving the retrieval performance, so we plan to adopt their query formulating rules as keyword extraction ones in our future system. Additionally, we expect that our methods can be easily adapted to another domain having similar features. In case of query formulating or rewriting methods, all rules are carefully reconstructed when a domain is changed, but our methods only require a new or retrained term recognizer for a new domain.
5. 4
CONCLUSIONS AND FUTURE WORKS
The results of our system, when additional methods such as the organism filtering are used, are described in our TREC report[9].
P@10 0.1160 0.1660
Table 6: Performance improvements according to proposed additional methods
baseline +FWC +WP QW 1 QW 2
QuerySet 1 A-Prec ∆ 0.1587 0.1619 +2.01 0.1649 +1.85 0.2011 +24.21 0.2100 +29.71
QuerySet 2 A-Prec ∆ 0.2953 0.3342 +13.17 0.3197 -4.34 0.3628 +8.56 0.3797 +13.61
In this paper, we have proposed biomedical term indexing and reweighting methods to improve the biomedical text retrieval system. Our indexing methods, FWC and WP, and reweighting methods, QW 1 and QW 2 are developed by carefully considering the unique characteristics of biomedical terminology. Through the experiments on MEDLINE corpus, we can draw following conclusions: • Functional word components such as “1A” or “G” in terminology are more useful when they are combined with the neighboring words. • Word pair keywords are not effective unless we properly discount the excessive weights of the keywords due to their low document frequencies. • With query length normalization, the performances can be improved by controling the weights of the words in short and long terminologies. • Inverse query frequency(iqf ) turns out to be novel and effective statistics by enhancing the discriminative power of the terms used in terminologies for queries. For the future work, we will try to develop a more appropriate weighting scheme for the biomedical domain, and a more elaborate term indexing method using biomedical word formation patterns.
6. REFERENCES [1] G. Bhalotia and P. Nakov. Biotext team report for trec 2003 genomics track. In The Twelfth Text REtrieval Conference: TREC 2003, Gaithersburg, US, 2003. [2] B. deBruin and J.Martin. Finding gene function using litminer. In The Twelfth Text REtrieval Conference: TREC 2003, Gaithersburg, US, 2003. [3] W. Hersh and R. T. Bhupatiraju. Trec genomics track overview. In The Twelfth Text REtrieval Conference: TREC 2003, Gaithersburg, US, 2003.
[4] K. S. Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: Development and comparative experiments; part 1. Information Processing and Management, 36(6):779–808, November 1998. [5] K. J. Lee, Y. S. Hwang, and H. C. Rim. Two-phase biomedical NE recognition based on SVMs. In S. Ananiadou and J. Tsujii, editors, Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pages 33–40, 2003. [6] S. E. Robertson and S. Walker. Okapi/keenbow at trec-8. In Proceedings of TREC-8, 8th Text Retrieval Conference, pages 151–161, Gaithersburg, US, 2000. [7] G. Rocio and F. Tasnim. Regen: Retrieval and extraction of genomics data. In The Twelfth Text REtrieval Conference: TREC 2003, pages 107–117, Gaithersburg, US, 2003. [8] S. Schultz, M. Honeck, and U. Hahn. Biomedical text retrieval in languages with a complex morphology. In Proceedings of the Workshop on Natural Language Processing in the Biomedical Domain, pages 61–68, Philadelphia, July 2002. Association for Computational Linguistics. [9] Y. I. Song, K. S. Han, H. C. Seo, S. B. Kim, and H. C. Rim. Biomedical text retrieval system at korea university. In The Twelfth Text REtrieval Conference: TREC 2003, pages 368–375, Gaithersburg, US, 2003. [10] D. L. Yeung, C. L. A. Clake, C. V. Cormack, T. R. Lynam, and E. L. Terra. Task-specific query expansion (multitext experiments for trec 2003). In The Twelfth Text REtrieval Conference: TREC 2003, pages 810–820, Gaithersburg, US, 2003.