Query Expansion with Biomedical Ontology Graph for Effective MEDLINE Document Retrieval James Z. Wang1,*, Liang Dong2, Yuanyuan Zhang1, Lin Li1, Pradip K Srimani1, and Philip S. Yu3 1
School of Computing, Clemson University, Clemson, SC 29634. Barnes and Noble, LLC, New York City, New York, USA 3 Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607. 2
ABSTRACT* Motivation: With proliferation of new discoveries in biomedical research areas as well as dramatic increase in volume of publications, it is imperative to design efficient search strategies and develop effective search engines to look up relevant documents. Results: This paper proposes a novel ontology graph based scheme for query expansion. In this scheme, a Personalized PageRank algorithm is first used on an ontology graph derived from multiple diverse biomedical ontologies to perform user query expansion. Then, a weighted edge semantic similarity measure is used to filter out the less relevant terms in the expanded query term set, further improving the relevance of information retrieval. Extensive experimental results show that this new search scheme outperforms the popular Lucene approach by 22% while other existing query expansion approaches are unable to beat the free-text based Lucene strategy. Furthermore, a graph-based biomedical search engine, G-Bean (Graph-based biomedical search engine), is implemented based on this new scheme. Not only can G-Bean rank the initial search results based on the relevance to the user query, but also it discovers user’s true search intention and conducts a new query based on the articles that he/she has already interest in to retrieve additional relevant documents. G-Bean provides a more accurate and easier to use Web interface for searching the MEDLINE database, compared the most popular biomedical search engine PubMed. Availability: The search engine, G-Bean, is available at: http://bioinformatics.clemson.edu/G-Bean/index.php/ Supplementary information: http://bioir.cs.clemson.edu:8080/BioIRWeb/supplement.jsp Contact:
[email protected]
1
INTRODUCTION
Dramatic increase in both volume and diversity of biomedical publications in recent years causes finding relevant biomedical articles a huge burden. General purpose search engines, such as Google and Bing, often fail to return the relevant search results due to the eclectic nature of biomedical terms and frequent use of numerous acronyms and abbreviations in biomedical articles. Nowadays, most biomedical researchers use PubMed to search the MEDLINE database, which contains more than 23 million biomedical articles. However, finding relevant publications from PubMed pertaining to users’ individual interests is still challenging, especially for a non-expert user. It is widely reported that lessexperienced users, including those who regularly use the PubMed system, do not utilize it as effectively as experienced users (McKibbon and Walker-Dilks 1995) (Bernstam 2001). PubMed *
To whom correspondence should be addressed
employs a Boolean search strategy to perform document retrieval. Those less-experienced users either fail to employ effective context-sensitive keywords or fail to effectively formulate query expressions using Boolean logic. (Wildemuth and Moore 1995) reported that a novice user (third year medical student) requires on an average fourteen separate queries to get the desired information using a Boolean system. In addition, PubMed does not always return the most relevant articles for user queries. PubMed’s underperformance in biomedical information retrieval is partly due to the fact that it uses only a very small subset of the Medical Subject Headings (MeSH) (Lipscomb 2000) to index the biomedical articles. PubMed compares and maps keywords from an input query to list of preindexed MeSH terms. MeSH (version 2013) is composed of 26,853 descriptors, 83 qualifiers, over 213K assisting entry terms, and over 214K supplementary concept records. However, only descriptors and qualifiers in MeSH are used in PubMed indexing, which means that only 5.8% of the concepts in MeSH are used for indexing. Compared with the 2.9 million biomedical concepts available in UMLS Metathesaurus 2013AA, PubMed indexing includes only less than 1.0% of the available vocabulary. What’s more, PubMed implements the query expansion by adding mapped MeSH terms to the original query. Since MeSH only contains less than 1.0% of the available terms, query expansion in PubMed doesn’t provide satisfactory performance in finding related or more accurate terms to retrieve relevant articles. To address these limitations, we introduced G-Bean for searching biomedical articles in MEDLINE database more efficiently. We first develop a procedure to automatically construct an ontology graph from multiple diverse biomedical ontologies. Then a Personalized PageRank algorithm (Haveliwala, Topic-Sensitive PageRank 2002) (Haveliwala, Kamvar and Jeh, An Analytical Comparison of Approaches to Personaliz-ing PageRank 2003) is applied on this ontology graph to obtain Personalized PageRank Vectors (PPVs) for concepts in this ontology graph. These PPVs are used to expand the query by adding additional relevant concepts to the user query so as to improve the search performance. Furthermore, a weighted edge similarity measure (Dong, Srimani and Wang, Weighted:Edge A New Method to Measure the Semantic Similarity of Words based on WordNet 2010) is used to filter out the less relevant concepts in the expanded query. Based on this novel scheme, G-Bean is implemented for searching biomedical articles in MEDLINE database. The main contributions of this ontology graph based query expansion and indexing scheme are five-fold: 1) automatically constructing the ontology graph from multiple diverse UMLS biomedical ontologies to address the low concept coverage of MeSH indexing for PubMed; 2) applying the Personalized PageRank algorithm to the ontology graph to obtain the PPV which shows the
relevance of concepts in the ontology graph to the query; 3) adapting a weighted edge similarity measure to calculate the semantic similarity between the expanded terms and the original query terms so that less relevant terms are excluded from the expanded query; 4) ranking the query results according to the relevance to the user query by default; 5) retrieving additional relevant articles based on articles the user has already shown interested in and ranking them via the relevance to the interested articles. Extensive experimental results show that this new scheme outperforms the free-text based Lucene search strategy by 22% in terms of 11-point average precision and produces a better retrieval performance than former methods.
2 2.1
METHODS Ontology Graph Based Query Expansion Model
Query Expansion (QE), an integral part of any search engine, is the process of reformulating the original user query (expanding the search query with additional relevant terms) to improve search performance in terms of both recall and precision because users do not always formulate search queries using the “best” search terms. Most query expansion schemes either use a single ontology or knowledge base to discover extended concepts or make certain heuristic assumption on user queries and search results to improve query performance (Hersh, Buckley, et al. 1994) (Hersh, Report on the TREC 2004 genomics track 2005) (Hersh, Information retrieval: a health and biomedical perspective 2009) (Hersh and Hickam, Information retrieval in medicine: the SAPHIRE experience 1995) (Hersh and Hickam, A comparison of retrieval effectiveness for three methods of indexing medical literature 1992) (Srinivasan, Exploring query expansion strategies for MEDLINE 1995) (Srinivasan, Optimal document-indexing vocabulary for MEDLINE 1996) (Yoo and Choi 2007) (Abdou, Ruck and Savoy 2005). (Hersh, Price and Donohoe, Assessing thesaurus-based query expansion using the UMLS Metathesaurus 2000) performed query expansion using synonym, hierarchical, related term information and term definitions from UMLS Mesathesaurus. (Yang and Chute, An application of least squares fit mapping to text 1993) (Yang, Expert network: effective and efficient learning from human decisions in text categorization and retrieval 1994) introduced a technique to map query terms to MeSH terms in UMLS Mesathesaurus by a linear least square technique. The average precision improvement was 32.2% in for a small test collection but the improvement may have been exaggerated by their use of a large training set. (Aronson, Rindflesch and Browne, Exploiting a Large Thesaurus for Information Retrieval 1994) mapped the text of queries as well as documents to terms in UMLS Metathesaurus and achieved a 4% improvement in average precision over unexpanded queries for 3000 MEDLINE documents . When tested with OHSUMED test collection, all kinds of query expansion scheme degraded the aggregated retrieval performance. Then (Mao and Chu 2002) introduced a phrase-based vector space model for retrieving documents in biomedical area and got a 16% improvement in 11-pt average precision compared to stem-based model for OHSUMED test collection. This study differs from former works since our proposed query expansion approach explores semantic links in a large ontology graph, derived from multiple diverse ontologies, so that more terms related to a given query can be found and used to improve the query performance. We evaluate the query expansion algorithm on the whole OHSUMED test collection and the retrieval
2
performance is greatly improved by 22% compared to free-text base Lucene query in term of 11-pt average precision. In general, a knowledge base assisted query expansion can be modeled as a word similarity problem. Consider an arbitrary se( ) which is used to compute the mantic similarity function semantic similarity between two terms x and y in knowledge base B. We define the Accumulated Similarity (AS) between a term set { } and term y as ∑ ( ). Then a knowledge base assisted query expansion problem can be modeled { } and a as, given a set of query terms knowledge base B, finding the top k terms set { } having the largest Accumulated Similarity with the query set X in B. Similarly, our ontology graph based query expansion problem is ( ) and a modeled as follows: given an ontology graph ( ) for computing the similarity between similarity function any two nodes x and y as shown in Figure 1(a), we define the Ac{ } cumulated Similarity (AS) between a node set ) ∑ ( ), as shown in Figand any node y as ( ure 1(b). Thus, our ontology graph based query expansion can be } which modeled as taking the first M nodes in the set { ) values, where X is the set is sorted in descending order of ( of query terms (nodes) and V is the set of all nodes in the ontology graph [see Figure 1(c) where X ={X0, X1, X2} and M=2; two shaded nodes are chosen].
Fig. 1. Relationship between word similarity and query expansion
Note: We need to use accumulated similarity rather than the maximum similarity { ( )} to prevent query drifting (Gauch, Wang and Rachakonda 1999) ( Mitra, Singhal and Buckley 1998); query drifting can cause the degradation of the search performance and is probably the worst effect of query expansion that offsets its advantages.
2.2
Ontology Graph-Based Query Expansion Using Personalized PageRank Algorithm
To illustrate how query expansion works in an ontology graph, we consider a search for two concepts “Vitamin” and “Nyctalopia (Night Blindness)” as user input. A sub-ontology graph containing both concepts is shown in Figure 2; we used one simple English word to represent each concept in the figure. A simple query expansion scheme would be used to compute two complete sets of concepts in the ontology graph – one linked to “Vitamin” and the other linked to “Nyctalopia” – and to use their intersection as the set of expanded query terms. Such a scheme has two major disadvantages: (1) it is computationally extensive since the set of concepts that are directly or transitively linked to a given concept in an ontology graph is usually relatively large due to close interrelationships among many concepts; in an ontology graph derived from multiple diverse biomedical ontologies, there might be hundreds of concepts related to either “Vitamin” or “Nyctalopia”, and
tens of concepts related to both concepts; (2) the terms in the expanded query set may not reflect true semantic similarity with the original query terms since the edges in the ontology graph is not weighted (no measure of closeness in the relationship). Our proposed ontology graph based query expansion scheme introduces two key ideas to ameliorate the disadvantages. First, we adapt a Personalized PageRank algorithm to select the expanded query terms to improve computational efficiency (Agirre and Soroa 2009). Second, we assign different weights to the edges of the ontology graph to match the human perception on concepts in different specialization levels so that less relevant terms in the expanded query term set are filtered out. To conceptualize Personalized PageRank algorithm, imagine that a random surfer is teleported back to either “Vitamin” or “Nyctalopia” each time a new concept is reached. Thus, “Vitamin” and “Nyctalopia” will have the highest probability distribution in the Personalized PageRank Vector (PPV); followed by those concepts linked to (or close to) both “Vitamin” and “Nyctalopia” such as “Vitamin A” and “Cod-
liver oil”. Those concepts close to only one concept have less probability mass and concepts far from both “Vitamin” and “Nyctalopia” are assigned the lowest probability value.
Fig. 2. A sample fraction of a biomedical ontology graph
Fig. 3. Flow chart of our proposed ontology graph based query expansion
The novelty in the Personalized PageRank algorithm is that it naturally assigns higher probability to the concepts that are closer to the original query terms. By computing PPV, we acquire the probability distribution of the concepts from the entire graph and the PPV provides a numeric measure of semantic similarity of each concept to the original query concepts. The architectural flow chart in Figure 3 shows the major steps of our ontology graph based information retrieval scheme. There are total six steps in the entire architecture – we elaborate on each one of them in the following subsections. Step A – Ontology graph construction We construct our ontology graph with the assistance of the UMLS Metathesaurus (Humphreys, et al. 1998), which is a multi-purpose, multi-lingual vocabulary database containing information about biomedical related concepts, their various names, and their interrelationships. In the UMLS Metathesaurus, each biomedical concept is identified by a distinct eight character alpha-numeric string, called Concept Unique Identifier (CUI). Each CUI is associated with a set of lexically variant strings, called concept names. The concept name may refer to medical conditions, appendages, diseases, drugs, and others; it may be a single term, phrase, or a string of terms. The MRCONSO table stores the complete set of CUIs
and concept names. We use the CUI to represent a biomedical concept in this paper. In addition, the Metathesaurus also includes many inter-concept relationships; most of these relationships come from different individual vocabularies. Other relationships are either added by NLM during Metathesaurus construction or contributed by users to support certain types of applications. The interconcept relationships, such as parent/child, and immediate siblings, are stored in the MRREL table. Our ontology graph is automatically constructed by a program using the information from MRCONSO and MRREL tables (Step A in Figure 3) where a vertex represents a concept and an edge represents an inter-concept relationship. We do not label the edges of the graph with the types of inter-concept relationships and no weight is attached to the edges at this stage. (Chvatal 1979) found that a set of four vocabularies in UMLS, MeSH, SNOMEDCT, CSP and AOD was able to cover all senses of the target words in national library of medicine database. Therefore, in this paper, these four major biomedical ontologies with a total of 702,455 concepts of Metathesaurus 2013AA are employed to build the ontology graph for our experiments; note that Metathesaurus 2013AA includes a total of 2.9 million concepts. Table 1 lists the full name and the number of CUIs of each selected vocabulary.
3
Table 1. Four vocabularies and their numbers of CUIs
Abbr. Name
Full Name
# of CUIs
MeSH
Medical Subject Headings
339,922
SNOMEDCT
SNOMED Clinical Term
329,977
CSP
CRISP Thesaurus, 2006
16,654
AOD
Alcohol and other Drug , 2000
15,902
Step B – Mapping text to CUI MetaMap (Aronson and Lang, An overview of MetaMap: historical perspec-tive and recent advances 2010) is a supporting software tool provided by NLM. It provides a group of symbolic, natural language processing (NLP), and computational linguistic techniques using a knowledge intensive approach. We use MetaMap to automatically map biomedical text to the UMLS Metathesaurus CUIs (Step B1 in Figure 3). In order to map the text to Metathesaurus, MetaMap applies SPECIALIST parser to the text producing a set of noun phrases and generates the variants for each noun phrase where a variant essentially consists of one or more noun phrase words together with all of its spelling variants, abbreviations, acronyms, and synonyms. Each noun phrase is tagged using the MedPost/SKR tagger to identify nouns, verbs, prepositions, adjectives, punctuation, etc., and where the head or main idea of the noun phrase is located. Then, it maps a noun phrase into a set of candidate CUIs containing one of the variants and returns a score for each candidate CUI. Finally, it combines candidates involved with disjoint parts and recomputes scores based on the combined candidates. The CUIs with the highest scores are then selected as the best match to the input text. Since we use only four vocabularies as shown in Table 1 to construct the ontology graph, we keep only those mapped CUIs that exist in the selected four ontologies (Step B2 in Figure 3). Step C – Personalized PageRank on CUI Personalized PageRank was introduced and discussed by Haveliwala in (Haveliwala, Topic-Sensitive PageRank 2002) (Haveliwala, Kamvar and Jeh, An Analytical Comparison of Approaches to Personaliz-ing PageRank 2003). We adapt the Personalized PageRank as follows. Given an input biomedical text, CUIs produced by MetaMap are used as initial teleportation probability vector to compute Personalized PageRank Vector (PPV) via power iteration. We denote the CUIs with top scores in the computed PPV as PPV CUIs, shown as Step C1 in Figure 3. Scores of PPV CUIs are L1normalized. Thus, our modified Personalized PageRank ensures Original CUIs are present and highly scored in the computed PPV CUIs. PPV CUIs of the query terms are used as query expansion candidates. Most query text is short that maps to about 2-4 Original CUIs. We select a fixed top 500 computed PPV CUIs as candidates for query expansion in Step C2 of Figure 3. Step D – Re-Ranking the PPV CUIs A key concept of our ontology graph based method is to build the L1-normalized PPV CUIs effectively and efficiently from the original CUIs for query expansion. However, the direct use of these PPV CUIs for query expansion may not guarantee better performance in biomedical information retrieval. Note that our Personalized PageRank algorithm ensures that the Original CUIs yield high-score PPV CUIs. If we sort the PPV CUIs in descending order, the PPV CUIs derived from the Original CUIs usually have high scores and the score gaps between two consecutive PPV CUIs in this group are large in most cases. The
4
rest of the PPV CUIs have much lower scores as well as tiny score gaps between two consecutive CUIs. Thus, direct use of PPV CUIs in query expansion does not make any significant difference from simply using the Original CUIs since the original CUIs are still ranked at the top of the list. Second, general concepts, as opposed to more specialized ones, usually have more links in the ontology graph and thus result in higher computed PPV scores. This phenomenon causes general medical concepts, such as ‘disease’ and ‘therapy’, being highly ranked in the PPV CUIs list – not desirable for improving the accuracy of information retrieval later. To address the above problem, we assign a new weight for each PPV CUI in order to re-rank the PPV CUIs. Analogous to the classic TF-IDF(term frequency-inverse document frequency) method (Salton, Wong and Yang 1975) in information retrieval, we use the following PS-IPF weight formula to assign a weight wi to PPV CUI i: ( ) ( ) {
(
)}
( )
The weight wi is a product of two factors: the first factor is an acronym for PPV Score, is the L1-normalized PPV score of [ ] is a tuning parameter; the PPV score inCUI i , and creases when decreases. The second factor ipfi is the inverse PPV frequency (IPF), that is analogous to the inverse document frequency (Jones, Walker and Robertson 2000 ), where N is the total number of computed PPVs in the collection, and is the number of PPVs related to CUI i. Note that 0.5 is artificially added to Equation (3) to prevent the computation exception when . To statistically estimate IPF in Equation (3), we computed and indexed a large number of PPVs from biomedical corpus to build the IPF repository (Step D1 in Figure 3); the PPV CUIs for document are computed using a sliding window method, different from the fixed top 500 PPV CUIs for query text. Since title and abstract texts may have arbitrary length with different number of Original CUIs, a sliding window with size 100 is applied on the sorted PPV CUIs list to truncate the sequence when the difference in scores between the first and last CUI in the window drops below 5% of the highest-score PPV CUI in the window. 348K OHSUMED documents are used to build the IPF repository, shown as Step D2 in Figure 3. We apply Personalized PageRank algorithm to the 348K OHSUMED documents and find 500 CUIs closely related to each document as a result. Thus, we estimate the IPF by counting the frequency for every CUI generated in step C2 using Equation (3). After computing the weights of all PPV CUIs using Equation (1), we sort the query PPV CUIs again according to the weight value and select the top k ranked candidate as computed PPV CUIs, shown as Step D3 in Figure 3. The computed weights of the k CUIs are divided by the highest weight for normalization so that those final weights are in the range [0, 1]. After re-ranking the PPV CUIs with PS-IPF weights, we obtain the new expanded query composed of both original query text and the computed PPV CUIs. Since the size of the computed PPV CUIs is usually larger than that of the original query text, we need to relatively lower the weight of the expanded CUIs to prevent query drifting since query drifting often causes the presence and the dominance of non-query-related aspects or topics in topretrieved documents. We use a boosting weight (less than 1) to weaken the influence of Final CUIs during the final query construction to avoid query drifting.
Step E – Term-level filtering In this step, we further refine the PPV CUIs by using semantic similarity between the computed PPV CUIs and the original CUIs to filter out the expanded CUIs that are semantically less similar to the original CUIs. Weighted edge similarity measure is an effective method to compute semantic similarity of node pairs in an ontology graph (Dong, Srimani and Wang, WEST: Weighted-Edge Based Similarity Measurement Tools for Word Semantics. 2010) (Dong, Srimani and Wang, Weighted:Edge A New Method to Measure the Semantic Similarity of Words based on WordNet 2010). Weighted edge similarity considers specification levels of both CUI nodes and their Least Common Ancestor (LCA) node in the ontology graph. We compute the weighted edge similarity between Original CUIs and the computed PPV CUIs using Equation (4) and filter out CUIs with low similarity scores. (
∑
∑
)
node in the hierarchy. Other attributes of MRHIER, including SAB and PAUI, can be derived from PTR and AUI. In Figure 4, we use OHSUMED query #10 “Effectiveness of gallium therapy for hypercalcemia” to demonstrate how we determine the hierarchy of the related concepts in an ontology graph derived from two ontologies, MeSH and SNOMED CT. The term “Gallium” corresponds to CUI C0016980 in MRCONSO in Table 2 and MRHIER in Table 3. Figure 4 shows the hierarchy of the ontology graph between CUI pair , and . The AUI specific level (SpecLev) of the hierarchy is shown in the figure which is used to compute the weighted path length as well as the similarity between a pair of concepts.
( )
Given two nodes i and j in the ontology graph, the weighted edge distance between i and j is the sum of all the edge weights along the shortest paths from i and j to their LCA; and, slev is the specification level of a node in the ontology graph and is a constant, ( ]. As shown in (Dong, Srimani and Wang, WEST: Weighted-Edge Based Similarity Measurement Tools for Word Semantics. 2010) (Dong, Srimani and Wang, Weighted:Edge A New Method to Measure the Semantic Similarity of Words based on WordNet 2010), the weighted edge distance can be translated into a similarity value using hyperbolic transfer functions. The determination of the optimal weighting factor value α was also discussed in these papers. To compute the semantic similarity of two nodes in the ontology graph, we need to explore the hierarchy of ontology structure to find the Least Common Ancestor (LCA) node. We use the “Computable” Hierarchies (MRHIER) table from Metathesaurus to find the hierarchy among the concepts from multiple ontologies. The
Fig. 4. Hierarchies of “gallium nitrate”, “fermium”, “berkelium” and “gallium”
While our Personalized PageRank algorithm uses the query’s
Table 2. MRHIER of CUI C0016980 “Gallium”
AUI A0014094 A0014094 A0479658 A0479659 A2877777 A2877777
SAB MSH MSH AOD CSP SNOMEDCT SNOMEDCT
PAUI A0743535 A0743535 A1388564 A1195034 A3471460 A3471460
PTR A0434168 A2367943 A18456972 A0135374 A0135450 A0053536 A0743535 A0434168 A2367943 A18456972 A0135374 A0135450 A0085365 A0743535 A1386158 A1389303 A1389283 A1392037 A1388564 A0398472 A0318590 A0318854 A0483678 A1195034 A3684559 A3206010 A16967690 A3347798 A3559706 A3471460 A3684559 A3206010 A3738095 A3347798 A3559706 A3471460
MRHIER table of the ontology graph with four ontologies contains 650K distinct CUIs out of 15, 246, 3946 total records. The MRHIER table has two important attributes, AUI and PTR: (1) Metathesaurus assigns an Atom Unique Identifier (AUI) to an occurrence of a term or phrase in a source vocabulary. If the same term or phrase appears multiple times in the same ontology, i.e., the same term representing different concepts, a unique AUI is assigned for each occurrence. AUI starts with the letter “A” followed by seven digits. The source abbreviation that contributed to each AUI is noted in parentheses after the string. (2) PTR denotes “Path to Top or Root” of the hierarchical context. PTR is a string composed of AUI separated by periods. Each AUI represents a
context in the ontology graph to expand the user query, our weighted edge similarity algorithm uses node pair’s distance specification level of its LCA and specification level difference in the ontology graph to determine the semantic similarity of a pair of concepts in the ontology graph. We applied weighted edge similarity algorithm on the PPV CUIs list obtained by the Personalized PageRank algorithm to compute the semantic similarity between Original CUIs and computed PPV CUIs to refine the expanded PPV CUIs. Specifically, we evaluate the semantic similarity of each expanded CUI with every Original CUI. A heuristic threshold (See Section 3.2.2) is set to remove terms with low similarity with the original CUIs. Through this process (Step E in Figure 3), we
5
can filter out the less relevant CUIs in the final expanded set of CUIs for the given query. Therefore, the final expanded set of CUIs contains only the most relevant query terms to avoid query drifting. Table 3. MRCONSO of CUI C0016980 “Gallium”
AUI A0014095 A2877777 A0014094 A0479659 A0479658 A4781508 A1961887 A3471456
SAB MSH SNOMEDCT MSH CSP AOD SNOMEDCT CSP SNOMEDCT
STR Gallium Gallium Gallium gallium gallium Gallium, NOS Ga element Gallium
Step F – Document indexing and retrieval We adapt Lucene Java search library version 4.5.1 to create local index for MEDLINE documents. A modified Lucene standard analyzer with an enhanced MIT stop-list and the Porter stemmer are used to analyze, tokenize and index MEDLINE document’s title and abstract respectively. Moreover, MetaMap 2013 is employed to map the title and abstract into a set of associated CUIs which are indexed as well. In the retrieval stage, shown as Step F1 and Step F2 in Figure 3, query text is analyzed by the same Lucene analyzer to extract query terms. MetaMap 2013 is used to map the query text into Metathesaurus CUIs. Finally, the query terms and Final CUIs (Step F3 in Figure 3) are combined into a new expanded query for querying Lucene index. The retrieved results are ranked by the relevance to the query and the most relevant results are ranked on the top of the result list. The user can select articles he/she is interested in from the return results and thus form a list of interested articles. G-Bean conducts additional query based on the interested articles to retrieve articles related to all of the interested articles to satisfy user’s search intention (Zhang and al. 2013).
3 3.1
RESULTS G-Bean Search Engine
We have implemented and published G-Bean, that accepts any biomedical related user query and processes them with expanded query to search the entire MEDLINE database. The core component of this biomedical search engine is our proposed ontology graph based query expansion scheme. We downloaded the 2014 MEDLINE/PubMed Baseline database which contains 22,376,811 records from (NLM 2013) and indexed each record after processing each record with MetaMap. However, building an index for such a large scale data was challenging with Lucene library. So, we had to customize the Lucene library by indexing the citations through batching and setting the Java virtual machine’s memory space by -Xms4096m -Xmx4096m to prevent the exhaustion of the Java virtual memory space used by Lucene library. The entire index occupies 30.4GB disk space. Since the Java version of Lucene library was used to index the biomedical documents, we implemented the search engine using Client-Server architecture powered by Java Servlet Pages. The front-end is written by Java Servlet Pages (JSP) and the back-end is supported by our ontology graph based query expansion and indexing scheme. The detailed architecture is shown in Figure 5.
6
Fig. 5. G-Bean Search Engine Architecture
As shown in Figure 5, the front-end is the user interface composed of HTML and JSP codes. When a user’s query is passed to the back-end system, the query text is processed in a same way as the indexing process: it is parsed by Porter stemmer and filtered by the enhanced MIT stop-list and the MetaMap program searches and generates the corresponding CUIs defined in the Metathesaurus. The CUIs are expanded first by running the Personalized PageRank algorithm on the ontology graph. Then, the expanded CUIs are filtered by computing the semantic similarity between the expanded CUIs and Original CUIs. The Final CUIs and user query text are combined together to form a new query to search our local MEDLINE index on the G-Bean server. Currently, our proposed G-Bean search engine is deployed on web server Tomcat 6.0; it is publicly accessible via URL http://bioinformatics.clemson.edu/G-Bean/index.php.
3.2
Comparisons between PubMed and G-Bean
The comparisons between G-Bean and PubMed search engine are shown in Table 4. (1) concept coverage: PubMed only indexes text with MeSH terms and covers concepts in MeSH; while G-Bean uses the set of the four UMLS vocabularies which covers all medical concepts in the medicine database to construct the ontology graph; (2) query expansion: PubMed expands query with related MeSH terms; while G-Bean reformulates user query with additional related concepts in the ontology graph and can find more related concepts than MeSH; (3)default ranking schema: PubMed ranks the query results in chronological order by default; G-Bean ranks the query results based on the relevance to the query which helps user find articles pertaining to user interest efficiently. (4) search intention: G-Bean conducts additional query based on the articles the user has already shown interested in to retrieval articles related to all the interested articles which represent the user’s search intention; while PubMed doesn’t provide such function to take the user’s interest into account.
Table 4. Comparisons between PubMed and G-Bean
PubMed
G-Bean
Concept coverage
Only covers concepts in MeSH
Covers all concepts in medicine database
Query expansion
Expands query with MeSH terms
Expands query with UMLS terms
gies, Strategies 1-3 repeat the work of previous studies in query expansion and serve as our baselines. Strategies 4 and 5 apply PPV based query expansion in different ways; Strategies 6-8 apply weighted edge similarity to further refine the expanded query terms. The methods used in Table 5 are explained below (A strategy may be a combination of two different methods): Lucene Free-text: Both title and abstract of document and query text are analyzed and tokenized by Lucene’s standard analyzer with stop-list and stemmer.
Default raking
Ranks results in chronological order
Ranks results by relevance to query
Original CUIs: Metathesaurus CUIs mapped by MetaMap tool and presented in four selected vocabularies.
Search intention
N/A
Conduct additional query based on user’s search intention
3.3
Performance Study
We use a clinically-oriented MEDLINE subset called OHSUMED collection (Hersh, Buckley, et al. 1994) which consists of 348,566 documents covering all references from 270 medical journals over a five-year period (1987-1991) as the experimental testbed. OHSUMED dataset contains 106 human generated queries. Each query’s results were assessed for relevance by a diverse group of physicians. We calculate the precision/recall of information retrieval results obtained by our G-Bean search engine using these 106 benchmark queries. 3.3.1. Experimental Design Eight different strategies are evaluated and compared in our experimental studies; they are listed in Table 5. Among the eight strate-
Original CUIs + Expanded CUIs: the new expanded query includes Original CUIs in the first place; then it appends the top ranked PPV CUI candidates that are not in the original CUIs in the end. All CUIs in the final query are factored by coefficient b (See Section 3.2.2). Expanded CUIs: the new query is directly formed by the top ranked Expanded CUIs. It’s worth noting that Original CUIs are not guaranteed to be included in the new query. WE(g(x)) Filtered CUIs: using g(x) as the transfer function to compute the Weighted Edge(WE) similarity between Original CUIs and the expanded CUIs to filter out irrelevant CUIs.
Table 5. Eight index and retrieval strategies (*N/A: NOT APPLICABLE)
Retrieval Document Representation Strategies Vector 1 Vector 2 S1 Lucene Free-text N/A S2 N/A Original CUIs S3 Lucene Free-text Original CUIs S4 Lucene Free-text Original CUIs S5 Lucene Free-text Original CUIs S6 Lucene Free-text Original CUIs S7 Lucene Free-text Original CUIs S8 Lucene Free-text Original CUIs
Query Representation Vector 1 Vector 2 Lucene Free-text N/A N/A Original CUIs Lucene Free-text Original CUIs Lucene Free-text Original CUIs + Expanded CUIs Lucene Free-text Expanded CUIs Lucene Free-text WE (sech) Filtered CUIs Lucene Free-text WE (tanhc) Filtered CUIs Lucene Free-text WE (sech*tanhc) Filtered CUIs
Table 6. Performance of different strategies S1
S2
S3
S4
S5
S6
S7
transfer function
N/A
N/A
N/A
N/A
N/A
sech
tanhc
iprec_at_recall_0.00 iprec_at_recall_0.10 iprec_at_recall_0.20 iprec_at_recall_0.30 iprec_at_recall_0.40 iprec_at_recall_0.50 iprec_at_recall_0.60 iprec_at_recall_0.70 iprec_at_recall_0.80 iprec_at_recall_0.90 iprec_at_recall_1.00 11pt. avg. precision MAP R-precision
0.7032 0.5157 0.4130 0.3203 0.2477 0.2062 0.1588 0.1132 0.0717 0.0365 0.0059 0.2538 0.2333 0.2712
0.5594 0.3637 0.2728 0.1960 0.1389 0.0883 0.0566 0.0357 0.0219 0.0119 0.0008 0.1587 0.1366 0.1810
0.7037 0.5210 0.4060 0.3244 0.2516 0.1994 0.1490 0.0994 0.0597 0.0310 0.0048 0.2500 0.2289 0.2742
0.7226 0.5509 0.4283 0.3358 0.2614 0.2121 0.1601 0.1120 0.0675 0.0330 0.0047 0.2626 0.2415 0.2840
0.7601 0.5883 0.4781 0.3896 0.3033 0.2479 0.1924 0.1416 0.0906 0.0399 0.0047 0.2942 0.2704 0.3060
0.7475 0.5940 0.4889 0.4068 0.3283 0.2596 0.2025 0.1494 0.0933 0.0386 0.0053 0.3013 0.2841 0.3176
0.7344 0.5787 0.4761 0.3891 0.3144 0.2621 0.2005 0.1443 0.0892 0.0435 0.0061 0.2944 0.2716 0.3086
S8 sech* tanhc 0.7775 0.6034 0.4912 0.4092 0.3291 0.2741 0.2170 0.1657 0.0941 0.0467 0.0073 0.3105 0.2857 0.3252
7
3.3.2. Experimental Results We use G-Bean to retrieve a maximum of 100,000 documents per query and calculate the 11-point average precision, mean average precision (MAP), and R-precision of each query. The number of expanded CUIs is set to be 15 and the influence factor b for the expanded CUIs is 0.8. The in Equation (2) is set to be 0.1 heuristically (Dong, Srimani and Wang, Ontology Graph based Query Expansion for Biomedical Information Retrieval 2011). Three different transfer functions g(x) are used in our semantic similarity measure of CUIs because the semantic similarity values obtained by these functions closely match human perception according to our previous studies (Dong, Srimani and Wang, WEST: Weighted-Edge Based Similarity Measurement Tools for Word Semantics. 2010) (Dong, Srimani and Wang, Weighted:Edge A New Method to Measure the Semantic Similarity of Words based on WordNet 2010). The sech function and tanhc function are used in S6 and S7 respectively. S8 applies sech*tanhc which showed the best performance. We use 0.85 , a heuristic parameter obtained in our previous studies, in these transfer functions. The experimental results of all eight strategies are listed in Table 6. The experimental results demonstrate that both the PPV based indexing and query expansion method and the weighed edge similarity based CUI filtering method contribute to the performance improvement of the biomedical information retrieval using G-Bean search engine. The best strategy S8, which combines both methods, is 22.34% better than the Lucene free-text strategy S1. S8 also shows a 5.54% performance improvement strategy S5, which used only the PPV based indexing and query expansion. 3.3.3. Subjective Evaluation We also conducted a subjective evaluation (by a number of graduate students) of the performance of our search engine G-Bean. 106 queries from OHSUMED dataset were used to search the entire 20 million MEDLINE citations. The search results were compared with the results returned by PubMed interface and ranked by relevance. The graduate students carefully examined the results returned by both search engines and discussed which search engine produced better search results for each query. They found that GBean returned better search results in 79 of these benchmark queries while PubMed retuned better results in only 8 of these queries. Both search engines returned good search results on 19 queries. Therefore, G-Bean shows better query performance than PubMed for 74.5% (79/106) of these benchmark queries and returns good query results for 92.4% ( (79+19)/106) of these benchmark queries. This subjective evaluation further confirms the superiority of GBean search engine on biomedical information retrieval. It is worth-noting that PubMed system did not return any results on several queries such as queries #17, #52, and #95. For some other queries, such as queries #23, #49, #71 and #89, PubMed system only returned one result in each case. For instance, Table 7 shows the top 5 search results by PubMed and G-Bean on OHSUMED Query #71 “cystic fibrosis and renal failure, effect of long term repeated use of aminoglycosides” respectively. For this query, the only article returned in PubMed ranks top 5 in G-Bean’s result list as well. But G-Bean retrieves other useful articles besides this one. The expanded query for this query is the combination of the following two queries: Term query: ((title:cystic abstract:cystic)(title:fibrosi abstract:fibrosi) (title:renal abstract:renal)(title:failur abstract:failur)(title:effect abstract:effect)(title:long abstract:long)(title:term abstract:term) (title:repeat abstract:repeat)(title:aminoglycoside abstract:aminoglycosid))
© Oxford University Press 2009
where the raw term in the input query form this term query. CUI query: (cui:C0002556^0.8,cui:C1720953^0.77764446 cui:C0035078^0.6918046,cui:C0010674^0.59097385,cui:C003509 1^0.5840204,cui:C0282563^0.55353206,cui:C0205341^0.4276737 3,cui:C0022658^0.39479542,cui:C0443252^0.38731256,cui:C044 4619^0.3291697,cui:C0022646^0.2782315,cui:C0439827^0.22319 728,cui:C0449903^0.21943615,cui:C0449866^0.19725645,cui:C0 449872^0.19718058) where the float number following each CUI is the final weight of it. The term query and the CUI query are merged to form a final query to retrieve the MEDLINE index and get related articles. Same story for query #8 “work-up of hypertension in patient with horseshoe kidney”, PubMed returned only one result for this query which is also returned in G-Bean. While G-Bean returns other related articles besides this article. Upon further analyzing the queries and search results, we also found that if a query can’t be mapped into MeSH terms, PubMed system often returns bad search results while G-Bean still produces good search results. The entire evaluation set can be accessed via http://bioir.cs.clemson.edu:8080/BioIRWeb/supplement.jsp. We note that the search results listed in this Website and in Table 7 was obtained in December 25, 2013. Since the PubMed updates its index weekly, there might be new articles included in the PubMed search results in a later date. Table 7. Top 5 search results returned by PubMed and G-Bean with OHSUMED Query #71 “cystic fibrosis and renal failure, effect of long term repeated use of aminoglycosides”
PubMed Renal impairment in cystic fibrosis patients due to repeated intravenous aminoglycoside use
4
G-Bean Renal impairment in cystic fibrosis patients due to repeated intravenous aminoglycoside use. Aminoglycoside-associated hypomagnesaemia in children with cystic fibrosis. Cystic fibrosis, aminoglycoside treatment and acute renal failure: the not so gentle micin. Case-control study of acute renal failure in patients with cystic fibrosis in the UK. Renal dysfunction in cystic fibrosis: is there cause for concern?
CONCLUSION
In this paper, we have designed a novel ontology graph based scheme to index biomedical articles and expand user queries. This scheme uses two novel methods to modify the original user query into a more relevant one. First, it uses a modified Personalized PageRank algorithm on an automatically constructed ontology graph derived from multiple diverse biomedical ontologies to select concepts which best match the query context for query expansion. It then uses a weighted edge semantic similarity measure to filter out the expanded concepts that are not closely related to the original query. These query expansion and refinement methods ensure the search results retuned by G-Bean are more relevant to user queries. Extensive experimental results show that this new search engine outperforms the popular Lucene approach by 22%,
8
which is a noticeable gain. A subjective evaluation on our deployed biomedical search engine, G-Bean, shows that it is more accurate and easier to use in comparison with the most popular biomedical search engine, PubMed.
ACKNOWLEDGEMENTS Funding: This work is partially supported by National Science Foundation [grant numbers: DBI-0960586, DBI-0960443] and National Institute of Health [grant number: 1 R15 CA131808-01, 1 R01 HD069374-01 A1]. Conflict of interest: None declared.
REFERENCES Mitra, Mandar, Amit Singhal, and Chris Buckley. "Improving automatic query expansion." Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. 1998. 206-214 . Abdou, Samir, Patrick Ruck, and Jacques Savoy. "Evaluation of stemming, query expansion and manual indexing approaches for the genomic task." In the 14th Text REtrieval Conference Proceedings. 2005. Agirre, Eneko, and Aitor Soroa. "Personalizing pagerank for word sense disambiguation." Proceedings of the 12th conference of the European chapter of the Association for Computational Linguistics. Athens, Greece, 2009. Aronson, Alan R, and François-Michel Lang. "An overview of MetaMap: historical perspec-tive and recent advances." Journal of the American Medical Informatics Association 17, no. 3 (2010): 229-236. Aronson, Alan R, Thomas C Rindflesch, and Allen C Browne. "Exploiting a Large Thesaurus for Information Retrieval." Proceedings of RIAO. 1994. 197 - 216. Bernstam, Elmer. "MedlineQBE (Query-by-Example)." Proceedings of Amia Symposium. 2001. 47-51. Chu, Wesley W, Zhenyu Liu, Wenlei Mao, and Qinghua Zou. "A knowledge-based approach for retrieving scenariospecific medical text documents." Control Eng Prac 13, no. 9 (2005): 1105-21. Chvatal, V. "A greedy heuristic for the set-coving problem." Mathematics of operations research. 1979. 233-235. Doğan, R.I. et al. "Understanding PubMed user search behavior through log analysis." Database (Oxford) 2009, no. bap018 (2009). Dong, Liang, Pradip Srimani, and James Wang. "Ontology Graph based Query Expansion for Biomedical Information Retrieval." Proceedings of IEEE International Conference on Bioinformatics and Biomedicine. 2011. 488-493. —. "Weighted:Edge A New Method to Measure the Semantic Similarity of Words based on WordNet." Proceedings of the 5th International Conference of the Global WordNet Association. Mumbai, India, 2010. —. "WEST: Weighted-Edge Based Similarity Measurement Tools for Word Semantics." Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. Toronto, ON, 2010. 216-223.
Gauch, Susan , Jianying Wang, and Satya M Rachakonda. "A corpus analysis approach for automatic query expansion and its extension to multiple databases." ACM Transactions on Information Systems 17, no. 3 (1999): 250-269. Haveliwala, Taher. "Topic-Sensitive PageRank." Proceedings of the 11th International Conference on World Wide Web. 2002. 517-526. —. "Topic-Sensitive PageRank." Proceedings of the 11th International Conference on World Wide Web. 2002. 517-526. Haveliwala, Taher, Sepandar Kamvar, and Glen Jeh. "An Analytical Comparison of Approaches to Personaliz-ing PageRank." Standford University Technical Report, 2003. Hersh, Wllliarn R. Information retrieval: a health and biomedical perspective. Springer Verlag, 2009. —. "Report on the TREC 2004 genomics track." in ACM SIGIR Forum. 2005. 21-24. Hersh, Wllliarn R, and D.H Hickam. "A comparison of retrieval effectiveness for three methods of indexing medical literature." The American journal of the medical sciences. 303, no. 5 (1992): 292. Hersh, Wllliarn R, and D.H Hickam. "Information retrieval in medicine: the SAPHIRE experience." Journal of the American Society for Information Science 46, no. 10 (1995): 743-747. Hersh, Wllliarn R, Chris Buckley, TJ Leone, and David Hickarn. "OHSUMED: an interactive retrieval evaluation and new large test collection for research." SIGIR ’94 Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. 1994. Hersh, Wllliarn R, Susan Price, and Larry Donohoe. "Assessing thesaurus-based query expansion using the UMLS Metathesaurus." Proceedings of AMIA Symposium. 2000. 344 - 348. Humphreys, Betsy L, D. A Lindberg, H. M Schoolman, and G. O Barnett. "The Unified Medical Language System: an informatics research collaboration." Journal of the American Medical Informatics Association 5, no. 1 (1998): 1-11. Jones, Sparck K, S. Walker, and S. E Robertson. "A probabilistic model of information retrieval: development and comparative experiments." Information Processing and Management: an International Journal 36, no. 6 (2000 ): 779 - 808. Lipscomb, Carolyn E. "Medical subject headings (MeSH)." Bulletin of the Medical Library Association 88, no. 3 (2000): 265 – 266. Lu, Zhiyong, and Alexey Iskhakov. "Finding query suggestions for PubMed." AMIA Annu Symp Proc., 2009: 396-400. Mao, Wenlei, and Wesley W Chu. "Free-text medical document retrieval via phrase-based vector space model." Proceedings of AMIA Annual Symposium. 2002. 489-493. McKibbon, A. K., and C. J. Walker-Dilks. "The qualify and impact of MEDLINE searches performed by end users." Health libraries review 12, no. 3 (1995): 191-200. NLM. 2014 MEDLINE®/PubMed® Data Files. 2013. http://www.nlm.nih.gov/bsd/licensee/access/medline_pu bmed.html. PubMed Clinical Queries. n.d. http://www.ncbi.nlm.nih.gov/pubmed/clinical/.
9
Salton, G, A Wong, and C. S Yang. A vector space model for automatic indexing. Vol. 18, in Readings in information retrieval, 613-620. Morgan Kaufmann Publishers Inc., 1975. Srinivasan, P. "Exploring query expansion strategies for MEDLINE." Journal of the American Medical Information Association 3 (1995): 157-167. Srinivasan, P. "Optimal document-indexing vocabulary for MEDLINE." Infor-mation Processing & Management 32, no. 5 (1996): 503-514. Wildemuth, B. M, and M. E Moore. "End-user search behaviors and their relationship to search effectiveness." 1995. 294–304. Yang, Yiming. "Expert network: effective and efficient learning from human decisions in text categorization and retrieval." Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. 1994. 13 - 22. Yang, Yiming, and Christopher Chute. "An application of least squares fit mapping to text." Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval. 1993. 281 290. Yoo, S, and J Choi. "Improving MEDLINE document retrieval using automatic query expansion." in Proceedings of the 10th international conference on Asian digital. 2007. Zhang, Yuanyuan, and et al. "G-Bean: an Ontology-graph based Web Tool for Biomedical Literature Retrieval." IEEE International Conference on Bioinformatics and Biomedicine. Shanghai, China, 2013.
10