RefMed: Relevance Feedback Retrieval System ... - Semantic Scholar

9 downloads 1415 Views 163KB Size Report
Nov 6, 2009 - user's specific relevance in the given query interface and a keyword ... ranking is not integrated with the PubMed's keyword queries, and.
RefMed: Relevance Feedback Retrieval System for PubMed ∗

Hwanjo Yu, Taehoon Kim, Jinoh Oh, Ilhwan Ko, Sungchul Kim Data Mining Laboratory Pohang University of Science and Technology (POSTECH) Pohang, South Korea

{hwanjoyu,zyint,kurin,koglep,subright}@postech.ac.kr ABSTRACT Finding related articles from the PubMed (a large biomedical literature repository) is challenging because it is hard to express the user’s specific relevance in the given query interface and a keyword query typically retrieves many results. Biomedical researchers spend a critical amount of time (e.g., often more than several days) in the literature search process. This paper proposes RefMed, a novel search system for PubMed, which supports relevance ranking by enabling relevance feedback on PubMed. RefMed first returns initial result documents for a user’s keyword query as in PubMed. The user then makes relevance judgments on some of the resultant documents while browsing them. Once the user “pushes the feedback”, the system induces a relevance function using RankSVM and ranks the results according to the function. To realize the ad-hoc relevance retrieval on PubMed, RefMed “tightly” integrates RankSVM within RDBMS and runs the rank learning and process on the fly with a response time of a few minutes. Our qualitative experiments with biomedical researchers show that RefMed substantially reduces the amount of effort required to search related PubMed articles. RefMed is accessible at “http://dm.postech.ac.kr/refmed”.

Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous

General Terms Experimentation

Keywords PubMed, Relevance Feedback Need: PubMed-Medline is a biomedical literature repository that contains 18 million articles and keeps growing. It is one of the most important information sources for medical doctors and bioresearchers. Each article in PubMed is structured (e.g., title, author, abstract, journal name, and MeSH terms), and it supports an efficient processing of exact matching queries on each attribute. However, finding related articles from PubMed is challenging because it is hard to express the user’s specific relevance in the given query interface and a keyword query typically retrieves many results. For ∗

This work was supported by the Brain Korea 21 Project in 2009 and the Korea Research Foundation Grant funded by the Korean Government(KRF-2009-0080667). Copyright is held by the author/owner(s). CIKM’09, November 2–6, 2009, Hong Kong, China. ACM 978-1-60558-512-3/09/11.

User

(1) Submit a keyword query

RefMed

(2) Return initial results

(3) Feedback relevance (4) Return ranked results

Figure 1: Search process in RefMed example, a keyword, “breast cancer” returns around 160,000 articles. The user can only sort the results according to publication date, first or last author name, or journal name. Biomedical researchers often spend a critical amount of time (e.g., often more than several days) in the literature search process. Related Work: To improve the search quality on PubMed, researchers have studied querying methodologies for PubMed, such as how to use controlled vocabulary, MeSH terms, or background knowledge to formulate proper PubMed queries [5, 8]. Re-organizing the search results using ontologies or clustering techniques has been explored to provide better presentation of the results to the users [1, 3]. Text mining researchers have tried to compute global importance of articles using the citation information and apply it to rank the results as done in Google [4, 3, 7]. However, users’ hidden relevances are typically widely varied even with the same keyword query. For example, with a query “breast cancer”, one user may be interested in finding genetic-study related papers and another user may want to find the latest cancer treatments. Thus ranking according to the global importance often does not meet the users’ specific or personalized information needs. Finally, machine learning techniques have been exploited to find relevant articles by ranking the articles according to the learned relevance function [9, 6]. However, the process of learning and ranking is not integrated with the PubMed’s keyword queries, and the users have to provide a large amount of training articles to get reasonable learning accuracy. Such systems also involve timeconsuming offline preprocessing and learning, which take too much time to support the search process in real time. Method: This paper proposes RefMed, a novel search system for PubMed, which supports relevance ranking by enabling relevance feedback on PubMed, and thus supports personalized ad-hoc retrieval on PubMed. Figure 1 shows the search process in RefMed. RefMed first accepts a keyword query (Step 1) and returns initial results (Step 2) as in PubMed. The user then makes relevance judgments on some of the resulting documents while browsing them (Step 3). (The number of relevance levels can be adjusted depend-

Figure 2: Feedback relevance (left) and ranked results (right) in RefMed ing on the user’s preference.) Once the user “pushes the feedback”, the system induces a relevance function from the user’s feedback using the RankSVM [2], and returns top-k results ranked according to the function (Step 4). This process of relevance searching including the learning and ranking is run in real time with a response time of a few minutes. Implementation: To realize the real-time relevance feedback search system, we first “tightly” integrated the RankSVM within MySQL. Specifically we revised the MySQL (v5.1) source codes to add two new SQL commands, each for learning and processing the RankSVM directly on the MySQL data tables. We designed the PubMed schema and imported the PubMed data in the MySQL database. Although we store the entire PubMed data in our database, when a keyword query is submitted from the user, RefMed first sends it to the PubMed site, receives a list of result PMIDs (PubMed article IDs), and retrieves the corresponding articles from our database. We do this to provide the same set of initial results as the PubMed in RefMed. Note that the goal of RefMed is to enable relevance feedback on top of PubMed. Finally, we used the new SQL commands, Java scripts, and PHP to implement the Web interface for RefMed. Contributions: To the best of our knowledge, RefMed is the first real time search system that supports relevance feedback on PubMed. The new technical contributions of RefMed are as follows. • Traditional relevance feedback systems use classification methods for learning (e.g., SVM, Bayesian learning) thus support two levels of relevance - relevant or not relevant. RefMed supports multi-level relevances by using RankSVM and achieves a higher accuracy with less feedback. • RefMed tightly integrates RankSVM and MySQL to support keyword queries and relevance feedback in the same framework and to achieve a real-time search process with minimal response time. In our experiments on the OHSUMED dataset (a subset of Medline), the learning accuracy is higher than 80% overall with the user’s feedback on 20 documents. We also qualitatively evalauted the system with POSTECH I-Bio students, and the overall search time is substantially reduced with RefMed. RefMed is accessible at “http://dm.postech.ac.kr/refmed”. Demonstration Plan: We plan to demonstrate by performing the search on popular queries such as “brain tumor”, “breast cancer”, and “stem cells”. We will qualitatively show the effectiveness of the search. For example, Figure 2(left) shows the screen shot of the

user’s feedback on the results of query “breast cancer” in RefMed. The system shows 20 articles of 200,049 results for breast cancer in the first page. The user marked the first article as Not relevant and the third and fourth articles as Relevant. Once the user presses the “Push Relevance” button, the system (1) learns a relevance function, (2) sorts the 200,049 articles according to the function, (3) and returns top 20 articles. Figure 2(right) shows the screen shot of the top 20 articles. Note that the first two articles are those that the user marked as Relevant previously. The user can keep judging the relevance on the other articles until she receives satisfying results.

1.

REFERENCES

[1] Gopubmed. http://www.gopubmed.com/. [2] SVM-light. http://svmlight.joachims.org/. [3] Y. Lin, W. Li, K. Chen, and Y. Liu. A document clustering and ranking system for exploring medline citations. In Journal of American Medical Informatics Association, 2007. [4] Z. Lu, W. Kim, and W. Wilbur. Evaluating relevance ranking strategies for medline retrieval. In Journal of American Medical Informatics Association, 2009. [5] L. Murphy, S. Reinsch, W. Najm, V. Dickerson, M. Seffinger, A. Adams, and S. Mishra. Searching biomedical databases on complementary medicine: the use of controlled vocabulary among authors, indexers and investigators. BMC Complementary and Alternative Medicine, 2003. [6] G. Poulter, D. Rubin, R. Altman, and C. Seoighe. Mscanner: a classifier for retrieving medline citations. BMC Bioinformatics, 2008. [7] M. Siadaty, J. Shu, and W. Knaus. Relemed: sentence-level search engine with relevance score for the medline database of biomedical articles. BMC Bioinformatics, 2007. [8] C. Sneiderman, D. Demner-Fushman, M. Fisaman, N. Ide, and T. Rindflesch. Knowledge-based methods to help clinicians find answers in medline. Journal of American Medical Informatics Association, 2003. [9] B. Suomela and M. Andrade. Ranking the whole MEDLINE database according to a large training set using text indexing. BMC Bioinformatics, 2005.