BioinQA: Addressing bottlenecks of Biomedical Domain through Biomedical Question Answering System Sparsh Mittal
Saket Gupta
Dr. Ankush Mittal
Department of Electronics and Department of Electronics and Department of Electronics and Computers Engineering, Computers Engineering, Computers Engineering, IIT Roorkee, India, 247667. IIT Roorkee, India, 247667. IIT Roorkee, India, 247667
[email protected] [email protected] [email protected]
ABSTRACT Recent advances in the realm of biomedicine and genetics in the post genomics era have resulted in an explosion in the amount of biomedical literature available. Large textbanks comprising of thousands of full-text biology papers are rapidly becoming available. Due to gigantic volumes of information available and lack of efficient and domain specific information retrieval tools, it has become extremely difficult to extract the relevant information from the large repositories of data. While the specialized information retrieval tools are not suitable for beginners, general purpose search engines are not intelligent enough to respond to domain specific questions, predominantly of the Medical, Bioinformatics or related fields. In this paper, we present an intelligent question answering system that efficiently and quickly responds to the questions of novices in natural language and is smart enough to answer questions of advanced users (researchers). It employs various natural language processing techniques to answer user questions even from multiple documents. The system also makes use of metadata knowledge for addressing specific biomedical domain concerns like heterogeneity, acronyms, etc.
1. INTRODUCTION The recent technological advancements in biomedicine and genetics have resulted in an unimaginable vast amount of data which keeps on growing at an ever increasing rate. An idea about the growth of biomedical literature can be had from the fact that PUBMED database of National Library of Medicine(NLM) contains more than 14 million articles and hundreds of thousands more are being added every year[1]. However, such a huge repository of data is useful only if it can be easily accessed and the contents retrieved as per the user requirements, providing information in a suitable form [2]. Mostly modern search engines such as Google have huge storehouse of information but the limitation faced by users is of manually searching the documents obtained for the queries thus posed, which may be vast in number. The meaning of query is very relevant. For e.g. ―How is snake
Sumit Bhatia Department of Electrical Engineering, IIT Roorkee, India, 247667
[email protected]
poison employed in counteracting Neurotoxic venom?‖, and ―When is snake poison employed in counteracting Neurotoxic venom?‖ and ―Why is snake poison employed in counteracting Neurotoxic venom?‖ all have different meanings. Very few research groups are working on medical, domain-specific question answering [3]. In contrast, generic open domain question answering systems are not suitable when dealing with the biomedical text where complex technical terms are used in domain specific contexts [4]. The term order variations and abbreviations are also quite common in this field [5], and definitional questions [6] are few to be found, for mostly an analytical explanation is needed. The specialized information retrieval systems like PUBMED are generally used by the researchers and experts whereas general information retrieval systems like Google which are preferred by neophytes of the field, suffer from the inherent drawbacks of Open Domain information retrieval systems. [5] and [7] provide a good study on the feasibility of medical question answering systems and limitations of open domain question answering system when applied to biomedical domain. Also, difference in jargons used by a novice and advanced user causes heterogeneity (see Literature Review). Our QA System assigns the task of resolving differences of terms and formats as used by user and in corpus, entirely to the system, thus requiring no qualification on part of user to know the complex jargon of the subject. In our efforts to develop a system that can make the process of information retrieval in large amounts of biomedical data efficient and user friendly, we built a system which has the following contributions: 1. In the field of multi-document search, our QA system is a step in the direction of next-generation system which has the capability of extracting answers from diverse documents. 2. This paper proposes tools to resolve and to mitigate the semantic (essential) heterogeneity problem for bioinformatics metadata. Information integration of bioinformatics data is becoming a very vital problem. 3. In order to make the system useful for a wide range of users, we have used different weighing and ranking schemes to present to the user information which is most important to him (See Section 3 and 4). 4. Our system is capable of answering a wide variety of questions in addition to definitional questions, (e.g. How does Tat enhances the ability of RNA polymerase to elongate?) 5. The system integrates diverse resources like scanned books, doc files, PowerPoint slides, etc, which have different
information and presentation methods. Books are illustrative and give detailed analysis of concepts. Slides are condensed, highlighting the key points. The system integrates information from different types of documents. The paper is organized as follows: Section 2 describes the previous work in biomedical field and related areas and bottlenecks. Section 3 describes the operational facets of the question answering system. Section 4 presents the multidocument extraction problem and our solution. Section 5 describes our implementation to resolve the heterogeneity problem and use of metadata. Section 6 discusses experiments and results. Section 7 gives conclusions and briefly discusses future research.
2. LITERATURE REVIEW An overview of the role and importance of question answering systems in biomedicine is provided by [3]. TREC, one of the major forums for discussions related to question answering systems has included a track on Genomics. EQuer, the French Evaluation campaign of Question-Answering Systems is first one to provide a medical track [8]. MedQA, developed by Lee et al [6] is a biomedical question answering system which caters to the needs of practicing physicians. However it is still limited due to its ability to answer only definitional question. Another question answering system which is restricted to the genomics domain has been developed by Rinaldi et al [9]. They have adapted open domain question answering system to answer genomic questions with emphasis on identifying term relations based on a linguistic-rich full-parser. In a study conducted with a test set of 100 medical questions collected from medical students in a specialized domain, a thorough search in Google was unable to obtain relevant documents within top five hits for 40% of the
questions [10]. Moreover, due to busy practice schedules physicians spend less than 2 minutes on average seeking an answer to a question. Thus, most of the clinical questions remain unanswered [11]. However, our system answers the questions of the user quickly and most succinctly. The Heterogeneity Problem - The current systems are not flexible enough to adopt themselves according to the knowledge level and requirements of a user. Agreeing to common standard of uniformity of similar concepts or data relationships has drawbacks. As with mediators/wrappers it remains difficult for standards to keep up with a dynamically changing domain (e.g. novice and researchers generally conform to different domains). [12] aims to semantically integrate metadata in bioinformatics data sources Also, heterogeneity of metadata is either of an ―accidental‖ or ―essential‖ nature. Accidental heterogeneity arises from the use of different formats and representation systems (e.g. XML, flat file, or other format) and can be solved through translation systems, which perform format conversion. Essential heterogeneity, also called semantic heterogeneity, arises from using varied vocabulary to describe similar concepts or data relationships or from using the same metadata to describe different concepts or data relationships. The mediator/wrapper-based strategy [13], [14] has not been widely successful because it solves the problem reactively, after it occurs (which is more difficult).
3. SYSTEM ARCHITECTURE System Overview Figure 1 shows system block diagram. The Question Answering system is based on searching in context the entities of the corpus for effective extraction of answers. System recognizes entities by searching from course material, using User’s Question
ANSWER MINING
Link Parser
Link Parser: Corpus Entity File generation
Question Classification Multi Document Question
CORPUS
General Question
Segmentation Map components in respective domain
Question Parsing: Question Focus Noun Phases, POS Info
Acronym Expansion
ANSWER MINING from respective domain documents
Answer Extraction: Passage Retrieval and Scoring
METADATA Information Heterogeneity Resolution
CRG for implicit assumptions
Answer Selection: NP Matching and Ranking
FINAL ANSWER
Passage Sieving using Entity clusters
Link parser. This is especially useful in biomedical domain where extended terms (e.g. immunodeficiency, hematopoietic, nonmyeloablative, etc) of lexicon are classified as entities. The question is parsed using Link parser during Question parsing. Query formulation translates the question into a set of queries that is given as keyword input to the Retrieval engine. We used Seft for context based retrieval and answer re-ranking methods. Answer Mining In this QA system, Link Grammar Parser decides the question‘s syntactic structure to extract Part of speech information. Question classifier then uses pattern matching based on wh-words (such as when – refers to an event, why – reasoning type, etc) and simple part-of-speech information to determine question types ([15]). Questions seeking comparison may need the answer to be extracted from more than one passage or document. These are dealt separately (Section 4). In the next step, Question focus is identified by finding the object of the verb. Importance is given to question focus by assigning it more weightage during retrieval of answers. Quite logically, the answers are most appropriate when there is a local similarity in the text with the query, for example for the question ―Is nonmyeloablative allogeneic transplantation feasible in patients having HIV infection?‖ query terms ‗nonmyeloablative‘ , ‗allogeneic‘ , ‗transplantation‘ etc have local similarity which is identified in the text, by locality based similarity algorithm. The contribution of each occurrence of each query term is summed to arrive at a similarity score for any particular location in any document in the collection. Software tool Seft [16], matches in accuracy conventional information retrieval systems and is fast enough to be useful on hundreds of megabytes of text. The Query Formulation module finds query words from the question for providing input to the retrieval engine. The system constructs a hash table of the entities identified from the question based on entity file, which is based on either the table of contents or the index or glossary of the biomedical corpus. These keywords (entities) are considered most important and are given the maximum weight. To avoid the higher ranking of passages merely due to the frequent occurrence of noun words (as is done in search engines) Most importantly, key issues of solving heterogeneity, acronym expansion and understanding user‘s implicit assumptions are also addressed in the answer extraction module (detailed in Section 5). BioinQA then performs phrase matching based re-ranking by searching for occurrence of Noun Phrases (identified by question parser above). After phrase matching, system processes the passages according to the classification done in question classification.
information documents. The actual answer may be found in different documents owing to the different nature of the entities involved in the question. Documents are then scanned for these components and the top n documents thus obtained are re-ranked based on passage sieving. Entity Cluster Matching Based Passage Sieving Obtained passages will be most accurately depicting a contrast when their parameters or entity clusters (linked-list of the entities of a passage along with their frequency of occurrence in that passage) are very similar (e.g. the possible parameters for comparing medicines would be duration of action, dosage, ingredient levels, side-effects, etc). Thus, reranking is performed by generating such entity clusters for each document and matching them. The link parser in the system recognizes the entities of the passages and performs matching with those of second by employing the following procedure: Let Gi , n be the entity cluster set of the nth answer in the ith component, where, 1 i 2 , 1 n 10 . The Score obtained from Entity Cluster Matching Based Reranking algorithm, ECRScore is given by 10
ECRScorei ,n Cn ,k
1 i 2 , 1 n 10 .
k 1
Here
Cn ,k
is the Similarity function and is defined as
Cn ,k G1,n G2,k G2, k G1, n The operator is used to match the entities present in both its operands for measuring the similarity between them. Now, the
FinalScorei ,n
of all the passages is calculated as
FinalScorei ,n w1 * CurrentScorei ,n w2 * ECRScorei ,n Where w1 w2 1 Here CurrentScorei , n is the Score of the passage obtained from answer selection phase, w1, w2 are weights given to scores to incorporate the contribution of both modules and are chosen (empirically) in our system to be 0.7 and 0.3 respectively. Finally answer passages are ranked according to their FinalScore and top 5 passages are presented to user.
BioinQA
4. MULTI DOCUMENT RETRIEVAL To explain the differences of components, a new algorithm was developed and implemented. Segregate Algorithm Answers to domain questions involving a ―comparison‖ or ―differentiation‖ or ―contrast‖ between two different entities of the corpus, generally lie in different documents. For example: o What is the difference between Introns and Exons? o Contrast between Lymphadenopathy and Leukoreduction. We developed the ―Segregate Algorithm” that maps the two separate ingredients (components) of the question (for example ‗Lymphadenopathy‘ and ‗Leukoreduction‘) in their respective
Figure 2. Sample output for the question “What is the difference between glycoprotein and lipoprotein?”
5. SEMANTIC HETEROGEINITY RESOLUTION THROUGH METADATA Bioinformatics field is a multidisciplinary field, with the users of all level, from general people, students to researchers and medical practitioners. To bridge the gap between level of understanding of an experienced researcher and a novice, our system employs metadata information during Answer extraction: I. Utilization of scientific and general terminology: A nonbiology student is not likely to access information by ‗homosapiens‘ but by ‗humans‘. The user himself decides whether he wants to use the system for novice search or advanced user search (see clip). We have developed Advanced-and-Learner-Knowledge Adaptive (ALKA) Algorithm, which works and performs selective ranking (of the initial 10 passages) on these principles: researchers use scientific terms and terminologies of the jargon more frequently. These may also include equations, numeric data (numbers, percentage signs) and words of large length such as Lymphadenopathy, etc. Thus, documents relevant for researchers will include more of such terms with a higher frequency because it actually fulfills the need of the user , whereas those meant for the novice would include simple (short length) words with less numbers or equations. The entity file of corpus constructed in the initial phase is configured to classify the terms as either ‗biological‘ (e.g. Efavirenz), ‗scientific‘ (e.g. homosapiens) or ‗general‘ (e.g. human), using metadata information. If a passage contains more scientific terms occurring frequently, it is given a lower rank for the novice, and a higher one for the advanced user. II. Use of acronyms: These are of great importance in a field like biomedicine where precise scientific terms are used and any error introduced due to requirement of typing long names can be critical. Solution to problem of acronyms will not only save time of user but also relieve them of burden of remembering long scientific names to accuracy of single character. Manually built acronym lists have been employed to resolve the differences in meaning due to use of acronyms at one place and its full form at another place. Many acronym lists have been compiled and published and many lists are available on the Web (e.g., Acronym Finder and Canonical Abbreviation/Acronym List). As the purpose of this study was to demonstrate the use of information about expansions of acronyms in enhancing the answers obtained from a question answering system, use of a manually built acronym list is justified. III. Comprehending the implicit assumptions of the user: It is a common observation that a typical question of user rarely contains full information required to answer the question. Rather it essentially contains many unstated assumptions and also requires extending or narrowing the meaning of the question to either broaden or shorten the search. This is actually the case in the real life for humans as their conversations hardly include the full detail, but leave many things for the listener to assume. For example a user may ask ―How does Tat enhances the ability of RNA polymerase to elongate‖ It is on part of system to decide between 3 RNA polymerases (1, 2, and 3). To perform in such circumstances, the system is built with Concepts Relation Graph (CRG) which affects the search by enhancing it with knowledge represented in the graph (CRG
BioinQA
BioinQA
Figure 3. Different outputs for advanced and novice users respectively, and a display of acronym expansion. is a form of metadata information). CRG is a ‗one-to-many‘ relation graph representation of concepts and data of the biomedical domain (the entities corresponding to the nodes of the graph are obtained by this relation). For example in the above mentioned question, the concept of RNA will be related to the three variants possible, namely 1-, 2- and 3- RNA. CRG is meant to fill in the missing information, which is required to answer the question or remove the ambiguity from the question. Given an ambiguous question, as determined by CRG, the user can either be prompted to supply more information, or the system can still answer the question, by employing the aid of CRG. Clearly, a general user is not likely to know everything about the searched concept in the beginning, so the latter approach is better. Hence, system recognizes the keywords present in the question, as well as those in the CRG, augmenting the search by using CRG entities. The user is then presented the answer, along with the knowledge of CRG. The user can choose to take the help
provided by CRG, and thus can select the suitable answer, without having to search again for the answer by supplying more information. This approach is general enough and can solve variety of problems. If more precise information about the background of the user is available, the system can be configured to provide a unique and unambiguous answer to the user, by selecting just one entity from the CRG. Use of this approach paves the way for development of a ―friendly‖ QA system, which will save the user from having to enter elaborate information in the question (although at the expense of accuracy). Figure 3 shows difference in levels of answers obtained for novice and advanced user with acronym expansion (Tuberculosis searched from word TB).
6. EXPERIMENTATION As Sample Resource, abstracts were taken from PUBMED to experiment on the system. Difference seeking questions are not generally available to be used as test questions, unlike the open domain evaluations, where test questions can be mined from question logs (Encarta, Excite, AskJeeves), thus we had them constructed by a one of the biomedical students. To build a set of questions we took a pair of 40 normal questions and 20 difference seeking questions from general students by conducting a survey. The group comprised of beginners and sophomores as well. This was to simulate use of the system by both novice and expert users. The questions thus received were of widely varying difficulty level covering various topics of the subject. For each question the system presents 5 top answers to the user (and 3 for difference seeking questions). A question is answered if the answer to the question is available in the text only which is presented to the user (and not in the document from which the text is retrieved). Comparison of BioinQA with the GOOGLE Search Engine We compared our system with the most sophisticated search engine, Google. Questions were posed to Google and 5 documents were checked for presence of answer in them. Evaluation metrics For general questions we used the popular metric Mean Reciprocal Answer Rank (MRAR) suggested in TREC [17] for the assessment of question answering systems, which is defined as follows.
RR
1 1 n 1 , MRAR rank[i ] n i 1 (rank i )
punishing the case where the system provided answer to only one of the components. In our experiments we took as 10. The use of MCRR, being very similar to MRAR can be justified as it is symmetric w.r.t. objects being compared so it takes ―difference between A and B‖ and ―difference between B and A‖ to be the same, and because answer to a comparison question is complete when both components (e. g. lipoprotein and glycoprotein) are described, not just one. So, it punishes the answers where only one component has been answered. Results: We calculated MRAR and MCRR for our system and Google search engine. The following table and graphs summarizes the result of our experiments:
MRAR BioinQA 0.7333 0.6328 Google
MCRR 0.3096 0.2195
Table 1: Experimental Results of BioinQA and Google on the data set Evaluation of the results: As opposed to BioinQA, where answers passages provided to user were taken as correct, for Google authors had to manually search in whole document returned by it, to check
(a)
n is the number of questions.; RR is the Reciprocal Rank. For evaluation of comparison based questions no metric has been suggested in the literature. To evaluate BioinQA‘s performance for such questions novel metric was adopted called ―Mean Correlational Reciprocal Rank (MCRR)‖ which is defined as: Let rank1 and rank2 be ranks of correct answers given by system for both components respectively. Then
MCRR
1 n 1 n i 1 (rank1i rank 2 i )
n is the number of questions. If answer to a question is not found in passages presented to user, then it is assumed that rank of that question is whose value is large compared to number of passages. For calculation of MRAR is taken as . To calculate MCRR, is taken as a much smaller value as it avoids
(b) Figure 4. Plot of (a) MRAR vs. % of questions asked. (b) MCRR vs. % of questions asked
whether it somewhere contained the answer. This makes the user effort exorbitantly large for Google. Moreover this strategy completely fails for comparison based questions if it does not happen to find a direct answer in the same words as presented in the question. The following figure proves the ineffectiveness of Google.
Figure 5. No answer to the Google search for the question „What is the difference between lipoprotein and glycoprotein?‟
7. CONCLUSIONS AND FUTURE WORK Our biomedical QA system uses the technique of entity recognition and matching. System is based on searching in context and utilizes syntactic information. BioinQA also answers the comparison type questions from multiple documents, a feature which contrasts sharply with the existing search engines, which merely return answers from single document or passage. The use of the Metadata to understand the implicit assumptions of the user, accommodate acronyms and to answer the question, based on the expertise of the user (rather than giving fixed answers for every user irrespective of their background) makes the system adapted to needs of user. Our future work will focus on developing a systematic framework for image (jpeg, bmp, etc) extraction and method for its contextual presentation, along with presentation of the textual data as the answer to any question, which will greatly enhance the understanding of the user. Along with images, focus will be on incorporating audio lectures available in the e-leaning facilities, and other sources as PUBMED.
8. REFERENCES [1]. http//www.ncbi.nlm.nih.gov/ - National Center for Biotechnology Information. Last accessed – 27 September, 2007 [2]. Stergos Afantenos, Vangelis Karkaletsis, Panagiotis Stamatopoulos. ‗Summarization from medical documents: a survey‘. 13th April, 2005. [3] Zweigenbaum P. ‗Question answering in biomedicine‘. Workshop on Natural Language Processing for Question Answering, EACL 2003.
[4]. Schultz S., Honeck M., and Hahn. H. ‗Biomedical text retrieval in languages with complex morphology‘. Proceedings of the Workshop on Natural Language Processing in the Biomedical domain, July 2002, pp. 61-68. [5]. Song Y., Kim S., and Rim H.. ‗Terminology indexing and reweighting methods for biomedical text retrieval‘. Proceedings of the SIGIR'04 workshop on search and discovery in bioinformatics, ACM, Sheffield, UK, 2004. [6]. Minsuk Lee, James Cimino, Hai Ran Zhu, Carl Sable, Vijay Shanker, John Ely , Hong Yu. „Beyond Information Retrieval—Medical Question Answering‘. AMIA, 2006. [7]. Jacquemart P. & Zweigenbaum P. ‗Towards a medical question-answering system: a feasibility study‘. In R. Baud, M. Fieschi, P. Le Beux & P. Ruch, Eds., Proceedings Medical Informatics Europe, volume 95 of Studies in Health Technology and Informatics, p. 463–468, Amsterdam: IOS Press(2003). [8]. Ayache, C. ‗Rapport final de la champagne EQueREVALDA, Evaluation en Question- Réponse‘2005. Site webTechnolanguehttp://www.technolangue.net/article61.html.last accessed - 15th June 2007. [9]. Rinaldi F., Dowdall J., Shneider G. & Persidis A. ‗Answering questions in the genomics domain‘. ACL2004 QA Workshop, 2004. [10]. P. Jacquemart, and P. Zweigenbaum, ―Towards a medical question-answering system: a feasibility study,‖ In Proceedings Medical Informatics Europe, P. L. Beux, and R. Baud, Eds., 2003, Amsterdam. IOS Press. [11] J. Ely, J. A. Osheroff, M. H. Ebell, et al., ―Analysis of questions asked by family doctors regarding patient care,‖ BMJ, vol. 319, 1999. pp. 358–361. [12]. Lei Li, Roop G. Singh, Guangzhi Zheng, Art Vandenberg, Vijay Vaishnavi, Sham Navathe. ‗A Methodology for Semantic Integration of Metadata in Bioinformatics Data Sources‘. 43rd ACM Southeast Conference, March 18-20, 2005, Kennesaw, GA, USA. [13]. Chen, L., Jamil, H. M., and Wang, N. ‗Automatic Composite Wrapper Generation for Semi-Structured Biological Data Based on Table Structure Identification‘. SIGMOD Record 33(2):58-64, 2004. [14] Stoimenov, L., Djordjevic, K., Stojanovic, D. ‗Integration of GIS Data Sources over the Internet Using Mediator and Wrapper Technology‘. Proceedings of the 2000 10th Mediterranean Electrotechnical Conference. Information Technology and Electrotechnology for the Mediterranean Countries (MeleCon 2000), pp. 334-336. [15] Kumar, P. Kashyap S., Mittal A., Gupta S. ‗A Fully Automatic Question Answering System for intelligent search in E–Learning Documents‘. International Journal on ELearning(2005) 4(!),149-166. [16] Owen de Kretser, Alistair Moffat ‗Needles and Haystacks: A Search Engine for Personal Information Collections‘. acsc, p. 58, Australasian Computer Science Conference, 2000. [17] Giovanni Aloisio, Massimo Cafaro, Sandro Fiore, Maria Mirto. ‗ProGenGrid: aWorkflow Service Infrastructure for Composing and Executing Bioinformatics Grid Services’. Proceedings of the 18th IEEE Symposium on Computer-Based Medical Systems (CBMS’05)
ABOUT THE AUTHORS Dr. Ankush Mittal: Dr Ankush Mittal is a faculty member at Indian Institute of Technology Roorkee, India. He has published many papers in the international and national journals and conferences. He has been an editorial board member, Int. Journal on Recent Patents on Biomedical Engineering and reviewer for IEEE Transaction on Multimedia, IEEE Transaction on Circuit and Systems for Video Technology, IEEE Transactions on Image Processing, IEEE Transactions of Fuzzy Systems, IEEE Transactions on TKDE, etc, He has been awarded the Young Scientist Award by The National academy of Sciences, India, 2006 for contribution in E-learning in the country, best paper award with Rs. 10,000 at IEEE ICISIP conference, 2005 and Star Performer, 2004-05, IIT Roorkee based on overall performance (teaching, research, thesis supervision, etc). His research interests include Image Processing and Object Tracking, Bioinformatics, E-Learning, Content-Based Retrieval, AI and Bayesian Networks.
Sparsh Mittal: Sparsh Mittal is a senior undergraduate student of Electronics & Communications Engineering Department at Indian Institute of Technology Roorkee, India. His research interests include natural language processing, data mining, FPGA implementation using VHDL and Verilog and image processing.
Saket Gupta: Saket Gupta is a senior undergraduate in Electronics and Communication Engineering Department at Indian Institute of Technology Roorkee, India. He has worked on Content Based Retrieval, QA Systems and other NLP applications for e-learning. His current field of research includes MIMO communication systems; Image processing; and FPGA synthesis and design using VHDL. He has been awarded many scholarships from IIT Roorkee and from other institutions.
Sumit Bhatia: Sumit Bhatia is a senior undergraduate student in Electrical Engineering Department at Indian Institute of Technology Roorkee, India. His current research interests include Content Based Information retrieval and Data Mining. In the past, he has worked in the areas of Digital Image processing and Remote Sensing.