with other search engines like Google and QA systems on. Web like ... a search engine and other QA systems, which make it especially ... tool in e-learning. 2.
BioinQA Multidocument Question Answering System: Providing Access to E-learning for masses Sparsh Mittal, Saket Gupta, and Ankush Mittal Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee Roorkee, 247667, India email: {freshuce, sparsuch, ankumfec} @iitr.ernet.in, web: www.iitr.ernet.in ---------------------------------------------------------------------------------------------------------------------------------------------------------
ABSTRACT: In this paper we report our own ongoing work in developing a Natural Language Processing (NLP) based fully automatic Multi-document Question Answering system (BioinQA) which intelligently composes the answers from single and multiple documents. Real-life QA systems of nextgeneration must go beyond limited search in one document and be able to ‘piece together’ the answer from multiple documents. As a tool for E-learning, our system provides quick and effective reach to the students over the vast information present on web. The comparison of BioinQA with other search engines like Google and QA systems on Web like Answers.com and Yahoo! Answers highlights the salient features of our QA system over and above a search engine and other QA systems, which make it especially useful in e-learning realm. We also survey the state-of-theart architectures and methodologies of systems proposed in literature which have great potential in e-learning field. This survey aims to facilitate innovation in and development of future e-learning architectures. Keywords: Question answering System, multi-document QA, e-learning, entity cluster matching, semantic heterogeneity resolution, metadata
1. Introduction Through scientific research national & international R&D institutions are creating a wealth of information and knowledge, which are potentially valuable resources, not only for professionals working in mainstream but also for rural people, which can make tremendous impact on their lives in the information age of today. Many universities, government run organizations and NGOs are offering computer-mediated distance-education (e-learning) programs for students located in even the remotest areas. The government has envisioned providing primary and secondary education through E-learning. E - learning is defined as formal and informal education and information – sharing that uses digital technology. It is a nation’s ability to generate, disseminate and use digital
information among its citizens to the betterment of the country’s economic activity [1]. Apart from vast amount of information generated on web in the form of PowerPoint slides, digital text and FAQs; e-learning resources may include, local information systems and multimedia products on CD-ROM etc. The importance of e-learning in a country like India cannot be underemphasized. The e – learning system can help to solve the problems of teacher shortage, differing teacher quality, differing learning places and materials in rural areas and in cities, etc. It can bridge distances, conserves classrooms and costs. It is available anytime according to learners’ convenience and fosters life-long learning for career development. Online educational materials are more easily updated and more motivated for learners, thus elearning resources can always be kept up to date and useful. This approach has distinct advantages over videoconferencing in that it is less expensive and allows students and tutors to interact at any time and anywhere there is an Internet connection. Thus an effective and wellstructured e-learning system can help in solving many problems which have long plagued the educational system in the country by extending the opportunity to even those, who may lack the resources (time, money, opportunity) to participate in courses offered at fixed times and locations. There are many notable efforts devoted to addressing the issues of both e-learning and QA systems, and a considerable number of different architectures have been proposed. In this paper we present a survey of question answering architectures. Although some general features of the QA system architectures will be presented, our main focus is on the features which prove highly beneficial in the area of e-learning. The contributions of this paper are as follows: 1. Highlighting significance of a QA system as an integral tool in e-learning 2. Presenting our Multidocument Question Answering System BioinQA, with its salient features, experiments & results and special utility in the field of e-learning 3. Draw attention to utility features (e.g. user interaction, multilingual support) of state-of-art systems from the perspective of general users of e-learning system.
Finally, the answers are retrieved and Answer Selection module selects top answers to be presented to the user.
4. Identifying issues and design factors to be addressed in upcoming systems. NEED OF A QA SYSTEM Making new ICTs available to people in general and rural communities in particular, however, provides no guarantee that people will be able to ‘access’ them to create and share knowledge. A huge repository of data is useful only if it can be easily accessed and the contents retrieved as per the user requirements, providing information in a suitable form [2], otherwise accessing the web or any corpus is like “licking the honey from outside”. Even most widely used modern search engines such as Google which have huge storehouse of information are inherently limited in being true and complete tools for solving information needs of the user. They merely return a large number of documents which, as the general experience of all users would indicate, are hardly ‘answers to their needs’, but are more a collection of all documents on web having some similarity with the query. A mere keyword matching leads to incomplete and error-prone answers which is very inconvenient for the user. The meaning of query is very relevant. For e.g. “How is snake poison employed in counteracting Neurotoxic venom?”, and “When is snake poison employed in counteracting Neurotoxic venom?” and “Why is snake poison employed in counteracting Neurotoxic venom?” all have different meanings. In a study conducted with a test set of 100 medical questions collected from medical students in a specialized domain, a thorough search in Google was unable to obtain relevant documents within top five hits for 40% of the questions [3]. Moreover, due to busy practice schedules physicians spend less than 2 minutes on average seeking an answer to a question. Thus, most of the clinical questions remain unanswered [4]. The need of a QA system inherently arises in such circumstances. Question Answering has been examined by TREC, CLEF, and NTCIR for many years, and is arguably the ultimate goal of semantic web research for interrogative information needs. [5]
2. System architecture Figure 1 presents the architecture of the system (see [6] for details). In short, the system is based on searching in context the entities of the corpus for effective extraction of answers. The question is parsed using Link parser during Question parsing. Query formulation translates the question into a set of queries that is given as keyword input to the Retrieval engine. Question focus is identified by finding the object of the verb. Importance is given to question focus by assigning it more weight during retrieval of answers.
CORPUS
Figure 1. System Architecture MULTI DOCUMENT RETRIEVAL In what follows, we present salient features of our system which make it especially suitable for e-learning field. Answers to the non-trivial questions may not be present in a single document but may require multiple documents. To be used in any real scenario, QA systems must go beyond local search in a document and deal with possibility of finding answer from different locations from one or multiple documents. In our system we implement Multi-document question answering by using a novel Segregate Algorithm Answers to domain questions involving a “comparison” or “differentiation” or “contrast” between two different entities of the corpus, generally lie in different documents. For example: o What is the difference between Introns and Exons? o Contrast between Lymphadenopathy and Leukoreduction. To answer such question, we developed the “Segregate Algorithm” which is as follows. The two separate ingredients (components) of the question (for example ‘Lymphadenopathy’ and ‘Leukoreduction’) are mapped in their respective information documents. Documents relevant to query are then scanned for these components and the top n documents thus obtained. These are further reranked based on “passage sieving”, which is as follows. Entity Cluster Matching Based Passage Sieving Obtained passages will be most accurately depicting a contrast when their parameters or entity clusters (linked-list of the entities of a passage along with their frequency of occurrence in that passage) are very similar (e.g. the
possible parameters for comparing medicines would be duration of action, dosage, ingredient levels, side-effects, etc). Thus, re-ranking is performed by generating such entity
G
clusters for each document and matching them. Let i ,n be the entity cluster set of the nth answer in the ith component, where 1 ≤ i ≤ 2 , 1 ≤ n ≤ 10 . The Score obtained from Entity Cluster Matching Based Reranking algorithm, ECRScore is given by
ECRScorei , n = Here
Cn , k
10
∑C k =1
n ,k
1 ≤ i ≤ 2 , 1 ≤ n ≤ 10 .
is the Similarity function and is defined as
Cn,k = G1,n ∩ G2,k = G2,k ∩ G1,n The operator ∩ is used to match the entities present in both its operands for measuring the similarity between them. Now, the
FinalScorei , n
of all the passages is calculated as
FinalScorei,n = w1 *CurrentScorei,n + w2 *ECRScorei,n Where w1 + w2 = 1 CurrentScorei , n is the score of the passage obtained Here from answer selection phase, w1, w2 are weights given to different scores to incorporate the contribution of both modules and are chosen (empirically) in our system to be 0.7 and 0.3 respectively. Finally answer passages are ranked
according to their FinalScore and top passages are presented to user. To the best of our knowledge, until now, such difference seeking questions could have been answered only when the answer was directly present as one passage in one document. However, our QA system is a step in the direction of nextgeneration system which has the capability of extracting answers from diverse documents.
Figure 2 Sample output for the question “What is the difference between glycoprotein and lipoprotein?”
SEMANTIC HETEROGEINITY RESOLUTION THROUGH METADATA A common problem occurring in scientific literature, especially in many domains, which are multidisciplinary in nature, is that of Heterogeneity, which arises from either the use of different formats and representation systems (e.g. XML, flat file, or other format) or from using varied vocabulary to describe similar concepts or data relationships (humans or homosapiens) or from using the same metadata to describe different concepts or data relationships. Also, difference in jargons used by a novice and an advanced user causes heterogeneity. Our QA System assigns the task of resolving differences of terms and formats as used by user and in corpus, entirely to the system, thus requiring no qualification on part of user to know the complex jargon of the subject. By employing many important techniques such as “Utilization of scientific and general terminology”, “Use of acronyms”, BioinQA bridges the semantic gap. We have developed Advanced-and-Learner-Knowledge Adaptive (ALKA) Algorithm, which works and performs selective ranking (of the initial 10 passages) on these principles: researchers use scientific terms and terminologies of the jargon more frequently. These may also include equations, numeric data (numbers, percentage signs) and words of large length such as Lymphadenopathy, etc. Thus, documents relevant for researchers will include more of such terms with a higher frequency because it actually fulfills the need of the user whereas those meant for the novice would include simple (short length) words with less numbers or equations. Moreover, the questions posed by novice or lay-man may be ambiguous, as they may be non-English speakers or new to the field. Even otherwise, in the real life, it is experienced that questions posed by humans essentially contain many unstated assumptions and also requires extending or narrowing the meaning of the question to either broaden or shorten the search. By using technique namely, ‘Comprehending the implicit assumptions of the user’, the system enhances the search with knowledge represented internally as metadata, thus ‘filling in the gap’ of the question. This knowledge helps to remove the ambiguity by taking help of the user. This approach is general enough and can solve variety of problems. Use of this approach paves the way for development of a “friendly” QA system, which will save the user from having to enter elaborate information in the question (although at the expense of accuracy). It comes to help especially for the general user, who is semi-literate or new to internet or the field. Figure 3 shows difference in levels of answers obtained for novice and advanced user with acronym expansion (Tuberculosis searched from word TB).
evaluations, where test questions can be mined from question logs (Encarta, Excite, AskJeeves), no question sets are at the disposal of restricted domain evaluators [7]. To build a set of questions we took a pair of 40 normal questions and 20 difference seeking questions from general students by conducting a survey. The group comprised of beginners and sophomores as well. This was to simulate use of an e-learning system in a real-life scenario by a general user as the students’ (the potential users) question better reflected questions likely to be posed in a QA system used for a practical purpose. The questions thus received were of widely varying difficulty level covering various topics of the subject. A question is answered if the answer to the question is available in the text only which is presented to the user (and not in the document from which the text is retrieved). Comparison of BioinQA with the GOOGLE Search Engine and QA Systems on Web We compared our system with the most sophisticated search engine, Google. Questions were posed to Google and 5 documents were checked for presence of answer in them. Evaluation metrics For general questions we used the popular metric Mean Reciprocal Answer Rank (MRAR) suggested in TREC [8] for the assessment of question answering systems, which is defined as follows.
RR =
1 1 n 1 , MRAR = ∑ rank[i] n i=1 (rank [i])
n is the number of questions; RR is the Reciprocal Rank. For evaluation of comparison based questions no metric has been suggested in the literature. To evaluate performance of BioinQA for such questions novel metric was adopted called “Mean Correlational Reciprocal Rank (MCRR)” which is defined as: Let rank1 and rank2 be ranks of correct answers given by system for both components respectively. Then
MCRR = Figure 3 Different outputs for novice and advanced users respectively, and a display of acronym expansion. Since an e-learning system is likely to be used by users of widely varying backgrounds, the users of all levels, from general people, students to researchers and professionals such as medical practitioners can easily use the system. This feature especially empowers the general user and students from rural background to access scientific corpus. Such access was very limited until now.
3.
Experiments:
The experiments performed to evaluate the performance of any system must closely relate to the practical use that the system is expected to be put to. Unlike the open domain
1 n 1 ∑ n i=1 (rank1[i] × rank 2[i])
n is the number of questions. If answer to a question is not found in passages presented to
user, then it is assumed that rank of that question is Z whose value is large compared to number of passages. For calculation of MRAR
Z is taken as ∞ . To calculate
MCRR, Z is taken as a much smaller value as it avoids punishing the case where the system provided answer to
only one of the components. In our experiments we took Z as 10. Results: We calculated MRAR and MCRR for our system and Google search engine. The following table and graphs summarizes the result of our experiments:
Table 1: Experimental Results of BioinQA and Google on the data set BioinQA Google
MRAR 0.7333 0.6328
MCRR 0.3096 0.2195
Evaluation of the results: There is more to effectiveness and usefulness of BioinQA, then what the table above suggests. As opposed to BioinQA where the answers must be present in only the passages provided to the user to be taken as correct, for Google the authors had to manually search in the whole document returned by it to check whether it somewhere contained the answer. On other criterions also, our system performs much better. Returning thousands of hyperlinks to potential answers without specifying the detailed location of the answer reduces the precision (the percentage of retrieved documents that is relevant) very much and makes user effort exorbitantly large for Google. Moreover this strategy completely fails for comparison based questions if it does not happen to find a direct answer in the same words as presented in the question.
Figure 5: The output of a few prominent QA systems on the Web to the question ‘What is the difference between lipoprotein and glycoprotein?’ shows their failure to answer.
Figure 4 No answer to the Google search for the question ‘What is the difference between lipoprotein and glycoprotein?’
4. Other important technologies Among many recent QA technologies, we briefly review here multilingual, cross lingual and Indian language question answering which have special application in realm of e-learning.
Authors in [9] propose strategy to widen the scope of the answer and also include the blocks, which may not contain the query keywords, as long as they contain the answer and bear some similarity to other answer blocks. They use a template mapping technique for question classification and for each type of set of question, templates are defined. The benefit of this scheme lies in construction of a query that is more appropriate to a search engine once proper set is selected for any question. Using web pages to find answers to a question poses many challenges, which may hardly be present in a corpus stored locally. Many Web pages are multi-topics, and the relevant information to the answer may only be a small portion in relevant pages, which are “polluted” by irrelevant/noise information and do not look alike. To solve this problem, they use content block instead of page based similarity checking to identify desired information. The content blocks are found out by detecting the layout
structure of a Web page, based on visual clues as a user would understand it. One of the advantages that Web offers is that redundant answers generally can be found in multiple relevant Web pages [10]. This property is used by authors in [9] to produce effective answers from low-quality Web pages. Due to the similarity that is expected between the questions being asked by typical users of e-learning, exploring the cumulative experiences from previous students’ answers for the benefit of new ones can help an e-learning system tremendously. The PERSO [11] is an adaptive hypermedia e-learning system [12, 13], where learners with different learning goals are treated differently, by building a model of knowledge and preferences about each of them. It stores questions/answers in a database. The tool tries to search for this information in the database, if any similarity occurs between a question and that answered earlier; it answers the student automatically by giving him the stored data. The semantic closeness between the current student’s question and previous questions saved in the database is calculated to find degree of similarity. Apart from dealing with “Similar” and “Totally different” questions, it also deals the case of questions which have very close semantic closeness values by giving them fuzzy treatment. Fuzzy treatment consists of using feedback with the user to either improve the question failing which the question is sent to the tutor for answering. Research work has been done in Surprise Language Exercise (SLE) within the TIDES program where viability of HindiEnglish Cross Lingual Question-Answering (CLQA) has been shown by [X10]. Their system accepts questions in English, finds candidate answers in Hindi newspapers, and translates the answer candidates into English along with the context surrounding each answer. However it is aimed for English speaking users. A majority of school-going children pursue their education in regional languages, among which Hindi language stands out to be most prominent. Currently, schools are being provided with computer education facilities and internet connectivity so that vast educational resources already available and to be developed by schools themselves could be shared amongst them. The Question Answering System developed for Hindi E-learning documents by authors in [15] is the first of its kind (figure 6). The user can submit his question in his/her mother-tongue in Devanagri script such as “आधुिनक कृ ष यंऽ के उपयोग एवं उनसे होने वाली
दघ ु टनाओं के वषय म बताय”.They employ many novel techniques to build Hindi QA system, such as Automatic Entity Generation, question classification, question parsing, query formulation, stop words removal, answer extraction and selection and closed loop dialogue, despite unavailability of many state of the art tools in Hindi language.
The user friendliness of system is enhanced by using closed loop QA which provides interactivity between the user and the system and prevents the system from failing in case of ambiguous questions. However the system is in its initial stages and additional work is required to enhance the speed and prediction accuracy of the system and enable it to withstand very high workload
Figure 6 Output of the system (from [15]): The answer provided in the first passage with full confidence (100%) and link to the specific location in the relevant document The authors in [16] discuss an implementation of a multilingual Question- to-Query conversion system. They have integrated two independent systems called aAQUA and AgroExplorer. It converts English, Marathi and Hindi questions to syntactically correct and meaningful queries. AgroExplorer [17] is a meaning-based, multilingual search engine that considers the semantics of a query using UNL. aAQUA is an online multilingual, multimedia question and answer based community forum for disseminating information from and to the grassroots of the Indian community [18]. The multimedia answers prove to be great value as some agricultural and veterinary problems are better addressed by photographs or audio and video files which provide details to the experts. Multilingual keyword based search is especially useful for users fluent in two or more languages (which is quite common in India). The original search query is expanded with their counterparts in each language. It allows users to search in their own language and retrieve content in other languages.
5. Conclusion and Future Work We presented the BioinQA system as an effective tool to empower the general masses to gain access to e-learning facilities. Along with important features, the experimentation and the results were shown and compared
with Google and other QA systems. The work on comparison type questions and use of metadata to make the system more user-friendly showed promising results. The distinguishing feature of online educational materials (ease of updating and capability of presenting itself in an attractive manner to benefit even an uninterested learner) will be fully utilized with such a tool as BioinQA. This is a major advantage of our system which will solve the problem of indifference and lack of awareness, so widely present in Indian people. The analysis on the other systems leads to the following conclusions and suggestions. Firstly, considering today’s general and academic scenario, multilingual and multimedia systems only can effectively fulfill the need of the novice and professional users. However, there has been comparatively little work done on Indian languages. The resultant unavailability of basic utility softwares and limited computer support very much restricts high-quality systems in these languages. Secondly, work should be done in UNICODE characters presenting itself with easy to use interface. The systems thus made in one language can be made multilingual by keeping the language independent modules of the architecture Unicode complaint and implementing the language specific modules like GUI for different languages (like Marathi, Punjabi or any other language) to be targeted. Finally, along with text and images, focus should also be on incorporating audio lectures available in the e-leaning facilities and adapting the system to the needs of a lay man (by considering factors such as real time and low bandwidth etc). This will greatly enhance the efficacy of the system.
6. References [1] E-Learning in Thailand (2005). by Suwat Suktrisul, The Office of the Basic Education Commission (OBEC). Ministry of Education. Thailand. [2] Stergos Afantenos, Vangelis Karkaletsis, Panagiotis Stamatopoulos. ‘Summarization from medical documents: a survey’. 13th April, 2005. [3] P. Jacquemart, and P. Zweigenbaum, “Towards a medical question-answering system: a feasibility study,” In Proceedings Medical Informatics Europe, P. L. Beux, and R. Baud, Eds., 2003, Amsterdam. IOS Press. [4] J. Ely, J. A. Osheroff, M. H. Ebell, et al., “Analysis of questions asked by family doctors regarding patient care,” BMJ, vol. 319, 1999. pp.358–361. [5] Trotman, A., Geva, S., and Kamps, J. 2007. Report on the SIGIR 2007 workshop on focused retrieval. SIGIR Forum 41, 2 (Dec. 2007), 97-103. [6] S. Mittal, S. Gupta, A. Mittal, S. Bhatia, “BioinQA: Addressing bottlenecks of Biomedical Domain through Biomedical Question Answering System” In International Conference on Systemics, Cybernetics and Informatics, (ICSCI-2008), Hyderabad, India, Vol. 1, pp. 98-103.
[7] A. R. Diekema, O. Yilmazel and E. D. Liddy. Evaluation of Restricted Domain Question-Answering Systems. ACL 2004 Workshop on Question Answering in Restricted Domains. Barcelona (2004). [8] E. M. Voorhees, D. Harman, “Overview of the sixth text retrieval conference (TREC),” Information Proc. Manag. Vol. 36, pp.3-36. [9] S. Parthasarathy and J. Chen, “A Web-based Question Answering System for Effective e-Learning”, Seventh IEEE International Conference on Advanced Learning Technologies (ICALT 2007) pp. 142-146. [10] C. L. A. Clarke, G. V. Cormack, and T. R. Lynam, “Exploiting redundancy in question answering”, In Proc. Of the 24th ACM SIGIR conference, 2001, pp. 358-365. [11] M. JEMNI & Issam BEN ALI, “Automatic answering tool for e-learning environment”, 3rd International Conference on multimedia and Information & Communication Technologies in Education, June 8-10th 2005, Caceres, Spain. [12] H. Chorfi & M. Jemni, PERSO: Towards an adaptative e-learning system, Journal of Interactive Learning Research, (2004), 15 (4), pp 433-447. [13] H. Chorfi & M. Jemni, PERSO: A System to customize e-training, 5th International Conference on New Educational Environments, May 26-28 2003, Lucerne, Switzerland. [14] Sekine, S., Grishman, R. Hindi-English Cross-Lingual Question-Answering System. ACM Transactions on Asian Language Information Processing, Vol. 2, No. 3, September 2003, Pages 181-192. [15] P. Kumar, S. Kashyap, A. Mittal and S. Gupta, A Hindi Question Answering system for E-learning documents, IEEE Third International Conference on Intelligent Sensing and Information Processing, 14-17 December 2005, Bangalore , India, pp. 80-86 [16] Venkata S.R.S.K, Badodekar, S., Bhattacharyya, P. Question-to-Query Conversion in the Context of a Meaningbased, Multilingual Search Engine, Symposium on Indian Morphology, Phonology and Language Engineering, IIT Kharagpur, February, 2005. [17] Singh, S. Meaning Based, Multilingual Search Engine, B. Tech. Thesis at IIT Bombay, 2003. [18] Ramamritham, K., Bahuman, A., Kumar, R., Chand, A., Duttagupta, S., Kumar G.V.R., Rao, C. aAQUA - A Multilingual, Multimedia Forum for the community. IEEE International Conference on Multimedia and Expo,2004.