Conceptual Extraction of Questions from Wikipedia

6 downloads 55 Views 625KB Size Report
Conceptual Extraction of Questions from Wikipedia. ∗. Kshitij Gautam, Itika Gupta, Krishna Chandramouli,. Division of Enterprise and Cloud Computing,.
Conceptual Extraction of Questions from Wikipedia ∗ Kshitij Gautam, Itika Gupta, Krishna Chandramouli, Division of Enterprise and Cloud Computing, School of Information Technology and Engineering, VIT University,Vellore, India 632 014 {kshitij.gautam2010, itika.gupta2010, krishna.c}@vit.ac.in

Keywords: Question Answering System, Natural Language Processing, Grammar Structure, Bloom’s Taxonomy, Wikipedia

Abstract Online education systems enable learners to experience lifelong learning process. With the exponential increase of such eLearning systems, knowledge matters more than ever. In particular, validation of this knowledge plays a vital role in order to assure learner’s ability. However, these online systems still depends on tutors to generate questions which are subsequently evaluated before awarding the learners with appropriate grades. Addressing the challenge of creating and evaluating a large-scale number of questions for mass learners, we present an integrated framework for conceptual extraction of questions from online sources. The proposed system generates objective type question pattern based on grammar structure of the language and integrates parts-of-speech tagger and named entity extractor to enable semantic mapping. As opposed to using the information from closed sources such as course material, the proposed system enables online mining of information to generate questions. The system has been thoroughly evaluated by experienced tutors and students providing subjective and objective metrics.

1

Introduction

Evolution and change in information technology and educational technologies are important forces driving developments in education [15]. One such recent development gaining popularity is the notion of “university 2.0”. As opposed to conventional class-room environments, where the leaners strength is in the order of hundreds, a typical course provided via eLearning system could consist of learners in the order of millions. Therefore, a special emphasis should be placed on the nature of course content that is being taught along with the nature of learner assessment. With the ever changing era, knowledge matters more than ever and the validation of this knowledge plays a vital role in assuring learner’s ability. By far, the ∗ The authors of the paper express their sincere gratitude for the professors of School of Information Technology and Engineering, VIT University for participating in the evaluation of the system and providing valuable feedback. Also, the authors express sincere gratitude for fellow researcher Dr. Tomas Kliegr for his sharing the Wikipedia API.

multiple-choice questions (MCQ) plays a key role in measuring student’s ability across universities world-wide and a learner’s critical thinking ability is evaluated based on course objective and significant information. In addition, these questions also triggers higher-order thinking among learners while answering the questions. However, it is not possible for the tutors to manually design such question patterns for mass learners, where integrity of the examination has to be preserved. Recently, the notion of question-answering system has gained popularity among many researchers [6]. These systems are developed with the intention of answering a specific question, with the help of large-scale datasets. However, on the other hand, researchers have proposed innovative solutions for generating questions based on factual knowledge using Wikipedia[10] and learning framework based on semantic web technologies[4]. In [9], authors propose a question-answering system that uses sentences within a document as a source of question/answer. However, the generation of MCQ has not been addressed to the best of our ability and in this paper we present an integrated framework for conceptual extraction of questions from online sources. The proposed system generates objective type question pattern based on grammar structure of the language and integrates parts-of-speech tagger and named entity extractor to enable semantic mapping. The questions generated from the framework are systematically categorized with Bloom’s Taxonomy. As opposed to using the information from closed sources such as course material, the proposed system enables online mining of information to generate questions. The remainder of the paper is structured as follows. In 2 an overview of the problem definition is presented. In particular special emphasis to the impact of blooms taxonomy on the generation of MCQ is discussed in detail. The section 3 presents a detailed overview of the literature review and critically analyses the state-of-the art methodologies presented for the challenges of the paper. The proposed integrated framework is discussed in 4 along with the details on the functionality of the module. In section 5, a description of the system evaluation is presented in detail in which the subjective and objective metrics used to evaluate the system is presented, followed by conclusion and future work in section 6.

2

Problem Definition

Assessment forms a key constituent of the instructional process. It provides a measure for evaluating learning and the effectiveness of instructional pedagogy. However, constructing assessment items for objective tests is a challenging task, which is time consuming and requires experience. The questions generated should evaluate a leaner on different levels of competency that are expected to be developed while learning a subject. In [1], authors have extended Benjamin Bloom’s taxonomy which has categorised student learning assessment in six different categories namely knowledge, comprehension, application, analysis, synthesis and evaluation. Knowledge and comprehension enables a tutor to evaluate the learner on the fundamentals of the subject whereas application, analysis, synthesis and evaluation enables the tutor to assess a learners ability to reason with critical thinking. Among the questions generated, 40% of the questions are formulated to test a learner’s ability in higher-order thinking (comprising of application to evaluation). The remaining 60% is allocated to assess the learner’s fundamental knowledge on the subject. The challenge of translating open-world information to knowledge that can be used in learner assessment is achieved through the exploitation of language structure which illustrates different patterns of question grammar as listed below. • is a - definition • is the - fact finding • often use, examples of, includes, use of - application • have, VS - characteristics • such as , like - concept • to - action • more than - comparative • make a- integration For example, the following text is obtained from wikipedia on the topic of ”Machine Learning”. Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data. After learning, it can then be used to classify new email messages into spam and non-spam folders. The core of machine learning deals with representation and generalization. Representation of data instances and functions evaluated on these instances are part of all machine learning systems. From the above text, a tutor can formulate the following questions in order to assess the learner on definition, action and application respectively, with the help of following questions. • Machine learning, a branch of ....., is about the construction and study of systems that can learn from data. (artificial intelligence) • The core of machine learning deals with representation and ..... . (generalization)

• ..... of data instances and functions evaluated on these instances are part of all machine learning systems. (Representation) Addressing the challenge of creating large-number of quiz questions for evaluating mass learners on open world information, the proposed framework enables a tutor to provides a topic , filter the questions generated according to bloom’s taxonomy and evaluate the performance of learners. A more detailed description of the framework is discussed in section 4.

3

Literature Review

In recent years, researchers developing Web-based questionanswering systems (QASs) have shifted their focus from simple factoid questions to complex queries that require processing and drawing inferences from numerous sources [6]. Definitional QASs typically extract sentences that contain the most descriptive information about the search term from multiple documents and then summarize these sentences into definitions. They usually exploit additional external definition resources such as encyclopedias and dictionaries, as these can provide precise and relevant nuggets that are further projected on the corpus. However, integrating each particular resource involves designing a specialized wrapper. Exploiting the Web as a source of descriptive information is therefore a key issue in searching for definitional answers on the Web. In [4], authors propose a novel approach that employs query rewriting techniques to increase the probability of extracting the nuggets from Web snippets by matching surface patterns. In addition, they suggest that their method takes advantage of corpus-based semantic analysis and sense disambiguation strategies for extracting words that describe different aspects of target concepts on the Web. In [10], authors present a semi-automatic approach for question generation targeting academic writing. Their system first extracts key phrases from the article, which are then matched with a Wikipedia article and is classified into one of the five abstract concept categories: Research Filed, Technology, System, Term and Other. Using the content of the matched Wikipedia article, the system then constructs a conceptual graph structure representation for each key phrase and the questions are generated based on the structure. Multiple-choice questions are one of the most popular ways of conducting tests [4]. In order to achieve this, authors in [8], extract factual sentences from a corpus or from the web, which are then segmented to contain factual knowledge. As every multiple-choice question requires a target answer, they train a classifier to decide which noun phrase in the sentence can be used as an answer. In addition, the authors also use E-HowNet and Wikipedia sources, to generate distractions as the candidates to be chosen in an MCQ. Finally, they apply simple rule to transform the selected sentences into questions.

4

Proposed Framework

In Figure 1, the architecture of the proposed system is presented. In this framework, the tutor interacts with the system by providing input topics on which he/she wishes to evaluate

the student. The topics thus obtained are translated to queries which are searched in Wikipedia through the use of WiKi API 1 . The articles extracted as relevant from Wikipedia are further processed to extract sentences and thus to enable parts-ofspeech tagging. Based on the language grammar, the text is annotated through the following symbols.

Figure 1. Architecture of the proposed System

Table 1. Question Grammar Grammar Annotation NN, NNP, NNPS, NNS VB, VBD, VBG, VBN, VBP, VBZ DT JJ, JJR, JJS IN PRP, PRP$ RB, RBR, RBS MD CC TO

Notation α β γ δ λ  ψ µ ν ω

Annotating the previous examples with the notation mentioned, the following pattern is obtained. • Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data. Machine/NN learning/NN ,/, a/DT branch/NN of/IN artificial/JJ intelligence/NN ,/, is/VBZ about/IN the/DT construction/NN and/CC study/NN of/IN systems/NNS that/MD can/MD learn/VB from/IN data/NNS ./. Machine/α learning/α ,/, a/γ branch/α of/λ artificial/δ intelligence/α ,/, is/β about/λ the/γ construction/α and/ν study/α of/λ systems/α that/µ can/µ learn/β from/λ data/α ./. 1

http://www.mediawiki.org/wiki/API:Main page

Machine learning, a branch of ....., is about the construction and study of systems that can learn from data. • The core of machine learning deals with representation and generalization. The/DT core/NN of/IN machine/NN learning/VBG deals/NNS with/IN representation/NN and/CC generalization/NN ./.. The/λ core/α of/λ machine/α learning/β deals/α with/λ representation/α and/ν generalization/α ./. The core of machine learning deals with representation and ... Once the sentences are tagged, they are compared with the standard sentence structure, as in normal English the order of the words, the syntax, must follow a specific pattern. This helps in determining which section of sentence can be removed in order to frame the question. Thus as an example, of the sentences listed above, the noun phrases which are succeeded by verb in a sentence can be removed to generate questions. Furthermore, answers of all the questions framed with same structure can be provided as wrong answers, also called distractors, to generate multiple choice questions. For ex: a sentence of type αβγαλαα Other sentence is also of the type αααββλ Thus we can take answer of (ii) and include it as an option for (i). And since the number of questions generated from a single article is in hundreds or thousands, the probability that two different questions have same set of options is obviated. 4.1

Targeted Hypernym Discovery

The hypernym discovery approach proposed here is based on the application of hand-crafted lexico-syntactic patterns (Hearst patterns). Although lexico-syntactic patterns have been extensively studied since the seminal work [7] was published in 1992, most research has been focusing on the extraction of all word-hypernym pairs from the given generic free-text corpus. In contrast, the goal of Targeted Hypernym Discovery (THD) is not to find all hypernyms in the corpus, but rather to find only hypernyms for the current entity. Additional experiments presented here shown that THD achieves a significantly higher accuracy than previous approaches to hypernym discovery. THD has also the advantage that it requires no training and can use up-to-date on-line resources to find hypernyms in real time. THD proposed here comes out of our earlier work on utilizing hypernym discovery in the the context of image information retrieval and relevance feedback [3]. Here we present an updated and expanded version of the algorithm: 1. Fetch articles from the corpus relevant to the query 2. For each article: (a) Determine if suitable for further processing (b) Extract hypernyms matching the patterns (c) Return the most likely hypernym found

Figure 2. Questions Generated for the topic of Machine Learning

Performing all these steps require carrying out multiple information retrieval and text processing tasks. For this purpose, our THD implementation uses Stanford CoreNLP Framework [5]. The screenshots of the system generating the questions are depicted in Fig. 2. 4.2

Text Preprocessing

According to our experimental evaluation, the first section of each article provides a sufficient basis for THD since it contains a brief introduction of the article’s topic, often including the desired definition in the form of a Hearst pattern. Processing remaining sections in our experience only increases computation time and introduces noise hypernyms. The system uses existing and newly created Stanford CoreNLP2 modules to perform text preprocessing. We use the modules available within the Stanford CoreNLP to perform text tokenization, sentence splitting and Brill-style part-of-speech (POS) tagging. Noun chunks are identified using Ramshaw and Marcus-based chunker [11]. The system also performs customely implemented replacement of non-English characters by their ASCII fallback alternatives. 4.3

a JAPE grammar (basically a set of rules) and a text to annotate. JAPE grammar rules consist of left and right hand side. On the left-hand side, there is a regular expression over existing annotations, annotation manipulation statements are the on the right-hand-side. The JAPE grammar used in our research comprises of several rules to find possible action, application, characteristic, comparative, concept, definition, Fact Finding, integration pattern questions and Wikipedia articles as input. The output consists of annotated text which determines different patterns of question grammar as well as provides all the possible list of words that can be removed in order to generate meaningful questions. Also the answers of questions generated with same patterns of question grammar can be used as distractors to generate multiple choice questions.

5

Framework Evaluation

JAPE Grammar

The different patterns of question grammar illustrated in section 2 are also implemented with the help of Java Annotation Patterns Engine (JAPE) which performs regular expression annotations3 . JAPE is a component of the open-source General Architecture for Text Engineering (GATE) platform4 . JAPE is a finite state transducer that operates over annotations based on regular expressions. Thus it is useful for pattern-matching, semantic extraction, and many other operations. Its input is 2

http://nlp.stanford.edu 3 http://gate.ac.uk/sale/tao/splitch8.html#chap:jape 4 http://gate.ac.uk

Figure 3. Teacher’s Evaluation Criteria towards Learner Assessment In order to facilitate a fair evaluation of the system, a survey among the faculties of VIT University 5 was conducted 5

http://www.vit.ac.in

and among the participants on the survey, eight tenured faculties were asked to interact with the tool and provide subjective analysis of the tool. The faculties who participated in the evaluation have an average teaching experience of 7.25 years out of which for the last four years they have adopted quiz pattern as one of the assessment methods towards grade determination. The subjects that have been handled by the faculties include Information and Network Security, Distributed Systems, Microprocessor, Object Oriented Programming Concepts, Software Project Management and Computer Graphics. In the survey conducted prior to the system development it was noted that, the average time taken by a faculty to prepare a quiz paper is nearly 13.5hrs6 and thereafter to evaluate a class with more than 120 students requires nearly 3.5 hrs. Also, the idea of using open world information has been welcomed by the faculties. The average Computational complexity of system is about 6.25 minutes. As the system is intended to benefit the tutors handling large-number of students, the subjective evaluation of the tenured faculties who participated in the pre-development survey have evaluated the system. The first evaluation was concerned with the inter-annotator agreement on the questions generated that comply with the six categories of bloom’s taxonomy for learner’s assessment. The tutor requirement gathered from the pre-development questionnaire is shown in Fig. 3. Of the either tutors participated, 87.5% preferred assessing learners based on the knowledge, while 25% of tutors used synthesis category from bloom’s taxonomy. This result is significant as most of the multiple-choice question papers are intended to test the learner’s basic understanding on the subject. Similarly, the application and analysis type questions were generated to test the leaners ability to critically think and apply the knowledge gained from the subject. Such a distribution of question patterns has been widely welcomed by other tutors expressing their support as the distribution provides a good balance of assessing the learner’s ability towards critical thinking.

ous subjects were analyzed and categorized in accordance with bloom’s taxonomy whose statistics are presented in Fig. 4. Among all the six categories of bloom’s taxonomy, 35% of the question generated were based in knowledge, whereas 5% of questions were based on synthesis. Thus we can conclude that the tutor’s expectation has been achieved. The questions generated from the system were individually annotated by the tutors and the performance of the system is presented in Fig. 5. Among the total number of questions generated, 30.6% of the questions were considered towards finalizing the assessment pattern for the subject Artificial Intelligence. Similarly, for Natural Language Processing, Object Oriented Analysis and Design and Software Testing the efficiency of the system is approximately 37%. However for the subject Human Computer Interaction, the system efficiently is approximately 29%. This difference in system performance could be attributed towards the lack of appropriate resources from Wikipedia.

Figure 5. System Performance on Questions Generated for Different Subjects

Translation of information into knowledge is achieved through the use of grammatical patterns that could be mapped to form question pattern. As shown in the figure, the number of questions generated with grammar patterns 1, 2 and 3 are significantly high compared to the questions generated from 4 to 7. Table 2. Pattern Description

Figure 4. Distribution of Questions Generated according to Bloom’s Taxonomy

Question Classification Characteristics Definition/Fact/Integration Application/Characterstics Concept/Comparative Action Concept Application Application

Pattern αλααδα λδαβγω δαβαβδ αδλναβ αλαωδδ αλδνδα ααλββλ αψαωδδ

In order to evaluate whether the system performance complies with tutor’s expectation, the questions generated for vari6

In order to preserve the integrity of the examination students are given an average of five different question papers. The time mentioned is an average for preparing all question paper.

The second part of the evaluation includes the usability of the software. The usability has been evaluated with the categories namely ”functionality”, ”graphical design”, ”naviga-

tion”, ”intuitive” and ”overall”. The evaluation of the system involved tutor to grade the system on a scale of one to five. One being the lowest and five being the highest in terms of tutor satisfaction. From the Fig. 6, the overall system design and usability has an average of 4.125, while the intuitive aspect of the system is graded at 4. This is due to the fact that no such system has been previously used by the tutor and for each tutor an orientation towards the tools was provided in order to facilitate the tool evaluation. Similarly, the agreement among the different tutors also vary around the average grade, which is supported with the representation of standard deviation.

[4] Min-Huang Chu, Wen-Yu Chen, and Shou-De Lin. A learning-based framework to utilize e-hownet ontology and wikipedia sources to generate multiple-choice factual questions. IEEE Conference on Technologies and Applications of Artificial Intelligence, 12(2):125–130, 2012. [5] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. A framework and graphical development environment for robust nlp tools and applications. In ACL 2002, 2002. [6] A. Figueroa, G. Neumann, and J. Atkinson. Searching for definitional answers on the web using surface patterns. IEEE Computer, pages 68–76, 2009. [7] M. Hearst. Automatic acquisition of hyponyms from large text corpora. In Fourteenth International Conference on Comput. Linguistics, pages 539–545, 1992. [8] M. Heilman and N. A. Smith. Extracting simplified statements for factual question generation. In In Proc. of the 3rd Workshop on Question Generation, 2010.

Figure 6. Statistical Representation of the Tutor Feedback on Subjective Metrics

6

Conclusion and Future work

Thus in this paper a system for MCQ generation was presented. The system has been thoroughly evaluated by experienced tutors and have found the system to be efficient as the classroom size is gradually increasing. The system has been used to generate questions on five different topics and the quality of the questions generated assures the domino independency of the system. However, the system for now uses Wikipedia as a resource website, which in future could be extended to any other resourceful website.

References [1] L. Anderson and D. A. Krathwohl. Taxonomy for Learning, Teaching and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman, New York, 2001. [2] Sebastian Blohm and Philipp Cimiano. Using the web to reduce data sparseness in pattern-based information extraction. In PKDD, volume 4702 of Lecture Notes in Computer Science, pages 18–29. Springer, 2007. [3] Krishna Chandramouli, Tom´asˇ Kliegr, Jan Nemrava, Vojtˇech Sv´atek, and Ebroul Isquierdo. Query refinement and user relevance feedback for contextualized image retrieval. In VIE 08: Proceedings of the 5th International Conference on Visual Information Engineering, 2008.

[9] Min kyoung Kim and Han joon Kim. Design of question answering system with automated question generation. 4th International Conference on Networked Computing and Advanced Information Management, pages 365–368, 2008. [10] Ming Liu, Rafael A. Calvo, Anindito Aditomo, and Luiz Augusto Pizzato. Using wikipedia and conceptual graph structures to generate questions for academic writing support. IEEE Trans. on Learning Technologies, 5(3):795–825, JULY-SEPTEMBER 2012. [11] L. A. Ramshaw and M. P. Marcus. Text chunking using transformation-based learning. In ACL Third Workshop on Very Large Corpora, pages 82–94, 1995. [12] R. Snow, D. Jurafsky, and A. Ng. Learning syntactic patterns for automatic hypernym discovery. In NIPS, 2005. [13] R. Snow, D. Jurafsky, and A. Ng. Semantic taxonomy induction from heterogenous evidence. In ACL, 2006. [14] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A Core of Semantic Knowledge. In 16th international World Wide Web conference (WWW 2007), New York, NY, USA, 2007. ACM Press. [15] M. Vargas-Vera and M. D. Lytras. Aqua: Hybrid architecture for question answering services. IET Software, 4(6):418–433, 2010. [16] William E. Winkler and Yves Thibaudeau. An application of the fellegi-sunter model of record linkage to the 1990 u.s. decennial census. Technical report, U.S. Bureau of the Census, Washington, D.C., 1991. Statistical Research Report Series RR91/09.

Suggest Documents