7 Improving Content Comprehension and Knowledge Retention with Wikipedia Proscovia Olango and Gosse Bouma Department of Information Science The University of Groningen Oude Kijk in ’t Jatstraat 26 Groningen, The Netherlands. {p.olango; g.bouma}@rug.nl
Abstract Wikipedia can function as an inter-knowledge domain glossary for identifying technical terms, jargons, acronyms, and abbreviations in documents, with anchor texts serving as the mentioned entities. There are more than ten million anchor texts that cover nearly all known knowledge domains. This wide coverage offers an opportunity for extracting terms from a document in the absence of a manually constructed glossary. This paper discusses how a state-of-the-art method of term extraction (sub-string match) can combine in-coming links overlaps to predict, disambiguate, define and link technical terms using Wikipedia anchor texts for easy knowledge acquisition. Where applicable, Wikipedia images shall be displayed along with the technical term definitions in an effort to improve knowledge retention. It is hoped that term definitions shall improve document content comprehension and facilitate the processes of knowledge acquisition. Keywords: Document enrichment, term extraction, term definition, term disambiguation, text illustration, hyperlink generation, document comprehension, knowledge retention.
Introduction Cost effective document content comprehension can be defined as the ability to read and understand the concepts discussed in a document within minimal time possible. The field of document comprehension is widely studied with reference to reading 76
Part 3: Improving Content Comprehension and Knowledge Retention with Wikipedia
77
levels where persons who do not read at college level or higher are said to have a low reading level and therefore cannot clearly understand the concepts of the documents they read [Young et al. 1990]. These studies indicate that education plays an important role in vocabulary level building, which in turn eases understanding of semantically related documents. As an implementation of research findings based on reading levels, vocabularies (or technical terms) that occur in documents are substituted with simpler words or word phrases that have similar semantics [Graves and Graves 2003]. If the substitution is done without considering word context, this venture may not only distort document content meaning but also limit the possibility for students to build their vocabulary level. By technical terms we refer to anything that is a word, a group of words, an acronym, or an abbreviation which designates a special meaning in text context. Although research in reading shows substantial evidence that the meaning of technical terms can be derived from context [Kate 2007], it is also true that in most cases, technical terms are used in text without definition or explanation. This tends to hinder content comprehension for readers who are at low reading levels in certain knowledge domains. Moreover, human languages more especially English constantly change by borrowing, coining, and combining words to represent new ideas, (technology,) and development [Engineer, 2005]. Therefore vocabulary level building becomes a lifetime obligation. In 2009 the verb twitter was borrowed as trade mark of a social network that provides microblogging services, enabling its users to send and read messages called tweets.
The example above presents three technical terms that were borrowed or coined in the recent years to express ideas related to Short Message Services (SMS) and Internet technology. For a person who is not familiar with these technologies the semantics of the terms twitter, and tweet may be conspicuously revealed by considering them in context. On the contrary the meaning of the term microblogging may not be that obvious from this sentence context. Technical terms occur in almost all reading material especially those intended for an audience at higher institutions of learning like universities and they may hinder content comprehension if they are introduced without definitions and/or explanations. We assume that easy access to context related definitions and explanations of technical terms found in a document will simplify document comprehension for readers of all levels without depriving them the opportunity to build their vocabulary levels. Therefore, TermPedia was designed with an objective to provide easy access to contextually relevant definition of technical terms that are embedded in documents. In case a technical term definition is obscure, TermPedia contains an option for the reader to obtain additional explanation on the defined technical term by linking to a Wikipedia article that discusses it explicitly. In cognitive science, comprehension is generally characterised as the construction of a mental model that represents the objects and semantic relations described in text
78
Strengthening the Role of ICT in Development
[Thuring et al. 1995]. This suggests that if text strings can be represented by visual aids, content comprehension could be largely improved. Pictures, charts, maps, and other illustrations form powerful object representations of text strings and make important points in document content vivid. When visual aids are coupled with relevant definitions of technical terms, information is received by the reader through two senses, the fruition of which is a clearer and lasting impression on the mind [BEN., 2001], enabling the reader to easily retain knowledge acquired. Document content comprehension is an essential part of knowledge acquisition and retention without which the process of education is rendered futile. Education is a basic element for sustainable development, an educated population is able to provide sufficient labour for its government and in turn earn adequate income for their support. The subtle objective of this research is to promote sustainable development by facilitating the process of knowledge and retention during education. The ability to retain knowledge is important in a developing country like Uganda which has limited access to education materials like books. Recently the expansion of ICT facilities at Makerere University increased the library users’ access to electronic information and improved student-book ratio to 1:21 [Namisango, 2010]. Although acceptable, this ratio is still below the ideal of 1:40 as recommended by The National Council for Higher Education in Uganda [NAT., 2006]. Worse scenarios exist in less fortunate universities like Islamic University in Uganda where the student-book ratio is below an unacceptable ratio of 1:13 [Idrisa, 2009]. TermPedia Technologies TermPedia is a document enrichment tool that uses Human Language Technologies (HLTs) to provide contextually relevant information for technical terms that are embedded in documents. TermPedia was designed to define technical terms by extracting their meaning from Wikipedia, an on-line encyclopedia. Although Wikipedia offers multilingual documents, we only utilize the English4 content. Defined technical terms are linked to contextually relevant Wikipedia articles so that in cases where a term definition does not provide sufficient information for content comprehension, a reader may navigate to the Wikipedia article for additional explanation. The HLTs used in TermPedia include semi-automatic technical term extraction, automatic term definition, automatic term disambiguation and automatic hyperlink generation. Much as document enrichment (that is to say, incorporation of contextually relevant information into existing text) is the fundamental technology of TermPedia, the tool uses document extraction techniques for semi-automatic term extraction. In addition, a simple string look-up algorithm is used for automatic term definition and a frequency based algorithm is used to generate automatic disambiguated hyperlinks. For details on TermPedia technologies please see [Olango et al. 2009]. In this paper, we present a term sense disambiguation algorithm based on overlap counts of incoming links to 4
See http://en.wikipedia.org for the English Wikipedia home page.
Part 3: Improving Content Comprehension and Knowledge Retention with Wikipedia
79
Wikipedia articles as discussed in section 3. Term sense disambiguation is necessary because most terms are ambiguous with reference to the context in which they occur. For example twitter could refer to a social network as seen in the sentence example above or to the chirpingsound of a bird.
Related Works This section discusses previous research in relation to using Wikipedia for document annotation and enrichment in an effort to improve content comprehension and knowledge retention in addition to reducing time for knowledge acquisition. We also look at work that has been done in relation to improving content comprehension and knowledge retention using illustrations or still images. From the angle of topic indexing, Medelyan et al. [2008] show that Wikipedia can be used as a controlled vocabulary for identifying the main topics in a document, where article titles serve as index terms and redirect titles as their synonyms. We borrow the idea of controlled vocabulary from this research and use Wikipedia anchor texts for identifying technical terms in documents and the first paragraphs of Wikipedia articles to which the anchor texts point as the definitions of these terms. Medelyan et al.’s research was extended for identifying significant terms within unstructured text and enriching them with links to appropriate Wikipedia articles using machine learning by Milne and Witten [2008]. They use Wikipedia not only as a source of information to point to, but also as training data for how best to create links. Milne and Witten’s resulting link detector and disambiguator performs with a constant recall and precision of close to 75% on both Wikipedia data and “real world data.” The term detector uses Wikipedia article titles for identifying terms and the disambiguation algorithm considers the links made to articles as a means of comparing possible senses of each term with its surrounding context. The good performance of this approach motivated us to use Wikipedia for training a term detector and disambiguator that uses anchor texts in place of article titles while maintaining the term disambiguation paradigm experimented by Milne and Witten. Education theories suggest that the use of realistic graphics in teaching material increases the probability of improving comprehension. Willis [2004] acknowledges that graphics have been used to overcome the deficits in patient education material because the comprehension of such material is often impeded by text written at reading levels too high for the patient population. The focus of Willis’ work is on choosing graphics that are not too advanced for patients to understand while providing illustrations for patient educational material. Our perspective is that graphics should be used alongside technical term definitions and explanations. We believe that if text at a high reading level is well explained, comprehension shall be improved which in turn will be enhanced by graphical representations of the difficult text. Physiologists believe that proficient readers spontaneously and purposefully create mental images while and after they read. They also say that the images emerge from all the
80
Strengthening the Role of ICT in Development
five senses and emotions and are anchored in a reader’s prior knowledge [Keene, 2002]. Therefore a reader without prior knowledge would benefit from image representation of text portions. Graphical representation of text is expected to facilitate the process of creating mental images and therefore improve content comprehension. Although the question as to whether images facilitate text comprehension is beyond the scope of this paper, various researches have shown the advantage of images in this regard. [Pan Y. and Pan Y. 2009] for example presents results that show that the presence of pictures in text improved the translation scores of low-proficiency Taiwanese English foreign language college students. Their findings further show that the pictures facilitated the comprehension of both simple and difficult text. This suggests that images can be used alongside text to improve comprehension of low-level readers in a particular field without necessarily simplifying difficult text. For this reason, we are confident that Wikipedia images will improve document content comprehension for readers at all levels when presented along with term definitions and explanations.
Research Framework and Methodology As mentioned in section 1 above, we identified three major problems that exist in document content with specific emphasis on higher education literature. The first problem is that such documents contain difficult terms or vocabulary that may hinder content comprehension if they are used without simple definitions or explanations. The second problem is that terms are ambiguous in relation to the context in which they are used. For example the term tweet may be deemed to mean the following5: • A type of bird vocalisation • Tweet (singer) (born 1971), American R&B and soul singer-songwriter • Cessna T-37 Tweet, a trainer aircraft • Tweet, a post on Twitter A third problem is that knowledge acquired from educational literature may be easily lost if it does not make a clear and simple impression in the readers’ minds. To solve these problems, we propose TermPedia, a document enrichment tool that provides technical term definitions, explanations, and illustrations with the help of HLTs and Wikipedia’s special features. The hypothesis guiding this research can therefore be stated as follows: 1. Wikipedia anchor texts can be used to predict difficult terms in documents. 2. Incoming links overlaps can be used to disambiguate terms senses. 3. Term definitions can be provided by the first paragraphs of a Wikipedia article. 4. Wikipedia images can provide relevant illustrations for difficult terms to improve knowledge retention. TermPedia uses anchor Wikipedia texts to represent terms because this information occurs within the text body and therefore it is more likely to provide correct terms 5
Tweet. Retrieved February 28, 2011 from http://en.wikipedia.org: http://en.wikipedia.org/ wiki/Tweet_(disambiguation). Wikimedia Foundation, Inc
Part 3: Improving Content Comprehension and Knowledge Retention with Wikipedia
81
in unstructured text. Anchor texts also escape the problem of spelling errors, word stemming, and terms made up of multiple noun phrases and/or multiple word strings. By anchor text we mean the alternative set of characters that are displayed in place of a web address after the creation of a hyperlink. For example blogging is an anchor text in the hyperlink below:
blogging Data Collection We used a 2010 English dump of Wikipedia which was downloaded on 14 February, 2011. The dump contained important summary information of Wikipedia articles among which were 8,605,586 article titles including their page ids. A database was thus created to hold information about the article titles and their ids. A summary of 10,672,018 anchor text strings also existed in the dump of which only 578,576 anchors indicated the possibility of ambiguity, meaning that they were linked to more than one Wikipedia article. Therefore technically speaking, TermPedia needed to disambiguate only 5.4% of the Wikipedia anchors. A database was created to contain the 10,093,442 unambiguous anchor texts including the number of incoming links to their Wikipedia targets. Yet another database was created to contain the ambiguous anchors including the id of Wikipedia articles to which they are linked. Summarized information for each article and their incoming links from other Wikipedia articles was also stored into a database. The dump also included unstructured text from Wikipedia to the capacity of 4.0 Kilobytes of zipped files. Although these databases require a lot of storage space, the pay-off was a first algorithm that did not overload the working memory with large files. Term Sense Disambiguation by Maximum Incoming Links Overlap Counts There are three major steps in the incoming links overlap count algorithm that include, categorizing terms into unambiguous and ambiguous sub-sets, finding Wikipedia targets for the terms and incoming links to these targets, and, counting the incoming link overlaps between the unambiguous targets and ambiguous targets. The ambiguous target with the most incoming links overlap count is then selected as a target for the ambiguous term. Given an unambiguous term x and an ambiguous term y, the score () for the best target of y was calculated using the formula below: β(x, y) = Max(Count)|X∩Y| Where X is the collection of incoming links to the target of x and Y are the collection of incoming links to targets of y. A 39 Kilobytes text file was generated from the Wikipedia article on Anarchism and used to train the disambiguation algorithm.
82
Strengthening the Role of ICT in Development
Evaluation Evaluation of the maximum incoming links overlap counts algorithm was carried out on 151 random Wikipedia articles that belonged to the medical category. Each file was transformed into a plain text file with the Wikipedia links removed and then the algorithm was used to populate the free text documents with Wikipedia links. We were interested in evaluating the performance of the algorithm on accurate term prediction and relevant target allocation.
Discussions of Findings In this section, we shall discuss results from the evaluation of the disambiguation algorithm and the use of Wikipedia images to represent terms. Term Sense Disambiguation By the time this paper was submitted, the evaluation results for the incoming links term sense disambiguation algorithm were not yet ready. We shall include these results in the camera ready copy of the paper. The evaluation results shall be presented using precision, recall and f-score to show how well the algorithm performs at predicting relevant terms for specific Wikipedia articles. Original links in Wikipedia articles shall be used as the gold standard of terms in each article of interest. Improving Knowledge Retention Using Wikipedia Images As already mentioned, the scope of this paper does not evaluate the relevance of Wikipedia images for illustrating terms, but we assume that the disambiguation algorithm provides relevant targets for the predicted terms. Figure 1: Screen Shot of TermPedia User Interface
We are happy to find the first image from relevant target article of a term to illustrate it as shown in Figure 1 which is a screen shot from TermPedia user interface6. The user interface was designed to especially extend TermPedia to university students. The interface allows students to read HTML version of annotated text with automatically generated hyperlinks using predicted terms as anchor texts. Each hyperlink has an added 6
Online are: http://www.let.rug.nl/~olango/TermPedia/termpedia.php
Part 3: Improving Content Comprehension and Knowledge Retention with Wikipedia
83
Java script functionality which allows definitions of the terms (i.e. the first paragraphs of their target pages) to show in a pop-up window together with an image if available, as soon the mouse is moved over the term. Definitions are retrieved in real-time from the current version of Wikipedia articles with an excellent speed. In addition, all exist
Conclusion and Future Work The challenge of using Wikipedia anchor texts to predict terms is that they are very noisy and it is difficult to filter out anchor texts which may not necessarily depict proper terms. Since we are focusing on improving vocabulary levels of readers at all levels, this is not a problem of our concern. Our concern is to allow the reader to select a level of reading by choosing how much of the text should be annotated. Fever annotations provide relatively good annotations and this take care of over populating the text with text. In turn the reader can still easily use the interface without much destruction. The images from Wikipedia have been successfully incorporated into text but the quality and relevance of these images is beyond the scope of this research. Terms that have been linked to relevant Wikipedia pages should automatically provide meaningful illustrations from the Wikipedia images if available. We hope that in cases were the images are clear and relevant, they shall improve knowledge retention by the readers since it will allow information to go through the readers mind in two ways, by explanation and illustration. These improved methods of reading are expected to reduce the time a reader needs for knowledge acquisition and retention, thus leading to improved ways of learning in both higher education institutions and elsewhere. In order to investigate the effect of TermPedia on improving document content comprehension, a user survey shall be carried out at Gulu University. TermPedia term prediction algorithms shall be improved in future by using category information from Wikipedia. This is expected to reduce term ambiguity and provide more accurate definitions of predicted terms including relevant images. For ambiguous anchors texts, their targets could be used to predict their synonyms or related concepts. An example can be given for the anchor text as well considering the subject domain, it they are very closely related.
References 2001. Benefit From Theocratic Ministry School Education. Watch Tower Bible and Tract Society of New York Inc. 2006. The National Council for Higher Education: Quality Assurance Framework for Universities. National Council for Higher Education. Kampala, Uganda ENGINEER, S. 2005. 21st Century English Vocabulary. Pearson Education. IDRISA, S. 2009. The Automation/Digitization of library services of Islamic University in Uganda (IUIU) to enhance access to knowledge for development effectiveness. In Proceedings of the 1st International Conference on African Digital Libraries and Archives, Addis Ababa, Ethiopia. GRAVES, M. F. and GRAVES, B. B. 2003. Scaffolding Reading Experiences: Designs for Students Success. Christopher-Gordon Publishers, Inc., 2nd edition.
84
Strengthening the Role of ICT in Development
KATE, C. 2007. Deriving word meanings from context: Journal of Research in Reading, Vol.30:347– 359. KEENE, E. O. 2002. Comprehension Strategies. Retrieved February 16, 2011 from readinglady. com: http://www.readinglady.com/mosaic/tools/COMPREHENSION STRATEGIES by Elli 2002.pdf MEDELYAN, O; WITTEN, I.H., and MILNE, D. 2008. Topic Indexing with Wikipedia. In Proceedings of the first AAAI Workshop on Wikipedia and Artificial Intelligence (WIKIAI’08), Chicago, I.L. MILNE, D. and WITTEN, I.H. 2008. Learning to link with Wikipedia. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM’2008), Napa Valley, California. MIHALCEA, R. and CSOMAI, A. 2007. Wikify!: linking documents to encyclopedic knowledge. In CIKM ’07: Proceedings of the sixteenth ACM conference on information and knowledge management, pages 233–242, New York, USA. ACM. NAMISANGO, R. 2010. The Third Vice-Chancellor’s Monthly Press Briefing, Held on Monday 1st March 2010. Retrieved February 18, from http://intranet.mak.ac.ug: http://intranet.mak.ac.ug/ uploads/Highlights%20from%20the%203rd%20Media%20briefing%20March%2001%20 Final.pdf OLANGO, P., KRAMER, G., and BOUMA, G. 2009. Termpedia for interactive document enrichment: using technical terms to provide relevant contextual information. In Proceedings of IMCSIT ’09: International Multiconference on Computer Science and Information Technology, pages 265–272. IEEE. PAN, Y. 2009. The Effects of Pictures on the Reading Comprehension of Low-Proficiency Taiwanese English Foreign Language College Students: An Action Research Study. In VNU Journal of Science, Foreign Language, pages 186 – 198. VNU Printing House THURING, M., HANNEMANN, J., and HAAKE, J. M. 1995. Hypermedia and cognition:. In Proceedings of Design for Comprehension. Community of the ACM. WILLS, S. T. 2004. Graphics Barriers: Enhanced Comprehension of Patient Education Material. Theory, Research, Education, and Training, 140-144. YOUNG, D. R., HOOKER, D. T., and FREEBERG, F. E. 1990. Information consent documents: Increasing comprehension by reducing reading level. The Hastings Center, Vol. 12. No.3.