Automated Extraction of Semantic Concepts from ...

3 downloads 549 Views 205KB Size Report
School of Computer Science .... key phrases acquisition from scientific articles. ..... various sources such as web, open course ware, university lectures and book ...
Automated Extraction of Semantic Concepts from Semi-Structured Data: Supporting Computer-Based Education through the Analysis of Lecture Notes Thushari Atapattu, Katrina Falkner, Nickolas Falkner School of Computer Science University of Adelaide, Adelaide, Australia {thushari, katrina, jnick}@cs.adelaide.edu.au

Abstract. Computer-based educational approaches provide valuable supplementary support to traditional classrooms. Among these approaches, intelligent learning systems provide automated questions, answers, feedback, and the recommendation of further resources. The most difficult task in intelligent system formation is the modelling of domain knowledge, which is traditionally undertaken manually or semi-automatically by knowledge engineers and domain experts. However, this error-prone process is time-consuming and the benefits are confined to an individual discipline. In this paper, we propose an automated solution using lecture notes as our knowledge source to utilise across disciplines. We combine ontology learning and natural language processing techniques to extract concepts and relationships to produce the knowledge representation. We evaluate this approach by comparing the machine-generated vocabularies to terms rated by domain experts, and show a measurable improvement over existing techniques. Keywords: ontology, POS tagging, lecture notes, concept extraction

1

Introduction

Computer-based intelligent education systems have been an area of research for the past decade. Early systems, such as SCHOLAR [1], provided intelligent assistance without human intervention by presenting a single set of digital materials to all students. Subsequently, research efforts have focused on creating student-centered learning environments, supporting learner’s diversity and individual needs. Intelligent Tutoring systems, question answering systems and student-centered authoring systems are examples of ‘one-to-one teaching’, which provide individual attention for each student [2-5]. These systems are capable of generating customised questions for students, responding to unanticipated questions, identifying incorrect answers, providing immediate feedback and guiding students towards further knowledge acquisition. The foremost effort in intelligent system development is allocated for knowledgebase modeling. Traditionally, knowledge engineers and domain experts formulate the

domain knowledge manually [13] and the success of any intelligent system is heavily dependent on the quality of the underlying knowledge representation. Manual efforts are constrained in their usefulness due to their error-prone nature and time-delays in the ability to incorporate new domain knowledge. Accordingly, research has focused on overcoming the knowledge acquisition bottleneck, including the use of authoring shells [4-6]. This semi-automatic approach allows teachers to define pedagogical annotations of teaching materials, which includes pedagogical purpose, difficulty level, evaluation criteria and performance level [6]. These authoring environments provide natural language interfaces for teachers, and knowledge engineering tools are required to transform them into a machine comprehensible form. A significant problem with both manual and semi-automated processes is that any extracted knowledge is not reusable due to its domain-specific nature. The goal of our research is to further automate the knowledge acquisition process from domain independent data, using digital lecture notes as our knowledge source. In this paper, we discuss terminology extraction from PowerPoint slides written in natural language. The purpose is to migrate from a full-text representation of the document to a higher-level representation, which can then be used to produce activity generation, automated feedbacks, etc. According to Issa and Arciszewski [7], an ontology is “a knowledge representation in which the terminologies have been structured to capture the concepts being represented precisely enough to be processed and interpreted by people and machines without any ambiguity”. Therefore, our system extracts domain-specific vocabularies from lecture notes and stores them in an ontology which complies with an underlying knowledge model comprising concepts, instances, relations and attributes. Generally, lecturers (domain experts) dedicate significant effort to produce semantically-rich, semi-structured lecture slides based on extensive knowledge and experience. Our research is based upon reusing this considerable effort and expertise, enabling computers to automatically generate activities for students. We combine natural language processing techniques such as part-of-speech tagging (POS tagging) [8], lemmatisation [9] with pre and post-processing techniques to extract domain-specific terms. Information-retrieval based weighting models are utilised to arrange the precedence of concepts, placing a higher weight upon those that are more important [10]. In order to evaluate our approach, a comparison has been carried out between machine generated vocabularies and concepts identified by independent human evaluators. The domain knowledge extracted from the lecture notes can be used to model the knowledge base of intelligent education systems (e.g. intelligent tutoring systems, question answering systems). Moreover, the ontologies generated from this research can be used as a knowledge source for semantic web search [12] and support for knowledge transfer for tutors and teaching assistants [13]. According to Rezgui [15], ontologies provide a perspective to migrate from a document to a content-oriented view, where knowledge items are interlinked. This structure provides the primitives needed to formulate queries and necessary resource descriptions [15]. Therefore, we use ontology reasoning techniques to generate questions for students and answer stu-

dent questions in the forums of Learning Management System. Simultaneously, students can ask discipline-oriented questions from the system, with algorithms defined to traverse within the particular ontology to find suitable answers. If we can make use of lecture slides as a source, then this will be a valuable technique, as lecture slides are a commonly used and widely available teaching tool. This paper includes a background study in section 2. A detailed description of methodology including concept and hierarchy extraction can be found in section 3, 4 and 5 respectively. In section 6, we evaluate the concept extraction algorithm using human judgment as a reference model. We discuss the overall work and provide a conclusion in section 7.

2

Background

At present, people share an enormous amount of digital academic materials (e.g. lecture notes, books and scientific journals) using the web. Educators and students utilise these knowledge resources for teaching and learning, however this digital data is not machine processable unless manipulated by humans. For instance, a computer cannot construct activities from a book available on the web without the involvement of a teacher; there is no simple transition pathway from semi-structured content (book) to teaching activities. Artificial intelligence researchers have attempted to automate knowledge acquisition, with the purpose of improving computer-based education, semantic web search and other intelligent applications. Hsieh et al. [11] suggest the creation of a base ontology from engineering handbooks. They utilise semi-structured data, such as table of contents, definitions and the index, from earthquake engineering handbooks for ontology generation. Although the glossary development is automatic, the authors cite the importance of the participation of domain experts and ontology engineers in defining upper level concepts and complex relationships. The evaluation and refinement of this research is a manual process, which they believe to be an essential part to improve the quality of the base ontology. Similar to our approach, Gantayat and Iyer [12] indicate the automated generation of dependency graphs from lecture notes in courseware repositories. Although our concern is to extract contents of particular lecture notes, this research focuses on extracting only the lecture topics to build dependency graphs. During the online courseware search, this system recommends ‘prerequisites’ and ‘follow-up’ modules. They use the ‘tf-idf’ measure for terminology extraction, and their relationship building only corresponds to determine whether the other concept is prerequisite or follow-up (hierarchical order). Similarly, Rezgui [15] uses ‘tf-idf’ and metric cluster techniques to extract concepts and relations from the documents in the construction engineering domain. Although this automated approach reduces the knowledge acquisition bottleneck, this knowledge is not reusable over a diverse range of domains. The authors of [13] propose the construction of a semantic network model from PowerPoint lecture notes. Teaching assistants gain knowledge of the sequence of teaching contents from the proposed semantic network model, creating a graph structure from each slide and combining all the graphs to form a semantic network report-

ing an agreement rate of 28%, which is the degree of agreement between human experts and the generated semantic network, measured by correlation of concepts. Ono et al. [13] argue that some teachers disagree with this model as the semantic network does not contain the original ideas of lecturers. Instead, it contains a machine generated overview of the lecture. Furthermore it also does not contain the knowledge of graphs and figures as in the lecture, resulting in a further negative attitude towards the new model. Concept map generation in [16] indicates the automatic construction of concept maps from an e-learning domain. The e-learning domain is an emergent and expanding field of research. The manual construction of the e-learning domain over months or years can produce obsolete knowledge. Therefore, automating the domain construction from e-learning journal and conference articles is significant not only for novice users but also for experts [16]. The extracted key word list from scientific articles is used to build the concept map and indicate the relations to guide learners to other areas which related to current topic. Kerner et al. [14] suggest base line extraction methods and machine learning for key phrases acquisition from scientific articles. The results found Maximal Section Headline Importance (MSHI) to be the best base line extraction method over Term frequency, Term Length, First N Terms, Resemblance to Title, Accumulative Section Headline Importance, etc. Besides, in machine learning approach, optimal learning results are achieved by C4.5 algorithm over multilayer perceptron and naïve Bayes [14]. The majority of related works utilise the popular ‘tf-idf’ measure for concept extraction, filtering the most frequently occurred concepts in the domain over linguistically important nouns. Despite contemporary efforts to combine natural language processing and earlier techniques, the field is still open to considerable development.

3

Our model

The proposed methodology, illustrated in Figure 1, consists of concept extraction techniques to construct the domain-specific vocabulary and concept hierarchy extraction algorithm to arrange the extracted vocabularies. We have implemented a PowerPoint reader to process text and multimedia contents of Microsoft PowerPoint documents using Apache POI API [17]. This reader is capable of acquiring rich text features such as title, bullet offset, font color, font size and underlined text. We make use of these features in our research with the purpose of identifying emphasised key terms. Later sections of this paper discuss the importance of emphasised key terms in selecting concepts in the domain. In this paper, we address text-based extraction; multimedia processing will be addressed in a later publication. We assume all the PowerPoint slides presented to the system are well structured (e.g. include titles, text/multimedia content) and contain no grammatical errors in natural text. Our system automatically corrects spelling errors using the in-built Microsoft spell checker.

PowerPoint slides PPT text

PowerPoint Reader

PPT layout Hierarchy extraction

Semantic extraction Preprocessing

Weighting model

Semantic concept Hierarchy extractor Concept hierarchy

Normalised text

Filtered data Link-distance algorithm

NLP tagging

Tagged text

Postprocessing Concepts & Hierarchical relations

Fig. 1.

4

An overview of our methodology

Concept extraction

This section consists of four stages of concept extraction: pre-processing; natural language processing (NLP) tagging; post-processing and weighting model. 4.1

Pre-processing

In order to improve the acquisition, we normalise the PowerPoint text contents before preparing them for linguistic annotation. This normalisation includes; 1. splitting statements at the occurrence of periods, comma or semi colon, 2. replacing non-alphanumeric symbols, 3. processing punctuation marks (e.g. hyphen, &), 4. expanding abbreviations, and 5. removing white spaces. 4.2

Natural language processing tagging

NLP tagging is categorised into two parts: lemmatisation and Part-of-speech (POS) tagging. In this stage, we first annotate the normalised statements using lemmatisation [9], where words are mapped to their base form (e.g. activities => activity, normalisation =>normalise). We have selected lemma annotation over the most popular stemming or morphological analysis techniques as the latter two techniques will entirely remove suffixes of term. This often gives different meaning of the term [9] (e.g. Computation, computing, computer => compute).

The POS tagger found in the Stanford Core NLP project [8] identifies nouns, verbs, adjectives, adverbs and other part-of-speech definitions in phrases or sentences [18]. The following parse tree illustrates the POS tagging of a sample sentence, and the creation of a parse tree that represents the structured interpretation of the sentence. Sentence: The authors are extracting semantic concepts from lecture slides. Parse tree: (ROOT (S (NP (DT The) (NNS authors)) (VP (VBP are) (VP (VBG extracting) (NP (JJ semantic) (NNS concepts)) (PP (IN from) (NP (NN lecture) (NNS slides))))) (. .))) The parse tree shows the term and its corresponding POS tagger in bold and underline respectively. The same statement is clearly shown in Table 1 with the term, its corresponding lemma and the POS definition. Table 1. Annotated example sentence Word

The

authors

are

extracting

semantic

concepts

from

lecture

slides

Lemma

The

author

be

extract

semantic

concept

from

lecture

slide

POS

DT

NNS

VBP

VBG

JJ

NNS

IN

NN

NNS

The definitions of POS tags can be found in the Brown Corpus [18]. The Brown Corpus includes a list of POS tags and their definitions (e.g. VB => verb, IN => preposition, CC => coordinating conjunction like and, or). Our algorithm extracts adjectives (JJ), comparative adjectives (JJR), singular or mass noun (NN), possessive singular noun (NN$), plural noun (NNS), proper noun (NP), possessive proper noun (NP$) and plural proper noun (NPS) for annotations. There are some verbs (VB) which rarely indicate as concepts in particular domains. We plan to integrate the extractions of such verbs as a future work. Our retrieval results have indicated that domain-specific concepts are combinations of numerous POS tag patterns (e.g. nouns followed by adjectives, compound nouns). A careful analysis confirms the requirement for a priority-based processing mechanism for these retrieved term or phrases. We define a list of regular expressions to arrange the n-grams (i.e. contiguous sequence of n items from given text or speech) on the basis of matching patterns. This list is processed in order and when the first one that matches is applied, the algorithm will eliminate that phrase from the sentence and apply the regular expressions recursively until the sentence has no more singular nouns (because singular nouns are the least prioritised pattern). For example, let us assume a particular sentence includes a noun followed by an adjective, four consecutive nouns and a singular noun. According to the defined regular expressions list, we extract four consecutive nouns (four-grams) at the beginning. Then we eliminate that pattern from the sentence. Then the algorithm will return the noun followed by an adjective (bi-grams) and a singular noun (uni-gram) respectively.

4.3

Post-processing

We notice that the NLP tagger returns the terms such as example, everything, somebody as nouns. However, in the computing education context, we can identify that these terms are not of high importance. In order to improve our results, we have implemented our own stop-words filter which includes 468 stop-words (e.g. a, also, the, been, either, unlike, would), enabling the elimination of common words. Further, stop-words are not permitted in bi-grams and can only be used as a conjunction word in between tri-grams (e.g. point-to-point communication) or any other n-grams (n > 2). Thus, stop-words are not allowed at the beginning or end of a phrase. 4.4

Weighting model

Although our algorithm returns nouns and compound nouns, it may include phrases which do not belong to domain-specific categories (e.g. ‘authors’ in table 1). In order to refine the results, a weighting model has been employed which places highest weighted terms as the most important concepts. We have developed a new model by building on term frequency-based approaches and incorporating this with n-gram count and typography analysis. We then present four weighting models that are based on these weighting factors as discuss below. Term frequency The system counts the occurrence of each concept in the lecture notes (tf). We assign the ‘log frequency weighting’ [10] for each term to normalise the occurrences within a controlled range. For instance, if the term frequency is 1, weight will be 0.3 and for 100, weight will be 2.0. This prevents a bias towards high frequency terms in determining the threshold value and important concepts. This will result in the term frequency being an important and influential factor in choosing important concepts rather than the only factor. Termweight = Log (1 + tf)

(1)

N-gram count and Location In our analysis, we identify that the majority of noun phrases that are confirmed as concepts are brief PowerPoint statements rather than a fragment of long sentences. Therefore, we count the number of tokens (n-grams) included in each item of bullet text and assigns a weight to the bullet text based on its n-gram count (n=1 to 5). If the count is more than five, which implies the bullet text is a moderate to long sentence. According to the evaluation results, shorter statements are more likely to be chosen as concepts over long statements and our experiments verified that text locations (e.g. title, topic, bullet or sub-bullet) are not effective in the judgment of concept selection.

Typography analysis In PowerPoint slides, lecturers often emphasise terms or phrases to illustrate their importance in the given domain, frequently employing larger fonts, different colors or different font faces. Underlined, bold or italic terms are also considered as emphasised. In our algorithm, we introduce a probability model (illustrated in Table 2) to assign unique weights for these terms. Table 2. Weight assignment of emphasised text based on probability. Probability of emphasised text (%) 50 < P