Creating MAGIC: System for Generating Learning ... - Semantic Scholar

3 downloads 907 Views 193KB Size Report
Caption. Extraction. Time stamped closed captions. Text. Analysis. Figure 2: Audio and video ... reading in an HTML or PDF document, the tools first tokenize it.
Creating MAGIC: System for Generating Learning Object Metadata for Instructional Content Ying Li, Chitra Dorai, Robert Farrell IBM T.J. Watson Research Center, NY {yingli,

dorai, robfarr}@us.ibm.com

ABSTRACT This paper presents our latest work on building a system called MAGIC (Metadata Automated Generation for Instructional Content) that will automatically identify segments and generate critical metadata conforming with the SCORM (Sharable Content Object Reference Model) standard for instructional content. Various content analytics engines are utilized to automatically generate key metadata, which include audiovisual analysis modules that recognize semantic sound categories and identify narrators and informative text segments; text analysis modules that extract title, keywords and summary from text documents; and a text categorizer that classifies a document according to a pre-generated taxonomy. With MAGIC, instructional content developers can generate and edit SCORM metadata to richly describe their content asset for use in distributed learning applications. Experimental results obtained from collections of real data from targeted user communities will be presented. Categories and Subject Descriptors: H.3.1 [Content Analysis and Indexing]: Indexing methods General Terms: algorithm, design, experimentation Keywords: Instructional video content analysis, learning object metadata, text analysis, text categorization, E-learning

1.

INTRODUCTION

Today’s knowledge workers are becoming constant learners. While information is often available in electronic form to learners, it is often not accessible because it is buried in large documents or videos, isolated in different organizations, or archived in an inaccessible way. To address this problem, many organizations are creating distributed learning content repositories that allow for searching and browsing of learning resources according to a common set of metadata. In 2000, the Department of Defense released the first version of SCORM (Sharable Content Object Reference Model) by incorporating emerging standards and specifications to describe learning

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’05, November 6-11, 2005, Singapore. Copyright 2005 ACM 1-59593-044-2/05/0011....$5.00.

resources across both federal and private sectors [1]. Particularly, SCORM defines a standard set of content attributes, also referred to as learning object metadata (LOM), which itself is an IEEE standard. Since then, many government and industrial organizations have started applying this reference model to their content development and management. However, preparing learning resources in SCORM format is labor-intensive. Our work targets at automatically generating SCORM conformant metadata by applying a suite of content analysis tools. Specifically, we developed a system, called MAGIC (Metadata Automated Generation for Instructional Content), to assist content authors and course developers with SCORM adoption and enable wider reuse of high value information assets. The current system will process instructional videos, training documents and various other resources, segment them, and generate the most critical SCORM metadata fully automatically. This automation of metadata creation subsequently enables easier content reuse and repurposing, improves interoperability, and expedites content registration for use by a distributed community of users. Moreover, content authors and course developers can access MAGIC to make their text and video sources available for training and education applications as well as to other course developers. For system evaluation, we used training and education documents and videos that are freely downloadable from various DHS (Department of Homeland Security) agency’s web sites. DHS has an increasing need for distributed web-based learning programs to educate its employees, partner organizations and suppliers. Timely availability of learning resources is particularly important to DHS for mission areas such as response preparedness for biological, radiological and nuclear threats. Given the job of coordinating and sharing the information and assets across over twenty diverse agencies, DHS is looking to SCORM for a common architecture and data model for learning content development and delivery. To the best of our knowledge, there has not been much work in the literature similar to MAGIC that automatically extracts SCORM conformant metadata for instructional content based on analysis of various media modalities. The closest one is possibly that of Myers et al. [2], which describes a video editor that takes advantage of pre-generated video information cues such as shot boundaries and time-synchronized transcripts. This system, however, is mainly used to help novice users edit digital videos, but not for course developers to develop new course material or content authors to register their content as targeted by our MAGIC system.

2.

SYSTEM OVERVIEW

Figure 1 shows the essence of the MAGIC system which consists of a Metadata Generation Environment (MGE) and a Metadata Editor. The MAGIC user (an author or a course developer) interacts

367

with the system via the Metadata Editor. The user accesses a training document or an instructional video by entering a URL on the Web, name of a local file, or an address in a Content Management System. The Metadata Editor loads the content into the Content Cache where it is optionally annotated by the Segmenter. The user can then view the resulting annotated resource in the Metadata Editor for suggested segmentation points. The resource is then processed by the Metadata Generator, which creates a SCORM metadata record and stores it in the Metadata Cache. Subsequently, the user can view and correct the SCORM record. Finally, he or she can request the Packager to create a SCORM-compliant package (with an IMS Manifest [1]) that can be exported to SCORM-compliant authoring systems, learning management systems, or learning content repositories.

signal into a set of homogeneous audio segments and tag each segment with a unique sound label. Specifically, the audio classifier first applies a support vector machine-based learning technique to classify every 1-second audio clip into seven different sound types including speech, silence, music, environmental sound, and their combinations. Temporally adjoining clips in the same sound class are then combined, and isolated audio clips are smoothed. More details on this module are present in [3]. Video sequence

Video Closed Caption Extraction

Time stamped closed captions

Audio Classification

Homogeneous Video Segmentation

Audio segments with sound labels

Video segments

Face Detection

.

Embedded Video Text Extraction

Face Information

MAGIC Metadata Editor

Text Information Video Segment Content Identification

WWW

MGE Segmenter

Metadata Generator

Packager

Video segments with visual labels

MAGIC User

File System Text Analysis

Metadata Cache

Content Cache

Video segments with audiovisual labels

Content Management System

Primary sources

Authoring System

Figure 2: Audio and video analysis components and their interactions.

Learning Management System

Learning Content Repository Target Systems (SCORM-compliant)

Meanwhile, a homogeneous video segmentation process is performed which partitions the video into a series of video segments where each segment contains content under the same physical setting [4]. To understand the semantics of each segment, we then extract two types of information, namely, the face and text from its visual content. Specifically, using a face detector, we compute the following two facial features: face-contained frame ratio and duration of the longest face sub-sequence, to determine if the segment has a narrator presenting or not. Applying these features helps in reducing false alarms in the narrator detection process. Another important and widely used visual cue in instructional videos is text overlays, which we refer to as informative text since its primary purpose is to inform and emphasize. Informative text is detected by first identifying individual text blobs from video frames based on color and edge information, then by tracing their contours and forming sentences to increase the detection accuracy [5]. Video segments that possess enough number of text blobs and sentences are subsequently flagged to contain informative text. Using this approach, we can reliably identify segments that contain presentation slides, computer screen displays or text overlays without a critical dependence on the text detection accuracy. Other segments that do not contain narrators or informative text are labelled as linkage scenes which could be an outdoor landscape, field demonstration or indoor classroom overview. While considerable work has been done at our lab to identify other types of visual information, we have found the above three types to be the most important and useful for describing instructional videos. Furthermore, detecting additional semantic categories requires a considerable amount of labelled training data which was not available in our case. Figure 3 shows the three major visual content types considered in this work. The audio and visual analysis results are then integrated together which essentially assigns a semantic audiovisual label to each video segment. Specifically, given a segment, we first identify its major audio type by finding the one that lasts the longest. For instance,

Figure 1: Architecture of the MAGIC system. The Metadata Generator consists of the following set of text and video processing tools that we have developed and integrated through a common set of application program interfaces. 1. Audio and Video Analysis Tools. These tools extract instructional videos’ narrative structures using joint audio and video analysis. Specifically, they segment videos into small units, identify their content semantics by detecting human presence, recognizing sound types and extracting video text, and finally label them with annotations describing the narrative elements in the segments. 2. Text Analysis Tools. These tools segment large text documents into smaller meaningful units, generate a title, extract keywords, locate named entities that include people, places, and organizations, and generate a summary description. These tools leverage high speed natural language processing techniques to parse and extract relevant information in the text documents. 3. Text Categorization Tool. This tool includes a highcoverage domain-independent taxonomy (e.g., “Patrol Boats”, “Weapons of Mass Destruction”) and a text classifier component that automatically and precisely assigns text documents to specific categories in this taxonomy. Due to space constraint, this paper will focus on the analysis tools and omit the details about the other parts of the MAGIC system.

3. 3.1

Audiovisual Integration

CONTENT ANALYSIS TOOLS Audio and Video Analysis Tools

The audio and video analysis tools are designed to track the narrative flow of professionally-produced education and training videos. A detailed process flow is depicted in Figure 2. As shown, given an instructional video in digital format such as MPEG-1, we first apply an audio classification module to partition its audio

368

Video (MPEG)

Documents (HTML, text) Closed Caption Text

Text Analysis Tools

(a)

(b)

(c)

Text Categorizer

general.title general.description general.keyword annotation.description & text segments

Figure 3: Three visual content types: (a) narrator, (b) informative text, and (c) linkage.

Closed Caption Extractor Audio-Video Analysis Tools

classification.taxonpath classification.description

Text processing SCORM Generator

if a 10-second long segment has speech in the first 5 seconds and last 2 seconds, while containing 2-seconds of silence and 1-second noise in the middle, we tag this segment with the speech label. Then, the audio and visual labels are integrated in a straightforward way to reveal its semantics. For instance, if the segment contains a narrator while its major audio type is music, it will be tagged as narrator with music playing. A total of fifteen constructs are thus generated, coming from the combination of three visual labels (narrator, informative text and linkage scene) and five sound labels (speech, silence, music, environmental sound and speech with music).

3.2

SCORM metadata

IMS Manifest

Figure 4: Integration of all analysis tools and targeted SCORM metadata elements.

4.

SYSTEM PERFORMANCE EVALUATION

We have conducted an informal evaluation of the system performance using sample data from multiple organizations including DHS agencies. While the major goal is to evaluate the analysis tools, we also want to measure the level of human editing that will be involved in correcting automatically generated metadata in the Metadata Editor. The test data comprises of two sets. The first set contains IBM content which includes IBM reference books, video captures of live seminars, and recorded class lectures. The second set contains DHS content which includes 172 text document files (149 HTML documents and 23 PDF and Microsoft Word files, which we converted into plain ASCII text files), and nine videos from the web sites of FEMA (Federal Emergency Management Agency), CDC (Centers for Disease Control and Prevention) and other related government agencies. We will only report the results on the DHS content set in this paper.

Text Analysis Tools

4.1

Audio and Video Analysis Components

We evaluated the performance of the audiovisual analysis components using four videos with two from a government technical support group, one from CDC and one from FEMA. Each video is approximately an hour long. These videos contain various types of sounds, complex video content with frequent gradual content transitions, and various types of visual scenes such as classroom instruction, onsite/indoor interviews, panel discussions, presentations, and outdoor activities. These cover most of the audio and video content types that we observed in our larger DHS video collection. The evaluation was done to determine how accurate our algorithm detects the segment boundaries and how well it identifies the segment content types in terms of both audio and visual labels. Table 1 shows the detailed audiovisual analysis results. Specifically, we achieved an average 91.3% classification accuracy on the four test videos for the audio classification module. If more data is used for SVM training, we expect the performance to further improve. With video segmentation, on average, we achieved around 96% precision and 97% recall rates. False alarms are observed when the content contains fast object movements or abrupt camera operations such as panning, tiling and zooming. Regarding the visual content identification, we applied CMU’s face detector to locate the narrators and obtained an average of 98% precision and 92% recall

Text Categorization Tool

The text categorization tool is designed to take a document and assign it to one or more categories such as anthrax. This is done by comparing the document to models that have been previously constructed for a large number of subjects. For instance, a document will be identified as being about anthrax if it best fits the anthrax model (also called a centroid). Centroids are pre-computed model files in the form of numerical vectors describing the word stems of a typical document in its category. For more details on this component, please refer to [8]. Categories are represented in a taxonomy with a single root and many branches. Once the category of the document is determined, it is assigned one or more paths to root. For instance, in the case of an anthrax document, one possible output could be “Business & Commerce → Health & Medicine → Diseases & Conditions → Infectious Diseases → Bacterial Diseases → Anthrax”.

3.4

Audio-Video SCORM Generator

Text & Video SCORM Integrator

The text analysis tools are designed to scan through large documents and identify or extract useful information. Specifically, after reading in an HTML or PDF document, the tools first tokenize it using Frost, a component of IBM’s LanguageW areT M product [6]. Then each word is tagged with its part of speech (e.g., noun, verb or adjective). Next, a set of analysis modules are applied to extract the following set of information including: 1) the document title; 2) technical keywords which are ranked from the most specific to most generic; 3) entity keywords which include person, place and organization names, all ranked by frequency; 4) topic shift boundaries that cue topic changes; and 5) the summary description that comprises a few important sentences. Finally, the SCORM Generator component generates a LOM metadata file by populating proper metadata elements in XML format using the information extracted above. For more technical details on these tools, please refer to [7].

3.3

annotation.description & video segments and classification

Integration of Analysis Tools

Figure 4 depicts the integration of the audiovisual and text analysis tools and highlights those SCORM metadata elements that will be automatically populated.

369

cross-DHS taxonomy available. Therefore, as part of our work on the MAGIC project, we are now building such a taxonomy by extending our existing high-performance classifier which already contains thousands of well-defined and distinct categories. Once the DHS-specific categories are added to the generic categorizer’s taxonomy, we expect its accuracy to be above 80%.

Table 1: Audiovisual analysis results where all numbers are in percentages. Terms, “pre.” and “rec.” stand for precision and recall rates, respectively. Video Data V1 V2 V3 V4

Audio Clsy Accuracy 91 93 92 89

Video pre. 96.4 97.8 98 94

Seg. rec. 100 100 96.7 95

Narrator pre. rec. 100 100 100 100 94 100 100 71

Info. pre. 100 100 100 93

Text rec. 92.3 100 100 96.4

5.

Text is another important type of information embedded in a video document. Yet so far, the powerful text analysis tools that we developed have only been applied to text documents. Our current plan is therefore, to apply them to the closed captions from videos as shown in Figure 2. So far, we have developed an application to extract time-stamped closed captions from videos, which makes a better option than ASR (automatic speech recognition) whose result is not accurate enough for our tasks. Moreover, most DHS videos contain closed captions in compliance with federal accessibility requirements such as Section 508. Currently, we are exploring two levels of joint audiovisual and text analysis. The first level is a straightforward integration where text keywords and a short summary are generated for each video segment. At the second level, keyword cues will be exploited to assist in identifying macro-segments where each macro-segment contains a complete thematic topic. It is our belief that the hierarchical narrative structure obtained from these two-level of integration can assist users in quickly locating desired content and reusing them for various purposes such as authoring a new video or developing new course modules. To conclude, this paper presents a system that can automatically identify segments and generate critical SCORM metadata for instructional content. Various tools performing text, audio and visual analysis have been developed to achieve this goal. Benefits of using this MAGIC system include easier content reuse and repurposing, improved interoperability, and more timely registration of content by course developers and content authors. This system, while mainly addressing the practical needs of DHS in the paper, provides a framework and a solution that can be used by a wider community of users.

rates. A point to note is that the recall rate varies across different videos and the system performance drops when human faces are of relatively small sizes. Finally, we applied the text extraction tool [5] to identify informative text, which reported an average of 98% and 97% precision and recall rates. Overall, the system performance is fairly stable across the various test content. Minor degradation is observed when the video content presents a cluttered background.

4.2

Text Analysis Components

Due to time and resource constraints, we evaluated only nine text documents randomly selected from the test collection. Also, as the analysis results tend to be subjective, we had to rely on human judgement. Specifically, we first assigned the documents to three colleagues in a way such that every document was rated by two independent evaluators. Each evaluator then created a short title, found named entities and 3-10 technical keywords, and identified 3-5 important sentences as a summary for every assigned document. Finally, the evaluators examined the system-generated results and rated them on a five-point scale (1=very poor, 2 = poor, 3= fair, 4=good, 5 = very good) against their own “ground truth”. The subjective evaluation results are briefly summarized below. Overall, we achieved above average scores, specifically 3.53, 3.72 and 3.69, for the title, keywords and summary results, which is very encouraging especially for the summary as it is a very challenging and subjective task. We also noticed that the titles extracted from HTML documents were generally rated higher than those for plain text documents. As for the technical keywords, the performance could be further improved if all non-ASCII characters were prefiltered out. In contrast, relatively low scores were obtained for the entity keyword generation component, especially for people names (scored 2.58) and place names (scored 2.83). This is mainly due to the difference between our test data and the training corpus which mainly contains news articles. The test documents however, comprise some other text genres. We did not evaluate the topic shift segmenter as a preliminary analysis showed that the results were not satisfactory. Generally, we found that it performed well for well-written prose but there were challenges in applying it to the wide ranges of genres in DHS text samples.

4.3

DISCUSSION AND FUTURE WORK

6.

REFERENCES

[1] ADL, “SCORM 2004 documentation,” Downloadable at http://www.adlnet.org/scorm/history/2004/index.cfm, 2004. [2] B. Myers and et al., “A multi-view intelligent editor for digital video libraries,” Proc. of the Fist ACM/IEEE-CS Joing Conf. on Digital Libraries, 2001. [3] Y. Li and C. Dorai, “SVM-based audio classification for instructional video analysis,” ICASSP’04, 2004. [4] Y. Li and C. Dorai, “Video frame identification for learning media content understanding,” ICME, 2005. [5] J. Shim, C. Dorai, and R. Bolle, “Automatic text extraction from video for content-based annotation and retrieval,” ICPR, pp. 618–620, 1998. [6] IBM, “languagewareT M ,” http://www-306.ibm.com/software/ globalization/topics/languageware/design.jsp. [7] B. Boguraev and M. Neff, “Lexical cohesion, discourse segmentation and document summarization,” RIAO, 2000. [8] C. Aggarwal, S. Gates, and P. Yu, “On using partial supervision for categorization,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 2, pp. 245–255, 2004.

Text Categorization Component

We repeated a similar manual evaluation process for the text categorization component on the same nine documents. The categorizer matched exactly with human’s classification for two documents, and matched part of the path to root for three documents. In some cases, evaluators thought that the categorizer did a better job than our human colleagues. To our knowledge, currently there is no single comprehensive

370

Suggest Documents