Speech Recognition for a Digital Video Library - CiteSeerX

38 downloads 38165 Views 311KB Size Report
Library provides the user with a tool with which to assemble, from a ..... highly ranked one to display the automatically generated title, which suggests that ..... and the best version of the Informedia search engine using TFIDF, document length ...
JASIS, 1996.

Speech Recognition for a Digital Video Library Michael J. Witbrock and Alexander G. Hauptmann ABSTRACT The standard method for making the full content of audio and video material searchable and is to annotate it with humangenerated meta-data that describes the content in a way that the search can understand, as is done in the creation of multimedia CD-ROMs. However, for the huge amounts of data that could usefully be included in digital video and audio libraries, the cost of producing this meta-data is prohibitive. In the Informedia Digital Video Library, the production of the meta-data supporting the library interface is automated using techniques derived from artificial intelligence (AI) research. By applying speech recognition together with natural language processing, information retrieval and image analysis, an interface has been produced that helps users locate the information they want and navigate or browse the digital video library more effectively. Specific interface components include automatic titles, filmstrips, video skims, word location marking and representative frames for shots. Both the user interface and the information retrieval engine within Informedia are designed for use with automatically derived meta-data, much of which depends on speech recognition for its production. Some experimental information retrieval results will be given supporting a basic premise of the Informedia project: that speech recognition generated transcripts can make multimedia material searchable. The Informedia project emphasizes the integration of speech recognition, image processing, natural language processing and information retrieval to compensate for deficiencies in these individual technologies.

Library Creation Offline Video

Audio

Keywords: video browsing, information retrieval interfaces, speech recognition, News-On-Demand, multimedia indexing and search, Informedia, artificial intelligence, automatic text summarization, video summarization, digital library

Text

INTRODUCTION TO INFORMEDIA Digital Compression Speech Recognition

Image Extraction

Natural Language Interpretation

Segmentation

Indexed Database Segmented Indexed Compressed Transcript Audio/Video

Vast digital libraries of video and audio information are becoming available on the World Wide Web as a result of emerging multimedia computing technologies. However, it is not enough simply to store and play back information as many commercial video-on-demand services intend to do. New technology is needed to organize and search these vast data collections, retrieve the most relevant selections, and permit them to be reused effectively.

Through the integration of technologies from the fields of natural language understanding, image processing, speech recognition and video compression, the Informedia digital video library system TO USERS DISTRIBUTION [Christel-94a,b][Wactlar96][Informedia95] allows a user to explore multimedia data in depth as well as in breadth. An overview of the Library Exploration system is shown in Figure 1. Hours of video programming are segmented into small coherent pieces and indexed according to their Online multimedia content. Users can actively explore the information by Story Choices finding sections of content relevant to their search, rather than by following someone else’s path through the material or by serially Spoken Natural Language Query Indexed Database viewing a single large chunk of pre-produced video. This active Segmented Indexed exploration is far more flexible than that provided by video-onCompressed Transcript Semantic-expansion Audio/Video Translation Translation demand, where only one way of viewing the content is permitted. It is Toolkit also more flexible than the interfaces provided by the current Requested Segment generation of educational CD-ROMs, where users follow a designed path through the material in a more or less passive manner. The goal Figure 1: Overview of the Informedia Digital in Informedia is have the computer serve as more than just a sophisticated video delivery platform. The Informedia Digital Video Video Library System Library provides the user with a tool with which to assemble, from a

large corpus, an instructive set of video segments relevant to a particular information need. Using this tool, a large library of video material can be searched with very little effort. The Informedia project is developing these new technologies and embedding them in a video library system primarily for use in education and training. To establish the effectiveness of these technologies, the project is establishing an on-line digital video library consisting of over a thousand hours of video material. In order to be able to process and search this volume of data, practical, effective and efficient tools are essential. News-on-Demand [Hauptmann95a][Hauptmann96] is a particular collection in the Informedia Digital Library that has served as a proving ground for automatic library creation techniques. In News-on-Demand, complete automation is the principal goal. Motivated by the timeliness required of news data, and the volume of material to be indexed every day, the project has applied speech recognition, natural language processing and image understanding to the creation of a fully content-indexed library and to interactive querying. While this work is centered around processing news stories from TV broadcasts, the Informedia library creation process exemplified in News-on-Demand represents an approach that can make any video, audio or text data more accessible. This article will concentrate on the speech recognition, and information retrieval aspects of automated library creation; natural language and image processing will only be covered in passing. Content for News-on-Demand can be automatically captured off the air or via a DSS satellite receiver on a daily basis. Over the course of nearly two years, the system has captured four hundred and thirty three news broadcasts, of which three hundred and eighty two are television broadcasts stored as MPEG-1 files, and fifty one are radio broadcasts, stored as 16kHz sixteen-bit digital audio files. These news broadcasts have been segmented into more than fourteen thousand individual news stories. In most cases, the system captures closed caption data along with the video broadcasts, and this caption data can be used along with the soundtrack to create a higher quality text transcript, thus improving searchability. For the radio broadcasts, the raw audio signal is all that the system can use to create a searchable transcript. In addition to storing and segmenting and indexing the data and its associated transcript, the system has also generated one line “headline” summaries for more than twenty three thousand individual news stories and produced more than two hundred thousand video “skims” (described below) at varying levels of detail. All processing for the News-on-Demand corpus has been fully automated, no human intervention is required.

RELATED PROJECTS AND RESEARCH Most other attempts at solving the news retrieval problem by providing news databases have restricted the data to text material only. Video-on-demand systems allow a user to select, and pay for, a complete movie, but do not allow for ad-hoc search and retrieval within programs. An approximation to News-on-Demand can be found in the “CNN-AT-WORK” system offered to businesses by a CNN/Intel cooperation. At the heart of the CNN-AT-WORK solution is a digitizer that encodes the video into INDEO format compression format and transmits it to workstations over a local area network. Users can store headlines together with video clips and retrieve them later. However, this retrieval depends entirely on the separately transmitted, manually created, text “headlines” and the service does not include news sources other than CNN. In addition, CNN-AT-WORK does not feature an integrated multi-modal query interface [CNN-AT-WORK95]. Preliminary investigation into the use of speech recognition for analysis of a news story was carried out by Schäuble and Wechsler [Schäuble95]. Since they lacked a powerful speech recognizer, their approach used a phonetic engine that transformed the spoken content of the news stories into possibly erroneous phoneme strings. The query was also transformed into a phoneme string and the database searched for the best approximate match. Despite errors in recognition and word prefix and suffix mismatches, the system performed reasonably well, since these errors scatter evenly over all documents allowing the consistently high search scores of well-matching correct segments to dominate the retrieval. Another news processing system that included video materials was the MEDUSA system [Brown95]. The MEDUSA news broadcast application could digitize and record news video and teletext transcriptions, which are equivalent to closedcaptions. Instead of segmenting the news into stories, the system used overlapping windows of adjacent text lines for indexing and retrieval. During retrieval, the system responded to typed requests by returning an ordered list of the most relevant news broadcasts. Within a news broadcast, it was up to the user to select and play a region, using information provided by the system about the position of the matched keywords. The focus of MEDUSA was in the system architecture and the information retrieval component. No image processing and no speech recognition were performed. The system did, however serve as the substrate for a later series of speech recognition and information retrieval experiments [Jones96]. Using a speech recognition system to extract words from spoken messages, these experiments evaluated information 2

retrieval using a combination of word spotting based on rapid scanning of word lattices and whole word retrieval, in a video mail retrieval task. These latter experiments have not yet been extended to news or other broadcast data. Other projects that seek to index and retrieve from video news sources include the Conceptually Indexed Video project at Sun [Woods96], which is attempting to build conceptual taxonomies of query terms to improve the quality of returned stories, and the VISION system at the University of Kansas which, while similar in aim to Informedia, is concentrating on the problems of compressing video data and delivering it over the Internet. It is also distinguished by its stated concentration on the use of pre-existing, mature, domain independent indexing technologies [Li96]. The Broadcast News Navigator (BNN) system [Maybury96][Mani96] has concentrated on the automatic segmentation of stories from news broadcasts using discourse structure. While a great deal of success has been achieved so far using heuristics based on stereotypical features of particular shows (e.g. “still to come on the NewsHour tonight… ”), the longer term objective is to use multi-stream analysis of such features as speaker change detection, scene changes, appearance of music and so forth to achieve reliable and robust story segmentation. The system also aims to provide a deeper level of understanding of story content than is provided by simple full text search, by extracting and identifying, for example, all the named entities in a story. The Informedia project, in as much as it involves the indexing of non-textual data, also bears similarities to projects such as QBIC [Flickner95], which applies both automatic image characterization and hand-annotation to images, and supports retrieval using image similarity. One of the more interesting features of the QBIC system is that it allows query by demonstration, with the user sketching the features desired in the retrieved image. A similar effort, which also encompasses some video material, is the Photobook system [Pentland94]. Photobook employs relatively sophisticated statistical characterizations of selected image features, such as faces, shapes and textures, to support accurate retrieval by image similarity. A final example of an image retrieval system is Chabot [Ogle95], a part of the Berkeley digital library project. This system includes an element of cross-modal operation, allowing users to search simultaneously in pre-existing annotations and color content characterizations of a large set of landscape images. This allows searches for objects such as “yellow flowers”, that might not have been easily identified from the annotations or image qualities alone.

CREATING AN INFORMEDIA LIBRARY The Informedia digital video library uses a combination of techniques from image processing, speech recognition, natural language processing and information retrieval. The integration of these techniques has permitted the construction of an effective interface to a digital video library, even though none of the techniques is completely reliable or error-free. Speech recognition is used for transcription and alignment, image processing is used for shot analysis and to identify representative frames, and natural language processing is used for summarization. Information retrieval allows the user to easily retrieve indexed material. Despite the imperfections in all the techniques used, and the problems inherent in working with raw broadcast data, a suite of navigation aids enabled by their use allows the user to quickly select and play back appropriate stories from the Informedia Digital Video Library. The Informedia Digital video library system is composed of two parts: the Library Creation System and the Library Exploration Client. The Library Creation System for News-on-Demand can automatically capture one or more current news shows every night. Processing a news show for the library takes about 14 times real time on a DEC-AlphaStation 600 5/266 workstation with 256 Mbytes of memory. During library creation, the following major steps are performed: 1.

The news shows are digitized. Every evening, when the selected news shows are broadcast, they are automatically encoded into the standard MPEG-1 digital video compression format. If closed captions are being broadcast at the same time, they are also captured and stored as ASCII text. This capture is performed by off-the-shelf hardware attached to PCs. After capture, the MPEG-1 file and closed caption data are stored to a large software RAID system. Every hour of broadcast news generates about half a gigabyte of MPEG-1 data. The data for the current shows are also copied to a high-end DEC-Alpha based UNIX workstation on which the automated meta-data creation process takes place. Radio news broadcasts are automatically captured in a similar manner, using a computer controlled radio tuner and an audio digitizer.

2.

The audio and video streams of the MPEG-1 file are separated for further processing. The audio stream is converted into a 16kHz, sixteen-bit raw digitized audio stream, and segmented into sections of about thirty seconds duration. The video is converted into a video-only MPEG-1 data stream. The audio segmentation at this stage is done by finding the 3

relatively long dips in the audio power of the signal which usually correspond to silences between phrases. Story boundaries identified later in processing can override this initial segmentation. 3.

A fundamental step in meta-data creation is the speech recognition. The Sphinx-II large-vocabulary connected speech recognition system is used to transcribe the entire audio track of the video program. For the video material, a twenty thousand word vocabulary is used, since this provides a good balance between speed and accuracy, and since more information about word identity can often be obtained from closed captions. For the radio shows, where the only available information is contained in the speech signal, a sixty thousand word vocabulary is used. The language model is based on trigram (word triple) probabilities estimated from The Wall Street Journal (WSJ) and other North American business news sources, collected between 1987 and 1994 [Rudnicky95]. When trigram counts are too small, the model backs off to bigrams and then to simple corpus frequencies. This language model does not exactly match the characteristics of contemporary broadcast news, and efforts are underway to improve it. The acoustic models used by the recognizer were trained using speech from speakers of North American English reading isolated sentences from the WSJ. For video material without closed-captioning, the current recognition system with a twenty thousand word vocabulary produces transcriptions with approximately a 50.1% word error rate [Hwang94][Hauptmann96].

4.

Images from the video are searched for shot boundaries and representative frames within a shot. Two techniques are used here. Firstly, color histograms are computed from the video frames, and shot changes are hypothesized where the differences between successive histograms peak. Secondly, camera or scene motion is inferred from the motion information stored in the MPEG data stream and extracted using Lucas-Kanade optical flow analysis [Hauptmann95b]. Since actual shot changes involve the complete replacement of one scene by another, they are characterized by random apparent motion vectors. Consistent motion vectors across a hypothesized scene break can therefore allow that hypothesis to be rejected.

5.

If closed-captioning is available, the captions are aligned to the words identified by the speech recognition step (3, above). Since the speech recognizer outputs an exact time, within 10ms, for each word that it hypothesizes, dynamic time warping alignment enables the system to identify exactly when the words contained in the closed captioning were actually spoken. This timing information is essential to the operation of the Informedia library navigation tools described later in this article. Since misspelled word suffixes are a common source of recognition errors, the distance metric between words used in the alignment process is based on the degree of initial sub-string match.

6.

The news show is segmented into individual news stories or paragraphs. When closed captions are available, syntactic markers provided by the captioning services are used to aid in this process. In the absence of this information, the acoustic segments identified in step 2 above are used. More general segmentation techniques that work with speech recognition output are currently under investigation.

7.

Short, one-line, “headline” titles are created for each story. Since an oft-repeated rule of journalistic writing dictates that the essence of a story should be contained in the first lines, “interesting” words are preferentially selected from the beginning of the story. The “interest” level of words is determined through a linear combination of the TFIDF (Term Frequency by Inverse Document Frequency) score for each word, and a chi-squared measure comparing the frequency of words in stories with their expected frequency based on whole corpus parameters. Words that score highly on this “interest” measure are included in the title. While titles generated in this manner are often useful, improvements in the algorithm used to produce them are the subject of active research.

8.

Video summaries in the form of “skims” are created for each story at several different levels of detail. Like title generation, skim generation, as proposed in [Wactlar96], also uses extractive text summarization, but in this case the extracted text representing the important concepts should be more evenly spread throughout the story. Skims are produced at a variety of lengths, covering 10.8 to 26 percent of the original story. Skim fragments are selected from the story by repeatedly choosing an as yet unselected word with the highest TFIDF weight, and then selecting additional adjacent words, in the direction that maximizes TFIDF score, until the duration of the selected fragment exceeds a threshold. When sufficiently many fragments have been selected to make up the total desired skim duration, the process ends. Further research in selecting syntactically and semantically coherent fragments, and on ensuring smooth transitions between these fragments is planned.

9.

The newly processed news shows and all their automatically generated meta-data are combined with previously processed data and everything is indexed into a new collection catalog. The library catalog supports two views: the first 4

is a browsing view in which the library is broken down hierarchically by collection, source of material or series name, show name and segment within the show. The second is an inverted index of all the text from closed captions, if available, or speech recognition performed on the show. The inverted index supports standard information retrieval techniques such as stemming, stop-word omission, TFIDF weighting, and document length normalization. It can also support, for experimental purposes, features such as document vector magnitude normalization, phonetic sub-sequence matching and chi-squared term weighting. This inverted index supports ad-hoc queries against the whole library. 10. Access to the library catalog is given to users. When the automatic process concludes with the generation of the new index for a particular library collection, it is made available for searching using the standard Informedia client. Multiple collections, perhaps compiled and stored at different points on the network, can be simultaneously addressed by a single client. At this point, a user with the Informedia Digital Library Client Software can access the library in a number of different ways and use different abstractions to navigate through the data it contains.

EXPLORING THE INFORMEDIA LIBRARY A user can type queries to the system or speak the queries in natural English. Speech recognition for the IDVL client queries is done with the Sphinx-II Speech Recognition System using a 20,000 word vocabulary based on North American Business News and modified to account for the typical phrasing of queries. Since an ad-hoc retrieval engine is used, requests for information can be posed in unconstrained English. Means for issuing simple Boolean queries are also provided, although they are seldom used. Users may refine a query by adding more words to their initial query. A variety of abstractions are available to aid users in browsing the video stories or paragraphs returned from a search of the library. The library exploration process will be described in more detail in the following sections. The roles and effects of speech, natural language and image processing techniques are illustrated in conjunction with the library exploration process. None of the technologies underlying Informedia library creation work perfectly. The version of the Sphinx-II speech recognition system used in the system, for example, only correctly transcribes about half of the words in a typical TV news broadcast. Because of this basic imperfection, the Informedia client system has been designed to provide as much information as possible to aid users in navigating through the presented information. The goal is to allow users to find data, which satisfy their information needs. By combining information derived using different processing techniques, and from different modalities, it is often possible to minimize the effects of shortcomings of the automatically derived meta-data on retrieval effectiveness. Similarly, it is possible to compensate for problems with the data itself. The techniques from speech recognition, information retrieval, image processing and natural language processing enable the interface to support rapid and accurate search of imperfect news data. Before a more detailed discussion speech recognition and information retrieval in Informedia that follows, a brief walk-through of a typical interaction with the system is illustrated in Figure 2.

5

Imagine a cautious user who is planning a trip to Europe, and who says to the system “Tell me about mad cow disease.” The system searches and retrieves the best six of ninety-four matches that contain one of the words ‘mad’, ‘cow’or ‘disease’ (Figure 2a). The user could have set options to retrieve more hits from among the news stories that match the query. Moving the mouse over the representative poster frames extracted from the stories causes a text summary headline “Britain’s secretary it’s nation’s entire herd might slaughter” to appear. Another story poster has the headline “Britain’s ending dashed European officials, Belgium, voted”. Although imperfect, these summaries allow the user to select the story of greater interest, in this case the first one, which is clearly about the likelihood of drastic measures being taken to allay fears of infection, in the second, the focus is more on the crisis’ treatment in European politics. Clicking on the first poster frame starts the video of the story playing (Figure 1b). Underneath the video window is a bar with colored lines showing the exact time at which every query term was spoken. Clicking on the word ‘mad’ in the query would have highlighted the bars representing that word and the poster frames whose stories contained that word. The user could then have clicked the “next

Figure 2: The Informedia Digital Video Library client in operation. Screen a shows the system just after a query has been submitted to the system. Three video and three audio clips have been retrieved, and the mouse pointer has been moved over the most highly ranked one to display the automatically generated title, which suggests that the story is highly relevant to the query. The relevance “thermometers” on the left of the frames also suggest that all hits are highly relevant. In screen b, the user has started the video playing. Beneath the video window, the hits are displayed as red lines, and the user can use the next-hit button to jump to the point at which the query terms are spoken. In screen c, the user has clicked on the filmstrip gadget on a retrieved frame to display the filmstrip view. Each shot in the story is represented by a single frame, and the positions of query terms are marked on the sprocket holes. The current position in the playing video is highlighted. Finally, screen (d) shows the library catalog browser displaying the location of a filmclip about life on mars. 6

hit” button to skip past introductory material to the exact place where the word `mad’was mentioned. Since the video clip is nearly a minute long, and since it is not obvious from the automatically selected poster frame that the story begins on the topic of mad cow disease, the user switches to a “filmstrip” view of the story, where every shot is represented by one frame (Figure 2c). Again occurrences of the query words are marked on the filmstrip exactly where they occur, and the user can navigate directly to the parts of the story that are clearly relevant, such as the pictures of butchers’ shops. Alternatively, the user might have elected to enable and play a video “skim” of the story, viewing only the most important sections of the story in a fraction of the original time. Finally, the user changes topic entirely, and in Figure 2d is shown engaged in a query about life on Mars. Having become interested in the general topic, the user has opened the catalog browser to see what other space related material is available from, in this case, NASA sources included in an accompanying educational collection.

SPEECH RECOGNITION IN INFORMEDIA Table 1 shows the results of testing recognition accuracy for the Sphinx-II recognizer applied to samples of about two hours of speech from a variety of video data. These results show that the type of speech and the environment in which it was created dramatically alter the speech recognition accuracy. Substantially lower error rates can be obtained using recently developed systems such as Sphinx-III, but at greatly increased computational cost. The following paragraphs describe the conditions for each line in the table in more detail.

Table 1: Speech Recognition using the CMU Sphinx–II recognition system recognizes broadcast material with word error rates between twenty and eighty five percent. Careful speakers in the lab produce error rates between eight and seventeen percent. Conditions are described in detail in the text. Type of Speech Data

Word Error Rate = Insertions + Deletions + Substitutions ~ 8% - 12% ~ 10%- 17% ~ 20% ~ 40% ~ 50% - 65% ~ 33% - 50% ~ 65 – 75% ~ 85%

1) Speech benchmark evaluation 2) News text spoken in lab 3) Narrator recorded in TV studio 4) C-Span 5) Dialog in documentary video 6) Evening News (30 min) 7) Complete 1-hour documentary 8) Commercials

1) The basic reference point is the standard speech evaluation data which is used to benchmark speech recognition systems with large vocabularies between five and sixty thousand words. The recognition systems are carefully tuned to this evaluation and the results can be considered close to optimal for the current state of speech recognition research. In these evaluations, typical word error rates range from eight to twelve percent depending on the test set. Recall that word error rate is defined as the sum of insertions, substitutions and deletions. This value can be larger than one hundred percent and is regarded as a better measure of recognizer accuracy than the number of words correct. (I.e. words correct = 100% - deletions - substitutions). 2) Taking a transcript of TV broadcast data with an average reader and re-recording it in a speech lab under good acoustic conditions, with a close-talking microphone shows an estimate of word error rate between ten and seventeen percent for a speech recognition system that was not tuned for the specific language and domain in question. 3) Speech that has been recorded by a professional narrator in a TV studio and that does not include any music or other noise gives an error rate of around twenty percent. Part of the increased error rate is due to poor segmentation of utterances, leaving the speech recognizer unable to tell where an utterance started or ended. This problem was not 7

present in the lab-recorded data. Different microphones and environmental acoustics also contribute to the higher error rate. 4) Speech recognition on C-Span broadcast data shows a doubling of the word error rate to forty percent. While speakers in this data are mostly constant and always close to the microphone, other noises and verbal interruptions degrade the accuracy of the recognition. 5) The dialog portions of broadcast documentary videos yielded recognition word error rates of fifty to sixty-five percent, depending on the video data. The signal for these sections contains many more environmental noises and speech recorded outdoors. 6) The evening news was recognized with approximately a fifty percent overall word error rate. This rate includes recognition accuracy for commercials and introductions as well as the actual news program. Using the Sphinx-III recognition system on similar material yields an error rate of around thirty three percent, however the recognition process is computationally expensive, taking between fifty and one hundred times real time. 7) A full one-hour documentary video including commercials and music raised the word error rate up to between sixty-five and seventy-five percent. 8) Worst of all were commercials, which were recognized with an eighty-five percent error rate due to the large amounts of music in the audio channel as well as the unusual speech characteristics, and singing, contained in the spoken portion. While these recognition results seem dismaying at first glance, they merely represent a first attempt at quantifying the usefulness of speech recognition for broadcast video and audio material. Fortunately, as the experiments on information retrieval described below demonstrate, speech recognition does not have to be perfect to be useful in the Informedia digital video library. The transcript generated by Sphinx-II recognition need not be viewed by users, but can be hidden. However, the words in the transcript are time-aligned with the video for subsequent retrieval. Because, generally, only the timing information from the speech recognition output is used directly, errors in recognition are not directly visible to users and the system can tolerate higher error rates than those that would be required to produce a human-readable transcript.

INFORMATION RETRIEVAL IN INFORMEDIA In this section, some experimental information retrieval results will be given to support the basic premise of the Informedia project: that speech recognition generated transcripts can make multimedia material searchable. More than any other “imperfect” technology, the Informedia Digital Library System depends on text transcripts that allow effective indexing and retrieval of segments relevant to a query. If a perfect, manually created transcript were available, the success of information retrieval in the Informedia Digital Library System would be assured; there are many examples of successful document retrieval systems. However, large amounts of video and audio data in the real world do not have associated perfect transcripts. Closed-caption transcripts, for example, are quite errorful. Most video and audio material has no available transcript at all. It is therefore necessary to develop and evaluate techniques for information retrieval that can be applied to imperfectly, and possibly automatically, transcribed transcripts. This is a cornerstone upon which Informedia is built. In support of this evaluation, a series of experiments and measurements have been conducted. The information retrieval experiments were performed using data sets consisting of perfect text transcripts, closed-captioned text transcripts that were broadcast together with news shows, and transcripts created by the Sphinx-II speech recognition system. Different corpus sizes were also compared. The most substantial body of previous work has been done at Cambridge University in the United Kingdom. Jones et al. [Jones96] used a specially constructed test set of 50 queries and 300 voice mail messages, from 15 speakers, constructed to have on average 10.8 highly relevant documents per query. They measured precision at rank 5, 10, 15 and 20 and also reported the average precision. For their data, the best performance on a hand transcribed version of the data was an average precision of 36.8 %. Average precision for the best speech data was 85.6 % of the text retrieval precision. This was achieved by combining a speech recognizer transcript (based on a similar 20,000 word North American business news language model) with a phone-lattice scanning word-spotter based on speaker independent biphone models. It should be

8

noted that these results are comparable to the results below in terms of the techniques used, but not in terms of the data sets. Because each corpus has significantly different characteristics, one should resist the temptation to compare precision values.

METHOD OF EVALUATION The standard metrics for retrieval effectiveness in the information retrieval literature are precision and recall [Salton71]. Precision is defined as the number of correct (relevant) hits returned by the system divided by the number of total hits returned to the user. Recall is defined as the number of correct hits returned to the user divided by the number of hits a perfect retrieval should have returned. Recall and precision are thus computed based on the retrieved set, the relevant set, and their intersection. [Jones96] reported results for precision at 5, 10, 15 and 20 items retrieved and in order to allow some comparison, we will use average precision and recall over those four ranks. Retrieval effectiveness for automatically transcribed spoken documents has been reported as a percentage of the figure for a comparable text retrieval system applied to perfect transcripts [Jones96][James96], which we will also report. It should be noted that we also computed all our results using 11 point interpolated precision, and found identical trends in all experiments. The precision/recall metric has the drawback that a person (or, preferably, a number of people) must manually score the test data. This is extremely tedious and time-consuming and is therefore usually only done for small sets. Any results are simply assumed to scale to larger sets. Within the Informedia project, an effort is being undertaken to measure precision and recall for a data set of 602 news stories given a list of 105 queries, this involves having each human judge make 63210 relevance judgments. To date, a full set of these evaluations has only been completed by a single judge, and partial sets of evaluations have been made by several judges. The experimental results scored based on this set of judgments will be described below. Making these evaluations is not an easy task, even for human judges. A subset of 100 stories, for which 3 judges completed relevance judgments for the 105 queries, demonstrated that the judges agreed on 10443 judgments and disagreed 57 times. However, of the 85 cases where at least one judge thought a story was relevant to a query, the two other judges agreed only 28 times. In order to be able to report meaningful numbers over larger data sets, a second metric, the “average rank of the correct story”, was substituted. This measure is intended to give information about retrieval effectiveness comparable to that supplied by precision and recall numbers. This metric uses a query prompt that is created for one specific document in the database set. Then the rank of this target document in the returned set is computed. Over large numbers of documents the average rank of the target document is computed. Note that this number can be expected to increase with the size of the database. This relatively simple metric permits the repetition of retrieval experiments for relatively large amounts of data without laborious manual scoring. In particular, one can empirically observe which techniques scale better than others, something that is virtually impossible to do for precision and recall metrics that are manually derived. In the future a more thorough effort will be undertaken to measure the correlation between the average and median rank and the precision recall metrics. Although the measure is expected to give similar information, it is not directly comparable to measures based on actual relevance judgments, since it assumes that there is exactly one document that should be retrieved for a given prompt, when, in fact, several documents may be relevant.

The Data for the Information Retrieval Experiments 1) Data set 1 consists of manually created transcripts obtained through the Journal Graphics transcription service, for a set of 105 news stories from 18 news shows broadcast by ABC and CNN between August 1995 and March 1996. The shows included were ABC World News Tonight, ABC World News Saturday, and CNN’s The World Today. The average news story length in this set was 418.5 words. For each of these shows with transcripts, closed-captions were also collected as they were broadcast and a speech recognition transcript was generated from the audio using the Sphinx-II speech recognition system running with a 20,000 word dictionary and language model based on the Wall Street Journal from 1987-1994. Speech recognition for this data has a 50.7% Word Error Rate (WER) when compared to the JGI transcripts. WER measures the number of words inserted, deleted or substituted divided by the number of words in the correct transcript. Thus, WER can exceed 100% at times. Closed captions have a 15.6% WER compared to the Journal Graphics transcribed text.

9

The Journal Graphics transcription service also provided human-generated headlines for each of the 105 news stories. Each headline was matched to exactly one news story. The headlines were used as the query prompts in the information retrieval experiments. Thus, the rank of the correct story is defined as the rank of the news story returned by the search engine, for which the headline used as the query was created. Recall that this does not ensure that no other story is relevant to the title. In fact, in the 63,210 relevance judgments, a human judge assigned an average of 1.857 relevant documents to each headline. The average length of a headline query was 5.83 words. In all the experiments described here, the stories being indexed were segmented by hand. Automatic segmentation methods can be expected to generate errors that may decrease retrieval effectiveness. 2) Set 2 consisted of set 1 augmented with 497 Journal Graphics transcripts of news stories from ABC and CNN in the same time frame (August 1995 - March 1996). This set is thus comprised of 602 stories in total. Corresponding speech transcripts or closed captions were not obtained for this set. These news transcript texts had an average length of 672 words per news story. 3) Set 3 contained 2600 stories, including all the stories in set 1 and set 2. Except for the stories in set 1, none of these stories had closed-captions, nor were speech transcripts available for these stories. 4) Set 4 consisted of a total of 11928 stories, including the stories from sets 1, 2 and 3. All of these stories came from Journal Graphics transcripts from the same time period. The transcripts were similar in type to those in the previous set. Speech recognition transcripts were only available for the stories in the original set 1. In these experiments, the measure of retrieval effectiveness adopted is the average rank of the query’s correct story in the retrieved set. Where possible, it is compared with actual precision and recall figures calculated with respect to humangenerated relevance judgments. The base-line system uses a search engine based on TF and stop words (also referred to as “coordinate matching” by [Witten94]). The initial comparison used the 602-story corpus (set 2) to evaluate retrieval effectiveness for the closed-caption transcripts, the manual transcripts and the speech recognition transcripts in the set. Note that only for the 105 stories corresponding to the headline queries was there a choice of manual transcripts, speech recognizer output or closed-captioning available. The remaining 498 stories in the set were all derived from “perfect” manual transcripts. Thus, the data set was biased against the speech recognition data in that it mixed perfect text transcripts, some of which may have been relevant to the query headline, with the targeted speech recognized transcripts. Since the speech recognized transcripts can be expected to lose some query terms to recognition errors, their relevance ranking is likely to be somewhat lower than a comparable, relevant text story. Because of the limited number of stories for which speech, closed captions and manual transcriptions were available, accepting this bias was necessary to permit experimentation on a sizable retrieval corpus.

Experiment 1: Precision and recall for various transcription methods The first experiment shows precision and recall for three different types of transcripts. Precision and recall were calculated using the relevance judgments of a human judge as a reference, as described earlier. The corpora used consisted of the 602 stories (described above in set 2) including 105 stories corresponding to the query headlines and 497 manually transcribed “distractor” stories. The three experimental conditions involved using the same 105 stories corresponding to the queries, but the transcripts for these 105 stories were generated in three different ways: by manual transcription, by closed-captioning, by large vocabulary speech recognition. Precision and recall figures were computed at 5, 10, 15 and 20 stories retrieved. Two versions of the retrieval system were contrasted, a simple search engine using only TFIDF weighting and stop words, and the best version of the Informedia search engine using TFIDF, document length normalization, stop words, suffix stripping and document weight vector normalization. The latter is effectively a type of cosine distance metric. The average precision and recall figures over these 4 sets were computed and are displayed in Table 1. On manually prepared transcripts, which are assumed to have perfect text content, recall at the average of ranks 5, 10, 15 and 20 was 0.714 (precision = 0.097) for the standard search engine and 0.906 (precision = 0.128) for the search engine using suffix stripping, document length normalization and document weight normalization in addition to TFIDF and stop words. This shows the helpful effects of these additional search engine features. The closed captions for these same transcripts had a 15.7% word error rate compared to the perfect manual transcripts. This translated into a decreased average recall at rank 5,10,15 and 20 of 0.667 (precision = 0.091) for the standard search engine and an average recall of 0.849 (precision = 0.116) for the best search engine. Note that a 15.7% word error rate resulted in a 6.6% decrease in recall (6.2% 10

decrease in precision) for the standard search engine and a 6.3% recall decrease (9.3% precision decrease) for the best search engine compared to text transcript retrieval. For speech generated transcripts, the average recall was 0.505 (precision 0.068) for the standard search engine and 0.803 (precision 0.110) for the best search engine. The 50.7% word error rate thus resulted in a 29.3% decrease in recall performance (29.9% precision decrease compared to text retrieval from perfect transcripts for the standard search engine. For the engine with the most sophisticated weighting scheme, this decrease was 11.4% in recall and 14.1% decrease in precision. By increasing the quality of the information retrieval engine, it was possible to palliate the effects of imperfect transcription by speech recognition. When the best search engine is used, the decrease in precision (14.1%) resulting from use of speech recognition generated transcripts is actually less than that resulting from using an information retrieval system instead of a human being to make relevance judgments (23.8%). The errors in speech recognition accuracy are not a critical impediment to achieving good information retrieval performance. Table 1: Comparison of precision and recall for different transcript types when 105 “headline” queries were made against a corpus of 602 stories. The transcripts for 105 stories corresponding to the queries were derived, in three conditions, from manual transcription, closed-captioning, and speech recognition. 498 manually transcribed text story transcripts were added to the corpus in each condition. Precision and recall figures were averaged over ranks 5,10,15 and 20. Hypothetical “perfect” retrieval scores, according to human relevance judgments, are also shown. Search Engine Features:

TFIDF and Stop Words

TFIDF, Stop Words, Stemming, Document vector normalization, Document length normalization.

Type of Corpus

Word error rate

Avg. Recall at rank 5/10/15/20

Avg. Precision at rank 5/10/15/20

Avg. Recall at rank 5/10/15/20

Avg. Precision at rank 5/10/15/20

Manually Prepared Transcript

0% (base line)

0.714

0.097

0.906

0.128

Broadcast Closed Captions

15.6%

0.667

0.091

0.849

0.116

Speech Generated Transcript

50.7%

0.505

0.068

0.803

0.110

“Perfect” Retrieval on manual transcripts (human relevance judgments)

0%

0.992

0.168

0.992

0.168

An analysis of the large difference between the simple search engine and the full search engine (Recall/precision = 0.505/0.068 vs. 0.803/0.110) showed that about half of the improvement in the speech document retrieval are due to the effect of stemming. The speech recognizer will misrecognize words and substitute close phonetic matching words. These matches are often words with similar stems, but different suffixes. The second biggest improvement came from vector normalization (computing the cosine distance between the query vector and the document vector instead of the Euclidean distance).

Experiment 2: Average retrieval rank for correct story using various transcription methods. The second experiment involved computing the second, “average rank of correct story” measure for the same conditions. Table 2 shows that the rank of speech recognition based transcripts is more than three times higher than that of manually generated data. Closed captions lie somewhere in between. The differences in the recall and precision figures in Table 1 are in the same direction, but are much smaller. Compared with precision and recall, the “average rank of correct story” metric seems to be a correlated, but more sensitive measure of retrieval effectiveness.

11

Table 2: Comparison of the average retrieval rank for the correct story corresponding to a query headline for manual story transcripts, closed-captioned transcripts and transcripts based on speech recognition. The same corpus was used as for the experiment described in Table 1. Transcript Source

Word error rate

TFIDF and Stop Words

Suffixes, Stop words, Document weight normalization, TFIDF, Document length normalization.

Manually Prepared

0% (base line)

11.14

2.32

Broadcast Closed Captions

15.6%

13.93

4.85

Speech Recognition

50.7%

44.11

7.89

Experiment 3: Scaling behavior of the average retrieval rank. Table 3. A comparison of retrieval effectiveness for manual transcripts and speech recognition transcripts for larger data sets. Each data sets used the same 105 prompts for which corresponding stories were either created manually or through a speech recognizer. In this case, though, three corpus sizes were generated by adding manually generated transcripts. Average rank figures were computed using the best retrieval system, as described in the text. Average rank of correct story based on:

602 stories

2,600 stories

12,000 stories

Manually Prepared Transcript

2.32

5.65

9.34

Speech Generated Transcript

7.89

31.16

60.19

Table 3 shows the scaling behavior of the average rank measure for the best retrieval system as the number of documents in the corpus is increased by adding more manually generated “distractor” story transcripts. The table indicates that the average rank rises more quickly for speech recognized transcripts than for manually created transcripts. However, both conditions seem to degrade approximately with the log of the size of the corpus. Recall that since only “perfect” manual transcripts were added to the corpus, this data is slightly biased against the speech recognized transcripts. One would expect better measured performance in the speech recognition condition if the additional stories in the corpus were always of the same type (i.e. speech transcript) as the original 105 stories. It is also worth noting that the ratio of the average rank between speech recognition and manual transcripts increases with the size of the corpus. This indicates that speech recognition generated transcripts are less focused on the correct topic and are more likely to be displaced by other apparently relevant stories from the “distractor” set.

Experiment 4: Phonetic transcription compared with large vocabulary recognition In previous work, Schäuble and Wechsler [Schäuble95] performed experiments in which they used automatic phonetic transcriptions, as opposed to the whole word speech recognition transcripts described above, for information retrieval in a small radio news corpus. They reported reasonable success in retrieving relevant documents. Similarly, Jones et al, [Jones96] used a combination of whole word transcription and phoneme lattices to improve on the retrieval effectiveness of a system using either alone. Although a strictly phonetic transcription of the data used in these experiments has not yet been generated, an approximation is achieved by converting story transcripts into a phonetic representation by looking up the words in the transcript in the large phonetic dictionary used by Sphinx-II [CMU-Speech95]. All substrings of between three and six phonemes in length are generated from these transcriptions and used as the lexical tokens for building the inverted index and retrieval. The prompts are also converted into phoneme based tokens in the same way.

12

Table 4 gives precision and recall for word based and phonetic representations of both human generated and machine generated transcriptions. Combined word and phoneme based retrieval is also evaluated, as originally proposed by James [James96]. Figures are given for two information retrieval engines. The first observation one can make from these data is that the speech recognition vocabulary is crucial to IR performance. Reducing the text vocabulary to that of the speech recognition engine accounts for half the loss of precision and recall resulting from the use of speech recognition based transcripts. Further analysis shows that the out-of-vocabulary (OOV) rate for the data used in these experiments is rather high; OOV terms account for 11.4% of the terms in the 105 “headline” prompts (71/620 words) and for 6% of the words in the stories. Searching on a phonetic representation of the manual transcripts slightly decreases retrieval effectiveness, but this is not reliably true for the SR based transcripts, perhaps because the phonetic transcription allows the system to bypass some errors in the recognition of word suffixes. Finally, interpolating the phonetic and whole word representations gives better retrieval performance than either alone, both for manual and automatic transcriptions. For manual transcriptions (perfect text), it is likely that the phonetic transcriptions help by providing a general suffix and prefix matching mechanism, which is not a feature in the base search engine using TFIDF and stop words. However, performance when using the “best” search engine, which now includes suffix stemming, is not changed when phonetic transcriptions are added to the already perfect text transcripts. The results for perfect text transcripts and phonemes with the full search engine are only given to illustrate the point phoneme retrieval is not needed for perfect text systems, whereas it can be useful for speech recognized documents. For speech recognition based transcriptions, the improvement in retrieval effectiveness derives from phonetic transcriptions, which allow matching on words outside the fixed 20,000-word speech recognition vocabulary. Table 4: Recall and precision for retrieval performed using a variety of whole word and phonetic transcript representations, for transcripts generated manually or using large vocabulary speech recognition TFIDF + stop words

Full system with all IR features

Type of transcription

Recall

Precision

Recall

Precision

Words from Text

0.714

0.097

0.906

0.128

Words from SR

0.505

0.068

0.803

0.110

Words from Text without words not in SR dictionary

0.623

0.081

0.850

0.119

Phonemes from Text

0.705

0.087

0.839

0.115

Phonemes from SR

0.544

0.064

0.762

0.102

Text words + Text Phonemes interpolated

0.765

0.107

0.905

0.129

SR words + SR Phonemes interpolated

0.616

0.083

0.831

0.114

Table 5 gives the results of a similar experiment expressed in terms of the average rank measure. As in previous experiments, the magnitude of the effect is much more noticeable when this “average rank of correct story” measure is used. The relative retrieval effectiveness for different condition shows exactly the same trends as before: when precision and recall are high, the average rank is relatively low, and vice versa.

13

Table 5. Improvements in the average rank of the correct story from speech recognized stories using phoneme recognition. This experiment is based on the set of 602 stories. For each condition, a base-line (TF + stop words) is shown, as well as the best information retrieval using TFIDF, stop words, document length normalization, document weight normalization, proximity weighting and suffix stripping. The conditions contrast words from manually transcribed text, words from a 20,000 word speech recognizer, words from the manual transcripts after the out-ofvocabulary words were removed, retrieval given only a phonetic representation of the text transcript, and retrieval given only a phonetic representation of the speech recognized transcript. The last two rows show the improvements in average rank of the correct story obtained when both words and phonemes are used for retrieval. Transcript type

TFIDF + stop words

Best IR System

Words from Text

11.14

2.32

Words from SR

44.11

7.89

Words from Text less OOVs for SR

38.36

17.20

Phonemes from Text

17.74

9.75

Phonemes from SR

25.05

12.02

9.15

2.18

20.06

6.87

Text words + Text Phonemes interpolated SR words + SR Phonemes interpolated

Experimental Summary The experiments confirm findings by Wechsler and Schäuble [Schäuble95], that phoneme-based recognition using phoneme strings of different lengths can be used for effective information retrieval. While these experiments do not directly duplicate the procedure of Wechsler and Schauble, they are sufficiently similar to confirm their results and to show the robustness of phoneme-string based information retrieval in different implementations. The current results also show that it is better to have a large vocabulary speech recognizer (with a lexicon of at least 20,000 words) in conjunction with a phonetic engine, where the retrieval results of the two can be combined. This is consistent with the findings of Jones et al. [Jones96] on a voice mail retrieval task. In contrast to Jones et al, the current experiments do not use word spotting or phoneme lattices to augment the vocabulary of the recognition system. Instead of word spotting, a large set of phoneme strings of various lengths is indexed and searched. In the future, these experiments will be extended to investigate use of word and phoneme lattices that can be generated by the Sphinx-II speech recognizer. There are also subtle differences in the search engine features used by the different groups. Unlike the Jones et al experiments, the Informedia search engine uses document weight vector normalization to gain a small improvement in retrieval effectiveness for larger corpora. However, this difference is unlikely to influence the direction and trends of the results, which support Jones et al findings in all essential respects. The experiments reported here also hint at the behavior of the retrieval system for larger collections of documents. For larger corpora it was demonstrated that the performance for speech recognition generated transcripts fell off more rapidly than the performance for equivalent manually created transcripts. The introduction of the “average rank of correct story” metric enabled measurements for collection sizes that are far beyond what humans could reasonably judge for relevance. The “average rank of correct story” metric confirmed all the basic trends in the small corpus, when compared to the recall and precision figures based on human judgments. It does, however, appear to be a more sensitive measure. It is suspected

14

that median rank will perhaps be less sensitive and more directly comparable to the traditional measure of precision and recall.

FUTURE DIRECTIONS There are six main research areas that contribute to the effectiveness of the Informedia Digital Video Library: Data delivery, user interface design, image understanding, natural language processing, information retrieval and speech recognition. AI technologies in image understanding, natural language processing, speech recognition and information retrieval primarily affect the off-line library creation process. The online library exploration phase with the user is affected more by data delivery issues and user interface design, and to a lesser extent by natural language understanding and information retrieval for query processing and speech recognition for spoken queries. There are two main data delivery issues: storage and transmission. How can one address the problem of huge storage requirements of the MPEG-I encoded video data accumulated through daily news broadcasts? An hour of video takes up about 600MB of disk space. Although the Informedia project will eventually have a terabyte of disk on which to store video, this is still barely enough for 1500 hours of video. Even with the constantly dropping prices of storage, the many thousands of hours of both broadcast and privately produced video force the consideration of new approaches to dealing with this data It is worthwhile to investigate when data could be degraded or "forgotten". It may be useful for data to degrade to lower quality video at fewer frames per second, and lower resolution. It is also possible to eliminate the video entirely and save only the audio portion. Finally one can retain only the text transcript without audio or video. Even if enough storage is available, the need to speed access by keeping cached copies of material encourages research to find out which reduced quality representations of video material can serve as useful substitutes while the original material is fetched. The second data delivery issue concerns the transmission of the video news story to a remote user. Essentially, one needs to provide fast enough networks to allow MPEG-1 bit rates to be transmitted continuously, and servers that can keep up with this demand for many users. The need to play back skims, which in future versions of the system will be dynamically created according to user queries, require that the MPEG-1 data be served with very low latency, preventing many of the optimizations currently used in multimedia servers. Local caching strategies would allow for a frequently used subset of the news to be stored near to the client where they can be served rapidly and without contention, while most of the less frequently used, older news is stored in a central archive. This approach would reduce network bandwidth requirements, although occasional MPEG-I transmission rates would be required. Another approach is to use lower bandwidth streaming video representations, trading off reduced quality against expanded accessibility. To enable experimentation with a wide variety of networks and delivery platforms, the Informedia project is currently implementing a Web based client in Java. The user interface issues deal with the way users explore the library once it is available. Can the user intuitively navigate the space of features and options provided in the Informedia: News-on-Demand interface? What other features should the system provide to allow users to obtain the information they are looking for? The plan for the future is to move the system into further test-bed deployments to gain insight from users and to enable evaluation of interface design alternatives. User studies have already been useful in directing the system towards better schemes for poster frame generation. In future, such studies will be used to answer questions such as whether the speech interface enables more effective querying than the typing interface, and whether skims are an effective reduced representation of the content of a video segment. Natural language processing research for News-on-Demand needs to find ways of providing acceptable segmentation of the news broadcasts into stories, even when those broadcasts are only available as speech. It is desirable for the system to generate more meaningful short summaries of the news stories in natural sounding English, and to chose more coherent and informative sections of a story for use in skims. In both these cases, it would also be useful if the system could tailor both the headline and the skim to a model of the user’s information need, as expressed through the query. Natural language understanding also has a role to play in deciding the weightings and operators, such as adjacency and Boolean operators, that should be used with natural language queries to ensure optimal retrieval or matching concepts from the story texts. Parsing and broad semantic analysis of news stories may also be worth pursuing; the system might greatly improve if it could parse and separate out dates, major concepts and types of news sources. Finally, machine translation, both statistical and symbolic, will be fundamentally important to dealing with the fact that video and audio and audio programs are produced in a variety of languages and that this variety is also represented by potential users of Digital Libraries.

15

Image processing research [Hauptmann95b][Zhang95] is continuing to refine the scene segmentation (the identification of cuts in the video). Within a scene and within a story, image processing gives us the key frame to represent that scene or story. The choice of a single key frame to best represent a whole scene is a subject of active research. In the longer term, the project plans to add text detection and optical character recognition (OCR) capabilities for reading captions and text off the video images. In the future, work will be done aimed at including similarity-based image matching in the retrieval features available to a user. A simple version of similarity matching based on color histograms has already been used in one version of the system, but the aim is to move far beyond this point, and to do matching of identified objects found in the images. The error rate of the speech recognition system still leaves much room for improvement. The experiments presented here indicate that substantial gains in retrieval effectiveness can be achieved with more accurate speech-recognition generated transcripts. The figures in this paper show that the greatest benefit for speech recognition would be obtained by reducing the number of out-of-vocabulary words. These words occur in the stories and the queries, but that are not represented in the speech recognizer’s active vocabulary. In addition to the out-of-vocabulary words, improvements could come from better acoustic models, better language modeling and better pronunciation modeling. To improve acoustic modeling, experiments are under way to automatically adapt the existing models to broadcast news. This is achieved by combining the closedcaptioned text with a speech recognition transcript of the audio stream. In principle, this could enable a computer to learn to recognize speech by watching television 24 hours a day. To improve the lexical coverage and the language model, a two pass system is being developed. The first pass will be a standard recognition using the general English language model. The recognized words are then used in a query to an information retrieval system from which related text on the specific topic can be obtained. This text is then interpolated with the generic language model to focus on the topic domain covered in the current speech. Preliminary experiments have shown that this adaptation of the language model can improve speech recognition accuracy as well as coverage of the out-of-vocabulary lexical items. The task for speech recognition in Informedia is not merely transcription. It matters little whether the system correctly transcribes any of the stop words. However, important concept words should be recognized correctly or retrieval effectiveness will suffer. This suggests a new approach to training and evaluating speech recognizers for information retrieval tasks, in which the score to be maximized is based on retrieval effectiveness, not on the number of words correctly transcribed. A further improvement could come through the use of confidence measures, which reflect the recognizer’s estimation of the likelihood that a word was correctly transcribed. Confidence measures offer the potential to be used as an additional weighting factor in the document vector. Finally, one would expect prosodic information, and in particular lexical stress, to reflect the importance of spoken query terms or spoken document concept words. The challenge here is to reliably identify the prosodically marked words and to favor them as relevant terms. While the results presented here, as well as those from Schäuble and Wechsler [Schäuble95] and Jones et al [Jones96] show some success in information retrieval from spoken documents, much remains to be done. In the near term, the combination of whole-word recognition and phoneme recognition shows clear promise. In addition, improvements can be achieved through the use of word or phoneme lattices, where the recognizer indicates multiple, ranked choices for each word or phoneme. Since recognition errors are not semantically correlated, the penalty of adding multiple word candidates into the retrieval document is likely to be outweighed by the benefit of alternate, lesser ranked, but sometimes correct word or phoneme candidates. Judicious combination of a variety of AI techniques has permitted the construction of an effective interface to a digital video library. Speech recognition is used for transcription and alignment, image processing is used for shot analysis and to identify representative frames, and natural language processing is used for summarization. Despite the imperfections in each of these techniques, and the problems inherent in processing unmodified broadcast news data, strong navigation tools supported by the use of AI allow the user to quickly retrieve appropriate stories from the Informedia Digital Video Library.

ACKNOWLEDGMENTS The authors are grateful for the help of Mosur Ravishankar, Paul Placeway and our other colleagues in the CMU speech group, Michael Smith, for the image processing code, Michael Christel and Mark Hoy for more than substantial coding on the user interface, Ricky Houghton, Craig Marcus, Bryan Maher and King-Sun Wai for programming and technical support, and Howard Wactlar for fearless leadership. Especial thanks go to Marci Maher, without whose sterling efforts this paper would not have been possible. We are also indebted to Hsinchun Chen for his forbearance. 16

REFERENCES [Brown95] Brown, M. G., Foote, J. T., Jones, G. J. F., Spärck Jones, K., and Young, S. J. “Automatic Content-based Retrieval of Broadcast News,” Proceedings of ACM Multimedia. San Francisco: ACM, November, 1995, pp. 35-43. [Christel94a] Christel, M., Stevens, S., & Wactlar, H. “Informedia Digital Video Library,” Proceedings of the Second ACM International Conference on Multimedia, Video Program. New York: ACM, October, 1994, pp. 480-481. [Christel94b] Christel, M., Kanade, T., Mauldin, M., Reddy, R., Sirbu, M., Stevens, S., and Wactlar, H.”, “Informedia Digital Video Library”, Communications of the ACM”, 38 (4), April 1994, pp. 57-58. [CNN-AT-WORK95] Cable News Network/Intel CNN at Work - Live News on your Networked PC Product Information. http://www.intel.com/comm-net/cnn_work/index.html. [Flickner95] Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petkovic, D., Steele, D., and Yanker, P. “Query by Image and Video Content: The QBIC System”. IEEE Computer, September 1995, pp. 23-31 [Hauptmann95] Hauptmann, A. G., Witbrock, M. J., Rudnicky, A. I., and Reed, S., Speech for Multimedia Information Retrieval, UIST-95, Proceedings of User Interface Software Technology, 1995, in press [Hauptmann95b] Hauptmann, A. G. and Smith, M. A., Text, Speech and Vision for Video Segmentation: the Informedia Project. AAAI Fall Symposium on Computational Models for Integrating Language and Vision, Boston MA Nov 10-12 1995., pp. 90-95. [Hauptmann96] Hauptmann, A.G. and Witbrock, M.J., Informedia News on Demand: Multimedia Information Acquisition and Retrieval, in Maybury, M. T., Ed, Intelligent Multimedia Information Retrieval, AAAI Press/MIT Press, Menlo Park, 1996 (In Press). [Hwang94] Hwang, M., Rosenfeld, R., Thayer, E., Mosur, R., Chase, L., Weide, R., Huang, X., and Alleva, F., “Improving Speech Recognition Performance via Phone-Dependent VQ Codebooks and Adaptive Language Models in SPHINX-II.” ICASSP-94, vol. I, pp. 549-552. [Informedia95]

http://www.informedia.cs.cmu.edu/

[James96] James D. A., System for Unrestricted Topic Retrieval from Radio News Broadcasts. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Atlanta, GA, USA, May 1996, pp. 279-282. [Jones96] Jones, G.J.F., Foote, J.T., Spärck Jones, K., and Young, S.J., “Retrieving Spoken Documents by Combining Multiple Index Sources”, SIGIR-96 Proceedings of the 1996 ACM SIGIR Conference, Zürich. [Li96] Li, W., Gauch, S., Gauch, J., and Pua, K.M., “VISION: A Digital Video Library”, Digital Libraries ’96: 1st ACM International Conference on Research and Development in Digital Libraries, Bethesda MD, March 1996. [Mani96] Mani, I., House, D., Maybury, M. and Green, M. 1996. “Towards Content-Based Browsing of Broadcast News Video”, in Maybury, M. T. (editor), Intelligent Multimedia Information Retrieval. [Maybury96] Maybury, M., Merlino, A., and Rayson, J., submitted 1996. “Segmentation, Content Extraction and Visualization of Broadcast News Video using Multistream Analysis”, in Proceedings of the ACM International Conference on Multimedia, Boston, MA. [Ogle95] Ogle, V. and Stonebraker, M. “Chabot: Retrieval from a Relational Database of Images”, IEEE Computer, Vol. 28, No 9, September 1995. [Pentland94] Pentland, A, Picard, R., Sclaroff, S., “Photobook: Tools for Content-Base Manipulation of Image Databases “. SPIE Conference on Storage and Retrieval of Image and Video Databases II, (SPIE paper 2185-05) Feb 6-10, 1994, San Jose CA, pp34-47 17

[Rudnicky95] Rudnicky, A., “Language Modeling with Limited Domain Data,” Proceeding of the 1995 ARPA Workshop on Spoken Language Technology, in press. [CMU-Speech95] URL: http://www.speech.cs.cmu.edu/speech/ [CMU-Speech96] URL: http://www.speech.cs.cmu.edu/cgi-bin/cmudict [Salton71]

Salton, G., Ed, “The SMART Retrieval System”, Prentice-Hall, Englewood Cliffs, 1971.

[Schäuble95] Schäuble, P. and Wechsler, M. “First Experiences with a System for Content Based Retrieval of Information from Speech Recordings,” IJCAI-95 Workshop on Intelligent Multimedia Information Retrieval, Maybury, M. T., (chair), working notes, pp. 59 - 69, August, 1995. [Wactlar96] Wactlar, H. D., Kanade, T., Smith, M. A. and Stevens, S.M. “Intelligent Access to Digital Video: Informedia Project”. IEEE Computer, 29(5) May 1996, pp. 46-52. [Witten94] Witten, I.H., Moffat, A., and Bell, T.C., “Managing Gigabytes : Compressing and Indexing Documents and Images”, Van Nostrand Reinhold, 1994. [Woods96] Woods, Bill, “Conceptually http://www.sun.com/960201/cover/video.html

Indexed

Video:

Enhanced

Storage

and

Retrieval”

.

[Zhang95] Zhang, H., Low, C., and Smoliar, S. “Video parsing and indexing of compressed data,” Multimedia Tools and Applications 1 (March 1995), pp. 89-111.

18

Suggest Documents