Spoken Documents: Creating Searchable Archives from ... - CiteSeerX

4 downloads 2799 Views 245KB Size Report
audio portion of video and dedicated audio media such as radio and .... MS SQL Server. Audio signal. IR Index. Server. Server. Speech. Recognition. Speaker.
Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

Spoken Documents: Creating Searchable Archives from Continuous Audio Sean Colbath, Francis Kubala, Daben Liu, Amit Srivastava {scolbath,fkubala,dliu,asrivast}@bbn.com BBN Technologies, GTE Corporation 70 Fawcett Street, Cambridge MA, 02138

Introduction The quantity of audio data produced in the world is truly vast. Audio, as a medium, is regarded as a second-class citizen to text. While most of the current emphasis on searching, summarization, and indexing is focused primarily on text, more original language content is being generated through the audio portion of video and dedicated audio media such as radio and telephone than in text alone. [Fig. 1], [1, 2].

provides the end user with no summarization, editing, or information extraction capabilities. For instance, using the Lycos search engine to locate Martin Luther King, Jr’s “I have a dream” speech using their sound search capability returns pointers to audio files containing the Everly Brothers’ song “Dream Dream Dream” rather than the famous speech. Searching for “Martin Luther King” finds a recording of the speech, but the only way to access it is to play it back – further searches can’t be done on it, and the words in the speech can’t be processed in any meaningful way unless the user

1.0E+18

1.0E+15

Transcript of Spoken Words Raw Data

1.0E+12

[Fig.1] Bytes per year of original content 1.0E+09 Text

Video

Current search technologies for audio rely on the cataloger of the data to provide additional keywords or metadata to enable retrieval. This can lead to haphazard cataloging and misleading searches, and

Telephone

transcribes them. One obvious way to tackle this problem is to transcribe the speech-based audio using automatic speech recognition technology. Recent advances in

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

1

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

speech recognition and natural language technologies make it possible to turn a previously opaque form of data – audio, in the form of speech – into a searchable, indexable, manipulable form. Just turning speech into text simply moves the problem from one medium to another, and valuable structural data in the audio is lost. The Virage VideoLogger system was used to transcribe and index the video of Bill Clinton’s impeachment trial, but the user was only given the ability to search for particular keywords in the text, with “stories” delimited by visual scene changes, rather than changes in the structure of the conversation. Usually, systems which attempt to automatically index video rely on the closed captioning tracks provided by the content producer, but closed captioning is generally a paraphrase of what is actually said, and is only available on produced, finished video. It wouldn’t normally be found on alternate video sources, such as videotapes of seminars, or a videoconferencing stream. Other systems such as MITRE’s Broadcast News Navigator [9] or Carnegie Mellon’s Informedia project [10, 11] focus primarially on video streams, locating scene changes, performing face recognition, and attempting to locate similar regions of video through color and shape similarities. Obviously, many of these elements are missing in audio-only sources like radio and teleconferences, or strongly subdued in corporate or internet video conferences. A better approach to indexing audio and video streams is to locate the semantic boundaries in the medium to be indexed. For instance, broadcast news shows on television and radio are divided into introductions, stories, and commercials. Face-toface meetings and video conferences have minutes and agendas. Seminars and lectures have syllabi. Rather than just accumulating masses of words through speech recognition, it is possible to recover some of the underlying structure through trainable probabilistic methods. An automatic transcript of broadcast or meeting audio has no boundaries of any kind. Even punctuation and capitalization are missing, making the resulting unbroken stream of words difficult to read, understand, and search. But by integrating several acoustic and linguistic technologies together, we are able to construct a structural

summary of continuous audio that is searchable by content. The implicit assumption is also that while none of these technologies is 100% accurate, together they give an end-user the ability to manipulate and search an archive with reasonable accuracy and extract data that they would otherwise not have been able to find. In this paper, we will describe the Rough’n’Ready audio indexing system. The system produces a ROUGH transcription, which is READY to be browsed, and integrates seven advanced speech and language technologies: speaker segmentation, clustering, and identification, speech recognition, name spotting, topic classification, and story segmentation. The system uses the linguistic information recovered by these technologies to give structure to the audio stream and to use this structure to provide the user with a searchable index. In addition, a full-text information retrieval component is used at runtime to retrieve the segmented stories based on topics or keywords. We will give a brief overview of each component, and a more in-depth view of the overall Rough’n’Ready system, as well as discuss future areas of research this capability opens up. The current focus of the system is on transcription and indexing of broadcast news, with a corpus size of 150 hours of news gathered during 1997 and 1998 (specifically, the DARPA Broadcast News corpus). We will discuss the work necessary to move to a new problem domain later in the paper.

Component Technologies The analysis component of the Rough’n’Ready system is comprised of seven base components: a speech recognition system, a speaker segmentation system, a speaker clustering and identification system, a name spotting system, and story segmenter. Each of these components form part of a data pipeline, transforming the audio from wave file to indexed database

Speech Recognition Rough’n’Ready uses the BBN Byblos speech recognition system. Byblos is a speakerindependent continuous speech recognition system, with a vocabulary of 60k words. The current

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

2

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

system is trained on the DARPA Broadcast News corpus, with 100 hours of acoustic data, and 400 million words of language model data from newspapers. No effort has been made yet to specifically adapt the models to a specific environment; the recognizer is being used “off the shelf” from the BBN speech recognition research group. Byblos was developed under the DARPA Broadcast News program and is described in [2]. The recognition performance on high-quality broadcast news is very acceptable, on the order of 87% word accuracy, but requires running the system at greater than 100x realtime. In order to run the system at a faster speed (4x realtime), we accept a drop in performance to 76% overall word accuracy.

Speaker Segmentation, Clustering, and Identification Speaker identification and segmentation allows us to create paragraph-like units between speakers, both known and unknown (classified by gender). Speaker change detection is important for correct playback of audio sections of the archive, and speaker identification allows the user of the archive to perform queries about particular speakers known to the system, and to skim over the archive for areas where particular speakers were present. In general, however, most speakers will be unknown to the system, so the speaker identification system will cluster them and give them a unique name. For the speaker change detection problem, we are able to find 90% of all speaker changes in general Broadcast News data within 100 milliseconds of the true (human-labeled) boundary. Accurate speaker identification is a much harder problem. The current system only has 20 known speakers (primarily well known anchor speakers and common news personalities) out of a speaker population of 170, but there are plans to increase this to about 100. With 20 known speakers, the system has a false rejection rate (that is, where audio segments of the target speaker are labeled anonymously) of 5%, and a false acceptance rate (where audio segments of the target speaker are labeled as another known speaker) of 2%.

Name Spotting Name spotting (person names, location names, and organization names) allow us to find the important “players” in a body of text: WHO was involved with an event, WHERE the event took place, and WHAT other groups were associated with the event. In addition, extracting this information allows us to generate summaries of documents or groups of documents organized by the frequency of extracted names. We use BBN’s IdentiFinder named-entity extraction system. IdentiFinder is statistically trained to locate proper names in text, along with other entities of interest, such as dates, times, percentages, and monetary amounts, using a unigram HMM model of a sentence [3], [4]. Because IdentiFinder uses a statistical model and not a dictionary, it is able to make a generalized model of a name, and predict new names based on prior training data. The system is trained by having a human annotator mark up text with true name boundaries, which are used to estimate probabilities for future name occurrence. An adjudication process is used to arbitrate between multiple annotations and to find inaccuracies in the annotated data. The system has an accuracy in the low 90% range as measured on the DARPA Message Understanding Conference training corpus (a corpus of newspaper articles), and we found that this error measure degrades linearly as the word error rate from speech recognition increases.

Topic Classification and Story Segmentation The OnTopic topic indexing system is a brand new, completely statistical system for topic classification. Previous work in the area of topic classification typically considered only one topic per segment, and modeled only a small number of topics. OnTopic has an HMM-based classification system, estimated from labeled training data. The topic HMM includes a model for every topic encountered in training and a model for general language that acts as an absorber of words that are not strongly associated with any topic. This model allows the system to generate a list of N statistically likely topics along with their ranking.

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

3

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

Using our model, we find that topics are indiciated by less than 10% of the words in a segment – that is, more than 90% are classified as general language words in any given document [5], [6]. Topic samples are taken from a sliding 200-word window across the transcribed text. Runs of similar high-ranking topics are combined to create story boundaries that give the user a high-level view of the data being shown, as well as providing a document model for information retrieval. The current set of approximately 5,500 topics come from an outside vendor, and apply specifically to broadcast news.

Information Retrieval By recovering boundaries between regions in a body of text that denote similar topics, we can start to apply traditional methods of information retrieval, such as full-text search. Without having a document model for the audio archive, a single query for a particular topic or set of keywords might return an unmanageable amount of audio, such as a one-hour newscast or a three-hour seminar. By generating a full-text index of the

Audio signal

Audio Compressor

Speaker Segmentation

archive using the boundaries found in the previous story detection component, we are able to provide the user with a concise body of text that answers their query. Unlike the preceding seven components which form a data pipeline through which the audio is processed prior to being deposited in the database, the Information Retrieval system is used interactively, via the database browser. The Rough’n’Ready IR system uses a full-text search system developed at BBN which uses an HMM-based model of document retrieval. This system, described in [7], is used in relevancefeedback mode to allow the user of the system to find documents that are similar to an exemplar.

System Architecture The Rough’n’Ready system is composed of three primary components: the indexer, which is a staged pipeline of the previously described components, a server, which consists of a relational database system as well as the previously described full-text search system, and a client browser, which

Audio Server

WAN LAN or Local bus

Speech Recognition MS Internet Explorer Speaker Clustering

IR Index Server

Speaker Identification Name Spotting Topic Classification Story Segmentation

Indexer

Story Indexing

XML Index

Information Retrieval

MS SQL Server

Metadata Server

Browser XML Corpus

Database Uploader

Server [Fig. 2]

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

4

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

allows the user to navigate through the database. [Fig. 2]

other visualization systems.

The indexer currently runs in an off-line mode, with processing taking approximately five times realtime on a single high-end PC to produce an index. An input waveform can be indexed by running a single script. The product of the indexing process is an XML document representing the transcription of the audio, along with the additional markup generated by the other components of the indexer. Each word in the XML representation is tagged with the time offset in the waveform that the word occurred at to allow extremely precise playback of the audio.

The data is initially organized by the source of the data and the date and time it was collected. The archive can be browsed as though it were a hierarchically organized file system of documents [Fig. 3].

or

information

extraction

The XML markup format is translated into a set of flat files, and loaded into a commercial relational database system (in this case, Microsoft SQL Server). The full-text indices and the audio waveforms are also copied to the server machine to be used by the browser. An advantage of using XML is that it gives us a common language for data interchange. A user is able to view the Rough’n’Ready data in the native browser, or optionally, the data can be consumed by some other visualization system that can read the file format. Alternatively, a third-party system can read the data from the relational database, but with significantly less portability. One of the primary goals of the Rough’n’Ready project was the rapid development of the browser interface. With this in mind, the current browser is built in Microsoft Visual Basic as a set of ActiveX controls. The browser currently operates as either a standalone application, or a downloadable control through Microsoft Internet Explorer.

System Usage Since we envision a potential user of the Rough’n’Ready system storing hundreds if not thousands of hours of indexed audio, we wanted to provide a flexible system to allow the user to engage in data discovery on the archive. By data discovery, we mean the incremental process of sifting through a large quantity of documents, and collecting it into various piles of desired and undesired data. These piles can be sifted further, read or listened to in greater detail, or processed by

[Fig.3] Drilling down through the data, the user can browse the various stories that have been extracted, which are represented as leaves on the tree. Each of these stories is given a title made up of its constituent topics: in the example shown, the first story in the CNN Morning News episode is “Denver (Colo.) trial: Oklahoma City bombing, April 19, 1995: Oklahoma: Bombings: Political crimes and offences: Criminal justice, Administration of: Terrorism: McVeigh, Timothy James: Conspiracy”. [Fig. 4] A close-up view of the story is also shown. On the left-hand side of the screen, the speakers are identified, both known and unknown. Each unknown speaker is identified with a unique tag, such as “Male 8” or “Female 6”, and this tag is maintained consistently throughout the episode. On the right-hand side, the extent of the story is shown. If the user wishes to hear the underlying audio, they can select a speaker turn to listen to, a story scope, or a region of text, and play back the audio associated with that region. In the center of the screen is the transcript of the story, with the extracted names highlighted and

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

5

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

[Fig. 4] color-coded. The line breaks in the transcript correspond to the speaker turn – no attempt is made to hypothesize sentence boundaries, and punctuation is arbitrarily added at the end of every turn. At any level in the archive hierarchy, the user can right-click on a node and get a summary of the information extracted from stories further down in the hierarchy (for instance, the topics, speakers, or locations). In Fig. 5, we can see a roll-up summarized by count of all the topics in the current archive. [Fig.5] If the user wishes to use a particular piece of extracted information in a query, they can doubleclick on the name or topic and it will automatically be moved to the query system. The query mechanism allows two forms of queries: quasi-boolean queries of the format allowed by the AltaVista search engine [8], and relevance feedback

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

6

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

queries where the user gives an entire story as an exemplar, and the system locates up to five other similar stories for the user. For AltaVista format queries, the user is allowed to use “+” to mandate the appearance of a keyword, ““ to prevent the appearance of a keyword, and double-quotes to indicate that proper names should be searched for as one continuous keyword. Returned documents are scored similar to AltaVista’s scoring system, with one point given for having the keyword once, and a maximum of two points given for all occurrences of two or more times, thus preventing tokens such as “CNN” from overwhelming a query.

One solution to this is to use traditional relevance feedback. The user has the option of specifying an entire story to the query system. When this is done, all the words in the story are fed into the full-text search engine, which returns five documents that use the maximum number of common terms with the seed document.

[Fig.7]

[Fig.6] In the example query above (Fig.6), the user has searched for stories containing the topic “Smoking” and the organization “FDA”, and the system has returned two stories in a new folder in the tree. Subsequent searches are limited in scope to the most recently returned folder of stories, unless the user deletes the folder or manually changes the scope of their query. One obvious problem with keyword based queries is that the user can very rapidly become dead-ended due to overspecification of their search. If the user chooses the wrong search terms, they will be unable to find the documents they desire, even if they are present in the archive. The search system can attempt to be smart and add synonyms or related words to the user’s query, but this can result in a lack of precision, with too many false positives being returned.

In the example in Fig.7, we’ve given the system the first story in the “Smoking and FDA” query for a relevance feedback operation. The full-text search system has returned five stories. The first one is the seed story, since it has the most terms in common with itself. The second one happens to be the second-ranked story from the boolean query. The remaining three stories, however, are three stories on highly similar topics that weren’t found with the boolean query mechanism. It should be emphasized that this mode of search becomes particularly important when the document source is text with errors introduced by a speech recognition system. Because of speech recognition errors, highly relevant documents may fall through the cracks of a boolean search, but are more likely to be found via relevance feedback since they will contain other words in common that are recognized correctly.

Annotation There is no particular reason that the database has to be browsed in a read-only fashion, however. The training data for the Rough’n’Ready indexer is currently fairly static. To annotate more speech data, or additional names for the name spotter, or

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

7

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

additional topics for the topic classifier is a separate, offline process using dedicated annotators. However, since the current annotation process is relatively simple and does not require any in-depth linguistic knowledge, it seems logical that the enduser of the archive should be enlisted in helping to provide the training data. This makes sense since it is likely the consumer of the data will have the most familiar with it, and will be able to provide topics, identify speakers, etc. The current Rough’n’Ready system includes some basic speaker annotation capabilities. If the user encounters a speaker currently marked as unknown, they can step through a relatively simple wizard that will play segments of data that have been tagged with the same identifier (such as “Male 5”) and ask them to confirm that this is the same as the first speaker. Once they have accumulated enough data (three to five minutes), the system trains a new speaker model, and reprocesses the rest of the archive off-line to include the new speaker. It is also possible to add extra training data for speakers that have particularly weak performance, improving their models. There is no particular reason that this technique could not be extended to include proper name tagging, marking of new vocabulary words for the recognizer, or identification of new topics for the topic classifier. The latter is particularly important, since it is unlikely that an end-user could find a ready-made set of topics for their own meetings or teleconferences. For some particular problem domains, it may be sufficient to have a small set of topics (3-4 instead of the current 5,500).

based teleconferences will become increasingly common and easy to record. Moving away from broadcast news to meetings will make some aspects of recognition, summarization, and indexing easier. For a particular meeting or teleconference, the speaker identification problem should be simpler, because single speakers will dominate the audio, there is likely to be channel separation between speakers, and a record of all the participants in the meeting will generally be available. On the other hand, each new problem domain will certainly require additional training data. Meetingspeech, whether face-to-face or through a telecommunications medium, is significantly more fluid than broadcast news, and is still an open area of research. However, collecting new training data need not be a significant burden in time or cost. Language model training (that needed to build the vocabulary and the speech grammar) can be found in a wide variety of sources, as long as those sources are reasonably similar in style to the speech that will be processed. Email and other similar corporate documents could likely be used to augment a base training set, assuming privacy concerns can be addressed. Current training methods for other language model data, such as name tagging, do not require in-depth linguistic knowledge, only familiarity with a simple tool and the ability to follow a set of annotation guidelines. And it is likely that much of the annotation will be done by the users of the system on the fly, to “bootstrap” the system from a base training model.

Future Problem Domains Rough’n’Ready currently provides a quick and easy way to mine data from traditional audio sources of broadcast news, such as television and radio. However, the same technologies can be applied to virtually any audio source that can be recorded. A natural source of important audio is meetings, both traditional person-to-person meetings in a room, and ones that use telecommunications, such as videoconferences. In fact, with the increase in multimodal communication, having an audiocentric database will be a valuable asset. Voiceover-IP methods of communication, such as IP-

Conclusion Recent advances in speech recognition and natural language technologies are making it increasingly possible to allow corporations or organizations to build huge archives of audio, and to browse, search, summarize, and extract information from that audio quickly and accurately. Having the ability to save and search audio into a self-organizing archive will improve the performance of many collaborative efforts where the speed of information dissemination and assimilation is of the utmost importance. Rough’n’Ready is an example of a

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

8

Proceedings of the 33rd Hawaii International Conference on System Sciences - 2000

first-order system that creates this capability. In the future, tools like Rough’n’Ready will be combined with other visualization and search systems, both new and traditional, to make our audio as searchable as text, if not more so.

[7] David R. H. Miller, Tim Leek, Richard Schwartz. “BBN At TREC7: Using Hidden Markov Models for Information Retrieval.” 7th Text Retrieval Conference, November 9-11, 1998, National Institute of Standards and Technology, Gaithersburg, MD. [8] AltaVista Help. http://www.altavista.com/av/content/help.htm

Acknowledgements This work was supported by the Defense Advanced Research Projects Agency under contract F3060297-C-0253.

References [1] Michael Lesk, “How Much Information Is There In the World”, Getty Information Institute, http://www.ahip.getty.edu/timeandbits/ksg.html [2] Philip Morrison and Phylis Morrison, “The Sum of Human Knowledge?”, Scientific American, July 1998 [3] Francis Kubala, et al., “The 1997 Byblos System Applied to Broadcast News Transcription”, Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA, February 1998

[9] Mark Maybury, Andy Merlino. “Multimedia Summaries of Broadcast News”, International Conference on Intelligent Information Systems, December 8-10, 1997. [10] Hauptmann, A., Jones, R., Seymore, K., Slattery, S., Witbrock, M., Siegler, M.. "Experiments in Information Retrieval from Spoken Documents.", DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne, VA., Feb 8-11, 1998. [11] Hauptmann, A., Witbrock, M., “Informedia: News-on-Demand Multimedia Information Acquisition and Retrieval”, Intelligent Multimedia Information Retrieval, Mark T. Maybury, Ed., AAAI Press, pps. 213-239, 1997.

[3] Dan Bikel, Scott Miller, Rich Schwartz, Ralph Weischedel. “Nymble: a High-Performance Learning Name-Finder.” The Fifth Conference on Applied Natural Language Processing, March 31April 3, 1997, Washington, D.C., pp. 194-201. [4] David Miller, Richard Schwartz, Ralph Weischedel, Rebecca Stone. “Named Entity Extraction from Broadcast News”, Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Herndon, VA, February 1999. [5] Toru Imai, Rich Schwartz, Francis Kubala, Long Nguyen. “Improved Topic Discrimination of Broadcast News Using a Model of Multiple Simultaneous Topics.” Proceedings of ICASSP ’97, Munich, Germany, April 1997, pp. 727-730. [6] Rich Schwartz, Toru Imai, Francis Kubala, Long. Nguyen, John Makhoul. “A Maximum Likelihood Model for Topic Classification of Broadcast News.” Proceedings of Eurospeech 97, Rhodes, Greece, September 1997, pp. 1455-1458.

0-7695-0493-0/00 $10.00 (c) 2000 IEEE

9

Suggest Documents