Sep 29, 1995 - eigen huis by Rene Froger & Het Goede Doel over and over again on the radio. The song is about how the singer possesses everything heĀ ...
Multimedia Information Access Arjen P. de Vries 1 Database Systems Group, Faculty of Computer Science University of Twente September 29, 1995
1 Summer Intern at the Digital Cambridge Research Lab
Abstract
The problem of information overload can be solved by the application of information retrieval systems to the huge amount of data. This application in mind, the design of a multimedia information access system is discussed. Information on radio and television can be retrieved using the audio track. A design problem for a multimedia ltering system using speech recognition is introduced and a prototype system is developed on top of the INQUERY information access system. The problem of automatic segmentation is explained and an implementation of audio segmentation in silence, speech, music and other is evaluated.
Contents 1 INTRODUCTION 2 PROBLEM STATEMENT AND MOTIVATION 2.1 2.2 2.3 2.4 2.5
The Problem . . . . . . . . . . . . . Outline of a Solution . . . . . . . . . Multimedia Information Access . . . The Indexing a Sea of Audio Project The Thesis Problem Analyzed . . . .
PREFACE . . . I had been away six months. When the cart I was riding in dropped me a mile or so from home I felt like turning back. I was afraid. Afraid that things would be dierent, that I wouldn't be welcome. The traveller always wants home to be just as it was. The traveller expects to change, to return with a bushy beard or a new baby or tales of a miraculous life where the streams are full of gold and the weather is gentle. I was full of such stories, but I wanted to know in advance that my audience was seated. . . . Jeanette Winterson October 6th, 10.34 pm, Boston airport. No place to live, little money and nobody I knew. A big challenge. The start of my thesis project was representative for the rest of these six months. Trouble with a real estate agent. One day in December, my supervisor left Digital. I was the only one working on the project from then on. No speech recognizer was available. Besides, I was not used to working under a UNIX environment. Fortunately, everything worked out ne. I will never regret my going to CRL and I learnt a great big deal, socially and technically. Although I am not planning on leaving for Boston to stay there, I will de nitely go back to visit as soon as I can safe the time and money. I miss the city, dining out, visiting rock clubs up to three times a week and ordering barbeque wings late at night. Since I am back home, I have not been shot at all! I guess I should thank people in my preface. Because I do not want to risk the possibility of forgetting too many people, I will start to say that it really does not matter whether you are mentioned here or not. People who mean something to me should know very well themselves and I am certain they will not blaim me if I forgot to mention their names.
iii
Preface
1
Let me try and start with some people in Boston. First I should thank Larry Stewart for hiring me, although he did not work at CRL any more when I arrived and I never really met him. I belief he started his own company, Open Market, where almost everybody who worked at CRL and did not leave for Microsoft works now. Including my supervisor of the rst months, who got me started on the project, Tom Levergood. Special thanks go to TV Raman. He was not only a price-winning replacement supervisor, but a very good friend as well. Of course I should mention Aster, who never leaves without Raman. If this were a multimedia document, you could hear her bark and watch her wag her tail after grabbing a cookie. The last I will mention from CRL is Maureen Gobiel. She helped me nd an aordable room among people, drove me to work every day and fought some of my ghts against the bureaucrats at payroll. Almost forgot to mention: Chris, Jean-Manuel and Bill, I confess I really miss upsetting the guys at the coee corner downstairs . . . Of the American friends I met outside the laboratory, I will only name Jane who made me feel more comfortable in that weird culture. Thanks for explaining to me that I was `about as sharp as a marble'. Back to the Netherlands. Firstly, I want to thank my supervisor Henk Blanken for all the support and positive feedback. Next, I want to mention my parents because they always stimulated my studies. Besides, my father being a real technical person and my mother bringing home a computer for her work when I was ten or eleven are probably the main reasons why I ended up in computer science. At parties, they still talk about my `reading' big manuals to nd out what I wanted to know while I did not even know English. Finally, I want to thank all my friends who sent letters, postcards and electronic mail. These messages kept me going, dudes! Special thanks go to Milou, Lian, Edwin, Gerie, Henk, Dominique and Marjolein. You did not just keep contact, but also shared your feelings. Last but not least, I want to thank Kristel for cheering me up and making me smile over and over again1. Arjen P. de Vries August, 1995
We still have to eat icecream because your lock was oiled before this report had been nished . . . 1
2
Multimedia Information Access
Chapter 1 INTRODUCTION This Master's thesis deals with the development of a multimedia information access system. The research for this project has been done at the Cambridge Research Laboratory (CRL) in Boston. CRL is one of the laboratories of Digital Equipment Coorporation (DEC). One of the strong points of CRL is the audio group. Interesting applications have been developed to ease the incorporation of audio in today's working environment. It is believed that audio will play an important role in computing the next decades. Promising results in the eld of speech recognition made us believe a real-time speech recognition system could be used to identify interesting information on radio and television. This thesis project investigates the state of the art in research community and developes a foundation for multimedia information ltering projects at CRL. The thesis addresses the problem of information overload in multimedia environments. The next chapter will introduce this problem. A multimedia information ltering system is proposed that could provide a solution to this problem. The position of this system in a typology of information ltering systems is given. Chapter three gives the necessary research background. The chapter contains an overview of research in dierent elds relevant to this project. The research topics of the thesis problem are mentioned. The development of speech recognizers using Hidden Markov Models is explained. Chapter four discusses the problem of probabilistic information retrieval and the approach based on inference networks. I will introduce the INQUERY system developed by University of Massachusetts at Amherst. The suitability of this system for multimedia information access to speech data is evaluated. Chapter ve deals with the problem of automatic segmentation of the continuous stream of incoming multimedia data in documents. The importance of the 3
4
Multimedia Information Access
problem is explained. An approach to segmentation using neural networks has been implemented and is evaluated in this chapter. Chapter six gives an overview of the prototype system I developed for Digital. The implementation is illustrated with some examples. User interaction with the system is described, and extensions to the system are suggested. The nal chapter presents a summary of the thesis. Recommendations for further research are given.
Chapter 2 PROBLEM STATEMENT AND MOTIVATION 2.1 The Problem In the Netherlands, some years ago, we heard this very popular Dutch song Een eigen huis by Rene Froger & Het Goede Doel over and over again on the radio. The song is about how the singer possesses everything he wants but forgot how to simply be happy. One line in the lyrics of this song mentions that he even has three VCR-s so there will not be a television program he will miss. The song was written in a time, that only three local channels were available, so he would be able to tape every program being broadcasted. When I rst heard this line, my reaction was that he still would not have more than 24 hours in a day. So how would he be able to know what programs he would have to look at? Nowadays hundreds of information sources are available. New information is added every day. People are talking about information highways and the internet is getting more and more popular. The achievements in communication systems result in a society where everybody can publish information. However, we do not have the time and capabilities to process all these data. A lot of information exists that people would be interested in, if they only knew that it was available. Let me illustrate this problem with a real life example. Living in the States I got bad news from the home front. My sister had to give up her study because of the illness chronique fatigue. This disease is not commonly known and I was surprised to watch a late night news item - in Boston - about American research to this disease. I called my sister to explain the new insights that were reported and she discussed these with her doctor. Of course she would never have known about this news report if I were not living in Boston that half year and tuned in 5
6
Multimedia Information Access
to that particular channel that day.
2.2 Outline of a Solution The example of the former section does not only illustrate the problem of information overload, but also suggests a possible solution to the problem: an automated system checking all data ows for interesting information would help the user decide what news items to watch and what articles to read. It was only a coincidence that my sister had a brother watching Boston cable and calling her about the interesting news report. In principle, a computer system could nd all the news reports about chronique fatigue in Germany, France or even China. A computer system that can tell whether a piece of information is interesting or not could prevent her from missing all these news reports. The approach of using a computer system to dealing with information overload is not new. Since the early sixties, people have investigated the storage and retrieval of automatically indexed documents [vR79], [Sal89]. Recent advances in communication and computer systems have made the problem of information overload more important. Anybody who ever `surfed the internet' knows how much useless data is around. People easily get lost in large information spaces [Les89].
2.3 Multimedia Information Access Research in information access is usually restricted to text processing. Little research has been done to investigate the information system solution for information overload applied to multimedia data. However, a lot of television and radio channels are broadcasting thousands of programs a day. How to deal with all this multimedia information is an open problem. A lot of expectations have been raised by speech recognition research. Laboratories reported 95% recognition rates on 5,000 word vocabularies [RHL94]. One of the latest IBM commercials shows a secretary dancing with her boss, who tells her not to worry about the reports she still has to type: does she not know IBM has made typing obsolete as she could just dictate the article to the computer? The Dutch magazine Automatiseringsgids recently stated speech recognition has nally resulted in useful products [vS95]. These optimistic sounds put the development of a multimedia information access system for radio and television within reach. If you can transform audio into descriptive text, it is possible to index the text representing the audio. Most
Problem Statement and Motivation
7
information that you can see on television is also captured in the audio track accompanying the video frames. This means that if one could search in the audio of a video tape, one would be able to answer most queries about that video. Of course, somebody looking for a red car crossing the railroad in a particular movie would not be helped. However, somebody looking for information on cars made by Renault would probably nd an appropriate documentary because the term Renault will be mentioned by the announcer explaining about the cars. The problem of searching in multimedia would be reduced to the problem of searching in large text databases if we accept that we cannot retrieve information that is represented only visually. This system would notify us of interesting information sources on television and radio. It could be used in many other situations as well. Of course, secret services of all countries would like to pay a lot of money for a system that can keep track of many audio channels. Another interesting application of this system would be to archive all meetings of a board of a large company. The formal decision process could be tracked afterwards, to study how mistakes or very good decisions were made. An application in health services was suggested to me by a children's psychiatrist (who sounded very enthusiastically about this idea). The system could record all sessions of the psychiatrist with the child and help research to determine optimal treatment and evaluate applied therapy. Nowadays, the only hold is a logbook kept by the psychiatrist.
2.4 The Indexing a Sea of Audio Project Digital Equipment Corporation (DEC) is one of the largest computer companies in the world. The Cambridge Research Laboratory is the smallest and youngest research laboratory of Digital Corporate Research. Because most traditional research topics like networking and operating sysems were already covered by the other labs, research at CRL naturally focused on relatively new topics. The three dierent groups at CRL work on parallel systems, visualization and audio. I was hired by the audio group to work on the Indexing a Sea of Audio project. The goal of the project is to develop a system that does the things I described in the previous section: recognize the audio and index the recognition results to provide access to a database of audio and video data. Unfortunately, most people of this group left Digital just before or after I arrived. The result of this brain-drain was that during my stay the speech recognition research had to be restarted. After three months my supervisor at CRL left the company too and I was the only one working on the multimedia indexing project. A speech recognition system to work with would not be available in time. How-
8
Multimedia Information Access
ever, CRL provided a closed caption decoder [Fed]. Closed captions are like subtitles, but sometimes they contain slightly more information, eg. `music' or `knock knock'. Closed captions are primarily focused on hearing-impaired people. They are added to the television signal manually. These closed captions can be thought of as the output of an almost perfect speech recognizer. Some of the television companies broadcast closed captions together with their audio and video signal. The standard by the RFC only describes the electrical signal itself, not the content that the broadcasters should provide. Example broadcasters are the Public Broadcasting System and CNN Headline News. With the latter, captions are paid for by the US Government Department of Education. Other building blocks could justify continuation of my project without a speech recognizer. Digital had developed a fast indexing tool ni2 that could be used for indexing the descriptive text. The company has a license for one of the best information retrieval systems in the world, INQUERY [CCH92]. The application store-24 is build on top of the audio le (AF) toolbox and was implemented using Tcl and Tk [Ous94]. Store-24 is a ringbuer that stores the last twenty four hours of a radio channel. Because this application is implemented on top of audio le [LPG+93], everybody in the laboratory (and in principle in the whole world) can listen to this radio and you never miss the news because you can just reposition the audio in time. CRL was not the only place in Digital where people worked on the information overload problem. INTELLECT, an advanced technology group (ATG), investigated business opportunities in information ltering. They used the INQUERY system to build a ltering environment of textual data like the internet news. Although the group was closed down one month before I left CRL, their product will probably continue to be developed at one of the business groups in the company. The problem for me to solve at CRL was to construct a framework integrating all these building blocks that could eventually result into ltering multimedia based on the audio track and the closed captions if available. In October 1995, a speech recognizer will probably be available within CRL. Hopefully, the foundations laid in my thesis project will be used to implement a working prototype multimedia ltering system. Considering the immatureness of the necessary technology, it will certainly not be possible to identify all the information the user might want to know about. However, it will be possible to identify some information sources the user would have missed otherwise.
Problem Statement and Motivation
9
2.5 The Thesis Problem Analyzed According to [Loe92], the landscape of ltering applications and usage scenario's is de ned by eleven dimensions divided into four groups (explained further in the rest of this section):
User disposition: user type and privacy protection. Time scale: Information lifetime, source availability patterns, lter delivery patterns, user usage patterns and user feedback mode. Information delivery: information media characteristics, information transport architecture and user equipment. Information content: information content attributes.
The set of dimensions in this typology can be used to classify a particular ltering system. In this section, I will use a subset of these eleven dimensions to classify the system designed in this thesis. The dimension of privacy protection remains out of scope of this thesis because I did not make design decisions with respect to privacy that would limit the conclusions of the project. For example, adding a privacy protection scheme would not require the design decisions with respect to the information media characteristics to be reconsidered. This does not imply that I consider privacy aspects to be less important than the aspects I studied. To the contrary, I would never use an information ltering application that allows other people to study my queries. Goverments would de nitely be interested in the possibility to study any person's interests. Users should be protected from Big Brother scenario's. [Loe92] divides the users of information ltering systems in two types: casual users and proactive users. Examples of proactive users are the users of libraries and information banks. Their information need can be expressed rather precisely and they do not really object to spending time on an iterative process to formulate the information need. The user type we deal with in our project is referred to as casual. Casual users do not have immediate and speci c information needs like proactive users. Their information need is not clearly de ned and they are probably not willing to engage in lengthy interactions with the system to articulate current information needs and provide explicit feedback. If a system demands too much eort from the casual user, it will not be used at all. The information lifetime varies from days up to centuries. Weather reports will loose their information value to the user after a day. This implies that the weather report has to be delivered to the user within a day from the broadcast. A documentary about chronique fatigue will keep its value for years though. A movie
10
Multimedia Information Access
starring Marilyn Monroe will always be nice to watch. We do not aim to deal with time critical information like stock information that has to be delivered to the users within minutes to be of any use. For our goal, the delivery of information to the user is not a real-time problem. The delivery process should not be confused with the analyzing process though. The interpretation of the incoming data should keep up with the incoming data. This de nitely is a real time problem. The system design is focused on asynchronous lter delivery following a user de ned query. Such a long running query is usually referred to as user pro le. I ignore synchronous ltering, where the user expects a more or less direct response on the submitted query. Although [Loe92] refers to synchronous and asynchronous delivery as dierent processes, [BC92] argues that they are basically the same. This argument implies that the lter delivery dimension is not a dierent dimension but just a dierent view. With respect to the usage pattern dimension, the system is classi ed as irregular. The ltering has to work twenty-four hours a day. The system produces a list of interesting documents for each user continuously. User interaction takes place on an irregular basis though. The user may have returned from work and study the list the system produced for that day. The frequency and session duration will vary depending on the amount of time available. Maybe the user does not use the system at all during a very busy week, maybe he will keep track of the list continuously on a long Sunday afternoon. Mechanisms for user feedback must work o-line as user interaction does not take place on a real time basis. The project addresses information ltering of multimedia data. The lters that are studied merely use audio as input to the system. This decision has the consequence that the demands on processing power and storage capability of the system are high. The available storage limits the amount of history that the system can keep. The storage problem will not be further addressed in this thesis, refer to [ZPD90], [GC92] and [FT93] to nd more information about this topic. I will assume that a database (or maybe just a le system) is available that has unlimited storage capacity. I also assumed availabity of a network between users, sources and the ltering system. Clearly, user equipment should be advanced enough to play audio and video data. In section 3.5 I discuss the currently available automatic analysis techniques. The collection of techniques implemented in the system limits the dimension of information content attributes. In case of a text-based environment, full-text indexing can be applied. If a perfect speech recognizer were available, the same would hold for speech data. The level up to which analysis has been achieved determines the system characteristics to a large account.
Chapter 3 RESEARCH BACKGROUND 3.1 Introduction This chapter will describe the research background of multimedia information ltering. A very general overview of the related research elds is given. The rst section discusses information access and dissemination. Next, a short introduction to speech recognition is given. The issues regarding retrieval of multimedia data are introduced in the following section. The next section discusses representation of multimedia. The state of the art in automatic analysis of data is overviewed. Some research projects aimed to build automatic multimedia information ltering or archiving systems are mentioned next.
3.2 Information Access and Dissemination Already in the early sixties, research groups worked on information retrieval. An information access system has the function of leading the user to those documents that will best enable him to satisfy his need for information. Over the years, information retrieval systems have moved from storing relatively short abstracts towards storing large collections of documents of varying size. Nowadays, people prefer to talk about information access because the eld has developed to embrace more than just retrieval of plain text. For example, most hypertext systems use navigational methods to help the user nd the relevant data [Con87], [HS94], [HBvR94]. Information access research is focused on systems dealing with text data. The research can roughly be divided into three categories: representation, retrieval techniques and acquisition of information needs. The major models developed 11
12
Multimedia Information Access
in information access research are the exact-match Boolean model and the bestmatch vector space and probabilistic retrieval models [Sal89], [vR79], [TC91b]. I will very shortly describe the dierences between these three models. The rest of this section will be addressed to the probabilistic information retrieval model. The Boolean retrieval model splits the database in two sets of documents, one of retrieved documents and one of non-retrieved documents. The lack of an evaluation of relevance by the system results in lower eectiveness than best- match models. The vector space model assigns vectors to documents in multidimensional space, the dimensions of which are the terms representing the document. The vectors of query and document are compared using similarity measures. The terms of the vectors are weighted based on statistical distributions of the terms in the database. A ranked list of documents is the result of a query. The probabilistic models are based on the Probability Ranking Principle (PRP). This principle states that the retrieved documents should be ranked in the order of their probability of relevance to the query given all the evidence available. The retrieval problem has been reduced to the problem of estimation of this probability of usefulness. In both of the latter models, Boolean queries can be expressed. The dierence between the form of the query and the underlying model should be understood. Some problems have been identi ed with the probability theory in information access [Coo94]. The PRP turned out to be a design heuristic instead of a principle because counterexamples could be created. The PRP assumes that document usefulness is a binary property which is an oversimpli cation of the true problem situation, for the property of interest is a matter of degree. However, some real advantages of the probabilistic information retrieval model can be mentioned. Usually, less experimentation has to be done to optimize a probabilistic retrieval rule than a nonprobabilistic one (eg. to nd the best value of some particular constant). Some rules based on theory do not even need experimental optimization because application of the theory underlying the model led to just one possible rule. Other rules contain parameters that have to be determined experimentally. The experimentation is ecient though. Last but not least, the meaning of probability values can be understood by an end-user. It is much harder to understand the value of cosine measures and fuzzy logic quantities. The experiments of the recent TREC conferences have included both probabilistic and nonprobabilistic systems. While the performance of the probabilistic systems has been very respectable, it has by no means proved itself consistently superior to other methods. The future will have to prove whether the theoretical foundations of probabilistic information retrieval are an advantage over nonprobabilistic models. To test performance of information retrieval systems, the recall and precision are
Research Background
13
measured [Gor90]. The de nition of recall is the number of relevant documents retrieved divided by the number of relevant documents in the collection. The amount of relevant data of the collection that can be found is measured. Precision is the number of relevant documents retrieved divided by the number of documents retrieved. Nrelevant;retrieved recall = Nrelevant;retrieved ; precision = Nrelevant Nretrieved A tradeo between recall and precision is unavoidable [BG94]. If you retrieve one relevant document, the precision is 100% but recall is very low. If you retrieve all documents, recall is 100% but precision has dropped. High recall is not always needed, since people commonly do not need all relevant items. However, a retrieval system should be able to achieve high recall eciently. Test collections are used to measure recall and precision of information retrieval systems [JvR76]. Examples of such test collections are the CACM and the TIPSTER collection. Sets of test queries are de ned by experts, who also manually determine which documents are relevant and which documents are not relevant for each query. The previously mentioned TREC conferences are very large empirical studies of retrieval performance over numerous retrieval systems. The terms information dissemination and information ltering refer to the delivery of information to users who submitted a pro le describing their interests. The dissemination model has become increasingly more important due to the rapid advances in wide-area information systems. The simplest form of such a system is the mailing list. Examples of more advanced information ltering systems are SIFT [YGM] and NRT [SvR91]. Both systems provide dissemination of articles in USENET news. SIFT uses the vector space model for the ltering process. The key idea is to treat the user pro les as documents and the incoming news documents as queries. The same approach is taken by INROUTE, the ltering system of the INQUERY information retrieval engine. The delivery of articles is done via electronic mail. NRT uses some kind of probabilistic information retrieval. The system looks more like a conventional retrieval system than SIFT. Most attention is given to the usage of relevance feedback to dynamically search the information space. The mouse can be used to drag interesting articles in a sort of basket, the tick window, that will be used to create a follow-up query. Relevance feedback will be explained after I introduced the notion of intelligent software agents. Information ltering research developed the notion of intelligent software agents [Mae94]. An interface agent is a piece of software that can take independent actions on behalf of the user's goal without explicit intervention by the user. Learning algorithms are used to build agents that can learn the behaviour of the user by looking over his or her shoulder. The agents built so far use a
14
Multimedia Information Access
nearest neighbour matching algorithm on large vectors representing the incoming message. In a ltering system for electronic mail it was shown how agents can help a user to deal with large amounts of mail messages. The system learns the actions of the user. If I always read messages by `knieuwen' immediately but ignore messages from `kantine', the system will warn me when mail by `knieuwen' is received, but I will not be disturbed for messages from `kantine'. The importance of a learning algorithm lies in the fact that it is very hard, if not impossible, for an end-user to de ne his information need or behaviour. It is believed that the use of nearest neighbour algorithms will lead to problems when the number of users increases. The music recommending system Ringo is used to investigate the scalability of the used techniques. By matching people's music vectors, recommendations are made using vectors of `near' people. Ringo can also tell you how mainstream you are. In december 1994, more than three thousand people used this system. Thanks to Ringo, I bought a very good compact disc by Bualo Tom. Matching people's musical tastes seems to work in practice. The idea to provide mechanisms that help the user de ne his information need can also be found in the area of relevance feedback [Fro94]. Relevance is the judgement by the end-user of the output of an information access system. Automatic analysis of the retrieved documents that the user judged relevant with respect to his information need reveals new query terms and judges on the importance of the terms of the old query. A new query is composed and fed back into the retrieval system. Stepwise re nement helps the user to nd more information. An example of the usefulness of relevance feedback (after a paper by Thinking Machines) is mentioned in [SvR91]. During a search process for articles about the Chernobyl disaster, the relevance feedback system added terms like nuclear, radiation and Russia. The next retrieval action found an article dated the day before the story broke. A Finnish radiation detection centre had measured unusually high radiation levels coming from Russia. Relevance feedback added the terms to nd an article that would have been missed otherwise. In [BC92] the authors recognized the fact that information access and information dissemination are basically identical processes. The major dierence is that information ltering usually works on a dynamic stream of input data whereas information access deals with relatively static databases. For the retrieval and ltering processes, the same ideas can be applied to both environments.
3.3 Speech Recognition Speech recognition has nally evolved into working products. At least, that is what the papers and commercials want us to believe. In this section, I will explain
Research Background
15
very shortly how speech recognizers work and what the limitations for our project are. For this very short introduction to the eld of speech recognition, I used [RS78] and [Cox90]. The purpose of this section is not to be a complete guide to speech recognition. I will start to explain some terms that are commonly used in speech literature. The study of the sounds of speech is called phonetics. Speech sounds can be classi ed in three distinct classes according to their mode of excitation. The description I will give of the physical process is not exact, but will serve the goal of giving an idea. Voiced sounds are produced by forcing air over the vocal cords. Fricative or unvoiced sounds are generated by producing turbulence in the mouth. Plosive sounds are produced when you build up air pressure in your mouth and then suddenly release it. Formants are the resonance frequencies of the vocal tract. Most languages, including English, can be described in terms of a set of distinctive sounds, called phonemes. For American English, about 42 phonemes exist that can be divided in four major classes: vowels, diphtongs, semivowels and consonants. Vowels are sounds like in bat, but and boot. Diphtongs can be viewed as combined vowel sounds that evolved into new sounds. Examples are bay, buy and boy. Semivowels are the w, l, r and y sounds. They are called semivowels because like vowels, these sounds are in uenced a lot by the context in which they occur. The consonants are all other sounds. Most common consonants fall in the subcategories of nasals (m, n), stops (b, p), and fricatives (f, v). One of the best speech recognition systems is the SPHINX-II system, developed at Carnegie Mellon University. This large vocabulary continuous speech recognition system has achieved a 95% success rate on generalized tests for a 5000 word general dictation task. Speech recognition is viewed as a pattern recognition process. I assume the reader is familiar with the general idea of pattern recognition1. Phonetic units are modeled during training and used during recognition. Hidden Markov Models (HMM) are the most common approach to model phonetic units. A hidden Markov model is de ned by a set of states, an output alphabet, a set of transition probabilities and a set of output probabilities. For speech recognition, the output alphabet that should be chosen is the collection of input vectors for the recognition process. These input vectors, usually called input features, are often spectral representations of the speech signal (like the rst eight Fourier coecients over a short time frame). An output probability is associated with each state transition. This probability determines whether the transition outputs a symbol or not. Transition probabilities are de ned for each pair of states. A hidden Markov model can be considered as a stochastic system that generates random sequences according to a distribution determined by the set of output 1
A short introduction to pattern recognition for segmenting speech is given in section 5.2.2.
16
Multimedia Information Access
and transition probabilities. The remaining question is how we can use these random sequences for speech recognition. Each word in the vocabulary of the speech recognizer gets its own hidden Markov model. The probability of a particular word model to output a certain sequence of symbols can be calculated. During recognition, the model with the highest probability to output the sequence that has to be recognized is determined. This sequence contains the input features of the speech data that has to be recognized. The word of which the model was found to have the highest probability of outputting exactly this sequence, is chosen as the recognized word. In continuous speech recognition, it is not possible to nd the word boundaries by looking at the signal. The Viterbi search algorithm is used to search the word models in parallel. Language models are used to limit the search space. Training algorithms have been developed to adapt the transition and output probabilities of a model in a way that the probability is increased of producing the same sequence as the training example. To develop a speech recognizer, word models are built using phoneme models. First, the phoneme models have to be trained. Next, the phoneme models are concatenated to form a word model. This model of a complete word is then trained with examples. Hidden Markov models are just a tool to perform a pattern recognition process. For speech recognition, this tool seems to work pretty good. However, a speech recognizer built with this technology clearly has a restricted preknown vocabulary. If the word model is not known by the recognizer, the model of another word will be chosen as best t. However, speech recognition based on phoneme models alone does not work. The problem that makes phoneme level recognition impossible, is the in uence of the context on some of the phonemes. More problems occur if the speech signal has music in the background or two people speak at the same time. Therefore, speech recognition based on hidden Markov modeling is not robust.
3.4 Multimedia Information Retrieval Most research in information access and dissemination has focused on textual data. Only recently, some research groups have made attempts to retrieve more than just text documents. The QBIC project realized querying of images by content [NBE+ 93]. Special features have been extracted to indicate color, texture and shape. Measures are de ned to make judgement possible on the extracted features. The QBISM prototype implements a prototype of a three dimensional medical image database system [ACF93]. Other research to image retrieval includes [BFM+94], [CLP94] and [SC94]. The approach in all these projects is to identify features that can be measured to represent the images. These features
Research Background
17
are then indexed to enable retrieval by content. [TBCE94], [TBC94] and [CHTB92] report about information retrieval on texts that have been generated using Optical Character Recognition (OCR). The eects of the noisy data on the retrieval process have been studied using a collection of articles that were scanned. The retrieval performance was measured on this collection and compared with the results on a manually edited version of the database. The experiments in [CHTB92] found that high quality OCR devices cause almost no degradation of the accuracy of retrieval, but low quality devices applied to collections of short documents can result in signi cant degradation of performance. A minimal accuracy rate with which an information retrieval system can cope seems to exist. The use of a speech recognizer to produce descriptive text is analogous to the use of OCR devices in these experiments. The results seem to predict that speech recognition should be of high quality to make good information retrieval possible. Techniques for retrieval from voice messages were rst described in [RCL91]. The problem dealt with is the categorization of speech input utterances according to a prede ned notion of topic or message class. It was recognized that for the classi cation process conventional methods applied to texts could be used. The vocabulary was very small. Only 110 dierent keywords could be recognized and classi ed. The same is the problem with current speech recognition: the recognizer can only deal with a prede ned vocabulary. In a dynamic environment like the daily news, many names and even new words will have to be recognized. In [GS92] an information retrieval model of speech documents is introduced that is in principle vocabulary independent. The key idea is that a small number of indexing features was identi ed such that a reasonable amount of training data would be sucient to train the hidden Markov models used by the speech recognition process. Indexing features are the symbols with which the index is built. These can be words, but other features can be used as well. The indexing features identi ed by [GS92] are based on sequences of phoneme classes and can be identi ed in text les as well. In section 3.3 I mentioned four major classes of phonemes. [GS92] only uses two dierent classes, vowels (V ) and consonants (C ). The indexing features they proposed are V +, V +C +, C +V + and C +V +C +. The letters A, E, I, O, U and Y were denoted as vowels, the other letters are consonants. They give the example that for the phrase `die neue franzosische' the features `die', `neue', `fra', `anzoe', `oesi' and `ische' would be identi ed. The trick is that they do not aim to recognize the words correctly. The speech recognizer should just identify which features occur in the text. These features are only parts of the words. Simulations using a standard test set showed that retrieval experiments using these indexing features achieved better performance on recall and precision than conventional information retrieval features. This is
18
Multimedia Information Access
conformable to the results of experiments with digrams and trigrams in research to retrieval techniques [Wil79]. The simulations seem to prove that information retrieval from speech documents is feasible by using these features to index the speech data. However, there is a aw in these experiments. According to the example given in the article, knowledge of the word boundaries is assumed in the simulations. Current speech recognition systems for continuous speech cannot identify the word boundaries on a phoneme level. Therefore, a search process is used over a prede ned vocabulary of words. It is not certain whether the features could be identi ed without problems. Moreover, the lack of word boundaries will change the results of the simulations. Many more features will be identi ed in a particular text. In their example phrase they forgot that `ieneue' and `euefra' would be selected as indexing features as well. Even without word boundaries, the indexing features could be selective enough to make retrieval possible. Further research is de netely needed. Simulations should be performed that do not use the word delimiters found in text les as spaces. If these results are promising, hidden Markov recognizers of the suggested indexing features should be tested. Even less research has been done to the retrieval of video. The massive amounts of data make retrieval by content almost impossible. Most proposals use manually added descriptions to search the video data [DWG], [SSJ92]. Closed captions can be used to generate descriptive text at the receiver side, but these are manually added by the broadcaster. A motion picture can be modeled as a composition of many scenes where each scene is composed of multiple shots. In [DLM+94], a real-time algorithm is described that can be applied to produce storyboards of a video. A storyboard is a sequence of frames that captures the sequence of shots and scenes. For example, in a news report with two items, the storyboard could contain ve frames. The rst, third and nal frame would show the reporter talking to her public. The other two frames would represent the two news items. One frame could show the courtroom of the Simpson case, the other could show a single roof in a large sheet of water during the oods in the Netherlands. A system using this algorithm for retrieval purposes is described in [LAF+93]. [DG94] investigates the retrieval of video based on spatial as well as motion characteristics.
3.5 Multimedia Representation and Automatic Analysis Computers are basically deaf, blind and ignorant. Representations for media content are needed. We want to go beyond keyword indexes. Manually index-
Research Background
19
ing will not be possible for all information sources that we want to follow. We de nitely need automatic analysis of multimedia data. This section discusses available techniques that can be applied in real-time. All these algorithms should be implemented in the multimedia ltering application to achieve the best results possible. Evidence from dierent representations of the data could be combined if these algorithms were available. I distinguish between the media text, audio, images and video. For each medium, I will describe algorithms for segmentation and algorithms for content analysis. Segmentation is necessary to split the incoming continuous stream of multimedia data in documents. A document is a collection of data that belongs together in a higher semantical level. A newsreport on CNN and a commercial selling coke are two examples of such documents. A more extreme example of something that I would call a document is a coke can, as it contains information about both the manufacturer and the chemicals used inside. A document can contain dierent types of media. Content analysis focuses on the analysis of information content of these documents. It should be noticed that segmentation information can be viewed as a special type of content information. If segmentation of an audio track identi es a three minute fragment as a song, this fact reveals more than just the borders. Textual information in ASCII has been studied a lot. The syntax is very easy to recognize. Therefore, research to understanding of information focused on textual information. Arti cial intelligence research has tried to model the world so that computers would be able to interpret the information and maybe even deduce facts automatically. Automatic classi cation of texts has been studied. The Parlevink project at University of Twente is focused on text understanding. An interesting article about the automatic segmentation of text is [Hea94]. The segmentation algorithm TextTiling is described. It is a rather simple algorithm that uses term repetition as a feature to nd topic changes. The text is subdivided in pseudo sentences of each twenty words (to avoid normalization problems). Six of such pseudo sentences are viewed as a block. The similarity between each two subsequent blocks is calculated with a cosine measure. Boundaries are determined by changes in the similarity measures. The algorithm is found to produce segmentation that corresponds well to human judgment of the major subtopic boundaries in thirteen lengthy texts. A panel of readers was used to get the human judgement. The algorithm proved to be better in determining topic changes than author markup. The motivation for this segmentation of texts is the fact that nowadays longer documents are indexed by information retrieval systems. However, most systems use the same similarity measures to compare the query and the document as with short abstracts. This segmentation in several parts each dealing with a single
20
Multimedia Information Access
topic is then used to improve retrieval. Another application of this segmentation is during presentation of the retrieved documents. The document is shown as a bar divided in pieces according to the identi ed parts. The parts are coloured using the similarity of the query to that part. For example, if the bar is very intensively red at one spot, it is clear that only one section is of interest to the user. If the bar is coloured over the full length, this indicates that the article or book probably deals with the same topics as the query. For the segmentation of audio, several algorithms can be used. People have developed speaker change detection and speech emphasis detection [Aro94], [CW92]. In section 5.2.1, a segmentation algorithm in silence, speech and music is described. Algorithms for content analysis of audio are speaker identi cation [RS78], word spotting [WB91], [WSB92] and speech recognition. Most image analysis algorithms do not run real-time. For example, it is almost impossible to recognize a car in arbitrary images. For video, this means that the only content information we will have available are closed captions. Speech recognition of the audio track seems the solution to get content information when closed captions are not available. Segmentation of video is possible using the sizes of the compressed frames. This algorithm [DLM+94] was already mentioned in the previous section.
3.6 Current Research Projects Finally, I would like to refer to three major research initiatives regarding the indexing and retrieval of multimedia data. These projects have in common that they all attempt to integrate access to dierent media. One project is the digital libraries project at Xerox PARC. Experiences with speech technology and document analysis are combined in the context of information access. In Europe, the umbrella organization Idomeneus focuses on multimedia information systems. Idomeneus covers most research groups that work on multimedia research topics. Very ambitious is the InforMedia project at Carnegie Mellon University [CSW94]. This project emphasizes the necessity of searching and retrieval technology in large digital libraries as well. The project integrates speech, image and language understanding for the creation and exploration of digital libraries. The initial database will be built using television fragments. The video will be annotated automatically with text transcripts using the speech recognition system Sphinx-II. Image understanding technology will be used to segment the video clips via visual content. This segmentation will be improved through the use of information supplied by the transcript and language understanding. Much attention is given to the interaction of the user with the
Research Background
system.
21
22
Multimedia Information Access
Chapter 4 PROBABILISTIC INFORMATION RETRIEVAL 4.1 Introduction In section 3.2, the foundations of probabilistic information retrieval were treated. Information retrieval is viewed as an inference process in which we estimate the probability that a user's information need, expressed as one or more queries, is met given a document as evidence. Of course, uncertainness has to be taken into account. A retrieval model should support multiple representation schemes, allow combination of results of dierent queries and query types and facilitate exible matching between the terms or concepts used in the queries and those assigned to documents [TC91b]. The notion of multiple representation schemes refers to the fact that documents can be represented in dierent ways. Information from the abstract of a paper should be treated dierently from information that is only given in the complete paper. A given query will retrieve dierent documents when applied to dierent representations. A description of an information need will usually generate several queries using dierent strategies and capturing dierent aspects. These queries are known to retrieve dierent documents for the same underlying information. In practice, a poor match between the vocabulary used to express queries and the vocabulary to represent documents is a major cause of poor recall.
23
24
Multimedia Information Access
4.2 The Inference Network Model A retrieval model that has the properties mentioned in the previous section is the inference network model. I will describe this model following [TC91b] and [RC]. Both references deal with the information retrieval system INQUERY that will be discussed in the next section. The inference network model is a probabilistic retrieval model in that it follows the Probability Ranking Principle. A probabilistic model calculates P (relevant j document; query), which is the probability that a document is relevant given a particular document and query. The inference network model takes a slightly dierent approach in that it computes P (I j document), which is the probability that a user's information need is satis ed given a particular document. This small dierence reveals the possibility to combine dierent queries and document representations in the retrieval process. The inference network model is based on Bayesian inference networks [Pea89]. The heart of Bayesian techniques lies in the inversion formula P (e j H )P (H ) P (H j e) = P (e) which states that the belief we accord a hypothesis H upon obtaining evidence e can be computed by multiplying our previous belief P (H ) by the likelihood 1 P (e j H ) that e will materialize if H is true . P (H j e) is referred to as the posterior probability or posterior and P (H ) as prior. The importance of the formula is that it expresses quantity P (H j e), which is usually hard to assess, in terms of quantities that can often be drawn directly from experimental knowledge. [Pea89] gives the example of a very crowded casino where roulette is played as well as some game of two dice. Imagine your standing at a roulette wheel and hearing the employee at the neighbouring table say twelve. You want to know whether that twelve came from a pair of dice or from a roulette wheel. P (Dice) is the probability that I am standing at a table of dice. P (Roulette) is the probability that I am standing at a roulette wheel. It is very hard to calculate the probability P (Dice j Twelve), the probability that I was standing next to a table of dice given the evidence that you heard the employee say twelve. However, determining the probability that someone throws twelve with two dice is straightforward (P (Twelve j Dice) = 361 ). The prior probabili1 The denominator P (e) is merely a normalizing constant, that can be computed by requiring that P (H j e) and P (:H j e) sum to unity. P (e) = P (e j H )P (H ) + P (e j :H )P (:H )
Probabilistic Information Retrieval
25
Figure 4.1: The basic inference network ties P (Dice) and P (Roulette) can be judged by reading the number of roulette wheels and tables of dice in the brochure of the casino. Bayesian networks are directed, acyclic graphs (DAG) in which nodes represent propositional variables or constants and edges represent dependence relations between propositions. If a proposition represented by node p causes or implies the proposition represented by node q, we draw a directed edge from p to q. The node q contains a link matrix that speci es P (q j p) for all four possible values of the two variables. When a node has multiple parents, the link matrix speci es the dependence on the set of parents and characterizes the dependence relationship between that node and all nodes representing its potential causes. Given a set of prior probabilities for the roots of the DAG, these networks can be used to compute the probability or degree of believe associated with all remaining nodes. The basic document inference network is shown in gure 4.1. This network consists of a document network and a query network. The document network represents the document collection using a variety of document representation schemes. It is built once for a given collection and its structure does not change during query processing. It consists of document nodes (dj 's) and concept representation nodes (tm 's). These concept representation nodes can be divided in several subsets, each corresponding to a single representation technique that has been applied to the documents. For example, if a document title contains the phrase `information ltering' and this phrase also occurs in the document text, two nodes with a distinct meaning will be created. The assignment of a representation term to a document is modelled with a directed arc from the document node to the representation node. Each representation node contains a speci cation of the conditional probability associated with the node given its set of parent nodes. In principle, computation
26
Multimedia Information Access
of this probability would require O(2n) space for a node with n parents. Since only one document is considered at a time, a simple estimation can be used that is very similar to the tf idf weights used in many previous information retrieval experiments [Sal89], [vR79]. The used formula is: log( Nf ) log( tf + 0 : 5) 0:4 + 0:6 log(max tf + 1:0) log N
where tf is the term frequency of the representation term in the document, normalization factor max tf is the maximum term frequency in the document, f is the number of documents in which the term occurs and N is the number of documents in the collection. The nal term of the formula is known as inverse document frequency (idf ). Intuitively, the formula can be understood by realizing that a term is more important if this term occurs often in one document than if it occurs two or three times in each document in the collection. A query network consists of a node representing the user's information need (I ) and one or more query representations (qk 's) expressing this information need. A query network is built for each information need and can be changed during query processing, for example by relevance feedback [HC93] or the application of a thesaurus. A query is processed by constructing the query network and attaching it to the document network. The probability that the information need is met given a particular document dj is computed by setting the value of the dj node to true. The probabilities associated with each node in the query network are calculated given this document as evidence. These probabilities are propagated through the network to derive a probability associated with the node representing the user's information need. This is repeated for each document in the collection and the probabilities are used to rank the documents. Some simplifying assumptions made for ecient implementation of query networks are given in [TC91a].
4.3 INQUERY The INQUERY system was implemented on the inference network model [CCH92]. It has shown to be one of the best information retrieval systems in the world [CC93]. The possibility of combining evidence from dierent sources during query processing makes the system real powerful. Relevance feedback using inference networks is eective and the overall implementation of the system is ecient. The underlying theoretical framework can be consulted when performance does not match expectations. The main tasks performed by the system are creation of the document network,
Probabilistic Information Retrieval
27
creation of the query network and use of the networks to retrieve documents. During document network creation, documents are parsed and representation terms are identi ed. The default parser identi es stemmed words that are not in the stoplist as indexing terms. Stemming con ates words to their roots in spite of dierent endings. Europe and european should result in the same representation concept. A stoplist is a list of very common words that contain little information. The parser relies on a subset of SGML to identify the parts of the document to index like title and text. This parser can be extended or even completely replaced with another parser. The system provides a exible architecture to add concept recognizers to the parsing process. Such a recognizer can eg. add a special representation concept #ut. Documents that contain University of Twente, Twente University or UT would be assigned this special concept. Creating the document network involves building compressed inverted les that are necessary for ecient performance. The network has to be organized in a way that the evidence can be propagated through the network rapidly and eciently. Queries use natural language or a structured query language. Natural language queries are converted to the structured query language by applying the sum operator to the terms in the query. Other operators are and, or, not, weighted sum, max and a proximity operator. Queries are stemmed and stopped like the document collection. Query transformers, which can be added by the user, can call concept recognizers to identify extra concepts like #ut.
4.4 INQUERY and Indexing a Sea of Audio One of the major questions in this thesis is whether a full-text information retrieval system like INQUERY can be used as a basis for multimedia information ltering. This section summarizes the bene ts and the shortcomings of using a system based on inference networks for this task. Although the INQUERY system has been built for text representations, nothing limits the underlying mechanisms to be applied to other representation concepts. For terms that can be expressed in ASCII representation the original system can be used without a problem. The indexing features identi ed in [GS92], further described in section 3.4, could be used by slightly changing the parser. Stemming and stopping should of course be turned o, but these were options already. The parser can be changed easily because it has been implemented with the standard Unix utilities lex and yacc. The inference network model allows for several representations of the same document to coexist. An extra set of representation terms can be added to the document network. A real drawback of using INQUERY for our project is the fact that the current
28
Multimedia Information Access
system cannot deal with incremental indexing. In the Indexing a Sea of Audio project, the collection will be extended continuously. Traditionally, information retrieval research focused on large static databases. For each term, the inverted list contains weights and locations for each document in which the term occurs. These inverted lists are typically laid out contiguously in a at inverted le with no gaps between the lists. Adding to such inverted les requires expensive relocation of growing lists and careful management of free-space in the le. Most information retrieval sysems, including INQUERY, simply rebuild the inverted le after adding the new documents. Therefore, the costs of an update are proportional to the size of the collection, not to the size of the added documents. Fortunately, the new version of INQUERY due in June 1995 (teasingly called INQUERY '95 by the people in the Digital Intellect group) has solved this problem. Instead of using at inverted les in a custom build system, they rebuilt the system on top of the persistent object store Mneme [BCCM94]. Using the features of this object store, the inverted lists were stored in buckets that are sized dynamically. The new technique permits incremental indexing in a fast and ecient way [BCC94]. The performance of INQUERY on imprecise data is not known. As discussed in section 3.4, experiments with input from optical character recognition showed that retrieval performance measured with precision and recall drops when the recognition is too low [CHTB92]. The formula used to calculate the probability of usefulness of a document is based on term frequencies. These frequencies will have wrong values if the recognition process is not errorfree. This can result in a degradation on recall and precision. Research has to be done whether speech recognition performs well enough to produce output that can be used reliably in a retrieval system. Knowledge of common mistakes by the recognition process could help improve retrieval. Experience with preprocessing the input data on scanned documents in INQUERY is reported in [TBC94]. The results were promising. I think that the inference network model is a good candidate to deal with these error models. Another level of representation nodes could be added with the probability of correct recognition on the arcs. The mathematical implications of this idea for the correctness and the computational complexity still have to be analyzed thoroughly. I believe that INQUERY will turn out to be a good tool for CRL's information access system. It provides a very good retrieval system that can be applied to the output of a speech recognizer. It is capable of dealing with Gigabyte collections and has proven to perform well on recall and precision tests. A more practical advantage is the fact that the Intellect group in Digital planned to use INQUERY for the ltering of text information. At this moment, it is not sure whether the information ltering project started by the Intellect group will remain to exist in
Probabilistic Information Retrieval
29
the company. If it does, the knowledge in the Indexing a Sea of Audio project can exchange a lot of experience with this group. Final advantage of using INQUERY is the excellent support given by the University of Massachusetts at Amherst, where the system originated.
30
Multimedia Information Access
Chapter 5 AUTOMATIC SEGMENTATION 5.1 Introduction A major step in the automatic analysis of a continuous stream of multimedia data is the segmentation of this stream in documents [SSJ92]. The problem of segmentation can be divided into two smaller problems I will address seperately. The rst problem concerns the recognition of segments based on the data type. The other problem is the recognition of structure in the data using content information. I introduce the term syntactical segmentation to refer to the rst type of segmentation, because the decision to start a new segment is not based on the information content of the data but purely on the type of the information presented. The second type of segmentation will be referred to as semantical segmentation. In this case, the purpose is to separate the continuous ow of input data in documents the way a human would do this. This distinction between syntactic and semantic segmentation can often be found in the approach followed in document analysis [CdV93]. In the MULTOS information system [BCF95], each document is described by a logical, a layout and a conceptual structure. The logical structure determines arrangements of logical document components like title, introduction and chapter. The layout structure de nes the layout of document content in output. It contains components like pages and picture frames. The conceptual structure allows a semantic oriented description of the document. It refers to a document as a business letter, a manual or a brochure. The papers about MULTOS do not explain how they nd these three dierent structures given a paper document. 31
32
Multimedia Information Access
One step in the process to nd these structures is the segmentation of the data. I will explain the relation of the MULTOS structures to my de nitions of syntactical and semantical segmentation. The recognition of the begin and end of a `chapter' in an audio document is semantical segmentation. An audio chapter is a segment of the audio stream that is comparable to a chapter of a book. Without knowledge of the content, you can only guess for the borders. The analysis of a paper document to nd chapters is easier than the analysis of an audio document. Each new chapter can be identi ed relatively easy by looking for a large title on the top of a new page. If such a header is noticed, a new chapter has begun. In the case of analysis of a paper document, the recognition of a chapter is simply syntactical segmentation. The explicit layout reveals the necessary information to nd the borders. The words syntactical and semantical refer to the type of information used in the segmentation process. In the MULTOS model, the terms logical, layout and conceptual refer to the type of document structure. Automatically determining the logical structure of an audio document may require knowledge of the content of the document. Therefore, I use the term semantical segmentation for this process. Semantical segmentation does not have to recognize information that would be on the level of the conceptual structure of the document. The decision between news report and commercial can be postponed to the content analysis. Semantical segmentation should nd the border between the news report and the commercial though. The syntactical segmentation of audio in silence, music and speech can be viewed as a layout structure of the audio. However, when one tune ows uently over in another tune, we notice immediately. If tunes in a news report are viewed as layout components, they can be compared to picture frames in a text. The two tunes would be two dierent frames. Syntactical segmentation would not notice the dierence beween the two tunes though. To recognize the layout structure of audio, you need content information as well.
5.2 Syntactical Segmentation Speech recognition systems are not very robust. If you feed them music, they will not recognize the fact that it is not speech you put into the system, but produce garbage results instead. To solve this problem, it is necessary to design an algorithm that can distinguish the data containing speech from the data containing other information. This section describes the development of an algorithm to lter out the speech fragments from the incoming data. Although speech recognizers have reached
Automatic Segmentation
33
a level of performance that makes it possible to use them for our system, these systems assume that their input is speech. Little research has been done to deal with real life situations. Especially the television and radio data imply a completely new environment for these systems to work in. People have worked on problems with noisy telephone lines [RSA77] and background noise [MM80], [JMR94]. However, I am not aware of any research project that attends the identi cation of the speech data in any document containing audio, ranging from CNN news reports to talk shows with Ricki Lake. In the rest of this section I will explain the syntactical audio segmenter I developed for the Indexing a Sea of Audio-project. First, I will precisely de ne what the segmenter has to do. Following a short introduction in signal processing, I will report about experiments with the two approaches I investigated. Finally, I will evaluate the segmenter I chose to implement.
5.2.1 A syntactical audio segmentation problem As explained previously syntactical segmentation should keep speech recognizers from working on non-speech data and provide a basis for higher level segmentation. A very simple segmentation that achieves these two tasks is the classi cation of audio fragments in silence, speech, music and other. Silence does not imply no signal. To the contrary, as we proceed through this section, we will discover that most audio that we classify as silence is not silent at all. I de ne silence to be all signal containing not more information than punctuation. Examples of this class are breath noise like a deep sigh and little background noise like somebody laughing softly. All fragments where someone is talking will be labelled as speech. As soon as more than one person is talking, the audio will be classi ed as other though, because even the most modern speech recognizers only deal with one speaker at a time. In case of background noise like a tune identifying the next program we will choose other. Music applies to all cases that we call music. However, I choose to label music with someone talking or singing as other. This turned out to improve classi cation because the distinction between music alone, speech alone and music with a person speaking or singing is harder to make. Other contains all fragments with loud background noise, people speaking at the same time and simultanuously music and speech. It functions like a reservoir for all fragments that cannot be labelled in silence, speech or music. To develop a classi cation algorithm you need a set of test data and a set of
34
Multimedia Information Access
training data. The test data is used to check whether the classi cation algorithm will also work on data that it had not seen before. Because nobody ever tried to solve the problem of processing television audio tracks, no standard test suite was available. I decided to manually segment ten minutes of local radio of the WBUR station and two sets of ten minutes of newsreports and commercials of CNN Headline news to create my own test suite. These three ten minute fragments were each split in two consecutive ve minute fragments, of which one was used for training and the other was used for testing. The amount of training and test data may look small, but it took three days of ten hours each already to segment this half hour of audio. Manually segmenting audio is very tedious work, and therefore I would like to mention the speech tools of the Oregon Graduate Institute: they really sped up and facilitated the work. The grapical display tools allow you to listen to the audio and watch the wave form at the same time. You can also generate segmentation les lined up with the audio signal. I wrote some scripts in Tcl to transform the output of my analysis algorithms to trick these speech tools into believing that for example the energy of a signal was an audio signal itself. This way I could vertically line up dierent representations of the audio fragment with the manual and automatic classi cation to evaluate the performance of the algorithm. Two demands hold for the syntactical segmenter. The actual segmentation process should be faster than real-time. The algorithm has to keep up with the incoming continuous stream of audio data and should leave the other processes (recognition and indexing) enough time so that the complete system can operate in real-time. The second demand helps realization of these time constraints. If the segmentation algorithm needs computationally expensive calculations like fourier transforms, it would be nice if the segmenter used the same processing of the input vectors as the speech recognizer will do.
5.2.2 Digital speech processing I will give a short introduction in digital speech processing. More information can be found in the `bible' of speech processing [RS78]. Section 3.3 is an introduction to speech recognition systems. The problem we are interested in, falls in the category of hetero-association classi cation. This means that the output classes and the input classes are disjunct sets. In our application, inputs are the samples of audio and outputs are the decisions between silence, speech, music and other. Pattern recognition takes place in two phases. The rst phase is feature extraction. The second phase of pattern recognition is the classi cation process itself. Using the feature vectors extracted in the rst phase as inputs, an output class is assigned to the input vectors.
Automatic Segmentation
35
Feature extraction is necessary for two reasons. Firstly, it permits focusing on information within the signal which is important for discriminating between patterns of dierent classes. Secondly, it enables data reduction so that manipulation of patterns becomes computationally feasible. Commonly used features are energy, zero-crossings and all sorts of coecients of fourier transforms. Speech processing is usually divided in time-based and frequency-based processing. In the rst case, all algorithms operate on the raw speech signal itself. The features energy and zero-crossings fall in this category. Energy is ten times the logarithm of the mean-squared samples over a certain frame size. Energy is related to the power of the signal and measured in decibels. Zero-crossings is the number of times the signal changes from positive to negative. The number of zero-crossings is a rough estimate of the frequency of the signal. Another timebased feature that is used in speech processing is the autocorrelation coecient which is the correlation between adjacent samples. Frequency-based speech processing uses spectral representations of the speech signal. Used techniques to produce spectra are fourier analysis, Linear Predictive Coding (LPC), Perceptual Linear Predictive (PLP) analysis and cepstrum analysis. Looking at a spectrum of a sound fragment, trained people can read the words that are spoken in the fragment. Speech recognizers typically use the rst seven coecients of PLPs or melceps as input vectors for the classi cation module. The classi cation process can be implemented using dierent techniques. Statistical analysis applies a minimum distance rule under the assumption that the features are distributed according to the multidimensional Gaussian probability function [AR76]. Hidden Markov Modeling (HMM) is a strategy that is used a lot in the eld of speech recognition [Cox90]. Neural net architectures have been applied to pattern recognition as well. Neural nets seem to be a good tool for speech recognition [Lip89]. The development of a classi cation process takes place in two steps: training and recognition modes [Cox90]. In training mode, models are built that encapsulate the decision process. In the case of statistical analysis, these models are rules stating that variables between estimated boundaries indicate a speci c output class. With a neural network classi er, these models are stored in the coecients belonging to the neurons. In an HMM, the models contain the state transition coecients and the probabilities of emitting observations. During recognition mode, the trained models are used to make decisions based on the input vectors. If neural nets or HMMs are used for the pattern recognition system, the development process is an iterative process. After training mode, results in recognition mode on the training set are compared with the results on the test set. The better the models get, the lower the performance on the training set and the higher the
36
Multimedia Information Access
performance on the test set will be. When these two performances come close together, no more training is necessary. This eect can be explained because the network will generalize better after it has seen more training data. The rst iterations, it will tend to learn special characteristics of the training data that do not hold in the full data set. Of course, the data in the training and the test set should be representative for the data that has to be segmented. A speech signal usually consists of 8 kHz (low-quality) or 16 kHz (high quality) samples. Numerous standards for the formats exist. Most used formats, -law and linear, are linear series of either eight or sixeen bit raw sample data. Some formats use a header containing descriptive information about the data. To store stereo audio, the followed approach is to store the samples for the left and the right channel alternately. For my experiments, I chose to use 16 kHz 16 bit mono audio. The speech recognition group at CRL was working on a speech recognizer with high quality audio input, so this high quality signal will be available in the nal system anyway. For storage purposes, one may always decide to sample down to 8 bit 8 kHz data, this way saving a factor four of disk space.
5.2.3 A rst approach Originally, I expected the problem of segmentation to be rather trivial. Literature reports about algorithms for word boundary detection [JMR94], silence detection [GD88], speech/non-speech detection [MM93] and voiced-unvoiced classi cation [RSA77]. Word boundary detection is a necessary rst step in word-based speech recognition. Silence detection has been studied for storage reasons. Speech detection is often applied in telephone systems to make better use of the available bandwidth. Voiced-unvoiced-classi cation refers to the dierent types of excitation during speech. This classi cation process deals with very short intervals of speech and is used to nd phoneme boundaries. To detect music, I expected bandwidth ltering to measure high frequency activity to be quite eective. The underlying assumption is that music would cover a much wider range of frequencies than plain speech. It turned out to be harder than I expected. I started to work on silence detection. The approach followed in literature was to measure the energy of the audio signal and as soon as it drops under a certain treshold, you claim it is silence. This tresholding is the common approach for systems that deal with short bursts of speech data. An example of such a system is the computer system answering the phone at post order companies. You can tell the computer via the phone the number of the article that you want to get more information about. People that use the system give the information needed after the system asks for it. The rest of the time, the signal will just contain background noise.
Automatic Segmentation
37
Testing this approach on my hand segmented CNN data proved that it was not that simple. Most pauses are not silent at all, so either the treshold would be so low that only one silence would be found in the total amount of twenty minutes, or the treshold would be so high that most fragments labelled as silence were not pauses at all. Unfortunately, the amount of noise on television and radio is so high that a simple tresholding algorithm does not work. In [MM80] the modi ed roberts noise detection algorithm is given to adaptively estimate the amount of background noise. The original algorithm was applied to radio communication for the US Air Force. The noise estimate was used to suppress noise from the speech signal. I hoped I could use this estimate to adapt the silence treshold to the amount of noise in the signal. The algorithm is based on Roberts' notion that a 4 second histogram of the frame energies is bimodal. He found that putting the treshold between the two modes of the histogram classi es correctly between speech and noise most of the time. I implemented the algorithm but it did not solve the problems with the television signal. The source of this problem is the fact that the energy of the audio at CNN is continuously high. The adaptive noise estimation can almost never be called.
5.2.4 Development of a neural network The previous section makes clear that syntactical segmentation of the audio of television is not an easy problem. Traditional approaches to detect silence fail because of the dierent environment. The algorithms in literature are applied to telephone or radio communication between a sender and a receiver that both want to communicate with each other. Television and radio broadcasters have to ght to keep the attention of the potential listener. Therefore, they keep talking or putting jingles in the air. The sender keeps sending information all day long. It is not clear when someone starts to watch a show. If the receiver is bored, he will switch channel. The receiver can tune in and out any time he likes, but the sender will earn more money if he has more people tuned in. The environment of operation for the algorithms is completely dierent from the traditional environment, where both sender and receiver are anxious to communicate. However, if we can distinguish between dierent phonemes that are real close, it should not be too hard to distinguish between a yawn or cough and a fragment of speech or music. Speech recognition with neural networks reached very good results on phoneme recognition. Because a trained network is really fast to make the decision which class an input vector belongs to, a neural network segmenter will de nitely ful ll the demand that processing has to be faster than real-time. For a good introduction to neural networks, I would like to refer to chapter 5 and 6 [HKP91]. I will explain very shortly how a neural network calculates results.
38
Multimedia Information Access
x1 ...
xn
w1 wn
f
O
Figure 5.1: A single neuron The neural network model originated as a model of the human neural system. In gure 5.1 a single neuron is shown. The neuron calculates its output value O by summing the inputs xi : n wi xi ) O = f(
X i=1
In the original model, the function f was one if the input sum was larger than zero. The output would be zero otherwise. Nowadays, for most applications the function is a sigmoid function. A neural network consists of a collection of interconnected neurons. All weights wi are initialized with random values. A learning rule updates the weights to come closer to the correct answer during training. You simply show the neural network examples of the function that it should learn and it will try to learn that function adapting its weights. Feedforward networks are neural networks consisting of dierent layers. Outputs from one layer are connected to the inputs of the next layer. The backpropagation training scheme is used to train these networks. After training has nished, a neural network simply has to multiply the inputs with the weights and sum the results of the neurons. This can be done real fast. In [BCFV] a neural network for phoneme recognition is described in detail. I used some of the ideas of this neural network application as a basis for my syntactical segmenter. The previously mentioned OGI speech tools contain a simple but eective neural network development toolkit. After xing a small bug in their software I used their toolkit to train and test my neural network. For each experiment, large ASCII les containing the input vectors for the neural network were necessary. Although CRL has a lot of disk space, running several experiments simultaneously is an eective way to consume every disk up to the last byte. Because I had to run a lot of dierent tests, I wrote a suite of scripts that automated the complete development process.
Automatic Segmentation
39
Figure 5.2: The architecture of the neural network segmenter Silence Speech Music Other
The architecture of the network I used for my experiments is shown in gure 5.2. I used 56 input neurons, 16 neurons in the hidden layer and 4 output neurons. One output neuron was assigned to each of the classes for segmentation, a common method to deploy neural networks for hetero- association classi cation problems [HKP91]. The input vectors for the network are 7th order melcep coecients (MFCC) [RS78]. MFCCs have proven to achieve good results in speech recognition. It is a spectral representation that models some aspects of the human ear. The people in CRL working on speech recognition were planning to use these coecients. If I could make my decision what frames contain speech using the same feature vectors, this would save a lot of processor time in the nal sysem. The seven MFCCs are calculated every 6 ms. As shown in gure 5.2, some context is taken into account during segmentation. The drawback of this decision is that the segmenter will always lag behind at least 84 ms. After training the performance on the test set is the confusion matrix in table 5.1. The rows show the real classi cation of the vectors, the columns the output
40
Multimedia Information Access
of the neural network. Silence is classi ed as music in 2.4% of the cases. In the perfect case, the values at the diagonal should be 100% and all the other values should be 0%. The distinction between music and other is de nitely a tough one to make. However, if the neural network misclassi es music as other, this is not a real problem. The segmentation would not be correct, but in both cases the speech recognizer would not be feeded with the data. The mistakes that we should worry about are speech classi ed as music or other and music or other classi ed as speech. In the rst case, information that could have been indexed by the system would be thrown away. In the other case, the speech recognizer would produce garbage and slow down a lot because it tries to match one of the word models on the fragment containing music or two speakers at the same time. Observing the system, a lot of mistakes seem to be easily resolved if you would put knowledge in the segmenter such that it knows that between two blocks of ve seconds of speech the audio signal will probably not contain six milliseconds of music. Also, it is impossible to speak for less than some hundreds of milliseconds. The output decisions of the network have to be smoothed. Firstly, I used dierent sizes of hamming windows and rectangular windows to smooth the outputs. A xed number of network decisions was summed over the window size and then a new decision was taken for the frames in the window. In a rectangular window are all the decisions weighted the same. A hamming window uses the function 0:54 + 0:46 cos(t) to weight the decisions in a window for t from -1 to 1. The results of using these windows to smooth the outputs were not promising. Either the windows were so large that fragments of a dierent classi cation would be swallowed by surrounding fragments or the windows were so small that they did not improve the results. Because this windowing approach did not work the way I hoped it would, I decided to try a grammar based approach. The major problem using windows seemed to be that segments containing silence typically are shorter than the other types of segments. A windowing approach treats all the decisions the same way and cannot make distinction between a short silence segment among two speech segments and a short music segment among two speech segments. In these two examples, the decision of silence is probably right but the decision of music would often be wrong. Using a grammar-based approach, it is easy to make dierent decisions for each class. Using trial-and-error I found the following grammar to achieve good smoothing.
SX
)
S , if duration X < 36 ms
Automatic Segmentation
Silence Speech Music Other Table 5.2:
XS XY
) )
41
Silence Speech Music Other 58.4% 30.1% 7.9% 3.6% 3.1% 95.7% 0.1% 1.1% 1.4% 4.5% 79.3% 14.7% 0.9% 5.1% 1.7% 92.3% Confusion matrix after smoothing
X , if duration S < 18 ms X , if duration Y < 360 ms
This set of heuristic rules does the following. If a silence is followed by a fragment shorter than 36 milliseconds, I decide the other fragment was silence as well. Silences shorter than eighteen milliseconds are ignored and a transition from music to speech can only be made if the decision to change to speech is consequent for at least 360 milliseconds. The confusion matrix in table 5.2 shows the signi cantly improved results of the smoothed data. Only the classi cation of silence achieved better performance without smoothing. This is not really surprising, because during hand segmentation of the data I did not identify all short silences between two sentences and the boundaries where silence changes to speech or music are not exactly de ned. The tests described in the next section give a better idea of the performance on the class of silence. The other classi cations perform real good. As argued before, the slightly worse performance of music does not worry us because most of the mistakes end up in the class other.
5.2.5 Evaluation of the neural network segmenter The nal version of the neural network with smoothing seemed to work real good. To test syntactic segmentation in a real application, I added a lot of code to store24 described in section 2.4. I changed the software to produce the melceps realtime twenty-four hours a day. The syntactical audio segmenter is continuously run using these melceps and the segmentation is stored in a separate ringbuer. The new application was called segment-24. I wrote a Tcl extension that allows the user of that interpreter to query for the classi cation of a certain channel at a certain time. More about Tcl will be mentioned in chapter 6. The user interface was extended to display the smoothed output of the neural network of the currently playing audio. Because segment-24 always lags behind ve seconds in live mode to allow the audio to be stored by the le-system and retrieved by
42
Multimedia Information Access
the clients, segmentation could usually keep up with the audio feed even in live mode. Syntactical segmentation extended the functionality of the original radio application in two ways. The possibility to use syntactical segmentation to predict topic changes is described in the next section. The other improvement was to extend a jump facility of the radio. Before syntactical segmentation became available, the radio could jump forward or backward in time using a xed amount of seconds. However, this would often redirect the audio to start playing in the middle of a word which is really annoying to listen to. Taking the next detected silence after (or before) a jump as the point in time to restart the audio from improved the user-friendliness of the radio a lot. Listening to CNN with the segment-24 radio application and simultaneously watching the neural net output, I noticed that the output could not be over 90% correct, although the test results of the previous section really indicated such high performance. I needed more test data but did not really want to do more hand segmentation. I decided to follow a slightly dierent approach. I would use the neural network segmenter to segment a full hour of CNN audio. I would write each segment to a le, varying in duration from hundreds of milliseconds up to thirty seconds or more. Next, I would just listen to all these audio fragments and keep a tally of the number of fragments that were classi ed correctly. Because I was only interested in the performance with respect to the goal of making a decision whether a fragment should be further processed by a speech recognizer or not, and whether a silence really indicated a punctuation in the audio track or not, I soothed the demands on classi cation a little. A silence mistakenly classi ed in speech was still judged to be correct. Mistakes between music and other were ignored as well. However, a short moment of speech in a music segment would indicate the segment was wrongly classi ed. Same thing holds for a silence that contains a drumbeat. Another reason to sooth the demands is the fact that the length of the fragments were ignored by keeping a tally. How wrong is a music fragment of thirty seconds with one short moment where somebody yells a word? The short yell should have been a fragment labelled with other, but the other twentynine seconds are de nitely correct. Still, with the original speci cation of the classi cation process this would be a mistake. The test procedure described above produced table 5.3. The results were far worse than expected from the previous section. This probably had to do with the way the training and test set were de ned. I took random ten minute fragments of CNN and WBUR, each split in ve minutes for training and ve minutes for testing. I assume that the number of speakers, tunes and strange sounds in commercials varies more over the full day than it does in ten minutes. This assumption implies that more training data could improve the results. The
Automatic Segmentation
43
Right Wrong Silence 77.7% 22.3% Speech 85.1% 14.9% Music 76.5% 23.5% Other 78.1% 23.5% Table 5.3: Performance on a full hour of CNN Right Wrong Silence 99.0% 1.0% Speech 92.1% 7.9% Music 100.0% 0.0% Other 71.3% 28.7% Table 5.4: Performance on a full hour of CNN after extra training new test procedure does not need the correct boundaries of the segments. This means that I could use the original test set, which was hand segmented for the tests in the previous section, as training data for the neural network. I retrained the network with the full hand segmented data set and repeated the test of the previous paragraph. Table 5.4 shows the signi cant improvement of the results. Extra training data would probably improve the performance even more. Two other obvious improvements to the segmentation algorithm are the application of zero mean normalization to the input vectors and the development of a better output smoothing. In the smoothing process, dierent sources of evidence are combined to formulate a nal judgement. Application of a more formal method like Dempster-Shafer seems appropriate. Also, integration with the inference network model should be considered for this evidential reasoning. I believe that splitting the speech class in two subclasses female speech and male speech and adding an extra output neuron to the architecture will prove to be a major improvement. When I added a class other to the original silence- musicspeech classi cation I had in mind, the results improved a lot. Male and female speech signals are quite dierent and the classi cation of these two groups in the same class may confuse the training of the network. Other network architectures should be investigated for the syntactical segmentation of audio. A major improvement to syntactical segmentation of audio with the goal to make semantical segmentation feasible would be the additon of speaker change detection to the system.
44
Multimedia Information Access
5.3 Semantical Segmentation Two main reasons exist why semantical segmentation is important for the development of an information access system. Firstly, people have a certain perception of the data segmented in blocks. We deal with concepts of news ashes, commercials and songs. If an information access system is consequently returning blocks of information with dierent boundaries than a human would use, the user will de netely get annoyed by the system and will not want to use the system any more. For example, it is very annoying to get a document that consists of half a news report as well as a commercial. You want the system to be able to discover the end of the newsreport. It should only retrieve the commercial if you are looking for the best buy and then you do not want the news report. The second reason for semantical segmentation may even be more crucial. The determination of the relation between a document and the underlying concepts by any information access system assumes a document structure. Without the concept of document you cannot know what you have to retrieve. If the continuous stream of incoming data has not been divided in documents the way we would do it, the retrieval engine will not work as expected. Similarity measures based on values like the frequency of a term in a document (tf) will not be assigned the correct values. Therefore, the relevance judgement of the system for some wrong-segmented document will not make sense to the user. Important information could be rejected during the retrieval process. This information would have been retrieved if the semantic segmentation would have chosen the correct segmentation. Without a speech recognizer it will be very hard to recognize the higher level structure. Once good speech recognition is available, the TextTiling algorithm described in section 3.5, can be used to detect topic changes. Assuming that the quality of the speech recognizer output is high enough, application of this algorithm will work ne. In the area of document image processing, a common method to produce structure is to use a very low level format like bit streams indicating black or white dots on the paper [CdV93], [NSV92]. Another demonstration of the same idea can be found in [DLM+94]. The size of video frames encoded with motion-JPEG was used to statistically construct storyboards. With an amazingly high precision, fade-ins and scene changes were detected with nothing more than the sequence of bytes per frame. Scene changes are the transitions of one shot to another in the video data stream. All these algorithms predict higher level segmentation using the low level format information. At the MIT Media Laboratory, some people worked on the semantical segmentation of audio using syntactical segmentation as a basis. A college radio station
Automatic Segmentation
45
with many interviews and little commercials was segmented using an algorithm to detect music and silence [Haw93], [Hor93]. The segmentation based on this little information provided enough information to estimate the begin and the end of items. Other work is done to realize an environment for speech skimming [Aro94]. Silence detection and emphasis detection are used to guess the semantical structure of a speech by Negroponte. The system allows the user to skim a speech document like one is used to skim some pages of a paper to get an idea what the hot topics are. The many references to the application of low level information to semantical segmentation were the motivation to do some tests using the syntactical segmenter from the previous section. I added fast forward and backward buttons to the segment-24 application. Pressing the fast forward button would nd the next `long' silence and skip a succeeding music fragment. Long was arbitrarily de ned as at least three seconds. This turned out to be a convenient tool to listen to CNN news. Every time the reporter would start about the Simpson murder case again, I would simply press the button and most of the times listen to the next news item. Of course, the created semantical segmentation will not be perfect. For text segmentation, an algorithm using semantical information like texttiling proved to perform more like a human than the algorithm that split the text based on syntactical information of paragraph boundaries [Hea94]. Apparently, authors of text documents are not always good in segmenting these texts. The same will hold for algorithms detecting silences and emphasis. If a speaker pauses at the wrong moment or puts emphasis on an unimportant phrase, the segmentation algorithm wrongly decides that a new item has started. A thorough analysis of the limitations of the usage of syntactical information from sound to obtain semantical information is currently worked on in the MIT Media Lab ([Sti95]).
5.4 Summary I introduced the problem of automatic segmentation and divided the problem into the subproblems syntactical segmentation and semantical segmentation. Because the syntactical segmentation was crucial to the development of the prototype system, I focused on this problem in my thesis work. That does not mean that semantical segmentation is less important. To the contrary, I believe it is very important that a real system will do a good job in recognizing the higher level structure. The quality of automatic segmentation of the data in documents is very important for the performance of the complete system. The performance of the
46
Multimedia Information Access
syntactical segmentation will directly in uence the quality of the output of the speech recognizer. As the speech recognizer will produce the indexing terms for the information access system, it is important to give it the best available input. Semantical segmentation has to do a good job to build an information access system that people can use to nd the information they are looking for. The segmentation in uences the performance of both the algorithm to actually search for interesting documents and the presentation of the search results to the user. Although the segmenter I developed still needs a lot of work, I showed that a neural network can be trained to distinguish between silence, music and speech. Such a network can be used to assist at the harder problem of semantical segmentation and to keep a speech recognizer from processing music.
Chapter 6 A PROTOTYPE SYSTEM 6.1 Architecture This thesis described the design process of an information ltering system for multimedia data. In this chapter, a framework is given that can be extended to a complete multimedia ltering system. By examining the prototype, new algorithms and concepts can be tested in a real environment. The abstract design of the prototype is shown in gure 6.1. It is based on a dual architecture. The incoming data is split in an information content stream and a raw data stream by the automatic segmentation and recognition unit. In the prototype, the information content stream contains the closed-captions of the television data. The storage unit dealing with the raw data is implemented using the segment-24 application. The prototype system only deals with audio. The video frames of the CNN data stream are simply ignored. The output of the automatic segmentation unit is stored separately from the raw audio data. The neural network producing the segmentation information is described in section 5.2.1. The output of the content analysis unit is stored in an INQUERY data collection. The information content stream only contains the closed captions of the television signal because a speech recognizer was not available yet. The dierent versions of the data are linked by fdate, time, channelg-tuples. The user of the prototype system can search the data collection by content. INQUERY accepts natural language queries and returns a list of documents. I do not use the user interface of the INQUERY system because I have to further process the retrieved documents. The user interface of the prototype system is implemented on the World Wide Web. The advantage of this decision is that the browser will display the data and my software only has to produce a description 47
48
Multimedia Information Access
raw
Audio Video
Automatic Analysis & Segmenting
store-24 User Interface
meta
INQUERY
Figure 6.1: The prototype framework of the output in HTML. The HTTP protocol suite does not allow server initiated updates yet, so this interface could not be used for a ltering application that updates a user's homepage automatically. Fortunately, the Netscape browsers will be extended with server push and client pull capability, which will make this interface suitable for a ltering application as well. As result of a query, the system presents both the information content version and the original audio version of the document. It redirects the playback point of the user's active segment-24 application to the correct channel at the time the document started. A better approach than directly positioning segment- 24 would be to return a list of links that will position the audio after pressing. If the ringbuer contained a longer time of audio, eg. a full month, these links could be included in other documents as well.
6.2 Implementation I will not go through the implementation in depth. However, I will give a general impression of how the system has been built and what problems I had to face. The prototype has been realized on a UNIX platform. The system was implemented using C and Tcl/Tk [Ous94] in combination with a lot of pipes to link all building blocks. The Tool Command Language Tcl (pronounced as `tickle') is an interpreted script language with features for looping, conditionals, sets, le handling, subprocesses, associative arrays and regular expression matching. It
A Prototype System
49
can be extended using C and all the language features can be called from a C program. Using Tcl simpli es the handling of user input and the integration of dierent building blocks in one environment. The source code is nicely readible. The Tk toolkit is a Tcl extension which provides an interface to the X Window system. Tcl and Tk are freely distributable and used in a lot of applications. The interested reader should also check the Tcl-DP extensions for distributed processing. With this interpreter, a multi-user conference tool can be written using no more than hundred lines of source code that is easy to understand. Because an audio document can be arbitrarily long and the recognition process will make mistakes, fast browsing through the retrieved audio documents to concepts found in the document has to be implemented. The user does not want to listen to ten minutes of audio to nd out that the speech recognizer did mix up two words. To enable jumping to retrieved representation concepts, an index of ftimestamp, representation conceptg-tuples has to be stored within the document. For implementation purposes I chose to store this information in a table containing fline, word, secondsg format, where seconds stores the time relative to the beginning of the document. I chose the word oriented approach of referring to locations within the document because this is the representation that is used within INQUERY. If we want to store real multimedia documents with the audio at the same location as the text, this representation will not work any more. The software accessing the closed caption decoder board does not output documents but just a long, unreadable list of ftimestamp, captiong-tuples. These tuples are read by my software and segmented into documents whenever it takes longer than a treshold time for new captions to come in. On CNN Headline News, the only documents that are captioned are news reports approximately between 6 pm and 10 am and most of the commercials. An algorithm segmenting the incoming captions based on a time treshold of twelve seconds turned out to make little mistakes. Of course, this segmentation strategy would not have worked if all information were captioned. The algorithm is far too simple for a real application but it works to store these captions. This satis es for the purpose of prototyping the nal system. The standard parser that comes with the INQUERY package deals with documents marked up in a subset of SGML. I decided to use this parser for my system as I believe that SGML and HyTime will be commonly used to describe (multimedia) documents in the future. An example of a document that I composed automatically from the incoming closed captions is:
50
Multimedia Information Access CNN-04/04/95-02:00:03 04/04/95 CNN {1 1 2}{2 1 3}{3 1 13}{4 1 14} {5 1 17}{6 1 19}{7 1 20}{8 1 22} ... captions paid for by the us department of education live from atlanta headline news david goodnow reporting texas authorities are trying to figure out what caused ...
To enable the use of INQUERY, the parser of the indexing subsystem had to be extended and the interfacing between retrieval subsystem and user interface had to be implemented. With the help of the INQUERY people from University of Massachussetts at Amherst1 I managed to make the INQUERY system deal with the document layout explained in the previous paragraph. I added a TIME eld to the parser that contains the start time of a document. Using eld indexing options from INQUERY, this enables the user to formulate queries searching for documents of this morning before ten o'clock. I also added a TIMES eld to the parser to store the index table of ftimestamp, captiong-tuples. However, this eld was not indexed by INQUERY and was only used because I had to store this information somewhere. The best place to keep information that belongs to a document is within the document itself. These extra elds cannot be accessed from the retrieval engine in the old version of INQUERY I had to use. Therefore, I had to use the undocumented get raw doc function call that retrieves the original unparsed document. Minor problems occured with this call (it was undocumented, what did I expect) because accessing the standard TEXT eld with the normal function calls crashed the system after get raw doc had been used. The solution was to always reparse the retrieved document in I want to thank Michelle LaMar for answering the many mail messages with questions and giving me the code of the Tcl interpreter extended with the INQUERY API calls 1
A Prototype System
51
my own code to process this extra information. I did not implement searching the table of timestamps in the prototype application. Playback always starts at the beginning of the document. It is a trivial task to add this feature to the application. I will nish this section with an example of how the system should process a query. If the user queries for `david goodnow', the closed caption documents stored in the INQUERY database collection are searched and INQUERY returns a list of documents ordered by decreasing probability of usefulness to the user. One of the documents that would be retrieved, is the example document given before. The system then reads the index table in the TIMES eld for line 5 of the text and adds the 17 seconds to the document time stamp. Segment-24 is tuned to CNN and starts playing at 2:00:20, 17 seconds after the start of the document.
6.3 Conclusions and Further Work The prototype application turned out to be a nice tool to work with. The implementation based on small building blocks realizes extendibility of the system. New approaches to automatic segmentation and analysis of the input data can easily be added and the improvements for retrieval can be studied. Once a standard test set has been de ned, precision and recall measures can be used to nd good representations of multimedia data from the viewpoint of information access. A better document model is needed to store the documents. I think the segmentation information should be stored within the document. The same holds for the dierent representations of the data. The SGML/HyTime standard seems to be a good candidate for this purpose [VB95], [Erf93]. Documents speci ed with HyTime have a notion of time. The standard also provides means to deal with dierent media items in one document. In theory, hooking up a speech recognizer with the prototype should be a fairly easy task. Just add the speech recognizer building block and everything will work ne. However, the performance of INQUERY (or another retrieval engine) on the imprecise output of a speech recognizer cannot be guaranteed. As explained before, the approach followed by [GS92] to indexing of speech data has not been tested in practice and their simulation using text les does not represent a realistic situation because they assumed knowledge of the word boundaries. It will be clear that a lot of research is still necessary to make the step from prototype to an implementation of the system described in the problem de nition.
52
Multimedia Information Access
Chapter 7 SUMMARY AND CONCLUSIONS This thesis discussed the problems that come around the corner when you want to do information ltering on the data from television and radio. Motivation for the research is the problem of information overload. Too much information is broadcasted and therefore we want a computer system to select the interesting shows and documentaries for us. A prototype system has been developed that can be used to direct research and nally evolve into a product. This chapter will summarize the most important aspects of the report. Directions for further research are given. Too many factors are unknown to predict whether the approach outlined in this thesis will lead to a working product. The only research I am aware of that reports about information retrieval from speech documents for the same type of application has not been tested in practice. Their simulation results do not make clear whether the approach would work, because of the assumption of knowledge of all word boundaries. I think the inference network model is very suited to be applied to multimedia data. The INQUERY information retrieval system using this model has proven to be among the best in the world. The network model makes it easy to combine evidence from dierent sources. The dierent media representations of multimedia data can be integrated in the same manner as dierent representations of textual data. Speech recognizers make mistakes. Erroneous data can confuse the information retrieval process. Intuitively, I think it is possible to extend the inference network model with knowledge about the probability that a word was recognized correctly and the alternatives suggested by the recognizer. However, this still has to be proven. Implications of this idea for correctness of the model and computational complexity have to be studied. 53
54
Multimedia Information Access
I could not perform tests with speech recognizers myself, but the experiments I took with conventional silence detection algorithms indicate that television data may be hard to deal with for current speech recognition applications. I built a neural network segmenter that can identify the speech segments on the CNN audio track. The motivation for this syntactical segmentation algorithm that segments the audio in fragments of the four classes silence, speech, music and other, is two-fold. First goal is to avoid feeding the speech recognizer with music. The speech recognizer would not know what to do with it and start to output nonsense. The other motive is that it is possible to guess the semantical structure of an audio document using the syntactical segmentation. An analogous approach is commonplace in automatic document structure analysis of scanned documents. The recognition of audio document boundaries on the continuously incoming stream of data is an important but hard problem. Both the retrieval process and the user of the system will be disturbed when the document boundaries are wrong. To nd these boundaries, content information is probably needed. Without this content information, we can use low level information about the syntactical structure of the audio. To show how this would work, I implemented segmentation on a twentyfour hour ringbuer of radio audio. People could use this extension on the radio to jump to next and previous long silences and skip musical fragments. It runs real-time on the CNN channel in the laboratory. Guessing semantics using syntax is a hard problem though. Even with text les, an algorithm that uses content to nd topic changes was found to perform better than the visual paragraph layout the authors created themselves (judged by a panel of readers). For a particular channel, it may be possible to construct models that can nd the boundaries between news reports, talk shows and commercials based on low level information. To come up with a general algorithm for any channel seems to be impossible to me. For future research, I would advice to narrow the scope of the Indexing a Sea of Audio project. I sincerely believe that information retrieval from speech documents will proof to be feasible. However, the chosen sources of television and radio signals impose too many problems at once. I would suggest to focus the project to a retrieval system for rather formal meetings. Not knowing the vocabulary in advance is still a tough problem to tackle. Jingles and tunes will not confuse the system and most data will be of one person speaking at a time. I think there will be a market for a system that implements a search engine on recordings of meetings. Companies and governments will want to keep history of all decision processes if such a system were available. The same type of system can be used to evaluate therapies in psychiatry. The challange of integrating speech recognition and information retrieval into a working system is a big one. The most important open problems are the
Summary and Conclusions
55
selection of a good document representation model, the recognition and selection of indexing features for speech retrieval and dealing with the erroneous output of recognition processes. Hopefully, a project as proposed in the previous paragraph will succeed in solving these problems.
56
Multimedia Information Access
Bibliography [ACF93] M. Arya, W. Cody, and C. Faloutsos. QBISM: A prototype 3-D medical image database system. IEEE Data engineering bulletin, 16(1):38{ 42, March 1993. [AR76] B.S. Atal and L.R. Rabiner. A pattern recognition approach to voicedunvoiced-silence classi cation with applications to speech recognition. IEEE Transactions on acoustics, speech, and signal processing, 24(3):201{212, 1976. [Aro94] B.M. Arons. Interactively skimming recorded speech. PhD thesis, Massachusetts Institute of Technology, February 1994. [BC92] N.J. Belkin and W.B. Croft. Information ltering and information retrieval: Two sides of the same coin? Communications of the ACM, 35(12):29{38, 1992. [BCC94] E.W. Brown, J.P. Callan, and W.B. Croft. Fast incremental indexing for full-text information retrieval. In Proceedings of the 20th International Conference on Very Large Databases (VLDB), Santiago, Chile, 1994. [BCCM94] E.W. Brown, J.P. Callan, W.B. Croft, and J.E.B. Moss. Supporting full-text information retrieval with a persistent object store. In EDBT '94, 1994. [BCF95] E. Bertino, B. Catania, and E. Ferrari. Research issues in multimedia query processing. In Advanced Course: Multimedia Databases in Perspective, pages 279{314. Center for Telematics and Information Technology of the University of Twente, 1995. [BCFV] E. Barnard, R. Cole, M. Fanty, and P. Vermeulen. Real-world speech recognition with neural networks. [BFM+94] J. Barrios, J. French, W. Martin, P. Kelly, and J.M. White. Indexing multispectral images for content-based retrieval. In Proceedings of the 57
58
Multimedia Information Access
23rd AIPR workshop on image and information systems, Washington DC, 1994. [BG94] M. Buckland and F. Gey. The relationship between recall and precision. Journal of the American society for information science, 45(1):12{19, 1994. [CC93] J.P. Callan and W.B. Croft. An evaluation of query processing strategies using the TIPSTER collection. In Proceedings of the sixteenth annual international ACM SIGIR conference on research and development in information retrieval, pages 347{356, 1993. [CCH92] J.P. Callan, W.B. Croft, and S.M. Harding. The INQUERY retrieval system. In Proceedings of the 3rd international conference on database and expert systems applications, pages 78{83, 1992. [CdV93] D.R. Corman and A.P. de Vries. Document analysis: an integrated approach. In America Multimedia Studytour '93: Preliminary report, pages 13{23. Inter-Actief, Universiy of Twente, 1993. [CHTB92] W.B. Croft, S.M. Harding, K. Taghva, and J. Borsack. An evaluation of information retrieval accuracy with simulated OCR output. In Symposium of Document Analysis and Information Retrieval, 1992. [CLP94] T.-S. Chua, S.-K. Lim, and H.-K. Pung. Content-based retrieval of segmented images. In ACM Multimedia 94, pages 211{218, San Francisco, 1994. [Con87] J. Conklin. Hypertext: an introduction and survey. Computer, pages 17{41, September 1987. [Coo94] Wm.S. Cooper. The formalism of probability theory in IR: a foundation for an encumbrance? In Proceedings of the seventeenth annual international ACM SIGIR Conference on research and development in information retrieval, Dublin, Ireland, 1994. [Cox90] S.J. Cox. Speech and language processing, chapter Hidden Markov Models for automatic speech recognition: theory and application, pages 209{230. Chapman and Hall, 1990. [CSW94] M. Christel, S. Stevens, and H. Wactlar. Informedia digital video library. In ACM Multimedia 94, pages 480{481, San Francisco, 1994. [CW92] F.R. Chen and M.M. Withgott. The use of emphasis to automatically summarize a spoken discourse. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing, San Fransisco, CA, March 1992.
Summary and Conclusions
[DG94]
59
N. Dimitrova and F. Golshani. RX for semantic video database retrieval. In ACM Multimedia 94, pages 279{286, San Francisco, 1994. [DLM+94] E. Deardor, T.D.C. Little, J.D. Marshall, D. Venkatesh, and R. Walzer. Video scene decomposition with the motion picture parser. In IS&T/SPIE Symposium on Electronic Imaging Science and Technology, San Jose, 1994. [DWG] A. Duda, R. Weiss, and D.K. Giord. Content-based access to algebraic video. [Erf93] R. Er e. Speci cation of temporal constraints in multimedia documents using hytime. Electronic publishing, 6(4):397{411, 1993. [Fed] Federal Communications Commission. 15.119 Closed caption decoder requirements for television receivers. [Fro94] Th.J. Froehlich. Relevance reconsidered - towards an agenda for the 21st century. Journal of the American Society for Information Science, 45(3):124{134, 1994. [FT93] P. Furtado and J.C. Teixeira. Storage support for multidimensional discrete data in multimedia databases. Eurographics '93, 12(3), 1993. [GC92] J. Gemmell and S. Christodoulakis. Principles of delay-sensitive multimedia data storage and retrieval. ACM Transactions of information systems, 10(1), 1992. [GD88] C.K. Gan and R.W. Donaldson. Adaptive silence detection for speech storage and voice mail applications. IEEE Transactions on acoustics, speech, and signal processing, 36(6):924{927, 1988. [Gor90] M.D. Gordon. Evaluating the eectiveness of information retrieval systems using simulated queries. Journal of the American Society for Information Science, 41(5):313{323, 1990. [GS92] U. Glavitsch and P. Schauble. A system for retrieving speech documents. In Proceedings of the 15th annual international SIGIR, pages 168{176, Denmark, 6 1992. [Haw93] M.J. Hawley. Structure out of sound. PhD thesis, Massachusetts Institute of Technology, September 1993. [HBvR94] L. Hardman, D.C.A. Bulterman, and G. van Rossum. The Amsterdam hypermedia model. Communications of the ACM, 37(2):50{62, February 1994.
D. Haines and W.B. Croft. Relevance feedback and inference networks. In Proceedings of the sixteenth annual international ACM SIGIR conference on research and development in information retrieval, pages 2{11, 1993. M.A. Hearst. Multi-paragraph segmentation of expository text. In ACL '94, Las Cruces, 1994. J.A. Hertz, A.S. Krogh, and R.G. Palmer. Introduction to the theory of neural computation. Addison-Wesley, California, 1991. C.D. Horner. NewsTime: A graphical user interface to audio news. Master's thesis, Massachusettes Institute of Technology, June 1993. F. Halasz and M. Schwartz. The Dexter hypertext reference model. Communications of the ACM, 37(2):30{39, February 1994. J. Junqua, B. Mak, and B. Reaves. A robust algorithm for word boundary detection in the presence of noise. IEEE Transactions on speech and audio processing, 2(3):406{412, 1994. K. Sparck Jones and C.J. van Rijsbergen. Information retrieval test collections. Journal of documentation, 32(1):59{75, 1976. T.D.C. Little, G. Ahanger, R.J. Folz, J.F. Gibbon, F.W. Reeve, D.H. Schelleng, and D. Venkatesh. A digital on-demand video service supporting content- based queries. In Proceedings of the rst ACM international conference on multimedia, pages 427{436, Anaheim California, 1993. M. Lesk. What to do when there's too much information. In Hypertext '89 Proceedings, pages 305{318, New York, 1989. ACM. R.P. Lippmann. Review of neural networks for speech recognition. Neural computation, 1(1):1{38, 1989. S. Loeb. Architecting personalized delivery of multimedia information. Communications of the ACM, 35(12):39{50, 1992. Levergood, Payne, Gettys, Treese, and Stewart. AudioFile: a network-transparent system for distributed audio applications. In USENIX Summer Conference, June 1993. P. Maes. Agents that reduce work and information overload. Communications of the ACM, 37(7):31{42, July 1994.
R.J. McAulay and M.L. Malpass. Speech enhancement using a softdecision noise suppression lter. IEEE Transactions on acoustics, speech, and signal processing, 28(2):137{144, 1980. L. Mauuary and J. Monne. Speech/non-speech detection for voice response systems. In EUROSPEECH '93, 1993. W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic, P. Yanker, and C. Faloutsos. The QBIC project: querying images by content using color, texture and shape. Technical Report RJ 9203, IBM Research Division, 1993. G. Nagy, S. Seth, and M. Viswanathan. A prototype document image analysis system for technical journals. Computer, pages 10{21, July 1992. J.K. Ousterhout. Tcl and the Tk toolkit. Addison-Wesley Publishing, 1994. J. Pearl. Probabilistic reasoning in intelligent systems. Morgan Kaufmann, California, 1989. T.B. Rajashekar and W.B. Croft. Combining automatic and manual index representations in probabilistic information retrieval. Technical Report IR-39, Center for intelligent information retrieval. R.C. Rose, E.I. Chang, and R.P. Lippmann. Techniques for information retrieval from voice messages. In International conference on acoustics, speech and signal processing, pages 317{320, 1991. Rudnicky, Hauptmann, and Lee. Survey of current speech technology. Communications of the ACM, 37(3):52{57, 1994. L.R. Rabiner and R.W. Schafer. Digital processing of speech. PrenticeHall, New-Jersey, 1978. L.R. Rabiner, C.E. Schmidt, and B.S. Atal. Evaluation of a statistical approach to voiced-unvoiced- silence analysis for telephone-quality speech. The BELL system technical journal, 56(3):455{482, 1977. G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison Wesley Publishing, 1989. J.R. Smith and S.-F. Chang. Quad-tree segmentation for texturebased image query. In ACM Multimedia 94, pages 279{286, San Francisco, 1994.
62 [SSJ92]
Multimedia Information Access
D. Swanberg, C.F. Shu, and R. Jain. Architecture of a multimedia information system for content-based retrieval. In Proceedings of the third international workshop on network and operating system support for digital audio and video, pages 387{392, 1992. [Sti95] L.J. Stielman. A discourse analysis approach to structured speech. In AAAI 1995 Spring symposium series: Emperical methods in discourse interpretation and generation. Stanford university, 1995. [SvR91] M. Sanderson and C.J. van Rijsbergen. NRT: news retrieval tool. Electronic Publishing, 4(4):205{217, 1991. [TBC94] K. Taghva, J. Borsack, and A. Condit. Results of applying probabilistic IR to OCR text. In Proceedings of the seventeenth annual international ACM SIGIR Conference on research and development in information retrieval, Dublin, Ireland, 1994. [TBCE94] K. Taghva, J. Borsack, A. Condit, and S. Erva. The eects of noisy data on text retrieval. Journal of the American Society for Information Science, 45(1):50{58, 1994. [TC91a] H. Turtle and W.B. Croft. Ecient probabilistic inference for tex retrieval. In RIAO 91 Conference proceedings, pages 644{661, Barcelona, 1991. [TC91b] H. Turtle and W.B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions of information systems, 9(3), 1991. [VB95] P.A.C. Verkoulen and H.M. Blanken. SGML/HyTime for supporting cooperative authoring of multimedia applications. In Advanced Course: Multimedia Databases in Perspective, pages 179{212. Center for Telematics and Information Technology of the University of Twente, 1995. [vR79] C.J. van Rijsbergen. Information retrieval. Butterworths, London, 2nd edition, 1979. [vS95] Hein van Steenis. Spraakherkenning levert eindelijk produkten op. Automatiseringsgids, May 26 1995. [WB91] L.D. Wilcox and M.A. Bush. HMM-based wordspotting for voice editing and indexing. In Proceedings of the Second European Conference on Speech Communication and Technology, Genova, Italy, September 1991.
Summary and Conclusions
[Wil79]
63
P. Willet. Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, truncation, digram and trigram encoding of indexing terms. Journal of Documentation, 35(4):296{305, 1979. [WSB92] L.D. Wilcox, I. Smith, and M.A. Bush. Wordspotting for voice editing and audio indexing. In Proceedings of CHI, Monterey, CA, May 1992. [YGM] T.W. Yan and H. Garcia-Molina. SIFT - a tool for wide-area information dissemination. http://sift.stanford.edu/. [ZPD90] P. Zabback, H.-B. Paul, and U. Deppisch. Oce documents on a database kernel: ling, retrieval and archiving. ACM Oce Information Systems, 11(2 and 3), 1990.