The ToCAI Description Scheme for Indexing and Retrieval ... - CiteSeerX

Multimedia Tools and Applications, 14, 153–173, 2001 c 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. °

The ToCAI Description Scheme for Indexing and Retrieval of Multimedia Documents∗ N. ADAMI A. BUGATTI R. LEONARDI P. MIGLIORATI† L.A. ROSSI DEA, University of Brescia, Via Branze, 38, Brescia, Italy

Abstract. A framework, called Table of Content-Analytical Index (ToCAI), for the content description of multimedia material is presented. The idea for such a description scheme (DS) comes out from the structures used for indexing technical books (containing a Table of Content, typically placed at the beginning of the book, where the list of topics is organized hierarchically into chapters, sections, and an Analytical Index, typically placed at the end of the book, where keywords are listed alphabetically). The ToCAI description scheme provides similarly a hierarchical description of the time sequential structure of a multimedia document (ToC), suitable for browsing, and an “Analytical Index” (AI) of audio-visual key items for the document, suitable for effective retrieval. Besides two other sub-description schemes are proposed to specify the program category and the description of other metadata associated to the multimedia document in the general DS. The detailed structure of the DS is presented by means of a UML diagram. Moreover, some suitable automatic extraction methods for the identification of the values associated to the descriptors that compose the ToCAI are presented and discussed. Finally, a browsing application example is also proposed. Keywords: description schemes, analytical index, multimedia documents, indexing, effective retrieval

1.

Introduction

Nowadays a huge amount of audio-visual (AV) material arises from a variety of digital sources. Therefore, there is the need to define suitable frameworks for efficient browsing through this large amount of available material and for retrieving relevant information according to user specific requirements. For the aforementioned purposes, in the last years, there have been several contributions in the field multimedia indexing [4, 13, 19]. Furthermore, the International Standard Organization (ISO) started in October 1996 a standardization process for the description of the content of multimedia documents, namely MPEG-7 [10, 11]. This standardization effort should bring by September 2001 the definition of a set of standard descriptors (D) and description schemes (DS) expressed according to a description definition language (DDL). A DS can be used to generate a description of a multimedia document with various levels of ∗ This work has been partially founded by the European ESPRIT project AVIR (Audio-Visual Indexing and retrieval for non IT expert users). † To whom all correspondence should be addressed.

154

ADAMI ET AL.

abstraction, by combining descriptors characterizing features such as shape, color, texture, motion (for objects), or audio type (for the audio component) [1]. The DDL should allow to build a variety of different new description schemes for dealing with specific application contexts. The DS herein proposed relies on a joint approach that takes into account both audio and video processing for constructing a hierarchical organization of audio-visual information. The proposed DS aims at providing the following functionalities. – Characterize the temporal structure of a multimedia document from a semantic point of view at multiple levels of abstraction, so as to have a series of consecutive segments which are coherent with the semantics of information at that level. With this kind of indexing procedure, a fast navigation throughout a multimedia document can be carried out. – Allow an easy way to effectively retrieve relevant information, such as objects appearing in the video (e.g., Bill Clinton), or identify specific situations of interest (e.g., a murder in a thriller movie or a goal in football game). To have a good retrieval capability, it is important that these objects or events be arranged in the underlying Analytical Index (AI) according to various criteria (named “ordering keys”, so as to ease the retrieval task). – Offer generic and/or specific information about the content of the multimedia document such as authors, program title, date of production, etc. – Provide useful information about the document description itself such as, e.g., the size of the description and the type of involved extraction methods with an indication of the reliability of the results they provide. The original idea for such a DS originates from the organization of the presentation adopted for technical books where one may, on one hand, easily understand the sequence of addressed topics by exploring their Table of Content and, on another hand, quickly access elements of interest by means of the provided Analytical Index. In the first case, the chronological order of presentation is preserved, while in the latter, an alphabetical order exists to facilitate the retrieval task. The ToCAI allows for a similar mechanism to access information in the multimedia document, with an extension: it allows to retrieve information at any given level of abstraction, which is not normally the case in a book (each keyword in the index points normally to the page numbers only, not the sections or paragraphs where the topic of interest can be found). This paper is organized in two main sections. Section 2 briefly presents the context and objective of the MPEG-7 standardization effort. In Section 3, the structure of the ToCAI DS is explained using Universal Modeling Language (UML) diagrams [5]. In Sections 4, 5, 6, 7, the issue of the automatic description creation is addressed. Finally some details about the adopted visual interface are given and an example of implementation of such a DS is shown (Section 8). 2.

MPEG-7 context and objectives

In October 1996, MPEG started a new effort to address the issue of the multimedia content description: the “Multimedia Content Description Interface” (in short “MPEG-7”). The purpose of MPEG-7 is the specification of a standard set of descriptors that can be used to

THE ToCAI DESCRIPTION SCHEME FOR INDEXING AND RETRIEVAL

Figure 1.

155

MPEG-7 description generation, transmission, and usage.

describe several kinds of multimedia information. MPEG-7 will also standardize the structures (DS’s) for the descriptors and their relationships as well as a language for specifying description schemes, i.e., a Description Definition Language (DDL). The standard will be applicable to AV material like still pictures, graphics, 3D models, audio, video. An MPEG-7 description may be either physically located with the associated AV material or also live somewhere and rely on bi-directional linking mechanisms. In figure 1, a block diagram of a possible MPEG-7 processing chain is shown. This chain includes feature extraction (automatic or semi-automatic), the description itself, and the search engine (application). Despite their usefulness, automatic and semi-automatic feature extraction methods are outside the scope of the standard in order to allow industry and scientific competition and to benefit of the expected improvements in these technical areas. For similar motivations, the search engines as well will not be specified within the scope of MPEG-7. More detailed explanations about MPEG-7 can be found in [10].

3.

Structure of the ToCAI DS

We describe now the ToCAI structure by presenting the hierarchical organization of its subdescription schemes and involved descriptors. The ToCAI is organized in four main DSs: the Table of Contents (ToC), the Analytical Index (AI), the Context and the Meta-descriptors description schemes (see figure 2).

3.1.

ToC DS

The ToC describes the temporal structure of the AV document at multiple levels of abstraction. It is organized in different hierarchical levels where the lower levels provide a detailed characterization of the sequential structure of the AV document while the higher ones have the role to offer a more compact description with associated semantics. A key aspect is that the items at each level are kept in chronological order.

156

Figure 2.

ADAMI ET AL.

High level structure of the ToCAI DS.

The ToC DS is very useful for browsing and navigation, since it provides summaries of the document at several levels of details. Besides the meaningful characterization of the temporal structure of the document, provided by the ToC DS, may also be used for retrieval tasks, as it can restrict the search field for a particular query. ToC is formed by two DSs explained below, namely Audio-visual Structure and Audio Structure. We proposed an Audio-Visual Structure DS rather than a simple visual DS because, from the semantic point of view, it is often necessary to consider the information carried by video together with the one provided by associated audio to recover reliable intermediate semantic levels for the description.

3.1.1. Audio-visual structure DS. This DS is represented in figure 3. The two Time-code Ds specify the start and the end position of the AV document. The core of this DS is the Scene DS. A scene is a temporal segment having a coherent semantics at a certain hierarchical level. It is formed by a various number of sub-scenes, a time reference (2 Time-code Ds) and a type of scene D (a string and, if useful, a characteristic icon). The elementary component of a scene is the shot.1 The Shot DS indicates the type (cut, dissolve, fade in, etc.) of editing effects and their temporal locations (Editing effects D). It includes a set of DSs for K-frames, mosaic and outlier images associated with the shot.2

3.1.2. Audio structure DS. This DS basically reflects the structure of Audio-visual DS. Thus we can have various layers of audio scene and herein the Shot DS is replaced by the Homogeneous audio DS. The Ds belonging to the Homogeneous audio DS represent the leaves of the tree, i.e., audio segments having an homogeneous audio source (for example a particular speaker, a particular noise, a defined music, etc.). An appropriate label and a time reference constitute each one of these Ds.


Figure 3.

3.2.

157

The audio-visual structure DS.

Analytical Index DS

The AI allows to create an ordered set of audio-visual objects of the multimedia document. An item in the AI can point at different locations and at different level of abstraction (according to the hierarchy provided in the ToC). Hence, this DS has the main role to support retrieval of selected objects within the AV document. Two DSs form it: the Audiovisual object DS and the Audio object DS. AI objects can be semantic entities (like an AV scene belonging to a particular category, e.g., a dialog), particular kind of images (backgrounds, foreground objects, etc.) but audio objects as well (like the musical motif and/or some keywords from a speech to text transcription). These objects can be ordered according to various criteria, which are listed with the DS. It must be pointed out that thanks to the AI, more than one shot or more than one scene can be referenced by the same AI item. 3.2.1. Audio-visual objects DS. The structure of this DS is shown in figure 4. The ordering keys D is a set of possible keys for ordering the AI items, e.g., a certain color or size descriptor for objects contained in images. Two classes of AV objects are foreseen: scenes and images. Consequently there are two main DSs belonging to the audio-visual object DS. For each object included in the index, there are several types of pointer and for every type of pointer, several reference pointers, allowing to access the multimedia content through the hierarchy provided with the ToC. 3.2.2. Audio-objects DS. The structure of this DS is analogous to the one AV object DS. In this case the considered objects may be keywords (provided by a speech to text transcription),

158

ADAMI ET AL.

Figure 4.

The audio-visual object DS.

the identities of the speakers involved in the multimedia document, relevant musical motifs, etc. 3.3.

Meta-descriptors DS

This DS has the role to incorporate in the ToCAI DS a set of descriptors carrying information about how accurate is the description and by which means it has been obtained (see Section 4 for more details). The goal is to describe not the content, but to give an indication of the reliability with which descriptor values have been assigned throughout the ToCAI DS. First, it is of importance to let the user know who the content provider or the description provider may be. Other relevant information which may be considered could include for example, the identification of the methods used to generate the description, and the size of such description. A series of descriptors indicating the reliability level of the involved extraction methods may be used to give users an idea about how much they can trust the description for answering their query. Thus these descriptors can well complement to the description itself, since, e.g., an unreliable description may not be very helpful for content understanding of a multimedia document.


3.4.

159

Context DS

The ToCAI, which refers to the structure of an AV document, should be considered together with a DS describing the category of the audio-visual material. This Context DS includes descriptors such as title of program, actors, director, language, country of origin, etc. Indeed this information is necessary for retrieving purposes to restrict the search domain. This DS may in addition serve to uniquely identify a typical program as it is common to have in program guide like, e.g., the title of the program, the country of origin, the year of production, etc.

4.

Automatic extraction tools

We have adopted several tools in order to obtain automatic descriptors values generation for describing an AV document according to ToCAI structure. With respect to the extraction of Key Items, it is necessary to select a certain application context for developing ad-hoc automatic extraction tools. For instance, in order to monitor football game salient events, automatic goal detection procedures can be derived by using a conjunction of camera motion information, average duration of shots within a certain time interval, and presence of a static scene.

5.

Audio extraction tools

We have used a simple but very useful classification of the audio stream in order to obtain some semantic information. These information have been used to support the automatic generation of the ToC part of the description jointly with the video segmentation methods. According to several researches in audio segmentation area [6, 9, 14, 20], we have considered the audio model composed of four non-overlapped classes: silence, music, speech and noise. Despite of its simplicity this model is sufficiently descriptive to our purposes. The approach is based on a multilevel analysis: at each level a different algorithm is applied to carry out a finer segmentation of the audio stream (see figure 5). At the first step the audio file is separated in silence and no-silence frame. Our silence detection algorithm uses a statistical pattern classification algorithm with the short time energy function, which provides an adequate measure of the amplitude variations of the signal and it is used in a few silence detection framework [3, 8]. The developed algorithm does not require any a priori information on the signal distribution. It performs, when it is possible, an initial training in order to evaluate the statistics of the local energy level of the background noise, which defines the initial condition for silence. The training takes place on the first few samples of audio signal (typically 0.4 sec.). Obviously these samples have to be representative of the background noise of the entire audio signal: if this condition is not satisfied then we use an estimate based on a random selection of frames with low energy level. Starting from these initial estimates, the statistics of the silence energy level are dynamically updated to avoid misclassification due to changes in statistics. The only

160

Figure 5.

ADAMI ET AL.

The segmentation algorithm.


161

necessary assumption is that the background noise does not exhibit abrupt changes in statistics. If the background noise can be considered wide sense stationary at least during short time intervals, the above hypothesis allows to dynamically follow the change of its statistics. As indicated previously, once silence and non-silence segments have been separated, the non-silence segments are further classified into speech, music and noise. For this purpose we use a speech characteristic to discriminate it from music and noise; the speech shows a very regular structure where the music and the noise don’t show it. Indeed the speech is composed by a succession of vowels and consonants: while the vowels are high energy events with the most of the spectral energy contained at low frequencies, the consonant are noise-like with the spectral energy distributed more towards the higher frequencies. Saunders [17] used the ZCR, which is a good indicator of this behavior. The audio file (without the silence segments) is partitioned into segments of 2.04 sec, each of them is composed of 150 consecutive non overlapping frames. These values allow a statistical significance of the frame number and, using a 22050 Hz sample frequency, each frame contains 300 samples, which is an adequate trade-off between the quasi-stationary properties of the signal and a sufficient length to evaluate the ZCR. For every frame, the value of the zero-crossing rate is calculated using the definition in [12]. These 150 values of the ZCR are used to estimate the following statistical measures: variance, skewness and difference between the number of ZCR samples which are above and below the mean. Each segment of 2.04 sec. is thus associated with a 3-dimensional vector. To achieve the separation between speech and the other classes using a computationally efficient implementation, a multivariate Gaussian classifier has been used. At the end of this step we have a set of consecutive segments labeled like speech or no-speech. The next step is justified to an empirical observation: the probability to observe a single segment of speech surrounding of music segments is very low and vice versa. Therefore a simple procedure is applied to change the labels of these spurious segments. The boundaries between segments of different classes are placed in fixed positions, inherently to the nature of the ZCR algorithm. Obviously these boundaries aren’t placed in a sharp manner, thus a fine-level analysis of the segments across the boundaries is needed to determinate a sharp placement of them. The ZCR values of the neighboring segments are processed to identify the exact moment of the transition between speech and no-speech signal. The last step is the music-noise separation. In general the music can be considered harmonic, where an harmonic sound consists of a series of major frequencies components including the fundamental frequency and those which are integer multiplies of the fundamental one. According to this observation we can measure the harmonicity degree to discriminate between music and noise (most environmental sounds are non-harmonic like). A simple pitch detector based on the FFT applied at a 1024 sample window is used to achieve the objective. 6.

Shot cut detection and classification of editing effects

An image sequence is formed of a certain number of shots that correspond to a series of frames typically extracted from a single camera record. During the editing stages, each

162

ADAMI ET AL.

Figure 6.

The shot cut algorithm.

camera record is linked to other by means of editing effects such as cuts which simply define a sharp transition, wipes which correspond to a sliding window effect, mates which let a dark blob invade the frames and dissolves, which represent a gradual change from one camera record to another by simple linear combination of the frames involved in the dissolve process. The weights of the linear combination are set on the basis of the distance of the frame which is part of the dissolve with respect to the beginning and the end of the surrounding shots respectively. Other effects such as fade-in and fade-out are also used; in the first case, it simply corresponds to a dissolve from a continuous black shot to a real camera record, while in the second case, a camera record is progressively darkened. A first attempt to temporally segment a video sequence is performed by identifying the single shots, i.e., by recovering the various editing effects. This is very relevant as the temporal organization of shots has often a semantic significance, which may be used to recover highlevel descriptors of the audio-visual data. Once the shots have been separated, it is also possible to summarize the information content by the use of indices such as K-frames or video objects (mosaics/sprites). As some of the editing effects such as dissolves have more semantic significance than others, it is also important to recover the type of editing being used. Depending on the type of audio-visual material, some editing effects may be used more than others. For example, a football game is likely to contain only cuts, whereas the situation is quite different for movies. In our work the shot segmentation is carried out by means of an algorithm based on histogram analysis. More specifically, first of all the probability density function (pdf ) of each shot’s image are estimated, see figure 6. Then abrupt transactions (cuts) and gradual transactions (dissolves, wipes, ..) are detected and classified separately. In the last phase the two previous results are merged together.

6.1.

Abrupt transition detection

The cut detection algorithm is based on the χ 2 between the histograms of contiguous images. These error values are then normalized with respect to the maximum error in a window of 20 frames in order to avoid false detection due to fast variation of lightness (flashing ..). If the normalized value of the considered frame (located in the center of the window) is higher than a fixed threshold then a cut is detected.


6.2.

163

Gradual transition detection

This algorithm is based on a statistical modeling of editing effects described in [2]. The next consideration is about to the dissolve effect but it is applicable to wipe, picture in picture, and other gradual transition effects. It is assumed that a dissolve is created by overlapping two camera records for the duration of the dissolve and by weighting the contribution of each frame at any given instant on the basis of its location with respect to the boundaries of the editing effect. If the two overlapping shots are identified as shotprev and shotnext , the following relationship holds when they overlap for n = n in ..n out : F[n] =

n out − n n − n in × Fin [n] + × Fout [n] n out − n in n out − n in

(1)

At times, it is possible to have also a non linear combination of weights in the formation of the dissolve frame F[n]. The model that is proposed here after could be adopted accordingly. Assuming that the series of frames forming each shot are outcomes of a same stationary random process (at least for first order statistics), an estimate of the marginal (pdf ) of each process is represented by the last frame of shotprev /shotnext prior to the dissolve, Fin /Fout , respectively. Normally frames Fin and Fout are quite different so that it can be really assumed that their associated histograms are those of two independent random variables. Accordingly for any frame F[n] being part of the dissolve, its associated first order statistics can be estimated by the convolutions of the histograms Hin and Hout properly scaled to take into account the weighting factors contained in Eq. (1). This implies that the difference between H [n] and Hin ∗ Hout should ideally be zero. On the contrary, if Fin and Fout belong to a same shot, the previous histogram difference would be different from zero.

7.

Visual scene segmentation

In order to generate indices that can be used to access a video database, a description of each video sequence is necessary. The key to understand the content of a video sequence is based upon a temporal scene segmentation and classification. Shot cut detection has been used as a technique to segment video sequences [18]. Unfortunately, visual information even when organized into consecutive shots does not always convey semantically meaningful information. A more adequate representation can be obtained when groups of consecutive shots can be merged into semantically more coherent entities called scenes. With this objective in mind, recent works have demonstrated the potential of analyzing the associated audio signal for video scene identification and classification [15, 16]. A cross-modal analysis of low-level visual and audio features has often brought intermediate but satisfactory semantic description of an audio-visual sequence. In this work the emphasis is placed on the identification of four different types of scenes (dialogs, stories, actions and events), by a joint audio-visual analysis.

164

ADAMI ET AL.

AC Audio Classification. SCD Shot Cut Detector. VFE Video Feature Extractor. SDC Scene Detection and Characterization. Figure 7.

7.1.

General scheme for cross-modal analysis.

Scenes by cross-modal analysis

The identification and characterization of scenes is an important step for speeding up the retrieval process. Normally, the process of understanding audio-visual material performed by a human being requires a joint analysis of both visual and audio signals. At a first level of abstraction, there is an unconscious grouping of segments of audio-visual material into semantically consistent scenes. To reflect this type of processing, a possible general scheme [16] for jointly using audio and video information for scene change detection and characterization is shown in figure 7. On one hand, the audio signal is processed by the AC module, which separates consecutive groups of audio frames into: – Silence segments which define those audio frames which only contain a quasi-stationary background noise, with a low energy level with respect to signals belonging to other classes. – Speech segments which contain voiced, unvoiced and plosive sounds. – Music segments which contain a combination of sounds with peculiar characteristics of periodicity. – Miscellaneous sound segments which correspond to all other categories, i.e., everything which does not belong to the previous classes. For a detailed description on how the AC module may operate, so as to decompose a generic audio signal into a sequence of segments of a consistent class type according to the previous list, refer to [15, 16]. On the other hand, visual information is analyzed first by the SCD module, in order to detect shot cuts. For each shot, the VFE module extracts then a VQ codebook so as to reconstruct the associated shot with a certain distortion with respect to the original visual information. The outputs of both the AC and VFE modules are then passed to the SDC module, which tries to merge together consecutive shots. The SDC module starts to describe each shot Si with a set of metadata. These are the time interval in which the shot resides, the dominant associated audio class, the percentage of samples belonging to such a


165

class (evaluated for the entire shot), a reference to the codebook associated to the shot and a label identifying the visual content of the shot. Once such meta information (which serves the role of abstract keywords) has been extracted, it can be processed so as to identify four different types of scenes: – Dialogs: The audio signal is mostly speech and the change of the associated visual information occurs in an alternated fashion, that is, the associated visual labels (which should ideally reflect a change of speaker) follow a pattern of the type ABABAB.. . – Stories: The audio signal is mostly speech while the associated visual information exhibits the repetition of a given visual content, to create a shot pattern of the type ABCADEFGAH.. . – Actions: The audio signal belongs mostly to one class (which is not speech) and the visual information exhibits a progressive pattern of shots with contrasting visual contents of the type ABCDEF.. . – Generic scenes: Consecutive shots which do not belong to one of the aforementioned scenes but their associated audio is of a consistent type. In order to verify the repetition of similar visual cues among non consecutive shots, a similarity measure is needed. In the following section, the techniques used to segment and characterize the visual information are presented. 7.2.

VQ codebook design

For each shot, a VQ codebook is designed so as to reconstruct each shot with a certain distortion with respect to the original visual information. Such codebook is constructed using the Learning Vector Quantization (LVQ). More specifically, every frame of a shot is divided into non overlapping blocks of size M ∗ N , scanning the image from left to right and top to bottom. All blocks of all frames belonging to the shot are then passed to the LVQ algorithm. The number of reference vectors is chosen equal to the number of blocks in a frame. The reference vectors returned by the LVQ algorithm define the codebook associated to the shot. 7.2.1. Similarity measure. Once a codebook has been associated to a shot, a similarity measure between two shots, based on the code vectors representing the shots, can be defined. Let Si be a shot and let Cj be a codebook, when a vector x ∈ Si is quantized to a vector v ∈ Cj , a quantization error occurs. This quantization error may be measured by the average distortion Dj (Si ) =

X 1 M−1 kxil − vjk k M l=0

(2)

where M is the number of vectors xil of shot Si (i.e. the number of blocks obtained by its frames), and vjk is the code vector of Cj with the smallest distance from xil , i.e., k = arg min kxil − vju k u

166

ADAMI ET AL.

with vju ∈ Cj . Furthermore, kDj (Si ) − Dk (Si )k can be interpreted as the distance between two codebooks (Cj and Ck ) when applied to shot Si . A similarity measure between two shots can, thus, be defined as follows: Svq (Si , Sj ) = kDj (Si ) − Di (Si )k + kDi (Sj ) − D j (Sj )k

(3)

where Di (Si ) is the distortion obtained when shot Si is quantized using its associated codebook. The smaller Svq the more similar the shots are. It is to be noticed that the similarity is based on the cross-effects of codebooks on each shots. In fact, if only one of the two elements of the sum in Eq. (3) is used, it is more likely that the resulting value be low, although the two shots are different. This happens, for example, when the majority of blocks of one shot, e.g., Sj is similar to a subset of blocks of the other shot, Si . Therefore, the codebook associated to Si , can represent Sj with a resulting small average distortion. On the other hand, the codebook associated with Sj cannot keep the same distortion level when applied to Si . Once a similarity measure between two shots has been defined, the next step is to identify and characterize scenes using joint audio and visual information. 7.3.

Scene identification

The identification of the four types of scenes defined the previous sections (dialogs, stories, actions and generic scenes) is performed by considering the audio classification and exploiting visual similarities between shots. To establish visual correlations, it is practically impossible to consider all combinations of non consecutive shots of a video sequence. As it is not very significative to compare shots that are very distant in time, similarities are measured only within a certain temporal window. The algorithm starts to label groups of L consecutive shots (typically L = 5) using the following approach. The first two shots are labeled independently of all the others. The third shot is then compared with respect to the first, and in case Svq (S1 , S3 ) < ² (see Eq. (3)), it is assigned the same label as the first shot; otherwise, it is labeled on its own. More generally, shot Si (with 3 ≤ i ≤ L) is compared to shot Si−2 . If Svq (Si , Si−2 ) < ², it is labeled with the same label assigned to shot Si−2 , and the algorithm proceeds by analyzing the next shot Si+1 . Otherwise, shot Si is compared to shot Si−3 , and the procedure is iterated if necessary till the first shot in the temporal window is reached. Each time the current shot does not compare well to the previously labeled shots, a new label is assigned to it. Once the first L shots have been labeled, the algorithm checks if the pattern of labels identifies anyone of the three scene types with specific visual patterns (dialogue, story or action). To reach such conclusion, the associated audio signal is also considered. If the audio signal classification corresponds to the specification of the detected scene, the scene type is consistent and the algorithm can proceed. The temporal window is thus enlarged so as to include the next M shots (with M typically set to (M = 3) in the case of story and set to (M = 1) in the case of a dialogue or an action). The new M shots are labeled according to the shots they are similar to in the window, with the same


167

procedure as before. The strategy is to increase the scene duration by incorporating shots which maintain the scene type unchanged. The procedure of enlarging the window size is continued as long as the scene type remains the same. When a set of M new shots does not obey to the specifications of the scene type detected so far, a new scene of a different type is searched for, by starting the whole scene identification process from scratch. The last set of M consecutive shots is again increased to L. If there is no correspondence with the proposed scene classes, the analysis window shifts one shot to the left and the algorithm starts back again. At the end, all consecutive shots which do not belong to a dialogue, an action or a story, are grouped together, in the so called generic scene, if their associated audio belongs to a same class. All shots which are not labeled are not grouped to form a scene, but are left as isolated shots. In order to make the method more robust to possible errors in the labeling procedure, a “noisy” approach must be considered. For example, it is reasonable to detect a dialogue if the associated pattern of visual labels corresponds to ABACBABAB and the associated audio has been mostly speech for the entire duration. In this case C represents the noisy component (i.e., shot). It has been noticed that a reasonable way to operate is to make the number of acceptable noisy labels dependent upon the number of shots forming the scene. Accordingly, if a scene contains already l shots, then the number of noisy [audio,visual] 1 with i = 1, 2, 3 for a dialog, story, and components must be lower or equal than {V,A}D i 1 action, respectively. Typically VD1 = VD3 = AD1 = AD3 = 5 and VD2 = AD2 = 7. VD i represents the maximum acceptable number of visual components which interrupt (or create, 1 in case of an action) a repetition of the visual content. ADi is the maximum possible number of shots forming the scene, which have an associated audio which does not correspond to the type of audio associated to that scene. To determine the audio pattern associated to a shot, the algorithm first identifies a possible type of scene by analyzing the visual information only. In case the visual labels follow the pattern of a dialog or a story, an audio label S is associated to each shot of the pattern whenever at least 80% of the associated audio is speech and the label S 0 otherwise. Thus, the number of noisy audio components corresponds to the number of shots in the processed temporal window having the associated audio label equal to S 0 . In case the visual labels follow the pattern of an action, each shot of the pattern will have an audio label determined as follows: (1) In case 100% of the associated audio of each processed shot belongs to a same class, the shot will have the audio label corresponding to that class, i.e., s in the case of silence, S in the case of speech, M in the case of music and N otherwise. (2) In case the associated audio in a shot has been classified as belonging to a specific class at least 90% of the time, the audio label associated to that shot will be s 0 , S 0 , M 0 , N 0 in case the dominant class is silence, speech, music and miscellaneous other sounds, respectively; (3) In all other cases, i.e., whenever there is not a dominant audio class present at least 90% of the time, the audio label is not classified (0). Hence, in case of an action, the noisy audio labels are s 0 , M 0 , N 0 . Consecutive shots identify an action if the visual labels follow the pattern of an action (e.g., ABCDEF), the number of

168

ADAMI ET AL.

1 noisy audio labels is less or equal to AD and none of the audio labels correspond to 0, S i 0 or S . Whenever one of these labels is encountered, it implies the end of the action. When dealing with generic scenes, no noise is considered as they do not exhibit any specific visual pattern.

8.

The ToCAI DS: an implementation

The ToCAI DS seems very adequate to describe the content of a large archive of AV programs such movies, news, sport and so on. The ToC allows to navigate at different levels of details (scene or shot or micro-segment), while the AI gives the possibility to retrieve some interesting shots (like goals in football match or head speaker in news) according to the appropriate Ordering Key. We show now an implementation example of the ToCAI DS applied to a set of programmes extracted from the MPEG-7 Content Set. The programs were segmented in shots then they have been clustered in scenes and further segmented in micro-segments. Figure 8 shows a typical screen-shot of the browsing in the ToC part of the scheme. In the left box there is a tree-view object, in which the first level is the program level, the second is the scene level, the third is the shot level and the fourth (not present in figure) is the

Figure 8.

A screen shot of the TOC part.


169

micro-segment level. The navigation through the scheme is similar at the normal navigation in directory structure: at each level you can expand the folder and you can view the below level. In figure 8 the tree-view shows the scene level (Spain 1-0 Morientes) contained in programme Spain-Sweden football match. The five picture at the bottom of the form contain the K-frames associated with every shot belonging to the scene. It is possible to scroll forward and backward the shots as well as to play an interesting shot (in the example the second shot is playing). For an easier browsing of the programme some normal control buttons are placed under the playing area. For the AI part we provide a description of the demo material according to considerations about the user’s preferences, but this description is arbitrary and it is useful only to show an example of AI. In figure 9 navigating through the items we arrive to a level in which are contained all the shots belonging to Spain-Sweden football match, ordered by temporal position. This is not a useful order for the user, which is interested to retrieval the goals, for example. For this task it is possible to use one of the proposed Ordering Key, e.g., the audio loudness. According to this OK the first picture contains the K-frame of the Spain first goal, the second contains the K-frame of the beginning of the match, the third contains the K-frame of the Spain second goal. The audio loudness, despite of its simplicity, seems a useful Ordering Key for

Figure 9.

Audio loudness ordering key.

170

ADAMI ET AL.

Figure 10.

Dominant hue ordering key.

the task of retrieval goal in a football match. Obviously for other task it is not indicated, where other Ordering Keys can be useful. In figure 10 we show a situation in which the aim is to retrieval the shots containing the speaker within a news program. For this task an appropriate OK seems to be the dominant hue. Using the interface control box it is possible to choose the dominant hue and to order the shot K-frames according to their similarity on the selected hue. As you see the K-frames containing the speaker appear in the picture boxes in near places (some occasional error happen due to the simplicity of the OK).

9.

Conclusion

The paper presented the ToCAI DS as a framework for multimedia content description. The proposed audio-visual DS is based on four main structures: 1) a Table of Contents DS for semantically characterizing the temporal structure of the multimedia document. 2) An Analytical Index DS for providing an ordered set of relevant objects of the document with their link to the document itself. 3) A Context DS for focusing on the category of the document. 4) A Meta-descriptors DS for giving useful information about the description itself and its reliability. The detailed structure of the DS has also been presented, and an


171

application example was shown. Moreover, some suitable automatic extraction methods for the identification of the values associated to the descriptors that compose the index (ToCAI), are presented and discussed. Notes 1. A shot is defined by a sequence of frames captured from a unique and continuous record of camera. 2. A mosaic represents the background in a shot. An outlier represents a foreground object in motion with respect to the background. These are typically extracted thanks to mosaicing techniques, which allow to register regions at different layers moving differently throughout the image sequence.

References 1. N. Adami, A. Bugatti, R. Leonardi, P. Migliorati, and L. Rossi, “ISO/IEC JTC1/SC29/WG11/M4586: The TOCAI DS for audio-visual documents. Structure and concepts,” MPEG-7, Seoul, Korea, March 1999. 2. N. Adami and R. Leonardi, “Identification of editing effects in image sequences by statistical modelling,” in Proc. of the 1999 Picture Coding Symposium, Portland, OR, U.S.A., April 1999. 3. P. De Souza, “A statistical approach to the design of an adaptive self-normalizing silence detector, IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 3, No. 31, pp. 678–684, 1983. 4. A. Ferman, A. Tekalp, and R. Mehrotra, “Effective content representation for video,” in Proc. IEEE International Conference Image Processing, Chicago, IL, Oct. 1998. 5. M. Fowler, UML Distilled, Addison-Wesley, 1997. 6. J. Foote, “A similarity measure for automatic audio classification,” Proc. AAAI’97 Spring Symposium on Intelligent Integration and Use of Text, Image, Video and Audio Corpora, 1997. 7. O.N. Gerek and Y. Altunbasak, “Key frame selection from MPEG video,” in Proc. SPIE Visual Communications and Image Processing, 1997, Vol. 3024, pp. 920–925. 8. L.F. Lamel, L.R. Rabiner, A.E. Rosenberg, and J.G. Wilpon, “An improved endpoint detector for isolated word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 4, No. 29, pp. 777–785, 1981. 9. C. Montaci and M. Caraty, “A silence/noise/music/speech algorithm,” in International Conference on Spoken Language Processing, Sidney, 1998. 10. MPEG Requirement Group, MPEG7, “Context and Objectives. ISO/IEC JTC1/SC29/WG11 N2460,” MPEG98, Atlantic City, USA, Oct. 1998. 11. MPEG Requirement Group, MPEG7, “Requirements. ISO/IEC JTC1/SC29/WG11 N2461,” MPEG98, Atlantic City, USA, Oct. 1998. 12. L. Rabiner and R. Schafer, Digital Processing of Speech Signals, Prentice Hall, Alan Oppenheim editor. 13. Y. Rui, T. Huang, and S. Mehrotra, “Browsing and retrieving video content in a unified framework,” in Proc. IEEE Workshop on Multimedia Signal Processing, Dec. 1998. 14. C. Saraceno, “Content-based representation and analysis of video sequences by joint audio and visual characterization,” Ph.D. thesis, Brescia, 1998. 15. C. Saraceno and R. Leonardi, “Indexing audio-visual databases through a joint audio and video processing,” International Journal of Imaging Systems and Technology, 1998. Vol. 9, No. 5, pp. 320–331. 16. C. Saraceno and R. Leonardi, “Identification of story units in audio-visual sequences by joint audio and video processing,” in Proc. International Conference on Image Processing, Chicago, IL, U.S.A., Oct. 1998. 17. J. Saunders, “Real Time discrimination of broadcast music/speech,” in Proc. ICASSP-1996, 1996, pp. 993– 996. 18. I.K. Sethi and N. Patel, “A statistical approach to scene change detection,” in Proc. of the SPIE Conf. on Storage and Retrieval for Image and Video Databases III, SPIE-2420. Feb. 1995, pp. 329–338. 19. S. Smoliar and L. Wilcox, “Indexing the content of multimedia documents,” in Proc. Second International Conference on Visual Information Systems, San Diego, CA, 1997.

172

ADAMI ET AL.

20. T. Zhang and C.-C. Jay Kuo, “Audio-guided audiovisual data segmentation and indexing,” in IS&T/SPIE’s Symposium on Electronic Imaging Science & Technology—Conference on Storage and Retrieval for Image and Video Databases. San Jose, Jan. 1999, Vol. 7, No. 3656, pp. 316–327.

Nicola Adami was born in Rovereto, Italy, on March 19, 1971. He received the Dr. Eng. degree in electronic engineering in 1998, from the University of Brescia, Brescia. He has been Ph.D. student since 1998 at the University of Brescia. He was involved in the European project AVIR from 1998 to 2000. His research interest is in image processing, motion estimation, filtering, vector quantization. His efforts are currently devoted to the automatic video sequences content analysis.

Alessandro Bugatti was born in Brescia, Italy, on June 19, 1971. He received the Dr. Eng. degree in electronic engineering in 1998, from the University of Brescia, Brescia. He was worked in the University of Brescia as extern consultant. He was involved in the European project AVIR from 1998 to 2000 and in the Mpeg-7 standardization process. His research interest is in audio segmentation and classification, user interfaces and audio-video indexing. His background includes also studies in AI and expert systems fields. His efforts are currently devoted to both automatic audio sequences content analysis and cross-modal analysis.

Riccardo Leonardi has obtained his Diploma (1984) and Ph.D. (1987) degrees in Electical Engineering from the Swiss Federal Institute of Technology in Lausanne. He spent one year (1987–88) as a post-doctoral fellow with


173

the Information Research Laboratory of the Department of Electrical and Computer Engineering at the University of California, Santa Barbara (USA). From 1988 to 1991, he was a Member of Technical Staff at AT&T Bell Laboratories, performing research activities on image communication systems for the Visual Communications Research Department. In 1991, he returned briefly to the Swiss Federal Institute of Technology in Lausanne to coordinate the research activities of the Signal Processing Laboratory. Since March 1992, he was appointed at the University of Brescia to lead research and teaching in the field of Telecommunication. He holds there the Signal Processing Chair. His main research interests cover the field of Digital Signal Processing applications, with a specific expertise on visual communications, and content-based analysis of audio-visual information. He has published more than 50 papers on these topics, and acts as a national scientific coordinator of research programmes in visual communications.

Pierangelo Migliorati was born in Brescia, Italy. He received the degree in Electronic Engineering (cum laude) from the “Politecnico di Milano” in 1988 and the “Master in Information Technology” from the CEFRIEL Research Centre, Milan, 1989, respectively. He joined CEFRIEL Research Center in 1990. From 1990 to 1995 he was involved in activities concerning image processing and coding. He was also involved in research activities in the areas of “Multimedia Communications” systems. From 1995 he is Assistant professor at “University of Brescia”, where is involved in activities related to indexing of multimedia documents and segmentation of image sequences.

Lorenzo Rossi received an Electrical Engineering degree from the University of Brescia (Italy) and an M.Sc. from the University of Southern California (USA). He is currently a Ph.D. student at the Signal and Communication Lab. of the University of Brescia. His research interests include audio-visual content analysis and indexing, blind equalization and higher order statistics. He has been involved in the ESPRIT project AVIR (Audio Visual Indexing and Retrieval) and in the MPEG-7 standardization activities. He is a member of the Moving Picture Expert Group (MPEG) and of Italian National Telecommunication Consortium (CNIT).

The ToCAI Description Scheme for Indexing and Retrieval ... - CiteSeerX

The ToCAI Description Scheme for Indexing and Retrieval ... - CiteSeerX

Suggest Documents

Indexing and retrieval scheme of the image

Segmentation, Indexing, and Retrieval for Environmental ... - CiteSeerX

ART Extension for Description, Indexing and Retrieval of 3D Objects

ART Extension for Description, Indexing and Retrieval of 3D Objects

Text-Based Description of Music for Indexing, Retrieval, and Browsing

Using Description Logics for Indexing Audiovisual ... - CiteSeerX

Database Indexing, Retrieval and Clustering - CiteSeerX

BUS: An Effective Indexing and Retrieval Scheme in Structured

Image Indexing and Retrieval using Expressive Fuzzy Description ...

Rapid Prototyping for Indexing and Retrieval for

A New Approach for Video Indexing and Retrieval Based ... - CiteSeerX

Online Gene Indexing and Retrieval for BioCreative III at ... - CiteSeerX

Automatic Message Indexing and Full Text Retrieval for a ... - CiteSeerX

Visual indexing and retrieval - Webology

Sound Indexing Using Morphological Description - CiteSeerX

Sound Indexing Using Morphological Description - CiteSeerX

Tools and Methodologies for the Indexing, Storage and Retrieval of ...

Indexing natural images for retrieval based on kansei factors - CiteSeerX

A Semantic Approach for the Indexing and Retrieval ... - CEUR-WS.org

Content -based Image Retrieval for Image Indexing - The Science and

Automatically Controlled-Vocabulary Indexing for Text Retrieval

Indexing Structures for Geographic Web Retrieval - XLDB

A Methodology for the Representation, Indexing, and Retrieval of

Concept-based indexing and retrieval of hypermedia ... - CiteSeerX