ADAPTIVE VIDEO SUMMARIZATION
Philippe Mulhem1, Jérôme Gensel2, Hervé Martin2 1: IPAL-CNRS, Singapore, 2:LSR-IMAG, Grenoble, France
[email protected],
[email protected],
[email protected]
1. INTRODUCTION One of the specific characteristics of the video medium is to be a temporal medium: it has an inherent duration and the time spent to find information present in a video depends somehow on its duration. Without any knowledge about the video, it is necessary to use some Video Cassette Recorder facilities in order to decrease the search time. Fortunately, the digitalization of videos overcome former material constraints that forced a sequential reading of the video. Nowadays, the technological progress makes it possible to achieve some treatments on the images and segments which compose the video. For instance, it is very easy now to edit certain images or sequences of images of the video, or to create non-destructive excerpts of videos. It is also possible to modify the order of the images in order to make semantic groupings of images according to a particular center of interest. These treatments are extensively used by film-makers who use numeric benches of assembly such as Final Cut Pro [1] and Adobe Premiere [2]. In such software, users have to intervene in most of the stages of the process in order to point out the sequences of images to be treated and to specify the treatment to be made by choosing appropriate operators. Such treatments mainly correspond to operations of manipulation such as, for instance, cutting a segment and copying it out in another place. Such software offer simple interfaces that fit the requirements of applications like personal video computing and, in a more general way, this technology is fully adapted to the treatment of one video. However, in the case of large video databases used and manipulated at the same time by various kinds of users, this technology is less well-suited. In this case, a video software has to offer a set of tools in able to i) handle and manage multiple video documents at the same time ii) find segments in a collection of videos corresponding to given research criteria (iii) create dynamically and automatically (i.e. with no intervention of the user) a video comprising the montage of the result of this search. In order to fulfill these three requirements, video software must eventually model the semantics conveyed by the video, as well as it structure which can be determined from the capturing or editing processes. According to the nature of the semantics, the indexing process in charge of extracting the video semantics can be manual, semiautomated or automated. For instance, some attempts have been made to automatically extract some physical features such as color, shape and texture in a given frame and to extend this description to a video segment. Obviously, it is quite impossible to extract some high-level information such as the name of an actor or the location of a sequence. Thus, automatic indexing is generally associated with manual indexing. The expected result of this indexing process is a description of the content of video in a formalism that allows accessing, querying, filtering, classifying, and reusing the whole or some parts of the video. Offering metadata structures for describing and annotation audiovisual (AV) content is the goal of the MPEG-7 standard [3] which supports a range of descriptions ranging from the low-level signal features (shape, size, texture, color,
1
movement, position…) to the highest semantic level (author, date of creation, format, objects and characters involved, their relationships, spatial and temporal constraints…). MPEG-7 standard descriptions are defined using the XML Schema Language and Description Schemes can then be instantiated as XML documents. Using such descriptions, it is possible to dynamically create summaries of videos. Briefly, a video summary is an excerpt of a video that is supposed to keep the relevant parts of the video while dropping the less interesting parts of the video. This notion of relevance is subjective, interest-driven, related to the context and therefore it is hard to specify and necessitates some preference criteria to be given. It is our opinion that in order to create such a summary, the semantics previously captured by the semantic data model can be exploited while taking into account the preference criteria. In this paper, we present then the VISU model which is the result of our on going research in this domain. This model capitalizes amount of work made in the field of information retrieval, notably the use of Conceptual Graphs [4]. We show how the VISU model adapts and extends these results in order to satisfy the constraints inherent to the video medium. Our objective is to annotate videos using Conceptual Graphs in order to represent complex descriptions associated with frames or segments of frames of the video. Then, we take advantage of the material implication of Conceptual Graphs on which is based an efficient graph matching algorithm [5] which allows to formulate queries. We propose a query formalism that provides a way to specify retrieval criteria. These criteria are useful to users for adapting the summary to their specific requirements. The principles of the query processing are also presented. Finally, we discuss about time constraints that must be solved to create a summary with a given duration. The paper is organized as follows. In the next section, the different approaches used to capture video semantics are presented. The section 3 proposes an overview of the main works in the field of video summarization. The section 4 presents the video data model, and the query models of the VISU system we are currently developing. Section 4 gives some concluding remarks and perspectives to this work.
2. VIDEO SEMANTICS 2.1. ANNOTATION-BASED SEMANTICS An annotation represents any symbolic description of a video, or an excerpt of a video. However, when considering annotated symbolic description of videos, it should be noted that the representation may not be always extracted automatically. In contexts where enough knowledge can be used like in sports videos [6] or in news videos [7][8], automatic processes are able to extract abstract semantics, but in general cases systems are only able to provide help for users to describe the content of videos [9] or low level representations of the video content. Many approaches have been proposed for modeling video semantics. According to the linear and continuous aspect of a video, many models are based on a specification of strata. Strata can be supported by different knowledge or data representation: VSTORM [10] or Carlos et al [11] adopt an object or prototype approach, and recently, the MPEG committee has chosen the family of XML languages for the definition and the extension of the MPEG-7 standard audio-visual (AV) descriptions. We give below the principles of the stratum-based models, and the different representations of the strata. 2.1.1. Strata-based fundamentals A stratum is a list of still images or frames that share a common semantics. So, a strata can be associated with some descriptions of the video excerpts it encompasses. In some models, it is allowed to specify overlapping between strata.
2
In [12][13], the authors propose a stratification approach inspired from [14] to represent high level description of videos. A stratum is a list of non-intersected temporal intervals. Each temporal interval is expressed using frame numbers. A video is thus described by a set of strata. Two types of strata are proposed: dialog stratum and entity stratum. An entity stratum reflects the occurrence of an object, of a concept, or of a text, etc., and has a Boolean representation. For instance, an entity stratum can be used to store the fact that a person occurs, to express the mood of a sequence in the video (e.g. 'sadness'), and to represent the structure of the video in term of shots or scenes. A dialog stratum describes the dialog content for each interval considered, and speech recognition may also be used in this case to extract such strata. Retrieval is here based on Boolean expressions for entity strata and on vector space retrieval [15] for dialog strata. The automatic extraction of semantic features, like Action/CloseUp/Crowd/Setting [16] allows also the definition of strata which are characterized by these features. 2.1.2. Strata structures and representations In the stratification work of [13], the strata are not structured and no explicit link between strata exists. In [17], the authors define a video algebra that defines nested strata. Here, the structure is defined in a top-down way to refine the content description of video parts. Such nesting ensure consistency in the description of strata, because the nested strata correspond necessarily to nested time intervals. From a user perspective, browsing in a tree structure is certainly easier than viewing a flat structure. In that case however, the question related to the retrieval of the video parts according to a query is far more complex than with flat strata. Such tree-based content representation is also proposed in [18]. The approach there consists in finding ways to evaluate queries using database approaches. A tree structure describes when objects occur, and an SQL-like query language (using specific predicates or functions dedicated to the management of the object representation) allows the retrieval of the video excerpts that correspond to the query. Other database approaches has been proposed, like [19], in order to provide database modeling and retrieval on strata-based video representation. An interesting proposal, AI-Strata [20], has been dedicated to formalize strata and relations among strata. The AI-strata formalism is a graph-based AV documentation model. Root elements are AV units or strata to which annotation elements are attached. These annotations elements derive from a knowledge base into which abstract annotation elements (classes) are described and organized in a specialisation hierarchy. The exploitation of this graph structure is based on a sub-graph matching algorithm, a query being formulated in the shape of a so-called potential graph. In this approach, one annotation graph describes how the video document decomposes into scenes, shots and frames, the objects and events involved in the AV units, their relationships. In V-STORM [10], a video database management system written using 02, we proposed to annotate a video at each level of its decomposition into sequences, scenes, shots and frames. A 02 class represents an annotation and is linked to a sub-network of other 02 classes describing objects or events organized in specialization and composition hierarchies. Using an object-based model to describe the content of a stratum has some advantages: complex objects can be represented, objects are identified, attributes and methods can be inherited. Following a similar approach in the representation formalism, authors in [11] have proposed a video description model based on a prototype-instance model. Prototypes can be seen as objects which can play the roles of both instances and classes. Here, the user describes video stories by creating or adapting prototypes. Queries are formulated in the shape of new prototypes which are classified in the hierarchies of prototypes describing the video in order to determine the existing prototype(s) which match the best the query and are delivered as results.
3
Among the family of MPEG standards dedicated to videos, the MPEG-7 standard addresses specifically the annotation problem [3][21]. The general objective of MEPG-7 is to provide standard descriptions for the indexing, searching, and retrieval of AV content. MPEG-7 Descriptors can either be in a XML (and then human-readable, searchable, filterable) form or in a binary form when consuming storage, transmission and streaming are required. MPEG-7 Descriptors (which are representations of features) can described low-level features (such as color, texture, sound, motion, or such as location, duration, format) which can be automatically extracted or determined, but also higher level features (such as regions, segments, their spatial and temporal structure, or such as objects, events, and their interaction, or such as author, copyright, date of creation, or such as users preferences, summaries). MPEG-7 predefines Description Schemes which are structures made of descriptors and the relationships between descriptors or other description schemes. Descriptors and Description Schemes are specified and can be defined using the MPEG-7 Description Definition Language which is based on the XML Schema Language with some extensions concerning vectors, matrices and references. The Semantic Description Scheme is dedicated to provide data about objects, concepts, places, time in the narrative world and abstraction. The descriptions can be very complex: it is possible to express, using trees or graphs, actions or relations between objects, states of objects, abstraction relationships, abstract concepts (like 'happiness') in the MPEG-7 Semantic Description Scheme. Such Descriptions Schemes can be related to temporal intervals like in the strata-based approach. To conclude, annotation-based semantics generally relies on a data structure that allows to express a relationship between a continuous list of frames and its abstract representation. The difference among approaches is the capabilities offered by the underlying model to formalize strata and links between strata and the structure of the annotations. Our opinion is that this description is a key-point for generating video summarization. Thus, proposing a consistent and sound formalism will help in avoiding inconsistencies and fuzzy interpretation of annotations. 2.2. LOW LEVEL CONTENT-BASED SEMANTICS We define the low level content-based semantics of videos as the elements that can be extracted automatically from the video flow without considering specific knowledge related to a specific context. It means that the video is processed by an algorithm that captures various signal information about frames and that proposes an interpretation in term of color, shape, structure and object motion of these features. We can roughly separate the different content-based semantics extracted from videos into two categories: single frame-based and multiple frame-based. Single frame-based extractions consider only one frame at a time, while multiple frame-based extractions use several frames, mostly sequences of frames. Generally, single frame extraction is performed using segmentation and region feature extractions using colors, textures, and shapes (like the one used in still image retrieval, like QBIC [22] and Netra [23] for instance). MPEG-7 proposes, in its visual part, description schemes that supports color descriptions of images, groups of images and/or image regions (different color spaces, using dominant colors or histograms), texture descriptions of image regions (low level based on Gabor filters, high level based on 3 labels, namely regularity, main direction and coarseness), shape of image regions based on curvature scale spaces and histograms of shapes. Retrieval is then based on similarity measures between the query and the features extracted from the images. However, the use of usual still image retrieval systems on each frame of video documents is not adequate in term of processing time and of accuracy: consecutive frames in videos are usually quite similar, and this feature has to be taken into account.
4
Features related to sequences of frames can be extracted from averaging over the sequence, in order to define, like in [24], the ratio of saturated colors in commercials. Others approaches propose to define and use motion of visible objects and motion of camera. MPEG-7 defines motion trajectory based either on key points and interpolation techniques, and parametric motion based on parametric estimations using optical flow techniques or on usual MPEG-1, MPEG-2 motion vectors. In VideoQ [25], the authors describe ways to extract object motion from videos and to process queries involving motion of objects. Works in the field of databases also consider the modeling of objects motion [26]. In this case, the concern is not how to extract the objects and their motion, but how to represent the objects and their motion in a database for allowing fast retrieval (on an object oriented database system) of videos part according to SQL-like queries based on object motion. Indexing formalisms used at this level are too low-level to be straightforwardly used in a query process by consumers for instance but on the other hand such approaches are not tightly linked to specific contexts (for instance motion vectors can be extracted from any video). Thus, a current trend in this domain is the merger of content-based information with other semantic information in order to provide usable information. 2.3. STRUCTURE-BASED SEMANTICS It is widely accepted that video documents are hierarchically structured into clips, scenes, shots and frames. Such structure usually reflects the creation process of the videos. Shots are usually defined as continuous sequences of frames taken without stopping the camera. Usually, scenes are defined as sequences of contiguous shots that are semantically related, but in [27] [28] the shots of a sequence may not be contiguous. A clip is a list of scenes. After the seminal work of Nagasaka and Tanaka in 1990 [29], work has been done for detecting shot boundaries in a video flow. Many researchers [30][31][32][33] have focused on trying to detect the different kinds of shot transitions that occur in video. The TREC video track 2001 [34] compared different temporal segmentation approached, and if the detection of cuts between shots is usually successful, the detection of fades does not achieve very high success rates. Other approaches focus on scene detection, like [27] using shot information and multiple cues like audio consistency between shots and the close caption of the speeches of the video. Bolle et al. [35] use types of shots and predefined rules to define scenes. Authors in [36] extend the previous work of Bolle et al. by including vocal emotion changes. Once the shots and sequences are extracted, it is then possible to describe the content of each structural element [37] in a way to retrieve video excerpts using queries or by navigating in a synthetic video graph. The video structure provides a sound view of a video document. From our point of view, a video summary must keep this view of the video. Like the table of content of a book, the video structure provides a direct access to video segments.
3. SUMMARIZATION Video summarization aims at providing an abstract of a video for shortening the navigation and browsing the original video. The problematic of video summarization is to be able to present in a synthetic way the content of video, while preserving 'the essential message of the original' [38]. According to [39] there are two different types of video abstracts: still-image and moving-image abstracts. Still-image abstracts, like with Video Manga [40][41], are presentations of salient images or key-frames while movingimage abstracts consists of a sequence of image sequences. The former are referred to as video summary, the later as video skimming. Video skimming is a more difficult process since it imposes an audio-visual synchronization of the selected sequences of images in order to restitute a coherent
5
abstract of the entire video content. Video skimming can be achieved by using audio time scale modification [42] which consists in compressing the video and speeding up the audio and the speech while preserving the timbre, the voice quality and the pitch. Another approach consists in highlighting the important scenes (sequences of frames) in order to build a video trailer. In [38][43], scenes with a lot of contrast, scenes in the average coloration of the video, as well as scenes with a lot of different frames are automatically detected and are integrated in the trailer as they are supposed to be important scenes. Action scenes (containing explosion, gun shot, rapid camera movement) are also detected. Close captioning can also be used for selecting audio segments that contain some selected keywords [44] that together with the corresponding image segments put in chronological order will constitute an abstract. Clustering is also often used to gather video frames that share similar color or motion features [45]. Once the set of frame clusters is obtained, the more representative keyframe is extracted from each cluster. The video skimming is built by assembling video shots that contain these keyframes. In the work of [46] sub-arts of shots are processed using a hierarchical algorithm to generate the video summaries. Video summary can be seen as a more simple task to perform since it consists in extracting from a video sequences of frames or sequences of segments of video as the best abstract of it, without considering audio segments selection and synchronization or close caption. The representation of video summaries can be composed of still images or of moving images, and may use the video cinematographic structure (clip, scenes, shots, frame) as well. The problem is still to define the relevant parts of video (still or moving images) to be kept in the summary. Many existing approaches consider only signal-based summary generation. For instance Sun and Kankanhalli [47] proposed the Content Based Adaptive Clustering that defines a hierarchical removal of clusters of images according to color differences. This work is conceptually similar to [48] (based on Genetic Algorithms) and [49] (based on singular value decomposition). Other approaches, like [50] and [51], try to use objects and/or background for summary generation, hoping more meaningful results, at the expense of an increased complexity. Another signal-level feature present on videos is motion, authors in [52] proposed to use such motion and gesture recognition of people in the context of filmed talk with slides. In [53], MPEG-7 content representations are used to generate semantics summaries based on relevant shots that can be subsampled based on the motions/colors in each shot. In some specific repetitive contexts, like for electrocardiograms [54], the process uses a priori knowledge to extract summaries. Other contexts, like broadcast news [55], helps the system to find out the important parts of the original videos. Approaches propose also to use multiple features to summarize videos: [55] use closed caption extraction and speaker change detection, when [44] extract human faces and significant audio parts. The process of defining video summaries is complex and prone to errors. We consider that the use of strata-like information in a video database environment is able to produce meaningful summaries by using high level semantic annotations. In order to achieve this goal, we propose to use the powerful formalism of Conceptual Graphs in order to represent complex video content. We are currently developing this approach in a generator of adaptive video summaries, called VISU (which stands for VIdeo SUmmarization). This systems is based on two models we present in the following section.
4. THE VISU MODELS We present in this section the two models of the VISU system which allows both to annotate videos with high level semantic descriptions and to query these descriptions for generating video summaries. Thus, the VISU System relies on two models: an annotation model which uses strata and conceptual graphs for representing
6
annotations, and a query model, based on SQL, for describing the expected summary. The query processing exploits an efficient graph matching algorithm for extracting the frames or sequences of frames which answer the query before to generate the summary according to some relevance criteria. We describe here the VISU models and the process of summary generation. 4.1. THE ANNOTATION-BASED STRUCTURE MODEL In this model, objects, events and actions occurring in a video are representation units linked to strata similar to those presented in section 2. Moreover, we organize these representation units in Conceptual Graphs wherein nodes correspond to concepts, events or/and actions which are linked together in complex structures. A video or a segment of video can be annotated using several strata. Each stratum is associated with a set of chronologically ordered segments of video. Then, as shown in Figure 1, two strata involved in a video may share some common segments of video meaning that the object, event or action they respectively represent both appear simultaneously in those segments. Provided that, for each segment of video where it occurs, a stratum is associated with the starting time and the ending time of this segment, the overlapping segments can be computed as the intersection of the two sets of segments. Annotations associated with strata can be seen made of elementary descriptions (representation units) or more complex descriptions organized in Conceptual Graphs [4]. The Conceptual Graph formalism is based on a graphical representation of knowledge. Conceptual Graphs can express complex representations, and are able to be used efficiently for retrieval purposes [5].
Strata 2 Strata 1 0
300
video 0-22
video 246-300 Strata 2 87-165
Strata3 166-177 193-229
Initial strata level Annotation structure level
Strata 1 16-86; 178-192 230-246
Figure 1. An example of annotation-based structure for a video. A video of 300 frames is here annotated using 2 initial strata, namely Strata1 and Strata2. The video segments between frames 87 and 165 is related only to the description of stratum 2, the frame intervals [166, 177] and [193,229] are described by a generated strata Strata3, corresponding to a conjunction of the annotations of the strata 1 and 2, where the frame intervals [178,192] and [230,246] are described only by the stratum 1. More formally, Conceptual graphs are bipartite oriented graphs, composed of two kinds of nodes: concepts and relations. ! A Concept, noted "[T: r]" in an alphanumeric way and T: r in a graphical way, is composed of a concept type T and a referent r. Concept types are organized in a lattice that represents a generic/specific partial order. For the concept to be correct syntactically, the referent has to conform to the concept type, according to the predefined conformance relationship. Figure 2 shows a simple
7
lattice that describes the concept types Person, Woman and Man; Tc and ⊥c represent respectively the most generic and the most specific concepts of the lattice. A referent r may be individual (i.e. representing one uniquely identified instance of the concept type), or generic (i.e. representing any instance of the concept type, and noted by a star: "*"). A relation R is noted "(R)" in a alphanumeric way and R in a graphical way. Relations are also represented in a lattice expressing a generic/specific partial order.
!
Tc
TR
Person
Action
Woman
Hold
Man
Throw
Give
⊥R
⊥c Figure 2. An example of concept types lattice.
Figure 3. An example of relations lattice
We call arch a triplet (concept, relation, concept) that links three nodes including one relation and two concepts in a conceptual graph. In the following, we refer to a triplet "concept#relation#concept" as an arch. Conceptual graphs can be used to represent complex descriptions. For instance, the Conceptual Graph G1, noted [Man: #John]#(talk_to)#[Woman: #Mary] can describe the semantic content of a frame or segment of frames. In this description, [Man: #John] represents the occurrence of a man, John, [Woman: #Mary] represents the occurrence of a woman, Mary, and the graph (which here, reduces to a simple arch) expresses the fact that John is talking to Mary. Figure 4 presents the graphical representation of this simple graph. It can be noticed that the expressive power of Conceptual Graphs is able to represent hierarchies of concepts and relations between objects as proposed by the MPEG-7 committee.
Man: #John
talk_to
Woman: #Mary
Figure 4. An example of conceptual graph. Syntactically correct Conceptual Graphs are called Canonical Graphs. They are built using a set of basic graphs that constitutes the Canonical Base, from the following construction operators: - joint: this operator joins graphs that contain an identical (same concept type and referent) concept, - the restriction operator constrains a concept by replacing a generic referent by a specific one, - the simplification operator remove duplicate relations that may occur after a joint operation, for instance, - the copy operator copies a graph. As shown by Sowa, the advantage of using conceptual graphs is that a translation, noted φ, exists between these graphs and the first order logic, providing conceptual graphs with a sound semantics. We exploit this property to ensure the validity of the
8
summary generation process. This process is based on the material implication of first order logic (see [4]). According to the conceptual graphs formalism described above, the joint operator can be used to achieve the fusion of graphs when possible. Thus, the resulting joint graph G of two graphs Gi and Gj, performed on one concept Cik of Gi and one concept Cjl of Gj, where Cik is identical to Cjl is such that: - every concept of Gi is in G - every concept of Gj except Cjl is in G - every arch of Gj containing Cjl is transformed in G in an arch containing Cik - every arch of Gi is in G - every arch of Gj that do not contain Cjl, is in G As described above, an annotation of a strata may correspond to a single graph description, or to a conjunction of graphs under the form of one graph or a nonsingleton set of graphs. More precisely, consider for instance two annotations s1 and s2 described respectively by the graphs g1, "[Man: John]->(talk_to)->[Woman: Mary]", and g2, "[Man: John]->(playing)->[Piano: piano1]". If the video segments corresponding to the related strata intersect, then the graph describing the generated stratum from the strata related to g1 and g2, is the graph g3, i.e. the graph that join g1 and g2 on their common concept [Man: John], as presented in Figuge 5. The graph g3 is the best representation for the conjunction of the graphs g1 and g2, since in g3 we explicitly expresses the fact that the same person (John) is playing and talking at the same time. If we consider now two descriptions g3 "[Man: John]->(talk_to)->[Woman: Mary]" and g4 "[Man: Harry]->(playing)->[Piano: piano1]", then the description that correspond to the generated strata cannot be the result of a joint operation because the two graphs g3 and g4 do not have a common concept, then the annotation of the interection is the set { g3, g4}. Generated strata are necessary to avoid mismatches between query expressions and initial strata graph descriptions.
g1 Man: #John
talk_to
g3
Man: #John
playing
Woman: #Mary
Man: #John
talk_to
Woman: #Mary
playing
Piano: *
Piano: *
g2 Figure 5. Joint of two graphs (g1 and g2) on a concept [Man: #John]. Figure 5 describes the joint of two graphs but it should be noticed that in our annotation-based structure, we try to joint all the graphs that correspond to generated strata. Such task is simply achieved through an iterative process that first finds the identical concepts of strata, and then generate joint graphs when possible. If no joint is possible for some graphs, then these graphs are kept such as they are in the set of graphs describing the generated stratum.
9
Our objective is to exploit the complex description of Conceptual Graphs involving strata in order to generate meaningful summaries. This goal requires to associate a measure of the relative importance of the graph elements: the concepts and the arches. Concepts are important because they express the basic elements present in the description. But arches are important too because they represent relationships between elements. The solution adopted here is based on the weighting scheme of the well known tf*idf (term frequency * inverse document frequency) product used for 30 years in the text-based information retrieval domain [56]. The term frequency is related to the importance of a term in a document, while the inverse document frequency is related to the term's power to distinguish documents in a corpus. Historically, these tf and idf values were defined for words, but here, we use them to evaluate the relevance of concepts and arches in graphs. We describe below how these values are computed, respectively for a concept and an arch, in a Conceptual Graph: !
!
!
!
The term frequency tf of a concept C in a conceptual graph description Gsj is defined as the number of concepts in Gsj that are specific concepts of C. A concept X, [typeX : referentX] is a specific of Y, [typeY : referentY], if typeX is a specific of typeY according to the lattice of concept type (i.e. typeX is a sub-type of typeY) and referentX=referentY (they represent the same individual) or referentX is an individual referent and referentY is the generic referent '*'. In the graph "[Man: #John]->(talk_to)->[Man: *]", representing the fact that John is talking to an unidentified man, tf([Man: #John]) is equal to 1, and tf([Man: *]) is equal to 2 because both [Man: #John] and [Man: *] are specific of [Man: *]. The inverse document frequency idf of a concept C is based on the relative duration of the video parts that are described by a specific of C, using a formula inspired from [57]: idf(c)=log(1+D/d(c)) with D the duration of the video and d(C) the duration of the occurrence of C or a specific of C. For instance, if John appears on 10% of the video idf([Man: #John])=1.04, and if a Man occurs 60% of the video, idf([Man: *])=0.43. For an arch A, the principle is similar to concepts: the term frequency tf(A) is based on the number of different arches that are specific of A in considered graph. An arch (C1x, Rx, C2x) is a specific of an arch (C1y, Ry, C2y) if and only if the concept C1x is a specific of C1y, and if the concept C2x is a specific of the concept C2y, and if the relation Rx is a specific of the relation Ry according to the relation lattice (the record-type associated to Rx is a sub-record-type of the record-type associated to Ry). The idf of an arch A is defined similarly to concepts, and is based on the relative duration of the video parts that are described by specific arches of A.
4.2. CINEMATOGRAPHIC STRUCTURE MODEL The cinematographic structure Model is dedicated to represent the organization of the video according to what we presented in part 2.3. This structure is a tree that reflects the compositional aspect of the video, the node corresponds to a structural level and a frame number interval. The inclusion of parts is based on the interval inclusion. As explained previously, one video is structured into scenes, shots and frames. We choose to limit the structure to shots and frames, because according to the state of the art shot boundaries can be extracted with an accuracy greater than 95% (at least for cuts, as described in [58]), while automatic scene detection is still not effective enough for our needs.
10
4.3. QUERY MODEL We use a query model in order to describe the expected content of the adaptive summary and the expected duration of the summary through the formulation of a query. For instance, a query Q1 that expresses the fact that we look for video parts where John talks to somebody, can be written: "[Man: #John]#(talk_to)#[Human: *]". It is obvious that the graph G1 presented in section 4.1.1 and noted [Man: #John]#(talk_to)#[Woman: #Mary] is an answer for the query Q1. In the context of VISU, this means that a frame or segment of frames (assuming that duration constraints are satisfied) described by the graph G1 should be present in a summary described by means of the query Q1. In the query model of VISU, we also allow users to assign weights to the different parts of the query. These weights reflect the relative importance of different contents of the videos. A syntax of a query is given in Figure 6, using the usual Backus-Naur semantics for the symbols "[", "]", "{", "}" and "*". SUMMARY FROM video WHERE graph [WITH PRIORITY {HIGH|MEDIUM|LOW}] [ {AND|OR|NOT} graph [WITH PRIORITY {HIGH|MEDIUM|LOW}] ]* DURATION integer {s|m} Figure 6. Query Syntax In Figure 6, video denotes the initial video that is to be summarized. graph is represented as a set of arches in an alphanumerical linear form. An arch is represented by "[Type1: referent1|id1]->(relation)->[Type2: referent2|id2]", where the concept identifiers id1 and id2 define uniquely each concept in a way to represent concepts that occur in more than one arch. The concept identifiers are used because a linear form is not able without such identifiers to fully represent graphs, particularly when considering generic referents. For instance, a graph representing that a man, John, is talking to a unidentified woman, and at the same time John is smiling to another unidentified woman, is represented as: "{[Man: John|0]->(talk_to)->[Woman: *|1], [Man: John|0]->(smile)->[Woman: *|2]}". Without such identifiers, it would not be possible to distinguish between the fact that John smile to the same woman or to another woman. integer after the keyword DURATION corresponds to the expected time of the summary with the unit s for seconds and m for minutes. To illustrate, let us consider that we want to obtain a summary: ! taken from the video named "Vid001" ! showing a man, John, is talking to a woman, Mary, and at the same time John is at the right of Mary, with a high importance ! or showing snow falling on houses ! and having a duration of 20 seconds. The query which corresponds to the expected generation can be formulated as follows: SUMMARY FROM Vid001 WHERE {[Man: John|0]->(talk_to)->[Woman: Mary|1], [Man: John|0]->(right_of)->[Woman: Mary|1]}WITH PRIORITY HIGH OR {{[SNOW:*|0]->(falling)->[BUILDING:*|1]} WITH PRIORITY MEDIUM DURATION 20s
11
4.4. SUMMARY GENERATION WITH VISU The query processing relies on two phases. One is dedicated to the video content and the other is dedicated to the duration of the resulting summary. We describe each of these phases in the following. 4.4.1. Information Retrieval theoretical background To evaluate the relevance value between a query Qi and a document stratum Dj, we consider the use of the logical model of information retrieval [59]. This meta-model stipulates, in the Logical Uncertainty Principle, that the evaluation of the relevance of a document (represented by the sentence y) to a query (represented by a sentence x) is based on "the minimal extend" to be added to the data set in order to assess the certainty of the implication x$y. In our case the x$y is related to the existing knowledge in the Conceptual Graph Canon (composed of the canonical base, the concept type hierarchy, the relation hierarchy and the conformance relationship), and the "$" symbol is the material implication ⊃ of first order logic. In our process, once the information of an annotation implies an annotation the relevance value of a stratum Sj according to a query Qi can then be computed as we explain in the following. 4.4.2. Matching of graphs The conceptual graph formalism allows to search in graphs using the project operator [4]. The project operator is equivalent to the material implication according to the semantics given to conceptual graphs. The project operator, a sub-graph search, takes into account the hierarchy of concepts types and of relations. The projection of a query graph Gqi, noted πGsj(Gqi), into a conceptual graph Gsj, only concludes on the existence of a sub-graph in the description that is specific to the query graph. In the work of [5], it has been shown that the projection on graphs can be implemented very efficiently in term of search algorithm complexity. We quantify the matching between a content query graph Gq and an annotation graph Gs by combining concepts matching and arches matching, inspired by [60] and [61]: F(Gq, Gs) = Σ{tf(c).idf(c)|c in concepts of πGs(Gq)} + Σ{tf(a).idf(a)|a in arches of πGs(Gq)}
(1)
In formula (1), the tf and idf concepts and arches are defined according to section 4.1.1. Since we defined that an annotation may be one graph or a set of conceptual graphs, we now have to express how to match a query graph and a set of graphs corresponding to an annotation. Consider an annotation composed of a set S of graphs Gs, 1≤k≤N. The match M of the query graph Gq and S is: M(Gq,S) = maxGs∈S (F(Gq,Gs)) 4.4.3. Query expression evaluation We describe here how the priorities in the sub-query expression "{graph} WITH PRIORITY P", P being "HIGH", "MEDIUM" or "LOW" are taken into account in order to reflect the importance of the sub-expression for the summary generation. The notion of priority reflects the importance of the different sub-parts of the query content, and was defined in the MULTOS system [62] or in the OFFICER system [63]. A query subexpression is then evaluated for each graph of each annotation of the annotation-based structure, using a simple multiplication rule: if the matching value obtained for a node is v, then the matching value for the query sub-expression is: v×p, with p equals to 0.3 for "LOW", 0.6 for "MEDIUM" and 1.0 for "HIGH".
12
Complex query expressions composed by Boolean expressions of elementary sub-query expressions are evaluated for strata annotations using Lukasiewicz-like definitions for n-valued logics: - an expression composed of "A AND B" is evaluated as the minimum of the matching for each sub-queries A and B, - an expression composed of "A OR B" is evaluated as the maximum of the matching values of each sub-query A and B, - an expression composed of "NOT B" is evaluated as the negative of the matching value for B. Then, we are able to give a global matching value for each of the annotations of the annotation-based structure. The Duration part of the query is used to constrain the result to a given duration. Three cases might arise: - The duration of all the relevant video parts is larger that the duration expected. Then we avoid presenting the video parts that are less relevant, i.e. the video parts that have the lower matching values. We notice, however, that this is only processed for video parts that match with a positive matching value for the summary generation criteria. - The duration of all the relevant video parts is equal to the duration expected. Then the result is generated with all the relevant video parts. - The duration of the relevant video parts is smaller than the duration expected. Then the result is generated using all the relevant parts obtained and using the cinematographical structure of the video: consider that the query asks for x seconds and that the duration of the relevant semantic parts is y second (we have then y < x by hypothesis). The remaining video to add to the relevant parts must be x-y seconds long. We include in the result continuous excerpts of (x-y)/n seconds for each of the n shots that do not intersect with any relevant video parts. The excerpts correspond to the shot parts having the more motion activity in each shot, consistently with the work of [47] on signal-based summarization. In any case, we force the result generated to be monotonic according to the frame sequence: for each frame fri and frj in the generated result corresponding respectively to the frames fok and fol in the original video, if the frame fok is before (resp. after) the frame fol in the original video, then the frame fri is before (resp. after) the frame frj in the generated result.
5. CONCLUSION AND FUTURE WORK In this chapter, we presented various requirements for generating video summarization. Existing works on this field rely more or less on each of the following aspects of videos: o The video-structure level that expresses how the video is organized in term of sequences, scenes and shots. Such structure also helps to capture temporal properties of a video. o The semantic level that expresses the semantics of parts of the video. Various forms of semantics may be either automatically or manually extracted. o The signal level that represents features related to the video content in terms of colors/textures and motion. Many works have been done in these different fields. Nevertheless, we show that a better and powerful formalism has to be proposed in order to facilitate the dynamic creation of adaptive video summaries. It is the reason why we proposed the VISU model. This complex model is based both on a stratum data model and on a conceptual graph formalism in order to represent video content. We also make use of the cinematographic structure of the video and some low-level features in a transparent way for the user during the generation of summaries. An originality of the VISU model is to allow the expression of rich queries for guiding video summary creation. Such rich queries can only be fulfilled by the system if the
13
representation model is able to support effectively complex annotations; that is why we choose to use the conceptual graph formalism as a basis for the annotation structure in VISU. Our approach also makes use of well known values in Information Retrieval, namely the term frequency and inverse document frequency. Such values are known to be effective for retrieval and we apply them on annotations. The proposed model will be extended to support temporal relations between annotations in the future. We will also in the future use this work to propose a graphical user interface in order to generate automatically query expressions.
6. REFERENCES [1] [2] [3]
[Final Cut Pro: http://www.apple.com/fr/finalcutpro/software Adobe Premiere: http://www.adobe.fr/products/premiere/main.html MPEG-7 Committee, Overview of the MPRG-7 Standard (version 6.0), Report ISO/IEC JTC1/SC29/WG11 N4509, J. Martinez Editor, 2001. [4] J. F. Sowa. Conceptual Structures: Information Processing in Mind and Machines. Addison-Wesley, Reading (MA), USA, 1984. [5] I. Ounis and M. Pasça, RELIEF: Combining expressiveness and rapidity into a single system, ACM SIGIR 1998, Melbourne, Australia, pp. 266-274, 1998. [6] N.Babaguchi, Y. Kawai and T. Kitahashi, Event Based Video Indexing by Intermodal Collaboration, Proceedings of First International Workshop on Multimedia Intelligent Storage and Retrieval Management (MISRM'99), Orlando, FL, USA, pp. 1-9, 1999. [7] B. Merialdo, K. T. Lee, D. Luparello and J. Roudaire, Automatic construction of personalized TV news programs, Proceedings of the seventh ACM international conference on Multimedia, Orlando, FL, USA, pp. 323-331, 1999. [8] H.-J. Zhang, S. Y. Tan, S. W. Smoliar and G. Y. Hone, Automatic Parsing and Indexing of News Video, Multimedia Systems, Vol.2, No. 66, pp. 256-266, 1995. [9] M. Kankahalli and P. Mulhem, Digital Albums Handle Information Overload, Innovation Magazine, 2(3), National University of Singapore and World Scientific Publishing, pp. 64-68, 2001. [10] R. Lozano and H. Martin: Querying virtual videos using path and temporal expressions. Proceedings of the 1998 ACM symposium on Applied Computing, February 27 - March 1, 1998, Atlanta, GA, USA. ACM, 1998. [11] R. P. Carlos, M. Kaji, N. Horiuchi, K. Uehara:Video Description Model Based on Prototype-Instance Model, pp 109-116, Proceedings of the Sixth International Conference on Database Systems for Advanced Applications (DASFAA), April 1921, Hsinchu, Taiwan. [12] M. S. Kankahalli and T.-S. Chua, Video Modeling Using Strata-Based Annotation, IEEE Multimedia, 7(1), pp. 68-74, Mar 2000. [13] T.-S. Chua, L. Chen and J. Wang, Stratification Approach to Modeling Video, Multimedia Tools and Applications, 16, pp. 79-97, 2002. [14] T.G. Aguierre Smith and G. Davenport, The stratification system: A design environment for random access video, In Proc. 3rd International Workshop on Network and Operating System Support for Digital Audio and Video, La Jolla, CA, USA, pp. 250-261, 1992. [15] G. Salton and M. J. McGill, Introduction to Modern Information Retrieval, McGraw-Hill, New-York, 1983. [16] N. Vasconcelos and A. Lippman, Bayesian Modeling of Video Editing and Structure: Semantic Features for Video Summarization and Browsing, ICIP'98, pp. 153-157, 1998. [17] R. Weiss, A. Duda, D. Gifford, Composition and Search with Video Algebra, IEEE Multimedia, 2(1), pp 12-25, 1995. [18] V. S. Subrahmanian, Principles of Multimedia Database Systems, Morgan Kaufmann, San Francisco, 1997.
14
[19] [20]
[21]
[22]
[23] [24]
[25]
[26]
[27]
[28]
[29] [30]
[31] [32] [33]
[34]
[35] [36] [37] [38]
[39]
R. Hjesvold and R. Midtstraum, Modelling and Querying Video Data, VLDB Conference, Chile, pp.686-694, 1994. E. Egyed-Zsigmond, Y. Prié, A. Mille and J.-M. Pinon, A graph based audio-visual annotation and browsing system, Proceedings of RIAO'2000 Volume 2, Paris, France, pp. 1381-1389, 2000. P. Salembier and J. Smith, MPEG-7 Multimedia Description Schemes, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 11, No. 6, pp. 748-759, June 2001. Flickner, M., Sawhney, H., Niblack, W., Ashley, J., Huang, Q., Dom, B., Gorkani, M., Hafner, J., Lee, D., Petrovic, D., Steele, D., and Yanker, P.: Query by Image and Video Content: the QBIC System. IEEE Computer 28(9), pp. 23-30, 1995. W. Y. Ma and B. S. Manjunath, NETRA: A toolbox for navigating large image databases. In Proc. IEEE ICIP’97, Santa Barbara, pp. 568-571, 1997. J. Assfalg, C. Colombo, A. Del Bimbo, P. Pala, Embodiying Visual Cues in Video Retrieval, IAPR International Workshop on Multimedia Information Analysis and Retrieval, LNCS 1464, Hong-Kong, PRC, pp. 47-59, 1998. S.-F. Chang, W. Chen, H.J. Horace, H. Sundaram and D. Zhong, A Fully Automated Content Based Video Search Engine Supporting Spatio-Temporal Queries, IEEE Trans. CSVT, 8, (5), pp. 602-615, 1998. J. Li, T. Özsu and D. Szafron, Modeling of Moving Objects in a Video Database, IEEE International Conference on Multimedia Computing and Systems (ICMCS), Ottawa, Canada, pp. 336 – 343, 1997. Y. Li, W. Ming and C.-C. Jay Kuo, Semantic video content abstraction based on multiple cues, IEEE International Conference on Multimedia and Expo (ICME) 2001, Tokyo, Japan, 2001. Y. Li and C.-C. Jay Kuo, Extracting movie scenes based on multimodal information, SPIE Proc. on Storage and Retrieval for Media Databases 2002 (EI2002), Vol. 4676, San Jose, USA, pp.383-394, 2002. A. Nagasaka and Y. Tanaka, Automatic Scene-Change Detection Method for Video Works, 2nd Working Conference on Visual Database Systems, pp. 119-133, 1991. R. Zabih, J. Miller and K. Mai, Feature-based algorithms for detecting and classifying scene breaks. Proceedings of the Third ACM Conference on Multimedia, pp 189-200, San Francisco, CA, November 1995, (with).] G. Quénot and P. Mulhem, Two Systems for Temporal Video Segmentation, CBMI’99, Toulouse, France, October, pp.187-193, 1999. H. Zhang, A. Kankanhalli, S. W. Smoliar, Automatic Partitioning of Full-Motion Video, Multimedia Systems Vol. 1, No. 1, pp. 10-28, 1993. P. Aigrain and P. Joly, The Automatic Real-Time Analysis of Film Editing and Transition Effects and Its Applications, Computer and Graphics. Vol. 18, No. 1, pp. 93-103, 1994. A. F. Smeaton, P. Over and R. Taban, The TREC-2001 Video Track Report, The Tenth Text Retrieval Conference (TREC 2001), NIST Special Publication 500-250, 2001. http://trec.nist.gov/pubs/trec10/papers/TREC10Video_Proc_Report.pdf . R. M. Bolle, B.-L. Yeo and M. M. Leung, Video Query: Behond the Keywords, IBM Research Report RC 20586 (91224), 1996. J. Nam, Event-Driven Video Abstraction and Visualization, Multimedia Tools and Applications, 16, pp. 55-77, 2002. B.-L. Yeo amd M. M. Yeung, Retrieving and Visualizing Video, Communication of the ACM, 40(12), pp.43-52, 1997. S. Pfeiffer, R. Lienart, S. Fisher and W. Effelsberg, Abstracting Digital Movies Automatically, Journal of Visual Communication and Image Representation, Vol. 7, No. 4, pp. 345-353, 1996. Y. Li, T. Zhang and D. Tretter, An Overview of Video Abstraction Techniques, HP Laboratory Technical Report HPL-2001-191, 2001.
15
[40]
[41]
[42]
[43] [44]
[45]
[46] [47] [48]
[49] [50]
[51] [52]
[53]
[54] [55] [56] [57]
[58]
[59]
S. Uchihashi, J. Foote, A. Girgensohn and J. Boreszki, Video Manga: Generating Semantically Meaningful Video Summaries, ACM Multimedia'99, Orlando(FL), USA, pp. 383-392, 1999. S. Uchihashi, J. Foote, Summarising Video using a Shot Importance Measure and a Frame-Packing Algorithm, ICASSP'99, Phoenix (AZ),Vol. 6, pp. 3041-3044, 1999. A. Amir, D. B. Ponceleon, B. Blanchard, D. Petkovic, S. Srinivasan and G. Cohen. Using audio time scale modification for video browsing, Hawaii International Conference on System Sciences, Maui, USA, 2000. R. Lienhart, S. Pfeiffer and W. Effelsberg, Video Abstracting, Communication of the ACM, 40(12), pp.55-62, 1997. M. Smith and T. Kanade, Video Skimming and Characterization through the Combination of Image and Language Understanding Techniques, IEEE Computer Vision and Pattern Recognition (CVPR), Puerto Rico, pp. 775-781, 1997. A. Hanjalic, R.L. Lagendijk and J. Biemond, Semi-Automatic News Analysis, Classification and Indexing System based on Topics Preselection, SPIE/IS&T Electronic Imaging'99, Storage and Retrieval for Image and Video Databases VII, Vol. 3656, San Jose, CA, USA, pp. 86-97, 1999. R. Lienhart, Dynamic video summarization of home video, SPIE 3972: Storage and Retrieval for Media Databases 2000, pp. 378-389, 2000. X.D. Sun and M.S. Kankanhalli, Video Summarization using R-Sequences, Journal of Real-Time Imaging, Vol. 6, No. 6, pp. 449-459, 2000. P. Chiu, A. Girgensohn, W. Polak, E. Rieffel, and L. Wilcox A Genetic Algorithm for Video Segmentation and Summarization., IEEE International Conference on Multimedia and Expo (ICME) 2000, pp. 1329-1332, 2000. Y. Gong and X. Liu, Generating Optimal Video Summaries, IEEE International Conference on Multimedia and Expo (III) 2000, pp. 1559-1562, 2000. J. Oh and K. A. Hua, An Efficient Technique for Summarizing Videos using Visual Contents, Proc. IEEE International Conference on Multimedia and Expo. July 30 August 2, 2000. pp. 1167-1170, 2000. D. DeMenthon, V. Kobla and D. Doermann, Video Summarization by Curve Simplification, ACM Multimedia 98, Bristol, Great Britain, pp. 211-218, 1998. S. X. Ju, M. J. Black, S. Minneman, and D. Kimber, Summarization of videotaped presentations: Automatic analysis of motion and gesture, IEEE Trans. on Circuits and Systems for Video Technology. Vol. 8, No. 5, pp. 686-696, 1998. B. L. Tseng, C.-Y. Lin and J. R. Smith, Video Summarization and Personnalization for Pervasive Mobile Devices, SPIE Electronic Imaging 2002 - Storage and Retrieval for Media Databases, Vol. 4676, San Jose (CA), pp. 359-370, 2002. S. Ebadollahi, S. F. Chang, H, Wu and S. Takoma, Echocardiogram Video Summarization, SPIE Medical Imaging 2001, pp. 492-501, 2001. M. Maybury and A. Merlino, Multimedia Summaries of Broadcast News, IEEE Intelligent Information Systems 1997, pp. 422-429, 1997. G. Salton, A. Wong and C. S. Yang, A vector space model for automatic indexing, Communication of the ACM, 18, pp. 613-620, 1975. G. Salton and C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing and Management, Vol. 24, John Wiley and Sons Publisher, pp. 513-523, 1988. Y-F. Ma, J. Shen, Y. Chen and H.-J. Zhang, MSR-Asia at TREC-10 Video Track: Shot Boundary Detection Task, The Tenth Text Retrieval Conference (TREC 2001), NIST Special Publication 500-250, 2001. http://trec.nist.gov/pubs/trec10/papers/MSR_SBD.pdf C. J. van Rijsbergen, A non-classical logic for information retrieval, Computer Journal, 29, pp. 481-485, 1986.
16
[60]
[61]
[62] [63]
S. Berretti, A. Del Bimbo and E. Vicario, Efficient Matching and Indexing of Graph Models in Content-Based Retrieval, IEEE Trans. on PAMI, 23(10), pp.1089-1105, 2001. P. Mulhem, W.-K. Leow and Y.-K. Lee, Fuzzy Conceptual Graphs for Matching of Natural Images, International Conference on Artificial Intelligence 2001 (IJCAI’01), Seattle, USA, pp. 1397-1402, 2001. F. Rabitti, Retrieval of Multimedia Documents by Imprecise Query Specification, LNCS 416, Advances in Databases Technologies, EDBT'90, Venice, Italy, 1990 B. Croft, R. Krovetz and H, Turtle, Interactive Retrieval of Complex Documents, Information Processing and Management, Vol. 26, No. 5, 1990.
17