Metadata Handling: A Video Perspective - CiteSeerX

4 downloads 481642 Views 464KB Size Report
[16,17,18]. A number of inexpensive and easy to use video editing tools like Ulead [15] and Adobe Premiere [1] are available which help in editing the video data ...
Metadata Handling: A Video Perspective CHITRA L. MADHWACHARYULA School of Information Management and Systems, University of California at Berkeley, U.S.A

and

***

**

*

MARC DAVIS , PHILIPPE MULHEM and MOHAN S. KANKANHALLI *School of Computing, National University of Singapore **CLIPS-IMAG, Grenoble, France ***School of Information Management and Systems, University of California, Berkeley, U.S.A {[email protected], [email protected]} ________________________________________________________________________ This paper addresses the problem of processing the annotations of preexisting video productions to enable reuse and repurposing of metadata. We introduce the concept of ‘Automatic content based editing of preexisting semantic home video metadata’. We propose a formal representation and implementation techniques for reusing and repurposing semantic video metadata in concordance with the actual video editing operations. A novel representation for metadata editing is proposed and an implementation framework for editing the metadata in accordance with the video editing operations is demonstrated. Conflict resolution and regularization operations are defined and implemented in the context of the video metadata editing operations. Categories and Subject Descriptors: H.5.1 [Information Interfaces and Presentation]: Multimedia Information Systems - Video; H.3.3 [[Information Storage and Retrieval] : Information Search and Retrieval – Information Filtering; Query Formulation General Terms: Preexisting Video, Metadata, Editing, Reuse, Semantic, Content, Representation

______________________________________________________________________

1. INTRODUCTION Digital Video is an important component of multimedia consisting of raw video data and the semantic information contained within it. With the rising popularity of personal camcorders, users are shifting from being passive spectators to active creators of video. New technologies and inexpensive gear are going to create an explosion in the number of amateurs creating digital video content. With the production of such large volumes of video data, research on ‘Home Video production techniques’ as exemplified by Garage Cinema Research [4] is becoming inevitable. Many exciting research projects are investigating how to search, visualize, and summarize digital home videos shot by amateurs. [16,17,18]. A number of inexpensive and easy to use video editing tools like Ulead [15] and Adobe Premiere [1] are available which help in editing the video data by performing operations like building a media library of clips, trimming the clips and adding transitions, text titles, sound, music and voice-overs between clips. Author's address: School of Information Management and Systems, University of California at Berkeley, U.S.A. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for pro¯ t or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. © 2006 ACM 0000-0000/20YY/0000-0001 $5.00

ACM Journal Name, Vol. V, No. N, Month 2006, Pages 1-32

Though a lot of attention has been devoted to research on developing video editing tools [5,12] that support the annotation of video at the level of new productions, but none of the existing works, to the best of our knowledge, address the problem of processing the annotations of preexisting productions to enable reuse and repurposing of metadata. This is especially significant in the area of home video metadata in which the probability of reappearance of the same elements (people, places or other objects) in the videos are very high since people tend to focus mainly on their family or loved ones when shooting home videos. This paper attempts to address this significant problem by introducing the concept of ‘automatic content based editing of preexisting semantic home video metadata’. We propose a formal representation and implementation techniques for reusing and repurposing preexisting semantic metadata in concordance with the actual video editing operations. Our system is designed for home videos typically shot by amateurs with little experience in annotating videos. Hence our system is aimed to be highly user friendly by not exposing the actual annotation or editing procedures to the users who will only view a simple interface through which they will enter or view the information about the video. The remaining of the paper is organized as follows: Section 2 gives an overview of the DVA System. This system provides a integrated environment for semi-automatically generating metadata related to videos and videos excerpts, and for using the metadata to create new videos and their associated metadata. Section 3 gives a typical scenario in which our metadata reuse framework can be applied and also gives an overview of how our system handles the common video editing operations. Section 4 gives an overview of the related work. Section 5 outlines the metadata editing representation proposed by us. Section 6 presents the implementation framework and results and Section 7 provides an evaluation of our System. Finally in Section 8 we give some concluding remarks and directions for future work. 2.

THE DIGITAL VIDEO ALBUM (DVA) SYSTEM

The main aim of the DVA (Digital Video Album) Project [9] is to develop techniques for content-based indexing, retrieval, summarization and access of digital video from sources such as home video and digital TV. While conceptualizing the project we made two important assumptions. Firstly that people want to share their home videos and secondly video presentations need to be customized for maximum impact. Based on these assumptions we formulated the basic design philosophy of DVA that is mainly based on the following two rules. Rule 1 emphasizes on minimal user intervention through the use of semi automatic methods and learning. Rule 2 states that information integration and propagation should be implemented to maximize reuse. The DVA System consists of the modules and tools as shown in Fig 1 with the main DVA GUI shown in the center.

Fig 1: Modules of DVA system [9]

ACM Journal Name, Vol. V, No. N, Month 2006, Pages 1-32

The workflow of the system is as follows. When a new video is entered into the system, the first step is to annotate the video details in the form of an XML [20] file. The annotation process stores the details of not only the video but also the different objects or people occurring in the video, the relevant frames in which they occur in the video, the sequences in the video etc, This detailed annotation system helps in efficient querying and retrieval of results using our query system for retrieving videos or sequences based on a particular query. The system also allows operations like face detection, object tracking, shot detection and audio video mixing. A presentation and summarization module allows the user to create video presentations and summaries from existing videos based on the desired criteria. The user can select the summarization criteria as shown by Figure 2. During such video editing, the relevant preexisting metadata is automatically extracted, edited and attached to the newly created summary or presentation, while ensuring that it still conforms to the original DVA schema. We will go into more detail about the DVA Annotation Module since it is the underlying module for performing automatic preexisting metadata transformations.

Fig2: Interface for Criteria Selection for Video Editing/Summarization 2.1 DVA Indexing and Annotation Module The DVA indexing/annotation module, as shown in Figure 3, is a stratification model that captures objects and events. It consists of two types of metadata sources, the user manual input and the automatically extracted features. The former is the text input according to our predefined schema to describe video, and the latter includes object-tracking features, human faces and shot boundaries that are generated by the computer with the launching action of human. XML is chosen as the annotation medium because its usage will facilitate interoperability with other metadata standards in the future. Video annotation

ACM Journal Name, Vol. V, No. N, Month 2006, Pages 1-32

standards like MPEG-7[11] set XML Schema as the language of choice for the textual representation of content description.

Fig 3: Architecture of DVA Index/Annotation model

Our approach provides a simpler and more customized method of video annotation with respect to home videos as compared to MPEG7 because MPEG-7 contains additional complicated features not necessary for the DVA application and tools for comprehensive utilization of MPEG-7 are not yet available. Since we use XML-based metadata storage, it is relatively easy to write a translator to convert our format to MPEG-7 if such a need arises. We have not built an ontology for annotating the video data. Instead we provide the user with an intuitive interface that allows them to enter free text in a structured manner that is stored in the form of XML documents. Hence we have also not used RDF [25] for explicitly defining object relationships at this point though porting our metadata onto an RDF platform in the future can be a natural extension to our present work.

2.2 DVA Metadata Document Structure The aim of the DVA metadata document structure is to be able to represent and manage information related to whole videos and metadata related to objects that occur in video parts. We do not claim that the description provided in this part allows all possible video and objects descriptions, but it gives a good overview of the potential use of the document structure. The DVA metadata structure consists of three main files as shown by Figure 4.

Fig 4: Metadata structure for video and objects in DVA

 

Index File: Global file for the entire system. Acts as the root to record the basic information such as Video ID and physical path for any video file loaded in the system. Object File: Object file stores metadata about available objects and people appearing in the video. Every Object is assigned a unique Object ID and its details are stored in the Object File. A reference is made in the Video Description File to the Object IDs of those objects occurring in a particular video. These references are ACM Journal Name, Vol. V, No. N, Month 2006, Pages 1-32

made at the time of creation of the Video Description File. Only one Object File exists for a single implementation of the DVA System Tool.  Video Description File: Every video is associated with a Video Description File that stores the information about a particular video pertaining to the sequences present in it, objects present in it and other relevant details. The metadata editing operations are performed on this file so that the newly generated edited file is associated with the edited video clip generated by the video editor. The Schema of the Video Description File is very important during any metadata editing or reuse process, as the resulting edited metadata also has to conform to it. The complete schemas for the DVA annotation structure are available at http://www.sims.berkeley.edu/~chitra/DVA/DVASchema.htm. An excerpt from the schema of the Video Description File is shown in Fig 5. 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13)

Fig 5: The Video Description File Schema

As can be seen from the Fig 5 the following points have to be kept in mind during any metadata editing operation:  There has to be only one video ID or VID per video and it has to be unique (Line 2).  Elements like (i.e. next free sequence ID), , , , etc can exist only once in a document and hence appropriate manipulations have to made to the values of these elements in the merged document to ensure that there is no or minimal information loss. The is represented in terms of time (in seconds).  Sequences have their unique SID’s that are local to a video description document.  The contains information about the objects (people or other objects) that are present in the video and the sequences in which they occur (Line 12). It makes a reference to the Object ID’s, of the objects present in the video, in the Object File.  Sequence information is stored separately under the tags.  During any merge operation it has to be ensured that the sequence ID’s of the merged sequences do not clash and the value has to be changed if needed.  The corresponding changes have to be made to the objects in the OIDList with respect to the sequence SID’s and the frame numbers.  During any metadata operation the VID of the new video has to be extracted from the of the index file and its value has to be updated accordingly.  Specifically related to our reuse of video metadata is the fact that certain explicit data content, namely content of the elements and is to be stored. This data will be used to ensure consistency of the generated metadata, as presented in section 6. This procedure does not impact the generality of our metadata reuse framework.

ACM Journal Name, Vol. V, No. N, Month 2006, Pages 1-32

The videos annotated by using the metadata structure described in this section are automatically tagged with appropriate reused metadata information whenever they undergo editing operations and we will walk you through a possible editing scenario in the next section. 3.

SCENARIO

We present an example use scenario of our framework. Consider the video as shown by Figure 6 that consists of Frames 1 to 5000. There are 4 people occurring in the video designated as Person1, Person2, Person3 and Person4. The video also consists of two sequences Sequence1 and Sequence2. The relevant frame ranges in which the people and the sequences occur are as shown in Figure 6. Frame Numbers

1

1000

2000

5000

Video Stream Person1 Sequence1

Person2 Person3

Person4 Sequence2

Fig 6: Original Video Clip 1 Video1 Person1 1 800 India Person2 1000 2000 Singapore Person3 800

1500 Person4 3000 5000 Sequence1 1 1500 Sequence2 1500 5000


Fig 7: Video Annotation document

Figure 7 represents a possible annotation of the video. Each video has a Video ID and Video name. Various details of the video like the different objects or people occurring in the video, the frame numbers in which they occur and other details like their name, country etc are stored in the form of an XML metadata document. The Sequence tags consist of interesting portions of the video that the user annotates. The DVA system internally provides the frame information of the sequences, which are included in the annotations for future editing, viewing and reuse. Suppose we want to create a new video with all the sequences having Person2 in it. The video editor, based on the selected criterion, extracts the video only from Frame 1000 to ACM Journal Name, Vol. V, No. N, Month 2006, Pages 1-32

Frame 2000 and creates a new video. In this scenario the XML metadata describing the video also has to be appropriately edited such that the new metadata file that describes the edited video only has information about the portions of the video containing Person2. And the metadata should automatically include the frame numbers of sequences with respect to the new video. The structure of the new video would be as shown in Figure8: Frame Numbers

1

1000

Video stream Person3 Person2 Sequence1

Sequence2

Fig 8: Representation of edited video

The metadata file created to annotate the new video shown in Figure 8 will be as shown in Figure 9. 2 Video2 Person2 1 1000 Singapore Person3 1 800

Sequence1 1 300 Sequence2 300 1000


Fig 9: Annotation of new video clip

A new video name and ID are assigned to the newly formed video. The metadata describing the edited video should also now contain the new frame numbers for the people and sequences occurring in it. For example the frame numbers of occurrence of Person2 has to be changed to Frames 1 to 1000 as compared to the original video where Person2 was occurring between the Frames 1000-2000. Our system performs other related operations along with the ones described above like calculating the length of the new video, resolving possible conflicts that may arise during operations like merging two videos and performing regularization to ensure conformance to the original schema. Possible conflicts that may arise are described in more detail in Definition 6 of Section 5. The DVA system is developed to run on stand alone machines and we have opted to create new annotations for edited videos that are extracted from existing annotations as opposed to creating references to existing annotations, to ensure the portability and reusability of the annotations across multiple machines. We have aimed to minimize the user interaction at the annotation level to the minimum so that users can reuse the existing annotations while editing already annotated videos. In our

ACM Journal Name, Vol. V, No. N, Month 2006, Pages 1-32

system, all the annotation creation and editing operations are internal processes. The user only interacts with the basic annotation form that is used to enter or display the information about the video. 3.1 Handling Basic Video Editing Operations In this section we give an overview of the common video editing operations and how the DVA Metadata Editor handles these operations to edit the corresponding metadata. Tables I describes the general editing operations that define the underlying processes behind video editing and how these are handled by our Metadata Editor. Table II provides information about how the Metadata Editor handles the specific video editing operations performed by video editors. Table I: General Video Editing Processes General Types of Video Editing [19] Combine editing: process of putting different shots together in order to achieve a particular purpose or convey meaning. Condense editing: process of shortening the length of a video by extracting just the essence of the material. (This operation is similar to the operations performed by the DVA Summarization and Presentation Tools) Expand editing: process of making a short period of time take longer than it actually did or would. Ways of expanding time include using slow motion to lengthen an action or by using multiple shots of the same event to show the same moment in time from different points of view. Corrective editing: process of correcting mistakes made during production with creative editing by using a shot of something outside the current shot to cover up visual problems and editing multiple takes together to fix dialog problems.

Corresponding Operations Performed by DVA Metadata Editor Merges the metadata of the shots being combined to create a new metadata file with all the information about the combined shots. The metadata associated with the shortened video is projected from the original metadata document. The shortened video length and new sequence frame numbers are updated in the new metadata associated with the condensed video. Only certain parameters like the length of the sequence, length of the video, durations of occurrences of objects may change during expand editing. These changed parameters are passed to the metadata editor and it updates the information in the newly created metadata file. Any metadata associated with the external shot used for correction is inserted into the newly created metadata file. This helps in maintaining the editing history of a video.

Table II: Specific Video Editing Operations Specific Types of Video Editing

Corresponding Operations Performed by DVA Metadata Editor

Insert: inserts a clip and moves time forward in a sequence.

Denotes merging of a clip into a video. Handled by the Merge Operator described in Definition 3 in Section 5.2. The Metadata Editor makes changes to the Video and sequence lengths in the edited metadata document as demonstrated in Section 6.

ACM Journal Name, Vol. V, No. N, Month 2006, Pages 1-32

Overwrite: overwrites other clips. No change in total time of sequence.

Superimpose: Used to overlay titles and other text onto video. Fit to Fill: inserts clip and speeds or slows the material to fill available space. Trimming of Clips

Splitting: involves making a clip into two or more pieces. Transitions

Uses Conditional Merge defined in Definition 5 in Section 5.2. Video editor sends insertion coordinates to Metadata editor. The metadata of the new clip is inserted in the sequence while keeping the sequence length and video length parameters intact. Handled dynamically by Presentation module of DVA. The DVA Metadata Editor does not support insertion of superimposed text. Similar to clip insertion. Video editor sends the coordinates of the inserted (merged) clip and the relevant metadata is inserted accordingly. Equivalent to merging of the trimmed clip metadata using the Merge Operator. Co-ordinates of the trimmed clip sent to the Metadata Editor according to which the metadata is merged into the video. Metadata of split clips handled by the Projection Operator described in Definition 4 in Section 5.2. The DVA Metadata Editor does not handle transition related metadata.

As can be seen from Tables I and II, most of the common video metadata can be edited using the ‘Merge’ and ‘Project’ operations either individually or in combinations. Hence these operators are the most basic operators that can be recombined in a variety of ways to support the editing of metadata in concordance with most of the common video editing operations. Though more complex video editing operations would need more granular understanding of video semantics and aesthetics according to which other operations would possibly need to be defined. But for the “non-meaning-changing” editing operations, all complex operations can be decomposed into a sequence of application of ‘Merge’ and ‘Project’ operators.

4

RELATED WORKS

Focus on digital content representation and manipulation has been increasing with the extensive amounts of digital data being generated in the present times. A number of inexpensive and easy to use video editing tools like [1, 14, 24] etc. are commercially available which focus on editing video data. Hitchcock [8] is a home video editing system, which presents a user interface that supports semi-automatic video editing. This system describes the problems that non-professionals have in using existing video editing tools and provides users with an interactive system for composing video that does not require manual selection of the start and end points for video clips. Zhang and Ma [10] detail a system for indexing and browsing home videos. The system is capable of extracting both the structure information and the semantic objects. FlyAbout [7] is a system uses spatially indexed panoramic video for virtual reality applications. [14] presents a framework for automatically generating audio-visual skims. The VideoAnnEx annotation tool [23] developed by IBM assists authors in the task of annotating video sequences with MPEG-7 metadata. [2] presents a model for audiovisual-based hypermedia authoring environment. [5,12] present frameworks for semantic annotations during the video production process. However all the above research either focuses on video editing methodologies without the use of semantic ACM Journal Name, Vol. V, No. N, Month 2006, Pages 1-32

annotations or only focuses on creation and manipulation of semantic metadata at the initial production phase of the video. There is no related work that focuses on already annotated pre-existing videos, that addresses the problem of reuse and transforming these annotations based on already existing semantic elements. We try to address this problem by defining a formal representation and implementation mechanism for reuse and repurposing of preexisting video metadata.

5

METADATA TRANSFORMATION REPRESENTATION

This section defines the metadata transformation representation proposed by us. This proposal provides a basic groundwork for any kind of further enhancements for video metadata editing descriptions. XML is chosen as the medium of storage of video metadata in this representation. The two main editing operations handled are: 1. Merge of two metadata documents. 2. Projection of a portion of the metadata. 5.1 Universe The scope of our proposed metadata editing representation is restricted to one underlying schema for all editing operations. Every metadata document Dj always conforms to the same underlying schema Si. 5.2 Definitions The definitions for the metadata representation and editing mechanisms are as follows: Definition 1: Model of an XML conformant document Every conformant document has the structure (Si, Dj) where Si denotes the XML Schema, Dj denotes the metadata document which conforms to the XML schema Si. Therefore if we have 2 documents with the same XML schema they will be represented as (S1, D1) and (S1, D2). Definition 2: Structural details of a metadata document Every metadata document Dj consists of a root element Rj followed by a set of other elements that are used to store the metadata and the attributes that describe these elements. The elements follow the hierarchical structure as defined by the schema. The exact order in which these elements occur in the actual document instance is not considered in the metadata-editing scenario although it is defined in the document structure. Thus a metadata document Dj can be represented as Dj = (Rj, Ej, Associative law Consider two metadata documents Dm1 and Dm2 such that: Dm1 = D3 ⊕ Rmc (D1 ⊕ Rmc D2) and Dm2 = (D3 ⊕ Rmc D1) ⊕ Rmc D2 then If the merge condition mc is the same for the above operations then I(Dm1) =0d I(Dm2) where I(Dm1) and I(Dm2) are the information content of Dm1 and Dm2 respectively. 3> Additive Identity If a Non-empty metadata document is merged with an empty metadata document then the resulting document would be the original non empty document. It is assumed that the merge condition is concatenation or union. If, D1 is a non empty metadata document and it is merged with D2 which is an empty document resulting in D3 then D3 =0d D1 i.e. D1 ⊕ R mc D2 =0d D1 when D1 ≠ ø and D2 =0d ø 4> Idempotence The metadata resulting from the union (merge condition mc = ∪) of a metadata document with itself should be the same original metadata i.e .D1 ⊕ Rmc D1 =0d D1 where D1 is a metadata document 5> Subset Property of projected metadata If D1 and D2 are two metadata documents with the following structure: ACM Journal Name, Vol. V, No. N, Month 2006, Pages 1-32

D1 = (R1,E1,

Suggest Documents