Using MPEG-7 and MPEG-21 for personalizing video - Semantic Scholar

6 downloads 317 Views 1MB Size Report
ryboards for personalized TV news programs based on user preference.3. For video summarization, we expect to view a moving video and listen to the audio ...
Content Repurposing

Using MPEG-7 and MPEG-21 for Personalizing Video Belle L. Tseng, Ching-Yung Lin, and John R. Smith IBM T.J. Watson Research Center

As multimedia content has proliferated over the past several years, users have begun to expect that content be easily accessed according to their own preferences. One of the most effective ways to do this is through using the MPEG-7 and MPEG-21 standards, which can help address the issues associated with designing a video personalization and summarization system in heterogeneous usage environments.

42

A

s multimedia content has proliferated over the past several years, users have begun to expect that content be easily accessed according to their own preferences. Similarly, as PDAs and mobile phones have grown in popularity and capability, people have become enthusiastic about watching multimedia content through their mobile devices. The capabilities of these devices vary widely and are limited in terms of network connectivity, processor speed, display constraints, and decoding capabilities. As a result, when people use these mobile devices to view multimedia content, they generally restrict their viewing time because of the smaller displays. Because of the existence of such heterogeneous user clients in addition to the wide variety of data sources, it’s a real challenge to implement a universally compliant system to fit multiple usage environments. One common requirement for almost all high-bandwidth video—regardless of the platform or the application—is the need to personalize the content for the user and summarize the video for easy retrieval. This article addresses the issues associated with designing a video personalization and summarization system in heterogeneous usage environments and provides a tutorial that introduces MPEG-7 and MPEG-21 in these contexts. Our design framework for the summarization system is a three-tier architecture of server, middleware, and client. The server maintains the con-

1070-986X/04/$20.00 © 2004 IEEE

tent sources, the MPEG-7 metadata descriptions, the MPEG-21 rights expressions, and content adaptability declarations. The client communicates the MPEG-7 user preference, MPEG-21 usage environments, and user query to retrieve and display the personalized content. The middleware is powered by the personalization and adaptation engines, which select, adapt, and deliver the summarized rich media to the user. We designed our system to include MPEG-7 annotation tools, semantic summarization engines, real-time video transcoding and composition tools, and application interfaces for PDA devices as well as browser portals.

Video summarization systems Due to high-bandwidth videos, there is a need to personalize contents for the user by searching for relevant clips and summarizing the videos. See the sidebar “Searching for Videos and Displaying the Results” for a summary. Some systems display summarized videos using selected frames from the video sequences. You can extract these frames, called keyframes, from a uniform sampling taken every 5 seconds or so—or you can extract the keyframes from each detected video shot or scene. To summarize the video, you can display the images all at once in a storyboard fashion. Some summarization systems display summarized videos using several keyframes for each detected scene to generate a storyboard.1,2 Research has even demonstrated the use of storyboards for personalized TV news programs based on user preference.3 For video summarization, we expect to view a moving video and listen to the audio accompaniment. The extracted video segments can be based on video shots or on semantic video structures. Video shots could be played back with their corresponding audio tracks. Some researchers have generated video summaries to compress and represent the original video into short, highlighted segments.4-6 A summary of a home video collection could include video excerpts along with an independent commentator speech and some background music. One research team demonstrated such an audio-annotated summarization system for home videos.7 You can effectively represent a video shot in a composite representation called a mosaic. The mosaic is essentially created from images of a common scene. Mosaics can be displayed as a static composite or animated with foreground objects traversing the background. One popular use of the

Published by the IEEE Computer Society

mosaic is in surveillance videos where the scene is essentially static and uneventful except for the occasional activity of intruders and unauthorized personnel. You can also use this technique for video indexing8,9 or video browsing.10 Using the audio channel to derive the semantic meaning can improve video summaries that simply rely on visual data. For example, you could process the speech signal with a speech-to-text engine to generate relevant keywords. Otherwise, the audio signal could convey interest through a volume energy-based classifier or an applause detector. One team used the speech signal to select shot segments of interest for summarizing a digital video library.11 Another team generated video summaries by using user attention models based on multiple sensory perceptions.12 As we examine the semantic structure of videos, we can divide and categorize content by its inherent semantic properties. For news videos, for example, this can mean partitioning a onehour news program into 10 categories such as local news, international news, and so forth. Furthermore, you could detect individual stories from each category. The semantic structure evidently creates semantic relationships between the structured segments of a video. One team’s research incorporated film structure rules to detect coherent summarization segments,5 while another team summarized video scenes from TV programs with a coherent scene mosaic composition.13 In our system, we explore the hierarchical semantic structure of videos and use context-clustering to generate coherent video summaries dynamically.14,15

Digital item adaptation To personalize the selection and delivery of multimedia content for the individual users, a phrase coined by the ISO/IEC MPEG-21 group is called digital item adaptation (DIA).16,17 MPEG-21 DIA aims to provide the tools to support resource adaptation, descriptor adaptation, and qualityof-service management. Figure 1 illustrates a fundamental structure for DIA. The content sources can include text, audio, images, and video. MPEG-21 digital item declaration (DID) provides standard description schemes that identify the content and include components for resource adaptation, rights expression, and adaptation rights. Under MPEG21 DID, there is a standard scheme for content adaptability called digital item resource adaptation. While you can use MPEG-7 to describe con-

Searching for Videos and Displaying the Results There are three conventional ways to search for video content. The first and fastest way is to know the name of the video’s file name, but users rarely know file names. Another approach would be to enter a text string that matches against stored video descriptions, which is where MPEG-7 can assist us by defining a set of standardized description schemes. Not only does MPEG-7 include text descriptions, the ISO standard also describes a broad spectrum of features and attributes about multimedia, which can lead to the third method of searching for a file: content-based information retrieval. In the third technique, the user shows an example image or video clip to the system, and the system extracts critical features to find the closest match in the database. Having performed the video search, the system displays the results to the user in any of four different views. The first view is a linear text listing of the video name, description, and associated information. This is similar to the search results from text search engines like Google, Lycos, or WebCrawler. However, a text description is usually not the best way to display video. Thus there is the second view, called the storyboard. In the storyboard, the system displays a set of representative images in a spatial layout, offering a visual snapshot of the search results. Unfortunately, storyboards lack audio. In another view type, the slide show, the interface adds audio and displays each slide in a sequence along with its corresponding audio track. The slide show is the most desirable view for mobile devices that do not have high video bandwidth or display capabilities. One final view is the video summary, also known as the video abstract or video trailer. As all its names suggest, the video summary shows a continuous sequence of video clips played seamlessly to resemble a custom movie.

MPEG-21 digital item declaration MPEG-21 resource adaptation

MPEG-7 controlled-term list/ classification scheme

Metadata adaptability

Metadata adaptability

Content sources

MPEG-21 usage environment MPEG-7 user preference

MPEG-7

MPEG-21 rights expression

Usage environment

Personalization and adaptation engines

User query

MPEG-21 DIA MPEG-7

Figure 1. Block diagram of digital item adaptation using MPEG-7 and MPEG-21.

43

tent using features, semantics, and models,18 it also includes adaptation hints covered under the MPEG-7 media resource requirements, which overlaps with MPEG-21. Rights expression is another subcomponent of MPEG-21 DID that includes rights for an individual or designated group to view, modify, copy, or distribute the content. Among these expressions that are relevant to personalization are adaptation rights. From the perspective of the user, the usage environment defines the user profiles, terminal properties, network characteristics, and other user environments. You can use the MPEG-21 usage environment to describe such conditions. It also includes user preferences that can partially or fully be described by the MPEG-7 user preferences. Furthermore, the user can request specific content in another way that is not defined by the standard, such as user query. To match the user’s usage environment with the content’s DID, the adaptation engine can select the desired content and adapt it in accordance with the descriptions. Specifically, users specify their requests for content through the user query and usage environment. On the other hand, the DID encompasses both the rights expression and the content adaptability of the available content.

IEEE MultiMedia

Media descriptions Media descriptions identify and describe the multimedia content from different abstraction levels of low-level features to semantic concepts. MPEG-7 provides description schemes to describe content in XML to facilitate search, index, and filtering of audio-visual data. The description schemes can describe both the audio-visual data management and the specific concepts and features. The data-management descriptions include metadata about creation, production, usage, and management. The concepts and features metadata can include what the scene is about in a video clip, what objects are present, who is talking, and what is the color distribution of an image. See the sidebar “MPEG-7 Multimedia Description Scheme.” While MPEG-7 standardizes the description structure, technical challenges remain. Generating these descriptions is not part of the MPEG-7 standard and the technologies for generating them can be variable and competitive. We have therefore developed tools to capture the underlying semantic meaning of video clips. Furthermore, we have facilitated the annotation

44

task through the use of a finite vocabulary set. This set can be readily represented by the MPEG7 controlled-term list, which can be customized for the different applications. Usage environment The usage environment holds the profiles about user, device, network, delivery, and other environments. The system uses this information to determine the optimal content selection and the most appropriate form for the user. Other than MPEG, the industry has proposed several standards that could help accomplish this goal. HTTP 1.1 uses the composite capabilities preference profile to communicate the client profiles. The forthcoming wireless access protocol (WAP) proposes a user agent profile that will include the device profiles to cover the hardware platform, software platform, network characteristics, and browser. The MPEG-7 user preferences DS lets users specify their preferences for certain types of content and for ways of browsing. To describe the types of desired content, the filtering and search preferences DS is used, which consists of the creation of the content (creation preferences DS), the classification of the content (classification preferences DS), and the source of the content (source preferences DS). To describe the ways of browsing the selected content requested by the user, the browsing preferences DS is used along with the summary preferences DS. For instance, a user preference could encapsulate the preference ranking among several genre categories produced in the United States in the last decade in wide screen with Dolby AC3 audio format. The user could also access summary content by specifying preference and total duration. The MPEG-7 user preferences descriptions specifically declare the user’s preference for filtering, search, and browsing. But other descriptions could be required to account for the terminal, network, delivery, and other environment parameters. The MPEG-21 usage environment descriptions cover exactly these extended requirements. The descriptions on terminal capabilities include the device types, display characteristics, output properties, hardware, software, and system configurations. For example, you could use this specification to deliver videos to wireless devices in smaller image sizes. Physical network descriptions let you adapt content dynamically to the limita-

MPEG-7 Multimedia Description Scheme

T00:00:00 PT0M15S Slide Presentation 421 41 135 290 135 290 230 41 230 Graphics & Text

Video contents can be annotated using the ISO standardized MPEG-7 multimedia content description interface. MPEG-7 defines the compatible scheme and language, called description schemes, to describe different abstraction levels, variations, and semantic meaning of multimedia content. Figure A shows an example of an MPEG7 video segment description scheme. In this video, we annotate the first shot, which includes 136 frames, as slide representation and describe a rectangular region in Figure A. Example of MPEG-7 video segment description. the keyframe as recorded in the order of , , ..., after graphics and text. In MPEG-7, each video shot is defined as a video segment, the tag. where the shot start time and duration are given and the annoFor multiple regions in a keyframe, the system must repeat tations are described. Thus the embedded tag lets us specify the section between and its closing tag inside the the region location and the corresponding text annotation in a section. If the annotator must label mulkeyframe. In addition, the keyframe is the 82nd frame of the tiple frames in the shot, then the system needs to repeat the video. The annotated region is specified by the section inside the tag and identified by a polygon whose n vertex coordinates are section.

Content adaptability Content adaptability refers to the multiple variations into which you can transform a media file, through changes in format, scale, rate, or quality. Format transcoding might be required to

accommodate the user’s terminal devices. Scale conversion can represent image resizing, video frame-rate extrapolation, or audio-channel enhancement. Rate control corresponds to the data rate for transferring the media content, and might allow variable or constant rates. Quality of service could be guaranteed to the user based on any criteria, including distortion quality measures. These adaptation operations transform the original content to fit the usage environment. The MPEG-7 media resource requirement and the MPEG-21 digital item resource adaptation both provide descriptions for allowing certain types of adaptations.

January–March 2004

tion of the network. The delivery layer includes the transport protocols and connections. These descriptions let users access location-specific applications and services. The MPEG-21 usage environment descriptions also include user characteristics, which describe service, interactions, conditional usage environment, and dynamic updating.

45

Server Media descriptions VideoAnnEx MPEG-7 Rights expressions MPEG-21 RE

Content adaptability MPEG-21 MRA MPEG-7 MRR Content sources MPEG-1 MPEG-4 MPEG-2

IEEE MultiMedia

Figure 2. Block diagram of our video personalization and summarization system.

46

customized content is then optimally selected, summarized, and delivered to the user. Usage The database server stores multienvironment media content and provides their Personalization descriptions. Each piece of content engine is associated with a set of media MPEG-7 MPEG-21 descriptions, rights expressions, and UP UE adaptability declarations. Media descriptions include feature descripVideoSue tions as well as semantic concepts, User which can be annotated by the query VideoAnnEx MPEG-7 annotation tool or the VideoAL automatic labeling tool. For each piece of content, there can be an associated set of rights expressions that define the right for others to view, copy, print, modify, or distribute the original Adaptation engine content. Similarly, there is an assoDisplay ciated set of content adaptability client declarations on the possible variation of media transformations. The media middleware consists of VideoEd UnivTuner the personalization engine and adaptation engine. In the personalization engine, the user query and usage environment are matched with the media descriptions and rights expressions to genSystem overview Figure 2 illustrates our personalization and sum- erate the personalized content. We use our marization system, which consists of three major VideoSue summarization module to find the opticomponents: user client, database server, and mal set of desired content. The adaptation engine media middleware. The user client component lets retrieves the corresponding content sources the user specify preference queries along with the according to the results of the personalization usage environment, and receives the personalized engine and determines the optimal variation— content on the display client. The database server including its format, size, rate, and quality—for component stores all the content sources as well as the user in accordance with the adaptability dectheir corresponding MPEG-7 media descriptions, larations and the inherent usage environment. MPEG-21 rights expressions, and content adaptability declarations. The media middleware Database server processes the user query and the usage environIn our system, we store, describe, analyze, and ment with the media descriptions and rights annotate the content sources with MPEG-7 and expressions to generate personalized content. MPEG-21 descriptions. The database stores media In the client end of our system, a user can descriptions in MPEG-7 using two tools to assomake a request by specifying a user query and ciate labels with content. We implemented the communicating the usage environment to the VideoAnnEx MPEG-7 annotation tool to assist media middleware. The user query takes the form users to annotate semantic descriptions semiauof preference topics, certain keywords, and the tomatically. The second tool, VideoAL, is an user’s time constraint for watching the content. automatic labeling tool to generate MPEG-7 The usage environment includes descriptions metadata based on trained anchor models. about the client terminal capabilities, physical network properties, and delivery layer character- VideoAnnEx istics. The display client ranges from a pervasive The VideoAnnEx tool,19 one of the first MPEGmobile device to a networked workstation. The 7 annotation tools being made publicly available Middleware

Client

at http://www.alphaworks.ibm.com/ tech/videoannex, is divided into four graphical areas, as Figure 3 shows. As you play back the video, the tool provides the current shot information as well. The shot annotation module displays the defined semantic lexicons and the keyframe window. The views panel displays two different previews of representative images of the video. The frames view shows all the frames as representative images of the current shot, while the shots view shows the keyframes over the entire video. As the annotator labels each shot, the descriptions are displayed below the corresponding keyframes. You can associate labels with the entire video shot or with the regions on the keyframes of video shots. We define a video shot as a continuous camera-captured segment of a scene. The tool performs shot boundary detection to divide the video into multiple nonoverlapping shots. The VideoAnnEx performs shot boundary segmentation on the compressed domain and is based on color histogram distributions. In general, a video shot can fundamentally be described by three attributes: static scene, key object, and event. According to the characteristics of the video corpus, you can import a predefined lexicon into VideoAnnEx. The lexicon, whose format is compatible with MPEG-7, is dependent on the summarization application and can be modified, imported, and saved using the VideoAnnEx.

Figure 3. The VideoAnnEx MPEG-7 annotation tool consists of four regions: (1) video playback, (2) shot annotation, (3) views panel, and (4) region annotation (not shown).

Media middleware The media middleware consists of the personalization engine and adaptation engine. In the personalization engine, we match the user query and usage environment with the media descriptions and rights expressions to generate the personalized content. The adaptation engine determines the optimal variation—the format, size, and quality—of the content for the user in accordance with the adaptability declarations and the inherent usage environment.

January–March 2004

VideoAL VideoAL consists of seven modules: shot segmentation, region segmentation, annotation, feature extraction, model learning, classification, and MPEG-7 XML rendering.20 To learn concept models, in the first step, the tool performs shot boundary detection on the video set. The tool then associates semantic labels with each shot or region using the VideoAnnEx module. A featureextraction module extracts visual features from shots in different spatial-temporal granularities. Finally, a concept-learning module builds models for anchor concepts such as outdoors, indoors, sky, snow, car, flag, and so forth. The first three modules of the automatic semantic labeling process are the same as those

modules of the training process. After the tool extracts features, a classification module tests the relevance of shots with the anchor concept models, which results in a confidence value for each concept. The tool then describes these output concept values using the MPEG-7 XML format. The tool uses only high-confidence values to describe the content. This process is executed automatically from an MPEG-1 video stream to an MPEG-7 XML output. The tool automatically assigns a relevance score to the video according to the confidence value of the classification.

Personalization engine Our system aims to show a shortened video summary that maintains semantic content within the desired time constraint. Our VideoSue engine—which stands for Video Summarization on Usage Environment—takes MPEG-7 metada-

47

Our video summary objective is to retrieve video shots with optimal coherent semantic segments within a certain time constraint.

IEEE MultiMedia

ta descriptions from our contents along with the MPEG-7 and MPEG-21 user preference declarations and user time constraints to output an optimized set of selected video segments that will generate the desired personalized video summary.15 Using shot segments as the basic video unit, we use multiple methods of video summarization based on spatial and temporal compression of the original video sequence. In our work, we focus on the insertion or deletion of each video shot depending on user preference. Each video shot is either included or excluded from the final video summary. In each shot, MPEG-7 metadata describes the semantic contents with corresponding scores. These semantic scores for all the shots thus form the complete attribute matrix. Similarly, the user preference vector denotes the preference scores for each semantic concept. It then follows that the weighted attributes for the shots, also referred to as the importance weightings, are calculated as the dot product of the attribute matrix and the user preference vector. Consequently, we include a shot in the summary if the importance weighting is greater than some threshold that we determine according to the sum of the most important shot durations. We rank each shot initially according to its importance weighting and either include or exclude it in the final personalized video summary according to the time constraint. Furthermore, the VideoSue engine generates the optimal set of selected video shots for one video as well as across multiple video sources. Our video summary objective is to retrieve video shots with optimal coherent semantic segments within a certain time constraint. Visual shot boundaries do not necessarily correspond to boundaries imposed by semantic continuity. For example, news videos have story and categorical

48

structured segments. Thus we use a hierarchical semantic-clustering algorithm to determine the multiple levels of semantic segmentations through the use of the accumulated correlation function. Nodes i represent a class of video segment units. Then the semantic similarity s(i, j) between nodes i and j represents the semantic commonality between the two video segments. We can determine this by any similarity function against the corresponding video shot annotations and speech transcripts. We used the simplified voting scheme to calculate our semantic similarity s(i, j). Next, we obtain the semantic correlation c(i, j) between nodes i and j within the neighborhood window W of nodes as: c(i,j) = w(i–j) * s(i,j) for |i–j|

Suggest Documents