Content on a range of Mobile and Stationary User Devices. Gabriella Kazai ... system, a delivery platform and a content access ... Finally, the storage manager is.
Architecture for the Retrieval of Annotated Broadcast Content on a range of Mobile and Stationary User Devices Gabriella Kazai1,2, Mounia Lalmas1, Alan Pearmain2 and Marie-Luce Bourguet1 1
2
Department of Computer Science and Department of Electronic Engineering Queen Mary University of London
1
Introduction
The European IST project, SAVANT1 (Synchronised and Scalable Audio Visual content Across NeTworks), is developing integrated broadcast and Internet technologies that allow end users to access and retrieve interrelated and synchronised multimedia content in an intelligent and transparent manner on stationary and mobile terminals with different capabilities and under varying network conditions. These technologies are implemented within a studio production system, a delivery platform and a content access system consisting of different user terminals. In this paper we focus on the content access system and within it the search and retrieval engine. We define an architecture that supports the use of different user terminals and builds on established and emerging standards such as MPEG-7 [1], TV-AnyTime [2] and MPEG-21 [3]. The paper is structured as follows. Section 2 outlines the terminal system. Section 3 overviews the used metadata standards. Section 4 describes the search engine detailing how the metadata standards are used by the different functions within the retrieval process.
2
1. A Set Top Box (STB) with TV and remote control forms the core of the overall system. It provides large storage and is used as a gateway server. 2. A Tablet PC, operated via touch screen and keyboard, is used as a portable device at home. 3. A PC, equipped with a media player and fast Internet access, is used as a stationary terminal. 4. A PDA is used as the mobile device for “on the move” usage. It is equipped with storage media and is able to automatically connect to the STB. Each of these terminals can be used by itself or in combination with others. For example, the STB and TV used in parallel with PDAs allows for the group viewing of the main broadcast content while supporting personalised viewing of additional interrelated content.
The Terminal System
The overall architecture of the SAVANT system, consisting of three main components, is shown in Figure 1. The broadcasting chain starts with the content creation and annotation system whose output is the MPEG-2 main broadcast and additional content (e.g. MPEG-4, HTML, etc.) annotated with MPEG-7, which describes the structure and semantics of the multimedia content, TV-AnyTime, which provides programme level metadata, and MPEG-21, which describes the annotated content as a collection of Digital Items enabling its transparent delivery across different networks and to different user devices. This output is fed to the content delivery system, which is responsible for its smart routing via multiple transmission channels (e.g. DVB, IP) to the content access system. Based on this 1
architecture, users gain access to integrated services and multimedia content from different and combined information sources (e.g. broadcast, Internet) using any of the following terminals.
www.ist-savant.org
Figure 1. The SAVANT system
The schematic architecture of the content access system is shown in Figure 2. To support the different terminal capabilities, the architecture is adaptable in that some components can be downscaled or implemented on a proxy. For example, due to its limited capabilities, the PDA implements only user interface functions and limited storage. Content and service adaptation is performed mainly by the STB. A DVB card and
the MHP (Multimedia Home Platform) software are only installed on the STB and Tablet PC.
additional information into the transport stream. At the user terminal, MPEG-7 is used to locate structural or semantic components of a currently viewed and locally or remotely stored content. The MPEG-7 file, shown in Figure 3, describes a multimedia content decomposed using temporal segmentation. We use high-level descriptor schemes (DS) e.g. MediaInformation, CreationInformation and TextAnnotation DSs, and low-level features such as ScalableColor DS. For text annotation, the FreeTextAnnotation, StructuredAnnotation and KeywordAnnotation DSs are used to provide semantic description both in natural language and in structured form (Who, WhatAction, Where etc.).
Figure 2. The SAVANT Terminal Architecture
… Eating out in Budapest A documentary visiting famous restaurants in Hungary’s capital Documentary /tourist info Opening scene with celebration dinner at the famous Matthias Cellar restaurant Matthias Cellar restaurant 0.75 Matthias Cellar 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 …
The architecture includes the following subsystems. The user interface component is directly linked with the content presentation engine, which controls and synchronises the different players for the various multimedia content (e.g. MPEG-2, MPEG-4, HTML). The service description interpreter interprets user requests concerning services, controls service and content adaptation, handles requests for additional content, and is location and time aware. The access control component is in charge of managing the different remote (DVB and Telecom) and local channels, and provides and maintains resource information necessary for content and service scaling. In addition, it interfaces with authentication and security tools. The content adaptation engine supports content (transcoding, summarizing and abstracting) and metadata adaptation. The user manager includes a dynamic user profile engine that automatically learns and adapts to user interests and a user preferences editor, which maintains static user settings (e.g. preferred screen resolution). The search engine realises metadata based remote and local search for content and implements data fusion concepts. Finally, the storage manager is the basic function unit for all the different types of content stored and managed by the terminal. The stored content comprises multimedia content as well as metadata, services descriptions, user profiles and preferences, and the description of the terminal characteristics.
3
Metadata
SAVANT uses the MPEG-7, TV-Anytime and MPEG-21 metadata standards. MPEG-7 [1], referred to as “Multimedia Content Description Interface”, is an ISO/IEC metadata standard that provides a rich set of tools to describe the structure and semantics of multimedia content. The metadata is associated with media streams (e.g. MPEG-2 and MPEG-4), and is inserted as
Figure 3. Example of a MPEG-7 description
TV-Anytime [2] is a metadata standard developed to define specifications for programme level content descriptions to allow viewers to find, navigate and manage content from a variety of sources including enhanced broadcast, Internet and local storage. The metadata includes attractors (e.g. title, synopsis, genre, cast and awards) to aid the acquisition of available content organised, for example, in Electronic Programme Guides (EPG).
MPEG-21 [3] is a metadata framework that aims to enable transparent access to multimedia resources across a wide range of networks and devices. MPEG-21 targets the adaptation of Digital Items defined as structured digital objects with standard representations, identifications and descriptions. It describes a variety of dimensions, including users, terminals, networks and delivery. User preferences include display, accessibility and mobility characteristics. Terminal capabilities include hardware, software and display properties, and device profiles indicating the supported media formats (e.g. MPEG-2). Network capabilities specify delay, error and bandwidth characteristics. Delivery descriptors specify the types of supported transport protocols (e.g. TCP/IP) and the types of connections (e.g. multicast). Content type MPEG-2 content MPEG-4 content
Figure 4. Example of a MPEG-21 description
Figure 4 describes a Digital Item composed of two sub-items: an MPEG-2 content described by the MPEG-7 metadata shown in Figure 3, and an MPEG-4 video also associated with some MPEG-7 description.
4
The Search Engine
The Search engine in SAVANT is responsible for the content-based querying, filtering, search and retrieval of scalable and synchronised multimedia content arriving (push or pull) via different delivery channels (DVB, IP, local storage). The search engine supports a number of functions that together define the
retrieval process: indexing the metadata describing the content (MPEG-7 and TVAnyTime), query formulation and processing, relevance-based retrieval (using MPEG-21), data fusion, presentation and navigation of the retrieval results.
4.1
Indexing the Metadata
The SAVANT search engine indexes the MPEG-7 metadata that describes the multimedia content at the media stream level and the TVAnyTime metadata, which provides programme level description. During the indexing process functions such as stop-word (e.g. “the”, “an” etc.) removal and stemming are applied, content analysis techniques are used to extract and weight keywords and classifications from the metadata, and fast access indexes are created using inverted files and database relations. The MPEG-7 annotation is used to support the retrieval of additional content related to the currently viewed programme or segment, filtering according to user profiles, generating highlights or semantically linked video chains (e.g. to compose a virtual feed for following a golf player), and constructing summaries of multimedia content. For example, using the StructuredAnnotation DS, the problem of following a golf player, given several camera feeds, is solved by setting up a filter such as //AudioVisual//StructuredAnnotation/Who[.= "Tiger Woods"]. TV-Anytime metadata is used to enable filtering, recommendation, search and retrieval of content from the programme listings of EPGs. This is based on information extracted from the following elements of the ProgramInformation DS: BasicContentDesription DS for programme description and classification and AVAttributes DS for technical description. The metadata is indexed off-line for content stored in local storage and on the fly for live broadcast using the HySpirit retrieval platform [4]. The indexed metadata is represented using probabilistic relational algebra, where the textual values of the descriptors, extracted during content analysis, are stored as content and the nested retrieval units (e.g. Segment, StillRegion) are references as contexts. The storage of the structural relationships between contexts enables the propagation of content values along the MPEG-7 and TV-AnyTime metadata tree. This allows the retrieval process to return multimedia content elements of varying granularity, e.g. a StillRegion, a Segment or a whole video [5].
4.2
Query processing
The retrieval process is initiated with a user query. This is either automatically derived from the metadata description of the currently viewed
content or manually produced by the user with the help of an input device. With the exception of the STB all SAVANT terminals are equipped with a keyboard or pen. For the STB users are traditionally limited to the use of a remote control, which, although suitable for browsing, is cumbersome for query input. Within SAVANT, users can use a PDA as an additional input device to the STB. Other query formulations are derived directly from the user preferences, which describe the user’s topical interests. Automatically derived queries are represented in MPEG-7 and TVAnyTime syntax, while manually entered queries may be in natural language or in XPath dialect [6]. The latter allows users to issue searches requesting, for example, films where a certain actor is cast. Once the query is formalised it is translated, using a process similar to that used for metadata indexing, to a format suitable for retrieval and passed to the retrieval engine along with MPEG21 metadata describing the capabilities of the terminal (and of any proxy servers with transcoding capability).
4.3
Retrieval
The relevance-based retrieval of multimedia content is based on a matching function that provides a ranking of retrieved objects according to their estimated relevance to the user’s query, taking into account the capabilities of the requesting user terminal. The retrieval engine is based on a probabilistic framework, where the degree of relevance is estimated via statistical models derived from the indexed metadata and the processed user query [7]. The retrieval function, implemented using HySpirit, is based on the combination of feature-based content retrieval, context matching and path algebra. During retrieval only those resources are considered whose media type is appropriate for the capabilities of the user device described using the MPEG-21 TerminalCapabilities DS. For example, if the user query is “Restaurants in Budapest” and the terminal is capable of playing MPEG-4 but not MPEG-2, then given the Digital Item in Figure 4, only the MPEG-4 component would be retrieved with a calculated probability of relevance based on its MPEG-7 description. Retrieval can also be performed remotely, by passing the query onto external search services, such as in the case of a PDA, where the actual search and retrieval takes place in the STB. Another example is the retrieval from the broadcasters’ databases or the use of metasearch engines for searching the Internet.
4.4
Data fusion
Data fusion techniques are used to combine the various sets of multimedia content retrieved
from different sources, e.g. from the different search engines on the Internet, local storage or broadcaster’s database. This is done by first selecting the most appropriate sources using a measure of quality associated to each source. This measure is computed based on heuristics such as authority, popularity, user history and terminal capability. Once the source quality has been derived a number of retrieval results are selected from the each retrieval set. These sets are then merged to produce a single ranked list, which is displayed to the user. To accomplish this, we use the approach developed in [8], which considers any available information associated with each retrieved item (e.g. title, abstract, etc.) in addition to their rank score.
4.5
Retrieval results
The result of the retrieval process is an XML representation of the ranked list of retrieved multimedia content, which is then passed to the User Interface module for presentation on the user’s terminal. The XML tree is considered as a labelled ordered graph, whose nodes contain summary information about the multimedia object and references (e.g. URI and Media Time) to the actual content that is to be displayed when selected by the user. The summary information may be composed of different media elements (MPEG-2, MPEG-4, text etc.), which is scaled by the content adaptation engine (for example, with the use of XLST processors). Navigation of the retrieval results and the selection of given multimedia objects for their play-out or display is supported via the user interface and is done with the use of a remote control, touch screen, keyboard or pen.
References [1] ISO MPEG-7, Part 5 - Multimedia Description Schemes, ISO/IEC JTC1/SC29/WG11/N4242, Oct 2001. [2] TV-Anytime Provis. Specification 1.3, Sept 2002. [3] ISO MPEG-21, Part 7 - Digital Item Adaptation, ISO/IEC JTC1/SC29/WG11/N5231, Oct 2002. [4] T Rölleke, R. Lübeck and G. Kazai. The HySpirit Retrieval Platform. ACM SIGIR Demonstration, Sept 2001. [5] A Pearmain, M Lalmas, E Moutogianni, D Papworth, P Healey and T Rölleke. Using MPEG7 at the Consumer Terminal in Broadcasting, Journal on Applied Signal Processing, 2002. [6] P Liu, L H Hsu. Queries of Digital Content Descriptions in MPEG-7 and MPEG-21 XML Documents. XML Europe, Barcelona, May 2002. [7] N Fuhr, N Gövert, T Rölleke. DOLORES: A System for Logic-based Retrieval of Multimedia Objects, 21st ACM SIGIR Conference, 1998. [8] T Tsikrika and M Lalmas. Merging Techniques for Performing Data Fusion on the Web, 10th ACM CIKM, pp 181-189, Atlanta, Nov 2001.