thus set a preference or priority on the objects in the scene. For example, a video clip of a news broadcast may have the presenter prioritised higher than the.
_________________________________________________________ISSC 2003, Limerick.
July 1-2
An Adaptive MPEG-4 Streaming System Based on Object Prioritisation Stefan A. Goor and Liam Murphy Performance Engineering Laboratory, Department of Computer Science, University College Dublin, Dublin IRELAND E-mail: {stefan.goor, liam.murphy}@ucd.ie __________________________________________________________________________________________ Abstract – Streaming of video and multimedia content has generated a lot of interest, especially with the recent development of mobile devices with multimedia capabilities. However, ubiquitous multimedia systems face many challenges, such as varied and limited network performance and viewing device capabilities. Adaptive streaming systems are frequently used to reconcile these limitations, while also providing an acceptable level of perceptible quality to the end user. In this paper, we propose an adaptive streaming system that exploits the Video Object (VO) coding capabilities of MPEG-4 by applying priorities to the individual objects. By allowing the content provider to define prioritisation of objects, video adaptation can be customised based on the content. Keywords – Video Streaming, Scalability, MPEG-4, Video Objects, Prioritisation. __________________________________________________________________________________________
I
INTRODUCTION
The demand for rich content, such as video and multimedia, on mobile devices is rapidly increasing. Advances in such terminal hardware have resulted in mobile phones and palm devices that are capable of supporting complex media types. In addition to these advances, mobile network capacity has also been improving with the introduction of GPRS, 802.11b and UMTS wireless networks [1]. However, even with these advances, many limitations in both network and viewing device capabilities still exist. Scalability of video is commonly used to address such problems to ensure the stream(s) and content received at the client terminal match the available network and device resources. Many systems using MPEG-2, H.263 or similar technologies have been proposed or implemented [2]. Frame rate is generally used for temporal scalability, while resolution provides spatial scalability. Providing adaptation can be achieved in a number of ways, such as dynamic encoding or selection of preencoded clips of different temporal and spatial complexities at the server, based on client feedback.
Other options include the use of layered encoding, multiple description encoding [3] or fine granular scalability (FGS). Ordinarily, all of these techniques are based on rectangular frames of video, so the whole of the rectangular area is adjusted by the same factors. This paper discusses a shift from this traditional frame based scalability, by exploiting the features of MPEG-4 such as Video Object (VO) coding, Binary Format for Scenes (BIFS), MPEG-Java (MPEG-J) and the use of MPEG-7 [4]. Using this technology, video clips can be encoded as a number of independent objects that form the video content when composed according to the BIFS. Individual objects can thus be encoded and / or adapted independently and arranged into a single scene. Independence of object scalability means we can thus set a preference or priority on the objects in the scene. For example, a video clip of a news broadcast may have the presenter prioritised higher than the background. If scaling back is required, the background will have its spatial and / or temporal complexity reduced more than the object representing the presenter. Human comprehension of the content should be improved by object-based
adaptive techniques because the object(s) of focus in the clip are maintained at a relatively high quality level. This assumes that maintaining prioritised objects at a relatively high level of quality and degrading the lower priority objects to a greater extent should result in a more comprehensible clip than uniformly encoding the complete clip at a compromis ed quality. If all objects are indiscriminately degraded then important perceptible features such as text or lip-synch may be lost due to the requirements of other less perceptually important objects such as backgrounds. The system proposed in this paper has two main features, firstly a tool for segmentation, prioritisation and encoding of VOs from a media clip (section III), and secondly a client-server based system for streaming this object-based media (section IV).
II
MPEG-4 FEATURES
The MPEG-4 standard [5] was produced with the aim of providing video and multimedia content for varied environments and enhanced functionality. MPEG-4 was originally aimed at audio-visual information with very high compression in low bit rate environments. In fact, the standard was explicitly optimised for three bit rate ranges, below 64 kbps, 64 to 384 kbps and 384 kbps to 4 Mbps. MPEG-4 also specifies a number of new functionalities not available in previous standards like MPEG-1, MPEG-2 or H.263, such as contentbased mu ltimedia data access tools, content-based manipulation and bit stream editing, improved coding efficiency, and robustness in error-prone environments. Therefore MPEG-4 is appropriate for flexible audio-visual content communication over wireless networks. In terms of the system proposed in this paper the most important features are: Video Object (VO) coding, Binary Format for Scenes BIFS, MPEG-J and MPEG-7. a) Video Object (VO) Coding MPEG-4 supports coded representation of media objects [6]. Primitive media objects can be still images (e.g. a fixed background for some clip), video objects (e.g. a newsreader) or audio objects (e.g. a voice or music). Primitive media objects such as these can be either natural or synthetic. Many primitive media objects may represent a single VO. Consider the case of the news program: if we consider the presenter as a separate object from other content in the clip, then the VO must represent both the primitive video object (i.e. the presenter) as a sequence of textures or images, and the primitive audio object (i.e. the presenter’s voice). As a result usually more than one Elementary Stream (ES) is required to represent the object. An Object Descriptor (OD) is used to identify all streams associated with a particular object.
(a)
(b)
Figure 1: Demonstrating the encoding of an arbitrary shaped video object. (a) the original sequence with all the content, (b) the extracted YUV VOP and shape matrix.
Arbitrarily shaped video objects are possible due to the use of a shape matrix or mask, as shown in Figure 1. A VO is represented by a sequence of Video Object Planes (VOPs) composed of YUV texture matrices and shape matrices that indicate the parts of the VOP that are part of the VO. YUV is a colour-encoding scheme in which luminance and chrominance are separate. Chrominance of images is less perceptually noticeable to the human eye and, as a result, it is commonly sub sampled. b) Binary Format for Scenes (BIFS) BIFS provide a method to compose a set of objects into a scene. BIFS can be use to describe a scene in terms of how objects are grouped together (as shown in Figure 2), and how objects are positioned in space and time. Scene Person Voice
2D Background Sprite
Figure 2: Illustration of how a scene is described hierarchically in terms of objects.
c)
MPEG-J
Intelligent management of the operation of audiovisual sessions is promoted by MPEG-J. MPEG-4 players and Java code used collaboratively allow authors to embed complex control and data processing mechanisms with their media data. Included in MPEG-J are Terminal, Scene, Resource, Decoder and Network APIs. MPEG-J applications that use these APIs are referred to as MPEGlets, which are similar to Java applets. These APIs can be used to assess the capabilities and available resources for streaming of video to a terminal and also apply appropriate changes to the stream by sending responses via a back channel to the server. Scene descriptions can be altered by the use of MPEG-J. Altering the scene descriptions means objects may be changed or omitted completely from the scene. d) MPEG-7
Effective and efficient cataloguing of the content of media clips cannot be achieved by solely examining the clip’s constituent data. Auxiliary data or metadata is necessary to explicitly describe the content in an informative manner. With such metadata, the versatility of clips is increased as they can be indexed in a database style approach based on their content. Utilising metadata means we could search for specific content in a clip, for example, finding a clip that contains Bertie Aherne and Tony Blair in London. MPEG-7 [7] provides support for stipulating such metadata for a given clip. Although the most apparent use of MPEG-7 is for searching and retrieval, any kind of metadata can be recorded that may be beneficial for reasons other than searchable indexing. e)
degraded but the background is reduced in complexity and then omitted completely.
III
MEDIA PREPARATION
In order to allow manipulation of video objects based on prioritisation, a technique for creation of objects and assignment of priority values is required. We propose a tool, as shown in figure 4, similar to the VOGUE tool [8] that will allow the content provider to allocate priority values to the objects they identify. This tool will also encode each object to a number of specified encoding parameters.
Scalability
Ordinarily rectangular frame based video is scalable with only two degrees of freedom, namely, spatial and temporal parameters. Spatial scalability involves an adjustment in the resolution of a video sample, whereas temporal scalability involves changing the frequency of distinct frames during play out. Other scalability techniques exist, however they tend to be fundamentally based on these two adaptation dimensions. For example, enhancement layers or multiple description encoding can be applied to provide improvements in quality of the sample, but the net effect of this process is generally increased spatial and temporal complexity. Exploiting MPEG-4’s VO capabilities means we can introduce an extra degree of freedom for scalability because we can choose to include or omit individual objects from a scene as an extra adaptation parameter. Scalability of MPEG-4 is also much more versatile than simple frame based scaling because each object, of which there may be many in a single scene, can be modified independently in each of the adaptation dimensions.
Figure 3: An illustration of the differences between rectangular frame based scalability (top row) and object based scalability (bottom row).
Figure 3 illustrates a coarse example of spatial scalability in both frame and object based paradigms. In the diagram, we can see that frame based scalability affects the whole of the scene, whereas in the case of object based scalability, the objects are degraded independently: the tennis player is not
Figure 4: A screenshot of the Segmentation Tool being developed.
a) Video Object Segmentation Streaming video in terms of objects means that video content must be converted to a collection of specified objects comprising the original scene. Manual identification of objects in a video can be achieved by the use of such a video segmentation tool. Identifying objects enables the content provider to define objects of interest to them for a particular purpose or domain. For example, during a football match a content provider may wish to have their logo displayed as an object so it can be manipulated independently of the other objects in the clip such as the players. Generally video segmentation tools are graphical based applications with simple user interfaces that enable a user to identify an object’s shape by drawing a polygon or editing an alpha mask to match the object boundaries. Some basic systems require identification of the object in every frame of the source clip, although others employ automated object tracking algorithms to determine the objects position and shape in subsequent frames. In addition, some automated techniques for identifying key objects without user interaction have been suggested, however for this project we maintain the use of manual object segmentation to provide content providers greater control of the media content. b) Metadata (MPEG-7) for VOs Once the VOs have been identified and defined, a method to distinguish between them and provide information about the objects is needed. For the
purposes of this project, the priority level of each object must be specified and recorded. Conveniently, MPEG-7 provides an appropriate solution for this task. The metadata required for the VOs, including priority information, will therefore be stored in MPEG-7 format.
IV SYSTEM ARCHITECTURE We propose a client server system architecture, as shown in Figure 5. Communication channels are available in both directions, one for transmission of content streams to the client and a back channel for receipt of feedback from the client.
Figure 5: Overview of the system architecture.
a) Server The server is responsible for the transmission of the audio-visual data and so must have a library or database of media it can stream. This library of media must be form into objects, each encoded to a number of specified configurations. The tool discussed in the previous section is responsible for encoding of the objects. Due to the VO nature of the streamed video, the server must also have a catalogue of metadata so it stores MPEG-7 information referring to the media clips in the server’s library. Scalability of the system is achieved by selection of particular VO encodings for each Vo in the scene. A control unit is responsible for selecting appropriate coded object streams based on the feedback it receives from the client. Multiplexing is also required because each object may generate a number of ESs to represent its content. These streams are combined into a single stream to transmit to the client in the multiplexer. b) Client Playing the stream sent by the server requires decoding of the ESs to gather the data needed for each object, the BIFS and the MPEGlet. However because the server sends these streams to the client in a single multiplexed stream, a demultiplexer must first be applied to extract the individual ESs. Decoded streams are then supplied to the compositor that arranges the media content according to the BIFS information. An MPEGlet is used to provide the server with greater control over playback because the MPEG-J application can be used to monitor
resources and send feedback when required. The Delivery Multimedia Integration Framework (DMIF) handles the communication of the responses from the client to the server.
V
SYSTEM OPERATION
As mentioned earlier, before a clip can be streamed it will require some preparation in the form of segmentation and prioritisation. Preparation of video clips before streaming is not a new concept, as streaming systems often require videos to be hinted for streaming. Upon notification of a client wishing to view streamed content, an MPEGlet will be transmitted to the client to present the media clip and provide intelligent control of the transmission i.e. feedback for adaptation. Feedback is given in relation to the network and terminal device resources, which can be assessed using the MPEG-J APIs. This feedback will be fed to the control unit at the server and different objects will then be selected at a complexity based on a calculation involving their priority and the resources available at the client terminal. Unlike most adaptive systems which use periodic polling for assessment of performance, this system only sends feedback when the MPEGlet detects an event or situation that calls for degradation or improvement of the stream e.g. a significant change of available memory in the terminal device or a large shift in network capacity. a) Adaptation Processing Adaptation of video streams is done to try and achieve a stream that can be facilitated by the resources available. Both terminal device resources (i.e. CPU power and memory) and network resources are utilised in our system to indicate how to configure an appropriate stream. Whenever either or both of these resources are overloaded, the MPEGlet at the client will detect the problem and send feedback to the server. Information received by the server will indicate the available resource limit. This is calculated as the maximum bit rate that all resources at the client terminal can sufficiently handle. Each object contained in the scene, will then be apportioned a weighted fraction of the available resources according to its associated priority. Spatial and temporal complexities of the object are the primary factors involved in determining the required bit rate for a video object. Therefore, an encoded stream that requires approximately the allocated resources for that object will be selected. If an object is assigned an insufficient bit rate to present an object with an acceptable quality in terms of resolution and frame rate it will be omitted from the scene. Although an available resource limit may be determined, the adaptation of the stream has
numerous possibilities. Each object may be encoded with many temporal and spatial complexities and the proportioning of resources to objects can be varied hugely. In an effort to optimise the adaptation process, the system will use a scheme based on the results of proposed perceptual experiments to allocate resources among objects and determine appropriate encoding parameters for objects.
VI CONCLUSIONS This paper proposes a system that streams MPEG-4 content from a server to a client and static encoding adaptation of the content in terms of VOs based on client feedback. The adaptation is achieved by calculation of available resources and VO priorities to determine each VO’s appropriate encoded stream. While similar work has been carried out [9], the proposed system includes the addition of tools to allow content providers to specifically define, prioritise and encode objects to suit their requirements. In addition, the adaptation process will be based the results of perceptual quality experiments and the system will use static encodings, eliminating the need for a dynamic encoder. MPEGlets are also used for the purpose gathering information at the client side such as device and network performance and supplying feedback to the server when required. Using MPEGlets means the adaptations can be event driven removing the need for user interaction at the client side. Future work for this project involves implementing the proposed system, consisting of a media preparation tool, a client and a server. Perceptual testing of object-based adaptation compared to traditional frame based adaptation will be investigated. This perceptual testing will additionally be used to define the adaptation process and encoding parameters.
ACKNOWLEDGEMENT The support of the Irish Research Council for Science, Engineering and Technology (IRCSET) is gratefully acknowledged.
REFERENCES [1] Andrew S. Tanenbaum. “Computer Networks: 4th Edition”. Prentice Hall ISBN: 0-13-066102-3, pages 166-169 and 292-310, 2002. [2] Gabriel-Miro Muntean and Liam Murphy. “Adaptive Pre-recorded Multimedia Streaming”. IEEE GLOBECOM 2002, Taipei, Taiwan, November 17-21, 2002. [3]
J.G. Apostolopoulos . “Reliable Video Communication over Lossy Packet Networks using Multiple State Encoding and Packet Path Diversity”. Visual Communications and Image Processing (VCIP), pages 392-409, January 2001.
[4] ISO/IEC JTC1/SC29/WG11 N4668. “MPEG-4 Overview”. http://mpeg.telecomitalialab.com/standards/mpeg -4/mpeg-4.htm, 2002. [5] Fernando Pereira and Touradj Ebrahimi. “The MPEG-4 Book”. Prentice Hall ISBN: 0-13061621-4, pages 1-34, 2002. [6] “Coding of Audio-Visual Objects – Part 2: Visual, 2d Edition”. ISO/IEC 14496-2:2001, 2001. [7]
ISO/IEC JTC1/SC29/WG11. “MPEG-7 Overview”. http://mpeg.telecomitalialab.com/standards/mpeg -7/mpeg-7.htm, July 2002.
[8] B. Marcotegui, P. Correia, F. Marques, R. Mech, R. Rosa, M. Wollborn and F. Zanoguera. “A Video Object Generator Tool Allowing Friendly User Interaction”. ICIP-99, Kobe, Japan, October 1999. [9] Anthony Vetro, Huifang Sun and Yao Wang. “Object-Based Transcoding for Adaptable Video Content Delivery”. IEEE Transactions on Circuits and Systems for Video Technology VOL. 11, NO. 3, March 2001.