BITSTREAM SYNTAX DESCRIPTION LANGUAGE FOR 3D MPEG-4 VIEW-DEPENDENT. TEXTURE STREAMING. R. Osorio. (1). , S. Devillers. (2). , E. Delfosse.
BITSTREAM SYNTAX DESCRIPTION LANGUAGE FOR 3D MPEG-4 VIEW-DEPENDENT TEXTURE STREAMING R. Osorio(1), S. Devillers(2), E. Delfosse(1), M. Amielh(2), G. Lafruit(1) (1)
IMEC, DESICS, Kapeldreef 75, B-3001 Leuven, Belgium {Roberto.Osorio, Eric.Delfosse, Gauthier.Lafruit}@imec.be (2) Philips Research France, 51, rue Carnot, 92156 Suresnes Cedex, France {Sylvain.Devillers, Myriam.Amielh}@philips.com ABSTRACT In modern multimedia applications, scalability is a key functionality that allows transmission and representation of contents in a wide variety of networks and terminals. In order to obtain full advantage of scalability features, techniques for detailed description and transformation of multimedia contents are needed. In this paper, the Bitstream Syntax Description Language is used for describing the structure of an MPEG-4 wavelet-coded texture. An XML-based bitstream description transformation is then applied for selecting some texture regions at an appropriate quality, effectively scaling down the processing and bandwidth requirements for viewdependent texture transmission. Appropriately applying this technique to 3D streaming guarantees Quality-ofService, i.e. it certifies the best quality at limited network/processing resources.
1. INTRODUCTION Transmitting, decoding and rendering 3D content requires a significant amount of resources both in the network as on the terminal. In case of scarce/insufficient resources, the content richness is typically scaled down uniformly for obtaining real-time operational conditions. A more appropriate solution however consists in a non-uniform content adaptation affecting only select parts of the content, and therefore ensuring a minimal overall perceived quality loss. Particularly, as mostly only parts of a textured 3D object are visible at any given time, incrementally transmitting and decoding these texture regions (Region of Interest – ROI) over time reduces the instantaneous bandwidth and processor workload. Moreover, with scalable coding, the transmission of these ROIs may be performed at different quality levels according to their degree of visibility and associated importance: see Figure 1 [1]. The whole process of region-based, quality selective texture coding is known as view-dependent texture streaming.
0-7803-7622-6/02/$17.00 ©2002 IEEE
Figure 1: Principle of view-dependent texture decoding. Non-contiguous portions of the bitstream and associated texture can be independently extracted and decoded according to the viewing angle. Special XML-based techniques for supporting viewdependent texture decoding in MPEG-4 3D content transmission are presented throughout this paper. Section 2 briefly introduces the MPEG-4 texture-coding tool. Section 3 describes the use of the Bitstream Syntax Description Language and XML for bitstream editing. Finally, Section 4 applies the principles of Section 3 to the MPEG-4 texture-coding tool for 3D transmission. 2. MPEG-4 VISUAL TEXTURE CODING Within the MPEG-4 multimedia compression standard, tools are available for coding 3D content. In particular,
III - 17
IEEE ICIP 2002
MPEG-4 provides a scalable, wavelet-based compression tool for textures, called Visual Texture Coding (VTC) [2]. It can operate in six different modes defined by three quantization modes in combination with two imagescanning modes [3]. The Multiple Quantization (MQ), Tree-Depth (TD) mode codes each parent-children tree of the hierarchically organized wavelet data separately, with successively refined quantization steps, creating the embedded bitstream of Figure 1. This mode is the most appropriate for quality selection within a ROI, and is therefore used throughout the discussion. The separation of the coded texture bitstream into its constituent ROI portions is supported by exploiting VTC’s error-resilience packetization: each packet of the bitstream corresponds to a specific region of the texture at a certain quality and resolution level. The start of each packet is unambiguously identified by a unique bit-sequence, called error-resilience marker. Detection of such markers therefore allows the selection of texture regions. The finest possible granularity of the texture is a block of 2L x 2L pixels, with L the number of wavelet decomposition levels in VTC. 3. BITSTREAM EDITING WITH XML MPEG-4 VTC supports view-dependent texture coding, and therefore allows the reduction of bandwidth and/or computing resources for 3D scene transmission and rendering. Relevant segments of the bitstream are identified and extracted by parsing it for error-resilience markers. Even though the corresponding software does not itself decode the bitstream, it is tuned to this specific coding format, since it needs to understand its data structure. However, the proliferation of different types of multimedia content and coding formats makes it likely that several formats will coexist on a server. It is therefore more appropriate to develop one single software module that will perform the same kind of editing operations on the server, taking care to tune them to the specificities of the content media format. This paper presents a generic method based on XML for manipulating any bitstream for view-dependent texture decoding. Results are tailored to the MPEG-4 VTC media format. We propose to use XML to describe the high-level structure of the bitstream and call the resulting XML document a Bitstream Description. This description is not meant to replace the original binary format, but acts as an additional description layer, similar to metadata. This Bitstream Description does not describe the bitstream on a bit-per-bit basis, but addresses its high-level structure, i.e. how it is organized in layers or packets of data. Furthermore, it does not deal with the semantics of the bitstream (i.e. the original object, image, audio or video it represents), but merely considers it as a sequence of binary symbols.
In order to constrain the syntax and semantics of this description, we defined in [4] a new schema language, based on the Web Consortium's XML-Schema [5]. It is called the Bitstream Syntax Description Language (BSDL). This language is used to express a schema defining the model for the XML documents, which describe the bitstream structure of a particular media format. Once an XML representation of the bitstream is given, it is possible to define a set of editing operations on the XML document, transcoding the original bitstream into a view-dependent version. These editing actions remove the elements corresponding to the packets of data to be cut off, and modify some elements or attributes thereof. This purpose is served by style sheets written in the eXtensible Stylesheet Language Transformations (XSLT) format, a W3C language specifying XML to XML transformations [6, 7]. The application of BSDL to view-dependent decoding is further detailed in the next section. 4. VTC BITSTREAM EDITING 4.1. VTC bitstream description The application of the BSDL approach to view-dependent VTC decoding may be summarized as follows. A Bitstream Description is generated from the original MPEG-4 VTC bitstream. This XML document conforms to the schema written in BSDL, which specifies the model of the VTC bitstream. It is transformed via a dedicated style sheet using XSLT into a modified description from which a new, adapted bitstream is produced, containing only the required texture packets. The BSDL description of a VTC bitstream must reveal those image parameters that are relevant for adapting the content. This includes the number of quality layers, number of packets per layer, portion of the image embraced by a packet, etc. For the sake of avoiding excessive complexity during the bitstream transformations, the description should not be verbose: information irrelevant to the Style Sheets should be hidden as much as possible. At the same time, vital information should be easily accessible and not scattered along the bitstream. Figure 2 shows an example of the structure of a VTC bitstream description, with headers containing amongst other the error-resilience markers, and payloads containing the texture data for the corresponding region. In the particular case of VTC, vital information is deeply embedded in the bitstream. Simply producing the bitstream description may be under-optimized since the description in its primary form is fairly verbose with useful information scattered. Moreover, the position of some elements is data dependent. Since the current definition of the XML-Schema Language [5] does not
III - 18
support conditional constructions, this information should be moved to the payload and used just by the decoder. To circumvent this problem, we propose an alternative solution, based on an extended BSDL description. Instead of trying to resemble the structure of the original bitstream, the new description encapsulates into the payload as much data as possible. Relevant information for transformations is included as additional elements that are originally not part of the bitstream. This way, fast and simple content adaptation is possible. An example of this method is shown in Figure 3 where the number of quality layers and packets thereof are explicitly added. The extended description can be obtained in two ways. In a first approach, the encoder can produce the description at the same time as the encoding process of the binary bitstream. Another approach consists in using the VTC Schema to parse the bitstream and produce the original description, and then to transform it into a summarized description using a dedicated XSLT style sheet. It should be noted that this XSLT transformation is totally unrelated to the one for view-dependent content scaling.
… bitstream#20-148 bitstream#149-320 … Figure 2: VTC bitstream description. Payload limits are expressed in bytes.
5 180 … … bitstream#16-148 …
4.2. XSLT transformations of VTC for 3D viewdependent decoding Transformation style-sheets have been designed for content adaptation in various conditions. In the case of MQ-TD VTC, an unlimited number of rectangular regions can be selected, specifying the quality layer that must be included for each region. The XSLT code reads the extended information and copies to the output the main header together with those packets that are included in any of the selected regions. The extended information elements are not copied. Figure 4 shows a 3D, rotating, VTC textured ball. The animation consists of 50 frames. The texture is decoded view-dependently, i.e., at each frame, only the newly visible regions of the texture are decoded. The grid at the left side of the ball in Figure 4 indicates which texture blocks are already decoded (crosses inside the white box), and which not (zeros outside the white box). At the end of the animation, almost all the regions are decoded, even though a small portion remains invisible. Figure 5 compares the bandwidth requirements between the view-dependent decoding and the instantaneous full texture decoding. Though the view-dependently decoded bitstream is larger than the non-scalable bitstream, due to the introduction of error-resilience markers, only a portion of it needs to be transmitted during the main part of the animation (Figure 5a). The texture transmission can thus be spread over time, drastically reducing the instantaneous bandwidth of the first frame over all consecutive frames (Figure 5b), compared to the situation in which the full texture is instantaneously downloaded before the animation starts.
Figure 3: VTC Extended Information node. Payload limits have changed as it embraces more information. 5. CONCLUSIONS In this work, we have studied the application of BSDL to MPEG-4 VTC in order to obtain non-uniform scalability of 2D textures in 3D view-dependent transmission. BSDL has been found suitable for describing VTC’s bitstream format. Furthermore, we have proposed some usability enhancements leading to an extended BSDL description. BSDL applied to MPEG-4 VTC allows the selection of texture regions at any required quality level. Using this technique in the context of 3D view-dependent texture coding, 3D content can be scaled on demand, providing best quality at limited network/processing resources. 6. ACKNOWLEDGEMENTS This work was partially funded by IMEC's MPEG Industrial Affiliation Program.
III - 19
(a)
(b)
(c)
(d)
Figure 4: View-dependent decoding of an animated (rotating during 50 frames), VTC textured, 3D ball. The crosses inside the white boxes indicate the decoded regions of the texture. (a) Frame 1, (b) Frame 16, (c) Frame 32, (d) Frame 50 Percentage of instantaneously decoded bytes compared to the nonscalable bitstream
115 110 105 100 95 90 85 80 75 70
5 % instantaneously decoded bytes
% decoded bytes
Percentage of total bytes decoded compared to the non-scalable bitstream
4 3 2 1 0
0
10
20
30
40
50
0
Frame number
10
20
30
40
50
Frame number
(a)
(b)
Figure 5: View-dependent decoding (dotted blue line) compared to instantaneous full texture decoding (full red line). Percentage of bytes view-dependently decoded, (a) up to each frame and (b) for each frame individually, compared to the size of the non-error resilient bitstream. 7. REFERENCES [1] E. Delfosse, G. Lafruit, J. Bormans, “Streaming MPEG-4 textures: a 3D view-dependent approach,” accepted for ICASSP 2002, International conference on Acoustics, Speech and Signal Processing, Orlando, May 13-17, 2002.
[4] M. Amielh, S. Devillers, “Bitstream Syntax Description Language: Application of XML-Schema to Multimedia Content Adaptation,” submitted to the 11th International World Wide Web Conference, Honolulu, May 6-11, 2002. [5] , “XML-Schema,” W3C Recommendation, May 2, 2001, http://www.w3.org/TR/xmlschema-0/, -1/, -2/.
[2] , “The MPEG-4 Audio-Visual Compression Standard, Text of ISO/IEC 14496-5/FPDAM1,” ISO/IEC JTC1/SC29/WG11/ MPEG99/N3309, Noordwijkerhout, March 2000.
[6] , “XSL Transformations Version 1.0,” W3C Recommendation, November 16, 1999, http://www.w3.org/TR/xslt.
[3] G. Lafruit, B. Vanhoof, “MPEG-4 Visual Texture Coding: Variform, yet Temperately Complex,” IWSSIP the 8th International Workshop on Systems, Signals and Image Processing, pp. 63-66, Romania, June 7 - 9, 2001.
[7] M. Amielh, S. Devillers, “Multimedia Content Adaptation with XML,” 8th International Conference on Multimedia Modeling MMM’2001, Paper Proceedings, pp. 127-145, Amsterdam, The Netherlands, November 5-7, 2001, http://www.cwi.nl/conferences/MMM01/pdf/amielh.pdf.
III - 20