Thesis for the degree of Doctor of Philosophy
Supporting video-mediated communication over the Internet
Mathias Johanson
Department of Computer Engineering Chalmers University of Technology Gothenburg, Sweden, 2003
Supporting video-mediated communication over the Internet Mathias Johanson ISBN 97-7291-282-0 Copyright Mathias Johanson, 2003 Doktorsavhandlingar vid Chalmers tekniska högskola Ny serie nr 1964 ISSN 0346-718X Department of Computer Engineering Chalmers University of Technology Gothenburg, Sweden, 2002 S-412 96 Gothenburg, Sweden Phone: +46 (0) 31-772 10 00 Contact information: Mathias Johanson Alkit Communications Sallarängsbacken 2 S-431 37 Mölndal, Sweden Phone: +46 (0) 31-67 55 43 Fax: +46 (0) 31-67 55 49 Email:
[email protected] WWW: http://www.alkit.se/~mathias
Printed in Sweden Chalmers Reproservice Gothenburg, Sweden, 2003
iii
Supporting video-mediated communication over the Internet Mathias Johanson Department of Computer Engineering Chalmers University of Technology
Abstract The tremendous success of the Internet in providing a global communication infrastructure for a wide variety of applications has inspired the invention of packet video systems for synchronous interpersonal communication. The potential benefits of video-mediated communication are numerous, ranging from improved social interactions between individuals to more efficient distributed collaborative work. However, since the Internet was originally designed as a data communication network, primarily supporting asynchronous applications like file transfer and electronic mail, the realization of Internet-based packet video systems presents considerable technological challenges. Specifically, the best-effort service model of the Internet, that does not guarantee timely delivery of packets, implies that video applications must be resilient to packet loss and adaptive to variations in bandwidth and delay. Two fundamental issues are how to make the systems scalable to large numbers of widely distributed users, and how to support video-mediated communication in highly heterogeneous environments. Since the Internet is built upon network connections of widely different capacities and since the computers connected to the network have vastly different characteristics, video applications must be adaptive to diverse and dynamic conditions. Furthermore, video-mediated communication systems must take various application-specific requirements and usability concerns into consideration. This thesis contributes to the realization of a flexible framework for videomediated communication over the Internet by presenting scalable and adaptive algorithms for multicast flow control, layered video coding, and robust transport of video. Enrichments of video-mediated communication, in the shape of stereoscopic video transmission mechanisms and mobility support, are proposed along with design and implementation guidelines. Furthermore, the scope of Internet video is broadened through the introduction of a novel video gateway technology interconnecting multicast videoconferences with the World Wide Web. In addition to the contributions on core technology, the thesis also deals with applications of video-mediated communication. Specifically, the use of video for distributed collaborative teamwork is explored through experiments with prototype implementations. Keywords: Video-mediated communication, teleconferencing, Internet video, layered multicast, congestion control, layered video coding, robust video transmission, video gateways, stereoscopic video, distributed collaborative work
v
List of included papers This thesis consists of an introduction and the following papers: Paper A
Mathias Johanson, Delay-based flow control for layered multicast applications, Proceedings of the 12th International Packet Video Workshop, Pittsburgh, PA, April 2002.
Paper B
Mathias Johanson, A scalable video compression algorithm for real-time Internet applications, Proceedings of the Fourth EURASIP Conference on Video / Image Processing and Multimedia Communications, Zagreb, Croatia, July 2003.
Paper C
Mathias Johanson, Arne Lie, Layered encoding and transmission of video in heterogeneous environments, Pending publication.
Paper D
Mathias Johanson, Adaptive forward error correction for real-time Internet video, Proceedings of the 13th International Packet Video Workshop, Nantes, France, April 2003.
Paper E
Mathias Johanson, An RTP to HTTP video gateway, Proceedings of the Tenth International World Wide Web Conference, Hong Kong, China, May 2001.
Paper F
Mathias Johanson, Stereoscopic video transmission over the Internet, Proceedings of the Second IEEE Workshop on Internet Applications, San José, CA, July 2001.
Paper G
Peter Törlind, Mårten Stenius, Mathias Johanson, Peter Jeppsson, Collaboration environments for distributed engineering: Development of a prototype system, Proceedings of CSCWD'99 – Computer Supported Cooperative Work in Design, Compiègne, France, September 1999.
Paper H
Mathias Johanson, Peter Törlind, Mobility support for video-mediated collaborative teamwork, Proceedings of the Tenth ISPE International Conference on Concurrent Engineering: Research and Applications, Madeira Island, Portugal, July 2003.
vii
Table of contents I Thesis introduction 1 Background 2 Research methodology and scope 2.1 Problem formulation and scope of the work 2.2 Research methodology 2.3 Thesis overview 3 Video on the Internet – an overview 3.1 Applications and requirements 3.1.1 Teleconferencing 3.1.2 Video telephony 3.1.3 Web cameras 3.1.4 Media spaces 3.1.5 Computer-supported cooperative work 3.1.6 Other applications 3.2 Protocols and standards 3.2.1 Transport protocols 3.2.2 Session management and control 3.3 Reflectors, transcoders and mixers 3.4 Multicast communication 3.4.1 Group management 3.4.2 Multicast routing 3.4.3 Multicast scope 3.4.4 Layered multicast 3.4.5 The Mbone 3.5 Quality of service 3.5.1 Integrated services 3.5.2 Differentiated services 3.5.3 Traffic engineering and constraint-based routing 3.6 Media encoding 3.6.1 Colorspace conversion and subsampling 3.6.2 Inter-frame coding 3.6.3 Transform coding 3.6.4 Quantization 3.6.5 Entropy coding 3.6.6 Video compression standards 3.6.7 Performance metrics 4 Scalability, adaptivity and robustness 4.1 Flow control and congestion avoidance 4.1.1 Unicast congestion control 4.1.2 Multicast congestion control 4.2 Scalable video coding 4.2.1 EZW video coding with temporal prediction 4.3 Heterogeneous environments
1 3 5 5 6 7 9 9 10 10 10 10 11 11 11 12 13 14 15 15 15 17 17 18 18 18 19 19 19 20 20 21 22 22 22 23 25 26 28 29 33 35 37
viii 4.3.1 Granularity of layered encoding and transmission schemes 4.4 Error-resilient video transmission 4.4.1 Layered coding with transport prioritization 4.4.2 Forward error correction 4.4.3 Postprocessing techniques 4.4.4 Co-operative sender/receiver techniques 5 Applications 5.1 Distributed collaborative teamwork in product development 5.1.1 The benefit of video in distributed collaborative work 5.1.2 Process coordination and content coordination 5.1.3 The role of vision in communication 5.1.4 Video supporting visible behaviors 5.1.5 Video promoting spontaneous interactions 5.1.6 Video-as-data 5.1.7 The effect of video quality on distributed collaborative work 5.1.8 Conclusions 5.2 Video and the World Wide Web 5.2.1 A transport level gateway bringing video to the WWW 5.3 Next generation video-mediated communication systems 5.3.1 Stereoscopic video communication over the Internet 5.3.2 Mobile teleconferencing 6 Summary of included papers and their contributions 6.1 Paper A 6.2 Paper B 6.3 Paper C 6.4 Paper D 6.5 Paper E 6.6 Paper F 6.7 Paper G 6.8 Paper H 7 Future directions
II Appended papers Paper A: Delay-based flow control for layered multicast applications Paper B: A scalable video compression algorithm for real-time Internet... Paper C: Layered encoding and transmission of video in heterogeneous... Paper D: Adaptive forward error correction for real-time Internet video Paper E: An RTP to HTTP video gateway Paper F: Stereoscopic video transmission over the Internet Paper G: Collaboration environments for distributed engineering Paper H: Mobility support for video-mediated collaborative teamwork
37 38 40 41 43 43 45 46 46 47 48 48 49 49 50 50 50 51 52 52 54 55 55 55 56 56 56 57 57 58 59
69 69 87 113 131 145 159 177 195
ix
Acknowledgements Without the unfaltering support and encouragement from a number of people, this work would probably not have been possible, and it would definitely not have been as enjoyable. First and foremost, I would like to thank my supervisor Professor Sven Tafvelin for giving me the opportunity to embark on this scientific journey, for having faith in me, and for always being supportive and open minded. Thank you Sven! It has been a true pleasure! I am forever grateful to Lars-Åke Johansson, who has acted as my mentor, supporting me professionally and personally ever since I first joined his research group at SISU back in 1994. Thank you, Lars-Åke! I appreciate your devotion, enthusiasm and helpfulness very much! My sincerest gratitude also goes to Professor Lennart Karlsson and to Leif Henriksson, for supporting my work for a long time, and for taking the initiative to start our company Alkit Communications, where I hope the ideas of my research will eventually materialize and prosper. Many thanks goes to my colleagues at Alkit Communications – Lars-Åke, Leif, Magnus Fant, Mats Gustafsson, Roland Dahl, Peter Törlind and Magnus Löfstrand – who greatly contributed to making my research endeavors a very pleasant and stimulating undertaking. I am very grateful to Peter "Peppar" Parnes, for supporting me within the SITI Internet3 research program, and for acting as discussion leader at my licentiate seminar. I would like to send very special thanks to my co-authors, Peter Törlind, Mårten Stenius, Peter Jeppsson and Arne Lie, without whom this thesis would not only have been shorter and poorer, but also less stimulating to write. The work presented in this thesis has been funded from a variety of sources, including Alkit Communications, SISU, SITI, VINNOVA, NUTEK, NORDUnet 2, Framkom Research Corporation and Luleå University of Technology. Finally, I would like to thank my mother and my father, to whom I dedicate this thesis, and my two sisters, for their endless support and love.
Mathias Johanson, Göteborg, April 2003
Thesis introduction
3
Thesis introduction 1 Background Interpersonal communication systems are becoming increasingly pervasive in everyday life. As technology evolves, the demand for richer means of interaction spurs the development of ever more sophisticated multimodal communication systems. Meanwhile, providers of telecommunication services compete to deliver enhanced communication services to their customers at affordable cost levels. Without a doubt, this development is not only an important business opportunity for the service providers, but also has a tremendous potential for enriching social interactions between individuals. Moreover, high quality communication using rich media can enable new ways of collaborative work between teams of coworkers, irrespective of geographical location. This not only reduces the need to travel but also facilitates new ways of cooperative work, wherein the flow of information is more direct between the people concerned. More efficient information exchange makes it possible to cut lead times and increase productivity in distributed teamwork, while improving the working conditions for the people involved. An important aspect of human interaction is visual communication. This has impelled the development of digital video communication systems, initially based on dedicated circuit switched telecommunication channels and subsequently using the Internet, through the invention of packet video systems. The Internet protocols were originally designed for asynchronous data transfer applications, like electronic mail, file transfer, and remote access to time-sharing systems. Already at an early stage, however, the potentials of using the Internet for synchronous interpersonal communication were explored, initially through text messaging systems and then using packet audio and video tools. Gradually, as link capacities and end system performance improved, the Internet evolved into a multiservice network infrastructure supporting many types of applications of voice, video and data communication. This convergence of telecommunication services into a unified IP-based network infrastructure presents huge savings potentials for network operators, since it eliminates the need to maintain several communication networks in parallel. Thus, the incentives for video-mediated communication over the Internet can be seen to be related both to the desire for richer means of interpersonal communication services and to the cost-effective realization of those services. However, the connectionless, best-effort nature of the current Internet architecture poses severe technological challenges for designing time-critical synchronous communication systems. Since there are no guarantees on resource availability or timely delivery of datagrams, packet video applications must be resilient to packet
4
Supporting video-mediated communication over the Internet
loss and adaptive to variations in bandwidth and latency. Furthermore, since bandwidth in many parts of the Internet is a scarce resource and since uncompressed digital video signals are prohibitively broadband, sophisticated video compression algorithms are needed to efficiently utilize the network. The requirements of real-time multimedia communication applications have inspired researchers to propose enhancements to the prevalent best-effort model of packet delivery. These efforts, collectively labeled Internet Quality of Service (QoS) are aimed at providing different service classes for different types of Internet traffic. Although more sophisticated QoS support from the network will substantially facilitate the realization of large-scale real-time communication services, it is currently unclear in what shape this functionality will be provided. In any case there will for a long time yet be necessary for applications to rely on the current best-effort model. The design of Internet-based video-mediated communication systems must not only be influenced by technological concerns, but should also reflect usability requirements related to the subjective assessment of video quality and the nature of the particular application. This requires an understanding of the basic mechanisms of the human visual system and the role of vision in interpersonal communication processes. Different applications have different needs and it is of great importance that a technological framework for video-mediated communication is flexible enough to meet a wide variety of requirements. Due to the challenges and the promises of the technology, video-mediated communication over the Internet has emerged as a very vivid research area, both within the technical disciplines and the behavioral sciences. The requirements on the network imposed by real-time transport of video highlight many of the most relevant issues in contemporary computer network research. The computational demands of digital video processing push the limits of computer hardware design and promote progress in signal processing research. Video-mediated communication can thus be seen as a driver for advances in computer and network technology and as an enabler of innovations in the vanguard of Internet application research. The research presented in this thesis contributes to the realization of a flexible framework for video-mediated communication over the Internet. The main focus is on the core technological components of video communication, such as network transport protocols and video coding issues, but to a lesser extent the thesis also deals with applications and usage of video-mediated communication.
5
2 Research methodology and scope This chapter discusses the scientific methodology that has been used when conducting the research that forms the basis for the thesis. The scope of the work is defined and the organization of the thesis is outlined.
2.1 Problem formulation and scope of the work The research presented in this thesis can be coarsely divided into two parts: Papers A to D are concerned with the technological foundations of video-mediated communication over the Internet, specifically focusing on scalability, adaptivity and robustness issues. Papers E to H are to a higher degree dedicated to applications of video-mediated communication, although the perspective is still very technical. This distinction is to some extent reflected in the way the technologies, algorithms and methods proposed in the papers can be associated with the layered model of data communication found in the OSI reference model [1]. Papers A, C and D are mainly concerned with transport layer functionality, although in practice the mechanisms proposed are often implemented at the application layer. Papers F, G and H are mainly concerned with application level issues. Paper E describes a transport level video gateway, but is also closely related to Internet application research through the proposed interoperation of multicast video tools with the WWW. The video compression algorithm presented in Paper B is conceptually part of the application layer (or the presentation layer), but like papers A, C and D the focus is on a fundamental technological component of a video communication system, rather than on a specific application. In accordance with this dual nature of the work, the problem formulation underlying the research can quintessentially be stated by the following two questions: •
How can we design the core technological components of a system supporting video-mediated communication over the Internet, in a way that is scalable, robust and efficient in terms of resource utilization?
•
How can we realize new types of applications of video-mediated communication by building systems out of the components, and what effects can be expected when these systems are put to use?
The aim of the work presented in the thesis has been to study technology supporting video-mediated communication over the Internet from a broad technological perspective, rather than focusing exclusively on some limited aspect thereof. The motivation for doing so is that the components of a video communication system are closely interrelated, each affecting the design of the system as a whole. Thus, in order to gain a system-level understanding of Internet video
6
Supporting video-mediated communication over the Internet
communication it is important to be knowledgeable in many different areas. Consequently, the research presented in this thesis is inherently multidisciplinary, relating to the fields computer networking, signal processing, algorithmics, and application software architecture. Part of the work, especially the more application-oriented research, is influenced by behavioral sciences like cognitive usability and human-computer interaction. Although the thesis is broad in its scope, the individual research papers each target highly specialized research areas, such as layered multicast congestion control, scalable video coding, stereoscopic video transmission, and collaboration support for distributed engineering teamwork.
2.2 Research methodology The research presented in this thesis is based on both experimental and theoretical approaches, but the emphasis is on experimental methods. The analysis of network protocols has primarily been carried out through simulations. Design and analysis of video compression algorithms have relied on both theoretical methods (e.g. complexity analysis) and experiments with prototype implementations. The application-oriented research relies heavily on the development of prototype systems for the purpose of experimentation. All in all, the research approach can be characterized as problem-oriented, in that there is a specific or potential problem underlying the research. Experimental research typically starts with the formulation of a hypothesis. The next step is to design one or more experiments that will either validate or invalidate the hypothesis. The outcome of the experiments often results in a need to reformulate the hypothesis and redesign the experiments to explore the revised hypothesis. Thus, experimental research often proceeds iteratively. As an example, one of the hypotheses underlying the experimental research in Paper A can be stated as follows: "The overall loss rate for multicast flow control can be reduced by using a delay-based flow control algorithm." This hypothesis is validated in Paper A through simulations. The process of finding the algorithm that is presented in Paper A was an iterative procedure alternating between simulations and algorithm redesign. The research focusing on applications of video-mediated communication has largely been based on the development of prototype systems for the purpose of experimentation and testing in realistic or pseudo-realistic use situations. Experiences from the test phase resulted in redesign of the prototype systems, implementation of new features, and further testing in a cyclic process. Methodologically, this way of conducting research can be categorized as action research. The motivation for choosing this approach is that is promotes a tight interaction between researchers focusing on core technology and researchers studying applications and usability. Traditionally, these research groups have tended to work too much in isolation from each other, resulting in the development of technology and systems that are technically sound by unusable, or in usability studies of obsolete technology. This approach has also been a natural choice
Thesis introduction
7
considering the circumstances under which the research has been conducted: by an industrial PhD student working in close cooperation with user groups in industry as well as with researchers at academic institutions. Since the focus of the research presented in the thesis is mainly on technological issues the majority of the work is associated with the prototype implementation and redesign phases and less with the testing and evaluation. Throughout the work – both the application-oriented research and the work on core technology – a holistic view of video-mediated communication has been embraced, adhering to the scientific tradition of system theory [2]. This is underscored by the effort invested in implementing the technological components in prototype systems for the purpose of experimentation and analysis. Some of the software components developed for this purpose have also found use outside the scientific context, and are being developed into commercial products.
2.3 Thesis overview The thesis is organized in two parts: an introduction and a collection of research papers. The purpose of the introductory part is to provide an overview of the core technologies and key applications of Internet video communication, to introduce important concepts and establish a terminology. Moreover, the research papers included in the second part of the thesis are put in context and related work in the field is discussed. The outline of the introductory part of the thesis is as follows: Chapter one gives an introduction to Internet-based video communication systems; chapter two discusses the organization of the thesis, the research methodology and the scope of the work; chapter three provides an overview of the technological foundations of Internet video communication systems and identifies the most important applications; chapter four introduces the research field concerned with scalability, adaptivity and robustness issues and puts the research presented in papers A to D in context with related work in the field; chapter five introduces the applicationoriented research on video-mediated communication presented in papers E to H; chapter six provides a summary of the included research papers, and chapter seven concludes the introductory part of the thesis with an outline of future research directions.
9
3 Video on the Internet – an overview The earliest digital video transmission systems focused on circuit-switched transmission networks with fixed capacities. In circuit-switched networks calls are aggregated through time division multiplexing (TDM), allotting a constant bitrate share of the communication channel to each call. This requires that the video signal to be transmitted is encoded at a constant bitrate (CBR), conforming to the capacity of the TDM slot of the communication channel. In contrast, packet-switched networks, like the Internet, aggregate traffic with variable bitrate (VBR) onto a single communication link using statistical multiplexing. By not requiring the video coding to be CBR, the aggregate utilization of the network can potentially be higher. On the other hand, circuit-switched networks guarantee the bandwidth of a connection for the duration of a communication session, whereas connectionless packet-switched networks are typically best-effort, requiring the applications to adapt to the amount of bandwidth available.
3.1 Applications and requirements Internet video applications can be broadly categorized into two classes: live video applications and applications of stored video material. Live video applications are concerned with synchronous, real-time transmission of live video signals in a person-to-person communication scenario, whereas the stored-video applications are concerned with asynchronous transmission and playback of pre-recorded material stored on disk or otherwise. The simplest case of a stored-video Internet application is transfer of a video file from a server for playback after the download finishes. To reduce start-up latencies, streaming video applications have been developed, that maintain a playback buffer so that playback can be initiated before the transfer is complete. The streaming applications are typically based on a client/server model. In contrast, the synchronous applications are based on a peer-to-peer model, and the communication is in many situations (but not always) symmetric in that video flows both ways, e.g. in a videophone application. An example of an asymmetrical live video application is a lecture that is broadcast to a group of students in a teleteaching scenario. The students are in this type of setting typically able to use an audio or text chat backchannel for asking questions, etc. Internet video applications can also be classified as point-to-point or multipoint. In a point-to-point application, video is transmitted, bidirectionally or unidirectionally, between two endpoints. In a multipoint application video is broadcast from one or more senders to many receivers. A multipoint application with only one sender is a point-to-multipoint application. The work presented in this thesis is mostly oriented towards live Internet video applications, although much of the underlying technology can also be applied to streaming video applications. Synchronous interpersonal communication using
10
Supporting video-mediated communication over the Internet
live video is commonly referred to as video-mediated communication – a term used heavily in this thesis. Some of the most important applications of video-mediated communication on the Internet are discussed below.
3.1.1 Teleconferencing Teleconferencing is a wide concept involving applications that enable groups of people to communicate in real time using a combination of audio, video and other media. The term videoconferencing is often assumed to include both audio and video, is most often symmetrical, and can be multipoint or point-to-point. Moreover, videoconferencing systems are often categorized as either desktop systems or room-based systems. Desktop videoconferencing systems are software applications running on general workstations, whereas room-based systems are self-contained units typically installed in conference rooms. Teleconferencing applications, being highly interactive, impose hard requirements on end-to-end delays. Furthermore, the systems need to be scalable to large sessions with many participants. Audio and video quality requirements typically depend on the circumstances and are highly subjective.
3.1.2 Video telephony Video telephony can be seen as a special case of videoconferencing, limited to a symmetrical point-to-point configuration with two participants. As with videoconferencing, audio is often assumed to be an integral part of a video telephony system. The terminal equipment (the videophones) are typically rather simple and lightweight devices. Video quality requirements can be assumed to be somewhat lower than in group conference settings, but again, this is highly subjective.
3.1.3 Web cameras A web camera is a video camera attached to a computer that transmits a live video feed to a client web browser. A software component installed on a web server makes it possible to incorporate live video in WWW pages. This is commonly used for various types of remote awareness applications. Video quality is typically rather moderate and the delay requirements are relaxed.
3.1.4 Media spaces A media space is a loose concept that tries to capture the idea of creating a shared environment between two or more interconnected sites. The media space is typically intended to support distributed collaborative teamwork and to create a sense of togetherness and social proximity between the distributed team members. Key functions are to provide presence and awareness cues and to promote opportunistic interactions between co-workers at the remote sites. Continuously open audio and video links in combination with large wall-projection systems in
Thesis introduction
11
public workplace areas enable the co-workers to bump into each other and to engage in informal conversations. What distinguishes a media space from a teleconferencing system is the focus on communicating not only the video of the participants' faces and bodies, but also the surrounding room. Moreover, whereas a teleconference generally has a rather limited duration, a media space system is "always on," providing telepresence and promoting spontaneous interactions. In order to be useful, media spaces require rather high quality video with low latency.
3.1.5 Computer-supported cooperative work Videoconferencing is an important component of distributed computer-supported cooperative work (CSCW). In synchronous collaborative work sessions, live audio and video communication is often complemented by data sharing tools like shared whiteboards and application sharing tools. The requirements on video quality depend on the type of collaborative work being supported: highly interactive brainstorming meetings typically require higher audiovisual quality and lower latency, whereas more structured meetings can tolerate lower quality.
3.1.6 Other applications In addition to the above-mentioned applications, a large number of highly specialized applications of Internet video have been proposed and implemented. For instance, telemedicine applications using video for remote consultations and diagnoses have been successfully demonstrated. Teleteaching is another application where videoconferencing in combination with other tools can be used for distance education in an asymmetrical, multipoint setting. Moreover, virtual reality systems augmented with video (sometimes referred to as augmented or mixed reality) have attracted a lot of attention from the research community.
3.2 Protocols and standards The applications and technology covered by this thesis build heavily on the Internet Protocol (IP) architecture and service model. Indeed, for the remainder of the thesis an IP-based communication infrastructure will be assumed. Standards for video communication over IP networks have primarily emerged from two sources: the International Telecommunication Union (ITU-T) and the Internet Engineering Taskforce (IETF). Although the standards developed by these respective authorities are partly overlapping, they represent two fundamentally different standardization approaches. The ITU-T Recommendation H.323 [3] defines protocols and procedures for multimedia communication over packet networks and is a conglomerate of a number of standards, including H.245 for control, H.225.0 for packetization and connection establishment, H.261 and H.263 for video coding and a number of others for supplementary services. The H.323 series standards are based on
12
Supporting video-mediated communication over the Internet
adaptations of protocols developed for the traditional circuit-switched service model of telecommunication networks (e.g. H.320 for ISDN videoconferencing and Q.931 for signaling). A significant focus is kept on interoperability and compliance. The IETF standards framework for Internet video is a more loosely coupled set of documents, each defining a specific protocol or procedure. Furthermore, the IETF standards are more lightweight with a pronounced aim of scalability. In contrast to the ITU-T standards, they do not define any algorithms for content coding, but include procedures and guidelines for packetization of media.
3.2.1 Transport protocols Most Internet applications use the Transmission Control Protocol (TCP), which implements reliable, connection-oriented data delivery over the connectionless datagram service provided by IP. The TCP transport protocol achieves reliability by retransmission of lost packets using an acknowledgement scheme. TCP also provides a congestion avoidance algorithm that adapts the packet transmission pace based on the experienced loss rate. However, delay-sensitive applications, like packet audio and video tools, cannot use the TCP protocol due to its poor realtime properties. When dealing with real-time data, a packet arriving too late is just as bad as a lost packet. The retransmission scheme of TCP is hence not appropriate for real-time applications. The Real-time Transport Protocol (RTP) is an IETF proposed standard providing end-to-end delivery over packet networks for data with real-time characteristics [4]. For Internet applications it is typically used on top of the User Datagram Protocol (UDP) taking advantage of its multiplexing and checksum functionality [5]. RTP does not provide QoS mechanisms, but rather relies on lower level protocols to do so. A slightly abbreviated version of the RTP protocol is included in the ITU-T standard document H.225.0, specifying packetization rules for H.323 videoconferences. In contrast to the traditional programming model of data communication, where the transport protocol is implemented in a protocol stack in the operating system kernel, the RTP protocol functionality is integrated in the application. This concept is known as application level framing (ALF) [6] and is motivated by the fact that multimedia application design can be significantly simplified, and overall performance enhanced, if application level semantics are reflected in the transport protocol. RTP defines a packet header containing information that is of generic interest for many real-time applications, like timestamps, sequence numbers and payload type identifiers. In accordance with the ALF concept, the semantics of several of the RTP header fields are deferred from the RTP specification to applicationspecific RTP profile documents. Typically, each media encoding to be carried over RTP has an associated RTP profile document specifying packetization rules and defining the semantics of the application-specific fields of the header. The RTP specification includes a control protocol for RTP, called the RTP Control Protocol (RTCP). The primary function of RTCP is to provide quality feedback from the receivers of media streams. This is performed by periodic
Thesis introduction
13
transmission of receiver report RTCP packets, containing status information like the current interarrival jitter and cumulative number of packets lost. The receiver reports can be used by the originators of media streams to adapt the transmission rate or other parameters based on the observed performance. RTCP also provides a persistent transport-level source description mechanism used to identify the sender of an RTP stream through a canonical name and other optional metadata. Moreover, RTCP can be used to convey minimal session management information to keep the members of an RTP session updated about the current participants of the session. Since the RTCP protocol is based on periodic transmissions of session control information, the transmission interval must be scaled in proportion to the size of the session. Otherwise, an implosion of RTCP packets might overload the network for large sessions. The RTP specification provides guidelines for how this scaling should be implemented. For a complete description of the RTP and RTCP protocols, see the RTP specification [4].
3.2.2 Session management and control As mentioned above, the RTCP protocol provides elementary session management and control functions. However, this is limited to rudimentary support for identification of session participants and is not concerned with synchronous signaling of session initiation and control. The H.323 standards suite provides session initiation and control through H.245 and H.225.0. The H.323 approach to session control conforms to the traditional circuit-switched model, based on the Q.931 protocol for ISDN call setup signaling. In contrast, the IETF has designed an application-level signaling protocol called the Session Initiation Protocol (SIP) [7], reusing many of the header fields, encoding rules, error codes and authentication mechanisms of HTTP. The SIP protocol can be used to initiate, modify and terminate synchronous communication sessions with two or more participants. Furthermore, SIP invitation messages, used for session setup, contain session descriptions, based on the Session Description Protocol (SDP) [8], specifying the media encodings to be used for the session. The SIP/SDP protocols for session initiation and control provide a clearer separation of session signaling and multimedia data exchange, compared to the H.323 protocols. This makes it possible to implement dedicated session management tools, that can be used to launch any synchronous communication tool based on the SDP descriptions. A comprehensive comparison of SIP and H.323 is given by Schulzrinne and Rosenberg [9]. For large multicast conferences where synchronous invitation of all prospective participants is not viable, a protocol called the Session Announcement Protocol (SAP) has been proposed [10]. With SAP session announcement packets containing SDP descriptions are periodically transmitted to a well-known multicast
14
Supporting video-mediated communication over the Internet
address and port. Specialized session directory tools listen to session announcements, informing the user about active and upcoming sessions.
3.3 Reflectors, transcoders and mixers A reflector (also known as a Multipoint Conferencing Unit, MCU) is an applicationlevel agent that serves as a relay point for multimedia traffic, facilitating multipoint communication. In a multipoint communication scenario employing reflectors, media packets are addressed to the reflector, which forwards the packets to all participants of the session. This results in a more scalable packet delivery mechanism compared to the situation where each sender transmits a unique copy of every packet to all participants. Several reflectors can be combined into a hierarchy for large sessions. The reflectors can be configured either statically, or using a dynamic signaling protocol, e.g. SIP. In Figure 1 a multipoint communication scenario using two reflectors is illustrated, wherein host 1 transmits a packet to all other participating hosts. reflector 2
reflector 1
6
4
5
1 2
3
Figure 1 Multipoint communication using reflectors A transcoder, or transcoding gateway, is a device that performs conversion between different media encodings in real-time. Transcoders are typically implemented in reflectors to enable a set of participants in a multipoint session to receive media in different encodings, based on link capacities and other considerations. By using transcoders, multipoint multimedia conference sessions can be realized in heterogeneous network and computing environments. However, transcoding introduces a high computational complexity and increases latency. A mixer is a device that aggregates multiple incoming media streams into one outgoing media stream by performing some synthesis of the media. The typical example is a multipoint audio conference, where multiple incoming audio streams are mixed together to one outgoing stream. Mixers make more efficient use of network bandwidth and relieve the end systems of the media mixing operation that might be required for presentation (for
Thesis introduction
15
example when playing out audio). Albeit not as straightforward as audio mixing, video sources can also be mixed. For instance, four video sources can be combined into one quadruple view signal, or two signals could be combined in a picture-inpicture arrangement.
3.4 Multicast communication In multipoint communication situations an efficient mechanism is needed for delivery of data to many receivers. As discussed in section 3.3, reflectors can be used to achieve a more scalable multipoint transmission architecture. However, when using reflectors packets are still duplicated on shared network segments, resulting in sub-optimal resource utilization. Moreover, reflectors are hard to configure and maintain. In IP multicast a range of class D IP addresses (224.0.0.0 to 239.255.255.255) is reserved for group communication [11, 12]. A packet sent to a multicast group address is delivered to all members of the group. Group membership is maintained dynamically through explicit signaling. A dedicated multicast routing protocol is needed to forward traffic to multicast group members without transmitting redundant packets. Multicast traffic is propagated to the receivers of a group along a multicast delivery tree rooted at the sender.
3.4.1 Group management Group dynamics in IP multicast is provided by the Internet Group Management Protocol (IGMP) [13]. Similar to the Internet Control Message Protocol (ICMP) for error control, IGMP is an integral part of IP. IGMP defines four types of operations for maintaining multicast groups, namely • general membership query, • group-specific membership query, • membership report, and • leave group. An IGMP message format is defined for carrying queries and reports. General membership queries are used to obtain information about active groups in a subnet, whereas group specific membership queries request information about whether a particular group contains members in the subnet. Membership queries are issued periodically by multicast routers to determine group membership status. Membership reports are sent by hosts when joining a new group and in response to membership queries. A leave group message is sent when a host's group membership is terminated.
3.4.2 Multicast routing Multicast routers compute routing tables specifying forwarding paths for multicast traffic. For each multicast group having members on a subnet, the designated router maintains a routing table entry consisting of the multicast
16
Supporting video-mediated communication over the Internet
address, the network interface of the source and the interfaces where packets should be forwarded. The routers rely on soft state, so the routing table entries must be periodically refreshed by sending membership queries. If no local members remain for a group, the routing table entry is deleted. A number of routing algorithms and protocols have been designed to compute the multicast routing tables. Basically, multicast routing algorithms can be classified as either data-driven or demand-driven. Data-driven algorithms, also known as broadcast-and-prune schemes, initially flood datagrams of a multicast group to all potential receivers using reverse-path-forwarding (RPF). In the RPF scheme a multicast router forwards an incoming packet on all interfaces except the ingress interface, if it arrived on the interface constituting the shortest path to the source. Forwarding paths are then pruned (i.e. removed) bottom-up by downstream routers that have no group members on their connected subnets. If a host joins a multicast group in a subnet whose designated router has previously pruned the delivery tree for the group, the router grafts (i.e. reestablishes) the forwarding path. To be able to graft previously pruned paths, multicast routers must maintain state information for every pruned group. Periodically, routers will anew flood datagrams of active groups, to update the soft state in the downstream routers. The reliance on flooding in data-driven algorithms limits the scalability of multicast routing. In response, several algorithms have been proposed that avoid flooding. These algorithms are known as demand-driven, since they refrain from forwarding datagrams into networks until specifically demanded. When a host joins a multicast group in a demand-driven routing configuration, the designated router on the host's subnet must signal this join event to other routers before multicast datagrams will be forwarded to the host. The question is how to know which multicast router to inform, i.e. which router is on the next higher level of the delivery tree from the source. In demand-driven multicast routing, the Internet is divided into administrative regions (domains), each with a dedicated core router (also known as a rendezvous point). Other routers in the region are either statically configured to know about the core router, or use a dynamic discovery protocol at boot time to find it. Once a host joins a multicast group the designated router unicasts a join request to the core router. To be able to support multicasting between domains, an interdomain multicast routing protocol is needed. Data-driven multicast routing algorithms are appropriate in dense network configurations where many hosts are clustered closely and bandwidth is abundant. Demand-driven algorithms are more suitable in sparse network configurations where bandwidth is scarce. Thus, data-driven multicast routing can be envisioned to be used in enterprise and campus networks, whereas demand driven routing is more appropriate in wide-area networks. Moreover, data-driven routing is senderoriented, suitable for applications where it is of importance that the first datagrams of a session are delivered to all members of a group. Demand-driven algorithms, in contrast, are receiver-oriented, suitable for dynamic situations where the receivers join the group at different times. In multipoint videoconferencing sessions, participants typically join the conference at slightly
17
Thesis introduction
different points in time and it is not crucial that the very first datagrams transmitted to the session are delivered to all hosts. Hence, the receiver-oriented paradigm might be preferable. Moreover, since video communication is inherently broadband, data-driven routing can waste a lot of bandwidth due to flooding.
3.4.3 Multicast scope Two techniques exist for limiting the scope of IP multicast transmission. The first technique uses the datagram's time-to-live (TTL) field to limit the number of hops the datagram will be forwarded. Each router decreases the TTL field when forwarding a datagram. When the TTL value reaches zero the datagram is dropped. The second technique, called administrative scoping, is based on reserving certain ranges of multicast addresses for limited propagation. The extent of administratively scoped multicast groups is explicitly configured by the organization operating the network.
3.4.4 Layered multicast Layered multicast is an approach to synchronous multipoint communication in heterogeneous network and computing environments, proposed as a more scalable alternative to using transcoding gateways. The idea behind layered multicast is to utilize a layered source coding that transforms the media to be disseminated into a hierarchy of cumulative layers each of which is transmitted to a unique IP multicast address. Each participant of the synchronous communication session can then independently decide on a suitable number of multicast groups to join, depending on the amount of bandwidth and CPU resources available. A flow control algorithm is needed at each receiver, determining the optimal number of groups to subscribe to, based on feedback from the network. Figure 2 shows an example of a layered multicast communication session. Medium quality receiver
multicast router
multicast router
Sender
High quality receiver
multicast router
Low quality receiver
Figure 2 Layered multicast scenario
18
Supporting video-mediated communication over the Internet
3.4.5 The Mbone The multicast backbone (Mbone) is an experimental virtual network implemented on top of the Internet, providing global multicast connectivity [14]. The Mbone consists of islands of native multicast routing domains interconnected over nonmulticast routing domains through tunneling. IP-in-IP tunneling enables multicast datagrams to be encapsulated in unicast datagrams for unicast transmission to a destination network, where they are decapsulated and re-multicast.
3.5 Quality of service The current Internet architecture provides only a single class of best-effort service. All packets are treated in the same way with no guarantees on delivery or bounds on delay and jitter. For a wide deployment of performance-critical applications, like real-time multimedia conferencing, it has been argued that a more predictable service needs to be delivered from the network. This has resulted in the proposition of new service models for the Internet, most notably the integrated services model and the differentiated services model [15].
3.5.1 Integrated services The integrated services model (Intserv) is based on end-to-end resource reservations. In this model, a signaling protocol is used to set up a path between the communicating endpoints prior to data exchange. Along the path, resources are reserved at intermediate systems to be able to guarantee the quality of service requested by the application. The signaling protocol defined by the IETF for this purpose is called the resource reservation protocol (RSVP) [16]. If ample resources are available for the QoS requested, the RSVP reservation will succeed and the application can proceed to communicate using the guaranteed service. In case insufficient resources are available, the reservation will fail and no service will be given to the application. Intserv requires admission control to be performed, to decide whether a reservation request should be granted. Furthermore, when a packet arrives at a router it must be classified and put in a packet scheduling queue corresponding to the QoS requested. To be able to perform these functions, each router along the reserved path must maintain state information about every communication session (also known as a microflow). Since the amount of state information increases proportionally to the number of microflows, the Intserv model places a huge storage and processing burden on core routers. This has raised doubts on whether an Intserv model is scalable enough for the global Internet. Moreover, incremental deployment is troublesome, making the migration to an Intserv Internet architecture hard to realize. These concerns have lead to the emergence of another Internet QoS model, known as differentiated services.
Thesis introduction
19
3.5.2 Differentiated services The differentiated services model (Diffserv) is designed to avoid per-flow state in core routers. Instead, packet classification and admission control is performed at the edge of the network, where the traffic volumes are typically lower. Upon ingress to a Diffserv network, packets are classified and marked with an identifier using a dedicated field of the IP header termed the DS field. Internal routers of the Diffserv domain then treat the packets based on the content of the DS field, according to a well-defined per-hop behavior (PHB). By defining a number of service classes and their PHBs corresponding to different application requirements, different types of traffic can receive differentiated QoS. Diffserv can thus be seen as essentially a relative priority scheme. When traffic enters a Diffserv network it is classified, policed and shaped according to a contract between the customer and the service provider called a service level agreement (SLA). Basically, the SLA specifies the service classes that are provided and the amount of traffic the customer is allowed in each class. SLAs can be either static or dynamic. Dynamic service contracts need a signaling protocol to request services on demand. For instance, RSVP can be used for dynamic SLA signaling.
3.5.3 Traffic engineering and constraint-based routing Intserv and Diffserv provide different ways of sharing the bandwidth of a congested network between different applications. Congestion can be caused either by network resource shortage or by uneven load distribution. In the latter case congestion might be avoided by optimizing the routing of traffic. Since the current Internet routing protocols make routing decisions based only on the shortest path to the destination, traffic will typically be aggregated towards the core of a network, even if alternative routes of higher capacity exist. Traffic engineering is the process of distributing the load on a network in order to achieve optimal utilization of the available bandwidth. An important mechanism for automating traffic engineering is constraint-based routing (CBR). CBR extends shortest path routing algorithms to take resource availability and flow requirements into consideration when computing routes. Thus, a CBR algorithm might select an alternative path to the destination if it provides more bandwidth than the shortest path. This leads to a more effective utilization of network resources. However, constraint-based routing increases the computational complexity of routers, increases routing table size and can potentially result in routing instability.
3.6 Media encoding Digital media signals, in particular video, need to be compressed when transported over a network, to make efficient use of the bandwidth. For this purpose a
20
Supporting video-mediated communication over the Internet
plethora of compression algorithms have been designed targeted at different applications and requirements. Compression algorithms can be characterized as lossless or lossy. A lossless compression algorithm allows perfect reconstruction of the original digital signal, whereas a lossy algorithm introduces controlled loss of information so that a sufficiently accurate approximation of the original signal can be reconstructed. Lossless compression algorithms are typically used for data compression where perfect reconstruction is critical. For video compression, lossless algorithms typically result in moderate compression efficiency, but are nevertheless used for certain applications where loss of information is unacceptable (e.g. medical imaging). Most video compression algorithms are lossy, exploiting the properties of the human visual system to discard information that is of insignificant perceptual importance. As with lossless algorithms, redundancy in the original signal is also exploited to represent the information more efficiently. In essence, video compression algorithms are based on the following techniques: • colorspace conversion and component subsampling, • inter-frame coding, • transform coding, • quantization, • entropy coding.
3.6.1 Colorspace conversion and subsampling The first step of essentially all video compression algorithms is to convert the images from the RGB colorspace into a luminance/chrominance representation (YCrCb). By exploiting the fact that the human visual system is less sensitive to variations in chrominance, the chrominance components are subsampled (i.e. represented with fever samples) to reduce the data rate. Typically, the chrominance components are represented with one sample for every four luminance samples (resulting in so called 4:1:1 component subsampling).
3.6.2 Inter-frame coding Inter-frame coding exploits temporal correlations in a video signal to reduce redundancy. Coding a frame differentially from a previous frame as an error signal improves subsequent run-length and entropy coding techniques. This predictive coding (P-coding) is usually performed on smaller subblocks of the image, typically 16-by-16 pixels large. To improve the efficiency of predictive coding a technique called motion compensation is often utilized. In a motion compensation scheme a block is coded predictively from a spatially translated block in a previous image. The differentially coded block together with a displacement vector, called a motion vector, are used by the decoder to recreate the block. Optionally, a scheme called conditional replenishment can be utilized together with block-based predictive coding. The idea is that only blocks whose error signal, when coded differentially from a previous frame, is larger than some threshold
Thesis introduction
21
value will be transmitted. This implies that only the spatial regions of a video scene that changes temporally will be transmitted, resulting in efficient bandwidth utilization for video sequences with fairly static content. Temporal prediction can be performed either from previous frames or from subsequent frames, providing that the temporally posterior frames have been sampled in advance. Temporal prediction in both directions is known as bidirectional prediction (B-coding). B-coding improves compression efficiency, but is of limited applicability for interactive applications with hard delay requirements. Predictive coding introduces interframe dependencies that make the video coding sensitive to packet loss. This is of great concern for Internet video applications, since they are typically based on unreliable transport protocols. To reduce the adverse implications of packet loss for video decoding, intra-coded frames are interleaved at regular intervals, providing re-synchronization points for the decoder.
3.6.3 Transform coding In transform coding an image is transformed from the spatial domain to the frequency domain and represented as a linear combination of some set of basis functions. Some of the most commonly used basis functions are the trigonometric functions, used by the Fourier transform and the cosine transform. The reason for transforming an image to the frequency domain is to obtain a more compact representation of the data. Since the human visual system is more sensitive to low-frequency content in an image, high-frequency information can be excluded or represented with less precision. The discrete cosine transform (DCT) is the most widely used transform for image and video compression. For instance, the JPEG, MPEG and H.261 compression algorithms are based on the DCT. Since the cosine function has infinite support and since the spatial correlation of image pixels is localized, the transform is applied to small blocks of the image (typically 8-by-8 pixels). Another, more recently discovered transform popular in image coding is the discrete wavelet transform (DWT). The wavelet transform is based on basis functions obtained by translation and dilation of a single wavelet mother function. The wavelet basis functions are localized in space and can consequently be applied to the whole image, contrary to the block-based approach of the DCT. This is beneficial at high compression ratios where block-based algorithms typically result in quantization defects known as blocking artifacts. Moreover, the DWT provides a native multiresolution representation that can be progressively decoded. This is highly beneficial when designing scalable encodings. Transform coding is primarily used for intra-coding of video images. However, three-dimensional transform coding algorithms for video have been proposed that extend the two-dimensional spatial transform to the temporal dimension. Indeed, video compression algorithms based on the 3D DWT have been shown to obtain very high compression ratios, but the computational complexity is prohibitively high.
22
Supporting video-mediated communication over the Internet
3.6.4 Quantization Quantization is a lossy procedure wherein the precision of data samples is limited to a set of discrete values. The quantization function maps several of its input values to a single output value in an irreversible process. The quantization can be either uniform or non-uniform. Uniform quantization limits the precision of samples uniformly over the input range. This can easily be implemented by dividing each input sample value by a quantization factor and then rounding off the result. In non-uniform quantization the input samples are represented with different precision. Non-uniform quantization is typically implemented with a look-up table known as a quantization table. By reducing the precision of sample values, quantization limits the number of different symbols that need to be encoded in the entropy coding step following the quantization.
3.6.5 Entropy coding Entropy coding is the process of assigning the shortest codewords to the most frequent symbols based on the probability distribution of the input data. Examples of entropy coding schemes are Huffman coding and arithmetic coding. Entropy coding is most often preceded by a run-length coding that encodes a consecutive series of the same symbol value as a run-length count and a symbol codeword.
3.6.6 Video compression standards Standardization of video compression algorithms has been performed primarily by the Moving Pictures Expert Group (MPEG) of the International Standardization Organization (ISO) and by the Telecommunication standardization sector of the International Telecommunication Union (ITU-T). MPEG has developed a number of video compression standards targeted at different multimedia applications, while the ITU-T has mainly developed standards for teleconferencing applications. MPEG-1 (ISO standard 11172) defines a video compression algorithm based on the DCT and motion compensation, targeted at multimedia applications with data rates up to about 1.5 Mbps. MPEG-2 (ISO standard 13818) extends MPEG-1 with support for greater input format flexibility, higher data rates and better error resilience. The basic principles of MPEG-2 are the same as MPEG-1 (DCT and motion compensation) and MPEG-2 is backwards compatible with MPEG-1. MPEG-2 is also part of the ITU nomenclature as ITU-T Recommendation H.262. MPEG-4 (ISO standard 14496) takes an object-oriented approach to video coding. Visual scenes can be represented as a collection of objects, each with a specific encoding and compression format. Visual objects can be either synthetic or natural. Natural video objects are compressed using the DCT and motion compensation in basically the same manner as in MPEG-2.
23
Thesis introduction
ITU-T recommendations H.261 and H.263 are video compression standards targeted at teleconferencing applications at data rates up to 2 Mbps. Both are based on the DCT and motion compensation. An excellent introduction to image and video compression standards is provided by Bhaskaran and Konstantinides [17].
3.6.7 Performance metrics To compare the performance of different compression algorithms a standard metric is needed that measures the quality of a reconstructed image at a given compression rate. The most commonly used quality metric in image coding is the peak signal to noise ratio (PSNR) defined by PSNR = 20 log 10
max iN=1 x i
σ
,
where xi, i = [1, N], are the reconstructed pixel values of an original image, yi, i = [1, N], and σ is the root mean square error, defined by
σ=
1 N
N
∑ (x
i
− yi ) 2 .
i =1
The PSNR, which is dimensionless and expressed in decibel, is usually calculated for the luminance component only, since the luminance component is the most perceptually important component. When used as a quality measure for video coding, the mean PSNR is calculated over a number of video frames, and this distortion measure is complemented by the frame rate. It is important to emphasize that PSNR is a statistical measure of the distortion introduced by a compression scheme, which does not take characteristics of the human visual system into account. Hence, it is not really a good estimator of perceptual visual quality, but since it is an objective measure that is easy to compute and understand it is nevertheless the most commonly used quality metric in image and video coding.
25
4 Scalability, adaptivity and robustness A salient feature of virtually all Internet protocols and standards is a strong focus on scalability. This is not surprising since the success of the Internet is dependent on its ability to support a large number of simultaneous users. Therefore, when designing a video communication system based on Internet technology, a fundamental concern must be the effects of scaling the system to many simultaneous users and large network topologies. This focus if further stressed by the fact that video-mediated communication is a very demanding application in terms of bandwidth and processing requirements. The best-effort model of the current Internet, where all state information pertaining to an end-to-end communication session is kept at the endpoints, imposes a requirement on the applications, or the transport protocols used by the applications, to be adaptive to changing conditions [18]. Moreover, heterogeneity in terms of connection capacity and end-system capabilities calls for adaptive applications and protocols. Specifically, for Internet video applications, dynamics in bandwidth availability must be responded to by adapting the target bandwidth of the video codec, based on feedback from the network [19, 20]. The process is illustrated in Figure 3. When increasing the number of simultaneous users on a network of limited capacity, the total amount of available resources must be shared between the users somehow. In the current best-effort Internet this resource sharing is achieved through adaptive transport protocols. Thus, scalability and adaptivity issues are strongly interrelated.
feedback
video signal video codec
network
Figure 3 Network-adaptive video transmission In the context of video coding, the term scalability refers to the ability of a coding framework to generate a bitstream that is flexible in terms of bandwidth requirements and decoding complexity. Specifically, coding schemes that permit decoding at different quality levels from the same encoded bitstream are known as scalable codings. Scalable video coding is a key component of adaptive video communication systems, since it provides the necessary tools to trade off video
26
Supporting video-mediated communication over the Internet
quality for bandwidth (or processing power) in order to achieve a graceful degradation of performance in response to dynamics in resource availability. Robustness refers to the property of being resilient to perturbations in the service delivered by the network to an application. In a best-effort packet network this is mainly a question of dealing with packet loss, without disrupting the operation of the application. Unlike traditional data transfer applications, video communication is not dependent on a perfectly reliable transport service, i.e. that all packets transmitted are eventually received at the other end. Some limited amount of packet loss can generally be tolerated as long as this doesn't affect the perceptual visual quality too much. However, due to the reliance on variable length coding and temporal prediction techniques in state of the art video codecs, even moderate packet loss rates can desynchronize the decoder and introduce propagation errors, with severe effects on the quality of the reconstructed video signal. In remedy to this, various techniques for error correction and error concealment have been proposed. The major research issues in the design of future video communication systems and services over the Internet are related to scalability, adaptivity and robustness. In this chapter an overview of research and key technologies within the field will be given, and the contributions of the research on the technical foundations of video communication presented in this thesis (papers A to D) will be put in context. The organization of the chapter is as follows: Section 4.1 discusses congestion control mechanisms, section 4.2 treats scalable video coding, section 4.3 discusses problems related to heterogeneity, and section 4.4 is devoted to errorresilient video transmission.
4.1 Flow control and congestion avoidance Flow control is the process of deciding the optimal transmission rate for a communication session. Congestion control is an exertion of flow control where the objective of the transmission rate adjustments is to avoid or minimize congestion. Although flow control is a slightly more general term, the two are often used synonymously, as is the case in this thesis. Flow control is typically provided by transport protocols (e.g. TCP's congestion avoidance). For real-time multimedia applications, however, the transport protocol functionality, including flow control, is usually implemented at the application level. The end-to-end perspective on congestion control, which is a salient feature of the Internet service model, implies that the communicating endpoints must regulate the transmission rate based on feedback about the state of the network. The feedback can be either explicit or implicit. Explicit congestion notification (ECN) mechanisms rely on active queue management in Internet routers to detect congestion and signal this congestion indication to the end nodes. Since a router can detect incipient congestion before its queue overflows, for instance through random early detection (RED) [21], ECN can enable applications to respond to congestion before packet loss occurs. This is beneficial for delay- and loss-sensitive applications such as interactive video communication. The explicit congestion
Thesis introduction
27
signal can be directed either to the sender of the packet stream (backward ECN), by sending a congestion indication message, or to the receiver (forward ECN) by setting a congestion experienced bit in the packet header. For sender-based congestion control algorithms, backward ECN makes it possible for the sender to respond quicker to impending congestion, since the congestion signal doesn't have to be echoed by the receiver. On the other hand, forward ECN mechanisms typically propagate the congestion signals by setting a designated bit in packet headers and thus no overhead traffic consuming bandwidth is caused. An example of backward ECN is the ICMP Source Quench mechanism. In the current Internet, the Source Quench mechanism is rarely used, in part because it consumes bandwidth when congestion occurs. Recently, a framework for forward ECN in the Internet has been suggested that uses bits 6 and 7 in the IPv4 TOS octet or the IPv6 DS field [22]. One bit is used for congestion experienced signals set by routers during congestion and the other bit is used to indicate whether the transport protocol in use by the end-node application supports ECN. This ECN mechanism is closely related to the Diffserv architecture for QoS discussed in section 3.5.2. Implicit congestion detection requires nothing from the network. Instead, congestion is detected by the end nodes from the packet loss caused by buffer overflows. Another implicit congestion indication is an increased end-to-end transmission delay, due to the increased buffering in the congested routers' packet queues before packets are being dropped. Similar to ECN, variations in end-to-end transmission delay can be used to detect congestion before packet loss occurs. Congestion control algorithms can be classified as sender oriented or receiver oriented. In a sender oriented algorithm the sender adjusts the transmission rate based on feedback about the state of the network. In a receiver oriented algorithm the receiver adjusts the rate that is being propagated from the source, through some receiver-driven bandwidth allocation mechanism, e.g. layered multicast. Receiver oriented algorithms are typically used for multicast congestion control, whereas sender oriented algorithms are most common for unicast applications. Cooperative schemes involving both sender and receiver are also possible. Congestion control algorithms can also be categorized as reactive or proactive. A reactive algorithm detects congestion from the packet loss it causes and responds by adjusting the transmission rate. Proactive algorithms detect impending congestion and try to respond before packet loss is experienced. Thus, proactive mechanisms are beneficial in that the overall packet loss on a network can be reduced, but on the other hand, reactive algorithms typically exhibit higher bandwidth utilization. Moreover, when reactive and proactive congestion control algorithms are co-existing on the same network, reactive algorithms are usually favored since they are more aggressive in terms of bandwidth allocation. For these reasons, congestion control on the Internet has hitherto been dominated by reactive algorithms. Yet another way to classify congestion control algorithms is as either feedbackbased or feedback-free. Feedback-based algorithms rely on feedback of status information from the receiver (or the group of receivers in the multicast situation). The rate is adjusted by the sender based on the status reports. Feedback-free schemes
28
Supporting video-mediated communication over the Internet
are typically used in multicast congestion control, where the receivers are subject to heterogeneous bandwidth constraints. In a feedback-free multicast congestion control algorithm, each receiver individually controls the amount of data being received, without involving the sender, typically using layered multicast. Feedback-free schemes are more scalable to large multicast groups, since they circumvent the potential implosion of feedback information that feedback-based schemes must deal with. A comparison of feedback-based and feedback-free multicast congestion control algorithms is provided by Gorinsky and Vin [23]. When designing a flow control algorithm for video-mediated communication over the Internet, many different aspects need to be considered. Some of the most important desiderata of Internet flow control algorithms are: • responsiveness to dynamics in bandwidth availability, • high utilization of network bandwidth, • intra-protocol fairness, assuring fair bandwidth allocation among sessions using the same flow control algorithm, • inter-protocol fairness, assuring fair bandwidth allocation among sessions using different flow control algorithms, • fast convergence to an optimal operating point, • lightweight implementation characteristics, and • feasibility of incremental deployment.
4.1.1 Unicast congestion control Congestion control for unicast applications in the Internet is heavily dominated by TCP. The congestion avoidance algorithm of the TCP protocol [24] is a windowbased congestion control algorithm, which means that the transmission rate of a communication session is controlled by a congestion window, the size of which is halved for every window of data subjected to packet loss, and increased by roughly one packet per window otherwise. This approach to congestion control is sometimes called an additive increase, multiplicative decrease (AIMD) mechanism, since in the steady state, the transmission rate is additively increased, whereas the rate is halved at the occurrence of packet loss. In TCP an acknowledgement scheme is employed to detect packet loss and accommodate for retransmission of lost data. Since interactive video applications generally cannot tolerate the delay imposed by retransmissions, TCP is usually considered a poor choice of transport protocol for these applications. Nevertheless, some video applications with relaxed delay requirements, typically streaming applications that can buffer a substantial amount of data at the receiver, have been implemented using TCP. Another approach is to use an algorithm that mimics TCP's AIMD behavior, but avoids retransmissions [25, 26, 27]. However, it is not only the delay-implications of TCP's congestion control that is problematic for real-time multimedia applications. The abrupt halving of the transmission rate in response to congestion is difficult to reconcile with the need for reasonably smooth variations in bandwidth required for an acceptable user-perceived video quality.
29
Thesis introduction
Congestion control algorithms that adjust the sending rate directly, instead of using a congestion window, are known as rate-based algorithms. In order for ratebased applications to compete fairly for bandwidth with the large installed base of TCP applications and thus be "well-behaved," the rate-based congestion control algorithms need to exhibit approximately the same long-term performance as TCP. This behavior is known as TCP fairness or TCP friendliness. A TCPcompatible flow is defined as a flow that, in steady state, uses no more bandwidth than a conformant TCP-implementation running under comparable conditions [28]. In the current best-effort Internet, TCP fairness is important for the stability of the Internet, since applications using congestion control algorithms significantly more aggressive than TCP can cause starvation of competing TCP sessions, if FIFO queuing is being used [18]. In equation-based congestion control algorithms, the sending rate is directly adjusted as a function of the measured loss late. The performance of the congestion control is hence dependent on the choice of control equation. In the TCP-friendly rate control protocol (TFRC), proposed by Floyd et al. [29], the TCP response function, describing the steady-state transmission rate of TCP, is used as an upper bound on the sending rate of the algorithm. The TCP response function exists in many incarnations derived independently by many researchers. The formulation used in TFRC, derived by Padhye et al. [30] is T=
s 2p 3p R + 3t RTO p(1 + 32p 2 ) 3 8
,
(1)
where, T is the sending rate in bytes per second, s is the packet size in bytes, R is the round-trip time in seconds, p is the steady-state loss event rate and tRTO is the TCP retransmit timeout value. Note that loss event rate is not the same thing as loss rate. A loss event is defined as one or more packets lost in a time interval determined by the round-trip time. The dependence on the loss event rate, rather than the loss rate, models TCP's response to lost packets more closely. The TFRC protocol was designed to achieve approximately the same long-term performance as TCP (and hence be TCP-friendly) while maintaining a more slowly changing sending rate. Simulations show that the algorithm successfully can achieve this in many situations, but further studies are required in order to confirm that TFRC is a feasible solution for TCP-friendly unicast rate control for continuous media streaming applications.
4.1.2 Multicast congestion control In multicast applications the sender is faced with the dilemma of having to adapt the sending rate based on the heterogeneous requirements of a set of receivers with different levels of bandwidth available along their multicast path. This predicament can be address in a number of ways. One approach is to set the sending rate to the highest rate that all participants can receive. This leads to an
30
Supporting video-mediated communication over the Internet
underutilization of resources, since the least capable receiver dictates the rate for all the others. Another strategy is to adapt the rate to the average bandwidth computed over the set of all connected receivers. This will lead to constant congestion and high loss rates for the receivers less capable than the average, possibly preventing them to take part in the session. The TCP-friendly multicast congestion control protocol (TFMCC) is a multicast version of the unicast TFRC algorithm discussed above [31]. TFMCC is a feedback-based single rate congestion control algorithm that adapts the sending rate to the capacity of the most constrained receiver. In order to avoid feedback implosion problems, only receivers with a calculated rate less than the current sending rate are allowed to send feedback. Each receiver individually calculates its preferred rate from RTT estimates and the packet loss event rate using equation (1). One of the major difficulties with TFMCC is how to achieve a reliable measurement of the RTT between the sender and each receiver without feedback implosion. As discussed in section 3.4.4, a layered multicast transmission architecture can satisfy heterogeneous bandwidth requirements by enabling the receivers to individually decide on a suitable number of multicast groups to join, depending on the available bandwidth and local considerations, such as terminal equipment capabilities. This makes it possible to design receiver-based, feedback-free congestion control algorithms that require nothing from the sender, except for the layered encoding and transmission. Joint sender/receiver schemes, wherein feedback from the receivers influences the organization of data into layers have also been suggested [32]. In their seminal work on layered multicast, McCanne, Vetterli and Jacobson proposed a multicast congestion control algorithm dubbed receiver-driven layered multicast (RLM) [33]. In RLM the receivers in a multicast session periodically join an additional multicast group to investigate whether the increased bandwidth causes packet loss. If so, the group is left and the join experiment is considered to have failed. If no packet loss is experienced, the join experiment was successful, and the receiver remains a member of the new group. The real innovation of the RLM algorithm lies in its ability to co-ordinate the join experiments among the receivers of a session through a technique called shared learning. The idea is that before a receiver performs a join experiment a message is multicast to all other receivers indicating the intention to perform a join experiment. All receivers can then jointly conduct the experiment, instead of keeping the network constantly congested from uncorrelated join experiments. The shared learning algorithm makes layered multicast scale to large sessions, but inevitably also increases the convergence time of the bandwidth allocation process. The RLM algorithm has been found to exhibit questionable intra-protocol fairness as well as poor interprotocol fairness. Specifically, it is not TCP-friendly [34]. Vicisano et al. have suggested an alternative algorithm called the TCP-like multicast congestion control algorithm [35]. Instead of the shared learning mechanism of RLM, join experiments are coordinated by the sender through synchronization points, i.e. marked packets in the media stream indicating when join experiments are allowed. TCP friendliness is achieved by requiring an
Thesis introduction
31
exponential distribution of the layer rates. Thus, dropping a layer mimics TCP's multiplicative decrease of the rate in response to congestion, but unlike TCP it also causes a multiplicative increase of the rate when adding a layer. However, by spacing the synchronization points of each layer in proportion to the cumulative bandwidth requirement for the layer, an approximately additive increase behavior can be achieved over a larger time scale. Furthermore, to mitigate the serious implications of failed join experiments, Vicisano et al. suggest that the sender should inject a brief burst of packets prior to the synchronization points, on the layer the synchronization point belongs to. The burst is intended to simulate the rate increase of joining an additional group, so that the receivers that experience packet loss during the burst can refrain from a join experiment at the synchronization point. Unfortunately, successful reception of the burst packets does not guarantee that joining an additional group will not cause packet loss, since the burst may be of insufficient length. Both RLM and the TCP-like congestion control algorithm suffer from poor convergence times, primarily due to the long IGMP latency, i.e. the time from a group is left until the packet propagation in multicast routers is pruned back to the source. In response, Byers et al. propose a clever scheme called dynamic layering [36], wherein the sender decreases the rate of the layers over time. To sustain a constant reception rate the receivers must periodically add layers. Conversely, the receivers can reduce their reception rate quickly, simply by not adding any additional layers. With dynamic layering the asymmetry between IGMP join and leave latencies can be avoided, making the leave operation as efficient as the join operation. The main drawback is that the total number of multicast groups required for the layering is increased. Since the scalability of multicast routing protocols is limited by the need to maintain state information about group memberships, this is potentially a serious impediment. A congestion control algorithm based on dynamic layering and a simplified version of the TCP-like multicast congestion control algorithm was suggested by Byers et al. [36]. The algorithm, called fair layered increase/decrease with dynamic layering (FLID-DL) relies on synchronization points for co-ordination of join experiments and benefits from the reduced leave latencies due to dynamic layering. TCP-friendliness is achieved by scheduling the synchronization points in a way that makes the throughput for each receiver approximate the TCP throughput equation. The multicast congestion control algorithms discussed so far all rely on packet loss to detect congestion. In contrast, the algorithm presented in Paper A in this thesis uses variations in the end-to-end transmission delay as an indication of the state of the network. A growing delay, due to increased buffering in router queues, indicates incipient congestion and is responded to by leaving multicast groups. Similarly, a sufficiently small queuing delay, suggesting that enough bandwidth might be available for adding an additional layer, triggers a join experiment. By carefully assigning the threshold delay levels for when to join or leave groups, based on the current subscription level, the algorithm is demonstrated to achieve intra-protocol fairness, both within a single session and among competing sessions using the same layering and congestion control. Since the bandwidth allocation is
32
Supporting video-mediated communication over the Internet
independent of the magnitude of the total transmission delay (only the queuing delay is measured, not the propagation delay), the algorithm is not TCP-friendly, since TCP's bandwidth allocation depends on the round-trip time (cf. equation (1)). Thus, the delay-based multicast congestion control algorithm is increasingly unfair to TCP for hosts that are connected by links with large propagation delays (i.e. hosts at large geographical distances from the sender). The primary motivation for delay-based congestion control is that it is a proactive technique capable of detecting and responding to congestion before packet loss occurs. Hence, the overall loss rate should be reduced with a delaybased algorithm. As is demonstrated through simulations in Paper A, the delaybased multicast congestion control algorithm exhibits lower loss rates than both RLM and the TCP-like congestion control. Another advantage of a delay-based algorithm is that it is not sensitive to packet loss caused by noisy communication channels. Since it is generally impossible to distinguish packet loss caused by congestion from packet loss caused by bit errors, loss-based algorithms will perform poorly in lossy environments, such as wireless networks. The main disadvantage of the delay-based multicast congestion control algorithm is that it is sensitive to noise in the delay measurements. The algorithm also suffers from the same convergence time problems as RLM and the TCP-like algorithm. This can be improved by adopting the dynamic layering approach. Another proactive multicast congestion control algorithm, called ThinStreams, was proposed by Wu, Sharma and Smith [37]. In ThinStreams each layer of the media stream is required to have a constant bitrate (CBR) that is known to the receivers. Furthermore, the bitrate is required to be sufficiently small so that the congested network can buffer the excessive data transmitted during a failed join experiment. In the spirit of TCP Vegas, ThinStreams uses the difference between the expected throughput and the actual throughput as a means to detect congestion. The actual throughput is calculated as the number of bytes received in an arbitrary measurement interval, whereas the expected throughput is calculated as the number of groups joined multiplied by the well-known constant bitrate. The problem with this approach is that the layers of hierarchical video codecs rarely produce CBR-traffic. Rather, they tend to produce bursty VBR-type traffic. In response to this Wu et al. suggest subdividing the layers from the codec into "thin" layers, thereby decoupling the output from the layered codec with the unit of the flow control. But this leads to transmission of partial video layers that the receiver cannot use in the decoding process and hence implies poor bandwidth utilization. Also, the requirement of the layers to be CBR poses severe implementation problems. Compared to the delay-based congestion control algorithm of Paper A, the ThinStreams approach imposes more restrictions on the media coding, exhibits less tractable implementation characteristics, and also suffers from the same convergence time problems as the previously discussed algorithms. Legout and Biersack propose a layered multicast congestion control algorithm called PLM, or packet pair receiver-driven cumulative layered multicast [38]. The cornerstone of the algorithm is a bandwidth inference mechanism called a packet pair, first suggested by Keshav [39]. The idea is to transmit all packets in pairs,
Thesis introduction
33
back-to-back, and to let the receivers infer the available bandwidth from the spacing of the packet pair upon reception, and the packet size. For the packet pair mechanism to succeed, every router in the network must be a packet-generalized processor sharing scheduler (PGPS) with longest queue drop buffer management [40]. Such a network is called a Fair Scheduler network. The problem is that the Internet is not a Fair Scheduler network, and, although Legout and Biersack claim that a "per ISP" deployment is feasible, such a transition seems highly unlikely, if not unattainable. All layered multicast frameworks discussed so far assume a cumulative organization of layers. Indeed, one of the main tenets of layered multicast is to achieve a more efficient distribution of data by avoiding redundant data flows. Nevertheless, Byers, Luby and Mitzenmacher explore the costs and benefits of using non-cumulative layering [41]. The layering scheme they propose uses redundant layers with bandwidths derived from a Fibonacci sequence, so that the bandwidth of layer i is the sum of the bandwidths of layers i – 1 and i – 2. With this organization of layers, a more fine-grained AIMD-like bandwidth allocation can be achieved, compared to congestion control algorithms based on cumulative layering. Non-cumulative layered multicast can be successfully combined with multiple description video coding. The obvious disadvantage, compared to cumulative layering, is the overhead imposed by multicasting redundant data. In summary, all algorithms for multicast congestion control proposed so far have drawbacks. Indeed, one of several reasons why IP multicast is still not widely deployed is the lack of a good congestion control mechanism backed up by consensus. The bulk of the research on multicast congestion control has been devoted to implicit (loss-based) congestion detection mechanisms and to TCP fairness issues. With the introduction of mechanisms for differentiated services in the Internet, the issue of TCP friendliness becomes less important, making intraprotocol fairness issues the chief concern. In this perspective, future efforts in multicast congestion control should possibly focus more on ECN mechanisms, given the close relationship between ECN and Diffserv.
4.2 Scalable video coding In order to realize scalable and adaptive multimedia communication systems, the media codings need to be scalable in terms of bandwidth requirements and computational complexity. In the context of video coding the term scalability refers to the ability of a video codec to adjust its output bitstream to different bandwidths. Most video compression algorithms in use today provide some means to trade off between quality and bandwidth, for instance by varying the frame rate, spatial resolution or the amount of quantization applied. For multipoint conferencing with heterogeneous terminal capabilities, layered video encoding can be used in combination with a layered transmission architecture (e.g. layered multicast). Layered media encoding is also beneficial in point-to-point streaming of pre-encoded stored media, since the transmission rate can be adapted to the available bandwidth without re-encoding the media.
34
Supporting video-mediated communication over the Internet
The goal of a layered video codec is to produce a layered output bitstream where the layers are non-redundant and cumulatively additive, so that a successively higher quality signal can be reconstructed as an increasing number of layers are used in the decoding process. The layering can be performed in different ways, the most common techniques being temporal, spatial and signal-to-noise ratio (SNR) layering. In temporal layering the individual frames of the video are assigned to different layers, so that the frame rate of the reconstructed video depends on the number of layers received. Spatial layering provides multiple image resolutions to the receivers, whereas SNR-layering successively refines the pixel sample precision depending on the number of layers received. Ideally, all three techniques should be possible to apply orthogonally, so that the receivers individually can trade off between frame rate, image resolution and compression distortion. A plethora of techniques for layered video coding have been suggested, some of which have been adopted by the standardization frameworks of ISO and ITU. The H.263+ standard defines a scalability option supporting a combination of all three layering techniques mentioned above and the MPEG-2 standard has a very similar layering option [42, 43]. However, the layering imposes restrictions on the media encoding that are somewhat conflicting with the compression techniques used in H.263+ and MPEG-2, i.e. block-based motion-compensation and the discrete cosine transform (DCT). Temporal layering is limited by inter-frame dependencies and spatial layering requires computationally expensive iterative processing. Prohibitively high codec complexity is a major obstacle in order to implement a video codec on a wide variety of hardware platforms. In addition to the scalability options of MPEG-2, MPEG-4 also defines a scalability option called fine granular scalability (FGS) [44]. FGS is an SNR scalability technique, where the DCT coefficients are divided into bit planes so that the most significant bits are transmitted before less significant bits. The encoding has a two-layer structure: one base layer and one enhancement layer. The enhancement layer is designed as an embedded coding, so that the output bitrate can match any bandwidth limit. This is beneficial for point-to-point streaming of pre-encoded video material, since a variable channel capacity can be met exactly without re-encoding the media. Due to the embedded encoding property, the enhancement layer can be further subdivided into any number of layers, for instance for layered multicast applications. Another class of scalable video codecs is based on the discrete wavelet transform (DWT), which is applied to the whole video frame instead of to 8-by-8 pixel blocks [45]. DWT-based image coding provides a native multi-resolution encoding, reducing the complexity of the spatial layering. Inter-frame compression can be achieved in different ways, either by transformed domain processing or by pixel-domain processing, with the same implications for temporal layering due to inter-frame dependencies as for DCT-based encoding. Since motion compensation is typically performed on 16-by-16 pixel macroblocks, this highly effective interframe compression technique is easier to combine with block-based DCT-coding than with DWT-coding that operates on the whole video frame. For DWT-based coding, full-frame motion compensation can be utilized, but it is much less efficient than block-based motion compensation. Another approach is to perform motion
Thesis introduction
35
estimation on overlapping blocks to reduce the high-frequency discontinuities that otherwise appear along the edges of the blocks [46]. This improves the performance of the wavelet coding of the residual image, albeit at the cost of higher computational complexity. A completely different approach to temporal compression is to extend the 2-D wavelet transform to three dimensions by performing yet another 1-D transform along the temporal axis [47, 48]. It turns out that this approach can indeed be very efficient in terms of compression performance, but the complexity is too high to be practical for real-time applications. Specifically, the decoding complexity is higher compared to motion compensation techniques, which are highly asymmetrical with encoding being much more computationally expensive than decoding. 3-D wavelet coding also introduces a substantial buffering requirement, since the input signal to the transform must extend at least as many frames in the temporal dimension as the number of decomposition levels of the wavelet transform dictates. For both DWT and DCT codecs, SNR layering can be achieved through a progressive quantization process, wherein the precision of the image pixels is successively refined. The primary advantages of wavelet-based codecs over DCTbased are that of a native multiresolution representation, potentially more finegrained layering in relation to the required complexity, and absence of blocking artifacts.
4.2.1 Embedded zerotree wavelet video coding with temporal prediction Paper B in this thesis presents a highly scalable wavelet-based video compression algorithm, called the embedded zerotree wavelet video coding with temporal prediction (EZWTP). The algorithm is an extension of Shapiro's EZW still image compression algorithm to video [49]. A temporal prediction technique is introduced that exploits the strong correlation between temporally adjacent video frames. The target application of the codec is multipoint video communication over the Internet. Consequently, the main considerations when designing the algorithm have been to keep the processing delays low in order not to inhibit interactivity, to achieve good scalability properties in terms of bandwidth, computational complexity and video quality, and to ensure orthogonality of the scalability mechanisms, giving the decoder flexibility in trading off between different aspects of video quality. These design goals stand in clear contrast to the scalability options of the standardization frameworks, that have largely tried to retrofit scalability mechanisms into algorithms originally designed for single rate video coding with the sole purpose of being efficient from a compression standpoint. Consequently, the scalability features are much more pronounced in the EZWTP algorithm, and the flexibility in coupling the layered output bitstream to a layered transmission architecture is considerably greater. Moreover, the key components of the EZWTP compression algorithm are inherently scalable: The intrinsic multiresolution property of the wavelet transform gives a spatially scalable bitstream, the zerotree quantization procedure naturally provides successive refinement of the fidelity of the video pixels, and the inter-frame compression allows temporal scalability by limiting inter-frame dependencies. This reliance on inherently scalable techniques
36
Supporting video-mediated communication over the Internet
makes it possible to implement the encoding in a single pass, without requiring the coder to decode the output video stream for each quality level in order to compute an error signal constituting the refinement signal of the next higher quality level. Thus, a very fine-grained layering can be achieved without causing a prohibitively high computational cost. In contrast, the MPEG-2 and H.263+ algorithms are typically limited to a much smaller number of layers in order for the complexity to be manageable. More recently, however, a seemingly attractive approach to scalability has been proposed in the MPEG-4 FGS framework. The idea is to encode the video in a twolayered structure, where the base layer is a compliant non-scalable MPEG-4 stream and the enhancement layer is an embedded bitplane coding of DCT coefficients. This structure allows a flexible subdivision of the enhancement layer into an arbitrary number of transmission channels. In native FGS, no temporal prediction is employed in the enhancement layer. Thus, if a multi-layered stream with a wide bandwidth span is desired, say from only a few kbps up to several Mbps, temporal redundancy is reduced only in the very narrowband base layer, implying that the substantial temporal redundancy remaining in the enhancement layer will not be exploited, resulting in poor bandwidth utilization. To address this shortcoming, amendment 4 of MPEG-4 [50] introduces temporal scalability in FGS, wherein the enhancement layer inserts P- or B-frames between the base layer frames. In comparison with the EZWTP algorithm, the FGS framework is clearly aimed primarily at streaming applications of pre-encoded video material, whereas EZWTP is targeted at real-time videoconferencing. Moreover, FGS includes no spatial scalability option. In terms of compression efficiency, due to the huge amount of resources invested in the MPEG development and the strong focus on compression efficiency, MPEG-4 FGS can be expected to outperform EZWTP, although no comparative experiments have been performed. The main drawback of the EZWTP algorithm is the lack of motion compensation, causing the inter-frame compression to perform poorly with high-motion video content. The exclusion of motion compensation is mainly to reduce codec complexity and to avoid the technical implications of combining a block-based motion compensation algorithm with a full frame wavelet transform. Since the typical video content of the target application is fairly low motion "head and shoulders" scenes, the performance penalty can be expected to be moderate in most situations. Another potentially troublesome property of the algorithm is that the zerotree quantization process and its extension to temporal prediction can be assumed to benefit less from specialized signal processing instruction sets being implemented in modern general purpose processors, compared to traditional quantization techniques. These instruction sets, collectively labeled single instruction-multiple data (SIMD) instructions [51, 52], can speed up operations like the DCT and the DWT considerably. Moreover, traditional quantization and dequantization, basically implemented by a large number of successive integer divisions and multiplications respectively, can benefit significantly from SIMD instructions. Zerotree quantization, on the other hand, requires more demanding iterative processing, involving many conditional operations, and this is possibly more
Thesis introduction
37
troublesome to implement efficiently. Examples of SIMD instruction sets are Sun Microsystems's Visual Instruction Set (VIS) [53] and Intel's Multimedia extension (MMX) [54].
4.3 Heterogeneous environments A striking feature of the Internet is its considerable heterogeneity in terms of equipment connected, link capacities and loss probabilities. Since basically any kind of device with a microprocessor and a network interface can be connected to the Internet, the capabilities of the terminal equipment are very different. Moreover, different access network technologies can support widely different connection speeds, and different load patterns in different parts of the network result in different congestion levels, limiting the amount of bandwidth available for an end-to-end connection. This property of the Internet stands in clear contrast to the publicly switched telephony network (PSTN), which is a much more homogeneous technological framework. Basically, all telephones support the same services and the access technology (the subscriber loop) is essentially the same everywhere. This fundamental distinction between the Internet and the PSTN is reflected in the dissimilar philosophies of the two standardization bodies involved, i.e. the IETF and the ITU-T. Whereas the ITU-T attempts to standardize terminal equipment and communication interfaces in considerable detail, with the principal aim of conformance, the IETF standards generally allow much more latitude. For video communication systems, the difference in perspective is evident when comparing the ITU-T standardization efforts (mainly the H.32x recommendations) with the corresponding IETF framework. When designing systems for video-mediated communication over the Internet, combating the problems associated with heterogeneity must be a fundamental concern. This applies to both terminal equipment heterogeneity, such as processing power and visualization capabilities, and network heterogeneity in terms of bandwidth and loss probability. The importance of this insight is emphasized by the fact that although network capacities and CPU performance will continue to improve, heterogeneity will persist. As previously discussed, layered encoding and transmission is a promising technique for overcoming heterogeneity. This motivates the development of layered video coding techniques and layered transmission architectures. Paper A and Paper B are two efforts in this direction.
4.3.1 Granularity of layered encoding and transmission schemes An important factor when designing a layered encoding and transmission architecture is the granularity of the layering, i.e. how many layers to choose and how bandwidth is assigned to the layers. A layering scheme with many relatively narrowband layers allows a more fine grained overall bandwidth allocation to be performed by the flow control algorithm, and allows a greater span of terminal
38
Supporting video-mediated communication over the Internet
system capabilities. On the other hand, excessively many layers will increase processing overhead and typically also reduce the efficiency of the encoding. The optimal layering strategy can therefore be seen as a trade-off between the conflicting goals of high bandwidth allocation granularity and low encoding, processing and transmission overhead. An attempt to analyze the effect of layer granularity on network utilization is presented in Paper C, where two simulation models are established to experimentally find the optimal layering scheme for layered multicast and unicast video streaming applications respectively. The approach is to compute the proportion of the available bandwidth in a network that can be successfully utilized for video data depending on the layering scheme and the total number of layers. As a simplification, the simulation models only consider the effects of bandwidth allocation granularity and transmission overhead. Although other factors, like router processing overhead and compression performance, are likely to influence the design of a layering scheme, the results obtained from the simulations nevertheless give an indication of the effect of layer granularity on overall system performance.
4.4 Error resilient video transmission In absence of mechanisms for service guarantees on the Internet, applications must be able to tolerate a limited amount of data loss due to dropped packets. Traditional data communication applications combat this by automatic repeat request (ARQ) techniques, wherein the receiver detects lost packets and requests retransmissions from the sender. For interactive video applications, ARQ-based techniques are generally not satisfactory, since retransmission of lost packets usually takes too long time. By the time a retransmitted packet arrives the playback deadline for the video frame it belongs to is likely to have passed, so the decoder will drop the packet anyway. For this reason a multitude of alternative techniques have been proposed in order to make video communication systems robust to packet loss [55, 56, 57]. Transmission errors can be broadly classified into two categories: random bit errors due to noisy communication channels and erasure errors due to dropped data packets. It is important to note that for Internet-based video communication systems only erasures need to be considered. Bit errors resulting from imperfections of physical communication links will be detected by link-level or transport-level error detection mechanisms, such as cyclical redundancy codes, and if uncorrectable, the packets containing the corrupted bits will be dropped. In today's wired network infrastructure, largely based on fiberoptic links, bit errors are extremely rare. The overwhelmingly dominant cause for packet loss in wired networks is congestion. In wireless networks, however, the bit error rate can be high, but this is still manifested as packet loss to the applications. The techniques proposed to make video communication resilient to packet loss can be classified according to the roles played by the sender and the receiver. In forward error protection techniques, the sender adds redundant information to the
Thesis introduction
39
video data that is transmitted to facilitate reconstruction or concealment of lost packets at the receiver. Postprocessing error concealment techniques are implemented at the receiver, exploiting statistical properties of video signals to conceal the effects of lost packets. In co-operative sender/receiver techniques the sender and receiver interact to achieve robustness to packet loss. Error resilience mechanisms can also be classified according to their underlying functional characteristics as error isolation techniques, data recovery techniques or error concealment techniques. Error isolation is aimed at limiting propagation of errors in the video bitstream. This is often achieved through insertion of certain resynchronization markers in the encoded bitstream that enable the decoder to resume decoding after a lost data symbol. Data partitioning mechanisms, wherein elements of the coded bitstream are grouped together and packetized depending on their syntactical functions (i.e. motion vectors are separated from waveform data, etc.) also belong to this class. Data recovery mechanisms try to recreate lost data through error correction coding or retransmission schemes. Finally, error concealment techniques rely on a priori information about video signals to mitigate the effects of lost data. Error resilience mechanisms can be implemented at different levels: in the video coder, at the transport protocol level, or both in combination. Robustness techniques that involve both the source coder and the channel coder are known as joint source/channel coding mechanisms. It is important to remember that in an Internet-based communication architecture there is a strict separation between the traditional channel coding (i.e. the modulation/demodulation, framing, etc) located in the physical and data link layers of the OSI reference model and the source coding that is performed at the application level. Thus, a traditional joint source/channel coding framework cannot be realized over an Internet connection since the network layer (the Internet Protocol) hides the details of the underlying channel coding. Consequently, when we are talking about joint source/channel coding in the context of IP-based communication systems, the channel coding refers to transport protocol mechanisms. The traditional goal of a video coding algorithm is to reduce the redundancy in the video signal to accommodate for a bandwidth-efficient transmission. However, redundancy is needed in order for the receiver of the video to be able to perform loss recovery and error concealment. Thus, robustness and efficiency are contradictory goals of a video communication system, and any such system must be designed to trade-off between these properties in some way, depending on the loss probability and bandwidth of the communication channel. According to Shannon's celebrated separation principle it can be proved that this trade-off can be resolved in an optimal way by designing the source and channel coders independently, optimizing the source coder for coding efficiency and the channel coder for reliability. At first sight this theoretical result might suggest that joint source/channel coding is not suitable for providing error resilience. However, the theory relies on the assumption that the source coder has infinite computational capacity and that source coding delay can be arbitrarily long. In the real-world situation, joint source/channel approaches to robust video communication can be very successful.
40
Supporting video-mediated communication over the Internet
Transport level robustness techniques utilize either error correcting codes or ARQ to deliver a high quality transport service to applications over a lossy channel. A big advantage of transport level robustness mechanisms is that they generally are media independent and thus can be used with a high level of generality.
4.4.1 Layered coding with transport prioritization One of the most ardently advocated and promising schemes for providing error resilience in video communication systems is based on layered source coding in combination with transport-level prioritization of the most important layers. A layered video codec, as described in section 4.2, transforms the video signal into a number of cumulatively additive layers. Each layer is assigned a priority depending on its relative importance for perceived visual quality when recreating the video signal at the receiver. The layers are then transmitted using a transport level mechanism that gives a higher degree of error protection to the layers with the highest priority. In an IP network this can be accomplished using a differentiated services transport mechanism, such as Diffserv, where the priority of a video layer is mapped to a DS codepoint carried in the packet header. When packet loss occurs due to congestion, the least important layers are dropped first, allowing a more graceful degradation of reconstructed video quality compared to the situation where all packets are dropped with the same probability. Another way of achieving unequal error protection of the layers is to apply forward error correcting codes to the layers using a stronger code for the more important layers [58]. This approach has the advantage of not requiring any QoS mechanism to be implemented in the network and is also more robust to loss not caused by congestion. As discussed in section 4.3, layered video coding in combination with layered multicast can be utilized to overcome heterogeneity in multipoint video communication situations. This framework can be extended with transport prioritization to increase robustness to packet loss. However, McCanne et al. argue against the use of prioritized transport of layered multicast video, since this reduces the participants' incentives to lower the multicast group subscription level in response to congestion [33]. When packet loss caused by congestion is distributed uniformly over all layers, as will be the case when no prioritization is used, performance is decreased for all participants sharing a bandwidth bottleneck. This motivates each participant to try to keep the layer subscription level precisely at the number of layers that the bandwidth bottleneck can support. With transport prioritization, an over-subscribed receiver will not be penalized with a reduced video quality, since packets will be dropped from the less important layers first. The effect is illustrated in Figure 4, where the reconstructed video quality for receivers sharing a B bps bandwidth bottleneck is plotted against the requested rate for random drop and priority drop policies respectively. Since the performance curve for random drop, unlike the corresponding curve for priority drop, has a unique maximum, there is a single stable convergence point for the receiverdriven flow control algorithm. With the priority drop policy, the performance is not
41
Thesis introduction
degraded when the requested rate is higher than the supported rate. Bajaj, Breslau and Shenker further analyzed the relative merits of random-drop and priority-drop policies for layered multicast applications [59]. Quality
priority drop random drop B
Rate
Figure 4 Priority drop vs. random drop for layered multicast
4.4.2 Forward error correction Forward error correction (FEC) techniques rely on applying error correcting codes to the source data at the sender, so that the receiver can recreate the data even if some data symbols are lost. Specifically, for an (n, k) code, k source data symbols are encoded at the sender to produce a codeword of n symbols of encoded data, such that any subset of k encoded symbols suffices to reconstruct the source data. Thus, the code can withstand a loss rate of 1− k / n and still be able to recreate the original data. The parameters n and k should be selected to match the loss probability of the communication channel. In an adaptive framework this can be performed recurrently, based on periodic feedback from the receiver. As previously noted, only erasures in the form of packet loss need to be considered in packet networks, not bit errors. Although FEC was originally designed to protect data from bursts of bit errors it can also be used for protection against lost packets by buffering a number of packets at the sender and applying the FEC coding orthogonally across the packets as illustrated in Figure 5. Since interactive applications, like videoconferencing, are sensitive to delay, the requirement to buffer k packets over which to apply the FEC coding can be problematic. Simply choosing a small k to limit the delay adversely affects the granularity of the FEC coding. For instance, setting k to 1 and n to 2 implies one FEC packet being transmitted for every data packet (i.e. the packets are duplicated) resulting in 50% redundancy. When packet loss rates are moderate this is clearly suboptimal, suggesting that a substantially larger value of k is needed. Paper D in this thesis presents an adaptive FEC framework that tackles this problem by adjusting the packet lengths when packetizing each frame of video so that an optimal value of k is achieved depending on the current loss rate obtained through feedback from the receiver. This novel packetization and error protection
42
Supporting video-mediated communication over the Internet
scheme enables the amount of redundant data transmitted to closely match the experienced loss rate, without imposing any additional sender-side buffering. As is shown in the paper, the performance gain in terms of improved network bandwidth utilization can be substantial, especially when the average size of the encoded video frames is relatively small. symbol length video data
k data packets
codeword redundancy
n-k FEC packets
Figure 5 Organization of packets for FEC coding The FEC coding utilized in the framework presented in Paper D is based on adaptive Reed-Solomon coding. The coding is systematic, which means that the k data packets are transmitted unchanged. This is a desirable feature of a FEC scheme since it relieves the receiver of the FEC decoding in case all k data packets are successfully received. It also means that a receiver incapable of FEC decoding still can decode the video frames that are transmitted without loss. The real novelty of the adaptive FEC scheme in Paper D is the coupling of the optimal packetization algorithm with the error correction coding. Based on receiver reports of the currently measured loss rate the sender infers the amount of redundancy needed in the FEC coding. For each video frame to be transmitted, the packetization algorithm then calculates the optimal packet size by maximizing a bandwidth utilization function. The optimal packet size can be seen as a tradeoff between high error correction granularity and low header overhead. Adaptive FEC schemes for packet video have been proposed by many researchers. Bolot and Turletti suggest an H.261-based scheme wherein the macroblocks contained in packet n – k are compressed using a coarser quantization and piggybacked on packet n [60]. This is an example of a media-dependent FEC scheme, unlike the scheme in Paper D which is media independent. Several FEC mechanisms have been proposed based on performing exclusiveOR operations over k packets to obtain a parity packet from which one of the k packets can be recreated if lost [61]. Such a scheme can be seen as a special case of the general (n, k) error correction scheme where n = k + 1. Rhee and Joshi propose an FEC-based loss recovery scheme for video applications wherein the interval over which the FEC codes are computed is longer than a single frame interval [62]. This can be done without introducing delay by not requiring the receiver to be able to recover a lost packet in time for the playout deadline of the video frame the packet belongs to. Thus, the FEC-coding will not
Thesis introduction
43
prevent loss of quality in the video frame the lost packet belongs to, but will limit error propagation due to inter-frame dependencies. When implementing an adaptive FEC scheme, the bandwidth allocated for video and FEC redundancy must be judiciously controlled. A naïve implementation that simply increases the amount of FEC data in response to an increasing loss rate will in effect exacerbate the congestion that most likely caused the packet loss in the first place. The application must therefore keep the total bitrate, including video and FEC data, at a constant level, or at a level determined by a congestion control algorithm. Thus, in response to increasing packet loss the bandwidth assigned for the video will typically be reduced in favor of stronger FEC codes, resulting in lower quality video with improved robustness.
4.4.3 Postprocessing techniques Techniques for error concealment by postprocessing at the decoder try to exploit smoothness properties of video signals to conceal transmission errors. Since the pixel values of a video image vary slowly, both temporally and spatially, a lost packet of video data can be replaced through spatio-temporal prediction. For block-based video coding, a lost packet will affect a spatially confined portion of a video frame. The lost blocks can be approximated by the spatially adjacent blocks, or by the spatially corresponding blocks of a previous frame. Motion compensation techniques can be used for temporal prediction, providing that the motion vectors are available. For instance, using data partitioning techniques, motion vectors are separated from waveform data and can be transmitted with a higher degree of protection, in order to enable error concealment through motion compensated prediction at the receiver. For wavelet-based coding, a lost packet will affect the whole video frame rather than an isolated area of the frame. A lost subband can be approximated by the corresponding subband of a previously received frame. Since the distortion resulting from a lost packet is distributed over the entire video frame, the subjective quality degradation is generally less objectionable, compared to when the error is concentrated in a spatially confined region. However, if one or more of the lowest frequency subbands are lost and temporal prediction introduces a lot of distortion, e.g. due to a scene change, then the error will be propagated through the inverse wavelet transform, resulting in a highly distorted reconstructed video frame. For this reason, a lost subband is often simply replaced by a zero-valued subband instead of a temporally predicted subband.
4.4.4 Co-operative sender/receiver techniques The efficiency of an error correction or concealment scheme can be improved if a backchannel is available, where feedback of status information can be propagated from the receiver to the sender. This makes it possible for the sender to modify source coding or transport level parameters, based on communication channel characteristics. Adaptive FEC schemes, as described above, utilize this type of
44
Supporting video-mediated communication over the Internet
technique to adjust the amount of redundancy applied by the FEC coding based on feedback from the receiver about the currently observed packet loss rate. Alternatively, source coding parameters can be changed as a result of the feedback information. For interframe coding schemes the I-frame/P-frame ratio can be modified depending on the loss rate, in order to limit propagation errors. Moreover, given information from the receiver about which frames are received correctly, the encoder can modify its prediction frame buffer, so that no frames are predicted from frames that are known to be damaged. One such technique, known as the reference picture selection mode, is adopted by the H.263 standard. Various error recovery schemes based on selective retransmission of lost packets have been suggested for video applications that can tolerate the extra delay – primarily one-way streaming applications. Marasli, Amber and Conrad suggest a partially reliable transport service, that limits the number of retransmission trials depending on the acceptable delay [63]. Unlike TCP, that repeats retransmission of a lost packet until it is successfully delivered, this scheme aborts retransmission after a time limit defined by the acceptable delay. When interframe video coding is used, retransmissions of lost data can be useful for the decoder, even if the retransmitted packet is received after the playout deadline of the frame. For instance, Ghanbari [64] describes a technique where the receiver sends a retransmission request when a lost packet is detected, but instead of waiting for the retransmitted data the decoding error in the video frame is concealed using a postprocessing concealment technique. When the retransmitted packet arrives, the affected pixels of the damaged frame are corrected, and the frame can be used successfully for motion compensated prediction. Thus, the retransmitted packet is of use to the decoder even if the playout deadline of the frame is missed.
45
5 Applications of video-mediated communication Although video content is very common today on the Internet, primarily in the shape of video clips for entertainment purposes, interpersonal communication using live video is still very modestly exploited. This is in part explained by technological shortcomings and cost issues, such as insufficient software tools, expensive hardware and lack of network bandwidth. However, even in environments were the required network and hardware resources are available, for instance in academic networks or corporate intranets, the use of video-mediated communication is still very modest. This can only be explained by a lack of appropriately designed systems resulting from a poor understanding of the underlying mechanisms of interpersonal visual communication. Since much of the information exchange in face-to-face conversations is known to be non-verbal, interpersonal communication systems could presumably benefit substantially from adding a video channel to the audio. For professional usage situations, such as various kinds of distributed collaborative work, it is hypothesized that the added value of the visual channel will make the collaboration more efficient. Moreover, the opportunity for people to meet and interact without being physically co-located can be assumed to promote more frequent interactions, enable more direct information exchange and reduce the need for traveling. One area in which distributed collaborative teamwork, supported by videomediated communication and other CSCW tools, is envisioned to have a significant impact is engineering and design work in product development projects. Section 5.1, as well as papers G and H, are devoted to the study of collaboration environments for distributed engineering teamwork. In section 5.2 the prospects of integrating video-mediated communication systems with the World Wide Web are discussed. Since the WWW has been very important for the deployment of many asynchronous Internet services, an interesting question is whether web technology can be of similar value for emerging synchronous applications, like video-mediated communication. In Paper E this theme is elucidated through the presentation of a transport level video gateway that makes it possible to access multicast video streams through a web browser. Real-time transmission of high-quality video streams has the potential of enabling completely new, highly specialized applications. When increases in bandwidth and processing power permit high quality, low latency video communication virtually anywhere on the Internet, new ways to enrich interpersonal communication will be demanded. One such enrichment of video-mediated communication is explored in Paper F, which describes an Internet-based stereoscopic video communication tool. Section 5.3 speculates on what the next generation interpersonal communication systems will look like and what new types of applications will emerge as the core technology evolves. Specifically, the potentials and limitations of stereoscopic video in future communication systems are discussed and the most credible
46
Supporting video-mediated communication over the Internet
applications are identified. Video-mediated communication using mobile devices and wireless networks is also discussed.
5.1 Distributed collaborative teamwork in product development Product development is a collaborative task involving a lot of people with different competences and responsibilities. In large companies development projects often involve design and engineering teams located in many different geographical regions. The product development process requires a lot of information exchange between and within engineering and design teams for coordination, data sharing, work integration, testing and so on. These incentives, together with the fact that much of the development work is highly computerized, make distributed product development a particularly interesting application domain for video-mediated communication systems in combination with other CSCW tools. Moreover, the trend towards cross-functional teamwork, promoting concurrent engineering to reduce development lead times, highlights the need for sophisticated support systems for information interchange, collaboration and communication. In Paper G, a framework for distributed collaborative teamwork is presented, targeted at engineering and design work in product development projects. The development of a prototype system is described and early usage experiences are related. The system is built around four key components: a commercial CAD/CAE system, an experimental virtual reality system, a research prototype for videomediated communication, and a semi-commercial document sharing system. The main focus of the paper is on software architecture, chiefly concerning the integration of the software components into one cohesive system. Initial usage experiences are also discussed in the paper. More comprehensive case studies and evaluations are conducted by Törlind [65]. Paper H presents an extension of the framework introduced in Paper G with functionality for user and device mobility. The relevance of mobility support in distributed collaborative teamwork is investigated and the technological implications are explored. The question of how to support video-mediated communication for mobile users in a practically useful and technically sound way is a key issue in the paper. The importance of video for awareness purposes and as a catalyst for spontaneous interactions is emphasized. On the technical side, lightweight application-level mobility support is advocated in preference to networklevel mechanisms such as Mobile IP.
5.1.1 The benefit of video in distributed collaborative work Despite the multimodal nature of interpersonal communication it is still debated whether the addition of video to audio only communication significantly improves the productiveness of distributed collaborative work. Indeed, most usability studies to date give weak or no evidence for objective improvements, unless the collaborative task is inherently visual [66, 67, 68, 69]. Moreover, these negative
Thesis introduction
47
results are not related to video quality: even face-to-face interaction is no better that audio only communication in terms of task outcome or time to completion of a task [69, 70]. On the other hand, subjective measures of the benefit of video generally give positive results. That is, although the benefit of a visual communication channel cannot be objectively substantiated, users respond positively to the use of video in collaboration [71, 72, 73]. The inconsistencies between subjective and objective measures of the benefit of video might suggest that the inability to objectively measure the impact of the video channel on collaborative problem solving is related to difficulties in measuring task performance rather than signifying the unimportance of video for collaborative work. There are a number of different techniques to objectively measure the efficacy of collaboration sessions. One way is to study the pattern of interaction between participants by analyzing videotaped sessions (or transcriptions thereof) to identify certain conversational key features, such as the number of interruptions, explicit handovers and backchannels. (Backchanneling is the process wherein a listener provides concurrent feedback to the speaker by short utterances and visual cues in the form of head nods and eye gaze, indicating that the conversation is on track.) Another way is to measure the time to completion of a well-defined collaborative task. Subjective measures of the impact of video on collaborative work are based on asking the users, through interviews or questionnaires, how beneficial they found the use of video in a collaboration session.
5.1.2 Process coordination and content coordination There are a number of fundamental conversational mechanisms that must be supported regardless of what modalities of communication are available. Communication is a joint activity between speaker and listener that requires coordination of both process and content [74]. Process coordination is concerned with the mechanisms and management of conversation, whereas content coordination is concerned with how participants arrive at and maintain a common understanding of the subject matter of the conversation. Key mechanisms of process coordination are turn-taking and availability. Turntaking is the process by which participants of a conversation determine who will speak, who will listen and how to switch between these roles. Availability cues determine how entire conversations are initiated and ended. Content coordination is required in a conversation in order to create a common knowledge that is shared between the participants. Since the individual utterances of a speaker typically underspecify the intended meaning, the common knowledge constitutes a context necessary for the listener to understand what is said. References to a shared context enable the participants of a conversation to jointly identify the objects and events they want to talk about. Another aspect of content coordination is the ability of the speaker to discern whether the listener comprehends what is being said. Feedback from listener to speaker, expressing
48
Supporting video-mediated communication over the Internet
acceptance, disagreement or confusion is critical in order for the conversation to successfully convey information between the interlocutors. An important aspect of content coordination concerns the affective state of the participants in relation to what is being discussed. This social or emotional information is generally not explicitly stated (or even intentional) but must be inferred.
5.1.3 The role of vision in communication In face-to-face conversations much of the information exchange comprising the process and content coordination is non-verbal. It is therefore natural to assume that video-mediated communication facilitates a mode of conversation more reminiscent of face-to-face conversations compared to audio only communication and hence enables the participants to communicate more efficiently. However, as previously mentioned, this conjecture has been difficult to substantiate experimentally. To investigate in what way the use of video can benefit remote collaboration Whittaker and O'Connail [69] established three hypotheses: 1. Video supports visible behaviors and hence supplies important non-verbal information. 2. Video provides visible information about the environment, specifically the availability of other people, which in turn facilitates initiation of spontaneous interactions. 3. Video provides dynamic visual information about objects and events in a shared environment. This use of video in collaborative work is referred to as video-as-data.
5.1.4 Video supporting visible behaviors The first hypothesis, sometimes referred to as the nonverbal communication hypothesis, states that the participants of a conversation exchange visual information significant for the conversation through their body language (gestures, posture), facial expressions and direction of gaze. These visual cues are used for both process coordination and content coordination. To evaluate the nonverbal communication hypothesis Whittaker and O'Connail suggest conducting laboratory studies comparing video-mediated communication with audio only and face-to-face conversations respectively. The comparison with audio-only reveals how video complements speech and the comparison with faceto-face interaction shows how well the use of video can mimic face-to-face conversations. Numerous such studies have been conducted with different technological set-ups and different experimental collaborative tasks. Although inconclusive, the results have generally not been supportive of the nonverbal communication hypothesis. An exception to this is communicative tasks that rely heavily on the exchange of affective information and social cues, such as negotiation, bargaining, conflict resolution and persuasion [69, 75, 76]. A reason for this might be that many visual affective cues, such as blushing, sweating,
Thesis introduction
49
exhibiting nervous tics, etc, are not intentional and hence cannot be substituted by nonvisual cues. Intentional visual cues on the other hand are generally substitutable across different communicational media (e.g. saying "uh-huh" instead of a head nod).
5.1.5 Video promoting spontaneous interactions The second hypothesis, known as the connection hypothesis, states that visual information about the environment of the conversation and the availability of people for unplanned conversations enhance the chances of opportunistic interactions. Simply stated, one can infer the availability of a person for conversation if the person is visible. Furthermore, visual information about a person's current activity, movement and proximity to other people has been shown to influence vital aspects of communication, such as availability and interruptibility [77, 78]. The connection hypothesis focuses on interpersonal interactions that are not planned in advance, but occur spontaneously and are highly informal. This type of informal communication is omnipresent in everyday life and has been shown to be an important element of collaborative work [79, 80]. Mainly two types of video applications have been used to investigate the connection hypothesis. "Glancing" applications, that let a user briefly peek into a co-worker's office to assess the opportunity of initiating a spontaneous communication session, were explored by Fish et al. [71] and by Tang et al. [78]. Continuously open video links between the offices of remote collaborators or between shared workplace areas, such as coffee rooms, can also be used to promote opportunistic encounters [71, 77, 81]. This is intended to mimic a shared physical environment enabling remote co-workers to "bump into" each other by chance. Studies of the use of open video links for collaborative work show that the systems are mainly used for brief social interactions, rather than for substantial collaborative work. Törlind and Larsson demonstrate the usefulness of continuously open video links for initiation and coordination of collaboration sessions in distributed engineering design teams [79]. Altogether, however, these trials do not give conclusive evidence supporting the connection hypothesis, suggesting that further research is needed.
5.1.6 Video-as-data The third hypothesis of Whittaker and O'Connail is the video-as-data hypothesis, stating that the major benefit of video lies in its ability to dynamically convey information about shared visual objects and the physical environments of objects. Thus, rather than using video to depict the collaborators themselves, the video-asdata hypothesis suggests using video depicting shared work artifacts, thereby creating a shared context for the conversation. Not surprisingly, the usefulness of video-as-data is strongly correlated to the nature of the collaborative task. Unlike the two previously discussed hypotheses, many studies have been able to demonstrate the validity of the video-as-data hypothesis for various collaborative
50
Supporting video-mediated communication over the Internet
tasks, such as remote surgery and distributed design and engineering teamwork [65, 82, 83].
5.1.7 The effect of video quality on distributed collaborative work As discussed above, attempts to objectively corroborate the nonverbal communication hypothesis have failed, irrespective of the quality of the video channel. On the other hand, subjective evaluations show that the satisfaction of users of videomediated communication systems is closely related to the quality of the audio and video signals [84]. The importance of minimizing audio latency has been established in many experimental studies [84, 85]. Audio delays longer than 400 milliseconds have been found to seriously disrupt the interactivity of conversations. Moreover, audio latencies are generally found to be more disruptive than poor lip synchronization, indicating that if a trade-off between audio delay and media synchronization is necessary, low audio delay should be favored [86]. However, good lip synchronization improves intelligibility, since even people who are not hearing-impaired utilize lip-reading to some extent. Bruce suggests a frame rate of at least 17 frames per second to convey facial cues, especially lip movement [87].
5.1.8 Conclusions To conclude, experimental studies have not been able to decisively show that the addition of video to audio only communication improves the quality of distributed collaborative work. However, users of computer-supported collaboration systems subjectively prefer audio/video systems to audio only systems, due to psychosocial factors such as the perceived "social presence" and "feeling of closeness." Weak evidence suggest that collaborative tasks requiring information about the participants' affective state, such as negotiation, can be more efficiently supported by audiovisual systems compared to audio only systems. The usefulness of video for informal communication and spontaneous interactions is a promising research field that will require further work to give conclusive results. Finally, the use of video as a conveyor of visual information about the work objects of the collaborative task has been found to increase the effectiveness of distributed collaborative work.
5.2 Video and the World Wide Web The extraordinary success of the Internet is to a great extent dependent on one application: the World Wide Web. The fundamental strength of the WWW is that it provides an easy-to-use interface to a wide variety of Internet services. The generality of the technology, making it possible to realize a multitude of different services, and the platform independence are also key factors behind its success. In this perspective, an interesting question is how video and WWW technologies can be combined with the aim of making video-mediated communication more
Thesis introduction
51
accessible on the Internet. One effort in this direction is presented in Paper E, where the design and implementation of a transport level video gateway is discussed. The gateway enables easy access to multicast videoconferencing sessions through a web browser interface. Another example of how web and video technologies can be integrated is the Contact Portal system described in Paper H. In the Contact Portal live video streams are incorporated in dynamic web pages together with other information associated with a collaboration session. The primary use of the video is to provide availability and awareness cues to the members of a collaboration team. The coupling of video and other synchronous media with the WWW has been studied by many researchers and a multitude of middleware services and software frameworks have been developed [88, 89]. The problem of extending multicast services to non-multicast capable network environments using application-level mechanisms were investigated by Parnes et al. [90] and by Pendarakis et al. [91].
5.2.1 A transport level gateway bringing video to the WWW In Paper E, a transport level gateway is presented, that makes it possible to access multicast RTP video sessions using a standard web browser. The gateway is installed on a web server that is connected to a multicast-capable network carrying the RTP streams. Through a simple HTML interface the users can select which video sessions to view, whereupon the gateway relays the corresponding RTP video streams via HTTP to the client browsers. The major drawback of relaying video over an HTTP connection (using TCP) is that it introduces delay, but since the system described in Paper E only supports one-way video anyway, this is not a big problem. Carrying the video over HTTP has the positive effect of simplifying firewall traversal and also solves the problem of congestion controlling the video traffic between the gateway and the client web browser, since this is handled by TCP's flow control. Although TCP's AIMD-based congestion control certainly isn't optimal for video, it nevertheless prevents a congestion collapse of the network. In addition to discussing the software architecture of the video gateway, Paper E also proposes a novel congestion control mechanism for the multicast RTP flows. The idea is to let the throughput of the HTTP connections determine the rate of the multicast RTP streams. This is performed by measuring the transmission rate of the HTTP connections (controlled by TCP's congestion avoidance) and then adapting the rate at which the corresponding RTP video stream is being received by the gateway through layered multicast. A temporally scalable layered multicast approach is utilized, so that the frame rate of a multicast video stream that is received by the gateway is determined by the rate of the outbound HTTP connections. The main contribution of Paper E is that it presents a middleware software component that provides a tight integration of RTP-based multicast video and the WWW. The majority of the research on video associated with the WWW focuses on more loosely coupled frameworks, where the web technology serves as a "glue" that keeps a number of distinct software components together. The TCP-driven
52
Supporting video-mediated communication over the Internet
multicast rate control algorithm proposed in the paper is also novel and could possibly be applicable in other situations not related to video transmission.
5.3 Next generation video-mediated communication systems Although video-mediated communication still is far from being widely deployed on the Internet, the core technology is maturing quickly. Since the next generation video-mediated communication systems will be less constrained by technological shortcomings, enrichments to the basic audiovisual communication service will most likely be desired. Furthermore, advances in computer hardware and signal processing technology in combination with improved bandwidth availability will make entirely new applications of video-mediated communication possible. An example of an enrichment of video-mediated communication is the ability of conveying stereoscopic depth cues, resulting in true three-dimensional depth perception. A framework for realizing this type of enhanced visual communication over the Internet is presented in Paper F.
5.3.1 Stereoscopic video communication over the Internet Through stereopsis, resulting from binocular disparity, the human visual system allows the perception of three-dimensional depth. This visual cue is generally lost in video-mediated communication systems, but as is demonstrated in Paper F, it is possible to implement a system that preserves this fascinating property so that a true 3D view can be conveyed to a remote user over a network. The system uses two identical cameras that are positioned so that the axes through the lenses of the cameras are parallel, with an inter-ocular separation of about 65 mm. The video signals from the cameras are digitized, compressed, packetized, multiplexed and transmitted to the remote end. At the receiver, demultiplexing and decoding of the video signals is followed by a horizontal image translation, needed in order to converge the images. The magnitude of the translation determines where the convergence plane of the reconstructed view will appear to be. The properly translated video images from the left and right camera are then projected separately to the user's left and right eye respectively, so that the left eye only sees the video from the left camera and vice versa. The viewpoint separation is performed in a time-multiplexed fashion using active liquid crystal shutter glasses synchronized with the display. The stereoscopic communication system is illustrated schematically in Figure 6. In the prototype implementation of the system presented in Paper F, the video signals from the two cameras are compressed independently, using a monoscopic video compression algorithm (M-JPEG). However, the proposed framework is independent of the video coding, so that a stereoscopic video compression algorithm could be utilized to reduce the bandwidth requirement. Research on stereoscopic video has hitherto mainly been concerned with the development of efficient compression techniques, by exploiting redundancies
53
Thesis introduction
between the two channels [92, 93]. The MPEG-2 standard includes a multiview profile, dedicated to the coding of stereoscopic video [94].
65 mm
View-filtered display
Parallel axes cameras
codec
codec
MUX
Coding
Multiplexing
Translation
codec
codec
DEMUX
Decoding
Demultiplexing
Transmission
Figure 6 Stereoscopic video communication system When it comes to stereoscopic video communication systems, the most wellknown technique is a particular kind of analog TV transmission popularized in the 1950s through films like "The Creature from the Black Lagoon." This simple example of stereoscopic video communication is based on the anaglyphic method whereby the two video views are superimposed as a red and a green component image which are separated before the viewer's eyes using inexpensive colorfiltering eyewear. One of the main problems with this technique is that it is limited to monochrome video. Ohm et al. describe a hardware system for stereoscopic videoconferencing over ATM networks [95]. The main focus of the paper is on disparity estimation and it is unclear whether the proposed system was ever implemented. Unlike the prototype implementation of the system described in Paper F, which is a software tool running on workstations with stereoscopic rendering capabilities, the system proposed by Ohm et al. is based on dedicated hardware and is dependent on a specific video coding (MPEG-2 multiview) and a specific layer 2 network protocol (ATM). Many applications can potentially benefit from stereoscopic video communication, including telerobotics, various telemedicine applications, telepresence systems, augmented virtual reality systems, and distributed engineering and design tools.
54
Supporting video-mediated communication over the Internet
5.3.2 Mobile teleconferencing A clear trend in computer and network architecture is towards lightweight mobile systems that are constantly connected to the Internet through wireless access networks. Following this trend, mobile video communication devices based on laptop computers or PDAs with wireless network interfaces are emerging. Mobile IP is an extension to IP that provides transparent host mobility without requiring modifications to applications [96]. Using Mobile IP a mobile host retains its statically configured IP address while roaming between different IP subnetworks. Packets bound for the mobile host are relayed by an agent in the host's home network to a "care-of-address" while the mobile host visits a foreign network. The transparent mobility support provided by Mobile IP is attractive for many reasons, but the reliance on forwarding agents is problematic for delay-sensitive applications like video-mediated communication. Instead, application-level mobility support is advocated in Paper H, for mobile teleconferencing applications. The mobility support implemented in the prototype collaboration environment presented in Paper H is based on mobility mechanisms provided by SIP. With this approach a mobile host's IP-address is reconfigured using DHCP whenever a foreign network is visited. The host's new IP-address is reported back to the mobile host's home SIP server using the SIP REGISTER request. The SIP server then forwards invitation requests to the mobile host's remote location. After the call set-up phase, video and audio packets are exchanged directly between the communicating hosts, instead of using a forwarding agent as in the Mobile IP situation. The SIP-based mobility approach requires a new session initiation process if the mobile host roams to a new subnetwork. Hence, network level mechanisms like Mobile IP typically provide a higher degree of mobility compared to application level schemes. In most practical situations, however, a video-mediated communication session will be conducted while the mobile host is connected to the same subnetwork. The real benefit of the mobility support lies in the ability to initiate communication sessions transparently, without knowledge about the present location of the mobile user. This motivates the use of application-level primitives in preference to network-level mechanisms.
55
6 Summary of included papers and their contributions In this chapter a brief summary of each of the appended research papers is given.
6.1 Paper A: Delay-based flow control for layered multicast applications The first of the included papers presents a novel approach to flow control for layered multicast applications. Traditionally, packet loss has been used as a congestion signal for participants of a layered multicast session, indicating that the bandwidth must be lowered by leaving multicast groups. In contrast, the algorithm presented in this paper detects impending congestion from packet delay measurements performed by the receivers. An increasing delay, corresponding to increased queuing in router buffers, is responded to by leaving multicast groups. By predicting impending congestion before packet loss is experienced the overall packet loss rate is reduced compared to the traditional loss-based algorithms. This is of vital importance for loss-sensitive applications like real-time multimedia communication. Moreover, since the algorithm is feedback-free it is free from the scalability problems of feedback-based schemes. The performance of the algorithm in terms of resource utilization, intra- and inter-protocol fairness, overall loss rate and convergence time is explored through simulations.
6.2 Paper B: A scalable video compression algorithm for real-time Internet applications In this paper, a video compression algorithm targeted at real-time Internet applications is presented. The design of the algorithm is focused on achieving scalability in terms of computational complexity, bandwidth and quality while keeping the coding latency at a minimum. Wavelet transform coding in combination with a zerotree quantization scheme with temporal prediction and arithmetic coding are the building blocks of the algorithm. The performance of the algorithm in terms of compression efficiency is analyzed using a prototype implementation. The computational cost is estimated through complexity analysis. The compression performance of the algorithm is shown to be competitive with a popular non-layered video compression algorithm (MPEG-1). The scalability in terms of bandwidth is shown to be excellent, ranging from about 10 kbps to several Mbps. Trade-offs between quality and resource consumption are demonstrated to be possible in three different ways depending on receiver capabilities and preferences.
56
Supporting video-mediated communication over the Internet
6.3 Paper C: Layered encoding and transmission of video in heterogeneous environments The layered media transmission architecture presented in Paper A and the layered video coding presented in Paper B provide core building blocks of a multipoint video-mediated communication system that can be utilized in a highly heterogeneous environment. However, papers A and B say little about how the layered media stream ideally should be coupled with the layered transmission architecture. This flexibility is a positive feature of the algorithms, but when designing a system incorporating these components, some information about how to assign layered video data to transmission channels is needed. Paper C explores how a layering scheme should ideally be designed from a network utilization standpoint. The effect on bandwidth utilization depending on the layering scheme and in particular the total number of layers is investigated through simulations. The paper emphasizes two applications: multicast videoconferencing and adaptive unicast video streaming. A survey of different layered video coding and transmission techniques is also included in the paper.
6.4 Paper D: Adaptive forward error correction for real-time Internet video In the prevalent best-effort service model of the Internet there are no guarantees on timely delivery of packets. Consequently, Internet applications must be resilient to packet loss, either by using reliable transport protocols like TCP, or through other techniques. Since real-time applications, like video-mediated communication, generally cannot accept the delay implications of retransmission schemes, forward error correction mechanisms can be useful to improve robustness to packet loss. Ideally, the amount of redundancy applied by the FEC coding should be dynamically adjustable to match the packet loss rate experienced on the network. One such approach, based on adaptive Reed-Solomon coding and receiver feedback is presented in Paper D. In addition to the FEC coding, a packetization scheme is devised that calculates the optimal packet size to use for the errorprotected video in order to maximize utilization. An efficient algorithm implementing this optimal packetization is described, and the utilization gain is shown to be as high as around 10% in some relevant situations.
6.5 Paper E: An RTP to HTTP video gateway The importance of the World Wide Web for the proliferation and penetration of the Internet is unquestionable. To take advantage of the prevalence of the WWW for video-mediated communication an interconnection of the transport protocols for WWW and video traffic is proposed in this paper. HTTP, being the application level protocol used on top of TCP for WWW traffic, is poorly suited for video, but can nevertheless be used if the real-time requirements are relaxed. The motivation for doing so is that it facilitates the inclusion of live video in HTML pages for user-
Thesis introduction
57
friendly display in a WWW-browser, in much the same way as web-cameras work. Moreover, it enables users located behind firewalls to easily participate in video communication sessions without requiring any re-configuration. In Paper E, the design and implementation of a transport level gateway interconnecting the WWW with RTP-based multicast video applications is described. The paper also proposes a multicast flow control mechanism implemented by the transport protocol gateway. The gateway monitors the bandwidth of the TCP connections of its connected web-browser clients, and adjusts the multicast bandwidth accordingly. The transport protocol overhead is estimated for RTP and HTTP/TCP respectively, and is found to be approximately the same.
6.6 Paper F: Stereoscopic video transmission over the Internet Stereopsis – the ability of the human visual system to perceive three-dimensional depth by means of binocular disparity, is a powerful sensory capability. Still, practically no visual communication systems support stereopsis. This paper explores the possibilities of stereoscopic video communication over the Internet by presenting the development of a novel stereoscopic video communication system. The paper contributes implementation and usage experiences to the Internet applications research community and analyzes the requirements for stereoscopic video communication systems. Furthermore, a transport protocol extension for identification and association of stereo video streams is presented along with guidelines for implementation. Finally, application domains expected to benefit from stereoscopic video communication are identified and discussed.
6.7 Paper G: Collaboration environments for distributed engineering: Development of a prototype system In this paper the development of a prototype system for distributed engineering teamwork is presented. The focus is on synchronous collaborative work, using a combination of teleconferencing and virtual reality tools. A commercial CAE system is interfaced so that geometric data and product metadata can be accessed and visualized in the virtual environment. Moreover, the paper discusses a coupling between a videoconferencing system and a VR system, which associates the voice and video of the participants of a conference session with the corresponding avatars in the virtual world. The loose integration of several specialized software tools is aimed at providing the users with a cohesive view of the system, while benefiting from the advantages of each of the subsystems. A test case, developed for the purpose of evaluating the collaboration system in a realistic situation, is also described in the paper.
58
Supporting video-mediated communication over the Internet
6.8 Paper H: Mobility support for video-mediated collaborative teamwork This paper is based on the hypothesis that distributed collaborative teamwork can benefit from the opportunity of supporting synchronous communication sessions involving mobile participants. Emerging technologies for mobile computing and universal network access are surveyed and the development of a prototype collaboration system including mobility support is described. Whereas previous work on mobile CSCW systems has focused mainly on text and voice communication, this paper emphasizes video as an important medium for mobile collaboration. The prototype system demonstrates the feasibility of supporting video-mediated communication in mobile settings, using handheld computers and wireless access networks. Experiences from a distributed product development project, where the prototype system has been tested, are also discussed in the paper.
59
7 Future directions The evolution of the Internet service model from a best-effort network for data exchange into a true multiservice network supporting voice, video and data will have significant implications for the design of high-quality multimedia communication systems. However, as numerous trials within the research community have demonstrated, this development is not necessarily a sine qua non for the successful realization of large-scale Internet video communication systems. More likely, the prevalent situation with adaptive applications that dynamically adjust to variations in network conditions will be continually valid. Gradual introduction of QoS support in certain regions of the Internet will make it possible for network operators to improve the service for customers of real-time communication applications, while maintaining the traditional end-to-end perspective on flow control and connection state maintenance. A clear trend in computer and network architecture is towards mobile, wireless computing. Handheld computers become increasingly powerful and wireless networks become more widespread, enabling new types of communicative applications. The tremendous impact of cellular phones for ubiquitous interpersonal communication suggests a huge potential for more sophisticated mobile communication services. A convergence of technology between cellular phones and handheld computers and between telephony and data networks is clearly foreseeable. Wireless networking and mobile computing present significant challenges for video-mediated communication in terms of bandwidth limitations, limited processing power, constrained visualization and man-machine interface issues. To overcome these obstacles a continual effort in the development of scalable media coding, adaptive transport protocols, and error-resilient transmission architectures will be necessary, and many promising research areas will be uncovered. For quite some time now, a major commercial breakthrough for video-mediated communication services has been predicted, primarily for professional applications such as distributed collaborative teamwork. However, the anticipated widespread deployment of video-mediated communication is yet to be realized. This can in part be explained by technological shortcomings – some of which are addressed in this thesis – but the problem is probably also related to disappointing experiences from premature deployment of the first generation video-mediated communication systems, that were hampered with many technical deficiencies. Moreover, the inability of usability studies to conclusively demonstrate the benefit of video for distributed collaborative work might discourage potential early adopters of the technology. These problems motivate continual research and development efforts to improve the underlying core technology for, and the overall design of, the next generation systems for video-mediated communication. However, with the rapidly growing number of households with broadband Internet access, and the ever-increasing performance of affordable personal computers, it might very well be the case that it is the private use of videomediated communication for social interactions, rather than professional appli-
60
Supporting video-mediated communication over the Internet
cations, that will lead the way to a more widespread deployment and greater commercial potential of the technology. This hypothesis is supported by the fact that usability studies to some extent have been able to demonstrate the benefit of video as a mediator of affective information and social cues. Furthermore, requirements on video quality might be lower for social interactions than for professional collaborative work, facilitating initial deployment. In any case, for a broad deployment of video-mediated communication to be successful, it is of vital importance that the core technology be scalable, robust, and efficient.
61
References [1]
A. Tanenbaum, "Computer networks," second edition, Prentice-Hall International Publishers, pp. 14-21, 1989.
[2]
P. Checkland, "Systems thinking, system practice," second edition, Chichester, Wiley, 1990.
[3]
ITU-T Recommendation H.323, "Packet based multimedia communication systems," International Telecommunication Union, Telecommunication Standardization Sector, Geneva, Switzerland, February 1998.
[4]
H. Schulzrinne, S. Casner, R. Frederick and V. Jacobson, "RTP: A transport protocol for real-time applications," IETF RFC 1889, January 1996.
[5]
J. Postel, "User datagram protocol," IETF RFC 768, August 1980.
[6]
D. Clark and D. Tennenhouse, "Architectural considerations for a new generation of protocols," Proceedings of ACM SIGCOMM'90, pp. 200-208, September 1990.
[7]
M. Handley, H. Schulzrinne, E. Schooler and J. Rosenberg, "SIP: Session initiation protocol," IETF RFC 2543, March 1999.
[8]
M. Handley and V. Jacobson, "SDP: Session description protocol," IETF RFC 2327, April 1998.
[9]
H. Schulzrinne and J. Rosenberg, "A comparison of SIP and H.323 for Internet telephony," Proceedings of NOSSDAV '98, July 1998.
[10] M. Handley, C. Perkins and E. Whelan, "Session announcement protocol," IETF RFC 2974, October 2000. [11] R. Wittmann and M. Zitterbart, "Multicast communication protocols and applications," Morgan Kaufmann Publishers, Academic Press, 2001. [12] S. Deering, "Multicast routing in a datagram internetwork," PhD thesis, Stanford University, December 1991. [13] W. Fenner, "Internet group management protocol, version 2," IETF RFC 2236, November 1997. [14] H. Eriksson, "Mbone: The multicast backbone," Communications of the ACM 37(8), pp. 54-60, August 1994. [15] X. Xiao and L. Ni, "Internet QoS: the big picture," IEEE Network Magazine, March 1999. [16] R. Braden, L. Zhang, S. Berson, S. Herzog and S. Jamin, "Resource reservation protocol (RSVP)," IETF RFC 2205, September 1997.
62
Supporting video-mediated communication over the Internet
[17] V. Bhaskaran and K. Konstantinides, "Image and video compression standards, algorithms and architectures," second edition, Kluwer Academic Publishers, 1997. [18] S. Floyd and K. Fall, "Promoting the use of end-to-end congestion control in the Internet," IEEE/ACM Transactions on Networking, August 1999. [19] X. Wang and H. Schulzrinne, "Comparison of adaptive Internet multimedia applications," IEICE Transactions on Communication, Special issue on distributed processing for controlling telecommunications systems, vol. E82B, no. 6, June 1999. [20] J. C. Bolot and T. Turletti, "A rate control mechanism for packet video in the Internet," Proceedings of IEEE INFOCOM'94, June 1994. [21] S. Floyd and V. Jacobson, "Random early detection gateways for congestion avoidance," IEEE/ACM Transactions on Networking, vol. 1, no. 4, pp. 397413, August 1993. [22] K. Ramakrishnan, S. Floyd and D. Black, "The addition of explicit congestion notification (ECN) to IP," RFC 3168, September 2001. [23] S. Gorinsky and H. Vin, "The utility of feedback in layered multicast congestion control," Proceedings of NOSSDAV'01, June 2001. [24] V. Jacobson, "Congestion avoidance and control," Proceedings of ACM SIGCOMM'88, pp. 314-329, August 1988. [25] S. Jacobs and A. Eleftheriadis, "Real-time dynamic rate shaping and control for Internet video applications," Workshop on multimedia signal processing, pp. 23-25, June 1997. [26] R. Rejaie, M. Handley and D. Estrin, "RAP: An end-to-end rate-based congestion control mechanism for real-time streams in the Internet," Proceedings of IEEE INFOCOM'99, March 1999. [27] I. Rhee, V. Ozdemir and Y. Yi, "TEAR: TCP emulation at receivers - flow control for multimedia streaming," NCSU Technical Report, April 2000. [28] B. Braden, et al., "Recommendations on queue management and congestion avoidance in the Internet," RFC 2309, April 1998. [29] S. Floyd, J. Padhye and J. Widmer, "Equation-based congestion control for unicast applications," Proceedings of ACM SIGCOMM'00, May 2000. [30] J. Padhye, V. Firoiu, D. Towsley and J. Kurose, "Modeling TCP throughput: A simple model and its empirical validation," SIGCOMM Symposium on Communications Architectures and Protocols, August 1998. [31] J. Widmer and M. Handley, "Extending equation-based congestion control to multicast applications," Proceedings of ACM SIGCOMM'01, August 2001.
Thesis introduction
63
[32] N. Shacham, "Multipoint communication by hierarchically encoded data," Proceedings of IEEE INFOCOM'92, pp. 2107-2114, May 1992. [33] S. McCanne, V. Jacobson and M. Vetterli, "Receiver-driven layered multicast," Proceedings of ACM SIGCOMM'96, August 1996. [34] A. Legout and E. W. Biersack, "Pathological behaviors for RLM and RLC," Proceedings of NOSSDAV'00, pp. 164-172, June 2000. [35] L. Vicisano, L. Rizzo and J. Crowcroft, "TCP-like congestion control for layered multicast data transfer," Proceedings of INFOCOM'98, March 1998. [36] B. Byers, M. Frumin, G. Horn, M. Luby, M. Mitzenmacher, A. Roetter and W. Shaver, "FLID-DL: Congestion control for layered multicast," Proceedings of NGC 2000, pp. 71-81, November 2000. [37] L. Wu, R. Sharma and B. Smith, "ThinStreams: An architecture for multicasting layered video", Proceedings of NOSSDAV'97, May 1997. [38] A. Legout and E. W. Biersack, "PLM: Fast convergence for cumulative layered multicast transmission schemes," Proceedings of ACM SIGMETRICS 2000, June 2000. [39] S. Keshav, "Congestion control in computer networks," PhD thesis, EECS, University of Berkeley, CA, September 1991. [40] A. K. Parekh and R. H. Gallager, "A generalized processor sharing approach to flow control in integrated services networks," Proceedings of IEEE INFOCOM'93, pp. 521-530, 1993. [41] J. Byers, M. Luby and M. Mitzenmacher, "Fine-grained layered multicast," Proceedings of IEEE INFOCOM'01, pp. 275-83, April 2001. [42] G. Cote, B. Erol and F. Kossentini, "H.263+: Video coding at low bit rates," IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no. 7, pp. 849-866, November 1998. [43] ISO/IEC 13818, "Generic coding of moving pictures and associated audio information," November 1994. [44] M. Hayder, M. Radha, M. van der Schaar and Y. Chen, "The MPEG-4 finegrained scalable video coding method for multimedia streaming over IP," IEEE Transactions on Multimedia, vol. 3, no. 1, March 2001. [45] M. Antonini, M. Barlaud, P. Mathieu and I. Daubechies, "Image coding using wavelet transform", IEEE Transactions on Image Processing, vol. 1, no. 2, April 1992. [46] K. Shen and E. J. Delp, "Wavelet based rate scalable video compression," IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, no. 1, pp. 109-122, February 1999.
64
Supporting video-mediated communication over the Internet
[47] D. Taubman and A. Zakhor, "Multirate 3-D subband coding of video," IEEE Transactions on Image Processing, vol. 3, no. 4, pp. 572-590, September 1994. [48] Y. Chen and W. A. Pearlman, "Three-dimensional subband coding of video using the zerotree method," Visual Communications and Image Processing '96, Proceedings of SPIE 2727, March 1996. [49] J. M. Shapiro, "Embedded image coding using zerotrees of wavelet coefficients," IEEE Transactions on Image Processing, vol. 41, no. 12, pp. 3445-3462, December 1993. [50] ISO/IEC, "Information Technology - Coding of audio-visual objects, Part 2: Visual amendment 4: Streaming video profile," ISO/IEC JTC1/SC29/WG11 N3904, ISO/IEC 14496-2:1999/FDAM 4, January 2000. [51] M. S. Tomassi and R. D. Jackson, "An evolving SIMD architecture approach for a changing image processing environment," DSP & Multimedia Technology, pp. 1-7, October 1994. [52] R. Cypher and J.L.C. Sanz, "SIMD architecture and algorithms for image processing and computer vision," IEEE Transactions on Acoustics Speech and Signal Processing, vol. 37, no. 12, pp. 2158-2174, December 1989. [53] M. Tremblay et al., "VIS speeds new media processing," IEEE Micro, vol. 16(4), pp. 51-59, August 1996. [54] A. Peleg et al. "Intel MMX for multimedia PCs," Communications of the ACM, 40(1), pp. 25-38, January 1997. [55] Y. Wang and Q. F. Zhu, "Error control and concealment for video communication: A review," Proceedings of the IEEE, vol. 86, no. 5, pp. 974997, May 1998. [56] J. D. Villasenor, Y. Zhang and J. Wen, "Robust video coding algorithms and systems," Proceedings of the IEEE, Special issue on Wireless Video, vol. 87, pp. 1724-1733, October 1999. [57] B. Wah, X. Su and D. Lin, "A survey of error-concealment schemes for realtime audio and video transmissions over the Internet," Proceedings of IEEE International Symposium on Multimedia Software Engineering, December 2000. [58] U. Horn, K. Stuhlmüller, M. Link and B. Girod, "Robust Internet video transmission based on scalable coding and unequal error protection," Image Communication, vol. 15, no. 1-2, pp. 77-94, September 1999. [59] S. Bajaj, L. Breslau and S. Shenker, "Uniform versus priority dropping for layered video," Proceedings of ACM SIGCOMM'98, pp. 131-143, September 1998.
Thesis introduction
65
[60] J-C. Bolot and T. Turletti, "Adaptive error control for packet video in the Internet," Proceedings of the International Conference on Internet Protocols, September 1996. [61] N. Schacham and P. McKenney, "Packet recovery in high-speed networks using coding and buffer management," Proceedings of IEEE INFOCOM '90, pp. 124-131, May 1990. [62] I. Rhee and S. R. Joshi, "FEC-based loss recovery for interactive video transmission – experimental study," ICMCS, vol. 1, pp. 250-256, 1999. [63] R. Marasli, P. D. Amber and P. T. Conrad, "Retransmission-based partially reliable transport service: an analytic model," Proceedings of IEEE INFOCOM '96, pp. 621-628, 1996. [64] M. Ghanbari, "Postprocessing of late cells for packet video," IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, pp. 669678, December 1996. [65] P. Törlind, "Distributed engineering: Tools and methods for collaborative product development," PhD Thesis, Luleå University of Technology, ISSN 1402-1544, October 2002. [66] C. Egido, "Teleconferencing as a technology to support co-operative work: A review of its failures," Intellectual Teamwork, pp. 351-372, Lawrence Erlbaum Associates, 1990. [67] A. Reid, "Comparing the telephone with face-to-face interaction," The social impact of the telephone, pp. 386-414, MIT Press, 1977. [68] A. J. Sellen, "Remote conversations: The effects of mediating talk with technology," Human-Computer Interaction, 10 (4), pp. 401-444, 1995. [69] S. Whittaker and B. O'Connail, "The role of vision in face-to-face and mediated communication," Video-mediated Communication, pp. 23-49, Lawrence Erlbaum Associates, 1997. [70] E. Williams, "Experimental comparisons of face-to-face and mediated communication," Psychological Bulletin, no. 84, pp. 963-976. [71] R. Fish, R. Kraut, R. Root and R. Rice, "Video as a technology for informal communication," Communications of the ACM, no. 36, pp. 48-61. [72] E. Isaacs and J. Tang, "What video can and can't do for collaboration: A case study," Proceedings of the ACM Multimedia 93 Conference, pp. 199-206, 1993. [73] J. Tang and E. Isaacs, "Why do users like video: Studies of multimediasupported collaboration," Computer Supported Co-operative Work, no. 1, pp. 163-196, 1993.
66
Supporting video-mediated communication over the Internet
[74] S. Whittaker, S. Brennan and H. Clark, "Co-ordinating activity: An analysis of computer supported co-operative work," Proceedings of CHI '91 Human Factors in Computing Systems, pp. 361-367, 1991. [75] J. A. Short, "Effects of medium of conversation on experimental negotiation," Human Relations, no. 27, pp. 225-234, 1974. [76] A. Mané, "Group space: The role of video in multipoint videoconferencing and its implications for design," Video-mediated Communication, pp. 401-414, Lawrence Erlbaum Associates, 1997. [77] C. Heath and P. Luff, "Disembodied conduct: Communication through video in a multimedia environment," Proceedings of CHI'91 Human Factors in Computing Systems, pp. 99-103, 1991. [78] J. Tang, E. Isaacs and M. Rua, "Supporting distributed groups with a montage of lightweight interactions," Proceedings of Computer Supported Cooperative Work, pp. 23-34, 1994. [79] P. Törlind and A. Larsson, "Support for informal communication in distributed engineering design teams," Annals of 2002 International CIRP Design Seminar, 2002. [80] R. Fish, R. Kraut, R. Root and R. Rice, "Evaluating video as a technology for informal communication," Proceedings of CHI'92, pp. 37-48, 1992. [81] M. Mantei, R. Baecker, A. Sellen, W. Buxton, T. Milligan and B. Wellman, "Experiences in the use of a media space," Proceedings of CHI'91 Human Factors in Computing Systems, pp. 203-209, 1991. [82] B. Nardi, H. Schwarz, A. Kuchinsky, R. Leichner, S. Whittaker and R. Sclabassi, "Turning away from talking heads: An analysis of 'video-as-data'," Proceedings of CHI'93 Human Factors in Computing Systems, pp. 327-334, 1993. [83] B. Nardi, A. Kuchinsky, S. Whittaker, R. Leichner and H. Schwarz, "'Videoas-data': Technical and social aspects of a collaborative multimedia application," Computer Supported Co-operative Work, no. 4, pp. 73-100, 1996. [84] B. O'Connail, S. Whittaker and S. Wilbur, "Conversations over video conferences: An evaluation of the spoken aspects of video-mediated communication," Human Computer Interaction, no. 8, pp. 389-428, 1993. [85] T. Kurita, S. Iai and N. Kitawaki, "Assessing the effects of transmission delay: Interaction of speech and video," Proceedings of the 14th International Symposium on Human Factors in Telecommunications, pp. 111-121, 1993. [86] E. Isaacs and J. Tang, "Studying video-based collaboration in context: From small workgroups to large organizations," Video-mediated Communication, pp. 173-197, Lawrence Erlbaum Associates, 1997.
Thesis introduction
67
[87] V. Bruce, "The role of the face in communication: Implications for videophone design," Interacting with Computers, no. 8, pp. 166-176, 1996. [88] P. Parnes, "An IP-multicast based framework for designing synchronous distributed multi-user applications on the Internet," PhD Thesis, Luleå University of Technology, ISSN 1402-1544, November 1999. [89] S. McCanne et al., "Toward a common infrastucture for multimedia networking middleware," Proceedings of NOSSDAV'97, May 1997. [90] P. Parnes, K. Synnes and D. Schefström, "Lightweight application level multicast tunneling using mTunnel," Journal of Computer Communication, 1998. [91] D. Pendarakis, S. Shi, D. Verma and M. Waldvogel, "ALMI: An application level multicast infrastructure," Proceedings of the 3rd USNIX Symposium on Internet Technologies and Systems, March 2001. [92] M. W. Siegel, P. Gunatilake, S. Sethuraman and A. G. Jordan, "Compression of stereo image pairs and streams," Stereoscopic Displays and Applications V, pp. 258-268, February 1994. [93] A. Puri, R. V. Kollarits and B. G. Haskell, "Stereoscopic video compression using temporal scalability," Proceedings of SPIE Visual Communications and Image Processing, May 1995. [94] A. Luthra and X. Chen, "MPEG-2 multiview profile for MPEG-2," Proceedings of SPIE/IS&T Multimedia Hardware Architectures, February 1997. [95] J. Ohm, "A realtime hardware system for stereoscopic videoconferencing with viewpoint adaptation," Signal Processing: Image Communication, vol. 14, pp. 147-171, 1998. [96] D. Ioannidis, D. Duchamp and G. Q. Maguire, "IP-based protocols for mobile internetworking," Proceedings of ACM-SIGCOMM'91, 1991.
Paper A Delay-based flow control for layered multicast applications
Proceedings of the 12th International Packet Video Workshop, Pittsburgh, PA, April 2002
71
Delay-based flow control for layered multicast applications Mathias Johanson
Framkom Research Corporation for Media and Communication Technology Sallarängsbacken 2, S-431 37 Mölndal, Sweden
[email protected]
Abstract
This paper presents an approach to flow control for real-time, loss-sensitive, layered multicast applications. The fundamentals of flow control for multicast applications are related and a novel delaybased flow control algorithm is introduced. The basic idea of the algorithm is to react to incipient congestion before packet loss occurs by monitoring variations in the one-way packet transmission delay between sender and receivers. By using a hierarchical representation of the real-time data in combination with a layered multicast transmission model the flow control algorithm can be implemented entirely in the receivers. Furthermore, by constraining the bandwidth of the layers to a well-defined rate, the congestion control can be accomplished almost entirely without packet loss. This is particularly suitable for real-time multimedia conferencing applications that are inherently multipoint and loss-sensitive. The performance of the flow control algorithm in terms of link utilization, inter- and intra-protocol fairness, session scalability and loss probability is evaluated through extensive simulation.
1 Introduction One of the reasons why the Internet has been so successful in supporting large numbers of simultaneous users is the ability of the network protocols to adapt to changing conditions. Specifically, the transport protocol used for most Internet traffic, TCP, includes a flow control algorithm that adapts the packet transmission pace of the sender so as not to congest the network [1]. The algorithm tries to experimentally find the optimal transmission rate by gradually increasing the rate until packet loss is experienced. However, delay sensitive applications like audio and video conferencing tools do not use TCP because of its poor real-time properties. Rather, these applications use the UDP and RTP protocols, leaving the flow control entirely to the application. In point-to-point configurations, flow control can be implemented by utilizing a rate adaptive coding algorithm, wherein feedback from the receiver is used to periodically adjust the media encoding parameters to match the available bandwidth [2]. For multipoint configurations where the receivers typically are subject to heterogeneous bandwidth limitations a more sophisticated arrangement is needed. One approach is to use audio/video gateways that transcode the media to match the available bandwidth of each receiver. This has the drawback of requiring specialized network configurations and is inherently not very scalable.
72
Supporting video-mediated communication over the Internet
Another approach is to use a layered multicast transmission scheme wherein a hierarchical representation of the data is transmitted to a set of multicast group addresses that can be subscribed to individually by the receivers. The number of groups subscribed to determines the bandwidth utilization for each receiver and consequently the quality of the decoded media. In order for multipoint real-time multimedia applications to be realized on a large scale a flow control algorithm is needed that can adapt the bandwidth of the multicast flows to the network and host resources available for each independent receiver. Since real-time multimedia streams are sensitive to packet loss it is desirable with a flow control algorithm that can detect congestion before packet loss occurs. For ease of deployment in existing network environments the flow control should ideally not be dependent on changes to network routers or switches.
2 Flow control algorithms for layered multicast Flow control for layered multicast applications is implemented solely in the receivers. By joining and leaving multicast groups as the network load changes the receivers can dynamically adapt to the available bandwidth. The decision of when to join groups, leave groups or remain at the same level is the task of the flow control algorithm. Several approaches have been suggested: A technique generally referred to as receiver-driven layered multicast (RLM) was proposed by McCanne, Jacobson and Vetterli [3]. In this scheme the receivers periodically perform what is known as a join experiment, wherein a receiver tentatively joins an additional multicast group and monitors packet loss to determine whether the additional bandwidth causes congestion. If so, the layer is dropped and the application concludes that the optimal subscription level is reached. If no packet loss is experienced, the application proceeds to subscribe to additional layers until the optimal number of layers is reached. To avoid the implosion of join experiments that would result if all in a potentially large group of receivers performed their join attempts independently, the experiments are coordinated. This is done by having the member that is about to perform a join experiment multicast a message to all the other receivers declaring its intention to perform an experiment for a certain layer. In this way all receivers can determine for themselves whether the experiment caused congestion or not and may not need to perform an experiment of their own. This procedure is known as shared learning. Vicisano, Rizzo and Crowcroft elaborated on this scheme by introducing the concept of synchronization points [4]. In this model receivers are only allowed to perform join experiments immediately after receiving a synchronization packet from the sender. Synchronization packets are sent periodically as flagged packets in the encoded media stream. This proves to be more scalable than the shared learning algorithm of RLM. The problem with these algorithms is that they use packet loss as a congestion detection signal. Since there is no corresponding signal when the network gets unloaded the applications must repeatedly perform join experiments to probe for
Paper A: Delay-based flow control for layered multicast applications
73
available bandwidth. Packet loss caused by the failed join experiments will negatively impact the quality of the received data, not only for the member performing the experiment, but for each member located behind the same bandwidth bottleneck. The problem is further aggravated by the fact that the pruning of the reverse data path to the sender after a multicast leave operation can take a substantial amount of time (up to a few seconds), which means that the congestion caused can be relatively long-lasting. To avoid the negative effects of failed join attempts, the experiments must not be performed too frequently. But, on the other hand, too infrequent experiments has serious implications on the rate of convergence to the optimal operating point and makes the application less responsive to bandwidth fluctuations. What is needed is a way of telling that the network is becoming congested before packet loss is experienced. At the onset of congestion queues start to build up at network routers leading to an increased end-to-end delay. Several congestion avoidance algorithms for TCP (most notably TCP Vegas [5]) have been proposed based on reacting to changes in the round-trip time (RTT) from a segment of data is sent until it is acknowledged by the receiver [5, 6, 7]. Wu et al. proposed a layered multicast transmission architecture called ThinStreams that, in the spirit of TCP Vegas, uses the difference between the expected throughput and the actual throughput as a means to detect congestion [8]. To calculate the expected throughput the ThinStreams algorithm requires a constant bitrate for each multicast layer. This paper suggests an approach to layered multicast congestion avoidance based on direct measurements of packet delay variations. Unlike the ThinStreams approach it does not require a constant bitrate for the layers and hence imposes less restrictions on the layered media encoding.
3 Delay-based layered multicast flow control In order for the flow control algorithm to be able to respond to congestion before packet loss occurs, the variations in packet transmission delay can be used to detect congestion. An increasing delay indicates that router buffers are filling up and must be responded to by lowering the effective bandwidth. Similarly, a delay that has decreased below some threshold indicates that it might be possible to increase the bandwidth. The rate control is performed by the receivers by joining and leaving multicast groups as appropriate. To avoid packet loss the increase in bandwidth resulting from joining an additional group must be small enough for the network to buffer the excessive packets for the time it takes the receivers to detect the congestion and respond to it by leaving the group. This time is prolonged by the fact that the packet forwarding will proceed at multicast routers until the prune message of the leave operation is propagated back through the reverse multicast path. By carefully assigning an upper limit to the bandwidth of each layer (corresponding to a multicast group), packet loss as the result of joining an additional group can be avoided. To compute this bandwidth limit assume that Q
74
Supporting video-mediated communication over the Internet
is the minimum queue size in use on the network and that L is the leave latency. Then the bandwidth limit B is
Q B≤ . L If we conservatively assume Q to be 5 Kbytes and L to be 2 seconds we get a bandwidth limit of 20 kilobits per second (kbps). The organization of data into layers at the transmitter should thus be made with a granularity of approximately 20 kbps. For real-time multimedia data this granularity is probably sufficiently small since the improvement in perceived quality by a refinement signal in the magnitude of 20 kbps is likely to be rather moderate for both audio and video. Given the above data organization and the layered multicast transmission architecture, what we now need is a way to monitor variations in packet delay. Recall that in TCP Vegas the round trip time is used to measure the variations in throughput. For multicast transmission, however, a round trip delay cannot be computed since the network path from a sender to a receiver is not generally the same as the path from the receiver to the sender and thus cannot give a reliable measure of the buffering in the multicast data path. Nevertheless, the variations in transmission delay can be measured by a scheme involving timestamping the packets at the transmitter and clocking the arrival times of packets at the receivers.
3.1 Variable transmission delay estimation The one-way transmission delay from a source to a receiver can be seen as consisting of two parts: the fixed propagation delay and the variable delay due to buffering. The variable delay that interests us can be determined in the following way. Let the source put a timestamp in every packet that reflects the sending time of that packet. Then the variable delay for packet i, δi, is δi = (ri − r0 ) − (ti − t0 ) ,
where ri and ti are the arrival and sending times of packet i respectively. Note that the delay calculations are performed only by the receiver and that the values of ti are determined from the timestamp in packet i. For the algorithm to give a reliable estimation of the variable delay the first (reference) packet must be transmitted when the network is uncongested, that is δ0 = 0. A reference packet with a non-zero variable delay will result in negative variable delays once the network gets uncongested. This is an indication that the values of t0 and r0 must be reassigned (i.e. a new reference packet is chosen). The RTP protocol, that is used for fragmentation of audio and video data into UDP packets, defines a packet header that includes a timestamp field, primarily intended for things like playout scheduling and cross-media synchronization. The
Paper A: Delay-based flow control for layered multicast applications
75
recommended clock frequency of the RTP timestamps is 90 kHz for video content and 8 kHz for audio [9]. The variations in transmission delay are typically in the order of 10 to 100 ms, so both clock frequencies are of sufficiently high resolution for the delay estimation. (For example, a 10 Kbytes router buffer and a wire speed of 1 Mbps gives a maximum delay of 80 ms.) To prevent measurement noise from impacting the join/leave decision algorithm, the packet delay estimation should be calculated as a running average over a number of measurements. That is, the delay estimation for the i:th packet, ∂ˆ i , is given by 1 ∂ˆ i = N
N −1
∑∂
i−k
,
k =0
where N is the number of delay measurements used to compute the average. In the simulations and the implementation presented in this paper, a value of N=20 was used. Note that the algorithm relies heavily on the fact that the sender's and receiver's system clocks are isochronous (that is, that they tick at the same speed). This could potentially be a serious deployment problem, since workstation clocks are frequently badly tuned. Note also that the algorithm does not require the clocks to have the same conception of absolute time. The issue of clock synchronization is beyond the scope of this paper, but techniques exist to synchronize clocks (both in terms of absolute time and clock frequency) down to microsecond precision [10].
3.2 Fairness In order for layered multicast applications to be successfully realized in existing network environments it is important that the flow control algorithm adjusts the rate of the traffic so that the application competes in a fair way for bandwidth with other applications. To this end one can distinguish three different fairness issues that can be considered crucial: 1. fairness among members of the same layered multicast session, 2. fairness among different sessions of the same layered multicast application, 3. fairness to TCP. Fairness among members of the same session and among members of different layered multicast sessions can be realized by adjusting the threshold delay values used in the algorithm to decide whether to join multicast groups, leave multicast groups, or remain at the same subscription level. By decreasing the leave threshold and the join threshold with increasing layer numbers, receivers at lower subscription levels will be more inclined to join new layers and less inclined to drop layers compared to receivers at higher subscription levels. This means that
76
Supporting video-mediated communication over the Internet
on a heavily loaded network connection with many competing sessions, the receivers subscribed to more layers will be more responsive to increased packet delays and hence will make a leave decision sooner than receivers at lower subscription levels. Similarly, at decreasing network load, receivers at lower subscription levels will join groups before receivers at higher levels. This will lead to a fair sharing of the available bandwidth between the members of a session and between the sessions, provided that all sessions use the same flow control algorithm. Fairness to TCP's flow control is important since the bulk of network applications in use on the Internet are based on TCP. The throughput of a TCP session can be shown to be inversely proportional to the product of the round trip time (RTT) and the square root of the packet loss rate [11]. Since the throughput of the layered multicast flow control presented in this paper is independent of the packet loss rate and the round trip time, the concept of fairness to TCP is not welldefined. Furthermore, since the types of applications targeted by the multicast flow control are very different from the "typical" TCP application, the relative performance exhibited by competing TCP sessions is not immediately appropriate. For instance, two TCP sessions with different RTTs will allocate the bandwidth of a shared bottleneck unevenly. While this "unfairness" can be motivated in the TCP case it does not make much sense for two participants of a multicast videoconference, sharing a bandwidth bottleneck, to receive the video at different rates depending on the distance to the sender. The important point to be stressed is that real-time multimedia data need to be rate-controlled in some way in order to coexist with TCP on congested links. This behavior is sometimes referred to as TCP-friendliness.
3.3 The join/leave decision algorithm The flow control algorithm implemented by each receiver of a layered multicast session uses the measured queuing delay, ∂ˆ i , as an indication of whether the layer subscription level should be increased or decreased. By considering not only the magnitude of the delay but also the rate of change, the algorithm can respond earlier to impending congestion. Since the algorithm responds to congestion by leaving a multicast group, the effect of lowered bandwidth is not manifested until the multicast delivery-tree is pruned back to the sender. Thus, in order to be able to respond in time, the algorithm needs to predict the congestion level at some time ahead determined by the leave latency. If y(t) is the queuing delay at time t and L is the leave latency then the queuing delay at time t + L can be predicted by y(t + L) = y(t) + Ly'(t). Now, in order to prevent loss, y + Ly' < M,
Paper A: Delay-based flow control for layered multicast applications
77
where M is the maximum queuing delay in the network. The value of M can be learned experimentally by initializing it to a conservatively small value and adjusting it whenever a larger delay is experienced. The leave latency can also be found experimentally, by using the algorithm described by Wu et al. [8]. Alternatively, a preconfigured upper limit can be used. The algorithm recurrently computes y(t+L) and whenever the value is above a certain limit (the leave threshold) a layer is dropped. To decide when to join an additional layer the algorithm uses the value of y(t) directly, instead of the predicted y(t+L). This asymmetry is due to the fact that the join decisions should not be made in a way that keeps the network in a constantly congested state. Whenever the value of y(t) is below the join threshold an additional multicast group is subscribed to. In order to ensure inter- and intra-session fairness, as discussed in section 3.2, the join and leave thresholds should depend on the layer subscription level. The threshold functions are designed in a way that assures that all members of the same session sharing a bandwidth bottleneck eventually converge to roughly the same number of layers. Following the discussion in section 3.2 it is clear that both the join and leave thresholds should decrease with increasing layer subscription level. In the current implementation the join and leave threshold values are calculated using functions that decrease quadratically with the number of layers joined. The range of the join threshold function is from zero to 75 percent of the maximum delay whereas the range of the leave threshold is from 65 to 100 percent. The appropriateness of using these functions and parameter values were determined experimentally from simulation results.
3.4 Scheduling the join/leave operations Since the flow control algorithm is designed to detect and respond to congestion before packet loss occurs, there is no need to synchronize the join operations from different receivers of the same session. The situation is different for algorithms that detect congestion from packet loss, since uncoordinated join attempts in this case will lead to constant congestion and packet loss. However, members of different sessions sharing the same bottleneck link can cause packet loss if they join new groups simultaneously. This is because the aggregate bandwidth change can be larger than what the network can buffer if many receivers join layers at the same time. To prevent this from happening the join operations performed by receivers in different sessions need to be decorrelated. Since a strict decorrelation is hard to realize without negatively affecting scalability and convergence time, a reasonable approximation can be achieved by scheduling the join operations using a pseudo-randomized timer. By having the timer interval increase as more layers are joined, the applications are allowed to converge relatively fast to a reasonable quality level. Since the leave operation does not have the desired effect (of lowered congestion) until all members subscribed to the same layer leave, the leave operations for members of the same session at the same subscription level should
78
Supporting video-mediated communication over the Internet
ideally be synchronized. However, all members sharing a bandwidth bottleneck will experience the same variations in packet delay and therefore the leave operations will be reasonably synchronized automatically if only the leave decisions are scheduled frequently enough. Given that the effect of the leave is not manifested in a lowered packet delay until the multicast tree is pruned back to the sender, the receivers must defer their next leave decision for a time equal to the leave latency to avoid dropping more than one layer in response to the same congestion signal. This is easily implemented with a hold-down timer after a leave.
3.5 The delay-based layered multicast flow control algorithm The algorithm lined out above can be described by the following pseudo-code segment: y = y'= M = n =
current queuing delay rate of change of y maximum delay number of layers joined
N = maximum number of layers t = current time L = leave latency join_threshold := 0.75*M*(1 - sqrt(n/N)) leave_threshold := M*(0.65 + 0.35*(1 - sqrt(n/N))) if ( y + L*y' > leave_threshold and t > leave_timeout ) drop_layer(n) n := n-1 leave_timeout := t + L if ( y < join_threshold and t > join_timeout) add_layer(n+1) n := n+1 join_timeout := t + (n/N + random(0, 0.5))*L
The procedures add_layer(n) and drop_layer(n) are assumed to implement the joining and leaving of multicast groups corresponding to layer n. The random(x, y) function is assumed to return a random value between x and y.
4 Simulation results The behavior of the flow control algorithm described in section 3 has been simulated using the network simulator ns [12]. The topologies used for the simulations are depicted in Figure 1. Each simulation used a packet queue length
Paper A: Delay-based flow control for layered multicast applications
79
of 20 packets and a dense multicast routing protocol. The transmission delays on the links were 10 ms unless otherwise noted. 128 Kbit/s
S
R
Topology 1:
One sender and one receiver connected by a 128 Kbit/s link.
S
R1
B
B Kbit/s
R2
B/2
R1
R2
...
B/2 n
Topology 2:
One sender and n receivers connected by links with different bandwidths.
Rn
Rn
512 Kbit/s
R1
1.5 Mbit/s
S
1.5 Mbit/s
256 Kbit/s
R2
Topology 3:
One sender and 3n receivers connected at three different bandwidth levels.
512 Kbit/s
128 Kbit/s
Rn
256 Kbit/s
Rn
R1 R2 S1 S2
Sn
R1
B
R2
Topology 4:
n senders with one receiver each, sharing a B Kbit/s bottleneck link.
Rn
Figure 1 Topologies used in simulations
4.1 Link utilization and intra-session fairness The first simulation was performed using the simplest possible topology: one layered multicast sender and one receiver connected by a point-to-point link (topology 1 in Figure 1). The aim of the simulation was to test the link utilization on a network connection with no intervening traffic. The sender transmits ten layers of approximately 20 kbps each, resulting in a total bandwidth requirement of 200 kbps. The link bandwidth is 128 kbps, so theoretically the receiver should be able to receive six layers (6*20 = 120 kbps) without congesting the network. Figure 7 shows how the algorithm quickly joins seven layers before the network becomes congested. Then two layers are dropped in response to increased packet delay and throughout the simulation the receiver oscillates between the sixth and the seventh layer. This shows that the algorithm indeed utilizes the available bandwidth as expected. The simulation was conducted without packet loss.
80
Supporting video-mediated communication over the Internet
Figure 7 Number of multicast groups joined by the application A slightly more complex situation is given by topology 2. Here n receivers are connected at different link speeds to a sender with the same characteristics as in the previous simulation. This topology was used to test the algorithm's ability to converge to different bandwidths in a heterogeneous network environment. Figure 8 shows the bandwidth allocation resulting from a simulation with three receivers (n=3) and a 256 kbps capacity of the shared link (B=256). The network path to receivers R1, R2 and R3 were 256, 128 and 64 kbps respectively. The expected result is that R1 should be able to receive all ten layers of the transmission, whereas R2 and R3 should converge to six and three layers respectively. The results of this simulation indicate that different receivers of the same session can converge to different bandwidths. No packet loss was experienced on any of the links.
Figure 8 Bandwidth consumed by three members of the same session A configuration with one sender and three sets of n receivers located behind bottleneck links is given by topology 3. The resultant bandwidth utilization when the sender transmits 20 layers of 20 kbps each and a value of n=5 is depicted in Figure 9. The receivers can be seen to converge to three distinct bandwidth levels; the five receivers of the uppermost cluster in topology 3 receive the full 400 kbps (all 20 layers), whereas the rightmost five receivers are limited by the shared 256 kbps bottleneck, and the lowermost five receivers are confined to 128 kbps. Again, the simulation was concluded without any packet loss.
Paper A: Delay-based flow control for layered multicast applications
81
Figure 9 Bandwidth consumed by the 15 receivers of topology 3
4.2 Inter-session fairness To investigate the performance of multiple independent sessions sharing the same bandwidth bottleneck, a large number of simulations were conducted using topology 4, with different values for the number of sessions, n, and the bottleneck bandwidth, B. Figure 10 shows the result of a configuration with two senders, S1 and S2, and one receiver for each session, R1 and R2, with a bottleneck bandwidth of 256 kbps. Both senders transmit ten 20 kbps layers resulting in an aggregate bandwidth requirement of 400 kbps for the shared link. Receiver R1 is started first and initially joins all ten layers resulting in an allocation of 200 kbps out of the available 256 kbps. Then, after approximately ten seconds, receiver R2 is started and the two receivers can be seen in Figure 10 to converge to approximately 128 kbps each. Thus, a fair sharing of the bottleneck bandwidth is achieved. Similar results were obtained when simulating using topology 3 for many different values of n and B. The algorithm approximately allocated the bandwidth B/n to each session. Hence the algorithm can be seen to share network resources in a fair way among independent sessions.
Figure 10 Bandwidth allocated by two members of different sessions
82
Supporting video-mediated communication over the Internet
4.3 Scalability
Loss Rate (%)
The primary motivation for delay-based flow control is that the packet loss rate can be reduced compared to loss-based algorithms. The simulations involving only one sender and many receivers can be performed entirely without packet loss resulting from congestion. This is not surprising since the flow control algorithm was designed to predict and react in time to pending congestion, providing that the layers of the encoded media are sufficiently narrowband. If more than one layered multicast session is active simultaneously, however, the increase in bandwidth resulting from two or more receivers of different sessions joining simultaneously can be higher than what the router queues can withstand. In order to investigate the scalability of the algorithm when the number of sessions grow, a number of simulations were conducted using topology 4 with increasing values of n. The bottleneck bandwidth, B, was scaled in proportion to the number of sessions for each simulation. Figure 11 illustrates the average and worst-case loss rate performance. The loss rates were computed in non-overlapping windows one second wide. As can be seen, the average loss rate is about 1%, whereas the worst-case loss rate is about 2% of the total bandwidth. In comparison, McCanne et al. report a short-term worst-case loss rate of about 10% for RLM and a long-term loss-rate of about 1% [3]. Vicisano et al. report loss rates of about 7-8% for a simulation with 32 senders using their TCP-like congestion control scheme[4]. 10.0 9.0 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0
worst case average
0
10
20
30
40
number of sessions
Figure 11 Loss rate when superpositioning independent sessions
4.4 TCP friendliness Figure 12 shows the bandwidth of a layered multicast session simulated using topology 1 with the addition of competing TCP traffic. The TCP traffic consisted of one FTP session and ten Telnet sessions. When the simulation is started the layered multicast session can be seen to allocate all the available bandwidth of the link. Then, after 10 seconds, the FTP file transfer and the Telnet sessions are started that lasts for approximately two minutes. It is clear that the multicast
Paper A: Delay-based flow control for layered multicast applications
83
session yields bandwidth in favor of the TCP sessions. After the file transfer has ended the multicast session regains the full bandwidth.
Figure 12 Bandwidth consumed by a layered multicast receiver in presence of TCP-traffic When simulating this scenario with a 10 ms delay on the shared connection, as in Figure 12, the bandwidth is shared evenly between the multicast traffic and the TCP traffic. However, since the performance of TCP is dependent on the RTT, a less fair sharing will be obtained if the delay is increased. To compare the bandwidth allocation of TCP with that of our multicast flow control we perform the same simulation as above but with different values of the link delay. Then we calculate the fairness index defined as the ratio of the bandwidth allocated by the multicast application and the bandwidth allocated by TCP. The result is plotted in Figure 13 for both droptail and RED routers. 5.00
fairness index
4.00
DropTail
3.00
RED
2.00 1.00 0.00 0
50
100 link delay (ms)
150
200
Figure 13 TCP friendliness index The multicast flow control obviously gets more aggressive in terms of bandwidth allocation compared to TCP when the link delay increases. For a 200 ms link delay the multicast session allocates almost three times as much bandwidth as the TCP sessions when using droptail routers. For RED routers the multicast session is favored even more in terms of bandwidth allocation. This is an expected
84
Supporting video-mediated communication over the Internet
finding since RED routers will drop packets before router buffers are filled leading to an earlier response from TCP's congestion avoidance, whereas the multicast flow control is unaffected.
5 Summary and conclusions Large-scale deployment of multipoint real-time conferencing applications in heterogeneous network environments requires sophisticated flow control. The flow control algorithm must be scalable to a large number of users, efficient in terms of resource utilization, fair to other data streams, adaptive to changing network conditions, and relatively light-weight for ease of implementation. In this paper, an approach to flow control for layered multicast applications was presented that relies on packet delay measurements to detect and avoid congestion. The algorithm was shown by simulation to interoperate in a fair way, in terms of resource allocation, among members of the same session as well as between instances of different sessions. Furthermore, the overall packet loss rate was seen to be very moderate when superpositioning independent sessions. The behavior of the algorithm in presence of competing TCP traffic was seen to be TCP-friendly for low delay links and increasingly favorable for the multicast traffic at higher link delays. Further work will be needed to study the behavior of the algorithm in more complex network topologies and with larger sessions.
References [1] V. Jacobson, "Congestion avoidance and control," Proceedings of ACM SIGCOMM'88, August 1988. [2] J. C. Bolot and T. Turletti, "A rate control mechanism for packet video in the Internet," Proceedings of IEEE INFOCOM'94, June 1994. [3] S. McCanne, V. Jacobson and M. Vetterli, "Receiver-driven layered multicast," Proceedings of ACM SIGCOMM'96, August 1996. [4] L. Vicisano, L. Rizzo and J. Crowcroft, "TCP-like congestion control for layered multicast data transfer," Proceedings of IEEE INFOCOM'98, March 1998. [5] L. Brakmo, S. O'Malley and L. Peterson, "TCP Vegas: New techniques for congestion detection and avoidance," Proceedings of ACM SIGCOMM'94, pp. 24-35, May 1994. [6] R. Jain, "A delay-based approach for congestion avoidance in interconnected heterogeneous computer networks," ACM Computer Communication Review, October 1989. [7] Z. Wang and J. Crowcroft, "Eliminating periodic packet losses in 4.3 Tahoe BSD TCP Congestion Control Algorithm," ACM Computer Communication Review, April 1992. [8] L. Wu, R. Sharma and B. Smith, "ThinStreams: An architecture for multicasting layered video," Proceedings of NOSSDAV'97, May 1997.
Paper A: Delay-based flow control for layered multicast applications
85
[9] H. Schulzrinne, "RTP profile for audio and video conferences with minimal control," IEEE RFC1890, January 1996. [10] D. L. Mills, "Network time protocol (version 3) specification, implementation and analysis," IEEE RFC1305, March 1992. [11] M. Mathis, J. Semke, J. Mahdavi and T. Ott, "The macroscopic behaviour of the TCP congestion avoidance algorithm," Computer Communications Review, vol. 27, no. 3, July 1997. [12] S. McCanne and S. Floyd, "The LBNL Network Simulator," Software on-line, http://www-nrg.ee.lbl.gov/ns. [13] T. Turletti, S.F. Parisis and J. Bolot, "Experiments with a layered transmission scheme over the internet," Proceedings of IEEE INFOCOM'98, February 1998. [14] T. Turletti and J. C. Bolot, "Issues with multicast distribution in heterogenous packet networks," Proceedings of the 6th international Workshop on Packet Video, September 1994. [15] J. C. Bolot, T. Turletti and I. Wakeman, "Scalable feedback control for multicast video distribution in the internet," Proceedings of ACM SIGCOMM'94, August 1994.
Paper B A scalable video compression algorithm for real-time Internet applications
Proceedings of the Fourth EURASIP Conference on Video / Image Processing and Multimedia Communications, Zagreb, Croatia, July 2003.
89
A scalable video compression algorithm for real-time Internet applications Mathias Johanson
Framkom Research Corporation for Media and Communication Technology Sallarängsbacken 2, S-431 37 Mölndal, Sweden
[email protected]
Abstract
Ubiquitous use of real-time video communication on the Internet requires adaptive applications that can provide different levels of quality depending on the amount of resources available. For video coding this means that the algorithms must be designed to be scalable in terms of bandwidth, processing requirements and quality of the reconstructed signal. This paper presents a novel video compression and coding algorithm targeted at delay-sensitive applications in heterogeneous network and computing environments. The algorithm, based on the embedded zerotree wavelet algorithm for still image compression, generates a highly scalable layered bitstream that can be decoded at different qualities in terms of spatial resolution, frame rate and compression distortion. Furthermore, the algorithm is designed to require only a minimal coding delay, making it suitable for highly interactive communication applications like videoconferencing. The performance of the proposed algorithm is evaluated by comparison with a non-scalable codec and the penalty in compression efficiency that the scalability requirement imposes is analyzed. The codec is shown to produce a scalable bitstream ranging from about 10 kbps to 10 Mbps, while the computational complexity is kept at a level that makes software implementation on CPU-constrained equipment feasible.
1 Introduction The evolution of the Internet has enabled a new class of synchronous multimedia communication applications with high demands on delay and bandwidth. Not only does this affect network and transport protocols, but it also has a profound impact on the design of media encoding and compression algorithms. For teleconferencing applications the coding delay must be kept at a minimum while a high compression performance is maintained to efficiently utilize the available bandwidth. For videoconferencing this is of particular importance due to the high bandwidth and complexity imposed by video transmission and processing. Furthermore, since the Internet is a highly heterogeneous environment, both in terms of link capacity and terminal equipment, video codecs need to be able to generate bitstreams that are highly scalable in terms of bandwidth and processing requirements. As the current Internet provides only a single class of service, without guarantees on bandwidth or lossrate, the applications need to be adaptive to variations in throughput and loss probability. The dissimilar requirements imposed by different applications and the heterogeneity problems have given birth to a multitude of video compression algorithms with different target bitrates, complexities and qualities. Multipoint videocon-
90
Supporting video-mediated communication over the Internet
ferences, where the participants in the general case are subject to different bandwidth constraints, can be realized using specialized gateways that transcode the video to different bandwidths. This is problematic since it introduces delay and complexity and limits scalability. Furthermore, in a network environment without strict quality of service guarantees where the instantaneous load is not predictable it is hard to identify where the transcoding gateways should be placed. Another approach is to encode the media using a hierarchical representation that can be progressively decoded and assign the layers of the encoded signal to a set of distinct multicast addresses [1, 2, 3, 4]. In this layered multicast transmission architecture each receiver individually chooses a quality suitable for the network and computing resources available, by joining an appropriate number of IP multicast groups. This is the target application for the video compression algorithm presented in this paper. While scalable encoding schemes based on the standard video compression algorithms have been designed (MPEG-2 scalable profile [5], H.263+ [6]), the scalability requirement has clearly been added as an afterthought, resulting in high complexity and suboptimal performance. The goal of the work presented here is to design a compression algorithm with the scalability property as one of the fundamental requirements.
2 Layered video compression algorithms A layered video coding is a representation that splits a digital video signal into a number of cumulative layers such that a progressively higher quality signal can be reconstructed the more layers are used in the decoding process. The layering can be performed in three ways, viz. spatial layering, temporal layering and layered quantization (also known as signal-to-noise-ratio scalability). In spatial layering the video can be reconstructed at successively higher spatial resolutions, while temporal layering implies that the frame rate of the video sequence can be progressively increased. In signal-to-noise-ratio (SNR) layering the quantization of the video images is refined with each added layer. While all three techniques result in a layered bitstream, the characteristics of the layering techniques are very different and address different aspects of the heterogeneity problem. With spatial layering the resolution of the decoded video images can be chosen depending on the resolution of the display. Temporal layering provides different levels of frame update dynamics in the video, whereas SNR scalability varies the compression distortion of each individual frame. The type of layering that is most suitable depends on the application, on user preference and on the level of overall scalability desired. A good scalable video codec should ideally provide all three types of layering simultaneously so that each decoder can individually trade-off between spatial resolution, temporal resolution and fidelity, given a certain resource limit. Thus the three layering techniques should be orthogonal so that they can be applied independently. The key challenge when designing a layered video compression algorithm is to keep the compression efficiency high while providing a high level of scalability. Intuitively, a non-scalable codec should perform more efficiently compared to a
Paper B: A scalable video compression algorithm for real-time Internet applications
91
scalable at a given bandwidth or distortion level. This assumption was verified by Equitz and Cover who proved that a progressive encoding can only be optimal if the source possesses certain Markovian properties [7]. Nevertheless, a number of layered video codecs have been proposed. The scalable mode of H.263+ defines a layered codec that provides all three modes of layering discussed above. In H.263, as well as in the MPEG video coding standards, spatial redundancies within individual images are reduced by the discrete cosine transform applied to eight-by-eight pixel blocks. Predictive coding with motion compensation is performed in the pixel domain with reference to a past frame, or bi-directionally with reference to both past and future frames. Spatial scalability is achieved by subsampling each frame until the desired resolution of the base layer is reached. The low-resolution image thus achieved is compressed using predictive coding and DCT whereupon the frame is decompressed and upsampled so that an error-signal constituting the enhancement layer can be computed. The process is repeated for each enhancement layer. SNR scalability is achieved in basically the same manner, except that instead of resampling the frames the binsize of the quantizer is refined at each level. Temporal scalability is achieved by assigning bi-directionally predicted pictures to the refinement layers to increase the frame rate of the decoded video. The scalable modes of MPEG and H.263+ are working in basically the same way. The basic problem is that the prohibitively high complexity introduced limits the total number of layers that are feasible. Also, the efficiency of the coding can be expected to be far worse compared to the baseline algorithm, although very little experimental results have been published. Another class of scalable video codecs are based on the discrete wavelet transform (DWT). Since the DWT, when used to reduce spatial redundancies in image compression, is applied to the whole image as opposed to the block-based DCT, the algorithm provides a multiresolution image representation without the need for an explicit subsampling operation. The wavelet coefficients can be progressively quantized in the same way as is performed in the block-based algorithm. Alternatively, Shapiro's Embedded Zerotree Wavelet (EZW) coding [8] or Said and Pearlman's related algorithm based on set partitioning in hierarchical trees (SPIHT) [9] could be used to successively refine the coefficients. The key issue in wavelet-based video codec design is how to exploit the temporal correlation between adjacent images to increase compression performance. One approach, pioneered by Taubman and Zakhor, is to extend the 2D DWT to three dimensions and apply the transform in the temporal dimension as well [10, 11, 12, 13, 14]. Apart from a dramatic increase in computational complexity this approach also has the drawback of requiring images to be buffered prior to transmission, for at least as many frames as there are iterations of the wavelet transform. This generates an unacceptable coding delay for delaysensitive applications. Another approach is to perform predictive coding and motion compensation in the pixel domain and then to compress the residual images using the DWT. This scheme is inherently incompatible with the scalable layering mode however, since a full resolution frame needs to be decoded before motion compensation can be performed. Another problem is that block-based
92
Supporting video-mediated communication over the Internet
motion compensation often results in blocking artifacts at high quantization ratios. This can be avoided by using overlapping block motion compensation [15, 16]. Yet another approach is to perform predictive coding and motion compensation in the transformed domain. The main obstacle with this type of coding lies in the fact that the wavelet transform is translationally variant causing motion compensation to perform poorly [17, 18]. A remedy for this is to apply an antialiasing filter to the wavelet coefficients prior to motion estimation [19]. Needless to say, this increases the already high complexity associated with motion compensation. The algorithm presented in this paper is targeted at multipoint videoconferencing in heterogeneous environments. With this application in mind the following assumptions have guided the design: 1. Coding delay is of paramount importance. The coder is therefore not allowed to buffer frames in order to process two or more frames as a unit prior to transmission. Consequently, only temporal prediction with respect to previous frames is permissible, not bi-directional prediction. Three-dimensional subband transform coding is not viable either. 2. The video content is assumed to be reasonably static, without camera pans and limited scene motion. With this assumption the high complexity of motion compensation cannot be motivated and is therefore omitted. 3. The encoding should support a hybrid of spatial, temporal and SNR scalability enabling each receiver to trade off between resolution, frame rate and distortion. 4. The algorithm should be reasonably lightweight so that software-only implementation on general-purpose processors is feasible.
3 A new wavelet-based video coding with low delay In an attempt to leverage off the excellent scalability and compression performance for still images provided by Shapiro's Embedded Zerotree Wavelet coding (EZW) [8], an extension to the EZW algorithm to also exploit temporal correlations between wavelet coefficients of a video sequence has been developed. We call this novel algorithm EZWTP, for Embedded Zerotree Wavelet coding with Temporal Prediction. In order to explain the algorithm let us first recapitulate Shapiro's classical EZW algorithm for still image compression.
3.1 Embedded zerotree wavelet coding for still images The first step of EZW coding is to decompose the input image into an octave-band pyramid, using a 2D DWT. A two-level decomposition of an image is shown in Figure 1. The EZW algorithm produces an embedded bitstream (in the sense that it can be truncated at any point to achieve any desired bitrate) by ordering the coefficients of the subbands so that the most significant coefficients are encoded first. Then the correlation between corresponding coefficients in subbands of the
Paper B: A scalable video compression algorithm for real-time Internet applications
93
same orientation is exploited by introducing the concept of the zerotree data structure. The zerotree data structure, illustrated in Figure 2, is an association between coefficients in subbands of the same spatial orientation in a tree-like structure. Each coefficient, with the exception of the coefficients of the lowest frequency subband and the three highest-frequency subbands, is considered to be the parent of four coefficients (the children) at the next finer scale with the same spatial orientation. For a given parent, the set of child coefficients, the children's children and so on are called zerotree descendants. The highest frequency coefficients have no children and are thus never parents. The coefficients of the lowest frequency subband each have three child coefficients at the corresponding spatial positions in the horizontal, vertical and diagonal refinement subband of the same level, as indicated in Figure 2. The algorithm progresses iteratively, alternating between two passes: the dominant pass and the subordinate pass. In the dominant pass the coefficient values are identified as significant or insignificant with respect to a threshold value Ti that is decreased (typically halved) for each iteration. The coefficients cx,y that are found to be significant at quantization level i, that is |cx,y| > Ti are encoded with one of two symbols (POS or NEG) depending on the sign of the coefficient. A coefficient that has been found to be significant is set to zero to prevent it from being encoded as POS or NEG in subsequent dominant passes. Then the magnitude of the coefficient is inserted into a list called the subordinate list, used in the subordinate pass. The insignificant coefficients are considered to be zero at the current quantization level and are coded as either zerotree roots (ZTR) or isolated zeros (IZ). A coefficient is coded as a ZTR if all its descendants in the zerotree rooted at the coefficient are insignificant with respect to the current threshold. Otherwise, if any of its descendants are significant, the symbol is coded as an IZ. A coefficient that is the descendant of a previously coded zerotree root is set to zero and not encoded in this pass. The dominant pass processes all coefficients in a well-defined scanning order, subband-by-subband, from low-frequency to high-frequency subbands, encoding the coefficients that are not zerotree descendants with a codeword from a foursymbol alphabet. Since the highest frequency subbands do not have any zerotree roots, a ternary alphabet is used to encode those coefficients. After the dominant pass is completed the subordinate pass processes each entry of the subordinate list refining the coefficient value to an additional bit of precision. This is done by using a binary alphabet to indicate whether the magnitude of a coefficient is higher or lower than half the current threshold value. In effect, this corresponds to a quantizer binsize being halved for each subordinate pass. The algorithm alternates between the dominant and the subordinate pass, halving the threshold value for each iteration, until the desired precision is achieved or a bandwidth limit is met. The symbols that are output from the dominant pass (POS, NEG, ZTR and IZ) are entropy-coded using arithmetic coding. An adaptive arithmetic coder can be used to dynamically update the probabilistic model throughout the encoding process. In practice this is done by
94
Supporting video-mediated communication over the Internet
maintaining a histogram of symbol frequencies as described in [20]. To further improve the entropy coding, a number of histograms can be used and the selection of probabilistic model for each encoded symbol is conditioned on whether the coefficient's parent and left neighbor coefficient are significant or not. This results in four histograms for the dominant pass. In the subordinate pass a single histogram is used.
Figure 1 Two-level dyadic wavelet decomposition of an image LL2
HL2 HL1
LH2
LH1
HH2
HH1
Figure 2 Parent-child relationship of subbands. A zerotree rooted at the LL2 subband is also shown.
3.2 The EZWTP algorithm To extend the EZW algorithm to video coding without introducing substantial coding delays and prohibitively high complexity, a temporal prediction scheme without motion compensation is devised. The temporal prediction uses only the previously encoded frame as reference. For robustness to packet loss, intra-coding is employed at regular intervals so that the decoder can be resynchronized. Thus, two types of encoded images are present in the output video stream: intra-coded frames and predictive frames (I-frames and P-frames).
Paper B: A scalable video compression algorithm for real-time Internet applications
95
The I-frames are coded using the traditional EZW algorithm. For the P-frames, two new symbols are introduced in the dominant pass: zerotree root with temporal prediction (ZTRTP) and isolated zero with temporal prediction (IZTP). A coefficient is coded as ZTRTP in the dominant pass if it cannot be coded as a ZTR, but the difference between the coefficient and the coefficient at the same spatial location in the previous frame is insignificant with respect to the current threshold and so is the difference between each descendant of the coefficient and the corresponding descendant in the previous frame. Thus, a temporally predicted zerotree is an extension of the zerotree data structure to include coefficients of the same subbands in the previous frame. This relationship is illustrated in Figure 3. A coefficient that is insignificant, but is not a ZTR or ZTRTP (or a descendant), is coded as an isolated zero. A significant coefficient that is not a ZTRTP is coded as an IZTP if the difference between the coefficient and the corresponding coefficient in the previous frame at the current quantization level is insignificant. Note that when computing the difference between a coefficient's value and the value of the coefficient at the same spatial location in the previous frame, we must use the approximation of the coefficient value corresponding to the precision of the current pass of the algorithm. This is because in order to decode a coefficient's value at a precision corresponding to the i:th refinement level, the decoder should only be required to decode the previous frame's coefficients at refinement levels 1,2,..i. Otherwise the SNR scalability criterion would be violated. Consequently, the coder and decoder must keep the coefficient values of each refinement level of a frame for reference when coding or decoding the next predictive frame. Although this results in a substantial memory requirement, it does not introduce any buffering delay, since a frame is still transmitted once it is coded. Coefficients that are not coded as ZTRTP, ZTR, IZTP or IZ are coded as POS or NEG depending on the sign as in the original EZW algorithm. Note however that when a coefficient is found to be significant and pushed onto the subordinate list, after previously having been coded with temporal prediction (ZTRTP, temporally predicted zerotree descendant or IZTP), it is the magnitude of the difference between the coefficient and the coefficient used for the temporal prediction that should be recorded. It must also be remembered that this magnitude value represents a differentially coded coefficient. In this way the same threshold value can be used for refinement of both coefficient magnitudes and coefficient difference magnitudes. Since the state of the algorithm implicitly encodes this information no extra signaling is needed between the coder and decoder. Note also that the successive approximations of coefficient values that the decoder will reconstruct can be generated as intermediate results of the encoding process without extra cost. The subordinate pass works in the same way as in the original EZW algorithm apart from the fact that some of the magnitude values on the subordinate list now represent the prediction error term of a coefficient relative to the corresponding coefficient in the previously decoded frame. The state of the decoder when the coefficient is added to the subordinate list determines whether it is a prediction error term or a coefficient magnitude value and this information is kept in the subordinate list.
96
Supporting video-mediated communication over the Internet
LL2
HL2
LL2
HL2
LH2
HH2
HL1 LH2
HH2
HH1
LH1
frame i-1
HL1
LH1
HH1
frame i
Figure 3 Spatial and temporal relationships of the coefficients belonging to a temporally predicted zerotree rooted at subband LL2 of frame i The arithmetic coding of the symbols is performed using codewords from five different alphabets. For I-frames the three alphabets of the original EZW algorithm are used, viz. a four-symbol alphabet for all subbands except the highest frequency subbands of the dominant pass, a ternary alphabet for the highest frequency subbands and a binary alphabet for the subordinate pass. For P-frames a six-symbol alphabet is used for all subbands except the highest frequency ones (ZTR, IZ, ZTRTP, IZTP, POS, NEG), a four-symbol alphabet is used for the highest frequency subbands, where ZTRTP and ZTR cannot occur, and a binary alphabet is used for the subordinate pass. The conditioning of the statistical model used by the arithmetic coder in the dominant pass is performed with respect to whether the parent and left neighbor of a coefficient is significant, as in the original EZW algorithm, but also with respect to whether the corresponding coefficient in the previous frame is significant. This increases the performance of the arithmetic coder. Another difference compared to the original EZW algorithm is that with the addition of temporal information, there is now a way to condition the statistical model to be used for arithmetic coding of the symbols resulting from the subordinate pass. Since the coefficients at the same spatial location in adjacent frames exhibit a strong correlation, the probability is higher that the coefficient will be refined in the same direction as the corresponding coefficient in the previous frame. Thus the arithmetic coding of the symbols from the subordinate pass can be enhanced by temporal conditioning. The arithmetic coding can be based on either static, predefined, probability models or adaptive models based on histograms of symbol frequencies. However, since the decoder should be able to partially decode the encoded bitstream a fully adaptive arithmetic coding, where symbol probabilities are updated for every coded symbol, cannot be used. Thus, in order not to violate the scalability criteria and to be resilient to packet loss, the intended application of multipoint video communication systems must use either a static model or introduce synchronization points in the media (e.g. for every I-frame), where the probability models are propagated from the coder to the decoder.
Paper B: A scalable video compression algorithm for real-time Internet applications
97
The advantage of using an adaptive arithmetic coder is not so significant that it motivates the added complexity of maintaining symbol frequency histograms, so a static model will generally be preferred.
3.3 EZWTP codec design The EZWTP encoder consists of the following four components: 1. colorspace conversion and component subsampling, 2. transform coding, 3. zerotree coding with built-in temporal prediction, 4. arithmetic coding. The colorspace conversion transforms the input color video signal into a luminance signal (Y) and two color-difference chrominance signals (Cr and Cb). Since the human visual system is more sensitive to variations in luminosity than in hue, the chrominance signals are subsampled by two horizontally and vertically. The colorspace conversion and subsampling operations decorrelates the components and reduces the bandwidth to half of the original signal. The encoding is then performed separately on each of the three components. The encoding process is illustrated schematically in Figure 4. input image
DWT
I- or Pframe?
Iframe
EZW
arithmetic coding
output
P-frame EZWTP
frame memory
Figure 4 Schematic diagram of the EZWTP encoding process
3.3.1 Transform coding and spatial scalability The wavelet transform decomposes the input images into subbands representing the frequency content of the image at different scales and orientations. The transform is implemented by applying a pair of band-splitting filters to the image. The filtering process is repeated on the lowest-frequency subband a finite number of steps, resulting in a pyramid of wavelet coefficients like the one depicted in Figure 1. For the implementation of the EZWTP codec presented in this paper the filters designed by Antonini et al. [21] were chosen, since they have been found to give good performance for image coding [22]. The transform is iterated on the low-pass
98
Supporting video-mediated communication over the Internet
subband until the size is considered small enough, e.g. for CIF-size images (352x288), five iterations are performed for the luminance component and four for the chrominance. Thus, for CIF video, five spatial resolution levels are obtained each of which (except the LL-band) contains three refinement signals for horizontal, vertical and diagonal detail respectively. The spatial layering can be performed on subband level resulting in 3x5+1 = 16 spatial layers, for CIF images. Such a fine granularity for spatial scalability is probably unnecessary for most applications, suggesting that the subbands should be coalesced into fewer layers.
I-frame P-frame IP-frame
Figure 5 Inter-frame dependencies for intra-coded frame (I-frames), predicted frames (P-frames) and intra-predicted frames (IP-frames).
3.3.2 Temporal scalability The temporal scalability requirement restricts the inter-frame dependencies that the predictive coding is allowed to establish. Since a P-coded frame cannot be decoded unless the I- or P-frame it is predicted from has been decoded, such interframe dependencies must be confined to the same layer or to temporally antecedent layers. P-frames are generally predicted from the immediately preceding frame, since the temporal correlation usually diminishes rather quickly. One approach is to employ a two-layer model wherein all I-frames are assigned to the base layer and all P-frames to a single refinement layer. To increase the number of temporal layers possible some (or all) P-frames can be predicted from the previous I-frame instead of from the immediately preceding frame. Figure 5 illustrates a temporal layering arrangement with three temporal layers, where the P-frames temporally equidistant from two I-frames are coded with reference to the previous I-frame and the intermediate P-frames are coded relative to the immediately preceding P- or I-frame.
Paper B: A scalable video compression algorithm for real-time Internet applications
99
4 Performance In this section a number of performance measurements are presented that evaluates the efficiency of the codec in terms of scalability, compression rate and reconstructed image quality. The compression efficiency for a given bandwidth limit is compared to that of a non-scalable codec in order to quantify the sacrifice in compression efficiency that the layering requirement imposes.
4.1 Inter-frame compression performance In order to investigate how much compression efficiency is gained by the predictive coding introduced in the EZWTP algorithm, a number of measurements were performed comparing the compression rate obtained for different ratios between Iframes and P-frames. In Figure 6 the compressed image size in bits-per-pixel is plotted for each of the 100 first images of the CIF akiyo video sequence for eight P-frames per I-frame. Figure 7 shows the same plot for I-frames only. The former I-frame/P-frame layout thus supports two temporal layers, whereas the latter supports any number of temporal layers (since there are no inter-frame dependencies). Each line in Figure 6 and Figure 7 represents a quantization level, resulting from the SNR scalable EZWTP coding. The compression performance for P-frames can be seen to be about twice the performance for I-frames, for each quantization level. Since the akiyo test sequence contains a typical "head and shoulders" scene, it can be assumed to be fairly representative for the kind of video content the algorithm is targeted at. The low-motion nature of the video makes inter-frame coding without motion compensation perform reasonably well. 3.0
image size (bpp)
2.5 2.0 1.5 1.0 0.5 0.0 image
Figure 6 Size of each compressed image (in bits-per-pixel) for the CIF akiyo test sequence at 8 P-frames per I-frame
100
Supporting video-mediated communication over the Internet
3.0
image size (bpp)
2.5 2.0 1.5 1.0 0.5 0.0 image
Figure 7 Size of each compressed image (in bits-per-pixel) for the CIF akiyo test sequence, with I-frames only In Figure 8 the mean size of a compressed image of the akiyo sequence is plotted against the proportion of P-frames per I-frame. Again, each curve represents a quantization level. It can be seen that the inclusion of inter-frame coding is highly beneficial to the overall compression efficiency, and that a codingstrategy with one I-frame every fourth to sixth frame can be adopted while maintaining a good compression performance.
average image size (bpp)
3.0 2.5 2.0 1.5 1.0 0.5 0.0 0
2
4 6 8 10 P-frames per I-frame
12
Figure 8 Mean compressed image size (in bits per pixel) depending on the number of P-frames for each I-frame
4.2 Overall compression efficiency To analyze the penalty on compression efficiency for a given bandwidth that the scalability requirements impose and to compare the performance of the EZWTP
Paper B: A scalable video compression algorithm for real-time Internet applications
101
PSNR
algorithm to a popular, widely used codec, EZWTP was compared with a nonscalable MPEG-1 codec [23]. In order for the comparison to be as fair as possible, and to reflect the target application for the EZWTP codec, the MPEG codec was configured to use I- and P-frames only (no B-frames) and to have the same Iframe/P-frame ratio. The overall compression performance was quantified by computing the peaksignal-to-noise-ratio (PSNR) for a number of bandwidth levels (measured in bitsper-pixel). The PSNR was computed on the luminance component only, whereas the bandwidth refers to all three components. Due to the SNR scalability of the EZWTP codec all distortion levels could be decoded from a single encoded source, whereas for the non-scalable MPEG codec the encoding was done multiple times with different target bandwidths. In Figure 9 the compression performance of the first 100 frames of the CIF akiyo sequence is shown. It is clear that the MPEG codec outperforms the EZWTP codec with as much as 3 dB.
50 45 40 35 30 25 20 15 10
EZWTP MPEG-1
0.0
0.5
1.0
1.5
2.0
bpp
Figure 9 Compression efficiency of EZWTP compared to MPEG-1 for the CIF akiyo test sequence Figure 10 shows the compression efficiency of EZWTP and MPEG-1 for the 100 first frames of the 4CIF susie test sequence. For this video source the EZWTP algorithm performs almost as well as the MPEG codec at low bitrates and even outperforms the MPEG codec at high bitrates. The reason why the EZWTP algorithm performs better relative to MPEG for susie compared to akiyo is probably related to the fact that the higher spatial resolution in the former case makes the superiority of the wavelet transform over the DCT for spatial decorrelation be of higher significance.
Supporting video-mediated communication over the Internet
PSNR
102
50 45 40 35 30 25 20 15 10
EZWTP MPEG-1
0.0
0.5
1.0
1.5
2.0
bpp
Figure 10 Compression efficiency of EZWTP compared to MPEG-1 for the 4CIF susie test sequence A contributing factor might be that although the inter-frame compression of EZWTP performs very well on the low-motion akiyo sequence (cf. Figure 6), the motion-compensated inter-frame coding of MPEG performs even better. The susie sequence contains slightly more motion and since the performance of MPEG to a higher degree depends on the efficacy of the inter-frame compression compared to EZWTP, the reduced P-coding efficiency has a larger impact.
4.3 Scalability The scalability of the encoding in terms of bandwidth and quality of the reconstructed signal is illustrated in Figure 11 and Figure 12, calculated over the first 100 frames of the CIF akiyo sequence. Only the effects of SNR- and spatial layering are considered. A five-level wavelet transform was used for the encoding, resulting in a total of 16 subbands (three refinement subbands per level plus the base layer). Each subband was assigned a unique spatial layer for the purpose of these measurements. The quantization of the wavelet coefficients was divided into 12 refinement layers. Thus, a hierarchical structure of 12*16=192 layers was created. For most applications such a fine granularity is probably not needed, indicating that some layers should be merged. Figure 11 shows the cumulative bandwidth in kilobits per second (kbps) as a function of the number of spatial layers and quantization layers. The bandwidth was computed at a frame rate of 25 frames per second. As can be seen, increasing the sample precision (adding quantization layers) has a bigger effect on bandwidth consumption compared to an increase in spatial resolution.
Paper B: A scalable video compression algorithm for real-time Internet applications
103
bandwidth (kbps)
6000 5000 4000 3000 2000 12
1000 0 16
1 1
spatial layers
quantization layers
Figure 11 Bandwidth scalability of the encoding of the CIF akiyo sequence PSNR
50 45 40 35 30 25 20
12
15 10 16
1 1
quantization layers
spatial layers
Figure 12 Scalability of the encoding of the CIF akiyo sequence in terms of PSNR of decoded images. The corresponding reconstructed image quality for each quantization and resolution level is shown in Figure 12. Here, image quality is quantified using the peak-signal-to-noise ratio (PSNR) between the original and reconstructed images. When computing the PSNR for frames decoded at a lower spatial resolution than the original, the reconstructed image was upsampled to the original dimensions prior to computing the mean square error to the original image. It is important to note that this is a statistical measure of correlation between signals that does not take psychovisual effects into consideration and hence is a poor estimator for perceptual quality. For instance there seems to be some anomaly resulting in a lowered quality when 14 subbands are used in the decoder, compared to using only 13. Upon visual inspection of the images, however, the higher resolution versions are subjectively preferable, although mathematically more distorted. The PSNR can be seen to depend approximately linearly on both quantization and resolution. The conclusions that can be drawn from these measurements are that finer quantization has a more profound effect on bandwidth consumption compared to increased spatial resolution, and that the reconstructed image quality
104
Supporting video-mediated communication over the Internet
(determined by the PSNR metric) depends, in some sense, equally on both parameters. Thus, when trading off between resolution and quantization distortion, the former should possibly be prioritized. However, in real applications, other factors like the video content, the type of application and user preference are likely to be of significant importance for this decision.
a
b
d
c
e
Figure 13 A frame of the akiyo test sequence decoded at three different resolutions and five different distortion levels
Paper B: A scalable video compression algorithm for real-time Internet applications
105
Figure 13 displays one frame from the akiyo test sequence decoded at three different resolutions and five different distortion levels. For image a (CIF resolution) 16 spatial layers and 10 SNR layers were used in the decoding process. For b and c (QCIF resolution) 12 spatial layers were used with 7 and 6 SNR layers respectively. For d and e (1/16 CIF resolution) 8 spatial layers were used while the number of SNR layers were 7 and 5 respectively. In Table 1 the PSNR of each image in Figure 13 is listed together with the frame rate that can be supported, given the bandwidth limit of a particular network access technology. The spatial resolutions are also included. These measurements indicate what performance can be expected from the video coding in some relevant situations.
image a b c d e
resolution CIF QCIF QCIF 1/16 CIF 1/16 CIF
PSNR 42.1 32.2 28.6 28.5 24.5
fps 25 24 22 16 10
target access technology T1 (1.5 Mbps) 4xISDN (256 kbps) 2xISDN (128 kbps) ISDN (64 kbps) modem (33 kbps)
Table 1: Examples of image quality, spatial resolution and frame rate at bandwidths corresponding to different network access technologies
5 Processing requirements One of the primary design goals of the EZWTP codec is that the computational complexity should be low enough for the algorithm to be possible to implement in software on general-purpose processors. Furthermore, the processing requirements should be scalable so that the encoding and/or decoding complexity can be adjustable to the amount of CPU resources available for different types of terminal equipment. To analyze the complexity of the EZWTP algorithm we first note that the two major contributions to the overall complexity are the transform and quantization for the encoder and the inverse transform and the dequantization for the decoder. It is easy to see that the encoding and decoding requirements are symmetric, since the inverse transform and the dequantization are simply the reverse processes of the forward transform and the quantization. Therefore we present the complexity analysis for the decoder only, since the scalability property of the algorithm is most highlighted in the situation where one encoded stream is decoded at many different quality and complexity levels for a collection of heterogeneous decoders. The computational complexity is estimated depending on the number of levels of the inverse transform and the number of iterations of the zerotree decoding that is performed. In this way we can analyze the scalability of the processing requirement in relation to the spatial resolution and compression distortion of the reconstructed video.
106
Supporting video-mediated communication over the Internet
One iteration of the wavelet transform is implemented by applying a low-pass and a high-pass filter to the pixel values of each image. For the next iteration the transform is applied to the low-frequency subband which has a resolution of a quarter of the original. Thus, for L levels of the transform the processing requirement is proportional to the number of multiplications performed, which is L −1
2fn
∑4 i =0
M −i −1
,
(1)
where f is the filter tap length, n is the number of pixels of the full image, M is the total number of transform levels executed by the encoder, and L≤M is the number of levels of the inverse transform effected by the decoder. The zerotree decoding with temporal prediction is performed in two passes: the dominant pass and the subordinate pass. In the dominant pass the wavelet coefficients that were found to be significant in the corresponding dominant pass of the encoder are input and decoded. The ZTR, ZTRTP, IZ and IZTP symbols are also read and the coefficients affected are set to zero or to the value predicted from the previous image. The significant coefficients are added to the subordinate list for further processing. In the dominant pass each coefficient is updated once, so the processing power for each pass is proportional to the number of coefficients, resulting in a total number of
Pn 4M−L
(2)
coefficients to be processed, where P is the number of iterations of the EZWTP decoding algorithm (i.e. the number of quantization levels decoded). In the subordinate pass each coefficient in the subordinate list is processed and refined to an extra bit of precision as determined by the symbols read from the input stream. The processing power for each iteration of the subordinate pass is hence proportional to the number of significant coefficients for that level. If no temporal prediction is performed (i.e. I-coding) the number of significant coefficients can be found empirically to be approximately doubled for each pass. With temporal prediction the number of significant coefficients is reduced, but we nevertheless assume a doubling quantity also for P-frames, appreciating that the complexity estimation will be somewhat pessimistic. This gives us a complexity for the subordinate pass that is proportional to P −1
∑4 j= 0
M−L
n , ⋅ 2Q − j−1
(3)
where Q is the total number of quantization levels computed by the encoder, and thus, P≤Q.
Paper B: A scalable video compression algorithm for real-time Internet applications
107
A linear combination of (1), (2) and (3) gives the total complexity, CEZWTP(L, P). That is, for some positive proportionality constants C1, C2, C3, where the filter length in (1) has been included in the C1 constant,
L −1
C EZWTP (L, P) = C1 ∑ i =0
n 4
M −i −1
+ C2
Pn 4
M −L
P −1
+C 3
∑4 j= 0
(
n M −L
)
n C1 (4 L +1 − 1) + 4 L PC 2 + 2 2 L −Q C 3 (2 P +1 − 1) . 4M
⋅ 2 Q − j−1
= (4)
As can be seen in (4), the complexity of the EZWTP decoding grows exponentially with respect to the number of transform levels (L). This is not surprising since the number of pixels to process increases by a factor four when the width and height of the images are doubled. With respect to the number of quantization levels (P), the EZWTP complexity increases by a linear term and an exponential term, accounting for the dominant and subordinate passes respectively. To verify the theoretically deduced complexity estimation in (4) the execution time of the EZWTP implementation when decoding the akiyo test video sequence was measured for different values of L and P. The proportionality constants were empirically determined from decoding time measurements. In Figure 14 decoding time is plotted against the number of quantization levels while keeping the number of transform levels constant. As can be seen, the decoding time corresponds very well with the theoretically estimated curve. In Figure 15 the decoding time is plotted against the number of transform levels, while keeping the number of quantization levels constant. Again, a very good correspondence is found indicating that the complexity estimation in (4) is sound.
decoding time (ms)
250 200 150 100 50 0 1
3
5
7
9
11
SNR levels
Figure 14 Decoding time as a function of the number of quantization levels
108
Supporting video-mediated communication over the Internet
decoding time (ms)
1000 800 600 400 200 0 0
1
2
3
4
5
6
IDWT levels
Figure 15 Decoding time as a function of the number of inverse transform levels From looking at the graphs in Figure 14 and Figure 15 it appears as if the number of IDWT levels chosen has a larger impact on decoding time than the number of quantization levels. Thus, when trading off between resolution and quantization distortion in the decoder, from a complexity standpoint a refinement of the quantization precision might be preferable compared to an increased resolution. The shape of the graph in Figure 14 suggests that the linear term of P in (4) is dominant over the exponential, for SNR levels below 10, resulting in an approximately constant increase in complexity, compared to the apparent exponential increase imposed by a higher resolution level. To verify this observation we differentiate the complexity function CEZWTP(L, P) with respect to L and P and form the quotient of the derivatives. The ratio thus obtained represents the relative impact on computational complexity of refining the spatial resolution versus the quantization precision.
(
)
∂C ln 4 4C + PC + C 3 2 P +1 − 1 1 2 P − Q +1 2Q ≥ 8C1 + 2PC 2 + C 3 2 ∂L = ≥1 ∂C C2 C 2 + ln 2C 3 2 P −Q +1 P − Q +1 + C3 2 ∂P ln 2 C iff 8C1 + 2PC 2 ≥ 2 ln 2
(5)
(6)
Since P ≥ 1 > 1 / 2ln2 – 4 C1/C2, the relation in (6) is trivially true and thus the ratio in (5) is always greater than one, implying that the increase in computational complexity is always affected more by an increase in resolution compared to an increase in quantization precision, irrespective of L, P and the proportionality constants. This suggests that for computationally constrained devices, refined quantization might be preferred over increased resolution. Note that in this analysis we have calculated the change in computational cost associated with a change in resolution corresponding to three additional spatial
Paper B: A scalable video compression algorithm for real-time Internet applications
109
subbands being used in the decoding process. That is, we do not consider the effect on complexity of adding the spatial subbands of a transform level independently. Since the improvement in reconstructed image quality is most profound when adding a spatial subband of the next resolution level (cf. Figure 12), the conclusions are still consistent.
6 Summary and conclusions Real-time multipoint Internet videoconferencing applications require highly scalable video encoding and compression algorithms with minimal coding delays. This paper has presented a video compression algorithm that produces a layered bitstream that can be decoded at different quality levels depending on the amount of resources available to the decoder in terms of network bandwidth, computational capacity and visualization capabilities. The algorithm, called EZWTP, has been designed with the scalability and real-time properties as primary requirements, while trying to maintain high compression efficiency and low computational complexity. Computational complexity is kept low by excluding motion compensation. The motivation for doing so is that the target application (Internet videoconferencing) implies that a reasonably low-motion video content can be assumed. The inter-frame compression of EZWTP was shown to give a substantial compression performance for low-motion video scenes. In comparison to a popular non-layered codec (MPEG-1), the EZWTP codec was shown to exhibit competitive compression performance for high-resolution video, due to the superior spatial decorrelation properties of the wavelet transform compared to the discrete cosine transform. For lower resolution video, non-scalable codecs with motion compensation typically outperform the EZWTP algorithm. The decoder can trade off between frame rate, spatial resolution and compression distortion based on local constraints and user preference. Complexity and performance analyses showed that for computationally constrained devices, enhanced quantization might be favored over an increased spatial resolution, while the opposite discrimination could be advocated for bandwidth constrained instances of the decoder. The temporal layering has a linear impact on both decoding time and bandwidth consumption. Although the computational power of processors and the capacity of network infrastructure will continue to increase, heterogeneity will persist. It can thus be argued that scalability in terms of performance and resource consumption should be considered a more important feature of a video coding algorithm than sheer compression efficiency, when targeting applications like Internet videoconferencing. This sentiment has inspired the work presented in this paper.
110
Supporting video-mediated communication over the Internet
References [1] N. Shacham, "Multicast routing of hierachical data," Proceedings of the International Conference on Computer Communications, pp. 1217-1221, June 1992. [2] T. Turletti and J. C. Bolot, "Issues with multicast video distribution in heterogeneous packet networks," Proceedings of the Sixth International Workshop on Packet Video, Portland, September 1994. [3] S. McCanne, V. Jacobson and M. Vetterli, "Receiver-driven layered multicast," Proceedings of ACM SIGCOMM'96, August 1996. [4] S. McCanne, "Scalable video coding and transmission for Internet multicast video," Ph.D. thesis, University of California, Berkeley, December 1996. [5] MPEG-2, ISO/IEC 13818, "Generic coding of moving pictures and associated audio information," November 1994. [6] G. Cote, B. Erol and F. Kossentini, "H.263+: Video coding at low bit rates," IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no. 7, pp. 849-866, November 1998. [7] W. Equitz and T. Cover, "Successive refinement of information," IEEE Transactions on Information Theory, vol. 37, pp. 269-275, March 1991. [8] J. M. Shapiro, "Embedded image coding using zerotrees of wavelet coefficients," IEEE Transactions on Image Processing, vol. 41, no. 12, pp. 3445-3462, December 1993. [9] A. Said and W. Pearlman, "A new, fast and efficient image codec based on set partitioning in hierarchical trees," IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, no. 3, pp. 243-250, June 1996. [10] D. Taubman and A. Zakhor, "Multirate 3-D subband coding of video," IEEE Transactions on Image Processing, vol. 3, no. 5, pp. 572-588, September 1994. [11] C. I. Podilchuck, N. S. Jayant and N. Farvardin, "Three-dimensional subband coding of video," IEEE Transactions on Image Processing, vol. 2, no. 2, pp. 125-139, February 1995. [12] J. R. Ohm, "Three-dimensional subband coding with motion compensation," IEEE Transactions on Image Processing, vol. 3, no. 5, pp. 559-571, September 1991. [13] K. Uz, M. Vetterli and D. J. LeGall, "Interpolative multiresolution coding of advanced television with compatible subchannels," IEEE Transactions on Circuits and Systems for Video Technology, vol. 1, no. 1, pp. 86–99, March 1991. [14] Y. Chen and W. Pearlman, "Three-dimensional subband coding of video using the zero-tree method," Proceedings of SPIE - Visual Communications and Image Processing, pp. 1302 - 1312, March 1996. [15] M. T. Orchard and G. J. Sullivan, "Overlapped block motion compensation: an estimation-theoretic approach," IEEE Transactions on Image Processing, vol. 3, no. 5, pp. 693-699, September 1994.
Paper B: A scalable video compression algorithm for real-time Internet applications
111
[16] M. Ohta and S. Nogaki, "Hybrid picture coding with wavelet transform and overlapped motion-compensated interframe prediction coding," IEEE transactions on Signal Processing, vol. 41, no. 12, pp. 3416-3424, December 1993. [17] K. Tsunashima, J. B. Stampleman and V. M. Bove, "A scalable motioncompensated subband image coder," IEEE Transactions on Communication, vol. 42, pp.1894–1901, April 1994. [18] A. Nosratinia and M. Orchard, "Multi-resolution backward video coding," IEEE International Conference on Image Processing, vol. 2, , pp. 563–566, October 1995. [19] X. Yang and K. Ramchandran, "Scalable wavelet video coding using aliasreduced hierarchical motion compensation," IEEE Transactions on Image Processing, vol. 9, no. 5, May 2000. [20] I. H. Witten, R. Neal and J. G. Cleary, "Arithmetic coding for data compression," Communications of the ACM, vol. 30, pp. 520-540, June 1987. [21] M. Antonini, M. Barlaud, P. Mathieu and I. Daubechies, "Image coding using wavelet transform," IEEE Transactions on Image Processing, vol. 1, no. 2, April 1992. [22] D. Villasenor et al., "Wavelet filter evaluation for image compression," IEEE Transactions on Image Processing, August 1995. [23] A. C. Hung, "PVRG-MPEG CODEC 1.1," Portable Video Research Group (PRVG), Stanford University, June 14, 1993. [24] S. McCanne, M. Vetterli and V. Jacobson, "Low-complexity video coding for receiver-driven layered multicast," IEEE Journal on Selected Areas in Communications, vol. 16, no. 6, pp. 983-1001, August 1997. [25] M. Vetterli and J. Kovacevic, "Wavelets and subband coding," Englewood Cliffs, NJ, Prentice-Hall, 1995.
Paper C Layered encoding and transmission of video in heterogeneous environments
Pending publication.
115
Layered encoding and transmission of video in heterogeneous environments Mathias Johanson
Alkit Communications Sallarängsbacken 2 S-431 37 Mölndal, Sweden +4631675543
[email protected]
Arne Lie
SINTEF Telecom and Informatics N-7465 Trondheim, Norway +4773592703
[email protected]
Abstract
Applications of real-time video transmission in heterogeneous network and computing environments, like the Internet, need to be highly scalable and adaptive in terms of bandwidth and processing requirements. Layered video coding in combination with layered transmission schemes have been proposed as a solution to network and terminal equipment heterogeneity. In this paper we identify the fundamental requirements of layered video coding and transmission architectures, and review the different types of layering mechanisms available. Two applications are emphasized: multipoint videoconferencing and point-to-point video streaming. The implications of layer granularity for network bandwidth utilization are experimentally investigated through simulations.
1 Introduction The Internet can be characterized as a highly heterogeneous environment, since the network connections it is built upon have vastly different characteristics in terms of link-speed, latency and loss-probability. Furthermore, the equipment connected to the Internet will at any time be a motley collection of machinery ranging for cheap, low-performance personal computers to expensive, highperformance supercomputers. This heterogeneity is likely to persist, and with the advent of wireless networks and mobile internetworking an even more manifold situation can be envisioned. In order to make real-time video communication possible in such an environment a highly scalable and adaptive media coding and transmission framework is needed. Three fundamental aspects of heterogeneity can be identified as crucial to consider in the design of scalable video communication systems, namely • network bandwidth heterogeneity, • CPU performance heterogeneity, • visualization capability heterogeneity. In layered video coding the video signal is encoded in a way that makes progressive reconstruction at increasingly higher quality possible. The output from the encoder is arranged in a hierarchical structure of cumulative layers, so that the more layers that are used in the decoding process, the higher is the
116
Supporting video-mediated communication over the Internet
quality of the reconstructed video. For Internet video applications, layered video coding has been proposed as a solution to the heterogeneity problems. Two types of Internet applications where layered video coding is of particular interest are multipoint videoconferencing and point-to-point streaming of stored video material. These two applications are discussed below.
1.1 Multipoint videoconferencing To support multipoint videoconferencing in heterogeneous network environments a technique called layered multicast has been suggested [1]. In this transmission scheme the video is encoded in a number of cumulative layers, each of which is transmitted to a unique multicast group address. The leaf-initiated-join mechanism of IP multicast makes it possible for each participant in a conference session to individually decide how many layers of the video signal to receive depending on the bandwidth and CPU resources available. In this model, an increasing number of received layers will correspond to a higher quality of the reconstructed signal. Since the bandwidth available to each receiver is likely to change throughout a conference session, the allocation of bandwidth performed by the receivers by joining multicast groups must be dynamically adjustable. Thus, a flow control algorithm must be implemented in the receivers that adapts to bandwidth fluctuations, to avoid congesting the network at high network load and to prevent poor link-utilization when the load decreases. Although layered video coding can be used also for point-to-point videoconferencing, it is not as appropriate since an adaptive non-layered codec can be used in this situation and is likely to give better compression efficiency [2].
1.2 Unicast streaming of stored video clips Layered video coding is beneficial for applications like video on demand, where previously encoded video material stored on disk is streamed to a receiver for playback. Since the amount of bandwidth available between the video server and the client typically fluctuates, the transmission rate of the video being streamed must be adjusted accordingly. This can be accomplished either by having the video material encoded in many versions with different target bandwidths, or by using a layered video coding and adapting the number of layers being transmitted. The layered coding saves storage space and saves time at content creation time. The number of layers to transmit at each instant must be determined by a flow control algorithm based on feedback from the network.
2 Layered video encoding A layered video encoding suitable for Internet applications needs to meet a number of requirements. The compression performance must be high in order to utilize bandwidth efficiently. Furthermore, the computational complexity of the codec must be low enough for real time operation. In the videoconferencing
Paper C: Layered encoding and transmission of video in heterogeneous environments
117
situation, both encoding and decoding must be done in real time. For streaming applications, asymmetrical codecs with non-real time encoding is acceptable. Moreover, for videoconferencing the latency of the encoding/decoding must be very low, whereas the streaming applications require only a reasonable bound on decoding latency. In addition to the requirements on compression performance, complexity and latency it is also desirable if the layered coding makes it possible to trade off between different aspects of video quality, such as frame rate and spatial resolution. For example, in order to meet a certain target bandwidth a receiver of a video stream can choose to favor high frame rate over high resolution or vice versa.
2.1 Layering techniques Many different techniques exist to generate a layered video bitstream. The most common types are known as temporal layering, spatial layering and signal-tonoise-ratio (SNR) layering. Temporal layering is achieved by distributing the individual images of a video sequence over a set of layers. Thus, the more temporal layers that are used in the decoding process, the higher is the frame rate of the video. In spatial layering a multiresolution representation is used to split each image into set of layers. In this case an increased number of reconstruction layers correspond to higher spatial resolution of the individual images of the video. In SNR layering, the amount of lossy compression applied through quantization is progressively adjusted. All three types of layering are highly appropriate both for videoconferencing applications and streaming applications. Some general observations can be made when designing a suitable layering model: Temporal layering has the lowest complexity and is the easiest to implement, since it precludes manipulation of individual images. However, temporal layering affects the design of the inter-frame compression scheme of the video coder, since inter-frame dependencies imposed by the temporal prediction must be resolvable by a decoder that only receives a subset of the temporal layers. Spatial layering is highly desirable since it makes it possible to decode the video at different spatial resolutions. One important aspect of heterogeneity is that different terminal equipment has different visualization capabilities. For instance, a handheld device might have a maximum screen resolution that is considerably less than the full resolution of the encoded video. In this case the limited resolution dictates the maximum number of spatial refinement layers to receive. Since quantization is the primary means of achieving high compression ratios, SNR layering is very important in order to get a scalable bitstream in terms of bandwidth. Also, since quantization is a lossy process a layered approach is highly desirable in order for the video coding to scale in terms of reconstructed quality. In a layered quantization scheme the coarseness of quantization is progressively adjusted by the refinement layers.
118
Supporting video-mediated communication over the Internet
Apparently, each of the layering techniquess exhibits characteristics that provide different aspects of the overall scalability. Hence, in order to be able to trade off between frame rate, resolution and precision in different ways depending on the circumstances, an orthogonal combination of all three is desirable. However, such an arrangement is not trivially accomplished.
2.2 Block-based DCT encoding A number of layered video codings based on the discrete cosine transform (DCT) have been proposed including MPEG-2 scalable profile [3], MPEG-4 scalable profile [4] and H.263+ [5]. The compression is based on the block-based DCT in combination with motion compensation, quantization and entropy coding. MPEG-2 defines three scalability schemes that can be used independently or in combination: spatial layering, temporal layering and SNR layering. MPEG-4 has especially put the focus on scalability and interaction. This is an inherent part of MPEG-4 because of the object-oriented approach. MPEG-4 supports conventional rectangular, frame-based visual encoding, and in addition it supports arbitrary-shaped frame objects. Since a natural scene cannot be divided into an indisputable number of objects, the object segmentation must be performed in such a way that the most important objects are identified. Moreover, the objects the user wants to interact with must be identified. When it comes to scalability, each object will be streamed using its own dedicated elementary stream. In fact, each object can be divided into multiple streams, e.g. a base layer stream, and several enhancement streams (both for video and audio). MPEG-4 supports three types of scalability for each object: temporal scalability, spatial scalability, and fine granular scalability (FGS). In contrast to traditional scalability implementations, where one or multiple enhancement layers must be fully received in order to get the requested SNR, spatial or temporal enhancement, the FGS approach is much more flexible and robust. In FGS, only one enhancement layer exist. Native FGS is an SNR scalability approach, where the DCT coefficients are divided into bit planes. The most significant bits are transmitted before less significant bits. In this way, the reception is progressive. The flexibility lies in the separation of encoding and transmission processes. In traditional scalability implementations, a fixed set of supported bit rates must be selected at encoding time. In FGS, the encoding process only decides on a minimum and maximum supported bandwidth. The base layer bit rate must be equal to or lower than the minimum bandwidth, while a full transmission of the enhancement layer plus the base layer must be equal to or lower than the maximum bandwidth. During a particular streaming session, the channel capacity dynamics will vary between the minimum and maximum bandwidth. The FGS streaming server will vary its output bit rate based on channel capacity monitoring, without the need for performing complicated realtime rate control algorithms. Thus, a large number of simultaneous unicast sessions can be supported [6]. The same compressed streams can also be used for multicast sessions, where both network and terminal resources vary. The multi-
Paper C: Layered encoding and transmission of video in heterogeneous environments
119
cast server partitions the FGS enhancement layer into any preferred number of multicast streams (with unique multicast IP addresses). As for ordinary scalability, a receiver-driven layered multicast framework is used: the receivers join the base layer multicast group plus any number of enhancement layers multicast groups, based on terminal and network capacity. In MPEG-4 Part 2 Amendment 4 (Streaming Video Profile) [7], FGS and FGST (FGS Temporal scalability) is defined. Hence, the current MPEG-4 FGS profile does not support spatial scalability. In FGST, the enhancement layer also inserts new frames (P or B), between the base layer frames. The architecture is robust against lost packets. In MPEG-4 FGS there are also more tools, such as adaptive quantization (AQ). In MPEG-4 FGS, AQ is implemented by bitplane shifting of entire macroblocks, and/or selected DCT coefficients within 8x8 blocks. One application for this can be in combination with face detection and tracking algorithms, to enhance the PSNR of faces at the expense of less important objects (this is still within ordinary rectangular video, not true object-based). The MPEG-4 FGS profile has proven little or no performance loss when it comes to compression efficiency. This is true both compared to optimized singlerate streams, and in particular when compared to ordinary scalability structures [6]. The spatial layering schemes of MPEG and H.263+ require that the subsampled frames are first compressed and then decompressed and upsampled again in order to compute the differential frame of the next higher layer. This leads to a very high complexity of the compression engine. There is obviously a conflict between the block-based DCT transform of the compression step and the subsampling step. A more attractive approach would be to combine the transform of the compression step with the transform required for the subsampling into one operation. This is one of the salient features of wavelet transform coding.
2.3 Wavelet-based encoding In wavelet image coding the discrete wavelet transform (DWT) is applied to the whole image instead of on small blocks of the image, as is the case with DCTbased encodings. The transform is then recursively repeated on the low-frequency subband, resulting in a hierarchy of subbands such that an image of progressively higher quality can be obtained by successively performing the inverse transform on the subbands and adding them. In this process the frequency information at different scales and orientations are separated. Compression is achieved by quantizing and entropy coding the subbands. Since the wavelet transform inherently provides a multiscale representation of an image it is an ideal choice for a spatially scalable video coder. Moreover, since wavelet image compression has been found to provide a more graceful degradation of image quality at high compression ratios, compared to block-based DCT algorithms, it also works well together with a scalable layered quantization scheme. A key issue in designing wavelet-based video codings is how to combine the spatial compression with an inter-frame coding scheme. Motion compensation can
120
Supporting video-mediated communication over the Internet
be used to compensate for moving objects within a scene. It has been noted that motion compensation performs best if it is performed on smaller units like 16-by16 pixels macroblocks. Unfortunately, the dependency on such a subdivision is somewhat conflicting with the wavelet transform that is applied to the whole image. For wavelet-based video codecs full-image motion compensation that only compensates for camera pans can be utilized [8]. Another approach is to perform motion estimation on overlapping blocks to reduce the high-frequency discontinuities that otherwise appear along the edges of the blocks [9]. This improves the performance of the wavelet coding of the macroblocks at the cost of higher complexity. A different approach to temporal compression is to extend the 2-D wavelet transform to three dimensions by performing yet another 1-D transform along the temporal axis [10, 11]. It turns out that this approach can indeed be very efficient in terms of compression performance but the complexity is too high to be practical for real-time applications. There is also a substantial buffering requirement, since the input signal to the transform must extend at least as many frames in the temporal dimension as the number of decomposition levels of the wavelet transform dictates. Traditionally, inter-frame compression methods are applied to an image sequence prior to intra-frame coding techniques. However, it is certainly possible to perform the inter-frame coding in the transformed domain, i.e. after the wavelet transform has been applied to the images [12]. Performing the predictive coding after the wavelet transform and after the subsequent quantization has the advantage of not requiring the encoder to also decode the images in order to compute the differential frame. This also makes it possible to perform the predictive coding on the macroblock level instead of on the whole image. Unfortunately, motion compensation has been found to perform less efficiently in the transformed domain compared to in the pixel domain [13]. This is due to aliasing noise in high frequency subbands resulting from the subsampling operation.
3 Layered transmission The transmission architecture requirements for layered video on the Internet are different depending on the type of application. The requirements for the two main candidate applications for layered video, multicast conferencing and unicast streaming of stored media clips, are identified below. For both applications the UDP and RTP protocols are assumed to be used for the transport.
3.1 Multipoint conferencing In order to enable a set of heterogeneous receivers in a conference session to receive video at different rates we need to transmit the video layers over a set of separate channels that can be received independently. Moreover, the allocation of the channels must be receiver-driven because only the receivers can determine the
Paper C: Layered encoding and transmission of video in heterogeneous environments
121
local quality requirements and availability of decoding resources. Furthermore, we need the network to support a scalable and efficient multicast implementation in order to avoid duplicate video packets to flood the network connections. Ideally the transmission system should not be dependent on any specialized network set-up, so that conferences can be realized over public internetworks. In layered multicast each layer of the encoded media is transmitted to a unique IP multicast group address. Each receiver determines the optimal number of multicast groups to join based on the available resources in terms of network bandwidth, CPU-requirements for decoding, and rendering capabilities. A flow control algorithm is needed that monitors the amount of bandwidth available and joins or leaves groups in response. Unlike the unicast situation, flow control for layered multicast applications is performed only by the receivers. A number of algorithms have been proposed that use packet loss as an indication that the network is congested and lowers the bandwidth by leaving multicast groups [1, 14]. Packet loss is detected by inspecting the RTP sequence number field present in every packet. Periodically, an additional multicast group is experimentally joined to investigate whether the resulting bandwidth increase causes congestion. For scalability reasons these join experiments must be coordinated among the receivers in order not to drive the network to constant congestion when the number of receivers is large. Flow control algorithms reacting to variations in packet delay have also been suggested [15, 16]. The idea is that impending congestion can be detected and responded to before packet loss occurs, by measuring the one way packet delay from sender to receiver. An increasing delay indicates that packets are being buffered in router queues due to congestion.
3.2 Unicast streaming For layered streaming of stored video clips from a server to a client, the client must continuously provide feedback to the server about the quality of the transmission. Based on this feedback the server must adjust the transmission rate so that dynamics in bandwidth availability due to congested network connections are responded to. Contrary to the multicast situation the flow control algorithm is implemented in the sender. The feedback information from the receiver usually includes the packet loss rate and the mean inter-arrival packet jitter. Typically, the receiver report mechanism of the RTCP protocol is used to convey this status information back to the sender. Unlike the multicast situation, the individual layers of a layered encoding need not be packetized independently. In the multicast case this is necessary in order for the layers to be transmitted to different multicast addresses. For unicast streaming all layers are aggregated into one stream that is packetized according to an RTP profile. The semantics for multiplexing the media layers is defined either by the coding format or by the RTP profile.
122
Supporting video-mediated communication over the Internet
4 Experiments In sections 2 and 3 we have motivated the use of layered encoding and transmission of video and identified the basic requirements for multicast conferencing and video streaming. An important parameter for the design of a layered video application that we have not yet addressed is how many layers we should use in total and how much bandwidth we should assign to each layer. If we have many layers of relatively low bandwidth, the network bandwidth allocation performed by the flow control algorithm can be done with a high granularity. On the other hand, many layers will increase the processing and transmission overhead. For the layered multicast situation a large number of layers will require an equally large number of multicast groups, resulting in an undesirable processing overhead for multicast routers. Also, since we have a strict packetization interval determined by the frame rate of the video (typically 40 milliseconds for 25 fps), a large number of layers will affect the packet length. With a small packet size the IP/UDP/RTP overhead will decrease the utilization of the network. In order to study the effects of the number of layers on resource utilization, we have established a simulation model for network bandwidth utilization in layered multicast and unicast video streaming applications.
4.1 Simulation model To simulate a layered multicast scenario with one video sender and a set of N heterogeneous receivers, we define ci, i∈[1, B], to be a set of B ≤ N bandwidth bottlenecks. Each of the N receivers is limited by one of the bandwidth constraints ci. By clustering the receivers around a number of bandwidth bottlenecks in this way, rather than assigning a limiting bandwidth threshold for each receiver independently, we simulate the typical situation that subsets of the receivers share common bottleneck links. Let ni, i∈[1, B], denote the number of receivers constrained by ci. In our simulator the distribution of receivers over the set of bottleneck capacities ci can be done using either a uniform distribution (i.e. ni = N/B), or a normal distribution. We consider the situation where one sender transmits a W bits per second (bps) video stream arranged in L cumulative layers, each with an effective bandwidth of bi bps, i∈[1, L], so that L
∑b
i
=W.
i =1
The total bandwidth W is distributed over the layers using one of two schemes: additive layering or multiplicative layering. In additive layering each layer is assigned the same bandwidth W/L, so that each enhancement layer increases the cumulative bandwidth additively. In the multiplicative layering scheme each
Paper C: Layered encoding and transmission of video in heterogeneous environments
123
enhancement layer increases the cumulative bandwidth multiplicatively by a factor mk/L, k∈[1, L], for some constant m. Let t be the packetization interval imposed by the encoding. For instance, a video frame rate of 25 frames per second implies a 40 ms packetization interval. Let p be the maximum packet size, including headers, and let h be the packet header size. Let gi denote the maximum number of groups that can be received by the receivers behind bottleneck ci. This means that gi is the maximum integer such that gi
∑b j=1
j
b jt h + ≤ ci . p − h t
(1)
The average utility of the multicast network bandwidth, Umc, is given by gi
U mc
1 = N
u(
B
∑ i =1
ni
∑b ) j
j=1
min(c i , W ) ,
(2)
where u(x) denotes a utility function. We define two different utility functions:
•
linear utility, ul(x) = x, that gives the utilization of network bandwidth, and
•
concave utility, uc(x) = x (2 − x ) ,
representing perceptual utility. The reason for having two different utility functions in the model is to be able to simulate both the sheer network bandwidth utilization and the utility of the bandwidth in terms of perceived visual quality. Studies of adaptive video codecs show that the marginal utility of bandwidth for these types of applications is greatest at low bandwidth levels [17, 18]. As perceived quality improves, the marginal utility of bandwidth decreases. The two utility functions are plotted in Figure 1. A similar model can be used for simulation of unicast video streaming. To do this we set N≡B and redefine some of the variables. N and B (by definition the same) now represent the number of different bandwidth levels to be simulated between sender and receiver. Furthermore, the bandwidth bottlenecks ci, i∈[1, B], are interpreted as representing the bandwidth levels corresponding to different load conditions of the network between sender and receiver. N=B implies that ni = 1, i∈[1, N]. Contrary to the multicast situation, unicast streaming of layered video does not require each layer to be packetized independently. (For multicast transmission this is necessary since the layers are transmitted to different multicast addresses.)
124
Supporting video-mediated communication over the Internet
Let gi denote the number of layers that is used for a particular unicast streaming situation. Then gi is the maximum integer such that gi ∑ bk ⋅ t h ≤c b j + k =1 i . p−h t j=1 gi
∑
(3)
The formula for the average utility of bandwidth for unicast streaming, Uuc, is gi
U uc
1 = B
B
u(
∑b ) j
∑ min(c , W) i =1
j=1
i
,
(4)
where gi is determined from equation (3). Note that in the multicast case we compute the average utility of the network bandwidth for a number of N heterogeneous receivers during one transmission session, whereas in the unicast case we compute the average utility of the network between a sender and receiver for N different bandwidth levels (corresponding to N different instances in time for one session, or indeed to N different sessions).
utility
1.0
Concave
0.5
Linear
0.0 0.0
0.5 bandwidth
1.0
Figure 1 Utility functions
4.2 Results The first simulation results, presented in Figure 2, show the network utilization, as measured by the linear and concave utility functions, depending on the number of layers used for the encoding. The full bandwidth of the simulated video transmission (W) was 4 Mbps. 1000 receivers normally distributed over 50 distinct bandwidth bottlenecks were simulated. A packet header of 40 bytes, representing the IP/UDP/RTP headers, was used and a maximum packet size of 1 kbyte. The packetization interval was 40 ms, corresponding to a 25 fps video frame rate.
Paper C: Layered encoding and transmission of video in heterogeneous environments
125
As can be seen in Figure 2a, the network bandwidth utilization is the highest when around 20 – 30 layers are being used for the layered transmission. Then the utilization decreases due to IP/UDP/RTP overhead. An optimal number of layers from a network utilization standpoint would thus be around 20 layers. It is also clear that the additive layering scheme outperforms the multiplicative scheme. This finding is in agreement with related work in the field [19]. In Figure 2b the perceptual utility of the bandwidth is plotted against the number of layers. It is clear that the perceptual utility reaches its maximum at about 40 layers for additive layering and at about 20 for multiplicative layering. However, it should also be noted that a high perceptual utility is reached relatively fast (after about 10 layers) compared to the linear utility. The negative effect of an increasing IP/UDP/RTP overhead is less manifest for concave utility compared to linear utility. This can be explained by the fact that although bandwidth is wasted by packet header overhead, the marginal importance of the wasted bandwidth for the application decreases. 1
perceptual utility
bandwidth utilization
1
0.5
Additive layering 0.5
Multiplicative layering
0
0 0
20
40 60 80 number of layers a
0
100
20
40 60 80 number of layers b
100
Figure 2 Multicast network bandwidth utility for N=1000, B=50, W=4 Mbps using a) linear utility and b) concave utility 1 perceptual utility
bandwidth utilization
1
0.5
0
Additive layering 0.5
Multiplicative layering
0 0
20
40 60 80 number of layers a
100
0
20
40 60 80 number of layers b
100
Figure 3 Multicast network bandwidth utility for N=1000, B=50, W=1 Mbps using a) linear utility and b) concave utility In Figure 3 the same simulation is performed, but with a total bandwidth, W, of 1 Mbps. As expected, the effect of the packet header overhead is more profound as
126
Supporting video-mediated communication over the Internet
the number of layers increases. The optimal number of layers from a network utilization standpoint can be seen to be around 15. 1 perceptual utility
bandwidth utilization
1
0.5
0
Additive layering 0.5
Multiplicative layering
0 0
20
40 60 80 number of layers a
0
100
20
40 60 80 number of layers b
100
Figure 4 Unicast streaming bandwidth utility for B=100, W=1 Mbps, using a) linear utility b) concave utility In Figure 4 the results of two simulations of unicast streaming of a 1 Mbps video is presented. As for the multicast simulations, a packet header size of 40 bytes was used, a packetization interval of 40 ms and a packet size of 1 kbyte. The utilization can be seen to increase monotonically with the number of layers. This is due to the fact that the packetization is independent of the layering. Hence the overhead from IP/UDP/RTP headers is not dependent on the number of layers. Contrary to the multicast case, somewhat surprisingly, the perceptual utility is higher for multiplicative layering than for additive layering. 1 bandwidth utilization
bandwidth utilization
1
0.75
0.5
10 layers 15 layers 0.75
20 layers 25 layers
0.5 0
2000 4000 bandwidth (kbps) a
6000
0
2000 4000 bandwidth (kbps) b
6000
Figure 5 Network utilization for unicast streaming using 10, 15, 20 and 25 layers respectively for different target bitrates a) additive layering b) multiplicative layering To investigate how the optimal layering strategy depends on the target bitrate of the encoder (W) the network utilizations when using 10, 15, 20 and 25 layers respectively were simulated over a bandwidth range from 100 kbps to 6 Mbps. The simulations were performed both for additive and multiplicative layering. As can be seen in Figure 5, the utilization decreases with higher bandwidth, when
Paper C: Layered encoding and transmission of video in heterogeneous environments
127
keeping the number of layers constant. This is due to the fact that at higher bandwidths the layers become increasingly broadband resulting in a more coarsegrained bandwidth allocation. This effect can be seen to be more profound for multiplicative layering compared to additive layering. 1 bandwidth utilization
bandwidth utilization
1
0.5
10 layers 15 layers 20 layers
0.5
0
25 layers
0
0
2000 4000 bandwidth (kbps) a
6000
0
2000 4000 bandwidth (kbps) b
6000
Figure 6 Multicast network utilization using 10, 15, 20 and 25 layers respectively for different target bitrates a) additive layering b) multiplicative layering In Figure 6 the network utilization for multicast video transmitted in 10, 15, 20 and 25 layers respectively is shown when the bandwidth bottlenecks are in the range 100 kbps to 6 Mbps. In contrast to the unicast situation, network utilization can be seen to increase with bandwidth. This is explained by the higher packetization overhead that results from having many thin layers. It is also clear from Figure 6 that when a broader bandwidth range is supported, additional layers become increasingly beneficial. Specifically, at bandwidths above 1 Mbps (for the additive layering scheme, Figure 6a) the 25-layer configuration outperforms the 10-layer configuration, whereas the opposite is true for bandwidths below 1 Mbps. The conclusion to be drawn from this is that when the span of receiver capabilities is large, an increased number of layers of the video coding is beneficial. The linear utility function was used to measure bandwidth utilization for the simulations presented in Figure 5 and Figure 6. The outcome is similar for the concave utility function.
5 Summary and conclusions In this paper we promote the use of scalable media coding and adaptive transmission architectures for two important applications: multipoint videoconferencing and unicast video streaming. State of the art video codecs make it possible to generate multi-layered video bitstreams that can be adapted to the bandwidth bottlenecks of a multicast session or to dynamics in bandwidth availability in a unicast streaming situation. Our simulation results for multicast sessions indicate that in terms of bandwidth utilization, approximately 20 layers
128
Supporting video-mediated communication over the Internet
give good performance when the bandwidth range of the receivers is in the order of a few megabits per second. However, other concerns, like multicast routing overhead and convergence time for congestion control algorithms, are likely to influence the layering strategy. For unicast video streaming a finer granularity of layers is possible without adversely affecting bandwidth utilization and scalability. Thus, as flexible embedded video codings like the fine granular scalability of MPEG-4 become widely available, a highly bandwidth-efficient transmission of video streams over the Internet can be realized.
References [1] S. McCanne, V. Jacobson and M. Vetterli, "Receiver-driven layered multicast," Proceedings of ACM SIGCOMM'96, August 1996. [2] J. C. Bolot and T. Turletti, "A rate control mechanism for packet video in the Internet", Proceedings of IEEE INFOCOM'94, June 1994. [3] ISO/IEC 13818-2, "Generic coding of moving pictures and assiciated audio information," (MPEG-2), 1995. [4] ISO/IEC 14496-2, "Generic coding of audio-visual objects," (MPEG-4), 1999. [5] M. Cote, B. Erol and F. Kossentini, "H.263+: Video coding at low bit rates," IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no. 7, pp. 849-866, November 1998. [6] M. Hayder, M. Radha, M. van der Schaar and Y. Chen, "The MPEG-4 finegrained scalable video coding method for multimedia streaming over IP," IEEE Transactions on Multimedia, vol. 3, no. 1, March 2001. [7] ISO/IEC, "Information Technology - Coding of audio-visual objects, Part 2: Visual amendment 4: Streaming video profile," ISO/IEC JTC1/SC29/WG11 N3904, ISO/IEC 14496-2:1999/FDAM 4, January 2000. [8] R. DeVore, B. Jawerth and B. Lucier, "Image compression through wavelet transform coding," IEEE Transactions on Information Theory, 38(2), pp. 719746, March 1992. [9] K. Shen and E. J. Delp, "Wavelet based rate scalable video compression," IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, no. 1, pp. 109-122, February 1999. [10] D. Taubman and A. Zakhor, "Multirate 3-D subband coding of video," IEEE Transactions on Image Processing, vol. 3, no. 4, pp. 572-590, September 1994. [11] Y. Chen and W. A. Pearlman, "Three-dimensional subband coding of video using the zerotree method," Visual Communications and Image Processing '96, Proceedings of SPIE 2727, March 1996. [12] X. Yang and K. Ramchandran, "Scalable wavelet video coding using aliasreduced hierarchical motion compensation," IEEE Transactions on Image Processing, vol. 9, no. 5, May 2000. [13] A. Nosratinia and M. T. Orchard, "A multi-resolution framework for backward motion compensation," SPIE Symposium on Electronic Imaging, February 1995.
Paper C: Layered encoding and transmission of video in heterogeneous environments
129
[14] L. Vicisano, L. Rizzo and J. Crowcroft, "TCP-like congestion control for layered multicast data transfer," Proceedings of IEEE INFOCOM'98, March 1998. [15] M. Johanson, "Delay-based flow control for layered multicast applications," Proceedings of the 12th International Workshop on Packet Video, April 2002. [16] L. Wu, R. Sharma and B. Smith, "ThinStreams: An architecture for multicasting layered video," Proceedings of NOSSDAV'97, May 1997. [17] J. Kimura, F. A. Tobagi, J. Pulido and P. J. Emstad, "Perceived quality and bandwidth characterization of layered MPEG-2 video encoding," Proceedings of the SPIE International Symposium on Voice, Video and Data Communications, September 1999. [18] ITU-R Recommendation BT.500-8, "Methodology for the subjective assessment of the quality of television pictures," 1998. [19] S. Gorinsky and H. Vin, "The utility of feedback in layered multicast congestion control," Proceedings of NOSSDAV'01, June 2001.
Paper D Adaptive forward error correction for real-time Internet video
Proceedings of the 13th International Packet Video Workshop, Nantes, France, April 2003.
133
Adaptive forward error correction for real-time Internet video Mathias Johanson
Alkit Communications Sallarängsbacken 2 S-431 37 Mölndal, Sweden
[email protected]
Abstract
Real-time video communication over packet networks is attracting a lot of attention for applications such as distributed collaborative work, teleteaching and telemedicine. Due to its sensitivity to delay, real-time communication is usually based on unreliable transport protocols, like UDP, since retransmission of lost packets is considered to take too long time. Instead of retransmissions forward error correction schemes have been proposed to make the applications more resilient to packet loss. Since the packet loss rate in the Internet typically fluctuates depending on the load on the network, the amount of redundancy applied by the error correction coding should ideally be adaptive to the amount of loss experienced on an end-to-end connection. In this paper an adaptive forward error correction scheme, based on Reed-Solomon erasure coding and receiver feedback, is presented. Moreover, an optimal packetization strategy for error protected video communication is suggested and an efficient algorithm implementing this packetization scheme is deduced. The performance gain of the scheme, in terms of increased network bandwidth utilization, is analyzed and found to be as high as around 10% for some relevant usage situations. The combination of adaptive forward error correction schemes with congestion control algorithms is also discussed in the paper.
1 Introduction Traditional computer communication applications, like file transfers or email, typically require reliable transmission of data. In packet-switched networks this is achieved through transport protocols implementing some sort of automatic repeat request (ARQ) technique. In the Internet the Transmission Control Protocol (TCP) provides this functionality transparently to applications. However, for delay sensitive applications like real-time audio and video tools, ARQ schemes are not suitable since the delay imposed by retransmissions cannot be tolerated. Moreover, a limited amount of packet loss is generally not catastrophic for these types of applications, but rather leads to a graceful degradation of performance. In the case of persistent loss, however, the user-perceived audio and video quality will suffer. In response, open loop error control schemes based on forward error correction (FEC) have been proposed for both audio and video applications [1, 2, 3, 4]. FEC techniques rely on the transmission of redundant information, usually in the form of error correcting codes, from which lost data packets can be recovered. In addition to reducing the time needed to recover lost data packets compared to ARQ schemes, FEC-based mechanisms also have better scalability properties for multicast applications, since no feedback is necessary [5].
134
Supporting video-mediated communication over the Internet
Transmission errors in computer networks can be divided into two classes: random bit errors due to noisy communication channels and packet drops due to congested links. For packet networks, like the Internet, bit errors will be detected by cyclic redundancy checksums (CRCs) and the corrupted packets will be discarded. Thus, both random bit errors and packet drops due to congestion will show up as packet loss to the applications. Consequently, Internet applications are only required to deal with packet loss, not with bit errors. In wired networks (in particular optical networks) bit errors are extremely rare, making congestion the overwhelmingly dominant cause for packet loss. This means that care must be taken when designing an adaptive FEC scheme, so that the applications not simply increase the amount of redundancy transmitted when the packet loss rate increases, since this will effectively exacerbate the congestion. Instead, the data transmission rate must be lowered in favor of more redundant information, and the total transmission rate must be controlled by the transport protocol's congestion control algorithm. This suggests that a unified approach to congestion control and error control is needed for real-time Internet applications, much like the combination of the congestion avoidance algorithm with the retransmission scheme in the TCP protocol [6]. For wireless networks, bit errors can be relatively frequent. Thus, a lost packet cannot automatically be assumed to be the result of congestion, indicating that congestion control algorithms need more input than packet loss events to be efficient in wireless networks.
2 An adaptive FEC model for Internet video Although audio and video applications are certainly more tolerant to packet loss compared to data transfer applications that require perfectly reliable transport, even moderate loss rates can have a significant impact on the perceived audio and video quality if the loss is persistent. This is most pronounced for video, since video codecs are often very sensitive to loss. Moreover, since a video frame typically spans a number of packets, losing only one of these packets will often result in the whole frame being discarded. To make things even worse, inter-frame compression schemes, wherein a video frame is temporally predicted from a past frame, are often used for Internet video applications, causing the loss of one packet to potentially result in several frames being dropped. Using a video coding with only intra-coded frames makes the video more robust to packet loss, but at the expense of significantly higher bandwidth. Even for intra-coded video persistent packet loss will be a problem, indicating that an FEC scheme is needed. FEC schemes can be used both to correct bit errors and to reconstruct lost source data symbols (known as erasures). As previously mentioned, we only need to consider erasures for packet data networks, not bit errors. Since IP packets usually are very long (typically several hundred bytes) compared to the symbol length used for the FEC coding, a lost packet corresponds to a very long burst of erasures. Although robust FEC schemes have been developed that can handle relatively long bursts, applying the FEC coding directly to the IP packets would be
Paper D: Adaptive forward error correction for real-time Internet video
135
highly wasteful, since we know that whenever one symbol in a packet is lost all other symbols in that packet will also be lost. The solution is to buffer a number of packets before transmission and to apply the erasure codes orthogonally across the packets, as illustrated in Figure 1. For real-time, interactive applications like teleconferencing, sender side buffering is generally problematic since it introduces delay. However, for video applications, a video frame typically needs to be fragmented into a number of packets, because its size is generally larger than the network's maximum transmission unit (MTU). The higher the resolution of the video, the more packets per frame will be needed. In a typical application a video image will roughly require from around five to around thirty packets, depending on the resolution of the video and the frame coding mode (intra- or inter-coding) and the efficiency of the compression scheme. Thus, without requiring any additional buffering, we can apply the FEC coding across the packets corresponding to one frame of the video. For real-time audio applications, this method will generally not work very well, since the size of the encoded audio data sampled over the packetization interval (i.e. the time between two consecutive audio packets) is much smaller and cannot efficiently be fragmented into as many packets as would be needed. symbol length video data
k data packets
codeword redundancy
n-k FEC packets
Figure 1 Organization of packets for FEC coding
2.1 Adaptive Reed-Solomon coding An error correction scheme wherein k source data symbols are encoded at the sender to produce a codeword of n symbols of encoded data, such that any subset of k encoded symbols suffices to reconstruct the source data, is called an (n, k) code (cf. Figure 1). There are a number of different ways to construct the error correcting codes. One of the most popular schemes is the Reed-Solomon (RS) code, which is very well suited for error protection against packet loss, since it is a maximum distance separable code, which means that no other coding scheme can recover lost source data symbols from fewer received code symbols [7]. For packet video, the symbol length used when calculating the RS codes is usually eight bits. Generally, an encoded video frame does not fit perfectly into an integral number of MTU-sized packets. When calculating the FEC packets, the last packet of the frame is simply padded to the same size as the other packets. (The padded
136
Supporting video-mediated communication over the Internet
bytes are only needed when computing the FEC codewords and are thus not transmitted.) If a verbatim copy of the source data is contained in the encoded data the code is called a systematic code. This is a very useful property, since it makes the decoding process extremely simple in case none of the k data packets are lost. The redundant packets are then just discarded. This also means that receivers of the video stream not capable of FEC decoding can still decode the video in the absence of packet loss. The adaptive FEC scheme proposed in this paper is based on systematic RS (n, k) codes, where n and k are reassigned for each frame to be transmitted based on the optimal number of packets to fragment the frame into and the current loss rate of the communication channel. In order for a receiver to be able to reconstruct the original packets from the received packets, the positions in the packet stream of the lost packets must be known. For applications using the Real-time Transport Protocol (RTP), this information is easily obtained by monitoring the sequence number field of the RTP header. In the absence of mechanisms for guaranteed Quality-of-Service (QoS) in the Internet, the packet loss rate as experienced by applications fluctuates with the load on the network. In order for the video transmitter to apply an adequate amount of redundancy, the receiver monitors the loss rate and continually reports this back to the sender. For multicast applications, feedback from a set of heterogeneous receivers must be harmonized by the sender. To this end, different strategies can be adopted, such as adapting the redundancy to meet the loss rate of the receiver experiencing the highest loss rate, or adapting the redundancy to the mean loss rate calculated over all receiver reports. Also, as for all feedback mechanisms in multicast communication, the feedback interval must be scaled to the size of the session to avoid feedback implosion.
2.2 Error correction and congestion control As previously mentioned it is important that the application keeps a unified strategy for congestion control and error correction when responding to packet loss rate reports from the receiver(s). Since the cause of the packet loss is likely to be congestion, a naïve implementation responding to increased packet loss simply by making the FEC code stronger will in effect worsen the congestion. This will cause the application to increase redundancy even more and this loop will continue ad nauseam, until all available bandwidth is consumed. The simplest remedy for this problem is to keep the output rate of the sender constant, reducing the video codec's rate when increasing the FEC redundancy. Alternatively, a rate-based congestion control algorithm could be used to control the total output rate. For instance, the TCP-friendly rate control algorithm (TFRC) [8] proposed by Floyd et al. could be used in circumstances where fairness to TCP flows is desirable. TFRC tries to model TCP's long term performance while allowing the application to change its sending rate more smoothly, compared to TCP's halving of the throughput in response to a loss event. It is important to note
Paper D: Adaptive forward error correction for real-time Internet video
137
that while both the adaptive FEC scheme presented here and TFRC rely on receiver reports of packet loss, the FEC scheme should be based on the long term, persistent loss rate, whereas congestion control algorithms like TFRC are based on the loss event rate. An interesting opportunity when integrating adaptive error correction and congestion control algorithms is the possibility of increasing the robustness of the FEC coding proactively in situations when the congestion control algorithm is probing for bandwidth, e.g. the slowstart phase of the TFRC algorithm or when conducting join experiments in receiver-driven layered multicast [9]. Since all lossbased congestion control algorithms in effect provoke packet loss in order to detect congestion, the negative effects of this can be reduced by applying a suitable amount of error correction redundancy when packet loss can be foreseen by the congestion control algorithm.
3 Optimal packetization for error protected video In its simplest form, the adaptive FEC scheme described above assumes a static packet size equal to the MTU size of the network, i.e. around 1500 bytes for most practical situations. Generally, when packetizing video data the packet size should be as large as possible to keep the overhead imposed by IP/UDP/RTP headers at a minimum. However, since the FEC coding is performed at packet level the granularity of the error protection is determined by the packet size and the size of the coded video frame. As an example, if a video frame to be transmitted is 10 kbytes in size a packet payload size of one kilobyte will result in ten frames. This makes the precision with which the error protection can be applied be in units of 10% (i.e. one packet of redundancy will make the transmission robust against packet loss rates up to 10%, two packets of redundancy can withstand up to 20% losses, and so on). Thus, if we measure a loss rate of for instance 5% we will have to use an overly strong FEC code, wasting bandwidth unnecessarily. If we instead use a packet size of 500 bytes we will get a protection granularity of 5% matching the measured loss rate perfectly, but causing higher IP/UDP/RTP header overhead. The optimal packetization strategy, from a bandwidth utilization standpoint, is thus a trade-off between error protection granularity and header overhead. To find this optimal packetization strategy, let I be the size of the video frame in bytes, let s be the packet size in bytes, let h be the header size in bytes and let p be the current packet loss rate. The utilization, u, i.e. the proportion of the bandwidth consumed by the application that is used for actual video data, will then be given by
u (I, s, h , p) =
I I p ⋅ I + ⋅ s − s h s − h
.
(1)
To achieve an optimal packetization strategy we should select s so that u is maximized for the given values of I, h and p. In Figure 2, u is plotted against s for
138
Supporting video-mediated communication over the Internet
I=10000 and h=40 at four different loss rates (p=0, p=0.01, p=0.05 and p=0.1 respectively). 1.0 utilization
0% 1%
0.8
5% 10%
0.5 0
500
1000
1500
packet size (bytes)
Figure 2 Bandwidth utilization as a function of the packet size for error protected transmission of a 10 kbytes video frame at loss rates of 0%, 1%, 5% and 10% As can be expected, the utilization increases monotonically with the packet size when there is no loss. For the more interesting cases where there is packet loss, the utilization can be seen to be given by piecewise continuous functions generated from the functions obtained by setting the ceilinged expression in (1) to 0, 1, 2, etc. Although there is no closed form expression for the maximum of the utilization function, we can derive a simple algorithm to calculate it. To do this we introduce the helper functions u h (s) =
I
I + a⋅s s−h
, where a is a nonnegative integer.
(2)
Plotting uh(s) for I=10000, h=40 and a=0, 1, 2, 3 gives the functions shown in Figure 3. 1.0 a=0 a=1 uh 0.8
a=2 a=3
0.5 0
500
1000
1500
s
Figure 3 Helper functions
Paper D: Adaptive forward error correction for real-time Internet video
139
It is easy to see that the utilization functions at different packet loss rates (depicted in Figure 2) can be generated by a certain combination of the helper functions (depicted in Figure 3). The endpoints of the intervals where a utilization function coincides with the same helper function can be found by solving for s in
p ⋅ I . This gives p ⋅ I + h < s ≤ p ⋅ I + h . a= a a +1 s − h
(3)
Now, it is readily realized that the maximum of the utilization function is achieved either at one of the transition points between the helper functions, defined by (3), or at the local maximum of one of the helper functions. The local optima of the helper functions can be easily found using standard analytical methods to be given by
s=h+
I⋅h . a
(4)
We can now find the maximum of the utilization function by iteratively examining the candidate optima given by the interval endpoints of (3) and the local optima given by (4). In Figure 4, a pseudocode segment is listed that finds the optimal packet size given the current loss rate, p, the current video frame size, I, the header size, h, and the maximum packet size, S. The result will be in the variable s. The algorithm works iteratively by testing candidate maximum values, starting with points located on the helper function given by setting a=1 in (2) and then increasing a by one for each iteration. The loop terminates when the best candidate maximum value tested is found to be greater than the maximum of the helper function given by the previous value of a. Since the helper functions never intersect, this condition guarantees that the maximum value has been found and the algorithm is done. The algorithm described above is executed by the sender for each video frame to be transmitted. For non-pathological choices of the variables I, p, h and S, the algorithm finds the optimal packet size within only a few iterations of the loop. Compared to the complexity of the video compression and the FEC encoding, the added complexity of the packetization algorithm is negligible. The only requirement on the receivers is that they must be able to handle video packets of variable size. The optimal packet size, as calculated by the algorithm above, is plotted in Figure 5 for I=10000, h=40 and S=1500 at different loss rates. It is clear from Figure 5 that the optimal packet size varies between about 750 bytes and the maximum packet size (1500 bytes).
140
Supporting video-mediated communication over the Internet k := ceil(p*I) if (k=0) then s := S else s := h+1 a := 1 prev_Sa := s while (a= u(I, h, p, s)) do Sa := ceil(h + sqrt(I*h/a)) if (p*I/(a+1)+h < Sa and (Sa u(I, h, p, s) and Sa u(I, h, p, s) and Se