An Extensible Framework for RTP-based Multimedia Applications

3 downloads 905 Views 65KB Size Report
networked multimedia applications is rapidly growing. .... Concurrent with the development of the RTP subsystem, a Windows-based framework to support.
An Extensible Framework for RTP-based Multimedia Applications John Du David Putzolu Linda Cline Donald Newell

Mike Clark Don Ryan

Intel Architecture Labs 2111 N.E. 25th Avenue Hillsboro, Or. 97124

Microsoft Corporation One Redmond Way Redmond, Wa. 98052

(Donald_Newell,David_Putzolu,John_Du)@ccm.jf.intel.com

(mikecl,donryan}@microsoft.com

Abstract An extensible framework for RTP-based multimedia applications is described. We first describe the design process that led to a streaming architecture and illustrate why this is a useful model when developing network-oriented multimedia applications. We then consider the extensibility and flexibility of the framework developed. Finally, we describe design considerations and usage models for several key components in the above framework. This document describes joint work by Intel and Microsoft in the area of networked multimedia streaming.

1 Introduction With RTP[1] and its associated profiles reaching stability, the demand for richer, more compelling networked multimedia applications is rapidly growing. Therefore, an extensible and easy-to-use framework is essential for the success of RTP in the environments with which we are most concerned. We have developed such a framework, which we refer to as ActiveMovie RTP (AMRTP), in the 32-bit Windows environment. Each function involved in media streaming, such as encode/decode and network I/O, has been developed as a Component Object Model (COM) object [2]. This allows us to easily control the behavior of each component through their respective interfaces, and makes it easier to add new functionality such as adding adaptive controls, animation, etc. However, COM-based design is not enough to make such complex applications easy to build and manage. In our final framework, the underlying subsystem is responsible for data flow and buffer management, leaving application programmers to concentrate on specifying policies and characteristics of these data flows and how such flows are presented. In Section 2, we describe the process that led to this framework and discuss some of its features. In Sections 3, 4 and 5 we address the design and usage models for some key components, and provide some examples to illustrate the flexibility of our approach. Finally, in Section 6 we present our conclusions and describe future work.

2 Framework Design 2.1 RTP and ALF The RTP specification describes the RTP protocol as one designed to follow the principles of Application Level Framing (ALF) and Integrated Layer Processing (ILP) proposed by Clark and Tennenhouse[4]. These design principles advocate a tight integration of the network and codec processing into the application. Thus, an application can take advantage of the feedback information provided by RTP and adapt itself as

needed to the condition and behavior of the network. A corollary of this principle is that each application needs to have intimate knowledge of both network and codec behaviors.

2.2 First Design - A Functional Architecture [5] Our functional requirements were to develop an infrastructure which supported the following : n allow easy addition of new media types into the RTP framework n make application use of the RTP subsystem simple n allow for new functionality to be added into the RTP subsystem with no impact on applications using RTP. Evaluating the RTP protocol and the state of the standard was the first step in determining architectural requirements. The base RTP protocol was stable and not subject to significant change. However, the list of payload types and payload specification documents was still fairly small, and some of the video codecs we were working on had no proposed payload specification (e.g. H.263). It was obvious that the payload specification area was work-in-progress, and could grow quite rapidly as new payload types were defined and specified. The next step was to examine some of the existing internal audio/video communication applications which were important to Intel. These applications included those with both large-scale presentation and point-topoint conferencing functionality. Although each type of application presented different requirements, some of the functional components, such as the video subsystem, were similar across applications. Also, the layering structures had similarities, most notably a distinct separation of the communications, media handling and user presentation functions. This complicated the task of creating a single RTP component and API that could encapsulate the layering which existing applications expected. We considered several publicly available RTP-compliant applications in evaluating the work required for our project. These applications included vic, a video conferencing tool, and vat, an audio conferencing tool, both developed at Lawrence Berkeley Laboratories. For prototyping purposes, we wished to utilize existing technology wherever possible. We discovered that these applications used the principles of ALF and ILP to such an extent that RTP support was not easily extractable. There were no libraries, so the RTP software could not be reused for integration with other existing applications, nor could new applications be designed to cleanly utilize that existing functionality. Just as important, adding new codecs and payload handlers appeared to be difficult. Clearly, the use of this software as the basis for our own development was impractical. It appeared to us that strong adherence to the ALF/ILP philosophy would make subsystem re-use difficult. Therefore, we decided to start our implementation from scratch, and aim at a better compromise between modular components and ALF/ILP. It made sense to separate the transmission and control protocol functions from the payload-specific handling of the audio and video data. In this manner, a single core component could handle all of the common RTP/RTCP transmission and control functions across the application, while only handlers for specific payloads used by the application needed to be loaded. This separation also enabled new payload handlers to be more easily implemented and integrated. One prevalent design philosophy at Intel is to completely encapsulate any handling for the network into a separate communications layer so no other component is involved in network tasks. Our first design followed this principle, and had the RTP/RTCP module situated as a layer above Winsock/UDP, internally incorporating a number of payload handlers to implement fragmentation and reassembly and to add payload specific information (Figure 1).

UI Codec-1 (A)

Service Layer

Codec-2 (V)

RTP/RTCP Core Module (.DLL) PH (COM) H.261

PH(COM) G.711

PH(COM) ...

RTCP control packets

RTP packets

Figure 1 This initial design had the drawback, however, in that the application could not monitor or control the data at the packet level. Also, it appeared useful to have a more direct coupling between the codec and the payload handlers since some of the packetization required information best provided by the codec. Finally, it was recognized that future network-aware codecs might produce RTP packets directly, making separate payload handlers unnecessary for such optimized codecs.

UI Codec H.261

PH (COM) H.261

Service Layer Codec G.711

PH(COM) G.711

RTP/RTCP Core Module (.DLL) RTP packets

RTCP control packets

Comm (Winsock/UDP)

Figure 2 These considerations led to our completed first phase architecture (Figure 2), which includes a core module and a group of payload handlers. The core module, RTP/RTCP Core Module (RRCM), is responsible for control and transmission functions, and resides as a layer above Winsock/UDP. The payload handlers, implemented via the Payload Preparation Module (PPM), are coupled with their corresponding codecs via the Service Layer. As Figure 2 above reflects, the minimal set of required components in a networked multimedia presentation are the capture/render, encoder/decoder, framing/reassembly, and network I/O pieces. In the above design, we implemented these pieces as a set of modular libraries, many of them as COM objects . As in the ALF model, the service layer --- and to some extent the application --- had to understand the interface of each piece and coordinate the buffer management and data flow between components. Each component is responsible for a particular task or function in the process of transmitting and receiving multimedia

streams. The service layer calls the Capture component to get a media sample and delivers it to the Encoder for compression. The compressed data is returned to the service layer which passes it to the framing module for packetization. Finally, the service layer gets the data back only to deliver it to the network component for transmission. The service layer grows even more complex when dealing with multiple streams and asynchronous operations between interfaces, or when adding new functionality such as end-system adaptation across data streams. Despite these problems, this architecture did meet the goal of creating an extensible framework where new codecs and payload handlers could be easily added without significantly perturbing the rest of the infrastructure. Most significantly, applications written for this infrastructure were not required to have any knowledge of media framing or networking behavior.

2.3 Second Design - A Streaming Architecture From the beginning, our primary design goal was to produce an architecture which would easily support RTP development on 32-bit Windows platforms. The components in this architecture would be used by a diverse set of applications transporting various types of media data. We wanted to be able to integrate these components easily into existing applications, enable new application design around the components, and allow the definition and handling for new payloads and codecs to be added at any time. Our first design (Figure 2) met some of these goals, but still did not offer as clear a separation of data stream management, buffering, and presentation as we wanted. For example, adding adaptive controls or new rendering engines meant modifying the service layer. Our conclusion was that a simple COM-based design along functional lines was not sufficient to enable easy construction of networked media streaming applications. In the context of processing media data, a media stream can be seen as a set of timed samples which are delivered from one processing module to another. In this model, a source module delivers data into a series of communicating modules which transform and filter the media sample and finally deliver it to a sink module which removes the data from the scope of the stream. In general, the sink module would act to deliver locally captured media data to the network, and would act to present, render or archive incoming media data. Viewed this way, media streaming produces an architecture where the common services layer component would simply establish the plumbing to allow the modules to communicate with one another (Figure 3). Application

Streaming Manager

Capture

Encoder

Framing

Network I/O

Figure 3 Concurrent with the development of the RTP subsystem, a Windows-based framework to support development of media streaming applications was announced. This framework, named ActiveMovie[3], offered considerable support for developing an architecture such as Figure 3 shows. The ActiveMovie framework defines the following: n how modules (called filters) connect to one another n how data types passed between filters are negotiated n allocation and policies for buffer management n how data is passed between filters.

An ActiveMovie component, called the filter graph manager, helps connect filters in the proper order, and can also be used to reflect specialized controls and events from a filter to an application. Well-understood control and status messages can be passed between filters, or if necessary between filters and an application. The components that form an ActiveMovie stream are the stream head, called the Source filter, the stream sink, called the Render filter, and zero or more transform filters between the two. The typical setup for an application has been greatly simplified by this framework since the underlying data flow and buffer management is taken care of by the graph manager and among filter components themselves. In the simple case, all the application has to do is tell the graph manager which filters need to be connected and control the graph manager to run, pause or stop the graph. The ActiveMovie environment clearly mapped to our intent in developing an RTP-based streaming framework. Figure 4 shows an example of an RTP-based video receiving filter graph. Application

Filter Graph Manager

RTP Source

RPH

Decoder

Video Render

Figure 4 In this scenario, an application wants to receive an RTP-based video stream in a multicast session. It sets up the graph via the graph manager and tells it to run the graph. Buffer allocation and media negotiation are performed during graph setup by each filter on its own behalf. When video data arrives, the RTP source filter detects the new source endpoint, picks the data up from a socket and passes the data up to a receive payload handler (RPH). The payload handler reassembles the data packets into video frames per the payload specification and feeds the data to the codec. The codec decodes the data and passes it to the video render filter for display. All these actions use a single thread running on a RTP source filter without necessitating any interaction on the part of the application.

2.4 Control Interface In the normal case, described in figure 4, an application does not need to interact with the media streaming components other than to request that they be instantiated and run. This level of abstraction is not sufficient in the case where an application writer wants more fine-grained control. The above stream architecture is not complete in the sense that it does not allow the application to monitor, notify, configure and control the behavior of the graph. To solve this problem, each filter may expose one or more custom interfaces. Upon instantiation of a filter, the application receives a handle to that filter which may be used to control the behavior of the component. For example, the RTP source filter allows applications to get RTCP reports and to be notified when a particular source is available or becomes silent. A video codec filter might allow the application to control its frame size and bit rate.

2.5 Extensibility This framework offers a well-defined model for adding functionality by inserting new filters into the graph without requiring the application --- or other filter modules --- to make any changes, as shown in Figure 5.

RTP Source

New Filter

PH Filter Figure 5

Decoder

Video Render

For example, we are adding modules to do bitrate shaping, data monitoring, etc., without any existing code being modified. Most important, projects using this infrastructure have added new codecs without having any knowledge of the rest of the architecture.

3 RTP Components 3.1 Division of functionality Our implementation decomposes networked media stream processing into four logical components: n the RTP/RTCP core module (RRCM) n the send and receive payload handler modules (SPH/RPH) n media specific codecs n media specific rendering. In the following discussion, we are concerned only with the RTP specific functions provided by RRCM, the SPH/RPH and certain of the codecs. RRCM deals with issues such as session management, RTCP reports, SSRC allocation and network i/o. SPH and RPH are concerned with fragmentation and reassembly according to payload type specifications.

3.2 Payload Handler The payload handler components were designed from a class hierarchy for extensibility. We defined a generic handler in a base class, with all of the basic fragmentation and reassembly functionality, which presents the external interface functions in the form of virtual functions. This allows us to derive payload classes from the generic base class and override virtual functions for any handling which needs to be different for a specific payload class. New payload handlers are easily added by deriving a new payload class and defining any different functionality.

3.3 RTP/RTCP Core Module This component sits on top of the Winsock layer. It handles RTP session management, and generates RTCP reports both locally and for the network. For scalability, it has options to prune feedback information for sources other than itself. Some of the interfaces it provides for application support include: 1. Enabling or disabling RTCP report generation. 2. Setting receiver buffer size and number. 3. Setting thread I/O priority. 4. Get/Set RTP session addresses. 5. Get/Set multicast scope. 6. Generating SSRC. 7. Disabling Bye. 8. Return of RTCP reports locally to the application.

3.4 Framing Support in Codec Component As mentioned earlier, an RTP payload specification may require that additional information be carried in a payload header. It may also specify that if the frame data cannot fit within a single transport packet, the fragmentation must occur at designated boundaries in the data. This information may be readily available at the time of encoding, but is difficult to obtain later without decoding the data stream. Unless a codec is capable of generating the audio/video bit stream in the RTP format directly, a separate payload handler must generate the RTP packets from the audio/video data. Conversely, unless a decoder is capable of receiving RTP packets directly, a payload handler is required to reassemble the received data to build the media bit stream for decoding. Our current H.263 and H.261 codecs operate on a frame basis rather than a group-of-blocks (GOB) or macroblock (MB) basis. The information at the GOB or MB level is not directly available outside the

codecs. However, the RTP payload format specifications for H.263 and H.261 state that the compressed video stream will be fragmented at either GOB or MB boundaries. Also, within each packet payload header, several fields must be provided, such as quantization level, motion vector predictor, macroblock address predictor, etc. In order for the encoders to provide this information to the payload handlers, we decided to extend the video bit stream by appending additional structured information to each frame. The payload handler can use this additional information to find the GOB and MB boundaries and efficiently generate the RTP payload headers. The new extended bit stream is intended to facilitate the fragmentation process for compressed video buffers when transmitted over RTP. It is also used to provide a mechanism by which incomplete bit streams can be decoded even in the presence of packet loss during transport over the Internet. The normal bit stream is extended with structured information which the payload handler uses when building a packet for transmission. Upon receipt of video packets from the network, the payload handler reassembles the bit stream and reconstructs the trailer part of this bit stream from the information in the packets. This extended bit stream is provided to the decoder for processing. The bit stream contains three parts: 1. The first part of the bit stream is defined by the ITU standard H.263 (or H.261) specification. Its size is variable depending on the output of the codec for a current frame. 2. The second part is called the bit stream information stream. Each of its elements is a bit stream information structure. This extended bit stream format is used as the output of the encoder and the input of the decoder. An intelligent decoder can recover from the loss of data with graceful degradation in the presence of packet loss based on the information provided in the bit stream information stream. 3. The third part is called the bit stream information trailer. It is added for bit stream verification and packet identification. At the receiving end, the payload handler detects a packet loss by inspecting the RTP header for the sequence number. By using the information in the payload header, the payload handler can also determine the number of packets lost. It can reconstruct the information and pass it to the decoder in the bit stream information stream, replacing the lost segments in the stream with a unique bit pattern. This enables the decoder to recognize that data is missing and intelligently handle the frame, reconstructing as much lost data as possible. To prevent a tight coupling between the codec and the packet handler on incoming media samples, the RPH negotiates with the codec to make sure it can interpret the patterns used for resiliency. The generation of the extended bit stream is optional in the codec, so that it can be used for non-RTP applications. The bit stream extension is non-intrusive in that the normal bit stream is intact and the frames can be decoded properly by any H.263 or H.261 decoder. None of the extended bit stream information is reflected in the packets transmitted on the network.

4 Adaptive Control Component (ACC) In this section, we demonstrate the extensibility of our framework by adding adaptive control functionality. There are essentially two kinds of adaptive control approaches for RTP-based streaming. One is a sender driven approach[6], which we call Adaptive Source Rate Control (ASRC), in which the sender adjusts its bitrate or intraframe encoding frequency based on packet loss rates obtained from RTCP reports. The other is Receiver-driven Layered Multicast (RLM) adaptations[7], based on receiver adaptation to congestion by joining and leaving multicast groups.

4.1 API design and Usage Model Since the ACC should handle both adaptation approaches, we define an ACC session as a set of associated RTP sessions which need adaptive controls. For example, an ACC session might consist of an RTP audio session and an RTP video session, or it might consist of multiple layered video RTP sessions. The application can also add additional RTP sessions into an existing ACC session. Through the ACC’s control interfaces, the application can set policies to control rate change range, media priorities, congestion criterion, etc.

4.2 ASRC architecture The sender-side architecture for sender-driven rate adaptation is shown in Fig. 6. The adaptive control functionality has been split into two parts. The front end component, called the ACC Proxy, is used to provide interface pointers of other filters within the graph for ACC and notify new RTCP events to ACC when they happen, such as RTCP reports, BYE message , etc. The back end component , called the ACC, implements the adaptation engine, processes statistics feedback information from all the RTP sessions, and controls the codecs to increase or decrease bitrate and intraframe encoding according to policies specified by the application.

Adaptive Control Component

Video

Video

Source

Codec

Audio

Audio

Source

Codec

RTP SPH

RTP SPH

Acc Proxy

Acc Proxy

RTP Render

RTP Render

Figure 6 Note that the addition of the adaptive control framework was accomplished by adding pieces transparently to the normal end-user application. In the typical case, the ACC operates without any input from a higherlevel application. Only in the cases where the application wants to have fine-grained control over policies such as floor and ceiling bitrate values does it need to seed the ACC with information.

4.3 RLM architecture The receiver-side architecture for receiver-driven rate adaptation is shown in Fig. 7. In this case, each ACC proxy has the same functionality as in ASRC. The Layered Video Merger (LVM) is used to synchronize multiple layered video streams and deliver a single bit stream to the codec. The ACC joins and drops RTP sessions according to RLM protocols based on packet loss. It also informs the LVM of the current number of layers which are active so that the LVM can synchronize multiple streams correctly.

Adaptive Component Control

RTP L1 Source

RTP L2 Source

ACC Proxy

RTP RPH Filter Layered Video Merger

ACC Proxy

RTP RPH Filter

Figure 7

Video Codec

YUV Render

5. Rich Media Presentations When compared with commercial cable and broadcast television productions, the typical MBONE session is crude. MBONE sessions are characterized by fixed, often uninteresting camera angle and static zoom factors. As part of our ongoing work to demonstrate richer, more interesting media productions, we have started to explore the possibility of combining animation, graphics and video in creating compelling visual presentations. The basic model we are working towards is to provide remote effects (Spin, Smorphs, Scaling, Overlays etc.) by transmitting them on a separate stream which is associated with the base video or graphics stream in a presentation. While we haven’t confirmed it with research, our current belief is that applying many typical image effects (spin, flip, shearing, etc.) at the source effectively increases the motion component, thus increasing BW requirements for motion-vector based codecs. We suspect that a multicast effects stream would be significantly smaller than this "induced" BW. In the case of live presentations with several potential sources (as in a multicast panel discussion), effects (perhaps including transitions between source streams) might be multicast from a separate non-source host (i.e. a "video mixing console"). Live and packaged effects might be applied directly at a receiving host for "local consumption" only. Also, by transmitting the effects on a separate stream we allow endpoints to make tradeoffs between resources such as bandwidth and MIPS. Currently, we have a prototype running which allows local manipulation of an incoming video stream. This was done by simply adding a new render filter into the RTP streaming filtergraph. The animation engine we are using to produce the effects is Intel’s RDX (Realistic Display Mixer) package. In this context, RDX should be viewed as a technology which enables more compelling forms of video/graphical presentation. The initial AMRDX prototype consists of a filter which allows the user to apply a few basic RDX effects to an RTP video stream. The filter applet runs in the context of the ActiveMovie Graph Editor, and the UI consists solely of filter property pages and the video rendering window. Active Movie Filter Graph Editor Context Active Movie Filter Graph AMRDX Filter Applet Custom Iface

RTP Source Filter

RTP RPH Filter

Decode Filter

AMRDX Render Filter

RDX COM API

RDX Core API

Figure 8 We believe there is a set of effects which can be specified independently of the underlying animation engine. Our intent is to define some of these effects and use our animation engine as a proving ground for their utility.

6. Conclusion In this paper, we describe an extensible framework for RTP-based multimedia application. This framework allows addition of new components and payload types without disturbing the rest of the framework. It greatly simplifies construction of basic streaming systems very simple from an application’s point of view.

Finally, it is flexible enough to allow applications detailed control of all aspects of streaming by exposing properties of each filter through custom interfaces. Future directions inside this framework include incorporating Resilient Multicast [8] and providing more powerful integration with animation and special effects.

7. Acknowledgments Thanks to lots of folks who have contributed to our architecture and implementation: Chad Zhu, Christian Maciocco, Chris Pearson, Thomas Pfenning et al.

References [1] Schulzrinne, H., Casner, S., Frederick, R., and Jacobson, V. RTP: A Transport Protocol for Real-Time Applications, RFC 1889, IETF, 1996. [2] Microsoft Component Object Model Specification. Microsoft 1995. [3] Microsoft ActiveMovie SDK 1.0. Microsoft 1995. [4] Clark, D., and Tennehouse, D., Architectural Considerations for a New Generation of Protocols, Proceedings of ACM SIGCOMM’90, Sept. 1990, pp. 201-208. [5] Cline, L., Maciocco, C., Mirashrafi, M., and Zhu, C., An Extensible Framework for Standard Internet Multimedia Communication, Intel Software Developers’ Conference, 1996. [6] Busse, I., Deffner, B., and Schulzrinne, H., Dynamic Qos Control of Multimedia Applications Based on RTP, Comupter Communication, Jan., 1996. [7] McCanne, S., Jacobson, V., and Vetterli, M., Receiver-driven Layered Multicast, Proceedings of ACM SIGCOMM’96, August, 1996, pp 117-130. [8] Rex Xu, X., Myers, Andrew C., Zhang, H., and Yavatkar, R., Resilient Multicast support for Continuous-Media Applications, NOSSDAV’97.

Suggest Documents