A Distributed Recording System for High Quality MBone Archives Angela Schuett, Randy Katz, Steven McCanne Department of Electrical Engineering and Computer Science University of California, Berkeley fschuett,randy,
[email protected]
Abstract. Popular multicast applications that allow group communi-
cation using real-time audio and video have enabled a wide variety of online meetings, conferences and panel discussions. The ability to record and later replay these sessions is one of the key functionalities required for a complete collaboration system. One of the unsolved problems in archiving these interactive sessions is the lack of any method for recording sessions at the highest possible quality. Since audio and video transmissions are typically sent unreliably, there may be a wide variance in recorded quality depending on where the recorder is placed relative to the various sources. This is especially problematic if multiple sources are active in a single session. In addition, because of congestion control schemes that send high-quality, high-rate data to local receivers, and low-rate data in the wide area, dierent sets of data may be available in dierent areas of the network for any given session. In response to these challenges, we have developed a system that uses multiple distributed recorders placed at or near the sources of the session. These recorders serve as data caches that transmit data to archives. The archive systems collate the data from various recorders and create a high-quality recorded session, which is then available for playback. In this paper, we present the tradeos involved in architecting a distributed recording system, and present our design for a fault-tolerant, scalable system that also supports a wide range of heterogeneity in endsystem connectivity and processor speed. This is achieved in our system through the use of decentralized, shared control protocols that allow simple and fast fault recovery, and decentralized, multicast data collection protocols that allow multiple systems to share data collection bandwidth. We describe and implementation of the system using the MASH multimedia toolkit, the libsrm reliable multicast protocol framework, and the AS1 active service middleware platform implementation. We also discuss our experience with the system and identify several areas of future work.
1 Introduction The deployment of the multicast backbone, or MBone, has made possible synchronous multi-party communication that is more ecient and more accessible
than ever before. The standardization of RTP [SCFJ96] as a light-weight, besteort, real-time data transmission protocol has allowed interactive sessions that scale to a very large number of participants. A number of publicly available tools transmit and receive RTP data streams [MJ95,JM,Sch92,CR], and these tools have been used to transmit sessions such as concerts, classes, group meetings, lectures, and conversations. Archive systems have also been developed that record and playback RTP sessions. These include local recorders and local players, which save RTP packets and replay them from local disk, [Sch,Hol95,CSa], and archive systems [AA98,Kle94,CSa], which allow remote clients to request playback of sessions that are stored at the archive. Some archive systems also allow clients to request that the archive system record an advertised session [Hol97,CSb,LKH98]. As the MBone carries more content, we expect that more archive systems will be run at a variety of sites, independently recording and replaying sessions that are of interest to their local user populations. One of the challenges in designing and deploying a recording system for the MBone is that RTP audio and video sessions may exhibit dierent types of intrasession heterogeneity. That is, participants in the same session may see dierent data. The rst type of heterogeneity is in received reception quality. Since RTP is a lossy protocol, without retransmission requests, some receivers may receive more packets, and thus a better reception quality, than other receivers1. This is particularly apparent on the current MBone, where one study measured that 50% of receivers in a large MBone session had loss rates above 10% [Han97]. Another study found that loss bursts lasting longer than 8 seconds were not uncommon [YKT96]. If a recorder is too \far away" (in network terms) from the session sources, it may produce an almost unusable recording of the session. The second type of heterogeneity is in the subscribed reception quality. In order to satisfy the network and processing requirements of a variety of receivers, a source may send its data stream in a hierarchically encoded, layered format [MJV96]. As illustrated in Figure 1, receivers subscribe to as many layers as can t in the available bandwidth. In this way, the source may generate high-quality, high-bandwidth layers that are viewed by nearby local receivers, while receivers across congested links may only subscribe to the lower-quality, lower-rate layers. A recorder which is not local to some sources may be missing a large percentage of the available session data. Another approach to dealing with both congested links and the diering decoding abilities of receivers is to install media gateways [AMZ95], mixers, and transcoders in the network. These may transform sessions into dierent transmission formats, or may rate-limit sessions. This type of transformation is described in the third type of intra-session heterogeneity, reception and transmission format. A session that contains transcoders may contain subsessions where data is transmitted in a format incompatible with the wider session. The transcoders join these sub-sessions into a single virtual session. The 1 Note that this is not an issue for SRM and other reliable multicast protocols since
these protocols are designed so that all packets will eventually reach all receivers.
R
R
Bottleneck S
S
R
R
Fig. 1. Layered multicast session with 2 sources separated by a bottleneck. Each source sends 3 layers locally, but there is no single point in the session where a recorder could receive all 6 possible layers. location of recorders with respect to these gateways may have a large in uence on the format and quality of the recorded session. Because of these various types of sender and receiver heterogeneity, any single session recorder will have the unique viewpoint of the place in the network where the recording is taking place. When seeking an archival copy of a session, the question is which viewpoint produces the best recording. In a session with a single source, the most complete session viewpoint would be that which is closest to the source. However, in a session with multiple sources, there may be no single place in the network that has a high quality viewpoint of all the sources. Therefore, instead of choosing a single viewpoint for an archived session, we would prefer to combine the data from multiple viewpoints into a recorded session with no missing pieces. These individual recorded viewpoints may be streamed on a separate session to an archive system that collates the data into a single high-quality representation. High-quality archived sessions allow playback systems more exibility in replaying stored sessions. During the interactive session, certain sacri ces for global congestion control may have been necessary, but during playback a dierent set of tradeos exist. Viewers may trade a longer playout buer and playback latency for a higher-quality representation. Some playback viewers may only be interested in a subset of the original participant group, but at the highest data rate possible for that subset. High-quality representations are also very useful at the archive for post-processing algorithms such as image or voice recognition for automated indexing and annotation services. Distributed recording as a solution to the problem of getting perfect quality recording has also been described by Lambrinos et al [LKH99]. Their paper mo-
tivates the need for distributed recording as a way both to achieve higher-quality archive copies of sessions, and as a means of gathering delay information from various receivers. This delay information can be used in replaying the session from a variety of \perception points". Other possibilities for improving the quality of recorded sessions include using forward error correction (FEC) or retransmission schemes such as resilient or reliable multicast [XMZY97,LPA98]. These can improve the received reception quality of recorders by limiting the number of transmission errors, but they cannot break the wide-area session view by transmitting local enhancement layers or data from behind a gateway. In addition, real-time participants may not wish to have as many retransmissions as archives may require, because interactivity in sessions requires limits on the latency between transmission and display. If a packet arrives at the site of a session participant after the audio or video frame has been played, it is useless, except for the case of archiving. A separate session for archive data gathering allows us to weaken the real-time latency constraints and stream data in a slower, TCP-friendly way through bottleneck links, without forcing all of the interactive session participants to receive the data enhancements. In order to achieve high-quality recordings of sessions, we have designed and implemented a system that uses recording caches as data caches, supplying data to archive systems, which collect this enhancement data on a session separate from the original interactive session. In the next section of the paper, we describe in more detail the components and protocols of the system, and the design decisions that inform our nal design. In Section 3 we describe the details of the protocol for collecting recorded data to participating archives. Section 4 describes our implementation and experience with the system. Section 5 describes our ongoing and future work in improving and extending the system, and our conclusions are given in Section 6.
2 System Design When implementing a system that will be large-scale and distributed, there are a number of important design considerations. In common with other large-scale, wide-area systems, the system must be scalable, fault-tolerant, TCP-friendly, and able to support heterogeneity in participants and network conditions. These goals have been achieved in a number of routing and application layer protocols through the design principles of light-weight sessions [McC98] [Jac94]. Lightweight sessions use shared multicast control rather than centralized control and soft state rather than hard state. Soft state [Cla88,RM99a] describes a protocol style which uses unreliable transmission of periodic state messages to achieve eventual consistency. State which is not refreshed by a periodic message eventually expires. Since messages are periodic, state is rebuilt automatically if a participant looses state due to a crash, or enters the session late. In contrast, hard state is sent reliably, and the default is that state is only established once.
Functionality Archive Systems Recording Caches Provides user interface for scheduling recording Yes No Records sessions Yes Yes Stores sessions Long term Short term Provides streaming session playback Yes No Responds to requests for cached packets Yes Yes
Fig. 2. Functionality of Archive Systems and Recording Caches Separate procedures must be used to rebuild state after a crash or for a new participant. We apply the light-weight sessions model of protocol design to many of the components of our recording system, ensuring that it will co-exist smoothly with other MBone protocols. Whenever possible, we use soft state, rather than hard server state. We use distributed algorithms rather than centralized server-side algorithms, so that there is no single point of failure. We use receiver-driven protocols, to allow for heterogeneity of receiver participation interest. In addition to the general goals of scalability, eciency and fault-tolerance, we have several more speci c goals for a distributed recording system. Our rst goal is that the live session quality should never be degraded by recording operations or retransmissions. To meet this goal we must include provisions for congestion control in our data transmissions. Our second goal is that the recorded session content should be accessible as soon as possible, even during the on-going live session. Accordingly, we must consider latency in our design. Our third goal is that the system should be able to work with a variety of recorder, archive and collaboration tools. To this end, the system should be componentized, rather than monolithic, with clear protocols for interaction between components. We must also consider that archive administrators and end-users may have heterogeneous interests, both in the types of sessions to be recorded and the desired quality levels of the resulting archival copies. Consequently, the protocols of the system must allow for application-de ned quality levels. In the context of these goals and general system design principles, we have framed our system and designed the protocols and components that implement it. In the next section, we give a high-level overview of the design. In the following sections, we go into more detail and describe the rationale for the various design decisions.
2.1 Design Summary
As stated earlier, the distributed recording system consists of two cooperating components, archive systems and recording caches. Figure 2 lists the diering responsibilities of these two components. Archive systems are fairly large, statically placed systems that provide playback and recording services to a number of users. Recording caches are smaller infrastructure services which do not provide a user interface or long term storage, but that perform recording upon
S
S
Recording Cache
Recording Cache
S
S S Archive
Fig. 3. An example interactive session showing multiple sources, an archive system, and several recording caches. Recording caches temporarily store data from sources that are not otherwise adequately recorded by archives, because of bottleneck links, or local-only data transmissions. On a separate session, data is retrieved from the caches and stored permanently at the archive system. request, store packets temporarily, and answer requests from archive systems for speci c packets. Note that both recording caches and archive systems perform recordings, and both may answer requests for packets. Our design allows archive systems to be run independently, with a variety of implementations and specializations, similar to the current diversity of web servers on the Internet. Because archives act independently, there may be several archive systems near each other, and no archive systems in other areas of the session. In order to achieve a high-quality recording, at least one recording agent (either an archive system or a recording cache) must be \close" to each of the session sources. If no archive system is close to a source, then a recording cache may be placed near the source to provide coverage. A single recording cache may take responsibility for multiple sources if the sources are located in the same local area. The system uses these caches to provide enhancement data to archive systems. The archive systems individually record as much of the session as they can receive consistent with congestion control algorithms. On a separate archive collection session archive systems request missing data packets. Responses may come from other archive systems, or from recording caches. Using these responses, archives build a complete, high-quality copy of the session. Figure 3 shows an example session containing multiple sources, a single archive, and several recording caches.
2.2 Recorders
In our system, recording caches and archive systems provide the data needed to construct high-quality recordings. An alternate design would be to require
sources to keep a log of packets so that archives could request missing packets or layers from sources. This has the advantage of simplicity, in that each source is responsible only for its own packets, and no packet need ever be lost. However, it does have several disadvantages. Due to heterogeneity, some sources may not have sucient resources to maintain a complete packet log. They may be running on a disk-limited or processor-limited platform, and requiring the source to keep a packet log may degrade the quality of the original session. Another limitation is that it may be inecient to require each source to keep a separate log. There may be several sources on the same local network participating in the same session. Any one source on this network could keep packets for all local sources, simplifying the recovery process. Finally, relying on sources to be loggers and responders is not fault tolerant, since a source could disappear from the session, causing all of its packets to be lost from the archives. To solve these problems, our design allows sources to transmit normally and uses nearby recording agents (either caches or archives) to store data from sources and provide that data to archive systems. In essence, these recorders act as proxy responders on behalf of sources. In this way, we can also take advantage of the mechanisms for ensuring the reliability and scalability of proxy services, including cluster-based platforms, and the automatic restart of services that have crashed. We believe that proxy platforms may be provided by ISPs to run a number of services on behalf of users, such as web content extractors [FGC+ 97], Internet shopping brokers [GAE98], and video stream transcoders [AMK98,AMZ95]. Another advantage of placing recorders on proxy platforms is that it allows recorded session data to be used by other services running at the platform. For example, an instant replay service might provide quick replay in the local area while an indexing service might produce an on-the- y summary of the session. For the implementation of our system, described in Section 4, we use recording caches implemented on top of clusters of computers controlled by middleware that provides fault tolerance and workload balancing. However, although the rest of this paper uses the term recording cache, the recording and responding process could be located at the source instead. But for maximum scalability, fault-tolerance and heterogeneity, caches that are implemented on cluster-based middleware platforms are recommended.
2.3 Control Protocol
In order to take advantage of the scalability and fault tolerance possible when placing recording caches on proxy platforms, the control and data transmission protocols for the distributed recording service must be designed appropriately. In particular, using the automatic fault-recovery provided by middleware proxy platforms can be very dicult if a hard state control protocol is used [FGC+ 97,SRC+ 98]. If only soft state is used, then failure recovery is almost automatic, because it does not require any special case of the protocol. The next soft state refresh message to arrive after the system fault can be used by the middleware platform to restart the necessary failed component. In a hard state
system, some central agent must store the state necessary for restarting the failed agent. This makes the middleware system less extensible, since the addition of each new service requires changes to the central agent. It may be more intuitive to consider recording to be a hard state service, where a recording request is initiated for a certain period of time (the announced length of the session), and run without further control input for that length of time. However, the small extra bandwidth cost of the periodic announce/listen control messages used in a soft state protocol is dwarfed by the bandwidth required to transmit data for any sort of multimedia session. The remaining problem with using a soft state protocol is ensuring that the failure semantics are designed correctly. That is, if the agents controlling the recording cache fail, the recording cache should also gracefully close. In the next section, we describe how this is achieved in our system. Another consideration is that soft state control protocols can be more scalable than hard state protocols, since they allow multiple clients to share the control of a single agent without increasing the control bandwidth. Because of this enhanced scalability and the improved fault-tolerance, we use soft state for initiation and control of distributed recorders.
2.4 Control Chain
As we have described, archive systems accept recording requests from users, but recording caches must be initiated and controlled by other agents of the system. In order to initiate the recording cache, an agent must have some knowledge of where in the session recording coverage is required. One possible design is for a controlling archive system to monitor the interactive session, initiating recording caches in areas of poor archive coverage, and removing recording caches when they are no longer necessary. However, there are several problems with this design. First, this centralized control hampers fault-tolerance by introducing a single point of failure into the system. Centralized control also limits the system's ability to respond to heterogeneity, because the central archive may not posses the information necessary to correctly place recording caches. For example, only local participants may be aware of transcoders or local-only enhancement layers. In addition, centralized control of the recording caches stretches control channels over long distances and lossy links. This makes control more dicult and more susceptible to failure in case of a temporary network partition. Source control of recording caches has the correct failure semantics. If a network partition occurs between the source and the recording cache it is controlling, then the cache is not able to perform its recording duty and should be suspended. However, if there is a partition between the source and the recording cache, but not between the recording cache and the collecting archive, then the recording cache may still need to participate in the data recovery session, responding to requests for packets. We solve this problem by splitting the functionality of recording caches into recording and responding. The recording agent stores packets to the shared disk, the responding agent retrieves packets from disk and sends them to collecting archives. If the responding agent is under
separate control, by the collecting archives, then the failure semantics for both agents will be correct. To summarize, our solution is to design the system so that sources select and control their own recording caches. While this may cause some temporary duplication of eort, it allows us to manage heterogeneity, system partitions and errors. Responders are selected and controlled by collecting archives.
2.5 Data Collection Responsibilities Given the correct coverage and control of recording caches in a session, it is next necessary to collect and combine the recorded data in such a way as to produce the high-quality session copy. Again, we are confronted with the choice of centralized versus decentralized algorithms. As an example, in a centralized algorithm, the cache for each source would slowly stream its source's data to a centralized collection point, where the data would be combined and stored. However, if each cache has responsibility only for a speci c source or sources, the loss of a recording cache due to hardware or software faults will cause severe damage to the nal session recording. Another negative point is that the entire data stream for each source may not be necessary at the archive. If the archive joins the original session, it will receive the baseline data along with the other real-time, interactive participants. This has the advantage of allowing the archive to support playback during the session (although it may not be highquality). Since the baseline data has already been received and stored, only the enhancement data will need to be streamed to the archive from the caches. Since the caches will not know what data the archive is missing, it will be necessary for the archive to request the data it requires. As we described in the overview, we also want to support multiple archives recording the same session. One possible solution is for one archive to build the high-quality copy of the session and then to transfer in bulk this copy to all other interested archive sites. This has the advantage of simplicity but the disadvantage of centralization. That is, it is less fault-tolerant, has a longer latency before the other participating archives can make the data available to their users, and has less support for heterogenous session interest on the part of archives. Some archives may not wish to collect a high-quality copy of all of the streams of a session. They may be interested in only portions of the session, or may want high-quality representations of some participants but not others. These archives could lter out unwanted data after it has been transmitted to them, but this is an inecient use of the network. In this section, we have laid the framework of our distributed recording system design, which allows a completely decentralized system with multiple sources, multiple recorders, and multiple collecting archives. In the next section, we will describe the data collection process in more detail.
3 Data Collection Protocol 3.1 Single Archive, Single Cache To begin the discussion of the data collection protocol, we consider a simple protocol in which one archive system collects data from one recording cache. The archive uses a reliable request-response protocol to retrieve data from the cache. To begin, the archive needs to know what data it is missing, and what data it wishes to collect. The archive will be able to calculate some of the data it is missing by looking for holes in the sequence number space of RTP packets. However, without input from the recording cache, it will not know if it is also missing data from the beginning of the stream, since RTP streams begin with a random sequence number. The archive will also not be able to detect tail losses. Also, since locally transmitted data may not be advertised outside of the local scope, the archive may not know that it is missing entire layers or streams that the recorder has cached. Once the archive knows all of the data that is available for collection, it still needs some application-level information about the data in order to decide whether it wishes to collect that data. Archives may have dierent policies about how much data should be collected and stored. For example, an archive may want to gather the highest quality audio data possible, but just use best-eort for video data. Or, an archive may have a very speci c mandate to only record certain sources in a session, or a certain time-slot of a long session. For this reason, when notifying the archive about data available at the recording cache, it is important to use application-level naming information, not just sequence number extents, so that the archive can decide which data is necessary. To begin the data collection process, each recording cache produces a namespace of the data available at that cache. This namespace includes information that uniquely identi es each individual stream and layer of data in the session. To identify streams and layers of data, we use the original session transmission addresses, along with source identi cation information from the RTCP protocol. Beginning and ending sequence numbers and timestamps for each stream are also included. The archive builds this same namespace for the data it has already recorded, and compares namespaces to nd missing streams or missing sequence number spaces.
3.2 Multiple Archives, Multiple Caches In the previous section we speci ed that we want to design the system so that multiple archives can collect data from multiple recording caches. It would be possible to have multiple archives individually contact the responders and request their missing data. However, there will be many cases where the archives' requests will overlap. For example, all archives might need a locally-transmitted layer that is only present at a single recording cache. For this reason, we would like to use multicast rather than unicast to perform data collection. We can also use multicast to transmit the namespaces of participants.
We need to use a reliable multicast protocol to transmit data namespaces to collection participants, since missing namespaces can impact protocol correctness. We could send data enhancement packets with either a reliable or unreliable protocol. Data enhancement does not necessarily need to be sent reliably to all participants, since RTP can tolerate certain levels of loss. However, some archives may want to devote resources to receiving every packet possible, so we feel that a reliable multicast protocol, with some latitude for receiver input into whether retransmissions are necessary, is the best choice. There are several reliable multicast protocols that we could use for our namespace and data transmission. Since we want receivers to choose whether to receive retransmissions, we feel that a NACK based scheme like SRM [FJM+ 95] is a better t than an ACK based scheme like RMTP [LP96]. In addition, SRM uses the principle of Application Level Framing, which allows application control of protocol features wherever possible. In SRM, applications decide whether to NACK packets or ignore the packet loss. For these reasons, we have chosen SRM as the reliable multicast protocol for our data collection algorithm. However, we do not use SRM simply as a replacement for TCP, to transmit packet requests and responses reliably. Instead, we take advantage of the data recovery features already present in SRM and cast each enhancement packet request as an SRM retransmission request. In SRM, upon receiving a retransmission request, agents which have the requested data set a timer, based on how far they are from the requesting agent. In this way the closest responder should generally provide the missing data. Through this mechanism we have multiple archives and recording caches sharing the responsibility of providing data, and multiple archives bene ting from the retransmission requests of other archives. In order to use the SRM recovery protocol, a globally unique namespace must be established. A requested packet must have the same name at all responders so that it can be requested by a single global message. We could map this namespace onto a at sequence number space for SRM retransmission requests and replies, but this would be very dicult. Participants would need to coordinate their assignment of sequence numbers to data streams so that a packet available from numerous recording caches or archive systems would have the same sequence number at each source. Instead of a at sequence number space, we would like to use a hierarchical space, with separate sequence number spaces for individual streams, so that we can reuse the original RTP sequence numbers. Figure 4 shows our hierarchical naming scheme. We use source naming information from the RTP protocol, which is globally available to all collection session participants. Each source may have transmitted multiple streams of dierent media types and/or multiple layers. These are identi ed by the multicast address they were transmitted on. Finally, individual data containers are identi ed with the starting timestamp from the rst packet in that data container. Individual packets are identi ed by the RTP sequence number of the interactive session. Multiple data containers may be used for a very long session where sequence numbers wrap. Using this namespace, participants can create a globally correct naming scheme that uniquely identi es each packet.
Root Source ID
Source ID
Source ID
RTP cname
RTP cname
RTP cname
Stream ID
Stream ID
RTP transmission addr
RTP transmission addr
Data
Data
Data
Starting Time
Starting Time
Starting Time
Fig. 4. Global data naming using a hierarchy. Streams are uniquely identi ed by source
ID and stream transmission address. Data packets are identi ed by RTP sequence number and a container starting time, to account for sequence number wrap.
The SNAP protocol (Scalable Naming and Announcement Protocol) [RM98], implemented in the libsrm framework [RC], allows transmitted data in an SRM session to be named hierarchically with application generated names and sequence number spaces. SNAP provides all the functionality required to transmit this namespace reliably in a compressed format. It provides periodic refreshes of portions of the namespace, including tail sequence numbers for all containers.
4 Implementation The distributed recording system we have described is composed of heavyweight archive systems, lightweight recording caches and the collaboration tools used by senders and receivers. Figure 5 shows these components and the protocols which are necessary for communicating between components. The recording agent and responding agent are lightweight agents which should be run on a computing cluster, with middleware to provide load balancing and fault recovery. To take advantage of these features, the control protocols for these agents should be soft-state, announce/listen protocols. Using a soft-state, announce/listen control protocol, clients must continue to send periodic keep-alive messages through the life of the service agent. If a service agent fails, from a hardware or software fault, the next keep-alive message will cause the agent to be restarted. In this style of protocol, message ordering is not important, and each message must contain the complete set of data to allow the agent to be restarted after a fault. As described in Section 2.4, the recording agents in the system are controlled by the nearby source or sources which require additional recording coverage. In order to operate, the recorder needs information about the addresses and media types of the session to be recorded. In addition, the recorder needs naming
Recording Agent
Archive System
Control
Data Data - RTP
Control
Responding Agent
Source
Data - SRM/RTP
Fig. 5. Components and protocols of the distributed recording system information about the session, so that it can be properly labeled in storage and available to the responder. In our protocol, the session's SDP announcement is used to provide this information. The SDP announcement contains all of the necessary session addresses that the recorder needs to monitor. Local-only layers are also advertised in the local SDP announcement, so no separate mechanism is required to achieve individualized recorder initiation. Each source will automatically instantiate its recorder to record all the data that the source is aware of. Changes to the SDP announcement, such as an addition of a new layer or media type, are automatically forwarded to the recording agent, since the message is soft-state and periodic. The response message from the recording agent to the controlling sources is more simple. Sources merely need to know that a recording is taking place, so a simple acknowledgment message is sent. Eventually, we may add quality report information, so that sources can decide whether to move their recording agent to a dierent platform. Unlike the recording agents, the responding agents are controlled by the archive servers which are collecting data. We would like archive servers which are requesting data from the same session to be served by the same agent. Therefore, the control message needs to contain a eld which will distinguish among responders for dierent sessions. The obvious choice is again to use the session identi cation information from the SDP announcement. In fact, this identi cation and an address on which the responder will join the SRM data recovery session is all that is needed in the responder initiation and control protocol. Fine-grained packet requests take place on the separate SRM channel. We did not create any new data protocols for the distributed recording system. As shown in Figure 5, we use unmodi ed RTP for the original data session, and SRM carrying RTP packets for the data recovery session. We chose the libsrm implementation of SRM, which also includes the SNAP protocol, for the implementation of the data recovery protocol between archive servers and response agents. Overall, we have found libsrm to be very helpful in providing the correct level of abstraction to the system. However, the libsrm library is still undergoing development and is not completely tuned to provide the best possible data retransmission. We have implemented the recording and response agents using the AS1 Active Service framework as the middleware system that allows the agents to be run scalably and reliably. The AS1 Active Service framework implements a service
S Recorder
Archive Archive
a. Single channel
S Recorder
Archive
S S
Archive
Recorder
b. Separate channels
Fig. 6. Two session con gurations. In Figure a, the archives would bene t from sharing a single multicast channel for retransmitted data. In Figure b, the archives are across dierent bottlenecks and so are collecting disjoint sets of data.
platform that is a cluster of computers providing load-balancing and automatic restart for agents. The agents themselves are implemented using the MASH multimedia toolkit [MBKea97], a set of composable multimedia networking and application objects. The archive server we use in our system is the MASH Path nder [CSb]. Through a web interface, Path nder allows users to view current session announcements (using the SDP/SAP protocol [HJ97]), join a live MBone session, request that a session be recorded, and play various sessions. Recorded sessions are automatically immediately made available for playback. For this system we made minimal changes to Path nder, adding an agent to perform data collection on sessions being recorded. The playback agent has been described in a previous paper [SRC+ 98]. We have been using the recording and archive server objects for some time, but are still gathering experience with the responding and collection objects. Using the AS1 middleware has been very helpful in simplifying the object implementation requirements, since the fault recovery code does not have to be re-written for each object.
5 Ongoing Work Using a protocol like SRM for our data collection algorithm only works well if all of the collecting archives have similar data needs, because SRM uses global retransmission of data. However, if archives have divergent needs, then using a single global channel for retransmission may be very wasteful. Figure 6 shows examples of recovery sessions where shared recovery channels would and would not be bene cial. If collecting archives are across dierent bottleneck links from the source and recorder, then they will need dierent sets of data. If they have correlated losses, or need the same set of local-only data, then they would bene t from sharing a multicast channel. One solution to this problem is to construct a hierarchy of participants that will allow retransmissions to only be sent to the portion of the tree that requires the data. This solution, adding local recovery to SRM, has been proposed by many researchers, but is not yet feasible. The problem with building these retransmission trees is that there is not an agreed way to build them without introducing new functionality into routers. Some schemes acknowledge that currently, administrator help is required to build trees of responders [LP96]. Other
schemes use an expanding ring search based on TTL [YGS95], or use short experiments to measure link delay [XMZY97], or use multicast IGMP trace packets to locate responders and receivers relative to each other [LPGLA98]. Although this is a dicult problem, and is currently the subject of much research, we believe that it can be solved for this particular application domain because of the application-level knowledge that is available and because the problem is somewhat more limited than in the fully general reliable multicast domain. The application-level knowledge that is available for building a tree of participants is twofold. First, we have the data from the original, interactive session that indicates which participants were able to subscribe to various localonly layers or pre-transcoded data. Second, we have the data from the interactive session indicating which participants lost packets due to congestion. Since archive applications have less onerous latency requirements than interactive applications, we may be able to use more history, based on the original session, than other reliable multicast applications typically have access to. In essence, we have a longer time available for bootstrapping. The Group Formation Protocol [RM99b,RM99c] uses receiver-generated lossprints that enumerate the packets lost at that receiver. These lossprints are used to group receivers who are behind the same bottleneck. We are working on using this protocol to organize collection session participants into sub-groups so that requests and responses only go to necessary participants.
6 Summary and Conclusions We have described the inherent problems in recording a multi-source MBone session with only one recorder. Session-wide variations in received reception quality, subscribed reception quality and reception and transmission formats indicate that in order to achieve the best possible recorded session, multiple cooperating recorders are necessary. To support this distributed recording system, we have introduced the recording cache, composed of a recording agent and a responding agent. The recording cache provides enhancement data to archive systems. Because the recording cache agents are initiated and controlled using soft state protocols, it is amenable to implementation on cluster based middleware, such as an active service platform, that provides a scalable, fault-tolerant implementation base. Because archives and caches record the entire session, users can begin viewing baseline quality session playback with low latency, and individual components can be lost from the system without catastrophic results. Because the system uses a decentralized data collection protocol, the system supports heterogeneity in archives' desired recording quality, and has no single point of failure. We have presented a protocol where archives collect data from recording caches and other participating archives using SRM retransmission requests. Archives are able to uniquely identify missing packets and streams through a globally consistent hierarchical namespace that uses RTP stream identi cations and sequence numbers. This namespace is reliably and eciently transmitted through
the SNAP protocol. We have described our initial implementation of the archives and recording caches, using the MASH multimedia toolkit, the libsrm reliable multicast protocol framework and the AS1 active service platform implementation. We are using these implementations to explore new techniques for using packet loss information to form data collection subtrees, so that the data collection algorithm scales to a larger number of archive session participants.
7 Acknowledgments Many thanks to Yatin Chawathe, Suchitra Raman, Drew Roselli, Helen Wang, Tina Wong and the anonymous reviewers for their feedback and suggestions. This work was supported by DARPA contract N66001-96-C-8505, by the State of California under the MICRO program, and by NSF Contract CDA 94-01156. Angela Schuett is supported by a National Physical Science Consortium Fellowship
References [AA98]
K. Almeroth and M. Ammar. The Interactive Multimedia Jukebox (IMJ): A New Paradigm for the On-Demand Delivery of Audio/Video. In Proceedings of the Seventh International World Wide Web Conference, April 1998. [AMK98] Elan Amir, Steve McCanne, and Randy Katz. An Active Service Framework and its Application to Real-time Multimedia Transcoding. In Proceedings of SIGCOMM '98, September 1998. [AMZ95] Elan Amir, Steve McCanne, and Hui Zhang. An Application Level Video Gateway. In Proceedings of ACM Multimedia '95, November 1995. [Cla88] D.D. Clark. The design philosophy of the darpa internet protocols. In Proceedings of SIGCOMM '88, Stanford, CA, August 1988. ACM. [CR] Yatin Chawathe and Cynthia Romer. Mash collaborator documentation. http://mash.cs.berkeley.edu/ mash/software/usage/collaboratorusage.html. [CSa] Yatin Chawathe and Angela Schuett. MASH archive tools documentation. http://mash.cs.berkeley.edu/mash/software/archive-usage.html. [CSb] Yatin Chawathe and Angela Schuett. MASH Path nder documentation. http://mash.cs.berkeley.edu/mash/software/usage/path nder.html. [FGC+ 97] Armando Fox, Steven Gribble, Yatin Chawathe, Eric Brewer, and Paul Gauthier. Cluster-based Scalable Network Services. In Proceedings of SOSP '97, pages 78{91, St. Malo, France, October 1997. [FJM+ 95] Sally Floyd, Van Jacobson, Steven McCanne, Ching-Gung Liu, and Lixia Zhang. A Reliable Multicast Framework for Light-weight Sessions and Application Level Framing. In Proceedings of SIGCOMM '95, Boston, MA, September 1995. Association for Computing Machinery. [GAE98] Ramesh Govindan, Cengiz Alaettinoglu, and Deborah Estrin. A Framework for Active Distributed Services. Technical Report 98-669, International Sciences Institute, University of Southern California, 1998. [Han97] Mark Handley. An Examination of MBone Performance. Technical Report ISI/RR-97-450, USC/ISI, 1997.
[HJ97]
Mark Handley and Van Jacobson. SDP: Session Description Protocol. Internet Draft, Internet Engineering Task Force, November 1997. [Hol95] Wieland Holfelder. MBone VCR - Video Conference Recording on the MBone. In Proceedings of ACM Multimedia, 1995. [Hol97] Wieland Holfelder. Interactive Remote Recording and Playback of Multicast Videoconferences. In Proceedings of the Fourth International Workshop on Interactive Distributed Multimedia Systems and Telecomminication Services (IDMS), 1997. [Jac94] Van Jacobson. SIGCOMM '94 Tutorial: Multimedia Conferencing on the Internet, August 1994. [JM] Van Jacobson and Steven McCanne. Visual Audio Tool. Lawrence Berkeley Laboratory. Software available at ftp://ftp.ee.lbl.gov/conferencing/vat. [Kle94] Anders Klemets. The Design and Implementation of a Media on Demand System for WWW. In Proceedings of the First International Conference on WWW, Geneva, May 1994. [LKH98] Lambros Lambrinos, Peter Kirstein, and Vicky Hardman. The Multicast Multimedia Conference Recorder. In Proceedings of the 7th International Conference on Computer Communications and Networks, October 1998. [LKH99] Lambros Lambrinos, Peter Kirstein, and Vicky Hardman. Improving the Quality of Recorded Mbone sessions using a Distributed Model. In Proceedings of the 6th International Workshop on Interactive Distributed Multimedia Services and Telecommunication Services (IDMS), October 1999. [LP96] John C. Lin and Sanjoy Paul. RMTP: A Reliable Multicast Transport Protocol. In Proceedings IEEE Infocom '96, pages 1414{1424, San Francisco, CA, March 1996. [LPA98] Xue Li, Sanjoy Paul, and Mostafa Ammar. Layered Video Multicast with Retransmissions (LVMR): Evaluation of Hierarchical Rate Control. In Proceedings of INFOCOM 98, March 1998. [LPGLA98] B. N. Levine, S. Paul, and J.J. Garcia-Luna-Aceves. Organizing Multicast Receivers Deterministically According to Packet-Loss Correlation. In Proceedings of ACM Multimedia '98, September 1998. [MBKea97] Steve McCanne, Eric Brewer, Randy Katz, and Lawrence Rowe et al. Toward a Common Infrastructure for Multimedia-Networking Middleware. In Proceedings of the Fifth International Workshop on Network and OS Support for Digital Audio and Video (NOSSDAV), May 1997. [McC98] Steven McCanne. Scalable Multimedia Communication with Internet Multicast, Light-weight Sessions, and the MBone. Proceedings of the IEEE, 1998. [MJ95] Steven McCanne and Van Jacobson. vic: A Flexible Framework for Packet Video. In Proceedings of ACM Multimedia '95, pages 511{522, San Francisco, CA, November 1995. [MJV96] Steven McCanne, Van Jacobson, and Martin Vetterli. Receiver-driven Layered Multicast. In ACM SIGCOMM, Stanford, CA, August 1996. [RC] Suchitra Raman and Yatin Chawathe. libsrm: A generic framework for reliable multicast transport. http://wwwmash.cs.berkeley.edu/mash/software/srm2.0/. [RM98] Suchitra Raman and Steven McCanne. Scalable Data Naming for Application Level Framing in Reliable Multicast. In Proceedings of ACM Multimedia '98, 1998.
[RM99a]
Suchitra Raman and Steven McCanne. A Model, Analysis, and Protocol Framework for Soft State-based Communication. In Proceedings of SIGCOMM '99, Cambridge, MA, September 1999. [RM99b] Sylvia Ratnasamy and Steven McCanne. Inference of Multicast Routing Trees and Bottleneck Bandwidths using End-to-end Measurements. In Proceedings of IEEE Infocom '99, New York, March 1999. [RM99c] Sylvia Ratnasamy and Steven McCanne. Scaling end-to-end multicast transports with a topologically-sensitive group formation protocol. In Proceedings of the 7th International Conference on Network Protocols, November 1999. [SCFJ96] Henning Schulzrinne, Steve Casner, R. Frederick, and Van Jacobson. RTP: A Transport Protocol for Real-Time Applications. Internet Engineering Task Force, Audio-Video Transport Working Group, January 1996. RFC1889. [Sch] Henning Schulzrinne. RTP Tools 1.6. http://www2.ncsu.edu/eos/service /ece/project/succeed info/rtptools/rtptools-1.7/rtptools.html. [Sch92] Henning Schulzrinne. Voice Communication Across the Internet: A network voice terminal. Technical Report TR-92-50, University of Massachusetts, Amherst, 1992. [SRC+ 98] Angela Schuett, Suchitra Raman, Yatin Chawathe, Steven McCanne, and Randy Katz. A Soft-state Protocol for Accessing Multimedia Archives. In Proceedings of 8th International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV 98), Cambridge, UK, July 1998. [XMZY97] X. Rex Xu, Andrew C. Myers, Hui Zhang, and Raj Yavatkar. Resilient Multicast support for Continuous-media applications. In Proceedings of NOSSDAV '97, 1997. [YGS95] R. Yavatkar, J. Grioen, and M. Sudan. A Reliable Dissemination Protocol for Interactive Collaborative Applications. In Proceedings of ACM Multimedia '95, San Francisco, CA, November 1995. Association for Computing Machinery. [YKT96] Maya Yajnik, Jim Kurose, and Don Towsley. Packet Loss Correlation in the MBone Multicast Network. IEEE Global Internet Conference, 1996.