B-MART, Bulk-data (non-real-time) Multiparty ...

3 downloads 3289 Views 73KB Size Report
This paper describes the design of a protocol to do one-to-many reliable bulk .... sending the next bulk transfer, transmission to these receivers is halted and ...
B-MART, Bulk-data (non-real-time) Multiparty Adaptive Reliable Transfer Protocol Lorenzo Vicisano, Mark Handley, Jon Crowcroft University College London

Abstract This paper describes the design of a protocol to do one-to-many reliable bulk non-realtime distribition of data. Key features of the design of this protocol include the fact that it provides efficiency through use of IP multicast, and that it provides “TCP friendly” Congestion Control in a scalable manner.

1

Introduction

This paper describes some novel ideas for the design of multicast for bulk data transfer. It does not address anything to do with shared applications or the multicast of data with realtime constraints. Key features of the design of this protocol include the fact that it provides efficiency through use of IP multicast, and that it provides “TCP friendly” Congestion Control in a scalable manner. The idea is to provide a protocol that can be deployed for a number of appllcations such as web cache preloading, software dissemination, or bulk transfer of network news, or world databases for DIS and distributed VR/Games systems, without endangering the fragile ecosystem of the Internet traffic management system.

2

Assumptions and Requirements

We assume that a significant number (typically greater than 100) of receivers wish to receive a large data object (typically many megabytes). We assume there is no ordering constraint on which order the data arrives - only that all receivers wish to receive all the data. We assume that there are minimum delay constraints - the data may take a significant amount of time to arrive, but that this delay is bounded. Typical usage may be web cache preloading, software dissemination, or bulk transfer of network news, or world databases for DIS and distributed VR/Games systems.Similar distribution to large Email lists probably does not entail long enough sessions to justify this approach precisely, although elements of the protocol design are common between this and Atanu Ghosh’s Multicast Mail Delivery protocol[12]. We assume that the network that these transfers must occur over is similar to existing multicast infrastructure, and that no priority or reservation mechanism is used. We assume that any such protocol must co-exist with a large amount of background TCP traffic, and that we must perform congestion control that is appropriate for co-existing with this TCP traffic. This is a key design decision in this protocol work, and not to be comprised!

1

3 Design Space There are a number of proposals, designs and implementations of “reliable multicast” protocols in existence. Many address the “many-to-many” problem, where there are multiple, possibly simultaneous sources, and have complex mechanisms to resolve ordering and reliability (and in the process, to maintain some degree of topology insensitivity) e.g. SRM[2] and LBRDM[3]. Others are transaction oriented in their group delivery semantics, for example Horus[5]. Still others address timely delivery, possibly traded off against k-reselient failure (loss of some members during the transfer). As a starting point for design of a one-to-many, non-realtime bulk data transfer protocol, we have other considerations. Not the least of these is congestion control, and how we produce a protocol that delivers reliability, with scalable congestion control.. At one extreme, we could imagine blasting packets at some rate to an IP multicast group, and selecting a single receiver by some voting algorithm (dynamically changing if there are faults), to send acknowledgements. Other receivers could wait til “the end”, and then request all the missing packets by sending a list. The sender could then re-multicast these if significant numbers of receivers were missing the same packets or unicast them to each receiver that had missed a packet that noone else had. One could easily image a set of more and more ingenious refinements of a scheme like this. At another extreme, one could have a set of unicast TCP connections from the source to the recipient. One could refine this, by organising the recipients (or proxies for them) into a tree of TCP connections. If instead, the source sends multicast, and we devise some control scheme to adapt the rate to some worst case bottleneck (remember, we only want eventual delivery!), then what is the rate that we need to adapt too? Should it be the same as that which a single TCP connection would use though that same bottleneck? We believe not, since the use of multicast has reduced the link usage on all the hops up stream of the bottlneck by something like “log base mbone-fanout of the number of receivers”. Thus we should be able to run at a rate somewhat higher than a single “fair” TCP share would give, assuming we know the number of receivers (even approximately). We might term this axiomatic fairness after work on how to charge for multicast.[1] Related work by Ammar[11] describes a protocol that involves a heuristic for forming groups based on finding a set that maximises the power (bandwidth/delay) in each group. We believe that our mechanism for group formation, and our protocol for feedback is more scalable, and also more compatible with TCP. In the next section, we will look at problems with how we can find this (or any) rate accurately. For example, TCP uses loss to determine a stable operating point[6]. However, this relies on a set of things that are hard to achieve for multicast, including: Accurate RTT estimation Explicit feedback for most packets transmitted Ignoring single loss (“fast retransmit”) as being possible spurious, after duplicate acknowledgements We are constraining our design to address the problem for very long lasting transfers, so we can run our adaption algorithm over long timescales. The tradeoff is between the aggression we might allow ourselves (as being fair to multicast) and the accuracy of the feedback control system that we need to devise. 2

4 Basic Idea Loss Rates per Receiver (3 min running average), Shuttle Video, Tues 29th May 1996 Loss rate for each receiver Loss Rate (percent) 90 80 70 60 50 40 30 20 10 0

5 10 15 Time (minutes) 20 25 30 0

20

10

50 40 30 Receiver number

60

70

Figure 1: Loss rate against time for each receiver Firstly, let us examine loss effects on Mbone traffic. To do this we gathered RTCP receiver reports for video traffic sent from NASA to a large multicast group. Figure 1 shows the loss rate against time (a three minute windowed average to reduce edge effects caused by the RTCP report intervals not being synchronised) for each receiver during a 33 minute interval of video traffic for the shuttle video from 10:30am UTC on 28th May 1996. Receivers are shown in order of mean loss rate. This graph shows some correllation of loss between receivers, as would be expected, but it also shows a large amount of low amplitude noise, which would only be expected if many different points in the distribution tree are causing small amounts of loss independently. The receivers shown here would segment quite well into a small number of categories - those with little or no loss, those with roughly correllated loss (the middle band appear to be behind a single bottleneck, but have additional loss imposed after the bottleneck by sibsequent links) and the broken ones. The assumption behind this protocol is that for most reasonably large multicast groups, we can group receivers into a set of correlated loss categories, with one category left for those broken receivers and receivers that don’t correlate well with the others.

4.1

Congestion Control

If we try to perform congestion control in a manner similar to TCP (although probably NACK based rather than ACK based), we are forced to send at a rate determined by the slowest receiver. In fact, it is worse than this - several receivers each reporting high uncorrelated loss rates can cause a congestion control scheme to go at a rate determined by the number of receivers suffering congestion. Congestion feedback from receivers which are receiving (more or less) correlated loss can be used directly in a short time-constant control loop because (to a first approximation) it doesn’t matter which 3

receiver reports the loss - all the receivers are affected by it. Congestion feedback from receivers which are receiving uncorrelated loss cannot be used in a short time constant control loop because even though the available bandwidth may be similar, the probability of loss being reported for any individual packet is much higher. Adapting based on short time constants will often simply cause a sender to back-off to its back-stop. Such a sender must be much more conservative both in the bandwidth it sends and in how fast it adapts to changing conditions because it is getting a less accurate view of the network conditions - it simply sees continuous loss, and must be prepared to ignore some level of loss reports if it is to achieve any throughput. There is also a question of what a fair-share of the bandwidth is. If there is only a single receiver, a fair shared should approximate to the throughput achieved by a TCP connection. If there are many receivers, the bandwidth saving from using multicast (as opposed to multiple unicast TCP connections) might be used to weight how aggressive a reliable multicast protocol might be in the face of congestion feedback. It is probably reasonable for a multicast transfer to achieve more than a “TCP fair share” of the bandwidth. One major problem is how to adapt to loss on timescales longer than an RTT. Mean Loss Rate (all receivers) against Sent Data Rate, Shuttle Video, Tues 29th May 1996 30

25

Loss Rate

20

15

10

5

0 0

200000

400000

600000 800000 Data Rate (bytes/minute)

1e+06

1.2e+06

Figure 2: Loss rate against data rate Figure 2 again graphs data gathered from the NASA shuttle video session in May 1996. It shows the mean loss rate reported by all the receivers (as reported by RTCP) in each one minute interval against the bandwidth sent in that minute (estimated from received data by assuming missing packets are of mean size). What this graph does not show is a good correlation between data rate and loss rate. If even a fairly small proportion of receivers were showing such a correlation, then such a trend should be visible in this graph. Thus although we can certainly adapt effectively to loss on time-scales of around 1 RTT by using a TCP Congestion Control style algorithm, adapting on longer timescales is a more difficult problem.

4

5 Towards a Solution We assume all the receivers that wish to receive the data are already listening to a multicast group. In addition, these receivers send session messages in a similar manner to RTCP messages enabling an estimate to be made as to the size of the receiver set. The sender starts sending traffic to this multicast group. Receivers send congestion report messages in their session messages, and weight the sending of their session messages according to the amount of loss they have seem. The sender regulates its transmit rate depending on these session message loss reports, and generally transmits very slowly in the presence of significant numbers of receivers suffering uncorrelated loss. Receivers also put information into their session messages which allows the discovery of loss correlation. This may be as simple as reporting loss rates for each range of sender sequence numbers. The sender uses this to discover that there are reasonably well correlated subgroups of receivers present. Given that there are a small set of such subgroups, the sender then initiates several new multicast groups - one for each of these correlated subgroups. It may do this during the initial multicast of the data if there are time constraints and sufficient feedback, or it may do it during the second pass over the data performing retransmissions. It now sends the data (or the missing data as indicated in session messsages if it’s the second pass) to each of these multicast groups, only now the loss is much more correlated, so the feedback loop can be much shorter, and the data rate can increase to an appropriate share of the available link bandwidth. As receivers receive the last of the missing data, they leave the multicast group, and so the congestion feedback becomes steadily more accurate and appropriate. Those receivers that did not fit into any useful correlation subgroup tend to be those behind broken links, and they continue to receive the data and retransmissions at a slow rate on the base multicast group. If the sender needs to move on to sending the next bulk transfer, transmission to these receivers is halted and deemed to have failed.

5.1

Feedback for Group Formation and Feedback for Bottleneck Statistics

The protocol has two packet types, one used to send data packets within each group, to receivers, and the other to report back to the sender. The send packet has a packet (not TCP byte) sequence number, and a send timestamp. The second type of packet are RTCP session[9] like messages, which are used to distribute information about loss. There feedback is in the form of a Selective Multicast Acknowledgement (SMACK) packet, which lists for each packet received, a timestamp, togetehr with the send timestamp, and a bitmap, which described packet sequence numbers that are present, and those that are missing. This information drives three processes: 1. Group formation is done by looking at loss correlation 2. Triggered messages from within the group drive the sender to send more. 3. Loss statistics drive the congestion avoidance algorithm. Loss statistics are passed through a filter, and three courses of action may be taken: 1. If the loss is zero, we increase the send window linearly. 2. Below some threshold (c.f. argument about rexmt thrshold), we keep cwnd constant. 3. Above it, we switch between constant and exponential backoff, exactly as TCP does.

5

Note that we do not have an accumulative acknowledgement system, so that the send and congestion window are actually a count of actual, and required packets in flight rather than specific sequence number ranges of packets. 5.1.1

Newgroup sub-protocol

The sender and a receiver group with correlated loss need to rendezvous somehow. This is how the protocol achieves this: To create a new group, on discovery of a correlated subgroup, the sender sends a newgroup message type, which contains the last SMACK message that caused it to decide on the correlation. This message is multicast on the base group used for finding correlation, and for control emssages such as this. If all receivers log all SMACK messages (which are, after all multicast grup wide), they can match this newgroup message via the matching SMACK contents. Even if a receiver lost the SMACK message that should match this, they may have actual data packet loss statistics, and can heuristcally match the newgroup message SMACK, to theuir actual logged loss statistics, giving a good chance of matching.

5.2

Rate or Window Based Control

The protocol does not have buffers either at the sender or at receivers. Instead, packets are drawn from the source data direct (e.g. from disk for a file), and written directly at the receiver, whatever order the arrive in (assuming they can be put in the right place in the file). For each group, the sender keeps two statistics from which it can derive the appropriate send rate for the next packet (or inter-packet delay between the last, and next transmission): 1. The RTT estimate to the group 1. 2. The bottleneck rate to the group; this is roughly analagous to the TCP Congestion Window. This is estimated using the control algorithm described above. A sender keeps a log of the packet rate that was last used for sending any given packet currently outstanding; This is roughly analgous to the TCP send window. Experimentation needs to be carried out to compare this with TCP window performance for standard and selective acknowledgement[10] type feedback mechanisms.

5.3

Starting and Stopping

A session directory[7] like protocol is used to coordinate the start of a transmission. Completion is marked by a sender receiving no more session messages from any of the receivers it has state for. SDP itself could be used, if we attributed the session differently from current Mbone ones- one useful parameter of the session would be the total transfer size. MIME encoding of contents is another area for thought.

6 Refinement of the congestion control and error recovery First a receiver based congestion control. It is based on the following: a multicast transmission does not waste bandwidth on a link which feeds a subtree if no receivers are joining the session in that subtree. It (hopefully) works this way: 1

RTT estimation is done as in TCP, but across the reports from all members of the group

6

1. the sender transmits the same data on several parallel sessions at different rates; 2. the receivers join the appropriate session and move from one session to another as needed. Which is the appropriate session? The one that gives a fair share with the TCP session sharing the same bottleneck, thus the one having roughly the same average throughput as the TCP session. This can be computed on the basis of the packet loss and the RTT more or less the same as TCP sender does. Problem: receivers jumping between different sessions (obviously not synchronized) lose packets. How to cope with that: The sender could cyclically retransmit everything, but this has the drawback that the receivers get delayed (which we may don‘t care about) waiting the missing packets while they receive packets already seen, and worse the delivery of duplicated packet is a waste of bandwidth. A first step towards a solution could be transmitting packets in different sequences among different sessions (or between 2 nearby sessions), so that when a receiver lower its rate jumping to lower rate session does’t necessarily receive packets already seen; Moreover we can introduce a forward error recovery scheme where recovery packets are transmitted every so and on. Each recovery packet can be used to recompute a missing packet belonging to a given set. packet sets are composed of non contiguous data packet so that we can cope both with burst packet loss and with session jumping (somewhat similar to a burst packet loss). probably to do that we need some duplication of recovery packets over nearby sessions. session As far as the error recovery scheme is concerned, probably can’t be integrated nor in the BMART architecture (subgrouping) nor with the receiver based congestion control, anyway... This error recovery scheme comes from Luigi Rizzo (a researcher in Pisa It is simply based on applying the same technique used in forward error recovery to backward error correction in multicast. Say you send N data packets and the receivers lose up to M packets, if you retransmit the packets straightforward you have to retransmit more than M packets (different receivers see different loss) – in B-MART subgrouping is used to cope with this – but you can transmit M error correction packet with which each receiver is able to recompute up to M data packet missing, no matter which ones. TBD - the function that maps packet/data sample to group/channel.

7 Experiment Design There are two types of thing we (must) set out to show through experiment: 1. Operating parameters for the protocol 2. Performance of the protocol in the larger environment. The first of these include: 1. Parameters can to found for the group formation, and the congestion phases of the protocol. Some details need to be establsiehd w.r.t the formation and congestion control for the base group, general groups, and the outlyer group. 2. That RTT estimation for a group, for rate control (inter-packet gap). 7

The hypotheses we are trying to prove for this protocol in the larger environment (the Internet and its set of adaptive protocols such as TCP) are: 1. B-MART achieves lower latency (total time to deliver a file to a set of recipients) than a serial TCP, or a parallel TCP, at little or no additional network loading cost. 2. The congestion control is stable. 3. The congestion control is fair between two or more instances of this protocol sharing the same bottleneck set (and even just some bottlenecks, if the group selection works correctly). 4. That this protocol is fair when sharing a bottleneck with a TCP. 5. That this protocol continues to be comparable with TCP in the presence of new router forwarding/drop algorithms (RED or Priorities or WFQ). Finally, we want some simple quantitative comparisons between a notional “perfect” protocol (some analysis based on what a set of TCPs with perfect group and bottleneck knowledge could do) and BMART... ...

8 Experiment Results ... Need to call in favours at LBL, UTS, INRIA, any others (cambridge etc) to find a set of sites we can run this at. Implement a prototype, then Design a sequence of tests for the hypotheses above some more backgrounmd traffic tests need doing to see if the base group and outlyer group formations are feasible.....these don’t need the protocol in place though

9

Discussion and Conclusions

... mention MTU size problem, and ALF/FEC arguments for just choosing 1kbyte packet size! mention traffic mix problem (TCP, RTP, B-MART etc) discuss what happens if we change assumptions about routers (e.g. drop tail fifo, or RED, or....) Acnowledgements to Steve McCanne for useful discussion.

Bibliography References [1] “Sharing the Cost of Multicast Trees: An Axiomatic Analysis” Shai Herzog, Deborah Estrin, Scott Shenker in ACM SIGCOMM 95, Vol 25, No. 4, pp 315-327 Cambridge Mass, Sep 1995. [2] “A Reliable Multicast Framework for Light-weight Sessions and Application Level” Van Jacobson, Steve McCanne, Ching-Gung Liu and Lixia Zhang, in ACM SIGCOMM 95, Vol 25, No. 4, pp 342-356 Cambridge Mass, Sep 1995. [3] H.Holbrook, S.Singhal and D.Cheriton “Log-Based Receiver-Reliable Multicast for Distributed Interactive Simulation” ACM SIGCOMM 1995, Vol 25, No. 4, pages 328-341 8

[4] RLM “Receiver-driven Layered Multicast”, S. McCanne, V. Jacobson, and M. Vetterli, ACM SIGCOMM, Vol 26, No 4, pp 117-131 Stanford, Cal, August 1996 [5] Kenneth P. Birman. The Process Group Approach to Reliable Distributed Computing. Technical Report, Cornell University, Ithaca, USA, July 1991. [6] TCP Van Jacobson Congestion Avoidance and Control Vol 18, No. 4, pages 314-329, ACM SIGCOMM 1988, Stanford, USA [7] M. Handley, V. Jacobson “SDP: Session Description Protocol” Internet Draft draft-ietf-mmusicsdp-02.txt, Work in Progress, Feb 1996. [8] M. Handley, “SDAP: Session Directory Announcement Protocol” Internet Draft draft-ietfmmusic-sdap-00.txt, Work in Progress, Feb 1996. [9] ”RTP: A Transport Protocol for Real-Time Applications”, H. Schulzrinne et al, IETF, 1996, RFC1889 [10] Fall, K., and Floyd, S., Simulation-based Comparisons of Tahoe, Reno, and SACK TCP. Computer Communications Review, July 1996. [11] “Using Destination Set Grouping to Improve the Performance of Window-Controlled Connections” Shun Yan Cheung and Mostafa H. Ammar - Tech Report GIT CC 94-32 [12] “Multicast Mail Daemon”, Work in progress by Atanu Ghosh, UCL, and now UTS.

9