Making SCTP More Robust to Changeover

Making SCTP More Robust to Changeover Janardhan R. Iyengar, Armando L. Caro, Jr., Paul D. Amer, Gerard J. Heinz Computer and Information Sciences University of Delaware fiyengar, acaro, amer, [email protected] Randall R. Stewart Cisco Systems Inc. [email protected]

Abstract We present a problem in the current SCTP (RFC2960) specification that results in unnecessary retransmissions and “TCP-unfriendly” growth of the sender’s congestion window during certain changeover conditions. We first illustrate the problem using an example scenario. To gain insight into the ambient conditions under which cwnd overgrowth can be observed, we present an analytical model of this problem. As solutions, we then propose two changeover aware congestion control (CACC) algorithms which incorporate changeover awareness in SCTP’s congestion control mechanism: Conservative CACC (CCACC), and Split Fast Retransmit CACC (SFR-CACC). Using ns-2 simulations, we validate the model and evaluate the recommended solution. Based on the analysis, we make recommendations for modifications to SCTP.

1 Introduction A node is multihomed if it can be addressed by multiple IP addresses [5], as would be the case when the host has multiple network interfaces. Network layer redundancy allows access to a host even if one of its IP addresses becomes unreachable; ideally packets can be rerouted to one of the host’s alternate IP addresses. However, since IP is connectionless, end-to-end session persistence under failure conditions becomes the responsibility of the transport layer and above. To provide for such fault tolerance, the Stream Control Transmission Protocol (SCTP) supports multihoming at the transport layer. SCTP sessions, or associations, can dynamically span over multiple local and peer IP addresses so that an association can remain alive even if one of the endpoints’ addresses becomes unreachable. SCTP [13] is a recent standards track transport layer protocol in the Internet Engineering Task Force (IETF). Of the salient features that distinguish SCTP from TCP, we concern ourselves with multihoming. SCTP

Prepared through collaborative participation in the Communications and Networks Consortium sponsored by the U. S. Army Research Laboratory under the Collaborative Technology Alliance Program, Cooperative Agreement DAAD19-01-2-0011. The U. S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon.

1

multihoming allows binding of one transport layer association to multiple IP addresses. This binding allows an SCTP sender to send data to a multihomed receiver through different destination addresses. For instance, in figure 1, A could send data to B using destination address B1 or B2 . SCTP’s multihoming feature was motivated by fault tolerance; if one destination address becomes unreachable, the destination can still send and receive via other interfaces bound to the association. In a multihomed SCTP association, the sender transmits data to its peer’s primary destination address. SCTP provides for application-initiated changeovers so that the sending application can change the sender’s primary destination address, thus moving the outgoing traffic to a potentially different path1 . We uncovered a problem in the current SCTP (RFC2960) specification [13] that results in unnecessary retransmissions and “TCP-unfriendly” growth of the sender’s congestion window under certain changeover conditions. We wish to point out that the problem of unnecessary fast retransmits observed is applicable to TCP as well, under reordering of traffic by the network. Such reordering has been known to occur, and there’s been work done in the area [4, 7, 14]. But, under a single association, SCTP has the unique feature of multihoming which allows multiple congestion windows to co-exist. Such a feature has not been known to TCP, and hence the problem of congestion window overgrowth is unique to SCTP. Nevertheless, any transport layer protocol equipped with multihoming awareness would probably observe the described problems. Though the solutions as described in Section 5 are specific to SCTP, they also indicate that such multihoming aware transport protocols should incorporate changeover awareness in their congestion control algorithms. In Section 2, we present a specific example which illustrates the problem of cwnd overgrowth with SCTP’s currently specified handling of changeover. We then generalize the problem, and develop an analytical model in Section 3. The model abstractly quantifies the cwnd overgrowth under various network and changeover conditions. This model provides insight into the ambient conditions under which cwnd overgrowth can be observed. Based on the model, we present some results estimating conditions for cwnd overgrowth in Section 4. Due to the fact that transport layer multihoming is not a current practice, it is extremely difficult to use any empirical data to reinforce the importance of the observed congestion window overgrowth. Hence, we use analytical results in Section 4 to suggest that the problem might not be a “corner case”. Section 5 presents two changeover aware congestion control algorithms as solutions: Conservative CACC (C-CACC) and Split Fast Retransmit CACC (SFR-CACC). By approaching the problem from different perspectives, the Rhein algorithm (described in a previous work [8]) and the CACC algorithms (Section 5) all solve the problem of TCP-unfriendly cwnd growth. After analyzing their advantages and disadvantages in Section 6, we recommend the addition of the SFR-CACC algorithm to SCTP. The problem description presented in Section 2 was introduced in [8]. This material is repeated here to establish context, and can be skipped by readers familiar with [8].

2 Congestion Window Overgrowth: An Example In this Section, we present an example illustrating the occurence of cwnd overgrowth and unnecessary retransmissions with SCTP’s currently specified handling of changeover. The example uses the architecture shown in figure 1. Endpoints A and B have an SCTP association between them. Both endpoints are 1

SCTP was designed as a transport protocol for telephony signaling in SS7 networks. In an SS7 network the upper layers can dictate to which destination address packets will be sent, motivating the application-initiated changeover feature in SCTP.

2

Path1

Host A

A1 A2

B1

Network

B2

Host B

Path2

Figure 1: Architecture used in example

multihomed, A with network interfaces A1 and A2 , and B with interfaces B1 and B2 2 . All four addresses are bound to the one SCTP association. For several possible reasons (e.g., path diversity, policy based routing, load balancing), we assume in this example that the data traffic from A to B1 is locally routed through A1 , and from A to B2 through A2 . Non overlapping paths are assumed with the bottleneck bandwidth of path 1 being 10Mbps, and that of path 2 being 100Mbps. The propagation delay of both paths is 200ms, and the path MTU for both paths is 1250 bytes. Figure 2 shows a timeline of events for our example. The vertical lines represent interfaces B1 , A1 , A2 and B2 . The numbers along the lines represent times in milliseconds. Each arrow depicts the departure of a packet from one interface and its arrival at the destination. The labels on the arrows are either SCTP Transmission Sequence Numbers (TSN) or labels of the form STC (TGS TGE ). SCTP transmits data and control information in transport layer entities called chunks. Each DATA chunk carries a unique TSN, as against the sequence numbering scheme in TCP, which assigns a sequence number per byte. Assuming one chunk per packet, every packet in the example corresponds to one TSN. A number represents the TSN of the chunk in the packet being transmitted. SCTP uses cumulative acks and selective acks in acknowledgments, where the selective acks indicate the TSNs received out of order. Such selective acks in SCTP, which are sent in SACK chunks, are called gap acks. A label STC (TGS TGE ) represents a packet carrying a SACK chunk with cumulative ack TC , and gap ack for TSNs TGS through TGE . C1 is the cwnd at A for destination B1 , and C2 is the cwnd at A for destination B2 . C1 and C2 are denoted in terms of MTUs, not bytes. The assignments such as initial TSN = 1 and initial time t = 0 are arbitrary assignments to signify the beginning of the snapshot. These assignments are not meant to imply the beginning of the association. Initially, C2 = 2 because we assume that either there has been no transmission to B2 before t = 0 during the lifetime of the association, or C2 has decayed3 to two M T U s by t = 0.

2.1 Example Description The sender (A) initially sends to the receiver (B ) using primary destination address B1 . This setting causes packets to leave through A1 . Assume these packets leave the transport/network layers, and get buffered at A’s link layer A1 , whereupon they get transmitted according to the channel’s availability. This initial condition is depicted in figure 2 at time t = 0, when in this example A has 50 packets buffered on interface A1 . At t = 1, as TSNs 1 - 50 are being transmitted through A1 , the sender’s application changes the primary destination to B2 , thus requiring any new data from A to be sent to B2 . In the example, we assume TSN More precisely, A1 , A2 , B1 and B2 are IP addresses associated with link layer interfaces. Here we assume only one address per interface, so address and interface are used interchangeably. 3 The cwnd for a destination address decays exponentially if no data is transmitted to that destination address [13]. 2

3

Receiver B

B1 1-50*

1 2 3

21 22 23

41 61 2 62 4 3 4 63 44 64 45 65 46 66

69 70

49 50

A1 0

Sender A

(C2=2) 1 (C2=2) 2

S49

(C2=2) 81.2 (C2=2) 81.3

81 82 83 84

(C2=3) 82 (C2=4) 83 (C2=5) 84

85

(C2=6) 85

86

(C2=7) 86

(51-

54) S54

89 90

(C2=10) 89 (C2=11) 90

* TSNs 1-50 have been buffered at the sender’s link layer corresponding to A1 and are being sent.

Receiver B

51 52

-51) S1(51 2) (C2=2) 41.1 S1(51-5 53 (C2=2) 41.2 54

S2(5 S1 1S3(5 52) 41 1-52 42 ) 43 S41 (5152) S42 (5154 S43 (51 ) S44 -54) (5154) S45 (5154) S46 (5154)

B2

A2

1-53) S41(51-54) 5 ( 1 S4 42 (rt55 x)

43, 4

4 (rtx ) 45, 4 6 (rtx ) 47, 4 8 (rtx ) 49, 5 0 (rtx ) 56, 5 7

21.1 21.2

61.2 61.3

101.3 101.4 102.1 103.1 104.1 105.1 106.1

62, 63 64, 65

Figure 2: Timeline for the example

51 is transmitted to the new primary at t = 1. We refer to this moment as the changeover time. This new primary destination causes new TSNs to leave the sender through A2 . Concurrently, the packets buffered earlier at A1 are still being transmitted. Previous packets sent through A1 , and the packets sent through A2 , can arrive at the receiver B in an interleaved fashion on interfaces B1 and B2 respectively. In figure 2, TSNs 1, 51, 52 and 2 arrive at times 21, 21.1, 21.2, 22, respectively. This reordering is introduced as a result of changeover; the specific times depend on the delays of paths 1 and 2. The receiver starts reporting gaps as soon as it notices reordering. If the receiver communicates four missing reports to the sender before all of the original transmissions (TSNs 1 - 50) have been acked, the sender will start retransmitting the unacked TSNs. SCTP’s Fast Retransmit algorithm is based on TCP’s Fast Retransmit algorithm [6], with the additional use of selective acks and a modification to handle some cases of reordering4 . Accordingly, the SACKs resulting from the receipt of TSNs 51-54 will be the only ones generating missing reports. The SACKs received by A on A2 at t = 41:1 and t = 41:2 will be considered as the first and second missing reports for TSNs 2 - 50. Since these SACKs do not carry new cumulative acks, they do not cause growth of C2 . Between t = 42 and t = 81, the cumulative ack in the SACKs received by A on A1 increases as a consequence of the original transmissions to destination B1 reaching B . In this period, A receives 40 SACKs which incrementally carry cumulative acks of 2 - 41. The SACKs received by A on A2 at t = 81:2 and t = 81:3 carry a cumulative ack of 41, and are the third and the fourth missing reports for TSNs 42 - 50. Upon the fourth missing report, A retransmits only TSN 42, since C2 permits only one more packet to be outstanding. At t = 82, the SACK for the original transmission of TSN 42 reaches A on A1 . Since the sender cannot distinguish between SACKs generated 4

[11] goes hand-in-hand with RFC2960. The Implementor’s guide maintains all changes and additions to be included in RFC2960’s next version. All implementations are expected to carry the specifications and modifications in this guide.

4

by transmissions from SACKs generated by retransmissions, this SACK incorrectly acks the retransmission of TSN 42, thereby increasing C2 by one, reducing the amount of data outstanding on destination B2 , and triggering the retransmission of TSNs 43 and 44. At t = 83, the SACK for the original transmission of TSN 43 arrives at A on A1 . As before, this SACK acks the retransmission (of TSN 43), further incorrectly increasing C2 , and triggering retransmission of TSNs 45 and 46. This behaviour of SACKs for original transmissions incorrectly acking retransmissions continues until the SACKs of all the original transmissions to B1 (up to TSN 50) are received by A. Thus, the SACKs from the original transmissions cause C2 to grow (possibly drastically) from wrong interpretation of the feedback.

2.2 Discussion The TCP-unfriendly cwnd growth and incorrect retransmissions during changeover occur due to current inadequacies of SCTP - (i) the sender is unable to distinguish SACKs for transmissions from SACKs for retransmissions, and (ii) the sender’s congestion control mechanism is unaware of the occurrence of a changeover, and hence is unable to identify reordering introduced due to changeover. Addressing either of these inadequacies will solve the more important problem of TCP-unfriendly cwnd growth, as described in Section 5. While the values chosen in our example illustrate but a single case of the congestion window overgrowth problem, our preliminary investigation shows that the problem occurs for a range of fpropagation delay, bandwidth, MTUg settings. For example, with both paths having RTTs of 200ms (bandwidth = 100Kbps, propagation delay = 40ms) and MTU = 1500 bytes, the incorrect retransmission starts much earlier (at TSN 3), and the cwnd overgrowth is even more dramatic. The congestion window overgrowth problem exists even if the buffering occurs not at the sender’s link layer, but in a router along the path (in figure 1, path 1). In essence, the transport layers at the endpoints can be thought of as the sending and receiving entities, and the buffering could potentially be distributed anywhere along the end-to-end path.

3 Congestion Window Overgrowth: A General Model In this section, we generalize the scenario described in Section 2. This model abstractly quantifies the cwnd overgrowth and the number of unnecessary retransmissions caused by the changeover under various network conditions. In Section 3.1, we present a generalized timeline of SCTP behaviour during changeover. We then derive analytic results from this general model in Section 3.2, followed by a discussion of these results in section 3.3.

3.1 Model Description We now present a generalized timeline of SCTP behaviour during changeover in figure 3. This timeline is an excerpt from an association and is based on the example scenario described in Section 2. Some parameters used in the model are described below. The rest of the notation is described in Section 3.2.

1 2 : Congestion windows at A for B1 and B2 , respectively

C ;C

5

t

c : Changeover time - Moment after a changeover when sender

starts sending packets to new

A

primary destination B2

2 : Time when fast retransmission (incorrectly) starts. G1 : Number of Transmission Sequence Numbers (TSNs) sent in initial group transmitted to destination B1 in the time interval f0; tc g.

t

K

+1

: First TSN to be fast retransmitted (incorrectly) by A. Sender

Receiver B

B1

e1F d1F

A A1 0 (G1 TSNs being

K t1

transmitted)

1 2

A2 (C2=2) tc

e2R d2F SK ([G 1 +1] -

K+S1

[G

1

K+1

([G

1

K+2

+1]

SK+

- [G

1

2 ([G

1

+1]

e1R

+X] )*

– [G

1

t2

(C2 = 2) t2

+4])

(C2 = 3)

+4])

(C2 = 4)

G1 - 1 SG

G1

1

-1 ([ G

1

SG

1

+1] -

+5

[G

1

+5])

* X = 3 if TSN K is received after TSN (G1+ 3), i.e., { e1F + (K-1)d1F } > { tc+ 2e2F + e2R } 2 otherwise Y = ceiling {(tc+ e2F - e1F) / d1F}

Receiver B

B2

G1 +1 G1 +2 +1] ([G 1 SY

1])* G 1+

[

e2F d2F

G1 +3 G1 +4 +1] ([G 1 SK G

[G 1

+4])

t1

+ K+1 ( 1 5 rtx ) K+2, K+3 ( rtx ) K+4, K+5 ( rtx ) G1 (rt

x)

(C2 = G1 -K+1) N ew tra nsmis li s (C2 = G1 -K+2) mited by M ions, ax.Bu rst

Figure 3: General timeline for the problem At t = 0, host A starts to transmit G1 TSNs (TSN 1 through G1 ) to destination address B1 . By time tc the transport layer at host A has G1 TSNs outstanding. This group of TSNs (1 through G1 ) is referred to as the initial group. Note that these TSNs are outstanding at the transport entity at host A and could be buffered anywhere along the end-to-end path, even at interface A1 . By time tc , A has changed its primary destination to B2 . At the instant t = tc , A starts transmitting new data to B2 through interface A2 . tc can also be thought of as the time elapsed from the transmission of the first outstanding TSN on destination B1 to the time of transmission of the first TSN on destination B2 after changeover. Note that the SCTP receiver normally responds with delayed SACKs, but immediately returns a SACK whenever reordering is observed. The critical instant in the scenario, denoted t2 , occurs when A receives the fourth missing report [11, 13]. At this instant, TSNs K + 1 through G1 get marked for retransmission. Due to the receipt of a SACK acking TSN G1 + 4, (at t2 ) C2 allows one M T U sized chunk to be transmitted, hence TSN K + 1 gets retransmitted to destination B2 . According to RFC2960, “... when its peer is multi-homed, an endpoint SHOULD try to retransmit a chunk to an active destination transport address that is different from the last destination address to which the DATA chunk was sent.” Since the original transmission of TSN K + 1 went to B1 , the retransmission of TSN K + 1 is sent to B2 . The value of K is estimated and its relevance to the cwnd overgrowth is explained in Section 3.2. 6

The retransmission of TSN K + 1 at t = t2 is a consequence of the fourth missing report (SACK received on interface A2 at t = t2 ) carrying cumulative ack K . Since TSNs G1 + 1 through G1 + 4 reached host B by time t1 , the SACK also carries a gap ack for TSNs G1 + 1 through G1 + 4, resulting in the marking of TSNs K + 1 through G1 for retransmission. The cumulative ack K is an indication that the receiver B has received K TSNs in-sequence by time t1 . This in-sequence data is clearly the data received by B on the interface B1 by time t1 . Following the retransmission of TSN K + 1, the SACK for the original transmission of TSN K + 1 arrives at A. Since host A now considers TSN K + 1 to be outstanding on destination B2 , the receipt of this SACK incorrectly increases C2 , and allows TSNs K + 2 and K + 3 to be retransmitted. The receipt of a SACK for TSN K + 1 immediately after TSN K + 1 is retransmitted is not a coincidence. At time t1 when host B sends a SACK with a cumulative ack of K acking the receipt of TSN G1 + 4, TSN K + 1 is concurrently being received on interface B1 . Immediately after the receipt of TSN K + 1 on interface B1 , host B sends a SACK with cumulative ack K + 1. Consequently, the sequence of events at host A is the receipt of a SACK with cumulative ack K (which is also the fourth missing report for TSNs K + 1 through G1 ) followed by a SACK with cumulative ack K + 1. As shown, this behaviour continues until the SACKs for all the original transmissions to B1 (up to TSN G1 ) have been received at host A.

3.2 Analytic Results We will now estimate the cwnd overgrowth of parameters used in the following analysis are:

2 , and the number of unnecessary retransmissions. The

C

1F ; L2F : Maximum Transmission Unit (MTU) sizes on forward paths A1 to B1 and A2 to B2 , respectively B1F ; B2F : End-to-End available bandwidths [9] on forward paths A1 to B1 and A2 to B2 , respectively L

e

X

: Delay experienced by a packet along a path, given by: e

=

for i = each hop

where prop = propagation delay, sion delay.

proc

(prop)i + (proc)i + (queue)i + (trans)i

= processing delay,

queue

(1)

= queueing delay, and trans = transmis-

F : Delay experienced by a data packet, along the forward path. Assumption: Each data packet is M T U sized, therefore, eF is estimated by: e

F

e

X

=

for i = each hop in forward path

(prop)i + (proc)i + (queue)i +

L B

i

(2)

where, L is the MTU of the path, and B i is available bandwidth at hop i. e1F ; e2F : Delays experienced by a data packet on forward paths A1 to B1 and A2 to B2 , respectively, eR : Delay experienced by a pure SACK packet, along the reverse path. Assumption: that transmission delays for pure SACK packets are negligible, therefore, eR is estimated by:

R

e

=

X

for i = each hop in reverse path 7

(prop)i + (proc)i + (queue)i

(3)

e

1R ; e2R : Delays experienced by a pure SACK packet on reverse paths B1 to A1 and B2 to A2 , respectively.

: Minimum delay observed between consecutive packets transmitted along a same path by the receiver of the packets. This delay is dictated by end-to-end available bandwidth of the path, which is determined by the hop with the minimum available bandwidth on the path (in other words, the path bottleneck). d is given by:

d

d

L

=

i = each hop fB i g

minfor

(4)

where, L is the MTU of the path, and B i is available bandwidth at hop i.

1F ; d2F : Minimum delays between consecutive data packets from A1 to B1 observed at B1 , and from A2 to B2 observed at B2 , respectively. d1R ; d2R : Minimum delays between consecutive SACK packets from B1 to A1 observed at A1 , and from B2 to A2 observed at A2 , respectively. d

Assumption: The reverse path does not change the delay between SACKs. In other words, the forward path’s bottleneck dictates the rate at which SACKs are transmitted and then received, not the reverse path’s bottleneck. Therefore, the delay observed between SACKs is the same as the delay observed between the data packets. In other words, d1R = d1F ; and d2R = d2F (5) Packet transmission on path 2 starts at time tc ; it takes some time for the fourth legitimate missing report to reach the sender A. This time instant is shown in figure 3 as t2 , which is given by: t

2

=

t

c

+ 2e2F + 2e2R +

d

2F

(6)

1 is the instant when this fourth legitimate missing report leaves the receiver by:

t

1

t

=

2

t

2R

e

=

c

t

+ 2e2F +

As shown in figure 3, we assume that the SACK received at received by A so far5 .

2R

e

+

= =

through

2 , and is given

B

2F

(7)

d

2 on A2 contains the highest cumulative ack

t

Let K be defined as the TSN that was most recently cumulatively acked at words, K is the last TSN that reached the receiver B on B1 at t1 , where, K

B

d t1d1eF1F e d tc +2e2F +ed21RF+d2F

e1F e

A

prior to time

2 . In other

t

(8)

The result is that TSNs (K + 1) through G1 will be retransmitted on Path 2 and the total number of unnecessary retransmissions = maxf0; G1 (K + 1) + 1g = maxf0; G1 K g. The cwnd overgrowth for C2 will be maxf0; G1 K g. 5

This assumption is made for simplicity of analysis. If this assumption does not hold, the cwnd overgrowth will be lesser by

d e2Rd e1R e. 1F

8

3.3 Discussion For an AIMD (Additive Increase Multiplicative Decrease) congestion control algorithm, a round refers to the period from the transmission of cwnd amount of data to the receipt of acks for that data. After receipt of these acks, the next round starts as the sender transmits cwnd+1 amount of data. In our general model, the period between t = 0 and t = tc represents the beginning of such a round when the sender transmits G1 amount of data. This transmission of data may or may not be in a burst, but the receiver receives the data with packet interarrival times of at least d1F on interface B1 since d1F is the delay due to the available bandwidth of the bottleneck link on path 1. Thus, in our model, we assume that TSNs are received on interfaces B1 and B2 uniformly with interarrival times d1F and d2F , respectively. If K G1 , then all of the original transmissions to B1 are received by host B by time t1 . Hence, the SACK received by host A at time t2 would carry a cumulative ack of G1 + 4 and no gap acks. In this case, no unnecessary retransmission and no TCP-unfriendly cwnd growth occurs. On the other hand, if K < G1 , then K is the last TSN cumulatively acked prior to time t2 . Consequently, + 1 is the first TSN to be retransmitted incorrectly, and C2 overgrows by G1 K . A higher value of K results in a higher cumulative ack at A at t2 , hence fewer retransmissions and consequently less error in C2 . Similarly, as K decreases, more unnecessary retransmissions occur, and the error in C2 also increases. K

From equation (8), K decreases with an increase in d1F , or a decrease in d2F . Further, K decreases with an increase in e1F , or a decrease in e2F . These relationships between K and the characteristics of the two paths imply that when a changeover is made to a higher quality path, there is a likelihood of TCP-unfriendly cwnd growth and unnecessary retransmissions, and the bigger the improvement in quality that the new path provides, the larger the TCP-unfriendly growth and number of incorrect retransmissions will be.

4 Analytic Results: Validation and Visualization It is clear from the analytic results derived in Section 3.2 that cwnd overgrowth occurs if the sender has more than K packets outstanding at the time of changeover. The value of K , given by equation (8), is thus pivotal in quantifying cwnd overgrowth. We first validate this analytical value of K using ns-2 simulations in Section 4.1. We then estimate the value of K using the model under various network and changeover conditions in Section 4.2.

4.1 Analytic Results: Validation We now validate the analytical value of K derived in Section 3.2 through simulations using the SCTP module for ns-2 which was developed in the Protocol Engineering Lab at the University of Delaware [1, 10]. The topology is the same as in figure 1. The simulations do not have any cross traffic, hence the endto-end available bandwidths on each of paths 1 and 2 is equal to the minimum of link capacities on the corresponding path. Each of paths 1 and 2 has three links - two edge links and one core link. The edge links have a capacity of 10Mbps and propagation delay of 1ms. The available bandwidths of the paths, i.e., the capacities of the core links are chosen randomly between 10Kbps and 1Mbps. The propagation delays of the core links are chosen randomly between 25ms and 50ms. The sender’s sending window is fixed at 20KB 9

by setting the receiver’s advertised window to 20KB. We fix the sending window to make it easier to extract parameters from the traces. Changeover occurs at time 5 seconds. Of 1000 simulation runs, 511 runs showed the occurrence of incorrect fast retransmissions due to changeover. Only the runs which showed these retransmissions could be used for validation because to infer the value of K from a simulation run (denoted Ksim ), at least one such retransmission had to occur. The first incorrect retransmission would correspond to TSN Ksim + 1. We extracted the values of the parameters e1F , e1R , e2F , e2R , d1F , d2F and tc from the traces for each of the 511 runs. Feeding these parameters into equation (8) gave us the analytic value of K (denoted Kanal ). Simulation results show that of the 511 comparisons of Ksim and Kanal , 431 results agreed exactly. In the remaining 80 results that did not agree, Kanal was equal to Ksim 1. This underestimation of K by the analytic model could be attributed to the assumption made in the derivation of analytic expression for K in Section 3.2, or to approximations made in extracting the parameters from the traces. The simulations thus agree with our analytic results.

4.2 Analytic Results: Visualization In graphing the analytically derived value of K , we reduce the number of independent variables by making the following assumptions so as to visualize the graphs better:

Forward paths 1 and 2 have the same M T U . Hence, L1F

F

=

e

R

=

L

X

+

L

for i = each hop in forward path B

(9)

i

The transmission delays at the other links along a path are assumed negligible in comparison to the transmission delay at the bottleneck link. Using equation (4),

X

L

for i = each hop

2F

L

The forward and reverse paths have the same propagation, processing and queueing delays. Using equations (2) and (3), e

=

i B

L

i = each hop fB i g

minfor

=

d

(10)

Combining the above two assumptions, we get

F

e

=

e

R

+

d

F

R

=

e

+

L

(11)

F

B

For the forward paths 1 and 2, the equation 11 can be rewritten as e

1F

=

e

1R

+

L

1F

B

2F

and e

=

e

2R

+

L

2F

B

(12)

Figures 4 and 5 (left) graph K as a function of B2F , for fixed values of B1F , e1R and e2R . In these 2D graphs, the changeover time, tc , is fixed at 10ms. Each 3-D graph in figures 4 and 5 (right) picks one 10

representative curve from the corresponding 2-D graph (left), and shows the influence of tc on 3-D graphs thus show K as a function of B2F and tc , for fixed values of B1F , e1R and e2R .

K

. These

The graphs are organized as follows:

The results in figure 4 use the range 10kbps - 100kbps for the available bottleneck bandwidths B1F and B2F . tc is set to 10ms in the 2-D graphs. The curve corresponding to B1F = 50kbps is used as a representative curve to show the influence of tc on K . tc varies over 10ms - 100ms in the 3-D graphs. Three combinations of (e1R , e2R ) are used: (50ms, 50ms), (50ms, 25ms), and (25ms, 50ms). The results in figure 5 use the range 100kbps - 1Mbps for the available bottleneck bandwidths B1F and B2F . tc is set to 10ms in the 2-D graphs. The curve corresponding to B1F = 500kbps is used as a representative curve to show the influence of tc on K . tc varies over 10ms - 100ms in the 3-D graphs. Three combinations of (e1R , e2R ) are used: (50ms, 50ms), (50ms, 25ms), and (25ms, 50ms).

We split the range (10kbps - 1Mbps) into two subranges (10kbps - 100kbps and 100kbps - 1Mbps), because the variation observed in K with both B1F and B2F ranging from 10kbps to 1Mbps is large. We are thus able to visualize the behaviour of K over a large range of available bandwidths, with the assumption that the available bandwidths of the two paths are comparable. In figure 4, K varies between 0 and 30, and mostly has a value below 10. Remember that the smaller K is, the more unnecessary retransmissions will occur, and the more cwnd grows when it should not. Changes in e1R ; e2R and tc seem to have little influence on K , as compared to the variation due to B1F ; B2F . That is because in this set, since the available bandwidths are low, the total delay is dominated by transmission delay. In figure 5, K varies between 0 and 40. The median value of K in this set has increased from the first set. This increase can be attributed to the greater range of the bottleneck bandwidths. Another important factor can be understood by considering equation (8). With an increase in the bottleneck bandwidth, the value of d1F decreases, consequently increasing K . We also observe the increased influence of e1R ; e2R and tc in this set of results, since the transmission delay is lesser dominant in this set. In both sets, we note that K decreases with a decrease in B1F or an increase in B2F , as is expected.

5 Proposed Solution: Changeover Aware Congestion Control As mentioned earlier, the TCP-unfriendly cwnd growth and incorrect retransmissions during changeover occur due to current inadequacies of SCTP - (i) the sender is unable to distinguish SACKs for transmissions from SACKs for retransmissions, and (ii) the sender’s congestion control mechanism is unaware of the occurrence of a changeover, and hence is unable to identify reordering introduced due to changeover. Addressing either of these inadequacies will solve the more important problem of TCP-unfriendly cwnd growth. The Rhein Algorithm [8] solves the problem by addressing (i). In this section, we propose solutions which solve the problem by addressing (ii). In other words, the following solutions introduce changeover awareness in the sender’s congestion control mechanism. The cwnd overgrowth occurs due to the sender misinterpreting SACK feedback, and incorrectly sending fast 11

B2F (in kbps) vs K (e1R = 50 ms, e2R = 50ms, tc = 10ms) 30

K as a function of B2F and tc (e1R=50ms, e2R=50ms, B1F=50kbps) B1F=10kbps B1F=30kbps B1F=50kbps B1F=70kbps B1F=100kbps

25

K K

15

20

K

10

15 5

10

0 10

5

20

30

40

50

B2F (kbps)

60

70

80

90 100

10

20

30

40

50

70 60 tc (ms)

80

90

100

0 10

20

30

40

50

60

70

80

90

100

B2F B2F (in kbps) vs K (e1R = 50 ms, e2R = 25ms, tc = 10ms) 30

K as a function of B2F and tc (e1R = 50ms, e2R = 25ms, B1F = 50kbps) B1F=10kbps B1F=30kbps B1F=50kbps B1F=70kbps B1F=100kbps

25

K K

15

20

K

10

15 5

10

0 10

5

20

30

40

50

B2F (kbps)

60

70

80

90 100

10

20

30

40

50

70 60 tc (ms)

0 10

20

30

40

50

60

70

80

90

100

B2F B2F (in kbps) vs K (e1R = 25 ms, e2R = 50ms, tc = 10ms) 35

K as a function of B2F and tc (e1R = 25ms, e2R = 50ms, B1F = 50kbps) B1F=10kbps B1F=30kbps B1F=50kbps B1F=70kbps B1F=100kbps

30

K K

15 25 10

K

20 5 15

0 10

10

5

20

30

40

B2F (kbps)

50

60

70

80

90 100

10

20

30

40

50

60

70

tc (ms)

0 10

20

30

40

50

60

70

80

90

100

B2F

Figure 4: Graphing K analytically: 10kbps B1F ; B2F

12

100kbps

80

90

100

80

90

100

B2F (in kbps) vs K (e1R = 50ms, e2R = 50ms, tc = 10ms) 40 B1F=100kbps B1F=300kbps B1F=500kbps B1F=700kbps B1F=1Mbps

35

30

25 20

25

K

15 20 10 15 5 10

0 100 200 300 400 500 600 700 800 900 1000

5

0 100

200

300

400 500 600 700 800 B2F B2F (in kbps) vs K (e1R = 50ms, e2R = 25ms, tc = 10ms)

900

10

20

30

40

60

50

70

80

100

90

1000

35

K as a function of B2F and tc (e1R = 50ms, e2R = 25ms, B1F = 500kbps) B1F=100kbps B1F=300kbps B1F=500kbps B1F=700kbps B1F=1Mbps

30

K K

25 25 20 15

20 K

10 15

5 0 100 200 300 400 500 B2F (kbps) 600700 800 900 1000

10

5

0 100

200

300

400

500

600

700

800

900

10

20

30

40

50

70 60 tc (ms)

1000

B2F B2F (in kbps) vs K (e1R = 25ms, e2R = 50ms, tc = 10ms) 45

K as a function of B2F and tc (e1R = 25ms, e2R = 50ms, B1F = 500kbps) B1F=100kbps B1F=300kbps B1F=500kbps B1F=700kbps B1F=1Mbps

40

K K

35

25 20

30

15 25 K

10 20 5 15

0 100 200 300 400 500 B2F (kbps) 600700 800 900 1000

10

5

0 100

200

300

400

500

600

700

800

900

10

20

30

40

50

60

70

tc (ms)

1000

B2F

Figure 5: Graphing K analytically: 100kbps B1F ; B2F

13

1Mbps

80

90

100

80

90

100

retransmissions. Changeover aware congestion control (CACC) algorithms curb the TCP-unfriendly cwnd growth by eliminating these improper fast retransmissions. The key in a CACC algorithm is maintaining state at the sender for each destination when changeover happens. On receipt of a SACK, the sender selectively increases the missing report count for TSNs in the retransmission list, thus preventing incorrect fast retransmissions. Section 5.1 describes the Conservative CACC (C-CACC) algorithm which has the disadvantage that in the face of loss, a significant number of TSNs could potentially wait for a retransmission timeout when they could have been fast retransmitted. In Section 5.2, we describe the Split Fast Retransmit CACC (SFR-CACC) algorithm which alleviates this disadvantage. We verify the effectiveness of the SFR-CACC algorithm through simulation in Section 5.3. In Section 6, we discuss the advantages of the CACC algorithms over the Rhein algorithm in solving the cwnd overgrowth problem.

5.1 Conservative CACC As mentioned previously, C-CACC maintains state at the sender when changeover happens, on a perdestination basis. This state is used to conservatively increment missing report counts for TSNs. This conservative approach prevents incorrect triggering of fast retransmissions, thus eliminating the cwnd overgrowth problem.

On changeover, the sender maintains the following state for the new primary destination: 1) Set CHANGEOVER ACTIVE to 1, indicating that a changeover has occured. 2) Store the next TSN to be sent in next tsn at change. On receipt of a SACK, 1) If the cumulative ack in the SACK is the next tsn at change, the CHANGEOVER ACTIVE flag is cleared. 2) The following algorithm dictates when the missing report count for a TSN t should be incremented in accordance with [13, 11], and when the count should not be incremented: if (CHANGEOVER ACTIVE == 1) and (the SACK reports at least one TSN next tsn at change) then if (t next tsn at change) then Increment missing report count for t according to [13, 11]; else Do not increment missing report count for t; else Increment missing report count for t according to [13, 11];

Figure 6: Conservative CACC Algorithm

14

As was discussed in Section 3.1, the receiver could observe reordering of TSNs due to changeover. According to C-CACC, the sender uses state maintained for the current primary destination to identify SACKs that are sent by the receiver after the receiver observes this reordering. The state is constituted by two variables per-destination: 1. CHANGEOVER ACTIVE - a flag which indicates the occurence of a changeover. 2. next tsn at change - the next TSN to be used by the sender, at the moment of changeover. The algorithm is described in figure 6. On changeover, the sender sets the state as described6 . The sender is considered to be in active changeover state until the CHANGEOVER ACTIVE flag is cleared. The flag is cleared when a SACK which cumulatively acks TSNs up to and including next tsn at change is received. At that time, all TSNs which were sent to the receiver before changeover occurred at the sender have been received, andreordering due to changeover no longer happens. This period during which the sender is in active changeover state is referred to as the active changeover period, and the outstanding TSNs which have not yet been acked at the sender at the moment of changeover constitute the changeover range. During the changeover period, receipt of a SACK that reports a TSN greater than or equal to next tsn at change indicates to the sender that reordering has been observed at the receiver. Since this reordering is likely due to changeover, the sender does not increment missing report counts for TSNs in the changeover range, thus preventing the incorrect fast retransmissions. C-CACC is conservative because when reordering due to changeover is observed at the receiver and consequently reported to the sender, the sender conservatively chooses to not increment missing reports for any TSN in the changeover range. In the face of loss, the sender will not perform fast retransmission on any TSN in the changeover range. The TSNs in the changeover range would thus have to wait for retransmission timeouts to be retransmitted. Furthermore, C-CACC does not take into account the possibility of multiple changeovers at the sender.

5.2 Split Fast Retransmit CACC (SFR-CACC) To alleviate the limitations of C-CACC, note that the reordering observed during changeover happens because TSNs which are supposed to reach the receiver in-sequence end up reaching the receiver in concurrent groups, in-sequence within each group. With this observation, we reason that the fast retransmit algorithm can be applied independently within each group. That is, on the receipt of a SACK, if the sender can estimate the TSN(s) that causes this SACK to be sent from the receiver, the sender can use the SACK to increment missing report counts within the causative TSN(s)’s group. In SFR-CACC, four variables for each destination are introduced: 1. CHANGEOVER ACTIVE - a flag which indicates the occurrence of a changeover. 2. CYCLING CHANGEOVER - a flag which indicates whether the change of primary is the first changeover to this destination address during an active changeover. This flag helps determine changeovers cycling through destination address space. 6

Unless explicitly stated, the variables used in the CACC algorithms refer to the state for the current primary destination, from the sender’s viewpoint.

15

On changeover, for the new primary destination: 1) If CHANGEOVER ACTIVE is 1, then there was a changeover to this destination address earlier. The sender sets CYCLING CHANGEOVER to 1, indicating that this changeover is a cycling switch to the same destination address during an active changeover. 2) The sender sets CHANGEOVER ACTIVE to 1, indicating that a changeover has occured. 3) The sender stores the next TSN to be sent in next tsn at change.

Figure 7: Split Fast Retransmit CACC Algorithm (Part 1) 3. next tsn at change - the next TSN to be used by the sender, at the moment of changeover. 4. cacc saw newack - a temporary flag, used during SACK processing to estimate the causative TSN(s)’s group.

On receipt of a SACK, 1) If the cumulative ack in the SACK is next tsn at change, the CHANGEOVER ACTIVE and CYCLING CHANGEOVER flags are cleared for all destinations. 2) If (CHANGEOVER ACTIVE == 1) and (the SACK contains Gap Acks) then for each destination d do initialize d.cacc saw newack = 0; done; for each TSN t being acked, that has not been acked in any SACK so far do let d be the destination to which t was sent; set d.cacc saw newack = 1; done

Figure 8: Split Fast Retransmit CACC Algorithm (Part 2) SFR-CACC is broken up into three logical parts. SFR-CACC(1) is very similar to the initial part of C-CACC algorithm, except for the CYCLING CHANGEOVER flag which we will discuss shortly. SFR-CACC(2) and SFR-CACC(3) specify sender actions on receipt of a SACK. On receipt of a SACK that cumulatively acks up to and including next tsn at change, the sender leaves the active changeover state. In SFR-CACC(2) the sender estimates the causative TSN(s)’s destination. The sender estimates the causative TSN(s) as TSN(s) getting acked for the first time in a SACK. TSNs sent to 16

On receipt of a SACK (contd.), 3) The following algorithm dictates when the missing report count for a TSN t should be incremented in accordance with [13, 11], and when the count should not be incremented: if (CHANGEOVER ACTIVE == 1) and (CYCLING CHANGEOVER == 0) then let count of newacks be number of destinations for which cacc saw newack is set; if (count of newacks == 1) then /* SACK acks only one dest */ let d be the destination to which t was sent; if (d.cacc saw newack == 1) then Increment missing report count for t according to [13, 11]; else Do not increment missing report count for t; else /* Mixed SACK - SACK acks more than one dest */ if (t was sent to the current primary) then Increment missing report count for t according to [13, 11]; else Do not increment missing report count for t; else if (CHANGEOVER ACTIVE == 1) and (CYCLING CHANGEOVER == 1) then /* Cycling observed, hence mark only in latest group */ if (t next tsn at change) then Increment missing report count for t according to [13, 11]; else Do not increment missing report count for t; else /* Sender is not in changeover active state */ Increment missing report count for t according to [13, 11];

Figure 9: Split Fast Retransmit CACC Algorithm (Part 3)

17

the same destination as the causative TSN(s) form the causative TSN(s)’s group. In SFR-CACC(3), the sender does not increment missing report counts for TSNs outside the causative TSN(s)’s group. In other words, the sender applies the SACK selectively to fast retransmit within the causative TSN(s)’s group. If more than one group are being acked, then fast retransmit is conservatively applied only to TSNs in the current primary destination’s group. SFR-CACC does the in-group marking of TSNs only as long as the sender does not changeover to a previously used destination address which was already used during the current active changeover period. If the sender starts to cycle through destination address space, then the sender switches to a more conservative behaviour of marking only TSNs in the latest outstanding group. The protection from such cycling changeovers is necessary because SFR-CACC assumes that the latest outstanding TSNs were transmitted to the current primary. One could envision a scenario where the sender has TSNs outstanding on two destination addresses, B1 and B2 , having performed changeover in that order. The sender then performs a changeover back to B1 , and a SACK acking both TSNs from both groups is received. The sender could now end up incorrectly fast retransmitting TSNs sent to destination B1 , causing cwnd overgrowth on destination B2 - precisely what we are trying to avoid. There may be other scenarios where the original problem of cwnd overgrowth may occur due to cycling changeovers. For the moment, we have not looked into cycling changeover in greater depth, and design SFR-CACC to be conservative when a cycling changeover occurs.

5.3 Simulations Verification of the effectiveness of SFR-CACC was done through ns-2 simulations. Using SFR-CACC under the same conditions as in section 4.1 for which cwnd overgrowth was observed, the simulations showed no unnecessary retransmissions, or cwnd overgrowth due to changeover.

6 Conclusion and Future Work Results from Section 4 suggest that the problem might not be a “corner case”, since for a large range of network settings, the value of K , which governs the minimum packets required to be outstanding at the time of changeover so as to observe cwnd overgrowth, is low. By approaching the problem from different perspectives, the Rhein algorithm [8] and the CACC algorithms all solve the problem of TCP-unfriendly cwnd growth. The Rhein algorithm recognizes that this growth occurs due to the sender’s inability to distinguish between SACKs for original transmissions from SACKs for retransmissions. This algorithm does not solve the problem of unnecessary fast retransmissions on a changeover. This algorithm also adds the overhead of an extra chunk for every SCTP packet. The CACC algorithms maintain state information during a changeover, and use this information to avoid incorrect fast retransmissions. Consequently, these algorithms prevent the TCP-unfriendly cwnd growth. These algorithms have the added advantage that no extra bits are added to any packets, and thus the load on the wire and the network is not increased. One disadvantage of the CACC algorithms is that some of the TSNs on the old primary are ineligible for fast retransmit. Furthermore, complexity is added at the sender to maintain and use the added state variables.

18

The fast retransmit algorithm is active on the changeover range for a longer time in SFR-CACC than with C-CACC. To quantify the number of TSNs which will be ineligible for fast retransmit in the face of loss, let us assume that only one changeover is performed and that SACKs are not lost. Under these assumptions, potentially only the last four packets sent to the old primary destination will be forced to be retransmitted with an RTO instead of a fast retransmit. In other words, under these assumptions, if a TSN is lost, and at least four packets are successfully transmitted to the same destination after the loss, then the TSN will be retransmitted via fast retransmit. With C-CACC however, any TSN in the changeover range will require an RTO to recover from loss. C-CACC is also incapable of handling multiple changeovers, whereas SFRCACC is equipped to do so. We have implemented SFR-CACC in the NetBSD/FreeBSD release for the KAME stack [2, 3]. The implementation uses three flags and one TSN marker for each destination, as described in Section 5.2. Approximately twenty lines of C code were needed to facilitate the SFR-CACC algorithm, most of which will be executed only when a changeover is performed in an association. We therefore recommend the addition of the SFR-CACC algorithm to RFC2960 to alleviate the problem of artificial cwnd growth and unnecessary fast retransmissions during a changeover. We plan to further investigate simulated results for the cwnd overgrowth with cross traffic. We also plan to look at multiple and cycling changeovers, and how well SFR-CACC performs in the face of multiple changeovers. It may also be interesting to look at how the receiver could help the sender make more informed decisions by participating more actively in the congestion control mechanisms.

7 Disclaimer The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government.

8 Acknowledgments Thanks to Ivan Arias Rodriguez, Vern Paxson, Mark Allman, Phillip Conrad and Johan Garcia for their comments and inputs.

References [1] The Network Simulator - ns-2. http://www.isi.edu/nsnam/ns/. [2] The SCTP Homepage. http://www.sctp.org. [3] Webpage of the KAME project. http://www.kame.net. [4] E. Blanton and M. Allman. On Making TCP More Robust to Packet Reordering. ACM Computer Communication Review, Vol. 32(1), January 2002. 19

[5] R. Braden. Requirements for internet hosts–communication layers. RFC1122, Internet Engineering Task Force (IETF), October 1989. [6] M. Allman et al. TCP Congestion Control. RFC2581, Internet Engineering Task Force (IETF), April 1999. [7] M. Gerla, S. S. Lee, and G. Pau. TCP Westwood Simulation Studies in Multiple-Path Cases. Proc. International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS) 2002, San Diego, CA, July 2002. [8] Janardhan R. Iyengar, Armando L. Caro Jr., Paul D. Amer, Gerard J. Heinz, and Randall Stewart. SCTP Congestion Window Overgrowth During Changeover. Proc. SCI2002, Orlando, July 2002. [9] M. Jain and C. Dovrolis. Pathload: A Measurement Tool for End-to-End Available Bandwidth. Proc. 3rd Passive and Active Measurements Workshop, Fort Collins, March 2002. [10] Protocol Engineering Lab, University of Delaware. SCTP Module for ns-2. http://pel.cis.udel.edu. [11] R. Stewart, L. Ong, I. Arias-Rodriguez, K. Poon, P. Conrad, A. Caro, and M. Tuexen. SCTP Implementers Guide. Internet Draft: draft-ietf-tsvwg-sctpimpguide-06.txt, Internet Engineering Task Force (IETF), May 2002. (work in progress). [12] R. Stewart and Q. Xie. Stream Control Transmission Protocol (SCTP): A Reference Guide. Addison Wesley, New York, NY, 2001. [13] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and V. Paxson. Stream Control Transmission Protocol. Proposed standard, RFC2960, Internet Engineering Task Force (IETF), October 2000. [14] M. Zhang, B. Karp, S. Floys, and L. Peterson. RR-TCP: A Reordering-Robust TCP with DSACK. Technical Report TR-02-006, International Computer Science Institute (ICSI), July 2002.

20

Making SCTP More Robust to Changeover - CiteSeerX

Making SCTP More Robust to Changeover - CiteSeerX

Suggest Documents

Making SCTP More Robust to Changeover - CiteSeerX

On Making SCTP Robust to Spurious Retransmissions - University of ...

SCTP - CiteSeerX

Making Markets More Robust* Randall S. Kroszner Booth ... - CiteSeerX

making chroma features more robust to timbre ... - Semantic Scholar

making airline schedules more robust - School of Computer Science

An Approach to Robust Decision Making under Severe ... - CiteSeerX

Making Intra-Domain Routing Robust to Changing and ... - CiteSeerX

Making Intra-Domain Routing Robust to Changing and ... - CiteSeerX

Making thin films more transparent, more conductive - CiteSeerX

making more informed hiring decisions - CiteSeerX

Making Implementation More Democratic through Action ... - CiteSeerX

Making climate model forecasts more useful - CiteSeerX

Series Making health systems more equitable - CiteSeerX

Making Context-Aware Computing More Social - CiteSeerX

Making Behavioral Activation More Behavioral - CiteSeerX

Making Implementation More Democratic through Action ... - CiteSeerX

MORE EFFICIENT TESTS ROBUST TO ... - Google Sites

Making Executable Interface Specifications More Expressive - CiteSeerX

Adopting robust decision-making to forest ...

Stream Control Transmission Protocol (SCTP): Robust and ... - arXiv

Robust Decision Making - Mediation

Stream Control Transmission Protocol (SCTP): Robust and ... - arXiv