Scalable Timers for Soft State Protocols - Semantic Scholar

3 downloads 0 Views 338KB Size Report
Sep 3, 1996 - The essential mechanisms required to realize this scalable timers approach are: (1) dynamic adjustment of the senders' refresh rate so that the.
Scalable Timers for Soft State Protocols Puneet Sharma

Sally Floyd

Deborah Estrin

Van Jacobson

Lawrence Berkeley National Laboratory

Information Sciences Institute University of Southern California 4676 Admiralty Way Marina del Rey, CA 90291 Ph: (310) 822-1511 x-742 Fax: (310) 823-6714 [email protected], [email protected]

1 Cyclotron Road Berkeley, CA 94720 Ph: (510) 486-7518 Fax: (510) 486-6363

[email protected], [email protected]

September 3, 1996

Abstract

Soft state protocols use periodic refresh messages to keep network state alive while adapting to changing network conditions; this has raised concerns regarding the scalability of protocols that use the soft-state approach. In existing soft state protocols, the values of the timers that control the sending of these messages, and the timers for aging out state, are chosen by matching empirical observations with desired recovery and response times. These xed timer-values fail because they use time as a metric for bandwidth; they adapt neither to (1) the wide range of link speeds that exist in most wide-area internets, nor to (2) uctuations in the amount of network state over time. We propose and evaluate a new approach in which timer-values adapt dynamically to the volume of control trac and available bandwidth on the link. The essential mechanisms required to realize this scalable timers approach are: (1) dynamic adjustment of the senders' refresh rate so that the bandwidth allocated for control trac is not exceeded, and (2) estimation of the senders' refresh rate at the receiver in order to determine when state can be timed-out and deleted. The refresh messages are sent in a round robin manner not exceeding the bandwidth allocated to control trac, and taking into account message priorities (e.g., triggered messages about new state are higher priority than refresh messages about pre-existing state.) We evaluate two receiver estimation methods for dynamically adjusting network state timeout values: (1) counting of the rounds and (2) exponential weighted moving average.

Keywords: Protocols, Internetworking, Multicast Algorithms, Internet.

1

1 Introduction A number of proposed enhancements to the Internet architecture require addition of new state (i.e., stored information) in network nodes. In the context of various kinds of network element failures, a central design choice is the manner in which this information is established and maintained. Soft state protocols maintain state in intermediate nodes using refreshes that are periodically initiated by endpoints. When the endpoints stop initiating refreshes the state automatically times out. Similarly, if the intermediate state disappears, it is re-established by the end-point initiated refreshes [1]. Many soft state designs, such as RSVP [2], RTP [3], and PIM [4, 5, 6], use best e ort messages to carry the refreshes. These traditional periodic refreshes generated at xed periods scale poorly with the increase in the amount of state. Some designers have suggested using explicitly-acknowledged, reliable, messages in place of these periodic refreshes [7]. The reliable nature more completely decouples timer management of soft state from the issues of packet losses of these refreshes. However, use of reliable message transport entails an increase in complexity of the mechanism. In this paper we propose a new approach of scalable timers to improve the scaling properties of soft state mechanisms. Scalable timers replace the xed timer settings used by existing soft state protocols with timers that adapt to the volume of control trac and available bandwidth on the link. Scalable timers regulate the amount of control trac independent of the amount of soft state. In the next section we present an overview of state management in networks. Section 3 discusses the soft state paradigm of maintaining network state. In Section 4 we describe how the xed timers are currently used by soft state protocols to exchange refresh messages and discard stale state and we motivate our approach of scalable timers that makes soft state protocols more scalable. Mechanisms required for the proposed approach of regulating control trac are discussed in Section 5 and Section 6. We look at the application of the scalable timers approach to PIM in Section 7 followed by the simulation results in Section 8. Section 9 compares the traditional and the proposed approach. We conclude with a summary and a few comments on future directions in Section 10.

2 State Management in Networks State in network nodes refers to information stored by networking protocols about the conditions of the network. The state of the network is stored in a distributed manner across various nodes and various protocols. For instance a teleconference application might run on top of multiple protocols. Internet Group Management Protocol (IGMP) [8] stores information about the hosts participating in the conference. Based on the membership information the multicast routing protocols such as PIM [4, 5, 6] or CBT [9] create multicast forwarding state in the routers. A reservation protocol such as RSVP [2] or ST-II [10, 7] may reserve network resources for the teleconferencing session along the multicast tree. The state has to be modi ed to re ect the changes in network conditions. The network nodes communicate with each other to exchange the information regarding change in the network conditions. Based on these control messages the network nodes modify their stored state. For instance, in the event of a change in network topology, the routers exchange messages resulting in modi cation of the multicast forwarding state. In this section we discuss the tradeo s associated with various ways of maintaining state in networks.

2.1 Paradigms for maintaining state in network

The state maintained by nodes in a network can be categorized as hard state and soft state. Hard state is that which is installed in nodes upon receiving noti cation for setting up state and is removed only on receiving an explicit tear-down (throw away) message. Soft state, on the other hand, uses refresh 2

messages to keep it alive and is discarded if the state is not refreshed for some time interval. Protocol designers have adopted both paradigms for storing state. The reservation protocol ST-II [10, 7, 11] is an example of a protocol that maintains hard state. In ST-II connect messages are generated to setup a new reservation which is later torn down by disconnect messages. Similarly, the multicast routing protocol Core Based Tree (CBT)[9] establishes hard state. On the other hand, Resource Reservation Protocol (RSVP) [2] uses a datagram messaging protocol with periodic refreshes to maintain soft state in network nodes. Protocol Independent Multicast(PIM), another multicast routing protocol [4, 5, 6], also has control messages that are exchanged periodically to keep the multicast forwarding entries; a forwarding entry that is not refreshed is discarded. In hard state architectures, control messages are exchanged among the nodes for installing as well as removing the state and no periodic refreshes are needed. The install and remove messages are sent reliably (resent until an acknowledgement is received) to make it robust. Soft state architectures on the other hand do not need to manage reliable message delivery functions (e.g. timeouts, retransmits, etc.). In addition soft state architectures do not rely upon explicit tear-down messages1 . In steady state, hard state protocols require less control trac as there are no periodic control messages. Soft state protocols provide better (faster) adaptation and greater robustness to changes in the underlying network conditions, but at the expense of periodic refresh messages. However, if the network is highly dynamic, hard-state control messages must be generated to adapt to the changes. In such a scenario use of hard state does not provide much advantage over soft state because the bandwidth being used by control messages would be comparable for both paradigms. Based on the roles played by the nodes with respect to the particular state being referenced, the control message exchange among network nodes can be modelled as an exchange of messages between two entities; the sender and the receiver. The sender is the network node that (re)generates control messages to install, keep alive, and remove state from the other node. The receiver is the node that creates, maintains and removes state, based on the control messages that it receives from the sender.

3 Soft State Paradigm The soft state sender generates refresh messages to keep the soft state at the receiver alive. These messages are sent periodically after one refresh period. We assume for simplicity that the soft state protocols do not use explicit tear-down messages1 . In the absence of explicit tear down messages, if the receiver does not get refresh messages for a particular state for one refresh period, it may treat the state as stale and discard it. Refresh packets can occasionally get dropped in the network. The protocol becomes more robust to dropped packets if the receiver waits longer than one refresh period before discarding the state. The time for which the receiver waits before discarding a state is a small multiple of the refresh period. The multiplying factor is determined by the degree of robustness required and lossiness of the link or path2 . As the size and usage of the networks grow the amount of the state to be maintained also increases. Currently, most soft state protocols generate refresh messages using xed timers, resulting in growth of control trac with the increase in the amount of state being maintained in the network. In the center of the network aggregate trac levels will contribute to large quantities of state and trac. At the edges, lower speed links make even a lower trac level a concern. The primary goal of a data network is to carry data trac. Unconstrained growth of control trac can jeopardize this primary goal. This is exacerbated by the fact that trac levels are higher when congestion and network events are reducing A soft state protocol might optionally use explicit tear-down messages to achieve faster action. Degree of robustness is a tradeo of tolerance to dropped packets and overhead of maintaining state that may no longer be required. 1

2

3

overall network resource availability. Soft state protocols can be made scalable only if control messages can be constrained independent of the amount of state in the network.

3.1 Control Trac in Soft State Protocols

As we have discussed in the previous section, soft state protocols require the periodic exchange of control messages between the network nodes. Based on the state for which the control message is sent, control trac can be categorized into two types. We de ne the two types as:

 Refresh Trac: These are control messages that refresh already existing state that needs to be kept

alive. Such messages are generated locally at a sender node on the basis of its active local state. An example of refresh trac is a PIM join message for an already active distribution tree.  Trigger Trac: This consists of messages that are generated by the sender when it wants to create new state. For example, if a PIM router receives a join message for a group that is not already represented by an entry in the node's multicast forwarding table, then a join message is triggered and sent towards the source.

In the next section we identify drawbacks of the traditional ways of setting the timer values in soft state protocols and present our approach to regulate control trac with scalable timers.

4 Timers for Control Trac Soft state protocols have two timers associated with the control trac. The sender maintains a refresh timer that is used to clock out the refresh messages for the existing state. When the refresh timer for an existing state entry expires, a refresh message for that state is generated. The receiver maintains a state timeout timer to age the state that it maintains. The receiver discards a state entry if it does not receive a refresh message for that state before this timer expires.

4.1 Traditional Approach

Traditional protocols have xed settings for the timer values. The values of the refresh timers and timeout timers are chosen by empirical observations with desired recovery and response in mind. These values are then used as xed timers for sending the periodic updates. The value of the timeout timer is set to a multiple of the refresh period. For instance, a PIM router sends join messages periodically every 60 seconds to its next-hop router towards the sender. If a state in an upstream PIM router does not get refreshed within 180 seconds, it is treated as stale and is discarded. Such xed refresh timers for the state update fail to address the heterogeneity of the networking environments (e.g. range of link bandwidths) and the growth of network state. Fixed timers su er from scaling problems. The control trac uses bandwidth proportional to the amount of state being refreshed. State in the network might exist even when data sources are idle (e.g. state related to shared trees in PIM routers, or IGMP membership information in Designated Routers). To date when the overhead becomes excessive the xed timer values have to be changed globally. In summary, the xed timer settings fail because they use time as a metric for bandwidth and do not address the range of link speeds present across the network.

4

rs le Tim e lab Sc a

State−−>

s

er

im

T ed

Fix

Ba

nd

wid

th−

−>

>

d−−

erio

hP

res Ref

Figure 1: Fixed Timers vs/ Scalable Timers

4.2 Scalable Timers

We propose a new approach for regulating the control trac in which the timer values are dynamically adjusted to the amount of state. This approach is called scalable timers. In our approach we x the control trac bandwidth instead of the refresh interval. The design objective is to make the bandwidth used by control trac negligible as compared to the link bandwidth and the data trac. Consequently, a xed portion of the link bandwidth is allocated to the control trac. The refresh interval at the sender is adjusted according to the xed available control trac bandwidth and the amount of state to be refreshed. The sender varies the refresh interval with the change in amount of state to be refreshed. Figure 1 compares the changes in the refresh interval and control bandwidth for xed and scalable timers. Figure 1 has a line for the scalable timers approach and a line for the xed timers approach, showing how the amount of bandwidth used by control messages and refresh period vary as the amount of state to be refreshed increases. In xed timers approach the amount of bandwidth used by control messages increases as the amount of state to be refreshed increases. However, in our scalable approach the refresh interval increases with the increase in amount of state, keeping the volume of control trac constant. Unlike xed timers, our approach scales well because the control trac does not grow with the amount of state. In the next section we present the mechanisms required for applying scalable timers to soft state protocols.

5 Mechanisms for Scalable Timers The essential mechanisms required by scalable timers are twofold: (1) the sender dynamically adjusts the refresh rate so that the bandwidth allocated for the control trac is not exceeded, and (2) the receiver estimates the rate at which the sender is refreshing the state, in order to determine when state can be considered stale and deleted. In this section we discuss these mechanisms for servicing the control trac and timing out state at network nodes for point-to-point links. These mechanisms can be extended to multiaccess LANs by treating them as multiple point-to-point links. 5

5.1 Servicing the control trac at the Sender

The sender needs to generate updates for its state entries such that the bandwidth allocated for control trac is not exceeded. A simple model for generating refreshes at the sender is to equally divide the allocated control bandwidth (control bwtotal ) among all the state entries. Updates are sent for each state entry in a round robin fashion. In this zeroth level algorithm all the control trac is serviced as one class. As discussed earlier in Section 3.1, trigger messages are generated when new state is created. Trigger trac needs to be serviced faster than existing state trac for improved end-to-end performance and convergence (adaptability to the changes in the network conditions). More generally, di erent control messages can have di erent priorities. Simple round robin servicing of the control messages fails to include the priorities associated with various control messages for a protocol. More structure needs to be added to the simple round robin model at the sender node to include the di erence in the priority levels of the control messages. A priority model can be used to add more structure to the zeroth level algorithm at the sender. The control trac is divided into various classes based on the priority associated with the messages. One instantiation of such a priority model is to divide control trac into two classes, trigger and refresh, assigning trigger trac to the higher priority class. Such a simple priority model is also not suitable, as it can lead to starvation of the lower priority class messages. In the above mentioned instantiation, refresh messages might be starved if there is a large amount of trigger trac. Instead, isolation of bandwidth for various classes of trac (messages) is required, such that each class receives a guaranteed share of the control bandwidth during even heavy load. Because each class of control trac is guaranteed a share of the control bandwidth, there is no starvation. While bandwidth allocated to a class is not being used, it is available to other classes sharing the control trac bandwidth. The class structure of the control messages in di erent protocols can be set di erently based on the requirements of that protocol. This use of isolation to protect the lower classes from starvation is similar to the link sharing approach in Class based Queueing (CBQ) [12]. One possible class structure is to divide the control messages into two classes, trigger messages and refresh messages. The trigger trac class is guaranteed a very large fraction (close to 1) of the control bandwidth so that trigger messages can be serviced very quickly for better response time. Refresh messages are allocated the remaining non-zero fraction of bandwidth. Guaranteeing a non-zero fraction of the control trac bandwidth to refresh messages precludes starvation and premature state timeout at the receiver. A token bucket rate limiter can be used to serve bursts in trigger trac. While the trigger trac is not using the bandwidth allocated for trigger trac, it is used by refresh trac. The sender generates updates for the existing state in a round robin fashion. To avoid sending refresh messages too frequently all protocols have a minimum refresh period refresh intervalmin that is the same as the refresh period used by the xed timers approach. In this model the refresh period for existing state is at least refresh intervalmin and is given by max(refresh intervalmin ; bw available=amount of existing state). In this class structure there is a tradeo between the latency in establishment of new state and delay in removing stale state. If a larger fraction of control bandwidth is allocated to trigger trac the latency in establishing new state is less, but there might be more delay in discarding stale state.

5.2 Timing out Network State at Receiver

In the xed timer approach, the receiver has prior knowledge of the refresh rate and can compute the timeout interval a priori (e.g., a small multiple of the refresh rate to allow for packet loss). Since in our approach the rate at which the sender refreshes the state varies based on the total amount of state at the sender, the receiver has to track the refresh frequency and update the timeout interval accordingly. 6

One possibility would be for the sender to explicitly notify the receiver about the change in the refresh period in the control messages. Another possibility is for the receiver to estimate the refresh frequency from the rate of arrival of the control messages. If the receiver relies on information conveyed by the sender, the receiver is implicitly putting trust in the sender to behave properly. This is not an issue of maliciousness but of lack of adequate incentives to motivate strict adherence. The refresh period at the sender varies as the amount of the state being refreshed changes. So even the sender can not foresee the changes in the refresh period and hence might convey incorrect information to the receiver. For example, if a link comes up there could be a big burst of trigger trac resulting in changes in refresh intervals not foreseen by the sender. Therefore, even if information is sent explicitly by the sender, the receiver still needs alternate mechanisms to avoid premature state deletion. If the receiver does not trust the sender and has its own mechanism for the estimation of the timeouts, then the information sent by the sender is redundant. In summary, the explicit noti cation approach unnecessarily couples system entities. We adopt a general architectural principle to place less trust in detailed system behavior in order to make the architecture more robust. In the absence of explicit noti cation of change in the refresh interval, the receiver needs to estimate the current refresh interval and adjust the timeout accordingly. In the next section we discuss mechanisms for estimation of the refresh interval at the receiver.

6 Estimating the Refresh Periods In the previous section we discussed why receivers should not rely on explicit noti cation from the sender to set timeouts. Instead, the receiver has to estimate the refresh period from the time between two consecutive refresh messages that it receives for the same state. The challenge is for the receiver to adapt to changes in this refresh period over time. Changes in the refresh interval as observed at the receiver may be due to:  trigger messages : Servicing of trigger messages can result in transient changes in the refresh interval because while they are being serviced the bandwidth available for the refresh messages is reduced.  new state : As new state is created at a node, the sender has more state to serve, making the refresh interval larger. Similarly if some state is discarded the refresh messages are generated at a faster rate.  packet loss : The receiver observes a longer update interval if some refresh message is lost on the link. Unlike the xed refresh period approach, where the timeout interval has to be robust only to deal with dropped packets, the estimator used in scalable timers has to be more carefully designed to be both adaptive and ecient. The estimator in our approach must detect and respond quickly to an increase in the refresh period caused by an increase in the amount of trigger state. We not only need timers to be robust in the presence of an occasional dropped update, but also in the presence of a rapid non-linear increases in the inter-update intervals. The estimate of the refresh interval is used by receivers to discard old state. However, the estimate must also be conservative so as not to timeout state prematurely. The consequences of timing out state prematurely are generally more serious than the consequences of moderate delays in timing out state. In this section we discuss two approaches (and associated problems) that can be used at the receiver to age the old state. 7

6.1 Counting of the Rounds

A round refers to one cycle of refreshes at the sender for all the states that it has to refresh. In this approach instead of trying to estimate the refresh period and then using a multiple of this period to throw away the state, the state is thrown away if the receiver does not receive the refresh for a particular state for some a rounds (where a is set to provide robustness to lost packets, as with the xed timers). Thus by counting of the rounds the receiver can track the sender closely. Assuming that the sender is servicing the states in a roundrobin fashion, all of the other state will have been serviced once between two consecutive updates for a particular state. For each state the receiver maintains the last time that a refresh was received for the state. The receiver marks the beginning of a round by round markers. If two refresh messages are received for a particular state since the beginning of the current round, the last round marker is moved to the current time. The receiver can count the rounds easily using this algorithm. The algorithm is further explained in the psuedocode below. For each state entry the receiver stores the time that it received the last refresh message for that entry. For state i, let this time be denoted by previous refresh[i]. The receiver maintains round markers per link. Every time the last round marker is moved, (i.e., a new round starts), checks are made to see if any state has to be removed. On receiving a refresh message for state i: if (previous refresh[i]  last round marker) f last round marker = current time; shift round markers(); age state();

g

previous refresh[i] = current time; The age state() routine checks if there is any state that has to be discarded. With the counting of the rounds approach, the receiver is essentially assuming that the sender is using round robin for sending refresh messages. However, it is not necessary for the sender to send the refreshes in exact round-robin order, some good approximation to round-robin will suce. That is, as long as the receiver waits at least two round times between refreshes before deleting a particular piece of state, it is not necessary to preserve the exact order of the refreshes within every round. It is important that the sender refresh some state more frequently than the others, but it would not be a problem to refresh some other state slightly less frequently.

6.1.1 Re nements to basic algorithm We identi ed several scenarios in which a simple implementation of the counting of the rounds can have diculties. In the rst set, the counting of the rounds can be confused if various state entries have di erent refresh rates. In the second set, there is an end case in which some state never expires. If the receiver gets two refreshes for a particular piece of state, the last round marker is moved ahead. So, if the refresh rate for various states is not the same then the above algorithm fails. One simple example of this phenomenon is when we have a faulty interface card which generates a duplicate packet for each packet that it puts on the link. In this case, the last round marker will be updated every time the sender generates a control message. Since the sender can not refresh faster than a maximum rate (even if control bandwidth is available), the receiver can use this information to check if the refresh period is smaller than the minimum con gured 8

refresh period (refresh intervalmin ). Accordingly, the algorithm would be modi ed to: if ((previous refresh[i]  last round marker) & ((current time ? previous refresh[i])  refresh interval min)) f last round marker = current time; / move round markers/ shift round markers(); age state();

g

previous refresh[i] = current time; Another scenario in which counting of the rounds can get confused is when clouds are present in the network. Networks might have clouds that do not maintain state and simply forward the control trac received from the downstream node to the upstream node. In such a scenario the receiver gets refreshes that are not generated locally at the neighboring router. In this case, if two or more senders are refreshing the same state, the receiver will observe a smaller refresh period for some state and update the round marker faster than the actual round, resulting in miscounting of the rounds. Moreover, this scenario is di erent than a multiaccess link scenario, where the receiver can distinguish between the refreshes for the same state sent by di erent senders by looking at the address of the sender. Another failure mode for a simple implementation of the counting of the rounds can result in state entry that never expires. If there is only one state entry being maintained and the sender stops sending the refreshes for this state, the receiver in a simple implementation will never time out the state. This happens because the round marker does not shift as no updates are received. Similar behavior can occur when there are multiple entries of state and all of them die together or when a link goes down and the receiver does not see any control messages. Since a percentage of control bandwidth is allocated for refresh trac, there is an upper bound on the time between two consecutive refresh packets seen by the receiver on a link in the absence of a link failure. The receiver can use this upper bound to detect the end case. Let us assume the simple class structure given in Section 5.1. The upper bound on the time interval between two consecutive refresh packets seen on a link is given by

gapmax = pktsizeavg =control bwrefresh; where pktsizeavg is the average size of refresh packet and control bwrefresh is the control bandwidth allocated to refresh messages at the sender3. If the amount of state being refreshed is low, then the round completion time is refresh intervalmin 3 . Otherwise the round completion time is bounded by N  gapmax , where N is the number of state entries that are being refreshed. So the instantaneous upper bound on round completion time is given by

MAX (N  gapmax ; refresh intervalmin ): The above mentioned worst case occurs when there is in nite trigger trac to be serviced. So the round marker should also be moved if no refresh packets are seen for an interval of the above computed upper bound. If no trigger state is being serviced, the time interval between two consecutive refresh packets is bound by gapmin = pktsizeavg =control bwtotal ; 3 pktsizeavg , control bwrefresh , refresh intervalmin and sender to generate refresh messages.

control bwtotal

9

are the same constants that are used by the

where control bwtotal is the bandwidth allocated to control messages at the sender3. So the instantaneous upper bound on round completion time is given by

MAX (N  gapmin ; refresh intervalmin ); where N is the number of state entries that are being refreshed. This bound can be used for detecting the end case faster when no trigger trac is being received. Similar upper bounds can be computed for any other class structure being used by the sender. With the suggested re nements to the basic algorithm, the counting of the rounds approach can successfully track the changes in refresh period from a single sender. In the next section we discuss Exponential Weighted Moving Average for estimating the refresh period at the receiver.

6.2 Exponential Weighted Moving Average

The Exponential Weighted Moving Average (EWMA) estimator is a low pass lter belonging to a class of estimators called recursive prediction error algorithms. EWMA has been used widely for estimation before, e.g. TCP round trip time estimation [13], etc. We investigated use of EWMA estimators as an alternate to the counting of the rounds approach. In this scheme a network node runs one EWMA estimator for each coincident link. Each EWMA estimator tracks the refresh interval average (intervalavg ) and mean deviation (intervalmdev ) for the associated link. On receiving a new measurement for refresh interval I , the receiver updates the average of the refresh interval as:

intervalavg = (1 ? w)  intervalavg + w  I;

where w is the smoothing constant of the estimator. As shown in [14], EWMA can be computed faster if it is rearranged as: where

intervalavg = intervalavg + w  (err); err = I ? intervalavg

Mean deviation is computed similarly as:

intervalmdev = intervalmdev + w  (jerrj ? intervalmdev ):

Once the average and mean deviation for the refresh interval are updated, the timeout value for the state is then set to: a  (intervalavg + b  intervalmdev ); where a, b are both small integers. Here, a is the degree of robustness used for setting the timeout period in the xed timers approach and b is the weight given to the refresh interval deviation in computing the timeout period. The mean deviation is used for computing the timeout period so that the estimator is able to track sharp increases in the refresh interval. This prevents the state from being deleted prematurely due to sudden increases in the refresh interval. When a state timeout occurs, a new timeout value is computed based on the current values of intervalavg and intervalmdev . If the new timeout value is not greater than the previous value, the state is discarded. Otherwise the new value for state timeout is set. In the event of long bursts of trigger 10

Timeout for State S. 1. Put S in list of timeout pending states.

No_Timeout Refresh Message for State S.

1. Update estimate 2. Compute new timeouts for all timeout pending states and S 3. Set new timeout value for S. 4. Set new timeout values for all timeout pending states or time them out.

Trigger Message for State S.

1. Set timeout for S based on current estimate.

Timeout for State S. 1. Update the estimate 2. Compute new timeout for S based on new estimate 3. Set new timeout value for S.

Normal

1. Compute new timeout for S based on current estimate. 2. Set new timeout value for S or timeout S.

Refresh Message for State S.

Figure 2: State Diagram for an EWMA receiver trac, the estimator does not receive any refreshes and the estimate is not updated. In such a scenario, the estimator lags behind the actual refresh interval and premature timeout of state might occur. To avoid this problem, the receiver does not timeout state during a burst of trigger trac. The receiver has two modes of operation as shown in the state diagram for the receiver in the Figure 2. While the receiver is not receiving any trigger messages it stays in the Normal state. It enters the No Timeout state when it receives a trigger message. No state is timed out when the receiver is in the No Timeout state. The state of the receiver is changed back to Normal upon receiving a refresh message. For the EWMA estimator the value of the smoothing constant needs to be determined. The requirement from the smoothing constant is that it should track the refresh period reasonably closely and at the same time be able to respond quickly to increases in the refresh period due to trigger messages. The estimator should not overreact to occasional long refresh intervals due to occasional drops of refresh packets and trigger trac. If state in the node is not large, the refresh interval is dominated by refresh intervalmin and the smoothing constant does not play an important role. The smoothing constant used for the EWMA is a reasonable tradeo between responding quickly to increases in trigger trac, and not responding too strongly to occasional long refresh intervals due to packet drops. In this and the previous sections we have discussed our approach of scalable timers and the required mechanisms. In the coming sections we present our simulation studies of scalable timers for PIM control trac.

11

7 Using Scalable timers for PIM Control Trac Protocol Independent Multicast(PIM) [4, 5, 6] is a multicast routing protocol. It uses soft state mechanisms to adapt to underlying network conditions and multicast group dynamics. Explicit hop-by-hop join messages from members are sent towards the data source or Rendezvous Point. In steady state each router generate periodic join messages for each piece of forwarding state it maintains. These messages are sent to upstream neighbors periodically to capture state, topology, and membership changes. A join message is also sent on an event-triggered basis each time a new forwarding entry is established for some new source or group. A multicast forwarding entry is discarded if no join messages to refresh this entry are received for the timeout interval. Currently the refresh periods for the PIM update messages are xed, and the PIM control trac grows with the amount of the multicast forwarding state maintained in the routers. Members of a group graft to a shared multicast distribution tree centered at the group's Rendezvous Point. Though members join source-speci c trees for those sources whose data trac warrants it, they are always joined to the shared tree listening for new senders. Thus groups that are active but have no active senders still need to refresh the state on the shared tree. In PIM there is some minimal control trac for every active group even if there are no active senders to the group. So the bandwidth used by control trac is not a strict function of the data trac being carried by the network.

7.1 PIM Control Trac

Besides the exchange of refresh messages for existing state, PIM has trigger trac due to membership and network topology changes. PIM trigger trac can result from:

    

a change in group membership, a new data source to a group, a shift from a shared tree to a source-speci c tree and vice versa, a topology change such as partitions and network healings, and a change in Rendezvous Point reachability.

The refresh rate in PIM determines the adaptability of the protocol to the changes in the network conditions, because old (stale) state is timed out sooner if the refresh rate is high. The higher the refresh rate, the greater the adaptability to the changing conditions. Delay in servicing trigger trac can increase the join and leave latency of a group and hence it is important to give higher priority to trigger trac than refresh trac. Based on data trac being sent to a group, multicast groups can be of two types. A multicast group is said to be data active at some time if there are senders that are sending data to the group. It is more important to refresh the state related to data active groups as compared to data inactive groups, as data inactive groups su er less or no data loss during adaptation to changes.

7.2 Control Bandwidth

The links with less bandwidth can not support a large number of simultaneous conversations. The link bandwidth limits the number of source speci c states for a group. Hence the scaling with respect to the source speci c state is not of major concern. We analyzed the number of groups that can be supported on various links without in ating the refresh intervals by a large amounts [15]. For instance, 1% bandwidth 12

of an ISDN link can support 150 groups with the refresh period of 60 seconds. Similary, a T1 link can support 15000 groups with 1% of link bandwidth. We observed that using 1% of the link bandwidth for PIM control trac is enough to support an appropriate number of groups. We decided to use 1% of link bandwidth for PIM control trac for conducting our experiments.

7.3 Class Structure at the Sender

As we have discussed in the previous section, di erent PIM messages have di erent priorities associated with them. We divided PIM control trac into two classes, trigger trac and refresh trac. A large fraction of bandwidth is allocated to trigger trac so that it can be serviced as soon as it is generated. Besides a one level class structure, a token bucket is used to service the bursts in trigger trac faster. Another possible re nement would be to service control trac related to data-active & data-inactive groups in di erent classes. There are two parameters in this class structure, the fraction of the PIM control bandwidth for trigger trac, and the size of the token bucket. Since trigger trac needs to be serviced as fast as possible the fraction of bandwidth for trigger trac should be close to 1. However, allocation of all of the control bandwidth can result in starvation of refresh trac. Trigger trac must be allocated a very large (close to 1 but not equal to 1) fraction of bandwidth. If the size of the token bucket is large, the sender can service bigger bursts of trigger trac without increasing the join/prune latency. There is a high level of correlation among multicast groups. For instance, groups created for audio and video streams of same session are coupled together. Similarly there can be multiple groups for sending hierarchical video streams that are coupled with each other and exhibit related dynamics. The bucket size should be large enough to accommodate these correlations.

7.4 Aging out the multicast routing state

Multicast routers age out the forwarding entries if they do not receive refresh messages for that entry. Delaying too long in timing out a forwarding entry can result in following overhead:  bandwidth and memory in maintaining soft state downstream for inactive trees, and  multicasting application trac downstream after it is no longer needed. On other hand, if the state is aged prematurely, it results in the failure of multicast routing to send data to downstream locations until state is re-established. It is more important to be careful not to timeout prematurely, than to be overly precise in timing out state as soon as possible. As described in section 5 two approaches can be adopted at the receiver node to discard soft state. The next section discusses the simulation studies conducted with the scalable timers for PIM.

8 Simulation Studies of Scalable Timers for PIM An existing PIM simulator, pimsim [16] was modi ed to include the scalable timers approach for PIM control trac. A sender module was included to schedule the PIM control messages based on the class structure described above. We implemented both the counting of the rounds approach and exponential moving average estimator for a PIM receiver. This section presents the details about the simulator and the studies conducted.

13

H-1

H-2

DR

H-3

IN

RP

H-4

H-n

Figure 3: Simple chain topology

8.1 Illustrating the Simulator

A wide range of scenarios were simulated for studying scalable timers in the context of PIM. The topologyy of the network and lifetimes of the members of various groups were provided to the simulator in con guration les. The simulation runs were validated by feeding the packet traces generated from the simulations to a Network Animator nam 4 . We conducted simulations on various scenarios having topologies with di erent link bandwidths and membership changes. In this section, we present a few simple yet representative scenarios for understanding the behavior of scalable timers. Since PIM messages are router to router we use a single topology in Figure 3 to investigate the behavior on a single link. Figure 3 shows a chain topology with three PIM routers. The designated router DR sends join messages towards the Rendezvous Point RP for the groups that are active in the attached hosts. These join messages are received by the intermediate router IN. The studies focus on the behavior of the sender at the DR. The receiver's behavior was studied at IN.

8.2 PIM Sender 25

refresh intervals

Refresh Intervals(secs.)--->

20

15

10

5

0 0

200

400

600

800 1000 time(secs.)--->

1200

1400

1600

1800

Figure 4: Change in the refresh interval for step changes in number of groups. Various scenarios that demonstrate the sender behavior have been presented here. If the link does not drop any packets then the actions of the sender can be captured by the refresh messages as seen by the receiver. For each refresh message received, we plot the time elapsed since the previous refresh was received for the same state. Each refresh message has been marked by a . 4

nam is a animation tool for networking protocols being developed at LBNL.

14

In the simulations shown here the parameters were set so that the allocated control bandwidth is sucient to refresh two state entries without exceeding the refresh intervalmin . We used a refresh intervalmin of 5 seconds. The size of the trigger token bucket used was 5. The bandwidth of the link between the DR and the node IN was 1875 bytes/sec. Figure 4 illustrates the behavior of a sender during step changes in the amount of state (corresponding to step changes in the number of multicast groups). Figure 4 presents the change in refresh intervals in response to a step change in the number of active groups at the DR. For the simulation shown in Figure 4, a new group becomes active every 100 seconds until 600 seconds. After 1000 seconds are over, one group expires every 100 seconds. When the amount of state is increased from one to two the refresh intervals do not change as enough bandwidth is available to refresh all the state within the minimum refresh period. As more state is created and needs to be refreshed, the refresh interval increases with each step. This is because the control bandwidth is not enough to be able to send the refreshes at refresh intervalmin and the sender adapts to the increase in the amount of state by reducing the refresh frequency (i.e., increases in the refresh interval). The expected behavior was observed and is shown in Figure 4. As the amount of state decreases after 1000 seconds, we notice that refreshes are sent more often, consuming the available control bandwidth. refresh intervals 50

Refresh Intervals

40

30

20

10

0 0

50

100

150

200

250

time(secs.)--->

Figure 5: Change in the refresh interval for trigger trac bursts. In steady state all the control bandwidth is used for sending refresh messages. When a new group becomes active, trigger state tokens are used for servicing the associated trigger trac. Spikes in the refresh intervals are observed when there is trigger trac to be served. The height of the spike is a function of the amount of trigger trac to be serviced and the size of the bucket. Figure 5 shows the response of the sender when the burst size of trigger trac is bigger than the bucket size. In this scenario at 100 seconds 7 new groups become active. The sender sends the trigger trac for all but two groups back to back. The trigger trac for the remaining two groups is served as more trigger tokens are generated. No refresh packet is seen during this time5 . One of the parameters that needs to be con gured at the sender is the size of the trigger trac token bucket. The size of the burst of trigger trac is limited by the associated link speed. There is a tradeo between the size of the token bucket and the latency in servicing the trigger trac. The larger the bucket size, the smaller is the associated join/prune latency. Trigger trac su ers latency greater than the xed timers approach only if the trigger trac burst is greater than the tokens available in the token bucket. The worst case delay su ered by a trigger packet is given by: burst size=control bwtrigger . However, in most cases the token bucket is full and the worst case delay su ered by a trigger packet in a burst larger 5

The gures only shows refresh messages, trigger messages have not been shown.

15

than the bucket size token bucketsize is given by: (burst size ? token bucketsize )=control bwtrigger :

8.3 PIM Receiver

Studies were conducted of the counting of the rounds and exponential weighted moving average approaches for aging state at the receiver. In this section we present the behavior of the two approaches as observed in the simulations shown in previous section. The state timeout timer was set to three times the current estimate of the refresh period.

8.3.1 Counting of the Rounds 25

round markers refresh intervals

Round Markers

20

15

10

5

0 0

200

400

600

800 1000 time(secs.)--->

1200

1400

1600

1800

Figure 6: Tracking behavior of Counting of the Rounds approach to step changes in number of groups. We studied the receiver's tracking behavior during step and burst changes in the amount of state. Figure 6 plots the round times as observed by the receiver over time, for the step change scenario shown in Figure 4. The receiver is able to track the round times very closely. The receiver is also able to detect the end case as described in previous section and times out the state properly. State was timed out when no refresh for that state was received for the previous three rounds. In this scenario no premature timing out of state was observed with the counting of the rounds approach. The counting of the rounds approach advances the round markers only when refreshes are received. Hence, the receiver is able to track the round times during bursts of trigger trac. Figure 7 depicts the tracking of round times during the burst of trigger trac as shown in Figure 5. Figure 8 shows the behavior of the counting of the rounds approach to occasional dropping of refresh packets (approximately 2% loss rate). If the dropped refresh packet is for the state that is currently used for advancing the round marker, then the receiver moves the round ahead when it receives the next refresh (for any state). Otherwise dropped packets do not a ect the tracking behavior. The sender does not exhibit strict round robin servicing of the state when new state is added. That is, the time between the trigger message and the rst refresh for a state is not necessarily a full round. The receiver lters these small deviations from the round robin behavior by not advancing the round marker when it receives the rst refresh message for a state entry. Figure 9 shows the behavior of the receiver for the case when duplicates of refresh messages are received. The receiver lters out the duplicates of the refresh messages and does not misbehave in this case. However, if the sender generates refresh messages in an arbitrary order, refreshes for some state are 16

round markers refresh intervals

50

Round Markers

40

30

20

10

0 0

50

100

150

200

250

time(secs.)--->

Figure 7: Tracking behavior of Counting of the Rounds to trigger trac bursts. 45 round markers refresh intervals 40

35

Round Markers

30

25

20

15

10

5

0 0

200

400

600

800

1000 1200 time(secs.)--->

1400

1600

1800

2000

Figure 8: Behavior of Counting of the Rounds in presence of occasional packet drops. received at the receiver more frequently. The counting of the rounds approach is not able to track round times in such a scenario.

8.3.2 Exponential Weighted Moving Average Estimator In this section we show the behavior of the EWMA estimator in the presence of step and burst changes in state. We used 0:05 as the value for the smoothing constant (w) and 4 as the weight for the interval deviation (b). For each refresh message received we plot the new refresh interval estimate (= a  (intervalavg + b  intervalmdev )). Figure 10 shows the refresh interval estimates for the step change scenario shown in Figure 4. The EWMA estimator is able to follow the changes in the refresh intervals at the sender. The rate of convergence for EWMA is dependent on the amount of state being refreshed. The EWMA estimator has more observations of the refresh interval in one round if the amount of state is large and therefore converges faster to the refresh interval. The response of the EWMA estimator to the trigger trac burst in the scenario shown in Figure 5 is shown in Figure 11. As shown in the gure, the EWMA estimator can underestimate the refresh intervals when there are big bursts of trigger trac (because the refresh interval suddenly increases). However, no state is timed out in this period as the receiver is in the No Timeout state. 17

14 round markers refresh intervals 12

Round Markers

10

8

6

4

2

0 0

50

100

150 200 time(secs.)--->

250

300

350

Figure 9: Tracking behavior of Counting of the Rounds approach to duplicate refreshes.

EWMA estimate, Refresh Interval(secs.)--->

25

ewma estimate refresh intervals

20

15

10

5

0 0

200

400

600

800 1000 time(secs.)--->

1200

1400

1600

1800

Figure 10: Tracking behavior of EWMA estimator to step changes in number of groups. Figure 12 shows the behavior of the EWMA estimator to occasional dropping of refresh packets. If the link drops refresh packets for a state, the receiver observes longer refresh intervals for that state. The EWMA estimator reacts conservatively to such increases in refresh intervals by increasing its estimate, and does not prematurely timeout state.

8.3.3 Discussion In the simulation studies we observed that the counting of the rounds receiver is able to respond to the events at the sender and track the round times more precisely than a receiver running an EWMA estimator. However, the counting of the rounds receiver fails to follow a non-round robin sender successfully. If the sender generates refreshes in an arbitrary order the EWMA estimator can also underestimate the refresh interval. However, in such scenarios EWMA estimator is more conservative than the counting of the rounds receiver. Both approaches work in that they allow the receiver to adapt to changes at the sender; consequently, there is no clear winner. Since the mechanism for aging state is local to a node, di erent nodes can use 18

ewma estimate refresh intervals

EWMA estimate, Refresh Interval(secs.)--->

50

40

30

20

10

0 0

50

100

150

200

250

time(secs.)--->

Figure 11: Tracking behavior of EWMA to trigger trac bursts. 45 ewma estimate refresh intervals

EWMA estimate, Refresh Interval(secs.)--->

40

35

30

25

20

15

10

5

0 0

200

400

600

800 1000 time(secs.)--->

1200

1400

1600

1800

Figure 12: Behavior of EWMA in presence of occasional packet drops. di erent approaches. There is no need for standardizing the approach across the network as a whole.

9 Scalable timers as compared to Fixed timers approach We can evaluate the xed timers and our scalable timers approaches along three dimensions: eciency, responsiveness and complexity.  Control Bandwidth: In the traditional xed timers approach the bandwidth used by control trac grows with the amount of state. Scalable timers adapt to the growth in the amount of state to be refreshed and limit the control trac bandwidth. Figure 1 shows the change in the control bandwidth and refresh intervals for the two approaches investigated.  Responsiveness: The refresh interval determines the response time of the protocol to the changes in the network conditions. If the refresh interval is smaller the protocol adapts faster to the changes. The responsiveness in soft state protocols is of two types, namely, initiation of new state and timing out the old state. Trigger messages are sent when new state is initialized. Since the trigger messages are served instantaneously, the responsiveness of the protocol using scalable timers for control trac is comparable to the xed timers approach. The refresh interval increases in scalable timers with increasing amounts of state and it might take longer to time out old state than in the xed timers 19

approach. However, in most cases the lifetime of a state unit is considerably longer than the period for which the state is kept after the last refresh. So the overhead due to late timing out of old state (longer response time) should be negligible compared to the savings in the bandwidth used by control trac over the lifetime of the state.  Complexity and Overhead: The additional mechanisms required for scalable timers are simple and introduce very small memory and computation overhead to the protocol. Moreover, the scalable timers module is generic and is not closely coupled with the protocol. The scalable timers approach does not require any changes be made to the protocol.

10 Conclusion and Future directions In this paper we presented a scalable approach for regulating control trac in soft state protocols. With scalable timers the control trac consumes a xed amount of bandwidth by dynamically adjusting the refresh interval. Scalable timers requires mechanisms at the sender and the receiver of the control messages. Through simulations we have illustrated the e ectiveness of scalable timers in terms of overhead reduction with minimal impact on protocol responsiveness; moreover there is little increase in complexity. We plan to further study the a ect of changes in the parameters, such as class structure, and smoothing constant etc., to the performance of the proposed mechanisms. We have studied scalable timers in the context of a multicast routing protocol (i.e. PIM), which is an example of a router-to-router protocol. Multiparty, end-to-end protocols were actually the rst protocols to incorporate the notion of scalable timer techniques for reduction of periodic trac (e.g., wb, RTP [3]). Future work will investigate the applicability of the detailed mechanisms discussed in this paper to such end-to-end protocols and soft-state protocols in general.

References [1] David D. Clark. The design philosophy of the darpa internet protocols. In SIGCOMM Symposium on Communications Architectures and Protocols, pages 106{114. ACM, August 1988. [2] Lixia Zhang, Bob Braden, Deborah Estrin, Shai Herzog, and Sugih Jamin. Resource reservation protocol (RSVP) { version 1 functional speci cation. Internet Draft, Xerox PARC, October 1993. Work in progress. [3] Henning Schulzrinne, Stephen Casner, Ron Frederick, and Van Jacobson. RTP: A transport protocol for real-time applications. Internet draft (work-in-progress) draft-ietf-avt-rtp-*.txt, IETF, November 1995. [4] S. Deering, D. Estrin, D. Farinacci, V. Jacobson, C. Liu, L. Wei, P. Sharma, and A. Helmy. Protocol independent multicast (pim) : Motivation and architecture. Internet Draft, March 1996. [5] S. Deering, D. Estrin, D. Farinacci, V. Jacobson, C. Liu, L. Wei, P. Sharma, and A. Helmy. Protocol independent multicast (pim), sparse mode protocol : Speci cation. Internet Draft, March 1996. [6] S. Deering, D. Estrin, D. Farinacci, and V. Jacobson. Protocol independent multicast (pim), dense mode protocol : Speci cation. Internet Draft, March 1994. [7] Claudio Topolcic. ST II. In Nossdav, ICSI Technical Reports, Berkeley, California, November 1990. TR-90-062. 20

[8] William C. Fenner. Internet group management protocol, version 2. Internet Draft, May 1996. [9] Tony Ballardie, Paul Francis, and Jon Crowcroft. Core based trees (CBT). In Deepinder P. Sidhu, editor, SIGCOMM Symposium on Communications Architectures and Protocols, pages 85{95, San Francisco, California, September 1993. ACM. also in Computer Communication Review 23 (4), Oct. 1992. [10] C. Topolcic. Experimental internet stream protocol: Version 2 (st-ii), oct 1990. RFC1190. [11] Danny J. Mitzel, Deborah Estrin, Scott Shenker, and Lixia Zhang. An architectural comparison of ST-II and RSVP. In Proceedings of the Conference on Computer Communications (IEEE Infocom), Toronto, Canada, June 1994. [12] Sally Floyd and Van Jacobson. Link-sharing and resource management models for packet networks. IEEE/ACM Transactions on Networking, 3(4):365{386, August 1995. [13] J. Postel. Transmission control protocol. Request for Comments (Standard) STD 7, RFC 793, Internet Engineering Task Force, September 1981. [14] Van Jacobson. Congestion avoidance and control. ACM Computer Communication Review, 18(4):314{329, August 1988. Proceedings of the Sigcomm '88 Symposium in Stanford, CA, August, 1988. [15] P. Sharma, D. Estrin, S. Floyd, and V. Jacobson. Appendix: Scalable timers for pim. ftp://catarina.usc.edu:/pub/puneetsh/paper/st appendix.ps, University of Southern California, June 1996. [16] Liming Wei. The design of the usc pim simulator (pimsim). Technical report, tr95-604, University of Southern California, June 1995.

21

This appendix presents the details for application of scalable timers approach for regulating PIM control trac.

A Summary of PIM Control Messages Message Type Router Query Message Join/Prune Message Register Message Assert Message

Refresh Period Timeout Interval (sec.) 30

(sec.) 90

60

180

Size of Packet

Depedent on

Comments

(bytes) 32

Number of neighboring PIM routers 40+6 for each source (s,g) state with same incoming interface Unicast to the Piggybacked RP on data packet Not sent periodically

B Calculations for Refresh Periods Average (pim control) pkt size = 50 bytes Let 1% of the link bandwidth be allocated for PIM control traffic. for an ISDN link link bandwidth = 128 Kbps Pim control bandwidth = 128/100 ~= 1 Kbps No of refresh packets which can be sent per sec = 2.5 states/sec => we can send information for about 2.5 states per sec. => refersh period = N /2.5 sec where N is the number of the states present. if N = 150 refresh period = 60 sec. Note: Currently we send periodic join/prunes every 60 secs.

The number of source speci c states for a group is limited by the link bandwidth, as the links with less bandwidth can not support a large number of conversations. Hence the scaling with respect to to the source speci c state is not of major concern. Same analysis can be restated in terms of the number of groups which can be supported downstream of a link, if 1% of its bandwidth is allocated for PIM control trac. Following table shows the number of groups which can be supported on various link types.

Link Type Link bw Number of groups Refresh Period (Kbps) (min) ISDN 128 150 1 T1 12800 15000 1 T3 51200 60000 1 From the above analysis, we can observe that 1% of the link bandwidth is enough to support an 22

appropriate number of groups without in ating the refresh periods by a large amount.

C Psuedocode for a PIM Sender For each Link attached to the router: Constants: B : control bandwidth for the PIM control trac. Bts : bandwidth for the trigger state. Bos : bandwidth for the old state. Ss : state in the sender. Pav : Average Packet size t ts : time for generating new trigger state token t os : time for generating new old state token TS Bucket size : The size of the bucket for the trigger state. Bos = Bos Bts = Bts ts + os = 1 B = Bts + Bos ts is very close to 1, so that the new state can be ushed fast. ## ##

We have a counter for the new state tokens and old state tokens. The count for the new state tokens never increases more than (TS Bucket size + 1). The count for old state tokens is atmost 1. While sending the old state refresh we do a roundrobin on the entries. trigger messages queue : the queue for the trigger messages. t ts = Pav / Bts t os = Pav / Bos

##Every t ts seconds: 1. Generate(increment) new trigger state tokens. 2. Flush state 23

##Every t os seconds: 1. Generate(increment) new old state tokens. 2. Flush State ##When new state is created: 1. Enqueue the related message in the trigger message queue. 2. Flush State.

ush state()

f

while((trigger message queue is not empty) and (trigger state tokens are available) )

f

g

dequeue from trigger message queue and send message. decrement trigger state tokens.

while(trigger state tokens > TS Bucket size)

f

if(trigger message queue is not empty)

f

g else

g

dequeue from trigger message queue and send message. decrement trigger state tokens. if not too early to send old state refresh send old state refresh.

decrement trigger state tokens

while(old state tokens > 0)

f

g

if not too early to send old state refresh send old state refresh. decrement old state tokens

g

24

Suggest Documents