A theory of convergence order of maxmin rate allocation and an ...

2 downloads 0 Views 485KB Size Report
transmission rates of a set of flows in a network so that ..... Then, we will explain the available rate computation ... the optimal transmission rate for this source.
A THEORY OF CONVERGENCE ORDER OF MAXMIN RATE ALLOCATION AND AN OPTIMAL PROTOCOL Jordi Ros and Wei K. Tsai Department of Electrical and Computer Engineering University of California, Irvine, CA 92697 {jros, wtsai}@ece.uci.edu Abstract-The problem of allocating maxmin rates with minimum rate constraints for connection-oriented networks is considered. This paper proves that the convergence of maxmin rate allocation satisfies a partial ordering in the bottleneck links. This partial ordering leads to a tighter lower bound for the convergence time for any maxmin protocol. An optimally fast maxmin rate allocation protocol called the Distributed CPG protocol is designed based on this ordering theory. The new protocol employs bi-directional minimization and does not induce transient oscillations. The Distributed CPG protocol is compared against ERICA, showing far superior performance.

1

Introduction

This paper considers the problem of rate-based flow control for connection-oriented networks. Among rate-based flow control protocols, maxmin fairness has been a popular optimization objective; therefore, this paper focuses on the convergence speed of maxmin protocols with minimum rate constraints. The current works on maxmin protocols all state that the lower bound on any maxmin protocol is of the order 2DS , where D is the round-trip delay and S is the number of iterations needed for convergence. The fastest convergence time is provided by our research group [Tsa00] where the concept of constraint precedence graph (CPG) was introduced, S = L, and L is the number of levels in the CPG. This paper builds on this work and proves that the convergence of maxmin rate allocation satisfies a partial ordering in the bottleneck links. This partial ordering leads to a tighter lower bound for the convergence time, (L-1)T, where T is the time required for a link to converge once all its predecessor links have converged. This new lower bound can be significantly smaller than the previous best lower bound. The faster convergence time is made possible by employing bi-directional minimization and maintaining per-flow information at the switches.

An optimally fast maxmin rate allocation protocol called the Distributed CPG (D-CPG) protocol is designed based on this new ordering theory. The D-CPG protocol does not induce transient oscillations and achieves the optimal convergence time. The D-CPG protocol is compared against ERICA [Jai96], showing far superior performance. Due to space limitation, a number of the proofs have been omitted and the reader is referred to three technical reports available at our web site.

0-7803-7016-3/01/$10.00 ©2001 IEEE

2

Background

A flow control protocol is a mechanism that allows the sources to adapt their rates according to some feedback received from the network. Depending on the nature of this feedback, flow control protocols can be classified into two groups: explicit or implicit protocols. In implicit feedback schemes, the source infers a change in its service rate by measuring its current performance. In explicit feedback schemes, the information is explicitly conveyed to the source. Among the explicit flow control approaches, the so-called maxmin allocation has been widely adopted as a solution of the rate assignment problem. Several authors have approached the flow control problem from the maxmin perspective. [Cha95] and [Tsa96] first addressed the classic maxmin problem. Later, [Hou98], [Lon99], [Abr97] and [Kal97] studied the maxmin problem with additional nonlinear constraints, the so-called maxmin with minimal rate guarantee. The common denominator of all these approaches is that they are state maintained algorithms, meaning that per flow information is maintained at each switch. Other authors have approached the problem using stateless algorithms. Stateless schemes such as ERICA [Jai96] and EPRCA [Rob94] are good in the sense that they minimize the computational cost in the switch giving a higher degree of scalability. The cons of this approach are a higher convergence time and failure to guarantee fairness for some scenarios, degrading the level of QoS in the network.

2.1

Stateless or state maintained

Because of the exponential growth of the Network, one of the most important properties that a protocol has to consider is scalability. The first explicit rate flow control algorithms were applied to ATM networks. Originally, the ATM model was based on the end-to-end virtual circuit (VC) model, meaning that a user trying to access a remote resource has to set up a VC before transmission. As a result of this model, the number of VCs in a ATM switch can potentially be very high, so high that any additional complexity on a per VC basis can be unaffordable. In other words, under this model any state maintained algorithm can be very expensive and stateless algorithms are probably the only affordable solution. The network model has shifted since. Scalability issues have forced the way networks are built. One example that

717

IEEE INFOCOM 2001

clarifies this statement could be the Multiprotocol Label Switching model (MPLS) proposed by the IETF [MPLS]. An MPLS network can be seen as a scalable version of an ATM network. To provide scalability, an MPLS network implements a higher level of flow granularity. A flow (equivalent to the concept of VC in the ATM notation) aggregates now many sub-flows with similar properties such as routing path. By making use of this larger granularity, the number of flows to be handled in a router is dramatically decreased, so much that per flow computation is now in most cases affordable. The protocol presented in this paper assumes a connectionoriented network with scalability properties such as those of an MPLS network. Under this assumption we will prove that the improvement achieved by using a state maintained versus a stateless protocol can be quite significant.

3

Optimal Rate Allocation Theory for Best-Effort Traffic with Minimal QoS Guarantee

In this section we present a brief theory of optimal rate allocation for best-effort traffic with QoS guarantees. For the complete theory, we refer the reader to [RT00b].

3.1

Maxmin rate allocation

The problem we address here is that of allocating the transmission rates of a set of flows in a network so that throughput in each of them is maximized while ensuring fairness among them. The QoS parameter that our model includes is a minimal rate guarantee in each flow, called here Minimal Flow Rate (MFR). The criterion we use to meet the optimal rate allocation is maxmin fairness. In the following definition, a feasible rate vector is a set of flow rates that satisfies the respective minimal flow rate requirements while not exceeding the link capacities in the network.

max{ri i ∈V j , ri > MFRi } if Vm j ≠ V j , Fj = C j  Rj ≡  ∞ if Fj < C j  0 if Vm j = V j , Fj = C j 

where C j is the capacity of link j and Fj ≡ ∑ i∈V ri . j

Theorem 1. Projection optimality condition. A rate vector r is a maxmin rate allocation with MFR guarantee iff for every flow i its rate satisfies the following condition, ri = max{min{R j : j ∈ Pi }, MFRi }} , where Pi is the set of links within the path of flow i . For an intuitive understanding of the advertised rate concept and a proof of Theorem 1, refer to [RT00b]. Theorem 1 provides an easy way to check whether a given rate allocation r is maxmin or not. Let us now define a centralized algorithm to solve the presented maxmin problem.

3.2

Figures 1 and 2 present a centralized solution to the maxmin problem with minimal flow rate guarantee. The first algorithm solves the particular case of having one single link in the network. The second algorithm, the CPG algorithm, solves the multi-link case by calling the former. As we will see in the next section, the algorithm can be used to build a constraint precedence graph (CPG) to compute a lower bound on the convergence time for the maxmin problem. Input: N - network with single link j Algorithm: SingleLink ( ) 0. Set all flows as unconstrained.

Definition 1. Maxmin fairness. A feasible rate vector r is said to be maxmin fair iff for every feasible rate vector r such that ri > ri for some flow i , there exists a flow k such

1. For any unconstrained flow

Definition 2. Advertised rate. Let V j denote the set of flows

crossing link j and let Vm j denote the subset of flows within

V j that satisfy ri = MFRi . The quantity R j defined below is called the advertised rate for link j :

0-7803-7016-3/01/$10.00 ©2001 IEEE

i set

ri = (C j − Fc j ) ( N j − Nc j ) ;

2. Calculate the advertised rate R j . If ri = max{R j , MFRi } for each

that rk ≤ ri and rk > rk . The maxmin fairness optimality condition in Definition 1 is not automatable as it requires excessive (exponential) amount of checking, making them unsuitable for computer implementation. As proved in [RT00b] though, there exist two equivalent automatable conditions. In this section, we will introduce one of them. Before that, we need to define the concept of advertised rate.

Centralized protocol: the CPG algorithm

flow i then stop; Otherwise, set ri = max{R j , MFRi } for any flow i such that ri ≠ max{R j , MFRi } and go to 1;

Figure 1. Single-link Projection Algorithm These algorithms introduce some new notation. In Figure 1, a flow is constrained if its rate is equal to its MFR. Also, Fc is defined as the overall rate assigned to the constrained flows and N c is the total number of constrained flows.

Proposition 1. CPG convergence. The rates computed by the CPG algorithm converge to the maxmin solution in a finite number of steps.

718

IEEE INFOCOM 2001

C3 = 200

Input: N - network Algorithm: CpgAlgorithm ( ) 0. L = 1 ; 1. For each single-link N j in N , obtain R j and ri for i ∈ V j

C2 = 50

by executing SingleLink ( N j ); 2. For each link

j in

C4 = 60

C5 = 25

N , do the following

C1 = 10

If R j = min{Rk | link j & link k share a joint flow} 2.1. link

Update Ck for all link k sharing a joint flow with

Figure 3. Precedence link relationship

j doing the following Ck = Ck −



i∈V j ∩Vk

In the example, Ci is the capacity of link i and we assume that the figure is a snapshot of the remaining network at iteration l. Notice that link 4 is here our “link j”. At iteration l

ri ;

2.2. Remove link j and all the flows crossing

we have that R 2 , R 5 < R 4 . Notice also that link 4 will be removed at iteration l+1, then we need to see whether links 2 and 5 satisfy the proposition. At iteration l, link 5 becomes a bottleneck and is removed from the network. Also, link 2 does not become a bottleneck at iteration l but link 1 shares a flow with link 2 and link 1 becomes a bottleneck at iteration 2. The key point of this last case is that when link 1 is removed, the advertised rate at the following iteration for link 2 increases beyond that of link 4, allowing this last to be the new bottleneck.

link j from N ; 4. If N is not empty, do L = L + 1 and go to step 1. Otherwise, stop.

l

Figure 2. CPG Algorithm

Proof. For a rigorous proof we refer the reader to [RT00b]. The following section introduces a new theory of maxmin bottleneck ordering.

3.3

A theory of maxmin bottleneck ordering

One of the most important parameters for designing a flow control protocol is its convergence time to the optimal rate. In this section we will study the convergence properties of the maxmin problem. In particular we will give answers to two questions: Given a link i, what other links have to converge first so that link i can converge? Given a certain network topology, what would be the minimal time needed for any maxmin protocol to convergence? Let us start with the following proposition.

Proposition 2. Precedent link relationship. Let N be a network configuration and let j ∈Ν be a link that is removed at iteration l+1 of the centralized maxmin algorithm. Let also i be a link in Ν that at iteration l shares a flow with link j so that

Ri < R j . Note that such a link must l

l

exist, otherwise link j would be removed at iteration l. Then there are two cases: either link i becomes a bottleneck at iteration l or there exists at least one link k that shares at least one flow with link i such that it becomes a bottleneck at iteration l.

Proof. For a rigorous proof, please refer to [RT00c]  For an intuitive meaning of Proposition 2, let’s consider the example in Figure 3.

0-7803-7016-3/01/$10.00 ©2001 IEEE

l

l

The previous property provides the key to understand the order of link convergence. In the previous example and using our ordering definition, we can say that “link 4 cannot converge until link 1 and link 5 have converged”. Note also that link 4 can converge without the need for link 2 to converge.

Definition 3. Direct/indirect precedent link and medium Link. Let N be a network configuration and let link j be a link in Ν removed at level l+1. Let also i be a link in Ν that at l l iteration l shares a flow with link j so that Ri < R j . We say that link i is a direct precedent of link j if it is removed at level l. On the other hand, if link i is not removed at level l, then let link k be a link that shares at least one flow with link i and that it is removed at level l. Note that by Proposition 2 such a link exists. We call link k and indirect precedent of link j and link i is referred as the medium link of links j and k. For instance, in the example of Figure 3 we have that link 5 is a direct precedent of link 4, link 1 is an indirect precedent of link 4 and link 2 is the medium link of links 1 and 4. The direct and indirect precedence link concept defines a partial ordering in the set of links of the network configuration. Intuitively, a link cannot converge to its maxmin rate until all of its direct and indirect precedence links have converged. It is possible to build a directed graph that represents the ordering of convergence, as the following definition shows.

719

IEEE INFOCOM 2001

Definition 4. Constrained Precedence Graph (CPG). Let N be a network configuration and let j and k be two links in Ν . We define the Constrained Precedence Graph (CPG) as the directed graph that is built as:

related in the sense that link 6 can converge if and only if link 2 has converged. The CPG graph also gives a lower bound in the convergence time of any maxmin algorithm, as shown in the following corollary.

- Each node represents a link that is removed in the centralized maxmin algorithm.

Corollary 1. Maxmin convergence time. The convergence time of any maxmin algorithm is at least (L-1)T, where L is the number of levels in the CPG graph (e.g. in Figure 5 L is 4) and T is the time required for a link to converge once all its predecessor links have converged.

- An arc runs from k to j if and only if k is a direct or indirect precedent of link j. To see how this graph is built, let us consider the example in Figure 4. The corresponding CPG is shown in Figure 5.

C3 = 45

C1 = 10

C5 =



C4 =



C2 = 25

C8 = 60

C6 = 50 C10 =

C7 =





C9 = 70

Figure 4. Network configuration At the first iteration, link 1 is the only link that is removed with an available rate (AR) of 10. Note for example that link 6 cannot be removed because link 3 has a smaller AR, and at the same time link 3 cannot be removed because link 2 has a smaller AR. At the second iteration, link 2 is removed with an AR of 15 and link 1 becomes its direct precedence. Again, link 6 cannot be removed because link 3 has a smaller AR. However, at iteration 3 we notice that the AR of link 3 becomes bigger than that of link 6. At this iteration, link 6 is removed with no direct precedence link. Instead, it has link 2 as an indirect precedence link and link 3 as the medium link for such relation. Finally, at iteration 4 links 8 and 9 can be removed in parallel since they don’t share a flow. L1

The ordering theory presented in this section can be used to evaluate the correctness and the performance of an actual implementation. We will use the results in this section in our distributed algorithm, presented in the next section.

4

Distributed Protocol

A distributed algorithm differs from a centralized algorithm in the sense that input parameters of the algorithm are not located in a single position. In order to converge to the same solution as in the centralized approach, a distributed algorithm has to provide a transport support for the distributed information. This information has to be brought to the right place so that a decision can be taken. In our distributed solution, decisions are taken in the switches. We will provide a signaling protocol that allows for the switches to virtually establish a one to one communication link. Once the right information has been transported to a switch, it can deduce whether he is a bottleneck or not. If he is a bottleneck, he will inform the other switches so that they can proceed with their computations. Once a switch has converged to the optimal rate allocation, a source is also immediately informed. The following sections define the algorithm in two steps. First, we will explain the signaling algorithm that transports the information from switch to switch and from switch to source. Then, we will explain the available rate computation algorithm that allows for a switch to know whether he is a bottleneck or not.

L2

L6

L8

The previous property applies to both centralized and distributed algorithms. In centralized algorithms, T is the computational time spent in a single iteration. In distributed algorithms, T is the time that it takes for a link to receive the status from its predecessors links.

L9

4.1

Figure 5. CPG graph The CPG graph provides two kinds of information. First, it gives you the ordering of convergence of each link. For instance, in the previous example we discovered that even though link 2 and link 6 don’t share a common flow, they are

0-7803-7016-3/01/$10.00 ©2001 IEEE

Signaling protocol: switch-to-switch direct communication

The signaling scheme proposed in this section is similar to that defined in the ATM Traffic Management Specifications 4.0 [TM4.0]. Some modifications in this scheme though will bring us to a new family of signaling protocols twice as fast as those in TM4.0.

720

IEEE INFOCOM 2001

In our signaling protocol we will assume that special resource management packets (RM packets) are periodically sent from source to destination and then back to the source. RM packets include a field called explicit rate (ER). These packets travel along the source-destination path capturing in ER the value of the available bandwidth in the bottleneck switch. After convergence, the value of this field for an RM packet reaching the source in his way back should be equal to the optimal transmission rate for this source. Note that if RM packets are not available in the network, the ER field can be piggybacked in a data packet. The use of RM packets is preferable though since it allows the assignment of different levels of priorities between data packets and network management packets. In a situation of congestion, network management packets should have the highest priority since they provide the means to remove the congestion. The simile for this approach is a police car. Policemen drive special cars different from regular cars because this way they can use the siren and get a highest priority on a road. If they were not driving special cars (RM packets) they wouldn’t be getting the highest priority. In our case, the siren can be implemented by inserting an s-bit field in the RM packet. A switch that is congested can set this bit so that the packet is assigned the highest priority in the network.

When a RM packet arrives RM.ER  ∞ ; Send RM packet upstream; Figure 7. Destination algorithm

Figure 8 shows the switch algorithm. The switch stores some status for each flow. The meaning of these fields is: UB for upstream available bandwidth, DB for downstream available bandwidth, N for the number of flows, MFR for minimal flow rate and B for the minimal of UB and DB. When a connection flow i is set up Allocate new entries for UBi , DBi , MFRi and Bi ; Set MFRi to the minimal rate allowed by the session; N  N +1; When a connection is closed Free memory space reserved for the connection; N  N -1; When a new RM packet arrives from flow i If forward RM packet UBi  RM.ER; Else DBi  RM.ER; Bi  min{UBi , DBi }; AR = ComputeAR ( ); RM.ER  max{min{AR, Bi }, MFRi} ; If the switch is congested RM.S  1; If RM.S = = 1 Forward RM packet with maximum priority; Else Forward RM packet;

Figure 6 shows the source algorithm. Every time a backward (from destination to source) RM packet arrives, the transmission rate of the source (TR) is set to the ER field in the RM packet. Also, a source periodically sends forward (from source to destination) RM packets. The initial values of the ER field and the s-bit in this packet are set to infinity and zero, respectively.

When a RM packet arrives TR  RM.ER;

Figure 8. Switch algorithm When is time to send a RM packet RM.ER  ∞ ; RM.S  0; Send RM packet downstream;

Figure 6. Source algorithm

Figure 7 shows the destination algorithm. Upon receiving a forward RM packet, the ER field is set to infinity and the packet is sent back to the source. This implementation differs from previous approaches where the destination does not modify the ER value. As we will see, resetting ER to infinity in the destination site is crucial to achieve switch-to-switch direct communication for fast convergence.

0-7803-7016-3/01/$10.00 ©2001 IEEE

When a connection is setup -for example in a MPLS network this could be included in the LDP protocol (Label Distribution Protocol [MPLS])- we allocate memory space for the parameters, store the value of the minimal rate for a flow (MFR) and increase the number of flows crossing the switch. When a connection is closed, we free the memory space corresponding to the parameters and decrease the number of flows. The actual signaling algorithm is executed every time an RM packet arrives. If it is a forward RM packet we save the ER field in UB, if not we save it in DB. This part is also crucial to achieve switch-to-switch communication. The idea is that from a switch standpoint, the status of the whole network (for what a flow concerns) can be summarized with

721

IEEE INFOCOM 2001

two pieces of information: the upstream and the downstream available bandwidth. This idea will be further explained at the end of this section. The switch algorithm proceeds computing his own available rate by calling ComputeAR. This computation will depend on B, the minimal value of UB and DB. Another important property is that we can clearly separate here the signaling algorithm from the rate computation algorithm. Depending on the optimization criteria, we could change the ComputeAR procedure but still use the same signaling protocol to keep switch-to-switch communication properties. Finally, the switch updates the ER field in the RM packet to the maximum of the flow’s MFR and the minimum of the switch available rate and the current ER value. After treating the s-bit properly, we forward the packet according to its priority.

These two properties are fundamental to achieve fast convergence. A comparison at the end of section 4 will show the benefits with respect to previous approaches. Let us now define the ComputeAR procedure.

Before defining the ComputeAR algorithm, let us first understand why this signaling protocol allows for a fast convergence implementation. The proposed protocol has two properties: bi-directional minimization and transient oscillation freedom.

4.2

Bi-directional minimization. Previous state maintained algorithms [Cha95-Hou98-Lon99-Kal97] don’t reset the ER field to infinity when returning RM packets in the destination site. Also, the destination behavior presented in the ATM Traffic Management Specifications [TM4.0] is defined so that the ER value is not reset to infinity. If this is the case, assume the network snapshot shown in Figure 9. In steady state, switch 2 is receiving backward RM packets with ER equal to 5 (since the destination does not reset this value to infinity). Now suppose that B1 increases to 15 so that the new bottleneck is 10. Then, for switch 2 to receive this new bottleneck value we will have to wait for an RM packet to go from switch 1 to destination and then back to switch 2. In other words, the signaling protocol does not provide the necessary means for switch 3 and switch 2 to directly talk.

Source

B1 =5

B2 =90

B3 =10 Destination

Figure 9. In our approach, for switch 2 to receive the new bottleneck value we will only have to wait for the next RM packet coming from switch 3. Hence, we achieve a virtual switch-toswitch direct communication allowing for a convergence time two times smaller (instead of one round trip, it takes half a round trip).

Rate computation algorithm

In our approach, each flow i has a minimal flow rate MFRi that has to be guaranteed. In addition, from the switch standpoint, each flow i also has a peak flow rate equal to Bi , the minimum of UBi and DBi . Indeed, note that a switch must not give more bandwidth to a flow i than Bi since there is another switch that cannot afford that amount of bandwidth. As a result, a switch must solve a single link maxmin problem with both peak and minimum flow rate constraints (Figure 10a). In order to easily solve this problem, we transform it into an equivalent one: a multi-link problem with no peak flow rate constraints. As shown in Figure 10b, each peak flow rate i is substituted by a new link connected to a switch with capacity equal to Bi . It is easy to see that the available rate for our switch in both cases is the same. Note that now our problem is that of solving a multi-link network where we know all of its parameters (Figure 10b), meaning that we can use the centralized algorithm to solve it. Figure 11 presents an implementation of the ComputeAR function that solves this problem. The reader can check that this implementation is exactly the same as that presented in Figure 2 but for the particular case of our problem in Figure 10b. Because the algorithm requires two loops, its cost is O( N 2 ) . In the next section we will show that for the particular network topology in Figure 10b it is possible to find an algorithm with a cost O(log(N)). The solution presented is based on an analogy to a fluid model.

4.3

Transient oscillation freedom. The other implementation key issue is the storage of both the upstream and downstream available bandwidth in different fields. To see the benefit of

0-7803-7016-3/01/$10.00 ©2001 IEEE

this enhancement, we will also be using the same example shown in Figure 9. Suppose that switch 2 does not record the values of UB and DB. In that case, after switch 1 increases its available bandwidth to 15 it will inform switch 2. However, since switch 2 does not remember that the downstream path is bottlenecked at 10 it cannot make a good decision. Assuming that the new bottleneck rate is 15, the situation can be very dangerous since this value would be propagated to the rest of the network inducing oscillations and new congestion points. Instead, if switch 2 remembers the downstream available bandwidth, upon receiving the feedback from switch 1 it will immediately recognize that switch 3 is the new bottleneck.

Fluid model

In the following discussion we will assume without loss of generality that MFR1 ≤ MFR2 ≤ ... ≤ MFRN . Let us consider the deposit in Figure 12a. It consists of a rectangular cavity with stairs in booth the lower and the upper side of it. The

722

IEEE INFOCOM 2001

lower side step i is defined at a height equal to MFRi − MFR1 whereas the higher side step i is located at a height equal to Bi − MFR1 , both in units of longitude. (a)



B1 B2 … BN

C

MCR1 MCR2 … MCRN

The lower steps in the deposit are built so that first we keep filling those flows with smaller MFR. When the first flow of water comes, we first fill in the first step. If there is enough water left, then we start filling the second step at the same rate as the first one, and so on. If the level of water grows until an upper side step, then the corresponding flow is saturated and no more water (bandwidth) is given to it.

a)

(b)

N

C − ∑ MFR i

deposit ∞

MCR1



MCR2

B2



MCRN

BN

i =1

B1

… C

∞ ∞ … ∞

MCR1 MCR2 … MCRN

B N − MFR1 MFR N − MFR1

B1 − MFR1 B2 − MFR1 MFR3 − MFR1

Figure 10. Solving the switch rate allocation problem

MFR2 − MFR1

1

1 Parameters: Ω : Set of rates bottlenecked to their MFRs; Ψ : Set of rates bottlenecked somewhere else; RC: Remaining Capacity; RN: Remaining flows that are not in Ω ∪ Ψ ; AR: Advertised Rate;

step 1

step N

b)

Algorithm: ComputeAR 1. Ψ ← ∅; RC  C; RN  N; 2. Ω ← ∅; 3. AR  RC/RN; 4. If ∃ i ∉ (Ω ∪ Ψ ) such that AR < MCRi Put any flow i such that i ∉ (Ω ∪ Ψ ) and AR < MCRi into Ω ; RC ← C − ∑ MCRi − ∑ Bi ; RN ← N − | Ω ∪ Ψ |; i∈Ω

i∈Ψ

Return to 3; 5. If ∃ i ∉ ( Ω ∪ Ψ ) such that AR > Bi Put any flow i such that i ∉ (Ω ∪ Ψ ) and AR > Bi into Ψ ;

Figure 12. Fluid model

RC ← C − ∑ Bi ; RN ← N − | Ψ |; i∈Ψ

Return to 2; 6. Stop;

Figure 11. ComputeAR Procedure to solve the single-link case Suppose now that we let C − ∑ i =1 MFR i units of volume of N

water flow into the deposit as shown in Figure 12b. It can be seen that there is a direct relation between the height of the resulting water level and the maxmin solution for our network in Figure 10b. First, note that we have removed all the MFRs from the amount of water so that we guarantee that each flow gets at list minimal requirements of bandwidth.

0-7803-7016-3/01/$10.00 ©2001 IEEE

The fluid model allows to build and algorithm that reduces the cost of computing a maxmin solution. This algorithm is implemented in two steps, as the model suggests. The first procedure is called BuildDeposit and is used to build the deposit. This algorithm is shown in Figure 13 and its output is a 2 × 2 N matrix representing the deposit. The second procedure is called FindWaterLevel and it computes the available rate out of the water level. As shown in Figure 14, FindWaterLevel gets a deposit (d) and the amount of water ( C * ) and returns the available rate (AR). As shown in Figure 15, these two functions are supposed to be coded inside ComputeAR replacing the previous algorithm showed in Figure 11.

723

IEEE INFOCOM 2001

Proposition 3. Algorithm convergence. The ComputeAR algorithm showed in Figure 15 returns the maxmin rate solution.

deposit, we can prove that the computational cost of ComputeAR can be reduced to O(log(N)). Because of space limitations, for a detailed description of this enhancement, we refer the reader to [RT00a].

Proof. Refer to [RT00a]. Corollary 2. Algorithm computational cost. The cost to build the deposit is O(Nlog(N)) whereas the cost to compute the water level is O(log(N)).

Algorithm: ComputeAR 1. BuildDeposit( ); 2. FindWaterLevel( );

Proof. In Figure 13 step 1 is O(Nlog(N)) and steps 2 and 3 are O(2N). In Figure 14 step 1 is O(log(N)) and step 3 is O(1). 

Figure 15. ComputeAR using the fluid model approach

Input: MCR* = min{MCRi | i = 1,..., N } ;

4.4

In the previous section we have studied the cost of the algorithm ComputeAR executed every time an RM packet arrives in the switch. In this section we will study the convergence time of the distributed algorithm proposed in the previous sections. Let us first define this concept.

M = {MCRi − MCR* | i = 1,..., N } ; P = {Bi − MCR* | i = 1,..., N } ;

Algorithm: BuilDeposit( ) 1. Order the elements in M ∪ P from the smallest to the biggest and put them in the first row of the matrix d of size 2 × 2N ; 2. Build y by using the following equation 1, if i = 1  y[i ] =  y[i − 1] + 1, if d[1, i ] ∈ M , for i =1,…, 2N; y[i − 1] − 1, if d[1, i ] ∈ P 

Definition 5. Convergence time definition. Let t0 be the time that the link capacities stabilize in the network and let t1 be a time when the sources have converged to their maxmin allocation. We define the convergence time of the distributed algorithm as min t1 {t1 − t0 } .

3. Build the second row of d by using the following equation d[1,1], if i = 1 , for i =1,…, 2N; d[2, i] =  d[1, i −1] + y[i](d[1, i] − d[1, i −1]) 4. Return(d);

Using the previous definition we have that,

Proposition 4. Convergence time of the distributed algorithm. The convergence time of the proposed distributed algorithm is (L-1)RTT, where L is the number of levels in the CPG graph and RTT is an upper bound in the round trip time.

Figure 13. BuilDeposit algorithm

Proof. For a rigorous proof, refer to [RT00a], we will only provide a brief sketch. Using the theory of maxmin bottleneck ordering presented in section 3.3, we know that a lower bound for any distributed algorithm is (L-1)T, where L is the number of levels of the CPG graph and T is the time for a link to converge once all its predecessor links have converged. The proof provided in [RT00a] shows that for our algorithm T is equal to RTT. 

Input: Deposit d ; N

C * = C − ∑ MCR i ; i =1

Algorithm: FindWaterLevel ( ) 1. Find i such that d[2, i ] ≤ C* ≤ d[2, i + 1] ; 2. AR = d[1, i ] + (C* − d[2, i])

d[2, i + 1] − d[2, i ] d[1, i + 1] − d[1, i ]

4. Return (AR);

Figure 14. FindWaterLevel algorithm In theory, both BuildDeposit and FindWaterLevel functions should be called upon the arrival of an RM packet (as shown in Figure 8). However, a nice property of this implementation is that in practice, the deposit does not need to be rebuilt every time an RM packet arrives. Intuitively, if we receive a new RM packet with a new ER value, the new deposit should keep some similarities with the old one since the new RM packet only brings feedback for one of the flows. Indeed, by properly handling the reconstruction of the new

0-7803-7016-3/01/$10.00 ©2001 IEEE

Convergence time

Per our knowledge, the convergence time provided in Proposition 4 is the lowest for a state maintained distributed maxmin algorithm with minimal rate guarantees. For a comparison, table 1 shows the convergence time of previous approaches given by their authors. In this table, N denotes the number of different bottleneck links and S denotes the number of flows. Note that S ≥ N ≥ L . Algorithm

Algorithm in [Cha95]

Algorithm in [Hou98]

Algorithm in [Lon99]

Our approach

Convergence time

4(N-1)RTT

2.5(S-1)RTT

2(L-1)RTT

(L-1)RTT

The previous table shows the improvement achieved in our approach. As we have already stated, this improvement is

724

IEEE INFOCOM 2001

achieved by using a bi-directional minimization scheme and by maintaining in the switch status for both upstream and downstream bottleneck rates.

5

algorithm achieves direct convergence, ERICA converges asymptotically. The reasonfor this is becauseof the stateless property of ERICA. Because it does not remember the bottleneck rates of previous iterations, ERICA requires of more round trip delays to converge. The results in this simulation prove that by saving some state in the switch, the convergencetime can be improved by a factor of 100.

Simulations

In this section we will evaluate the performance of the proposed protocol. In order to have a reference, we have chosen to compare our algorithm with ERICA. Some of the performance parameters that we will be evaluating are convergence time, degree of fairness, degree of oscillation and congestion in the queues.

Rate (Mbps)

wflow

5.1

Simulation

W(O-

Figure 17 shows the network setup for our simulation. It consists of 5 flows. In order to simulate the case of moving bottlenecks with dynamic available bandwidth, flow 5 is defined to be at a higher priority switching level than the others. In other words, packets from flows l-4 will only be forwarded in a switch if there are no packets from flow 5.

5.2

!;---1 ,.-..--_ __---’

0 0

W

_

I 2

I

In this simulation we measure the response time of our distributed algorithm and compare it to ERICA. The link capacities for Ll and L2 are set to 60 and 30 Mbps, respectively, and we add a minimal rate guarantee to flows 3 and 4 of 40 and 20 Mbps, respectively. In this simulation we disable flow 5 so that available bandwidth in the links are fixed. The length of link 1 and 2 are set to 100 km, introducing each one a propagation delay of 1 millisecond. The initial transmission rates of all the flows are set to 7.5 Mbps. Note that the maxmin solution to this network configuration is 5 = 10, r = 10, r, = 40, r4 = 20 , where flow flow

-

-

-

-- --

---

-----

-

I

I

I

I

I

0

8

m

12

u

-------

0

flow -- -

1,2-

I

t (mllisecs()

Rate (Mbps)

flow 3

flow 4

“‘-T-1--,. [;i. / ‘ px-+.

at their minimal

.._

.

=40,

0

3 and flow 4 are constrained requirements.

-.

w-

Response time

0

4

setup

------------ -- - ----- ---._ - -- -.--_ 1

I

I

50

100

1W

flow 1,2 1 wo

I

I

“t (millisecs~

Figure 16. Responsetime of (a) our distributed algorithm (b) ERICA

5.3

Dynamic

convergence

In this simulation we consider the case of moving bottlenecks. For that, an on-off traffic is inserted in flow 5. The on rate is set to 100 Mbps while the off rate is set to 10 Mbps, having both intervals a duration of 80 milliseconds. We reset all the minimal flow rate constraints to zero and both link capacitiesare set to 150 Mbps. Note that under this configuration, during an off interval flows 1, 2 and 3 are constrainedat link 1 with a rate of 50 Mbps each and flow 4 is constrainedat link 2 with a rate of 90 Mbps. During an on interval, flows 1 and 4 are constrainedin link 2 with a rate of 25 Mbps and flow 2 and 3 are constrained in link 1 with a rate of 62.5 Mbps.

rate

High priority traffk Low priority traffic Flow 4

Figure 18 showsthe rate allocation resulting from both our algorithm and ERICA. Our algorithm proves to converge faster and without oscillation. ERICA suffers from some oscillations. Flow 5

Figure 17. Network configuration Figure 16 showsthe responsetime for both algorithms. It takes about 2.6 milliseconds for our protocol to converge to the maxmin solution and once in steady state, the rates are 100% maxmin. ERICA is slower in terms of convergence time. It takes about 256.7 milliseconds to converge to a rate 99 % close to the maxmin solution. While our distributed

0-7803-7016-3/01/$10.00 ©2001 IEEE

Figure 19 shows the queue sizes at both switch 1 and 2. From this figure, ERICA suffers some important congestion in switch 1. Note that for our approach the queue size is almost negligible. In switch 2 the differences are not as big. However, one important characteristic shown in the figure is that queue sizes in our approach are much predictable than those in ERICA. The reasoncomesfrom the fact that ERICA converges asymptotically. While eventually converging to the maxmin rate, ERICA will take longer time to reach that point. During this time the rates can be considerably far from the maxmin solution inducing unpredictable queuesizes.

725

IEEE INFOCOM 2001

References Rate for flow 1

Rate for flow 2

55

75 ERICA ours

50

ERICA ours

70

45

65

40

60

35

55

30

50

25

45

20

40

15

35

10

30 0

100

200

300

400

500

0

t (millisecs)

Rate for flow 3

75

100

200

300

400

Rate for flow 4

120 ERICA ours

70

500

t (millisecs) ERICA ours

100 65 60

80

55 60 50 45

40

40 20 35 30

0 0

100

200

300

400

500

0

100

200

t (millisecs)

300

400

500

t (millisecs)

Figure 18. Dynamic rate allocation under high priority background traffic

6

x 10

4

Queue size at sw1 (bytes) ERICA ours

5

4

3

2

1

0 0

10

x 10

50

4

100

150

200

250

300

350

400

450

t (millisecs)

Queue size at sw2 (bytes)

ERICA ours 8

6

4

2

0 0

50

100

150

200

250

300

350

400

450

t (millisecs)

Figure 19. Queue sizes at switches 1 and 2

6

[Abr97] S. P. Abraham, A. Kumar, "Max-min Rate Control of ABR Connections with non-zero MCRs", Proc. IEEE GLOBECOM'97, 1997, pp. 498-502. [Cha95] A. Charny, D. Clark, R. Jain, "Congestion Control with Explicit Rate Indiciation", Proc. IEEE ICC'95, June 1995, pp. 1954-1963. [Hou98] Y. T. Hou, H. Tzeng, S.S. Panwar, "A Generalized Max-min Rate Allocation Policy and its Distributed Implementation Using the ARB Flow Control Mechanism", Proc. IEEE INFOCOM'98, San Francisco, April 1998, pp. 1366-1375. [Jai96] R. Jain, S. Kalyanaraman, R. Goyal, S. Fahmy, and R. Viswanathan, "ERICA Switch Algorithm: A Complete Description," ATM Forum cont. 961172. [Kal97] L. Kalampoukas, A. Varma, "Design of a Rate-allocation Algorithm in an ATM Switch for Support of Available-bit-rate (ABR) Service", Proc. Design SuperCon97, Digital Communications Design Conference, January 1997. [Lon99] Y. H. Long, T. K. Ho, A. B. Rad, S. P. S. Lam, "A Study of the Generalized Max-min Fair Rate Allocation for ABR Control in ATM", Computer Communications 22, 1999, pp. 1247-1259. [MPLS] Callon et al, "A Framework for Multiprotocol Label Switching", Internet Draft, March 2000. [Rob94] L. Roberts, "Enhanced PRCA (Proportional Rate-Controlo Algorithm)", ATM Forum/94-0735R1, Aug. 11, 1994. [RT00a] J. Ros, W. K. Tsai, "An Optimal Distributed Protocol for Fast Convergence to Maxmin Rate Allocation", Technical Report, University of California Irvine, June 2000. www.eng.uci.edu/~netrol/ [RT00b] J. Ros, W. K. Tsai, "A Theory of Maxmin Rate Allocation with Minimal Rate Guarantee", Technical Report, University of California Irvine, June 2000. www.eng.uci.edu/~netrol/ [RT00c] J. Ros, W. K. Tsai, "A Theory of Maxmin Bottleneck Ordering ", Technical Report, University of California Irvine, June 2000. www.eng.uci.edu/~netrol/ [TM4.0] "Traffic Management Specification Version 4.0", af-tm-0056.000, April 1996. [Tsa96] D. H. K. Tsang, W. K. F. Wong, "A New Rate-based Switch Algorithm for ABR Traffic to Achieve Max-min Fairness with Analytical Approximation and Delay", Proc. IEEE INFOCOM'96, 1996, pp. 1174-1181. [Tsa00] W. K. Tsai, M. Iyer, and Y. Kim, "Constraint Precedence in MaxMin Fair Rate Allocation," IEEE ICC 2000. [Wan98] Wangdong Qi, Xiren Xie, "The Structural Property of Network Bottlenecks in Maxmin Fair Flow Control", IEEE Communications Letters, Vol. 2, N. 3, March 1998.

Conclusions

This paper provides an ordering theory for the convergence of maxmin rate allocation with minimum rate constraints. This theory leads to a tighter lower bound for the convergence time, (L-1)RTT. The faster convergence time is made possible by employing bi-directional minimization and maintaining per-flow information at the switches. Based on this ordering theory, an optimally fast maxmin rate allocation protocol called the Distributed CPG protocol is designed. The D-CPG protocol does not induce transient oscillations. The results of this paper can be generalized to multicast with multi-rates on each multicast trees, in both theory and protocol design.

0-7803-7016-3/01/$10.00 ©2001 IEEE

726

IEEE INFOCOM 2001

Suggest Documents