GENERAL DYNAMIC ROUTING WITH PER-PACKET DELAY ...

1 downloads 0 Views 310KB Size Report
GUARANTEES OF O(DISTANCE + 1/SESSION RATE)∗. MATTHEW .... Until recently, the best delay bound known was O(di/ri) for packets of session i.
SIAM J. COMPUT. Vol. 30, No. 5, pp. 1594–1623

c 2000 Society for Industrial and Applied Mathematics 

GENERAL DYNAMIC ROUTING WITH PER-PACKET DELAY GUARANTEES OF O(DISTANCE + 1/SESSION RATE)∗ ‡ , MOR HARCHOL-BALTER§ , ´ MATTHEW ANDREWS† , ANTONIO FERNANDEZ ¶ TOM LEIGHTON , AND LISA ZHANG†

Abstract. A central issue in the design of modern communication networks is that of providing performance guarantees. This issue is particularly important if the networks support real-time traffic such as voice and video. The most critical performance parameter to bound is the delay experienced by a packet as it travels from its source to its destination. We study dynamic routing in a connection-oriented packet-switching network. We consider a network with arbitrary topology on which a set of sessions is defined. For each session i, packets are injected at a rate ri to follow a predetermined path of length di . Due to limited bandwidth, only one packet at a time may advance on an edge (link). Session paths may overlap subject to the constraint that the total rate of sessions using any particular edge is at most 1 − ε for any constant ε ∈ (0, 1). We address the problem of scheduling the sessions at each switch, so as to minimize worst-case packet delay and queue buildup at the switches. We show the existence of a periodic schedule that achieves a delay bound of O(1/ri + di ) with only constant-size queues at the switches. This bound is asymptotically optimal for periodic schedules. A consequence of this result is an asymptotically optimal schedule for the static routing problem, wherein all packets are present at the outset. We obtain a delay bound of O(ci + di ) for packets on path Pi , where di is the number of edges in Pi and ci is the maximum congestion along edges in Pi . This improves upon the previous known bound of O(c + d), where d = maxi di and c = maxi ci . We also present a simple distributed algorithm that, with high probability, delivers every sessioni packet to its destination within O(1/ri + di log(m/rmin )) steps of its injection, where rmin is the minimum session rate and m is the number of edges in the network. Our results can be generalized to (leaky-bucket constrained) bursty traffic, where session i tolerates a burst size of bi . In this case, our delay bounds become O(bi /ri + di ) and O(bi /ri + di log(m/rmin )), respectively. Key words. communication networks, packet routing, scheduling, delay bounds AMS subject classifications. 68M20, 68M10, 68W40 PII. S009753979935061X

1. Introduction. 1.1. Motivation. Motivated by the need for quality-of-service guarantees, network designers today offer connection-oriented service in many networks, e.g., ATM (asynchronous transfer mode) networks. In this medium, a user requests a particular share of the bandwidth and injects a stream of packets along one particular session at the agreed-upon rate. An important consequence of the user’s predictability is that the network can, in return, guarantee the user an end-to-end delay bound, i.e., ∗ Received by the editors January 29, 1999; accepted for publication (in revised form) July 17, 2000; published electronically November 28, 2000. Supported by Army grant DAAH04-95-1-0607 and ARPA contract N00014-95-1-1246. http://www.siam.org/journals/sicomp/30-5/35061.html † Bell Laboratories, 600 Mountain Avenue, Murray Hill, NJ 07974 ([email protected]. com, [email protected]). The work of the first author was supported by NSF contract 9302476-CCR. The work of these authors was performed while they were at MIT. ‡ Universidad Rey Juan Carlos, M´ ostoles, Spain ([email protected]). The work of this author was done while he was at MIT and was supported by the Spanish Ministry of Education. § Computer Science Department, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA ([email protected]). The work of this author was done while she was at MIT and was supported by an NSF Postdoctoral Fellowship in the Mathematical Sciences. ¶ Department of Mathematics and Laboratories for Computer Science, MIT, Cambridge, MA ([email protected]).

1594

GENERAL DYNAMIC ROUTING

1595

an upper bound on the time that any packet takes to move from its source to its destination. In order to provide this delay guarantee, the network must determine how to schedule the packets that contend for the same edge simultaneously. Apart from delay bounds, it is also important to guarantee small queues at each switch due to limited buffer size. In this paper we show how to design schedules that guarantee asymptotically optimal per-session delay bounds as well as small queues. 1.2. Model and problem. Consider a network N of arbitrary topology and a set of sessions defined on this network. A session i is associated with a source node, a destination node, and a simple path from the source to the destination. (A path is simple if it uses each edge at most once.) Packets are injected to the network N in sessions. A packet injected in session i enters the system at the source node of i, traverses the path associated with i, and then is absorbed at its destination. The length di is the number of edges on the path from the source to the destination of session i. Each session i has an associated injection rate ri . This rate constrains the injection of new packets from the session so that, during any interval of t consecutive steps, at most tri + 1 packets can be injected in session i for any t. We assume that all packets have the same size and all edges have the same bandwidth. We also assume a synchronized store-and-forward routing, where at each step at most one packet can traverse each edge. When two packets simultaneously contend for the same edge, one packet has to wait in a queue. During the routing, packets wait in two different kinds of queues. After a packet has been injected but before it leaves its source, the packet is stored in an initial queue. Once the packet has left its source, during any time it is waiting to traverse an edge, the packet is stored in an edge queue. The end-to-end delay (delay for short) for a packet is the total time from the packet injection until it reaches its destination. This includes the total time the packet spends waiting in both types of queues, plus the time it spends traversing edges. Our goal is to minimize both the end-to-end delay for each packet and the length of all edge queues. In order to achieve delay guarantees and bounded queue sizes, it is necessary to require that, for all edges e, the sum of the rates of the sessions that use edge e is at most 1. Throughout, we shall assume that the sum of the rates of the sessions using any edge e is at most 1 − ε for a constant ε ∈ (0, 1). This constant ε will appear throughout our subsequent bounds. Our paper focuses on the problem of timing the movements of the packets along their paths. A schedule specifies which packets move and which packets wait in queues at each time step. Most of the schedules obtained in this paper are template-based. The schedule defines a fixed template for each edge in advance. A template of size M is a wheel with M slots, each of which contains at most one token. Each token is affiliated with some session. The wheel spins at the speed of one slot per time step. A session-i packet can traverse the edge if and only if a session-i token appears. For each session-i token, the session-i packet that uses it will be the one that has been waiting to cross the edge for the longest amount of time, i.e., the session-i packets use the session-i tokens in a first-come-first-served manner. The template size and associated tokens do not change over time. We show that per-session delay bounds that are asymptotically optimal for templatebased schedules can be achieved. Meanwhile, constant-size edge queues can be achieved as well.

1596

ANDREWS ET AL.

1.3. Lower bounds. We observe that di is always a lower bound on the delay for session i, since every session-i packet has to cross di edges. It is also easy to see that Ω(1/ri ) is an existential lower bound. For instance, consider n sessions, all of which have the same rate r = (1 − ε)/n and the same initial edge e. If a packet is injected in each session simultaneously, one of the packets requires n = Ω(1/r) steps to cross e. Furthermore, for any given set of sessions, Ω(1/ri ) is a lower bound for some session i in template-based schedules. Consider the template for an edge e where  r = 1 − ε. By the pigeon-hole principle, the tokens for some session i can occupy i e at most an ri /(1 − ε) fraction of the slots on the template. Hence, there exist two session-i tokens that are separated by at least (1−ε)/ri slots. As a result, an adversary can make sure a session-i packet arrives just after the first token has passed, thereby forcing the packet to wait Ω(1/ri ) steps. If the schedule is not restricted to being template-based, the scheduler is more powerful. The scheduler does not have to decide on a fixed schedule in advance, but rather can make a new decision at each step, based on seeing the adversary’s injections. In this case it is unknown if for any given set of sessions Ω(1/ri ) is a lower bound. 1.4. Previous work. The problem of dynamic packet routing in the above setting is well studied. Until recently, the best delay bound known was O(di /ri ) for packets of session i. It is tempting to believe that this is the best possible delay bound, since a session-i packet may need to wait Ω(1/ri ) steps to cross each of the di edges on its route. However, this upper bound of O(di /ri ) can be much improved. In 1990, Demers, Keshav, and Shenker [8] proposed a widely studied routing algorithm called Weighted Fair Queueing (WFQ). WFQ is a packetized approximation of the idealized fluid model algorithm Generalized Processor Sharing (GPS). WFQ is simple and distributed. This same algorithm was proposed independently by Parekh and Gallager [14, 15] in 1992 under the name of Packet-by-Packet Generalized Processor Sharing (PGPS). Parekh and Gallager prove that the algorithm has an end-to-end delay guarantee of 2di /ri [15, p. 148] in the case when all packets have the same size. In their 1996 paper, Rabani and Tardos [16] produce an algorithm that routes every packet to its destination with probability 1 − p in time O(1/rmin ) + (log∗ p−1 )O(log



p−1 )

dmax + poly(log p−1 ),

where rmin = mini ri and dmax = maxi di . Ostrovsky and Rabani improve the bound to O(1/rmin + dmax + log1+ε p−1 ) [13]. These bounds are not session-based, meaning that if one session has a small rate or a long path, then the delay bounds for all sessions will suffer. The algorithms of [13, 16] are distributed, where knowledge of the entire network is not assumed, but each packet carries some information. The main technique of [13, 16] is based on “delay-insertion.” The intuition here is that if each packet receives a large random initial delay, then the packets are sufficiently spread out to ensure that they only need to wait O(1) steps at each successive edge rather than Ω(1/ri ) steps. This delay-insertion technique is used extensively by Leighton et al. in [10, 11] in the context of static routing. (In the static routing problem, all packets are present in the network initially.) Since our main result employs many techniques from [10], we give a detailed summary of [10] in section 4.1. A contrasting model, the connectionless adversarial queueing model, is also much studied, e.g., [4, 1]. Here the paths on which packets are injected can change over

GENERAL DYNAMIC ROUTING

1597

time, giving the adversary more power. In the adversarial queueing model the best delay bound known is polynomial in the maximum path length [1]. 1.5. Our results. We first provide a randomized, distributed scheduler that achieves a delay bound for session-i packets of O(1/ri + di log (m/rmin )) and a bound on the queue size of O(log (m/rmin )), where m is the number of edges in the network and rmin = mini ri . While this bound is not optimal, it nevertheless conveys some intuition for our main result. The main contribution of this paper is an asymptotically optimal template-based schedule. We prove that a schedule exists for the dynamic routing problem such that the end-to-end delay of each session-i packet is bounded by O(1/ri + di ).1 Our result improves upon previous work in several aspects. • We provide a session-based delay guarantee. That is, packets from sessions with short paths and high injection rates reach their destinations quickly. This is a big improvement over the previous bounds, which are stated in terms of rmin = mini ri and dmax = maxi di . We also guarantee that every packet always reaches its destination within the delay bound, without dropping any packets. • We guarantee constant-size edge queues. This is interesting because edge queues are much more expensive than initial queues in practice. • A consequence of our result is a packet-based bound, which improves upon the O(c + d) bound in [10] for the static problem. (See section 4.1 for the problem and parameter definitions.) We show that if packet pi follows a route Pi , then pi can be routed to its destination within O(ci + di ) steps, where ci is the maximum congestion along Pi and di is the number of edges on Pi . This result trivially follows from our result by creating a different session i for each packet pi , and defining ri = (1 − ε)/ci , where ε is a positive constant used to ensure that the load on any edge is under 1. For a template-based schedule, even if the computation of the schedule is timeconsuming, it needs to be done only once. Packets can then be scheduled indefinitely as long as the sessions do not change. Leaky-bucket injection model. Our results above can be generalized to bursty traffic streams that are leaky-bucket regulated. Here, each session i has a maximum burst size (or bucket size) of bi ≥ 1 and an average arrival rate of ri . During any t consecutive time steps at most ri t + bi session-i packets are injected. Leaky-bucket regulated traffic is widely used in the literature, e.g., [6, 7, 9, 14, 15, 18]. Leaky-bucket regulated injections allow traffic shaping. When session-i packets are injected, they first enter the session-i bucket at the source. These packets then leave the bucket one at a time at the rate of ri . In this way, the end-to-end delay is separated into two components: delay in the bucket and delay in the network. Since delay in the bucket is at most bi /ri , the end-to-end delay is increased by at most bi /ri steps, and the size of the edge queues is unchanged. The rest of the paper is divided into sections as follows. We first give some definitions and preliminary results in section 2. Then, in section 3, we describe a simple distributed scheduler that has a delay bound of O(1/ri + di log(m/rmin )). In section 4, we overview the major techniques employed to achieve the main result: a bound of O(1/ri + di ) and constant-size edge queues. In section 5 we define a set of 1 In this paper, we concentrate on proving the existence of such a schedule. However, the proof can be made constructive using ideas of Leighton, Maggs, and Richa [11] that are based on Beck’s algorithm [3]. For details, see [19].

1598

ANDREWS ET AL.

parameters used in the proof of the main result, and in section 6 we provide a detailed proof of the main result. 2. Preliminaries. In this section we present some preliminary results. Section 2.1 proves a generic fact about “token sequences” for template-based schedules. Section 2.2 presents two lemmas for probabilistic analysis that will be used extensively throughout the paper. 2.1. Token sequences. Throughout the paper we define template-based schedules in terms of token sequences. A token sequence for session i consists of di session-i tokens, K1 , . . ., Kdi , one from each template along the session-i path, where Kj+1 appears xj > 0 steps after Kj . Then xj is the token lag for these two tokens and di −1 j=1 xj is the end-to-end delay for this token sequence. Two token sequences cannot have tokens in common. In the following, we show that in any template-based schedule, bounding the delay for token sequences is sufficient to bound the packet delays and that bounding the token lag is sufficient to bound the edge queues. Our proof relies on Lemma 2.1. A vector v = [v1 , v2 , . . . , vn ] is sorted if v1 ≤ v2 ≤ · · · ≤ vn . We define perm(v ) to be a sorted vector whose components form a permutation of the components of v . We also use the notation u < v to indicate that the jth component of u is smaller than the jth component of v for each j. Lemma 2.1. Let u = [u1 , u2 , . . . , un ] and v = [v1 , v2 , . . . , vn ] be two vectors, each of which consists of n distinct numbers. Suppose u is sorted, i.e., perm(u) = u, and suppose u < v . Then, the following hold. 1. perm(u) < perm(v ). 2. If v < u + z, then perm(v ) < perm(u) + z, where z = [z, . . . , z] is a vector of n z’s for a scalar z. 3. Let |v | represent the maximum component of v ; then |perm(v ) − perm(u)| ≤ |v − u|. Proof. Let perm(v ) = [vσ(1) , . . . , vσ(n) ], where σ represents the sorted permutation of v . 1. Let us compare uj and vσ(j) . There are two cases to consider. If j ≤ σ(j), then uj ≤ uσ(j) < vσ(j) . These inequalities hold since u is sorted by assumption and u < v . If j > σ(j), then there exists j  ≥ j such that vj  ≤ vσ(j) . (Otherwise, for all j  ≥ j, vj  > vσ(j) . However, only n − j components of v can be greater than vσ(j) .) Combining the fact that u is sorted and u < v , we have uj ≤ uj  < vj  ≤ vσ(j) . Therefore, perm(u) < perm(v ) in both cases. 2. Since perm(u + z) = perm(u) + z for z = [z, . . . , z], property 1 implies perm(v ) < perm(u + z) = perm(u) + z. 3. Suppose |perm(v ) − perm(u)| = vσ(j) − uj . There are two cases to consider. If vσ(j) ≤ vj , then vσ(j) − uj ≤ vj − uj , which implies |perm(v ) − perm(u)| ≤ |v −u|. If vσ(j) > vj , then there exists j  < j such that vσ(j) ≤ vj  . (Otherwise, for all j  ≤ j, vσ(j) > vj  . However, only j − 1 components of v can be smaller than vσ(j) .) Since uj  < uj by the assumption that u is sorted, we have vσ(j) − uj ≤ vj  − uj  , which implies |perm(v ) − perm(u)| ≤ |v − u|. Property 3 follows. We are ready to transform a token-sequence-based bound into a packet-based bound. Although it might seem straightforward, the difficulty is that a packet is unable to identify a token sequence. This means if a session-i token appears, then the session-i packet that has been waiting the longest has to move. The first token in a

GENERAL DYNAMIC ROUTING

1599

token sequence is called an initial token. Theorem 2.2. Consider any template-based schedule. If the end-to-end delay for each session-i token sequence is bounded by X, then each session-i packet reaches its destination within X steps after it obtains an initial token. If the token lag is bounded by x for all token sequences for all sessions, then the edge queue size is also bounded x. Proof. It suffices to show the following. For any y ≥ 1, consider the first y sessioni packets injected. After obtaining its initial token, each of these y packets reaches the destination within X steps, and it waits at most x steps to advance each edge. Let Tk1 be the time that the kth packet catches an initial token Kk and advances its first edge. Let Tkj be the time that the kth packet would cross the jth edge if it followed the token sequence initiated at Kk . Note that Tkj is not necessarily the time that the kth packet crosses the jth edge in a template-based schedule. However, Tkj does represent the time that a token would appear on the jth edge. We have T11 < T21 < · · · < Ty1 , and Tk1 < Tk2 < · · · < Tkd1 for 1 ≤ k ≤ y. We first apply property 1 of Lemma 2.1 to show that packets 1 through y are able to cross the jth edge by times perm(T1j , T2j , . . . , Tyj ) for 1 ≤ j ≤ di . Take an example of the second edge. Let perm(T12 , T22 , . . . , Ty2 ) = [Tσ(1),2 , Tσ(2),2 , . . . , Tσ(y),2 ], where σ represents the sorted permutation. Property 1 of Lemma 2.1 implies [T11 , T21 , . . . , Ty1 ] < [Tσ(1),2 , Tσ(2),2 , . . . , Tσ(y),2 ]. Since packet 1 has left its first edge by time T11 and an unused token for the second edge appears by Tσ(1),2 , packet 1 is able to advance its second edge by Tσ(1),2 . Since packet 1 has left by Tσ(1),2 , packet 2 is able to obtain an unused token by Tσ(2),2 and advance its second edge. Similar reasoning applies to packets 3 through y for the second edge. Inductively, packets 1 through y are able to advance their last edge by perm(T1di , T2di , . . . , Tydi ). This quantity is bounded by [T11 +X, T21 +X, . . . , Ty1 +X] by property 2 of Lemma 2.1. Hence, all the session-i packets reach their destination within X steps after they obtain the initial tokens. Let us bound the queue size now. Consider the jth edge, where 1 ≤ j ≤ di . Suppose packet k, for 1 ≤ k ≤ y, uses token Kkj to cross its jth edge at time tkj . Let Kk,j+1 be the (j + 1)st token on the same token sequence as Kkj , and let tk,j+1 be the time that Kk,j+1 appears. (Note that Kkj is not necessarily on the same token sequence as the initial token that packet k used to cross its first edge, and that Kk,j+1 is not necessarily the token that packet k would use to cross the (j + 1)st edge.) Since tkj < tk,j+1 , property 1 of Lemma 2.1 and our argument for the delay bound above imply that packets 1 through y are able to cross the (j + 1)st edge by perm(t1,j+1 , t2,j+1 , . . . , ty,j+1 ). Property 3 of Lemma 2.1 shows that |perm(t1,j+1 , t2,j+1 , . . . , ty,j+1 ) − [t1j , t2j , . . . , tyj ]| is bounded by x, the token lag. Hence, a packet waits at most x steps to advance each edge once it obtains an initial token. 2.2. Lemmas for probabilistic analysis. Throughout the construction of our schedules, we use the Lov´asz local lemma [17, pp. 57–58] and a Chernoff bound [5] for probabilistic analysis. We include them here for easy reference. Lov´ asz local lemma. Let E1 , . . . , En be a set of “bad events,” each occurring with probability at most p and with dependence at most d (i.e., every bad event is

1600

ANDREWS ET AL.

mutually independent of some set of n − d other bad events). If 4pd < 1, then   n  Pr E¯i > 0. i=1

In other words, no bad event occurs with a nonzero probability. Chernoff bound. Let Xi be nindependent Bernoulli random variables n with n X , and let the expectation µ = probability of success pi . Let X = i=1 i i=1 pi . Then for 0 < δ < 1, we have 2 Pr [ X > (1 + δ)µ ] ≤ e−δ µ/3 . We also prove a variation of the Chernoff bound. Lemma 2.3. Let X with probability i be n independent Bernoulli random variables n n of success pi . Let X = i Xi and the expectation E[X] = i pi . Then for u ≥ E[X] and 0 < δ < 1, we have 2 Pr [ X > (1 + δ)u ] ≤ e−δ u/3 . Proof. We prove the lemma by amplifying the success probabilities. If u ≥ n, then Pr [ X ≥ (1 + δ)u ] = 0 and we are done. Otherwise, let pi be a value such that n  pi ≤ pi ≤ 1 and i pi = u. We have Pr [ X > (1 + δ)u | success probabilities p1 , . . . , pn ] ≤ Pr [ X > (1 + δ)u | success probabilities p1 , . . . , pn ] . The Chernoff bound implies that the above probability is bounded by e−δ

2

u/3

.

3. Suboptimal schedules. We present in this section a simple randomized distributed scheduler that, with high probability, produces a delay bound of   1 m O + di log ri rmin m and edge queues of size O(log rmin ), where m is the number of edges and rmin = mini ri . This preliminary result is substantially simpler to prove than the optimal result of O(1/ri + di ) because of the relaxed bounds. Nevertheless, it illustrates the basic ideas necessary to prove the main result. We begin with a centralized scheme in section 3.1 that achieves these bounds, and then we convert it to a distributed scheme in section 3.2.

3.1. A simple centralized scheduler. As stated above, we now present a m centralized scheduler that achieves the desired delay bound of O( r1i + di log rmin ) with m edge queues of size O(log rmin ). The structure of the proof is as follows. Each session-i packet must traverse di edges. We prove that the time the packet spends waiting for a token at each edge m along the path (after the first edge) is O(log rmin ). Hence the time to traverse all edges m (but the first) is O(di log rmin ). It turns out that the time spent waiting to receive the m very first token (what we refer to as initial waiting time) is O( r1i + di log rmin ). Hence the result follows. The difficulty is to come up with a placement of tokens that achieves the above bounds. To do this, we will first come up with an illegal placement of tokens, where we place more than one token in some slots and zero tokens in other slots. We will prove delay bounds on the illegal placement. We will then apply a smoothing procedure which “smooths” out the bumps in the illegal placement, making it legal. We will prove that the smoothing process does not change the bounds very much.

GENERAL DYNAMIC ROUTING

1601

Template size. We first decide the template size T . Roughly speaking, T needs to be sufficiently large so that enough tokens can be placed to accommodate arrivals from all sessions every T steps. We express the injection rate for session i in terms of rˆi = si /%i , a fraction slightly larger than ri . If T is the least common multiple of %i for all i, then we can place si session-i tokens every %i consecutive slots on each template along the path of session i. The quantities of %i and si are defined as follows: (3.1) (3.2) (3.3)

log

2



εri , %i = 2 si = %i ri (1 + ε/2) , rˆi = si /%i ,

where ε is a constant to ensure that the sum of the rates of the sessions using any edge is less than 1. In other words, %i is the smallest power of 2 that is larger than or equal to 2/(εri ), and si is the largest integer that is less than or equal to %i ri (1 + ε/2). The template size T is the least common multiple of %i . Since all the %i ’s are powers of 2, T = O(1/rmin ). Lemma 3.1. We have the following properties for rˆi . 1.  ri ≤ rˆi ≤ ri (1 + ε/2) for each session i. 2. ˆi ≤ 1 − ε/2 for each edge e, where Se is the set of sessions that cross i∈Se r edge e. Proof. Property 1 is equivalent to %i ri ≤ si ≤ %i ri (1 + ε/2). The difference between the lower bound and the upper bound is %i ri ε/2, which is at least 1 by the definition of %i . Therefore, there exists an integer in the range of + ε/2)], and si is such an  integer by definition. Property 1 follows. [%i ri , %i ri (1 Given i∈Se ri ≤ 1 − ε, we have i∈Se rˆi ≤ (1 − ε)(1 + ε/2) < 1 − ε/2. Property 2 follows. 1 We now define the template size T to be maxi %i , which is Θ( rmin ). Since all the %i ’s are powers of 2, T is also the least common multiple of the %i ’s. Token placement. We describe a procedure to place the tokens for all sessions. We start with an illegal placement of tokens. For each session i, we first place si initial tokens in one slot every %i slots on the template that corresponds to the first edge of session i. We then delay each initial token of session i by an amount chosen uniformly and independently at random from [L + 1, L + %i ], where (3.4)

L = 2log( 2

α

log(mT ))

for a constant α. In other words, L is a power of 2 that is greater than or equal to α2 log(mT ). As we shall see, this is enough randomness to spread out the tokens. For every session-i token a placed on the template corresponding to the jth edge, we place a session-i token b on the template corresponding to the (j + 1)st edge such that b appears exactly 2L steps after a. In this way, we have partitioned all the session-i tokens into T rˆi sequences, where each token sequence has di tokens and two neighboring tokens in each sequence are 2L apart. In the following we show that the tokens cannot be too clustered. Lemma 3.2. At most L tokens appear in any consecutive L slots on any template with probability at least 1 − 1/(mT ), where L is defined in (3.4) for a sufficiently large constant α.

1602

ANDREWS ET AL.

Proof. Since si initial tokens for session i are placed in one slot every %i slots and each is delayed by an amount chosen independently and uniformly at random from [L + 1, L + %i ], the expected number of session-i tokens in a single slot is si /%i , which Lemma 3.1, the expected is rˆi . Hence by linearity of expectations and property 2 of  number of tokens over all sessions in L consecutive slots is i rˆi L ≤ (1 − ε/2)L. For a particular interval of L consecutive slots on a particular template, let the random variable X be the number of tokens in these slots. Whether or not a token lands in these L slots is a Bernoulli event. Since the delays to the initial tokens are chosen independently and all session paths are simple, these Bernoulli events are independent. Since E[X] ≤ (1 − ε/2)L, we have the following by Lemma 2.3. 2

Pr [ X > L ] ≤ Pr [ X > (1 + ε/2)(1 − ε/2)L ] ≤ e−ε

(1−ε/2)L/12

.

In m templates there are at most mT intervals of L consecutive slots. Therefore, by a union bound the probability that more than L tokens appear in any L consecutive slots is bounded by 2

mT Pr [ X > L ] ≤ mT e−ε

(1−ε/2)L/12

2

= mT e−ε

(1−ε/2)α log(mT )/24

.

By choosing a sufficiently large constant α, we can bound the above probability by 1/(mT ). If the first pass of the delay insertion does not produce a token assignment that satisfies the condition of at most L tokens every L slots, we simply try another pass until the condition is met. Smoothing. In order to guarantee one token per slot, we carry out a smoothing process. Since there are at most L tokens in any consecutive L slots, we partition each template into intervals of L consecutive slots and arbitrarily place at most one token in each slot within each interval. (Note the template size T is a multiple of L since T and L are both powers of 2.) Recall we have defined a token sequence for each session in the token placement process. Lemma 3.3. Let K1 , . . ., Kdi be any token sequence for session i; then, after the smoothing process, we have the following. 1. Token Kj appears after Kj−1 for 1 < j ≤ di . 2. The end-to-end delay of the token sequence is bounded by 2di L + 2L, and the token lag is bounded by 4L. Proof. Before the smoothing, Kj appears exactly 2L steps after Kj−1 for 1 < j ≤ di , i.e., the token lag is 2L. Since the smoothing process shifts each token by at most L − 1 slots, Kj still appears after Kj−1 after the smoothing. The token lag therefore increases to at most 4L. The end-to-end delay for the token sequence increases from 2di L to at most 2di L + 2L due to the shift of the first and the last tokens. Theorem 3.4. With high probability, the above randomized centralized scheme m generates a template-based schedule that produces a delay bound of O( r1i + di log rmin ) m and edge queues of size O(log rmin ). Proof. We first show that each session-i packet, p, is able to catch an initial token within 2L + 2%i steps of its injection. Before the initial session-i tokens are delayed, we have exactly si tokens every %i slots. Since at most si session-i packets can be injected during %i steps, packet p would be able to obtain an initial token, say K, in fewer than %i steps if the tokens were not delayed or shifted. Let p be injected at time t, and let K appear at T before K is delayed and shifted; then t ≤ T < t + %i . Each initial token is delayed by an amount in the range of [L + 1, L + %i ] during the token

GENERAL DYNAMIC ROUTING

1603

placement process and is shifted by at most L − 1 slots during the smoothing process. Therefore, after the smoothing process, K appears after t but before t + 2L + 2%i . By Theorem 2.2 and Lemma 3.3, any session-i packet p is able to reach its destination within 2di L + 2L steps after it obtains its initial token. Therefore, the end-to-end m delay for session-i packets is (2L + 2%i ) + (2di L + 2L), which is O( r1i + di log rmin ). m The edge queue size is bounded by the token lag 4L, which is O(log rmin ). 3.2. A simple distributed scheduler. The above scheme is centralized since the session-i tokens on one template are dependent on the previous template. However, it suggests the following simple distributed scheme for scheduling packets so as to achieve small delay. As in section 3.1, we place initial tokens on the first edge of session i; however, this time we delay each initial token by an amount chosen independently and uniformly at random from [1, %i ], where %i is defined in (3.1). (Note that the delay is from [L + 1, L + %i ] in the centralized scheme.) Suppose that a session-i packet p now obtains its initial token at time T . Then for the jth edge on the session-i path, p is given a deadline of T + 2L(j − 1) + L, where L is defined in (3.4). Whenever two or more packets contend for the same edge simultaneously, the packet with the earliest deadline moves. Ties are broken arbitrarily. We call this scheme EarliestDeadline-First (EDF). Note that EDF is no longer template based. We show in Lemma 3.5 that the deadlines do not cluster together with high probability, and we show in Lemma 3.6 that every packet meets its deadlines. Lemma 3.5. For any edge, at most L deadlines appear in any consecutive L time steps with probability at least 1 − 1/(mT ), where L is defined in (3.4) for a sufficiently large constant α. Proof. The deadlines for a packet p are T +L, T +3L, T +5L, . . ., which correspond to the times that the tokens in a sequence appear. Hence, the proof is identical to that of Lemma 3.2. Lemma 3.6. If for any edge at most L deadlines appear in any consecutive L time steps, then each packet crosses every edge by its deadline by EDF. Proof. For the purpose of contradiction, let D be the first deadline that is missed. This implies all deadlines earlier than D are met. Let p be the packet that misses deadline D for edge e. Since packet p makes its previous deadlines, p must have crossed its previous edge by time D − 2L, or else e must be p’s first edge and p must have obtained its initial token at time D − L. Hence, at every time step from time D − L + 1 to D, packet p is held up by another packet with a deadline no later than D. Furthermore, these deadlines must be later than D − L since all deadlines earlier than D are met. Therefore, at least L + 1 packets have deadlines for edge e from time D − L + 1 to D. Our lemma follows from the contradiction. By an argument similar to that in Theorem 3.4, a session-i packet obtains its initial token within 2%i steps of its injection. Combined with Lemmas 3.5 and 3.6, we have the following theorem. Theorem 3.7. With high probability, the randomized distributed scheme EDF m generates a schedule that produces an end-to-end delay bound of O( r1i + di log rmin ). In [2], simulations were carried out to compare the end-to-end delays produced by our EDF scheme against those produced by WFQ. The former outperformed the latter in a range of simulations. 4. Overview of the main result. Our main result for the dynamic routing problem parallels an earlier result on static routing. In section 4.1 we review the method used for solving the static case, and in section 4.2 we give an overview of the additional complexities that need to be addressed in the dynamic case.

1604

ANDREWS ET AL. Table 4.1 Frame-refinement for static routing in [10].

Schedule S (q) S (q+1)

Refinement

Frame size I (q) log5 I (q) I (q+1)

Relative congestion c(q) (1 + o(1))c(q) c(q+1)

4.1. A bound of O(c + d) for static routing. Leighton, Maggs, and Rao consider the static routing problem for arbitrary networks in [10]. For static routing, all packets are present in the network initially. Each packet is associated with a source, a destination, and a route. The congestion on each edge is the total number of routes that require that edge, and the dilation of a route is the number of edges on the route. Leighton, Maggs, and Rao show that for any set of routes with maximum congestion c (over all edges) and maximum dilation d (over all routes), there is a schedule of length O(c + d) and edge queue size O(1). In this schedule, at most one packet traverses each edge at each time step. A packet waits O(c + d) steps initially before leaving its source, and it waits O(1) steps to cross each edge thereafter. We summarize here the techniques in [10]. The strategy for constructing an efficient schedule is to make a succession of refinements to an initial schedule S (0) . In S (0) , each packet moves at every step until it reaches its destination. This schedule has length d, but as many as c packets may traverse the same edge at the same step. Each refinement brings the schedule closer and closer to the requirement that at most one packet uses one edge per time step. A T -frame is a time interval of length T . The frame congestion, C, in a T frame is the largest number of packets that use any edge during the frame. The relative congestion in a T -frame is the ratio C/T . The frame congestion (resp., relative congestion) on an edge e during a T -frame is defined to be the frame congestion (resp., relative congestion) associated with edge e. It is obvious that the initial schedule S (0) has relative congestion at most 1 for any c-frame. A refinement transforms a schedule S (q) with relative congestion at most c(q) in any frame of size I (q) or larger into a schedule S (q+1) with relative congestion at most c(q+1) in any frame of size I (q+1) or larger. The resulting frame size I (q+1) is much smaller than I (q) , whereas the relative congestion c(q+1) is only slightly bigger than c(q) . In particular, I (q+1) = log5 I (q) and c(q+1) = (1 + o(1))c(q) . After a series of O(log∗ c) refinements, a schedule S (ζ) is obtained, where the relative congestion is O(1) for any O(1)-frame. A final schedule, in which at most one packet at a time crosses each edge, can be constructed by replacing each step of S (ζ) by a constant number of steps. Each refinement is achieved by inserting delays to the packets. It is the central issue in [10] to show that a set of delays always exists satisfying the criteria in Table 4.1. 4.2. A bound of O(1/ri + di ) for dynamic routing. Our result for the dynamic routing problem is parallel to that in [10]. For an arbitrary network where paths (sessions) are defined, we show that there is a schedule such that every session-i packet reaches its destination within O(1/ri + di ) steps of its injection, where ri and di are the injection rate and path length for session i, respectively. A session-i packet waits O(1/ri + di ) steps initially before leaving its source, and it waits O(1) steps to cross each edge afterwards.

1605

GENERAL DYNAMIC ROUTING

Service time for all sessions

Arrival time for session i Arrival time for session j Ti

Ti

Tj

Tj T

2T

Fig. 4.1. All the session-i packets that arrive during [kT − Ti , (k + 1)T − Ti ) are serviced during [kT , (k + 1)T ). In this figure, k = 1.

To achieve a session-based, end-to-end delay bound of O(1/ri +di ) for our dynamic routing problem, we adopt the general approach in [10]. However, there are three major problems in transforming the solution for the static problem into a solution for the dynamic problem. In the remainder of this section we present these three problems and their solutions. In the remainder of the paper we use the language of “scheduling packets” rather than “placing tokens.” At the end of the presentation we show how to transform the packet schedule into a template-based schedule. Although the actual packet arrivals are not be periodic, the times at which the packets cross the first edge are periodic. This is the key to the transformation. Problem 1: Infinite time. In [10] all the packets to be scheduled are present initially. In the dynamic model, packets are injected over an infinite time line. We would like to partition the infinite time line into finite time intervals which can be scheduled independently of each other. We divide time into intervals of length T , where T = Θ(1/rmin + dmax ). We then independently schedule the time intervals [0, T ), [T , 2T ), [2T , 3T ), etc. We associate each session i with a quantity Ti = Θ(1/ri + di ). For any integer k ≥ 0 consider all the session-i packets that are injected during interval [kT − Ti , (k + 1)T −Ti ). We provide a schedule in which all these packets leave their sources no earlier than time kT and reach their destinations before time (k + 1)T . (See Figure 4.1.) From now on, we concentrate on scheduling the arrivals that would be serviced during interval [T , 2T ). The quantity T will also serve as the size of all templates in the template-based schedule. Problem 2: Session-based delay guarantees. Once we restrict ourselves to the interval [T , 2T ), it seems that the dynamic routing problem is similar to the static problem. However, we cannot simply proceed with the successive refinements as in section 4.1, since some sessions need tighter delay bounds than others. Sessioni packets can only tolerate a delay proportional to 1/ri + di . We group sessions according to their associated 1/ri + di value. We start by inserting delays to sessions having large values of 1/ri + di , reducing the frame size, and bounding the relative congestion. When the frame size becomes small enough, sessions with smaller 1/ri +di join in.

1606

ANDREWS ET AL. Table 4.2 Refinement and conversion for dynamic routing.

Schedule S (q)

S (q+1)

Refinement Conversion

Integral sessions A(q) A(q) (q) A ∪ B (q+1) A(q+1)

Frame size I (q) log5 I (q) log5 I (q) I (q+1)

Relative congestion c(q) (1 + o(1))c(q) (1 + o(1))2 c(q) c(q+1)

More precisely, we introduce the concept of integral and fractional sessions. When session i is integral, packets of size 1 are injected at rate ri . When session i is fractional, a packet of size rˆi is injected at every time step, where rˆi is a value slightly larger than ri . A packet from a fractional session always crosses one edge at a time, whether or not other packets are crossing the edge at the same time. Therefore, a fractional packet from session i always contributes exactly rˆi to the congestion. Integral sessions are those to which we can afford to insert delays in order to bound the congestion. Fractional sessions are those to which we cannot insert delays. However, congestion due to a fractional session i is only rˆi , which is small. As before, S (q) represents the schedule in the qth iteration. The set of integral sessions for S (q) is denoted by A(q) . For the initial schedule S (0) , all the sessions are fractional and we show that the relative congestion is less than 1. For schedule S (q) we inductively assume that the relative congestion due to the current integral and fractional sessions is at most c(q) for any frame of size I (q) or larger. To create a schedule S (q+1) from schedule S (q) we carry out a frame-refinement step and a conversion step. The frame-refinement step reduces the frame size from I (q) to I (q+1) = log5 I (q) , (q) + o(1))c while slightly increasing the relative congestion from c(q) to . This step   (1 (q) 2 is achieved by delaying the integral packets by up to Θ (I ) steps. We make sure that if session i is in A(q) , then 1/ri + di ≥ (I (q) )2 , and therefore the delays inserted can be tolerated. The conversion step converts some sessions from fractional to integral, while maintaining the frame size of I (q+1) and slightly increasing the relative congestion to c(q+1) = (1 + o(1))2 c(q) . These newly-converted sessions form a set B (q+1) and have associated values 1/ri + di ≥ (I (q+1) )2 . This bound is chosen so that the sessions in A(q+1) , which is A(q) ∪ B (q+1) , will be able to tolerate the delays inserted during the next iteration of frame refinement. During the conversion step we delay the packets in B (q+1) by up to Θ(1/ri + di ) steps. We are able to show the existence of “good” delays for both frame refinement and conversion steps. Table 4.2 summarizes our approach. At the termination of our algorithm we have a schedule S (ζ) in which every session is integral and the relative congestion is at most 1 for all frames of size larger than a certain constant. In S (ζ) all session-i arrivals during [T − Ti , 2T − Ti ) are serviced during [T , 2T ). Furthermore, all session-i packets reach their destination within O(Ti ) steps of their injections. Problem 3: Constant-factor stretching in the final schedule. As discussed above, we repeat the process of refinement and conversion until we have a schedule, S (ζ) , in which all sessions are integral and in which the relative congestion is 1 for all frames of size larger than a certain constant w. In the static problem, a final schedule can easily be obtained by stretching S (ζ) by a constant factor. However, we cannot

GENERAL DYNAMIC ROUTING

1607

Construct new network M

Partition time into finite intervals Schedule intervals independently

Repeat: Conversion

Refinement

Smooth schedule Convert back to network N Fig. 4.2. An overview of our approach for the dynamic routing problem.

afford to have a constant blowup in our final schedule for the dynamic problem. This is because we need to independently schedule all time intervals [0, T ), [T , 2T ), etc., and a constant blowup would make these time intervals overlap. To overcome this problem, we first devise a schedule for a new network M that is constructed from the original network N as follows. Each edge e of N is replaced by 2w consecutive edges e1 , . . . , e2w , where w is the constant introduced above. The rates and routes of the sessions are unaffected. In M, session i has length Di = 2wdi = O(di ). All the techniques described earlier are applied to the network M. We carry out successive conversion and refinement steps for M and obtain a schedule S (ζ) , where the relative congestion is 1 for any frame whose size is larger than w. We then “smooth” S (ζ) and convert it to a schedule for N where only one packet at a time traverses any edge. The idea behind the smoothing process is as follows. In S (ζ) , more than one packet may require some edge of M during a given time step, but at most w packets can require any given edge f in M within w time steps. This means we can shuffle each packet that requires edge f by at most w time steps, so that exactly one packet traverses f at any step. Unfortunately, this shuffling in time can lead to an illegal schedule for M, in which a packet can be scheduled to traverse the edges on its path out of order (timewise). However, one can prove that if we consider the schedule with respect to the packets traversing edge e2w for all e, then this schedule is legal, i.e., the packets cross these edges in order. Hence, we schedule edge e in N in exactly the same way that the corresponding edge e2w is scheduled in M. Figure 4.2 is a schematic picture of our overall approach.

1608

ANDREWS ET AL.

5. Parameter definitions. Interval length T and Ti . As discussed in section 4.2, we independently schedule intervals [0, T ), [T , 2T ), etc. Our proof will concentrate on the interval [T , 2T ). All the session-i packets that arrive during [T − Ti , 2T − Ti ) are serviced during [T , 2T ). We define T and Ti for session i as follows. Recall Di = 2wdi , where w is a constant defined at the end of this section. Ti = 4Di + 2 + (8/ε + 2)/ri ,

(1 + 4/ε) maxi Ti w. T = w In other words, T is the smallest multiple of w that is greater than or equal to (1 + 4/ε) maxi Ti . Clearly Ti = O(1/ri + Di ) = O(1/ri + di ). In the template-based schedule, all template sizes will be T . Packet size for a fractional session. In this section we define rˆi , the packet size for a fractional session i. For reasons that will become clear in the conversion step of section 6.3, we need rˆi , to be slightly larger than ri , and we shall need to express rˆi as the ratio of two integers. Let %i = 8/(εri ), si = %i ri (1 + ε/2) , rˆi = si /%i . The following lemma is analogous to Lemma 3.1. Lemma 5.1. We have the following properties for rˆi . 1.  ri (1 + ε/4) ≤ rˆi ≤ ri (1 + ε/2) for each session i. 2. ˆi ≤ 1 − ε/2 for each edge e, where Se is the set of sessions that cross i∈Se r edge e. Note that the definition of %i and property 1 of Lemma 5.1 are different from the ones in section 3.1. We need this stronger lower bound on rˆi to handle the extra complexity in the conversion step. In particular, rˆi is also used to indicate the rate at which the initial tokens for session i appear. During the conversion step, the initial tokens for session i are placed in the interval [T , 2T − Ti ). Since these tokens are to accommodate all the session-i arrivals during [T , 2T ), we need rˆi (T − Ti ) ≥ ri T . This condition is guaranteed by the choices of T and Ti and property 1 of Lemma 5.1. (See Lemma 6.8.) Parameters for schedule S (q) . We shall show later that, in schedule S (q) , the relative congestion, due to all integral and fractional sessions, is at most c(q) for any frame of size I (q) or larger. For S (q) , the set A(q) consists of all the integral sessions. As we construct schedule S (q+1) from S (q) , sessions in B (q+1) become integral and join A(q) . The schedule at the end of the refinement and the conversion is S (ζ) . The parameters I (q) , c(q) , A(q) , and B (q+1) are defined by the following recurrences. Let Xi = Di + 1/ri for session i, and let Xmax = maxi Xi . 2/5

I (0) = elog Xmax , (q+1) I = log5 I (q) , c(0) = 1 − ε/2, c(q+1) = (1 + δ (q) )2 c(q) , δ (q) = β/ log I (q) ,

1609

GENERAL DYNAMIC ROUTING

A(0) = ∅, A(q+1) = A(q) ∪ B (q+1) ,  2

√ (q+1) (q) (q+1) I (q+1) = i∈ /A : I ≤ Xi ≤ e B   √ (q+1) B (q+1) = i ∈ / A(q) : Xi ≤ e I

for q = ζ − 1, for q = ζ − 1.

The parameter β is a sufficiently large positive constant. Note that I (q) decreases polylogarithmically and c(q) increases by a factor of 1 + o(1). One can verify that B (q) forms a partition of all the sessions and that sessions with large Xi values become  2 integral first. We make use of the bound Xi ≥ I (q+1) in the frame refinement step, and we use the bound log2 Xi ≤ I (q+1) in the conversion step. Definition of w. We define a constant w that has two purposes. First, the process of refinement and conversion terminates when the frame size becomes smaller than or equal to w. Second, the intermediate network M is constructed from the original network N by replacing each edge in N with 2w edges. We define w to be a constant that satisfies the following two bounds: √ 2 −2 1. w ≥ x, where x satisfies (1 − √ α )2 = 1 − ε/2, i.e., x = eα (1− 1−ε/2) , log x

2. w ≥ 2 log15 w + 2 log10 w − log5 w. The first bound ensures that the relative congestion c(ζ) is at most 1. (See Lemma 6.11.) The second bound is to maintain an invariant throughout the frame refinement steps. (See section 6.2.) 6. An asymptotically optimal schedule. In this section we show the existence of an asymptotically optimal schedule. Sections 6.1 through 6.4 concentrate on problem 2 of section 4.2. We begin with an initial schedule S (0) and transform it to schedule S (ζ) through a process of refinement and conversion. All these schedules are designed for the intermediate network M. Section 6.5 concentrates on problem 3 of section 4.2. We describe how to obtain an optimal schedule SN for the original network N from S (ζ) . We first define or recall several basic concepts. Given some schedule S, a region R of the schedule is some interval of contiguous time steps in the schedule. A T -frame is a region of length T . The congestion C in a T -frame is the maximum number of packets that cross any edge in that interval, and the relative congestion is the ratio C/T . A fractional packet from session i always contributes exactly rˆi to the relative congestion of any frame. 6.1. An initial schedule S (0) . In S (0) , all sessions are fractional, i.e., A(0) = ∅. Each packet (of a fractional size) crosses one edge per time step whether or not other packets are using the same edge at the same time. Since the relative  congestion is entirely due to fractional sessions, the relative congestion is at most rˆi < 1 − ε/2 = c(0) on any edge e. (See Lemma 5.1.) Note that the above relative congestion holds for any frame size. We choose 2/5 the initial√ frame size I (0) = elog Xmax , so that I (1) = log2 Xmax , which implies (1) Xmax = e I . This allows the sessions with the largest Xi value to be converted in the first iteration of the algorithm (see definition of B (1) ). 6.2. Frame refinement for schedule S (q) . In this section we describe the frame-refinement process. For each schedule, a frame refinement delays the packets

1610

ANDREWS ET AL.

from integral sessions in a way that dramatically reduces the frame size but does not increase the relative congestion and the length of the schedule by much. To be more precise, for schedule S (q) , we inductively assume that the relative congestion is at most c(q) for frames of size I (q) or larger and that each integral packet waits at most once every I (q−1) steps after leaving its source. In this frame refinement step we show that there is a way to delay (by an amount related to the 1 frame size) the packets from A(q) so that, in the resulting schedule S (q+ 2 ) , the relative 5 (q) (q) (q) (q+1) congestion is at most = log I or larger, (1 + δ )c for any frame of size I (q) (q) where δ = β/ log I , and each integral packet waits at most once every I (q) steps. The base case of the initial schedule S (0) is described in section 6.1. Since there are no integral sessions, no delays are inserted in this step. Trivially, the resulting relative congestion is at most (1 + δ (0) )c(0) for any frame of size I (1) or larger at the end of this step, and no packet ever waits. Let us now consider refining schedule S (q) for q > 0. The refinement is divided into two steps. In the first refinement step we divide the current schedule into blocks of length 2(I (q) )3 + 2(I (q) )2 − I (q) , and we insert delays into each block so that its length increases to 2(I (q) )3 + 2(I (q) )2 . We show that these delays can be introduced in such a way that in the central 2(I (q) )3 steps of each block the relative congestion of frames of at least length I (q+1) is only a little larger than c(q) . (See Figure 6.1.) At the beginning and end of each block there are “fuzzy” regions of length (I (q) )2 each. In the second step we move the block boundaries so that the fuzzy regions at the end and beginning of adjacent blocks are at the center of the new blocks of 2(I (q) )3 + 2(I (q) )2 steps. Again, we insert delays into each block, increasing the size of the block by (I (q) )2 steps. We show that there is a way to insert these delays so that the final conditions for refining S (q) are indeed satisfied. (See Figure 6.2.) In the following we present Lemma 6.2, which will be used extensively in both steps of the refinement. We continue by presenting both steps in detail. A useful lemma. The following lemma is used to prove Lemma 6.2. Lemma 6.1. Let X and Y be independent random variables. Let Y be binomially distributed with mean µy , and let σ1 , σ2 , and v be values such that σ2 = σ1 − 1/v. Then, Pr [ X + µy > (1 + σ1 )v ] ≤ 2 Pr [ X + Y > (1 + σ2 )v ] . Proof. Let z = (1 + σ1 )v − µy . We have (6.1)

Pr [ X + µy > (1 + σ1 )v ] = Pr [ X > z ] , Pr [ X + Y > (1 + σ2 )v ] = Pr [ X + Y > µy + z − 1 ] .

(6.2) Note also that

Pr [ X + Y > µy + z − 1 ] ≥ Pr [ X > µy + z − 1 − µy and Y ≥ µy ] = Pr [ X > z − 1 + µy − µy ] Pr [ Y ≥ µy ] . This last equality follows from the independence of X and Y . Theorem B.1 in [12] shows that Pr [ Y ≥ µy ] ≥ 1/2. Since µy − µy < 1, we have Pr [ X + Y > µy + z − 1 ] ≥

1 Pr [ X > z ] . 2

GENERAL DYNAMIC ROUTING

1611

Our lemma follows from equalities (6.1) and (6.2) and the above inequality. We say that a packet is active during some region of a schedule if the packet belongs to some integral session and it traverses at least one edge during the region. Since we maintain the invariant that a packet waits at most once every I (q−1) steps after leaving its source, an inactive packet is either at its source or its destination during the entire region. Lemma 6.2 below is a stepping stone that allows us to reduce the frame size from I (q) to poly log I (q) . We invoke this lemma for various values of s, t, r, and R. Lemma 6.2. Consider some region R of a schedule where the relative congestion is at most r = Θ(1) for frames of length s or more, where log3 I (q) ≤ s ≤ (I (q) )2 . Consider any edge e and any t-frame, where log2 I (q) ≤ t ≤ 2 log2 I (q) . Assume each active packet in the region is delayed between the beginning of R and the beginning of the t-frame by a number of steps randomly, independently, and uniformly chosen from [1, s]. Then, for any constant k there is some value γ = Θ(1)/ log I (q) such that the probability of having a relative congestion larger than r(1 + γ) on e during the t-frame is at most (I (q) )−k . Proof. Let the random variable X be the frame congestion on e during the tframe due to the active packets after they are delayed. If the relative congestion due to fractional sessions is rf , the frame congestion due to fractional sessions in the t-frame is exactly rf t. Since the active packets are the only integral-session packets that can cross e during the region, the frame congestion on e during the t-frame is X + rf t after the delay. Consider now a binomial random variable Y with parameters (rf s, t/s) and mean E[Y ] = rf t. From Lemma 6.1, the probability p that the congestion in the t-frame is larger than (1 + γ)rt after the packets have been delayed is p = Pr [ X + rf t > (1 + γ)rt ] ≤ 2 Pr [ X + Y > (1 + σ)rt ] , I (q) and r = Θ(1), γ = Θ(1)/ log I (q) if and only where σ = γ − 1/rt. Since t ≥ log2 if σ = Θ(1)/ log I (q) . Let σ = v/ log I (q) , where v is a constant. We shall choose an appropriate value v so that the lemma is satisfied. We first concentrate on X. Since the active packets are delayed up to s steps, an active packet that crosses e in the t-frame after the delay could cross e in an interval of t + s steps before the delay. The relative congestion due to active packets is at most r − rf in that interval before the delay. Hence, at most (t + s)(r − rf ) active packets can cross e in the t-frame after the delay, and each of them has a probability of at most t/s of doing so. Recall that Y is a binomial random variable with parameters (rf s, t/s). We define Z to be a binomial random variable with parameters (n, t/s), where n = r(t + s) > (r − rf )(t + s) + rf s. It is easy to see that p ≤ 2 Pr [ X + Y > (1 + σ)rt ] ≤ 2 Pr [ Z > (1 + σ)rt ] . Therefore, we bound the probability p as follows: r(t+s)

p≤2



i=(1+σ)rt



 r(t + s) (t/s)i (1 − t/s)r(t+s)−i . i

We bound the sum by observing that (1 + σ)rt is larger than E[Z] = (t + s)rt/s, since t/s ≤ 2/ log I (q) . Thus, the first term of the sum is the largest. Hence, from the fact

1612

ANDREWS ET AL.

that there are at most r(t + s) terms in the sum, we have   r(t + s) p ≤ 2r(s + t) (t/s)(1+σ)rt (1 − t/s)r(t+s)−(1+σ)rt . (1 + σ)rt   By applying the inequality ab ≤ (ae/b)b for 0 < b < a, we get  p ≤ 2r(s + t)

(t + s)e (1 + σ)t

(1+σ)rt

(t/s)(1+σ)rt (1 − t/s)r(t+s)−(1+σ)rt .

Now applying the inequality ln(1 + x) ≥ x − x2 /2 for 0 ≤ x ≤ 1, for the case x = σ,

(1+σ)rt 2 p ≤ 2r(s + t) (1 + t/s)e1−σ+σ /2 (1 − t/s)r(t+s)−(1+σ)rt . Finally, by applying the inequality (1 + x) ≤ ex for 1 + x = 1 + t/s in one case and for 1 + x = 1 − t/s in the other, we obtain p ≤ 2r(t + s)e−rtσ

2

(1/2−σ/2−t/σ 2 s−2t/σs)

.

The bounds on s and t and the definitions of r and σ imply that we can choose a constant v large enough so that p < (I (q) )−k for any constant k > 0. The first refinement step for schedule S (q) . We first divide the interval [T , T + |S (q) |) into blocks of length 2(I (q) )3 + 2(I (q) )2 − I (q) . We shall reschedule each block B independently. During a block B we only delay active packets. For each block B, each active packet in B is assigned a delay randomly, uniformly, and independently chosen from [1, I (q) ]. An active packet p, whose assigned delay is x, is delayed in the first xI (q) steps of B once every I (q) steps. In order to independently reschedule the next block, packet p is also delayed in the last (I (q) − x)I (q) steps of B once every I (q) steps. Therefore, a rescheduled block has length 2(I (q) )3 + 2(I (q) )2 . Before the delays are inserted to reschedule block B, an active packet p is delayed at most once within the block, provided that 2(I (q) )3 + 2(I (q) )2 − I (q) < I (q−1) , which holds as long as I (q) is larger than some constant. Prior to inserting any new delay to a block, we check if it is within I (q) steps of the single old delay. If the new delay would be too close to the old delay, then it is simply not inserted. The loss of one delay in a block has a negligible effect on the probability analysis that follows. Lemma 6.4 shows that with the insertion of delays we can dramatically reduce the frame size in the center of the block and increase the relative congestion only slightly. In order to prove Lemma 6.4, we need the following fact. Lemma 6.3. If the relative congestion in every frame of size T to 2T − 1 is at most r, then the relative congestion in any frame of size T or greater is at most r. Proof. Consider a frame of size T  , where T  > 2T − 1. The first T  /T T − T steps of the frame can be broken into T -frames, each with relative congestion r. The remainder of the T  -frame consists of a single frame of size between T and 2T − 1 steps in which the relative congestion is also at most r. Lemma 6.4. There exists a way of choosing delays so that in between the first 2 (q) and last (I (q) )2 steps of block B, the relative congestion of any frame of size log I (q) (q) or larger is at most (1 + γ1 )c for some γ1 = Θ(1)/ log I . Proof. With each edge e, we associate a bad event. A bad event on e happens when the frame congestion on edge e is more than (1 + γ1 )c(q) I during any I-frame

1613

GENERAL DYNAMIC ROUTING

of size log2 I (q) or larger. Due to Lemma 6.3, it is sufficient to prove the statement for all frames of size between log2 I (q) and 2 log2 I (q) . We shall use the Lov´ asz local lemma to show that the probability that no bad event occurs is nonzero. We first bound the dependence, d, of bad events. Two bad events on two edges are dependent only if a packet from a session i ∈ A(q) can use both edges. At most c(q) (2(I (q) )3 + 2(I (q) )2 − I (q) ) packets (from sessions in A(q) ) can cross the same edge in block B, and each packet crosses at most 2(I (q) )3 + 2(I (q) )2 − I (q) edges in B. As we shall show later, c(q) ≤ 1. Therefore, a bad event can be dependent on at most O((I (q) )6 ) other bad events. We now bound the probability, p, that a bad event happens on e. Consider a particular I-frame, where log2 I (q) ≤ I ≤ 2 log2 I (q) , that lies completely between the first and last (I (q) )2 steps of B. By setting R = B, r = c(q) , s = I (q) , and t = I, we apply Lemma 6.2 to show that for any constant k1 there is some value γ1 = Θ(1)/ log I (q) such that the probability p1 of a bad event happening on e in the I-frame is smaller than (I (q) )−k1 . Since there are O((I (q) )3 log2 I (q) ) possible I-frames in B, the probability that a bad event happens on e during any I-frame is p < p1 O((I (q) )3 log2 I (q) ). We can set the value k1 appropriately so that this probability is upper bounded by O((I (q) )−7 ). Therefore, we have 4pd < 1, and our lemma follows from the Lov´ asz local lemma. time step (I (q) )2

1

(1 + γ1 )c(q)

2(I (q) )3 + (I (q) )2

2(I (q) )3 + 2(I (q) )2

fuzzy region

fuzzy region

log2 I (q)

Fig. 6.1. Situation after the first refinement step.

At the end of the first refinement step, the center of each block has small relative congestion for small frame sizes. However there are regions of (I (q) )2 steps at the beginning and end of each block that may have very large relative congestion. We call these “fuzzy” regions, and we deal with them in the second refinement step. The second refinement step for schedule S (q) . We start the second step of the refinement by relocating the block boundaries so that blocks still have 2(I (q) )3 + 2(I (q) )2 steps, but now the fuzzy regions that were at the beginning and end of adjacent blocks are in the center of new blocks. Then, a new block has two “clean” regions of (I (q) )3 steps each at the beginning and the end, and a fuzzy region of length 2(I (q) )2 steps in the center. As in the first step of the refinement we now concentrate on individual blocks. We first show that the relative congestion is not very large for frames of size (I (q) )2 or larger (even in the fuzzy region). Lemma 6.5. For any choice of delays in the first step of the refinement, the relative congestion in any frame of size (I (q) )2 or larger is at most (1 + 2/I (q) )c(q) . Proof. Without loss of generality we shall assume that all the sessions are integral. Consider an I-frame with I1 steps before the center of the block and I2 steps after the center. (I = I1 +I2 , and either I1 or I2 could be zero.) A packet crosses some edge e in

1614

ANDREWS ET AL. time step

1

(1 + γ1 )c(q)

(I (q) )3

(I (q) )3 + 2(I (q) )2

2(I (q) )3 + 2(I (q) )2

fuzzy region

I (q+1)

(I (q) )2

Fig. 6.2. Situation after relocating block boundaries.

the I1 -frame only if it did so in some frame of length I1 + I (q) before the delays where inserted. Therefore, at most (I1 + I (q) )c(q) packets can cross edge e in the I1 -frame. Similarly, at most (I2 + I (q) )c(q) packets can cross edge e in the I2 -frame. Therefore, the congestion in the I-frame can be at most (I1 + I2 + 2I (q) )c(q) = (I + 2I (q) )c(q) , and for I ≥ (I (q) )2 the relative congestion is at most (1 + 2/I (q) )c(q) . Now, in order to reduce the frame size in the fuzzy region, we consider only the active packets in each block B, and we assign a delay randomly, independently, and uniformly chosen from [1, (I (q) )2 ] to each active packet. A packet p with delay x waits once every (I (q) )3 /x at the beginning of the block and once every (I (q) )3 /((I (q) )2 − x) at the end. As in the first step a delay is not inserted if it is going to be within I (q) steps of an existing delay for a moving packet. The block length after the delay insertion is 2(I (q) )3 + 3(I (q) )2 , and the fuzzy region can be (I (q) )2 steps longer, spanning steps (I (q) )3 to (I (q) )3 + 3(I (q) )2 . The next lemma shows that there is some way of inserting delays so that the frame size in the fuzzy region is reduced, and the frame size and relative congestion in the rest of the block are increased by only a small amount. Lemma 6.6. In a block B, there exists a way of choosing delays so that in the fuzzy region (i.e., interval [(I (q) )3 , (I (q) )3 + 3(I (q) )2 ]) the relative congestion of any frame of size log2 I (q) or larger is at most (1 + γ2 )c(q) for some γ2 = Θ(1)/ log I (q) , and so that in the intervals [I (q) log3 I (q) , (I (q) )3 ] and [(I (q) )3 + 3(I (q) )2 , 2(I (q) )3 + 3(I (q) )2 − I (q) log3 I (q) ] the congestion of any frame of size log2 I (q) or larger is at most (1 + γ3 )c(q) for some γ3 = Θ(1)/ log I (q) . Proof. As in Lemma 6.4, we will use the Lov´asz local lemma to prove the claim. We associate a bad event with every edge e, so that a bad event happens on e if, for any I ≥ log2 I (q) , • more than (1 + γ2 )c(q) I packets cross e in any I-frame in [(I (q) )3 , (I (q) )3 + 3(I (q) )2 ] (the fuzzy region), or • more than (1+γ3 )c(q) I packets cross e in any I-frame in [I (q) log3 I (q) , (I (q) )3 ] or [(I (q) )3 + 3(I (q) )2 , 2(I (q) )3 + 3(I (q) )2 − I (q) log3 I (q) ]. The dependency, d, of the bad events is bounded as in Lemma 6.4. Two bad events on two edges are dependent if packets from some session i ∈ A(q) can use both edges. At most O((I (q) )3 ) packets cross any edge in a block, and each of them can cross at most O((I (q) )3 ) other edges. Therefore, d = O((I (q) )6 ). Now, to bound the probability p of a bad event happening on some edge e, we consider the three intervals separately and sum their respective probabilities. From Lemma 6.3 we only consider frames of length I such that log2 I (q) ≤ I ≤ 2 log2 I (q) . Take first some I-frame in [(I (q) )3 , (I (q) )3 + 3(I (q) )2 ] (the fuzzy region). From Lemma 6.5 we know that the relative congestion for frames of size (I (q) )2 or longer

GENERAL DYNAMIC ROUTING

1615

is at most (1 + 2/I (q) )c(q) = Θ(1). Then, by choosing R = B, r = (1 + 2/I (q) )c(q) , s = (I (q) )2 , and t = I, we can use Lemma 6.2 to show that, for any constant k2 , there is some σ2 = Θ(1)/ log I (q) such that the probability p1 of having relative congestion (q) on e in the I-frame larger than c (1 + 2/I (q) )(1 + σ2 ) = c(q) (1 + γ2 ) is smaller than (q) −k2 (I ) . Note that γ2 = Θ(1)/ log I (q) . Take now some I-frame in [I (q) log3 I (q) , (I (q) )3 ] which starts at step j. Given the way delays are inserted, by the jth step an active packet with delay x has been delayed jx/(I (q) )3 steps. Thus, the delay of an active packet at the jth step is essentially a random value uniformly chosen from [1, j/I (q) ]. For j ≥ I (q) log3 I (q) the value j/I (q) ≥ log3 I (q) . Note that before inserting delays, from Lemma 6.4 the relative congestion in any frame of length log2 I (q) or larger in the interval [1, (I (q) )3 ] was at most (1 + γ1 )c(q) . Then, we can make R = [1, (I (q) )3 ], r = (1 + γ1 )c(q) , s = log3 I (q) , and t = I, and we use Lemma 6.2 to show, for any constant k3 , the existence of some σ3 = Θ(1)/ log I (q) such that the probability p2 of having relative congestion larger than (1 + σ3 )(1 + γ1 )c(q) = (1 + γ3 )c(q) on e in the I-frame is smaller than (I (q) )−k3 . Again, γ3 = Θ(1)/ log I (q) . By symmetry, the same value γ3 makes the probability of a bad event happening on e in some I-frame in [(I (q) )3 + 3(I (q) )2 , 2(I (q) )3 + 3(I (q) )2 − I (q) log3 I (q) ] smaller than (I (q) )−k3 . There are O((I (q) )3 log2 I (q) ) possible I-frames as described in total. Hence, we can choose values for k2 and k3 such that the probability of a bad event is bounded as p ≤ (p1 + 2p2 )O((I (q) )3 log I (q) ) < O((I (q) )7 ). Therefore, we can guarantee 4pd < 1 and invoke the Lov´ asz local lemma to prove the claim. Finally, we bound the frame size and the relative congestion in the remaining intervals of the block in the following lemma. Lemma 6.7. The relative congestion in any frame of size log4 I (q) or larger in the intervals [1, I (q) log3 I (q) ] and [2(I (q) )3 + 3(I (q) )2 − I (q) log3 I (q) , 2(I (q) )3 + 3(I (q) )2 ] is at most (1 + γ1 )(1 + 1/ log I (q) )c(q) = (1 + γ4 )c(q) . Proof. Let us first consider some I-frame in [1, I (q) log3 I (q) ]. Recall that, before inserting delays, the relative congestion for frames of size log2 I (q) or more was at most (1 + γ1 )c(q) . In the interval no packet is delayed more than log3 I (q) steps. Therefore, the packets crossing some edge e in the I-frame could have crossed e in some interval of at most I + log3 I (q) steps, and the congestion in the I-frame can be of at most (I + log3 I (q) )(1 + γ1 )c(q) . For I ≥ log4 I (q) the claim follows. The proof for interval [2(I (q) )3 + 3(I (q) )2 − I (q) log3 I (q) , 2(I (q) )3 + 3(I (q) )2 ] is similar. From the above two lemmas we have that any frame of length at least log4 I (q) in each of the different intervals has at most a relative congestion (1 + γ)c(q) , where γ = max(γ2 , γ3 , γ4 ) and γ = O(1)/ log I (q) . We need to be careful now with the relative congestion in frames that overlap several intervals or several blocks. We can 1 safely say that for any frame of size I (q+1) = log5 I (q) or larger in the schedule S (q+ 2 ) (q) (q) obtained after the frame refinement, the relative congestion is at most (1 + δ )c (q) for some δ = β/ log I (q) large enough. 6.3. Conversion for schedule S (q) . In the conversion process we transform 1 the schedule S (q+ 2 ) , obtained from the frame refinement step, into a new schedule (q+1) . In this new schedule, all the sessions in B (q+1) which were fractional in S (q) S

1616

ANDREWS ET AL.

U V

Service time

i

i

i

Arrival time

T − Ti

T

2T − Ti

2T

Fig. 6.3. Session-i packets that are injected in interval U are assigned initial tokens in interval V . The interval V is divided into consecutive intervals of length i , each of which has si initial tokens. The initial tokens are shown in solid dots.

have been made integral, and the relative congestion of frames of size I (q+1) or larger is at most c(q+1) = (1 + δ (q) )2 c(q) . At the beginning of this step, we inductively assume that the relative congestion is at most (1 + δ (q) )c(q) for any frame of size I (q+1) or larger, where δ (q) = β/ log I (q) . If the set B (q+1) is empty, then we skip this conversion step; clearly, the relative congestion is at most c(q+1) for any frame of size I (q+1) , and we are done. On the other hand, if the set B (q+1) is not empty, then for each session i ∈ B (q+1) we apply the following two processes. (a) In the discretization process we convert the schedule for fractional session-i packets into a schedule for integral packets in which no packet has to wait too long before it starts moving. (b) In the delay-insertion process we delay the time at which packets start moving (i.e., we insert initial delays) in such a way that the relative congestion requirements are satisfied. Discretization. We first show how to transform a fractional session in B (q+1) into an integral session. Consider a session i in B (q+1) . When session i is fractional, a packet of size rˆi = si /%i is injected at every time step, where %i and si are integer constants defined in section 5. A fractional packet marches to its destination one edge at a time with no delay. We want to replace these fractional packets by integral packets. An integral packet waits at its source until it finds an unused initial token. Then, it crosses one edge every time step until it reaches its destination. The number of initial tokens and their distribution have to be carefully chosen so that no packet waits at its source for too long. To transform session i, we consider the two intervals shown in Figure 6.3, U = [T − Ti , 2T − Ti ) and V = [T , 2T − Ti ). When session i is converted, we distribute enough initial tokens in the interval V to accommodate all the session-i arrivals during U . Integral packets arrive at a rate ri during U , and initial tokens will appear at a rate roughly equal to rˆi during V . Recall from section 5 that rˆi is slightly larger than ri . By choosing the interval U long enough (i.e., T large enough), we guarantee that there are more initial tokens than arrivals. Let |V | = T − Ti be the length of interval V . We divide V into consecutive intervals of length %i (starting from the end), and we put si initial tokens in the last slot of each %i -interval. Note that if |V | is not an integer multiple of %i , then the first %i -interval is “incomplete.” (See Figure 6.3.) We show that there are enough initial

GENERAL DYNAMIC ROUTING

1617

tokens and that no packet waits too long for an unused one. Lemma 6.8. For a converted session i ∈ B (q+1) , every session-i packet that is injected during U finds an unused initial token in V within Ti + %i = O(1/ri + Di ) steps of its injection. Proof. Let x = T /(T − Ti ) be the ratio of the length of interval U to the length of interval V . It suffices to show that si , the number of initial tokens in an %i -interval (shown in Figure 6.3), is as large as the number of session-i arrivals during an interval of length x%i . At most n = x%i ri + 1 packets can arrive during x%i steps. Since T ≥ (1 + 4/ε) maxi Ti by definition, we have x ≤ 1 + ε/4 and n ≤ %i ri (1 + ε/4) + 1. By the left-hand side of property 1 of Lemma 5.1, we have n ≤ si . Therefore, we have enough initial tokens. Since the initial tokens are at the end of an %i -interval, each packet can use an initial token that appears after the packet arrival time. It is also easy to verify that an unused initial token appears within Ti + %i = O(Ti ) steps of the packet injection. Delay insertion. Before any delay is inserted for a packet from session i ∈ B (q+1) , the packet leaves its source at the time of its initial token and marches to its destination with no more waiting. Now we insert an initial delay for each session-i packet, which has the effect of deferring the start time of the packet. We choose the delays uniformly from [1, %i ]. After the initial delay each packet travels to its destination without further delay. Lemma 6.9. Consider a particular edge e and a particular t-frame during interval [T , 2T ). Suppose session i requires edge e; then the expected number of session-i packets that use e in the t-frame is at most tsi /%i = tˆ ri . Proof. Let us assume first that delays have not been inserted yet. Due to the way initial tokens are distributed, session-i packets cross edge e in a very synchronous manner: a batch of at most si packets crosses every %i steps. Since we want an upper bound on the expectation, we assume that exactly si packets cross e every %i steps. Let us now partition time in %i -intervals, so that each interval ends with a step in which packets cross e (i.e., all packets cross e in the last step of the intervals). Observe that, once delayed, all the packets that crossed e in the last step of some %i -interval will cross it in the following interval. Then, the total number of packets crossing e in an %i -interval after the delay insertion is exactly si . Also, after the insertion of delays, the expected number of packets crossing e in some subinterval of length % of an %i -interval is exactly %si /%i . Take now the t-frame, and consider the incomplete %i -intervals it contains. There can be at most one at the beginning and one at the end. Assume they have lengths t1 and t2 , respectively. From the above observations, the expected number of packets crossing e in the t1 (resp., t2 ) subinterval is t1 si /%i (resp., t2 si /%i ). In the remainder of the t-frame the number of packets crossing is exactly (t − t1 − t2 )si /%i . Hence, the expected number of packets crossing e in the t-frame is (t − t1 − t2 )si /%i + t1 si /%i + t2 si /%i = tsi /%i . We now use a Chernoff bound and the Lov´ asz local lemma to show the following. Lemma 6.10. There exists a way of choosing the initial delays for sessions in B (q+1) such that the relative congestion in any frame of size I (q+1) or bigger is at most c(q+1) after the delays are inserted. Proof. Due to Lemma 6.3, it is sufficient to prove the result for all frames of size I (q+1) to 2I (q+1) . We associate a bad event with each edge e and each I-frame, where I (q+1) ≤ I ≤ 2I (q+1) . A bad event E{e,I} happens when more than Ic(q+1) packets use e during frame I. We use the Lov´asz local lemma to show that with nonzero

1618

ANDREWS ET AL.

probability no bad event occurs. Let Dmax = maxi∈B (q+1) Di , rmin = mini∈B (q+1) ri , X = maxi∈B (q+1) Di + 1/ri , and %max = maxi∈B (q+1) %i . We first bound the dependency d of bad events. Note that the probability space is given by the delays assigned to packets from sessions in B (q+1) . Hence, a bad event E{e,I} is dependent on another bad event E{e ,I  } only if there is a packet p from a session i ∈ B (q+1) such that there is a nonzero probability that p uses e during the I-frame and there is a nonzero probability that p uses e during the I  -frame. There are at most 1/rmin sessions in B (q+1) , each of which is at most Dmax long. Therefore, E{e,I} depends on E{e ,I  } for at most Dmax /rmin = O(X 2 ) choices of e . Furthermore, intervals I and I  cannot be more than Dmax + %max steps apart. (Otherwise any session-i packet either has probability 0 of crossing edge e during I or probability 0 of crossing e during I  .) Therefore, the starting point of I  is limited to 2Dmax + 2%max + 4I (q+1) locations, and the total possible choices for I  is at most (2Dmax + 2%max + 4I (q+1) )I (q+1) = O(X(I (q+1) )2 ). We conclude that the dependency d is O(X 3 (I (q+1) )2 ). We now bound the probability p that a bad event E{e,I} happens. By our inductive assumption, the frame congestion on edge e during the I-frame is at most (1 + δ (q) )c(q) I before the conversion. Let S be the set of sessions in B (q+1) that use edge e. When sessions in B (q+1) are fractional, they contribute exactly I i∈S rˆi to the frame congestion. Lemma 6.9 implies  that the expected frame congestion due to the sessions in B (q+1) is at most I i∈S rˆi after the initial delays are inserted. The congestion due to sessions not in B (q+1) does not change during the conversion. Hence, the expected frame congestion on edge e during the I-frame is at most (1 + δ (q) )c(q) I = µ. We bound the probability of E{e,I} as follows.   p = Pr Frame congestion on e in I > c(q+1) I   = Pr Frame congestion on e in I > (1 + δ (q) )µ ≤ e−(δ

(q) 2

) µ/3

≤ e−(1−ε)β ≤ e−(1−ε) ≤ e−(1−ε)

2 (q+1)

I

/(3 log I (q) )

β2 3

(I (q+1) )1/5 (I (q+1) )3/5

β2 3

(I (q+1) )1/5 log6/5 X

.

The first inequality follows from Lemma 2.3. The second inequality holds since µ > (1 − ε)I ≥ (1 − ε)I (q+1) and from the definition of δ (q) . The third inequality follows from the recurrence for I (q+1) . The last inequality follows from the fact that log2 X ≤ I (q+1) . (This explains the need for log2 Xi ≤ I (q+1) in the definition of B (q+1) .) When β is a sufficiently large constant, we have 4dp < 1. Hence, the Lov´asz local lemma implies that with nonzero probability no bad events occur. That is, there exists a way to choose the initial delays for sessions in B (q+1) such that for all frames of size I (q+1) or larger the relative congestion is at most c(q+1) . Note that in the proof of this lemma we associate a bad event with each edge e and each interval I. Why couldn’t we associate a bad event with each edge e only and then use a union bound on the number of intervals, as in Lemma 6.4? This is because we are considering all the session-i packets during an interval of length T , which can be much bigger than 1/ri + Di for some sessions i. 6.4. Termination at schedule S (ζ) . The succession of refinement and conversion terminates at schedule S (ζ) when the frame size I (ζ) becomes smaller than

GENERAL DYNAMIC ROUTING

1619

or equal to w, a constant defined in section 5. The following lemma shows that the relative congestion of S (ζ) is small. Lemma 6.11. In the schedule S (ζ) all sessions are integral and the relative congestion is at most c(ζ) < 1 for any frame of size I (ζ) or larger. Proof. One can verify that B (q+1) forms a partition of all the sessions. Therefore, all the sessions are integral in the schedule S (ζ) . By our induction, the relative congestion is at most c(ζ) for all frames of size I (ζ) or larger. Hence, we only need to show that c(ζ) < 1. Due to √ the termination conditions, x ≤ I (ζ−1) , where x is defined in section 5. Let ∆ = β/ log x, and observe that δ (ζ−1) ≤ ∆ < 1. By the recursive definition of c(ζ) , we have c(ζ) = (1 + δ (ζ−1) )2 (1 + δ (ζ−2) )2 . . . (1 + δ (0) )2 c(0) < (1 + ∆)2 (1 + ∆2 )2 (1 + ∆4 )2 (1 + ∆8 )2 . . . c(0)   ≤ (1 − ∆)−2 (1 − ∆)2 (1 + ∆)2 (1 + ∆2 )2 (1 + ∆4 )2 (1 + ∆8 )2 . . . c(0) ≤ (1 − ∆)−2 c(0)

−2 = 1 − β/ log x c(0) 1 − ε/2 1 − ε/2 = 1. =

The first inequality holds since δ (q) < (δ (q+1) )2 for all q by the recurrence defined in section 5. The third inequality holds since ∆ < 1, and therefore the “telescope product” in the braces is less than 1. The last equality holds by the above choice of x and the definition of c(0) in section 5. Now, we have to make sure that in the resulting schedule S (ζ) no packet waits too long. The conversion step guarantees that when a session i becomes integral, no packet waits more than O(Di +1/ri ) steps before it starts moving, and it does not wait anymore. The last frame refinement step also guarantees that a moving packet never waits more than once every I (ζ−1) steps. However, all the frame refinement steps that an integral packet has to go through can, in fact, delay the time it starts moving. The following lemma shows that this delay does not add up to a large amount, and therefore that a session-i packet reaches its destination in at most O(Di + 1/ri ) steps in the schedule S (ζ) . Lemma 6.12. During frame-refinement a session-i packet is delayed by at most 2(Di + 1/ri ) steps before it starts moving.  Proof. Suppose session i first becomes integral in schedule S (q ) . Consider a (q)  session-i packet p. For schedule S , where q ≤ q − 1, p is never delayed during frame refinement. For schedule S (q) , where q ≥ q  , p is delayed by at most I (q) + (I (q) )2 steps before it starts moving.Therefore, the total delay inserted during all the frame refinement steps is at most q≥q I (q) + (I (q) )2 . Since session i becomes integral for     schedule S (q ) , we must have i ∈ B (q ) . By the definition of B (q ) , Di + 1/ri ≥ (I (q ) )2 . Since I (q) decreases polylogarithmically, a session-i packet is delayed during frame refinement by at most 2(Di + 1/ri ) steps before it starts moving. We proceed to prove that S (ζ) has all the properties. Theorem 6.13. Given network M and a set of sessions as defined in section 1.2, there is a schedule S (ζ) such that the following hold.

1620

ANDREWS ET AL.

1. The relative congestion is at most 1 for any frame of size larger than a certain constant. 2. After leaving its source, each packet waits at most once every O(1) steps, which implies that all edge queues in M have size O(1). 3. For all sessions i, any session-i packet reaches its destination within O(1/ri + Di ) steps of its injection. 4. All session-i arrivals during [T −Ti , 2T −Ti ) are serviced during [T , 2T ), i.e., all packets leave their source no earlier than T and reach their destination before 2T . Proof. 1. By Lemma 6.11, the relative congestion is at most 1 for any frame of size I (ζ) or larger. Due to the termination conditions I (ζ) is a constant. 2. By the invariant maintained throughout the frame refinement steps, a packet waits at most once every I (ζ−1) steps once it leaves its source. In addition, by property 1 above, at most I (ζ) packets cross an edge during any time step. Therefore, the edge queues have size at most 2I (ζ) . 3. We first show that a session-i packet reaches its destination within Ti steps after it obtains an initial token. After the initial token, a session-i packet is deferred by an initial delay during the conversion step and other delays during the frame refinement step before it could leave its source. The initial delay is at most %i < 1 + 8/(εri ), and the delay during the refinement is at most 2(Di + 1/ri ) by Lemma 6.12. Once the packet starts moving, it reaches its destination in at most 2Di steps by property 2. Therefore, a session-i packet reaches its destination within 4Di + 1 + (8/ε + 2)/ri < Ti steps after obtaining its initial token. Since any session-i packet obtains an initial token within Ti + %i steps of its injection by Lemma 6.8, the packet reaches its destination within 2Ti + %i = O(1/ri + Di ) steps of its injection. 4. For all session-i arrivals during [T − Ti , 2T − Ti ), the initial tokens are in [T , 2T − Ti ). From the discussion of property 3, a session-i packet reaches its destination within Ti steps after it obtains an initial token. Therefore, all packets leave their sources no earlier than T and reach their destinations before 2T . 6.5. The final schedule for the original network N . We now describe how to create a schedule SN for network N from S (ζ) . In SN at most one packet at a time crosses each edge in N . Recall that in the construction of M from N , each edge e in N is replaced by 2w consecutive edges e1 , . . . , e2w , where w is a constant defined in section 5. We first partition the time interval [T , 2T ) into consecutive intervals of length w (recall that by definition T is a multiple of w). For each w-interval and each edge f in M, as many as w packets, p1 , p2 , . . . , pw , can cross f during the w-interval by schedule S (ζ) . We smooth out S (ζ) so that pj is the jth packet to cross f in the winterval, where p1 , . . . , pw represents an arbitrary ordering. After smoothing, a packet may not be scheduled to cross the edges on its route in order. For example, a packet may be scheduled to cross edge f before g, whereas f follows g on the route in M. A packet may also be scheduled to leave its source before its injection time. However, S (ζ) after smoothing does have the property that at most one packet at a time crosses each edge. We define SN as follows. SN schedules a packet p to cross e in N at time t if and only if S (ζ) after smoothing schedules p to cross e2w in M at time t.

GENERAL DYNAMIC ROUTING

1621

Lemma 6.14. In SN , each packet is scheduled to leave its source after its injection and is scheduled to cross the edges on its route in order. Proof. We first show that each packet crosses the edges on its route in order. Consider a packet p. Let e and eˆ be two edges on p’s route in N , where eˆ follows e. Let t and tˆ be the times that p crosses e and eˆ in schedule SN . We shall show that t < tˆ. Let e2w and eˆ2w be the edges in M that correspond to e and eˆ. Let τ and τˆ be the times that p crosses e2w and eˆ2w in the schedule S (ζ) before smoothing. Since p crosses the edges in M in order before smoothing, we have (6.3)

τ + 2w ≤ τˆ.

In schedule SN , packet p crosses e at time t, which is shifted by at most w − 1 steps from τ . Similarly, tˆ is shifted by at most w − 1 steps from τˆ. Hence we have τ − (w − 1) ≤ t ≤ τ + (w − 1), τˆ − (w − 1) ≤ tˆ ≤ τˆ + (w − 1). From (6.3) and the above inequalities, we have t < tˆ. Therefore, p crosses the edges on its route in order. The proof that packet p leaves its source after its injection time is similar. Suppose that p is injected into the network at time s. Let edge e be the first edge on the route of p in network N , and let t be the time that p crosses e in SN . Also, let e2w be the corresponding edge in M, and let τ be the time that p crosses e2w in S (ζ) before smoothing. Since in S (ζ) before smoothing p crosses the edges in order and leaves its source after its injection, we have s + 2w ≤ τ. In schedule SN , packet p crosses e at time t, which is shifted by at most w − 1 steps from τ . Hence we have τ − (w − 1) ≤ t

≤ τ + (w − 1).

Therefore, s < t and packet p leaves its sources in N after the injection time. We summarize the properties of SN . Theorem 6.15. Schedule SN satisfies the following properties. 1. At most one packet at a time crosses each edge in N . 2. After leaving its source, each packet waits a constant number of steps to cross an edge, which implies all the edge queues in N have a constant size. 3. For all sessions i, any session-i packet reaches its destination within O(1/ri + di ) steps of its injection. 4. All session-i arrivals during [T −Ti , 2T −Ti ) are serviced during [T , 2T ), i.e., all packets leave their source no earlier than T and reach their destination before 2T . Proof. The smoothing process guarantees property 1. Properties 2 and 3 come from properties 2 and 3 of S (ζ) given in Theorem 6.13, the construction of M from N , and the fact that each packet is scheduled to reach its destination in SN at most w steps later than in S (ζ) . To see property 4, recall that the interval [T , 2T ) is partitioned into intervals of size w (with one interval possibly longer than w), and that schedule S (ζ) is smoothed

1622

ANDREWS ET AL.

out within each w-interval. Therefore, if a packet is scheduled to cross an edge e during [T , 2T ) according to S (ζ) , the packet must also be scheduled to cross e during [T , 2T ) according to SN . Property 4 follows. Property 4 of the above theorem implies that all intervals of [0, T ), [T , 2T ), etc. can be scheduled independently. 6.6. Derivation of the templates. We now describe how to transform SN into a template-based schedule. Property 4 of Theorem 6.15 says that all packets considered in schedule SN (those injected in interval [T − Ti , 2T − Ti ) for each session i) move from their sources to their destination during interval [T , 2T ). For this reason, we choose T as the size of each template. Recall that in the conversion step of section 6.3, the placement of the initial tokens is independent of the actual packet arrival times. The placement is simply a result of randomization added onto the fixed configuration shown in Figure 6.3. As we have shown, even if each session-i initial token is owned by a session-i packet, we can schedule these packets by schedule SN . Then, if we place a session-i token in the template of edge e whenever a session-i packet crosses e in SN , the movement of each packet determines a token sequence, and these token sequences define the locations of all the tokens. We emphasize that the placement of these tokens is fixed as the initial tokens are. Obviously, the token lag is O(1) for all sequences and the end-to-end delay is O(1/ri + di ) for all session-i token sequences. Since each session-i packet is able to obtain an initial token within O(1/ri + di ) steps of its injection, Theorem 2.2 implies that the template-based schedule defined by the token sequences achieves a delay bound of O(1/ri + di ) and constant-edge queues. Combined with Theorem 2.2, we have a template-based schedule with desired delay bounds and constant-edge queues. In summary, we have the following theorem. Theorem 6.16. Consider an arbitrary network in which sessions are defined. Each session i is associated with an injection rate ri and path length di . Packets are injected into the network along these sessions subject to the injection rates. If the total rate on each edge is at most 1 − ε for a constant ε ∈ (0, 1), then there exists a template-based schedule such that each session-i packet reaches its destination within O(1/ri + di ) steps of its injection and at most one packet crosses an edge at each time step. This schedule also maintains constant-edge queues. Acknowledgments. The authors wish to thank Bruce Maggs, Greg Plaxton, and Salil Vadhan for many helpful comments. REFERENCES ´ndez, J. Kleinberg, T. Leighton, and Z. Liu, Uni[1] M. Andrews, B. Awerbuch, A. Ferna versal stability results for greedy contention-resolution protocols, in Proceedings of the 37th Annual IEEE Symposium on Foundations of Computer Science, Burlington, VT, 1996, pp. 380–389. [2] M. Andrews and L. Zhang, Minimizing end-to-end delay in high-speed networks with a simple coordinated schedule, in Proceedings of the IEEE Computer and Communication Societies, INFOCOM, IEEE Computer Society, Los Alamitos, CA, 1999, pp. 380–388. [3] J. Beck, An algorithmic approach to the Lovasz Local Lemma I, Random Structures Algorithms, 2 (1991), pp. 343–365. [4] A. Borodin, J. Kleinberg, P. Raghavan, M. Sudan, and D. Williamson, Adversarial queueing theory, in Proceedings of the 28th Annual ACM Symposium on Theory of Computing, Philadelphia, PA, 1996, pp. 376–385. [5] H. Chernoff, A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations, Ann. Math. Stat., 23 (1952), pp. 493–509. [6] R. L. Cruz, A calculus for network delay, Part I: Network elements in isolation, IEEE Trans. Inform. Theory, 37 (1991), pp. 114–131.

GENERAL DYNAMIC ROUTING

1623

[7] R. L. Cruz, A calculus for network delay, Part II: Network analysis, IEEE Trans. Inform. Theory, 37 (1991), pp. 132–141. [8] A. Demers, S. Keshav, and S. Shenker, Analysis and simulation of a fair queueing algorithm, J. Internetworking: Research and Experience, 1 (1990), pp. 3–26. [9] A. Elwalid, D. Mitra, and R. H. Wentworth, A new approach for allocating buffers and bandwidth to heterogeneous, regulated traffic in an ATM node, IEEE J. Selected Areas in Communications, 13 (1995), pp. 1115–1127. [10] F. T. Leighton, B. M. Maggs, and S. B. Rao, Packet routing and job-shop scheduling in O(congestion + dilation) steps, Combinatorica, 14 (1994), pp. 167–186. [11] F. T. Leighton, B. M. Maggs, and A. W. Richa, Fast Algorithms for Finding O(Congestion + Dilation) Packet Routing Schedules, Technical report CMU-CS-96-152, Carnegie Mellon University, Pittsburgh, PA, 1996. [12] F. T. Leighton and G. Plaxton, Hypercubic sorting networks, SIAM J. Comput., 27 (1998), pp. 1–47. [13] R. Ostrovsky and Y. Rabani, Universal O (congestion + dilation + log1+ N ) local control packet switching algorithm, in Proceedings of the 29th Annual ACM Symposium on Theory of Computing, El Paso, TX, 1997, pp. 644–653. [14] A. K. Parekh and R. G. Gallager, A generalized processor sharing approach to flow control in integrated services networks: The single-node case, IEEE/ACM Trans. Networking, 1 (1993), pp. 344–357. [15] A. K. Parekh and R. G. Gallager, A generalized processor sharing approach to flow control in integrated services networks: The multiple-node case, IEEE/ACM Trans. Networking, 2 (1994), pp. 137–150. [16] Y. Rabani and E. Tardos, Distributed packet switching in arbitrary networks, in Proceedings of the 28th Annual ACM Symposium on Theory of Computing, Philadelphia, PA, 1996, pp. 366–375. [17] J. Spencer, Ten Lectures on the Probabilistic Methods, Capital City Press, Philadelphia, PA, 1994. [18] J. S. Turner, New directions in communications, or which way to the information age, IEEE Communications Magazine, 24 (1986), pp. 8–15. [19] L. Zhang, An Analysis of Network Routing and Communication Latency, Ph.D. thesis, MIT, Cambridge, MA, 1997.

Suggest Documents