Bounds on Delays and Queue Lengths in Input-Queued Cell Switches EMILIO LEONARDI, MARCO MELLIA, FABIO NERI, AND MARCO AJMONE MARSAN Dipartimento di Elettronica, Politecnico di Torino, Torino, Italy
Abstract. In this article, we develop a general methodology, mainly based upon Lyapunov functions, to derive bounds on average delays, and on averages and variances of queue lengths in complex systems of queues. We apply this methodology to cell-based switches and routers, considering first output-queued (OQ) architectures, in order to provide a simple example of our methodology, and then both input-queued (IQ), and combined input/output queued (CIOQ) architectures. These latter switching architectures require a scheduling algorithm to select at each slot a subset of input-buffered cells that can be transferred toward output ports. Although the stability properties (i.e., the limit throughput) of IQ and CIOQ cell-based switches were already studied for several classes of scheduling algorithms, very few analytical results concerning cell delays or queue lengths are available in the technical literature. We concentrate on Maximum Weight Matching (MWM) and Maximal Size Matching (mSM) scheduling algorithms; while the former was proved to maximize throughput, the latter allows simpler implementation. The derived bounds are shown to be rather tight when compared to simulation results. Categories and Subject Descriptors: C.2.1 [Computer-Communication Networks]: Network Architecture and Design; G.3 [Probability and Statistics]: Queueing Theory General Terms: Performance, Theory Additional Key Words and Phrases: Performance evaluation, delay bounds, input queued switches, scheduling
1. Introduction A number of high-performance IP routers are built around fast cell-based switching fabrics: input IP datagrams are segmented into fixed-size data units, usually
This work was supported by the Italian Ministry for Education, University and Research through the MQOS and PLANET-IP Projects. A preliminary version of this article was published as LEONARDI, E., MELLIA, M., NERI. F., AND AJMONE MARSAN, M. 2001. Bounds on average delays and queue size averages and variances in input queued cell-based switches. In Proceedings of IEEE Infocom (Anchorage, Ak., Apr.). IEEE Communication Society Press, Piscataway, NJ, pp. 1095–1103. Authors’ address: Dipartimento di Electtronica, Politecnico di Torino, C.so Duca degli Abruzzi 24, Torino, Italy 10129, e-mail: {leonardi,mellia,neri,ajmone}@mail.tlc.polito.it. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax: +1 (212) 869-0481, or
[email protected]. ° C 2003 ACM 0004-5411/03/0700-0520 $5.00 Journal of the ACM, Vol. 50, No. 4, July 2003, pp. 520–550.
Bounds on Delays and Queue Lengths
521
called cells, and transferred to output ports, where IP datagrams are reassembled. The design of these high-performance routers generally does not adopt the classical output-queued (OQ) architecture (where cells are stored at the output of the switching fabric), preferring either input queued (IQ) or combined input/output queued (CIOQ) structures. The reason is that, in OQ, both the switching fabric and the output (and possibly input) queues in line cards must operate at a speed equal to the sum of the rates of all input links; since this speed grows linearly with the number of switch ports, the OQ approach is impractical for large, fast switches. Instead, in IQ schemes, all the components of the switch (input interfaces, switching fabric, output interfaces) can operate at a data rate that is compatible with the data rate of each input and output link, and thus does not grow with the switch size. A traditional performance penalty of IQ architectures is due to head-of-the-line blocking in the case of a single queue per input interface [Karol et al. 1987]; however, blocking can be removed by Virtual Output Queuing (VOQ) (also called Destination Queuing) schemes [McKeown et al. 1999], which organize input buffers in each line card into a set of queues where cells awaiting access to the switching fabric are stored in accordance with their destination output cards. A major issue in the design of IQ switches is that the access to the switching fabric must be controlled by some form of scheduling algorithm,1 which operates on a (possibly partial) knowledge of the state of input queues. This means that control information must be exchanged among line cards, either through an additional data path or through the switching fabric itself, and that intelligence and computational complexity must be devoted to the scheduling algorithm, either at a centralized scheduler, or at line cards, in a distributed manner. The problem faced by the scheduling algorithm can be formalized as a maximum size or maximum weight matching on the bipartite graph in which nodes represent input and output ports, and edges represent cells to be switched. The solution to this problem which maximizes throughput, is known to be obtained with the Maximum Weight Matching (MWM) algorithm [Tarjan 1983], but several lowercomplexity scheduling algorithms of the Maximal Size Matching (mSM) class were also proposed for IQ switches, and studied in the recent literature [McKeown et al. 1999; Tamir and Chi 1993; McKeown 1995, 1999; McKeown and Mekkittikul 1998; Ajmone Marsan et al. 1999a, 1999b; Duan et al. 1998; Chen et al. 1995; Lamaire and Serpanos 1994; Schoenen et al. 1999]. MWM was also proved to sustain the same throughput of OQ switches [McKeown et al. 1999]. To simplify the design and implementation of scheduling algorithms, often switches operate with a speed-up, that is, the internal switching fabric, as well as the input and output memories, operate at a higher speed with respect to
1 The term, scheduling algorithm, for switching architectures is used in the literature for two different types of schedulers: switching matrix schedulers and flow-level schedulers [Zhang 1995; Stiliadis and Varma 1995]. Switching matrix schedulers decide which input port is enabled to transmit in a non-purely-output-queued switch; they avoid blocking and solve contentions within the switching fabric, thus introducing correlation among services at different input queues. Flow-level schedulers decide which cell flows must be served in accordance with QoS requirements; they typically operate locally at switch outputs. In this article, the term, scheduling algorithm, is only used to refer to the first class of algorithms.
522
E. LEONARDI ET AL.
the data rate of input/output links. In this case, buffering is required at outputs as well as inputs, and the term combined input/output queuing (CIOQ) is used. Obviously, when the speed-up is such that the internal switch bandwidth equals the sum of the data rates on input links, input buffers become useless. In [Chuang et al. 1999] a speed-up equal to 2 in CIOQ switches, independent of the number of switch ports, was shown to be sufficient to exactly emulate an OQ architecture, at the expense of quite complex scheduling algorithms, whose implementation seems problematic. Leonardi et al. [2001a] and Dai and Prabhakar [2000] proved that a wide class of simpler mSM scheduling algorithms (comprising well-known algorithms, such as WFA [Tamir and Chi 1993], i-SLIP [McKeown 1995], 2DRR [Lamaire and Serpanos 1994], and RC-BB [Chen et al. 1995]) provide the same throughput performance of OQ when a speed-up equal to 2 is available in the switch (although the behavior of an OQ architecture is not exactly emulated). These results concerning the throughput of IQ and CIOQ switches provide a solid theoretical background, which is very useful to designers of the high-speed switches and routers that will be a major ingredient of future telecommunication infrastructures. However, throughput is only one of the key performance parameters used in the design and planning of communication networks; the other major performance metrics is delay. For a wide class of telecommunication services, in particular those with real-time requirements, delay is even more important than throughput. Hence, extending the theoretical analysis of IQ and CIOQ switches to delays is extremely important. Computing cell delays in cell-based IQ and CIOQ switches is not easy; indeed, to the best of our knowledge, very few analytical results are available for the delays faced by cells traversing an IQ switch [Leonardi et al. 2001b; Shah and Kopikare 2002]. In this article, we consider MWM and mSM scheduling algorithms for IQ and CIOQ architectures, focusing on uniform as well as hot-spot traffic, and we derive analytical bounds on average delays, and on averages and variances of queue lengths. In Section 2, in order to motivate the reader to examine the mathematical derivations in the following sections, we provide a very simple example of our methodology, by studying a queuing system that models a cell flow multiplexer, or an output interface of an OQ cell switch. This simple example allows the explanation of the basic steps of the methodology, and of the bounding technique. In Sections 3 and 4, after introducing some definitions and preliminary results, we provide the general form of the main results of our analysis methodology, which we then apply to the derivation of bounds on average delays and queue sizes of IQ switches with MWM in Section 5, and of CIOQ switches with mSM in Section 6. We then derive bounds on queue size variances in Section 7. In Section 8, we show that our analytical bounds match well with simulation results. Section 9 discusses the extension of our bounding methodology to a more general class of traffic patterns. Section 10 provides some concluding remarks. The availability of tight bounds on delays and queue sizes is quite important for switch and router designers: they allow the quantification of the buffer storage required for the provision of QoS guarantees, in terms of both loss probability, average delay, and delay jitter. The fact that our bounds are expressed in closed form, makes them extremely easy to compute and apply to realistic design problems.
Bounds on Delays and Queue Lengths
523
2. A Simple Example Consider a simple discrete-time (or time-slotted) Geo(b) /D/1 queue with infinite buffer, where customers arrive in batches of size b at geometrically spaced time intervals, and require a deterministic service time (equal to one time slot). Such a queue can provide a simplified model for either a multiplexer of cell flows, or an output interface of an OQ cell switch, and can serve as an illustrative example of the methodology that we later apply to more complex queuing models of IQ and CIOQ cell switches. Let xn be the number of customers in the queue at time slot n; let an be the number of arrivals, and dn the number of departures, during time slot n. Assume that an is a sequence of independent and identically distributed integer-valued random variables with arbitrary distribution. The equation describing the evolution of this system over time is: xn+1 = xn + an − dn , where
½ dn =
1 0
if xn > 0 otherwise .
(1)
(2)
This simple queuing model corresponds to an irreducible discrete-time Markov chain (DTMC). Under ergodicity conditions, that is, when E[an ] = E[a] < 1, we have that lim xn = x
n→∞
in probability.
It is also possible to state that lim E[ f (xn )] = E[ f (x)]
n→∞
for every polynomial function f (·), under the assumption that E[ank ] < ∞, ∀k > 0. This also implies that E[an ] = E[a] = limn→∞ E[dn ] = E[d]. In particular, by considering the case f (x) = x 2 , £ 2 ¤ lim E xn+1 − xn2 = 0. (3) n→∞
Then:
¯ ¤¤ £ 2 ¤ £ £ 2 − xn2 = E x E a xn+1 − xn2 ¯ xn > 0 P{xn > 0} E xn+1 ¯ ¤ £ 2 − xn2 ¯ xn = 0 P{xn = 0} + E a xn+1
(4)
= E x [M(xn )]P{xn > 0} + N P{xn = 0}, where P{Z} is the probability of event Z, and we indicated the dependence on x and a of the averaging operator E[·], that is, we wrote E[ f (x, a)] = E x [E a [ f (x, a)]]. 2 − xn2 | xn = 0] is positive, Note that in the above equation, the term N = E a [xn+1 because only arrivals are possible when the queue is empty. Thus, from (3) we see 2 − xn2 | xn > 0]] that, if the system is ergodic, the term E x [M(xn )] = E x [E a [xn+1 must be negative. Interpreting xn2 as the queuing system potential energy (which is increasing with xn ), (4) shows that, for large values of xn , the system potential energy on average is
524
E. LEONARDI ET AL.
decreasing. The intuition behind this result is that the queuing system is such that a negative feedback exists, which is able to pull the system toward the empty state, thus making it ergodic. This is the main idea behind the Lyapunov function theory, which will be presented for more complex queuing systems in the next section. Considering now (1), (2), (3), we have £ 2 £¡ ¤ ¢2 ¤ − xn2 = lim E xn + an − dn − xn2 0 = lim E xn+1 n→∞
n→∞
= lim E[2(an − 1)xn + (an − 1)2 | xn > 0]P{xn > 0} n→∞
(5)
£ ¯ ¤ + E an2 ¯ xn = 0 P{xn = 0}
and, rearranging and recombining terms, we get £ ¤ 0 = lim E x [2(E a [an ] − 1)xn ] + E a (an − dn )2 = lim −E x [² xn ] + 1V, (6) n→∞
n→∞
where the right-hand side of the last equation comprises a linear term in xn (with coefficient ² > 0), and a constant 1V , independent of the system state. From (6), we obtain £ ¤ £ ¤ E an2 − 2E[an dn ] + E dn2 E[(an − dn )2 ] = lim . (7) E[x] = lim n→∞ 2(1 − E[an ]) n→∞ 2(1 − E[an ]) Considering that an and dn are independent; that, since dn is a binary random variable, E[dn2 ] = E[dn ]; and that, because of ergodicity, E[d] = E[a], we get: E[x] =
E[a 2 ] − 2E[a]2 + E[a] . 2(1 − E[a])
(8)
The reader familiar with the Pollaczek–Khinchin (P-K) formulas [Kleinrock 1975] should have noted that the previous derivation is very similar to the one normally used to obtain E[x] for an M/G/1 queue. Indeed, the result that we obtained is the equivalent of the Pollaczek–Khinchin formula for our discrete-time Geo(b) /D/1 queue. In the remainder of this article, we generalize the approach presented above to more complex queuing systems, where it will be necessary to find a function V (x) such that (4) can be generalized as E[V (xn+1 ) − V (xn )] = E[V (xn+1 ) − V (xn ) | xn > B]P{xn > B} + E[V (xn+1 ) − V (xn ) | xn ≤ B]P{xn ≤ B},
(9)
where E[V (xn+1 ) − V (xn ) | xn > B] < 0 and E[V (xn+1 ) − V (xn ) | xn ≤ B] > 0, with B ≥ 0, thus indicating that the negative feedback is able to pull the system toward the empty state only when the accumulated backlog is greater than B (i.e., when the queuing system is able to exploit all its service capacity). Unfortunately, the expression E[V (xn+1 ) − V (xn ) | xn > B] is often impossible to exactly evaluate, and thus must be bounded by a function M(xn ) = −²xn ≥ E[V (xn+1 ) − V (xn ) | xn > B]. The same is also true for the second term in the right-hand side of (10), so that it will be necessary to identify a constant 1V ≥ E[V (xn+1 ) − V (xn ) | xn ≤ B].
Bounds on Delays and Queue Lengths
525
Thus, following a process similar to the one used to derive (6), we will obtain 0 ≤ lim −² E[xn ] + 1V n→∞
(10)
from which we can obtain bounds on the average queue size E[x]. From such bounds, the derivation of bounds on the average cell delay is easy, thanks to Little’s result. The procedure for the derivation of bounds on the queue size variance is identical, but requires the use of different functions V (xn ), as we shall see later on. 3. The Bounding Methodology Consider a system of N discrete-time queues of infinite capacities, where exogenous arrival processes at different queues can be correlated, and services at different queues are not independent of each other (having in mind an IQ switch, we assume that correlation among arrivals at different inputs in one time slot is possible; correlation among services is due to the scheduling algorithm). In this system of queues, customers arrive at one queue, wait, obtain service, and then leave (again, in an IQ switch, packets join input queues, wait, cross the switch, and depart). Let X n be the row vector of queue lengths at time slot n; that is, X n = (xn(1) , xn(2) , . . . , xn(N ) ), where xn(i) is the number of customers in queue i at time slot n. (i) The queue length evolution is described by the expression xn+1 = xn(i) +an(i) −dn(i) , (i) where an represents the number of customers arrived at queue i in time slot n from outside, and dn(i) represents the number of customers that depart from queue i in time slot n and leave the system. We assume that E[(an(i) )k ] < ∞, ∀k > 0. Let An = (an(1) , an(2) , . . . , an(N ) ) be the vector of the numbers of arrivals at the different queues, and Dn = (dn(1) , dn(2) , . . . , dn(N ) ) be the vector of the numbers of departures from queues. With this notation, the evolution of the system of queues can be described as X n+1 = X n + An − Dn .
(11)
We assume for the moment that vectors An are independent, that is, P{An+1 = ˆ i } = P{An+1 = A} ˆ ∀i ≤ n. However, as we said, correlation among the A|A components of the arrival vector An is admissible. Extensions to traffic scenarios with time correlation will be discussed in the final part of the article (Section 9). Correlation among the components of Dn is also permitted. We indicate with kY k the norm of vector Y . Four different norms will be introduced later in Definitions 4.1, 4.2, 4.3, and 4.4. Definition 3.1. A system of queues is said to be weakly stable if, for every ² > 0, there exists B > 0 such that lim P{kX n k > B} < ².
n→∞
Definition 3.2. A system of queues is said to be strongly stable if lim sup E[kX n k] < ∞.
n→∞
526
E. LEONARDI ET AL.
We assume that the process describing the evolution of the system of queues is an irreducible DTMC, whose state vector at time n is2 Yn =(X n , K n ), Yn ∈ IN M , 0 X n ∈ IN N , K n ∈ IN N and M = N + N 0 . Yn is the combination of vector X n and a vector K n of N 0 integer parameters. Let H be the state space of the DTMC, obtained as a subset of the Cartesian product of the state space H X of X n and the state space HK of K n . If all states Yn are positive recurrent, the system of queues is weakly stable; however, the converse is generally not true, since queue lengths can remain finite even if the states of the DTMC are not positive recurrent, due to instability in the sequence of parameter vectors {K n }. Note that most systems of discrete-time queues of practical interest can be described with models that fall in the DTMC class. The following general criterion for the (weak) stability of systems falling into this class is therefore useful. THEOREM 3.3. Given a system of queues whose evolution is described by a DTMC with state vector Yn ∈ IN M , if a lower-bounded function V (Yn ), called Lyapunov function, V : IN M → IR, can be found such that E[V (Yn+1 ) | Yn ] < ∞ +
∀Yn
+
and there exist ² ∈ IR and B ∈ IR such that E[V (Yn+1 ) − V (Yn ) | Yn ] ≤ −²
∀kYn k > B
(12)
then all states of the DTMC are positive recurrent, and the system of queues is weakly stable. PROOF. This theorem is a straightforward extension of Foster’s Criterion; see Kushner [1967], Tassiulas and Ephremides [1992], and Fayolle [1989]. Note that, interpreting V (X n ) as the system potential energy, (12), similar to (4), shows that, for large values of X n , the system potential energy on average is decreasing. Thus a negative feedback exists which is able to keep the system stable. If the state space H of the DTMC is a subset of the Cartesian product of the denumerable state space H X and a finite state space HK , the stability criterion can be slightly modified, since the stability of the system can be inferred only from the queue lengths state vector X n . COROLLARY 3.4. Given a system of queues whose evolution is described by a DTMC with state vector Yn ∈ IN M , and whose state space H is a subset of the Cartesian product of a denumerable state space H X and a finite state space HK , if a lower bounded function V (X n ), called Lyapunov function, V : IN N → IR, can be found such that E[V (X n+1 ) | Yn ] < ∞
∀Yn
and there exist ² ∈ IR+ and B ∈ IR+ such that E[V (X n+1 ) − V (X n ) | Yn ] ≤ −²
∀Yn : kX n k > B
In this article, IN denotes the set of nonnegative integers, IR denotes the set of real numbers, and IR+ denotes the set of nonnegative real numbers. 2
Bounds on Delays and Queue Lengths
527
then all states of the DTMC are positive recurrent, and the system is weakly stable. In the remainder of this article, we restrict our analysis to the class of systems of queues for which Corollary 3.4 applies. We also assume an aperiodic DTMC since our arrival processes normally confirm this assumption; hence, having positive recurrent states is equivalent to having ergodicity of the DTMC. The generalization to periodic DTMCs is straightforward. As an extension to the results above, we recall from Leonardi et al. [2001a] the following criterion for strong stability. THEOREM 3.5. Given a system of queues whose evolution is described by a DTMC with state vector Yn ∈ IN M , and whose state space H is a subset of the Cartesian product of a denumerable state space H X and a finite state space HK , if a lower bounded function V (X n ), called Lyapunov function, V : IN N → IR, can be found such that E[V (X n+1 ) | Yn ] < ∞ +
∀Yn
+
and there exist ² ∈ IR and B ∈ IR such that E[V (X n+1 ) − V (X n ) | Yn ] ≤ −²kX n k
∀Yn : kX n k > B,
(13)
then the system of queues is strongly stable. In addition, if there exists a symmetric copositive matrix3 Z ∈ IR N ×N defining the Lyapunov function V (X n ) = X n ZXnT , then all the polynomial moments of the queue lengths distribution are finite. For a complete proof of Theorem 3.5, the reader is invited to refer to Leonardi et al. [2001a]. That proof provides a first bound on the limit behavior of E[kX n k]: 1 lim E[kX n k] ≤ (M + M0 ), (14) n→∞ ² where M is the maximum taken by E[V (X n+1 ) | Yn ] for Yn ∈ H B , where H B = {Yn , kX n k ≤ B}, and M0 is a constant such that4 M0 ≤ {−E[V (X n ) | Yn ∈ H B ] + ² E[kX n k | Yn ∈ H B ]}P{Yn ∈ H B }. Unfortunately, this bound is often very loose; thus, we will obtain a novel tighter bound by selecting a special class of Lyapunov functions. Considering a system of queues satisfying the assumptions of Theorem 3.5, since the DTMC describing the evolution of the queues is positive recurrent, if we assume aperiodicity as above, the DTMC is ergodic. Moreover, since the system is strongly stable: limn→∞ E[X n+1 ] = limn→∞ E[X n ] < ∞. In addition, if the Lyapunov function V (X n ) is a quadratic form, that is, V (X n ) = X n Z X nT , since all the polynomial moments of X n are finite, it follows that: lim E[V (X n+1 ) − V (X n )] = 0.
n→∞
(15)
An extension to Theorem 3.5 can be easily obtained by replacing in (13) −²kX n k with −² f (kX n k), where f (·) is a continuous function defined on IR+ . In this case An N × N matrix Q is copositive if XQXT ≥ 0 for all X ∈ IR+N . In this article, we use the elementary notation for the conditional expectation, that is, E[X |Y ∈ A] = E[X, Y ]/P{A}, where A is an event set.
3 4
528
E. LEONARDI ET AL.
with steps similar to those in the proof of Theorem 3.5, it is possible to prove that limn→∞ E[ f (kX n k)] < ∞. It is now possible to state the following novel theorem that provides a stronger and more general bound than (14). THEOREM 3.6. Given a system of queues whose evolution is described by a DTMC with state vector Yn ∈ IN M , whose state space H is a subset of the Cartesian product of a denumerable state space H X and a finite state space HK , and for which all the polynomial moments of the queue lengths distribution are finite, if a lower bounded polynomial function V (X n ), V : IN N → IR, can be found, such that E[V (X n+1 ) | Yn ] < ∞
∀Yn
and there exist two positive real numbers ² ∈ IR+ and B ∈ IR+ , such that E[V (X n+1 ) − V (X n ) | Yn ] ≤ −² f (kX n k)
∀Yn : kX n k > B
(16)
being f (x) a continuous function in IR+ , then ¯ · ¸ V (X n+1 ) − V (X n ) ¯¯ lim E[ f (kX n k)] ≤ lim E f (kX n k) + ¯ Yn ∈ H B n→∞ n→∞ ² (17) × P{Yn ∈ H B } The proof of this theorem is a generalization of the derivation leading to (8). PROOF. Due to (i) the ergodicity of the DTMC describing the evolution of the system of queues, (ii) the fact that the polynomial moments of the queue lengths are finite, (iii) the fact that V (X n ) is a polynomial function, we can state, similarly to (15), that: lim E[V (X n+1 ) − V (X n )] = 0.
n→∞
(18)
Following the same steps of the proof of Theorem 3.5, and using (18): E[V (X n+1 ) − V (X n )] = E[V (X n+1 ) − V (X n ) | Yn ∈ H B ]P{Yn ∈ H B } / H B ]P{Yn ∈ / HB } + E[V (X n+1 ) − V (X n ) | Yn ∈ ≤ E[V (X n+1 ) − V (X n ) | Yn ∈ H B ]P{Yn ∈ H B } / H B ]P{Yn ∈ / HB } −² E[ f (kX n k) | Yn ∈ = E[V (X n+1 ) − V (X n ) | Yn ∈ H B ]P{Yn ∈ H B } / H B ]P{Yn ∈ / HB } −² E[ f (kX n k) | Yn ∈ −² E[ f (kX n k) | Yn ∈ H B ]P{Yn ∈ H B } + ² E[ f (kX n k) | Yn ∈ H B ]P{Yn ∈ H B } = E[V (X n+1 ) − V (X n ) | Yn ∈ H B ]P{Yn ∈ H B } −² E[ f (kX n k)] + ² E[ f (kX n k) | Yn ∈ H B ] (19) × P{Yn ∈ H B }. Combining (18) and (19), the theorem is proved. This result will be used in the next sections, in which polynomial forms for V (X n ) and for f (kX n k) will be used to derive bounds on average delays as well as on queue size averages and variances for input-queued cell-based switches.
Bounds on Delays and Queue Lengths
529
It should be noted that the polynomial Lyapunov function V (X n ) in this more complex context plays a role very similar to that of f (x) = x 2 in Section 2. Thus, in spite of the higher formal complexity, our technique is the direct extension of the approach described in Section 2, which is normally used to derive the P-K formulas for an M/G/1 queue. Since the methodology requires an asymptotic study of the system behavior under stationarity, that is, for n → ∞, in the rest of the article, we always assume the system to be in stationary conditions. To simplify the notation, we assume stationarity to occur at finite time n; this may be obtained assuming the process to start at time t = −∞. 4. IQ Switches We consider IQ cell-based switches with P input ports and P output ports, all at the same cell rate (and we call them P × P IQS). The switching fabric is assumed to be nonblocking and memoryless, that is, cells are only stored at switch inputs. At each input, cells are stored according to a Virtual Output Queue (VOQ) policy: one separate queue is maintained for each output. Thus, the total number of queues in the switch is N = P 2 . In order to simplify our formulas, in the rest of this article, we adopt a vector notation rather than a matrix notation to indicate input queues: let q (k) , k = Pi+ j be the queue at input i storing cells directed to output j, with i, j = 0, 1, 2, . . . , P −1. Before proceeding, we need to define four norm functions. Definition 4.1. Given a vector Z ∈ IR N , Z = {z (k) , k = Pi + j, i, j = 0, 1, . . . , P − 1}, the norm kZ kIO is defined as: ( ) P−1 ¯ P−1 ¯ X ¯ X ¯ (Pk+ j) (P j+l) ¯z ¯, ¯z ¯ . (20) kZ kIO = max j=0,...,P−1
k=0
l=0
Definition 4.2. Given a vector Z ∈ IR N , Z = {z (k) , k = Pi + j, i, j = 0, 1, . . . , P − 1}, the norm kZ k I +O is defined as: ( ) P−1 ¯ P−1 ¯ X ¯ X ¯ ¯ (Pi+ j) ¯ (Pi+k) (Pl+ j) ¯z ¯+ ¯z ¯ − ¯z ¯ . max (21) kZ k I +O = i, j=0,...,P−1
k=0
l=0
Definition 4.3. Given a vector Z ∈ IR N , the norm kZ k1 is defined as: kZ k1 =
N −1 ¯ X ¯ ¯z (k) ¯.
(22)
k=0
Definition 4.4. Given a vector Z ∈ IR N , the norm kZ k2 is defined as: r X N −1 ¡ ¢2 kZ k2 = z (k) . k=0
(23)
530
E. LEONARDI ET AL.
The normalized vector parallel to vector Z is denoted by Zˆ , and is defined as: Zˆ =
Z kZ k1
While the definitions of norms kZ k1 and kZ k2 are rather intuitive, kZ kIO and kZ k I +O are less common and strictly related to the VOQ system we are considering. Indeed kZ kIO takes the maximum amount of the sum of quantities z (k) related to all the queues referring either to the same input or to the same output, while kZ k I +O considers the maximum among all pairs of input and output ports. Let 3 ∈ IR N , 3 = {λ(k) , k = Pi + j, i, j = 0, 1, . . . , P − 1} = E[A], be the vector of the average cell arrival rates at queues q (k) in cells/slot. In plain words, a traffic pattern is admissible if the total average arrival rates in cells/slot are less than 1 for all input and output ports. Thus: Definition 4.5. The traffic pattern loading an IQS is admissible if and only if k3kIO < 1 At each time slot, the switch scheduler selects cells to be transferred from input queues to output queues. The set of cells to be transferred during one time slot must satisfy two constraints: (i) at most one cell can be extracted from the VOQ structure at each input, and (ii) at most one cell can be transferred toward each output. Definition 4.6. At each time slot, the scheduler of an IQS selects for transfer from queues q (k) a set of cells denoted by vector D ∈ IN N , D = {d (k) , k = Pi + j, i, j = 0, 1, . . . , P − 1} so that: kDkIO ≤ 1 Set D is said to be a set of noncontending cells, or a switching vector. 5. Maximum Weight Matching Algorithm The selection of a switching vector in IQ or CIOQ switches was proved equivalent to the matching problem in a weighted bipartite graph, where nodes represent input/output ports, arcs represent the presence of cells to be transferred, and weights indicate the metrics to be used by the algorithm. As we already recalled, the solution to this problem which maximizes throughput is obtained with the MWM algorithm [Tarjan 1983] using proper arc weights. Thus, in this section, we apply Theorem 3.6 to derive bounds on the performance of IQS adopting the MWM algorithm to select a switching vector. Definition 5.1. An IQS adopts a MWM scheduler if the selection of the switching vector at each time slot is performed according to the MWM algorithm. Let Wn ∈ IR+N represent the weight vector at time n, and let Dn be a switching vector at time n. The departure vector Dn(MWM) produced by a MWM scheduler is such that: ¡ ¢ Dn(MWM) WnT = max Dn WnT . Dn
Note that Dn(MWM) is in general not unique.
Bounds on Delays and Queue Lengths
531
If we set Wn = X n , where X n is the vector of input queues at time n, that is, if we use queue lengths as the scheduler metrics, the switching vector produced by a MWM scheduler is such that ¡ ¢ Dn X nT = max Dn X nT . Dn
In McKeown et al. [1999], it was proved that for each vector Un ∈ IR+N such that kUn kIO ≤ 1, and for each vector X n ∈ IN N : Dn X nT − Un X nT ≥ 0.
(24)
From this property, using the Lyapunov function V (X n ) = X n X nT , it was then proved that, for any IQS using an MWM scheduler with Wn = X n : ¯ ¤ £ T − X n X nT ¯ X n ≤ −²kX n k (25) E X n+1 X n+1 for kX n k sufficiently large. Note that this property holds whichever norm is considered, as long as P < ∞, being the asymptotic behavior of kX n k the same for all norm functions. As a consequence, due to Theorem 3.5, with matrix Z equal to the identity matrix I , we can state the following theorem: THEOREM 5.2. Under any admissible independent and identically distributed input traffic pattern, an IQS using an MWM scheduler with weights equal to queue lengths is strongly stable, and all polynomial moments of the queue length distributions are finite. For a proof, the reader is referred to McKeown et al. [1999]. Given this result, we can apply (17) to derive a bound on the average queue length and cell delay in the switch. The derivation is reported in Appendix A. We have: E[kX n k1 ] ≤
k3k1 − k3k22 P. 1 − k3kIO
(26)
Considering kX n k2 , and recalling that kX n k1 /kX n k2 ≤ P, a similar derivation (not reported in the article) shows that: E[kX n k2 ] ≤
k3k1 − k3k22 √ P 1 − k3kIO
(27)
In the case of uniform traffic, we can use (26) to obtain a bound on the average length of individual input queues: the load at every VOQ q (i) is ρi = ρ/P ∀i. P N −1 (i) P N −1 Indeed, since kX n k1 = i=0 x , E[kX n k1 ] = i=0 E[x (i) ] = NE[x (i) ], so that: E[x (i) ] ≤
ρ − ρ 2 /P 1−ρ
(28)
where ρ = k3kIO represents the port load. The bound on the average cell delay is then obtained by applying Little’s result: £ ¤ E x (i) P −ρ E[T ] = £ (i) ¤ ≤ (29) 1−ρ E a being E[a (i) ] = ρ/P.
532
E. LEONARDI ET AL.
6. Maximal Size-Matching Algorithms The algorithmic complexity required for the computation of the MWM departure vector is quite large (algorithms are known with asymptotic complexity O(P 3 ), see Tarjan [1983]) and it can be partly reduced [Chuang et al. 1999] when the switching fabric, as well as the input and output memories, operate with a moderate speed-up S with respect to the data rate of input/output lines. In this case, buffering is required at outputs as well as inputs, and the term combined input/output queuing (CIOQ) is used. We call external time slot the time needed to transmit a cell at the data rate of the input/output lines. The internal time slot is, instead, the time needed to transmit a cell at the data rate of the switching fabric. The external time slot is S times longer than the internal time slot. This has encouraged researchers to look for simpler policies to approximate the MWM algorithm in input buffered switches using speed-up S greater than 1. As mentioned in the Introduction, previous research efforts proved that a number of well-known maximal Size Matching (mSM) scheduling algorithms, such as WFA [Tamir and Chi 1993], i-SLIP [McKeown 1995], 2DRR [Lamaire and Serpanos 1994], and RC-BB [Chen et al. 1995], whose implementation is quite simple, are guaranteed to achieve 100% throughput with speed-up equal to 2. Definition 6.1. A CIOQ switch (CIOQS) adopts a mSM scheduler if the selection of the set of noncontending cells to be transferred from inputs to outputs at each internal time slot is performed according to a mSM algorithm, that is, the scheduler selects a switching vector such that no cell can be added without violating the constraint kDkIO ≤ 1. Let’s denote in this section the departure vector produced by a mSM scheduler with Dn . Consider any queue q (k) , k = Pi + j, in the VOQ structure, that stores cells at input i directed to output j. Recall that cells stored in q (k) compete for inclusion 0 in the set of noncontending cells with cells stored in each queue q (k ) , k 0 = Pi + l 00 with l 6= j, and q (k ) , k 00 = Ph + j with h 6= i. If q (k) is nonempty, the mSM algorithm generates a set of noncontending cells that comprises at least one cell extracted from the interfering set I (k) [© 0 00 ª I (k) = q (k ) ∪ q (k ) (30) (exactly one, if the cell is extracted from q (k) ; possibly two, if one cell is extracted 0 00 from a q (k ) , k 0 6= k, and one from a q (k ) , k 00 6= k). Because CIOQ switches have both input and output queues, in the sequel, we use X n to indicate the state of the input VOQs, while On ∈ IN P indicates the state vector of output queues. Using Lyapunov theory, in Leonardi et al. [2001a], we proved the following: THEOREM 6.2. Under any admissible traffic pattern, a CIOQS adopting a mSM scheduler achieves 100% throughput for any speed-up S ≥ 2. This result was derived with different techniques also in Dai and Prabhakar [2000].
Bounds on Delays and Queue Lengths
533
If the mSM scheduling policy is “fair”,5 then the queuing delay was proved to be bounded, and thus the CIOQS is strongly stable: THEOREM 6.3. A CIOQS with speed-up 2 adopting a fair mSM scheduler is strongly stable under any admissible traffic pattern. Here we further extend those results, proving the following: THEOREM 6.4. A CIOQS with speed-up S ≥ 2 adopting a mSM scheduler is strongly stable under any admissible traffic pattern. This allows us to apply inequality (17) in order to derive bounds on the average delay and queue sizes of CIOQ system adopting a mSM scheduler. However, we need to discuss separately the average size of input queues and of output queues. The proof of Theorem 6.4 and the derivation of the bounds are reported in Appendix B. 6.1. CONSIDERING INPUT QUEUES. We first focus on the input queues. As shown in Appendix B, in the case of uncorrelated arrival patterns, we obtain: 6k3k1 − 3Q3T − k3k22 , 2(2 − k3k I +O )
E[kX n k1 ] ≤
(31)
where matrix Q is defined in (49) (see also Figure 10). In the case of uniform traffic, we obtain a bound on the average length of individual input queues. Indeed, being E[a (i) ] = ρ/P, £ ¤ E x (i) ≤
ρ(3 − ρ) . 2P(1 − ρ) + ρ
(32)
Applying Little’s result, we obtain the bound on the average cell delay: E[T ] ≤
P(3 − ρ) . 2P(1 − ρ) + ρ
(33)
6.2. CONSIDERING OUTPUT QUEUES. To derive a bound on the performance of the output queues in a CIOQS, consider the system evolution equation of a single output queue j, that comprise terms due to input VOQs that store packets directed to that selected output. Under any admissible traffic pattern, the complete system evolution equation is: X i
(Pi+ j)
xn+1
( j)
+ On+1 =
X i
xn(Pi+ j) + On( j) +
X
an(Pi+ j) − 1I On( j)
(34)
i
where 1Ix is equal to 1 if x 6= 0: if there is at least a cell in the output queue, then it will be served during time interval (n, n + 1]. Taking the expectation of both sides of (34), and considering that the system is stable because the arrival rate at a single output queue is smaller than one (due to 5
A CIOQS adopts a fair scheduling algorithm if the first two moments of the distribution of the time interval between two consecutive services of any nonempty queue in the VOQ structure are finite under any admissible traffic pattern.
534
E. LEONARDI ET AL.
the admissibility conditions), we derive: £
¤
E 1I On( j) = E
·X
¸ an(Pi+ j)
(35)
i
If instead we square both sides of (34), following a procedure similar to the one used by Pollaczek and Khinchin to analyze the M/G/1 queue, we obtain, under the previously stated assumption of independent arrival processes at different queues: ½ ·µ X ¶2 ¸ £ ( j) ¤ 1 (Pi+ j) i´ E hP an ≤ ³ E O (Pi+ j) 2 1− E a i i n ·X ¸µ ·X ¸¶ ·X ¸¾ (Pi+ j) (Pk+ j) 2 (Pi+ j) +E 1 + 2E − 2E an xn an i
k
i
(36) P (Pi+ j) 1I On( j) ]. having neglected the (unknown) negative term −2E[ i xn In case of uniform traffic, the previous expression becomes, for every output queue j, E[O] ≤
2ρ − ρ 2 (1 + 1/P) + 2ρPE[x (i) ] 2(1 − ρ)
(37)
where E[x (i) ] can be obtained using (32). 7. Bounds on Queue Size Variances 7.1. MAXIMUM WEIGHT MATCHING. In this section, we derive bounds on the second moment of the queue lengths in a P × P IQS implementing the MWM scheduling policy. The bounds are derived by repeating the same procedure used in the previous section, using k · k2 as the norm operator, rather than k · k1 , and adopting the (polynomial) Lyapunov function: ¡ ¢3 V (X n ) = X n X nT 2 We use (17) to obtain a bound on E[ f (kX n k2 )], where f (kX n k2 ) = (kX n k2 )2 (which is also a polynomial function). From the bounds on the second moment of the queue lengths and the bounds on the average queue lengths derived in the previous section, it is then possible to compute bounds on the queue length variances. Considering (27), and following the procedure reported in Appendix C, for uniform traffic we obtain: ¤ £ ρ P2 (ρ − ρ 2 /P)2 3 P + P{X n = 0} E kX n k22 ≤ 2 (1 − ρ)2 3(1 − ρ) 2ρ P 2 (1 + ρ) min(1, E[1/kAn k2 ]) √ P(1 − P{X n = 0}). + 8(1 − ρ)
(38)
Bounds on Delays and Queue Lengths Then, since kX n k22 = NE[(x (i) )2 ], so that:
P N −1 i=0
535
(x (i) )2 we have E[kX n k22 ] =
P N −1 i=0
E[(x (i) )2 ] =
£¡ ¢2 ¤ 2(ρ − ρ 2 /P)2 2ρ(1 + ρ) min(1, E[1/kAn k2 ]) E x (i) P+ ≤ 2 (1 − ρ) 8(1 − ρ) √ ρ P{X n = 0}. × P (1 − P{X n = 0}) + 3(1 − ρ)
(39)
7.2. MAXIMAL SIZE MATCHING. With a similar reasoning and derivation, using ¡ ¢3 V (X n ) = X n Q X nT 2 as a Lyapunov function, where Q is defined in Appendix B, we obtain a bound on the input queue lengths variances for a CIOQS adopting a mSM scheduler. The derivation is very similar to the one shown in Appendix C and is not presented in this paper. We obtain: ½ · ¸ 1 1 2 2 E ≤ (kQk + kQk )kX n k2 kAn − Dn k2 E 2 − k3k I +O 2 ¾ · ¸ E[(AQAT )3/2 ] kQk2 kAn − Dn k42 1 P{X n = 0} P{X n 6= 0} + + E 8 kX n k2 3 (40) where kQk = sup X kXQk2 /kX k2 A bound on the variance of the output queue lengths can be obtained with a procedure similar to the one adopted in Section 6.2, taking the third power of both sides of (34). The derivation is tedious, but straightforward, and therefore we do not report it. £
kX n k22
¤
8. Simulation Results In this section, we compare simulation results with our analytical bounds. We implemented a synchronous simulator of the dynamics of IQ and CIOQ switches, and run a set of simulations with different arrival processes. All simulation experiments were stopped when a confidence level equal to 0.95 was reached with a 5% confidence interval relative width. 8.1. MAXIMUM WEIGHT MATCHING—UNIFORM TRAFFIC. The arrival process feeding each switch input port is assumed to be Bernoulli, with average offered load ρ. The destinations of cells are uniformly chosen among all output ports. Recalling that the bounds that were previously derived apply to a wide class (i) must be independent and of arrival processes, and in particular that an(i) and an+m ( j) identically distributed, but a correlation between an(i) and an is allowed, we ran different simulations with different arrival patterns. Define the parameter ν as follows: ¯ ª © (41) ν = P a (Pi+k) = 1 ¯ a (P j+k) = 1 . n
n
ν represents the degree of correlation between the arrival processes at two different VOQs referring to the same output k. Note that, if the arrival processes at different
536
E. LEONARDI ET AL.
FIG. 1. Average delay versus offered load in a 16×16 MWM-IQS; uniform traffic configuration.
FIG. 2. Average delay versus number of switch input and output ports for ρ = 0.9 in a 16×16 MWM-IQS; uniform traffic configuration.
VOQs are independent, then ν = ρ/P. In simulation experiments, we set ν = {1, 0.5, 0.25, 0.125, ρ/P}; note that when ν = 1, a complete correlation among arrival processes exists, that is, if a cell arrives at input i, directed to output k, then a cell also arrives at every other input queue storing cells directed to the same output k. Figure 1 plots the estimates obtained by simulation for the average delay E[T ] at one VOQ, versus the input port load ρ, in a 16 × 16 IQS, as well as the bound in (29). Numerical results show that the bound captures the behavior of E[T ], and that the ratio between the bound value and the simulation estimates depends on the value of ν. The bound becomes tighter when the degree of correlation grows; for example, for ν = 1 the ratio becomes close to 2. The impact of correlation among arrivals at different inputs can be observed to be quite substantial. In Figure 2, instead, we report simulation results and the bound for the average delay versus the number of IQS ports P for a fixed offered load ρ = 0.9. In this case, under an uncorrelated arrival process, the average delay is almost independent
Bounds on Delays and Queue Lengths
537
FIG. 3. Average VOQ length versus offered load in a 16×16 MWM-IQS; uniform traffic configuration.
FIG. 4. Average VOQ length versus number of switch input and output ports for ρ = 0.9 in a 16×16 MWM-IQS; uniform traffic configuration.
of the switch size, but when correlation grows, E[T ] exhibits a linear dependence on P; this behavior is again well captured by the bound. Figures 3 and 4 plot simulation results and bounds for the average VOQ length E[x (i) ], versus the offered load ρ and versus the number of switch ports P, respectively. Considerations similar to those presented for average delays apply also to average queue sizes, due to the deterministic relation between E[T ] and E[x (i) ] given by Little’s theorem. Figure 5 plots the simulation estimates for the second moment of individual input queue lengths E[(x (i) )2 ], versus the offered load ρ, and compares them with the bound in (39). Again, the curve of the bound follows that of the simulation estimates, but now it is not as close as for averages. The curves in Figure 6 report simulation estimates and bounds on E[(x (i) )2 ], versus the switch size; in this case the bound exhibits a linear dependence on the switch size that is not observed in simulations results.
538
E. LEONARDI ET AL.
FIG. 5. Second moment of individual input queue lengths versus offered load in a 16×16 MWM-IQS, uniform traffic configuration.
FIG. 6. Second moment of individual input queue lengths versus number of switch input and output ports for ρ = 0.9 in a 16×16 MWM-IQS; uniform traffic configuration.
8.2. MAXIMUM WEIGHT MATCHING—HOT SPOT TRAFFIC. In this section, we again consider an IQS that adopts the MWM scheduling algorithm. The traffic pattern is now unbalanced; in particular, output 0 is more loaded than other outputs by an overloading factor 6, that is, it is an “hot spot,” ½ ρ/(6P) ∀i, ∀ j 6= 0 (Pi+ j) ρ = ρ/P ∀i, j = 0. It is possible to derive bounds on the performance of port 0. In particular, E[kX n k1 ] =
P X P X £ ¤ E x (Pi+ j) i=0 j=0
=
P X P P X £ £ ¤ X ¤ E x (Pi+ j) + E x (Pi) i=0 j=1
£
= P(P − 1)E x
(1)
¤
i=0
£ ¤ £ ¤ + PE x (0) ≥ PE x (0) .
(42)
Bounds on Delays and Queue Lengths
539
FIG. 7. Average delay versus offered load in a 16×16 MWM-IQS; hot spot configuration.
FIG. 8. Average delay versus offered load in a 16×16 mSM-CIOQS; hot spot configuration.
This result can be used in combination with (26) to derive a bound on the average delay in the hot spot queues. Figure 7 plots the bound, and compares it with the simulation results obtained when considering different degrees of correlation, as defined in the previous section. It can be noted that the bound is less tight than the one derived under uniform traffic assumptions; this is due to the approximation introduced by (42), which neglects P(P − 1) terms to derive a bound on the hot spot queue. 8.3. MAXIMAL SIZE MATCHING—HOT SPOT TRAFFIC. In this section, we briefly consider a CIOQS that adopts a mSM scheduling algorithm, and present simulation results to assess the tightness of the analytical bound we derived. In this case, given the complexity of the Q operator, we are forced to use uncorrelated arrival processes in our comparison. Instead of presenting a complete set of results, like for IQSs that adopt a MWM scheduling algorithm, we present in Figure 8 simulation results obtained considering the same hot spot scenario defined in 8.2, for both the average hot spot
540
E. LEONARDI ET AL.
input delay (derived from (31)), and the average hot spot output delay (derived from (36)). Since the bound applies to the entire very large class of mSM algorithms, in order to assess the performance bound, we implemented some of the most common heuristics [Tamir and Chi 1993; McKeown 1999; Ajmone Marsan et al. 1999b; Duan et al. 1998; Chen et al. 1995], as well as very simple algorithms that find a mSM in a greedy way. The performance of these different mSM algorithms were all very close to one another, and for the sake of clarity, we only report results for the iSLIP algorithm (with a number of iterations equal to the number of ports), that is probably the best known mSM algorithm. Figure 8 clearly shows that at medium-high load in CIOQSs the input delay is negligible, compared to the output delay. This is also well captured by the bound, that shows that the average input delay is always smaller than 10 time slots, and the delay does not diverge when the load approaches 1. Indeed, considering the form of (31), it is possible to see that the denominator goes to zero for k3k I +O → 2. But this is not possible under both uniform and hot spot traffic. Thus, the input delay is always upper bounded. This explains also why the different mSM algorithms perform all very similarly: the departure rate from the input queue is very high, thanks to the speed-up factor of 2; thus input queues have very little probability to build up. On the contrary, the output delay derived from (36) shows an asymptotic behavior for ρ → 1, as confirmed by the simulation results. 9. Extension to Time-Correlated Traffic The requirement that the systems of discrete-time queues that we study must correspond to a DTMC implies either that the exogenous customer arrival processes must be sequences of independent and identically distributed random variables, or that we use supplementary variables to describe the time correlation of exogenous arrivals. The use of supplementary variables is permitted by the Theorems of Section 3; however, it may result in fairly complex models. A technique that allows time-correlation to be considered without explicitly dealing with supplementary variables consists in using two classes of queues; in addition to the queues that are used to model the IQ or CIOQ switch (switch queues), the model comprises some other queues (correlation queues) that receive batches of customers at their input, and by feeding the switch queues with their output, shape the input customer process, transforming the arriving stream of customer batches into a correlation in time, corresponding to the queue busy periods. Of course, the state of correlation queues corresponds to special supplementary variables, but being able to treat the whole system of queues in a uniform manner helps simplify the model development. As an example, consider an IQ switch adopting the MWM algorithm as in Section 8, in which batch arrivals at the correlation queues are considered. In particular, let E[b] and E[b2 ] be the first and second moments of the batch size. Then, applying the methodology described in this paper, and considering a uniform traffic pattern, the average queue length can be seen to be bounded by £ ¤ (γ + 1)ρ − 2ρ 2 /P (γ + 1)ρ/P − 2ρ 2 /P 2 ρ − + , E x (i) = 2(1 − ρ) 2(1 − ρ/P) P
(43)
Bounds on Delays and Queue Lengths
541
FIG. 9. Average queue length versus offered load in a 16×16 MWM-IQS; uniform traffic configuration; time-correlated traffic.
where γ = E[b2 ]/E[b]. In Figure 9, we report simulation results compared to the bound in (43), in which both time and spatial correlation are considered. Also in this case for high values of ν the bound is quite tight. 10. Conclusions In this article, we considered IQ and CIOQ cell-based switch architectures, which are commonly used today in high-performance IP routers. We assumed that the scheduling algorithm used to select at each time slot a subset of the cells buffered at input ports for transfer to output ports is based on either maximal size matching, or maximum weight matching with weights equal to virtual input queues lengths. Using analytical techniques based upon Lyapunov functions, we derived original bounds for the average and variance of the lengths of input queues, and for average cell delays. These bounds complement the previously available theoretical results for IQ and CIOQ cell-based switch architectures, that mostly refer to throughput. Simulation results under both uniform and hot-spot traffic showed that our analytical bounds are rather tight and well mimic simulation results, specially if we consider that they are valid for quite a wide class of traffic patterns. Appendix A. Maximum Weight Matching Algorithm Here we apply (17) to derive a bound on the average cell delay in an IQ switch that adopts the MWM scheduler. Let us first prove the following: PROPOSITION A.1. For any nonnull normalized vector Xˆ n ∈ IR+N : 1 Dn Xˆ nT ≥ . P
542
E. LEONARDI ET AL.
PROOF. From (24), we know that ∀Un ∈ IR+N such that kUn kIO ≤ 1, Dn Xˆ nT ≥ Un Xˆ nT . Considering Un = P1 1I (note that with this choice kUn kIO = 1), where 1I is the all-ones vector, that is, 1I = {1, 1, . . . , 1}, being Xˆ n normalized, we obtain: N −1 1 ˆT 1 X 1 xˆ (i) = . 1I X n = P P i=0 P
Let an(k) be the binary random variable with mean λ(k) that represents the number of cells arrived in slot n at VOQ q (k) . Considering that at each slot a cell arrives at P P−1 (Pi+k) each input i with probability ρi = k=0 λn , and no cells arrive with probability 1 − ρi , and that each input port comprises P VOQs, the vector of the average arrival rates at slot n is such that: £ ¤ E[An ] = 3 E An AnT = k3k1 . (44) Since E[Dn ] = E[An ] = 3 due to system stability, similarly to (44), it is possible to state that: ¤ £ (45) E Dn DnT = k3k1 . Moreover, being An independent of Dn , £ £ ¤ ¤ E An DnT = E[An ]E DnT = k3k22 .
(46)
In order to derive bounds on average delays and queue sizes, we use (17), with f (kX n k) = kX n k1 and V (X n ) = X n X nT . Under these conditions, we must compute the terms in (17). For ², which appears at the denominator, we compute ²max , the supremum value of ² for which equation (16) holds. In order to compute ²max , from (25), we can write: ¯ ¤ £ T E X n+1 X n+1 − X n X nT ¯ X n − ≥² ∀X n : kX n k1 > B (47) kX n k1 for some B > 0. The function at the left hand side of inequality (47) admits a limit for kX n k1 → ∞ which depends on the direction of vector X n . ²max is equal to the smallest value for this limit, since any smaller value of ²max verifies (47) for large values of kX n k1 . Thus, recalling that X n+1 = X n + An − Dn , ²max can be obtained as: £ ¤ T − X n X nT | X n E X n+1 X n+1 ²max = lim inf − kX n k1 →∞ kX n k1 ¯ ¤ £ E 2(Dn − An )X nT + (An − Dn )(An − Dn )T ¯ X n = lim inf kX n k1 →∞ kX n k1 ¢ ¡ T T 2 Dn X n − E[An ]X n = lim inf . kX n k1 →∞ kX n k1 Consider vector Un = E[An ] + (1 − k3kIO )Dn . It is straightforward to prove that kUn kIO ≤ 1. Thus, if X n 6= 0, from (24): Dn X nT − E[An ]X nT − (1 − k3kIO )Dn X nT Dn X nT − Un X nT = ≥0 kX n k1 kX n k1
Bounds on Delays and Queue Lengths
543
and Dn X nT − E[An ]X nT (1 − k3kIO )Dn X nT ≥ . kX n k1 kX n k1
(48)
As a consequence: ²max
¡ ¢ 2 Dn X nT − E[An ]X nT = lim inf kX n k1 →∞ kX n k1 2(1 − k3kIO )Dn X nT ≥ lim inf . kX n k1 →∞ kX n k1
Finally, since Proposition A.1 holds, we obtain: ²max ≥
2(1 − k3kIO ) . P
Let us now evaluate the term E[V (X n+1 ) − V (X n ) | Yn = X n ∈ H B ] appearing in (17): ¯ ¤ £ T − X n X nT ¯ X n E X n+1 X n+1 ¯ ¤ £ = E (X n + An − Dn )(X n + An − Dn )T − X n X nT ¯ X n ¯ ¤ ¤ £ £ = E 2(An − Dn )X nT + E (An − Dn )(An − Dn )T ¯ X n . Applying (48) and Proposition A.1, we obtain: ¯ ¤ £ T − X n X nT ¯ X n E X n+1 X n+1 ¯ ¤ £ ≤ −2(1 − k3kIO )Dn X nT + E (An − Dn )(An − Dn )T ¯ X n ¯ ¤ £ ≤ −2(1 − k3kIO ) kX n k1 + E (An − Dn )(An − Dn )T ¯ X n . P
As a consequence, (17) with f (kX n k) = kX n k1 becomes: · ¶ µ 2(1 − k3kIO ) E[kX n k1 ] ≤ E kX n k1 1 − P² ¯ ¸ (An − Dn )(An − Dn )T ¯¯ + ¯ X n ∈ H B P{X n ∈ H B } ² ·
³ ²max ´ = E kX n k1 1 − ² ¯ ¸ (An − Dn )(An − Dn )T ¯¯ + ¯ X n ∈ H B P{X n ∈ H B } ² ¯ ¸ (An − Dn )(An − Dn )T ¯¯ ∈ H X ≤ E B P{X n ∈ H B } ¯ n ² ·
544
E. LEONARDI ET AL.
FIG. 10. The Q operator used to define the Lyapunov function; case P = 3.
·
¸ (An − Dn )(An − Dn )T ≤ E ² ¤ £ ¤ £ ¤ £ T E An An + E Dn DnT − 2E An DnT . = ² Thus, considering (44), (45), (46), and choosing ² → ²max 6 to minimize the bound value, we have (26).
B. Maximal Size Matching Algorithms PROOF OF THEOREM 6.4. Consider the Lyapunov function V (X n ) = X n QXnT , where Q ∈ IR N × IR N is such that 1 qi j = 1 0
if j = (i + lP) mod N , l = 0, . . . , P − 1 if Pbi/Pc ≤ j < Pbi/Pc + P otherwise
(49)
that is, X n Q is the vector whose kth entry P is the number of cells stored in interfering set I (k) defined in (30); [X n Q](k) = l∈I (k) xn(l) . Figure 10 reports a sketch of the Q matrix structure, in the particular case P = 3. Note that in the general case matrix Q comprises P rows of P × P blocks. The blocks on the main diagonal are all-1 matrices, while all the other blocks are unitary diagonal matrices. We need to prove that Inequality (13) holds in order to prove the system stability. Thus, recalling that X n+1 = X n + An − Dn , we have ¯ ¤ £ E[V (X n+1 ) − V (X n ) | X n ] = E 2X n Q(An − Dn )T + (An − Dn )Q(An − Dn )T ¯ X n 6 Being ²max a limit point for the set of admissible ² values, it is possible to build a sequence of bounds, whose limit corresponds to the bound obtained for ²max , thank to the continuity of the expression.
Bounds on Delays and Queue Lengths
545
being Q a symmetric matrix. For kX n k1 → ∞, it holds ¯ ¤ £ T (An −£ Dn )Q(A¯n − ¤Dn )T ¯ X n E 2X n Q(A n¯ ) + £ n −D ¤ = 2E X n Q AnT ¯ X n − 2E X n QDnT ¯ X n + o(kX n k1 ) where o(kX n k1 ) = 0. kX n k1 →∞ kX n k1 lim
We now separately bound E[X n QAnT ] and E[X n QDnT ]. For the first term we have E[An ]Q ≤ k3k I +O 1I, where the vector inequality holds between corresponding entries, and ¯ ¤ £ (50) E X n QAnT ¯ X n = E[An ]QXnT ≤ k3k I +O kX n k1 . For the second term, from the definition of the mSM algorithm (which selects at least one cell from interfering set I (k) at each internal slot provided that q (k) > 0), since S = 2, the kth entry of vector QDnT is equal at least to 2 whenever xn(k) is not null, except when xn(k) = 1 and only one cell is extracted from I (k) in the first internal slot. Thus: h¡ ¯ ª ¯ i © ¢(k) E QDnT X n(k) ¯ X n ≥ 2xn(k) − P xn(k) = 1 ∧ dn(k) = 1 ¯ X n ¯ ª © £ ¯ ¤ ≥ 2xn(k) − P dn(k) = 1 ¯ X n ≥ 2xn(k) − E dn(k) ¯ X n . Hence ¯ ¤ £ £ ¯ ¤ E X n QDnT ¯ X n ≥ 2kX n k1 − kE Dn ¯ X n k1 = 2kX n k1 + o(kX n k1 ).
(51)
Thus, there exist B > 0 and ² > 0 such that: ¯ ¤ ¯ ¤ £ £ E[V (X n+1 ) − V (X n ) | X n ] = 2E X n Q AnT ¯ X n − 2E X n QDnT ¯ X n + o(kX n k1 ) ∀X n : kX n k1 > B. ≤ 2(k3k I +O kX n k1 − 2kX n k1 ) + o(kX n k1 ) ≤ −²kX n k1 In the remainder of this appendix, we derive Bound (31) regarding input queues. The term ²max can be derived as the largest value of ² that satisfies Inequality (16) with f (kX k) = kX k1 . Using Inequalities (50) and (51), we can write E[V (X n+1 ) − V (X n ) | X n ] kX n k1 →∞ kX n k1 ¯ ¤ £ 2E X n Q(Dn − An )T ¯ X n = lim inf kX n k1 →∞ kX n k1 ¯ ¤ ¯ ¤ £ £ E X n QDnT ¯ X n − E X n Q AnT ¯ X n = lim inf 2 kX n k1 →∞ kX n k1 = 2 (2 − k3k I +O ) .
²max =
lim
inf −
(52)
Let us now evaluate the term E[V (X n+1 )− V (X n ) | Yn = X n ∈ H B ] that appears in (17) using (50) and (51): E[V (X n+1 ) − V (X n ) | X n ] ≤ 2 (k3k ¯ n ¤| X n ]k1 ) £ I +O − 2)kX n k1 + 2kE[D + E (An − Dn )Q(An − Dn )T ¯ X n .
546
E. LEONARDI ET AL.
As a consequence, (17) with f (kX n k) = kX n k1 becomes: · 2(k3k I +O − 2)kX n k1 + 2kE[Dn | X n ]k1 E[kX n k1 ] ≤ E kX n k1 + ¤¯ ² £ ¸ (An − Dn )Q(An − Dn )T ¯ ¯ X n ∈ H B P{X n ∈ H B } + ¯ ² ¯ ¸ · 2kE[Dn | X n ]k1 ¯¯ ≤ E ¯ X n ∈ Hn P{X n ∈ H B } ² ¯ ¸ · (An − Dn )Q(An − Dn )T ¯¯ +E ¯ X n ∈ H B P{X n ∈ H B } ² ¸ · ¯ ¸ · (An − Dn )Q(An − Dn )T ¯¯ 2kE[Dn | X n ]k1 E ≤ E ¯ X n ∈ HB ² ² × P{X n ∈ H B }, where the second inequalty holds since kX n k1 + 2(k3k I +O − 2)kX n k1 /² ≤ 0 (see also the proof of Theorem 6.4). Choosing ² → ²max , we get: "£ ¤# (An − Dn )Q(An − Dn )T 2k3k1 E[kX n k1 ] ≤ +E , ²max ²max since: (i) E[kE[Dn | X n ]k1 ] = kE[E[Dn | X n ]]k1 = k3k1 where the first equality holds for the linearity of k · k1 on IR+N ; (ii) we must account for a speedup equal to 2; and (iii) for ² → ²max , B → ∞. We need thus to bound £ ¤ £ ¤ £ ¤ £ ¤ E (An − Dn )Q(An − Dn )T = E An QAnT − 2E An QDnT + E Dn QDnT . (53) £ ¤ Let us start by evaluating the term E An QDnT . As the arrival process is independent of the departure process, it is possible to write £ ¤ ¤ £ (54) E An QDnT = E[An ]Q E DnT = 3Q3T . Evaluating E[An QAnT ] is more complex, as operator Q applied to vector An introduces mixed products between the arrival processes at different queues. If 0 we assume that the arrival processes are independent, so that E[a (k) a (k ) ] = 0 E[a (k) ]E[a (k ) ], k 6= k 0 , we can write: £ ¤ £ ¤ £ ¤ E An Q AnT = E An I AnT + E An (Q − I )AnT X X £ £ ¤ £ ¤ ¤ (55) E (an(k) )2 + E an(k) E an( j) . = k
j k ∈ (I ( j) \{ j})
Recalling Eq. (44), the first term in (55) is equal to k3k1 . The second term in (55) can be written as 3(Q − I )3T . We therefore obtain: £ ¤ E An Q AnT = k3k1 + 3(Q − I )3T . (56) The last term in (53) can be easily bounded by ¤ £ ¤ £ E Dn QDnT ≤ E 4I DnT = 4k3k1
Bounds on Delays and Queue Lengths
547
because QDnT cannot be greater than 4, since at most 2 departures can occur (we are considering S = 2) for each internal time slot in each interfering set of queues. However, a stricter bound can be derived, considering the distribution of the possible departure events: 0 if no cell departs from I (k) , thus d (k) = 0 (k) (k) 1 if 1 cell departs from I , thus d = 0, 1 ¡ ¢ T (k) QDn = 2 if 2 cells depart from I (k) , thus d (k) = 0, 1, 2 3 if 3 cells depart from I (k) , thus d (k) = 0, 1 4 if 4 cells depart from I (k) , thus d (k) = 0. By taking the largest value (equal to 3) for (QDnT )(k) when d (k) = 1, we can write: © ª © ª ¡ ¢(k) d (k) QDnT ≤ 4P d (k) = 2 + 3P d (k) = 1 . (57) The probability distributions of the ternary random variable d (k) is unknown. However, among all the possible distributions of d (k) , it is possible to find the one that maximizes (57) subject to E[d (k) ] = λk . Solving this maximization, we obtain: © ª © ª © ª P d (k) = 0 = 1 − λ(k) , P d (k) = 1 = λ(k) , P d (k) = 2 = 0. This means that (57) can be written as: £ ¤ E Dn QDnT ≤ E[3I Dn ] = 3k3k1 .
(58)
From (52), (54), (56), and (58), we obtain (31). C. Bounds on Queue Size Variances We consider in this appendix only Maximum Weight Matching. We first prove that the assumptions of Theorem 3.6 are verified when f (kX n k2 ) = (kX n k2 )2 . PROPOSITION C.1. Under any admissible independent and identically distributed input traffic pattern, for an IQS using an MWM scheduler with weights equal to queue lengths, there exist ² ∈ IR+ and B ∈ IR+ such that E[V (X n+1 ) − V (X n ) | X n ] ≤ −² kX n k22
kX n k2 > B.
(59)
PROOF. From the definition of the Lyapunov function V (X n ) = (X n X nT )3/2 , for kX n k2 6= 0, we can write: E[V (X n+1 ) − V (X n ) | X n ] h¡ ¢3 ¡ ¢3 ¯ i = E (X n + An − Dn )(X n + An − Dn )T 2 − X n X nT 2 ¯ X n "µ # ¶3 2(An − Dn )X nT + kAn − Dn k22 2 =E 1+ kX n k32 − kX n k32 | X n . (60) kX n k22 Taking the limit for kX n k2 → ∞, we can use a McLaurin expansion: 3 (1 + t)3/2 = 1 + t + o(t). 2
(61)
548
E. LEONARDI ET AL.
thus: E[V (X n+1 ) − V (X n ) | X n ] 2 kX n k2 →∞ · kX n k2 XT 3kAn − Dn k22 = lim E 3(An − Dn ) n + kX n k2 →∞ kX n k2 2kX n k2 lim
¯ ¸ ¯ ¯ Xn . ¯
(62)
The second term in the limit above can be neglected. Using (24), letting as in (48) Un = E[An ] + (1 − k3kIO )Dn , we obtain ¯ ¸ · X nT ¯¯ 3(1 − k3kIO )Dn X nT − . ≤ lim X lim E 3(An − Dn ) n kX n k2 →∞ kX n k2 →∞ kX n k2 ¯ kX n k2 Since it is possible to show, using an approach similar to the one used in the proof of Proposition A.1, that Dn X nT 1 ≥√ , kX n k2 P we can obtain the following: lim
kX n k2 →∞
1 − k3kIO E[V (X n+1 ) − V (X n ) | X n ] = −²max . ≤ −3 √ 2 kX n k2 P
(63)
We can now use (17) with f (kX n k) = (kX n k2 )2 to obtain the bound on the queue length variance. Consider again an IQS under admissible traffic pattern, so that (44) holds. Recalling (60), and exploiting the following inequality: 3 3 (1 + t)3/2 ≤ 1 + t + t 2 , 2 8 we can write ¯ · ¸ £ ¤ V (X n+1 ) − V (X n ) ¯¯ 2 2 E kX n k2 ≤ E kX n k2 + ¯ X n ∈ H B P{X n ∈ H B } ² µ ¶ · X nT 3kAn − Dn k22 kX n k22 2 + ≤ E kX n k2 + 3(An − Dn ) kX n k2 2kX n k2 ² ¯ ¸ µ ¶ 2 kX n k32 ¯¯ 3 2(An − Dn )X nT + kAn − Dn k22 X n ∈ H B , X n 6= 0 + 8 ² ¯ kX k2 n 2
EkAn k32 P{X n = 0} × P{X n ∈ H B , X n 6= 0} + ² · ¶ µ 3kAn − Dn k22 kX n k22 2 ≤ E kX n k2 + −²max + 2kX n k2 ² T 2 T 3((An − Dn )X n ) 3(An − Dn )X n kAn − Dn k22 + + 2²kX n k2 2²kX n k2 ¯ ¸ 4 ¯ 3kAn − Dn k2 ¯ + ∈ H , X = 6 0 × P{X n ∈ H B , X n 6= 0} X n B n 8²kX n k2 ¯
Bounds on Delays and Queue Lengths EkAn k32 P{X n = 0} ≤ ² · 3 kAn − Dn k42 + E 8²max kX n k2 3 EkAn k2 P{X n = 0}, + ²max +
549 £ ¤ E kX n k2 kAn − Dn k22 ² ¯ max ¸ ¯ ¯ X n 6= 0 P{X n 6= 0} ¯ 3
(64)
where the last inequality holds since: (i) for ² → ²max , B goes to ∞, and as a consequence H B tends to the whole state space; (ii) (An − Dn )X nT < kAn − Dn k2 kX n k2 ; (iii) E[(An − Dn )X nT kAn − Dn k22 ] ≤ 0, as it can be easily verified by direct inspection. Note that in the expectation of kX n k2 kAn − Dn k22 , Dn is independent of kX n k2 (although not from X n ). Under uniform traffic assumption, we derive a bound on the second moment of individual input queue lengths. Using (44), (45) and (46), we get: £ ¤ E kAn − Dn k22 = 2ρ(P − ρ) (65) and similarly we obtain:
and
£ ¤ E kAn − Dn k42 ≤ 2ρ P 2 (1 + ρ)
(66)
√ ¤ £ E kAn k32 ≤ ρ P P
(67)
Finally we notice that: kX n k2 ≥ max(1, kAn−1 k2 ), so that E[1/kX n k2 ] ≤ min(1, E[1/kAn−1 k2 ]). Considering (27), and (63), (64), (65), (66), (67), we obtain (38) and (39). REFERENCES AJMONE MARSAN, M., BIANCO, A., FILIPPI, E., GIACCONE, P., LEONARDI, E., AND NERI, F. 1999a. On the behavior of input queuing switch architectures. Europ. Trans. Telecommun. (ETT) 10, 2 (Mar.), 111–124. AJMONE MARSAN, M., BIANCO, A., LEONARDI, E., AND MILIA, L. 1999b. RPA: A flexible scheduling algorithm for input buffered switches. IEEE Trans. Commun. 47, 12 (Dec.), 1921–1933. CHEN, H., LAMBERT, J., AND PITSILLEDES, A. 1995. RC-BB switch. A high performance switching network for B-ISDN. In Procedings of IEEE Globecom (Singapore). IEEE Communication Society Press, Piscataway, NJ, pp. 275–279. CHUANG, S. T., GOEL, A., MCKEOWN, N., AND PRABHAKAR, B. 1999. Matching output queuing with combined input and output queuing. IEEE J. Select. Areas Commun. 17, 6 (Dec.), 1030–1039. DAI, J. G., AND PRABHAKAR, B. 2000. The throughput of data switches with and without speedup. In Procedings of IEEE Infocom (Tel Aviv, Israel, Mar.). IEEE Communication Society Press, Piscataway, NJ, pp. 556–564. DUAN, H., LOCKWOOD, J. W., AND MO KANG, S. 1998. Matrix Unit Cell Scheduler (MUCS) for inputbuffered ATM switches. IEEE Commun. Lett. 2, 1 (Jan.), 20–23. FAYOLLE, G. 1989. On random walks arising in queueing systems: ergodicity and transience via quadratic forms as Lyapunov functions—Part I. Queue. Syst. 5, 167–184. KAROL, M., HLUCHYJ, M., AND MORGAN, S. 1987. Input versus output queuing on a space division switch. IEEE Trans. Commun. 35, 12 (Dec.), 1347–1356. KLEINROCK, L. 1975. Queuing Systems—Vol. I: Theory, Wiley, New York. KUSHNER, H. J. 1967. Stochastic Stability and Control, Academic Press, Orlando, FL. LAMAIRE, R. O., AND SERPANOS, D. N. 1994. Two dimensional round-robin schedulers for packet switches with multiple input queues, IEEE/ACM Trans. Netw. 2, 5 (Oct.), 471–482.
550
E. LEONARDI ET AL.
LEONARDI, E., MELLIA, M., NERI, F., AND AJMONE MARSAN, M. 2001a. On the stability of input-queued switches with speed-up, IEEE/ACM Trans. Netw. 9, 1 (Feb.), 104–118. LEONARDI, E., MELLIA, M., NERI, F., AND AJMONE MARSAN, M. 2001b. Bounds on average delays and queue size averages and variances in input queued cell-based switches. In Procedings of IEEE Infocom (Anchorage, AK., Apr.). IEEE Communication Society Press, Piscataway, NJ, pp. 1095–1103. MCKEOWN, N. 1995. Scheduling algorithms for input-queued cell switches. Ph.D. Thesis. Univ. of California at Berkeley, Berkeley, Calif. MCKEOWN, N. 1999. The iSLIP scheduling algorithm for input-queued switches. IEEE/ACM Trans. Netw. 7, 2 (Apr.), 188–201. MCKEOWN, N., AND MEKKITTIKUL, A. 1998. A pratical scheduling algorithm to achieve 100% throughput in input-queued switches. In Procedings of IEEE Infocom (San Francisco, Calif., Apr.). IEEE Communication Society Press, Piscataway, NJ, pp. 792–799. MCKEOWN, N., MEKKITTIKUL, A., ANANTHARAM, V., AND WALRAND, J. 1999. Achieving 100% throughput in an input-queued switch, IEEE Trans. Commun. 47, 8 (Aug.), 1260–1272. SCHOENEN, R., POST, G., AND SANDER, G. 1999. Weighted arbitration algorithms with priorities for input-queued switches with 100% throughput. In Procedings of 3rd IEEE International Workshop on Broadband Switching Systems (Kingston, Ontario, Canada, Jun.). IEEE Communication Society Press, Piscataway, NJ. SHAH, D., AND KOPIKARE, M. 2002. Delay bounds for approximate maximum weight matching algorithms for input queued switches. In Procedings of IEEE Infocom (New York, Jun). IEEE Communication Society Press, Piscataway, NJ, pp. 1024–1031. STILIADIS, D., AND VARMA, A. 1995. Providing bandwidth guarantees in an input-buffered crossbar switch. In Procedings of IEEE Infocom. (Boston, Mass. Apr.). IEEE Communication Society Press, Piscataway, NJ, pp. 960–968. TAMIR, Y., AND CHI, H-C. 1993. Symmetric crossbar arbiters for VLSI commun switches, IEEE Trans. Parall. Dist. Syst. 4, 1 (Jan.), 13–27. TARJAN, R. E. 1983. Data structures and network algorithms. Society for Industrial and Applied Mathematics, Pa. TASSIULAS, L., AND EPHREMIDES, A. 1992. Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks. IEEE Trans. Automat. Cont. 37, 12 (Dec.), 1936–1948. ZHANG, H. 1995. Service disciplines for guaranteed performance service in packet-switching networks. Proc. IEEE 83, 10 (Oct.), 1374–1399. RECEIVED OCTOBER
2001; REVISED FEBRUARY 2003; ACCEPTED FEBRUARY 2003
Journal of the ACM, Vol. 50, No. 4, July 2003.