This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.
Load-Balanced Multipath Self-routing Switching Structure by Concentrators Wei He, Hui Li♠, Member, IEEE, Bing-rui Wang
Qin-shu Chen
Key Lab of Integrated Microsystems,Shenzhen Graduate School, Peking Univ.,Shenzhen, China ♠ Correspondence author (
[email protected])
HiSilicon Technologies Co. Ltd Shenzhen, China
Peng Yi, Bin-Qiang Wang National Digital Switching Centre Zhengzhou, China Abstract—A novel two stage Load-Balanced Multipath Selfrouting Switch Structure is introduced in this paper. Both stages use a multipath self-routing fabric. With simple algorithms and small buffers, the first stage fabric transforms the incoming traffic into uniform and the second stage fabric forwards the data in a self-routing manner to their final destinations. Compared with other similar structures, this structure outstands with no queuing delay and zero jitter, its component complexity and propagation delay are significantly reduced. Mathematical analysis and simulations show this structure can achieve 100% throughput under admissible traffic pattern, which is a common presumption for incoming traffic. For statistically admissible traffic, by stacking up a few copies of this structure, it is suitable to support QoS application for building super large scale switching fabric in Next Generation Network(NGN). Keywords;Concentrator;Load-Balanced;Self-routing; Switching Fabric
I. INTRODUCTION The Load Balanced Birkhoff-von Neumann switch proposed in [1] has intrigued researchers for achieving 100% throughput and using a simple fixed cyclic changing connection matrix to resolve conflicts. However, packets’ order is not maintained and its component complexity (number of basic routing cell, e.g. cross-points) grows dramatically, i.e. O(N2), for using the crossbar fabric. To keep the packets in order, an algorithm called Full Frame First (FFF) was proposed in [2]. Unfortunately, the three-dimensional queue, which is harder to implement than VOQs, was adopted in this structure. Moreover, all the inputs have to notify the schedulers about the queue status, which may bring about huge communication overhead. Another approach preserving packet sequence was suggested recently in [3]. With a simple algorithm, plus chamber queues and VOQs, this fabric can obtain 100% throughput under both uniform and nonuniform traffic patterns. Nonetheless, in each time slot, Empty Bank Lists and Occupied VOQ Lists information must be collected to make the scheduling decision, thus may increase the computation overhead. In addition, as crossbar was still
used, the component complexity remains high. To reduce the component complexity, a Banyan-based Quasi-Circuit Switch [4] was proposed. Quasi-circuit Switching [10] is between packet switching and circuit switching. The Banyan-based quasi structure can provide QoS at the time scale of frames while reaping the benefits of statistical multiplexing within frames. Under the cost of extra load-balancing stages and a speedup of 2, 100% throughput could be provided. On the other hand, different services have distinct QoS stringency [5]. E-mail and FTP, for example, are insensitive to delay but severe on data loss rate while real-time applications such as telephony may be strict with delay but loose in loss rate; uncompressed audio and video can accept the loss rate of 10-4 and 10-5 respectively. In fact, physical transmission line inherently has its loss rate: 10-5~10-6 for twisted pairs, 10-6~10-7 for cable and 10-9~10-12 for fiber. So, if loss rate is under properly control, nonblocking is not absolutely necessary for large scale switching fabric to support QoS applications. Based on the above observations, we propose a loadbalanced multipath self-routing switching structure (LBMPSR-C) by cascading two multipath self-routing fabrics [6]. The remainder of the paper is organized as follows. Section II briefly describes concentrators and the multipath self-routing switching fabric. Section III introduces the LB-MPSR-C structure in detail. Simulations and performance evaluations are discussed in Section IV. Section V summaries the paper. II.
CONCENTRATORS AND MULTIPATH SELF-ROUTING SWITCHING STRUCTURE
A.
Concentrators A comparator [7] [8] takes two input numbers and places the smaller number on the upper output and the larger number on the lower one. A 2G-to-G concentrator separates the larger 8 -to -4 C o n c e n tra to r L a rg e r 4 S ig n a ls S m a lle r 4 S ig n a ls
National Natural Science Foundation of China (NSFC No.60572042) Hi-Tech Research and Development Program of China (No. 2007AA01Z218) Natural Science Foundation of GuangDong (NSFGD2007 No.295).
C o m p a ra to r
Fig. 1.
A d d re s s A rb itra to rs
Infrastructure of 8-to-4 Concentrator
978-1-4244-2075-9/08/$25.00 ©2008 IEEE
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.
VOGQs at IG5
5,0 5,1
Fig. 2.
5,6 5,7
N=128, M=16, G=8 Multipath self-routing fabric
G signals from the others and forms two output groups, where the order within each output group is arbitrary. Intuitively, the 2G-to-G concentrator can be built by interconnecting comparators recursively. Address arbitrators are attached to the outputs of concentrators to clear misrouted packets. Fig. 1 shows the case when G is 4. For detailed description of Concentrators please refer to [6][12]. B. Multipath Self-routing Structure with Concentrators(MSC) A Multistage Interconnection Network (MIN) [9][12] is constructed by a set of interconnected nodes, all of which are divided into several internal stages. Each node can be a simple routing cell or other complex routing device. A Routing Network is a MIN built by self-routing cells; there exists a unique route from any input to output. n
m
g
Usually, let N=M×G (N=2 ,M=2 ,G=2 , n,m,g are positive integers), the construction of MSC just takes two steps: a) construct an M×M routing network; (Divide-and-conquer networks [12], a subset of routing networks, are often chosen for their modularity, scalability and optimal layout complexity among all M×M routing networks.) b) substitute each basic routing cell with 2G-to-G concentrator. Fig. 2 illustrates this multitpath structure with N=128 and G=8 base on 16×16 Divide-and-Conquer network. III.
LOAD-BALANCED MULTIPATH SELF-ROUTING STRUCTURE In this paper, we assume that the size of arriving packets is constant, say τ bytes. Besides, time is slotted and synchronized so that a packet can be transmitted within a time slot for each input line. N specifies the number of switch ports. As shown in Fig. 3, two multipath self-routing switching fabrics are concatenated to compose the whole structure, with VOGQs (Virtual Output Group Queues) ahead of the first Load-Balancer
VOGQs
MG0
Self-routing Forwarder
IG0 IG1
OG0 OG1 π1(t)
A(t)
B(t)
π2(t)
MGM-1
IGM-1
MG OG IG payload (α)
OG IG MGpayload (β)
OGM-1
IG MG payload (ϒ)
♣
Detailed data slice formats in the switching process
♣
MGmiddle group address; OGoutput group address; IGinput group address
Fig. 3.
Load-Balanced Multipath self-routing fabric
For each VOGQ, packets are combined into a single data block.
Cut into M slices MG OG IG payload
VOGQs at IG5
5,0
005
705
5,1
015
715
065
765
075
775
5,6
Labeled
5,7 8 slices for each VOGQ(5,j) Fig. 4.
Explanation about Algorithm 1,
M=8
fabric. Actually, the first stage fabric serves as a load-balancer, which is responsible for spreading any pattern of incoming traffic to be uniformly distributed to all the ingress ports of the second stage fabric. Consequently, the second stage fabric just self-routes the reassembled data coming from the first stage fabric to their final destinations. Every G inputs (outputs) are bundled into an input (output) group, thus M groups are formed on the input (output) side (N=M×G). To ease presentation, let IGi(OGi) denotes a specific input(output) group, and MGi represents a line group between the two stages (i, j =0 to M-1). VOGQs are logically the same as VOQ; however, each VOGQ is responsible for storing packets from G input ports. Let VOGQ(i,j) denotes the VOGQ storing the packets destined to OGj from IGi; and call the current queue length of a particular VOGQ(i,j) as Li,j, which is the amount of packets waiting in buffer for transmission. Generally, for our proposed scheme, the processing of arriving packets in each time slot is composed by several sequencial phases, which should be executed in pipeline for keeping the transferring as fast as possible: (see Fig.3) 1) Arrival phase: New packets arrive at the IGs during this phase. The packet destined for OGj arrives at IGi is stored into VOGQ(i,j). 2) Packaging phase: packets in VOGQs are united into data blocks. Later these blocks are segmented° and labeled into data slices according to Algorithm 1 and prepare to be transmitted (see data slice format α in Fig. 3). 3) Balancing phase: Aided by MG tags, IGs simultaneously send the packaged data slices to the middle groups. When the slices reach the middle groups, MG addresses will be reinserted as MG tag between IG tag and payloads (see β in Fig. 3). 4) Transferring phase: data slices are further forwarded by the second stage fabric, a self-routing fowarder, to their
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.
final destinations by using OG address tags. When these slices reach the OGs, the OG tags are discarded (see γ in Fig. 3). 5) Departure phase: data are reorganized according to Algorithm 2 and later depart from the OGs of the switch ° Here we suppose that G≥M. Actually, a big G could provide better
performance due to the statistical multiplexing effect within each group. Data slices are at least several dozen bytes for implementation. For instance, let G=256, M=64 and packet size is set to 2kbytes, i.e. 211 bytes. Thus, the size of a data slice is at least 211/64=32 bytes. Moreover, according to Algorithm 1, 18 bits information must be added to the slices for self-routing purpose. Therefore the extra overhead is 18/32*8, about 7.0%, which is acceptable in practical application.
Algorithm 1: For each input group, during the packaging phase, data stored in VOGQ(i,j) are evenly cut into M data slices that is marked out as payload in Fig. 3. Then,three tags MG,IG and OG are appended in front of each slice for selfrouting purpose through the two stage fabric.
(a)
Fig. 4 gives a detailed example when M=8, and VOGQs at IG5 satisfies: L5,0=4, L5,1=2, L5,6=L5,7=1. Algorithm 2: Data slices with the same IG address are reunited together. MG addresses carried by the slices will help keep the data in the original sequence. Afterwards, the restored packets could leave the OGs. Fig. 5 provides a simple example about Algorithm 2. IV.
SIMULATION AND PERFORMANCE ANALYSIS
Several popular switching structures will be compared in this section. For brevity, we will use LB-BvN, LB-OP, LB-BY to represent the structure proposed in [1], [3] and [4] respectively. Besides, N specifies the size of the switching structure. D a ta s lic e s fro m IG 5 a t O G s
IG M G p a y lo a d
OG0
5 0
5 7
OG1
5 0
5 7
OG6
5 0
5 7
OG7
5 0
5 7 R e u n ite d in to d a ta b lo c k s
R e u n ite d d a ta b lo c k s fro m IG 5 OG0 OG1 A t e a c h O G, d a ta slic e s fro m th e s a m e IG a re re u n ite d .
OG6 OG7 Fig.5
Explanation about Algorithm 2
(b) Fig.6
CBR traffic throughput under Admissible conditions (a) G=64, M=16; (b) G=128, M=8
Component Complexity (Cnet) refers to the amount of routing cells used in the structure, which represents the hardware complexity or cost of this proposed structure. If we define the time to pass a cell to be 1, then the total time to transfer a packet from input to output is recorded as propagation delay (Dnet). Without speedup, the average proportion of data packets can be transported from inputs to outputs is called normalized throughput, denonted as Tnet. For instance, LB-BY structure needs a speedup factor of 2 to achieve 100% throughput, Therefore, its Tnet is just 100%/2=50%. Qmax refers to the maximum queuing delay for a packet to be finally transported to its destination. To illustrate, a packet may wait up to 2N-1 time slots to be transported through LBBvN. Hence, Qmax is 2N-1. A. Throughput Analysis To evaluate the actual throughput under different traffic patterns, we first define two conditions to regulate the incoming traffic. Definition 1: In time slot n, let Ai,j(n) denote the number of packets arrive in IG(i) and destined to OG(j). Set Ai.j(0)=0. Suppose the arrival process is {Ai,j(.), i,j=1…M} and the number of packets arrive in arbitrary slot is λi,j. If λi,j satisfies (1), then the arrival process is called Admissible.
∑λ
i, j
i
≤ G and
∑λ
i, j
j
≤ G , i, j = 1...M
(1)
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.
Definition 2: If the arrival process satisfies the Strong Law of Large Number with probability 1, i.e.
lim
Ai , j (n)
n →∞
n
= λi , j , i, j = 1...M
(2)
and λi,j also satisfies (1), then the arrival process is called Statistically Admissible. Admissible Traffic Pattern: Theorem 1: for arbitrary admissible traffic pattern, the LBMPSR-C structure has a 100% throughput. Proof: Referring to Fig.3 again, consider the arrival process A(t)(with M×M traffic matrix Λ) to the structure. After processed by the load balancer of the first stage fabric π1(t), the arrival process to the second stage B(t)=π1(t)A(t), here B(t) is uniform. Suppose the number of packets entering the second stage fabric from MGi to OGj is λ’i,j, and λi,j satisfies (1). According to the admissible conditions, we have λ’i,j ≤G/M. Just consider the worst case, i.e. for all the middle groups, λ’i,j =G/M and a general multipath self-routing switching structure, see Fig.2 for example. As the traffic load is uniformly spread out, for a specific data line group k in internal stage i, 2i distinct input groups feed their data to it and these packets are finally transported to M/2i different output groups. As λ’i,j ≤G/M, for arbitrary line group k in the general multipath self-routing model:
Lineload (k ) ≤ (G / M ) ⋅ ( M / 2i ) ⋅ 2i = G
(3) Clearly, under this condition, the line load will not be oversubscribed, and 100% throughput can be achieved. ■ To verify Theorem 1, we use NS2 [11] to simulate the throughput performance. Constant bit rate traffic over UDP is generated on each input group destined uniformly to all output groups. In Fig. 6(a), a 16×16 routing network such as Fig.2 is adopted as the core structure and G is set to 64 while in Fig.6(b), an 8×8 routing network is used with G=128. Moreover, we set the packet size to be 128 bytes, simulation time to be 5 seconds, and the total switching capability to be 160Gbps for both cases. From Fig.6, we noticed that, under admissible traffic conditions, the simulation results match nicely with our mathematical analysis regardless of the core routing network of the multipath fabric. Statistically Admissible Traffic Pattern: X0 Y0
X1 Y1
D j=X j or Y j
Xj-1 Yj-1
Internal Stage 1
Fig. 7.
Xm
Y m-1
Ym
Xj Yj
j
(b) Fig. 8. Blocking rate vs. Load (a) N=65536; (b) G=32
For statistically admissible traffic patterns, the incoming traffic entering the first stage fabric will also be evenly spread to all the middle groups, or the input group of the second stage fabric. However, sometimes data lines in the structure may be oversubscribed and data loss may occur. Obviously, the traffic entering the second stage fabric is statistically uniform. On each input of the second stage fabric, every packet is independently active with the probability p and the destination address of the packet is independently and uniformly distributed. Let the random variable Xj (1≤j≤m) represent the number of packets that arrive at an output group in internal stage j at a time slot. Clearly Xj is bounded by G. Theorem 2: See Fig.7, the distribution of Xm can be inductively computed as follows, let Xj be a random variable initially with the binomial distribution B(G, p). For j≥0, let the random variable Yj be an i.i.d. of Xj and the random variable Dj is B(Xj-1+Yj-1,0.5) distributed. Actually, X0 denotes the initial traffic load for the second stage fabric and Dj reflects the number of packets, which depart from one output group of the internal stage j concentrators of the second stage fabric .
Pr(Aj = t ) = Aj (t) =
∑X
x+ y=t X m-1
Aj
(a)
j −1
(x)Yj−1( y),
0 ≤ x, y ≤ G,1 ≤ j ≤ m
(4)
∑ Aj (t )2 ( ) if z < G t=z (5) Pr( D j = z ) = 2G t ∑t =G ( A j (t )2 −t ∑k =G (tk ) ) if z = G 2G
−t t z
Pr(blocking ) = 1 − ∑ z = 0 z ⋅ X m ( z ) /(G ⋅ p ) (6) ■ G
m
Blocking rate analysis for Statistically Admissible Traffic
According to [6], (4) (5) and (6) may be used to compute the blocking rate of the second stage fabric and we use Matlab to plot the blocking rate in Fig.8. From Fig.8(a), we see that
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the ICC 2008 proceedings.
TABLE I.
Fabric Jitter Cnet Dnet Qmax Tnet
PERFORMANCE COMPARISON♦
LB-BvN
LB-OP
LB-BY
Crossba r Y O(N2) O(N) 2N-1 100
Crossbar
BANYAN
N O(N2) O(N)