Journal of Circuits, Systems, and Computers c World Scientific Publishing Company °
THREE-STAGE CLOS-NETWORK SWITCH ARCHITECTURE WITH BUFFERED CENTER STAGE FOR MULTI-CLASS TRAFFIC
MOO-KYUNG KANG and CHONG-MIN KYUNG Department of Electrical Engineering & Computer Science, Korea Advanced Institute of Science and Technology, 373-1 Guseong-dong, Yuseong-gu, Daejeon 305-701, Republic of Korea Email:
[email protected] and
[email protected]
Received (August 24, 2005) Revised (October 31, 2005) Accepted (December 1, 2005) Memory-space-memory (MSM) arrangement is a popular architecture to implement three-stage Clos-network switches with distributed arbitration. The scalability of this architecture, however, is limited by the round-trip communication delay between the first and the second stages. Moreover, virtual output queue (VOQ) does not completely remove the blocking in the buffered modules under multi-class traffic. In this paper, we propose a competition-free memory-memory-memory (CFM3 ) switch which is a three-stage Clos-network switch with buffered center stage. The CFM3 deploys buffered modules in all stages to simplify communication between stages. To reduce the blocking, each module is equipped with a set of buffers fully separated according to the destinations, classes of packets and the input ports of the module. Despite the buffered center stage, CFM3 is free from reordering problem due to simple control mechanism. Simulation result shows the delay of the proposed CFM3 switch closely approaches that of the ideal Output Queued switch under multi-class traffic when strict priority policy popularly used for class-based switch is deployed. CFM3 achieves 100% throughput under uniformly distributed four-class traffic with strict priority policy while traditional MSM switch records about 77% throughput. Keywords: scalable switch, distributed arbitration, multi-class traffic, competition-free memory-memory-memory switch.
1. Introduction Steady growth of Internet population obviously demands switch systems which quickly process and arbitrate incoming packets to support broader bandwidth and larger number of ports. To achieve the required performance, various kinds of singlestage switches have been invented. Researches based on the crossbar switch deal with arbitration policies and structures and places of queues to store packets.1,2,3 Buffered crossbar switches (a.k.a. cross-point buffered switches) were proposed to reduce the complexity of crossbar switches.4,5 Small amount of buffer in the crosspoints relaxes the complexity of arbitration and still enhances the throughput.
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
On the other hand, there are obviously some limiting factors in integrating a large number of ports in a single chip such as manufacturing yield and sufficiency of market size. This has motivated researches on implementing large-scale, scalable multi-stage switches by interconnecting multiple chip modules. The well-known three-stage Clos-network is studied as interconnection network of such scalable switches.6,7 Originally this network was created for circuit switches. Theoretical studies have shown that Clos-network has a nonblocking property with the minimum number of crosspoints in the switches.8 This property of Clos-network allows it to be used for estimating the statistical possibility of setting up a virtual connection at a call level in the packet switches.9 In this paper, we are interested in the packet switches and assume that switching and all the operations of input/output ports are performed on a fixed-length time slot. Packets with variable length can be segmented to fit into each time slot at entrance and reassembled at exit. Switch input ports (SIPs)
Module input ports (MIPs)
Module output ports (MOPs)
Switch output ports (SOPs)
Input port controller 1
IPC
n
IPC
n(r-1)+1
IPC
N=nr
IPC
Output port controller n×m switch module
r×r switch module
m×n switch module
OPC
1
OPC
n
n×m switch module
r×r switch module
m×n switch module
OPC
n(r-1)+1
OPC
N=nr
r input modules (IMs)
m center modules (CMs)
r output modules (OMs)
Fig. 1. An N × N three-stage Clos-network switch, C(n, m, r) consisting of r n×m input modules, m r×r center modules, and r m×n output modules.
A three-stage Clos-network switch (simply a switch) as shown in Fig. 1 consists of a number of smaller blocks called module or switch module. Each module has module input ports (MIPs) and module output ports (MOPs). A switch has switch input ports (SIPs) and switch output ports (SOPs). Input port controllers (IPCs) and output port controllers (OPCs) are attached externally to the outside of SIPs and SOPs, respectively. We assume that data moves through the switch in a fixed-size unit called cell. In the N × N three-stage Clos-network switch, C(n, m, r) in Fig. 1, the first stage (or input stage) has r input modules (IMs), the middle stage (or center stage) has m center modules (CMs), and the last stage (or output stage) has r output modules (OMs). Their dimensions, representing the number of input and output ports, are n×m, r×r, and m×n, respectively. Each module has connections
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
with all the modules in the neighboring stage(s). Three-stage Clos-network switches can be categorized into three types according to the existence of buffer in each stage: memory-memory-memory (M3 ) switches, space-space-space (S3 ) switches and memory-space-memory (MSM) switches. M3 switches,10 which have buffered modules in all three stages have the advantage of ease of distributed arbitration, simplicity and scalability; however, they suffer from the cell reordering problem as the sequence of cells can be disrupted due to nonuniform delays in the buffers of CMs. To solve this problem complex reordering hardware is required in each OPC, which has so far deterred the wide acceptance of M3 switches. On the other hand, S3 switches, which have bufferless modules in all three stages, require a very complex centralized arbitration hardware for determining all the paths through three stages for all cells departing from IPCs. This not only causes a very significant hardware overhead but also puts limitation on scalability. Last, MSM switches, which have buffered IMs/OMs and bufferless CMs, are most popular since they are scalable using distributed arbitration while not causing any problem about ordering. The ATLANTA architecture7 is a well-known example of the MSM switch. In spite of the multiple paths from each SIP to each SOP, the cell sequence is maintained due to the bufferless CMs. Each IM manages multiple queues each corresponding to each destination SOP. These queues called virtual output queues (VOQs) are used to prevent the Head-Of-Line (HOL) blocking. OMs have a queue for each SOP. Selective backpressure from the output stage is exerted on the input stage based on the buffer usage. This architecture uses a simple distributed arbitration algorithm to dispatch cells from the input stage to the output stage. Several algorithms have been proposed to increase the throughput.11,12 Basically the dispatching algorithms for the MSM switches use communication between the input stage and the center stage. First, each IM selects at most m candidate VOQs for transmission and requests the corresponding CMs for the assignment of paths. Second, each CM chooses one among multiple requests from IMs for each MOP of the CM, and then grants the requests to the chosen IMs. These algorithms consume a number of valuable clock cycles for the round-trip communication between IMs and CMs. Moreover, the shared memory management in IMs and OMs can still suffer from potential blocking problem under multi-class traffic. Generally IMs and OMs have limited amount of on-chip memory whereas a huge amount of memory is located in the IPC in front of each SIP. To efficiently manage the memory, OMs apply selective backpressure to the IMs to prevent overflow during congestion. IMs also apply selective backpressure to the IPCs based on the VOQs. Although this scheme suppresses the HOL blocking, modules with limited memory can still suffer from blocking caused by low-prioritized cells.4 Under congested conditions, low-priority cells pre-existing in a module can prevent those with higher priority from entering the module due to memory limitation. These limitations of existing switch schemes have let us to propose a new scalable switch scheme in this paper.
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
The rest of this paper is organized as follows. In Section 2, we describe the weakness of MSM switches in dealing with multi-class traffic, which is very critical in various security-critical applications. Then we propose a switch architecture called competition-free memory-memory-memory (CFM3 ) to solve the problem. Experimental results are given supporting the alleged performance improvement of the proposed scheme under multi-class traffic conditions. Some discussions regarding implementation are given before concluding remarks. 2. Memory-Space-Memory Switch 2.1. Dispatching Algorithms In MSM switches, cells departing from IMs share the paths in CMs to reach the proper OMs for destined SOPs. To organize the path allocation MSM switches use various dispatching schemes. Since these schemes affect the performance of switching system, dispatching algorithm has been an active research topic. Though the details differ from each other, all dispatching schemes generally consist of two steps: 1) Matching within an IM, and 2) matching between IM and CM. Fig. 2 shows the whole dispatching process in sequence from 1 to 5. In the first step, each IM selects, out of N VOQs in each IM, at most m candidate VOQs for transmission (since the number of MOPs of IM, m, is smaller than N ). Each IM sends a request to the CM connected to each MOP of the IM for the transmission of a cell in the selected VOQ. In the second step, each CM receives at most r requests from IMs. The requests may claim to occupy the same or different MOPs of the CM. If a MOP of a CM is requested by more than one IMs, the CM selects a request according to some rules which vary according to the deployed dispatching algorithms. The result of this selection is notified to corresponding IMs. Now, each IM sends the head cells of the VOQs granted by CMs. For each time slot, the decision process including round-trip communication has to be completed before cells depart from IMs for CMs. This means the decision process should be finished within a single time slot. As the routers and switches become faster, this round-trip communication is more likely to become a bottleneck in achieving high performance. In this paper we will describe how the proposed scheme removes this round-trip communication. 2.2. Blocking under Multi-Class Traffic IMs and OMs in MSM switches use buffered modules with shared memory management. While IPCs have a huge amount of cheap external memory to store cells, IMs and OMs can have only a limited amount of on-chip memory. Shared memory technology have, therefore, been used to maximize the utilization of memory resource where the selective backpressure is applied to prevent the overflow of buffers according to the used memory by each MIP and MOP. If the amount of used memory exceeds a given threshold, the corresponding MIPs signal the precedent modules to stop the cell transmission. This backpressure is removed when the amount of used
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage n×m IM
r×r CM
2 send requests 1 select m candidates
n VOQs
4
send grants 3 select one request per MOP
5 send granted cells m MOP
r MOP
r requests at most n×m IM
r×r CM
select m candidates n VOQs
select one request per MOP
m MOP
r MOP
r requests at most
Fig. 2. Decision process, in sequence from 1 to 5, to dispatch cells in IMs to CMs. This process includes round-trip communication between IMs and CMs: IMs request paths to CMs. CMs grant requests after resolving contention.
memory is decreased below another given threshold. Although this implementation is popular in MSM switches, a significant problem can arise under multi-class traffic situation, which is quite prevalent nowadays. Fig. 3 illustrates a blocking case. There is a 4 × 4 switch module using shared memory and there are cells with three kinds of priorities, i.e., high, medium, and low. Each small box represents a cell with the index of MIP, that of destination SOP, and its priority. The ‘count’ value for each MIP of a module is incremented each time a cell enters the module through the MIP, and decremented by one when the cell leaves the module. This ‘count’ is used to control the backpressure of the MIP with given thresholds. In this illustration, each MIP turns on the backpressure if ‘count’ is larger than or equal to 5, and turns off the backpressure if it is smaller than or equal to 4. In Fig. 3(a) the first MIP is in backpressured state and cannot receive cells since the ‘count’ value is 5. In the next time slot shown in Fig. 3(b), the first MIP is still in the backpressured state since the cells from the MIP, indexed as (1,1,L), (1,2,L) and (1,3,L), are not serviced due to their lower priority than cells having entered through other MIPs. The cell (1,4,H) waiting at the first MIP cannot be switched through the fourth MOP, which is idle and available in this case. This will eventually starve the preceding module trying to send cells to the first MIP of this module. In this paper we will show how to remove starvation by removing competitions among multiple-class cells from different MIPs. 3. Proposed Architecture: Competition-Free M3 Switch In order to reduce the communication overhead, the roles of IMs and CMs in the
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
Shared-memory switch module (output module)
MIP index
Count = 5
1,4,H
1
SOP index
five cells
Stop
2,1,M
2,1,M
4,1,M 1,1,L
4,1,M 1,1,L
1
3,2,M
3,2,M
2,2,M
2,2,M 1,2,L
2
4,3,M
4,3,M
3,3,M 1,3,L
3,3,M 1,3,L
3
Count = 4
2,1,M
2 Count = 4
3,2,M
four cells
3 Count = 4
4,3,M
4
4
a,b,c
a: MIP, b: SOP, c: priority
(a) Shared-memory switch module (output module)
MIP index
SOP index
Count = 5
1,4,H
1
Stop
2,1,M
2,1,M
2,1,M
4,1,M 1,1,L
1 1,1,L
4,1,M
Count = 4
3,2,M
2
3,2,M
3,2,M
2,2,M
2 1,2,L
2,2,M
Count = 4
4,3,M
3
4,3,M
4,3,M
3,3,M 1,3,L
3 1,3,L
3,3,M
Count = 4
4
4
a,b,c
a: MIP, b: SOP, c: priority
(b) Fig. 3. A case of blocking under multi-class traffic: (a) A cell (1,4,H) from the first MIP can not enter the module as count ≥ 5. (b) In the next time slot, the cell from the first MIP still can not enter the module, although the fourth MOP is available.
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
arbitration process need to be explicitly partitioned as following: CMs continuously provide each IM with the status of buffer, which is used by IMs to decide which cells to transmit to CMs. The arbitration in CMs is then performed based on the received cells. The blocking problem mentioned in Section 2.2 occurs because of the competition among cells with different priorities from different modules. This problem can be solved by partitioning buffers for storing cells according to MIPs of entry and priorities of the cells. In this section, we introduce an element module satisfying the above requirements and then propose a new three-stage Clos-network switch called CFM3 to overcome the weakness of MSM switches. 3.1. Competition-Free Virtual FIFO-Group Switch FIFO-group is defined, for a certain scheduling discipline of the output-queued switch, as a subsequence of cells where the order of cells is maintained throughout the scheduling.4 For example, a data flow can be a FIFO-group in any per-flow queueing discipline.13 A group of cells with the same priority class in a VOQ is also a FIFO-group in a class-based scheduler.14 We use the term, virtual FIFO-group queue (VFQ) to denote a group of queues where there is a separate queue for each FIFO-group. VOQ is a special case of VFQ where the number of classes to the same destination SOP is one. Now we define competition-free virtual FIFO-group queue (CFVFQ) for a module that has multiple MIPs. CFVFQ has one VFQ per MIP to resolve competition among MIPs. For a n1 ×n2 module with c classes in N ×N switch, CFVFQ has n1 ·N ·c queues since there are n1 MIPs, N destination SOPs, and c classes. We define a switch module as a competition-free virtual FIFO-group switch (CFVFS), if it contains a CFVFQ where each queue has at least one dedicated cell-storage space. CFVFS has no competition for storage resource among incoming cells, since the space is explicitly split according to MIPs, destination SOPs, and classes of cells. In the following we assume that only one cell-storage space is allotted for each queue. Control mechanism for dispatching cells between CFVFSs can be implemented in a distributed fashion as shown in Fig. 4. Module A does not consider other VFQs in the module B except the one dedicated to module A, i.e., shaded box in Fig. 4. Therefore, no competition arises among cells from different MIPs, and as a result there’s no blocking. Transmission of information on the availability of cell storage is unidirectional, i.e., from B to A, which happens only when a cell in the module B departs for the next module. For example, if the dimension of module B is n1 ×n2 , the amount of information per time slot is n2 bits at most since the maximum number of cells departing from module B is n2 . Due to the fact that information flow for the decision process is unidirectional and the amount of information thereupon is only the flag that shows the availability of cell storage, the
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
A
B VFQ
VFQ 2 select cells
1 send availability status 3 send cells
VFQ
VFQ
VFQ
Fig. 4. Decision process to dispatch cells between two CFVFSs, A and B. There’s no round-trip delay due to communication between A and B, as the communication is unidirectional.
associated overall communication bandwidth for the decision process is negligible. 3.2. Competition-Free M3 Switch The distributed nature of CFVFS can be utilized in implementing a three-stage Clos-network switch to resolve the problems of MSM as pointed out in Section 2. CFVFS is used for each module in a three-stage Clos-network switch, which is named here as Competition-Free M3 (CFM3 ) architecture. Since modules differ in the number of MIPs and reachable destination SOPs, the required number of cell-storage locations per module is different for each stage. Fig. 5 shows the storage structure of a CFVFS in each stage assuming c classes are used. In the input stage, a CFVFS has n · N · c cell-storage locations since there are n MIPs, N reachable destination SOPs, and c classes (see Fig. 5(a)). Based on a similar calculation, each CM has r · N · c cell-storage locations and each OM has m · n · c cell-storage locations as shown in Fig. 5(b) and Fig. 5(c). In Fig. 5, each row comprises a VFQ and is dedicated to an MIP. An OM sends the storage status of each VFQ to the corresponding CM. Based on the information from r MOPs, each CM selects cells to be transferred among those stored without considering the status of other CMs. No competition among cells arises as there are separate storage locations for each cell. Similar explanation is applicable between IMs and CMs. Fig. 6 illustrates a path through which a cell with priority p from the s-th IPC destined to the d-th OPC is switched in a CFM3 switch. IM S(a, b, c, d) represents a storage location of a cell with priority d in the a-th IM, from the b-th MIP, destined to the c-th SOP. “*” is a wild character used in place of a, b, c or d. For example, IM S(a, b, c, ∗) represents all storage locations of cells with any priority in the a-th IM, from the b-th MIP, destined to the c-th SOP. CM S and OM S represent storage locations of a cell in CM and OM, respectively. Since there is only one storage location for each set of MIP, destination SOP and class in a CFVFS, the location of a cell is decided only when the cell enters a CFVFS. A cell from
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
For n MIPs
Each row, for each of the n MIPs, has N large rectangles , where N denotes the number of destination SOPs .
This small rectangle denotes a space for a queue, which is a cell.
1,1,1
1,1,c
1,2,1
1,2,c
1,N,1
1,N,c
2,1,1
2,1,c
2,2,1
2,2,c
2,N,1
2,N,c
n,1,1
1,1,c
n,2,1
n,2,c
n,N,1
n,N,c
One VFQ for a MIP This large rectangle contains c small rectangles , where c denotes number of classes . (a)
For r MIPs
For N destination SOPs 1,1,1
1,1,c
1,2,1
1,2,c
1,N,1
1,N,c
2,1,1
2,1,c
2,2,1
2,2,c
2,N,1
2,N,c
r,1,1
r,1,c
r,2,1
r,2,c
r,N,1
r,N,c
(b)
For m MIPs
For n destination SOPs 1,i+1,1
1,i+1,c
1,i+2,1
1,i+2,c
1,i+n,1
1,i+n,c
2,i+1,1
2,i+1,c
2,i+2,1
2,i+2,c
2,i+n,1
2,i+n,c
m,i+1,1
m,i+1,c
m,i+2,1
m,i+2,c
m,i+n,1
m,i+n,c
i is a multiple of n. (c) Fig. 5. Storage structure of a CFVFS in each stage of CFM3 : (a) Storage in each IM. (b) Storage in each CM. (c) Storage in each OM. The three numbers in each small box represent MIP, destination SOP, and class, respectively.
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
the s-th IPC enters the β-th IM if its location, IM S(β, α, d, p), is available where α − 1 = (s − 1) mod n and β − 1 = b(s − 1)/nc. It may use one of m CMs to arrive at the d-th OPC. In other words, the candidate storage locations in the center stage are CM S(∗, β, d, p). IM knows the status of the candidate locations since it supplies cells and the consumption in each CM is reported when cells leave the CM. In this process, the communication before cell transmission is unidirectional, i.e., from CMs to IMs. Assume that the γ-th CM is available and selected. The cell is stored in CM S(γ, β, d, p). When OM S(δ, γ, d, p) is available, the cell is transmitted to the δ-th OM, where δ − 1 = b(d − 1)/nc. Then it arrives at the d-th OPC. rxr CMS(*,β,d,p) m candidate locations β,d,p
s-th IPC
β-th IM
nxm
1,1,1
1,d,p
1,N,c
1,1,1
1,d,p
α,1,1
α,d,p
α,N,c
β,1,1
n,1,1
n,d,p
n,N,c
r,1,1
IMS(β,*,d,p)
γ-th CM
δ-th OM
mxn
1,N,c
1,i+1,1
1,d,p
1,i+n,c
β,d,p
β,N,c
γ,i+1,1
γ,d,p
γ,i+n,c
r,d,p
r,N,c
m,i+1,1
m,d,p
m,i+n,c
CMS(γ,β,d,p)
IMS(β,α,d,p)
d-th OPC
OMS(δ,γ,d,p)
β,d,p
Fig. 6. A path through which a cell from the s-th IPC destined to the d-th OPC is switched in a CFM3 switch. The cell is sequentially stored in the shaded boxes during switching, i.e., IM S(β, α, d, p) → CM S(γ, β, d, p) → OM S(δ, γ, d, p), where α, β and δ are defined by α − 1 = (s − 1) mod n, β − 1 = b(s − 1)/nc, δ − 1 = b(d − 1)/nc. The γ-th CM is assumed to be selected as a path.
In the MSM switch architecture, each IM enforces a lower bound on ccell , the number of clock cycles required for transferring one cell: ccell > ch + cI + cIC + cC + cCI , where each term represents the number of clock cycles for each different activities. ch is the number of clock cycles required for an IM to retrieve header information including destination SOP from a cell. cI is for an IM to select candidate queues. cIC is for sending a request from an IM to CMs. cC is for a CM to resolve the contention among requests from IMs. cCI is for sending a grant from a CM to IMs. As the number of ports of a switch increases, the arbitration times, i.e. cI and cC , increase11,12 , increasing the lower bound on ccell , which becomes a major
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
bottleneck in scaling a switch system.15 On the other hand, the proposed CFM3 architecture has no round-trip communication. IM is ready to arbitrate cells as soon as it gets the header information of a cell. The lower bound of ccell in the CFM3 architecture simply becomes ccell > ch + cI , which is significantly lower than that of the MSM. This gives CFM3 a significant advantage over MSM in terms of scalability. The hardware for the arbitration is also saved due to the simplified decision process. Moreover, the reduced lower bound of ccell directly causes the reduction of the required storage bandwidth: As the ccell , or cell size, becomes small, the dummy data which is appended at the end of a packet to compensate the variable length of the packet becomes small. The wasted memory bandwidth by the dummy data is reduced.15 In MSM the sequence of cells is maintained throughout switching due to the bufferless center stage. The proposed CFM3 architecture, despite the buffered center stage, still maintains the cell sequence by a simple control in IMs as the sequence needs to be kept only among cells belonging to the same class and the same SIPSOP pair. These cells always pass through a specific queue of CFVFQ in an IM. Disordering only happens when multiple CMs have cells from a queue of the same IM. As shown in Fig. 6, there are m candidate locations in the center stage for a cell departing from a queue in an IM. Departure of the cell from the IM can be controlled so that there is no other cell from the same queue in any of the m candidate locations in CMs. Multiple locations of the m candidates in the center stage, however, can be filled if cells are from different queues of an IM, which is acceptable as there’s no required sequence among them. For example, in Fig. 6, cells departing from IM S(β, ∗, d, p) can share CM S(∗, β, d, p) if they are not from the same IM S. The ordering among cells is, thus, maintained. Assuming a 256-port CFM3 of C(16, 16, 16) with four classes, and 128-byte cell, an IM (or CM) has (n·N ·c = 16×256×4 =) 214 queues. Then the memory size becomes (214 cells × 128 bytes/cell = 2 Mbytes =) 16 Mbits, which can be implemented in 9 mm2 silicon area using 1T-SRAM in 0.09um-ASIC process.16 This shows that the proposed CFM3 architecture can be implemented using current technology. Memory requirement varies according to the switching system. For example, the total amount of memory is reduced to half if the traffic data has only two classes. 4. Experimental Results To evaluate the performance of the proposed CFM3 architecture, we modeled three different kinds of switches, i.e., MSM switch, CFM3 switch, and Output Queued (OQ) switch which is known as the ideal switch. IPCs are assumed to have infinite buffer. No speedup is assumed. In other words, the transmission speeds of all MIPs and MOPs of switch modules are the same as that of IPCs and OPCs. Each OPC has no congestion on its output port so that it immediately sends out a cell if there is any. All switches are assumed to employ strict priority policy for
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
different classes. Four classes are used in experiments. Dimensions of all switches are set identically to 256 × 256. MSM and CFM3 switch are comprised of 48 16 × 16 modules (There are three stages, each stage having 16 modules). The buffer size in a shared memory module of the MSM switch is set equal to that of the CFM3 switch. Assuming 128-byte cell, the total amount of memory in CFM3 is (r × n · N · c + m × r · N · c + r × m · n · c cells) × (128 bytes/cell) = 528 Mbits. The total amount of memory in MSM is (r × n · N · c + 0 + r × n · N · c cells) × (128 bytes/cell) = 512 Mbits. The ratio of two values is about 1.03. The delay of OQ switch under very low load is set equal to those of MSM/CFM3 switches for ease of reference. The model of the MSM switch is based on the ATLANTA switch7 as described in Section 1 and Section 2. Although there are many choices, Static Round-Robin Dispatching (SRRD) algorithm12 is used to dispatch cells from IMs to CMs. In this algorithm, pointers in an IM, which indicates where the IM starts searching cells for MOPs in a round-robin style, are set differently so that requests for paths in CMs are more uniformly distributed. After each time slot, the pointers in all IMs are incremented circularly. This approach called desynchronization helps keep the pointers more distributed and improves the path utilization. The CFM3 switch is modeled as in Section 3, including the control mechanism to keep the cell ordering. As assumed in Section 3, each storage location for a queue contains one cell. Dispatching algorithms of MSM switches and CFM3 switches can’t be exactly the same because the decision processes are different from each other. MSM uses round-trip while CFM3 uses one-way communication. To make the comparison fair, CFM3 is assumed to use almost the same algorithm as SRRD. The common part of two switches is the cell selection process in IMs. The CFM3 switch model adopts the same cell selection rule as used in SRRD. Fig. 7 shows a comparison of the delay performance of MSM switch, CFM3 switch, and the ideal switch, i.e., output-queued (OQ) switch as the traffic load varies, all assuming uniformly distributed Bernoulli traffic with four uniformly distributed classes. Delay values are in the unit of time slot, which is the time interval between two back-to-back cells. As the load becomes heavier, delay of MSM switch increases abruptly whereas delay of CFM3 switch stays at low values due to the competition-free scheme, closely approaching that of OQ switch. Fig. 8 shows a comparison of throughput under unbalanced traffic with four classes. Each class is assumed to have the same probability. Unbalanced traffic is generated as defined in the following:5 ½ ω + 1−ω , if s = d, N ρ= 1−ω , otherwise, N where ρ, ω, s, d are the load applied to each IPC, imbalance factor, the IPC where the traffic is applied, and the destination OPC, respectively. The imbalance factor being zero means that the traffic has a uniform distribution. CFM3 switch achieves 100% throughput while MSM switch shows about 77% throughput. If the imbalance
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
100 MSM CFM3
Average delay (time slots)
OQ
10
1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Normalized traffic load Fig. 7. Delay comparison among MSM switch, CFM3 switch, and the ideal output-queued switch under uniform traffic with four classes.
factor is one, i.e., all cells from a SIP are destined to one SOP, throughputs of all three switches converge to unity. CFM3 switch has always a higher throughput than MSM switch under all unbalanced traffic conditions. In the ideal OQ switch, all received packets are stored in the corresponding OPC such that all of them can be transferred without additional delay as there’s no competition among them. This allows all packets become candidates when the OPC determines the next packet to transfer. In this situation there is no waste of bandwidth in the outgoing line of the OPC. In multi-stage switches, there is delay due to competition to occupy the limited resources such as storage and transfer paths. The CFM3 switch resolved the competition for storage. Some further work on dispatching algorithms between CFVFS’s could relax the competition for paths as the existing dispatching algorithms are optimized for MSM switches.
5. Conclusion CFM3 architecture was proposed to solve the communication overhead and blocking problem of the MSM switch. CFM3 architecture uses CFVFS for modules in each stage. A CFVFS manages a separate location for each queue in CFVFQ, which makes it suitable for scalable switch with distributed arbitration. In spite of the multiple paths in the center stage, CFM3 architecture maintains the original or-
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
1.2
Normalized throughput
1 0.8 0.6 0.4 MSM CFM3
0.2
OQ 0 0
0.5
1
Imbalance factor Fig. 8. Throughput comparison among MSM switch, CFM3 switch, and output-queued switch under various unbalanced traffic conditions with four classes.
dering of cells by assuring exclusive occupation, among relevant cells of candidate storage locations in CMs. Simulation result shows CFM3 switch has better delay performance and higher throughput than MSM switch under multi-class traffic when the strict priority policy is employed. CFM3 switch achieves 100% throughput under uniformly distributed four-class traffic with strict priority policy while MSM switch records about 77% throughput. References 1. N. McKeown, “The iSLIP scheduling algorithm for input-queued switches,” IEEE/ACM Trans. Networking, vol. 7, no. 2, pp. 188–201, Apr. 1999. 2. T. Anderson, S. Owicki, J. Saxe, and C. Thacker, “High speed scheduling for local area networks,” in Proc. 5th Int. Conf. on Architecture Support for Programming Languages and Operating Systems, Oct. 1992, pp. 98–110. 3. H. J. Chao, “Saturn: a terabit packet switch using dual round-robin,” IEEE Commun. Mag., vol. 38, no. 12, pp. 78–79, Dec. 2000. 4. R. B. Magill, C. E. Rohrs, and R. L. Stevenson, “Output-queued switch emulation by fabrics with limited memory,” IEEE J. Select. Areas Commun., pp. 606–615, May 2003. 5. R. Rojas-Cessa, E. Oki, Z. Jing, and H. J. Chao, “CIXB-1: Combined input-one-cellcrosspoint buffered switch,” in IEEE HPSR2001 Proc., May 2001. 6. C. Clos, “A study of non-blocking switching networks,” Bell Sys. Tech. J., vol. 32, no. 2, pp. 406–24, Mar. 1953.
Three-Stage Clos-Network Switch Architecture with Buffered Center Stage
7. F. M. Chiussi, J. G. Kneuer, and V. P. Kumar, “Low-cost scalable switching solutions for broadband networking: the ATLANTA architecture and chipset,” IEEE Commun. Mag., pp. 44–53, Dec. 1997. 8. F. K. Hwang, The Mathematical Theory of Nonblocking Switching Networks, ser. Series on Applied Mathematics. World Scientific, 1998, vol. 11. 9. A. Jajszczyk, “Nonblocking, repackable, and rearrangeable clos networks: Fifty years of the theory evolution,” IEEE Commun. Mag., vol. 41, no. 10, pp. 28–33, Oct. 2003. 10. T. Cheney, J. A. Fingerhurt, M. Flucke, and J. S. Turner, “Design of a gigabit atm switch,” in IEEE Infocom ’97, vol. 1, Apr. 1997, pp. 2–11. 11. E. Oki, Z. Jing, R. Rojas-Cessa, and H. J. Chao, “Concurrent round-robin dispatching scheme in a clos-network switch,” in Proc. IEEE International Conference on Communications, June 2001, pp. 107–111. 12. K. Pun and M. Hamdi, “Static round-robin dispatching schemes for clos-network switches,” in Proc. IEEE Workshop on High Performance Switching and Routing, May 2002, pp. 329–333. 13. A. Parekh, “A generalized processor sharing approach to flow control in integrated services networks,” Ph.D. dissertation, Massachusetts Inst. Technol., Dept. Elect. Eng. Comput. Sci., Cambridge, MA, Feb. 1992. 14. H. Zhang, “Service disciplines for guaranteed performance service in packet-switching networks,” in Proc. IEEE, Oct. 1995, pp. 1373–1396. 15. M. V. Lau, S. Shieh, P.-F. Wang, B. Smith, D. Lee, J. Chao, B. Shung, and C.-C. Shih, “Gigabit ethernet switches using a shared buffer architecture,” IEEE Commun. Mag., pp. 76–84, Dec. 2003. 16. MoSys 1T-SRAM memory compiler. MoSys, Inc. [Online]. Available: http://www.mosyscorp.com/compilers/