Performance Modeling of the Cluster Interconnect Clint Hans Eberle Sun Microsystems Laboratories
[email protected]
in 1984 [2]. Lately, there has been increasing demand for high-availability clusters in the commercial market due to the ever-increasing number of businesses with an Internet presence that requires continuous operation. Products such as the Sun [TM] Cluster technology [3] and Microsoft Cluster Service [4] are targeted at these applications. These products are primarily used for failover clusters that consist of a paired server with an active node and a stand-by node. The task of the cluster management software is to detect the failure of the active node and to restart the affected applications on the stand-by node. Failover clusters typically use standard LAN technology such as 10BaseT or 100BaseT Ethernet. As systems scale in size and customers raise their performance expectations, more ambitious communication solutions can be anticipated. Clusters also provide a cost-effective platform for high-performance computing. The prime example of a commercial application are distributed databases such as Oracle Parallel Server. Such systems are specialized in that they often bypass operating systems and do not rely on a general-purpose software infrastructure for cluster applications. In order for this type of cluster to gain more widespread acceptance, system management software has to become easier to use and more applications have to be parallelized. Thanks to their cost advantage over full-custom parallel systems we can expect that the necessary investment will be made. In this paper I describe an interconnect for highperformance clusters. We call this interconnect Clint. Clint differs from a LAN in the following ways. It provides lower latency in that it uses separate channels for high-bandwidth traffic and low-latency traffic. This way, the two traffic classes do not interfere with each other and low latency can be guaranteed on one channel even if the other channel is highly loaded. Low latency is important since the nodes of a cluster operate on shared data and, therefore, require tight coupling. Further, given the current emphasis on clusters with a relatively small number of nodes Clint does not have to scale as well as a LAN. In fact, Clint is limited to a single switch per channel. As I will show, limiting the scale of an interconnect gives ample opportunities to
Abstract The cluster interconnect Clint1 has a segregated architecture that provides two separate transmission channels: a bulk channel optimized for high-bandwidth traffic and a quick channel optimized for low-latency traffic. The channels use buffer-free switches with fixed forwarding delays. These properties simplify scheduling as well as transmission protocols. The scheduler for the bulk channel allocates transmission paths before packets are sent off. This way collisions are avoided that lead to blockages and, with it, performance loss. In contrast, a best-effort approach is taken when transmitting packets over the quick channel. Packets are sent whenever they are available. If they collide in the switch, one packet wins and the other packets lose. A simulation model has been developed to analyze the Clint architecture. The simulation results clearly show the performance advantages of the proposed architecture. The carefully scheduled bulk channel can be loaded nearly to its full capacity without exhibiting head-of-line blocking that limits many networks while the quick channel shows a dramatic reduction in latency. 1 Introduction A cluster is a group of servers or workstations that work collectively as one logical system. The purpose of clustering is high availability and high performance. Since clusters capitalize on the economies of scale they are inexpensive alternatives to other fault-tolerant hardware-based approaches as well as to other parallel systems. The latter group includes systems such as SMPs (symmetric multiprocessors), MPPs (massively parallel processors) or NUMA (non-uniform memory access) machines. Clusters have been used for decades. Early commercial systems were IBM’s Job Entry System (JES) introduced in 1970 [1] and DEC’s VAXCluster introduced 1
Clint is the code name of a Sun Microsystems Laboratories internal project.
ISCA 2001
1
simplify its design and, ultimately, achieve higher performance. I have structured the paper as follows. Section 2 gives an overview of the Clint architecture. Section 3 describes the components of Clint in detail. Section 4 contains simulation results. Section 5 contrasts Clint with related work. Finally, section 6 gives the conclusions.
Bulk Switch
Node0
Scheduler
Node1
. ..
such as packet size, scheduling strategy, congestion management, buffering strategy, and error rate are often influenced by tradeoffs between high bandwidth and low latency. Given the different optimization criteria the methods for scheduling the channels look quite different. The bulk channel applies an elaborate and relatively timeconsuming scheme that avoids conflicts in the switch and that optimizes the aggregate throughput of the switch. This task is handled by a scheduler that calculates a conflict-free schedule before the packets leave the nodes. In contrast, the quick channel uses a besteffort approach in that nodes send off packets without any coordination. As a result, packets can collide in the switch. When this happens, one packet wins and the other packets lose. The winning packet is forwarded and the losing packets are dropped and retransmitted at a later time. The channels and their switches forward packets with fixed delays. This simplifies error detection since a simple request-acknowledge protocol can be used: if an acknowledgment packet has not been received a fixed amount of time after the request packet was sent, a transmission problem occurred. Further, fixed delays make it possible to operate a channel as a pipelined data path. We are currently working on a prototype implementation of Clint. It has the following characteristics. Each switch has 16 ports. Links are serial copper cables with a length of up to 5 m. The bulk channel has a fullduplex bandwidth of 2.5 Gbit/s and the quick channel has a full-duplex bandwidth of 0.66 Gbit/s. These numbers consider 8B/10B-encoded data streams, that is, the bandwidths include the encoding overhead. With it, the bulk switch has an aggregate bandwidth of 40 Gbit/s and the quick switch has an aggregate bandwidth of 8.45 Gbit/s. (The bandwidth calculated for the bulk switch includes the overhead of the 8B/10B encoding since data is forwarded in its encoded form, while the bandwidth for the quick switch is obtained by using the data rates of the decoded data since data is forwarded in its unencoded form.)
Quick Switch
Noden-1
Figure 1: The Segregated Architecture of the Clint Network.
2 Overview Network traffic typically falls into two categories: there is high-bandwidth traffic mainly consisting of large packets and there is low-latency traffic mainly consisting of small packets. While a network, and a cluster network in particular, is expected to support both types of traffic equally well, this goal is difficult to achieve. Unfortunately, the two types of traffic affect each other adversely: the presence of large packets increases the latency of the small packets and the presence of small packets makes it difficult to schedule the transmission of large packets and to fully utilize the available bandwidth. Our solution is to segregate the interconnect into two physically separate channels: a bulk channel optimized for high bandwidth transmission of large packets and a quick channel optimized for low latency transmission of small packets. Conversely, the bulk channel is not optimized for low latency, and the quick channel is not optimized for high bandwidth. By having two separate channels we can characterize the channels independently. This is beneficial since channel characteristics
ISCA 2001
Definitions The node that initiates a transfer is called the initiator while the node addressed by the initiator is called the target. The packet sent from the initiator to the target is called a request packet; the response sent back from the target to the initiator is called an acknowledgment
2
packet. Since data is always pushed from the initiator to the target, the request packet contains the operation and its operands, and the acknowledgment packet contains a transmission report. The time taken to transmit a bulk request packet is referred to as a slot.
This section discusses the architecture of Clint in detail.
There are two types of message descriptors. A quick message descriptor contains the handler function to be invoked upon receipt and seven 32-bit word arguments. The bulk message descriptor looks similar, except that the arguments are replaced with a reference to a 2kByte data block located in main memory. The NIC provides several buffers to prefetch bulk messages. A prefetching strategy is employed that tries to fill the buffers with messages destined for different targets so that bulk channel performance is not reduced by HOL blocking.
3.1 Network Interface Card
3.2 Bulk Channel
Our network interface card (NIC) is based on Active Messages 2.0 [8] and the Virtual Network abstraction [9,10]. This abstraction virtualizes the access points of the network in the form of endpoints. A collection of endpoints forms a virtual network with a unique protection domain. Messages are exchanged between endpoints, and traffic in one virtual network is not visible to other virtual networks. Endpoints are mapped into the address space of a process and can be directly accessed by the corresponding user-level program or kernel program. Thus, user-level communication does not involve the operating system. Fig. 2 shows the structure of the NIC. It holds a small number of active endpoints EP – inactive endpoints are stored in main memory. Endpoints adhere to the segregated network architecture in that they contain separate sets of message descriptors for the bulk channel and the quick channel. Resources are further separated into message descriptors for requests and responses. This way, the Active Messages protocol described in [10] can be implemented free of fetch-deadlocks.
The bulk channel is optimized for high throughput rather than for low latency. A global schedule determines the settings of the bulk switch and the times when nodes may send packets. The schedule is conflictfree and thus bulk packets are never dropped due to resource conflicts. This is important since the loss of a bulk packet can result in the retransmission of a considerable amount of data, in particular, if we consider that a packet may be part of a larger transfer. Scheduling the bulk channel is simplified in that forwarding delays are fixed. This is possible since the bulk switch contains no buffer memory that causes variable forwarding delays. Conflicting requests for the output ports of the switch are simply resolved in that packets remain in the buffers of the NICs until the necessary connections in the switch have been allocated. That is, at the time a NIC sends off a packet, a path through the switch has been reserved. The only buffer memory found in the bulk channel is located in the NICs. Logically, the bulk switch together with the buffer memories of the NICs resembles an input-buffered switch. To avoid HOL blocking caused by a single input buffer queue, the NIC contains several send buffers that can be randomly accessed and that can hold packets destined for different nodes. A simple request-acknowledge protocol is used to detect errors on the bulk channel. For every request packet received the target returns an acknowledgment packet. The initiator detects a transmission error if it receives a negative acknowledgment or the acknowledgment is missing. A negative acknowledgment is returned if a request with a bad CRC is received or no receive buffer is available. The loss of a packet is detected if the initiator does not receive an acknowledgment a fixed amount of time after the request was sent. Fixed forwarding delays on the bulk channel simplify error detection in that the number of outstanding re-
3 Architecture
Host Memory
Message Descriptors
NIC
Quick Channel req req req req
EP1
Buffer
Send DMA
EP2
Buffer
Receive DMA
EP3
Buffer
Control
EP4
Buffer
rsp rsp rsp rsp Bulk Channel req req req req rsp rsp rsp rsp
Figure 2: Organization of the NIC.
ISCA 2001
3
quests is fixed and there is a fixed timing relationship between the request and acknowledgment that does not require any timeout mechanism relying on worst-case assumptions about delivery times. Thanks to fixed timing and global scheduling, the bulk channel can be efficiently operated as a pipelined network. This is shown in Fig. 3. Packet transmission is split up into four stages:
1
2
3
arb
trf
ack
chk
P0,0
arb
trf
ack
chk
P0,1
arb
arb
trf
ack
chk
arb
arb
trf
ack
chk
arb
trf
ack
chk
T0,T1 T0,T1
T1 T0
-
-
P0,0
P1,0
• arb: During the arbitration stage a slot is allocated in the bulk switch for forwarding the request packet from an input port to an output port.
arb
P1,1 I0 I1
• trf: During the transfer stage a request packet is transferred from a buffer on the initiator to a buffer on the target.
0 1
• ack: During the acknowledgment stage the target checks the integrity of the request packet and returns an acknowledgment packet to the initiator.
T0 T0
4
5
-
0 1
Figure 3: Pipelined Transmission of Bulk Packets. in slot 2 and the corresponding packets P0,1 and P1,0 are transferred in slot 3. A simple flow-through switch realizes the bulk switch. Its input and output ports as well as its internal data paths are all bit-serial. There is no sequential logic in the form of buffer memory or registers. The only logic elements used are multiplexers connecting output ports with input ports. In particular, the bulk switch contains no logic or data paths to process packets. Packets do not even have to be examined to determine their route since the scheduler exchanges the necessary information with the NICs over the quick channel. We use a central scheduler to schedule the bulk channel. Referring to Fig. 3 the arbiter calculates the schedule for the subsequent transfer stage in the arbitration stage. An arbitration cycle starts with each node sending a configuration packet to the arbiter. This packet combines information supplied by both the initiator and the target residing on a node. It contains a request vector that identifies the targets for which the initiator has data packets. Also contained in the configuration packet is an enable vector that names the initiators from which the target accepts packets. An initiator is disabled, for example, if it is suspected to be malfunctioning. The arbiter can now calculate a conflict-free schedule. The schedule is communicated to the initiators with the help of grant packets. Each initiator receives a grant packet that reports whether one of its requests was granted and if so, which target it can address in the following transfer stage.
• chk: During the check stage the initiator examines the acknowledgment packet and, if an error has occurred, reinserts the corresponding packet into the pipeline. Since each pair of initiator and target forms a pipeline, and the pipelines share the bulk switch in the transfer stage, it is necessary that the pipelines are synchronized. Synchronization is achieved with the help of the signaling packets described in section 3.3. These packets are sent simultaneously and at a fixed time relative to the bulk slot boundaries. It is only the transfer stage that uses the bulk channel. Both the arbitration stage and the acknowledgment stage use the quick channel. Possible conflicts during the transfer stage are resolved in the arbitration stage. Fig. 3 illustrates the operation of the bulk channel pipeline. In this example a 2x2 switch connects initiators I0 and I1 with targets T0 and T1. The transmission of five packets Pi,t is depicted, where i stands for the initiator and t for the target. In addition to the transmitted packets, the figure shows the targets requested by the initiators and the settings of the switch. Note that the initiator generates a request for every full buffer and that, as a result, the initiator can request more than one target in a given arbitration cycle. In slot 0, both I0 and I1 request connections with T0. The request of I0 is granted and the request packet P0,0 is transferred in slot 1. In slot 1, both I0 and I1 request connections with T0 and T1. I0 is granted the connection with T0, and I1 is granted the connection with T1. The corresponding transfers of packets P0,0 and P1,1 take place in slot 2. The remaining two requests are granted
ISCA 2001
0
4
Similar to the bulk switch the ports of the quick switch can be enabled and disabled with the help of an enable vector provided as part of the configuration packet mentioned in section 3.2. The quick channel also serves as a signaling channel for the bulk channel. More specifically, it transports the configuration and grant packets needed to schedule the bulk channel and it transports the bulk acknowledgment packets. These packets are transmitted at fixed times, that is, at fixed offsets within a bulk slot. This simplifies the scheduler as well as the request-acknowledge protocol. Further, we make use of this property to synchronize the nodes. Since the timing of the signaling packets is implicitly known to the nodes and the switch, the grant packets are used to synchronize the nodes. Further, the grant packets are used to assign a unique identifier to each node. The node identifier corresponds to the number of the output port from where the packet was sent. Since any loss of bulk channel bandwidth is undesirable, the signaling packets are treated differently from regular quick packets so that signaling packets cannot be dropped due to collisions. If a signaling packet collides with a regular packet, the signaling packet is given priority and the regular packet is dropped. Still, there is a possibility that signaling packets collide with each other. While it is obvious that the configuration packets as well as the grant packets will not cause any collisions among themselves, the transmission of the bulk acknowledgment packets needs an explanation. If we look at the switch connections needed for forwarding the bulk acknowledgment packets over the quick channel, we note that the connections correspond to the ones of the schedule calculated for forwarding the corresponding bulk request packets over the bulk channel. The difference is the directions, which are reversed. Thus, since the schedule calculated for transmitting the bulk request packets is conflict-free, the transmission of the bulk acknowledgment packets is also conflict-free.
3.3 Quick Channel The quick channel is optimized for low latency rather than high throughput. In contrast to the bulk channel, nodes do not schedule the transmission of quick packets and do not allocate a path through the quick switch before they send off packets. Instead, a best-effort approach is taken in that packets are sent as soon as they are available. This approach leads to possible collisions in the quick switch in which case packets are dropped. Note the difference with collisions on a shared medium such as a bus. While in our case progress is made in that one packet is forwarded, this is not the case for a shared medium that requires all packets involved in a collision to be retransmitted. Collisions in the quick switch happen infrequently since the quick channel is only lightly loaded thanks to providing excess bandwidth. While we expect typical applications to only generate light load for the quick channel, a throttling mechanism is added to the NIC to limit usage of quick channel bandwidth. Similar to the bulk channel the quick channel contains no buffer memory and forwards packets with fixed delays. However, in contrast to the bulk channel, the quick channel is not pipelined since round-trip times are relatively short. Thus, every initiator can only have one outstanding request packet at the time. The quick switch uses a minimal arbiter that applies a first-come, first-considered policy. When a packet arrives at an input port, a request is sent to the output port specified by the routing information contained in the packet header. When quick packets collide in the switch, the packet that arrives first will win and be forwarded and packets that arrive later will lose and be dropped. If colliding packets arrive simultaneously, a round-robin scheme is used to pick the winner and the losers. This way starvation is avoided. Further, this arbiter is able to make quick routing decisions so that forwarding latency on the quick channel is kept low. Unlike the bulk switch, the quick switch can process and generate packets. To allow for sufficient processing time, internal data paths are parallel and packets are latched at the switch boundaries, that is, near the input ports as well as the output ports. Transmission errors are again recognized by a request-acknowledge protocol. A target returns an acknowledgment in response to every request. If an initiator does not receive an acknowledgment a fixed amount of time after the request was sent, either the request or the acknowledgment packet was lost and the request packet is resent.
ISCA 2001
4. Simulation Results In this section I analyze the Clint architecture with the help of a behavioral simulation model. I am using a synthetic workload to examine the latency and bandwidth characteristics of the two channels. 4.1 Simulation Methodology Figure 4 shows a block diagram of the model used to simulate the Clint network. Each node contains a packet
5
generator that deposits packets into a packet queue. This queue looks like a FIFO queue with unlimited length. Packet sizes range from 32 Byte to 2 kByte whereby the distribution of sizes is uniform. The mean value of the injection rate maintained by the packet generator is a parameter of the simulation while the injection interval varies according to a uniform distribution. An option is provided to generate packets in bursts; again, the distribution of burst lengths is uniform.
times are given in number of cycles. They refer to the time a packet spends in the packet queue as well as in the buffer pool. More specifically, the latency time starts when the packet is generated and deposited into the packet queue and stops when transmission of the packet out of the send register begins. The latency numbers do not include the serialization delay, which can be a dominant factor for small latency values. 4.2 Segregated Architecture
PG
PQ
BP SR
N0
QC
QC
BC
BC
... N15
PG: PQ: BP: SR:
Packet Generator Packet Queue Buffer Pool Send Register
I first examined the effect of segregating high bandwidth traffic and low latency traffic. As explained before the assumption is that low latency transmission is more important for small packets than for large packets. Therefore, in the simulation model packets are assigned to the two channels according to their sizes: if the packet size is smaller than or equal to a threshold value, the quick channel is used, otherwise, the bulk channel is used. The graphs in Figure 5 give the average latency for each channel as a function of the packet size threshold. The threshold value varies from 32 Byte to 1 kByte. Graphs are shown for injection rates of 0.1, 0.3, and 0.5 – these rates stand for the fraction of the maximal rate. Higher injection rates are not considered, since they saturate the network as will be explained later. Simulation results are shown for buffer pools with one and 16 entries, respectively. Two types of traffic patterns were simulated: non-bursty and bursty. Bursty traffic consists of bursts of up to five packets. The graphs clearly show that the segregated architecture leads to a dramatic reduction in latency time for the quick channel. As the various graphs illustrate, there are several factors contributing to latency. Obviously, higher injection rates increase latency. In addition, bursty traffic and smaller buffer pools also increase latency. Increasing the number of buffers, in this example, from one to 16, reduces the latency. This is particularly true for a high injection rate. The reason is that high injection rates cause more competing requests for output ports of the switch. If there are more buffers that can hold packets destined for different output ports, it is more likely that one of the requested output ports is available and the corresponding packet can be forwarded than if there are fewer buffers and, with it, fewer choices. As a result, packets spend less time in the buffer pool and latency is reduced.
N0
... QC
QC
BC
BC
N15
N: Node QC: Quick Channel BC: Bulk Channel
Figure 4: Simulation Model. The data path forks at the exit of the packet queue and leads into separate buffer pools for the bulk channel and the quick channel. The size of the packet is compared with a threshold value to determine the buffer pool the packet is to be transferred to. The buffer pools are kept as full as possible, that is, whenever there is both an empty buffer and a packet with the right size at the head of the packet queue, the packet is transferred. The buffer pools can be accessed in random order. Finally, packets are forwarded from the buffer pool to a send register whenever the send register is empty and there is at least one full buffer. There are separate switches for the quick and bulk channel. Unless otherwise noted both channels have the same bandwidth. The switches contain no buffer memory and, therefore, do not add any latency. The data paths of the switches are scheduled prior to the departure of the packets from the hosts to avoid collisions. Hosts are served in round-robin order. If an idle host has a packet in the buffer pool destined for an available switch output port, the corresponding path is allocated, the packet is transferred into the send register, and packet transmission is started. A simulation run takes 107 cycles. A cycle corresponds to the time it takes to transmit one byte. Latency
ISCA 2001
6
16 Bulk Buffers, 16 Quick Buffers, no Bursts
1 Bulk Buffer, 1 Quick Buffer, no Bursts 2.0E+03
1.0E+04
Quick Channel (Inj. Rate=0.1) Bulk Channel (Inj. Rate=0.1) Quick Channel (Inj. Rate=0.3) Bulk Channel (Inj. Rate=0.3) Quick Channel (Inj. Rate=0.5) Bulk Channel (Inj. Rate=0.5)
6.0E+03
4.0E+03
1.5E+03 Latency [cycles]
Latency [cycles]
8.0E+03
Quick Channel (Inj. Rate=0.1) Bulk Channel (Inj. Rate=0.1) Quick Channel (Inj. Rate=0.3) Bulk Channel (Inj. Rate=0.3) Quick Channel (Inj. Rate=0.5) Bulk Channel (Inj. Rate=0.5)
1.0E+03
5.0E+02 2.0E+03
0.0E+00
0.0E+00 0
256
512 Packet Size Threshold [Bytes]
768
1024
0
256
512 Packet Size Threshold [Bytes]
1024
16 Bulk Buffers, 16 Quick Buffers, Bursts
1 Bulk Buffer, 1 Quick Buffer, Bursts 5.0E+03
2.5E+04
Quick Channel (Inj. Rate=0.1) Bulk Channel (Inj. Rate=0.1) Quick Channel (Inj. Rate=0.3) Bulk Channel (Inj. Rate=0.3) Quick Channel (Inj. Rate=0.5) Bulk Channel (Inj. Rate=0.5)
1.5E+04
1.0E+04
Quick Channel (Inj. Rate=0.1) Bulk Channel (Inj. Rate=0.1) Quick Channel (Inj. Rate=0.3) Bulk Channel (Inj. Rate=0.3) Quick Channel (Inj. Rate=0.5) Bulk Channel (Inj. Rate=0.5)
4.0E+03 Latency [cycles]
2.0E+04
Latency [cycles]
768
3.0E+03
2.0E+03
1.0E+03
5.0E+03
0.0E+00
0.0E+00 0
256
512 Packet Size Threshold [Bytes]
768
1024
0
256
512 Packet Size Threshold[Bytes]
768
1024
Figure 5: Latency Simulations. 1Bulk Buffer, 1QuickBuffer, Bursts
16BulkBuffers, 16QuickBuffers, Bursts
1.0
1.0
QuickChannel (Inj. Rate=0.1) BulkChannel (Inj. Rate=0.1) QuickChannel (Inj. Rate=0.5) BulkChannel (Inj. Rate=0.5) QuickChannel (Inj. Rate=0.9) BulkChannel (Inj. Rate=0.9)
0.8
Bandwidth
Bandwidth
0.8
QuickChannel (Inj. Rate=0.1) BulkChannel (Inj. Rate=0.1) QuickChannel (Inj. Rate=0.5) BulkChannel (Inj. Rate=0.5) QuickChannel (Inj. Rate=0.9) BulkChannel (Inj. Rate=0.9)
0.6
0.4
0.6
0.4
0.2
0.2
0.0
0.0 0
256
512 Packet Size Threshold [Bytes]
768
0
1024
256
Figure 6: Bandwidth Simulations.
ISCA 2001
7
512 Packet Size Threshold [Bytes]
768
1024
16 Bulk Buffers, 16 Quick Buffers, no Bursts, Packet Size Threshold = 128 Byte
16 Bulk Buffers, 1 Quick Buffer, no Bursts, Packet Size Threshold = 128 Byte
2.5E+02
2.5E+02
Quick Channel (Inj. Rate=0.2) 2.0E+02
Quick Channel (Inj. Rate=0.8)
Latency [cycles]
Latency [cycles]
Quick Channel (Inj. Rate=0.8)
1.5E+02
1.5E+02
1.0E+02
1.0E+02
5.0E+01
5.0E+01
0.0E+00 0.00
Quick Channel (Inj. Rate=0.2)
2.0E+02
0.20
0.40 0.60 Quick Channel Bandwidth
0.80
0.0E+00 0.00
1.00
0.20
16Bulk Buffers, 1 Quick Buffer, Bursts, Packet Size Threshold = 128 Byte
0.40 0.60 Quick Channel Bandwidth
0.80
1.00
16 Bulk Buffers, 16 Quick Buffers, Bursts, Packet Size Threshold = 128 Byte
3.5E+02
1.0E+02
2.5E+02
Quick Channel (Inj. Rate=0.8)
Quick Channel (Inj. Rate=0.2) 8.0E+01
Quick Channel (Inj. Rate=0.8)
Latency [cycles]
Quick Channel (Inj. Rate=0.2)
Latency [cycles]
3.0E+02
6.0E+01
2.0E+02 1.5E+02
4.0E+01
1.0E+02
2.0E+01 5.0E+01 0.0E+00 0.00
0.0E+00 0.20
0.40 0.60 Quick Channel Bandwidth
0.80
1.00
0.00
0.20
0.40 0.60 QuickChannel Bandwidth
0.80
1.00
Fig 7: Bandwidth Distribution. Figure 6 shows the average amount of bandwidth consumed during the discussed simulation runs. I only show graphs for the bursty traffic pattern since they are similar to the ones for the non-bursty traffic pattern. Looking at the host with only one buffer we see that bandwidth does not exceed 60% of the available bandwidth. The significant reduction in network performance is caused by head-of-line blocking [6]: the packet in the buffer needs to be forwarded before any other packet in the packet queue can be forwarded even though a packet in the packet queue might be destined for an output port of the switch that is available. This shortcoming is eliminated by providing multiple buffers. The graph resulting from the simulation with 16 buffers shows that nearly the full bandwidth can be utilized. Without the ability to globally schedule network traffic, head-of-line blocking is an inherent performance bottleneck of many networks. This is, for example, true for Myrinet [11]: when a packet leaves a host, it is not known whether the necessary output ports of the switches along its way to the destination are available in time. If an output port is not available, the
ISCA 2001
packet is stopped and potentially blocks other packets behind it that are destined for available output ports. 4.3 Bandwidth Distribution Given the anticipated traffic types, we might conclude that the quick channel requires less bandwidth than the bulk channel. To examine this proposition, I varied the amount of quick channel bandwidth and determined the resulting change in quick channel latency. The graphs in Figure 7 show the results. The quick channel bandwidth is given as the fraction of bulk channel bandwidth. I show four versions that differ in the number of quick channel buffers and the burstiness of the traffic. In all cases, the bulk channel has 16 buffers. The packet size threshold is 128 Byte, that is, packets with sizes ranging from 32 to 128 Byte use the quick channel. The injection rates are 0.2 and 0.8. We first observe that the latency decreases exponentially as the bandwidth increases. As the bandwidth of the quick channel approaches the full bandwidth of the bulk channel, the return in decreased latency diminishes. Therefore, it is possible to operate the quick
8
channel at a fraction of the bandwidth of the bulk channel without a significant performance penalty. We can further notice that the latencies change little for the chosen injection rates. The reason is that variations in the injection rate affect the quick channel only minimally since the chosen packet size threshold has the bulk of the traffic be directed to the bulk channel leaving the quick channel only lightly loaded. Finally, the charts show that bursty traffic causes significantly higher latencies if the quick channel has one buffer only. If traffic is not bursty, having more than one buffer does not reduce latency and, therefore, is of no benefit. I now want to contrast these results with the characteristics of the prototype switch. For the prototype switch the ratio of quick channel bandwidth and bulk channel bandwidth is 1:4. Looking at Figure 7 the graphs are beginning to flatten out at a quick channel bandwidth of 0.25 and, therefore, we might conclude that a reasonable design point was chosen. In terms of buffering, the prototype resembles the simulation model with one quick buffer. As we could see, one buffer performs well as long as traffic is not bursty. Since it is unclear what the burstiness of real traffic is, it is probably too premature too evaluate this design decision.
give the latency as a function of the number of buffers which varies from one to 16. I consider injection rates of 0.5 to 0.9. The latency decreases exponentially as the number of buffers is increased. A small number of buffers suffices to obtain nearly the full performance benefits. For example, assuming a maximal injection rate of 0.8 we observe a noticeable decrease in latency up to about eight buffers. 5
Myrinet is a switched network widely used for cluster applications [11]. Switches have up to 16 ports and each port has a full-duplex bandwidth of 1.28 Gbit/s. Arbitrary topologies are possible by cascading switches. Crossbar switches are used with cut-through forwarding. Switches contain a minimal amount of receive buffers to implement stop-and-go link-level flow control. Low latency can only be guaranteed as long as the network is lightly loaded. Further, Myrinet will only perform well under high load if conflicts in the switches occur infrequently. However, this is only the case with certain types of regular traffic patterns; other irregular traffic patterns will degrade performance due to HOL blocking. The global communication network of the T3E [12,13] has properties similar to those of the quick channel. Messages are transmitted by writing them into E-registers from which they are delivered to message queues mapped into user or system memory space. State flags associated with the E-registers indicate whether the message was accepted. The sending processor checks the flags and retransmits messages if necessary. However, unlike the quick channel, the T3E network is flow-controlled. More specifically, credit-based flow control prevents the buffers of the routers from overflowing. The use of separate mechanisms to transport short and long messages has been proposed by the Illinois Fast Messages (FM) protocol [14] and the Scheduled Transfer Protocol (ST) [7] for HIPPI-6400. The FM protocol provides two send routines: FM_send to send a long message and FM_send_4 to send a short four-word message. FM_send uses send queues in main memory, FM_send_4 uses registers on the NIC. Thus, these routines are similar to sending a bulk packet and a quick packet, respectively. Unlike our approach, both routines use the same channel. ST uses two separate channels: a high-bandwidth data channel and a low-latency control channel. The control channel is used by upper-layer
Bursts, Packet Size Threshold = 128 Byte
2.0E+06 Bulk Channel (Inj. Rate=0.5) Bulk Channel (Inj. Rate=0.6) Bulk Channel (Inj. Rate=0.7)
Latency [cycles]
1.5E+06
1.0E+06
Bulk Channel (Inj. Rate=0.8) Bulk Channel (Inj. Rate=0.9)
5.0E+05
0.0E+00 0
1
2
3
4
5
6
7 8 9 10 Number of Buffers
11
12
13
14
15
16
Figure 8: Buffering. 4.4 Buffering The previous discussions had shown that increasing the number of buffers improves throughput as well as latency. So far I have only compared scenarios with either one or 16 buffers. Figure 8 provides a more detailed analysis of the performance improvement resulting from increasing the number of buffers. The graphs
ISCA 2001
Related Work
9
protocols and is not available for the exchange of lowlatency user-level messages. It is important to recognize the differences between the segregated architecture of Clint and other architectures with multiple channels such as HIPPI-6400 [7]. While Clint implements its two channels with physically separate networks, that is, with separate links and switches, other approaches multiplex multiple channels onto a single shared network. Some degree of decoupling can be achieved by using separate buffer queues [5]. This technique is quite efficiently used to prevent head-of-line (HOL) blocking in input-buffered switches [6]. Still, the channels are competing for shared resources such as links or switching fabrics. Another important difference is the usage model for the channels. Both channels of Clint are intended for general-purpose usage. In particular, both channels are available to user programs. This is not true for networks such as HIPPI6400 that provides a control channel in addition to a regular data channel and that restricts the usage of the control channel to protocol processing.
packets passing through the network do not inflict any collisions. By avoiding such conflicts head-of-line blocking common to many networks is avoided. Finally, I have shown that the quick channel bandwidth can be reduced to a fraction of the bulk channel bandwidth without a significant loss of performance. A lower channel speed is helpful since the quick channel uses a more complicated switch than the bulk channel. While the bulk channel only requires a simple flowthrough switch, the quick channel requires a more complicated switch since it has to process packets, in particular, to take part in scheduling the bulk channel. Given the difference in speed and complexity, the twochannel architecture has the potential to scale well to higher speeds.
Acknowledgments Neil Wilhelm helped with the initial design of Clint. Nils Gura designed and implemented an FPGA that realizes the quick switch and the scheduler for the bulk channel. Alan Mainwaring specified the NIC.
6 Conclusions
References
I have described a segregated network architecture that provides two physically separate channels with different characteristics: a bulk channel for the scheduled transfer of large packets at high bandwidth and a quick channel for best-effort delivery of small packets with low forwarding latency. Only by separating these two concerns, we can obtain a network that achieves high throughput and, at the same time, low latency. I have provided simulation results that validate the segregated architecture. In my simulations packets were assigned to the channels according to their sizes. The next step will be to show that the assumed correlation between packet size and latency budget is exhibited in real applications. Once we look into implementing real applications and communication protocols on top of Clint we might find that there are better-suited criteria such as the type of message. For example, synchronization operations and resource management operations could benefit from using the quick channel. Yet another possibility is to have the application explicitly chose one of the channels when setting up a data transfer. I have shown how a global schedule makes it possible to make full use of the available channel capacity. This was achieved by imposing a rigid timing regime that relies on fixed forwarding delays and coordinated packet transmission. This allows transmission paths to be allocated at the time a packet leaves a host so that
ISCA 2001
[1] IBM: OS/390 V2R4.0 JES2 Introduction. Document Number GC28-1794-02. [2] N. Kronenberg, H. Levy, W. Strecker: VAXclusters: A Closely Coupled Distributed System. ACM Transactions on Computer Systems, vol. 4, no. 2, May 1986. [3] Sun Microsystems: The Sun Enterprise Cluster Architecture. Technical White Paper, October 1997. [4] D. Libertone: Windows NT Cluster Server. Prentice Hall, 1998. [5] W. Dally: Virtual-Channel Flow Control, Proc. of the 17th Int. Symposium on Computer Architecture, ACM SIGARCH, vol. 18, no. 2, May 1990, pp. 60-68. [6] M. Karol, M. Hluchyi, S. Morgan: Input versus Output Queuing on a Space-Division Packet Switch. IEEE Transactions on Communications, C-35(12):1347-1356, December 1987. [7] National Committee for Information Technology Standardization: Scheduled Transfer Protocol (ST). Task Group T11.1, rev. 3.6, January 31, 2000, www.hippi.org. [8] A. Mainwaring: Active Message Application Programming Interface and Communication Subsystem Organization. University of California at Berkeley, Computer Science Department, Technical Report UCB CSD-96-918, October 1996.
10
[9] A. Mainwaring and D. Culler: Design Challenges of Virtual Networks: Fast, General-Purpose Communication. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), Atlanta, Georgia, May 4-6, 1999. [10] B. Chun, A. Mainwaring, D. Culler: Virtual Network Transport Protocols for Myrinet. IEEE Micro, vol. 18, no. 1, January/February 1998, pp. 5363. [11] N. Bode, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, W. Su: Myrinet: A Gigabit-persecond Local-area Network. IEEE Micro, vol.15, no.1, February 1995, pp.29-36. [12] S. Scott: Synchronization and Communication in the T3E Multiprocessor. Proc. of ASPLOS VII, Cambridge, October 2-4, 1996. [13] S. Scott, G. Thorsen: The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus. Proc. of HOT Interconnects IV, Stanford, August 15-16, 1996. [14] S. Pakin, V. Karamcheti, A. Chien: Fast Messages (FM): Efficient, Portable Communication for Workstation Clusters and Massively-Parallel Processor, IEEE Concurrency, vol. 5, no. 2, 1997, pp. 60-73.
ISCA 2001
11