cation performance with a low price, although it has a somewhat higher latency than other ... protocol for the internet small computer systems interface (iSCSI) [6] has been proposed for storage area ...... ing at Samsung Electronics since 2008.
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 27, 493-509 (2011)
An Efficient Architecture for a TCP Offload Engine Based on Hardware/Software Co-design* HANKOOK JANG1, SANG-HWA CHUNG1,+, DONG KYUE KIM2 AND YUN-SUNG LEE1 1
Department of Computer Engineering Pusan National University Busan, 609-735 Korea 2 Division of Electronics and Computer Engineering Hanyang University Seoul, 133-791 Korea To achieve both the flexibility of software and the performance of hardware, we design a hybrid architecture for a TCP offload engine that is based on hardware/software co-design. In this architecture, the transmission and reception paths of TCP/IP are completely separated with the aid of two general embedded processors to process data transmission and reception simultaneously. We implement this architecture based on an FPGA that has two general embedded processor cores. In the experiments based on the gigabit Ethernet, the hybrid TOE has a minimum latency of 13.5 μs. The CPU utilization is less than 3%, which is at least eighteen times lower than that of the general gigabit Ethernet adapters. The maximum unidirectional bandwidth of the hybrid TOE is 110 MB/s − comparable to that of the general gigabit Ethernet adapters − although the embedded processors operate with a clock speed that is seven times lower than that of the host CPU. By using two embedded processors, the bidirectional bandwidth of the hybrid TOE improves to about 201 MB/s, comparable to that of the general gigabit Ethernet adapters, and a 34% improvement over an experimental TOE implementation in which only one embedded processor is used. Keywords: TCP offload engine, hardware/software co-design, embedded system, embedded processor, FPGA, gigabit Ethernet
1. INTRODUCTION 1.1 Introduction of the Hybrid TOE The gigabit Ethernet (GbE) is the most popular choice for the system area networks of recent cluster-based computer systems, and more than 50% of the world’s top 500 super computers are based on the GbE [1]. This is not only because it is compatible with the Ethernet-based legacy networks, but also because it can provide considerable communication performance with a low price, although it has a somewhat higher latency than other high-performance network technologies such as Myrinet [2] and InfiniBandTM [3]. In the Ethernet-based cluster systems, TCP/IP and TCP/IP-based communication mechanisms such as the message passing interface (MPI) [4] are commonly used. In addition, advanced protocol suites based on TCP/IP have been developed for system area networks and storage area networks: for example, the remote direct memory access proReceived May 27, 2009; revised September 9 & December 11, 2009; accepted February 9, 2010. Communicated by Ren-Hung Hwang. * This work was supported by the Grant of the Korean Ministry of Education, Science and Technology (The Regional Core Research Program/Institute of Logistics Information Technology). + Corresponding author. 493
494
HANKOOK JANG, SANG-HWA CHUNG, DONG KYUE KIM AND YUN-SUNG LEE
tocol (RDMAP) [5] has been proposed for the system area networks and the transport protocol for the internet small computer systems interface (iSCSI) [6] has been proposed for storage area networks. These protocols can also be used on the 10 gigabit Ethernet, so it can be expected that many cluster systems will adopt an Ethernet-based unified network for both the system area network and the storage area network. TCP/IP has been implemented as a layered protocol stack of an operating system (OS) and it has been processed by host CPUs. This layered structure imposes much load on a host CPU, which increases as the physical network bandwidth increases, thereby degrading the overall performance of the computer system [7]. Moreover, it is predicted that a single host CPU can never perform TCP/IP processing if the physical bandwidth is greater than 10 Gbps because it has been known that one cycle of the CPU clock is required for 1 bps of the TCP/IP processing speed [8]. The TCP offload engine (TOE), in which TCP/IP is processed on a network adapter instead of on a host CPU, has been introduced as the most attractive solution for this problem. If TCP/IP is processed by the network adapter, most of the load imposed on the host CPU is reduced and the host CPU can devote much more resources for computations such as data processing and calculations. This results in a great improvement in the overall performance of a computer system. For implementing the TOE, a software-based approach (software TOE) that is based on a general embedded processor had been used first [9]. The software TOEs have the advantages of easy implementation compared with implementing all operations in hardware, but they are inferior in performance [9, 10], because general embedded processors are typically slower than host CPUs. A hardware-based approach (hardware TOE), which means developing an ASIC that consists of dedicated hardware logics for TCP/IP processing, is being commonly used. The hardware TOEs have an advantage in performance [11, 12]. However, they have drawbacks in expandability. The reason is as follows. TCP/IP is a complex protocol and several enhancements including updates of specifications have been proposed over the years, although most basic operations of TCP/IP have not changed significantly since TCP (RFC 793) [13] and IP (RFC 791) [14] specifications. In this evolving environment, it is difficult for the hardware TOEs to accept such enhancements. When implementing all operations in hardware, it is difficult to offload an entire protocol stack from an operating system on a host CPU (host OS). In addition, it is difficult to add new features such as security functions and specification updates if these have not been included in the ASIC. On the other hand, hybrid architectures for implementing the TOE (hybrid TOE), in which some TCP/IP operations are performed by hardware and the others are performed by software, have been proposed recently [15-17]. The hybrid TOEs can achieve high performance by processing time-critical operations in hardware, but they retain expandability because it is straightforward to add the evolutions of TCP/IP and the new features in software. In this paper, we propose a hybrid TOE architecture based on a hardware/software co-design in which the software and the hardware on a network adapter coprocess the protocol stack. Our hybrid TOE offloads the entire protocol stack under the socket layer from the host OS. We design the transmission/reception (TX/RX) path separation mechanism in which most operations for data transmission and reception are performed in parallel by using two embedded processors: one processor controls the TX path and the other
AN EFFICIENT ARCHITECTURE FOR A TCP OFFLOAD ENGINE
495
controls the RX path. The hardware modules in our architecture are controlled by the two processors and perform operations that are critical to improve the performance of data transmission and reception: fetching/storing data from/into the host memory using DMA; calculating checksums; creating data packet headers; processing incoming data packets; analyzing the headers in incoming data packets; and generating acknowledgement (ACK) packets. It also includes some features to support coprocessing with the software. For the TX/RX path separation mechanism, we design a coprocessing control mechanism that provides efficient data sharing between the embedded processors and the hardware using as little shared data as possible. The software performs other operations that are not critical for performance − such as connection establishment, ARP/ICMP processing, flow control, congestion control, and retransmission − with the assistance of the hardware. Unlike our hybrid TOE, two embedded processors may be used to process different connections simultaneously without TX/RX path separation, that is, each processor manages both transmission and reception. In this case, the hardware modules may either be shared or be duplicated for each processor and it is difficult to achieve load balancing between the two processors. 1.2 Contribution In this paper, we describe the design and implementation of a hybrid TOE architecture based on the hardware/software co-design and the TX/RX path separation mechanism. We also develop a real GbE-based hybrid TOE network adapter that is based on an FPGA that has two general embedded processor cores. Precisely, our contributions are: (1) The hybrid TOE architecture allows very low utilization of the host CPU − below 3% − by offloading the entire protocol stack under the socket layer from the host OS. In addition, the hybrid TOE exhibits low latency and considerable bandwidth by implementing major operations of data transmission and reception in hardware. (2) It is proved that our architecture can overcome the performance limitations of a single embedded processor that has very much lower performance than the host CPU during simultaneous bidirectional data transmission. This is because two general processors are used to control the TX path and the RX path independently and this mechanism allows data transmission and reception to be processed in parallel. (3) Our architecture guarantees expandability by using general embedded processors and embedded Linux. This is because it is easy for the software to accept new features such as evolutions of TCP/IP, security functions, specification updates, and offloading mechanisms for upper-level protocols. 1.3 Organization The remainder of this paper is organized as follows. In section 2, related works are described. In section 3, we describe the structure of our hybrid TOE architecture, the coprocessing mechanism between the software and the hardware, the hardware implementation, and the software implementation based on embedded Linux. In section 4, experimental results and the analysis are shown. We present our conclusion in section 5.
496
HANKOOK JANG, SANG-HWA CHUNG, DONG KYUE KIM AND YUN-SUNG LEE
2. RELATED WORKS Intel’s PRO/1000T IP Storage Adapter and an evaluation adapter developed by Hewlett Packard [9] are examples of software TOE implementations based on a general embedded processor. These are no longer manufactured because the former had poor performance [10] and the latter was never more than a test bed. Although multiple processors integrated in a single chip may be used to implement a software TOE, it is known that this multi-processor system-on-chip implementation has some problems as follows: It suffers from memory contention between the different processors competing for the path to off-chip memory, and it also suffers from “cache thrashing” when scaling to a large number of connections [18]. There are ASIC-based hardware TOE products based on the gigabit Ethernet, such as Alacritech’s SLIC technology, QLogic’s QLA4050C adapter, and Broadcom’s BCM 5706 controller. Furthermore, there are hardware TOE products based on a full ASIC implementation and the 10 gigabit Ethernet, such as Chelsio’s Terminator 3 chip and NetEffect’s NE010 adapter. Chelsio’s Terminator 3 chip has some features similar to our hybrid TOE architecture, such as a general processor embedded in the ASIC chip. There have been some papers relating to TOE implementation based on an FPGA. Wu and Chen introduced a hybrid TOE implementation [17]. Their hybrid TOE processes IP, ARP, and ICMP protocols by hardware and processes TCP by software based on a single embedded processor core. Liu et al. introduced a study of a software TOE implementation and the interface between the TOE and the host OS, based on an FPGA evaluation board [19]. In addition, their implementation did not exhibit sufficient performance to process TCP/IP at a speed of 1 Gbps. However, there has been no TOE implementation like our hybrid TOE architecture in which hardware acceleration modules are used to maximize the performance of data transmission and two general embedded processors are used to control separate paths for data packet transmission and reception. Some studies introduced features needed in the host OS to utilize the TOE network adapter [20, 21]. Wang et al. introduced strategies to improve the performance of a TOE [22]. Westrelin et al. predicted the performance of a protocol offloading mechanism with an emulation technique [23]. Gilfeather and Maccabe introduced the concept of activation and deactivation to maintain a large number of TCP connections efficiently on the limited resources of TOEs [24]. Kim and Rixner introduced the connection mechanism in which some TCP connection information that is managed by the host CPU is transferred to the network adapter based on a programmable gigabit Ethernet adapter [25]. Other studies examined protocol processing engines for very high-speed networks above 10 Gbps, based on hardware implementations [26, 27].
3. DESIGN AND IMPLEMENTATION OF THE HYBRID TOE 3.1 Overview of the Hybrid TOE Architecture Fig. 1 shows the block diagram of our hybrid TOE architecture based on a hardware/ software co-design. In the architecture, synchronous dynamic random access memory
AN EFFICIENT ARCHITECTURE FOR A TCP OFFLOAD ENGINE
497
64-bit/66-MHz Host PCI Bus Xilinx Virtex-II Pro Host Interface
SDRAM Flash Memory
PLB
TX Processor
RX Processor Coprocessing Control Module
OPB TX Engine
PLB OPB
SDRAM Flash Memory
RX Engine
Gigabit Ethernet Interface
64-bit/66-MHz Local PCI Bus
Fig. 1. Block diagram of the hybrid TOE architecture.
(SDRAM) and the flash memory are connected to each processor through the processor local bus (PLB) and the on-chip peripheral bus (OPB), respectively. The OPB is also used for connecting the hardware modules to each processor. We define operations that are performed by hardware and those performed by software in the hybrid TOE architecture based on our previous researches [15, 16]. Operations that are implemented in hardware are the most time-consuming operations in data transmission and reception. In the data transmission path, operations implemented in the hardware TX engine are: (1) fetching data from the host memory using DMA; (2) calculating the TCP checksum and the IP header checksum; and (3) creating headers in the data packets. In the data reception path, operations implemented in the hardware RX engine are: (1) checking and analyzing incoming packets; (2) generating ACK packets; and (3) storing data in the host memory using DMA. The software controls the operations of the hardware modules and performs operations that are not implemented in hardware. For the TX/RX path separation mechanism, two general embedded processors − the TX and RX processors − are used to control the TX path and the RX path of TCP/IP separately in cooperation with the hardware. In the TX/RX path separation mechanism, the TX processor, the RX processor, and the hardware modules must share essential data to TCP/IP processing such as data packet creation and processing, flow control, and congestion control. Therefore, we implement a coprocessing control module (CCM) that provides fast and efficient data sharing. The host interface provides an interface between the host CPU and the CCM through a 64-bit/66-MHz PCI bus. To minimize the overhead of DMA to the host memory, we merge the 64-bit/66-MHz PCI bus interface logic with very low DMA latency [28] into the host interface. The gigabit Ethernet interface controls an external media access control (MAC) chip and provides an interface between the TX/RX engines and the MAC chip through a 64-bit/66-MHz PCI bus, so it has the same PCI bus interface logic that is included in the host interface. 3.2 Design of the Hybrid TOE If the embedded processors perform all the operations for data transmission and re-
498
HANKOOK JANG, SANG-HWA CHUNG, DONG KYUE KIM AND YUN-SUNG LEE
Host CPU (1) TX request
TX Engine
(2) Initializing DMA
(4) Generating (3) Fetching data headers using DMA
Gigabit Ethernet Interface
RX Engine
(11) Storing data using DMA
(9) Initializing (10) Transmitting DMA ACK packet
(5) Storing data packet
(6) Fetching data packet
Transmitting data packet
Receiving data packet
(7) Checking data packet
(8) Generating ACK packet Gigabit Ethernet Interface
Fig. 2. Processing sequence for data transmission and reception.
ception, it is difficult to improve the performance because all the instructions are executed sequentially in the embedded processors. Therefore, we design hardware modules that accelerate data transmission and reception by performing important operations in parallel. Fig. 2 shows the processing sequence of the hardware modules for data transmission and reception. Sections 3.2.1 and 3.2.2 describe the implementation of the TX and RX engines in Fig. 2. 3.2.1 Design of the transmission engine We analyze the creation of TCP, IP, and MAC headers for the data packets; the results are shown in Table 1. We find that some fields of the headers are unchanged after establishing a connection and that the others which are changed in every packet can be calculated rapidly by hardware instead of using the TX processor. Default values for the fields are stored in the CCM (described in section 3.2.3) by the TX processor when a connection is established; these are then modified and used as the base information while the headers are generated by hardware. Table 1. Header fields that are unchanged and those that can be calculated by hardware. TCP Header fields that are unchanged after connection establishment
IP MAC TCP
Header fields that can be calculated by hardware
IP MAC
Source port number, destination port number Version, type of service, flags, time to live, source IP address, destination IP address Destination address, source address, type Sequence number, ACK number, header length, flag, window size, TCP checksum Header length, total length, header checksum, identification Length
Based on this analysis, we design the TX engine that creates data packets by hardware. If the TX processor is used to create data packets, it may take a considerable time for the software to perform operations (2) to (5) of Fig. 2 in sequence. In our implemen-
AN EFFICIENT ARCHITECTURE FOR A TCP OFFLOAD ENGINE
499
tation, the TX engine performs operations (3) and (4) of Fig. 2 in parallel by hardware. As a result, the processing time for data packet creation is reduced greatly. In addition, we design a data segmentation mechanism to reduce the overheads imposed on the host CPU and to improve the data transmission performance. In this mechanism, when transmitting data larger than the maximum transmission unit (MTU) size, the host CPU generates a single request for all of the data in the transmission buffer of the host memory, and then the hybrid TOE segments the large data into multiple packets. If the data is not larger than the MTU size, it is processed immediately. If this mechanism is not used, the host CPU must generate one request per data segment and suffer from the over-heads of generating many requests to transmit a large amount of data. Fig. 3 shows the structure of the TX engine, which consists of the header generator and the TX buffer manager. TX Engine Header Generator status of host TX buffer, status of host RX buffer, header information, etc. CCM
# of packets transmitted
TX Processor
# of transactions, option fields
Header Generation Controller
MAC/IP/TCP Header Buffer
partial TCP checksum, payload size
TX Buffer Manager Host Memory
data payload
Gigabit Ethernet Interface
TX Buffer Data Region
Header Region
Outbound FIFO
Fig. 3. Structure of the TX engine.
The header generator creates headers for the data packets and the TX buffer manager fetches data from the host memory. The details of data transmission are as follows, (1) After checking the congestion status in the CCM, the TX processor requests the TX engine to create data packets. (2) The TX buffer manager calculates the partial TCP checksum while fetching data from the host memory. At the same time, the header generator creates fields for the headers, except the TCP checksum and the IP header checksum. (3) After fetching data, the header generator calculates the TCP checksum and the IP header checksum using the partial TCP checksum from step (2), and then completes the creation of the TCP/IP/MAC headers. (4) The header generator completes a data packet by storing the TCP/IP/MAC headers in the TX buffer. The TX buffer manager transmits the data packet and then triggers an interrupt to the TX processor for reporting the transmission. (5) The TX engine repeats steps (2) to (4) until all transactions are performed.
500
HANKOOK JANG, SANG-HWA CHUNG, DONG KYUE KIM AND YUN-SUNG LEE
3.2.2 Design of the reception engine We design an RX engine that performs operations (6) to (11) of Fig. 2. In the RX engine, operations (6) to (8) of Fig. 2 are performed in parallel and then operations (9) and (10) of Fig. 2 are performed in parallel. Compared to an implementation using only the RX processor, these parallel operations reduce the time spent in data packet reception considerably. RX Engine status of host RX buffer, header information, etc. CCM
# of packets acknowledged, etc.
Packet Tester Data Packet Manager
ACK Generator status of host RX buffer, header information, etc.
window size, ACK information, expected sequence number # of packets transmitted, ACK information, etc.
Header Analyzer
ACK Generation Controller
SEQ#, ACK#, etc.
headers, checksums, etc. ACK Packet Buffer
Checksum Calculator
RX Processor
RX Buffer Manager IP/TCP header options Host Memory
data
TX Buffer Manager
RX Buffer Header Data Region Region
Gigabit Ethernet Interface
TX Buffer
Inbound FIFO
Outbound FIFO
Fig. 4. Structure of the RX engine.
Fig. 4 shows the structure of the RX engine that consists of the packet tester, the ACK generator, and the RX buffer manager. The packet tester analyzes incoming data packets and the ACK generator creates ACK packets. The RX buffer manager stores incoming data to the host memory using DMA. The details of data packet reception are as follows − steps (1a) to (1c) correspond to operations (6) to (8) of Fig. 2 and steps (2a) and (2b) correspond to operations (9) and (10) of Fig. 2: (1a) The RX buffer manager supplies the incoming packet to the packet tester while copying it to the RX buffer. (1b) In the packet tester, the checksum calculator calculates the TCP and IP checksums. While the checksum calculator operates, the header analyzer checks the headers in the incoming packet, and then it splits all fields of the headers and stores some information in the CCM through the data packet manager and provides some values to the ACK generator. (1c) After the header analyzer’s operation, the ACK generator creates corresponding
AN EFFICIENT ARCHITECTURE FOR A TCP OFFLOAD ENGINE
501
ACK packet before the checksum calculator finishes its work. (2a) The RX buffer manager stores clean data to the host memory using DMA. Then the data packet manager in the packet tester modifies the shared data in the CCM. (2b) If there is no error in the packet, the ACK generator transmits it with the assistance of the TX buffer manager. The ACK packet can be piggy-backed if there is a data packet in the TX buffer and the destination of the data packet is the same as the ACK packet’s destination. Otherwise, the ACK packet is transmitted immediately. (3) If the packet is erroneous (bad checksum, out-of-order, etc.), the packet in the RX buffer is dropped, which causes the source node to retransmit the correct packets. 3.2.3 Design of the coprocessing control module In the TX/RX path separation mechanism, two general embedded processors control and process the transmission path and the reception path of TCP/IP separately in cooperation with the hardware. We design the coprocessing control module (CCM) that is used to share information for coprocessing between the TX processor, the TX engine, the RX processor, and the RX engine, as shown in Fig. 5. Coprocessing Control Module
TX Processor
Dual-port Memory & Interrupt Generator
RX Processor
Quad-port Memory TX Engine
RX Engine
Fig. 5. Block diagram of the coprocessing control module.
The CCM supports fast and efficient data sharing between the four components via a quad-port memory. The dual-port memory is used for the TX processor and the RX processor to exchange data while they perform operations, such as ARP/ICMP processing, which do not require hardware assistance. The interrupt generator is used to trigger interrupt to the processors. The interrupt generator is also used by the host interface in order to notify requests from the host CPU to the TX processor. Table 2 shows the shared data for each TCP connection between the TX path and the RX path. The size of each data is 4 bytes and the amount of shared data for each TCP connection is less than 32 words (128 bytes) in the current implementation, consequently the hybrid TOE can maintain approximately 1,000 connections with a 128-KB quad-port memory. The TX processor and the RX processor establish TCP connections with the CCM as follows, (1) The application on the TX processor processes requests for connection establishment from the host CPU by invoking socket functions corresponding to the requests. (2) The TX processor creates packets for performing three-way handshaking and transmits them in sequence.
502
HANKOOK JANG, SANG-HWA CHUNG, DONG KYUE KIM AND YUN-SUNG LEE
Table 2. Shared data between the TX path and the RX path. Field Writer Reader TX engine Header information TX processor RX engine (sport/dport, etc.) Status of TX buffer in TX processor TX engine host memory Status of RX buffer TX processor RX engine in host memory Window size in RX engine TX processor incoming packet TX processor Number of packets TX engine RX processor transmitted TX processor Number of packets RX engine RX processor acknowledged TX processor Number of ACK packets RX engine RX processor arrived TX processor Time at which the last RX engine RX processor ACK packet arrived RX engine Expected sequence RX engine RX processor number
Use Header creation and processing Data fetch by DMA Calculation of window size for flow control and storing data by DMA Flow control Congestion control and retransmission Congestion control and retransmission Congestion control and retransmission Congestion control and retransmission Out-of-order packet detection and retransmission
(3) In the destination node, the RX processor processes incoming packets if these are for connection establishment and then it delivers summary information about the packets to the TX processor through the CCM. (4) After a connection is established, the TX processor creates the shared data needed for the connection and stores the data to the CCM. When transmitting data packets, the TX processor requests the TX engine to create data packets and maintains the list of transaction information to prepare retransmission. It performs flow and congestion controls with the support of the CCM. Flow control is performed based on information such as the status of the receive buffer in the host memory and the window size in the last incoming packet. Congestion control is performed using information such as the number of packets transmitted and the number of ACK packets received. Retransmission is performed by the TX processor at the request of the RX processor. The RX processor processes ACK packets, and if needed, delivers a retransmission request to the TX processor through the CCM. 3.3 Implementation of the Hybrid TOE Fig. 6 shows a picture of the hybrid TOE network adapter that is developed to verify the operation of the hybrid TOE architecture and to evaluate its performance. The adapter is equipped with a Xilinx Virtex-II pro FPGA that embeds two PowerPC 405 processor cores, an Intel 82545EM gigabit Ethernet MAC/PHY controller chip, two Flash memory chips, and four synchronous dynamic random access memory (SDRAM) chips. Each of the two processor cores is connected to two SDRAM chips through its own processor local bus (PLB) and is connected to one Flash memory chip through its on-chip peripheral bus (OPB). Each OPB is also used for connecting each core to the hardware mod-
AN EFFICIENT ARCHITECTURE FOR A TCP OFFLOAD ENGINE
RS-232 (2P)
MagJack
Flash Memories
503
SDRAMs
Intel 82545EM MAC/PHY Xilinx Virtex-II Pro FPGA
Fig. 6. Picture of the hybrid TOE network adapter.
ules of the hybrid TOE. The Flash memories are used to store images of the software and the SDRAMs are used to run the software. The two cores operate with a 300 MHz core clock, a 100 MHz PLB clock, and a 50 MHz OPB clock. The hardware modules operate with a 66 MHz PCI clock so as to be synchronized to the 64-bit/66-MHz PCI bus. Two PowerPC 405 cores in the FPGA are used for the TX processor and the RX processor. The dual-port memory that is shown in Fig. 5 is implemented by using a Block SelectRAM [29] that has two physical input/output port pairs with which it can be accessed by two devices at the same time. The coprocessing control module (CCM) shown in Fig. 5 must provide fast and efficient data sharing mechanism to the four components. Mostly, it is ideal to use a very fast synchronous SRAM for maintaining shared data. However, because there is no SRAM in the adapter as shown in Fig. 6 and a complex arbitration logic is required for the four components to access the shared memory at the same time, the CCM in this paper is implemented using a virtual quad-port memory [30] that is also based on a Block SelectRAM and supports simultaneous accesses from four devices as shown in Fig. 5. Each physical input/output port pair of a Block SelectRAM can be shared between two devices by using a memory clock that is two times faster than the devices’ clock, a multiplexer for the input port, and two output buffers. Data inputs from the two devices to the input port are switched via a multiplexer that is controlled by the device clock. The data output of the output port is stored in one of two temporary buffers for the two devices. One buffer operates with the device clock and the other operates with a clock that is a negated version of the device clock. Each device can read data stored in the memory through the corresponding temporary buffer.
4. EXPERIMENTS AND ANALYSES 4.1 CPU Utilization, Latency, and Bandwidth For the experiments, we used two computers that were based on an AMD Opteron
HANKOOK JANG, SANG-HWA CHUNG, DONG KYUE KIM AND YUN-SUNG LEE
504
60
70
T OE
T OE
T G3
50
T G3
60
E1000 50
40 Latency (us)
CPU utilization (%)
E1000
30 20
40 30 20
10 10 0
81 92
40 96
20 48
10 24
51 2
25 6
64
12 8
32
16
0 4
8
16
Data size (KB)
Fig. 7. Comparison of CPU utilizations. 120
32 64 128 Data size (bytes)
256
512
1K
Fig. 8. Comparison of latencies.
T OE T G3
100 Bandwidth (MB/s)
E1000 80 60 40 20 0 1
2
4
8
16
32
64
128
256
512
1024
Data size (KB)
Fig. 9. Comparison of unidirectional bandwidths.
246 CPU (2 GHz); each is equipped with 1 GB of main memory, Marvell’s Tigon-3 gigabit Ethernet controller, Intel’s gigabit Ethernet adapter based on the 82545EM chip, and the hybrid TOE adapter. We used the Linux kernel 2.6.29 for the x86-64 architecture as the host operating system. We set the timer frequency resolution as 1,000 Hz and did not use the preemption feature in the kernel configuration. We measured latencies, bandwidths, and utilizations on the host CPU with the NetPIPE [31] benchmark program. We used the default values in Linux for the sizes of the send and receive buffers. Fig. 7 compares CPU utilizations of the hybrid TOE and general gigabit Ethernet adapters. Fig. 8 compares the latencies of the adapters. Fig. 9 compares their unidirectional bandwidths. “TG3” denotes the Tigon-3 gigabit Ethernet controller that is installed on the main board. “E1000” means Intel’s adapter and “TOE” means the hybrid TOE. As the result, the TOE exhibited very low utilization − less than 3% − on the host CPU that is approximately an eighteen-fold reduction in utilization compared with the TG3 and the E1000. The TOE and the E1000 showed the minimum latencies of 13.5 μs and 15.2 μs, respectively. In contrast, the TG3 showed a much higher latency of 42.9 μs, which is more than 3 times higher than that of the TOE. The maximum unidirectional bandwidth of the TOE was approximately 110 MB/s, which was 4% lower than those of the TG3 and the E1000; however, it should be considered that the hybrid TOE operates with much slower clocks than the host CPU clock (2 GHz). This slight penalty was caused by the device driver on the host CPU that had not yet been fully optimized. These results show that the hybrid TOE can achieve high TCP/IP processing per-
AN EFFICIENT ARCHITECTURE FOR A TCP OFFLOAD ENGINE
505
250 T OE T G3
Bandwidth (MB/s)
200
E1000 ONE_PPC
150
100
50
0 1
2
4
8 16 Data size (KB)
32
64
128
Fig. 10. Comparison of bidirectional bandwidths.
formance by using hardware that accelerates data transmission and reception, despite using 300 MHz embedded processors that do not have sufficient performance to process TCP/IP at a speed of 1 Gbps. To prove the effectiveness of the TX/RX path separation mechanism, Fig. 10 compares the bidirectional bandwidths of the general GbE adapters, the hybrid TOE that uses two PowerPC 405 cores, and the “ONE_PPC” which denotes that just one PowerPC 405 core is used for implementing a TOE and the TX/RX paths are not processed simultaneously. The hybrid TOE based on the TX/RX path separation mechanism had higher maximum bidirectional bandwidth than the ONE_PPC. The bandwidth of the ONE_PPC did not improve significantly compared to that of the TOE because the single embedded processor became the bottleneck. In the hybrid TOE, the maximum bidirectional bandwidth was 201 MB/s, 34% higher than that of the ONE_PPC. This was 95% compared to the E1000’s maximum bidirectional bandwidth of 210 MB/s. To achieve performance comparable to the general GbE adapter, approximately 80,000 1500-byte packets that contain 1,448 bytes of data per packet must be transmitted per second. In this case, a time of only 12.5 μs is allowed to process every packet. That is, data transmission and reception have to be completed in 12.5 μs at each node with burst transactions. Table 3 shows the times taken to perform major hardware operations in Fig. 2 when creating, transmitting, and receiving a 1500-byte data packet. As shown in Table 3, all operations satisfy the time limit. Table 3. Elapsed times on main operations processed by hardware.
Transmission
Reception
Hardware Operation (1) Storing host command to command queue through PCI bus (2) DMA initialization (3) Fetching data from host memory using DMA (4) Header generation (5) Storing packet to gigabit Ethernet interface (6) Fetching incoming packet from gigabit Ethernet interface (7) Packet header test and analysis (9) DMA initialization (11) Storing data to host memory using DMA
Time (μs) 0.6 1.1 2.8 0.7 2.9 2.9 0.6 1.1 2.8
506
HANKOOK JANG, SANG-HWA CHUNG, DONG KYUE KIM AND YUN-SUNG LEE
4.2 Analysis for 10 Gigabit Ethernet In this section, we analyze performance requirements and features essential for applying our hybrid TOE architecture to the 10 Gigabit Ethernet. To achieve a comparable performance to the Chelsio’s TOE adapter, which is a full ASIC implementation with a unidirectional bandwidth of 7.6 Gbps based on the 10 Gigabit Ethernet [18], approximately 633,000 1500-byte packets have to be transmitted per second if jumbo frames are not used. In this case, a time of only 1.6 μs is allowed to process every packet, that is, each functional block in Fig. 2 has to be completed in 1.6 μs if the function blocks are executed in pipeline. If 9000-byte jumbo frames are used, approximately 105,000 packets will be transmitted per second, consequently this time limit may be eased to 9.5 μs. As shown in Table 3, operations that cannot satisfy the time limit of 1.6 μs for the 10 Gigabit Ethernet are as follows: operations (3) and (11) that access the host memory using DMA; operation (5) that passes outgoing packets to the 10 Gigabit Ethernet interface; and operation (6) that fetches incoming packets from the Gigabit Ethernet interface. These operations are bound to the 64-bit/66-MHz PCI interface logic, whose maximum bidirectional bandwidth is 4.2 Gbps, in the current hybrid TOE implementation. As a result, they can satisfy the time limit if we adopt the PCI-Express with x4 link, which have maximum bidirectional bandwidths of 20 Gbps. By accepting such enhancement, the hardware implementation and the processing sequence in Fig. 2 can be applied to the hybrid TOE architecture for the 10 Gigabit Ethernet. In the current software implementation, the time taken to process an interrupt is approximately 4 μs and it can satisfy the time limit of 9.5 μs when using 9000-byte jumbo frames on the 10 Gigabit Ethernet. However, when using 1500-byte packets, the current software cannot satisfy the time limit of 1.6 μs. Therefore, for the software of the hybrid TOE to satisfy the time limit of 1.6 μs, higher clock than 300 MHz is required for the em bedded processors to process every interrupt in 1.6 μs; we roughly calculated that the embedded processors, with the aid of caches, are required to operate with about 850 MHz clocks.
5. CONCLUSION In this paper, we proposed a hybrid TOE architecture based on a hardware/software co-design and on the TX/RX path separation mechanism in order to overcome the low performance of the software TOEs and the low expandability of the hardware TOEs. We designed a coprocessing control mechanism with minimum data sharing to support efficient coprocessing between the software and the hardware. The hardware of the hybrid TOE performs time-critical operations to improve the performance in data transmission and reception, such as creating packet headers, processing incoming data packets, fetching data from the host memory, storing data to the host memory, and generating ACK packets. The software of the hybrid TOE is based on embedded Linux to improve expandability. It performs operations that are not critical for the performance, such as connection establishment, flow control, congestion control, retransmission, and ARP/ICMP processing. Finally, we developed a hybrid TOE network adapter that is equipped with a gigabit Ethernet MAC/PHY chip and an FPGA that has two embedded processor cores.
AN EFFICIENT ARCHITECTURE FOR A TCP OFFLOAD ENGINE
507
Based on this adapter, we implemented the hybrid TOE architecture, verified operations of the hybrid TOE, and evaluated its performance. In the experiments using this adapter, the hybrid TOE showed approximately 13.5 μs as the minimum latency. The maximum unidirectional bandwidth was approximately 110 MB/s, although the embedded processors operated with a clock that was seven times slower than the clock of the host CPU. The utilization in the host CPU was less than 3%, which was approximately an eighteen-fold reduction of the maximum utilizations of the general gigabit Ethernet adapters. By using two embedded processors for the TX/RX path separation mechanism, the bidirectional bandwidth of the hybrid TOE improved to about 201 MB/s, approximately 34% higher than the bidirectional bandwidth achieved by using only one embedded processor. These results proved that our hybrid TOE is suitable to cluster systems that are based on the gigabit Ethernet network. Finally, we analyzed that our hybrid TOE architecture could be applied to the 10 Gigabit Ethernet by adopting PCI-Express with x4 link, using embedded processors faster than those used for the current hybrid TOE with optimized software.
REFERENCES 1. 2. 3. 4. 5. 6. 7.
8. 9. 10.
11. 12. 13. 14. 15.
TOP500 Supercomputer Sites, http://www.top500.org. Myri-10G Solutions, http://www.myri.com. InfiniBand Architecture Specification, InfiniBand Trade Association, 2006. Message Passing Interface Forum, http://www.mpi-forum.org. R. Recio, P. Culley, D. Garcia, and J. Hilland, An RDMA Protocol Specification, RDMA Consortium, 2002, http://www.rdmaconsortium.org. J. Satran, et al., Internet Small Computer Systems Interface, IETF RFC 3720, 2004. N. Bierbaum, “MPI and embedded TCP/IP gigabit Ethernet cluster computing,” in Proceedings of the 27th Annual IEEE Conference on Local Computer Networks, 2002, pp. 733-734. E. Yeh, et al., “Introduction to TCP/IP offload engine (TOE),” 10 Gigabit Ethernet Alliance, 2002. B. S. Ang, “An evaluation of an attempt at offloading TCP/IP protocol processing onto an i960RN-based iNIC,” Technical Report HPL-2001-8, 2001. S. Aiken, D. Grunwald, A. R. Pleszkun, and J. Willeke, “A performance analysis of the iSCSI protocol,” in Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003, pp. 123-134. H. Ghadia, “Benefits of full TCP/IP offload (TOE) for NFS services,” in Proceedings of NFS Industry Conference, 2003, http://nfsconf.com/pres03/adaptec.pdf. W. Feng, et al., “Performance characterization of a 10-gigabit Ethernet TOE,” in Proceedings of the 13th Symposium on High Performance Interconnects, 2005, pp. 58-63. Information Sciences Institute, University of Southern California, Transmission Control Protocol, DARPA Internet Program Protocol Specification, 1981. Information Sciences Institute, University of Southern California, Internet Protocol, DARPA Internet Program Protocol Specification, 1981. S. C. Oh, H. Jang, and S. H. Chung, “Analysis of TCP/IP protocol stack for a hybrid TCP/IP offload engine,” in Proceedings of the 5th International Conference on Par-
508
HANKOOK JANG, SANG-HWA CHUNG, DONG KYUE KIM AND YUN-SUNG LEE
allel and Distributed Computing, Applications and Technologies, 2004, pp. 406-409. 16. H. Jang, S. H. Chung, and S. C. Oh, “Implementation of a hybrid TCP/IP offload engine prototype,” in Proceedings of the 10th Asia-Pacific Computer Systems Architecture Conference, 2005, pp. 464-477. 17. Z. Z. Wu and H. C. Chen, “Design and implementation of TCP/IP offload engine system over gigabit Ethernet,” in Proceedings of the 15th International Conference on Computer Communications and Networks, 2006, pp. 245-250. 18. C. Communications, “The terminator architecture: The case for a VLIW processor vs. multi-processor SOC to terminate TCP and process L5-L7 protocols,” http://www. chelsio.com. 19. T. H. Liu, H. F. Zhu, C. S. Zhou, and G. R. Chang, “Research and prototype implementation of a TCP/IP offload engine based on the ML403 Xilinx development board,” in Proceedings of the 2nd International Conference on Information and Communication Technologies, Vol. 2, 2006, pp. 3163-3168. 20. S. C. Oh and S. W. Kim, “An efficient linux kernel module supporting TCP/IP offload engine on grid,” in Proceedings of the 5th International Conference on Grid and Cooperative Computing, 2006, pp. 228-235. 21. D. J. Kang, C. Y. Kim, K. H. Kim, and S. I. Jung, “Design and implementation of kernel S/W for TCP/IP offload engine (TOE),” in Proceedings of the 7th International Conference on Advanced Communication Technology, Vol. 1, 2005, pp. 706-709. 22. W. F. Wang, J. Y. Wang, and J. J. Li, “Study on enhanced strategies for TCP/IP offload engines,” in Proceedings of the 11th International Conference on Parallel and Distributed Systems, Vol. 1, 2005, pp. 398-404. 23. R. Westrelin, et al., “Studying network protocol offload with emulation: approach and preliminary results,” in Proceedings of the 12th Annual IEEE Symposium on High Performance Interconnects, 2004, pp. 84-90. 24. P. Gilfeather and A. B. Maccabe, “Connection-less TCP,” in Proceedings of the 19th IEEE International Conference on Parallel and Distributed Processing Symposium, Workshop 9, Vol. 10, 2005, pp. 210-212. 25. H. Y. Kim and S. Rixner, “TCP offload through connection handoff,” in Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems, 2006, pp. 279-290. 26. Y. Hoskote, et al., “A TCP offload accelerator for 10 Gb/s Ethernet in 90-nm CMOS,” IEEE Journal of Solid-State Circuits, Vol. 38, 2003, pp. 1866-1875. 27. H. Shrikumar, “40Gbps de-layered silicon protocol engine for TCP record,” in Proceedings of Conference on Design, Automation and Test in Europe, Vol. 1, 2006, pp. 1-6. 28. S. Park, S. H. Chung, and B. Lee, “Implementation and performance study of a hardware-VIA-based network adaptor on gigabit ethernet,” Journal of Systems Architecture, Vol. 51, 2005, pp. 602-616. 29. Xilinx, Inc., “Virtex-II pro and virtex-II pro X platform FPGAs,” Data Sheet, http:// www.xilinx.com. 30. N. Sawyer and M. Defossez, “Quad-port memories in virtex devices,” Application Note (XAPP228 v1.0), Xilinx, Inc., 2002, http://www.xilinx.com. 31. The Scalable Computing Laboratory, “A network protocol independent performance evaluator (NetPIPE),” http://www.scl.ameslab.gov/netpipe.
AN EFFICIENT ARCHITECTURE FOR A TCP OFFLOAD ENGINE
509
Hankook Jang received his Ph.D. and B.S. degrees in Computer Engineering from Pusan National University, Busan, Republic of Korea, in 2008 and 1999, respectively. He has been working at Samsung Electronics since 2008. His current research interests are in computer architecture, cluster systems, high-speed system area networks, real-time and embedded systems, and hardware/software co-design.
Sang-Hwa Chung received the B.S. degree in Electrical Engineering from Seoul National University in 1985, the M.S. degree in Computer Engineering from Iowa State University in 1988, and the Ph.D. degree in Computer Engineering from the University of Southern California in 1993. He was an Assistant Professor in the Electrical and Computer Engineering Department at the University of Central Florida from 1993 to 1994. He is currently a Professor in the Computer Engineering Department at Pusan National University, Korea. His research interests are in the areas of computer architecture and high-performance computer networking.
Dong Kyue Kim received the B.S., M.S. and Ph.D. degrees in Computer Engineering from Seoul National University in 1992, 1994, and 1999, respectively. From 1999 to 2005, he was an Assistant Professor in the Division of Computer Science and Engineering at Pusan National University. He is currently an Associate Professor in the Division of Electronics and Communications Engineering at Hanyang University, Korea. His research interests are in the areas of embedded security systems, crypto-coprocessors, information security, and hardware implementation for cryptographic devices. He is now a board member of IEEE Seoul Section, and a director of KIISC, KIPS, and KMMS.
Yun-Sung Lee received the B.S., M.S. degrees in Computer Science Engineering from Pusan National University in 2006, 2008. He is currently doctoral candidate in Pusan National University. His current research interests are in wireless mesh network, embedded systems, and hardware/software implementation.