A Cost-Effective TCP/IP Offload Accelerator Design for ...

11 downloads 0 Views 128KB Size Report
only with the end systems, e.g. web-browser and a web-server. Traditionally, the .... system resource allocation, prevention of denial-of-service attacks, and error .... and Descriptor Table to compute main memory address and then store the ...
A Cost-Effective TCP/IP Offload Accelerator Design for Network Interface Controller K. Hashimoto and V.G. Moshnyaga Department of Electronics Engineering and Computer Science Fukuoka University, Fukuoka, Japan Abstract - In this paper, we propose a new design optimization of TCP/IP offload engine, which significantly reduces the size of FIFO buffers in Network Interface Controller while maintaining the high-throughput of TCP/IP processing. Experimental evaluation of prototype TOE receiver design shows that the receiver can achieve 13.1Mbit/s throughput at very low (25MHz) frequency while utilizing only 57.5K 2-input NAND logic gates in control logic. Keywords: Network, controller, TCP/IP, hardware, design

1

Introduction

Transmission Control Protocol (TCP) with Internet Protocol (IP) are two core protocols of the Internet Protocol suite, commonly referred as TCP/IP. Whereas IP handles lowerlevel message transmissions from computer to computer across the Internet, TCP operates at a higher level, concerning only with the end systems, e.g. web-browser and a web-server. Traditionally, the TCP/IP processing is accomplished by software running on the central processor unit (or CPU), while physical access to a networking medium are implemented by Network Interface Controller (or NIC). As the network speed increases, the CPU becomes burdened with the tremendous amount of TCP/IP protocol processing, reassembling out-of-order packets, providing resourceintensive memory copies, and interrupts. As a result, the CPU spends more time on handling the network traffic than running user applications. Driving a full TCP/IP stack of 1Gbps network, for example, utilizes all power of 1GHz CPU[1,10]. To reduce the CPU load of running the TCP/IP stack, the TCP/IP processing functions are offloaded into special hardware. In this paper we focus on a TOE implementation in embedded systems and present a design optimization capable of significantly reducing the TOE cost. Due to stringent requirements of embedded application, the target TOE based network interface controller must provides megabit throughput at a very low clock frequency. The rest of the paper is organized as follows. In the next section we describe the NIC design background. Section 3 analyzes related research. Section 4 describes the proposed design optimization. Section 5 shows results of experimental evaluation of a TOE receiver designed based on the proposed optimizations. Section 6 presents conclusion and outlines work for the future.

2 Background A Network Interface Controller (or NIC) is a computer hardware component designed to allow computers to communicate over a computer network. The NIC has a ROM chip that contains a unique 48-bit network hardware identifier, the Media Access Control (or MAC) address, burned into it. The NIC exists on both the physical layer (or layer 1) and the data link layer (layer 2) of the OSI model, as it provides physical access to a networking medium and provides a lowaddressing system through the use of MAC addresses. Conventional Ethernet NICs are designed to reside on a standard I/O bus (e.g., PCI) that is physically distant from and clocked much more slowly than the CPU, such that accesses to device registers may require thousands of CPU cycles [12]. Traditionally, a NIC uses DMA to manage large mainmemory-based FIFOs, copying data to and from these structures into its on-board FIFOs as needed. To provide flexibility in memory allocation, these main-memory FIFOs are noncontiguous, represented by lists of memory-resident DMA descriptor data structures. Each descriptor contains the address and length of a contiguous buffer. To transmit a packet, the device driver creates a DMA descriptor for each of the internal kernel buffers that make up the packet often one for the protocol header and one for the payload), writes the DMA descriptors to the in-memory transmit queue, then writes a NIC control register to alert it to the presence of the new descriptors. The NIC then performs a DMA read operation to retrieve the descriptors, a DMA read for each data buffer to copy the data into the NIC-resident hardware FIFOs, then a DMA write to mark the descriptors as having been processed. The device driver will later reclaim the DMA descriptors and buffers. Receive operations are similar, except that the device driver pre-allocates empty buffers and places corresponding DMA descriptors on a queue for the NIC to fill with received packets. After each buffer is filled, the NIC marks its descriptor accordingly and interrupts the CPU. The device driver then processes the filled buffers, converting them to internal kernel format, and passes them to the kernel’s protocol stack. Although fetching, processing, and updating DMA descriptors is conceptually simple, it incurs a non-trivial

amount of memory bandwidth and processing overhead, both on the NIC and in the device driver. Willmann et al. [14] analyzed a commercial 1Gbps Ethernet NIC that implements DMA in firmware and determined that an equivalent 10 Gbps NIC must sustain 435 MIPS to perform these tasks at line rate. Note that, other than possibly calculating checksums, this computational effort includes no inspection or processing of the packets whatsoever.

The ASIC-based TOE implementations, such as [2] and [11] are customized for TCP/IP protocol offload. In [11], for example, the hardware implements all connection, packet transfer, disconnection, session management overhead, etc, i.e. all functions, which traditionally were performed by TCP/IP software. The ASIC designs usually are scalable and offer better performance than the processor-based designs at the expense of flexibility.

One significant issue with Ethernet has been the relatively high CPU overhead of a full TCP/IP stack and relatively high latency compared to other network technologies. Usually, the NIC does not offload any of the TCP/IP packet processing. In a typical computer system, OS allocates the protocol stack for sending and receiving of communication packets. In ITRON TCP/IP API specification Ver2.0 [20], for example, the stack establishes a connection with the phrase “communication end point”. To implement the TCP/IP protocol in embedded system, we need two important blocks (i.e. PHY and MAC) that operate in physical layer and data link layer, respectively. At the MAC layer, each MAC frame, which comes from the PHY layer, is verified by the CRC sum-check and then placed in the Rx buffer. The TCP/IP protocol stack (that is set upon the TCP connection) decides the status of the received IP packet based on the communication anchor (payload data and the header of the top-level protocol). If a packet stored in the Rx buffer satisfies the above condition, it is sent to the channel.

Hybrid implementations [3,4,6,7] take advantage of both the processor-based and the ASIC-based implementation. Their goal is to provide scalability and flexibility while maintaining performance. In [6] for example, the FPGA circuit implements Rx functions (address check, checksum, message buffering and allocation, hash-key calculation, and control), Tx-buffering functions, control of the incoming and outcoming messages, host bus control, etc. The microprocessor and firmware are responsible for TCP state transferring, TCP stream transaction, timeout event management, host interaction, etc. The architecture uses two 64MB SDRAMs achieving up to 800 Mps Ethernet, saving 3/4 of computational power of conventional TCP/IP stack.

To receive data in IP layer, the incoming packets need to be stored and reordered into a consecutive stream at the communication anchor. Large data streams (such as media content files) must be broken up or segmented into multiple packets. When a packet is dropped during transmission it must be retransmitted. As packets arrive out of order, an OSbased TOE waits until all packets arrive before they begin reassembling the segments. So the OS load is increasing.

3 Related Research The TCP Offload Engine (or TOE) is a basic technology used in network interface devices to offload processing of the TCP/IP stack to the network controller and thus ease the networking bottleneck. Although there are multiple protocols under umbrella of the TCP/IP protocol stack (like TCP, IP, UDP, ICMP and others), TOE implementations [2-9] usually focus on offloading TCP and IP processing. Existing TOE implementations can be classified as processor-based, ASIC-based, and hybrid. A processor-based TOE [8] is realized using off-the-shelf components like a network processor or microprocessor running a real time operating system (RTOS) and a MAC/PHY. The protocol processing from the host CPU is offloaded to the protocol stack in the RTOS. The advantage of this implementation is the flexibility of the solution and wide availability of the components; however, the scalability to 10 Gigabit Ethernet and beyond is coming under scrutiny.

The hybrid TOE developed by Dell, Broadcom and Microsoft [7] assumes that OS controls the TCP/IP tasks such as connection setup, connection termination, system resource allocation, prevention of denial-of-service attacks, and error and exception handling, while the Broadcom NetXtreme II based TOE hardware handles the data processing of TCP/IP connections. Namely, it implements the Physical (layer 1), the Data Link/MAC (layer 2), the Network/IP (layer 3), and the Transport/TCP (layer 4) of the Open System NIC. The main feature of this architecture is ability to create a direct connection between the top of the protocol stack and the software drivers to enable partial offload of the protocol stack. The TOE transfers data to the top of the protocol stack without moving through the intermediate protocol layers. The GigEx TOE [9] integrates FPGA-based hardware accelerator and a conventional 32-bit micro-processor to provide a complete TCP/IP stack including application layer (HTTP and DHCP), transport layer (UDP and TCP), internet layer (IPv4 and ICMP) and ARP and Ethernet from the link layer. Internally, it integrates Ethernet MAC, checksum offload, IPv4 (including reassembly), UDP processing and TCP flow hardware modules. The microprocessor implements control and the higher level protocols including DHCP and HTTP. Our work continues in the spirit of this research, but focuses on optimizing the NIC interface for the needs of the kernel’s TCP/IP stack. Several efforts have addressed this issue. Work [5] reported the benefit of integrating a conventional DMA-based NIC on the processor die, but did not consider modifying the NIC’s interface to exploit its proximity to the CPU. Intel announced an “I/O Acceleration Technology” (I/OAT) initiative [15] that explicitly discounts TOEs in favor of a “platform solution” [16]. Intel researchers

proposed a “TCP on-loading” model in which one CPU of an SMP is dedicated to TCP processing [17].

MAC (Ethernet) frame

“Zero-copy” receivers have been presented in [18]. The most common technique is page flipping, where the buffer is copied from the kernel to the user address space by remapping a physical page. Trapeze [19] provided zero-copy in this manner, but has limitations such as requiring page-size MTUs (much larger than the Internet standard 1500 bytes). The header and payload separation done by Trapeze is very similar to the zero-copy mechanism that we use to demultiplex packets, though we do not require that the extra intelligence for this separation exist in the NIC. In addition, the required page-table manipulations, while faster than an actual page copy, are not quick in an absolute sense. Zerocopy behavior can also be achieved using “remote DMA”(RDMA) protocol extensions, where the receiving application pre-registers receive buffers in such a way that the NIC can identify them when they are referenced in incoming packets. In addition to requiring a sophisticated NIC, RDMA is a significant change to both applications and protocols, and requires support on both ends of a connection.

Receiver Transmitter Type MAC MAC address address

payload = top-level packet (up to 1500 Byte)

14Byte IPv4 packet

Check Receiver Transmitter sum IP IP address address

payload = top-level packet

20Byte TCP packet

TCP header (20Byte)

TCP data payload (up to 1460 Byte)

UDP packet

UDP header (8Byte)

UDP data payload (up to 472 Byte)

Figure 1.

Formats of MAC frame, and IPv4, TCP, UDP packet

MAC (Ethernet) frame

CPU / Cache TOE (receiver) SRAM

PHY

MAC

4 The Proposed Design In traditional network systems, the IP-, TCP- and UDPheaders are separated from the TCP, UDP data and checksum, as shown in Fig.1. Though this separation reduces the overall load of protocol stack processing by software, only a few multi-functional network cards handle the IP, TCP, and UDP headers separately from data. We found that if the information about location of the corresponding descriptor in main memory could be transmitted between the application layer and NIC, the number of data copies, which have to be done between the NIC, the protocol stack and the application layer, would be significantly reduced; and so the processing time shorten. Furthermore, existing TOEs allocate an incoming MAC frame into a FIFO queue of MAC layer and keep it there until the frame is read by the protocol stack. The FIFO has to be long enough to prevent data loss at the given throughput. For example, it requires 30Kbyte SRAM to store 20 MAC frames. Due to large Rx and Tx buffers, existing TOE designs consume large area and power. To reduce the protocol stack buffer, we limit the receiving and transmitting buffers to a single packet in size and allocate the descriptor queue in main memory which does not require DMA to access. In our design, the receiver TOE analyzes the header of incoming MAC frame at IP, TCP, UDP layers, computes the address by which the packet data is stored in main memory and saves the packet in the buffer. Fig.2 outlines our NIC architecture. The PHY and MAC denote modules that establish connection at physical and MAC layers, respectively. TOE receiver and TOE transmitter

FCS

MII

DRAM I/F

TOE (sender)

Peripheral Bus Adapter SoC high-speed bus

I/O Low-speed bus

Figure 2.

Flash Memory

The proposed TOE architecture

represent hardware accelerators for data transmission and reception, respectively. Both accelerators are connected to CPU and memory via high-speed (AMBA AHB) local bus. The I/O units and flash memory are connected to the bus through a low-speed (APB) bus and a bus adapter. The NIC TOE design analyzes the header of an incoming MAC frame at IP, TCP, UDP layers and defines an address by which the packet data is stored in main memory. Below we list the distinguishing features of our TOE design. 1.

The proposed TOE receiver keeps in hardware the address by which the TCP/UDP data payload (see Fig.1) is stored in main memory, while data is delivered to a target TCP/UDP communication end-point specified in application.

2.

Similarly to NIC, the address is used to access the Descriptor Table associated with the packet. In order to lower the cost, the bit-width of in internal registers is minimized.

AHB I/F (store packet)

AHB I/F (Modify Descriptor table)

Table #1(Data node initialization Descriptor Table)

APB I/F (set system data)

Received Usable EHS

DESC_PROC

DESC_DMA

ID

ADR_MTH (5bit)

EDS

FIN,RST

PROT

MHS (13bit)

RHS (13bit)

MDS (13bit)

RDS (13bit)

Pointer to the next Table (#2) (32bit)

Control

Base address of the header storage segment (32bit)

PREG

FIFO_CTRL

SRAM

Base address of the TCP,UDP data storage segment (32bit)

(head info) Table #2

PACKET_FIFO

PACKET_ANALYZER

(packet head / data flow) (MAC frame)

Figure 3.

Table #M

Pointer to the Next Table (#1) (32bit)

The block-diagram of TOE receiver Figure 5.

APB (access from CPU) Data register

Data node 1

The descriptor table based packet queue

APB I/F

Data node 2

Data node n

The rest of packet

DESC_PROC status register

The DESC_DMA reads a packet from PACKET FIFO and based on the packet address and size from (provided by DESC_PROC) writes the packet into AHB.

Sequential logic circuit

AHB burst access circuit

AHB (Access to Descriptor Table)

Figure 4.

Control signal

DESC_DMA

The FIFO consists of 1 input/1-output SRAM and a simple controller. In general, the number of PREG registers corresponds to the number of packets simultaneously stored in the PACKET FIFO. Since our design has FIFO of 1 MAC packet in size, it uses only one PREG register.

Data Clear

PREG

Clear

Status monitoring

PACKET_FIFO

The DESC_PROC unit takes information written in PREG and Descriptor Table to compute main memory address and then store the packet header and data by the address.

The block-diagram of the DESC_PROC unit

3.

The internal clock frequency (X) of TOE is at least twice higher than the effective throughput (Y) of TOE data receiver; i.e. Y*2 >X is our design target.

4.

The memory buffer in the TOE receiver is limited to one packet in size.

5.

The TOE has functional compatibility with existing NIC designs.

Fig.3 depicts the internal configuration of TOE receiver. The PACKET ANALYZER takes an incoming MAC frame and analyzes all embedded headers (see Fig.1). First, it extracts IP, ARP/RARP from the MAC header. If the packet belongs to IPv4 layer, it computes sizes of the IP, TCP and UDP headers and data and stores them into the register PREG. Next, it initiates parallel load of the MAC frame into the PACKET FIFO simultaneously computing the checksum. If the checksum shows an error, the FIFO and PREG are flashed out and an error-message is sent to the DESC_PROC unit.

Fig.4 shows internal structure of the DESC_PROC. The sequential logic circuit implements the following functions. In contrast to related designs, it includes n data registers to process requests from n endpoints simultaneously. (The number of registers equals the number of endpoints in the protocol stack). The sequential logic circuit implements the following control functions. 1.

Reading of a packet from FIFO and storing it in main memory. This function is carried by DESC_DMA and controlled by DESC_PROC. The DESC_PROC may consist of a single or several data endpoint registers. The number of registers equals the number of endpoints in protocol stack.

2.

Recording information about the incoming packet in corresponding Descriptor Table (Fig.5) of the protocol stack. Each Descriptor Table contains the packet’s header, the base address of data in main memory, the size and the protocol analysis report. A Descriptor Table is

TABLE I.

TOE simulation model

THE TOE RECEIVER COMPLEXITY

TOE module PACKET_ANALYZE R PREG FIFO_CTRL DESC_DMA DESC_PROC Total

CPU (dummy)

Logic cells 5998 3989 883 4628 40016

AHB

(interrupt)

55514

4.

Descriptor Tables are combined in a packet queue by pointers as shown in Fig.5. An object in the queue reflects a packet. By providing implicitly the size of packet data, we reduce the amount of hardware required for storing and transmitting the packet. Also, by monitoring the TCP sequential numbers, we drop from processing those packets which do not match the TCP identifier and so save energy. In order to transmit a TCP-RST packet, a Descriptor Table corresponding to its endpoint data is placed into the packet queue. For each remaining (TCP-SYN, TCP, UDP) packet that does not match the data endpoint condition, we also allocate a Descriptor Table and place it in a queue before storing in main memory.

All data registers shown by the dotted line in Fig.4 can be accessed from CPU. To speed-up access to Descriptor Table, we provide a special AHB burst-access circuit. As a packet is processed, the DESC_PROC circuit resets the PREG register and waits for the next incoming packet. If there are multiple data end-points, they are processed sequentially.

5 Experimental Results We implemented the proposed TOE system in hardware and measured its efficiency on a number of tests. The circuit was synthesized in Verilog-HDL using Synopsis Design Compiler and realized in standard cells LSI based on Hitachi 0.18um design rules and library design provided by VDEC. Table 1 shows the design complexity in terms of 2-input logic cells. As we see the total complexity of receiver is less than 60K cells. In reality, a TOE system also includes the TOE sender and the MAC unit. The combined complexity of these two units nowadays does not exceed 100K cells. Consequently, even with FIFO SRAM the total complexity of the embedded TOE system is not so high. Next, we developed a simulation model (fig.6) of the proposed TOE receiver (Fig.6) and ran it in HDL simulator to evaluate the receiver performance on various data. In this model, the MAC, the TOE transmitter and the CPU are modeled by dummy variables; MAC has direct links with

TOE receiver

TOE Transmitter (dummy)

MAC (dummy)

transmitted between the DESC_PROC and the protocol stack. 3.

Main Memory

Virtual connection

Figure 6.

TOE simulation model

virtual connection model and MAC frame is carried out by PHY layer. Additionally the model assumes the following: 

32 bit parallel I/O; CRC-based checksum for MAC frame; IP cores in MAC are specified in the same way as in existing designs.



100Mbps data throughput between MAC and the virtual communication model; 25MHz frequency of internal clock.



zero delay for packet delivery by the virtual connection and no delay between packet transmission in a sequence.



1460Byte size of data payload in TCP packet; 4 packets in transmission sequence.

After storing an incoming packet in main memory, the TOE receiver sends an interrupt signal to the CPU and the transmission process is repeated for another packet. The simulation revealed that the TOE receiver takes 4131 clock cycles from the start of packet transmission to the end (i.e. the time of interrupt generation). Next, in existing TCP/IP engines, such as TOE from NEC[21] that implements the protocol stack by software running in ITRON OS, it takes 3.4ms to receive a TCP-ACK packet after an interrupt. Thus without loss of accuracy, we can assume that our TOE model has same delay to produce an acknowledgement signal (ACK). Consequently at 25 MHz clock frequency the TOE requires 3.4ms/40ns+4231=89131 clock cycles per iteration in the packet transition (see Fig.7). As result the peak throughput achievable by the TOE receiver is (1460Byte×4packets×8bit)÷(89131cycles÷40ns) or 13Mbps. Although, this value is obtained by an approximate calculation, it allows us to judge that the proposed TOE satisfies the target throughput requirement.

TOE model

Virtual model SYN

One of the ways to overcome above problems is a speculative checksum computation in parallel to DMA memory access. We are investigating techniques to implement this idea. The future work also will cover LSI design of a custom TCP/IP TOE chip.

SYN/ACK ACK

Acknowledgement

iteration Data= 1460Byte ×4 ACK Data= 1460Byte ×4

The work was sponsored by The Ministry of Education, Culture, Sports, Science and Technology of Japan under the Knowledge Cluster Initiative (The Second Stage) and Grantin-Aid for Scientific Research (C), No.21500063. The authors are thankful for the support.

ACK

7 Figure 7.

6

The TOE packet transmission flow

Conclusions

In this paper we showed that TOE can effectively increase throughput of receiver. From the point of communication stability and application response, however, a more drastic improvement in processing performance is necessary. The proposed TOE receiver architecture basically adopts the existing NIC configuration and layer structure of the protocol stack. To increase the TOE performance drastically, the following problems must be solved: 1. Checksum execution. The CRC based verification requires checksum to be computed for IP header, TCP and UDP. When we extract information about IP header, TCP and UDP, the parallel execution is possible. However, when checking the sum of TCP and UDP, we have to access the packet data entirely. Therefore our circuit activates the DESC_PROC module after the PACKET_ANALYZER completes its work and writes a packet into the FIFO. This is a bottleneck of our circuit. 2. End-point data search. Because of the stringent requirements imposed on embedded system cost, we limit the number of registers used in the design and store the packet data in main memory. When an incoming packet has multiple data end-points, checking the status of the packet data

requires multiple memory accesses which account for many clock cycles and delay. 3. PHY layer speed-up. The physical layer has evolved from 10Base and 100Base to 1000Base, increasing the throughput of MAC frames transmission. Accelerating the receiver throughput without enlarging the FIFO queues and system cost is difficult. When the throughput of transmitter is larger then the throughput of receiver, many MAC frames can not be received and must be re-transmitted. As result the overall communication throughput becomes low.

References

[1] TCP/IP offload Engine (TOE) — Everything about Ethernet@ 10Gea.org”, http://www.10gea.org/tcp-ip-offloadengine-toe.htm [2] Y. Hoskote, A.Bradly, A.Bloechel, et al, “A TCP Offload accelerator for 10Gb/s Ethernet in 90-nm CMOS, IEEE JSSC, vol. 38, no.11, pp.1866-1875, 2003. [3] H.Jang, S-H.Chung, S-C.Oh, “Implementation of a hybrid TCP/IP offload engine prototype”, Proc. 10th AsiaPacific Computer System Architecture Conf., pp.464-477, 2005 [4] T-H.Liu, H-F. Zhu, et al., Research and prototype implementation of a TCP/IP offload engine based on the ML403 Xlinix development board”, 2nd ICTTA Inf. and Comm. Technologies, vol.2, pp.3163-3168, 2006. [5] Z-Z.Wu and H-C.Chen, “Design and Implementation of TCP/IP Offload Engine System over Gigabit Ethernet”, Proc. 15th Int. Conf. on Computer Comm. and Networks, pp.245250, 2006 [6] S-M.Chung, C-Y. Li, H-H. Lee, et al, “Design and Implementation of TCP/IP Offload Engine”, IEEE 2007 Int.Symp. on Comm. and Inf.Technologies (ISCIT 2007), pp.574-579 [7] P.Gupta, A.Light, I.Hameroff, Boosting Data Transfers with TCP/IP Offload Engine Technology on Nine Generation Dell Power Edge Servers, Dell Power Solutions, Dell Inc., pp.18- 22, Aug. 2006. [8] H.Jang, S-H.Chung, D-H.Yoo, “Design and implementation of protocol offload engine for TCP/IP and remote direct memoy access based on hardware/software coprocessing”, Microprocessors and Microsystems, vol.33, issue 5-6, pp. 333-342, Aug.2009.

[9] “Off-loading TCP-IP into hardware makes Gigabit Ethernet a reality for your application”, Orange Tree Technologies Ltd, www.orangetreetech.com [10] A.Baldus, “TOE: TCP/IP offload engine relieves CPU burden” Embedded Computing Design, March 2005. [11] “TCP-offload engine SOC IP from Intelop undergoes major enhancements”, www.intelop.com [12] N. L. Binkert, L. R. Hsu, A. G. Saidi, R. G. Dreslinski, A. L. Schultz, and S. K. Reinhardt. Performance analysis of system overheads in TCP/IP workloads. Proc. 14th Ann. Int’l Conf. on Parallel Architectures and Compilation Techniques (PACT), pp.218–228, Sept. 2005 [13] N.L.Binket, A.G.Saidi, S.K.Reinhardt, “Integrated Network Interfaces for High-Bandwidth TCP/IP”, Proc. ACM ASPLOS’06, 2006 [14] P. Willmann, H. Kim, S. Rixner, and V. S. Pai. “An efficientprogrammable 10 gigabit Ethernet network interface card”. Proc. 11th Int’l Symp. on High-Performance Computer Architecture (HPCA), Feb. 2005. [15] K. Lauritzen, T. Sawicki, T. Stachura, and C. E. Wilson. “Intel I/O acceleration technology improves network performance, reliability and efficiently”. Technology@Intel magazine, Mar. 2005. http://www.intel.com/technology/ magazine/communications/Intel-IOAT-0305.pdf. [16] P. Gelsinger, H. G. Geyer, and J. Rattner. Speeding up the network: A system problem, a platform solution. Technology@Intel Magazine, Mar. 2005. http://www.intel.com/technology/magazine/communications/ speeding-network-0305.pdf. [17] G. Regnier, S. Makineni, R. Illikkal, R. Iyer, D. Minturn, R. Huggahalli, D. Newell, L. Cline, and A. Foong. “TCP onloading for data center servers”. IEEE Computer, 37(11):48–58, Nov. 2004. [18] J.Chase. High Performance TCP/IP Networking, chapter 13, “Software Implementation of TCP”. Prentice-Hall, 2003. [19] A. Gallatin, J. Chase, and K. Yocum. Trapeze/IP: TCP/IP at neargigabit speeds. In Proc. 1999 USENIX Technical Conference, Freenix Track, 1999. [20] “ITRON TCP/IP API Specification Ver.2.0”, TRON Ltd, July 2006, available from http://www.assoc.tron.org/ [21] Y.Hasegawa, H.Shinoshi, M.Tsutomu, “Behavioral verification and performance evaluation of hardware TCPNIC (TCP Offload Engine)”, 13th Conference on Internet technology, Committee Research Meeting no.163, may 2005 (in japanese).

Suggest Documents