Efficient Caching Techniques for Server Network ... - Semantic Scholar

4 downloads 0 Views 169KB Size Report
To measure performance benefits, we implemented these caching techniques in an execution-driven CPU simulator and applied these to a Free BSD TCP/IP ...
.

Efficient Caching Techniques for Server Network Acceleration Li Zhao*, Srihari Makineni+, Ramesh Illikkal+, Ravi Iyer+ and Laxmi Bhuyan* +

Communications Technology Lab Intel Corporation * Department of Computer Science University of California {srihari.makineni, ramesh.g.illikkal, ravishankar.iyer}@intel.com; {zhao,bhuyan}@cs.ucr.edu

Abstract –

A majority of server applications such as web server, database, E-mail, storage, etc. process a lot of network data, making them very network I/O intensive. TCP/IP is the most commonly used network protocol by these applications. TCP/IP runs over Ethernet, which is the de-facto Local Area Network (LAN) protocol. Rapid growth of Internet-enabled applications has resulted in development of faster Ethernet network technologies (from 100 Mbps to 1 and 10 Gbps). This sudden jump in Ethernet speeds necessitates TCP/IP processing to scale proportionately so that the server applications can ultimately benefit. This turns out to be a big challenge for server platforms, given that the CPU and memory speeds improve at a slower rate (tracking Moore’s law). Work done in the past has shown that TCP/IP processing, especially the receive side, is very memory intensive and therefore limits achievable network throughput considerably. To understand this, we have done a thorough cache characterization of TCP/IP processing. Results from this work have pointed out that only a small fraction of TCP/IP data (TCB and hash nodes) exhibits temporal locality, while the majority (descriptors, protocol headers, and payload) exhibits no temporal locality and thus generates a lot of memory accesses. To address these issues, we use two caching techniques in this paper: (1) Cache Region Locking (CRL) with Automatic-Actions, and (2) Cache Region Prefetching (CRP). We describe these techniques and analyze their performance benefits. To measure performance benefits, we implemented these caching techniques in an execution-driven CPU simulator and applied these to a Free BSD TCP/IP stack. Our simulation results show that the proposed caching techniques improve TCP/IP performance up to 65% by reducing the memory stall time.

2003 Server Enterprise Edition Operating System. For a 1460-byte application payload size (shown on x-axis), the TCP/IP stack was able to achieve about 1 Gbps and 750 Mbps for transmit and receive cases respectively. This required an entire CPU. This data makes it clear that achieving 10 Gbps and beyond on server platforms is a major challenge. Further, data from recent work [11] on TCP/IP processing characterization in commercial server workloads such as front end web servers and back end data base servers with iSCSI storage shows that TCP/IP processing is a significant portion (~28-35%) of the overall processing. So, it is equally important to keep the TCP/IP processing cost low while trying to scale to higher bit rates. As a way to achieve these objectives, some people are developing TCP/IP Offload [3,5,9] solutions, whereby TCP/IP processing is offloaded from the server processor to a peripheral device. However, study [17] shows that the TCP/IP Offload is only suitable for bulk data transfer applications and it suffers from several issues. So, our focus is on processor and system architecture enhancements as a way to solve this problem. TCP/IP Throughput and CPU Utilization 3500

Figure 1 shows current TCP/IP performance on a 2.6GHz Intel® XeonTM processor with HyperThreading technology running Microsoft Windows*

2000 1500 1000

100 Xeon - TX Xeon- RX Xeon - TX CPU Xeon - RX CPU

80 60 40 20

500 65536

32768

16384

8192

2048

0 1460

0 512

TCP/IP [6,7,8] over Ethernet is the most dominant packet processing protocol in data centers and on the Internet and is used by many of the commercial server applications such as web server, e-commerce, database, storage over IP, etc. Until recently, the commonly deployed Ethernet network speed is around 100 Mbps. However, rapid increase in the usage of the Internet required faster Ethernet networks, which led to the development of 1 and 10 Gbps Ethernet technologies. In order for network intensive server applications to benefit from these increased Ethernet speeds, TCP/IP processing has to scale to these speeds first.

2500

256

INTRODUCTION

128

1

120

3000

TCP Payload in Bytes

Figure 1. TCP/IP Performance

Work done in the past [2,4,12] showed various sources of overheads in TCP/IP processing. These studies, as well as our own measurements, show that TCP/IP processing, especially receive side, is memory intensive. Our measurements show that in order to receive and process a 1460-byte packet, despite having a 512KB L2 cache, the TCP/IP stack incurs approximately 20 and 63

.

cache misses for transmit and receive processing respectively. So, it is important to understand the nature of these memory accesses and improve the caching behavior of TCP/IP processing.

inside the descriptor to indicate to the driver that this descriptor holds a valid packet and generates an interrupt. This kicks off processing of the received packet.

The contributions of this paper are as follows. By analyzing the cache properties of various data types associated with TCP/IP processing, we show that the majority of the data shows no temporal locality, hence the existing cache policies do not help. Based on this observation, we evaluate our new proposed cache schemes to improve TCP/IP performance. These schemes include (1) Cache Region Locking (CRL) along with auto-updates, and (2) Cache Region Prefetching (CRP) instruction for software to prefetch a region of memory into the processor cache efficiently.

Figure 2 shows the overall flow for receive-side processing. The NIC device driver reads the descriptor to get the packet header and application payload. Once the descriptor is read, the TCP/IP stack has access to the memory buffer containing the header and payload data. The next step is to identify the connection to which this packet belongs. TCP/IP software stores state information of each open connection in a data structure, called the TCP/IP Control Block (TCB). Since there can potentially be several thousand open connections, hence many TCBs, TCP/IP software uses a well known search mechanism called hashing for fast lookup of the right TCB. The hash value is calculated by using the IP address and port number of both the source and destination machines. Several fields (sequence numbers for received/acknowledged bytes, application’s preposted buffers, etc.) inside the TCB are updated whenever a new packet is received. If the application has already posted a buffer to receive the incoming data, the TCP/IP stack then copies the incoming data from the NIC buffer to the application buffer directly. Otherwise, data is stored in a temporary buffer for later delivery to the application. The need for memory copy (or copies) from the payload buffer to the application buffer is one of the most time consuming operations in receive-side processing.

In this paper, we have mainly focused on the TCP/IP data path processing, which includes transmit and receive-side processing. We intend to study the connection processing at a later time. Also, when we mention TCP/IP processing, it includes interfacing with the Network Interface Card (NIC) hardware on one end, the applications on the other end, and all the processing in-between. 2

OVERVIEW OF TCP/IP PROCESSING

In this section, we provide a high-level overview of TCP/IP receive- and transmit-side processing. The intention here is not to delve into the TCP/IP protocol specifics, but to introduce various data types that are involved in the processing along with data flow that takes place at various layers of a typical TCP/IP stack. 2.1 Receive-Side Processing Receive-side processing begins when the NIC hardware receives an Ethernet frame from the network. The NIC extracts the packet embedded inside the frame by removing the frame delineation bits and updates a data structure, called a descriptor, with the packet information. The NIC driver software supplies these descriptors to the NIC, and they are organized in circular rings. The NIC driver informs the NIC through these descriptors, among other things, address of a memory buffer (NIC buffer) to store the incoming packet data. The stack allocates several memory buffers to receive incoming packets. These buffers, like the descriptors, get reused. The NIC copies the incoming data into this memory buffer using DMA operation. Once the packet is placed in memory, the NIC updates a status field

Application

User Kernel

Application Buffer

Copy & Signal

Sockets Interface

TCP/IP Stack

Socket Buffer/ Application Buffer

TCB

Copy

Driver

NIC Descriptors

Header Buffer

DMA Network Hardware

Ethernet Packet = Header + Payload

Figure 2. Data Flow in Receive-Side processing

2.2 Transmit Side Processing Transmit-side processing starts when an application wants to transmit data by passing a data buffer to the

.

TCP/IP stack. The data structure used to communicate between the TCP/IP stack and applications is called a socket. The application passes the socket id along with the data buffer, and the TCP/IP stack uses it to locate the TCB for the connection. The TCP/IP stack may then copy the application’s data into an internal buffer. Optimizations to avoid this copy are common in most of today’s TCP/IP stacks. When it is time to transmit data (based on the receiver’s window size), the TCP/IP stack divides the accumulated data into Maximum Transfer Unit (MTU) segments. Typically, a MTU is 536 bytes on the Internet and 1460 bytes on Ethernet LANs. It then computes the header (20 bytes for TCP assuming no options and 20 bytes for IP). These segments are then appended with the Ethernet and IP Headers and passed down to the NIC driver. The driver sets up the DMA to transfer headers and then application data to the NIC. 3

MEASUREMENTS/ANALYSIS METHODOLOGY

In order to use real-world network traffic loads to conduct our studies, we collected network traces generated by two popular commercial workloads; SPECweb99 [14] and TPC-W (web server) [15]. We have captured the network traces using an Ethernet sniffer [13] while running these two benchmark applications.

40% of the packets are control packets (containing zerolength payloads). We find that all workloads we studied exhibit a small fraction of packets that have a payload size smaller than 1400 bytes. For these servers, almost 60% of the packets have a payload size that is close to the maximum Ethernet frames, which is 1460 bytes. The average packet sizes for the servers we studied are given in Table 1. Packet & Data Distribution RX TX RX TX Rx Tx

SPECWeb99 TPC-W Webserver Rx Intense Workload

Packet % Data Size 34% 22 66% 1183 26% 136 74% 1226 100% 907 0% n/a

Table 1: Average Packet Sizes

We also analyzed the network traces from these benchmark applications to figure out the number of back-to-back packets for the same connection, which we call train of packets. This train of packets forms the basis for our locality study. Later, we show that larger numbers of back-to-back packets for the same connection improves the locality exhibited by some of the TCP/IP data types. A graph with the number of packet trains against the number of packets in a train of a given length is charted below.

Payload Size Distribution

Packet Trains in Server Benchmarks (CDF)

120% 120%

SpecWeb WebServer RX Intense Workload

100%

80%

% of packets

% of Packets (CDF)

100%

60%

40%

80%

60%

SPECweb99 (1600 Connections) 40%

TPC-W Web Server (10K Browsers) RX Intense Workload (10K Browsers)

20%

20%

0%

0% 0

200

400

600

800

1000

1200

1400

payload size

1

2

3

4

5

6

7

8

9

10

11

12

13 14

15

Length of Train (in Packets)

Figure 3. TCP/IP Payload Size distribution

Figure 4: Packet Trains in Server Workloads

Figure 3 represents the packet distribution in SPECweb99 and TPC-W (web server). We also synthetically created packet traces for a Receiveintensive workload, referred to as Receive-intensive Workload, based on packet traces of the TPC-W Image server. Here we assume that the packets are received with same locality as that of the Transmit side. The packet sizes and distribution vary considerably between the workloads. As shown in the figure, roughly 20 to

As shown in Figure 4, all the workloads have packet trains, although the Receive-intensive workload has a higher percentage of packet trains with two or more packets. 3.1 Simulation Methodology To understand the cache behavior of TCP/IP processing, we have used an execution-driven simulation

.

methodology based on the SimpleScalar* simulator [14]. Our base system configuration is a four-way fetch/issue/commit MIPS micro-processor with an instruction window size of 128 entries, two integer units, two load/store units and a floating point unit. We have simulated a two-level cache hierarchy. The L1 Icache and Dcache configuration are 32 KB in size, contain 64-byte cache lines, and are 4-way setassociative. The data cache is write-back, writeallocate, and non-blocking with two ports. The L2 is a unified, 8-way, 1-MB cache with 64-byte cache lines and a 15-cycle cache hit latency. The main memory latency in the base configuration is 300 cycles. We have added to the simulator cache region locking and region and line prefetching capabilities in order to study the benefits of these features in TCP/IP processing. However, we recognize that SimpleScalar does have some limitations, and we would like to move to a more comprehensive simulator in near future. That said, since our main focus is on studying cache behavior and not so much on performance measurements, this simulator is sufficient for our purposes. The TCP/IP stack we used in this study is derived from Free BSD OS and ported to the SimpleScalar (MIPS) environment. To emulate NIC hardware, we have created a simple NIC emulator in software. It reads traces stored in files and converts them into Ethernet packets. The NIC emulator uses descriptors, just like the NIC hardware, for data exchange with the stack. It also uses two shared queues (for transmit and receive) in-lieu of the interrupts. These queues are used by both the stack and NIC emulator to post descriptors that point to packets that need to be either transmitted or received. Since our focus in this study is only on the data path processing, we have pre-populated the TCB data structures with necessary connection state information and stored them in memory.

CPU

Study Focus L1 I Cache

L1 D Cache

Network Cache

L2 Cache Main Memory Figure 5: Network Cache Configuration

4

CACHE/MEMORY BEHAVIOR OF TCP/IP

In this section, we discuss the locality behavior, cache size requirements, and the impact of number of simultaneous connections on each of the TCP/IP data types. Before we do that, we would like to show the breakdown of TCP/IP processing time for the three workloads we are studying in this paper. To measure this, we ran the network traces through the simulator. We have measured the memory access time from the total execution time. Figure 6 shows this data. The memory access time is 60% for transmit-intensive workloads and 85% for receive-intensive workloads. This is a significant amount of time and should be brought down to scale TCP/IP processing efficiently. Compute-Memory Distribution 100%

Overhead %

80%

In order to study the cache behavior of TCP/IP data types, we have modified the cache subsystem of the SimpleScalar simulator by adding a dedicated network cache. This is illustrated in Figure 5. This allows us to study the locality of TCP/IP data types as well as size requirements. The simulator redirected references to memory addresses that belong to a particular TCP/IP data type that is under study to the right cache subsystem. We were able to do this because we have noted the addresses of these various TCP/IP data types and programmed the simulator.

60% 40% 20% 0% SPECWeb99

TPC-W Web Server Compute

RX Intense Workload

Memory

Figure 6. Compute vs. memory access time in TCP/IP

.

To address this problem we need to understand what data types are used in TCP/IP processing and how they contribute to this memory access time. Figure 7 shows all the data types used in TCP/IP processing, and we evaluate each one of these in the rest of this section. Src and DestIP Addr

Src and Dest Ports

NIC Buffer containing Packet Header

TCP/IP Control Block Connection context fields

Application Buffer Ptr

NIC Descriptor Status Fields

Packet Header Pointer

Packet Data Pointer

Application Buffer Packet Data

Memory Copy Packet Data

Memory Copy

NIC Buffer containing Packet Data

Figure 7. Memory Accesses in TCP/IP RX Processing

4.1 Descriptors As explained before, NIC descriptors are data structures used by the TCP/IP stack to communicate with the NIC. Typically, the TCP/IP stack allocates some number of descriptors (256 for instance) for each direction and places them in a circular queue. The size of the descriptor is typically 64 bytes. When the NIC receives a packet, it updates fields in the descriptor with information related to the packet. It then copies the descriptor and the packet into the memory using Direct Memory Access (DMA). To maintain cache coherence, all associated copies of the descriptor and packet data in the cache get invalidated. When the stack is ready to process packets, it first reads the descriptor. This read results in a cache miss because the cache lines were invalidated earlier. In the case of transmit, the TCP/IP stack updates the descriptor fields and sets up the DMA engine to transfer headers and data to the NIC device. Upon transmitting the data, the NIC updates the descriptors. Similar to the receive side, the TCP/IP stack incurs one or more misses when accessing the descriptors. In summary, the descriptors do not show any temporal locality, and hence any access to these results in cache misses. 4.2 TCP/IP Header In the case of receive-side processing, the TCP/IP stack has to read header data which was copied into system

memory by the NIC. A TCP/IP header can be anywhere from 40 to 128 bytes depending on whether any TCP/IP option fields are included. As a result, accesses to the header fields result in one or more cache misses. Once the headers are processed, they either get evicted from the cache to make space for other data or they get invalidated later when the NIC copies a new incoming packet header into the same memory buffer. This makes TCP/IP incoming header data non-temporal. In the case of transmit-side processing, when it is ready to transmit data the stack creates header fields and sets up the DMA engine to transfer header and payload data to the NIC. Since the processor is not involved in transferring the header data to the NIC, there is no memory access overhead on the processor when transmitting. 4.3 Payload Payload represents application data in a TCP/IP packet (essentially without the TCP/IP protocol header). For a regular Ethernet frame size of 1514 bytes, the size of the payload without the header information is 1460 bytes. The NIC copies the Payload data into a memory buffer provided by the stack. The stack needs to copy this data into either an application provided buffer if one is available or an internal buffer until the application is ready to receive the data. Since the source buffer for the copy operation is in memory and the destination buffer may or may not be in the cache, several memory accesses are needed to complete the copy operation. Hence, the payload, just like header and descriptor data, shows no temporal locality. In the case of transmit-side processing, when applications need to send data, they pass a data buffer to the TCP/IP stack. The stack may have to copy this data internally if there is a delay in sending the data. Whether data is copied internally or transferred to the NIC (using DMA) directly from the application buffer, data will be accessed twice exhibiting some degree of temporal locality. 4.4 TCP/IP Control Block (TCB) TCB is a data structure that the TCP/IP stack uses to store connection context information. The TCB size in our TCP/IP stack is 512 bytes. However, the stack typically only touches a portion of the TCB depending on what is happening on that connection; i.e., transmitting or receiving data, sending acknowledgements, etc. We expect the TCB data structure to have good cache locality since packets belonging to the same connection access the same TCB. To understand the impact of the number of simultaneous connections on TCB accesses, we measured the TCB miss ratio for Server workloads at varying numbers of

.

Effects of number of connections on TCB Locality 2.5

Miss ratio (%)

2 1.5 1 0.5 0 100

1000

5000

10000

20000

Number of Emulated Browsers (TPC-W Image Server)

Figure 8. TCB miss ratio vs. connections

To understand the TCB locality, we measured the miss ratio of TCBs as a function of the cache size (for a fixed number of simultaneous connections). Figure 9 shows the results from our simulation for the server workloads. It can be observed from the graph that the miss ratio reduces as the cache size is increased. However, we also find that the reduction in miss ratio slows down beyond 4 Kbytes (for three of the four server workloads). This shows that the TCBs exhibit cache locality across multiple packets. This locality is quite dependent on the number of back-to-back packets processed for a connection. TCB Locality

up a TCB for a connection quickly, TCP/IP stacks typically employ hashing for the RX side. The TX-side TCB is usually pointed to by the handle (e.g. socket handle) and the application specifies the handle when initiating the send operation. The hash key is calculated using the source and destination IP addresses and port numbers. There have been several studies [13] on how to achieve faster lookups. The FreeBSD stack (for instance) uses a hash table with each entry (hash node) in the table pointing to a linked list. This linked list is used to store TCBs that fall in the same hash node. To take advantage of backto-back packet arrivals on the same connection and to avoid the traversal of the linked list multiple times for the same TCB, the FreeBSD TCP/IP stack moves recently accessed TCBs to the front of the linked list. Figure 10 shows the hash node miss ratio as a function of the cache size (for a fixed number of connections). We have measured this by simulating a dedicated cache for storing the hash nodes and modifying the SimpleScalar simulator to redirect any accesses to the hash nodes to this new cache. As expected, the behavior looks similar to that of the TCB miss ratio shown in Figure 8. Hash Node Locality - RX Intense Workload 6% 5%

Miss rate (%)

Emulated browsers/connections with 8 Kbytes of cache space to store the TCBs. To accomplish this, we modified the simulator by adding a separate 8 Kbytes of cache for TCBs and redirected any TCB accesses to this cache. The graph in Figure 8 shows the simulation results. As expected, the TCB miss ratio increases as the number of simultaneous connections increases.

4% 3% 2% 1%

5% 0%

SPECWeb99 TPC-W Web Server RX Intense Workload

Miss Ratio (%)

4%

0.5

1

2

4

8

16

32

64

Cache Size (KB)

3%

Figure 10. Hash node locality

2%

4.6 TCP/IP Stack Variables Apart from the data types described earlier, the stack also uses local variables (like any other program). These also have temporal locality because the same code and local variables gets used for each data packet. Since the size of this data is relatively small and is independent of the number of simultaneous open connections, we did not perform any specific studies to quantify the locality of this data type.

1% 0% 0.5

1

2

4

8

16

32

64

Cache Size (KB)

Figure 9. TCB miss ratio at various cache sizes

4.5 Hash node Many server applications (such as web, ftp, and mail) typically handle thousands of simultaneous connections (each requiring a TCB data structure). In order to look

In summary, of all the TCP/IP data types only TCBs, hash nodes, and the local variables exhibit some degree of temporal locality (if the packets for the same

.

connection arrive before the TCB is evicted from the cache). On the other hand NIC descriptors, headers, and payload have shown no temporal locality. Figure 11 summarizes the effect of varying cache sizes on TCP/IP temporal data. Cache misses generated by this data, for a fixed number of connections, stabilizes around 16 KB, and any further increase in the cache size has negligible impact. Figure 12 summarizes the effect of varying cache sizes on TCP/IP non-temporal data. Cache misses per packet generated by this data stays constant (do not decrease with the increased cache sizes) because caches do not help non-temporal data. Effect of TLC on Cache Misses 50 Misses per packet

40 30 20 10 0 1

2

SpecWeb99

4 8 16 32 Cache size in KB

64

128

256

Rx Intens e Workload

TPC-W Web Server

Figure 11. Impact of larger caches on TCP temporal data Caching effects on Non-Temporal data Misses per packet

40.0 30.0 SpecWeb99

20.0

TPC-W Web Server Rx Intense Workload

10.0 0.0 0.5

1

2

4

8

16

Cache size (KB)

Figure 12. Impact of larger caches on TCP non-temporal data

5

REDUCING CACHE MISSES IN TCP/IP

In this section we describe two novel techniques to reduce the number of cache misses caused by TCP/IP non-temporal data and show the benefits of these techniques. Each of these techniques is best applied to certain data types of TCP/IP. 5.1 Cache Region Locking with Auto updates The TCP/IP stack reads and updates the NIC descriptors every time it receives or transmits a packet. Typically, TCP/IP stacks create a fixed number (256) of descriptors and reuse them. In spite of reusing the

descriptors, whenever the TCP/IP stack needs to access these, it ends up getting them from system memory instead of the processor cache. The reason for this is that these descriptors were earlier updated by the NIC with information related to incoming packets and in the process invalidated any corresponding cache lines. To reduce or eliminate these cache misses, one could use software prefetch instruction available on most modern processors. But there are several problems with this approach. First problem is that the prefetch instruction is implemented as a hint on some processors (e.g. x86 processors) and hence there is no guarantee that the data will be prefetched. Second problem is that there is not enough processing to do after the software prefetch is issued and before you need to access the descriptor, thus limiting the benefit of the prefetch instruction. The final problem is that current generation NICs support interrupt coalescing to reduce the interrupt processing overhead on processors. As a result, multiple prefetch instructions are required to cover all the descriptors and headers and the execution of multiple prefetches can add significant processing overhead (see section 5 for details). In order to prevent the cache misses related to the descriptors without adding to the execution time, we propose Cache Region Locking (CRL) with auto-updates. This technique is a novel extension to generally known cache line locking. Cache line locking [22, 23] has been mainly used in embedded applications to achieve deterministic behavior, but we are applying locking in the context of TCP/IP processing. CRL allows for locking a contiguous memory region in the cache, monitors for any updates to the locked data, and performs update action to keep the data in the cache up to date. While locking guarantees that the cache lines (memory addresses) always stay in the cache, the autoupdate technique guarantees that the data in the locked lines of cache is always updated with current data. In the case of descriptors, the stack needs to lock the descriptor regions in the cache when the stack comes up and unlock the regions before the stack is unloaded. This scheme can be easily extended and applied to the TCP/IP headers in the incoming packets. However, the difference here is that, unlike descriptors, addresses used for headers keep changing. This requires the stack to lock a small region (64 bytes) every time it allocates a new buffer to receive the header portion of an incoming packet. Since this technique requires locking the cache regions, some space in the cache will not be accessible to other applications. Assuming 256 descriptors and each

.

descriptor and TCP/IP header require one cache line, the total cache area that needs to be locked is 512 lines. This is a small area when compared to the current cache sizes (e.g. 3% of a 1MB cache) in today’s server platforms. 5.1.1

Performance Analysis

In order to measure the benefit of CRL, we have added a new instruction for cache region locking. The simulator keeps track of all the locked regions in a table. When a cache line needs to be replaced, the cache controller refers to this table to make sure that the line is not locked. We have also modified the TCP/IP stack to use this new instruction to lock descriptors and headers. This locking is done in the L2 cache of the simulator. The graph in Figure 13 shows the impact of CRL with auto-updates on TCP/IP performance. The graph shows the benefit of CRL over the base (no locking) case at 2MB cache size. The x-axis shows the three workloads that are being studied. The primary y-axis shows percentage reduction in cycles per packet (CPP), while the secondary y-axis shows reduction in misses per packet (MPP). From the graph, it is clear that CRL improves performance (CPP) of TCP/IP processing in SPECweb99 by 25%, TPC-W web server by 15%, and Receive-intensive workload by ~20%. In transmitintensive workloads (SPECweb99 and TPC-W web server) there are fewer receive packets and they are small compared to that of the Receive-intensive workload, and that is why the reduction in MPP is smaller for transmit-intensive workloads. Benefits of CRL

80%

20

improvement in performance Reduction in misses

16

60%

12

40%

8

20%

4

0%

-

SPECWeb99

TPC-W Web Server

Reduction in misses per packet

(% Clocks Per Packet)

Performance Improvement

100%

RX Intense Workload

Figure 13. Effect of Cache region locking 5.1.2

Implementation Options

In order to enable locking support in the processor, one approach is that a new locking memory type is created and loads/stores are tagged as locked to inform the cache about the nature of the memory access. The

problem here is that of auto-update of locked regions. It should be kept in mind that locking does not dictate that the state of the cache line be restricted to any subset of the typical MESI states. As a result, an external invalidation to a locked line would result in the data no longer being available to the CPU. With auto-updates, the line is updated instead of invalidated. In order to accomplish this, two approaches can be used: (1) Hybrid Protocols: When a snoop invalidation is sent to a processor, the originator of the snoop may not know whether the line is locked or not. The snooped node can identify the nature of the memory address (locked versus unlocked) and send back this information in the snoop response. Once the originator receives this response, it can update the cache line with the data it is writing. Update-based protocols are usually avoided due to the overhead placed on bandwidth consumption. In the case of networking, as we have seen in this section, it is desirable to update some of the incoming data (descriptors and headers) to speed up the protocol processing. (2) Auto-Fill: Another approach to accomplish the same result is to generate a prefetch after a locked line is invalidated by another processor or device. This prefetch can be queued into the hardware prefetcher in the CPU once a snoop invalidation is detected on a locked memory address. To reduce the impact of locking the cache on other applications, we need to limit locking to 1 or 2 ways in a set of the cache. This aspect and policies to deal with contention are not covered here and we intend to study these in near future. 5.2 Network-Aware Prefetching We have applied CRL to only descriptors and headers, as they require a very small (3% of 1MB cache) cache space to be locked. So, we still need to address the cache misses that can result when the incoming payload data is copied to either a TCP/IP internal buffer or an application provided buffer. In addition, accesses to the application buffer may also generate cache misses if the data was not touched recently. As a result, this copy operation generates several memory accesses causing significant CPU stalls. In order to reduce the number of memory accesses and speed up the copy operations in TCP/IP processing, we propose a new technique, called “Cache Region Prefetching” (CRP). CRP is an extension to the single cache line prefetch instruction that is supported in many current geneation processors. CRP allows applications to prefetch a region of memory into the cache with one

.

5.2.1

Performance Impact

In this section, we quantify the benefits of the CRP technique. We have added to the simulator a new prefetch instruction that lets software prefetch a region of memory into L2 cache. This instruction takes a starting address and number of bytes to prefetch. It is implemented in such a way that a prefetch hint will be

passed to the prefetching unit (which we added to the simulator) inside the simulator, upon which the instruction is retired. The prefetching unit then prefetches data in the background. We have modified the TCP/IP stack to issue this region prefetch instruction for both the source and destination buffers of the copy operation as soon as the addresses of these are known. The graph in Figure 14 shows the benefits of CRP over the base (no prefetch) case at 2 MB cache size. The xaxis in the graph has the three workloads we are studying. The primary y-axis shows performance improvement as measured by percentage reduction in CPP. The secondary y-axis shows reduction in MPP. As expected, the benefit of CRP is maximum (50%) in the case of Receive-intensive workload, as it has more incoming packets and the average size of each packet is around 900 bytes. For the same workload, there is a reduction in number of cache misses by about 14 (from 30 for the base case). We have also noticed that CRP saves about 300 cycles per packet (15% of total processing) over multiple single line prefetches. 5.2.2

Implementation Options

The multi-line prefetch can be accomplished using a prefetch instruction which contains the size of the 100% 80%

20

Improvement in performance Reduction in misses

16

60%

12

40%

8

20%

4

0%

-

SPECWeb99

TPC-W Web Server

Reduction in misses per packet

Benefits of CRP

Performance Improvement

instruction. This requires the processor to forward prefetch requests to the hardware prefetcher and the hardware prefetcher starts prefetching the data in the background. The reason why we need a better software prefetching mechanism is that issuing multiple single line prefetches consume lot of valuable processor resources such as instruction queue, re-order buffer, memory order buffer entries and load fill buffers. Efficiency of the prefetch instruction execution varies from processor to processor but we have measured that 23 prefetch instructions took roughly 1000 processor clocks on an Intel Pentium M® processor (prefetch instructions are squashed on this processor as opposed to an Intel Xeon® processor). Given that the memory speeds are improving slowly (vs. processor), this number is not expected to go lower significantly. We can further enhance the CRP instruction by adding hints about the data being prefetched. One such hint is to tell the prefetcher not to check in L1 and L2 caches and directly read data from the memory. This is applicable to the source buffer in TCP/IP copy operations, where the source represents the incoming packet data which is in the memory. Another hint to CRP is to tell the prefetcher if the prefetched data will be modified immediately. This allows the prefetcher to prefetch data in “Exclusive” mode directly rather than getting it in “Shared” mode first and then obtaining “Exclusive” access. Implementation details, ramifications, applicability of these additional optimizations have not been studied in this paper. A lot of research was done on various methods of prefetching. Some of it focused on hardware-based prefetching mechanisms [20], some focused on compiler generated prefetches [19,21], and some focused on hybrid approaches [18]. Looking at the past work it is clear that region (multi-line or block) prefetching was considered in hardware and compiler guided prefetching schemes. But the difference here, in addition to applying this in the context of TCP/IP processing, is that we are creating a region prefetch instruction and letting the programmer decide when to use this instruction. Since the programmer has knowledge about his/her program, the issue of prefetching unnecessary data will not arise.

RX Intense Workload

Figure 14. Effect of Cache region prefetching

prefetch region. When this instruction is executed, required information such as starting address of the region, number of bytes to be prefetched and any hints are extracted and passed to the hardware prefetcher at the L2 cache. The instruction is retired at that point and the hardware prefetcher starts prefetching data into L2 cache. Current hardware prefetchers do not have the capability of receiving prefetch requests from the software and hence this functionality needs to be added to the hardware prefetchers. 5.3 Combined effect of CRL and CRP Now that we have seen the benefits of CRL and CRP individually, let’s see the combined impact of these two

.

techniques on TCP/IP processing. The graph in Figure 15 shows this data for the Receive-intensive workload at 2 MB L2 cache size. The different techniques we compared are plotted on the x-axis. Percentage reduction in CPP is plotted on the primary y-axis, and the secondary y-axis shows percentage improvement in network throughput. As expected, the combined effect of CRL and CRP is the highest and is about 65% savings in processor cycles over the base case. Another way to look at the data is in terms of maximum achievable network throughput. Network throughput will be up by 220% over the base case when both the techniques are used. 6

CONCLUSIONS AND FUTURE WORK

In this paper, we studied the cache behavior exhibited by various types of TCP/IP data. We showed that descriptors, headers, and application data do not show temporal locality and therefore do not benefit from caching. On the other hand, TCBs and hash nodes show temporal locality based on the number of back-to-back packets over the same connection in server workloads. Effect of CRL & CRP on RX Intense workload 350% Performance Improvement

80%

280%

Throughput improvement

60%

210%

40%

140%

20%

70%

0%

Improvement in Throughput

Performance Improvement (% Clocks Per Packet)

100%

0% Base

Lock

Prefetch

Prefetch + Lock

Figure 15. Effect of CRL, CRP and CRL+CRP

We then proposed Cache Region Locking (CRL) with an auto-update technique to address the memory accesses caused by TCP/IP descriptor and header data. We showed that it can completely eliminate the memory accesses caused by these data types. CRL alone yielded 15% reduction in CPU cycles per packet. Next, to address the memory access caused by the copy operation in TCP/IP where the incoming payload data is copied, we have proposed the Cache Region Prefetching (CRP) technique. This technique reduced the number of misses (to 14 from 30) generated by the copy operation. Finally, we have showed the cumulative benefit of these two techniques, which is ~65% reduction in cycle count per packet. This amounts to a 220% increase in the network throughput.

Notices: Intel and Pentium are registered trademarks of Intel Corporation. * Other names or brands may be claimed as property of other parties. REFERENCES [1] J. Chase et. al., “End System Optimizations for High-Speed TCP”, IEEE Communications, Special Issue on High-Speed TCP, June 2000. [2] D. Clark et. al., “An analysis of TCP Processing overhead”, IEEE Communications, June 1989. [3] A. Earls, “TCP Offload Engines Finally Arrive”, Storage Magazine, March 2002. [4] A. Foong et al., “TCP Performance Analysis Re-visited,” IEEE International Symposium on Performance Analysis of Software and Systems, March 2003. [5] K. Kant, “TCP offload performance for front-end servers,” to appear in Globecom, San Francisco, 2003. [6] J. B. Postel, “Transmission Control Protocol”, RFC 793, Information Sciences Institute, Sept. 1981. [7] J. Nagle “Congestion Control in TCP/IP Internetworks”, RFC 896, FACC, January 1984. [8] V. Jacobson, et al., “TCP Extension for High Performance”, RFC-1323, LBL and ISI and Cray Research, May 1992. [9] M. Rangarajan et al., “TCP Servers: Offloading TCP/IP Processing in Internet Servers. Design, Implementation, and Performance,” Rutgers University, Dept of Computer Science Technical Report, DCS-TR-481, March 2002. [10] G. Regnier et al., “ETA: Experience with an Intel Xeon Processor as a Packet Processing Engine,” A Symposium on High Performance Interconnects (HOT Interconnects), 2003. [11] S. Makineni and R. Iyer, “Performance Characterization of TCP/IP Processing in Commercial Server Workloads”, 6th IEEE Workshop on Workload Characterization (WWC-6), Oct 2003. [12] S. Makineni and R. Iyer, “Architectural Characterization of TCP/IP Packet Processing on the Pentium® M microprocessor”, 10th High Performance Computer Architecture, Feb 2004. [13] Finisar Systems, http://www.finisar.com/ [14] SimpleScalar LLC, http://www.simplescalar.com [15] “SPECweb99 Design Document”, available online at http://www.specbench.org/osg/web99/docs/whitepaper.html [16] “TPC-W Design Document”, available online on the TPC website at www.tpc.org/tpcw/ [17] J. Mogul, “TCP offload is a dumb idea whose time has come,” A Symposium on Hot Operating Systems (HOT OS), 2003. [18] Zhenlin. W et al., “Guided region prefetching: a cooperative hardware/software approach”, ISCA, 2003. [19] T.C. Mowry, A.K. Demke and O. Krieger “Automatic compilerinserted I/O prefetching for out-of-core applications”, Operating System Design and Implementation, October 1996. [20] J-L. Baer and T-F. Chen. An effective on-chip preloading scheme to reduce data access penalty. In Proceedings of Supercomputing '91, 1991. [21] T. Mowry and A. Gupta. “Tolerating latency through softwarecontrolled prefetching in shared-memory multiprocessors”, Journal of Parallel and Distributed Computing, 12(2):87-106, 1991. [22] X. Vera and et al. “Data Cache Locking for Higher Program Predictability”, SIGMETRICS, 2003. [23] M. Campoy, A.P. Ivars and J.V. Busquets-Mataix. “Static use of locking caches in multitask preemptive real-time systems”, Proceedings of IEEE/IEE Real-Time Embedded Systems Workshop, 2001.

Suggest Documents