Connection-less TCP Patricia Gilfeather Scalable Systems Lab Department of Computer Science University of New Mexico
[email protected] Abstract TCP is an important protocol in high-performance computing. It is used extensively in graphics programs and file systems and it is often the protocol used for the cluster control mechanism. As the breadth of applications increases, the need for a scalable and efficient implementation of TCP becomes more important. In addition to other bottlenecks that must be alleviated, TCP connection management must be made scalable. This becomes critical as we consider offloading TCP processing onto TCP offload engines (TOEs) or intelligent network interface cards (iNICs). In this paper, we show how to take advantage of special characteristics of the high-performance computing environment and apply existing operating system mechanisms in a unique way to address some of the scalability concerns in offloaded TCP. Specifically, we implement methods for activating and deactivating TCP connections. These allow us to maintain a large store of open TCP connections without a large amount of storage overhead.
1. Introduction Clusters are getting bigger. In clusters of hundreds of thousands of nodes, resource management of communication will become more critical. One of the aspects of this scalability bottleneck is the amount of memory necessary to maintain communication. This problem is severe when we consider offloading protocol processing onto different architectures like iNICs, TOEs and processors in memory (PIMs). TCP/IP implementations and applications are widespread and therefore inherently appealing for use in cluster computing. Additionally, if TCP/IP can be made
Los Alamos Computer Science Institute SC R71700H-29200001 and Albuquerque High Performance Computing Center through IBM SUR
Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico
[email protected] competitive with respect to performance attributes like latency and overhead, message-passing libraries like MPI can be implemented over TCP and cluster administrators can maintain fewer protocols. Finally, TCP/IP is well-maintained, well-tested, well-understood and highly interoperable. One way to make TCP competitive with respect to latency and overhead for large clusters is to offload some protocol processing. However, offload engines can become very expensive. If a cluster designer wants to create a competitive large cluster using a TCP offload engine, the amount of memory devoted to connection management must remain small. Ideally, large-scale systems will only need to provide resources for a small working-set of active connections. This paper describes mechanisms to facilitate the working set method of managing TCP connections. Our goal is to lessen the overhead of inactive connections by deactivating the heavy-weight socket and replacing it with a placeholder that allows the connection to be reactivated when it is needed. This decreases the amount of memory needed to maintain connections and facilitates offloading TCP communication processing because only a small working set of active communications is fully instantiated at any time. One of the advantages of working with commodity protocols is that the Linux implementation of TCP has data structures and methods that can be leveraged to accomplish deactivation and reactivation of connections. Originally, minisocks and open requests were created to decrease resource usage for connections in the time wait state of TCP and for connections in the process of being created. The former is essential for large-scale web servers in order to maintain protocol correctness during the connection shutdown process. The latter is used to survive denial-of-service attacks. In this paper, we show how to modify these existing data structures and methods to create the deactivation and reactivation methods that drastically decrease the memory requirements for TCP in large-scale clusters. This is accomplished by creating a small working-set of active con-
nections. The first part of this paper reviews TCP connection management and offloading TCP. Next, we measure the resource constraints associated with TCP offload. Then, we outline the modifications we implemented to deactivate sockets and reactivate sockets and present the results. Finally, we discuss future plans for further addressing TCP/IP bottlenecks.
Source Socket at connect
Destination Socket at listen SYN − seq# X
SYN − seq# Y & ACK − seq# X+1
Socket accepts
2. TCP Working Sets One problem with TCP is the question of connection state. Applications or libraries must make decisions about how to best allocate and maintain TCP connections. There are three options: 1) open a TCP connection when it is needed and close it again after the message is sent; 2) open a TCP connection when it is needed and keep the connection open in case it is needed again; 3) open all possible TCP connections at application launch and use them as needed. There are inefficiencies associated with each of the three methods for handling TCP connection state. Option one, opening and closing a connection with each message, is the most inefficient. The cost of opening a connection is incurred with each message which increases latency too much. Option two is often used now. However, as we move either into larger clusters or onto NICs or PIMs that do not have large memories, this method will suffer from resource overcommitment pressure. Option three, opening all connections at application load, causes application startup to be drastically slower. Also, connections may be created that are never used. We will review the process of opening and closing a connection in TCP in order to more fully understand the costs associated with each of the three methods for handling TCP connection state discussed above. Furthermore, this review will provide background needed to explain the methods and data structures we used to deactivate and reactivate connections.
2.1. TCP Connection Establishment When a sender wants to send data to a receiver, the sender must establish a TCP connection with the receiver. As illustrated in Figure 1, TCP connection establishment occurs in three steps. Data is not allowed to be sent with the SYN message or with the SYN-ACK message. This means that the cost for connection startup between two hosts is a full roundtriptime. Clearly, the roundtrip cost of a connection startup should be paid only once as it is a high latency operation. A security concern associated with opening a connection, the SYN flood, was the motivation behind the openrequest data structure used in Linux to hold the place of a potential fully-instantiated connection. We discuss this in
ACK − seq# Y+1 & Data
Figure 1. Three-way handshake of TCP connection startup
Source
Destination
Socket at close Socket in FIN_WAIT_1
FIN − seq# U
ACK − seq# U+1 Socket in FIN_WAIT_2
Socket in CLOSE_WAIT
FIN − seq# V
Socket in TIME_WAIT ACK − seq# V+1
Socket closed
Figure 2. Close of a TCP connection
detail below as well as how we modified it to create reactivation of sockets.
2.2. TCP Connection Close Figure 2 shows a TCP connection close. Source sends a ¯ messages with the FIN flag set in the TCP header. The connection on Source must remain in TIME WAIT state until ¯ there is no possibility that a message intended for this connection can be received. This wait is called the 2MSL wait and is equal to twice the maximum segment lifetime of a segment. The MSL is implementation specific and can be as short as 30 seconds or as long as about 2 minutes. The MSL of Linux is 30 seconds. Because the activator of a close must remain in the time-
2.3. Connection Working Sets Ideally, an application or library would be able to maintain a working set of currently active TCP connections. For example, after a period of inactivity, a connection between two hosts is deactivated and a small amount of state is stored. When the host sends a message to the inactive connection, that connection is reactivated into a fully instantiated socket without paying the cost of a TCP three-way handshake. Connection working sets are especially powerful in environments with limited memory resources since on systems in which there are limited memory resources, the amount of memory available can significantly reduce the number of active connections available. Regardless of how connections are opened (at the beginning of an application or on first use) and regardless of how long they live, there will only be a small set of fully instantiated, active TCP connections. If the working set is small, the amount of memory needed to maintain the state of each connection will be manageable for TCP implementations that are offloaded onto commodity NICs or TOEs with a limited amount of memory.
3. Offloading TCP There is a great deal of work being done to offload TCP either onto iNICs or TOEs. The most well-known work on offloaded TCP[5]. Our research has shown that TCP latency can be reduced by as much as 40% when parts of the TCP stack are offloaded[6]. Also, offloading all or part of the TCP stack decreases overhead associated with communication processing. This decrease is due to memory copy overhead[5], interrupt pressure overhead[8, 3], and to the offloading of the communication progress thread making an event-driven model easy to implement[9]. Figure 3 shows the number of open connections possible as the number of 4Kb pages of memory allocated for the TCP stack decreases. As memory becomes more limited, the number of possible active connections is reduced. We measured the growth of memory with respect to the number of open connections for the Linux 2.4.25 TCP stack. First, we limited the memory associated with TCP connections by modifying the tcp mem proc filesystem file. This interface allows us to place a maximum memory limit, in terms of pages, on the TCP stack. We measured the memory over-
head associated with a fully instantiated socket without attached data buffers. Protocol offloading exacerbates these issues by moving connection management onto a memory-limited resource. Current iNICs have a memory of about 2MB [1]. While there are iNICs now that have up to 4GB of memory, we are interested in the commodity NIC market where memory will continue to be a constrained resource. Figure 3 shows that if there were no other firmware, and no data buffers, the maximum number of offloaded TCP connections on a 2MB (512 4Kb page) iNIC would be less than 1000. In fact, Linux will not run with TCP stacks of less than about 3000 pages. Offloaded TCP onto commodity NICs using traditional stacks, is not scalable. 7000 Number of open connections
wait state for at least 30 seconds, clusters must account for this. Linux uses minisocks as place-holders to alleviate the memory costs associated with the time-wait state. We explain these in detail below as well as how we modify minisocks to create deactivated sockets.
6000 5000 4000
3000 2000 1000 0 0
5000
10000 15000 20000 25000 30000 35000 Size of tcp_mem (MB)
Figure 3. Active Connections versus Available Memory
4. Connection-less TCP The WAN world has been able to reduce the resource pressure associated with connection management by decreasing the amount of memory needed for a TCP connection during startup and during tear-down. Can we use these techniques to reduce the memory footprint for a socket during its lifetime?
4.1. Characteristics of High-Performance Networks There has been a great deal of research done in the WAN world on decreasing the cost of connection setup in TCP. HTTP 1.1 made some web-serving traffic connectionoriented by default. Also, methods of caching and sharing connection information in order to avoid connection
startup costs have been proposed[14, 10, 13, 2, 4, 14]. All of this research concentrates on working around the inconsistencies of large-scale heterogeneous networks. Specifically, routes change so RTT estimations cannot be cached for long and congestion is highly variable so the congestion window must be regularly re-calculated. Large-scale clusters, however, are generally homogeneous with static routing. This allows us to reuse routes and RTT estimations. Because we are in a high-performance networking environment, we are able to move sockets into and out of an inactive state without paying a performance penalty because of stale flow control information. The result is a connection-less version of TCP.
flood. Additionally, the size of an open request is approximately 64 bytes whereas a fully-instantiated socket is 832 bytes. The memory savings are significant. Syncookies are further defense against SYN flooding. They reduce the need to even create the open request data structure. A web server generates a cookie based on the IP address, port number, write sequence number and mss of the client and uses the cookie as the sequence number in the acknowledgment of the SYN. Upon receipt of the final SYN-ACK, the server creates the open request data structure from the listening socket, the cookie and the final acknowledgment. With the listening socket, the open request structure, the incoming message, and the route table entry, a new socket can be created.
4.2. Connection Management in Linux TCP Because we are working with a commodity protocol in a commodity operating system, it is important that we show that we can reuse methods and data structures created by the community-at-large. We use data structures created to protect servers against time-wait loading to deactivate connections and maintain a placeholder and we use methods used to protect servers against denial-of-service attacks to reactivate the connections. We introduce the mechanisms used by the Linux implementation of TCP to protect resources in two common scenarios for high-traffic web servers, a denial-of-service attack and time-wait loading. Then we explain the modifications we made to protect resources for a common scenario for high-performance clusters, offloading. 4.2.1. Denial of Service One of the most common denialof-service attacks on the Web is a SYN-flooding attack. In this attack, a source floods a destination with SYN messages. The destination opens thousands of connections, and sends SYN-ACK messages back. Since the SYN-ACK messages are ignored. Thousands of half-open connections must time out. Any definitive protection against SYN-flooding must occur at the routers since some resources will always need to be allocated at the server when handling an open connection request[11]. However, the most successful defense against a denial-of-service attack is still to simply survive it. This is the reasoning behind one of the earliest defenses the Linux stack implemented, the open request data structure. When the Linux TCP stack receives a SYN request, instead of opening an entire socket, the stack creates a smaller data structure with just enough information to eventually open a large socket if the three-way handshake completes. This smaller data structure is called an open request. Open requests are held in a separate hash table. During a SYN-flood the hash table of established connections remains small. This allows legitimate connections to remain open as long as they don’t time out during the
4.2.2. Time-Wait loading Originally, clients were expected to actively close connections. It was assumed that small clients with few resource constraints would do most of the waiting in the timewait state[12]. However, because of the still widely-used HTTP 1.0 protocol (which is message oriented), web servers actually do most of the active closes. In the HTTP 1.0 protocol, the client issues a GET message. The server then sends the requested information. This is the end of the interaction. The server is forced to close the connection that was made and therefore must provide large amounts of resources for maintaining connections in time-wait[7]. In order to lighten the memory load for timewait on high-traffic web servers, minisocks were introduced into the Linux 2.4.0 TCP stack. The data structure associated with minisocks is called a tcp tw bucket. Because it is supposedly possible to move from a closing state back to an established state, the original large socket is not destroyed as long as there is a context to it. If an application closes or dies without closing its sockets, then the sockets are orphaned. Orphaned sockets must also go through the time-wait state and so tcp tw buckets must be created and ports must remain bound, but because there is no context with which to re-establish the connection, the large socket can be destroyed. Additionally, tcp tw buckets are hashed into a separate hash table. This is ostensibly done to keep the establishedconnection hash table from growing very large during timewait loading. Since it was assumed that a small establishedconnection hash table will keep demultiplexing latency low. Generally, web servers use individual threads to service a request. When the thread dies, the large socket becomes orphaned and only the tcp tw bucket remains. In this way high-traffic web-servers are able to avoid the heavy load of fully-instantiated sockets being held during time-wait. Again, the savings are substantial. The savings is 96 bytes versus 832 bytes.
4.3. Deactivating and Reactivating Connections The methods and data structures associated with tcp tw buckets and syncookies were created by the general networking community to address memory constraint situations in the normal TCP stack. We can leverage these methods and data structures to address similar memory constraints in high-performance computing. 4.3.1. Deactivation Deactivating a connection simply means putting a connection in the timewait state. There are, however, some additions that must be added to the timewait bucket structure. These additions are necessary in order to reconstitute the route information when the connection is reactivated. The additional overhead is: 4 bytes for the connection flags, 20 bytes for the IP options, and 4 bytes for the pointer to the route table entry. Additionally, we added a new state allowed for the tw substate field. The INACTIVE state is set so that we may later determine that a new message is valid for this tcp timewait bucket; We attempted to reuse the tcp time wait system call to create the modified tcp timewait bucket, but were unable to reuse the call because we needed to initialize the extra data fields and because we needed to move the socket into the closed state so that the process of reclaiming memory on orphaned timewait sockets would proceed directly rather than waiting until the thread is closed. Because we determined that latency will not be affected by the size of the established-connection hash table, we chose not to move the time-wait bucket into the timewait hash. This saves us the cost of rehashing the connection on reactivation and will not cost us extra latency during demultiplexing. Finally, we wrap the tcp deactivate call in a system call, tcp deactivate connection. This system call resolves the socket descriptor to the socket and i node which are the wrappers for the sock data structure which has been replaced by the timewait bucket. Another implementation could allow the kernel to decide when to deactivate a connection, but we feel that the application or library using TCP may have better information about deactivation. Therefore, the differences between tcp timewait and tcp deactivate are: the close on the fully-instantiated socket that removes the context from the sock data structure and allows the tcp done method to free the socket memory; the initialization of the added fields discussed above; and the removal the hashing and scheduling of timers. 4.3.2. Reactivation We leverage the methods used by the syncookies mechanism to reactivate a connection upon receipt of a message. First, a message follows the path of a message bound for a socket in the time-wait state. We added a check on the tw substate field at the beginning
of the tcp timewait state process. If the connection is inactive TCP TW REACTIVATE is returned to the main receive process. We pass the timewait bucket and the incoming message to a process modeled after the cookie v4 check method. An open request data structure is created. The open request, the pointer to the route table entry, the timewait bucket and the incoming message are all sent to the tcp v4 syn recv sock method. We cannot fully reuse this method because we are not rehashing the newly constituted socket and because the call to create the child socket is different. Otherwise, this method is identical. The call that allocates the socket from the timewait bucket is tcp create timewait child. It is modeled after the tcp create openreq child method. There are some substantial differences between these two methods. The tcp create openreq child method copies all data and initializes with data from the listening socket. We do not have that socket. We initialize the newly allocated socket from a mixture of timewait bucket and static data.
5. Results All measurements were made using a modified version of Linux on one host and an unmodified version of the same kernel on the other host. The tests were performed on 993MHz Pentium IIIs with 1Gb Acenic Ethernet cards in a cross-over pattern. The server code simply listens on a well-known port and accepts requests as they are received.
5.1. Memory Usage The client code (running on the modified Linux 2.4.25) first reads /proc/slabinfo in order to get a baseline measurement of cache usage. When the control test is run, the client code loops creating a socket, connecting it and reading /proc/slabinfo. During the deactivated run, the client loops creating a socket, connecting it, deactivating it and reading /proc/slabinfo. The memory measurements are created by multiplying the number of tcp tw buckets by the size of a tcp tw bucket and adding the product of the number of sockets and their size. This is then subtracted from the initial values to get a relative memory used per socket measurement. In addition to showing memory usage per socket for regular sockets (called connected sockets), we created a system call that either simply puts a socket in time wait or orphans a socket and puts it in time wait. Figure 4 shows memory use for connected sockets, sockets in timewait state, orphaned sockets in timewait state, and deactivated sockets.
18 16
Open Timewait Timewait/Orphan
450 400 Avg latency in usec
14 Memory used
500
Time Wait Connected Deactivated Time Wait Orphan
12 10
8 6
350 300 250 200 150
4
100
2
50
0
0 0
5000
10000 Number of sockets
15000
20000
0
Figure 4. Memory used as a function of the number of active sockets
10000
20000 30000 Number of connections
40000
50000
Figure 5. Demultiplexing latency of an 8 byte message
5.2. Demultiplexing Latency
5.3. Reactivation Latency Reactivation is only a legitimate method of opening a connection if it is faster than opening a connection from
scratch. Figure 6 shows the start-up latency of a single connection on Linux 2.6.9 for active connections, deactivated connections and closed connections. We looped through the process of opening a connection, sending and receiving a message and closing the connection. Next, we repeated the experiment, but instead of opening and closing the connection, we simply deactivated and reactivated the connection. Finally, we measured the latency of a connection that remains open.
900
Open Open/Close Deactivate
800 Avg latency in usec
Minisocks were originally introduced into the Linux kernel in order to move sockets that were in the time-wait state out of the main hash table. The idea was that a smaller hash table would decrease the time it took to demultiplex established connections. The established hash table and the timewait hash table occupy the top half and the bottom half of the connection hash table. We wanted to test the hypothesis that the smaller hash table really will decrease latency of demultiplexing. To test the latency of a connection as the function of the number of active connections, we first opened x connections and measured the ping-pong latency of the first connection made. This measures the demultiplexing speed of the active connection hash table. The timewait portion of the hash table will be empty. Second, we measured the ping-pong latencies of the first connection made when all other connections were moved into the timewait state as soon as they were opened. This will measures the demultiplexing speed when the timewait hash table is full. Finally, we measured the ping-pong latency of the first connection if all other connections are moved into time-wait and orphaned. This meaures the demultiplexing speed of an empty active connection hash table and an empty timewait hash table. Figure 5 shows the demultiplexing latency of Linux 2.6.9 implementation of the TCP stack for the above configurations.
700 600
500 400 300 200 0
200
400
600 800 1000 Message size (bytes)
1200
Figure 6. Ping-pong latency
1400
6. Discussion
8. Conclusions
As we see in Figure 4, deactivated sockets save substantial memory. Because sockets with a context are not released, the time wait run shows both the memory overhead of sockets and the memory overhead of the tcp tw buckets. The deactivated sockets require slightly more memory than the orphaned, time-wait sockets because there is the additional state necessary to reconstitute the socket. The traditional Linux stack only allows around 2000 connections as measured by the slab cache information. If we use deactivated sockets, we increase the number of connections allowed in 2MB of memory to over 20,000 connections. This is ten-fold increase in scalability. Figure 5 shows no significant correlation between demultiplexing speed and the number of connections populating the hash table. The speed appears to be constant. No differences are found between the various methods for storing inactive sockets. These findings are significant because they call into question the reasoning behind the use of minisocks in the standard Linux kernel. The timewaitand-orphan measurement shows the performance of standard Linux servers. There is no latency advantage. The only significant advantage of minisocks is the memory savings. Furthermore, there is no reason to remove deactivated sockets from the established-connection hash table. As we see in Figure 6, reactivation moderately increases startup latency over leaving connections open. On the other hand, reactivations decreases latency by approximately 40% for the first arriving packet of a message compared to closing and repoening a connection. When memory pressure requires space-saving methods, reactivation is clearly a viable method. In addition, note that this latency cost is only paid on the packet that initiates reactivation. The most significant result of these experiments is that there are methods that allow full offloading of TCP processing for large clusters using commodity NICs with limited memory resources. We drastically decreased the memory used for inactive sockets while only moderately increasing latency of the reactivation message.
We were able to drastically reduce memory usage for open TCP connections thus increasing the scalability of TCP, especially for offloaded TCP. The cost in latency of the first arriving packet on an inactive connection is significant, but it is much lower than the cost of closing and re-opening the connection. We were able to reuse mechanisms created in other areas of network research that reduce resource commitment for communication. By making small modifications to an existing operating system, we were able to drastically reduce resource usage. This is the great advantage of working with commodity protocols: often a good deal of research has already been implemented. Commodity protocols, especially TCP/IP will always be an important part of cluster computing. Certainly, as more scientific fields come to rely on high-performance computation, there will be more rather than less of a dependency on the interoperable, easy-to-program protocol. We must push TCP to the limits of its performance and efficiency with respect to high-performance environments and commodity hardware, as we cannot expect commodity protocols or components to fully disappear. Deactivation is a method for understanding the TCP stack in the context of high-performance computing. It begins to address a problem with TCP, scalability with respect to state management for TCP implementations that are offloaded onto commodity hardware.
7. Future Work Deactivation and reactivation of sockets is a good beginning in our research to decrease latency and increase scalability of TCP on large clusters. First, we must streamline the reactivation process in an attempt to decrease the latency cost. The next step is to use this mechanism to offload a small subset of highly, latency-sensitive connections to a NIC to allow for polling and direct access without interrupts or memory-copy.
References [1] Acenic gigabit ethernet for linux. Web: ’http://jes.home.cern.ch/jes/gige/acenic.html’, August 2001. [2] M. Allman, S. Floyd, and C. Partridge. RFC 2414 : Increasing TCP’s initial window, September 1998. Status: EXPERIMENTAL. [3] I. M. Amnon Barak, Ilia Gilderman. Performance of the communication layers of tcp/ip with the myrinet gigabit lan. Computer Communications, 22(11), July 1999. [4] H. Balakrishnan, V. Padmanabhan, S. Seshan, M. Stemm, and R. Katz. TCP behavior of a busy internet server: Analysis and improvements. In IEEE INFOCOM, March 1998. [5] J. Chase, A. Gallatin, and K. Yocum. End-system optimizations for highspeed TCP. In IEEE Communications, special issue on TCP Performance in Future Networking Environments, volume 39, page 8, 2000. [6] B. Duncan. Splinter tcp to decrease small message latency in high-performance computing. Technical Report TR-CS2003-27, University of New Mexico, 2003. [7] T. Faber, J. Touch, and W. Yue. The time-wait state in tcp and its effect on busy servers, 1999.
[8] P. Gilfeather and T. Underwood. Fragmentation and high performance ip. In Proc. of the 15th International Parallel and Distributed Processing Symposium, April 2001. [9] S. Majumder and S. Rixner. Comparing ethernet and myrinet for mpi communication. In Proc. of the 7th Workshop on Languages, Compilers, and Run-time Support for Scalable Systems, October 2004. [10] V. Padmanabhan and R. Katz. TCP Fast Start: a technique for speeding up web transfers. In IEEE Globecom ’98 Internet MiniConference, November 1998. [11] C. L. Schuba, I. V. Krsul, M. G. Kuhn, E. H. Spafford, A. Sundaram, and D. Zamboni. Analysis of a denial of service attack on TCP. In Proceedings of the 1997 IEEE Symposium on Security and Privacy, pages 208–223. IEEE Computer Society, IEEE Computer Society Press, May 1997. [12] W. R. Stevens. TCP/IP Illustrated, Volume 1; The Protocols. Addison Wesley, Reading, 1994. [13] J. Touch. RFC 2140: TCP control block interdependence, April 1997. Status: EXPERIMENTAL. [14] Y. Zhang, L. Qiu, and S. Keshav. Optimizing TCP start-up performance. Technical Report TR99-1731, 10, 1999.