Reliable Sockets - CiteSeerX

4 downloads 1665 Views 95KB Size Report
Jun 26, 2001 - such as a laptop for travel and separate desktop computers at home and ... discuss the details of in-flight data recovery, and describe how a ...
Page 1

Reliable Sockets Victor C. Zandy and Barton P. Miller Department of Computer Sciences University of Wisconsin at Madison 1210 West Dayton Street Madison, WI 53706 (608) 265-5907 (phone) 608 262-9777 (fax) {zandy,bart}@cs.wisc.edu

Abstract Reliable Sockets (rocks) are a portable, user-level replacement to the sockets interface that transparently protects applications from network connection failures. Rocks preserve TCP connections and UDP sessions across failures that commonly arise in mobile computing, including host movement, network reconfiguration, link failures, and extended periods of disconnection. Rocks detect connection failures within seconds of their occurrence and automatically recover, reconnecting within a couple seconds once connectivity is restored. Failures may occur at any time, even while data is in flight. To interoperate with ordinary sockets, rocks use a new rock detection protocol that is safe, efficient and suitable for general use in other sockets enhancing systems. Rocks do not require any modifications to application binaries or to the operating system kernel. The data transfer performance of rocks is comparable to that of ordinary sockets for all but small (less than 64 byte) data transfers, and connection latency, while greater than that of ordinary sockets, is on the order of a millisecond.

1 INTRODUCTION We present reliable sockets (rocks), a portable, user-level sockets interface that transparently enables applications to tolerate network connection failures, particularly those caused by the ordinary travails of mobile computing. Reliable sockets replace the standard sockets API, allowing them to be dropped into any sockets-based application with no modification. Reliable sockets automatically detect and recover from TCP connection failures. They can recover a TCP connection after one (but not both) of the endpoints has moved, and they always recover any data that was inflight at the time of the failure, no matter when it occurred. They can also be used for mobile and reliable UDP service. Reliable sockets are implemented entirely in user-level code that does not require special privileges to install or use, and they can be used by applications without reprogramming, recompilation, or static re-linking. Mobile computing frequently generates events that TCP was not designed to handle. These events include: moving a computer with open connections to a new location, migrating a running application with open connections to a new computer, replacing the network interface on a computer, and disconnecting a computer from the network for a period that exceeds the retransmission timeout period. When any of these events occurs, TCP terminates the connection. To the application, this response is visible as an error returned by calls to the sockets API. Our reliable sockets are one component of roaming applications, a new system we are currently developing. Roaming applications are ordinary applications, in execution, that move with their user, both as the user moves to new computers, and as the user moves their computer to a new location. Our goal is to enable the entire context of a roaming application to be mobile, including network connections, user interface, and I/O state, without introducing any modifications to applications, operating systems, or network infrastructure, and without making any requirements of correspondent hosts or software. The contributions of this paper are: ❏

An analysis of the failure modes of TCP.

June 26, 2001

Page 2



A portable, user-level, fault-tolerant technique for automatically re-establishing a failed TCP connection that is transparent to applications. This technique allows one of the endpoints to be relocated to a new IP address, and it can handle arbitrary periods of disconnection.



A portable, user-level mechanism for preserving the in-flight data of a TCP connection at user-level that is transparent to applications. The mechanism is based on a simple user-level buffering scheme that is designed to ensure that a copy of any data that could be in transmission is preserved.



A portable, user-level mechanism for remotely detecting the presence of a reliable socket, or any enhanced socket API, that does not interfere with ordinary client or server sockets and that does not affect the performance of servers for ordinary sockets.



An extension to the sockets API that gives finer control over reliable sockets to mobile-aware applications. Reliable sockets are distinguished from previous work on network connection mobility primarily in that they

work with common existing infrastructure, and that they emphasize reliability in addition to mobility. Other techniques, such as Mobile IP[17], MSOCKS[13], and Snoren’s TCP migration[22], require modifications to kernel-level network subsystems. Such modifications are system-dependent, may be difficult to port and maintain, and they (hopefully) require privileged access to install. In contrast, reliable sockets are designed to be ready-to-use by ordinary, unprivileged users, using a simple run-time re-linking mechanism. We discuss related work further in Section 6. Since the interface between reliable sockets and the application is a superset of the sockets API, reliable sockets are immediately portable to any sockets-based system, including all Unix platforms of which we are aware. (We are looking into porting reliable sockets to the extended Winsocks interface on Windows NT.) Reliable sockets are also novel because they introduce portable user-level mechanisms for automatic reconnection and in-flight data recovery and tolerate arbitrary periods of disconnection. Our implementation of reliable sockets is not a prototype, but a complete implementation that we distribute and maintain, ready to be used with real applications. In addition to their mobile computing applications, reliable sockets can be used to extend the scope of process checkpointing mechanisms to multi-process programs that communicate over sockets. In fact, reliable sockets are an ideal abstraction for such functionality: as a user-level abstraction, they are portable and can be compatible with both user-level and kernel-level checkpointing mechanisms, while as a replacement to the sockets API, they can save communication state without being aware of application-specific details such as message formats and buffering. We describe how we have used reliable sockets to perform process migration of MPI programs in Section 4. We have evaluated the performance of rock data throughput and latency, connection latency, and reconnection latency. Rock data transfer performance is comparable to that of ordinary sockets for all but small (less than 64 byte) data transfers. While the throughput of small data transfers falls more rapidly than that of ordinary sockets, the latency of such transfers remains about the same. The initialization that rocks perform when they connect causes connection time to be about 19 times that of ordinary sockets, but overall it is on the order of one millisecond, which we deem acceptable for the reliability and mobility. Once connectivity between the ends of a suspended reliable socket connection is restored, reconnection latency is no more than 2 seconds. The rest of the paper is organized as follows. In the next section, we discuss the failure modes of TCP that motivated this work. In Section 3, we present the architecture and functionality of reliable sockets and discuss the implementation. In Section 4, we describe some applications of our system. We present our performance results in Section 5 and related work in Section 6.

June 26, 2001

Page 3

2 TCP FAILURE MODEL We start with an analysis of the failure modes of a TCP connection, beginning with a review of TCP with particular attention to the mechanisms crucial to its reliability: connection specification, data buffering, and retransmission. We then review the sockets API, the user-level interface to TCP, and explain the ways that TCP will abort a connection. Finally we analyze these origins of these failures as they relate to mobility events, process migration, and application failure. Our attention is not on the internal state machine and parameters of TCP, but rather on the behavior of TCP as it appears to the applications on each end of a TCP connection. We discuss the internals of TCP only to the extent necessary to understand its external behavior. Our discussion of TCP is based on the original specification of the protocol in RFC 793[18] and the Host Requirements RFC 1122[3]. For issues not covered by these specifications, and for common violations of the specifications, we cite the behavior of current operating systems. Local Host

Peer Host Application

Application Socket Descriptor 5

Socket Descriptor 7

Sockets API

Sockets API Kernel

Kernel

TCP Socket SRC IP: SRC Port: DST IP: DST Port:

TCP Socket

128.105.31.1 12198 205.178.148.141 22

SRC IP: 205.178.148.141 SRC Port: 22 128.105.31.1 DST IP: DST Port: 12198

Send Buffer Receive Buffer

Send Buffer Receive Buffer

Network Data in flight from local host to peer host Data in flight from peer host to local host Figure 1: An established TCP connection between two hosts.

2.1 TCP and Sockets Overview TCP provides reliable and bi-directional byte stream communication between two processes running on networked hosts called the local host and the peer host (see Figure 1). The operating system kernels on each host maintain the state of their end of the connection in a TCP socket. A TCP socket is identified by an internet address comprised of an IP address and a port number, and a pair of internet addresses identifies a TCP connection. Application processes

June 26, 2001

Page 4

manipulate the socket through calls to the sockets API, using an integer descriptor for the TCP socket. Figure 1 omits additional TCP socket attributes, such as parameters for flow control and retransmission, that are irrelevant to our analysis of TCP failures. For our purposes, a pair of buffers in each socket and a scheme for acknowledging and retransmitting data comprise the TCP reliability mechanism. When the local application process writes to the socket, the local kernel copies the data to the socket’s send buffer before transmitting it to the peer. This data remains in the send buffer until the kernel receives an acknowledgement for it from the peer. When the local kernel receives data from the peer, it copies it to the destination socket’s receive buffer and sends back an acknowledgement. The receive buffer holds data until it is consumed by the application process. To trigger retransmission of data lost in flight on the network, the local kernel sets a timer when it inserts data in the send buffer. When the timer expires, the local kernel resends any unacknowledged data and resets the timer. The local kernel dynamically adjusts the timer interval as unacknowledged data accumulates and as network performance varies; however, these details of TCP are not pertinent in this paper. The key property of this mechanism is that it ensures that a copy of the data resides in either the sender’s send buffer or the receiver’s receive buffer until it is consumed. The kernel does not automatically expand the send or receive buffers as it fills them, although the application can manually resize them with setsockopt. When the receive buffer fills, the kernel signals the peer to suspend further data transfer, and it ceases to buffer or acknowledge incoming data. After the application begins to consume data, the kernel signals the peer to resume sending data. On the other hand, when the send buffer fills, the kernel blocks the application from further writing until the receiver begins to acknowledge the buffered data. Consequently, applications cannot pass more data to TCP than can be stored in the combined space of the local send buffer and the peer receive buffer. This sets a limit on the maximum amount of in-flight data, data that has been passed from the application to the kernel on one end, but not yet consumed by the application from the kernel on the other end, that can exist at any time over a TCP connection. The main problem for reliable sockets is to preserve the in-flight data after a TCP connection failure. Unfortunately, the sockets API does not provide an interface for accessing in-flight data. Although the sockets API has been extended in some research systems to return a checkpoint of the TCP state to the application[1,16,21,27,30], such functionality has not become standard in any operating system. As we explain in Section 3.7, our solution is to maintain a separate send buffer in each direction that is capable of holding the maximum of amount of in-flight data.

2.2 Aborted Connections TCP connection failures occur when TCP aborts a connection. Reliable sockets detect when TCP aborts a connection and initiate recovery. Once a connection has been aborted, the application can no longer use its socket to send or receive data. TCP will abort a connection in five cases: 1.

Too many retransmissions. The kernel limits either the amount of time spend retransmitting, or the number of retransmissions attempted, and aborts the connection when the limit is exceeded. Although RFC 1122 requires that a connection can have unlimited retransmissions, in practice most kernels do not allow the retransmission limit to be set on a per-connection basis and they do not provide the option to specify unlimited retransmissions.

2.

Application request. RFC 793 requires an application interface for aborting a connection. For the sockets API, the application sets the SO_LINGER socket option to a special value. Then, close will abort the connection

June 26, 2001

Page 5

rather than initiate the usual TCP close protocol. 3.

Peer reset. A reset is a message sent by a TCP socket in place of an ordinary TCP data packet. TCP sends a reset to its peer when the peer has referenced a connection that has been aborted or does not exist, and when the application has requested that the connection be aborted. When a TCP socket receives a reset, it aborts the connection.

4.

Too many “keep alive” probes. Ordinarily, no data flows over an idle TCP connection, so the retransmission limit does not cause an idle connection to be aborted. Applications may optionally set the keep-alive option on a TCP socket to have the kernel send a probe to the peer when the connection has been idle for a specific interval of time, and eventually abort the connection when some number of probes go unanswered.

5.

IP errors. RFC 1122 specifies that a TCP implementation should abort a connection when it receives some types of “destination unreachable” errors from the lower layer of the network stack. These errors include protocol unreachable, port unreachable, and fragmentation needed for a packet that was set to no fragment. Ordinarily, when an application process terminates normally or abnormally, its TCP connections are not aborted.

Instead, the kernel continues to transmit any remaining data in its send buffer, and then it initiates the TCP close protocol. However, if the SO_LINGER option is set for a connection as we described above, then the connection is aborted upon process termination.

2.3 Events Leading to Aborts Mobile computer users routinely perform actions that can lead to TCP connection failures. Although we have seen that there are several ways to abort a TCP connection, the primary reason for these failures is that they cause the TCP retransmission limit to be exceeded. Our model of the behavior of mobile computer users includes the following actions: ❏

Disconnection: A mobile host becomes disconnected when the link becomes unreachable (such as when the user moves out of wireless range), when the link fails (such as when a modem drops a connection), or when the host is suspended. If the peer has data to send, eventually it will exceed its retransmission limit and abort the connection. Unless it is suspended, the mobile host will also abort the connection if it has data to send. These particular cases of connection failure could be avoided if TCP implementations supported unlimited retransmissions on a per-connection basis, as required by RFC 1122.



Change of IP address: A host might move to a new physical subnet, requiring a new IP address, or a suspended host might loose its IP address and DHCP might subsequently provide a different one. This change of IP address will lead to a failure of the TCP connection whenever one endpoint has data to send to the other. The failure will result from exceeding the retransmission limit. A move to a new subnet may involve switching to a different network interface on the host, but it is the retransmission failure resulting from the IP address change, not the new device, that will cause the connection to fail.



Change of physical address: Through process migration, applications in execution may move from one computer to another. Process migration is an important mobility feature for people who own or use multiple computers, such as a laptop for travel and separate desktop computers at home and work, because it avoids the need for users to restart their applications when they move to a new computer. Migration causes the open TCP connections of such applications to fail for two reasons. First, it changes the IP address of the application, which as just described leads to retransmission failure. Second, if the process migration mechanism does not migrate kernel

June 26, 2001

Page 6

state, it will separate the application socket descriptor from the underlying kernel socket. The original kernel socket will be closed or aborted, according to the application’s use of SO_LINGER, while further use of the descriptor by the application will refer to a non-existent socket. This characteristic of sockets has long been an obstacle to migration of applications using sockets. ❏

Host crashes: The peer of the crashed host will either reach its retransmission limit while the host is down or, after the host reboots, receive a reset message in response to any packet it sends to the host. We do not explore host crashes in this paper because they entail application recovery, while we consider only connection recovery. To summarize, TCP connection failures affect applications because they prevent further use of the application

socket for communication and they destroy in-flight data. Most connection failures stem from an inability to transmit data to the peer. Failures are not immediately evident on idle connections, since no data is exchanged over idle connections.

3 RELIABLE SOCKETS We present the design and implementation of reliable sockets. We begin with an overview of the reliable sockets architecture and operation. In the remainder of the section, we discuss the major features of reliable sockets in greater detail. We motivate our support for UDP sockets, describe the options for loading the rocks library into ordinary applications, present the rock detection mechanism that enables rocks to interoperate with ordinary sockets, describe how a rock establishes its auxiliary control connection and how it establishes a new connection following a failure, discuss the details of in-flight data recovery, and describe how a reliable socket connection is shutdown.

3.1 Reliable Sockets Overview Reliable sockets are implemented as a library interposed between the application and the kernel at both ends of a TCP connection (see Figure 2). The library exports the sockets API to the application to allow reliable sockets to be used transparently by ordinary applications. The library also exports the rocks expanded API (RE-API), which enables rock-aware applications to set policies for the behavior of the reliable sockets library and to manually control some of its mechanisms. We describe the specific features of this API where appropriate. Internally, each reliable socket consists of a small amount of state, the most important of which are a fixed-size buffer and counters for bytes sent and received that maintain the state of in-flight data. The operation of a reliable socket can be summarized by the state diagram shown in Figure 3. A reliable socket exists in one of three states: CLOSED, CONNECTED, or SUSPENDED. Note that these states correspond to reliable socket behavior that affects the application, not the internal TCP socket state maintained by the kernel. A reliable socket begins in the CLOSED state. To establish a reliable socket connection for TCP or UDP communication, the application makes the usual sequence of sockets API calls. For TCP sockets, the rock connection is initiated when the sockets establish an initial TCP connection. For UDP clients, the rock connection is initiated either when the application calls connect to bind the destination address of the socket, or when it sends its first datagram to its peer if it does not call connect. For UDP servers, a rock connection is initiated when a datagram from a new client is received. All subsequent communication between UDP clients and servers is sent over TCP connections created by the reliable sockets library. To complete the connection, the library performs the following steps: 1.

Test for a remote rock. The rock library verifies that the peer is a reliable socket. If it is not, the library reverts the

June 26, 2001

Page 7

Peer Host (205.178.148.141)

Local Host (24.6.204.15) Reliable Socket 5 Descriptor

Application

Reliable Socket Descriptor 7

Reliable Sockets API

Reliable Sockets API

Reliable Sockets

Reliable Sockets

Reliable Socket SRC IP: SRC Port: DST IP: DST Port: Recv Seq: Send Seq: Key: Inflight Buffer Socket Descriptor

Application

Reliable Socket

128.105.31.1 12198 205.178.148.141 22 1000 2000 secret-key

SRC IP: 205.178.148.141 SRC Port: 22 DST IP: 128.105.31.1 DST Port: 12198 Recv Seq: 2000 Send Seq: 1024 Key: secret-key Inflight Buffer

5

Socket Descriptor

Sockets API

7

Sockets API Kernel

Kernel

TCP Socket SRC IP: SRC Port: DST IP: DST Port:

TCP Socket

24.6.204.15 62121 205.178.148.141 22

205.178.148.141 SRC IP: SRC Port: 22 24.6.204.15 DST IP: DST Port: 62121

Send Buffer Receive Buffer

Send Buffer Receive Buffer

Network Data in flight from local host to correspondent host Data in flight from correspondent host to local host Figure 2: The reliable sockets architecture socket to ordinary socket behavior with none of the reliable sockets functionality, and returns to the application. This test should not affect ordinary sockets that attempt to communicate with a reliable socket. 2.

Establish the data connection. The data connection is a TCP connection that, once the reliable socket connection is established, is used for exchanging application data. This connection is used for data transfer even if the applications are using UDP sockets.

June 26, 2001

CLOSED

e

ec

os

nn

Cl

rt

Co

o Ab

t/A

cc

ep t

Page 8

Reconnect

CONNECTED

SUSPENDED TCP Failure

Figure 3: The reliable socket state diagram 3.

Initialize. During initialization, the rocks negotiate a secret shared key through a Diffie-Hellman key exchange[7], and create the in-flight buffer based on the sizes of the TCP buffers at each end.

4.

Establish the control connection. The control connection is a separate TCP connection that runs parallel to the data connection. It is mainly used to detect the failure of the data connection. After these steps complete, the rock changes to the CONNECTED state. To subsequently distinguish the rocks,

we call the rock that initiated the connection the client rock, and the other one the server rock. Once connected, the application can use the rock as it would use any ordinary socket. It transfers data to and from the peer by calling any of I/O routines of the sockets API such as read, write, recv, and send. As the application calls these routines, the rock maintains a copy of potentially in-flight data. When the application sends data, the rock puts a copy of the data into the in-flight buffer, and increments the count of bytes sent. Older data in the in-flight buffer is discarded to make room for the new data; the in-flight buffer is large enough to guarantee that data that has not yet been received by the peer will remain in the buffer. When the application receives data, the rock increments the count of bytes received. Connection failures are detected primarily by heartbeat probes that are periodically exchanged over the control connection. Unlike the TCP retransmission mechanism, heartbeats detect connection failures within seconds instead of minutes, and their sensitivity can be tuned with the RE-API on a per-connection basis. Each heartbeat probe is sent as a byte of urgent TCP data, which allows rocks to be asynchronously notified when a heartbeat arrives. When a rock detects that it has not received several successive heartbeats, it switches to the SUSPENDED state. By default, heartbeat probes are sent once per second, and a rock suspends itself after it misses fifteen probes. In cases when the control connection cannot be established or when a rock-enabled application disables heartbeats, the rock also detects connection failures by monitoring the return values of sockets API calls made on the data connection. Although the TCP keep-alive probe may appear to provide functionality that is similar to our heartbeat, it is actually poorly suited for detecting connection failure. RFC 1122 requires that the default period for sending keep-alive probes be at least two hours. On many platforms this period can be reduced to under two hours on system-wide basis by privileged users, but not per-connection by ordinary users. The much smaller period of heartbeat probes provide good interactive response to connection failures. A suspended rock automatically attempts to reconnect to its peer, subject to a reconnection policy. If the application was using the rock when the failure was detected, the rock immediately attempts reconnection. Otherwise, it delays the reconnection until the next time the application uses the socket to prevent rocks that are not being used from unnecessarily blocking the application while they are being reconnected. A special case of this policy is when the application polls a set of descriptors that includes some suspended rocks. In this case, the rocks library attempts to

June 26, 2001

Page 9

reconnect the suspended rocks if none of the other descriptors are ready. The reconnection policy can be adjusted with the RE-API. Once a rock begins reconnection, it performs the following four steps: 1.

Establish a new data connection. Each rock simultaneously attempts to establish a connection with its peer based on the IP addresses of the failed connection. If at least one rock does not change its location, this mechanism eventually establishes a new connection.

2.

Authenticate. The rocks mutually authenticate through a challenge-response protocol that is based on the key they established during initialization.

3.

Establish a new control connection. The new control connection is established in the same manner as the original control connection.

4.

Recover in-flight data. The rocks perform a go-back-N retransmission of any data that was in-flight at the time of the connection failure. Each rock determines the amount of in-flight data it needs to resend by comparing the number of bytes that were received by the peer to the number bytes it sent. Suspended rocks attempt to reconnect for three days, a period of time that handles disconnection periods that

span a weekend. The time-out is used to prevent prolonged resource consumption when the reliable socket connection cannot be resumed, such as when the peer host or application has crashed. Using the RE-API, this time-out can be adjusted on a per-rock basis, or disabled to allow rocks to attempt reconnection indefinitely. When reconnection times out, the reliable socket changes to the CLOSED state, and all state associated with the rock is destroyed. This is the only way that a rock connection is aborted. Reliable sockets preserve the illusion of a fixed address to the application. When an application calls getsockname or getpeername, the sockets API calls for determining the address of connection endpoints, the reliable sockets

library returns the addresses associated with the first data connection that was established between the rocks. For example, in the rock connection instance shown in Figure 2, the addresses maintained by the rocks are those of the original connection, not the current one. If this behavior is undesirable, it can be changed with the RE-API to return the addresses associated with the most recent data connection. A reliable socket is closed gracefully when the application calls close or shutdown. If the application attempts to close a reliable socket while it is suspended, the reliable socket continues to try to reconnect to the peer, and then automatically performs a normal close once it is reconnected, preserving in most cases the usual close semantics of TCP.

3.2 UDP Although it may seem strange and contradictory to introduce a reliability layer under UDP, we see several applications. First, many RPC protocols such as NFS are built over UDP. Applications based on these protocols can benefit from the mobility and disconnection protection provided by reliable sockets. Second, UDP is likely to be a foundation for new transport protocols, particularly those used for streaming media, where the reliability semantics of TCP can be too aggressive for real-time delivery on wide-area networks[6]. Finally the roaming application system we are developing redirects all long-term application network communication, including both TCP and UDP, through a proxy. By extending reliable sockets to UDP, we have a single mechanism that handles all redirection.

June 26, 2001

Page 10

On the other hand, reliable sockets are not appropriate for uses of UDP in which the socket is used to exchange datagrams with multiple peers, since reliable sockets are designed to preserve the communication state between two ends. Reliable sockets are also inappropriate for short-lived communication, such as DNS requests. In these cases, which we detect by comparing the socket source and destination ports against services known to be multiplexed or short-lived, we revert to ordinary UDP. Since UDP over reliable sockets uses TCP, its performance can be worse than that of ordinary UDP. This penalty has two components. First, TCP uses a three-way handshake to setup a connection, while UDP does not have any setup protocol. The additional initialization overhead of reliable sockets exacerbates this difference. Second, some applications, such as streaming media applications, prefer UDP to TCP because they can tolerate the data loss associated with UDP more easily than they can tolerate the slow data delivery that can occur under TCP retransmissions. For these applications, UDP over reliable sockets may defeat the original reason for using UDP. We are considering retargeting rocks to operate over other user-level transport layers built on top of UDP, such as SCTP[25].

3.3 Loading the Reliable Sockets Library The reliable sockets library can be linked with an application in three ways: 1.

Static linking. The application can be re-linked against the reliable sockets library. This requires access to the object files of the application.

2.

Runtime linking. The library can be loaded into the application when the application is executed. Many operating system loaders provide an environment variable, commonly called LD_PRELOAD, that lists libraries that are to be loaded into a process when it is executed. This mechanism does not require any modification to the application binary and it is easy to use.

3.

Runtime re-linking. The application, already in execution, can be re-linked with the reliable sockets using the Dyninst API [28]. We prefer runtime linking because it is simple and fast. We distribute reliable sockets with a command-line tool

called rock that hides the system-specific details of runtime linking from the user (see Section 4.1).

3.4 Interoperability When establishing a new connection, a rock detects whether the remote socket to which it is connecting is also a rock or an ordinary socket. With rare exceptions, this rock detection protocol is transparent to the applications at both ends. Beyond reliable sockets, our technique can be used as a general-purpose approach for safe, portable, and user-level remote detection of clients and servers that use any sort of enhanced sockets. Servers can freely deploy this technology since it provides rock-awareness capability to server applications with trivial performance penalty when accepting ordinary socket clients. All significant costs of the technique are incurred by clients that request reliable socket connections. We have verified that our technique works with many standard services, including ssh, telnet, ftp, and X windows. It is tricky to remotely distinguish a reliable socket from an ordinary socket. The problem for the rock is to unmistakably indicate its presence to the remote socket without interfering with applications that do not use reliable sockets. The rock cannot indicate its presence by sending special data over the connection, as others have suggested[19], since an ordinary socket that received this data would pass it directly to the application, where it likely would be garbage. It is also problematic to verify the presence of a rock over a separate connection, because any

June 26, 2001

Page 11

approach based on this idea would depend on a pre-arranged scheme for selecting port numbers for the second connection. Such a scheme could conflict with other sockets that happen to use the same port number and with NATs or firewalls present in the network. Additionally, the sockets API lacks a way to create a distinct socket configuration, such as an unusual combination of socket options, that could be remotely sensed to identify a reliable socket. Rocks Client



connect

Rocks Server

TCP handshake accept

➁ shutdown

EOF “ROCKS”



read reset



connect

read write



RST

TCP handshake accept

Figure 4: The rock detection protocol. Our rock detection protocol allows rock clients and servers to mutually recognize each other and, if either end is not a rock, to safely fallback to ordinary socket behavior without disturbing the application. It is an example of a gray-box technique [1] for implict information exchange in that it probes the remote peer, without negatively affecting its state, to glean information that is not explicitly provided by the sockets interface. The protocol is as follows (see Figure 4): 1.

The client establishes a TCP connection with the server.

2.

The client closes the connection for writing using the sockets API function shutdown. This action causes the server application to see end-of-file on its end of the connection, but allows the client to continue to read from the server.

3.

The server detects the end-of-file and identifies the client as a suspected rock. The server announces its rocksawareness to the client by sending a distinctive message to the client.

4.

The client reads the message and now knows that the server is rocks aware. The client resets the connection.

5.

The client connects again to the server from the same IP address and port number as its connection in Step 1. Since the client reset the previous connection, the client socket does not enter the TCP TIME_WAIT state that would otherwise prevent the client from reusing the connection port number. Upon accepting a connection from a suspected rock port number, the server concludes that the new connection is from a rock. The client and server proceed with rocks initialization.

June 26, 2001

Page 12

Our protocol works in the case when the client or the server is not rocks aware. When a rocks-aware client performs the first two steps of the protocol, the server closes or resets its end of the connection, obliviously sends data, or does nothing. In any case, the client will not receive the rocks announcement from the server, and so it will not perform any further actions that could confuse the server. The client will wait for a small period of time before abandoning the rock detection protocol and concluding that the server is not rocks aware. It will then reconnect to the server from a different port number and proceed as an ordinary socket. In the other direction, if only the server is rocks-aware, there are two ways the server can interfere with the ordinary client application. First, if the client application happens to perform the first two steps of our protocol, including leaving its end of the connection open for reading, then the server will send the rocks announcement to the unexpecting client. We ignore this problem because ordinary clients are very unlikely to exhibit such behavior. Second, another process may connect to the server from the same IP address and port number as a suspected rock client. The server will proceed to initialize the connection as a rocks connection, which will confuse the client application. To reduce the chance of this occurring, the server only allows the client a short period of time to reconnect as a rock. A rocks-aware server may send an arbitrary amount of application data to the client before sending the rocks announcement. Since the announcement will be the last data sent by the server, the client can determine whether the server has sent the announcement by consuming all data from the server and comparing the last bytes it receives to the expected announcement. However the client cannot wait for an end-of-file from the server, since the server does not close the connection (if it did, it would either force the client into the TIME_WAIT state, or possibly cause the client to lose the rocks announcement, depending on whether it did a normal close or an abort). Instead, the client must continuously read from the server, and give up on rock detection if it does not receive the announcement within a certain period of time; we currently set that period to 1 second. Although the client reset sent to the server in Step 4 can be lost, this is not a problem because such a loss creates an instance of a half-open connection on the server [18], which TCP was designed to handle. When the client attempts to reconnect a half-open connection, the server responds to the client’s connection attempt (the first SYN packet of the 3-way handshake) with an acknowledgement from the context of previous connection. Since the connection no longer exists on the client, the client responds by sending another reset to the server, and retrying the connection attempt. Once a reset has been received by the server, the connection attempt will succeed. Our protocol has an important limitation. In networks that use network address translation (NAT) devices to maintain a private network address space [8], port numbers allocated by hosts within the network are usually dynamically mapped to different port numbers in the external network. Since our detection mechanism relies on clients being able to successively connect twice from the same port number, it fails unless the NAT device also happens to allocate the port for both connection attempts. To avoid this problem, the RE-API provides an option for suppressing the rock detection mechanism.

3.5 The Control Connection The use of a separate control connection was motivated by a problem that arises when trying to combine application and control data on the same connection. Since firewalls and NAT devices can hinder the establishment of a separate control connection, we use a bidirectional scheme to maximize the chance that a control connection can be successfully created.

June 26, 2001

Page 13

The problem with a single connection for data and control is that there must be a way to transmit heartbeat probes even when ordinary data flow is blocked by TCP, or rocks would otherwise falsely suspend blocked connections. TCP urgent data appears to be well-suited to solve this problem, but it has several limitations. First, although sockets can receive urgent data out-of-band, sending the heartbeat over the same connection as application data would interfere with applications, such as telnet and rlogin, that make use of urgent data. Second, on some operating systems, such as Linux, when new out-of-band data is received, any previously received out-of-band data that has not been consumed by the application is merged into the data stream without any record of its position, possibly corrupting the application data. Since we cannot guarantee that an out-of-band heartbeat is consumed before the next one arrives, we cannot prevent this corruption. Third, on some operating systems, when the flow of normal TCP data is blocked, so is the flow of both urgent data and urgent data notification. To establish a control connection, each end first creates a new passive socket that is bound to an arbitrary port number, and sends the port number to the peer over the data connection. Then each end receives the port number for its peer, and attempts to connect to the peer at that port number. With this scheme, the control connection can be established as long the placement of firewalls and NAT devices between the rocks allows one of the connections to be established. If such devices block connections in both directions, reliable sockets proceed to use the data connection without a heartbeat probe. If both connection attempts happen to succeed, the server rock selects one of the connections and closes the other.

3.6 Reconnection: Establishing a New Connection A reliable socket obtains a new connection with its peer during reconnection, allowing the rocks to reconnect even after one (but not both) of the connection endpoints has changed its IP address. The mechanism is summarized in Figure 5. When a reliable socket connection fails (Figure 5a), each rock records its last known address, the IP address and port number of its end of the data connection, and the last known address of the peer’s end of the data connection, which it obtains from the sockets API calls getsockname and getpeername. When a rock begins reconnecting, it attempts to establish an active TCP socket connection to its peer’s last

known address. It also, if possible, binds a passive socket to its own last known address in anticipation of a similar connection attempt by the peer (Figure 5b). It then waits for a connection to be established on either of these sockets. The rock simultaneously polls both sockets by making the active socket non-blocking. The rock will not be able to create the passive socket if it cannot bind its last known address, which happens when the rock has moved to a new IP address or the port number is bound to another socket on the rock’s host. In this case, the rock just attempts to connect to the peer. As long as one rock can be reached at its last known address by its peer, a new connection will eventually be established (Figure 5c). There are two problems with this mechanism. First, it can produce two new connections if neither socket moves from its last known address. As with the control connection, when this happens the server rock is responsible for selecting one of the connections and closing the other. Second, suspended rocks cannot establish a new connection when they both have moved from their last known addresses, or when they are separated by a NAT or some other device that manipulates network addresses. In this case, a rock-aware application can use the RE-API to manually establish a new connection, or otherwise the reconnection will time-out and the rock will abort.

June 26, 2001

Page 14

Host A

Host B New IP address

(a) Host A

Host B

No connection

(b) Host A

Host B

(c) Figure 5: Reliable Socket Reconnection (a) Host B moves, changing its IP address and causing the rock connection to be suspended; (b) Host A and Host B attempt to connect to each other at their last known addresses; (c) The connection from Host B to Host A completes, and the rock connection is resumed.

3.7 In-flight Data Recovery As discussed in Section 2.1, the TCP send and receive buffer sizes limit the amount of data that can be in-flight at any time over a TCP connection. When a reliable socket connection is initialized, each rock exchanges the size of its TCP receive buffer with its peer. Each rock then creates an in-flight buffer that is large enough to hold the combined capacity of its TCP send buffer and its peer’s receive buffer. At any time during the lifetime of the connection, the application may adjust the size of the TCP buffers on its side of the connection. The in-flight buffers are resized accordingly. When the application resizes its receive buffer, the rock uses the control connection to inform the peer rock to resize its in-flight buffer. If the application reduces the size of one of the buffers, the receiving rock creates a temporary receive buffer into which the sending rock flushes its in-flight buffer. When the control connection cannot be established, rocks instead create in-flight buffers of maximum size, and ignore application attempts to resize the TCP buffers. TCP urgent data requires special treatment. During reconnection, resent urgent data must be handled by the receiving TCP socket as urgent data. Since TCP only maintains state for the most recently received urgent data, each rock only needs to record the byte counter value of the most recently sent byte of urgent data. During reconnection, if

June 26, 2001

Page 15

that byte is resent, the rock transmits it as urgent data. Note that whether the urgent data is received out-of-band is determined by the receiver, and does not affect the way it is handled by the sending rock.

3.8 Shutdown Reliable sockets must preserve the application semantics of TCP connection shutdown. There are two problems. First, a rock may be suspended when the application attempts to close it. To ensure that the peer rock is notified of the close, the rock must reconnect to the peer before it can be closed. More importantly, the application may implicitly close a reliable socket by exiting the process. To handle this case, the reliable sockets library attempts to trap process exits and reconnect suspended rocks to complete the close. If the reconnection time-out expires before the reconnection completes, or if the library cannot trap the exit (such as when the process is terminated by a signal that cannot be caught), the peer rock will be aborted instead of closed. Second, instead of closing a connection, some applications may intentionally abort it. With reliable sockets, however, connections are also aborted when they are suspended. When a rock detects that its peer has aborted a connection, it cannot determine whether it was aborted because of connection failure or by intention of the application. To handle this problem, reliable sockets prohibit applications from intentionally aborting connections by converting an attempt to abort a connection into an ordinary connection close. We are currently investigating alternate approaches that would preserve the ability of the application to abort connections.

4 APPLICATIONS We have used reliable sockets to run interactive sockets-based applications on mobile computers and to checkpoint and migrate parallel programs.

4.1 Interactive Applications We have developed two programs, rock and rockd, that make it simple to use rocks with ordinary sockets-based applications. rock starts ordinary programs as rock-enabled programs by using runtime linking to link the application with the reliable sockets library. rockd is a reliable socket server that redirects its client’s connection to another server over an ordinary TCP connection. Rock-enabled clients can effectively establish reliable socket connections with ordinary servers by running rockd on the same host as the server. Although the connection between rockd and the server is not reliable, it is immune to TCP failures, including server host IP address changes, since it is local to the host. To simplify the use of rockd, rock detects the use of some common client applications, and automatically attempts to start rockd on the server host if necessary.

Four typical examples of the way we use reliable sockets connections from our laptops are shown in Figure 6. 1.

An interactive terminal on the laptop connected to Host A: The terminal server ( sshd) is made rock-enabled by modifying the daemon server (inetd) on Host A to start the daemon with rock; the terminal client (ssh) is also started by the user with rock.

2.

An X Windows application, such as Emacs or Framemaker, displayed on the laptop, but running on Host B: The X server is made rock-enabled by modifying its start-up script to use rock.

3.

An interactive terminal on the laptop connected to Host B: Since we do not have privileged access on Host B, we redirect a reliable socket connection through rockd.

4.

An X windows application that is running on the laptop but displayed on Host C: We do not have privileged

June 26, 2001

Page 16

Rock-Enabled Application

SSHD Server

ROCKD

Ordinary Application Rock Connection



SSHD Server



X Application



SSH Client

Host B

SSH Client

Ordinary Connection

X Application

X Server (Display)

Host A

Laptop



ROCKD

X Server (Display) Host C

Figure 6: Rock-enabled Applications access on Host C, so we use rockd. This connection allows us to use the laptop remotely from Host C. In each of these examples, the connections between the laptop and the other hosts are all fully-functional reliable socket connections.

4.2 Checkpointing Parallel Programs Reliable sockets can extend process checkpointing mechanisms to handle parallel programs, such as those that use the MPI[14] and PVM[10] libraries. Since rocks reside entirely in user space, the checkpoint mechanism transparently picks up the communication state as it saves the process address space. The checkpoint mechanism does not need to manipulate or even be aware of the library-level communication semantics of the application; all our reliability mechanisms operate on the underlying sockets (i.e., the least common denominator). In contrast, other systems for checkpointing parallel programs, such as CoCheck[23,24] and MIST[4], are explicitly aware of the particular communication library used by the application. As an additional benefit, a checkpointed rocks-enabled process can be migrated simply by restarting it on another host. Rocks automatically restore the communication of a migrated process without any assistance from the checkpoint mechanism. We have combined rocks with a user-level checkpoint library similar to that of Condor[12] to checkpoint and migrate the processes of an ordinary MPI application running under MPICH[11]. Our application runs on a cluster of workstations using the MPICH P4 device for clusters. We modified rock to work with mpirun, the MPICH command for starting a job, and to runtime link the checkpoint library along with the rock library. Once the application is started, each process can be checkpointed by sending a signal to it. The process terminates after it is checkpointed. This functionality can be used in several ways. To tolerate the failure of a single node, the process running on that node can be checkpointed and then restarted when the node recovers. The same checkpoint can also be used to

June 26, 2001

Page 17

migrate the process to another node by restarting the checkpoint on the new node. In the same manner, the entire application can be migrated to a new set of hosts, although this migration must be performed one process at a time to ensure that the rock reconnection succeeds. Alternately, the network proxy we are developing for roaming applications enables any subset of the processes to be migrated at the same time, and more generally, the RE-API can be used to link an arbitrary mechanism for locating migrated process with the rocks library. Rocks can be used to obtain a global checkpoint of a parallel application from which the application can be restarted after an arbitrary hardware failure. Care must be taken to ensure that the checkpoint is globally consistent. One approach is to stop each process after it checkpoints. Once all processes have checkpointed, the application can be resumed. A more general approach that does not require the entire application to be stopped is to take a Chandy and Lamport distributed snapshot [5]. We plan to use the recently introduced Condor support for MPI applications [26] to allow our rock and checkpoint-enabled MPI applications to be managed by Condor. To obtain a globally consistent checkpoint, each process will stop itself after it checkpoints and Condor will be responsible for restarting them.

5 PERFORMANCE We have evaluated reliable socket throughput, data transfer latency, connection latency, and reconnection latency on a pair of identical 600MHz Pentium IIIs connected by 100Mb/s switched ethernet, running Linux 2.4.4. Overall, there are few surprises. Rock connection time is about 19 times that of ordinary sockets, but once connected, the TCP throughput overhead of rocks is only noticeable in transfers of blocks smaller than 64 bytes and TCP latency increases by a small constant. Since rocks transfer UDP datagrams over TCP, the UDP latency and throughput of rocks is slightly worse than that of rocks TCP transfers, reflecting the overhead of datagram handling.

5.1 Throughput To study the impact of rocks on throughput, we compared TCP and UDP transfer rates of reliable sockets and ordinary sockets at various block sizes. Block size is the size of the buffer passed to the socket send system call. When the block size is too large and send returns without sending the entire buffer, we make additional calls to send until the entire buffer is sent. We measured sender and receiver throughput. For our TCP experiment, the sender transferred 64MB from memory with block sizes that varied from 8 bytes to 8MB, and we measured the time elapsed between the first call to send and the return from the final call to send. At the receiver, we measured the time elapsed between the establishment of its connection to the sender and the return of the final call to recv. To maximize its consumption rate, the receiver consumed data at maximum block size. We zerofilled all buffers before timing to minimize memory management effects. We repeated this experiment for UDP, but over a smaller range of block sizes. Since UDP does not have flow control, transfers of blocks smaller than 512 bytes suffer too much packet loss to sustain a meaningful throughput measurement, while blocks larger than 32KB exceed the maximum size of a UDP datagram. Our results are summarized in Table 1. At block sizes of 64 bytes and higher, reliable sockets and ordinary sockets have comparable TCP throughput performance. For smaller block sizes, throughput drops for both ordinary sockets and rocks, however the drop is larger for rocks. We attribute this difference to the various per-operation costs incurred during data transfer over

June 26, 2001

Page 18

TCP Block Size

UDP

Sockets Sender (Mb/s)

8 bytes 16 bytes 32 bytes 64 bytes 128 bytes 256 bytes 512 bytes 1 KB 2 KB 32 KB 512 KB 8 MB

35.7 66.5 89.9 89.9 89.9 89.9 89.9 89.9 89.9 89.9 89.9 89.9

Rocks

Receiver (Mb/s) 35.7 66.5 89.8 89.8 89.8 89.8 89.8 89.8 89.8 89.8 89.8 89.8

Sender (Mb/s) 19.8 40.3 75.2 89.9 89.9 89.9 89.9 89.9 89.9 89.9 89.9 89.9

Sockets

Receiver (Mb/s) 19.8 40.3 75.2 89.8 89.8 89.8 89.8 89.8 89.8 89.8 89.8 89.8

Sender (Mb/s)

84.5 89.7 90.0 91.7

Rocks

Receiver (Mb/s)

Sender (Mb/s)

84.5 89.6 89.9 91.7

89.2 89.6 89.7 89.9

Receiver (Mb/s)

89.1 89.4 89.6 89.8

Table 1: TCP and UDP Throughput. TCP and UDP send and receive rates for 64MB data transfer from process to process at varying block sizes. Connection establishment time is not included. UDP block sizes are restricted between the maximum size for UDP transfers (32KB) and the smallest size that does not produce overwhelming packet loss (512 bytes). Experiments were performed on two 600MHz Pentium IIIs running Linux 2.4.4 and connected by 100Mb/s switched ethernet. Rates shown are the maximum over five runs. rocks, including the overhead of copying into the in-flight buffer, periodic heartbeat interruptions, and the rocks wrappers to underlying socket API functions. Rocks UDP performance is slightly worse than rocks TCP performance. The overhead is due to the additional handling necessary to preserve UDP datagram boundaries as they are transferred over the TCP stream. Compared to ordinary UDP throughput, rocks are surprisingly competitive. At large block sizes, ordinary UDP throughput is slightly higher than rocks, but due to TCP buffering, rock throughput is higher for small blocks.

5.2 Latency We measured TCP and UDP data transfer latency for small block sizes. We measured the total time for 25,000 roundtrip data transfers of varying block size with and without the in-flight buffer enabled. The results are shown in Table 2. Compared to an ordinary TCP round trip, rocks incurs a constant overhead of about 6 usec for TCP connections, and about 15 usec for UDP datagram transfers. In both cases, the extra copy of the block into the in-flight buffer incurs about one microsecond of overhead. For TCP, the remaining overhead comes from the wrapper functions, which perform an array reference and some validity tests. The additional overhead of UDP is due to the datagram handling, which requires the receiver to make two calls to recv for each datagram: one to receive the length of the datagram and one to receive the datagram payload.

5.3 Connection We measured the time it takes for a client rock to connect to a server rock. We timed 100 application calls to connect and close. Several features of rocks that affect connection time are optional: rock detection, key exchange, control

June 26, 2001

Page 19

(heartbeat) connection establishment, and in-flight data buffer initialization. To measure their role in connection time, we also measured connection time with these features selectively disabled. Our results are summarized in Table 3. TCP Block Size 8 bytes 16 bytes 32 bytes 64 bytes 128 bytes

Sockets

Rocks

(usec)

(usec)

108 111 117 129 153

UDP Rocks No Buffer (usec)

114 117 123 135 159

Sockets

Rocks

(usec)

(usec)

113 116 122 134 158

92 92 97 109 133

Rocks No Buffer (usec)

124 127 133 145 169

123 126 132 144 168

Table 2: Round-trip latency. Round-trip TCP and UDP latency for a data transfer from process to process at varying block size. Connection establishment time is not included. Experimental setup is the same as in Table 1. Times shown are the average over the 25,000 runs. Full-featured rock connection time is about 19 times larger than the time for ordinary socket connection. With all features disabled, rock connection time is only a few microseconds larger, which can be accounted for by few kernel calls to read and initialize the socket state. The most expensive feature is the rock detection mechanism. Rock detection involves the detection by the server application of the end-of-file generated by the client and the establishment of a second connection. The second most expensive feature is the key exchange for authentication, an operation that involves large integer arithmetic. Of the remaining two features, the establishment of control connection, which involves a port number exchange and a connection set-up, is slightly more expensive than the initialization of the inflight data buffer, which involves an exchange of TCP buffer size parameters and memory allocation. Although rock connection is expensive compared to ordinary socket connection, it is still on the order of a microsecond, which we deem an acceptable cost for the added reliability and mobility. Increase over Rocks with no Features (usec)

Time (usec)

Connection Type

Ordinary socket

163

-

Rocks (with no optional features)

182

-

Rocks (with only in-flight data buffer)

394

212

Rocks (with only heartbeat)

739

557

Rocks (with only authentication)

1024

842

Rocks (with only rock detection)

1490

1308

Rocks (all features)

3040

2858

Table 3: TCP connection time. Average time to make 100 calls to connect and close. Experimental setup is the same as in Table 1.

5.4 Reconnection We measured the amount of time it takes to reconnect a suspended rock. Reconnection time is the time following a restoration of network connectivity that a rock spends establishing a new data and control connection with the peer

June 26, 2001

Page 20

and recovering in-flight data. For our experiment, we suspended a reliable socket connection by disabling the network interface on one machine, and then measured the time elapsed from when we re-enabled the interface to when the reliable socket on that machine returned to the ESTABLISHED state. While established, the applications at the ends of the connection exchanged data at a high rate to ensure the presence of in-flight data at the time of disconnection. The elapsed time over multiple runs of the experiment ranged from 1.2 seconds to 1.5 seconds. We are satisfied with this performance since it is comfortable for interactive use, less than the time required to restart most non-trivial applications that would fail without reliable sockets, and small in the time scale of the events that typically lead to network connection failures, such as change of link device, link device failure, laptop suspension, re-dial and connect, or process migration.

6 RELATED WORK Many techniques for network connection mobility have been proposed. Unlike these systems, rocks emphasize reliability over mobility, viewing mobility as just another cause of network connection failure. Rocks provide reliability by integrating mechanisms for rapid failure detection and unassisted reconnection with a mechanism for preserving connection state. The other distinguishing features of rocks are that they are implemented entirely outside of the kernel, they support UDP as well as TCP, and they interoperate safely with ordinary sockets. Techniques that enable the endpoints of a TCP connection to be re-assigned to a different IP address include the TCP Migrate option [22], Mobile TCP sockets [19,20], Persistent Connections [29], and MSOCKS [13]. The TCP Migrate option is an experimental kernel extension to TCP that is not widely deployed. It does not handle TCP failures, requiring disconnected peers to reconnect before TCP aborts the connection. It depends on external support to locate and initiate reconnection with the disconnected peer. Mobile TCP sockets and Persistent Connections interpose, like rocks, a library between the application and the sockets API that preserves the illusion of a single unbroken connection over successive connection instances. Between connections, Mobile TCP sockets preserve in-flight data by using an unspecified kernel interface to the contents of the TCP send buffer (such interfaces are not common), while Persistent Connections makes no attempt to preserve in-flight data. Mobile sockets cannot handle TCP failures that result in the abort of the TCP socket, since that action destroys the contents of the socket send buffer. Both of these techniques depend on external support to reestablish contact with a disconnected peer, and neither interoperates safely with ordinary applications. MSOCKS is a proxy-based system that, like the rockd, enables a client application to have a mobile connection with an ordinary server. The proxy has a kernel modification called a TCP splice that allows the client, as it moves, to close its end of the connection and establish a new one without disrupting the server. Like TCP Migration, MSOCKS does not handle TCP failures. The client end of an MSOCKS connection is responsible for in-flight data sent in both directions, using a mechanism similar to the rocks in-flight buffer to preserve data sent from the client to the server, and depending on the ability to read from the socket used in the previous connection to preserve data sent from the server to the client. An alternative to TCP-specific techniques, Mobile IP [12] routes all IP packets, including those used by TCP and UDP, between a mobile host and ordinary peers by redirecting the packets through a home agent, a proxy on a fixed host with a specialized kernel. Except for masking IP address changes from TCP sockets, Mobile IP does not handle failures to TCP connections. It depends on external mechanisms for detecting disconnection and initiating reconnection.

June 26, 2001

Page 21

Socket hand-off mechanisms have been developed and used to migrate sockets to other hosts for mobility, faulttolerance, load-balancing, and quality-of-service [1,16,21,27,30]. All these mechanisms are based on special interfaces for checkpointing and restoring the kernel state of a TCP socket. These mechanisms must be combined with a connection mobility technique such as Mobile IP or TCP Migrate to ensure that data follows the socket to its new location. While these mechanisms could be used for migration of processes with network connections, rocks also can provide this functionality without the need for any kernel mechanisms.

7 CONCLUSION AND FUTURE WORK Reliable sockets transparently protect ordinary applications from network connection failures, including those caused by a change of IP address, link failure, and extended period of disconnection. Besides being an unavoidable part of life with mobile computers, these failures can also occur unexpectedly during non-mobile communication, such as when modems fail or dynamic IP address leases expire. Rocks automatically detect connection failures and suspend the application from further use of the connection. When the connection is resumed, which rocks do automatically if at least one of the endpoints does not move, rocks recover any data in-flight data lost at the time of the failure. Rocks provide this protection without modifications to kernels or network infrastructure, work transparently with ordinary application binaries, handle both TCP and UDP communication, and transparently revert to ordinary socket behavior when communicating with ordinary peers. With the rocks expanded API, rock-aware applications can adjust the parameters of reliable sockets behavior and manually reconnect suspended rocks. We routinely use reliable sockets for end-user interactive applications, such as remote shells and remote GUI-based applications, and we have used reliable sockets to checkpoint and migrate parallel programs. As part of our ongoing work on roaming application, we are using reliable sockets to develop a network proxy for more general network connection mobility. This proxy will provide support for simultaneous movement of both ends of a reliable socket connection, and generalize the use of reliable sockets with ordinary peers that do not support reliable sockets. In addition, the presence of firewalls and NATs are a pervasive obstacle to network connection mobility. Our mechanisms for failure detection, interoperating with ordinary peers, and rock reconnection are all complicated by barriers imposed by these devices. With our network proxy, we are investigating approaches to alleviate this problem, enabling network connection reliability and mobility to interoperate better with existing network infrastructure.

REFERENCES [1]

A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau. Information and Control in Gray-Box Systems. 18th ACM Symposium on Operating Systems Principles (SOSP ‘01), Chateau Lake Louise, Banff, Canada, October 2001.

[2]

A. Bakre and B.R. Badrinath. I-TCP: Indirect TCP for Mobile Hosts. 15th International Conference on Distributed Computing Systems (ICDCS ‘95), Vancouver, British Columbia, Canada, May 1995.

[3]

R.T. Braden. Requirements for Internet Hosts - Applications and Support. Internet Request for Comments RFC 1122, October 1989.

[4]

J. Casas, D.L. Clark, R. Konuru, S.W. Otto, R.M. Prouty, and J. Walpole. MPVM: A Migration Transparent Version of PVM. Computing Systems 8, 2, Spring 1995, pp. 171-216.

[5]

K.M. Chandy and L. Lamport. Distributed Snapshots: Determining Global State of Distributed Systems. ACM Transactions on Computer Systems 3, 1, February 1985, pp. 63-75.

[6]

D.D. Clark and D.L. Tennenhouse. Architectural Considerations for a New Generation of Protocols. ACM Symposium on Communications Architectures and Protocols (SIGCOMM ‘90). Philadelphia, PA, September 1990.

June 26, 2001

Page 22

[7]

W. Diffie and M.E. Hellman. New Directions in Cryptography. IEEE Transactions on Information Theory, 22, 6, November 1976, pp. 644-654.

[8]

K. Egevang and P. Francis. The IP Network Address Translator (NAT). Internet Request for Comments RFC 1631, May 1994.

[9]

P. Ferguson and D. Senie. Network Ingress Filtering. Internet Request for Comments RFC 2267, May 2000.

[10]

A. Geitz, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, V. Sunderam. PVM: Parallel Virtual Machine: A Users’ Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge, Massachusetts, 1994.

[11]

W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard. Parallel Computing, 22, 6, September 1996, pp. 789-828.

[12]

M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System. Technical Report #1346, Computer Sciences Department, University of Wisconsin, April 1997.

[13]

D.A. Maltz and P. Bhagwat. MSOCKS: An Architecture for Transport Layer Mobility. INFOCOM ‘98, San Francisco, CA, April 1998.

[14]

Message Passing Interface Forum. MPI: A Message Passing Interface Standard. May, 1994.

[15]

The Open Group. CAE Specification, Networking Services (XNS), Issue 5. The Open Group, Reading, Berkshire, U.K., 1997.

[16]

V. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. Nahum. Locality-Aware Request Distribution in Cluter-based Network Servers. 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), San Jose, CA, USA, October 1998.

[17]

C. Perkins. IP Mobility Support. Internet Request for Comments RFC 2002, October 1996.

[18]

J. Postel. Transmission Control Protocol. Internet Request for Comments RFC 793, September 1981.

[19]

X. Qu, J.X. Yu, and R.P. Brent. A Mobile TCP Socket. Technical Report TR-CS-97-08, Computer Sciences Laboratory, RSISE, The Australian National University, Canberra, Australia, April 1997.

[20]

X. Qu, J.X. Yu, and R.P. Brent. A Mobile TCP Socket. International Conference on Software Engineering (SE ‘97), San Francisco, CA, USA, November 1997.

[21]

A.C. Snoeren, D.G. Andersen, and H. Balakrishnan. Fine-Grained Failover Using Connection Migration. 3rd USENIX Symposium on Internet Technologies and Systems (USITS ’01). San Francisco, CA, March 2001.

[22]

A.C. Snoeren and H. Balakrishnan. An End-to-End Approach to Host Mobility. 6th IEEE/ACM International Conference on Mobile Computing and Networking (Mobicom ’00). Boston, MA, August 2000.

[23]

G. Stellner. CoCheck: Checkpointing and Process Migration for MPI. 10th International Parallel Processing Symposium, Honolulu, HI, 1996.

[24]

G. Stellner and J. Pruyne. Resource Management and Checkpointing for PVM. 2nd European PVM User Group Meeting, Lyon, France, 1995.P. Vixie, S. Thomson, Y. Rekhter, and J. Bound. Dynamic Updates in the Domain Name System (DNS UPDATE). Internet Request for Comments RFC 2136, April 1997.

[25]

R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, and V. Paxson. Stream Control Transmission Protocol. Internet Request for Comments RFC 2960, October 2000.

[26]

D. Wright. Cheap Cycles from the Desktop to the Dedicated Cluster: Combining Opportunistic and Dedicated Scheduling with Condor. Proceedings of Linux Clusters: The HPC Revolution, Champaign-Urbana, IL, USA, June 2001.

[27]

D.K.Y. Yau and S.S. Lam. Migrating Sockets -- End System Support for Networking with Quality of Service Guarantees. IEEE/ACM Transactions on Networking, 6, 6, December 1998, pp. 700-716.

[28]

V.C. Zandy, B.P. Miller, and M. Livny. Process Hijacking. Eighth International Symposium on High Performance Distributed Computing (HPDC ‘99), Redondo Beach, CA, August 1999, pp. 177-184.

[29]

Y. Zhang and S. Dao. A “Persistent Connection” Model for Mobile and Distributed Systems. 4th International Conference on Computer Communications and Networks (ICCCN). Las Vegas, NV, September 1995.

[30]

B. Zenel. A Proxy Based Filtering Mechanism for the Mobile Environment. PhD Dissertation, Columbia University, December 1998.

June 26, 2001