Tapping TCP Streams

16 downloads 0 Views 657KB Size Report
tem call performance, one fundamental problem is the fol- lowing (see Figure ... should be minimal as long as the TCP tap drop rate is mini- mal. 1. 3 read result.
Tapping TCP Streams Maxim Orgiyan, Christof Fetzer [email protected], [email protected] AT&T Labs - Research, USA

Abstract

Replication can be used to increase the availability of services in the presence of computer and network failures. The finite state machine approach [1] facilitates replication of deterministic processes. However, many newer service implementations are non-deterministic. For instance, one source of non-determinism is multi-threading, which is widely used as a natural way to model the parallelism inherent in an application. Furthermore, it allows the application to harness the power of multi-processor machines. Hence, support for non-deterministic applications is an important requirement. Leader/follower replication [2] can be used to replicate non-deterministic processes. The idea is that one of the replicas is elected to become the leader, which performs each non-deterministic action first. The leader then notifies the followers of the result of the non-deterministic action. The followers use this notification to perform the same state transformation as the leader. Since the followers per-

form the same state transformations as the leader, all replicas stay consistent. If the leader fails, one of the followers can take over the role of the leader. To simplify the description, in the remainder of this paper we restrict ourself to the case of exactly one follower. However, it is straight forward to generalize the presented approach for followers. TFT [3] is a system based on the idea of leader/follower replication1. TFT is implemented on the system call level. All potentially non-deterministic system calls are intercepted. The leader records and relays its non-deterministic choices to the follower, which uses this information to perform the same state transitions. We are working on a similar system on the C-library level that addresses the performance issues of TFT. [3] reports that the normalized performance for synchronous overhead of the system is between random disk writes and up to for sequential disk reads. In our work we attempt to reduce the performance overhead inherent in a system such as TFT. While [3] does not explain the very large variance in system call performance, one fundamental problem is the following (see Figure 1). When the leader performs a disk or a network read, it must appear to the follower that it has performed the same action. There are two obvious ways to approach this problem. First, the leader can let the follower know that it has read a block of data by sending a return code (e.g., KBytes read from file descriptor 5). Then the follower can independently perform an equivalent read of the data (see Figure 1(a)). This solution works for read-only files, but not for network connections or files that can be modified concurrently by other processes. Second, a general solution to this problem is that the leader actually forwards the read data as well as the return code to the follower (see Figure1(b)). This solution performs poorly if the process reads large volumes of network or disk data, since all data has to be forwarded to the follower. Hence, this solution is not acceptable for the replication of busy server systems because their throughput might be severely affected.

This paper appears in the Proceedings of the IEEE International Symposium on Network Computing and Applications, NCA2001, Boston, MA, USA, Feb 2002.

1 The system evolved from hypervisor-based fault-tolerance [4], which shares the idea of having a leader to enable replication of non-deterministic state machines.

Providing transparent replication of servers has been a major goal in the fault tolerance community. Transparent replication is particularly challenging for highly nondeterministic applications, such as the ones that use multithreading. For such applications, keeping replicas in a consistent state becomes non-trivial. One way to deal with the non-determinism is to use a leader/follower approach. In this paper we describe the design and performance of a TCP tapping mechanism we implemented. This mechanism was designed to improve the efficiency of leader/follower replication. We argue that TCP tapping can address a major efficiency bottleneck of leader/follower replication. Keywords: fault-tolerance, distributed systems, leader/follower replication, TCP tapping.



    

1. Introduction





1

Leader

Follower 2 data

1 read

3 read result

4 read

Leader 5 data

1 read

Follower

3 read result

LAN

LAN 2 data sent via atomic broadcast RAID file server

RAID file server

(a) Follower issues a second read request.

Leader 1 read

(a) File-server sends data via atomic broadcast.

Leader

Follower 2 data

1 read

3 read result + data

Follower

3 read result

tapped data

LAN

LAN 2 data sent via TCP to leader RAID file server

RAID file-server

(b) Follower gets data via TCP tapping.

(b) Forwarding of data by leader.

Figure 2. Problem description

Figure 1. Problem description

In what follows we investigate the performance of various TCP tapping alternatives. In Section 2 we describe the architecture of our system. Although we concentrate on the connection tapping aspect of our system, for the sake of completeness we briefly explain how our system allows the follower to take over the connection in case of leader failure. Section 3 discusses the pros and cons of various network topologies for TCP tapping. Our TCP tapping implementation is described in Section 4, and its performance in Section 5. Before we conclude in Section 7, we describe system status and related work on TCP fault-tolerance in Section 6.

Another solution to this problem is based on the atomic broadcast [5] (see Figure 2(a)). The file server and the network clients send their data via atomic broadcast to the leader and the follower. Hence, the leader does not have to forward the data to the follower. Note that the leader still has to notify the follower of how many bytes it has read to enforce that the follower read the same number of bytes. This solution requires that all clients and services be capable of sending data via atomic broadcast. At least for services that serve remote clients this might not be acceptable since most clients might only be able to communicate via TCP (e.g., due to fire-walling and administrative issues). In this paper we investigate the performance of an alternative tapping mechanism that can be used to solve this problem (see Figure 2(b)). We illustrate our approach by applying it to the problem of replicating services that rely on TCP for communication. The follower will tap the TCP connections of the server, i.e., the connections to the clients and to the services that the leader uses. The tapping might or might not be perfect (i.e., the follower might not always get all the data). However, the leader can provide the missing data when notifying the follower of the read result. Note that even in this case the performance overhead at the leader should be minimal as long as the TCP tap drop rate is minimal.

2. System Architecture The TCP tapping approach to replication was designed as part of our on-going project for masking TCP endpoint failures, such as server host crashes and server process crashes. The system relies on TCP tapping to enable efficient leader/follower replication. In case of leader failure, the connection can be taken over by the follower, which becomes the new leader. The design goals of the system are as follows: 1. Additional hardware and software used to implement our approach must not significantly impact the perfor2

mance of the TCP connection between the client and the leader.

server gets access to all of the traffic coming from the client to the leader server via a standard BSD socket interface exported by the modified system library. In failure-free mode, the modified system library simply discards the follower server’s output. The challenge is to keep the follower from diverging from the leader, despite various sources of nondeterminism, such as certain system calls (e.g. time()), multithreading, etc. This is handled by the leader-follower protocol, which is a work in progress and is outside the scope of this paper. We briefly describe it here for completeness. Whenever a nondeterministic system call is executed, the follower’s system library blocks until it receives the result of this call from the leader. The leader’s system library sends the result of the call and blocks until it receives an acknowledgment from the follower’s system library (otherwise, the leader and the follower might enter different states). The result of the call is then returned to the leader server. The actual data is tapped by the follower, and does not need to be sent from the leader. Rather, a return code indicating the bytes read and the file descriptor is sent. Once received, this result is returned to the follower server. Nondeterminism due to multi-threading can be handled in similar fashion, by implementing the leader/follower protocol within the standard thread library, such as POSIX threads. The leader-follower protocol we are developing is configurable. Less synchronization between the leader and the follower might be acceptable for applications that can tolerate some nondeterminism. For example, if the application does not perform different actions depending on the exact number of bytes returned by a read() call, the read bytes can be passed up to the leader application immediately, without waiting for the follower acknowledgment.

2. Our approach must not require modification of the operating system or application code. 3. The TCP connection must appear to be intact to the client application, despite server-side failures. The third design goal separates our system from other cluster based systems available on the market today, e.g. [6]. In case of server failure, these systems forward new TCP connections to another server in the cluster. However, existing TCP connections to the failed server are dropped, and the state associated with them is lost.

CLIENT

BACKUP SERVER

PRIMARY SERVER

BSD sockets interface

BSD sockets interface

BSD sockets interface

MODIFIED LIBC

MODIFIED LIBC

MODIFIED LIBC

TCP/IP

TCP/IP

TCP/IP

TCP/IP connection tap

TCP/IP CONNECTION

Figure 3. System architecture We introduce redundancy in the form of semi-active replication, i.e., we have a leader and one or more followers. In order to preserve TCP connections in case of permanent host failures of the leader, the follower server must be located on a separate host. Our approach is entirely librarybased: we deploy modified system libraries on the client, the follower, and the server hosts (see Figure 3: the modified libraries are shown in white). As specified by our second design goal, no operating system or application code needs to be changed. In order to use our solution, applications link with our system libraries. While the fact that the client has to link with a special system library might be viewed as a drawback to deployment, this approach is used in other well-known systems that provide complimentary TCP functionality. Typical implementations of SOCKS [7], which is a standard protocol used for firewall traversal, require the client to link with a special version of the system library. The follower’s modified system library taps the TCP connection between the client and the leader, and captures all of the packets exchanged on this connection. Various tapping techniques are described in Section 4.2. As specified by our first design goal, tapping should be introduced in a way that does not penalize performance of the TCP connection between the client and the leader. The follower

For the sake of completeness, we also briefly describe how connection takeover is performed. If the leader crashes, the client’s modified system library initiates a new TCP connection to the follower’s modified system library. At this point the follower server becomes the leader. The follower server performs reads or writes through the standard BSD socket interface exported by its modified system library, which in turn uses the new TCP connection to communicate with the modified system library at the client. Note that the existence of this new connection is hidden within the modified system libraries, and is thus completely transparent to the client and the follower applications. Ongoing reads and writes aborted because of the crash must be completed once the follower takes over. For this purpose, the modified library at the client maintains state information (e.g., how many bytes it has received so far). 3

3. System Deployment

connected to the same Ethernet segment as the leader, attempts to capture all TCP traffic between the client and the leader by using a sniffer. Since we cannot assume that the sniffer will capture all of the traffic, we introduce a logger, which can retransmit missed packets to the follower. The logger runs at the router machine, and has access to all traffic between the client and the leader (just like the follower in the router topology of Figure 4(a)). Note that in this topology, a machine configured to perform proxy ARP can be used instead of a router, as described in the previous section.

3.1. Router Topology We propose several ways of introducing the follower into the system. One approach is to install the follower server on a machine configured to act as a router for the leader host. The router has at least two interfaces; one of them is on the Ethernet segment it shares with the leader. This topology is illustrated in Figure 4(a). A rule in the gateway’s routing table specifies that all packets with destination IP of the leader server (132.239.55.165) should be sent to the first interface of the follower (132.239.50.164). The follower is setup to forward these packets to the leader. Similarly, the leader has a routing rule according to which all outgoing packets are sent to the second interface of the follower (132.239.55.164). The follower then forwards these packets to the gateway. The follower can capture all of the TCP traffic between the client and the leader because it is physically interposed on the path between them. In fact, the follower has a superset of the traffic seen by the leader TCP (note that packets with destination IP of the leader might be dropped after the follower captures them, but before they are processed by the leader server).

WAN (OR LAN)

CLIENT

GATEWAY

Excerpt from the FOLLOWER’s routing table: DESTINATION 132.239.55.165 *.*.*.*

INTERFACE If2 If1

LAN (ETHERNET)

If1: 132.239.50.164 Router LAN (ETHERNET) 132.239.55.165 If2: 132.239.55.164

LEADER

FOLLOWER

(a) Router topology. WAN (OR LAN)

CLIENT

GATEWAY

Excerpt from the FOLLOWER’s routing table: DESTINATION 132.239.50.165 *.*.*.*

3.2. Proxy Arp Topology

INTERFACE If2 If1

LAN (ETHERNET)

Excerpt from the FOLLOWER’s arp table: REQUEST ADDRESS 132.239.50.165

REPLY ADDRESS If1 If1: 132.239.50.164 Proxy arp

A variation of the router topology described in the previous section uses a machine configured to perform proxy ARP (see Figure 4(b)) for the leader host. The gateway assumes that the leader is on the same Ethernet segment (the network part of the leader’s IP address is 132.239.50, which is the network address associated with the Ethernet segment the gateway is on). When the gateway receives a packet with the destination IP of the leader (132.239.50.165), it issues an ARP request asking for the corresponding physical address (assuming the address is not already in its ARP cache). The follower, which is configured to perform proxy ARP, replies with the hardware address of its first interface (132.239.50.164). Once the follower receives a packet with the destination IP of the leader, it forwards this packet to the leader through the second interface. The proxy ARP approach might be more practical than the router approach when it is not possible to modify the gateway’s routing table.

LAN (ETHERNET) 132.239.50.165 If2: 132.239.50.164

LEADER

FOLLOWER

(b) Proxy ARP topology. WAN (OR LAN)

CLIENT

Sniffer

GATEWAY

LAN (ETHERNET)

FOLLOWER

Router (or Proxy Arp)

LEADER

LAN (ETHERNET)

LOGGER

(c) Ethernet topology.

Figure 4. Tapping topologies.

If retransmissions are rare, the impact on performance of the TCP connection between the client and the leader should be small. The follower machine should be fast enough to keep up with the leader. The advantage of this topology is the decoupling of routing and replication which is important if the service performs computationally intensive tasks. Another advantage of this topology is that it is particularly easy to introduce additional sniffer-based followers. More-

3.3. Ethernet Topology A potentially more flexible, robust, and efficient approach is depicted in Figure 4(c). In this case, the router and the follower are decoupled. This eases the computational load on the follower machine, which was on the critical path between the client and the leader. The follower, 4

over, if the leader’s failure is temporary, it is easy to reintegrate it into the system once it reboots, since it can assume the role of a follower. Note that the Ethernet between the logger and the leader and the follower can be switched. An Ethernet switch can increase the cummulative bandwidth of Ethernet by forwarding packets only to the port to which the destination host is attached. This typically prevents sniffing of packets addressed to the leader. If the leader’s IP address, however, is mapped to a multicast Ethernet address, the follower will be able to sniff the packets addressed to the leader. Additionally, the use of a multicast addresses simplifies the IP address take-over by the follower in case the leader has failed.

and tap client close() to close the server and the client data stream, respectively. This causes a cleanup of the state – primarily various queues of socket buffers – that were allocated to these streams. Since the follower should not be allowed to enter a different state than the leader, calls to the above interface will be made only when the leader executes an analogous BSD call. This synchronization is achieved via the leader/follower protocol. For instance, when the leader application accepts a connection, its modified system library will, as part of the leader-follower protocol, send a message to the follower indicating that tap accept() can now be called. The leader’s modified system library will also supply the source IP address and port number for the accepted connection, as well as corresponding socket descriptor. The interface presented above abstracts the method by which the TCP connection is actually tapped, thus making it transparent to the follower system library. The various connection tapping techniques are described in Section 4.2.

4. Implementation Details 4.1. Tapping Interface The interface used by the modified system library to tap into the TCP stream between the client and the follower is shown in Figure 5. It is designed to be analogous to the well-known BSD socket interface, which the modified system library exports to the follower server.

4.2. Tapping Implementation The tapping code spawns an additional thread. This “tapping thread” handles packets, while the application thread is used to execute the calls provided by the interface shown in Figure 5. The tapping thread is started during the first call to tcp listen(), and is responsible for acquiring and processing the packets. Packet capture methods might differ depending on the particular follower deployment topology. Currently packets are acquired using pcap (the standard packet capture library under UNIX). We use the version of pcap that comes with our implementation environment, the SuSe6.4 Linux distribution. This version relies on the Berkeley Packet Filter (BPF)[8] to perform in kernel packet filtering. The advantage of this method is that only the relevant subset of the captured packets will be copied to user space, which significantly reduces the load on the follower machine, and, consequently, helps lower the drop rate [9]. When the follower is deployed on a machine physically interposed between the client and the leader, as shown in Figure 4(a) and Figure 4(b), one can deploy a packet capture method that is guaranteed to capture all packets. For instance, this can be achieved with the divert socket[10] feature of Linux. This feature allows a user level program to inspect all IP packets that the host receives. The disadvantage of this method, however, is that packets are pulled out to user space and then inserted back into the kernel. Another approach is to modify the kernel to output a copy of each relevant packet to the library. This could be implemented in the device driver, or the TCP/IP stack, or between the device driver and the TCP/IP stack. With this approach, the unnecessary copying between user and kernel space is avoided.

int tap socket (int sockfd, int domain, int type, int protocol) int tap bind (int sockfd, struct sockaddr *my addr, socklen t addlen) int tap listen (int s, int backlog) int tap accept (int sockfd, int s, struct sockaddr * addr, socklen t *addrlen) ssize t tap client read (int fd, void *buf, size t count) ssize t tap server read (int fd, void *buf, size t count) int tap server close (int fd) int tap client close (int fd)

Figure 5. Tapping Interface The semantics of some of these calls differ from the BSD semantics. For example, a socket descriptor is actually passed as the first parameter of tap socket() (i.e., int sockfd). This is needed to provide the ability to use the same descriptor as the one used by the leader server, thus ensuring that the leader and the follower maintain the same state. As part of the leader/follower protocol, the descriptor is sent from the modified system library at the leader to the modified system library at the follower. In general, we rely on the leader-follower protocol to ensure that the client and the leader do not enter divergent states due to the various sources of nondeterminism in the system. Once the connection is established, the follower system library can read both the stream of data from the client to the leader server (via tap client read()), and from the leader server to the client (via tap server read()). If the follower system library decides to stop reading, it uses tap server close() 5

4.4. Inconsistent Retransmission

The processing that takes place following packet capture can be roughly broken down into three parts: packet integrity checks, IP defragmentation, and TCP reassembly. These steps are analogous to the steps performed by the TCP/IP stack. Packet integrity is verified via TCP/IP checksums and packet length tests. The next step is the IP defragmentation code, which assembles fragmented IP packets. IP packets might get fragmented because the maximum packet size might differ for different networks. The fragments belonging to a single packet share an IP ID, which is a 16 bit field in the IP header. The defragmentation code joins fragments with the same IP ID. The next step is the reassembly of TCP segments into the original data stream. Out of order data, for instance, is placed in its correct location in the data stream during this step.

During TCP operation, packets might be retransmitted and the receiver will see several versions of the data corresponding to a particular range of sequence numbers. If the sender’s TCP stack is operating correctly, these copies should be identical. Unfortunately, there are buggy TCP stacks that will generate different versions[15][17]. This problem can also occur if there is a malicious attacker that is using a modified TCP stack at the client to purposely trick the library into accepting incorrect data. This problem often comes up in Intrusion Detection Systems (IDS) [16][17], which contain functionality similar to our tapping code. These systems typically monitor TCP streams and try to find attack patterns within them. If an IDS can be tricked into accepting erroneous data, it becomes desynchronized from the machine that it is supposed to protect. This problem can be easily addressed with the leaderfollower protocol, by having the leader send the follower checksums for the data along with the return codes. The tapping library can then checksum the data it is about to pass up to the application, compare it with the checksum received for the same data from the leader, and detect an inconsistency if the two checksums do not match.

4.3. Resource Deallocation

One problem we encountered is that of resource deallocation. The IP defragmentation code we use allocates certain datastructures to keep track of the fragments received for each fragmented IP packet. Suppose that several fragments corresponding to one particular IP packet were received. Further, suppose that the rest of the fragments needed to complete defragmentation of this IP packet were lost in the network. Now the fragments remaining in the IP defragmentation code represent a potential memory leak. The reason is that the TCP payload contained in this IP packet might be retransmitted in a different way by TCP (say, in several packets that do not get defragmented). Furthermore, if the fragments are allowed to remain within the defragmentation module indefinitely, the IP ID they contain might be reused, and, consequently, new packets with the same IP ID might be incorrectly defragmented. This possibility exists in networking stacks implementing the IP protocol, as specified in [13]. Its occurrence, however, is more likely in our library scenario, since fragments remain in the defragmentation code until they can be assembled into a complete packet. Assuming such fragments pass the TCP checksum after defragmentation, wrong data might be introduced into the data stream. A simple solution to this problem is to implement a timeout, which is the approach used in the TCP/IP stack. With a timeout, however, we cannot guarantee that some fragments still needed for reassembly will not be dropped. Note that this is not a problem with the real TCP/IP stack, since TCP will ask for the data to be retransmitted if needed. However, we can use the leader to forward the missing data via the leader/follower protocol. Or, if the system deploys a logger, the follower can ask the logger to retransmit missing data.

5. Performance 5.1. Measures Of Performance To evaluate the efficiency of our approach, we performed latency, bandwidth, and drop rate experiments. We first describe the latency and bandwidth experiments. They are motivated by the fact that with any deployment topology, if we want to avoid forwarding data from the leader or the client to the follower, we need at least one “logger” host through which all of the TCP traffic between the client and the leader is guaranteed to flow2 . This is a fundamental requirement, because if it is not met, the follower might not be able to capture all TCP traffic between the client and the leader. The latency and bandwidth experiments were designed to investigate whether the addition of such a logger host would have a significant negative effect on the performance of the TCP connection between the client and the leader. We used latency and bandwidth as measures of performance for the following reasons. Low latency is important for applications that send brief periodic messages (such as network games that transmit changes in player positions). High bandwidth, on the other hand, is important for application that send large amounts of data, such as FTP servers. 2 We can avoid the single point of failure problem by routing traffic around a failed logger. Moreover, one can use multiple logger hosts to be able to mask multiple failures and a “stable” main memory mechanism like the one introduced by the Rio file cache [18] to recover from permanent logger failures.

6

0.002

0.0015

0.0015

RTT (sec)

RTT (sec)

0.002

0.001

0.0005

0.0005

0

0 0

1000

2000

3000

4000 5000 6000 Measurement number

7000

8000

9000

10000

0

(a) Excluding connection establishment and teardown.

1000

2000

3000

4000 5000 6000 Measurement number

7000

8000

9000

10000

(a) Excluding connection establishment and teardown.

0.002

0.002

0.0015

0.0015

RTT (sec)

RTT (sec)

0.001

0.001

0.0005

0.001

0.0005

0

0 0

1000

2000

3000

4000 5000 6000 Measurement number

7000

8000

9000

10000

0

(b) Including connection establishment and teardown.

1000

2000

3000

4000 5000 6000 Measurement number

7000

8000

9000

10000

(b) Including connection establishment and teardown.

Figure 6. Base TCP round trip time.

Figure 7. Extra host TCP rtt. We used PentiumIII machines connected by 100Mbit Ethernet to perform our experiments.

The drop rate is a measure of the efficacy of sniffing as a TCP connection tapping technique. We were interested in evaluating sniffing because the Ethernet topology appears to be the most appealing (see the discussion in Section 3.3), assuming that the sniffer can capture most of the TCP traffic between the client and the leader. We present drop rate results for a sniffer located on a machine sharing an Ethernet segment with the leader, and for a sniffer located on a machine performing proxy ARP for the leader. Note that in the latter case, we can use one of the reliable packet capture techniques previously described, because the proxy ARP machine is physically interposed between the client and the leader. Though we are primarily interested in drop rate for the Ethernet sniffer, we present results for both topologies, because they give insight into why the sniffer drops packets, and because sniffing is the only packet capture technique we fully implemented so far. The drop rate results below show that, with an appropriately large socket buffer, sniffing appears to be a safe technique in the proxy ARP topology (though we cannot, of course, make that guarantee).

5.2. Latency And Bandwidth The latency results for the base topology are shown in Figure 6(a) and Figure 6(b). In this topology, the client and the leader are on the same Ethernet segment. We wrote a simple client-server application, in which the client sent a single byte of data to the server, and the server replied with the same byte of data. We measured the round trip time (rtt) of this byte of data. Figure 6(a) shows the rtt on an established TCP connection, and Figure 6(b) shows the rtt including TCP connection establishment and teardown time. Both measurements are important, since some applications care about fast connection establishment and teardown time, while others do not (consider, for example, a web browser that is using HTTP1.0 versus one that is using HTTP1.1). The average rtt excluding connection establishment and teardown was 0.2 msec, and the average rtt includ7

ing connection establishment and teardown was 0.54 msec (the standard error over 10000 experiments was negligible, in both of these cases). We then introduced a host between the client and the leader. This host was configured to perform proxy ARP for the leader. The latency results for this topology are shown in Figure 7(a) and Figure 7(b). The average rtt excluding connection establishment and teardown was 0.27 msec, and the average rtt including connection establishment and teardown was 0.7 msec (again, the standard error was negligible). Thus, the latency overhead induced by the extra host is about 35 percent without TCP connection establishment, and about 30 percent with TCP connection establishment and teardown. While this overhead appears to be significant, it should be tolerable for many applications, because the absolute round trip times with the extra host on the critical path are very small. The bandwidth results for both the base and the extra host topologies are shown in Figure 8(a). We wrote a clientserver application that sent messages of increasing size on an established TCP connection, until the maximum attainable bandwidth was reached. The bandwidth is calculated as

7e+007

6e+007

Bandwidth (bits per second)

5e+007

4e+007

3e+007

2e+007

1e+007 base extra proxy arp host 0 1000

10000

100000 1e+006 Message size (bytes)

1e+007

1e+008

(a) Base vs. extra host with 64Kbyte socket buffers. 8e+007

7e+007

Bandwidth (bits per second)

6e+007

 "!!$#&% '(') * + is the message size, ,!-! is the time taken where to send the message to the server and get an acknowl% '(' is the average round-trip time described edgment, and ,  ! ! above. is measured by the client, and includes the time to get an acknowledgment from the server application % '(') indicating that all the data has been received. The ,  ! ! is subtracted from to account for this acknowledg% '(') factor is negligible, ment. In practice, however, the   .  + , and we ignore it in especially for large values of

5e+007

4e+007

3e+007

2e+007

1e+007 base extra proxy arp host 0 1000

10000

100000 1e+006 Message size (bytes)

1e+007

1e+008

(b) Base vs. extra host with 256Kbyte socket buffers.

Figure 8. Observed bandwidth.

5.3. Drop Rate

the bandwidth calculations. We varied message size, which is plotted on the x axis using a log scale, from 1Kbyte to 32Mbyte. The observed bandwidth is plotted on the y axis. Default (64Kbyte) socket buffers were used. Each data point is the average of ten experiments, and the associated error bar represents the standard error (i.e., the standard deviation divided by the square root of the number of experiments). The results are surprising because, for large message sizes, the maximum observed bandwidth is higher in the extra host topology. We think that this might be due to the additional buffering that the extra host provides. A set of bandwidth results with larger socket buffers is shown in Figure 8(b). We increased socket buffers to 256Kbyte, which allowed us to reach maximum bandwidth (further increase in socket buffer size does not lead to higher bandwidth). This graph shows that, for large message sizes, the maximum bandwidth in the extra host topology is roughly the same as the maximum bandwidth in the base topology.

We wrote a client-server application that sent data on a TCP connection between the client and the leader at bandwidths ranging from 8Mbits/second to the maximum attainable bandwidth. We then installed a pcap-based sniffer and counted the number of sniffed packets. We were interested in the drop rate given by:

0/1324#/152 * 6/132 /132 is the number of packets received by the where /152 is the number of packets that are releader, and

ceived by both the sniffer and the leader. Since this drop rate is difficult to measure (we need to make sure a given packet is received by both the leader and the sniffer when we increment ), we used an approximate drop rate measure given by:

/73.2

/1

/7.8

0/&.84#/1 * 6/7.8

where is the number of packets sent by the client and is the number of packets received. 8

2

2 Sniffer drop rate for client to server stream Server drop rate for client to server stream Sniffer drop rate for server to client stream Client drop rate for server to client stream

Sniffer drop rate for client to server stream Server drop rate for client to server stream Sniffer drop rate for server to client stream Client drop rate for server to client stream

1.5

Drop rate (%)

Drop rate (%)

1.5

1

0.5

1

0.5

0

0 0

10

20

30 40 Transmission rate (Mbits/second)

50

60

70

0

10

20

(a) Ethernet sniffer.

60

70

80

(a) Ethernet sniffer.

2

2 Sniffer drop rate for client to server stream Server drop rate for client to server stream Sniffer drop rate for server to client stream Client drop rate for server to client stream

Sniffer drop rate for client to server stream Server drop rate for client to server stream Sniffer drop rate for server to client stream Client drop rate for server to client stream

1.5

Drop rate (%)

1.5

Drop rate (%)

30 40 50 Transmission rate (Mbits/second)

1

0.5

1

0.5

0

0 0

10

20

30 40 Transmission rate (Mbits/second)

50

60

70

0

(b) Proxy ARP sniffer.

10

20

30 40 50 Transmission rate (Mbits/second)

60

70

80

(b) Proxy ARP sniffer.

Figure 9. Drop rate with default (64Kbyte) socket buffer.

Figure 10. Drop rate with 256Kbyte socket buffers.

For the client to leader part of the TCP stream, we measured the drop rates both at the sniffer and at the leader. The drop rate at the sniffer indicates how many packets need to be retransmitted by the logger. Counts of packets sent by the client and received by the leader were obtained by reading the TCP statistics, reported in the /proc file-system in Linux. These statistics are cumulative for all TCP connections to and from the host, so we shut off all other TCP connections and services during measurement. The number of packets received by the sniffer was obtained by counting the number of packets received with pcap. Even though our application sent data from the client to the leader, there is also the leader to client part of the TCP stream, which consists of packets generated by the leader’s TCP stack, such as acknowledgments. For this stream, we measured the drop rates at the sniffer and at the client. At first, to get an indication of the raw packet capture capability of pcap, we measured the drop rates without the modified system library. Note that the version of pcap we

used relies on packet filtering in the kernel, via BPF. Figure 9(a) shows the drop rates for the Ethernet sniffer, and Figure 9(b) shows the drop rates observed with the sniffer located at the proxy ARP machine. Default (64Kbyte) socket buffers were used to obtain both of these results. Each graph contains four curves:

9 9

sniffer drop rate for the client to leader part of the TCP stream

9

leader drop rate for the client to leader part of the TCP stream

9

sniffer drop rate for the leader to client part of the TCP stream client drop rate for the leader to client part of the TCP stream.

Figure 9(b) shows that when the sniffer is located at the proxy ARP machine, all drop rates are zero. This is not 9

20

2 Library drop rate for client to server stream Server drop rate for client to server stream Library drop rate for server to client stream Client drop rate for server to client stream

Library drop rate for client to server stream Server drop rate for client to server stream Library drop rate for server to client stream Client drop rate for server to client stream

1.5

Drop rate (%)

Drop rate (%)

15

10

5

1

0.5

0

0 0

10

20

30 40 50 Transmission rate (Mbits/second)

60

70

80

0

10

20

(a) Ethernet sniffer.

60

70

80

(a) Ethernet sniffer.

20

2 Library drop rate for client to server stream Server drop rate for client to server stream Library drop rate for server to client stream Client drop rate for server to client stream

Library drop rate for client to server stream Server drop rate for client to server stream Library drop rate for server to client stream Client drop rate for server to client stream 1.5

Drop rate (%)

15

Drop rate (%)

30 40 50 Transmission rate (Mbits/second)

10

5

1

0.5

0

0 0

10

20

30 40 50 Transmission rate (Mbits/second)

60

70

80

0

(b) Proxy-ARP sniffer.

10

20

30 40 50 Transmission rate (Mbits/second)

60

70

80

(b) Proxy-ARP sniffer.

Figure 11. Drop rate at library level with 64Kbyte follower socket buffer.

Figure 12. Drop rate at library level with 48Mbyte follower socket buffer.

surprising, since all packets going to and from the leader physically traverse this machine. With the sniffer located on the host sharing an Ethernet segment with the leader, we observed positive (but very small) drop rates. The maximum drop rate we observed in this case, is about a tenth of a percent (which is the second point on the curve for the stream going from the leader to the client in Figure 9(a)).

ure 9(b)). The only difference between the two results is that there is a slightly positive client drop rate for the leader to client stream. Apparently, packets are getting lost sometime after they are registered at the proxy ARP host (the proxy ARP host, for example, might be dropping them). We are, however, primarily interested in the sniffer drop rates, which are still zero. The results for the Ethernet topology are shown in Figure 10(a). The drop rates are positive, but still fairly low (the maximum follower drop rate, which is the first point on the curve for the stream of packets going from the leader to the client, is about 1.7 percent).

While these results are encouraging, the maximum bandwidth we observed was about 64Mbits/second, and we wanted to find out if the drop rate increases when the data is sent at a higher rate. We increased the size of socket buffers until no further increase in bandwidth was possible, and repeated the experiments. We got a maximum bandwidth of about 74Mbits/second with 256Kbyte socket buffers. For the Ethernet topology, the results are shown in Figure 10(a), and, for the proxy ARP topology, the results are shown in Figure 10(b). The new proxy ARP results are similar to the results we obtained with default socket buffers (see Fig-

We then repeated the drop rate experiments described above with the pcap sniffer replaced by a modified system library that implements the tapping interface described in Section 4.1. Although this library uses exactly the same sniffer, we expected the follower drop rate to increase since the library puts additional load on the machine (both in terms of memory and processing), which is likely to reduce 10

the number of packets the sniffer is able to capture. In fact, the results in Figure 11(a) and Figure 11(b) show that the drop rates are significantly increased, for both topologies (note that the default pcap socket buffer is used in these figures). In particular, the drop rate with the library deployed at the proxy ARP host is extremely high (up to 18 percent). This is because the proxy ARP machine is overloaded; it has to perform IP forwarding, packet capture, and run the library. The drop rate with the library deployed on the Ethernet host has also increased (with a maximum of about 3.5 percent for the follower drop rate, which is the next to last point on the curve for the stream of packets going from the leader to the client in Figure 11(a)). We tried to experiment with size of the pcap socket buffer at the follower host, and found out that increasing it (up to 48Mbyte) significantly reduces the drop rate. The results with the library at the proxy ARP machine are shown in Figure 12(b). The drop rates are much lower than the ones in Figure 11(b), and, in fact, are back to the near-zero levels we were getting without the library, as shown in Figure 10(b). The results for the Ethernet topology are shown in Figure 12(a). Here, the drop rates are also significantly lower compared to Figure 11(a), and back to the levels we were getting without the library, as shown in Figure 10(a). Note that while we can decrease the drop rates to (practically) zero levels by increasing the socket buffer of the proxy ARP sniffer, this is not the case with the Ethernet sniffer. The difference is that all packets must traverse the proxy ARP machine, while the Ethernet machine has to capture the packets as they pass by. In any case, the conclusion is that increasing the size of the socket buffer used by pcap can significantly decrease the number of dropped packets. With appropriately large socket buffers, sniffing appears to be an effective TCP connection tapping technique.

from the logger to occur rarely. Second, the logger-follower does not perform any sophisticated packet processing, and, therefore, should not impose a significant computational load on the logger machine. An approach for logging and recovery of TCP connections is described in [20]. The primary focus of [20] is different than ours: they attempt to keep a connection alive while a failed server recovers whereas, rather then switching it over to a hot backup. The authors of [20] do mention that their solution can be used with a hot backup instead of a logger. However, they do not address the technical problems that occur in this scenario (e.g., the issue of the hot backup’s TCP stack receiving ACKs for bytes that have been produced by the primary’s TCP stack, but have not yet been produced by the backup’s TCP stack). Unlike our system, the approach of [20] does not need special libraries on the client side. Since modern link loaders simplify the wrapping of system libraries, wrapping client side libraries is not a major technical problem. However, the wrappers have to be distributed to the client side, e.g., by bundeling them with the client program. The system of [20] is implemented in two layers of software, called wrappers, interposed below and above the TCP stack. The approach of [20] suffers from performance problems, due to the fact that sequence numbers are adjusted by the wrapper to make sure that the server does not acknowledge data still not processed by the logger. This approach interferes with normal TCP congestion control mechanism. The authors cope with the problem by introducing a much faster (10x) link between the server and the logger, effectively making the link between the client and the server the bottleneck. This approach, however, does not appear to scale well if the link between the client and the server is also upgraded. Another drawback of the approach is due to fact that the server crash has to be masked from the client’s TCP stack. For example, when the server process crashes, FINs are sent on all open socket descriptors for that process. The challenge is to differentiate such FINs from FINs generated by normal close() calls, and intercept them before they are sent out to the client. The authors rely on the availability of OS specific information to do this. Also, the authors only consider the problem restarting a failed server from a log. This requires storage of all communication between the client and the server for each logged connection. Furthermore, replay of this communication implies that the recovery process does not scale well. Another approach for migrating TCP connections is described in [21]. Unlike our technique, this approach is more suitable in cases where the backup server is located across a wide area network from the primary server. This approach, however, has a number of limitations. First, it requires a modified TCP stack which implements a protocol for connection migration. All of the communicating hosts, i.e. the

6. System Status and Related Work The sniffer-based tapping library has been fully implemented. Drop rate experiments described in the previous section were performed with this library in place. A perfect failure detection protocol is implemented [19]: this protocol makes sure that the leader is down before a follower is promoted as the leader. This failure detector prevents executions in which there exists two leaders at the same point in time. The implementations of the logger and the leaderfollower protocol are in progress. To fully evaluate the performance of our approach, bandwidth and latency experiments with the logger in place need to be performed. However, we do not expect the loggerfollower protocol to have a significant performance overhead, for the following reasons. First, the drop rate experiments show that with large enough socket buffers, very few packets are dropped. Hence, we expect the retransmissions 11

References

client, the primary server, and one or more backup servers, have to run such a modified TCP stack. Second, connection migration is only supported at particular times. For example, if the primary crashes in the middle of processing the initial request from the client (e.g. GET in the case of http), the connection cannot be migrated and will be lost. The reason is that migration information, which includes the name of the object requested, has to be generated and propagated to the backup(s) before any migrations can be performed. This also means that there must be a handler which can parse requests of each supported protocol, and generate initial connection migration information such as the name of the object being requested. Third, this approach is not a general-purpose TCP connection migration technique, since it can only be used with certain applications. In particular, the authors consider deterministic servers that, upon initial request, stream data to the client. The example used by the authors is a web server. After the TCP connection is migrated according to the protocol implemented in the modified TCP stacks, it’s up to the backup server to resume data transmission starting at a byte following the last byte successfully received by the client. Thus, the application must provide an API which allows data transmission from a particular byte offset. In case of a web server, an HTTP Range request (specified in HTTP1.1) must be supported. Also, the approach requires a handler which will strip the protocol header from the data stream produced by the backup server, because the header has already been transmitted by the primary server to the client at the time of connection migration. Thus, in the case of a web server, there must be a handler which strips the HTTP header from the stream going to the client, since the header was already sent to the client by the primary server. Such a handler needs to be aware of every supported protocol.

[1] Fred Schneider. Implementing fault-tolerant services using the state machine approach. Communications of the ACM, 22(4):299–319, December 1990. [2] P.A. Barret, A.M. Hilborne, P.G. Bond, D.T. Seaton, P. Verissimo, L. Rodrigues, and N.A. Speirs. The delta-4 extra performance architecture (xpa). In Proc. of the 20th International Symposium on Fault-Tolerant Computing, pages 481 – 488, June 1990. [3] Thomas C. Bressoud TFT: A Software System for Application-Transparent Fault Tolerance In Symposium on Fault-Tolerance Computing , pages 128–137, 1998. [4] Thomas Bressoud and Fred Schneider. Hypervisor-based fault-tolerance. ACM Transactions on Computer Systems, 14(1):80–107, February 1996. [5] F. Cristian, H. Aghili, R. Strong, and D. Dolev. Atomic broadcast: From simple message diffusion to Byzantine agreement. Information and Computation, 118(1):158–179, April 1995. [6] Cisco Systems Cisco LocalDirector 400 Series http://www.cisco.com/warp/public/cc/pd/cxsr/400/index.shtml [7] M. Leech, M. Ganis, Y. Lee, R. Kuris, D. Koblas, and L. Jones SOCKS Protocol Version 5, RFC 1928 March 1996 [8] Steven McCanne and Van Jacobson The BSD Packet Filter: A New Architecture for User-level Packet Capture In USENIX Technical Conference Proceedings, San Diego, CA, pages 259–269, 1993. [9] Vern Paxson Automated Packet Trace Analysis of TCP Implementations In Proceeding SIGCOMM, 1997. [10] Ilia Baldine Divert Sockets Mini Howto In http://www.linuxdoc.org/HOWTO/mini/Divert-Socketsmini-HOWTO.html [11] G.R. Wright and R. W. Stevens TCP/IP Illustrated, Volume2. The Implementation. In Addison-Wesley Publishing Company, Inc, 1995. [12] V. Jacobson, R. T. Braden, and D. A. Borman TCP Extensions For High Performance, RFC1323 1992 [13] J.B. Postel Internet Protocol, RFC791 1980 [14] J.B. Postel, J.B Transmission Control Protocol, RFC793 1981 [15] V. Paxson, et. al. Known TCP implementation problems, RFC2525. 1999 [16] T. H. Ptacek and T. N. Newsham. Insertion, Evasion, and Denial of Service: Eluding Network Intrusion Detection 1998. [17] V. Paxson Bro: A system for detecting network intruders in real-time In Proceedings of the 7th USENIX Security Symposium, 1998. [18] Peter M. Chen, Wee Teck Ng, Subhachandra Chandra, Christopher Aycock, Gurushankar Rajamani, and David Lowell. The rio file cache: surviving operating system crashes. In Proceedings of the 7th international conference on architectural support for programming languages and operating systems, pages 74–83, 1996. [19] Christof Fetzer. Enforcing perfect failure detection. In Proceedings of the 21st International Conference on Distributed Systems, Phoenix, AZ, April 2001. [20] L. Alvisi, T.C. Bressoud, A. El-Khashab, K. Marzullo, and D. Zagorodnov. Wrapping server-side tcp to mask connection failures. In Proceedings of INFOCOM 2001, pages 329–337, 2001. [21] Alex C. Snoeren, David G. Andersen, and Hari Balakrishnan. Fine-grained failover using connection migration. In Proc. of the Third Annual USENIX Symposium on Internet Technologies and Systems (USITS), March 2001.

7. Conclusion

In this paper we described a TCP tapping mechanism. We have designed this mechanism to improve the performance of leader/follower replication. The idea is that a follower can reduce the load of the leader by sniffing the data from network instead of relying on the forwarding of the data from the leader. We implemented a TCP level tapping mechanism that can reconstruct TCP streams from network packets sniffed from the network. We performed intensive performance measurements to evaluate the effectiveness of this approach. In our future work we plan to integrate this TCP tapping mechanism with a leader/follower protocol we have been building. 12