An Fault-Tolerant TCP Scheme Based on Multi-Images Hai Jin, Jie Xu, Bin Cheng, Zhiyuan Shao, and Jianhui Yue Huazhong University of Science and Technology, Wuhan, 430074, China Email:
[email protected] Abstract—The fault-tolerance of TCP is a key technology to guarantee the availability of services for servers. A fault-tolerant TCP scheme based on multi-images is discussed in this paper. Each TCP connection in this scheme exists two synchronous connection images and it is no need to backup the status of every TCP connection. It issues a module mechanism in the kernel of Linux and does not affect the software running on both client and server. It also guarantees each TCP connection be taken over seamlessly when it fails, transparent to the user. At the same time, it also tries to reduce the side effect to the system performance on the premise of ensuring the fault tolerance of each connection.
I. INTRODUCTION
M
ore and more Internet services are built over TCP connections, such as HTTP, FTP, and Telnet. Running status of these services and their connections will be dropped if they fail to run abruptly. Traditional TCP protocol cannot provide fault tolerant to TCP connections. The fault-tolerance of TCP is a key technology to guarantee the availability of services. There exist several schemes for fault-tolerant TCP. One approach [1][2][3] to recover the TCP connection is to insert a layer of software between the TCP layer and the application layer of both client and server. Such a layer takes checkpointing of TCP connection states on both sides. So it can re-establish the connection between the old client and the new server consistently when the old server crashed. The main drawback of this approach is that the client code needs to be modified, which is not transparent to the client. The second approach [4][5] is to redirect TCP connections through a proxy leaving the client and server code untouched. The proxy maintains the state of every TCP connection that can recover the TCP connection when the server crashes. The main drawback is that the proxy is a single point of failure. The third approach is FT-TCP [6]. It is a scheme for transparent recovery of a crashed process with open TCP connections. A wrapper around the TCP layer intercepts and logs reads by the process for replay during recovery, and shields the remote endpoint from failure. The wrapper is composed of three components: SSW, NSW and Logger. SSW locates between the IP layer and the TCP layer. It is responsible for receiving data packets from the IP layer, then sending them to the Logger and the TCP layer. On the other hand, it also receives data packets from the TCP layer and sends them to the client through the IP layer. NSW locates between the TCP layer and the application layer. It logs the amount of data returned with each read socket call. Logger locates in another server. It stores information for recovery purposes, such as the connection state information, the data, This work was supported in part by the National High-Tech 863 project of China under Grant 2002AA1Z2102.
0-7803-7978-0/03/$17.00 ©2003 IEEE.
and the read lengths. With FT-TCP, each packet through the TCP layer and the IP layer should be recorded in the Logger and it makes the time recovering a particular connection to be high because of reincarnating a connection need to start from the beginning. A fault-tolerant TCP scheme based on multi-images, called MI_TCP, is proposed and studied in this paper. This approach does not suffer from the drawbacks of the previous approaches. The scheme makes each TCP connection two synchronous connection images and do not need to backup the status of every TCP connection. It issues a module mechanism in the Linux kernel and does not affect the software running on both side of client and server. This guarantees each TCP connection can be taken over seamlessly when it fails, transparent to the user. It has no single point of failure because of not using proxy. At the same time, it also tries to reduce the side effect to the system performance on the premise of ensuring the fault tolerance of each connection. The remaining of this paper is organized as follows. In section 2, we give an overview of the major components of MI_TCP. This is followed by a discussion of failure takeover in section 3. Section 4 presents the performance evaluation of MI_TCP. We end with conclusion and future work in section 5.
II.
SYSTEM ARCHITECTURE
MI_TCP is implemented in a cluster of two servers. One server is a Dispatch Server, whose function is to dispatch each TCP connection to each server in the cluster and to make sure that both servers can receive requests from the client. Another server is an Aggregate Server, whose function is to aggregate responses from both servers in order to guarantee the data consistency of each server. We call this system the high available (HA) cluster (see Fig. 1). Each server in the cluster has two network cards (eth0 and eth1). There are also two logically independent networks, the private network and the public network. Both eth1s are connected to the private network to provide a channel for the cluster internal communication, such as heartbeat monitoring. Both eth0s are connected to the public network to communication with the client. Only one eth0 that is bound to the dispatcher server is activated and its IP address is the only address advertised to the Internet. We call its IP address virtual IP address (VIP for short). Three components are provided in the scheme of MI_TCP: Heartbeat monitoring process – monitoring whether a server crashes; IP takeover process – taking over virtual IP address when the dispatcher server crashes; Fault tolerant TCP connection process – resuming the data transfer seamlessly when a server crashes.
1000
We now describe the operation of the fault-tolerant TCP connection process. After introducing the workflow of this process, we then describe the algorithms of the Dispatch and the Aggregate Server, which are two primary components of this process.
Dispatcher Server
IP takeover
Heartbeat monitor
Common data set
1) Workflow Fig.2 describes the workflow of the fault-tolerant TCP connection.
Fault tolerance of TCP connection
Client 1 2
Aggregate Server
IP takeover
Heartbeat monitor
6
Dispatcher Dispatcher IP Level
Common data set
Aggregater
Aggregate IP Level
5
3 4
Fault tolerance of TCP connection
Dispatcher Server
3
4
Aggregate Server
Fig. 2 The workflow of fault-tolerant TCP connection Fig.1 HA cluster architecture
A. Heartbeat monitoring The end-point of a connection will be changed because a server crashes. The cluster must be able to detect it, following which new server can be selected and the connection can be resumed. We call the process in the cluster that monitors the states of servers the heartbeat monitor. There are several possible designs for a heartbeat monitor, and they can be broadly classified into centralized and distributed implementations. We have used the latter. The focus of this paper is not on novel mechanisms for heartbeat monitoring; instead, we leverage related work that has already been done in this area [7][8]. The heartbeat monitor process we designed is composed of two threads: HeartbeatThread and RcvThread. HeartbeatThread sends its heartbeat to the other servers in order to show it’s alive. RcvThread receives heartbeats from the other servers alive. B. IP takeover Virtual IP address is bound to the eth0 of the Dispatch Server when the cluster setups by which clients communicate with the cluster. Resources of the Dispatch Server should all been taken over by the Aggregate Server when it fails to run. Employing the scheme of MI_TCP, the Aggregate Server need only take over VIP and the other resources are consistent with those of Dispatcher Server, which needn’t been take over. When VIP is bound to the Aggregate Server, it is necessary to send gratuitous ARP [9] request composed of the MAC address of the Aggregate Server and VIP to the Internet, which makes all clients know that the Aggregate Server is holding VIP now. So the Aggregate Server can receive data packets which are sent by clients to the server held VIP and clients do not feel the failure of the Dispatch Server. C. Fault-tolerant TCP connection
The client sends the request to the server held VIP (Step1). The request received by the Dispatch Server is copied and sent to the Aggregate Server by the Dispatcher that is located on the IP layer of the Dispatch Server (Step2). Both the Dispatch Server and the Aggregate Server process requests that have received (Step3). After processing requests, both servers send results to their IP layers (Step4). After receiving its response on the IP layer, the Dispatch Server modifies the destination address to the IP address of Aggregate Server and computes the checksum again, then sends the modified response to the Aggregater located on the IP layer of the Aggregate Server (Step5). The Aggregated response is sent to the client after its source IP addresses are modified to VIP (Step6). 2) Dispatcher The Dispatcher is responsible for intercepting data packets and dispatching them to the Aggregater. If the destination address of the incoming packet is not VIP, the packet will be dropped. Otherwise it gets a copy of the packet, and modifies its destination address to the Aggregate Server’s Address. Then it computes the checksum again, and forwards it by calling ip_rcv_finish. 3) Aggregater Responses from the Dispatcher Server and the Aggregate Server may be different because of their different resources, such as the windows size. In order to synchronize the status of two servers, an Aggregater has been introduced. MI_TCP maintains the following data structures: node_info, which records the sending sequence number, the acknowledgement sequence number and the windows size; and ha_conn, which records the information for synchronization. The Aggregater works as follows: (i) If a data packet comes, send_seq and ack_seq of the server sent data are modified and the
1001
(ii) (iii)
(iv) (v)
(vi)
(vii)
Aggregater re-computes min_send_seq and min_ack_seq of ha_conn, then to (iv); If the packet is an ACK packet without data, the Aggregater sends it directly, then to (iv); If the packet is a packet retransmitted, the Aggregater sends it directly to avoid losing the packet; The Aggregater drops the packet if it arrives out of order and waits the packet in order; If the packet arrives in order, the Aggregater estimates whether min_send_seq is modified. If modified, the Aggregater sends the smaller packet sent by two servers; The Aggregater modifies the packet that will send to the client. The windows item of the TCP head is set to min_window. Its ack_seq is set to min_ack_seq and its source IP address is set to VIP. At the same time, TCP checksum and IP checksum are both recomputed; Send the modified packet to the client.
D. Synchronization of initializing TCP connection A process of fault tolerance of TCP connection works properly on the premise that initial sequence numbers (ISN) of two TCP connection images are consistent. The procedure is illustrated in Fig.3. When capturing a SYN packet delivered by the client with an ISN of the Dispatch Server, the Dispatch Server fills the unused field of acknowledgement number with a secure sequence number that has no confliction with other connections and sets the reserved field (tcph->res1) in the packet. After all these being done, the modified SYN packet is delivered to the TCP layer of the Dispatcher Server and is also sent to the Aggregate Server after copied. After the Dispatcher Server receives the modified SYN packet, it fetches out the secure sequence number located where the acknowledgment sequence number used to occupy and uses it as the ISN for the second handshake. At the same time, when the Aggregate Server receives the SYN packet with a special flag, it does the same as the Dispatcher Server does. In this way the initial sequence numbers of the two TCP connection images are synchronized. Dispatch Server
Client
Aggregate Server
syn(J) syn(J) set(flag)
ack(J+1) syn(k) ack(J+1) syn(k)
Fig. 3 Synchronization of initializing TCP connection
III. FAILOVER
A. Fault Diagnosing The HeartbeatThread on both servers send heartbeats each other at an interval time. The RcvThread modifies the parameter of LastUpdateTime according to its clock when it receives a heartbeat from the other server. When a server fails, the other server will not receive its heartbeat. If the interval time that does not receive the heartbeat from the other exceeds a limit, the server will set the status of the other server inactive and regard the other server has been crashed. B. Fault Failover The scheme of MI_TCP makes each TCP connection exist two synchronous connection images, one on the Dispatcher Server and the other is on the Aggregate Server. When one of them fails, the other can resume seamlessly. Our failover scheme is simple, the Aggregate Server takes over VIP and deletes its Aggregate Server when it detects the fault of Dispatch Server and the Dispatch Server only deletes its Dispatcher when it find the failure of the other. C. Fault Recovery When the server ever failed recoveries, it sends its heartbeat to the server running and starts the Aggregater, whilst the server running starts Dispatcher when it receives the information. A high available cluster is formed again.
IV. PERFORMANCE EVALUATION
A. System configurations and test tools The experiments were performed on our HA cluster of two server (illustrated in Fig.1) and one client. Each of computers was a 450 MHz Pentium II workstation with 512 KB cache, 128MB of memory, two network interface cards of 100Mbps 3COM 3c59x and a Maxtor 40GB IDE Hard Disk. All of nodes were connected via a 3COM 100Mbps Switch. And the operating systems running on all the nodes are Linux with a kernel version 2.4.2, while we implemented a Linux kernel module for MI_TCP in both servers. We used an application in which the client transmits a data stream whose size could be controlled by the client to the server. The server received and printed the data received and transmitted them back to the client. Through the application, we measured the throughout of MI_TCP and the additional latency introduced by MI_TCP. The tcpdump utility was used to collect timestamps and packet information for connections under test. This was executed on the client machine to get accurate client-side measurements of latency of MI_TCP connections and recovering time. B. Throughput Fig. 4 presents the throughput of two systems. From it, we have noticed that the throughput of our HA cluster degraded from that of the single server, which is 85.75%. However, the decrease of performance is unavoidable in order to improve the availability. There must be a tradeoff between high availability and high performance. We can also see a penalty when the packet size is close to or
1002
slightly larger than the multiples of MTU size. This is due to the additional datagram fragmentation that is needed in these cases to add the encapsulation header. 70
Throughput(Mbps)
60 50 40 30 20
time (Trecovery) only includes two time periods: the time to find a failure (Tfound) and the time to takeover (Ttakeover). That is to say: Trecovery = Tfound +Ttakeover Tfound refers to the time from a server crashing to the other server detecting the failure, which is decided by the configuration file; Ttakeover refers to the time from detecting the failure to the server running properly taking over VIP. Recovering time tested by tcpdump is very close to Tfound because Ttakeover is transient compared to Tfound. We usually set Tfound to four heartbeats in the configuration file that is one second.
10 0 7681
6913
6145
5377
4609
3841
3073
2305
1537
769
1
V. CONCLUSION AND FUTURE WORK
Packet Size(Bytes) single server
our HA cluster
Fig. 4 Throughput of two systems
C. Latency time Latency time is defined as the interval time between the client sending the data and the client receiving the response. Fig. 5 shows in two conditions of the single server and the cluster implemented MI_TCP the increment of latency time is small especially when packets are smaller than 4KB. For example, when the size of the packet is 2187B, the latency time of the single server is 517.2632µs and the latency time of our HA cluster is 550.70µs. From the Figure 4, we can also know the average latency time is only added to 17.51%, but the availability is higher.
Latency(Microseconds)
1400 1200 1000 800 600 400 200 7681
6913
6145
5377
4609
3841
3073
2305
1537
1
769
0
Packet Size(Bytes) single server
our HA cluster
Fig. 5 Latency time of two systems
D. Recovering time Recovering time is defined as the interval time from a TCP connection crashing to the other connection image resuming transferring the data, which can also be test by tcpdump. As two connection images are synchronization, recovering
A new scheme of the fault-tolerant TCP connection (MI_TCP) is described in this paper. It has many advantages compared to other schemes. Firstly, both the Dispatcher Server and the Aggregate Server run the same task and it maintains the consistency of two servers by the Aggregate Server. So when a server fails, the other server can takeover seamlessly and it’s transparent to the client. Secondly, the Dispatcher and the Aggregate Server locate in different servers, which can load balance. Lastly, the time recovering a connection is shorter. Two TCP connection images are always consistent; it is easy to recover the connection from the failure, unlike some schemes reincarnating a connection to start from the beginning. In the future, we are prepared to use this scheme in the field of our cluster file system to acquire high availability of the metadata.
REFERENCES [1] R. Nasika and P. Dasgupta, “Transparent migration of distributed computing processes”, Proc. of Thirteenth International Conference on Parallel and Distributed Computing Systems, Las Vegas, 2000. [2] A. C. Snoeren, D. G. Andersen, H. Balakrishnan, “Fine-Grained Failover Using Connection Migration”, Technical Report MIT-LCS-TR-812, MIT, September 2000. [3] C. Fetzer and S. Mishra, “Transparent TCP/IP based Replication”, Proceedings of the 29th International Symposium on Fault Tolerant Computing - Fast Abstracts, Madison, Wisconsin, June 1999. [4] D. Maltz and P. Bhagwat, “TCP splicing for application layer proxy performance”, IBM Research Report 21139, Computer Science/Mathematics, IBM Research Division, 17 March1998. [5] W. Zhang, “Linux Virtual Server for Scalable Network Services”, Proceedings of Ottawa Linux Symposium, 2000. [6] L. Alvisi, T. C. Bressoud, A. El-Khashab, K. Marzullo, D. Zagorodnov, “Wrapping Server-Side TCP to Mask Connection Failures”, Technical Report, Department of Computer Sciences, The University of Texas at Austin, July 2000. [7] E. Amir, S. McCanne, and R. Katz, “An active service framework and its application to real-time multimedia transcoding”, Proc. of ACM SIGCOMM’98, Sept. 1998. [8] V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. Nahum, “Locality-aware request distribution in cluster-based network servers”, Proc. ASPLOS’98, Oct. 1998. [9] W. R. Stevens, TCP/IP illustrated, Volume 1: The protocols, AddisonWesley 1994.
1003