An Adaptive Multi-path Transmission Control ... - ACM Digital Library

9 downloads 0 Views 289KB Size Report
May 18, 2015 - Laboratory, Huawei. Technologies Co., Ltd., China hunongda@huawei.com. Ke Liu. †. Institute of Computing. Technology, CAS [email protected].
AMTCP: An Adaptive Multi-path Transmission Control Protocol

1

Long Li1,2

Institute of Computing Technology, CAS 2 University of Chinese Academy of Sciences, China

[email protected] Binzhang Fu

Nongda Hu



Ke Liu

Communications Technology Laboratory, Huawei Technologies Co., Ltd., China

[email protected]



Institute of Computing Technology, CAS

[email protected]

Mingyu Chen

Lixin Zhang

Institute of Computing Technology, CAS

Institute of Computing Technology, CAS

Institute of Computing Technology, CAS

[email protected]

[email protected]

[email protected]

ABSTRACT

Keywords

Enabling multiple paths in datacenter networks is a common practice to improve the performance and robustness. Multi-path TCP (MPTCP) explores multiple paths by splitting a single flow into multiple subflows. The number of the subflows in MPTCP is determined before a connection is established, and it usually remains unchanged during the lifetime of that connection. While MPTCP improves both bandwidth efficiency and network reliability, more subflows incur additional overhead, especially for small (so-called mice) subflows. Additionally, it is difficult to choose the appropriate number of the subflows for each TCP connection to achieve good performance without incurring significant overhead. To address this problem, we propose an adaptive multi-path transmission control protocol, namely the AMTCP, which dynamically adjusts the number of the subflows according to application workloads. Specifically, AMTCP divides the time into small intervals and measures the throughput of each subflow over the latest interval, then adjusts the number of the subflows dynamically with the goal of reducing resource and scheduling overheads for mice flows and achieving a higher throughput for elephant flows. Our evaluations show that AMTCP increases the throughput by over 30% compared to conventional TCP. Meanwhile, AMTCP decreases the average number of the subflows by more than 37.5% while achieving a similar throughput compared to MPTCP.

MPTCP, datacenter networks, TCP, subflows

∗He and the listed first author made an equal contribution. †Corresponding author: Ke Liu ([email protected]) Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CF’15, May 18 - 21, 2015, Ischia, Italy Copyright 2015 ACM 978-1-4503-3358-0/15/05 ...$15.00 http://dx.doi.org/10.1145/2742854.2742871

1.

INTRODUCTION

Many datacenters have hundreds of thousands of servers while running a large number of distributed applications. Those applications each can spread computation and storage across hundreds of nodes. Because many datacenters separate computation nodes from storage nodes, lots of communication is needed between computation nodes and storage nodes for those applications. As a result, east-west traffic (i.e., intra-datacenter traffic) dominates overall datacenter traffic. It is estimated that more than 80% of the packets stay inside a datacenter [14]. To improve performance and reliability, today’s datacenters usually adopt topologies with redundant paths between endpoints, such as FatTree [2], VL2 [9] and BCube [10]. In those networks, it is possible to spread traffic across multiple paths to improve aggregate bandwidth and network resilience. However, singlepath routing only provides one of the best paths, e.g., one of the shortest paths to the destination, thus cannot effectively utilize the existing redundant paths. Therefore, multi-path routing is used to exploit multiple paths between a sourcedestination pair and dispatch flows to the same destination across available paths, thus can dramatically improve network throughput. Equal-cost multi-path (ECMP) routing, one of the widely used multi-path routing protocols, distributes packets to the same destination across multiple equal-cost paths, e.g., multiple shortest paths to a destination. To avoid out-of-order delivery, a flow based hash algorithm is implemented to map all packets of a flow on the same path to guarantee all packets of a flow reach the destination in order. However, the optimal performance of ECMP can only be obtained when there are a sufficient number of simultaneous flows [3]. Otherwise, the flow collision, e.g., multiple flows are routed to the same link, caused by ECMP may overload several links while the others go idle, resulting in the achieved network throughput to be less than 50% of the full bisectional bandwidth [16]. Moreover, since a given flow is only routed along a single path, its throughput is limited by link capacity, even

if there are many available paths. To address the problem of flow conflicts in ECMP, some researchers have proposed to schedule traffic based on traffic characteristics and network status [3, 8]. By collecting traffic information of the whole network, Hedera [3] globally schedules elephant flows1 of which the total amount of data transmitted are beyond 10% of each egress link’s capacity from overloaded links to under-utilized ones to reduce flow conflicts. Thus, it can obtain better network utilization than ECMP. However, data center traffic is bursty and lacks of regular patterns [7]. To deal with traffic variations, the scheduler must be activated frequently. In order to achieve optimal performance, the scheduler should run every 100ms [16]. Further, Hedera performs global flow scheduling through a time-consuming loop: collecting flow statistics, identifying elephant flows, estimating the demand matrix, calculating routing paths, and installing flow table entries. This complex process limits the reaction time of Hedera. As a result, it cannot achieve optimal performance in high-bandwidth datacenters [16]. Unlike ECMP and Hedera, which only routes a flow along single path, Barre proposes MPTCP [6, 16] , that routes a flow along multiple paths. Rather than go over just one flow, data of a connection are striped across several subflows at the sender. These subflows are allowed to go along different paths between the source-destination pair simultaneously. Finally, at the receiver, data from all subflows of a connection are merged in order and submitted to the corresponding upper application. By exploiting multiple paths, a connection can acquire an aggregate throughput that is much higher than that obtained when it only routes along one single path. To address the flow collision problem in ECMP, MPTCP routes a subflow to the path with the minimal round trip time (RTT) as measured, thus traffic is fairly distributed among all the paths after convergence. When subflows experience link failure, the remaining data can still be delivered through the other unaffected active subflows. Therefore, MPTCP provides better network performance and reliability. However, MPTCP may incur more resource and scheduling overheads. For instance, opening more subflows will result in more data to transmit and more interrupts are needed to receive those data, thus more CPU cycles are consumed. To address this problem, we propose AMTCP which dynamically adjusts the number of the subflows of a connection according to its application workload. Firstly, AMTCP divides the time into small intervals and measures the throughput of each subflow over the latest interval. Every subflow is associated with one throughput threshold. If every subflow’s throughput is larger than its associated throughput threshold, AMTCP increases the number of the subflows by one. Otherwise, the number of the subflows remains unchanged. Since more than 80% of flows in a datacenter are mice flows [7], AMTCP is expected to reduce the CPU overhead and the number of the subflows dramatically without sacrificing the throughput in practical. The rest of this paper is organized as follows. Section 2 gives more background and introduces the motivation of this paper. Section 3 gives an overview of AMTCP. Section 4 discusses the implementation of AMTCP. Section 5 presents 1

Briefly speaking, flows that are smaller than 1MB in size are regarded as mice flows and the others are considered as elephant flows [4]

the results of our experiments. Section 6 describes the related works; and Section 7 concludes this paper.

2.

BACKGROUND AND PROBLEM

In this section, we will first briefly review the MPTCP algorithm and then we introduce the motivation of this paper.

2.1

MPTCP Algorithm

To explore multiple paths in the network, MPTCP opens several subflows between endpoints. There are mainly two ways to open an additional subflow: (1) the additional subflow can have the same pair of IP addresses as the initial one, but using different ports; (2) the additional subflow can use any different IP address pairs that the sender or receiver may have. In (1), the number of the subflows are determined by a configurable kernel parameter, and MPTCP relies on ECMP routing to distribute subflows on different paths. In (2), the number of the subflows is determined by the combination of IP address pairs of the sender and receiver, e.g., for a pair of endpoints each have two IP addresses (say A, B and C, D respectively) the number of the subflows for a TCP connection is four, and the paths are implicitly identified by the source and destination IP addresses, e.g., the pairs of IP addresses for those four paths are A→C, A→D, B→C and B→D, respectively. In both cases, the number of the subflows of a TCP connection remains unchanged during the lifetime of that connection, and in the current implementation of MPTCP, i.e., 0.89, the maximum number of the subflows are hard coded as eight. MPTCP uses a two-level sequence space to avoid the problem of out-of-order delivery: a connection-level sequence number is used to track the order of packets seen by an application to guarantee all packets are submitted to the application in order, while the subflow-level sequence number is used to record the order of packets within each subflow. Moreover, each subflow has its own congestion window, and performs a TCP-like coupled congestion control [16], which makes it able to schedule the traffic from the more congested paths to those less congested ones while achieve the goal of fair sharing of bandwidth with regular TCP.

2.2

Problem Statement

Common datacenter topologies (e.g., multi-rooted trees) include redundant paths between endpoints. FatTree [2] is a typical example, where all edge and aggregate switches are grouped into pods. A three-layer FatTree with K-port switches can provide up to (K/2)2 shortest paths between every pair of endpoints that locate in different pods, e.g., a FatTree with 24-port switches, contains 144 shortest paths between a pair of endpoints, hence MPTCP leverages the redundant paths in the topology by opening multiple subflows to achieve the goal of high bandwidth utilization. However, opening more subflows is expensive due to the following three ways: (1) spreading data across multiple subflows leads to more resource and scheduling overheads. For example, opening more subflows results in more running tasks in operating systems, thus more CPU utilization, and CPU is expected to encounter more interrupts as it needs to schedule more tasks; (2) splitting a short-lived flow into multiple subflows might result in the frequent connection initiation and tear-down, thus frequent resource allocation and de-allocation. The situation becomes even worse when the number of the subflows increases; (3) in emerging net-

File Size

CPU Usage (%)

20

1KB 2KB 5KB 10KB 100KB 1MB 10MB 100MB

15

10

5

0

1

2

3

4

5

Number of Subflows

Figure 1: The CPU usages of MPTCP for different number of the subflows. work architectures like OpenFlow [15], opening more subflows results in more on-chip memory, such as TCAM and SRAM, that is used to store flow entries in network devices, e.g., siwtch, router, etc., which might limit the scalability of MPTCP. For example, endpoints in datacenters usually have 10 concurrent TCP flows [9]. Considering a datacenter of 3,000,000 servers, opening one more subflow for each TCP connection will result in an increment of 3,000,000 flows in the network, which consuming a large amount of memory resources in network devices. MPTCP opens the same number of the subflows for all TCP flows, which might not be cost-effective in the following two ways: (1) in common datacenters, mice flow dominates the communication between endpoints. For instance, it is estimated that more than 80% of flows are smaller than 10KB [7]. In addition, some control messages, e.g., heartbeat detection, last for a long time but only have small amount of traffic to transmit. Hence mice flow is usually not throughput-intensive thus every subflow might only transmit a small amount of data but still costs the same amount of resources as other flows; (2) in contrast, MPTCP [16] is shown to improve the aggregate throughput significantly compared to conventional TCP, but it only works for elephant flows that have a large amount of data to transmit, hence elephant flows in the datacenter is usually throughputintensive. However, it is possible for elephant flows not to efficiently utilize the full bisection bandwidth by opening insufficient number of the subflows, both of which suggest that an adaptive way to adjust the number of the subflows according to the subflow’s throughput is needed. To validate it we conduct the following experiment: two hosts H1 and H2 are connected by a switch, where H1 is the sender and H2 is the receiver. Both H1 and H2 have a 4-core CPU. In this experiment, H1 sends a request to H2 for a file of 100KB, and H2 transmits that file using MPTCP to H1. We vary the number of the subflows in MPTCP and record the CPU usages correspondingly. The results are shown in Fig. 1. They show that, with more number of the subflows, MPTCP results in higher CPU usage. For example, the CPU usages for one, two and ten subflows are 9.9%, 15.2% and 19.1%, respectively. Interestingly, when the number of the subflows changes from one to two, the CPU usage increases significantly, but the CPU usage increases slowly if the number of the subflows is increased further. The increase of CPU usage is mainly due

FCT (us) One Two 199.91 197.09 201.65 198.30 234.38 218.75 280.38 260.37 1065.47 1042.21 9168.75 9105.13 92704.44 79322.33 917572.33 793982.77

Delta (us) 2.82 3.35 15.63 20.01 23.27 63.62 13382.11 123589.56

Table 1: Comparison of flow completion times (FCTs) for one and two subflows. In the table, the column Delta stands for the absolute difference of FCTs with one and two subflows. to the increase in the number of the subflows, because it need more CPU cycles to schedule the transmissions of all the subflows. To save CPU resource we can decrease the number of the subflows in MPTCP. But reducing the number of the subflows might degrade MPTCP’s throughput or flow completion time. To show it we evaluate the flow completion times of flows with different number of the subflows. In the following experiments two hosts H1 and H2 are connected by a switch. Both hosts have two network interface cards, thus there are two available paths between them. H1 downloads a file of a specific size using MPTCP with one subflow and two subflows respectively from H2. The size of that file ranges from 1KB to 100MB. We repeat each experiment for 1000 times and compute the average flow completion time for each file size. As shown in Table 1, the flow completion time reduces slightly for mice flows by increasing the number of the subflows from 1 to 2, but the completion time reduction increases when the file size increases. For example, the reduction in flow completion time is only 20.01us for 10KB file size, but the reduction can be as large as 123.6ms when the file size is 100MB. We expect the flow completion time can be improved further when the file size increases further, e.g., for elephant flows. These results also indicate that it needs to open multiple subflows to achieve high aggregate throughput, but it is unnecessary to open as many subflows as that in elephant flows for every mice flow. Therefore, it is essential to dynamically adjust the number of the subflows of a connection according to the application workload. The next question is that how to determine the minimum number of the subflows appropriately for every flow without sacrificing its throughput. Motivated by above experiments we propose an adaptive algorithm that is shown in the next section.

3.

AMTCP OVERVIEW

In this section we first present the detailed algorithm of AMTCP and then discuss the key parameters of AMTCP.

3.1

Algorithm

MPTCP improves bandwidth utilization and network resilience. However, it does not distinguish mice flows and elephant flows, and thus unnecessarily opens too many subflows for mice flows. Therefore, we propose to extend MPTCP to the AMTCP that dynamically adjusts the number of the subflows according to every subflow’s behavior that is char-

Data Scheduler

Subflow Manager Disabled subflow

subflow

subflow

subflow

AMTCP algorithm includes three parameters that can affect the performance of AMTCP. Throughput threshold T : The threshold is basically used to differentiate mice flows and elephant flows. Larger throughput threshold values trigger fewer subflows, reducing the overhead but also lowering the sustained bandwidth. Smaller values generate more subflows, increasing the overhead but also increasing the sustained bandwidth. In our base configuration, the throughput threshold is set as C/Nmax (c.f. Sec. 5), where C is the capacity of the associated link. Measurement interval ∆: The parameter controls the granularity for measurement and action. It controls the tradeoff between timeliness and accuracy. Our implementation of AMTCP adopts a constant ∆ = 2ms (c.f. Sec. 5). We have considered using RTT, but found out that it was rather small, e.g., 100us [19], and caused lots of fluctuation in measured throughput. Maximum number of the subflows Nmax : In order to prevent the number of the subflows growing out of control, we limit the maximum number of the subflows of a TCP connection. It has been shown that the efficiency of MPTCP increases significantly when the number of the subflows goes from one to eight but not much after eight. As a result, AMTCP uses eight as the maximum number of the subflows. AMTCP allows the throughput of any subflow to exceed the

AMTCP

subflow

3.2 Key Parameters

Data

subflow

acterized by the measured sending throughput. AMTCP has two goals: (1) minimize the resource and scheduling overheads (2) without sacrificing the throughput. It achieves the goals by adding/removing subflows gradually. (i) AMTCP starts with only one subflow for a TCP connection. AMTCP is associated with a predefined throughput threshold. In the following context, we use T to denote the throughput threshold. (ii) Each subflow is associated with a counter to track the number of bytes that it has transmitted within a given time interval. To adjust to the application workload quickly, AMTCP splits the lifetime of a TCP connection into constant intervals. The counter is reset at the beginning of each interval. All subflows of a TCP connection have the same intervals. We use ∆ to denote the length of the interval, bi,j to denote the number of bytes transmitted during the j th interval for the ith subflow. Thus, the throughput (bytes per second) of the ith subflow for the j th interval is measured as ri,j = bi,j /∆. (iii) When transmitting a new segment of data, AMTCP tries to distribute it to a subflow that has not reached the throughput threshold in the current interval. Anytime a subflow transmits data, its counter is updated accordingly. (iv) Once all subflows have reached the throughput threshold, i.e., ri,j > T for all i, AMTCP opens a new subflow. (v) If a subflow’s throughput is less than the half of the throughput threshold for a long period of time (such as 10∆), AMTCP deletes this subflow. (vi) To avoid over-generating subflows, AMTCP sets a limit on the number of active subflows, denoted by Nmax , that a TCP connection can have. When the number of the active subflows reaches this value, no new subflows will be created regardless if all subflows of the connection have reached the throughput threshold. In such a case, a subflow will be selected randomly to do the data transfer.

Congestion Control

Figure 2: The main structure of AMTCP. throughput threshold if the number of the subflows exceed Nmax , hence the aggregate throughput is not limited by the product of the maximum number of the subflows and the throughput threshold.

4.

AMTCP IMPLEMENTATION

As the design of AMTCP is based on MPTCP. The implementation of AMTCP is also based on the implementation of MPTCP [1]. In AMTCP, a TCP connection can open multiple subflows, each of which behaves similarly as conventional TCP. All the subflows of a TCP connection share the same send and receiver buffer. Two levels of sequence number are used for the receiver to reorder the received TCP segments: the connection-level sequence number and the subflow-level sequence number. At the sender, TCP segments are distributed among all the subflows as described in Sec. 3.1. Within each subflow, TCP segments are tracked by the subflow-level sequence numbers. At the receiver, TCP segments from different subflows are reordered according to the connection-level sequence numbers. As shown in Fig. 2, AMTCP algorithm can be divided into three components: subflow management, segments distribution and congestion control. Subflow Management: To dynamically adjust the number of the subflows, AMTCP adopts a subflow manager that tracks the amount of bytes transmitted over each interval for every subflow and estimates the throughput of every subflow at the end of each interval. Additionally, it receives the notification from data scheduler to open/delete subflows. During practice, we found that the number of subflows would oscillate if the application demand varies substantially, which could cause nonnegligible overhead due to creation and deletion of subflows. To alleviate this problem, the subflow manager deletes only the subflow with the smallest congestion window size (CWnd) among all the subflows marked for deletion. Other subflows marked for deletion will be saved to an internal data structure of the subflow manager. If a new subflow should be created, the subflow manager first chooses a subflow previously marked for deletion but having not been destroyed. To avoid saving too many deleted subflows, subflows that have not been used for a long time will be permanently deleted at the next activation of the subflow manager. In our implementation we hard code that time period to be 5s since the lifetimes of most connections in a datacenter are shorter than 5s [8]. Data Distribution: AMTCP adopts a data scheduler that distributes the TCP segments among the subflows. Algorithm 1 shows the data scheduling algorithm. We first define two sets: SR is the set of subflows whose through-

put is less than the throughput threshold, and SCwnd is the set of subflows whose CWnds have available space, thus are able to transmit more TCP segments if needed. The data scheduler first applies the AND operation on these two sets, finding out the set of subflows whose CWnd are available and whose throughput is less than the throughput threshold. It then randomly selects a subflow from the set to transmit the data. Otherwise, if the number of subflows Nsubf low is less than Nmax the data scheduler asks the subflow manager to create a new subflow. If Nsubf low equals Nmax , the data scheduler chooses the subflow in SCwnd with the smallest RTT. Congestion Control: AMTCP uses the same congestion control algorithm adopted by MPTCP, coupled congestion control [17], that each subflow maintains its own CWnd and implements a novel additive-increase and multiplicativedecrease (AIMD) congestion control algorithm instead of inheriting the conventional TCP [5] algorithm that increases CWnd by 1/CW nd for each received ACK during its congestion avoidance phase. The coupled congestion control algorithm increases CWnd by min(α/CW ndtotal , 1/CW ndi , where CW ndi is the congestion window size of subflow i, CW ndtotal is the sum of CWnds of all the subflows and α is a constant and determines the aggressiveness in increasing CWnd of one subflow. The coupled congestion control has been shown to achieve fair sharing of the bandwidth at the bottleneck link. It decreases the CWnd multiplicatively. Besides, it adopts the same slow start and fast recovery algorithms as conventional TCP.

5. PERFORMANCE EVALUATIONS This section describes experimental studies we have done to evaluate AMTCP. First, it presents the CPU overhead of AMTCP relative to MPTCP on a real system. It then provides the experimental results from a simulator.

5.1 CPU overhead of AMTCP Since AMTCP is built atop of MPTCP [1], we are interested in investigating the overhead resulted from the AMTCP implementation. We implement the function of AMTCP in Linux kernel based on MPTCP implementation, and use two

30 25 CPU Usage (%)

Algorithm 1 Data Distribution Algorithm. Input: The set of subflows whose throughput is less than its throughput threshold, SR ; The set of subflows whose congestion windows are not full, SCwnd ; Output: A subflow to ∩ transmit data to, S; 1: set SA = SR SCwnd 2: if SA is not empty then 3: randomly select a subflow from SA , as S; 4: else if Nsubf low < Nmax then 5: notify the subflow manager to create a new subflow, S; 6: else if SCwnd is not empty then 7: select a subflow from SCwnd with the minimal RTT, as S; 8: end if 9: return S;

MPTCP AMTCP

20 15 10 5 0

one

two

Number of Subflows

Figure 3: The comparison of CPU overheads for MPTCP and AMTCP.

... ...

Switch

Host

Figure 4: The FatTree topology contains 128 hosts. hosts (H1 and H2) connected by a switch. Both hosts have two network interface cards (NICs). Thus, there are two separate paths between those two hosts. H1 transmits TCP segments continuously to H2 using AMTCP or MPTCP. To achieve the fair comparison we set the number of the subflows in AMTCP and MPTCP to be the same. As there are only two paths between H1 and H2, we only collect the CPU overheads of AMTCP and MPTCP with one and two subflows respectively. As shown in Fig. 3, the CPU usage of AMTCP is close to that of MPTCP when they have the same number of the subflows, e.g., the CPU usages for MPTCP and AMTCP when there is only one subflow are 16.3% and 16.5% respectively.

5.2

Simulation Results

To evaluate AMTCP in large-scale networks, we implement AMTCP in an enhanced version of BookSim, a flitlevel network simulator [13]. BookSim has a realistic router model, and can accurately simulate the network. We extend BookSim to support tail-drop queue used by many Ethernet switches and routers. With tail-drop queue, when a packet tries to get into a full queue, rather than selecting a packet in the queue to be dropped, the packet itself is dropped. We also implement conventional TCP (TCP Reno), AMTCP and MPTCP in the simulator for comparison purpose. As many datacenter networks are built using 3-layer multi-rooted tree topologies we conduct all simulations in a 3-layer FatTree topology with 128 hosts, as shown in Fig. 4. A flow-based hash algorithm is implemented in all the switches and performed to select a path among all available shortest path-

Load 1.00

Load 0.05 Load 0.10

400

600

800 1000 2000 3000 4000 5000 Interval (us)

Load 0.20 Load 0.50

Load 1.00

8 7 6 5 4 3 2 1 0 0. 0 0. 5 07 0. 5 10 0. 0 12 0. 5 15 0. 0 20 0. 0 30 0. 0 40 0. 0 50 0. 0 60 0. 0 70 0. 0 80 0. 0 90 1. 0 00 0

8 7 6 5 4 3 2 1 0

Load 0.20 Load 0.50

Number of Subflows

Number of Subflows

Load 0.05 Load 0.10

Normalized Threshold

Figure 5: The number of the subflows for different intervals.

Normalized Throughput

Load 0.05 Load 0.10

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

400

600

Load 0.20 Load 0.50

Load 1.00

800 1000 2000 3000 4000 5000 Interval (us)

Figure 6: The aggregate throughput for different intervals. s for every subflow/flow at every hop. We use the traffic pattern adapted from [16] - random permutation matrix, that each endpoint transmits data to another endpoint selected randomly, but with the constraint that no endpoint receives data from more than one host. Two types of application workloads are simulated: high-load and low-load, where high-load is simulated by continuously transmitting files of 15MB while the low-load communication is simulated by sending files of 1.5KB with a frequency determined by the offered load. In all experiments, the packets are of fixed-size of 20 flits, and we repeat each experiment 10 times to get the average results.

5.3 Parameters Impact The measurement interval and the throughput threshold can impact the performance of AMTCP significantly. The interval determines the granularity of throughput measurement. Adopting a smaller interval, AMTCP is more sensitive to the burst transmission, while AMTCP might underestimate the throughput thus reduce the number of the subflows by adopting a larger interval. Fig. 5 shows the average number of the subflows used in AMTCP for various measurement intervals. The x -axis is the interval in microseconds while the y-axis is the average number of the subflows. At heavy load, e.g., 1.0, the number

Figure 7: The number of the subflows for different threshold values. All values are normalized to link capacity.

of the subflows reaches the maximum number, 8 subflows. The results show that as the measurement interval increases, the number of the subflows decreases, since the burst transmission is eliminated by distributing over the large time interval. For example, a burst transmission of 1.5KB results in a throughput of 30Mbps with an interval of 400us, while the throughput is only 15Mbps with a measurement interval of 800us, hence it is possible to open more subflows when adopting a small measurement interval. Fig. 6 plots the average aggregate throughput achieved with various measurement intervals. We find that the measurement interval has little impact on the aggregate throughput. At low load, e.g., 0.05∼0.50, AMTCP can always achieve the optimal performance. However, AMTCP only achieves a throughput of 0.87 when the offered load is 1.0. The root cause of the performance gap from the optimality is the overhead of the network, such as the inter packet gap (IPG) which is the minimum idle period between two transmissions of Ethernet packets. These results also indicate that AMTCP can achieve optimal performance. Throughput threshold is used to limit the throughput of each subflow. A large throughput threshold reduces the number of the subflows, but might result in the bandwidth under-utilization. Fig. 7 shows the average number of the subflows for different throughput thresholds normalized to link capacity. The results show that the throughput threshold impacts the number of the subflows significantly: increasing the throughput threshold would reduce the number of the subflows when the offered load is not high, e.g., 0.05∼0.50, while it has little effect on the number of the subflows, when the offered load is high, e.g., 1.0, since it is more likely for each subflow to reach the throughput threshold thus open a new subflow with the high offered load even if the throughput threshold is large. Fig. 8 shows the aggregate throughput of AMTCP using different throughput thresholds. As shown in Fig. 8, the throughput threshold only affects the aggregate throughput when the offered load is high. Otherwise, throughput threshold has little effect on the aggregate throughput. Therefore, we should tradeoff the number of the subflows and the aggregate throughput when set both parameters, hence we

Load 0.20 Load 0.50

8

Load 1.00

Number of Subflows

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

7 6 TCP AMTCP MPTCP

5 4 3 2 1 0

0. 0 0. 5 07 0. 5 10 0. 0 12 0. 5 15 0. 0 20 0. 0 30 0. 0 40 0. 0 50 0. 0 60 0. 0 70 0. 0 80 0. 0 90 1. 0 00 0

Normalized Throughput

Load 0.05 Load 0.10

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Offered Load

Normalized Threshold

Figure 8: The aggregate throughput for various threshold values.

Figure 10: The number of the subflows with different communication loads. 0.9

Number of Subflows

0.8

7

0.7

6

0.6

5

0.5

4

0.4

3

0.3

2

0.2

1 0

Throughput 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.1

Normalized Throughput

0.9 Normalized Throughput

Number of Subflows

8

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

0.0

TCP AMTCP MPTCP

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Offered Load

Offered Load

Figure 9: The number of the subflows and the aggregate throughput at different communication load.

set measurement interval to be 2000us and the throughput threshold to be 0.125.

5.4 Application Workload Adaptation Fig. 9 shows the number of the subflows and the aggregate throughput under various workloads. As the offered load increases from 0.10 to 0.80, both the aggregate throughput and the number of the subflows increase linearly. The results indicate that AMTCP can dynamically adjust the number of the subflows according to the workload. After the offered load increases above 0.80, limited by the maximum number of the subflows and the offered load, both aggregate throughput and the number of the subflows converges to 0.87 and 8 respectively. Note that due to the packet overhead during the simulations, such as IPG, the maximum throughput can only achieve 0.87.

Figure 11: The aggregate throughput obtained at different communication loads. ber of the subflows in MPTCP is eight by default. Unlike TCP and MPTCP, AMTCP adjusts the number of the subflows according to the workload. It increases the number of the subflows linearly as the workload increases. Compared to MPTCP, AMTCP significantly reduces the number of the subflows when the workload is low hence reduces the resource and scheduling overheads. For example, when the offered load is 0.5, the number of the subflows in AMTCP is less than 5 which is less than 8 in MPTCP. Hence, AMTCP reduces the average number of the subflows by more than 37.5% compared to MPTCP. Fig. 11 shows the aggregate throughput under various workloads. The results show that AMTCP gets a similar throughput performance as MPTCP at low offered load, while achieves a better throughput performance at high offered load. The results also show that both AMTCP and MPTCP outperform TCP in throughput by 30% at most.

5.5 Performance Comparison

6.

We compare the performance of AMTCP with TCP and MPTCP under different offered loads. Fig. 10 shows the number of the subflows for TCP, MPTCP and AMTCP at different workloads. Since TCP has no additional subflows, the number of the subflows of TCP is always one. The num-

There are too many TCP-related proposals to name them all, thus we only list some of them that are more relevant to AMTCP in this section. pTCP [11] is an end-to-end transport layer protocol that considers a multi-homed mobile host which potentially has

RELATED WORK

subscriptions and accesses to more than one wireless network at a given time, and addresses the problem of achieving bandwidth aggregation by stripping data across the multiple interfaces of the mobile host. In pTCP, it proposed a structure for data distribution in the transport layer, where it decouples functionalities associated with per-path characteristics from those that pertain to the aggregate connection. Specifically, a striped connection manager (SM) handles aggregate connection functionalities while TCP-virtual (TCPv) performs per-path functionalities. Each TCP-v corresponds to a subflow which behaves as a normal TCP connection. mTCP [20] is another end-to-end transport layer protocol to stripe data across multiple paths to aggregate bandwidth. To avoid an unfair sharing of bandwidth under congestion, it uses a shared bottleneck detection mechanism, but its detection time is long, e.g., 15 seconds. Considering that the duration of datacenter tasks is within five seconds [8], this shared bottleneck detection mechanism is infeasible. In addition, mTCP uses resilient overlay network (RON) as the underlying routing layer, thus the information about paths can only be obtained by querying the RON. Stream Control Transmission Protocol (SCTP) is a reliable, message-oriented transport layer protocol [18]. It supports multi-stream and multi-homing services. If one path fails, the other paths can still be used to deliver data. However, in the current implementation of SCTP, only one path is selected for data transmission while other paths are used as backups. When the primary path fails, one of the backup paths will be selected for transmission. CMT [12] uses SCTP’s multi-homing feature to simultaneously transfer new data across multiple paths. Three negative side-effects of reordering introduced by CMT are identified, and corresponding algorithms have been proposed to avoid these side-effects.

7. CONCLUSIONS In this paper, we propose AMTCP that dynamically adjusts the number of the subflows according to application workloads. Our evaluations on AMTCP show that AMTCP outperforms the conventional TCP in throughput by over 30%. Compared to MPTCP, AMTCP reduces the resource and scheduling overheads by decreasing the number of the subflows, while still maintains the similar aggregate throughput as MPTCP.

8. ACKNOWLEDGEMENTS We thank the anonymous reviewers for their feedback. This work was supported by the Strategic Priority Research Program of Chinese Academy of Sciences under grant No. XDA06010401, Huawei Research Program YBCB2011030, and the National Science Foundation of China under grants No. 61221062, No. 61331008 and No. 61202056.

9. REFERENCES [1] Multipath tcp - linux kernel implementation. [2] M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In SIGCOMM, 2008. [3] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, and A. Vahdat. Hedera: Dynamic flow scheduling for data center networks. In NSDI, 2010.

[4] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data center tcp (dctcp). In SIGCOMM, 2010. [5] M. Allman, V. Paxson, and E. Blanton. Tcp congestion control. RFC 5681. [6] S. Barre. Implementation and assessment of Modern Host-based Multipath Solutions. PhD thesis, Universite catholique de Louvain, 2011. [7] T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in the wild. In IMC, 2010. [8] T. Benson, A. Anand, A. Akella, and M. Zhang. MicroTE: Fine grained traffic engineering for data centers. In CoNEXT, 2011. [9] A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. Vl2: A scalable and flexible data center network. In SIGCOMM, 2009. [10] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. Bcube: A high performance, server-centric network architecture for modular data centers. In SIGCOMM, 2009. [11] H. Y. Hsieh and R. Sivakumar. A transport layer approach for achieving aggregate bandwidths on multi-homed mobile hosts. In MobiCom, 2002. [12] J. R. Iyengar, P. D. Amer, and R. Stewart. Concurrent multipath transfer using sctp multihoming over independent end-to-end paths. IEEE/ACM Transactions on Networking, 14(5):951–964, 2006. [13] N. Jiang, D. U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. E. Shaw, J. Kim, and W. J. Dally. A detailed and flexible cycle-accurate network-on-chip simulator. In ISPASS, 2013. [14] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of data center traffic: Measurements & analysis. In IMC, 2009. [15] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow: enabling innovation in campus networks. ACM SIGCOMM Computer Communication Review, 38(2):69–74, 2008. [16] C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and M. Handley. Improving datacenter performance and robustness with multipath tcp. In SIGCOMM, 2011. [17] C. Raiciu, M. Handley, and D. Wischik. Coupled congestion control for multipath transport protocls. RFC 6356. [18] R. Stewart. Steam control transmission protocol. RFC 4960. [19] H. Wu, Z. Feng, C. Guo, and Y. Zhang. Ictcp: Incast congestion control for tcp in data-center networks. IEEE/ACM Transactions on Networking, 21(2):345–358, 2013. [20] M. Zhang, J. Lai, A. Krishnamurthy, L. Peterson, and R. Wang. A transport layer approach for improving end-to-end performance and robustness using redundant paths. In ATEC, 2004.