IDTCP: An effective approach to mitigating the TCP ... - Springer Link

Inf Syst Front (2014) 16:35–44 DOI 10.1007/s10796-013-9463-4

IDTCP: An effective approach to mitigating the TCP incast problem in data center networks Guodong Wang · Yongmao Ren · Ke Dou · Jun Li

Published online: 5 November 2013 © Springer Science+Business Media New York 2013

Abstract Recently, TCP incast problem in data center networks has attracted a wide range of industrial and academic attention. Lots of attempts have been made to address this problem through experiments and simulations. This paper analyzes the TCP incast problem in data centers by focusing on the relationships between the TCP throughput and the congestion control window size of TCP. The root cause of the TCP incast problem is explored and the essence of the current methods to mitigate the TCP incast is well explained. The rationality of our analysis is verified by simulations. The analysis as well as the simulation results provides significant implications to the TCP incast problem. Based on these implications, an effective approach named IDTCP (Incast Decrease TCP) is proposed to mitigate the TCP incast problem. Analysis and simulation results verify that our approach effectively mitigates the TCP incast problem and noticeably improves the TCP throughput. Keywords Congestion control · Data center networks · TCP incast

G. Wang () · K. Dou Graduate University of Chinese Academy of Sciences, Beijing, China e-mail: [email protected] K. Dou e-mail: [email protected] G. Wang · Y. Ren · K. Dou · J. Li Computer Network Information Center of Chinese Academy of Sciences, Beijing, China Y. Ren e-mail: [email protected] J. Li e-mail: [email protected]

1 Introduction Data centers have become more and more popular for storing large volumes of data. Companies like Google, Microsoft, Yahoo, and Amazon use data centers for web search, storage, e-commerce, and large-scale general computations. Building data centers using commodity TCP/IP and Ethernet networks is still attractive because of their low cost and ease-of-use. However, TCP is easily to suffer a drastic reduction in throughput when multiple senders communicate with a single receiver in these networks, which is termed incast. To mitigate the TCP incast problem, many methods have been proposed. The accepted methods include enlarging the switch buffer size (Phanishayee et al. 2008), decreasing the Server Request Unit (SRU) (Zhang et al. 2011a) and shrinking the Maximum Transmission Unit (MTU) (Zhang et al. 2011b). Reducing the retransmission timeout (RTO) to microseconds is also suggested to shorten TCP’s response time, while the availability of such timer is a stringent requirement to many OS (Vasudevan et al. 2009). Another commonly used method is to modify the TCP congestion control algorithm. Phanishayee et al. (2008) tries disabling the Slow Start mechanism of TCP to slow down the congestion control window (cwnd) updating rate, but the experimental result shows that this approach does not alleviate the TCP incast. Alizadeh et al. (2010) provides fine-grained congestion window based control by using the Explicit Congestion Notification (ECN) marks. Wu et al. (2010) measures the sockets’ incoming data rate periodically, and then adjusts the rate by modifying the announced receive window. In view of so many methods, this paper provides the following contributions. First, we analyze the root causes of the TCP incast problem from the perspective of the cwnd

36

size of TCP. Based on the analysis, the relationships among the TCP throughput, the SRU, the switch buffer size and the cwnd are discussed in this paper. Second, we explore the essence of the current methods which are adopted to smooth the TCP incast problem. These methods include enlarging the switch buffer size, decreasing the SRU, shrinking the MTU and so on. Third, a series of simulations are conducted to verify the accuracy of our analysis. The analysis as well as the simulation results provides significant implications to the TCP incast problem. Finally, based on the implications, an effective method named IDTCP is proposed to mitigate the TCP incast problem. The remainder of this paper is organized as the follows. Section 2 describes the related work. Section 3 illustrates the analysis of the TCP incast problem. Section 4 validates our analysis by simulation and presents the implications. Section 5 describes the IDTCP congestion control algorithm. Section 6 conducts a serious of simulations to validate the performance of IDTCP. Section 7 concludes this paper.

2 Related work The problem of incast has been studied in several papers. Gibson et al. (1998) first mentioned the TCP incast phenomenon. Phanishayee et al. (2008) found the TCP incast could be related to the overflow of the bottleneck buffer. The phenomenon was then analyzed in Chen et al. (2009), Zhang et al. (2011a), to relate it to the number of senders and the different sizes of SRU. Similarly, Zhang et al. (2011b) used a smaller MTU to increase the bottleneck buffer size in terms of number of packets. Phanishayee et al. (2008) outlined several solutions, which include the use of a different TCP variant for better packet loss recovery, disabling slow-start to slow down the increase of TCP sending rate, or even reducing the retransmission timeout (RTO) to microseconds so that there would be a shorter idle time upon RTO. Vasudevan et al. (2009) further investigated the RTO in detail and found it was very effective if we could make use of a high resolution timer in OS kernel. However, the availability of such timer for each TCP connection is a stringent requirement to many OS. Although (Phanishayee et al. 2008) pointed out that disabling slow-start did not alleviate the TCP incast, the explanation and the detailed analysis were not presented. Throttling TCP flows to prevent the bottleneck buffer being overwhelmed is a straightforward idea. DCTCP (Alizadeh et al. 2010) used the ECN to explicitly notify the network congestion level, and achieved fine-grained congestion control by using the number of ECN marks. However, the deployment of DCTCP may require updating switches to support ECN. ICTCP (Wu et al. 2010)

Inf Syst Front (2014) 16:35–44

measured the bandwidth of the total incoming traffic to obtain the available bandwidth, and then controlled the receive window of each connection based on this information. The weakness of ICTCP is that the number of the concurrent senders that do not experience TCP incast is still not large enough and its flow rate adaptation is not immediate.

3 Analysis 3.1 Model The data center networks are usually composed of high speed and low propagation delay links with limited switch buffer. In addition, the client applications usually strip over different servers for reliability and performance. The model presented in Phanishayee et al. (2008) abstracts the most basic representative setting in which TCP incast occurs. In order to simplify our analysis we use the same model as shown in Fig. 1. In our model, a client requests a data block, which is striped over n servers. Table 1 shows the specific model notation which we use. 3.2 The congestion control window and tcp incast Traditional TCP’s cwnd updating mechanism combines two phases: Slow Start and Congestion Avoidance. At the Slow Start phase, the TCP doubles its cwnd from the default value (in standard TCP, the default value is 1) at every RTT, while at the Congestion Avoidance phase, it leads to an approximately linear increase with time. The standard TCP cwnd updating rules are given by Wi + 1, Slow Start Wi+1 = (1) Wi + 1/Wi , Congestion Avoidance where the index i denotes the reception of the ith ACK. For a given SRU, the number of ACKs which is used to acknowledge the received packets can be described by ACK =

SRU MSS

Fig. 1 A simplified TCP incast model

(2)


37

Table 1 Model notation

to decrease the cwnd growth rate both in Slow Start and Congestion Avoidance phases.

Symbol

Description

n B BDP SRU ACK MSS MTU cwnd

server number switch buffer size (in packets) Bandwidth Delay Product Server Request Unit (in KB) the number of Acknowledgement Maximum Segment Size (1460 Byte) Maximum Transmission Unit (1500 Byte) congestion control window (in packets)

3.3 The congestion control window size in DCN From Eq. 6, the relationship between the cwnd and the updating time (T (cwnd), the time that changes with the growth of cwnd) can be given by T (cwnd) =

RT T × log2 cwnd Slow Start RT T × (cwnd − γ ) Congestion Avoidance

(7)

To find out the maximum cwnd (cwnd max) which is in response to the ACKs, we suppose the link is not congested and TCP is in the Slow Start phase. Each ACK’s arrival will increase the cwnd of TCP by 1 from the initial value, so the maximum cwnd can be given by

As TCP transmits approximately cwnd size of packets within the time of RTT, the average throughput (TP) can be given by

SRU +1 (3) MSS The range of the current cwnd (cwnd(t)) of each server (n) can be given by

where the default MSS is 1460 byte. From Eq. 8, it is clear ×RT T that cwnd = T PMSS . Take the typical data center network as an example, the bandwidth is 1Gbps and the RTT is 0.1ms, so the cwnd = 1Gbps×0.1ms = 8.562. Based on this 1460Byt e discussion, we calculate the cwnd and the cwnd updating time of the traditional TCP in different BDPs in Table 2. Table 2 shows that in the link with 1000Mbps bandwidth and 0.1ms RTT, it is enough for TCP to fill the 1Gbps pipe only with small cwnd (8.562). The time to finish the Slow Start (suppose the Slow Start ends when the throughput is 50 % of the bandwidth, which is referred the standard TCP mechanism that reduces the cwnd to 50 % of the current cwnd when congestion occurs) is only 0.210ms (calculated by Eq. 7, RT T × log2 cwnd = 0.1ms × log2 4.281 = 0.210ms) which is too violent for the standard TCP to control the flow. The requirement for small cwnd of TCP in data center networks indicates that the standard TCP is too aggressive to avoid the TCP incast problem. Therefore, slowing down the cwnd growth rate of the traditional TCP is a critical step to mitigate the incast problem. Additionally, it can be seen from Table 2 that small (8.562) cwnd can fill the 1Gbps bandwidth, so the total throughput will not be excessively affected by slowing down the cwnd growth rate.

cwnd max = ACK + 1 =

1 ≤ cwnd(t)n ≤ cwnd maxn

(4)

From Fig. 1, each server will send cwnd(t)n × MT U data synchronously to the client; n servers mean n i=1 cwnd(t)n × MT U data will be synchronously appended to the switch buffer. As the servers will fundamentally share the same switch, in order to ensure that the concurrent packets will not exceed the link capacity, the following condition should be met n

cwnd(t)n × MT U ≤ B + BDP

(5)

i=1

From Eq. 5, in order to increase the number of servers (n), one approach is to enlarge the switch buffer size (B) (Phanishayee et al. 2008).Another approach is to decrease the cwnd which can be achieved by reducing the SRU (Zhang et al. 2011a). For the maximum cwnd is in direct proportion to the SRU as shown in Eq. 3. By limiting the SRU, the maximum cwnd will be limited. The third approach is to decrease the MTU (Zhang et al. 2011b). Therefore, the essence of these approaches (Phanishayee et al. 2008; Zhang et al. 2011a, b) can well be explained by Eq. 5. Besides, as cwnd(t) can be given by t /RT T 2 , Slow Start cwnd(t) = t −tγ (6) RT T + γ , Concgestion Avoidance where t represents the elapsed time, tγ and γ are the time and cwnd respectively when TCP exits the Slow Start phase. To mitigate the TCP incast problem, another approach is

TP =

cwnd × MSS RT T

(8)

Table 2 The relationships between BDP and cwnd updating time BDP (Mbps×ms)

cwnd

SStime(ms)

CAtime(ms)

1000×0.1 1000×0.2 1000×0.3 1000×0.4 1000×0.5

8.562 17.12 25.68 34.25 42.81

0.210 0.620 1.105 1.639 2.210

0.428 1.712 3.853 6.849 10.702

38


4 Validation and implications In this section, we validate the accuracy of Eq. 5 through simulation on the NS-2 platform (Fall and Vardhan 2007) and discuss the impact of parameters upon TCP incast throughput. The network topology is shown in Fig. 1. The bottleneck link has 1Gbps capacity. The RTT between the server and the client is 0.1ms. The module for TCP incast is developed by (Phanishayee et al. 2008). Firstly, we use Eq. 5 to estimate the TCP incast point, which is the number of concurrent servers when TCP incast occurs. Suppose the MSS is 1460 Byte (then the MTU is 1500 Byte accordingly) and the switch buffer size (B) is 128KB. To explore the extreme case, we set the cwnd of all servers to 1, the minimum value. From Eq. 5, the TCP incast point can be calculated by B + BDP MT U 128 × 1024Byte + 1Gbps × 0.1ms = 1500Byte 131072Byte + 12500Byte = 1500Byte = 95.71 (9)

incast point =

By using the same way, we estimate some other TCP incast points and show them in Table 3. In order to validate the TCP incast points estimated in Table 3, the following experiments are conducted. 4.1 The extreme case when cwnd is 1 We first validate the extreme case when cwnd is set to 1. In this experiment, the SRU is set to 64KB and the switch buffer size increases from 32KB to 256KB. Figure 2 is the throughput for different number of servers when cwnd is 1. When the switch buffer size is 32KB, the incast point is around 28 servers, which is approximate the value presented in Table 3. With the increase of

Table 3 The estimated TCP incast points B (KB)

MTU (Byte)

cwnd

incast points

32 64 128 256 32 64 32 64

1500 1500 1500 1500 1500 1500 1500 1500

1 1 1 1 2 2 3 3

30.18 52.02 95.71 183.09 19.25 30.18 15.61 22.89

the switch buffer size, the number of the concurrent servers increases accordingly. When the buffer size are 64KB, 128KB and 256KB, the simulated incast points are around 50, 97 and 180, respectively, which are in conformity with the estimated points presented in Table 3. 4.2 The throughput when cwnd is 2 or 3 We further validate the cases when cwnd is set to 2 or 3. In order to compare with the results which are presented in Fig. 2, the SRU is set to 64KB and the switch buffer size varies from 32KB to 64KB. Figure 3 shows that with the increase of cwnd, the number of concurrent servers with the same SRU and switch buffer size decreases accordingly. This trend is well anticipated from Eq. 5. Most importantly, the simulated results are also in conformity with the estimated incast points presented in Table 3. 4.3 The throughput with a varying SRU In this section, we validate the throughput with a varying SRU when cwnd is set to the minimum value 1. In this experiment, the switch buffer size increases from 64KB to 128KB. Figure 4 shows that when cwnd is set to a constant value, the influence of varying SRU on the incast points will be significantly weakened. This can be explained well by Eq. 5 for there is no direct relationship between the SRU and the TCP incast points. However, it should be noted that, just as we described in section III B, by shrinking the SRU the TCP incast points can be greatly increased when cwnd is not a constant value (the traditional TCP or other TCP variants). For the nature of shrinking the SRU is to reduce the maximum cwnd (cwnd max) as described in Eq. 3. Reducing the maximum cwnd can indeed increase the TCP incast points from Eq. 5. However, if the cwnd is a constant value, the maximum cwnd is determined, then the TCP incast point will not be influenced by the value of SRU. Figure 4 also shows that after the TCP incast happens, more bandwidth will be exploited by using a larger SRU. For the larger SRU will take more time to be transmitted and more bandwidth will be grabbed accordingly. All in all, these experiments once again validate the accuracy and rationality of Eq. 5. Most importantly, the simulated results are also in conformity with the estimated incast points presented in Table 3. From the above simulation, the accuracy and the rationality of Eq. 5 are validated. For most of the relationship among the cwnd, the switch buffer size, the SRU can be explained well by Eq. 5. Besides, The Eq. 5 and the simulation results also provide significant implications as shown in the follows.


39

Fig. 2 Throughput for different number of servers when cwnd is 1

1000 900 B=32KB B=64KB B=128KB B=256KB

Throughput (Mbps)

800 700 600 500 400 300 200 100 0

•

• •

0

20

Only optimizing the TCP congestion control algorithm can not solve, but mitigate the TCP incast problem. To mitigate the TCP incast, the cwnd growth rate should be reduced. To solve the TCP incast problem, the application layer approach (Podlesny and Williamson 2012) such as scheduling the concurrent servers may be a way out. The switch buffer size and the MTU play significant role in mitigating the TCP incast problem. To avoid packet loss, Eq. 5 should be met. Setting the cwnd to 1 can maximize the concurrent servers, but the bandwidth utilization will be affected when the number of the concurrent servers is small.

5 IDTCP congestion control algorithm Based on the above discussion, a novel congestion control algorithm named IDTCP is proposed in this section. IDTCP mitigates TCP incast problem through the following strategies. Fig. 3 Throughput for different number of servers when cwnd is 2 and 3

40

60

120 100 80 The Number of Servers (n)

140

160

180

200

5.1 Constantly monitoring the congestion level of the link Using a packet loss as the only congestion signal which is adopted in traditional TCP is insufficient for accurate flow control (Katabi et al. 2002). The binary signal (loss or not) only expresses two extreme states of the network link, therefore, a delicate and effective mechanism which can continuously estimate the bottleneck status is enhanced in IDTCP. In large BDP networks, estimating the number of queuing packets (Brakmo and Peterson 1995) is widely used to measure the congestion level of the link. However, in data center networks, the required cwnd to meet the 1Gbps bandwidth is much smaller than that in large BDP networks as shown in Table 2. As a consequence, using the queuing packet to estimate the congestion level of the link in data center networks is not as accurate as that in large BDP networks. Therefore, in IDTCP algorithm, we use α = RT T avg−RT T base to estimate the congestion level of the RT T base link. The RT T base is the minimum RTT measured by the sender, and RT T avg is the average RTT estimated during the current cwnd packets updating period.

1000 900

cwnd=2,B=32KB cwnd=2,B=64KB cwnd=3,B=32KB cwnd=3,B=64KB

Throughput (Mbps)

800 700 600 500 400 300 200 100 0

0

5

10

15


35

40

45

50

40


Fig. 4 Throughput for different number of servers when cwnd is 1 and the switch buffer increases from 64KB to 128KB

1000 900

Throughput (Mbps)

800 700 600 500 400 300

SRU=32KB SRU=64KB SRU=128KB SRU=256KB

200 100 0

20

10

0

60

50

40 30 The Number of Servers (n)

(a) Buffer Size=64KB 1000 900

Throughput (Mbps)

800 700 600 500 400 300


200 100 0

0

10

20

30

40

50

60

70

80

90

100

110

The Number of Servers (n)

(b) Buffer Size=128KB 5.2 Slowing down and dynamically adjusting the congestion control window growth rate TCP’s Slow Start algorithm is not slow at all, for it doubles the cwnd every RTT. Table 2 verifies the aggressiveness of Reno in data center networks by computing the Slow Start time. Besides, Eqs. 5 and 6 imply that slowing down the cwnd growth rate can significantly increase the number of the concurrent TCP servers. Therefore, IDTCP slows down and dynamically adjusts the cwnd growth rate according the congestion level which is estimated by α as follows. W (t) = (1 + m)t /RT T , 0 ≤ α ≤ 1

(10)

Where m is the cwnd growth rate of IDTCP, and α is the queuing delay which is measured from the link. Comparing to the standard Reno as shown in Eq. 6, the Slow Start and Congestion Avoidance mechanisms of Reno are integrated in to one mechanism in IDTCP. The cwnd growth mechanism in IDTCP is a discrete exponential increase with

RTT and the base is dynamically adjusted according to the congestion level of the link. The specific relationships between m and α are shown in Fig. 5. Figure 5 shows that the cwnd of IDTCP starts up with the Slow Start mechanism when there is little congestion (α ≤ 0.1) in the link. With the increase of the congestion level, the cwnd growth rate of IDTCP decreases accordingly, in which way to allow as many as possible concurrent servers to join into the network. 5.3 Setting the cwnd to 1 if the link is totally congested From Eq. 5 and the simulation results in Section 4.1, we can find that setting the cwnd to 1 can maximum the number of the concurrent servers. The shortcoming of setting the cwnd to 1 is that when the number of the concurrent servers is small (e.g., the concurrent servers are less than 8 in the typical DCN network with RTT of 0.1ms, and the bottleneck bandwidth of 1Gbps), the total throughput under such a cwnd can not sufficiently utilize the bandwidth. That


41

Fig. 5 The relationships between α and m

1 0.9

The value of m

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1

0.2

is, when the total cwnd ( ni=1 cwnd(t)n ) is 8 (8 servers, the cwnd of each server is 1), the throughput calculated by Eq. 8 is tp = 1460B×8 0.1ms = 934.4Mbps, which is less than the bandwidth of 1Gbps. On the other hand, if the number of the concurrent servers is more than 8, the 1Gbps bandwidth Fig. 6 Throughput for different number of servers when SRU is 64KB

0.3

0.4

0.5 0.6 The value of α

0.7

0.8

0.9

1

can be fully utilized even the cwnd of each server is 1. There are many approaches to be adopted to get the number of the concurrent servers, but for simplicity, in IDTCP, we use the congestion level of the link to indicate the time when the cwnd is set to be 1. Specifically, if the propagation delay is

1000 900

IDTCP Reno

Throughput (Mbps)

800 700 600 500 400 300 200 100 0

0

5

10

15

20

25

30

35

40

45

50

45

50


(a) SRU=64KB, Buffer Size=32KB 1000 900

IDTCP Reno

Throughput (Mbps)

800 700 600 500 400 300 200 100 0

0

5

10

15

20

25

30

35

40


(b) SRU=64KB, Buffer Size=64KB

42


Fig. 7 Throughput for different number of servers when SRU is 128Kbps

1000 900

IDTCP Reno

Throughput (Mbps)

800 700 600 500 400 300 200 100 0

0

10

20

30

40

50

60

80

70

90

100

180

200


(a) SRU=128KB, Buffer Size=128KB 1000 900

IDTCP Reno

Throughput (Mbps)

800 700 600 500 400 300 200 100 0

0

20

40

60

80

100

120

140

160


(b) SRU=128KB, Buffer Size=256KB equal to the queuing delay, which means that the RT T avg is 2 times of the RT T base, and α = 1, accordingly, we think the link is totally congested, and the cwnd is set to be 1. Fig. 8 Throughput for different numbers of servers when SRU varies

6 Performance evaluation In this section, we use the model developed by Phanishayee et al. (2008) which is described in Section 4 to verify

1000 900


Throughput (Mbps)

800 700 600 500 400 300 200 100 0

0

5

10

15


35

40

45

50


the performance of IDTCP. The performance metric is the throughput of the bottleneck link. The throughput is the total number of bytes received by the receiver divided by the completion time of the last sender. We explore throughput by varying parameters, such as the number of servers, switch buffer size and server request unit size. Figure 6 is the throughput for different number of servers when SRU is 64KB. When switch buffer size is 32KB, only 3 concurrent servers will cause the Reno TCP collapse. Our approach IDTCP increases the number of the concurrent servers to 25 under such a small buffer size. With the increase of the switch buffer size, the number of the concurrent servers increase accordingly. The increase of the number of the concurrent servers of IDTCP is much more prominent than Reno. When buffer size is 64KB, the number of the concurrent servers of IDTCP reaches to 47 without throughput collapse. Figure 7 is the throughput for different number of servers when the SRU is fixed to 128KB. The increase of the buffer size has greatly expanded the number of the concurrent servers of IDTCP. When the switch buffer size rises to 128KB, the number of the concurrent servers of IDTCP reaches to nearly 100 while the throughput collapse occurs. In contrast, the number of the concurrent servers of Reno are still no more than 10. If the buffer size is further expanded to 256KB, the number of the concurrent servers of IDTCP increases to 175 accordingly, while Reno less than 20. We further evaluate the performance of IDTCP in the situation when the switch buffer size is fixed but the SRU varies from 32KB to 256KB as shown in Fig. 8. It is clear that before the network is congested, the larger the SRU, the greater the throughput, which is different from the results as shown in Fig. 4. However, with the increase of the number of the concurrent servers, the network is inevitably being congested. Once the network is totally congested, the cwnd of IDTCP will be set to be 1, then the performance of IDTCP is close to the extreme case which is shown in Fig. 4.

7 Conclusion In this paper, we discuss the TCP incast problem in data centers through analyzing the relationship between the TCP throughput and the cwnd size of TCP. Our analysis explores the essence of the current methods which are adopted to smooth the TCP incast problem. We verify the rationality and accuracy of our analysis by simulations. The analysis and the simulation results provide many valuable implications to the TCP incast problem. Based on these implications, we further propose a new congestion control algorithm named IDTCP to mitigate the TCP incast problem. Simulation results verify that our algorithm greatly increased the number of the concurrent servers and

43

effectively mitigate the TCP incast in data center networks. Further more detailed theoretical modeling efforts, as well as more extensive evaluation in actual networks are the subjects of our ongoing work. Acknowledgments This work is an extension from the conference paper entitled ‘The Effect of the Congestion Control Window Size on the TCP Incast and its Implications’ which was presented in the IEEE ISCC 2013. This work is partially supported by the foundation funding of the Internet Research Lab of Computer Network Information Center, Chinese Academy of Sciences, the President Funding of the Computer Network Information Center, Chinese Academy of Sciences under Grant No.CNIC ZR 201204, and the Knowledge Innovation Program of the Chinese Academy of Sciences under Grant No.CNIC QN 1303.

References Alizadeh, M., Greenberg, A., Maltz, D., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., Sridharan, M. (2010). Data center tcp (dctcp). In ACM SIGCOMM computer communication review (Vol. 40, no. 4, pp. 63–74). ACM. Brakmo, L., & Peterson, L. (1995). Tcp vegas: end to end congestion avoidance on a global internet. IEEE Journal on Selected Areas in Communications, 13(8), 1465–1480. Chen, Y., Griffith, R., Liu, J., Katz, R., Joseph, A. (2009). Understanding tcp incast throughput collapse in datacenter networks. In Proceedings of the 1st ACM workshop on research on enterprise networking (pp. 73–82). ACM. Fall, K., & Vardhan, K. (2007). The Network Simulator (ns-2). Available: http://www.isi.edu/nsnam/ns. Gibson, G., Nagle, D., Amiri, K., Butler, J., Chang, F., Gobioff, H., Hardin, C., Riedel, E., Rochberg, D., Zelenka, J. (1998). A costeffective, high-bandwidth storage architecture. In ACM SIGOPS operating systems review (Vol. 32, no. 5, pp. 92–103). ACM. Katabi, D., Handley, M., Rohrs, C. (2002). Congestion control for high bandwidth-delay product networks. In ACM SIGCOMM computer communication review (Vol. 32, no. 4, pp. 89–102). ACM. Phanishayee, A., Krevat, E., Vasudevan, V., Andersen, D., Ganger, G., Gibson, G., Seshan, S. (2008). Measurement and analysis of tcp throughput collapse in cluster-based storage systems. In Proceedings of the 6th USENIX conference on file and storage technologies. Podlesny, M., & Williamson, C. (2012). Solving the tcp-incast problem with application-level scheduling. In IEEE 20th international symposium on, analysis & simulation of computer and telecommunication systems (MASCOTS), 2012 (pp. 99–106). IEEE. Vasudevan, V., Phanishayee, A., Shah, H., Krevat, E., Andersen, D., Ganger, G., Gibson, G., Mueller, B. (2009). Safe and effective fine-grained tcp retransmissions for datacenter communication. In ACM SIGCOMM computer communication review (Vol. 39, no. 4, pp. 303–314). ACM. Wu, H., Feng, Z., Guo, C., Zhang, Y. (2010). Ictcp: incast congestion control for tcp in data center networks. In Proceedings of the 6th international conference (p. 13). ACM. Zhang, J., Ren, F., Lin, C. (2011a). Modeling and understanding tcp incast in data center networks. In Proceedings IEEE INFOCOM, 2011 (pp.1377–1385). IEEE. Zhang, P., Wang, H., Cheng, S. (2011b). Shrinking mtu to mitigate tcp incast throughput collapse in data center networks. In Communications and mobile computing (CMC), 2011 third international conference (pp. 126–129). IEEE.

44 Guodong Wang received the Ph.D. degree from the University of Chinese Academy of Sciences in 2013. Dr. Wang is currently with the Computer Network Information Center, Chinese Academy of Sciences, China. His research focuses on Future Networks and Protocols, Fast Long Distance Networks, and Data Center Networks.

Yongmao Ren is an associate professor at Computer Network Information Center, Chinese Academy of Sciences. He received his B.Sc. in Computer Science and Technology in 2004 at University of Science and Technology of Beijing. He obtained the Ph.D. in Computer Software and Theory from Graduate University of the Chinese Academy of Sciences in 2009. His main research interests are on congestion control and future Internet architecture.

Inf Syst Front (2014) 16:35–44 Ke Dou received the M.Sc. from the University of Chinese Academy of Sciences in 2013. He is currently pursuing the Ph.D. degree in University of Virginia. His research focuses on Data Center Networks.

Jun Li is currently a professor, Ph.D. supervisor and vice chief engineer of the Computer Network Information Center, Chinese Academy of Sciences (CAS). He earned his M.S. degree and Ph.D. degree from the Institute of Computing Technology, CAS in 1992 and 2006, respectively. Before that, he received his B.S. degree in Hunan University in 1989. He has been working on research and engineering at the field of computer network over 20 years. And, he has gotten many achievements in the fields of Internet routing, architecture, protocol and security, etc. He once developed the first router in China. And, he won the national technological progress awards. He has been PI for many large research projects such as 863 program project. He has published over 50 peer-reviewed papers and one book. His current research interests include network architecture, protocol and security.