Explicit Rate Control for Anypoint Communication Ken Yocum Department of Computer Science University of California at San Diego
Jeff Chase Department of Computer Science Duke University
[email protected]
[email protected]
Abstract— This paper investigates explicit multiflow rate control for common forms of ensemble communication. We address rate control in the context of the Anypoint, a communication model that combines strong transport-layer guarantees with the flexibility of Anycast-style indirection. For example, Anypoint supports cluster-based services where extensible routers at the network edge route inbound requests according to pluggable service-specific redirection policies. Anypoint also supports multipoint-to-point communication in the reverse or outbound direction. Our approach extends recent work on an Explicit Control Protocol (XCP) and demonstrates compelling benefits for multiflow rate control. For example, it allows us to unify flow and congestion control for multipoint traffic. The explicit rate control approach defines a set of XCP header transformations to split and merge XCP traffic flows in an Anypoint edge router. We describe a prototype XCP-enabled transport protocol and Anypoint-XCP switch implementing the split/merge transformations. Experimental results evaluate our approach in a prototype Anypoint cluster in the ModelNet emulation environment, extended with support for emulated XCP routers. The results show how Anypoint-XCP flows compete fairly with conventional XCP flows and meet a range of multiflow rate control objectives in a flexible and efficient way.
I. Introduction Several factors are driving an increased interest in multiflow rate control and ensemble communication. Web clients often draw and assemble content from multiple sites to satisfy a single request [1]. And while networklayer Anycast [2] sees limited real-world deployment, many Internet content services employ the underlying communication model. For example web server farms [3] or scalable storage systems [4, 5] often use indirection to allow dynamic binding to a specific server from a set of servers that can provide a given piece of content. Similarly, environments for large-scale data-intensive computing have inspired new multiflow approaches for cluster-tocluster communication [6] and more general multipointto-point communication in which multiple flows converge at a single site [7]. This paper considers multiflow rate control in the con-
The Anypoint communication model encapsulates both point-to-multipoint (Anycast) and multipoint-to-point communication. It can allow two clusters to communicate without explicit knowledge of the nodes participating in the other cluster. Fig. 1.
text of Anypoint [8], a transport-layer abstraction for communication with ensemble sites such as cluster-based network services. Anypoint is designed for transports that transfer data as rate-controlled streams of frames, such as SCTP. Extensible Anypoint switches at the ensemble edge mediate communication; an Anypoint switch represents the ensemble as a single virtual site to the outside network, and implements a redirection policy to route incoming frames to ensemble members. Figure 1 illustrates the forms of multipoint communication that occur between an ensemble and a connection “peer” or client. Section II discusses Anypoint in more detail. Crucially, Anypoint’s rate control issues are similar to the core challenges for general multipoint communication and multiflow rate control. Figure 1 illustrates the range of multipoint communication in the Anypoint model. Inbound traffic arriving at an Anypoint ensemble is distributed across the ensemble members according to a service-specific routing policy. Outbound traffic from multiple ensemble members may converge on the same client. In addition, Anypoint sites may be combined for clusterto-cluster communication or they may cascade to form arbitrary trees to support tiered service architectures. This creates several rate control challenges that are common to other multipoint environments: •
The rate of an inbound connection to the ensemble must be limited to avoid network path and endpoint
•
•
buffer overflows. The fraction of inbound traffic destined to an ensemble member must not exceed the rate at which the member can receive and process it. The outbound traffic on a given Anypoint connection or session originates from multiple ensemble members (a multipoint-to-point flow). The aggregate flow to the client must not exceed the rate at which they can receive and process it. Ensemble members must allocate rates to share this capacity effectively. The traffic on a given Anypoint connection or session should compete fairly with other connections sharing a bottleneck link. In particular, traffic on an Anypoint connection should be session-fair with respect to competing connections: a connection’s fair share across a bottleneck resource is independent of the number of constituent ensemble member flows.
FreeBSD for a host-based Anypoint switch and an Anypoint-compatible transport protocol that supports application-layer framing and XCP-based congestion control. We also added support for emulating XCPenabled routers in the ModelNet Internet emulation environment [10]. This provides a direct execution environment for evaluating our XCP rate control scheme or any general XCP implementation. We find the combination of these two point-to-point transport facilities—XCP and framing—also particularly powerful for Anypoint communication.
II. Background The original impetus for the Anypoint model was to generalize “L4-L7” server switches that support load balancing and content-aware request routing for Web server clusters; the objective is to define a general redirecting switch architecture that accommodates pluggable indirection policies for a wide range of service protocols, not limited to HTTP over TCP. A fundamental goal of Anypoint is to reconcile transparent Anycast-style communication with strong transport guarantees, including reliable, ordered, rate-controlled transmission.
We develop and evaluate an approach to multiflow rate control for Anypoint traffic. Our approach is based on explicit rate coordination in the Anypoint switches—the network points where the flows of a given Anypoint connection split and merge. To deliver rate feedback to the end systems, we use a derivative of the Explicit Control Protocol (XCP [9]), in which routers and switches along the network path for each flow mark the packet headers with rate signals. We recognize that though XCP has many compelling benefits, it is unlikely that XCP will be deployed in the Internet core. However it is an important and valuable research perspective to understand the full impact of such a redesign of Internet congestion control. This work extends XCP’s benefits beyond those already demonstrated for point-to-point flows, and finds that XCP is a powerful basis for multipoint communication. This work makes the following contributions: •
•
•
For example, consider an Anypoint cluster implementing a request/response service, such as a network file service. Clients of the service open connections to the service in the usual manner, and transmit streams of requests. As in Anycast, the server to receive each request is selected dynamically, and the selection policy may consider the content of the request [3] such as the specific file or block requested [4]. Thus, different ensemble members may concurrently execute requests on behalf of any given client. To a client, the server cluster appears as an ordinary endpoint for a reliable, ordered transport connection, i.e., it appears to be a service running at a single host at a virtual IP address. Anypoint hides the selection policy and the structure of the ensemble from the client.
Multipoint explicit rate control. The key to our approach is a new set of transformation rules applied to XCP headers within a network switch. Note that these transformations can be applied in general to any XCP-based transport, and not only to our prototype Anypoint-compatible transport protocol. For example, the transformations do not require a framebased transport, nor any Anypoint switch functionality beyond parsing and writing XCP headers [9]. Rate-based flow control. We address flow control in a unified way by treating it as a congestion control problem; receive buffer space is a shared resource subject to congestion. This soft flow control approach is flexible enough to meet flow control needs for multipoint communication, and our experiments confirm that it avoids the pitfalls of obvious credit-based flow control policies [8]. Prototype-based evaluation. We have prototyped a complete rate-controlled Anypoint implementation (Section V). It includes kernel extensions to
The key to Anypoint is a set of rules for maintaining state and transforming packets to implement the redirection policy in an extensible Anypoint switch. An Anypoint switch does not terminate connections, but merely transforms packets to maintain end-to-end transport protocol guarantees at the end systems. The transport protocol itself is a general-purpose protocol with no Anypoint-specific functionality, although it does require advanced features— application-layer framing and partial ordering—which exist today in transports such as SCTP. To an end system, Anypoint connections are indistinguishable from pointto-point connections using the same transport: we refer to this property as transport equivalence. The switch functions include sequence number translations, acknowledgment coalescing, and coordination of rate control signals 2
as described in this paper. Because the switch operates at the transport layer, we refer to it as a transport switch. It maintains per-flow soft state that is proportional to the size of the cluster.
network paths to each ensemble member. Observe that the peer’s sliding window paces outgoing data at the rate of returning acknowledgments that open up free buffer space for new transmissions. Any outstanding data on the slowest or most congested path stops the peer from transmitting until the data is acknowledged. This limits the amount of outstanding data down all paths. Thus the optimal inbound rate is a function of the switch’s indirection policy. Because the peer is unaware that its traffic is split across the ensemble, we must merge congestion information from multiple network paths in proportion to how much they are used.
The Anypoint transport switch is a trusted component of the service implementation, designed to operate at the edge of a server cluster (e.g., it is allowed to decrypt incoming data). The switch is extensible, supporting pluggable application-layer routing modules (ALRMS) that implement the service-specific routing policy. The ALRM determines the indirection schedule, or how the flow is split across ensemble endpoints. In our previous work [8] we demonstrated the Anypoint transport switching approach as a building block for scalable IP storage appliances, and compared it to alternative approaches such as TCP proxies with respect to performance, scalability, and reliability.
In the outbound direction ensemble endpoints must coordinate to collectively achieve the appropriate capacity across the bottleneck link to the connection peer, but also share that capacity among themselves. The key challenge here is that an ensemble endpoint doesn’t know how the rest of the ensemble is using the bottleneck capacity.
III. Anypoint Rate Control
B. Flow Control
This work addresses the challenge of splitting and merging transport flows while avoiding network congestion and endpoint buffer overflows. We do this at explicit points, Anypoint switches, within the network, at the granularity of connections. An Anypoint connection consists of an inbound flow into the ensemble and an outbound flow to the connection peer. Figure 1 illustrates inbound and outbound flows. The inbound flow is split by the switch into separate flows to each ensemble member. In the other direction individual ensemble flows are merged into a single outbound flow. Here we may speak of the flow between switch and peer, or the separate flows between switch and ensemble endpoints. In the parlance of multipoint-to-point rate control, an Anypoint connection, consisting of one or more flows, is analogous to a session.
In a unicast scenario, flow control is a simple contract between a sender and a receiver to manage endpoint resources (buffering) of a single flow. However, outbound flows must share the peer’s buffer among the ensemble, and an inbound flow may consume receive buffering at any ensemble endpoint. Assuming credit-based flow control (e.g., TCP window advertisements), the switch could split the peer’s available credit evenly across the ensemble to control the outbound flow. But then actively transmitting endpoints may starve as quiescent endpoints sit with surplus credits. Similarly, the switch could throttle an inbound connection to the slowest ensemble endpoint. Both “strict” policies avoid buffer overruns but limit the ability of the switch to optimally manage receiver buffers. We overcome these limitations through soft flow control, an optimistic rate-based flow control detailed in Section IV-A that allows dynamic allocations. For comparison, consider the behavior of multiple uncoordinated TCP connections when the peer’s receive buffer space is constrained. These connections can match the throughput of SFC only by optimistically over committing the receiver’s memory to each of the connections; this is the default policy in most TCP implementations, but then the receiver must drop packets if all senders transmit simultaneously. In effect, SFC reallocates receive buffer memory among the senders according to their demand, similarly to congestion management on a shared link.
In Anypoint transport equivalence defines the fair behavior of connections. Transport equivalence implies that end nodes do not change their rate control policies to use Anypoint. Each flow abides by the same notion of rate control fairness; each is limited by the available bandwidth between the endpoint and the switch. That bandwidth is determined by the rate control policy at the endpoint, e.g., a split or merged TCP flow should remain TCP-compatible, sharing network paths fairly with other TCP connections. Thus all Anypoint flows are “session-fair”, sharing bottlenecks fairly independent of the ensemble size. This is in contrast to “connectionfair” behavior, where session rates are proportional to the number of flows within them.
C. A Transport Switching Rate Control Architecture
A. Inbound and Outbound Flows
For the switch to coordinate rates, we make two requirements: the transport carries explicit rate information in its headers, and the transport employs receiver-based
The inbound rate control challenge is to maximize the total inbound rate while avoiding overflowing the different 3
congestion control. We assume that transport headers carry rate, the current transmit rate, and f dbk, the future sending rate. The receiver uses the returning f dbk to adjust its transmission rate. Protocols that fit this model include TFRC and XCP [11, 9]. The switch coordinates outbound and inbound flows by transforming packet headers, merging outbound flow headers and splitting inbound flow headers.
)
*
%
&
%
&
'
-
!
-
%
+
,
&
'
!
$
'
$
&
"
(
#
#
(
.
.
/
0
1
/
0
1
A transport switch modifies XCP headers for outbound and inbound data flows. This diagram shows two transport switches moderating communication between two clusters. An outbound flow for the first transport switch becomes the inbound flow for the second. In this direction the switch updates all three XCP header fields: cwnd, rtt, and f dbk. Fig. 2.
IV. Transport Switching with XCP The eXplicit Control Protocol (XCP) [9] addresses many of the issues with traditional point-to-point transports in the Anypoint context. XCP is designed for high bandwidth-delay networks, virtually eliminates congestion loss, is independent of RTT, runs the network at a high level of efficiency, and removes the onus of inferring congestion from secondhand end-host observations. By eliminating congestion loss, XCP obviates the need for mechanisms like fast retransmit and loss differentiation in contexts where link-layer loss is rare. Critically, explicit rate control allows a network switch to dynamically coordinate flow rates, and is compatible with our rate control architecture. XCP details are found in [9], and here we give an overview as it pertains to Anypoint. XCP endpoints are window based; the size of the congestion window determines the transmission rate. XCP requires support from core routers along the network path between source and destination. Each router adjusts individual flow rates to match the total arrival rate to the outgoing link’s capacity. XCP routers have separate controllers for efficiency and fairness. The efficiency controller determines the available capacity on the outgoing link, and the fairness controller allocates that capacity across the current flows. XCP packet headers contain sufficient information to run both controllers without maintaining per-flow state in the routers. The routers update rate information in the packet headers as they travel from source to sink.
The acknowledgment flow returns f dbk to the source(s). The merge transform cannot simply sum f dbk from each ensemble member because it allows unused paths to increase the input rate, potentially overflowing used paths. The f dbki returning from each path k indicates a rate increase or decrease. We know that λp = λβii and thus the change in the peer’s rate is proportional to the i change in i’s rate (∆p = ∆ βi ). Thus for any particular f dbki i, a f dbk of βi achieves the rate change down path i. Finally, we find the rate change along the most constrained inbound path by taking the minimum across all paths i (Equation 1).
f dbki ) βi
There are four switch transforms. Each transform ensures that f dbk and rate are consistent with the actual rate of the flow along that path and that the resulting flow is transport equivalent. The first two modify rate and f dbk from source to sink (data flow), and the second two modify f dbk from sink to source (acknowledgment flow). For each flow there is a merge (outbound) and split (inbound) operation. The switch merges data headers as they arrive from the ensemble. The merged data flow represents the collective desires of the N P ensemble memPN N bers as rate = i=1 ratei and f dbk = i=1 f dbki . In contrast we split a data flow header by multiplying rate and f dbk by βi , and this maintains λi = λp ∗ βi .
∀i
In this and the next sections we use the following terminology. For any given connection, the achieved throughput to/from the peer is λp , and the achieved throughput to/from each ensemble endpoint is λi . There are N ensemble endpoints. For inbound flows we define βi , the receive ratio for ensemble member i, as the ratio of traffic directed to that endpoint, λλpi .
f dbk = Min(
XCP headers contain three fields: the round-trip time (rtt), the sender’s current congestion window (cwnd), and a f dbk field initialized to the sender’s desired rate increase (in bytes). To control the sender’s congestion window, XCP routers along the network path update f dbk according to the read-only fields cwnd and rtt. Returning acknowledgments carry f dbk to the sender; the sender sets its cwnd to Max(cwnd + f dbk, s), where s is the packet size. Note that the bottleneck XCP router determines the feedback returned to the source: XCP routers only decrease f dbk.
(1)
The last acknowledgment flow transform splits the returning feedback among the ensemble sources. The transform operates in an analogous fashion to AIMD controllers in standard TCP. It spreads a rate increase evenly among all ensemble sources, and it reduces ensemble rates in proportion to their contribution to the total outbound rate. Thus, for ensemble member i, positive feedback is i f dbk/N , and negative feedback is f dbk ∗ rate rate . 4
The goal of the switch transforms in Section III is to make sure that the rate control information in the headers is consistent before and after the switch. This is distinct from XCP controllers who share a link among multiple point-to-point transport flows. A transport switch splits an inbound XCP flow into N transport-equivalent XCP flows, where N is the ensemble size. In the reverse direction, it merges ensemble flows into a unified XCP outbound flow.
8
D
Y
9
I
H
I
:
J
J
;
Z
N
?
[
\
]
@