Run-time System for Scalable Network Services - CiteSeerX

3 downloads 74071 Views 186KB Size Report
1. Throughput-oriented network service model. services are stateful; they maintain persistent state–such as. TCP session state, worm signature table, etc.–that is ...
Run-time System for Scalable Network Services Upendra Shevade† Ravi Kokku‡ The University of Texas at Austin† {upendra,vin}@cs.utexas.edu

I. I NTRODUCTION The designs of most modern network services follow a canonical computational model: a sequence of packets arrive at the service, get classified as belonging to a flow, undergo service-specific computation, and get forwarded to a subsequent service, a client or a server (see Figure 1). Examples of such middlebox services include network monitoring and intrusion detection [4], DDoS mitigation [23] and worm scanning [33], XML parsing [1], Enterprise Service Bus (ESB) protocol transformation, and ESB policy enforcement [1]. To support high throughput (thousands to millions of packets per second), these services are often deployed on a distributed memory multi-processor (DM-MP) platform (e.g., a cluster of network processors, general-purpose processors, and co-processors [16]). The design of such platforms is guided by the following requirements: (1) scale the service throughput linearly with number of parallel processors; (2) minimize the increase in the per-packet processing time resulting from DMMP architecture; (3) maintain intra-flow packet order even though multiple packets are processed in parallel; and (4) minimize the programming effort required to exploit parallelism on such platforms. Meeting these requirements has proved to be challenging because of three reasons. First, most of the modern network

Flow−specific State

Identify Flow

Flow−independent State

Computation

Results to service, client or server

Sophisticated middlebox services–such as network monitoring and intrusion detection, DDoS mitigation, worm scanning, XML parsing and protocol transformation–are becoming increasingly popular in today’s Internet. To support highthroughput, these services are often deployed on Distributed Memory, Multi-processor (DM-MP) hardware platforms such as a cluster of network processors. Scaling the throughput of such platforms, however, is challenging because of the difficulties and overheads of accessing persistent, shared state maintained by the services. In this paper, we describe the design and implementation of Oboe, a run-time system for DM-MP platforms that addresses the above challenge through two foundations: (1) categoryspecific management of shared state, and (2) adaptive flowlevel load distribution for addressing persistent processor overload. Our simulations demonstrate that Oboe can achieve performance within 0-5% of an ideal adaptive system. Our prototype implementation of Oboe on a cluster of IXP2400 network processors, demonstrates the scalability achieved with increasing number of processors, number of flows and state size.

Request−processing model

Requests from service, client or server

A BSTRACT

Harrick M. Vin† NEC Labs, Princeton.‡ [email protected]

DM−MP platform

Single/multi−core processors

Fig. 1.

Memory

Throughput-oriented network service model.

services are stateful; they maintain persistent state–such as TCP session state, worm signature table, etc.–that is shared across a sequence of related packets (referred to as a flow). Second, for most of these services, packet processing times are dominated by memory access latencies. Third, flow sizes and arrival rates (i.e. the number of packets that belong to a flow and their arrival rate) can be highly skewed; further, the size or rate of a flow can not be easily predicted when the first packet of a flow arrive at the service. State-of-the-art solutions for the scalability of DM-MP platforms [14], [18], [27], [32] either (1) assume that the services are stateless or have only read-only state, or (2) map shared state to a globally accessible memory, and perform packet-level load distribution, or (3) pin each new flow to a particular processor and forward all packets of the flow to the same processor. Whereas the approach of mapping shared state to global memory increases the packet processing time significantly both due to remote memory accesses and locking, static pinning of flows to processors can lead to significant load imbalance because of skewed flow characteristics. In this paper, we present the design and implementation of Oboe, a run-time system that improves the scalability of DM-MP platforms by jointly and efficiently performing state management and load distribution. Oboe’s design is based on two foundations: • Light-weight category-specific state management: Oboe exploits the observation that much of the state management complexity and overhead arises because of accesses and updates to persistent, per-flow state, and not because of transitent data such as stack, or predominantly read-only data such as a virus signature table or a route table. Oboe performs category-specific

2

Processing Engines Load Distributors

PE1

Packets in

PE2 LD2 PE3

Packets out

LD1

LDn F(flowdef) −−> PE

PE4

PEm

Fig. 2.

Scalable Platform Model

optimizations to make state management light-weight. • Adaptive, flow-level load distribution: Oboe separates mechanisms and policies for supporting adaptive, flowlevel load distribution. The mechanisms address the question: how should a flow be migrated atomically from an originating processor O to a target processor T ? The policies address three questions: (1) when should the pinning of flows be adapted? (2) which flows should be re-pinned? and (3) what should the new mapping of flows to processors be? The main novelty and challenge in realizing such a policy in DM-MP platforms for network services is in minimizing the policy’s computation and memory footprint. Through a prototype implementation on a cluster of IXP2400 network processors, we demonstrate that the two foundations enable us to identify and employ techniques that are simple, yet effective and light-weight to implement; these characteristics are desirable to handle high line-rates (such as millions of packets per second). Our simulations results show that Oboe achieves throughput within 5% of the optimal for small to medium-sized clusters. Our prototype experiments demonstrate the scalability achieved with increasing number of processors, number of flows and state size. We believe that the techniques developed in this paper are useful for building scalable platforms for many high-throughput network services, including the GENI high-performance network platform [37]. The paper is organized as follows. In Section II, we formulate the problem of designing a run-time system for scalable throughput-oriented middleboxes. In Section III, we describe our solutions for light-weight state management and adaptive, flow-level load distribution. We discuss our prototype implementation in Section IV. Evaluations of our design and prototype implementation are presented in Section V. We discuss related work in Section VI, and summarize our contributions in Section VII. II. P ROBLEM F ORMULATION A. System Model We consider a DM-MP system consisting of two types of components—a set of M processing engines (PEs) to host the computation component of the services that processes packets, and N Load Distributors (LDs) that examine packets and schedule them onto the PEs (Figure 2). LDs communicate among themselves and with PEs to share system-level metrics

such as processor load, memory usage, and per-flow packet arrival rates. Each PE can either be a uni-processor or a multithreaded multi-processor. We assume that each PE has an associated local memory. Each PE hosts a replica of the same service. Each replica has an input packet queue in its local memory to which the Load Distributor enqueues packets to be processed at the PE. Data accessed during packet processing can be categorized into four classes: (1) stack data, which is active while processing a packet, (2) data contained in each packet, (3) per-replica packet queues, and (4) persistent application data, which is active across packet processing boundaries. In our model, only persistent data is shared across PEs, since we assume that, once dequeued, a packet is completely processed on a single PE. 1 B. Design Challenges Shared persistent state in network services makes it difficult to meet scalability requirements in DM-MP platforms. In a DM-MP architecture, persistent state is distributed over multiple distributed memories, and is accessed by multiple processors. Traditionally, two classes of solutions have been developed to utilize parallel platforms–one class exploits packet-level parallelism, and the other exploits flow-level parallelism, both enabling different strategies for distributing persistent state. In what follows, we present results from tracedriven simulations to demonstrate that both packet-level and static flow-level load distributors are inadequate to achieve throughput scalability and make a case for adaptive flow-level parallelism. Experimental Setup. We simulate a DM-MP system with a single load distributor and multiple PEs with the network service modeled as a three stage computation, with the second stage representing the critical section processing—where the per-flow element of shared state corresponding to the packet’s flow is locked, read and written. We discuss results obtained for a packet trace collected from the link connecting the Argonne National Labs (ANL) to its service provider [26]. This trace contains 427900 packets, which get clustered into 1529 flows when the flow definition is DestIP and 5574 flows when the flow definition is 5-tuple. The per-packet processing time is set to (total trace time / number of packets in the trace). For the experiments where we vary the number of PEs in the system, for comparison to be meaningful, we keep the total processing capacity of the system constant by scaling-down the processing capacity of each PE proportional to the number of PEs2 . We use the metric Loss in Throughput that is defined as LT = 1 − TTAI , where TA represents the throughput achieved by the load distribution strategy under consideration, and TI is the throughput achieved by an ideal adaptive flow-level scheduler that can adapt flow-to-PE mapping at per-packet granularity with zero remapping latency. The ideal throughput 1 Throughout the paper, a flow represents a sequence of packets related by a predefined set of fields. For instance, all packets of an end-to-end TCP connection have the same 5-tuple (SrcIP, DestIP, SrcPort, DestPort, Protocol). 2 The alternative methodology of scaling the traces up with cluster size was not used due to difficulties in maintaining flow characteristics while scaling.

70 60 50 40 30 20 10

90 80

40

r/l = 10 r/l = 50 r/l = 100 r/l = 200

70 60 50 40 30 20 10

0

30 25 20 15 10 5 0

10

20

30

40

50

Number of Processors

(a) Locking Effects Fig. 3.

60

0.001

0.01

0.1

Loss in throughput for packet-level distributor

represents the maximum value a practical adaptive load distributor can achieve; the actual throughput of a practical scheduler will depend on efficacy of policy choices and mechanism implementation. 1) Packet-level Parallelism: These solutions map each incoming packet to the least loaded PE at the instant of its arrival. Since a packet belonging to any flow may be processed at any PE in a packet-level system, the data structure is partitioned among PE memories, with locking used to coordinate writes. While packet-level load distribution can achieve finegrained distribution of workload across PEs, it can increase the packet processing time due to remote state accesses and locking. Further, an additional mechanism is needed to correct intra-flow packet reordering. Figure 3(a) shows the effect of locking on system throughput. In order to isolate the effect of locking from that of remote accesses, for this experiment, we assume that the costs of remote and local accesses are equal. csf represents critical section processing as a fraction of the total processing time. For a particular csf value, as the number of PEs increase, the fraction of state located locally reduces. In addition, the number of entities contending for any lock increases. As a result, the loss in throughput increases with higher number of PEs. For a particular number of PEs, as csf increases, PEs spend more time inside the critical section. As a result, other PEs need to spend more time waiting to obtain the lock and loss in throughput increases. Figure 3(b) shows the effect of remote access latency on system throughput in a system with eight PEs. For this experiment, we assume that no locking occurs. Write Fraction represents the fraction of per-packet processing cycles spent in writing to shared data and r/l represents the ratio of remote access latency to that of local access. The graph shows that for a particular value of Write Fraction, as the remote accesses become more costly with respect to local accesses, the loss in throughput becomes significant. For a particular value of r/l, as the write fraction, which is proportional to the number of accesses performed, increases, loss in throughput increases. As an example, even for 1% writes, an r/l of 100 leads to 50% loss in throughput. Note that above results show loss due to locking and remote accesses in isolation – the loss percentage will be greater when both locking and remote accesses occur in a real system. 2) Flow-level Parallelism: These solutions map a flow to a PE and direct all subsequent packets of the flow to the same PE. As a result, per-flow state can be localized to the

FiveTuple SourceIP-DestIP DestIP

35 30 25 20 15 10 5 0

0

10

Write Fraction

(b) Remote access effects

3

40

FiveTuple SourceIP-DestIP DestIP

35

Percentage Loss in Throughput

80

100

csf = 0.1 csf = 0.2 csf = 0.3 csf = 0.5 csf = 0.75 csf = 0.9

Percentage Loss in Throughput

90

Percentage Loss in Throughput

Percentage Loss in Throughput

100

20

30

40

50

60

70

0

10

Number of Processors

(a) Hash-based Fig. 4.

20

30

40

50

60

70

Number of Processors

(b) Load-based

Loss in throughput for static flow-level distributor

corresponding PE memory, and no global locks are required3 . Thus, this approach reduces the overhead of persistent state accesses. In addition, intra-flow packet order is automatically maintained as long as no PE internally disrupts the order. Flow-level distributors can either be static or adaptive, based on whether the flow-to-PE mapping is changed during flow lifetime. Static distribution leads to a simple and low-overhead system, and does not require support for migration of live flow state. However, the non-uniformities in flow durations and arrival rate of packets within a flow often yield poor balancing of load across processors for a static scheme. Proposed static schemes are either hash-based or loadbased. Figure 4(a) plots the loss in throughput for three different flow definitions with a static scheme that uses the CRC hash value of the flow identifier to map a flow to a PE. The graph shows that the loss can be significant even for small number of PEs, and increases with increasing cluster size. Also, extending the analysis by Shi et al. [32], we can show that the discrepancy between the heaviest-loaded PE and the least loaded PE, in general, is proportional to the number of PEs in the system, which can be significantly worse than our current trace-based observations. We present the analysis in the Appendix. In load-based approaches [14], the first packet of a flow is pinned to the least loaded PE at that instant and all subsequent packets go to that PE. Figure 4(b) shows that for a coarse flow definition such as DestIP, loss in throughput can be more than 30% for a 24-PE configuration. Also note that such a scheme imposes considerable space and lookup overhead of maintaining all flow-to-PE mappings individually, since they are based on run-time values at the instant of arrival of the first packet of a flow. Having demonstrated quantitatively the limitations of packet-level and static flow-level scheduling, we argue that adaptive flow-level parallelism represents the best choice for scalable PPSs. In the next section, we present the design and implementation of Oboe and describe how it enables scalable packet processing. III. O UR A PPROACH In this section, we describe Oboe, a run-time system that (1) performs adaptive flow-level load distribution to avoid per3 For simplicity, we assume that a PE processes a packet of a flow completely before dequeuing another packet from the queue, so we do not model local locks for shared state in the simulations.

TABLE I DATA STRUCTURE CLASSIFICATION FOR EXAMPLE NETWORK SERVICES

Service IP forwarding Metering [RFC 2698] Snort port scan detector [4] NAT-PT [RFC 2766] TCP splicer [41] Web load balancer [14], [27] DDoS mitigation [23] Snort pattern matcher [4] TCP flow assembler [4], [19] VPN gateways [RFC 2612] Worm identification [33]

Main data structures Trie for route table [11], [34] Flow record elements to store, packet and byte counts Splay tree for ports scanned by flows IPv4-to-V6 translation and vice versa Sequence number mapping between splices Flow to processor mapping Flow-specific traffic characteristics Signature table Partially assembled TCP flows Per-flow encryption/decryption keys Pattern data structures and pattern counts

sistent load on any individual PE, and (2) performs categoryspecific state management to minimize the overhead of state migration during flow re-distribution. In addition, it minimizes the programming overhead necessary to exploit parallelism in the hardware. Careful design of Oboe ensures packet order and minimizes per-packet latency while imposing minimum overhead. In what follows, we briefly describe the design of Oboe. A. Category-specific State Management As discussed in Section II-A, we only consider persistent data relevant for adaptation. Accesses to such data structures account for large fraction of total accesses—in many cases as high as 90% of non-temporary accesses [24]. Oboe classifies persistent data into multiple categories, and exploits specific properties of each category and utilizes simple applicationlevel extensions to perform efficient state migration. 1) Data Structure Classification: Persistent data can be divided into two categories: flow-specific and flow-independent. Table I shows the classification of data structures of several popular network services. More details on application data structure characteristics can be found in [30]. Flow-Specific Data Structures. These data structures have a one-to-one correspondence to flows; a particular element of the structure is accessed by only packets of one flow. For example, per-flow meters or packet counters in a metering application [RFC 2698] are accessed/updated by all packets of a flow and only packets of that flow. Each per-flow element of the data structure can be of fixed size or variable size. For example, data structures storing partially assembled TCP flows [4] have per-flow elements of variable size. Flow-specific structures can be either read-only4 or readwrite. An example of a read-only structure is the route cache, where each entry specifies the next hop for a flow. Examples of read-write category include splay trees used for monitoring flow activity and reassembling TCP flows in the Portscan detector and Stream-4 modules of Snort, respectively [4]. A detailed analysis of middlebox services showed that for a subset of services, for example the above-mentioned Snort modules 4 Note that by read-only, we mean data structures that are read-only for the data plane; they may be initialized at application start time or written by the control plane at run-time.

4

Category Flow-independent, read-only Flow-specific, read-write Flow-specific, read-write Flow-specific, read-write Flow-specific, read-write Flow-specific, read-only or read-write Flow-specific, read-write Flow-independent, read-only Flow-specific, read-write Flow-specific read-only Flow-specific, read-write

and DDoS mitigation [23], all persistent data structures are flow-specific read-write. Elements of read-write data structures are often dynamically created when the flow becomes active, and destroyed when the flow terminates. We partition flow-specific read-write data structures among the PEs corresponding to the flow-level load distribution mapping. Read-only structures are replicated on all PEs since they are either not written after application initialization or are written infrequently by the control plane. Flow-independent Data Structures. These data structures can be accessed by any packet; no finer-grained flow definition produces a one-to-one mapping between flows and elements of the structures. An example of a flow-independent data structure is the virus signature table in a virus scanning application; any entry in the table may be accessed by the signature matcher while processing any packet. Lack of relationship between packets and structures can make it harder to avoid remote accesses to flow-independent structures. To address this problem, we sub-divide flowindependent structures into read-only and read-write, and replicate flow-independent read-only data structures on all PEs with available space. To address flow-independent read-write data structures, our run-time system requires a solution similar to a general-purpose Distributed Shared Memory [6], [17] or distributed data structures [15]. We have not yet implemented DSM support in our current prototype. Our current study of network services did not reveal any flow-independent readwrite data structures. Hence, the frequency of occurrence of such data structures and their impact on scalability still remains to be studied and is a part of our future work. Data structure category can be identified using either annotations provided by the programmer or with support from the compiler (e.g. the Shangri-La compiler [9]). For instance, a structure storing partially assembled TCP connection may be annotated as flow-specific read-write structure with 5tuple (SrcIP, DestIP, SrcPort, DestPort, Protocol) as the flow definition. Domain specific languages, such as Click [20], are becoming increasingly popular for writing network applications and annotation support can be easily incorporated into such languages. The Lagniappe programming environment [5] for network services supports data structure annotation for use by Oboe. In any case, who specifies annotations is an

orthogonal problem for the run-time system; it just uses the annotations provided to perform state management. 2) Application Extensions for State Migration: To enable the migration of flow-specific read-write data structures and replication of read-only structures, each structure is required to adhere to a template. Specifically, along with the element specific data, the programmer is required to expose three methods—get(structure element id), put(structure element id, element data), remove(structure element id)—that enables Oboe to read and un-install flow state at one PE and install state at another PE. The structure element id contains the tuple (structure id, flow id). Requiring the programmers to specify these methods is a programming overhead added by our approach. However, applications typically create flow state dynamically when the flow becomes active, access flow state during flow’s life and destroy flow state when the flow terminates. Hence, programs already have get(), put() and remove() functions in some form. The run-time system only needs the programmer to expose these methods externally; the programmer can write a wrapper for each of the internal functions and export them. Hence, we believe that this overhead is not significant, and is worthy, given the scalability benefits it enables. B. Adaptive Load Distribution Oboe determines a flow definition for scheduling based on the categories of data structures present in the application, and uses several mechanisms and policies for efficient load distribution and adaptation. 1) Flow Definition: Oboe derives the flow definition for load distribution from the individual annotations of data structures; a definition that is coarse enough to subsume all individual annotations is chosen. For example, if the set of annotations is 5-tuple, (srcIP, destIP), (destIP,destPort), (destIP), then (destIP) is chosen for load distribution. This ensures that all flow-specific data structures will be available locally on a PE during system operation. Notice, however, that coarser definitions may lead to greater load imbalance among PEs (such as in Figure 3). Hence, in order to keep the flow definition as fine-grained as possible, we replicate all readonly data structures that can coarsen the definition and not consider them while deriving the flow definition. Once the flow definition is decided, each data structure is given a unique identifier and structure elements are identified by (structure identifier, flow identifier). The load distributor uses the derived flow definition to schedule packets onto the PEs. 2) Adaptation Policy: To maximize the throughput, Oboe dynamically corrects load imbalance on the PEs by adapting flow-to-PE mappings. Adaptation Algorithm. To minimize the overhead of migration, we minimize the number of flows migrated. To achieve our objective, we use the guideline that system throughput is maximized if the following conditions hold true: 1) If the overall system is underloaded, then no individual PE is overloaded.

5

2) If the system is overloaded, then no individual PE is underloaded. If either of the above conditions is violated, the policy triggers adaptation of flow to PE mappings. Observe that the trigger policy defined above is novel and better than traditional load balancing and load sharing policies; it maximizes throughput with minimal number of adaptations. Load balancing policies attempt to equalize load on all PEs at all times, whereas load sharing policies only attempt to ensure that no PE is idle if there is work waiting in the system [22], i.e. both will trigger adaptation even if no PE is persistently overloaded. Algorithm 1 shows the steps involved in remapping flows to PEs; the key idea of the algorithm is to move at each step the heaviest flow among the overloaded PEs to the least loaded PE. Observe that the algorithm terminates in a single pass over HF List; so the running time depends only on the number of heavy hitter flows maintained per PE, and not the total number of flows. Algorithm 1 Adaptation Algorithm 1: Initialize Overloaded and Underloaded PE pool 2: HF List ← Sorted heavy-hitter flows from all overloaded PEs. 3: while (TriggerCondition AND HF List not empty) do 4: HF ← heaviest flow in HF List, belonging to P Eover , (whose migration will not cause P Eover to become underloaded, if system is overall overloaded) 5: P Eunder ← Least loaded PE in the Underloaded pool that can fit HF (without becoming overloaded, if system is overall underloaded) 6: if (no P Eunder found for HF) then 7: Remove HF from HF List 8: else 9: Move HF to P Ebf and remove HF from HF List 10: Recompute Overload/Undeload status of P Eover and P Eunder 11: end if 12: end while 3) Mechanisms: Oboe employs the following mechanisms to perform Flow-to-PE adaptation. Safe Flow Migration Mechanism. Migrating a flow from one PE to another requires us to ensure safety. The mechanism is safe if it ensures intra-flow packet order and avoids conditions where either a packet is being processed at the old PE after its state has been migrated to the new PE, or a packet is being processed at the new PE before its state has been migrated from the old PE. To ensure safety, Oboe uses the following sequence for migration: (1) Once a flow F is chosen for migration from PE O to PE U , the load distributor changes the mapping F -to-O to F -to-HoldQ, where HoldQ is a queue maintained at the load distributor to hold packets of the flow until it is safe to send them to the U . (2) PE O is instructed to un-install state specific to F , when it is safe to do so. To know when it is safe, O marks the current last packet in its input queue, and waits until the marked packet

has been processed. Since all new packets belonging to F are being held in the HoldQ, the departure of the marked packet indicates that there exists no packet belonging to F in O, and hence state corresponding to F will not be accessed any more. O then un-installs F ’s state. (3) Oboe migrates F ’s state to U that installs the state locally. (4) The load distributor then dispatches packets from HoldQ to U , and change the mapping from F -to-HoldQ to F -to-U . Measurement Mechanism. This mechanism maintains and provides to the adaptation policy per-PE and per-flow arrival rates in the form of exponential moving averages (EMA). Since the adaptation policy chooses large flows, as an optimization, arrival rates for only the heavy-hitters are maintained. This optimization ensures that memory overhead of flow load indices is constant and independent of number of flows. Heavy hitters are detected using sketches [10] and added to a table that stores their arrival rate and PE mapping. Note that an assumption implicit in this mechanism is that all packets have similar processing times, which is true in most network services. Where this is not true, more sophisticated load estimation can be easily incorporated. Mapping Resolution Mechanism. The mapping resolution mechanism decides which processor a packet should be scheduled on in accordance with the flow-to-processor mapping. We currently implement the table-and-hash approach [32]. The table maintains the mapping for large flows that have been remapped by the policy to processors other than the hash result. This table is the same as the one used by the measurement mechanism to store identified heavy hitters. The table is searched first. If no entry is found, a hash is computed on the flow identifier to resolve the mapping. In order to prevent the table from becoming excessively large, table entries are periodically garbage-collected during low system load. Flows having very low EMA values are remapped to the processors obtained from the hash function and the corresponding table entries are deleted. IV. P ROTOTYPE I MPLEMENTATION Our prototype DM-MP platform is implemented with a cluster of Radisys ENP-2611 boards, each hosting an IXP2400 Network Processor [16] (NP). One group of NPs acts as LDs and another group acts as PEs (as shown in Figure 2), with a GigE data plane network connecting the two groups. Oboe modules reside either on the data plane, where they operate on each packet, or on the control plane, where infrequent tasks such as evaluation of the trigger condition and state migration are performed. Figure 5 shows the software architecture of Oboe. The Oboe modules are distributed across the LDs and PEs. The Oboe Controller module coordinates and controls various Oboe modules located in the LD and PE control- and data planes. The Controller is hosted on one of the LDs, and communicates with other Oboe modules via a 100Mbps control plane network. The Controller, LD control plane and PE control plane modules are implemented in C++ and run on MVLinux [3] hosted on the XScale processor. These modules

Policy

Mechanism

6

Load Sharing Policy

Mapping Resolution Primary

PE Load Information Information Primary Primary Flow Load

LD

PE

State

Manager Primary

Manager

Manager

Primary

Primary

Controller

TCP/IP Socket Communication Layer Plat indep

Plat dep

LD

MR Top

FL Top

PL Top

Bottom

Bottom

Bottom

LDM

PEM

FL Data Plane

Fig. 5.

PL Data Plane

Control Bottom

Kernel modules and ioctl( ) MR Data Plane

SM Top Plane

Kernel mods

PE

SM Data Plane

Data Plane

Software architecture

communicate using TCP/IP sockets. The data plane modules at the LDs and PEs are written in our domain specific language and run on the microengines. Communication between the data- and control planes occurs via kernel modules and ioctl(). Network services are developed using the Shangri-La programming environment [9] for our prototype studies. ShangriLa provides a network-domain-specific programming language (much like Click [20]), and an associated compiler suite that allows services to be compiled for IXP2400 network processors. The compiler decides where the data accessed during packet processing is placed within the various levels of memory available on the IXP2400. In addition, the ShangriLa run-time system provides Oboe with access to application ingress queue required by the safe flow migration mechanism. Extensibility and Load Distributor Scalability. Figure 5 shows the separation between policy and mechanism modules. Policy modules implement run-time policies for adaptive load distribution and for handling processor joins and leaves. Mechanism modules implement the LD mapping resolution (MR); provide information to the load distribution policy regarding PE loads (PL) and flow loads (FL); manage LD and PE joins and leaves (LDM and PEM), perform state management (SM) functions, such as coordinating migration and replication. The mechanism modules are divided into primaries at the Controller that control the corresponding secondaries at either the LD or PE. Separation of Oboe modules into mechanism and policies leads to a cleaner design and make the system extensible–new policies and mechanisms can be easily plugged in as long as required interfaces are implemented. More importantly, by separating LD mapping resolution and measurement mechanisms from the policy, we can have multiple LDs operating in parallel and implementing the mapping decision of a centralized adaptation policy. This allows scaling the PPS beyond line rates supported by a single LD. Portability. Oboe modules can be classified as either platform dependent or independent as shown in Figure 5. Retargeting Oboe for a different data plane hardware needs reimplementation of only the platform-dependent components. V. E VALUATION We demonstrate the efficacy of Oboe using both simulations and prototype evaluation. Each of these environments allows us to highlight different aspects of Oboe effectively.

40

35 30 25 20 15 10 5 0

35 30 25 20 15 10 5 0

1

10 Number of Processors

(a) Benefits ( I = 2.5ms and A = 0.99) Fig. 6.

7

40

StaticHash Adaptive, I = 100 ms Adaptive, I = 10 ms Adaptive, I = 5 ms Adaptive, I = 2.5 ms Adaptive, I = 1ms Adaptive, I = 0.5ms

Percentage Loss in Throughput

StaticHash Adaptive, L = 0 Adaptive, L = 500 Adaptive, L = 1000 Adaptive, L = 2000

Percentage Loss in Throughput

Percentage Loss in Throughput

40

StaticHash Adaptive, A = 0.1 Adaptive, A = 0.5 Adaptive, A = 0.9 Adaptive, A = 0.99

35 30 25 20 15 10 5 0

1

10

1

Number of Processors

10 Number of Processors

(b) Measurement interval

(c) EMA smoothening parameter

Static vs. Adaptive load distribution for ANL trace. Sensitivity of adaptive scheduling to measurement parameters 1400

We use the same simulator setup as described in Section II with DestIP as the flow definition. We compare the performance of our load distribution policy (Adaptive) and that of a CRC based hashing scheme (StaticHash) that does not dynamically adapt flows once they are mapped to a PE, with the throughput achieved by an ideal scheduler that can balance load at each packet arrival with zero overhead. Our load distribution policy is periodically invoked every I time units. During adaptation, migration of each flow incurs a latency of L microseconds. Adaptation Benefits. Figure 6(a) plots the percentage loss in throughput with increasing number of processors in the system, and with varying latency L for Adaptive. The graph shows that Adaptive achieves performance close to Ideal for cluster sizes of 8 or less. Loss increases for larger clusters but is still 15 to 20% lower than Static hash-based load distribution. Observe that, at times, Adaptive loses more throughput with L = 0 than Adaptive with non-zero L (e.g. when #processors = 32 in Figure 6(a)). This is because nonzero L smoothens the traffic by holding packets of heavy hitters during overload and releasing them later. We observe similar results for other traces; we skip showing them due to space constraints. Sensitivity to parameters I and A:. We study the sensitivity of Oboe to two parameters: adaptation interval I, and the EMA smoothing parameter A used in the measurement mechanism; higher A indicates that greater importance is given to the current measurement interval. Figure 6(b) demonstrates that I should be as low as possible since frequent monitoring exposes adaptation opportunities. Figure 6(c) shows that choosing higher A values (i.e. giving greater importance to current monitoring interval) allows the system to react more quickly to load imbalance. The exact values chosen for each of the parameters strike a tradeoff between how quickly we react to fix load imbalance and how much overhead we incur in the process.

1200

B. Prototype Evaluation 1) Setup: Our experimental setup contains five Radisys ENP-2611 boards; one of them acts as the LD, and the other four act as PEs. Traffic is generated using the IXIA traffic generator [2]. A metering service written using the

Throughput (Mpps)

A. Simulations

Adaptation on Adaptation off

1000 800 600 400 200 0 1

Fig. 7.

2

3 4 5 6 7 Number of heavy hitter flows

8

9

Comparison of system throughput with and without adaptation.

Shangri-La programming language [9] runs on the PEs, and counts the number of packets per flow using a read-write flow-specific counting data structure. The data structure is implemented as a 16-4-4-4 trie. The service uses destination IP as flow definition; load distribution is performed at the same granularity. 2) Scalability: To first ensure that one LD can support all the four PEs, we measured the throughput of the LD, and each PE hosting the metering service. We observed that the LD supports a throughput of 2.37 Mpps (million packets per second), whereas each PE supports 0.37 Mpps, making one LD sufficient. In this set of experiments, we demonstrate that Oboe increases the throughput of the DM-MP system by adapting the mapping of heavy-hitter flows to PEs. We consider an extreme scenario where all the heavy-hitters out of the flows generated are initially mapped to the same processor. In Figure 7, we vary the number of heavy-hitter flows—where each heavy-hitter flow corresponds to an arrival rate of 0.2 Mpps—and measure the throughput of the system with and without adaptive load distribution. Since each PE can support a maximum throughput of 0.37 Mpps, a PE will be overloaded if more than one such flow gets mapped to it. With adaptation, the system redistributes the flows among the four PEs and throughput increases linearly until four flows. Throughput increase is sub-linear with five to eight flows as more PEs are successively overloaded with two flows each. The overall system is overloaded with eight flows and throughput of the system saturates. Figure 8(a) shows the variation of throughput achieved by each individual PE and the total system throughput with

0.6 0.4 0.2

0.6 0.4

0.6 0.4 0.2

0 3

4

5

6

7

0 2

4

6

8

Time (seconds)

12

14

16

18

20

0

5

Time (seconds)

(a) Four Fig. 8.

10

10

15

20

25

30

Time (seconds)

(b) Eight

(c) Sixteen

Adaptation among 4 PEs with traffic consisting of four, eight and sixteen heavy-hitters respectively.

Flow type

Min (µS) 18.5 19.8

Heavy-hitter flows Non-heavy-hitter flows

Max (µS) 19.1 21.2

Avg (µS) 18.9 20.5

TABLE II P ER - PACKET LATENCY AT THE LOAD DISTRIBUTOR

TABLE III F LOW MIGRATION LATENCY BREAKDOWN FOR 4 BYTE PER - FLOW STATE . State migration latency (microseconds)

time when 4 heavy hitter flows are initially mapped to one PE. Traffic is started at 4 seconds. The graph shows that Oboe detects overload and remaps three heavy-hitter flows, one-by-one, to minimize load imbalance in the system, thereby increasing the system throughput. For Figures 8(b) and (c), we reduce the per-heavy-hitter arrival rate to 0.1Mpps and 0.05Mpps, and increase the number of heavy-hitters to 8 and 16 respectively. Again, all the heavyhitters are mapped to the same PE initially. The graphs show similar observations as Figure 8(a). However, the time it takes for the system to reach its maximum throughput is greater with greater number of flows since Oboe monitors and reacts in steps—it remaps a flow, measures the change due to the remapping, and then remaps again if needed. The graphs show the same observations. Also, recall that the trigger policy of Oboe does not attempt to equalize the load on each processor, it only relieves the overload on an overloaded processor by mapping the heavy hitter to a different processor that is underloaded. Figure 8(c) captures this behavior. 3) Oboe Overheads: Table II shows the per-packet latency for different types of flows at the LD. These values include time required for resolving flow-to-PE mapping and updating the flow-load and PE-load metrics. Per-packet latency is slightly higher for non-heavy hitters due to the mapping resolution mechanism; it consists of an unsuccessful search of the heavy-hitter table followed by a hash computation and a sketch update. In contrast, heavy-hitters incur only a successful table lookup. The latency incurred by a packet that identifies its flow as a heavy-hitter causing it to be added to the heavyhitter table is about 25µs. These results demonstrate that our adaptive LD imposes negligible latency overhead. Table III shows the breakdown of different overheads during flow migration with per-flow element size of 4 bytes. Safe flow migration involves four ioctl() calls to the data plane: one each for state uninstall and install and two at the LD to change flow mapping. The overall adaptation latency is about 2 ms, with socket communication accounting for more than half of

Latency (µS) 420 215 205 1130 1970 30

Processing at LD State uninstall State install Total socket communication Complete flow migration Each ioctl() call to data plane

4500

1

4000 0.8

3500

Throughput (Mpps)

2

Single PE Adaptation Threshold PE 1 Throughput PE 2 Throughput PE 3 Throughput PE 4 Throughput Overall Throughput

0.8

0.2

0

8

1

Single PE Adaptation Threshold PE 1 Throughput PE 2 Throughput PE 3 Throughput PE 4 Throughput Overall Throughput

0.8 Throughput (Mpps)

0.8 Throughput (Mpps)

1

Single PE Adaptation Threshold PE 1 Throughput PE 2 Throughput PE 3 Throughput PE 4 Throughput Overall Throughput

Throughput (Mpps)

1

3000 2500 2000 1500 1000

0.6 0.4 State size = 4 Bytes State size = 1 KB State size = 10 KB State size = 50 KB

0.2

500 0

0 0

500 1000 1500 2000 2500 3000 3500 4000 4500 Per-flow state element size (bytes)

(a) Migration latency Fig. 9.

0

5

10 Time (seconds)

15

20

(b) Throughput

Effect of per-flow state size on state migration.

the value. The state migration part of the latency—i.e., time required to uninstall, transfer and install flow state—depends on the size of per-flow state. Figure 9 (a) shows that state migration latency increases linearly as we vary the state size from 4 bytes to 4096 bytes. Figure 9(b) studies the effect of this latency on system throughput. We observe that with different state sizes, throughput stabilizes at 0.8 Mpps with minor difference in time to stabilization. VI. R ELATED W ORK The emergence of network processors as an attractive vehicle for rapidly prototyping and deploying network services has led to several related research efforts. Kencl and Le Boudec [18] and Shi et al. [32] present load distribution strategies that are suitable for stateless applications, applications with only read-only state, or when the hardware provides cache coherence support. Shi et al. [31] investigate the effect of load splitting on cache performance; they only consider route cache, which is read-only application state for the data plane. Our work focuses on a more general scenario of scaling a wide range of stateful services by performing category-specific state management. However, we do adopt some of the previously developed techniques to guide the design of Oboe.

Our work is complementary to the research efforts in building programming languages [9], [13] and operating systems [21], [28], [39] for network processors. For instance, Wolf et. al. [39] discuss the design considerations for operating systems for a single network processor system. Kokku et. al. [21] and Raghunath et. al. [28] discuss mechanisms and policies for light-weight run-time adaptation on a sharedmemory multi-core system, as opposed to a distributed memory model in this paper. Related work on parallel protocol processing and cache affinity scheduling has focused on the problem of TCP processing over shared-memory hardware [7], [25], [29], [40]. We leverage the insights gained in that area to address the problem for distributed-memory machines. TCP connection migration has been proposed as a technique for improving the availability of Internet services via transparent fail-overs [35], [36]. However, the solutions proposed are specific to TCP and other application layer protocols. The main goal of these solutions was migrating connections in a manner transparent to the enduser. Our safe flow migration mechanism allows remapping of flows of any definition, without violating application semantics and packet order. In the context of scaling Web servers [8], popular load balancing solutions include [14], which uses a static flowlevel distributor and [27], which proposes an adaptive flowlevel distributor that does not consider flow load information during adaptation and is suitable only for read-only state. TACC [12] and Ninja [38] propose general frameworks for building Internet services but do not address the challenges introduced by per-flow state. We address these challenges and provide concrete policies and mechanisms for throughput maximization. DDS [15] is a persistent data management layer that can be used to host flow-independent read-write structures in Oboe. VII. C ONCLUSION

9

to the number of PEs in the system, which can be significantly worse than our trace-based observations reported in Section II. Let m be the number of PEs in the system. We assume that the flow definition is DestIP with K representing the size of the IP address space. Let qj (0 < j < m) be the number of flows assigned to PE j by the hash function. [32] shows that flow sizes follow Zipf distribution and that X K(m − 1) CV [qj ]2 = PK i−2α − 1 −α 2 [ i=1 i ] (K − 1) i=1 K

(1)

where α (> 1) is the coefficient of the Zipf distribution. We compute the lower bound on CV as follows: PK −2α (m − 1) 2 i=1 i lim CV [qj ] = lim PK −α 2 − 1 k→∞ k→∞ (1 − 1 ) [ ] K i=1 i (2) (m − 1) N −1 = lim k→∞ (1 − 1 ) D 2 K PK PK where N = i=1 i−2α = 1 + i=2 i−2α and PK −α D = i=1 i . ZK Lower bound of N = 1 + x−2α dx (3) 1 1−2α 2α K − = 1 − 2α 1 − 2α Upper bound of D =

K+1 Z

x−α dx (4)

1

(K + 1)1−α 1 = − 1−α 1−α To compute the lower bound of CV [qj ]2 , we substitute (3) and (4) in (2): 1−2α

K 2α (m − 1) 1−2α − 1−2α LB of lim CV [qj ] = lim −1 1−α 1 2 k→∞ k→∞ (1 − 1 ) [ (K+1) − 1−α ] K 1−α 2α −1 = (m − 1)(α − 1)2 2α − 1 (5) 2

In this paper, we describe the design and implementation of Oboe, a run-time system for DM-MP platforms that host highthroughput services. Our design is based on two foundations: (1) category-specific management of shared state, and (2) adaptive flow-level load distribution for addressing persistent processor load. We demonstrate the efficacy of the two foundations using both simulations and prototype evaluation. Our prototype implementation of Oboe on a cluster of IXP2400 network processors, demonstrates the scalability achieved with increasing number of processors, number of flows and state size. A PPENDIX We present analysis to show that a static hash-based flow level load distributor can lead to arbitrarily high load imbalance. Extending the analysis by Shi et al. [32], we demonstrate that the lower bound of the coefficient of variation (CV) of PE loads–a metric that captures the discrepancy between the heaviest-loaded PE and the least-loaded PE—is proportional

Equation 5 shows that the lower bound on the CV of PE loads is proportional to m, the number of PEs and can be arbitrarily high. R EFERENCES [1] ESB Roundup Part Two: Use Cases. http://www.infoq.com/news/ESB-Roundup-Part-Two–Use-Cases. [2] IXIA. http://www.ixiacom.com. [3] MontaVista Software. http://www.mvista.com. [4] Snort. http://www.snort.org/. [5] The Lagniappe Programming Environment. http://www.cs.utexas.edu/users/riche/lagniappe/. [6] J. Bennett, J. Carter, and W. Zwaenepoel. Munin: Distributed Shared Memory Based on Type-specific Memory Coherence. In PPOPP ’90. [7] M. Bjorkman and P. Gunningberg. Locking Effects in Multiprocessor Implementations of Protocols. In SIGCOMM’93. [8] V. Cardellini, E. Casalicchio, M. Colajanni, and P. S. Yu. The state of the art in locally distributed web-server systems. ACM Computing Surveys, 34(2):263–311, 2002.

10

[9] M. K. Chen, X. F. Li, R. Lian, J. H. Lin, L. Liu, T. Liu, and R. Ju. Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming. In PLDI ’05. [10] G. Cormode and M. Garofalakis. Sketching streams through the net: Distributed approximate query tracking. In VLDB ’05. [11] W. Eatherton, G. Varghese, and Z. Dittia. Tree bitmap: hardware/software IP lookups with incremental updates. SIGCOMM Comput. Commun. Rev., 34(2), 2004. [12] A. Fox, S. D. Gribble, Y. Chawathe, E. A. Brewer, and P. Gauthier. Cluster-based scalable network services. In SOSP ’97. [13] L. George and M. Blume. Taming the IXP Network Processor. In PLDI ’03. [14] G. Goldszmidt, G. Hunt, R. King, and R. Mukherjee. Network Dispatcher: A Connection Router for Scalable Internet Services. In WWW ’98. [15] S. D. Gribble, E. A. Brewer, J. M. Hellerstein, and D. Culler. Scalable, distributed data structures for internet service construction. In OSDI’00. [16] Intel IXP Family of Network Processors. http://www.intel.com/design/network/products/npfamily/index.htm. [17] P. Keleher, S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. Treadmarks: Distributed shared memory on standard workstations and operating systems. In Winter 1994 USENIX Conference. [18] L. Kencl and J.-Y. L. Boudec. Adpative load sharing for network processors. In INFOCOM ’02. [19] H.-A. Kim and B. Karp. Autograph: Toward Automated, Distributed Worm Signature Detection. In Usenix Security ’04. [20] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The Click Modular Router. ACM ToCS, 18(3):263–297, August 2000. [21] R. Kokku. SHARE: Run-time System for High-Performance Virtualized Routers. PhD thesis, The University of Texas at Austin, 2005. [22] P. Krueger and M. Livny. The Diverse Objectives of Distributed Scheduling Policies. In ICDCS ’87. [23] A. Mahimkar, J. Dange, V. Shmatikov, H. Vin, and Y. Zhang. dFence: Transparent Network-based Denial of Service Mitigation. In NSDI ’07. [24] J. Mudigonda, H. M. Vin, and R. Yavatkar. Overcoming the Memory Wall in Packet Processing: Hammers or Ladders? In ANCS ’05. [25] E. Nahum, D. Yates, J. Kurose, and D. Towsley. Cache behavior of network protocols. In SIGMETRICS ’97. [26] NLANR Network Traffic Packet Header Traces. http://pma.nlanr.net/Traces/. [27] V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. M. Nahum. Locality-aware request distribution in cluster-based network servers. In ASPLOS ’98. [28] A. Raghunath, V. Balakrishnan, A. Kunze, and E. Johnson. Framework for Supporting Multi-Service Edge Packet Processing on Network Processors. In ANCS ’05. [29] J. D. Salehi, J. F. Kurose, and D. Towsley. The effectiveness of affinitybased scheduling in multiprocessor network protocol processing. IEEE/ ACM Transactions on Networking, 4(4):516–530, 1996. [30] U. Shevade. The Sahyadri Run-Time System for Scalable Packet Processing. Master’s thesis, The University of Texas at Austin, 2005. [31] W. Shi, M. H. MacGregor, and P. Gburzynski. Effects of a Hash-based Scheduler on Cache Performance in a Parallel Forwarding System. In Proceedings of CNDS’03, 2003. [32] W. Shi, M. H. MacGregor, and P. Gburzynski. Load balancing for parallel forwarding. IEEE/ACM Transactions on Networking, 2005. [33] S. Singh, C. Estan, G. Varghese, and S. Savage. Automated worm fingerprinting. In OSDI ’04. [34] K. Sklower. A Tree-Based Packet Routing Table for Berkely Unix. In Proceedings of the Winter 1991 USENIX Conference, January 1991. [35] A. Snoeren, D. G. Andersen, and H. Balakrishnan. Fine-Grained Failover Using Connection Migration. In USITS ’01. [36] F. Sultan, K. Srinivasan, D. Iyer, and L. Iftode. Migratory tcp: Connection migration for service continuity in the internet. In ICDCS ’02. [37] J. S. Turner. A proposed architecture for the GENI backbone platform. In ANCS ’06. [38] J. R. von Behren, E. A. Brewer, N. Borisov, M. Chen, M. Welsh, J. MacDonald, J. Lau, and D. E. Culler. Ninja: A framework for network services. In USENIX ATC ’02. [39] T. Wolf, N. Weng, and C. Tai. Design Considerations for Network Processor Operating Systems. In ANCS, 2005. [40] D. J. Yates, E. M. Nahum, J. F. Kurose, and D. Towsley. Networking support for large scale multiprocessor servers. In SIGMETRICS ’96. [41] L. Zhao, Y. Luo, L. Bhuyan, and R. Iyer. SpliceNP: A TCP Splicer using Network Processors. In ANCS, 2005.

Suggest Documents