Framework For Supporting Multi-Service Edge Packet ... - CiteSeerX

2 downloads 5696 Views 347KB Size Report
deal well with a variety of services and fluctuating workloads. For example, current ... Run-time adaptation, Network processors, Edge packet processing. 1 INTRODUCTION ..... time to check if the threads in the processing unit have all stopped ...
Framework For Supporting Multi-Service Edge Packet Processing On Network Processors Arun Raghunath, Aaron Kunze, Erik J. Johnson

Vinod Balakrishnan Openwave Systems Inc. 2000 Seaport Blvd, Redwood city, CA, 94063

Intel Research and Development 2111 NE 25 Avenue, Hillsboro, OR-97124 1-503-264-4892

[email protected]

{arun.raghunath, aaron.kunze, erik.j.johnson} @intel.com ABSTRACT

1 INTRODUCTION

Network edge packet-processing systems, as are commonly implemented on network processor platforms, are increasingly required to support a rich set of services. These multi-service systems are also subjected to widely varying and unpredictable traffic. Current network processor systems do not simultaneously deal well with a variety of services and fluctuating workloads. For example, current methods of worst-case, static provisioning can meet performance requirements for any workload, but provisioning each service for its worst case reduces the total number of services that can be supported. Alternately, profiledriven automatic-partitioning compilers create efficient binaries for multi-service applications for specific workloads but they are sensitive to workload fluctuations. Run-time adaptation is a potential solution to this problem. With run-time adaptation, the mapping of services to system resources can be dynamically adjusted based on the workload. We have implemented an adaptive system that automatically changes the mapping of services to processors, and handles migration of services between different processor core types to match the current workload. In this paper we explain our adaptive system built on the Intel® IXP2400 network processor. We demonstrate that it outperforms multiple different profile-driven compiled solutions for most workloads and performs within 20% of the optimal compiled solution for the remaining workloads.

Packet processing systems experience workloads with unpredictable fluctuations. These range from short-term spikes in the traffic (e.g. flash crowds) to long-term variations (e.g. time-ofday fluctuations) [32, 33, 34]. Currently on network processor (NP) platforms, these fluctuations are handled by provisioning for the worst case traffic expected for a service. However, the worst case traffic is only seen for a small percentage of the time. As a result most of the resources allocated are underutilized [24]. At the same time commercial network access systems are required to support more sophisticated services like content inspection and compression as well as encryption and decryption [2, 7, 20]. Typical network processors like the Intel® IXP2xxx family of network processors have fixed-size control stores with no hardware support for caching. As a result, current NP-based systems are capable of supporting very few services efficiently. Recently, sophisticated compilers have been developed that compile multi-service application code by profiling it for a particular workload [10, 17]. However due to the unpredictability of the fluctuations in workloads at run time there is no such thing as a correct workload. Consequently, the code generated by these compilers, though very efficient for the specific workload for which it was generated, is unable to sustain high system throughputs for other workloads. Run-time adaptation is a potential solution to this problem. Since the traffic for individual services fluctuates with not all services experiencing worst case traffic at the same time, it is possible for an adaptive system to support more services at a higher throughput by dynamically loading the code for the currently active services onto more of the processing cores in the system. Furthermore, within a particular set of active services the adaptive system can better handle traffic variations by identifying the bottlenecks and allotting more resources specifically to those components. When resources are not being used they can be switched off to save power. Thus, the ability to adapt dynamically allows the system to not only address the problems arising out of the application and workload characteristics typical in the packet processing domain, but also take advantage of them. We have implemented a system capable of run-time adaptation on the Intel® IXP2400 network processor. This paper describes the requirements and steps involved in supporting run-time adaptation, details of the design of our adaptation framework and policies implemented, and presents a performance evaluation of our solution. Section 2 describes the adaptation steps. Sections 3 and 4 describe the individual steps in further detail and how our system implements them. Section 5 describes the experimental setup we used to evaluate our system and the results obtained.

Categories and Subject Descriptors D.4.1 [Operating Systems]: Process Management – Scheduling, Multiprocessing/multiprogramming/multitasking; C.4 [Computer Systems Organization]: Performance of systems – Measurement techniques, performance attributes; D.4.7 [Operating Systems]: Organization and Design – Real-time systems and embedded systems; C.2.m [Computer-communication Networks]: Miscellaneous

General Terms Algorithms, Measurement, Performance

Keywords Run-time adaptation, Network processors, Edge packet processing Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ANCS’05, October 26–28, 2005, Princeton, New Jersey, USA. Copyright 2005 ACM 1-59593-082-5/05/0010...$5.00.

163

XScale® core, it sets a flag in memory. 1 Compiler inserted code checks this flag in each iteration, and if set, the thread running that stage exits. Another mechanism for adaptation on architectures with asymmetric cores is migrating stages from one core type to another. To accomplish this, the compiler generates two functionally-equivalent binaries for each stage: one for a microengine (ME) and one for the Intel XScale® core, enabling any stage to be executed on either of the two kinds of cores available on the IXP network processor. The final mechanism is the ability to bind physical resources to application stages. In our system, the application code is generated with calls to abstract resources such as packet channels, queues and locks which are exposed as part of a resource abstraction layer (RAL). This allows the run-time system to choose the physical resources appropriate for the current application-to-hardware mapping. A range of choices exist with respect to the approach to binding, including compile-time, run-time, and load-time. Compile-time binding involves recompiling the application each time the system adapts. Run-time binding uses a run-time conditional flag to determine which type of resource to use. Compile-time binding results in fast, small code, while run-time binding takes the least amount of time to change a bound interface from one resource to another. We choose an approach we call adaptation-time linking, that has code speed, size, and adaptation time somewhere between the compile-time and run-time approaches. In this approach, choosing a particular resource mapping involves linking the physical resource implementation binary into the application each time adaptation occurs. This approach does not have execution overheads or large code-store footprint like in the run-time binding approach nor does it take as long to adapt as the compiletime binding.

Section 6 discusses related work. We conclude by summarizing our contributions in Section 7 and discussing areas for future work in Section 8.

2 ADAPTATION STEPS Most packet processing applications can be expressed as a graph of packet processing stages, each of which perform some operation on a packet and then send it on to the subsequent stage [6, 11, 30]. Typically, the stages can be run in parallel and are connected by queues. Network processors provide numerous hardware resources like processing cores, multiple memory levels, inter-core queuing and signaling mechanisms and crypto assist [9]. When packet processing applications are executed on such architectures, the application stages need to be mapped to these hardware resources. We refer to run-time adaptation as the ability to change these mappings dynamically. In order for a system to adapt at run-time, the first step is to build mechanisms by which the system can be correctly migrated from one application-to-hardware mapping state to another. This change of mapping should occur as quickly as possible to minimize time periods when the system is running at less than maximum capacity. Long periods of reduced system capacity will eventually cause an overflow of the internal buffers and result in packet loss. Given the mechanisms to correctly adapt mappings, the next step is to determine when to utilize these mechanisms, i.e. when should the system adapt. For a system to make this decision automatically the run-time should be able to identify how the individual stages are handling the workload. This entails determining that some application stage is unable to meet the processing requirements imposed by the current workload and consequently triggering the adaptation. The final step in run-time adaptation is determining what changes to make to the resource mappings for the individual stages so that they may better handle the current workload. Moreover when there are multiple stages whose resource mappings need to be changed, the run-time system must prioritize amongst them and come up with a mapping that maximizes the system performance. In the next sections we focus on each of the steps needed for adaptation, describing further the requirements and explain how we have implemented these functionalities in our system.

3.1

State Monitoring

An adaptive run-time system must determine when to trigger that adaptation. Collecting the information necessary to make this determination should not cause a large overhead on the performance of the application and must have low latency. The ability to extract the necessary information quickly is essential for the system to be able to adapt to workload variations in a timely and accurate fashion. We implemented a local rate-based monitoring system in which the system monitors the rate at which packets arrive and depart from any queue connecting two stages. The arrival and departure rates are indicative of the processing needs of the attached stages, and hence indicate not just when to adapt but how to adapt as well. The local, or per-queue, nature of the approach also means a distributed implementation is possible, which could scale better. In order to determine arrival and departure rates we implemented a packet channel that counts the number of packets added and retired. This results in two extra instructions per packet traversing the channel (asynchronous increment/decrement instructions ensure there is no time spent waiting for memory). Also, since the counts are maintained in scratchpad memory we use up some of

3 ADAPTATION MECHANISMS The first mechanism needed is the ability to stop the execution at a safe point and then restart it with the new resource mappings. While general-purpose checkpointing solutions [13, 16, 27, 28] would suffice, it is possible to optimize this stopping and restarting mechanism for the domain of packet processing. Our packet processing applications are represented as data-flow graphs with each stage implemented as an infinite loop. Moreover there is no local state (like locks, local variables) at the start of the infinite loop. This unique property of these applications makes it easy to implement a light-weight checkpoint mechanism. We use a compiler [17] that adds in placeholders for checkpoints at the start of each application stage’s infinite loop, where no stack or processor state needs to be saved. When the run-time system decides to adapt code executing on the microengines, it writes a “halt” instruction to the checkpoint location. On the Intel

1 Intel XScale® , Intel® IXP2400 is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.

164

the internal bus bandwidth to memory.2 A user space thread running on the Intel XScale® core periodically queries the counts from the channels and stores the timestamp when the values were polled. The rate can then be calculated as,

NumCores(R) = R/Rdep1 Given the worst case arrival rate, Rworst, this gives the maximum number of cores needed for a stage as: NumCores(Rworst) = Rworst / Rdep1 However, this puts the system in the worst case provisioned state and as we discussed earlier the system typically does not experience worst case arrival rates all the time. Instead, by observing the current arrival rate, the system can switch on only as many cores as are needed for handling this rate.

rate = (countn – countn-1) / (timestampn – timestampn-1)

4 ADAPTATION POLICIES The monitoring mechanism described in the previous section is used by our adaptive system to make various policy decisions. In this paper, we only explore the adaptation of processor cores. The Allocation Policy determines when a particular stage might need more cores and how many cores that stage needs. The Deallocation Policy determines when a stage has excess cores and how many cores to remove. The Resource Allocator is a global entity that arbitrates the requests for the individual stages. The goal of the adaptation policies we describe is to maximize throughput. This is done by attempting to minimize packets dropped due to queue overflows. The Allocation and Deallocation policies are stage-local and focus on the queue connecting a source stage and a sink stage as shown in figure 1. The policies run independently for every queue connecting application stages. They determine when and how the cores allocated to the sink stage are to be adapted.

NumCores(Rarr) = Rarr /Rdep1 This gives the number of cores to allocate, k, as, k = Rarr / Rdep1 – n, With this basic idea we define the allocation policy as follows,

if   if    

Now let’s look at each of the two scenarios in further detail. First, when k >= 1:

In our model, the worst case arrival rate into any queue, Rworst, and the time to switch on a processing core, tsw, are design parameters which need to be specified when starting the system. Note that both of these parameters are fixed for a given system and can be determined easily. All other parameters are determined by the system automatically.

Qadapt

Rarr Source

k >= 1 Æ Rarr /Rdep1 – n >= 1Æ Rarr - Rdep >= Rdep1 Now, Rarr – Rdep is the rate at which the queue fills up. This means the queue will fill up quickly as the arrival rate is much more than the departure rate. Hence, in this case the allocation policy is to allocate the number of cores needed immediately. Second, when k < 1:

Rdep Sink

k < 1 Æ Rarr – Rdep < Rdep1 In this case, the queue depth will increase slowly since the arrival rate is only slightly higher than the departure rate. We would like to defer the decision to allocate as long as possible to leave the necessary resources either powered down or for other services. Hence, the policy in this case is to not add any cores at this time. Instead, the policy must determine the depth, Qadapt, such that if a core is requested when this depth is reached, the sink stage will not drop any packets.

(Rworst-Rdep)(tsw+tmon) Qsize Figure1: System model The current arrival rate, Rarr, and departure rate, Rdep, of the queue in packets per second, is obtained using the monitoring mechanism explained in the previous section. We make the assumption that Rdep scales linearly with the number of processors.3 Hence, given n, the number of cores currently allocated to the sink, we calculate the departure rate of the sink running on one core as, Rdep1 = Rdep / n.

4.1

k >= 1, request floor (k ) cores, immediately k < 1, request 1 core, defer till queue depth reaches Qadapt

To find Qadapt, let tmon be the interval at which we monitor the queues. Moreover, recall that it takes time tsw to switch on the cores requested. In the worst case, the arrival rate might increase to the maximum value right after we monitored the rates. Hence, to prevent packet loss we must ensure: Qadapt N, then nothing can be done by an adaptive system to meet that arrival rate. Also, if Rworst > Qsize / tsw nothing can be done to prevent loss of packets during adaptation.4

The policies and mechanisms we described are put together as follows to enable adaptation. The monitoring infrastructure collects statistics as the packets flow through the system. A separate Intel XScale® core thread queries the gathered statistics periodically, and based on the monitored values for each individual stage decides whether the particular stage needs a change in the cores it has been allotted. The requests from all the individual stages are presented to the resource allocator, which then makes a global decision on the physical cores that will be allocated to each stage. Once the final resource allocation decision is made, the affected stages are moved from the previous state to the new one using the adaptation mechanisms. Specifically, checkpoints are activated in the affected stages. Once a checkpoint is hit, the core running the stage is stopped. New resources are linked into the appropriate binary. The stage is now in the new state and the relevant cores are restarted and the application continues its task.

To summarize, this allocation policy allocates cores aggressively when the departure rate is much lower than the arrival rate. When the departure rate is only slightly lesser than the arrival rate the allocation is delayed as much as possible.

4.2

Resource allocator

The allocation and de-allocation policies identify the needs of the individual stages of the application and accordingly request or release resources. The resource requests made by the modules implementing the policies are handled by the resource allocator. The resource allocator is a global entity that is aware of all resources available on the physical system, the specific capabilities of the resources, as well as all the requests made by the individual stages of the application. With the help of this global knowledge the resource allocator determines which requests to service and the specific physical resources to provide in order to fulfill the requests.

Qadapt + (Rarr -Rdep)tsw = Qsize – (Rworst-Rdep)tsw – Rdep1*tsw

De-allocation Policy

How do we determine that we have more cores allocated than needed? We have implemented a mechanism wherein the estimated departure rate for a stage, Rdep1 is latched while increasing the allocation for a particular stage. When it is detected that the queue is not growing for a configurable number, D, of monitoring cycles, the latched departure rate is used to determine if some cores should be de-allocated as follows:

4

This value is obtained by substituting tmon = tsw /N and Rdep =Rworst / N in (1) and simplifying to eliminate Rdep

166

The ME load takes as input the application stage binary file name that resides on a memory-mapped file system and thus incurs the overhead of reading the binary and accessing the instructions from the binary before writing them into the ME code store. In the experimental setup we generated ME binaries with varying numbers of instructions and measured the time to load the binary into the ME code store using the load method of the ME processing unit. The result of this setup is shown in Figure 2.

5 SYSTEM EVALUATION Our run-time system implementation was done on a RadiSys, Inc. ENP-2611* with a single 600 MHz IXP2400 network processor, running MontaVista* Linux*5. The board has 3 optical Ethernet ports each capable of supporting 1Gbps of bidirectional traffic. We used an IXIA traffic generator [12] for packet stimulus. To evaluate our implementation we first measured the cost of the individual adaptation mechanisms using a set of applicationagnostic micro-benchmarks. We then used a real application to evaluate the cumulative effects of the adaptation mechanisms. Finally we measured the benefits of run time adaptation.

5.1

Adaptation Costs

We used a 37.5 MHz hardware-based timer on the IXP2400 network processor to measure the time taken for the different operations. We measured the overhead of invoking the method used to read the timer value as 0.53µs from both Linux user and kernel space. In the following sections we outline the experimental setup and the results of these studies. The measurements were taken in the Linux kernel on the Intel XScale® core unless otherwise mentioned. Figure 2: Microbenchmark results for loading code on the ME (Source:Intel)

5.1.1 Cost of Checkpointing:

The graph shows the total time taken by ME load as a function of the number of instructions and the contribution of different steps in the implementation: reading the binary, writing the code store and cleanup (freeing resources allocated in reading the binary).

To evaluate the checkpoint mechanisms we used two metrics: • time to inform the processing unit to stop at the beginning of the loop • time to check if the threads in the processing unit have all stopped execution The results on both the Intel XScale® core and the ME are shown in Table 1. Table 1: Overhead of checkpointing

The result shows that there is a fixed overhead incurred in reading the binary (4 ms) and cleanup (0.63 ms). The write time into the code store on the ME is proportional to the number of instructions in the binary, as expected.

5.1.4 Cost of Binding: We measured the overhead of the adaptation-time linking approach on both the MEs and the Intel XScale® core. The MEs instructions have no hardware support for relative addressing. Thus ME linking must explicitly relocate the RAL implementation binary and append the instructions in the RAL with the instructions in the application stage binary. On the Intel XScale® core, since the RAL implementations are already loaded as Linux kernel modules, no overhead of relocation of the RAL implementation occurs. The Intel XScale® core adaptation-time linking implementation incurs the overhead of the Linux insmod program, which is used to load the application stage binary into the kernel.

The ME numbers are larger than the Intel XScale® core numbers because the ME checkpointing consists of stopping the ME, writing a thread halt instruction in the code store at the checkpoint location and restarting the ME execution. The Intel XScale® core checkpointing, on the other hand, involves just setting a flag in a known memory location.

5.1.2 Cost of Starting Processing Units: We measured the costs of starting a processing unit. The IXA SDK [8] provides methods for starting an ME. We measured this to be 36 µs. For the Intel XScale® core, the cost is equal to the cost of creating and starting a Linux kernel thread which we measured to be 97 µs.

In order to measure the overhead, we generated binaries with varying numbers of: • relocatable instructions (for ME only) • call sites that invoke a RAL method. These were equally distributed among a fixed number of RAL instances.

5.1.3 Cost of Loading Code:

We measured the cost of adaptation-time linking for each binary and the contribution of the various steps involved in the current implementation. The steps are: reading the binary from a memorymapped filesystem on the Intel XScale® core, relocating RAL implementations (ME only), patching call sites and writing the linked binary in the filesystem.

We measured the performance of our loading mechanism on the ME and Intel XScale® core. The Intel XScale® core load involves associating a function entry point with the Linux kernel thread and was measured to be 54 µs.

* Other names and brands may be claimed as the property of others.

167

Figure 3 shows the results for ME adaptation-time linking as a function of number of call sites. As we can see, the ME adaptation time linking (labeled “Total link time”) varies linearly with the number of call sites. This is because the overhead in patching call sites varies linearly with the number of call sites. Each call site invoking a RAL method contains a branch instruction with an unresolved target address.

5.1.5 Cumulative Effects of Adaptation: We used a layer 3 switching and forwarding application (Fig 5) implemented using the Baker language [11] to measure the total adaptation time. Total adaptation time is defined as the time taken by the system to reach the final mapping from the initial mapping. This metric is useful to determine how fast the system can adapt.

Figure 5: Layer 3 switching and forwarding We measured the overhead of adapting between the same processing unit implementations (ME to ME) using the mapping configurations shown in table 2 and the overhead of adapting between processing units of different implementations (ME to Intel XScale® core) as shown in table 3.

Figure 3: Microbenchmark results for binding code for the ME (Source: Intel) The ME binding mechanism patches the branch instruction at each RAL method call site with the instruction address of the correct RAL implementation method that was linked in. The ME linking also incurs a fixed overhead for reading (4.5 ms), writing (3.5 ms) and relocation cost (2.1ms). The relocation cost consists of the cost of relocating a packet channel implementation and a lock implementation.

Table 2: Configuration to measure ME to ME adaptation overhead

Initial Final

L3 fwdr 1ME 3MEs

L2 bridge 3MEs 1 ME

Channel 2 sram scratchpad

Table 3: Configuration to measure the ME to the Intel XScale® core adaptation overhead

Initial Final

L3 fwdr L2 bridge 4 MEs Intel XScale® core Intel XScale® core 4 MEs

We only show the application stages whose mapping has changed in the tables. Appropriate timer probes were inserted into the code for these measurements. The results of this evaluation were obtained by running for 5 iterations and averaging the values (the standard deviation for the values was small). Figure 4: Microbenchmark results for binding code for the Intel XScale® core (Source:Intel)

Total time to adapt: ME to XScale® = 254.3ms

Figure 4 shows the results of the Intel XScale® core adaptationtime linking. The total link time is around 160 ms for the Intel XScale® core, of which there is a fixed overhead (80 ms) involved in the write operation which uses the insmod utility. The Intel XScale® core patching times are independent of the number of call sites since the Intel XScale® core binding involves renaming the methods in the symbol table which only needs to be done once for each unique method per RAL instance. This is the crucial difference between Intel XScale® core binding and ME binding results.

ME to ME = 99.5ms We found that with these values for adaptation time and the queue sizes supported in our implementation, our system was unable to prevent packets from being dropped during the period of adaptation for worst case arrival rates (2.5Gbps). This is to be expected based on the required relationships between Qsize, Rworst and tsw as explained in section 4.1.

168

5.2

Adaptation benefits

5.2.2 Experimental Setup:

In order to quantify the benefits of run-time adaptation we need to measure the ability of the adaptive system to handle multiple services efficiently while the workload varies. Moreover, any workload might be experienced by the system for long enough periods of time that not supporting high data rates during these periods can impact the overall system performance. So we need metrics to measure the ability of the system to handle long term variations in workloads.

For our experiments we chose the same layer 3 switching and forwarding application used while analyzing adaptation costs (Figure 5). This application has sufficient complexity that the generated code does not all fit in the code store of one ME. The application handles different packet types each of which require different stages of the application. Furthermore each stage has enough computation and memory accesses to require multiple copies to be running on the MEs in order to support high data rates. Consequently, a designer following the traditional approach of provisioning would have to choose a subset of the stages to place on the fast path and the remaining stages would be placed on the Intel XScale® core. As such this application, though simple, brings out the same challenges that would be faced by a designer of a system supporting numerous network services in a real world device such as a router or a switch.

5.2.1 Testing Methodology: We create different traffic streams each consisting of a particular packet type. We define a workload to comprise a particular percentage of different packet types. In this manner we create numerous workloads with varying percentages of constituent packet types representing the traffic destined for different network services. To measure the ability of the system to handle all the workloads we ensure each workload is experienced for the same amount of time. We also ensure a sufficient period with no traffic between workloads to ensure earlier workloads do not affect future behavior. We measure the forwarding rate for each workload. The overall system performance is the set of forwarding rates supported by the system, one for each of the workloads.

We created traces with two different packet types: one which would be bridged by the application (L2) and one which would be forwarded (L3). We created 7 different binary sets of the application, each profiled with a trace containing a different mix of the two packet types. For the adaptive system we chose a binary set that offered the run time system a maximum amount of flexibility for adaptation. This was a binary set with each application stage in a separate binary. This allows the run-time system the flexibility to load only the specific stages needed onto the MEs while all others are loaded onto the Intel XScale® core. We fed each binary set with workloads containing different percentages of the two packet types from 3 single gigabit ports. Each packet was of minimum size (64 bytes). Each workload was sent for 60s followed by 30s of no traffic.

We wish to compare a system with run time adaptation against a system which is compiled by profiling with a single workload. The obvious problem here is the correct workload to use for profiling. We handle this by creating a set of binaries each profiled with a different workload. We use the profile-driven Shangri-la auto-partitioning compiler [17] to generate the binaries for each profile. We compare the performance of the adaptive system against the performance of all these “static” binaries.

pkts received rate/pkts sent rate

100 90 80 70 60 50 40 30 20 10 0 0%,100%

3%, 97%

40%, 60%

50%, 50%

60%, 40%

80%, 20%

Input traffic mix(l2, l3) (Absolute rate = 2.5Gbps)

Figure 6: Adaptation benefits (Source: Intel)

169

97%, 3%

100%, 0%

Everest may need to vary with the number of cores assigned to a stage, whereas our algorithm assumes fixed length queues. Both the algorithms rely on online measurements to determine the values of certain parameters. An important parameter in such systems is the monitoring interval. [25] discusses the advantage of smaller monitoring intervals. We account for the monitoring interval in our equations and also derive an upper bound for it.

5.2.3 Results: The graph in Figure 6 plots the results. The X-Axis has the different workloads fed to each of the binaries. The Y-axis is the output rate expressed as a percentage of the input rate. Each line on the graph shows the performance of each binary set. As can be seen from this graph the binary set that is compiled for a particular workload supports high forwarding rates for that workload, but the rate drops as the workload changes. The performance of the adaptive system however remains roughly constant for all the workloads. Another point to note is that the absolute rate supported by the adaptive system for a particular workload is always less than a binary compiled specifically to handle that workload. This is to be expected as the binary set chosen for the adaptive system is one which offers the most flexibility for adaptation i.e. each stage is in a separate binary. A natural consequence of this is that we lose out on some compile time optimizations for these binaries. This accounts for the difference in output rate between the adaptive system and the binary compiled for a particular workload. The point to note, however, is that all the static compiled binary sets have a lower forwarding rate than the adaptive system for almost half of the workloads, and many are worse for 80% of the workloads. More importantly we see this behavior with just two services. As the number of services that need to be supported increases it will become more and more difficult for a compiler to fit in the critical components of all those services in a binary that can run on the fast path.

7 CONCLUSION We have implemented a system that is capable of dynamically adapting the processing cores allocated to network services with a goal of maximizing the overall throughput of the active services. The algorithm developed automatically balances application pipelines by increasing the resources allocated to bottleneck stages. It overcomes the code store limitation of the IXP network processor by allowing services to freely migrate between the limited instruction store microengines and the instruction-cached Intel XScale® core. Our algorithm does not require the system administrator to specify hard-to-determine parameters for correct operation. Finally, we demonstrate that the adaptive system performs better than a profile-driven compiled solution for most workloads and when it is slower, it remains within 20% of the optimal solution in the worst case and about 10% on average.

8 FUTURE WORK An interesting area of future work would be to determine how well an adaptive system can support short term fluctuations in traffic. Without adaptation, when the traffic is different from the profiled case the compiled solution will drop packets. While the adaptive solution can change the core allocation to handle the new traffic distribution, the time taken to adapt is crucial, since during this time packets could be dropped. It would be interesting to study what kind of short term fluctuations are seen in the real world and whether an adaptive system would drop more packets than a non adaptive system in the face of a large number of short term fluctuations.

6 RELATED WORK Several research efforts [3, 14, 15, 21, 22] have focused on the problem of making packet processing applications on network processors easier to program and sustain high throughput. These efforts have typically involved provisioning a particular application for the worst case traffic expected. This results in a less than optimal usage of the resources, wasting power and allowing a few services to be supported on a given hardware [24]. Work has also been done in the area of router extensibility and flexibility by dynamic resource allocation. Router Plugins [4] and PromethOS [26] have explored this issue on general purpose processors, while VERA [29] is focused on systems with a general purpose host processor coupled with processors on intelligent NICs or a network processor. NetBind [18] allows different pieces of machine code to be composed together efficiently on network processor systems using dynamic binding. However, none of these efforts focus on handling traffic fluctuations. Load balancing in the face of workload variations has been an area of active research in the context of web servers and hosting centers [1, 5, 19, 23, 31]. These differ from the edge networking domain in that edge packet processing needs to happen at a much higher rate, with the time taken for changing allocations being in the same time scale as that of the traffic fluctuations. ShaRE [25], which is implemented using the same resource abstraction layer code base as ours, has similar requirements, but the goals are different. Their algorithm (Everest) attempts to minimize the probability of delay tolerance violations, while our algorithm attempts to maximize throughput by reducing the probability of queue overflows. One consequence of the difference in approaches is that the lengths of the queues used in

9 REFERENCES [1] A. Chandra, et al. Dynamic Resource Allocation for Shared Data Centers Using Online Measurement. In Proceedings of International Workshop on QoS, 2003. [2] CyberGuard Corporation http://www.cyberguard.com/. [3] CloudShield Technologies. http://www.cloudshield.com. [4] D. Decasper, et al. Router Plugins: A Software Architecture for Next Generation Routers. In Proceedings of ACM SIGCOMM, 1998. [5] D. C. Steere, et al. A feedback-driven proportion allocator for real-rate scheduling. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation, pages 145–158, 1999. [6] E. Kohler, et al. Click Modular Router.ACM Transactions on Computer Systems, 18(3):263–297, August 2000. [7] F5 Networks, Inc. http://www.f5.com/. [8] Intel® Internet Exchange Architecture Software Development Kit. www.intel.com/design/network/products/npfamily/sdk.htm

170

[9] Intel® Network Processors. www.intel.com/design/network/products/npfamily/index.ht m

Symposium on High Performance Architectures (HPCA-9), 2003.

Computing

[22] Payloadplus family of network processors. http://www.agere.com/enterprise metro access/network processors.html.

[10] Intel Corporation, "Introduction to Auto-Partitioning Programming Model," http://www.intel.com/design/network/papers/254114.htm.

[23] P. Pradhan, et al. An observation-based approach towards self-managing web servers. In Int. Workshop on Quality of Service, 2002.

[11] Intel Corporation, “Advanced Software Framework, Tools, and Languages for the IXP Family”, http://www.intel.com/technology/itj/2003/volume07issue04 /art06_tools/vol7iss4_art06.pdf [12] IXIA. http://www.ixiacom.com/

[24] R. Kokku, et al. A Case for Run-time Adaptation in Packet Processing Systems. In the 2nd Workshop on Hot Topics in Networks, Nov 2003, Cambridge, MA, USA.

[13] K. Li, et al. Real-time, concurrent checkpoint for parallel programs. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 79–88, 1990.

[25] R. Kokku. ShaRE: Run-time System for High-performance Virtualized Routers. Ph.D. Thesis, The University of Texas at Austin, 2005.

[14] L. George and M. Blume. Taming the ixp network processor. In Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, pages 26–37. ACM Press, 2003.

[26] R. Keller, et al. PromethOS: A Dynamically Extensible Router Architecture Supporting Explicit Routing. In Proceedings of Fourth Annual International Working Conference on Active Networks (IWAN), 2002.

[15] M. Adiletta, D. Hooper, and M.Wilde. Packet Over SONET: Achieving 10 Gigabit/sec Packet Processing with IXP2800. Intel TechnologyJournal, 6(3), 2002.

[27] S. Osman, et al. The design and implementation of zap: a system for migrating computing environments. SIGOPS Oper. Syst. Rev., 36(SI):361–376, 2002.

[16] M. Litzkow, M. Livny, and M. Mutka. Condor - a hunter of idle workstations. In Proceedings of the 8th International Conference of Distributed Computing Systems, June 1988.

[28] Sapuntzakis, et al. Optimizing the migration of virtual computers. SIGOPS Oper. Syst. Rev., 36(SI):377–390, 2002. [29] S. Karlin and L. Peterson. VERA: an extensible router architecture. Computer Networks, 38(3), 2002.

[17] Michael K. Chen, et al. "Shangri-la: Achieving high performance from compiled network applications while enabling ease of programming", ACM SIGPLAN 2005, PLDI , Chicago, Illinois, USA, June 12-15, 2005

[30] TejaNPTM: A Software Platform for Network Processors. http://www.teja.com. [31] T. F. Abdelzaher, et al. Performance Guarantees for Web Server End-Systems: A Control-Theoretical Approach. IEEE Transactions on Parallel and Distributed Systems,

[18] M. E. Kounavis, et al. Programming the Data Path in Network Processor-based routers. Software Practice and Experience, 2004.

13(1):80–96, 2002.

[19] M. Welsh, et al. SEDA: An Architecture for WellConditioned, Scalable Internet Services. In Proceedings of ACM Symposium on Operating Systems Principles, Oct 2001.

[32] V. Paxson and S. Floyd. Wide area traffic: the failure of Poisson modeling. IEEE/ACM Transactions on Networking, 3(3):226–244, 1995.

[20] NetScaler, Inc. http://www.netscaler.com/.

[33] Y. Qiao, et al. Multiscale Predictability of Network Traffic. Northwestern University. Technical report.

[21] N. Shah, W. Plishker, K. Keutzer, "NP-Click: A Programming Model for the Intel IXP1200", 2nd Workshop on Network Processors (NP-2), 9th Intl

[34] Z. Zhang, V. Ribeiro, S. Moon, and C. Diot. Small-Time scaling behaviors of Internet backbone traffic: An empirical study. In Proceedings of the IEEE INFOCOM., 2003.

171

Suggest Documents