Considering Processing Cost in Network Simulations

3 downloads 8773 Views 282KB Size Report
In many network simulations and models the cost of process- ing a packet is considered ...... request redirection on CDN robustness. In Proceedings of the Fifth ...
Considering Processing Cost in Network Simulations∗ Ramaswamy Ramaswamy Department of Electrical and Computer Engineering University of Massachusetts Amherst, MA 01003 USA

Ning Weng

Tilman Wolf

Department of Electrical and Computer Engineering University of Massachusetts Amherst, MA 01003 USA

Department of Electrical and Computer Engineering University of Massachusetts Amherst, MA 01003 USA

[email protected]

[email protected]

[email protected] ABSTRACT In many network simulations and models the cost of processing a packet is considered negligible or overly simplified. The functionality of routers is steadily increasing and complex processing of packet payloads is being implemented (deep packet classification, encryption, content transcoding). We show two examples where processing cost can contribute to a significant portion of the overall packet delay. To enable a more precise consideration of processing delay, we present a tool called NPEST (Network Processing Estimator). NPEST is a framework on top of which packet processing functionality can be implemented and simulated using an actual processor simulator. NPEST can be programmed in C and greatly simplifies the implementation and simulation process as compared to using network processor simulators. The results derived from NPEST can either be used directly or be aggregated to processing statistics for network simulations. We present such results for two prototype applications: IP forwarding and IP security. We also show a comparison between the results obtained from NPEST and an Intel IXP1200 network processor.

1. INTRODUCTION Network analysis and simulation are key methods for networking researchers to understand, model, and improve the functionality and performance of computer networks. In various current models, much emphasis is put on the queuing and propagation delay. One important component that has been consistently been simplified or ignored in network ∗Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM SIGCOMM 2003 Workshops, August 25&27, 2003, Karlsruhe, Germany. Copyright 2003 ACM 1-58113-748-6/03/0008...$5.00.

models and simulations is the processing cost or processing delay on a network node. Traditionally, this delay has been considered negligible as only simple packet forwarding functions needed to be implemented. With networks emerging as “service platforms” more complex processing is performed on a router and this assumption no longer is suitable for future networks. Processing packets and modifying their payload (e.g., for ad-insertion, encryption, content-transcoding, etc.) takes considerable processing time and incurs delays in the order of several milliseconds. With emerging technologies like network processors, these functions become easier to implement inside the network and can be expected to further proliferate. Therefore it is crucial that such processing cost is considered in networking simulations and analysis as it will make results more realistic, accurate, and reproducible. To illustrate the impact of processing cost, we briefly discuss the various network delays. When sending a packet from one node to another, the following delays occur: (1) transmission delay (the time it takes to send a packet onto a wire), (2) propagation delay (the time it takes to transmit a packet via a wire), (3) processing delay (the time it takes to handle a packet on the networking equipment), and (4) queuing delay (the time the packet is buffered before it can be sent). Table 1 shows a simple back-of-the-envelope calculation for these delay components for a 1Gbps links, a 1250B (=10kb) packet and a 200km link. In most cases, the key contributors of delay are (2) and (4) and are therefore considered in simulations and measurements. The transmission delay (1) is usually negligibly small for fast links and small packets and is therefore not considered. Traditionally (column “Simple Packet Forwarding”), the processing delay (3) has also been negligible. In our previous work [37] and in Section 4 of this paper, we show that packet processing can take considerable time when payload modifications are involved. This can involve in the order of 100 general-purpose processor instructions per byte of payload. As a result, the processing delay can contribute as much as 50% of the overall packet delay (column “Complex Payload Modifications” in Table 1). Another example of processing delay impacting network measurements is ICMP processing. High-performance routers forward packets in the “fast-path,” which is optimized for handling the common case of packet forwarding. Packets that need special attention (e.g., IP option processing, or

There are two ways in which NPEST results can be integrated and used in network simulations and models. This is illustrated in Figure 2. One way is to integrate NPEST with a network simulator that uses actual packet traces and simulate the processing of these packets. This leads to accurate results, but is limited to small network simulations where packet traces are used. The other way of using realistic processing cost estimations is to obtain statistics from NPEST and then using these in network simulations or models. This greatly simplifies the usage, but is limited in its accuracy. Either way will significantly improve the accuracy of network models and lead to more realistic network simulations. The remainder of this paper is organized as follows: Section 2 discusses related work. We present the NPEST tool in detail in Section 3. Section 4 shows NPEST simulation results for two applications and we provide a comparison between the results of NPEST and an IXP1200 network processor. Since some aspects of NPEST are still work in progress, we discuss future work in Section 5. Section 6 summarizes and concludes this paper.

2. RELATED WORK Processing of traffic on a network node is not limited to forwarding functions anymore. It is common that routers perform firewalling [25], network address translation (NAT) 1

We use “application” and “packet processing code” interchangeably indicating the processing that is performed inside the network on the router. This is not to be confused with the application program on the end-system, which we are not considering here.

Complex Payload Modifications 10µs 1,000µs 1,000µs 0...∞

∼1%

∼50%

Table 1: Networking Delay Components. A 1Gb/s link, 10kb packet size, 100MIPS processor, and link distance of 200km are assumed.

Delay decreases between hops

Figure 1: Observed Delay Decrease due to ICMP Processing. network node

processing delay queuing delay

application-specific processing statistics

transmission and propagation delay

analytic model

Estimating the processing cost or processing delay for packets is not an easy task. With routers implementing a host of network services, any general-purpose processing could be performed and needs to be considered. While analytic models have been used successfully to estimate highly regular processing such as IP table lookups, they reach their limitations in the case of general-purpose processing. For this reason, we have developed a tool called NPEST (Network Processing Estimator). NPEST allows networking researchers to implement packet processing functionality as a simple C program with the NPEST framework providing all the necessary packet trace processing and memory management. Using a standard processor simulator, processing time, and other relevant metrics (memory bandwidth, etc.) can be obtained. We show in our evaluation that the results are comparable to those obtained from actual network processors. The key benefit of NPEST is that implementing applications1 is significantly easier than on any network processor platform. This allows users to quickly obtain processing time estimations without the need for expertise in network processor software development.

Delay Transmission delay Propagation delay Processing delay Queuing delay Fraction of processing delay to total delay

Simple Packet Forwarding 10µs 1,000µs 10µs 0...∞

NPEST networking application NPEST framework and processor simulator

simulation

ICMP processing) are diverted to the “slow-path,” which examines these packets more carefully. As a result, generating an ICMP response on a ping request can take more processing time that the forwarding of a packet. This can lead to measurements shown in Figure 1, where the routers on a path are pinged progressively. At the indicated hops, the delay between consecutive hops decreases and contradicting common sense. By considering processing cost accordingly, this behavior can be explained in a straightforward manner.

Figure 2: Network Node Delays and NPEST.

[10], web switching [2], IP traceback [32], and many other functions. With increasingly heterogeneous end-systems (e.g., mobile devices and “thin” clients) computationally more demanding services have been moved into the network. Examples for these are content transcoding, advertisement insertion, and cryptographic processing. Many of these functions are not implemented as proxies that terminate TCP connections, but on a per-packet level and transparent to the end-system. Therefore the delay that is caused by this processing can be directly observed and is reflected in network measurements. It can be expected that this trend towards more functionality on the router continues. A few programmable network platforms that have been developed by the active networking community [35, 8, 29] have reached the necessary level of maturity. Also, there are standardization efforts in progress by the IETF OPES (Open Pluggable Edge Services) working group to define such networking service platforms. In practice, these processing functions can be implemented in a variety of ways, ranging from software-based routers (workstation acting as specialized router) to specialized hardware (ASIC implementation on router line card). In recent years, so-called “network processors” (NPs) have become available for performing general-purpose processing on high-bandwidth data links. These network processors are system-on-a-chip multiprocessors that are optimized for high-bandwidth I/O and highly parallel processing of packets. A few of the numerous examples are the Intel IXP1200 [16], IBM PowerNP [15], and EZchip NP-1 [11]. One of the main challenges with network processors is that they require a significant amount of hardware expertise to be programmed. It is doubtful that processing cost measurement can be done easily on network processors and their simulators. Therefore we propose the NPEST tool, which uses a simple C code interface to implement networking applications and obtain results comparable to those of network processor (see Section 4). As mentioned above, the capability to specify processing cost in current network simulators is limited. Node delays can be specified in ns-2 [23] by attaching a timer to a node and performing some actions when the timer expires. The NEST network estimator [5] provides a slumber() method which can be used to suspend node execution for a definite period of time. The OPNET network simulator [27] can simulate processes which are behavioral descriptions of the functionality of network nodes. Processes in turn can add events to the event queue which causes simulation time to evolve inside the process. This offers a rather involved method of specifying processing cost in a network simulation. Wang et al. [36] have developed a new simulator that combines network-level simulation (using ns) with OS/server simulation (using logsim [28]) to provide detailed server processing statistics along with network statistics. In order to accurately estimate processing cost, we need to use a processor simulator. Numerous academic and commercial processor simulators are currently in use. Typically, these simulators accept a program in an appropriate binary format as input and simulate every instruction in the program. Statistics such as the number of instructions per cy-

cles executed, and the number of memory accesses made, are returned by the simulator. The ARMulator [3] is an instruction set simulator for the ARM microprocessor. Similarly the MIPS Free GNU toolkit [24] includes a simulator for the MIPS architecture. Shade [9] is an instruction simulator which can simulate SPARC and MIPS instruction sets. SimpleScalar [7] is a well known processor simulator which currently can perform functional simulation of the Alpha, x86, and ARM instruction sets. It is important to choose the right type of processor simulator when evaluating network processors. Most network processing cores are typically based on RISC architectures. NPEST utilizes the ARM version of SimpleScalar toolset for binary simulation due to its widespread academic use and free availability of source code. Network processor vendors offer software development kits (SDKs) for their products [17, 1]. These usually include a simulator which can be used to execute code that has been compiled for the network processor and obtain detailed hardware statistics about program execution on the processor. The IXA software development kit for the IXP 1200 [17] can compile programs written in micro-engine C or assembly and simulate the binary on the IXP 1200 simulator. However, programming with the SDK can be very involved and the user is required to have considerable knowledge of the network processor architecture and assembly instructions in order to obtain accurate results. This problem can be alleviated by using tools such as TejaNP [34] which raise the level of abstraction by accepting application logic described as state machines and library components. Code generators are used to map these entities onto the IXP hardware in an efficient way. While this provides a more abstract programming model, it is limited to particular network processor architectures. General programming models for network processors have also been developed. NP-Click [30] is a programming model for the IXP 1200 network processor based on the Click modular router [22]. NP-Click makes it possible to write efficient applications on the network processor without having to know all the details of the processor architecture. However, the metrics returned by NP-Click are application specific (throughput, data rate), and not processor specific (instruction count, memory accesses), which makes it hard to evaluate processing cost for particular applications.

3.

NPEST TOOL

In this section, we present NPEST, the Network Processing Estimator tool, with which processing cost for packet processing can be derived. As indicated earlier, analytic processing cost models are of limited benefit in a generalpurpose processing environment. Data-dependent processing, the halting problem, and other system-related issues make it almost impossible to obtain accurate estimates without employing a processor simulator. With NPEST, we provide a programming and simulation environment that is modelled after a network processor. The main benefit of NPEST is that code can be developed much easier and results can be obtained quicker than with a network processor development environment. We first discuss the challenges involved in processing simulation and then show how they are addresses in NPEST.

which include reading the packet from the input trace file, extracting packet headers, and writing back the processed packets to an output trace file. The framework also manages the memory allocated to packets and controls the processor simulator.

Network processing application NPEST Applications NPEST API NPEST Framework Packet Preprocessing

Packet trace

NPEST

Packet Memory Management

Processor simulator (SimpleScalar, ARMulator)

Processing statistics

Processed packet trace

• NPEST Applications. The application component is the software that requires to be evaluated for processing cost. The application interfaces with the API to receive packets for processing and returning them to the framework. • NPEST API. The API defines the interface between the framework and the application. The details of the API are discussed below.

Figure 3: NPEST Architecture. Then we present the NPEST framework and the API that is exposed to an NPEST user. Finally, we discuss the processing cost metrics that can be derived and used in networking simulations.

3.1 Processing Simulation Issues In the networking domain, a first order simulation of data transmission can be done relatively easily. Packet transmission is affected by only a few, simple parameters (packet size, link speed, link load, etc.) and results can be obtained with reasonably good accuracy. To reach a similar level of accuracy in processing simulations is challenging for the following reasons: • System Configuration. When considering processing, the number of system parameters to be considered increases significantly (e.g., processor clock, processor instruction set, memory access time, etc.). • Data Dependent Processing. Unlike data transmission, processing is data dependent and the processing time can vary drastically between packets (even for the same flow and packet size). • Small Time Scales. Accurate simulations need to simulate every processor cycle, which makes them time consuming and processing intense. To address these issue, NPEST uses a full processor simulator that can achieve the necessary accuracy. Processing statistics can then be derived and used as accurate estimations for large-scale networking simulations.

3.2 NPEST Overview The goal of NPEST is to emulate the functionality of a network processor. The system components of NPEST are shown in Figure 3. The NPEST framework implements the functions that are necessary to read and write packets, control the processor simulator, and manage memory. Through the NPEST API, the networking application receives packets and returns them to the framework. The detailed functions of each components are: • NPEST Framework. The framework provides basic “layer 2” functions that prepare packet for processing,

A networking application that is developed for NPEST can be fully compiled and executed as a stand-alone program. The packet trace that is processed by the application is actually modified (including any payload modifications implemented by the application). There is another important issue that arises in the context of simulating packet processing. To obtain a stand-alone executable that can be simulated, it is typically necessary to provide some means of reading and writing the packets that need to be processed. On a router, this functionality is provided by the system hardware and therefore is not relevant to us. In a simulator however, it requires considerable amounts of processing and can be an order of magnitude larger than the comparatively simple processing that is performed on the packet. If this issue is ignored, the simulation results reflect the characteristics of reading and writing packets on a particular operating system rather than the packet processing. In NPEST, we therefore make a clear distinction between the application and all other functions. This separation allows us to adjust the simulator to generate statistics for the application processing only and ignore the supporting framework functions.

3.3

Programming Environment

The NPEST API defines how packets are provided to an application and how an application can return them to the framework after processing is complete. In the application any arbitrary C code can be written. NPEST makes no restrictions on the application other than that it needs to adhere to the API. The three main functions that are defined in the API are: • void *init(). This function is implemented by the application and called by the framework before any packets are processed. It allows the application to initialize and data structures that are required for packet processing (e.g., routing table). The processing that occurs as part of init() is not counted towards processing cost. • void (*process packet function)(packet *). This function is the packet handler that is implemented by the application. The variable process packet function contains the pointer to the actual packet handler. It is called once for each packet that is processed by the

struct packet { unsigned char data[PACKET_SIZE-sizeof(struct meta)]; /* packet data */ struct meta meta_buffer; /* stores control information for packet */ }; struct meta { int ll_length; /* Packet length at link layer */ unsigned short ll_type; /* Packet type specified by link layer */ struct packet *next; /* next packet */ };

int process_packet(packet *p) { ...

/* custom application processing code */

write_packet_to_file(p, TRACE_OUTPUT); /* output of processed packet */ }

Figure 4: NPEST Application Code. framework. A pointer to the packet is passed as an argument. The packet memory is managed by the framework. • void write packet to file(packet *, int). This function is implemented by the framework and called by the application when processing is complete. It writes the packet to the trace file (specified by the second parameter).

The packet processing function has access to the content of the packet from the layer 3 header onwards. An example of NPEST application skeleton is shown in Figure 4. This Figure also shows the packet data structure. The “meta” data structure contains information that might be useful to the packet processing function that is not available in the packet (e.g., LLC/SNAP value). The NPEST API is similar to the processing abstractions on a network processor. On a network processor the framework functionality is implemented in dedicated hardware (e.g., queue manager) and the application functionality is implemented by the general-purpose processing cores.

3.4 NPEST Simulation For our prototype implementation, we use SimpleScalar [7] as the processor simulator. The version of SimpleScalar in our prototype supports the ARM [4] instruction set architecture and can be obtained from [31]. We chose this simulator because the ARM architecture is very similar to the architecture of the core processor and the microengines found in the Intel IXP1200 network processor[18]. The tools are setup to work on an Intel x86 workstation running RedHat Linux 7.3. We use packet traces that have been collected with tcpdump [33] using the pcap interface. The NPEST application and framework is compiled with a cross compilation tool chain for the GNU C compiler and an ARM back-end. The executable is then functionally verified

by comparing the generated output trace with the expected output. The processor simulator is adapted to separate the processing cost that is generated by the application (which is the cost that is relevant and that we are interested in) from the cost that the framework generates (which is implemented in hardware on a real router or network processor and thus is not a processing cost per se). We achieve this separation by augmenting the SimpleScalar processor simulator (simprofile) to evaluate the address of each simulated instruction and check if it is in the address range of the application or the framework. We can obtain these address ranges easily from the compiled object file with nm or objdump. These addresses are then passed to sim-profile as input parameters. The SimpleScalar simulator was further modified to provide more detailed statistics about program execution, such as cumulative instruction counts for the packet processing function, and the number of memory accesses made to a particular variable in the packet handler. The output of the simulator is post-processed using perl scripts to obtain the required results.

3.5

NPEST Metrics

There are several metrics that we can derive from our NPEST prototype. While in principle we only care about the overall processing time, obtaining more detailed metrics is helpful for deriving processing statistics and adapting the results to different system configurations. Currently, the following metrics can be measured: • Instruction Count. The number of instructions that are executed on a processor is a way to measure processing cost independent from the actual system configuration of a processor. This value can also be used to obtain a rough processing time estimation. • Memory Accesses. The number of memory references performed is an important metric because memory accesses can cause processor stalls, thread switching, and performance bottlenecks. • Instruction Mix. The instruction executed on a processor can be divided into different categories (e.g., load / store instructions, arithmetic / logic operations, branch and other control flow instructions, etc.). The instruction mix gives the percentage of instructions that belong to the different categories. This might be a less relevant metric from a networking point of view, but it gives good insight in the application behavior from a computer architecture point of view. Highly regular, computationally intense processing typically shows large numbers of integer computations as well as loads and stores. More irregular applications show larger numbers of conditional branches.

As can be seen from this list, there is no metric that gives the actual time for the processing or the delay that the packet is experiencing. We could simulate the number of cycles that it takes to process a packet, but it would be highly dependent on the system architecture and configuration we choose. A better method is to count the number instructions executed

Overall Statistics Total instructions 39391720 Total memory accesses 16529171 Application Statistics Total instructions 4370890 Total memory accesses 1125209 Avg. instr/packet 437 Avg. packet memory access/packet 17 Avg. lookup memory access/packet 6

(first metric) and then use the following formula to get a rough processing time estimate, tp , for packet p:

tp =

instrp instrp = . M IP Sthread M IP SN P /threads

(1)

This time depends on the number of instructions executed for the packet, instrp , and the processing power that is available to a packet. On a typical network processor, one thread is dedicated to one packet (which we assume to have a processing power of M IP Sthread ). The overall Network Processors can have significantly more total processing power (in the order of GIPS). This is achieved by employing numerous parallel processor cores and multiple hardware threads. When determining the delay for a single packet, however, the performance of a single thread needs to be examined as packet processing is typically performed by one thread and not split over multiple parallel threads or processors. As an approximation for M IP Sthread , the overall processing power, M IP SN P , can be divided by the total number of hardware threads. In previous work, we have developed a more detailed network processor performance model [12] that can be used if more precise results are necessary. With equation 1, we can estimate the processing delay for different network processor systems. Therefore it is sufficient if we focus on determining the number of instructions executed. Other mechanisms for adapting processing performance measurements to different system configurations are described in [13].

4. EVALUATION We present results from two application implementations on NPEST: IP forwarding and IPsec. The motivation for this choice is that we can compare the IP forwarding results with those obtained from the IXP1200 network processor, which lets us to validate our model. The IPsec implementation on the other hand performs computationally intense payload modification and is an example of an application that incurs high processing delays.

4.1 NPEST IP Forwarding Results The IP forwarding application that we implemented on our NPEST prototype performs IP forwarding according to RFC 1812 [6]. As lookup algorithm, we use a trie-based lookup that uses 4-bit lookups in each step. We choose this algorithm, because it is the same as used by the reference implementation of IP forwarding on the IXP1200 network processor to which we compare these results below. In the test setup, a packet trace of 10000 packets and small routing table were used. We chose all prefixes in the routing table to be of same size to avoid unbalanced results due to using a LAN trace. The results that are derived from NPEST are shown in Table 2. The table is split into two parts: “Overall Statistics” are the results from simulating the executable and including all the instructions and memory accesses contributed by the framework; “Application Statistics” show the processing cost that is contributed by the application only. This second

Table 2: Statistics for IPv4 Forwarding with Trie Lookup (10000 packets).

set of results is most relevant because it corresponds to the processing that is actually performed on the network processor cores. From the table it can be seen that the framework actually processes an order of magnitude more instructions than the application. This is due to various operating system interactions for reading and writing the trace file. The main point is that in a na¨ıve implementation and simulation setup the framework could not be distinguished from the application and it would be hard to obtain meaningful results from the aggregated statistics. In NPEST, the clear separation of application and framework through the NPEST API allows us to distinguish the two parts and extract the relevant results for the application. The overall, averaged results for IP forwarding show that a packet requires about 437 instructions of processing, a total of 17 memory accesses to the packet data and 6 memory accesses to the forwarding table. Since the NPEST framework provides the memory management for packets, it is actually possible to separate memory accesses to packet data and other data structures. Most of the processing instructions are spend on the routing table lookup and the IP checksum computation. The memory accesses to the packet are due to the need to read the header at least once and writing the new TTL and checksum. The average of 6 memory accesses to the trie lookup structure corresponds to 24-bit prefix lookups (each trie node corresponds to 4 bits). Overall, these results are what can be expected from IP forwarding. Without the ability to separate the application from the NPEST framework, the results would have been a order of magnitude larger and could not easily interpreted. Another statistic that can be gathered is the instruction mix for the application (shown in Table 3). For the case of IP forwarding, the instruction mix is dominated by integer operations due to checksum and trie lookup computations. The lookup in the trie data structure also contributes a significant portion of the conditional branches. The loads and stores also include accesses to local memory which is not shown in the memory accesses in Table 2.

4.2

NPEST IPsec Results

As an example of an application that performs payload modifications, we also implemented the IP Security Protocol processing according to RFC 2401 [21]. In our setup, we had the application encapsulate and encrypt the packets that were processed. This corresponds to the processing performed on the ingress side of a IPsec tunnel (RFC 2406 [20]). The

Instruction Mix Load 16.17% Store 9.57% Uncond branch 3.65% Cond branch 10.25% Int computation 60.36% FP computation 0.00%

100000 90000

IPsec

Table 3: NPEST Instruction Mix for IPv4 Forwarding.

Number of instructions

80000 70000 60000 50000 40000 30000 20000

Overall Statistics Total instructions Total memory accesses Application Statistics Total instructions Total memory accesses Total accesses to packet memory Avg. instr/packet Avg. packet memory access/packet

10000

441444046 263721776 409771412 249590742 8497384 40977 850

Table 4: IPsec Encryption Statistics (10000 packets).

encryption algorithm for this setup is AES [26]. The results for IPsec are shown in Tables 4 and 5. The first and most important observation for IPsec is that the average number of instructions that are needed to process a packet is in the order 40000. This is about two orders of magnitude more than for IP forwarding. This observation is the main premise and motivation for our argument that processing cost needs to be considered in networking simulations and models. The maximum processing cost observed for large packets is in the order of 100000 instructions. Translating this number into a processing delay on typical network processor, we obtain about 1ms as shown in Table 1 (assuming the NP executes about 100 MIPS for one thread). Another observation is that the overhead for the framework is roughly the same as in the IP forwarding application. This is exactly what is expected as we process the same number of packets. Finally, the instruction mix reflects the regularity of cryptographic processing with a large number of loads and stores as well as integer computation. Most of the memory accesses are to local memory and only about 850 accesses are made to packet data (to read the payload Instruction Mix Load 46.45% Store 14.46% Uncond branch 1.85% Cond branch 3.14% Int computation 34.09% FP computation 0.00%

Table 5: IPSec Encryption Instruction Mix (10000 packets).

IP forwarding

0 0

200

400

600 800 1000 Packet Size (bytes)

1200

1400

1600

Figure 5: NPEST Processing Cost as a Function of Packet Size. and write back the encrypted data).

4.3

Processing Cost Statistics

The above results show that we can obtain relevant processing cost results from the NPEST environment. Using the number of instructions executed, we can estimate the actual time it takes to perform the processing on various network processors. The memory access statistics give us an indication if memory could become a bottleneck for a given application, and the instruction mix indicates what kind of processor architecture is suitable. The question remains on how to use all this information and integrate it into a network simulation without actually using the trace-based simulator. In particular, since the processing cost depends on the networking application, packet size, packet data, and various other parameters, these need to be taken into account. To illustrate the approach we are proposing, Figure 5 shows the individual packet processing time for packets of different size for the two applications that we implemented in our prototype. For IP forwarding, the processing cost is basically constant and independent of the packet size because only the packet header is processed. For IPsec, the processing cost increases linearly with the packet size because the encryption processing needs to modify the entire packet payload. To quantify this relationship, we use two parameters, αa and βa , which need to be specified for each network application a: • Per-Packet Processing Cost αa . This parameter reflects the processing that needs to be done for each packet independent of its size (i.e., the y-axis offset in Figure 5). The processing cost of IP forwarding is an example. • Per-Byte Processing Cost βa . This parameter reflects the processing cost that depends on the packet size (i.e., the slope in Figure 5). These parameters also depend somewhat on the network

Application a IP Forwarding IPsec

αa 437 0

βa 0.05 61.37

ported the Microengine C code implementation of the forwarding routine to NPEST to be able to do a fair comparison using the same trie-based lookup algorithm.

Table 6: Application Statistics. processor system, but this is not considered here. The total processing cost, ca,l , for a packet of length l using application a can then be approximated by ca,l = αa + βa · l.

(2)

For the two applications considered here, we can then obtain the parameters as shown in Table 6. Of course, this is only one way of approximating processing cost and not all applications match the linear behavior that is observed in IP forwarding and IPsec. We have shown in previous work that such an approximation is still applicable to a range of applications [38]. It is likely this model needs to be expanded for more complex networking applications. Examples are flows where processing is unevenly distributed among packets (e.g., HTTP load balancer or web switch [2] where most processing is performed on transmission of the initial URL). We are in the process of investigating how much can be gained by more complex models. For now, we believe the two-parameter model is a good first step to integrating processing cost into network simulations. Both αa and βa can easily be derived with the NPEST tool and made broadly available for a set of common applications and network processor systems. With how much accuracy these parameters need to be derived is an important question. Among different packet forwarding systems (ASIC-based routers, NP-based routers, software-based routers), these parameters can vary significantly. Even within one system (e.g., NP-based router), different system configurations (processor clock rates, memory access times, etc.) can lead to different processing cost. Also, there is a fundamental question if processing cost should be measured in a system-independent metric (e.g., number of instructions per packet), as we propose it, or in a possibly more accurate but less portable metric (e.g., number of cycles per packet) as done in [14]. The main reason for using instructions per packet rather than actual processing time is that the metric describes the application complexity independently from the system. This metric can then be adapted to various heterogeneous packet forwarding systems. Galtier et al. have developed such a methodology in [13]. For most practical purposes (i.e., within a ns-simulation), a reasonable approximation of these metrics can be expected to be sufficient.

4.4 Validation In order to validate NPEST, we compare NPEST simulations statistics with those obtained from the Intel IXP1200 network processor—a network processor widely used in industry and academia. For this purpose we use the IXP1200 simulator, which is a cycle-accurate simulator, and compare the IP forwarding applications. The application is distributed with the Intel IXP Developer Kit and can be assumed to be a correct and efficient implementation. We

The IXP1200 is a system-on-a-chip multiprocessor with six “microengines” that perform processing. For the IP forwarding application, four microengines are used for receiving and processing packets and two microengines are used for scheduling and transmitting packets. Each microengine supports four threads that each handle individual packets. A receive thread transfers a packet from the external device, stores the packet data in SDRAM, filters based on Ethernet protocol, verifies the IP header, performs an IP lookup, and enqueues the packet for output processing. Due to various system issued we need to make a few simplifying assumptions in order to obtain results that can be compared to NPEST. The IXP evaluation was done as follows: • We ignore instruction that poll for packets since they do not contribute to useful processing and are only used when processors become idle. • A microengine can transfer large amounts of data from one memory to another with a single instruction. This can take significant amounts of time and would correspond to multiple instructions and memory accesses in NPEST. Therefore an instruction of the form sram[read, $xfer, addr1, addr2, count], which moves count words is considered as count NPEST instructions. • The number of routes supported by the IXP IP forwarding application is limited to the order of 100. We used 64 routes in the experimental setup—the same number as used by the NPEST application. • As traffic input we were not able to use the same trace as used by NPEST. Instead, we used synthetically generated traffic with varying packet length and random source and destination address. • We used the following system configuration on the Intel Developer Workbench [19]: Microengine speed: 200 MHz; IX Bus Speed: 80 MHz; PCI Speed: 66 MHz. Simulation configuration: unbounded mode; minimum-size packets: 64 bytes.

Table 7 shows the number of instructions executed by various phases in the IPv4 forwarding application executed on both NPEST and the IXP1200 simulator. The instruction counts are somewhat different due to fact that we compare different instruction set architectures. However, the percentage of instructions executed by the tasks that process the packet (IP header processing) are very similar for both tools (182/437 = 41.64% for NPEST and 34/78 = 43.6% for the IXP simulator). Similarly the ratio of instructions for the tasks that access memory (Copy IP header, Lookup, and Queue packet) are 58.4% and 56.4% for NPEST and the IXP simulator respectively. In terms of memory accesses, the results are more similar. For NPEST, the average number of memory accesses is 17 to packet memory and 6 to

Task Copy IP header IP header processing Lookup Queue packet Total

NPEST 118 182 80 57 437

IXP1200 6 34 26 12 78

Table 7: Comparison between NPEST and IXP1200 for IP Forwarding (10000 packets). lookup memory. For the IXP implementation, the number od accesses is 17 to packet memory and 4 to lookup memory. The differences between NPEST and the IXP can partially be explained by the differences in key system components (packet memory management, hardware support for certain function, etc.). Our next step to further validate and if necessary adapt NPEST is to implement a payload processing application on the IXP. This will let us compare a computationally intense application between the IXP and NPEST. We expect that the differences in the results will be less prominent in this case, because more time is spend just processing.

5. CURRENT AND FUTURE WORK The development of NPEST is work in progress and we are currently exploring several issues that need more attention: • Processing Statistics. The current model of linear approximation of processing cost is limited to applications that display this particular behavior. We are exploring other statistics that capture processing cost distribution in a simple and intuitive manner. • Validation. We are planning on comparing NPEST to the IXP simulator for a complex payload modifying application to see how well the result match for this case. We are also considering looking into other network processor architectures. • Integration with Network Simulators. Once we have application statistics for a set of different network applications, we are planning on integrating processing cost model into a network simulator, like ns, to see how results change when considering processing cost.

6. SUMMARY We have presented a tool for estimating processing cost on a network node. It is becoming increasingly important to consider processing delay as more and more functionality is being implemented on network routers. Using NPEST, a networking researcher can easily implement a networking application and simulate the processing delay that it incurs. This is a much simpler method than implementing network processor specific code, which requires considerable expertise. From the trace-based simulations of NPEST we can extract processing cost results and make them available as processing cost statistics to be used in other network simulators. We believe this is an important first step to improving the accuracy of network simulations and models in terms of processing delay.

7.

REFERENCES

[1] J. Allen, B. Bass, C. Basso, R. Boivie, J. Calvignac, G. Davis, L. Frelechoux, M. Heddes, A. Herkersdorf, A. Kind, J. Logan, M. Peyravian, M. Rinaldi, R. Sabhikhi, M. Siegel, and M. Waldvogel. IBM PowerNP network processor: Hardware, software, and applications. IBM Journal of Research and Development, 47(2/3):177–194, 2003. [2] G. Apostolopoulos, D. Aubespin, V. Peris, P. Pradhan, and D. Saha. Design, implementation and performance of a content-based switch. In Proc. of IEEE INFOCOM 2000, Tel Aviv, Israel, mar 2000. [3] ARM Ltd. ARM Development Suite 1.2, 2003. [4] ARM Ltd. ARM7 Datasheet, 2003. [5] D. F. Bacon, A. Dupuy, J. Schwartz, and Y. Yemini. NEST: A network simulation and prototyping tool. In USENIX Conference Proceedings, pages 71–77, Dallas, TX, 1988. [6] F. Baker. Requirements for IP version 4 routers. RFC 1812, Network Working Group, jun 1995. [7] D. Burger and T. Austin. The simplescalar tool set version 2.0. Computer Architecture News, 25(3):13–25, June 1997. [8] A. T. Campbell, H. G. De Meer, M. E. Kounavis, K. Miki, J. B. Vincente, and D. Villela. A survey of programmable networks. Computer Communication Review, 29(2):7–23, Apr. 1999. [9] R. F. Cmelik and D. Keppel. Shade: A fast instruction-set simulator for execution profiling. In Proc. of ACM SIGMETRICS, pages 128–137, Nashville, TN, May 1994. [10] K. B. Egevang and P. Francis. The IP network address translator (NAT). RFC 1631, Network Working Group, May 1994. [11] EZchip Technologies Ltd., Yokneam, Israel. NP-1 10-Gigabit 7-Layer Network Processor, 2002. http://www.ezchip.com/html/pr np-1.html. [12] M. A. Franklin and T. Wolf. A network processor performance and design model with benchmark parameterization. In P. Crowley, M. A. Franklin, H. Hadimioglu, and P. Z. Onufryk, editors, Network Processor Design: Issues and Practices I, chapter 6, pages 117–138. Morgan Kaufmann Publishers, Oct. 2002. [13] V. Galtier, K. L. Mills, Y. Carlinet, S. Leigh, and A. Rukhin. Expressing meaningful processing requirements among heterogeneous nodes in an active network. In Proc. of the Second International Workshop on Software and Performance, Ottawa, Canada, Sept. 2000. [14] M. Gries, C. Kulkarni, C. Sauer, and K. Keutzer. Comparing analytical modeling with simulation for network processors: A case study. In Proceedings of Design, Automation, and Test in Europe (DATE), Munich, Germany, Mar. 2003.

[15] IBM Corp. IBM Power Network Processors, 2000. http://www.chips.ibm.com/products/wired/communications/network processors.html.

[31] SimpleScalar LLC. http://www.simplescalar.com, 2003.

[16] Intel Corp. Intel IXP1200 Network Processor, 2000. http://developer.intel.com/design/network/ixp1200.htm.

[32] A. S. Snoeren, C. Partridge, L. A. Sanchez, C. E. Jones, F. Tchakountio, S. T. Kent, and W. T. Strayer. Hash-based ip traceback. In Proc. of ACM SIGCOMM 2001, pages 3–14, San Diego, CA, aug 2001.

[17] Intel Corporation. Intel IXA Software Developers Kit 2.01, 2003.

[33] TCPDUMP Repository. http://www.tcpdump.org, 2003.

[18] Intel Corporation. IXP1200 Network Processor Datasheet, 2003.

[34] Teja Technologies. TejaNP Datasheet, 2003.

[19] Intel Corportation. Intel IXP1200 Network Processor Family: Development Tools User’s Guide, Mar. 2002. [20] S. Kent and R. Atkinson. IP encapsulating security payload (ESP). RFC 2406, Network Working Group, nov 1998. [21] S. Kent and R. Atkinson. Security architecture for the internet protocol. RFC 2401, Network Working Group, nov 1998. [22] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The Click modular router. ACM Transactions on Computer Systems, 18(3):263–297, Aug. 2000. [23] LBNL, Xerox PARC, UCB, and USC/ISI, http://www.isi.edu/nsnam/ns/. The Network Simulator - ns-2. [24] MIPS Technologies Inc. MIPS Free GNU Toolkit, 2003. [25] J. C. Mogul. Simple and flexible datagram access controls for UNIX-based gateways. In USENIX Conference Proceedings, pages 203–221, Baltimore, MD, June 1989. [26] NIST. Advanced encryption standard (AES). Technical Report FIPS-197, National Institute of Standards and Technology, nov 2001. [27] OPNET Technologies, http://www.opnet.com. OPNET Modeler, 2003. [28] V. S. Pai, M. Aron, G. Banga, M. Svendsen, P. Druschel, W. Zwaenepoel, and E. Nahum. Locality-aware request distribution in cluster-based network servers. In Proc. of the Eight International Conference on Architectural Support for Programming Languages and Operating Systems, pages 205–216, San Jose, CA, Oct. 1998. [29] K. Psounis. Active networks: Applications, security, safety, and architectures. IEEE Communications Surveys, 2(1), Q1 1999. [30] N. Shah, W. Plishker, and K. Keutzer. NP-Click: A programming model for the intel IXP1200. In 2nd Workshop on Network Processors (NP-2) in conjunction with 9th Intl. Symposium on High Performance Computing Architectures (HPCA-9), pages 100–111, Feb. 2003.

[35] D. L. Tennenhouse, J. M. Smith, W. D. Sincoskie, D. J. Wetherall, and G. J. Minden. A survey of active network research. IEEE Communications Magazine, 35(1):80–86, Jan. 1997. [36] L. Wang, V. Pai, and L. Peterson. The effectiveness of request redirection on CDN robustness. In Proceedings of the Fifth Symposium on Operating Systems Design and Implementation, Boston, MA, dec 2002. [37] T. Wolf and M. A. Franklin. CommBench - a telecommunications benchmark for network processors. In Proc. of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 154–162, Austin, TX, Apr. 2000. [38] T. Wolf, P. Pappu, and M. A. Franklin. Predictive scheduling of network processors. Computer Networks, 41(5):601–621, Apr. 2003.

Suggest Documents