NP-CLICK: A PRODUCTIVE SOFTWARE DEVELOPMENT APPROACH FOR NETWORK PROCESSORS WRITING HIGH-PERFORMANCE CODE FOR MODERN NETWORK PROCESSORS IS DIFFICULT BECAUSE OF THEIR COMPLEXITY. NP-CLICK IS A SIMPLE PROGRAMMING MODEL THAT PERMITS PROGRAMMERS TO REAP THE BENEFITS OF A DOMAIN SPECIFIC LANGUAGE WHILE STILL ALLOWING FOR TARGET SPECIFIC OPTIMIZATIONS. RESULTS FOR THE INTEL IXP1200 INDICATE THAT NP-CLICK DELIVERS A LARGE PRODUCTIVITY GAIN AT A SLIGHT PERFORMANCE EXPENSE.
Niraj Shah William Plishker Kaushik Ravindran Kurt Keutzer University of California, Berkeley
2
Application-specific integrated circuit design is too risky and prohibitively expensive for many applications. This trend, combined with increasing silicon capability on a die, is fueling the emergence of application-specific programmable architectures. Researchers and developers have demonstrated examples of such architectures in networking, multimedia, and graphics. To meet the high performance demands in these application domains, many of these devices use complex architectural constructs. For example, network processors have many complicated architectural features: multiple processing elements each with multiple hardware-supported threads, distributed memories, special-purpose hardware, and a variety of on-chip communication mechanisms.1 This focus on architecture design for network processors has made programming them
an arduous task. Current network processors require in-depth knowledge of the architecture just to begin programming the device. However, for network processors to succeed, programmers must efficiently implement high-performance applications on them. Ideally, we would like to program network processors with a popular, domain-specific language for networking, such as Click.2 Although Click is natural for the application designer to use, it is challenging to implement on a network processor. To address this problem, we create an abstraction of the underlying architecture that exposes enough detail to write efficient code, yet hides less-essential complexity. We call this abstraction a programming model.
Published by the IEEE Computer Society
0272-1732/04/$20.00 © 2004 IEEE
Existing software development approaches More than 30 distinct architectures for network processing have been introduced in the
SDRAM controller
Command bus
Transmit FIFO
64 bit
Microengine 32 bit
PCI interface
16 Kbyte instruction cache StrongArm core (156 MHz)
8 Kbyte data cache
Receive FIFO
SDRAM bus
Microengine Ix bus unit
Microengine
Microengine
EBI engine interface
Microengine 1 Kbyte mini data cache
64 bit
Control status registers Hash engine
Microengine 32 bit
Scratchpad SRAM
SDRAM controller
Figure 1. Intel IXP1200 microarchitecture.
past five years.3 Although their architectures vary greatly, we can describe them using the following design space: multiple hardwaremultithreaded processing elements, a variety of memories under explicit user control, special-purpose coprocessors, various types of onchip communication mechanisms, and varying levels of peripheral integration. Although designers implement these features to achieve high performance, they exacerbate the programming of the device. For practical implementations, the programmer must manually determine the optimal functional decomposition of the application across threads and processing elements, define thread interaction, and use multiple memories effectively. For example, a common representative network processor, the Intel IXP1200, has six identical RISC processors called microengines, plus a StrongARM processor, as shown in Figure 1. The StrongARM mostly handles control and management plane operations. The microengines are RISC processors geared for data plane processing. Each microengine supports hardware multithreading for four threads that share a fixed-size program memory. Hence, there can be up to 24 threads concurrently processing packets. It is the programmer’s responsibility to divide the application into balanced tasks and arrange
these tasks across threads and microengines. The memory architecture has several regions: large off-chip SDRAM, faster external SRAM, internal scratchpad, and local register files for each microengine. Each of these areas is under the user’s direct control, and there is no hardware support for caching data from slower memory into smaller, faster memory (except for the small cache accessible only to the StrongARM). To ensure a high-performing implementation, the programmer must allocate data in appropriate memories, accounting for the memory’s throughput, latency, visibility, available space, and bus contention. The Intel proprietary IX bus is the main interface for receiving and transmitting packets data, exchanging it external devices such as Ethernet media access controllers and other IXP1200s. The microengines can directly interact with the IX bus, so any microengine thread can receive or transmit data on any port without StrongARM intervention. The IX bus interacts with the microengines via circular buffers called transmit and receive FIFOs.. A software development environment is critical to manage the complexity of network processor architectures. It provides the primary interface for users of the network processor. However, despite all the architectural research, relatively little effort has been devoted to pro-
SEPTEMBER–OCTOBER 2004
3
NETWORK PROCESSORS
gramming environments. The current state of the art is to use C-based languages. Although a popular programming language, C doesn’t offer much support for expressing multiprocessor multithreaded programs. As a result, these languages are unnatural to use and force the application programmer to learn the architecture’s intricate details. For example, for the IXP1200, Intel supports using a subset of C (which we refer to as IXPC) to program the microengines.4 IXP-C supports loops, conditionals, functions, intrinsics (function calls using C syntax that direct instruction selection), basic data types, and abstract data types such as structs and bit fields. However, the multithreading model is explicit: The programmer must not only partition his application among threads and microengines, but also manually define all interthread and interprocessor communication. For practical implementations, the programmer must also allocate data to specific memory regions (for example, SRAM, SDRAM, Scratchpad SRAM, or data registers) at declaration time. Teja Technologies also has a programming environment for Intel IXP network processors called Teja C.5 (That company also employs a language with C syntax and constructs.) Teja C allows users to decompose an application into threads and memory spaces. The user then maps this decomposition onto an architectural model that includes constructs for processing elements, memory banks, and buses. Teja-C also gives programmers the ability to define state machines that control the interaction of multiple processing elements. However, this state machine-based approach is not natural for designing data plane packet-processing applications.
Programming model Ideally, programmers would like to program network processors with a networking-specific language, such as Click. However, there is currently a large gap between these languages and the complex programmable architectures used for implementation, such as Intel’s IXP1200. In this section, we introduce and define the concept of a programming model to assist in bridging this gap.
Implementation gap We believe Click to be a natural environment
4
IEEE MICRO
for describing packet processing applications. In Click, applications are composed of elements, the base unit of computation. Elements correspond to common networking operations such as classification, routing-table lookup, and header verification. Elements communicate by passing packets using well-defined semantics. There are two types of communication: push and pull. The source element initiates push communication, which effectively models the arrival of packets into the system. The sink element initiates pull communication, which often models available memory in hardware resources for egress packet flow. This simple yet powerful concept, along with Click’s large element library, provides a natural abstraction for programmers to quickly create functional descriptions of their applications. This situation stands in stark contrast to the main concepts required to program the IXP1200. As described earlier, when implementing an application on this device, the programmer must carefully determine how to effectively partition his application across the six microengines, make use of the multiple memories, effectively arbitrate shared resources, and communicate with peripherals. We call this mismatch of concerns between the application environment and target architecture the implementation gap. To facilitate bridging this gap, we propose an intermediate layer, called a programming model, which presents a powerful abstraction of the underlying architecture while still providing a natural way of describing applications.
What is a programming model? A programming model presents an abstraction that exposes only the relevant details of the architecture necessary for a programmer to efficiently implement an application. It is a programmer’s view of the architecture that balances opacity and visibility: • Opacity abstracts the underlying architecture. Doing so obviates the need for the programmer to learn the architecture’s intricate details just to begin programming the device. • Visibility enables design space exploration. It allows the programmer to improve the efficiency of his implementation by trading off design parameters,
such as thread boundaries, data locality, and element implementations. Our goal is that the full computational power of a target device be realizable through the programming model. In summary, a programming model supplies an approach to harvesting the power of the programmable platform in a productive way. Such a model will inevitably balance between a programmer’s two competing needs: the desire for ease of programming and the requirement for efficient implementation. Further, we believe the programming model is a necessary, but not sufficient, condition for closing the implementation gap.
NP-Click To close the implementation gap, we propose the NP-Click programming model for the Intel IXP1200.6 (The “NP” indicates that this model is for network processors.) Rather than promote a new style and syntax of programming, we based our model on Click. Like Click, NP-Click builds applications by composing elements that communicate using push and pull. In NP-Click, we implement the elements in IXP-C to leverage the existing IXP1200 compiler. To improve implementation efficiency, NP-Click complements the abstraction of Click with features that provide visibility into salient architectural details. Specifically, NPClick enables programmers to • control thread boundaries to effectively manage processor and thread utilization, • map data to different memories (registers, Scratchpad, SRAM, and SDRAM), and • separate design concerns of arbitration of shared resources and functionality. Our experience with programming hardware multithreaded architectures shows that arriving at the correct allocation of elements to threads is a key aspect to achieving high performance. Thus, we enable the programmer to easily explore different mappings of elements to threads. In a symmetric multiprocessor system, a path of push (pull) elements can execute in a single thread by calling the source (sink) element.7 We implement a
similar NP-Click mechanism, however, because of the fixed number of threads on the IXP1200; we also let the programmer map multiple paths to a single thread. To implement the latter, we synthesize a scheduler that fires each path within that thread. The amount of parallelism present in the target architecture places pressure on shared resources. For example, the IXP1200 has 24 threads that can each simultaneously query a control status register. This can lead to bus saturation, which in turn can cause lengthy delays in requests. Such situations lead to potential hazards that necessitate arbitration schemes for sharing resources. To recognize the importance of sharing common resources, we separate arbitration schemes from computation and present them as interfaces to the elements. The two main resources that require arbitration on the IXP1200 are control status registers and the transmit FIFO.
Usage model There are two main usage modes of NPClick: implementing elements and implementing the application. If the elements required for an application are not in the library, the user must implement them. Although we implement NP-Click elements in IXP-C, they are much simpler to write. In most cases, implementing an element is easy because the inputs and outputs are defined, access to shared resources is via interfaces, and the assignment of data to memory is deferred. This focuses the programmer’s effort and attention on writing the code that implements an element’s function. After writing all the elements required for the application, the programmer now focuses on assembling the elements to describe the application functionality. He can then provide additional information on mapping to the architecture. For example, a programmer can manually define tasks and map them to threads on the architecture. He can also change the memory allocation of a data variable to optimize memory accesses. NP-Click separates the application description from implementation choices, such as functional partitioning across microengines or arbitration schemes for shared resources. For example, after initially implementing his application in NP-Click, a programmer can change the thread boundaries to try different
SEPTEMBER–OCTOBER 2004
5
NETWORK PROCESSORS
implementations without changing the application description.
Automatic allocation Even with a good programming model, network processor programmers must allocate computation to processing elements (PEs), data to distributed memories, and intertask communication to on-chip interconnects. Programmers often perform these mappings manually using ad hoc techniques. It is a timeconsuming and challenging problem because of a huge and irregular design space; resource constraints further exacerbate the problem. Automation could help to quickly explore this space and arrive at efficient implementations, so we have recently developed methods to automatically map application tasks to hardware threads on a multiprocessor system.8 The goal of this work is to further ease the programming burden, because the mapping of applications to architectures is increasingly difficult. We approach the mapping problem in three steps: • construct a simplified model to capture only the salient application parameters and resource constraints; • encode the constraint system as a 0-1 integer linear program (ILP) formulation; • solve the optimization problem using an efficient solver to determine an optimal configuration, focusing on the primary driver of IXP1200 performance: the mapping of tasks to microengines. We view the multiprocessor as a symmetric shared-memory architecture in which all PEs have the same average access time to each memory region. In addition, we incorporate per-microengine instruction store limits as a resource constraint to model the small instruction memory present in most network processors. This enables the formulation to trade off between instruction store and execution cycles for individual tasks. Our application model categorizes tasks into classes. Tasks in the same class are functionally equivalent, but can have different implementations, which might differ in their execution cycles and number of instructions. Our model makes some simplifying
6
IEEE MICRO
assumptions about the application and architecture. We assume that the application consists of independently executing tasks connected by queues. For simplicity, our model assumes that tasks have the same periodicity, but our formulation is easily extendable to accommodate multiple execution rates. Our model measures a task’s PE utilization by the number of execution cycles it consumes (execution time less long-latency events). Our goal is to allocate tasks onto PEs with the objective of minimizing the average makespan—the maximum execution cycles of all tasks running on the system. We determine the number of execution cycles for each task by profiling; the number of cycles can vary with implementation. However, in our experience, this variation is less than 10 percent. We encode the resource constrained decision problem as a 0-1 integer-linear programming (ILP); the variables in our constraint system are 0-1 variables, indicating task assignment to particular implementation and PE. Constraints fall into the following categories: • Exclusionary constraints specify that each task must execute in exactly one implementation and be assigned to exactly one PE. • Instruction store constraints ensure that the total instruction count of tasks allocated to a PE is less than that PE’s instruction store. • Execution constraints guarantee that the total execution time of all tasks in their selected implementations in each PE be less than the current makespan. • Symmetry-breaking constraints eliminate logically redundant configurations. The search strategy is to perform a binary search on the makespan to find the optimum possible execution time and a corresponding implementation and PE assignment for each task. We note that the decision problem formulated earlier is a reduction of the basic binpacking problem, and hence it is nonpolynomial complete. Though encoding the problem as ILP allows us the flexibility to specify varied constraints, solving such problems in the general case is inefficient for reasonably sized instances. However, we take advantage of recent advancements in search algorithms and heuris-
1,400 NP-Click IXP-C
Aggregate data rate (Mbps)
1,200
1,000
800
600
400
200
0 64
128
256
512
1,024
1,280
1,518
Input packet sizes (bytes)
BMWG mix
Figure 2. Performance comparison of 16 port IPv4 packet forwarding implementations.
tics for solving 0-1 ILP formulations to efficiently compute solutions that are optimal with respect to our problem model. We use Galena,9 a fast pseudo-Boolean SAT (satisfiability) solver, to solve the constraint system. For our two representative applications, the runtime of our approach is less than one second, with resulting implementations performing within a 5 percent aggregate data rate of implementations in which a designer manually allocated tasks to PEs. Automatic task allocation is one piece in the design process that will enable designers to explore different task implementations and identify optimal mappings. This in turn expedites the overall application design flow for network processors.
Performance results To evaluate NP-Click as a programming model, we implemented two applications on the Intel IXP1200 with NP-Click, using two approaches: manual task allocation and IXPC. For each application, we compare the performance and development process of these approaches. To compare NP-Click to IXP-C, we implement two applications: packet forwarding and a DiffServ interior node. The Internet Protocol
version 4 (IPv4) packet forwarding application is a performance-centric benchmark with relatively narrow functionality. The second application, a differential services (DiffServ) interior node, is a functionally rich application with lower performance requirements.
IPv4 packet forwarding IPv4 packet forwarding10 is a common kernel of many network processor applications. We chose to implement the data plane of a 16-port fast Ethernet (100 Mbps) IPv4 router. To measure performance for the IPv4 packet forwarding application, we test each implementation with a variety of single-packet-size input streams (64, 128, 256, 512, 1,024, 1,280, and 1,518 bytes) and the IETF Benchmarking Methodology Workgroup (BMWG) mix.11 The BMWG packet mix provides a more realistic input data set because it contains an even, random distribution of seven packet sizes ranging from 64 to 1,518 bytes. For each input packet stream, we measure the maximum sustainable aggregate data rate. Figure 2 shows the results of our experiments for IPv4 packet forwarding. The aggregate bandwidth of the NP-Click imple-
SEPTEMBER–OCTOBER 2004
7
NETWORK PROCESSORS
mentation ranges from 880 to 1,360 Mbps. For the BMWG packet mix, a more realistic data set, the NP-Click version achieved 93 percent of the IXP-C implementation (1,120 Mbps aggregate). The IXP-C implementation performs at 85 percent of line rate (1,360 Mbps aggregate) across all single-packet-size input streams. For the BMWG packet mix, the performance is slightly lower (1,200 Mbps aggregate) because of dynamic-load-balancing effects. We attribute the consistent data rate across all packet sizes to suboptimal arbitration of the transmit FIFO, a circular buffer used to transmit data to the media access controller that all threads attempting to transmit data share. A key architectural difference between the NP-Click and IXP-C software implementations is responsible for the large performance difference in processing small packets. The packet abstraction that all NP-Click elements rely on comes at a cost: The receiver must have the entire packet before packet processing can begin. However, IXP-C has no such packet abstraction. By overlapping header processing with packet receive operations, IXP-C’s software architecture processes packets more efficiently. This architectural difference is most noticeable for small packets. For large packets, the time to receive and transmit a packet dominates header processing time. Hence, the optimized NP-Click implementation’s shortfall in performance occurs for streams of small packets.
DiffServ interior node DiffServ is a method of facilitating end-toend quality of service over an existing IP network.12 DiffServ networks rely on traffic conditioning at the boundary nodes to simplify the job of the interior nodes. The boundary nodes of a DiffServ network aggregate ingress traffic into several categories, called behavior aggregates, using the differentiated services codepoint (DSCP).13 The boundary nodes aggregate packets into multiple classes of traffic, each with varying degrees of packet loss, latency, and jitter. For this comparison, we implement an interior DiffServ node, which supports four fast Ethernet ports (100 Mbps). Our DiffServ benchmark is based on the data plane IPv4 packet forwarding functionality. Both the NP-
8
IEEE MICRO
Click and IXP-C implementation received packets at line rate (100 Mbps/port). For all higher-priority traffic classes in the tested packet mix, NP-Click egress bandwidth was within 10 percent of that for IXP-C.14 The performance shortfall is due to NPClick’s modularity. Because the compiler is unable to effectively optimize across functions, the NP-Click implementation has more concurrently live variables. All the live variables cannot fit into data registers, thus, the NP-Click implementation allocates some frequently accessed variables to SRAM. This increased SRAM traffic also impacts other portions of the implementation by increasing the overall average SRAM access time. The IXP-C compiler implements only basic compiler optimizations. We believe the application of more sophisticated optimizations would eliminate these spills and greatly increase NP-Click’s performance.
Development process This section compares and contrasts the development process of the applications using IXP-C and NP-Click. Specifically, we focus on the debugging and performance improvement process, design time allocation, and total design effort. Software developers often use lines of code as a proxy for design effort. However, when comparing vastly different programming methodologies, this metric can be very misleading. Instead, we measure person-hours. Using NP-Click, we began with a Click description of the application and created an initial, functionally correct implementation on the IXP1200 within a few days. We spent the majority of the design effort exploring the design space of implementations: pinpointing design bottlenecks, changing the mapping of elements to microengines, and simulating. Because of NP-Click’s modularity, profiling different implementations was easy. Performance improvement occurred in three major categories: changing the mappings among elements and threads/microengines; better element implementations’ and implementing lower-overhead arbitration schemes. We spent relatively little effort debugging, and when we spotted an error, it was easy to pinpoint which portion of the implementation was the cause. Common errors included incorrectly specifying element configurations and minor functional bugs within an element.
The design effort for the IPv4 application required 100 person-hours. For the DiffServ application, we were quickly able to arrive at an initial functional implementation. This initial implementation was a naïve mapping of elements to microengines, which we then optimized. The additional design effort for the NP-Click DiffServ implementation required 120 person-hours. Using IXP-C, we spent most of the development effort arriving at a functionally correct initial implementation. More than half of that effort involved fixing bugs that arose from thread interactions. Some of these interactions were not obvious from the design or the code, which made debugging even more difficult. We spent the remainder of the design effort attempting to improve the implementations. Given the relatively low-level abstraction of IXP-C, both profiling and optimizing the implementation proved difficult. As a result, we could only implement and test a few design alternatives. All the design alternatives we implemented were incremental changes to the initial implementation. Large design changes would have required even more effort, with no guarantee of performance improvement. Our total design effort for the IPv4 packet forwarder using IXP-C required 400 personhours. For the DiffServ implementation, we started with a hand-coded data plane implementation of an IPv4 packet forwarder, then added the DiffServ functionality. The additional design effort for this implementation required 320 person-hours.
Applicability to additional network processors This research implemented NP-Click on the Intel IXP1200, a popular network processor. However, we can generalize NP-Click concepts to a broader class of architectures. To support additional multiple network processors, NP-Click must consider several common architectural features: multiple processing elements, hardware multithreading, multiple memories exposed to the programmer, coprocessors for common operations, and different methods of peripheral interaction. Most network processors employ multiple processing elements for packet processing. The two common configurations of processing elements are symmetric and pipelined.3
Symmetric multiprocessor systems can employ a mechanism similar to that used in the IXP1200 implementation. To target architectures composed of processors in a pipeline, the NP-Click representation would need to organize tasks into levels to map packet processing stages to processor stages. Within a single processing element, network processors typically use hardware multithreading. Because NP-Click exports threads to the programmer, it maps naturally to most multithreaded architectures. For example, threads that share data can reside on the same processing element. This enables threads to use local memory to efficiently share data, dramatically reducing traffic on memory buses and other communication mechanisms. Most network processors expose multiple memories to the programmer; these memories have different capacities, access times, and bandwidths. In many cases (such as with the IXP1200), effectively using them is paramount to an efficient implementation. There are two important memory considerations when implementing NP-Click on a new network processor: packet layout and data layout. All elements in NP-Click rely on a common packet layout to access packet headers and payload data, as well as packet metadata. In most network processor architectures, size restricts require storing the packet payload in SDRAM. Within an element, NP-Click provides the programmer with keywords to scope the sharing of data. Some network processor architectures, like the Intel IXP2xxx in which each PE has a small local memory, might present many options for allocating shared data. Potentially, this local memory could store all the data shared by threads on a single microengine. Many network processors integrate coprocessors for common kernels of computation (such as lookup and cyclic redundancy-checking computation. Although these coprocessors can vary greatly in function, they are generally limited in application scope. As a result, we can usually encapsulate such coprocessors in a single NP-Click element. Exposing coprocessors through an element presents the programmer with an application-level abstraction for a potentially obtuse hardware block. Network processor architectures differ most in their methods of peripheral interaction.
SEPTEMBER–OCTOBER 2004
9
NETWORK PROCESSORS
Some network processors have an integrated, on-chip, media access controller; others have a programmable media access controller that requires separate programming; yet others have specialized hardware for interacting with media access control (MAC) devices. With NP-Click, we abstract the various MAC interaction methods using two elements: FromDevice and ToDevice. These elements are responsible for abstracting ingress and egress ports. Hence, they coordinate with the media access controller to receive and transmit packets. So although each new peripheral interaction scheme will require new implementations of FromDevice and ToDevice elements, the remainder of an application remains the same.
N
ewer architectures, such as the Intel IXP2800,15 which has 16 microengines and a more complex memory architecture, are increasing the programming difficulty. Implementing applications on these newer architectures will be even more difficult as there are more mapping decisions to consider. To cope with these complexities, we are extending our ILP framework to allocate data to distributed memories and map intertask communication to on-chip interconnect mechanisms. Additionally, the ILP formulation grows polynomially with the number of PEs, memory elements, and communication channels onchip. However, the problem space grows exponentially. We will continue to employ our current methodology until the runtime to an optimal solution becomes impractical. We will then use heuristics, hierarchy, and designer input to intelligently navigate the design space. MICRO References 1. N. Shah and K. Keutzer, “Network Processors: Origin of Species,” Proc. 17th Int’l Symp. Computer and Information Sciences, (ISCIS 02), CRC Press, 2002. pp. 41-45. 2. E. Kohler et al., “The Click Modular Router,” ACM Trans. Computer Systems, vol. 18, no. 3, Aug. 2000, pp. 263-297. 3. N. Shah, Understanding Network Processors, master’s thesis, Electrical Eng. and Computer Sciences Dept., Univ. of California, Berkeley, 2001. 4. Intel Microengine C Compiler Support: Reference Manual, Intel Corp., 2002.
10
IEEE MICRO
5. Kevin Crozier, “A C-Based Programming Language for Multiprocessor Network SoC Architectures” Network Processor Design: Issues and Practices, vol. 2, Academic Press, Nov. 2003, pp. 427-443. 6. N. Shah, W. Plishker, and K. Keutzer, “Programming Models for Network Processors,” Network Processor Design: Issues and Practices, vol. 2, Dec. 2003, pp. 181-202. 7. B. Chen and R. Morris, “Flexible Control of Parallelism in a Multiprocessor PC Router,” Proc. Usenix Ann. Tech. Conf. (USENIX 01), ACM Press, 2001, pp. 333-346. 8. W. Plishker et al., “Automated Task Allocation on Single Chip, Hardware Multithreaded, Multiprocessor Systems,” Proc. Workshop on Embedded Parallel Architectures (WEPA 04), 2004, http://www.gigascale.org/pubs/493/plishker_auto_task.pdf. 9. D. Chai and A. Kuehlmann, “A Fast PseudoBoolean Constraint Solver,” Proc. 40th Design Automation Conf. (DAC 03), IEEE Press, 2003, pp. 830-835. 10. F. Baker, “Requirements for IP Version 4 Routers,” RFC 1812, Network Working Group, June 1995; http://rfc.sunsite.dk/rfc/ rfc1812.html. 11. S. Bradner and J. McQuaid, “A Benchmarking Methodology for Network Interconnect Devices,” RFC 2544, Internet Engineering Task Force (IETF), Mar. 1999. 12. S. Blake et al., “An Architecture for Differentiated Services,” RFC 2475, Internet Engineering Task Force (IETF), Dec. 1998. 13. K. Nichols et al., “Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers,” RFC 2474, Internet Engineering Task Force (IETF), Dec. 1998. 14. N. Shah, W. Plishker, and K. Keutzer, “Comparing Network Processor Programming Environments: A Case Study,” Workshop on Productivity and Performance in High-End Computing (P-PHEC 04), http://researchweb.watson.ibm.com/people/r/rajamony/ pphec2004, 2004, pp. -19-26. 15. Intel IXP2800 Network Processor: Product Brief, Intel Corp., 2002; http://www.intel. com/design/network/prodbrf/279054.htm.
Niraj Shah is a PhD student in the EECS Department at the University of California, Berkeley. His research interests include methodologies and tools for designing high-
performance programmable embedded systems. Shah has a BS in computer engineering from the University of Arizona and an MS in electrical engineering from the University of California, Berkeley. He is a student member of IEEE. William Plishker is a PhD student in the Electrical Engineering and Computer Sciences (EECS) Department at the University of California, Berkeley. His research interests include design automation techniques for programmable embedded systems. Plishker has a BS in computer engineering from the Georgia Institute of Technology. He is a student member of IEEE. Kaushik Ravindran is a PhD student in the EECS Department at the University of California, Berkeley. His research activities include reconfigurable computing and embedded system design. Ravindran has a BS in computer engineering from the Georgia Institute of Technology. He is a student member of IEEE. Kurt Keutzer is a professor in the EECS Department at the University of California, Berkeley. His research interests include software support for the design of applicationspecific instruction processors Keutzer has a BS in mathematics from Maharishi International University, Fairfield, Ia., and an MS and a PhD in computer science from Indiana University. He is a fellow of the IEEE.// Direct questions and comments about this article to William Plishker, c/o Jennifer Stone, 211 Cory Hall, #1722, Berkeley, CA 94720;
[email protected]. For further information on this or any other computing topic, visit our Digital Library at http://www.computer.org/publications/dlib .
SEPTEMBER–OCTOBER 2004
11