for common networking operations like pattern matching and bit manipulation. .... Network Services Processor Familyâ, Presentation, Network. Processor Forum ...
Network Processors: Origin of Species Niraj Shah, Kurt Keutzer University of California, Berkeley {niraj,keutzer}@eecs.berkeley.edu Abstract Numerous programmable alternatives to network processing have emerged in the past few years to meet the current and future needs of network equipment. They all promise various trade-offs between performance and flexibility. In this paper we attempt to understand these new network processing alternatives. We present five major aspects of network processor architectures: approaches to parallel processing, elements of specialpurpose hardware, structure of memory architectures, types of on-chip communication mechanisms, and use of peripherals. For each of these aspects, we include examples of specific network processor features.
because we want to note that even general-purpose processors are used for packet forwarding. Bottom up, we note that there are over 30 different processors self-identified as network processors [1]. As we shall see, self-identifying as network processor doesn’t indicate any unique architectural features or approach. Thus, from the outset we emphasize that our approach to identifying network processors is descriptive and not prescriptive. Moreover, we claim that a descriptive approach is the only sensible approach to understanding network processors at this time and this descriptive approach makes the approach used to make the classification all the more important.
2.1. Means of Classification
1. Motivation Many system designers are choosing to drop hardwired ASIC solutions for system solutions in favor of application-specific instruction processors (ASIPs). Nowhere is this trend more apparent than in communication network equipment. The past four years have witnessed over 30 attempts at programmable solutions aimed at packet processing for communication networks. Classifying these architectures will help us to: • evaluate the right match between an application and architectural features in a network processor • develop a programming model that enables efficient programming of multiple network processors More generally, understanding the evolution of network processors will help us understand the migration of programmable solutions to other application areas. In this paper, we dissect the space of network processor architectures from five major perspectives. We first give our definition of a network processor (NPU). Then, we provide a number of perspectives from which to classify NPUs, drawing examples from the numerous NPUs currently on the market or in development. Lastly, we speculate on the future direction of these devices.
2. What is a Network Processor? Top down, we define a network processor as any processor able to efficiently process packets for network communication. We purposely broaden the definition
To better understand the architectural space, we present five major strategies network processors have employed: approaches to parallel processing, elements of special-purpose hardware, structure of memory architectures, types of on-chip communication mechanisms, and use of peripherals.
2.2. Parallel Processing To meet increasing line speed requirements, network processing systems have taken advantage of the parallelism present in various networking algorithms. NPU architectures exploit parallelism at three different levels: processing element level, instruction level, and word/bit level. It is important to note that while these approaches are orthogonal, a decision at one level clearly affects the others. Details on these three levels are given immediately below. 2.2.1. Processing Element Level Before delving into different approaches to processing element level concurrency, we give our definition of a processing element: A processing element (PE) is an instruction set processor that decodes its own instruction stream. Most NPUs employ multiple PEs to take advantage of the data parallelism present in packet processing. Of those NPUs, there are two prevalent configurations:
Appears in Proceedings of ISCIS XVII, The Seventeenth International Symposium on Computer and Information Sciences, 2002.
• •
Pipelined: each processor is designed for a particular packet processing task Symmetric: each PE is able to perform similar functionality
In the pipelined approach, inter-PE communication is very similar to data-flow processing – once a PE is finished processing a packet, it sends it to the next downstream element. Examples of this architectural style include Cisco’s PXF [2], EZChip’s NP-1 [3], Vitesse’s IQ2000 [4], and Xelerated Packet Devices [5]. In general, these architectures are easier to program since the communication between programs on different PEs is restricted by the architecture. However, there are timing requirements to be met by each program on every PE. NPUs with symmetric PEs are normally programmed to perform similar functionality. They are often paired with numerous co-processors to accelerate specific types of computation. Arbitration units are often required to control access to the many shared resources. The Cognigine [6], Intel IXP1200 [7], IBM PowerNP [8], and Lexra NetVortex [9] are examples of this type of macroarchitecture. While these architectures have more flexibility, they are difficult to use, as programming them is very similar to the generic multi-processor programming problem. 2.2.2. Instruction Level Many network processors have chosen not to imple-
ment multiple-issue architectures. This is likely based on the observation that most networking applications do not have the available instruction level parallelism to warrant it. This is in contrast to signal processing applications, for example. However, some architects have chosen to implement architectures that issue multiple instructions per cycle per processing element. For multiple issue architectures, there are two main tactics for determining the available parallelism: at compile time (e.g. VLIW) or at run time (e.g. superscalar). While superscalar architectures have had success in exploiting parallelism in general-purpose architectures (e.g. Pentium), VLIW architectures have been effectively used in domains like signal processing, where compilers are able to extract enough parallelism. VLIW architectures are often preferred because they are lower-power architectures. The success of VLIW architectures in networking will largely depend on their target applications. Control-plane code, for example, is largely control-dominated, and therefore lends itself more to a superscalar implementation. The Agere Routing Switch Processor [10], Brecis’ MSP5000 [11], and Cisco’s PXF [2] use VLIW architectures. This allows them to take advantage of intra-thread instruction-level parallelism (ILP) at compile time by leveraging sophisticated compiler technology. Clearwater Networks takes another approach - they use a multiple issue superscalar architecture in which a hardware engine finds the available ILP at runtime [12]. Cognigine also has multiple issue PEs (4-way), but they have a run-time
EZchip NP-1
64 48
Cisco PXF
32
20
Number of PEs
18
IBM PowerNP Lexra NetVortex Motorola C-5
16
Cognigine RCU/RSF
14 12
Xelerated
10
64 instrs/cycle
Vitesse IQ2x00
8
Intel IXP1200
Alchemy Mindspeed CX27470
AMCC
6
np7120
4
Agere PayloadPlus
2
16 instrs/cycle BRECIS Broadcom 12500
Clearwater CNP810
8 instrs/cycle
0 0
1
2
3
4
5
6
7
8
9
10
Issue width per PE
Figure 1. Trade-offs between number of PEs and issue width.
Appears in Proceedings of ISCIS XVII, The Seventeenth International Symposium on Computer and Information Sciences, 2002.
configurable instruction set that defines data types, operations, and predicates [6]. 2.2.3. Bit Level Depending on the data types and operations present in an application, it is possible to exploit bit level parallelism. For example, some NPUs have circuitry to efficiently compute the CRC field of a packet header. 2.2.4. Summary Figure 1 plots the number of PEs versus the issue width per PE. It is easy to see that NPU architects have faced a large trade-off between processing element level and instruction level concurrency. Clearwater Networks, at one extreme, has a single PE with 10 issue slots, while EZchip has 64 scalar PEs. On this chart, we have also plotted iso-curves of issuing 8, 16, and 64 instructions per cycle. While the clock speed and specialized hardware employed by network processors are not represented in this figure, it does illustrate the trade-offs NPUs have made between multiple levels of parallelism. Most NPUs have opted for more stripped down processing elements, instead of fewer multi-issue PEs. This severely complicates the programming model, as NPUs have more independent executing units in this approach.
2.3. Special-Purpose Hardware Another strategy employed to meet increasing network processing demands is to implement common functions in hardware instead of having a slower implementation using a standard ALU. The major concern in having special-purpose hardware for NPUs is the granularity of the implemented function. There is a trade-off between the applicability of the hardware and the speedup obtained. The type of special-purpose hardware used can be broadly divided into two categories: co-processors and special functional units. 2.3.1. Co-Processors A co-processor is a computational block that is triggered by a processing element (i.e. it does not have an instruction decode unit) and computes results asynchronously. In general, a co-processor is used for more complicated tasks, may store state, and may have direct access to memories and buses. As a result of its increased complexity, a co-processor is more likely to be shared among multiple processing elements. A co-processor may be accessed via a memory map, special instructions, or bus transaction. Most NPUs have integrated co-processors for common networking tasks; many have more than one co-processor. Operations ideally suited for co-processor implementation are well defined, expensive and/or cumbersome to exe-
cute within an instruction-set, and prohibitively expensive to implement as an independent special functional unit. The functions of co-processors vary from algorithmicdependent operations to entire kernels of network processing. For example, the Hash Engine in the Intel IXP1200 is only useful for lookup, if the algorithm employed requires hashing. For IP routing, the most common algorithms (trie table-based) do not use hash tables. This limits the freedom of software implementation on network processors – the software programmer is forced to implement a task using a specific algorithm that can make use of the co-processor. The most common integrated co-processors execute lookup and queue management functions. The functionality of lookup is clear – given a key, lookup a value in a mapping table. The main design parameter is the size of the key. For additional flexibility, some co-processors also support variable sized keys. Since lookup often references large memory blocks, it needs to operate asynchronously from a processing element. Common uses of lookup are for determining next hop addresses and for accessing connection state. The global aspect of lookup operations (with respect to the device) requires the coprocessor be shared by all processing elements. Queue management is another good candidate for an integrated co-processor as the memory requirement for packet queues is large and queues are relatively cheap to implement in hardware. The small silicon overhead eliminates many memory read and write operations that would otherwise be required. Other common co-processors are for pattern matching, computing checksum/CRC fields, and encryption/authentication. 2.3.2. Special Functional Units A special functional unit is a specialized computational block that computes a result within the pipeline stage of a processing element. Most network processors have special functional units for common networking operations like pattern matching and bit manipulation. The computation required for these operations is cumbersome and error-prone to implement in software (with a standard instruction set), yet very easy to implement in hardware. For example, Intel’s IXP1200 has an instruction to find the first bit set in a register in a single cycle [7]. With a standard instruction-set, this would be quite tedious and take numerous cycles. As with co-processor candidates, the transistor overhead is well worth the convenience and speedup. Cognigine has a different approach – each PE has four execution units that can be dynamically reconfigured to match the application. Their VISC (Variable Instruction Set Computing) “instruction” determines operand sizes, operand routing, base operation, and predicates [6].
Appears in Proceedings of ISCIS XVII, The Seventeenth International Symposium on Computer and Information Sciences, 2002.
2.4. Memory Architectures A third strategy NPUs have employed is the structure of memory architectures. The major memory-related tactics of NPUs are: multi-threading, memory management, and task-specific memories. For data-plane processing, it is unclear whether the overhead of an Operating System (OS) is warranted. NPUs have responded by including more hardware support for common OS functions, like multi-threading and memory management. Hiding memory access latency is a key aspect to efficiently using the hardware of a network processor. The most common approach to hiding latency is multithreading. Multi-threading is used to efficiently multiplex a processing element’s hardware. The stalls associated with memory access are well known to waste many valuable cycles. Multi-threading allows the hardware to be used for processing other streams while another thread waits for a memory access (or a co-processor or another thread). Without dedicated hardware support, the cost of operating system multi-threading would dominate computation time, since the entire state of the machine would need to be stored and a new one loaded. As a result, many NPUs (Agere, AMCC, ClearSpeed, Cognigine, Intel, Lexra, and Vitesse) have separate register banks for different threads and hardware units to schedule threads and swap them with zero overhead. Clearwater Networks takes a slightly different approach – they have eight threads executing in parallel on the same processing element (which can issue 10 instructions per cycle) [12]. In addition, their processing element employs superscalar techniques to dynamically determine the available instruction-level parallelism and functional unit usage. On the Intel IXP1200, memory management is handled in a similar fashion: the SRAM LIFO queues can be used as free lists, thus obviating the need for a separate OS service routine [7]. Some NPs have special hardware that handles the common I/O path (i.e. packet flow). Clearwater’s Packet Management Unit copies data from a MAC device into a memory shared by the core [12]. IBM, Motorola, Intel, EZchip have similar units. Like co-processors and special functional units are specializations of a generic computational element for a specific purpose, task-specific memories are blocks of memory coupled with some logic for specific storage applications. For example, Xelerated Packet Devices has an internal CAM (Content Addressable Memory) for classification [5]. On the Vitesse IQ2200, the Smart Buffer Module manages packets from the time they are processed until they are sent on an output port [13]. In summary, most NPUs use hardware-supported multi-threading to hide latency. Some NPUs have take this further to accelerate additional OS kernels (like
memory management) and other common memory intensive tasks.
2.5. On-Chip Communication Mechanisms In general, on-chip communication mechanisms are tightly related to the PE configuration. For NPUs with pipelined processing elements, most communication architectures are point-to-point, between processing elements, co-processors, memory, or peripherals. NPUs with symmetric PE configurations often have full connectivity with multiple busses. For example, Motorola’s C-5 DCP has three buses with a peak bandwidth of 60Gbps that connect 16 channel processors and 5 co-processors. There’s a Payload Bus for high bandwidth, fixed latency communication of packet payloads between channel processors, the buffer management co-processor, and the queue management co-processor. The Ring Bus provides bounded latency communication to and from the lookup co-processor. Lastly, the Global Bus provides access to most of the processor’s memory through a conventional monolithic memory-mapped addressing scheme [14]. Brecis’s communication architecture has taken an alternative approach of mapping application characteristics directly to their bus architecture. Their Multi-Service Bus Architecture has a 3.2Gbps peak bandwidth and connects the major devices of the network processor, including the DSPs, control processor, security co-processor, Ethernet MACs, and peripheral sub-system. It supports three priority levels, which correspond to the three types of packets their device processes: voice, data, and control. This allows programmers to handle the different latency and throughput requirements of these packet types. In addition, the bus interface for each processor consists of a packet classifier and three packet queues, which map directly to the three types of traffic handled by this device (voice, data, and control) [11]. This enables efficient implementation of Quality of Service applications, a key to supporting voice and data on the same processor.
2.6. Peripherals In addition to performing layer 3+ tasks, NPUs also need to consider packet movement on to and off the chip. When placed in a line-card environment [15], there are two main interfaces a network processor has: network and switch. Some NPUs have integrated network interfaces on-chip (e.g. MAC devices). Ethernet is the most common protocol supported. SONET and ATM are also supported by a couple NPUs targeting higher line speeds as those protocols are used more in the core. For example, Brecis’s MSP5000 has two on-chip 10/100 Ethernet MACs [11]. While this may limit the applicability of their chip, it makes integration into a system much easier.
Appears in Proceedings of ISCIS XVII, The Seventeenth International Symposium on Computer and Information Sciences, 2002.
In addition, the tight integration of a MAC makes it less likely to be a bottleneck. For end-systems with more than one NPU in them, switch fabrics are necessary for their inter-communication. A few NPUs also include a dedicated switch interface (e.g. UTOPIA or SPI). For example, the IBM PowerNP has dedicated ingress and egress switch fabric interfaces to support higher line rates [8]. Those NPUs that do not have a network or switch interface normally handle these transactions on a high bandwidth shared bus. For example, the IX bus on the Intel IXP processors is used for communication with the MACs and switch fabric. Some NPUs even have programmable peripherals to support multiple protocols. The Motorola DCP C-5 has two parallel Serial Data Processors (one for send, one for receive) that can be micro-coded to support a number of different layer 2 interfaces, like ATM, Ethernet, or SONET. These Serial Data Processors handle programmable field parsing, header validation, extraction, insertion, deletion, CRC validation/calculation, framing and encoding/decoding [14].
5. References
3. Whither Network Processors?
[8] IBM Corp, “IBM Network Processor (IBM32NPR161EPXCAC100)”, Product Overview, November 1999.
Although they are largely attacking the same applications, network processors are currently exhibiting tremendous architectural diversity. In this paper we have simply tried to capture and organize this architectural diversity that was spawned in a nutrient (venture capital) rich environment. That diversity has now led to survival strategies as the business environment has considerably worsened. We expect that pruning of the companies, and the architectural design space will, naturally occur. Nevertheless, the use of programmable processors for packet processing in communication networks is well motivated and we expect that at least a few hearty species will successfully adapt to specific market niches.
[1] N. Shah, “Understanding Network Processors”, Master’s thesis, Dept. of Electrical Engineering & Computer Sciences, Univ. of California, Berkeley, 2001. [2] Cisco Systems, “Parallel eXpress Forwarding in the Cisco 10000 Edge Service Router”, White Paper, October 2000. [3] EZchip Technologies, “Network Processor Designs for Next-Generation Networking Equipment”, White paper, December 1999. [4] SiTera Corp, “PRISM IQ2000”, Product Brief, February 2000. [5] Thomas Eklund (Xelerated Packet Devices), “The World’s First 40Gbps (OC-768) Network Processor”, Presentation, Network Processor Forum, June 2001. [6] Rupan Roy (Cognigine), “A Monolithic Packet Processing Architecture Monolithic Packet Processing Architecture”, Presentation, Network Processor Forum, June 2001. [7] Intel Corp., “Intel IXP1200 Network Processor”, Product Datasheet, December 2001.
[9] Bob Gelinas, Paul Alexander, Charlie Cheng, W. Patrick Hays, Ken Virgile, William J. Dally (Lexra), “NVP: A Programmable OC-192c Powerplant”, Presentation, Network Processor Forum, June 2001. [10] Agere, “PayloadPlus Routing Switch Processor”, Preliminary Product Brief, Lucent Technologies, Microelectronics Group, April 2000. [11] BRECIS Communications, “MSP5000 Multi-Service Processor”, Product Brief, May 2001. [12] Narendra Sankar (Clearwater Networks), “CNP810™ Network Services Processor Family”, Presentation, Network Processor Forum, June 2001.
4. Acknowledgements
[13] Vitesse Semiconductor Corp., “IQ2200 VSC2232 Network Processor”, Preliminary Data Sheet, 2001.
The authors would like to thank Chidamber Kulkarni, Christian Sauer, Scott Weber, the rest of the MESCAL team, and the anonymous reviewers for their invaluable feedback.
[14] Motorola Corporation, “Motorola C-5 DCP Architecture Guide”, 2001. [15] M. Tsai, C. Kulkarni, C. Sauer, N. Shah, K. Keutzer, “A Benchmarking Methodology for Network Processors”, 1st Network Processor Workshop, 8th Int. Symp. on High Performance Computer Architectures (HPCA), Boston, MA, February 2002.
Appears in Proceedings of ISCIS XVII, The Seventeenth International Symposium on Computer and Information Sciences, 2002.