Programming Challenges in Network Processor ... - Semantic Scholar

5 downloads 3644 Views 196KB Size Report
Traditional network access equipment is realized by embedded software that runs on ... The programmer is not only required to partition and balance the load of ...
Programming Challenges in Network Processor Deployment Chidamber Kulkarni1 , Matthias Gries1 , Christian Sauer1,2 , Kurt Keutzer1 1

Electronics Research Lab, University of California, Berkeley 2 Infineon Technologies, Corporate Research, Munich

{kulkarni|gries|sauer|keutzer}@eecs.berkeley.edu

ABSTRACT

1. INTRODUCTION

Programming multi-processor ASIPs, such as network processors, remains an art due to the wide variety of architectures and due to little support for exploring different implementation alternatives. We present a study that implements an IP forwarding router application on two different network processors to better understand the main challenges in programming such multi-processor ASIPs. The goal of this study is to identify the elements central to a successful deployment of such systems based on a detailed profiling of the two architectures. Our results show that inefficient partitioning can impact the throughput by more than 30%; a better arbitration of resources increases the throughput by at least 10%, and localization of computation related to the memories can increase the available bandwidth on internal buses by a factor of two. The main observation of our study is that there is a critical lack of tools and methods that support an integrated approach to partitioning, scheduling and arbitration, and data transfer management for such system implementations.

Network processors (NPUs) are the response to the need of cost efficient yet flexible system solutions for evolving applications that allow packet processing at high data rates. Network equipment shows different characteristics in performance and flexibility depending on its position within the network hierarchy. The access segment of the network needs to perform many different tasks for multiple protocols and applications at relatively moderate data rates. Traditional network access equipment is realized by embedded software that runs on general-purpose architectures. In the core network, however, higher performance is required for a small set of simple tasks to meet line speed. Such equipment is typically based on specialized ASICs. Network processors target the levels in between these two extremes where flexibility at high data rates is mandatory. Due to these requirements NPU designs trade-off different constraints on performance, flexibility, and costs. Consequently, NPUs exhibit a wide range of architectures for performing similar tasks: from simple general-purpose RISC cores with dedicated peripherals, in pipelined and/or parallel organization, to heterogeneous multiprocessors, based on complex multi-threaded cores with customized instruction sets [32]. Although diverse, all NPUs exploit task-level concurrency in applications by means of parallel programmable processing elements (PEs) in order to meet line speed requirements. Programming such concurrent systems still remains an art. The programmer is not only required to partition and balance the load of the application manually among multiple PEs due to the lack of tool support, but it is also necessary to implement each individual task, often in assembly, on multiple elements, in order to get reliable performance estimations which in return are required to partition and balance for best performance. Programming tools even lack sufficient profiling functionality in order to enable the analysis of performance bottlenecks in these heterogeneous multiprocessor ASIPs caused by accesses to shared resources, such as buses and memories. In summary, there are multiple interrelated key factors that contribute to the difficulty and complexity of programming network processors:

Categories and Subject Descriptors C.4 [Performance of Systems]: Design studies; D.1.m [Programming Techniques]: Miscellaneous; C.3 [SpecialPurpose and Application-Based Systems]: Real-time and embedded systems, network processors

General Terms Performance, Design, Measurement

Keywords Programming heterogeneous architectures, resource sharing, programming model, multi-threading, mapping, IPv4 forwarding

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES’03, Oct. 30–Nov. 2, 2003, San Jose, California, USA. Copyright 2003 ACM 1-58113-676-5/03/0010 ...$5.00.

• Achievements of general-purpose multi-processing research are not applicable directly for such multi-core systems because of the much more fine-grained interaction among tasks on different cores caused by a need to hide latencies of frequent memory accesses. Tight coupling in combination with performance requirements

also prevents the direct use of (distributed) operating system layers. • Driven by the need for efficiency, the individual processing elements of a network processor may differ significantly. They have, for instance, different instruction set architectures that not only require individual or retargetable compilers (Ezchip Inc.) but also sometimes even different programming languages (Agere Systems Inc.). Other cores may use completely different/disjoint programming approaches, such as synthesis for re-configurable on-chip fabrics. Since as of today each individual core has to be programmed separately, the programmer is required to handle multiple programming environments concurrently. • Even more homogeneous architectures, e.g. using one type of core plus control processor and co-processors, are cumbersome to program on assembly level (or using assembly-like modifications of C). In general-purpose computing, on the other hand, the programmer is relieved from details like scheduling or memory management by using an abstraction provided by an operating system. However, the run-time overhead of such an abstraction is prohibitive to be used in NPU implementations. Hence, all such details need to be handled explicitly in network processors in order to achieve required performance metrics. • At the application level, there is concurrency at the level of network traffic flows as well as at the level of individual packets. However, exploiting and managing this degree of concurrency given the limited resources available in any network processor currently is a manual process. Given the above facts, programming heterogeneous multiprocessor ASIPs such as network processors is a challenge. The main goal of this paper is to clearly understand the bottlenecks and key elements in programming network processors that are common among different architectures. The obtained insights should guide the development of tools, methods, and programming models for existing network processors as well as potentially redefine the hardware-software interface in future generations of NPUs. The paper is organized as follows. We first introduce our application, namely IP forwarding router implementations on two different network processors, and discuss the corresponding results. Followed by this, we identify the critical elements in successful deployment of any programmable network processor. We present a discussion of the existing work in this area and provide conclusions of this study and directions for future work.

2.

ROUTER IMPLEMENTATION ON NETWORK PROCESSORS

In this section we present a brief discussion of the specification and implementation of a 16-port IPv4 router on two different network processors, namely the Intel IXP1200 [15] and the Motorola C-Port C-5 [13]. The goal of this exercise is to understand the programming effort associated with different aspects of system implementations and to identify critical elements in programming such systems based on a quantitative comparison.

2.1 Router specification Our functional specification of the router is based on RFC 1812, Requirements for IP Version 4 Routers [5]. We model only the data plane of the application and not the control plane. However, we recognize that adding control plane functionality will result in more complexity for scheduling and arbitration of resources as well as data transfer management between the control processor and the data plane processor. The main points of the specification are: • A packet arriving on port P is to be examined and forwarded on a different port P’. The next-hop location that implies P’ is determined through a longest prefix match (LPM) on the IPv4 destination address field. • Broadcast packets, packets with IP options, and packets with special IP sources and destinations are discarded. • The packet header and payload are checked for validity and packet header field checksum and TTL are updated. • Packet queue sizes and buffers can be optimally configured for the network processor architecture unless large buffer sizes interfere with the ability to measure sustained performance. • The network processor must maintain all non-fixed tables (i.e. tables for the LPM) in memory that can be updated with minimal intrusion to the application. • Routing tables for the LPM should be able to address any valid IPv4 destination address. It is important to note that in addition to the core functionality a number of steps are required to receive the packets from the media access control (MAC) unit into the network processing unit and extract the packet header, on which the above stated operations are performed. Lastly, the modified packet header and the packet payload need to be written back into the MAC unit, sometimes via a bus. These additional operations, in fact, result in most of the programming effort for many network processors. For example for the IXP1200 processor, fourteen detailed tasks are required to perform the core functionality of our benchmark whereas we need 42 detailed tasks to perform the ingress and egress operations on each packet. See [40, 9] for a more detailed description of a disciplined approach to benchmarking network processors. We will now introduce both the network processors as well as implementations of our router application on them.

2.2 Intel IXP1200 Network Processor The IXP1200 comprises six micro-engines for computation, with four threads on every micro-engine. Each microengine has an instruction memory of 4 KB and 128 32-bit registers. There are four unidirectional on-chip buses connecting both the off-chip memories (SRAM and SDRAM) to the micro-engines. External media access control units (MAC) are connected to the IXP1200 via an off-chip IX bus. The IX bus unit interfaces with the IX bus and has the required logic and memories to receive and transmit packets to the external MAC units. The IX bus unit has a scratchpad memory (SRAM) of 4 KB and two FIFO memories,

Lookup

Receive

MAC-0 MAC-1 MAC-2

MAC-3 MAC-0

SRAM Bus (header)

MAC-2

Receive

µEngine

Receive Receive

I/D Mem Regs

Receive

MAC-3

MAC-1

I/D Mem Regs

Receive

MAC-3

MAC-0

Receive

Receive

µEngine

Receive Receive

I/D Mem Regs

Receive Receive

MAC-1 MAC-2 MAC-3

IX IXBus Bus Rx RxBuffer Buffer

µEngine

SDRAM Bus (payload)

Tx Queue ports 2i+1 Tx Queue ports 2i Ext. SRAM

SRAM SRAM Interface Interface

Transmit Transmit

I/D Mem Regs

IX Bus

MAC-2

µEngine

SRAM Bus (packet descriptors & table lookup request)

MAC-1

Receive

SDRAM Bus (header)

MAC-0

IX Bus

Receive

Transmit Transmit

µEngine

Transmit

I/D Mem Regs

Transmit

MAC-2

MAC-1 MAC-2 MAC-3

µEngine

MAC-0 MAC-1

Header Payload Ext. SDRAM

SDRAM SDRAM Interface Interface

MAC-1

MAC-0

Q-Sched Q-Sched

MAC-0

MAC-3

(signals)

Receive

I/D Mem Regs

SRAM Bus (packet descriptors)

Receive

MAC-2 MAC-3 MAC-0 MAC-1

SDRAM Bus (packet)

IX IXBus Bus Tx TxBuffer Buffer

MAC-2 MAC-3

Figure 1: Packet flow and micro-architecture of the Intel IXP 1200. with each having 16 entries of 64 bytes. In addition, the SDRAM unit is connected to the IX bus unit via a separate on-chip bus, used to transfer packet payloads directly based on micro-engine commands. Lastly, an on-chip command bus carries events and signals between micro-engines and the IX bus unit. A StrongARM processor is also present on-chip for control plane functions, which interacts with the micro-engines via the IX bus unit. A more detailed description of the IXP1200 architecture can be found in, e.g., [15].

2.3 IXP1200 implementation We have developed the application in micro-engine C (a variant of C provided by Intel) following the above specification based on the Intel reference code. The transmit threads have been modified for improving the performance. The application was partitioned so that sixteen threads on four micro-engines were assigned one port each on the receive (and forwarding) part. The transmit part of the application was assigned eight threads on two micro-engines. This partitioning holds since the end-to-end delay for a packet on the receive part is more than twice that on the transmit part. The packet flow through the Intel IXP microarchitecture is sketched in Figure 1. Performance on the IXP1200 was measured using version 2.01 of the Developer Workbench assuming a clock frequency of 200 MHz; the IX bus is 64-bit wide and has a clock frequency of 80 MHz. Two IXF440 external media access control units (with eight duplex fast Ethernet ports each) are connected to the IX bus, and Ethernet IP packets are streamed from this unit to the IXP1200 and back. The packets for the application

contain destination addresses evenly distributed across the IPv4 32-bit address space. We employ different packet sizes namely from 64 bytes to 1518 bytes. The range of destination addresses and associated next-hop destinations provide an evenly distributed load on every output port.

2.4 Motorola C-5 Network Processor The C-5 comprises sixteen channel processors (CPs) with four threads on each CP that have a local data memory of 12KB and an instruction memory of 6KB. It is important to note that C-5 supports the connection of consecutive CPs in a pipelined fashion using special registers. Each CP includes an on-chip configurable media access controller, called serial data processor (SDP), that can be configured to provide 10/100/1000 Mbps Ethernet, packet-over-sonet, or ATM. CPs can be assigned individually to a port (namely a SDP) or in an aggregate mode. In addition to CPs there is an executive processor (XP) that serves as a centralized computing resource for the C-5 and manages the system interfaces, such as PCI, serial bus interface, and PROM interface. There is a fabric processor to interface with a switch or backplane fabric. Three on-chip buses connect all these computational resources to external memories, namely payload bus, ring bus, and global bus, which have widths of 128 bits, 64 bits and 32 bits, respectively. Three specialized units are part of memory controllers: table lookup unit (TLU), which accelerates six different types of table lookup algorithms with 11 dedicated instructions; buffer management unit that accelerates creation and destruction of variable width buffers for payload data stored in SDRAM; queue management unit that

Payload

DMA Ring Bus

DMA Ring Bus

Ext. SDRAM

RxThread

TxThread

Cell CellBuffer Buffer Interface Interface

Channel P/ Rx Part RxSDP Extract M

TxSDP Merge M

Channel P/ Tx Part

DMA Ring Bus

RxThread

TxSDP Merge M

TxThread

Lookup

Channel P/ Rx Part

DMA Ring Bus

Channel P/ Tx Part

On-chip MAC MAC On-chip

RxSDP Extract M

On-chip MAC MAC On-chip

On-chip MAC MAC On-chip

On-chip MAC MAC On-chip

Header

RxThread Channel P/ Rx Part

Ext. SRAM

Queue Queue Interface Interface

Payload Bus (packet)

Queues (0 – 15)



DMA

TxSDP

Ring Bus

Merge M

TxThread Channel P/ Tx Part

On-chip MAC MAC On-chip

DMA Ring Bus

TLU TLU Interface Interface

Global Bus (forwarding descriptor)

RxSDP Extract M

Global Bus (forwarding descriptor)

On-chip MAC MAC On-chip



Ring Bus (table lookup request)

Payload Bus (packet)

Ext. SRAM

Figure 2: Packet flow and micro-architecture of the C-Port C-5 network processor. accelerates creation and destruction of queues for packet descriptor data stored in SRAM. A more detailed description of C-5 architecture can be found in, e.g., [13]. In summary, C-5 is much more specialized for network processing than the IXP1200.

2.5 C-5 implementation For our study, we have derived the application in C based on another reference application, namely an Ethernet-POSGigabit Ethernet switch. In particular, we have adapted the channel processor code for reflecting our specification of an IPv4 router and configured the serial data processors to act as fast Ethernet media access controllers. The other changes were to the assignment of buffers in the buffer management unit, wherein we assigned 16 buffers for each channel processor, and in the queue management unit wherein we assigned a queue to each channel processor (or output port). Thus our configuration allows for a maximum of 256 packets inflight at any given instance. The application was partitioned so that two threads on each channel processor were assigned to handling receive and transmit functions, respectively. Note that both receive and transmit parts are decoupled and pipelined, i.e., each of them can process packets independently. In addition, the Serial Data Processors (SDPs) related to receive and transmit parts are decoupled and pipelined. The packet flow through the C-5 micro-architecture is shown in Figure 2. Performance on the C-5 was measured using version 2.1 of the C-5 development environment assuming a clock frequency of 200 MHz. The packets for the application contain

destination addresses evenly distributed across the IPv4 32bit address space. We also employ different packet sizes as in the IXP1200 case.

2.6 Results and discussion In this section we present the main results of our study. The three main metrics for comparison are the network traffic throughput rates achieved by each processor, utilization of resources (buses), and the programming effort in terms of lines of code and the resulting latencies per packet.

2.6.1 Network traffic throughput Figure 3 shows the throughput rates achieved by both processors for our router application. The throughput for the C-5 was observed at the output of SDPs (namely fast ethernet MAC) and aggregated, whereas for the IXP1200 the Developers Workbench provides aggregated throughput numbers based on the output of IXF440 (MAC units). We observe that C-5 always achieves near line-rate performance (1590 Mb/s). For our application the aggregated line-rate is 1600 Mb/s (duplex), using 16 ports at 100 Mb/s at the input. The IXP1200 achieves line-rates from 960 Mb/s to 1440 Mb/s for different packet sizes. At a first glance we observe that only the C-5 is able to sustain a constant and higher throughput across all packet sizes.

2.6.2 Resource utilization Figure 4 shows the utilization for the three different buses on the IXP and C-5 for different packet sizes. The three buses on the C-5 are the global bus, used for communicat-

1800

IXP 1200

C-Port C-5

Throughput [Mb/s]

1600 1400 1200 1000 800 600 400 200 0 64

128

256

512

1024

1280

1518

Packet size [byte]

Figure 3: Throughput rates for IXP1200 and C-Port C-5 implementations.

ing the packet descriptors with SRAM memory, the payload bus, used for transporting the payload from SDRAM memory, and the ring bus, used to access the look up table from SRAM memory. We observe that the global bus and the payload bus have a larger utilization for smaller packet sizes; this is due to the fact that there is an overhead in transporting every packet to a corresponding queue or buffer. For example, for a 72-byte packet the payload bus has 23% utilization as compared to 13% for a 512-byte packet at the same traffic throughput. This illustrates that there is a significant overhead per packet in communicating over both the global and payload bus since the achieved traffic throughput is constant independent of the packet size. For the ring bus we do not observe much overhead due to the fact that our lookup table is relatively small. In contrast to the above utilization, we note that for the IXP1200 the SRAM bus and SDRAM bus utilization for a 64-byte packet is 28% and 32%, and for a 512-byte packet it is 15% and 36% respectively. From a utilization perspective we note that C-5 has a lower utilization of the payload bus as compared to the SDRAM bus utilization of the IXP1200; however since the traffic throughput for IXP1200 is lower than that of C-5, we cannot draw concrete conclusions from such a number. For large packet sizes though, where the traffic throughput is almost equal for both processors, the difference in bus utilization can be explained by the difference in internal bandwidth on the buses, i.e. the internal bandwidth of the payload bus (C-5) is roughly by a factor of 2.4 larger than the bandwidth on the SDRAM bus in the IXP. In contrast, the SRAM bus utilization of the IXP1200 is more than a factor of ten apart from the combined utilization of the ring bus and the global bus even though the IXP1200 implementations attain a lower traffic throughput and the internal SRAM bus bandwidth of the IXP is only by a factor of 1.8 lower than the combined internal bandwidth of ring and global buses in C-5. This clearly shows that there is an evident overhead in SRAM memory and the related bus utilization in the IXP as compared to the C-5. In fact one can hypothesize that the utilization of both IXP1200 and C-5 are similar but one of the reasons why C-5 is able to achieve better throughput is due to the decoupling of memory accesses to the table lookup and the packet descriptors, which are in separate SRAM memories for the C-5 and in the same SRAM memory for the IXP1200.

2.6.3 Programming effort The programming effort is an indication of the amount of effort required to reach a certain performance criteria. This is very difficult to estimate in our study since both our processors attain different performance levels. Hence we use the effective lines of code (see [10] and the references therein) as software metric for comparing the programming effort of each implementation and the profile of where the most time is spent in Figure 5. For the IXP1200, we need 620 lines of micro-engine C code for implementing our specification, wherein 63% of the lines are for computation, 31% are for communication (involving bus and memory access), and 6% are for initialization. In contrast, the C-5 requires 1280 lines of C code, wherein 43% is for computation, 2% for communication, and 55% for initialization. For a 128-byte packet, the IXP1200 spends 29% on computation, 28% on communication, and 43% in idle time; the C-5 spends 55% on compute and communication and 45% in idle time. If we compare the total number of useful lines of code for both IXP1200 (94% of 620) and C-5 (45% of 1280), they are almost similar. However, for the IXP a relatively large fraction of the code is dedicated to communication. Programming communication code for the IXP is in particular time-consuming and error-prone since the programmer has to understand the underlying scheduling and arbitration schemes of the architecture, as it will be discussed in the next section. Looking at the number of cycles executed per packet, the amount of time spent in active work (non-idle) is also similar for both processors. In summary, although both the architectures have a similar utilization per packet and require similar programming effort, the performance achieved (traffic throughput) is noticeably different. This is primarily due to the decoupled and pipeline approach to packet processing in the C-5 using additional accelerators, such as the Serial Data Processors (SDPs), which is supported by a layered software environment.

2.7 Comparison of IXP1200 and C-5 implementations Based on the results discussed above, we present the main observations regarding the differences between the IXP1200 and C-5 implementations and the corresponding results. The C-5 has an architecture that expects the programmer to program in a pipelined approach to processing packets. Towards this goal, the C-5 has optimized the architecture and developed a programming model that forces the users to program every stage of the pipeline and take decisions based on certain parameters of the current, preceding, and the succeeding stages. The result of such an approach is that the programmer needs to take some global decisions, such as setting up buffers, queues and parameters to extract packet headers, and then program individual stages somewhat independently. Indeed, optimizing across the boundaries of each of these stages is not feasible and hence, the approach boils down to exploring different parameters in the architecture and evaluating the performance metrics. For cases, where one cannot achieve the required performance criteria it is almost impossible to deploy such an architecture. On the other hand, the IXP1200 has a fairly flexible architecture wherein the programmer has complete control over most of the architecture. This results in need for optimizing the architecture to meet certain performance criteria. For

SRAM Bus

SDRAM Bus

Utilization [%]

Utilization [%]

IXP1200 40 35 30 25 20 15 10 5 0

64

128

192

256

512

40 35 30 25 20 15 10 5 0

CPort C-5 Global Bus

72

128

192

Payload Bus Ring Bus

256

Packet size [byte]

Packet size [byte]

512

Figure 4: Bus utilization for different packet sizes on the IXP and C-Port C-5. Computation

Communication

Initialization

Computation

Communication

Idle

37

IXP 1200

391

812

192

784

1204

26

C-5

550

0

250

2090

704

500

750

1000

1250 0

Lines of Code

1000

1710

2000

3000

4000

Executed Cycles / 128 byte Packet

Figure 5: Comparison of programming effort for IPv4 routing implementations. our application, the decoupled and pipelined approach of C-5 seems to perform better. Indeed, our main observation has been that for networking applications that require only header manipulation, C-5 is easier to program and deploy than IXP1200 because of the following reasons: • The partitioning of functionality into receive and transmit threads for the IXP1200 and for C-5 is clearly one of the many possible ones. Although it seems the partitioning of C-5 is somewhat natural for our application since we have 16-ports and there are 16 channel processors in C-5, it is questionable whether the partitioning chosen by us for the IXP1200 is in any way good or bad. Such an observation clearly necessitates approaches to explore partitioning over different processors and threads. • The effort in the optimization process of IXP1200 is larger primarily due to the presence of more scheduling and arbitration-related decisions that a programmer needs to make as compared to the C-5. Examples are the arbitration schemes at the receive and transmit FIFOs of the IXP micro-architecture in Figure 1 which directly affect the number of lines of code required for communication in Figure 5. Each of these decisions ideally requires tools that estimate and explore options automatically, which are currently not available, thus making the process tedious and error-prone.

• The localization of computation related to the SRAM memory units, namely the table look-up unit (TLU) and the queue management unit (QMU) in C-5, results in a lower overhead as compared to the IXP1200, where the computation related to the memory units and the corresponding signals must be handled in the micro-engines. Such a localization of computation is important both for achieving higher performance and for a reduced programming effort.

3. CRITICAL ELEMENTS IN PROGRAMMING NETWORK PROCESSORS We have presented the main observations from our study on implementing the 16-port router on two different network processors. Based on our observations we identify three main elements that we believe are central to any successful deployment of programmable network processors and which must be dealt with in an integrated fashion to achieve optimal designs. The three main elements are partitioning of functionality onto processors and threads, scheduling and arbitration of resources (bus, specialized units, memory, etc), and data transfer management.

3.1 Partitioning Partitioning of functionality onto processors and threads is an important first step towards achieving an implementation that meets the required performance metrics. The

programmer has to match properties of the application with characteristics of the underlying hardware. The application might reveal natural cutting points that can be used to distribute the application load onto several processing elements. For IPv4 packet routing, the application could logically be split into packet receive, header processing, and packet transmit. This cutting, however, might not be a feasible partition for processing elements since these sections of the packet flow are not necessarily well balanced. Moreover, fine-granular thread swaps are initiated by access patterns of the application to memories. The determination of thread boundaries and deterministic behavior under the constraint of a fixed number of threads provided by the hardware therefore is a cumbersome task.

Observation For example, we have implemented a pipelined version of our router on the IXP1200. For this, we repartitioned the code on the four concurrent receive micro-engines to separate receive, header check, route lookup, and packet descriptor write functions. We now mapped four input ports to one thread each and the remaining three functions were mapped to three different threads in a pipelined fashion. Thus we now have a completely pipelined implementation of the receive functionality on four micro-engines supporting 16 ports. The transmit part remained the same as the original implementation. We observe that the pipelined implementation required two times more registers and thus we had to implement this communication using SRAM since the backlog of packets prevented us from using scratchpad memory. This implementation was only able to sustain a throughput of 35% (560 Mb/s) of the line-rate (1600 Mb/s) which is about one third less than the original implementation that achieved 880 Mb/s. Indeed, balancing pipelined implementations is a cumbersome task. Although our partitioning onto pipeline stages was a logical and ’natural’ result from the application description, it turned out to be an unbalanced implementation, i.e. the utilization of the microengines due to computations varied by more than a factor of two along the pipeline. This illustrates that a good partitioning is essential to achieving the required performance criteria. From a programming perspective, it is important that both the architecture and the programming model support a natural partitioning of the particular application.

3.2 Scheduling and arbitration Since network processors tend to have many different shared resources, such as memories, buses, and specialized units (accelerators), there is a need for scheduling and arbitration of these shared resources. On the one hand, thread scheduling on a processing element is usually implemented in hardware and often follows a Round-Robin scheme. On the other hand, memory controllers and buses employ a variety of dynamic arbitration mechanisms, which all together have to be harmonized to achieve efficiency. Finally, the chosen partitioning and data layout also affect arbitration effects.

Observation In our study we observe that the estimation of such scheduling and arbitration delays is central to achieving performance and for a reduced programming effort. For example, the IXP1200 implementation of the router will run out

of memory space due to too much processing backlog and stop functioning if we had not optimized the transmit FIFO interaction with the micro-engine transmit thread that performed the scheduling of writes to this FIFO buffer. We reduced the worst-case arbitration penalty for writing the FIFO by removing speculative executions from the reference code and by adding proper signal checks. Such an optimization step enhances the throughput of the system by at least 10%. The programming effort in determining such an optimization is significantly large and is also reflected by the lines of code spent for communication on the IXP in Figure 5. Other IXP-specific solutions to the arbitration of shared resources can be found as part of the works presented [33, 35] where Spalink et al. [35], for instance, implement a token passing scheme using inter-thread signaling for FIFO arbitration and Shah et al. [33] restrict the number of concurrent accesses to control status registers.

3.3 Data transfer management Managing data transfer between different processors, memories, and threads remains a challenge. The impact of any data transfer management scheme on both performance and programming effort is significant and cannot be neglected. There are two aspects to the data transfer management problem: first, control of data transfer between memory and computational units, and second, assignment of data to particular memories. Data structures used in network processing can be very fine granular, e.g. a fraction of a header field and packet descriptors, but also very complex and large, such as packet payload, packet headers, lookup tables, and queue management structures. Different kinds and multiple memories are consequently used with network processors to support different styles of memory accesses. Finally, apart from the data layout, the latency of a memory access can also depend on the order of accesses issued to the memory before. Memory access latencies in turn directly affect the timing of thread swaps (thread scheduling).

Observation For example, in our study we observe that by moving the computation related to the SRAM memory units into a specialized memory controller (namely TLU and QMU), the C-5 architecture is able to localize the memory accesses and thereby ensure a larger available bandwidth for other functions. Such an approach has a large impact both on performance and other metrics, such as power consumption (due to movement of data on the buses). This indeed motivates the need for more application-specific memory controllers that are able to localize the computation related to memories. In addition, one also needs to explore the different alternatives for assignment of data to various memories. There is a strong co-relation between assignment of data to the memory and scheduling of memory accesses. In addition, we find approaches to building customized memory controllers. However, most of these approaches do not address the problem of combining an effective programming interface with the development of customized memory controllers. We will now explore the existing work related to the above three elements as well as those related to the compilation and programming of ASIPs and embedded systems.

4.

RELATED WORK

Related work on programming application-specific processors, heterogeneous systems, and multiprocessors can be found in the context of development frameworks, compiling techniques, and high-level synthesis.

4.1 Development frameworks There are three main approaches to programming models for network processors: Those based on libraries, such as Click [19], NP-Click [33], and Teja [37], those based on domain-specific languages [39], and those based on dynamic scheduling of tasks, e.g. Calisto [27]. Library-based approaches are limited by the fact that any scaling of the architecture invariably results in recoding most of the library components. The dynamic scheduling of tasks as implemented in [27] requires additional hardware overhead such as control and status registers. Thus, very little work exists on supporting the partitioning and scheduling of tasks on network processors. One approach to early design decisions based on an analytical approach is presented in [14]. Programming environments in the domain of signal processing are mainly based on vector and matrix computations, exploiting multiply-and-accumulate operations, e.g. in Mathworks’ matlab language matrices can be used as primitive data types. Matlab and Synopsys’ COSSAP provide common signal processing solutions in libraries and provide code generation for single DSP core using Harvard architecture memory models.

4.2 Compiling techniques Common concerns in DSP compilation are coping with strength reduction of algebraic expressions in order to use simple, single cycle computations, software pipelining to work around dependency constraints in loops for increasing resource utilization, exploiting loop hardware, and mapping computation to a fixed, small set of data types. In short, they are mainly focused on static analysis to resolve fine-grained latency issues. Examples of DSP compilers are CoSy [1], SPAM [29], RECORD [23], and IBM’s e-lite [11]. Compilers targeted at embedded systems [24] and ASIPs, such as CoWare’s LISATek [2, 31], Chess/Checkers [3, 22], Tensilica’s XCC [4], and FlexWare [25], specialize on retargetability, code density, power issues, and reliability based on a single processing core. In addition, we find VLIW compilers focusing on scheduling of instructions for enhanced instruction-level parallelism (ILP), static cache management, etc., such as Trimedia [17] and Texas Instruments. Compilers for general-purpose computing, such as Impact [18], in particular focus on exploiting ILP and cache efficiency. Scheduling at compilation time is performed at the granularity of instructions. Parallelizing compilers, such as SUIF [16], Titanium [21], and Polaris [6], mainly deal with scientific programs targeted at shared memory multiprocessors by exploiting concurrency on the level of regular computation loops. In the networking domain, most of the effort has focussed on efficient code generation for a uniprocessor or for a single thread [12, 41]. The transfer of ASIP, DSP, and general-purpose compilation techniques to the domain of heterogeneous multi-processor ASIPs, such as network processors, is not straightforward since the boundary conditions and constraints of these embedded systems require fine-grained interleaving/scheduling of computations on different processors using threads and multiple, hetero-

geneous memory areas and co-ordination of accesses to specialized units, as identified earlier.

4.3 High-level synthesis Recent work on partitioning tasks on to threads [36, 34], transform the problem to known graph partitioning problems. However, the question of how to determine task boundaries still remains. In the area of embedded system synthesis, memory optimization approaches [28, 8] use data layout transformations and access reordering to minimize power dissipation, access latency, and memory allocation costs. Moreover, the impact of scheduling and arbitration on the performance and predictability of single processor-based systems led to a large number of related works, see [7, 20] and the references therein. Recently, methods have been introduced that allow the integration of different scheduling analysis techniques and domains into a coherent framework [30] and that generalize those techniques [38] for heterogeneous multiprocessor platform analysis. However, little work has been provided so far that identifies a set of critical elements in programming heterogeneous single-chip multiprocessor ASIPs, such as network processors. In this paper we have presented an analysis of two realworld network processors performing IPv4 packet forwarding as benchmark. We have revealed programming challenges that are characteristic for this domain of heterogeneous multi-processors and unlike properties of conventional embedded systems. In summary, there are works that address aspects of each of our elements individually for some cost function for a particular application domain. Recently there have been attempts to address more than one of the identified elements as an integrated optimization problem. However, to our knowledge there have been no attempts to combine all the three main elements and address these issues in a single integrated environment.

5. CONCLUSIONS AND FUTURE WORK For the next generation of network processor based system implementations, we strongly believe that a considerable emphasis will be put on performance per cost (for example, power consumption) aspects and on support of appropriate programming models. In this study we have identified the main elements that are central to a successful deployment of multi-processor ASIPs, such as network processors. Our results clearly indicate that the impact of the identified elements on both the achievable performance and the programming effort is significant; varying from 10-35% in performance for hand coded implementations. We strongly believe that approaches that integrate the three issues – partitioning over threads and processors, scheduling and arbitration of resources, and data transfer management – into a single environment and provide a path to implementation are central to making these multi-processor ASIPs successful. There are two directions for future work in this domain. From a pure programming perspective, we need to clearly address issues related to mapping of existing domain-specific environments, such as Click and Matlab, to such systems. Secondly, we need to explore architecture design environments that integrate the identified elements related to programming up-front in the architecture design process to op-

timize the architecture-programming model interface. In the Mescal project [26], we are currently exploring both the above approaches to making the deployment of such multiprocessor ASIPs successful.

Acknowledgment. This work was supported, in part, by the Microelectronics Advanced Research Consortium (MARCO) and Infineon Technologies, and is part of the efforts of the Gigascale Silicon Research Center (GSRC).

6.

REFERENCES

[1] ACE Associated Compiler Experts bv, the Netherlands, CoSy compiler development system http://www.ace.nl. [2] CoWare, Inc., LISATek EDA Tools http://www.coware.com. [3] Target Compiler Technologies n.v., Chess/Checkers retargetable tool-suite http://www.retarget.com. [4] Tensilica, Inc., Xtensa C/C++ Compiler (XCC) http://www.tensilica.com. [5] F. Baker. Requirements for IP version 4 routers. RFC1812, Internet Engineering Task Force (IETF), June 1995. [6] W. Blume, R. Eigenmann, J. Hoeflinger, D. Padua, P. Petersen, L. Rauchwerger, and P. Tu. Automatic detection of parallelism: A grand challenge for high-performance computing. IEEE Parallel and Distributed Technology, 2(3):37–47, 1994. [7] G. C. Buttazzo. Hard Real-Time Computing Systems. Kluwer Academic Publishers, 1997. [8] F. Catthoor, K. Danckaert, C. Kulkarni, E. Brockmeyer, P. G. Kjeldsberg, and T. Omnes. Data Access and Storage Management for Embedded Programmable Processors. Kluwer Academic Publishers, Mar. 2002. [9] P. Chandra, F. Hady, R. Yavatkar, T. Bock, M. Cabot, and P. Mathew. Benchmarking network processors. In P. Crowley, M. Franklin, H. Hadimioglu, and P. Onufryk, editors, Network Processor Design: Issues and Practices, volume 1, pages 11–25. Morgan Kaufmann Publishers, Oct. 2002. [10] V. Cˆ ot´e, P. Bourque, S. Oligny, and N. Rivard. Software metrics: An overview of recent results. Journal of Systems and Software, Elsevier Science, 8(2):121–131, Mar. 1988. [11] D.Naishlos, M. Biberstein, and A. Zaks. Compiler vectorization techniques for a disjoint SIMD architecture. Technical Report H0146, IBM Research, Nov. 2002. [12] L. George and M. Blume. Taming the IXP network processor. In ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI ’03), June 2003. [13] G. Giacalone, T. Brightman, A. Brown, J. Brown, J. Farrell, R. Fortino, T. Franco, A. Funk, K. Gillespie, E. Gould, D. Husak, E. McLellan, B. Peregoy, D. Priore, M. Sankey, P. Stropparo, and J. Wise. A 200 MHz digital communications processor. In IEEE International Solid-State Circuits Conference (ISSCC), pages 416–417, Feb. 2000.

[14] M. Gries, C. Kulkarni, C. Sauer, and K. Keutzer. Exploring trade-offs in performance and programmability of processing element topologies for network processors. In Second Workshop on Network Processors at the 9th International Symposium on High Performance Computer Architecture (HPCA9), Mar. 2003. [15] T. R. Halfhill. Intel network processor targets routers. Microprocessor Report, 13(12), Sept. 1999. [16] M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, S.-W. L. E. Bugnion, and M. S. Lam. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, special issue on multiprocessors, Dec. 1996. [17] J. Hoogerbrugge and L. Augusteijn. Instruction scheduling for Trimedia. Journal of Instruction-Level Parallelism, 1(1), Feb. 1999. [18] W. W. Hwu, R. E. Hank, D. M. Gallagher, S. A. Mahlke, D. M. Lavery, G. E. Haab, J. C. Gyllenhaal, and D. I. August. Compiler technology for future microprocessors. Proceedings of the IEEE, 83(12):1625–1640, 1995. [19] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The Click modular router. ACM Transactions on Computer Systems, 18(3):263–297, Aug. 2000. [20] H. Kopetz. Real-Time Systems: Design Principles for Distributed Embedded Applications. Number 395 in Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers, 1997. [21] A. Krishnamurthy and K. Yelick. Analyses and optimizations for shared address space programs. Journal of Parallel and Distributed Computations, 1996. [22] D. Lanneer, J. V. Praet, A. Kifli, K. Schoofs, W. Geurts, F. Thoen, and G. Goossens. CHESS: Retargetable code generation for embedded DSP processors. In Code Generation for Embedded Processors, pages 85–102. Kluwer Academic Publishers, 1995. [23] R. Leupers and P. Marwedel. Retargetable compilers for embedded DSPs. In 7th European Multimedia, Microprocessor Systems and Electronic Commerce Conference (EMMSEC), Nov. 1997. [24] R. Leupers and P. Marwedel. Retargetable Compiler Technology for Embedded Systems - Tools and Applications. Kluwer Academic Publishers, Oct. 2001. [25] C. Liem and P. Paulin. Compilation techniques and tools for embedded processor architectures. In J. Staunstrup and W. Wolf, editors, Hardware/Software Co-Design: Principles and Practise. Kluwer Academic Publishers, 1997. [26] A. Mihal, C. Kulkarni, K. Vissers, M. Moskewicz, M. Tsai, N. Shah, S. Weber, Y. Jin, K. Keutzer, C. Sauer, and S. Malik. Developing architectural platforms: A disciplined approach. IEEE Design & Test of Computers, 19(6):6–16, 2002. [27] J. Nickolls, L. J. Madar III, S. Johnson, V. Rustagi, K. Unger, and M. Choudhury. Broadcom Calisto: A multi-channel multi-service communication platform. In 14th Hot-Chips Symposium, Aug. 2002. [28] P. Panda, F. Catthoor, N. Dutt, K. Danckaert,

[29]

[30]

[31]

[32]

[33]

[34]

E. Brockmeyer, C. Kulkarni, A. Vandecappelle, and P. G. Kjeldsberg. Data and memory optimization techniques for embedded systems. ACM Transactions on Design Automation of Electronic Systems, 6(2), Apr. 2001. S. Rajan, M. Fujita, A. Sudarsanam, and S. Malik. Development of an optimizing compiler for a Fujitsu fixed-point digital signal processor. In International Workshop on Hardware/Software Codesign (CODES), 1999. K. Richter, M. Jersak, and R. Ernst. A formal approach to MpSoC performance verification. IEEE Computer, 36(4):60–67, Apr. 2003. O. Schliebusch, A. Hoffmann, A. Nohl, G. Braun, and H. Meyr. Architecture implementation using the machine description language LISA. In 15th International Conference on VLSI Design, pages 239–244, Jan. 2002. N. Shah. Understanding network processors. Master’s thesis, Dept. of Electrical Eng. and Computer Sciences, University of California, Berkeley, September 2001. N. Shah, W. Plishker, and K. Keutzer. NP-Click: A programming model for the Intel IXP 1200. In 2nd Workshop on Network Processors (NP2) at the 9th International Symposium on High Performance Computer Architecture (HPCA9), Feb. 2003. Y. Shin and K. Choi. Software synthesis through task decomposition by dependency analysis. In IEEE/ACM

[35]

[36]

[37] [38]

[39] [40]

[41]

International Conference on Computer-Aided Design, 1996. T. Spalink, S. Karlin, L. Peterson, and Y. Gottlieb. Building a robust software-based router using network processors. In 18th ACM Symposium on Operating Systems Principles (SOSP), Oct. 2001. X. Tang and G. R. Gao. Automatically partitioning threads based on remote paths. In International Conference on Parallel and Distributed Systems, pages 14–16, Dec. 1998. Teja Technologies. IPv4 forwarding application performance. White paper, July 2002. L. Thiele, S. Chakraborty, M. Gries, A. Maxiaguine, and J. Greutert. Embedded software in network processors - models and algorithms. In First Workshop on Embedded Software (EMSOFT), pages 416–434, Oct. 2001. A. Tillmann. A case for using a specialized language for NPU design. EE Times, Aug. 2002. M. Tsai, C. Kulkarni, C. Sauer, N. Shah, and K. Keutzer. A benchmarking methodology for network processors. In P. Crowley, M. Franklin, H. Hadimioglu, and P. Onufryk, editors, Network Processor Design: Issues and Practices, volume 1, pages 141–165. Morgan Kaufmann Publishers, Oct. 2002. J. Wagner and R. Leupers. C compiler design for an industrial network processor. In ACM SIGPLAN Workshop on Languages, Compilers and Tools for Embedded Systems (LCTES), pages 155–164, 2001.