Automatic partitioning and mapping of stream-based ... - CiteSeerX

39 downloads 0 Views 189KB Size Report
Apr 20, 2007 - processor [8] of IBM for gaming, the Wasabi architecture [13] of. Philips Research for video processing, and the IXP processor of. Intel for ...
Automatic partitioning and mapping of stream-based applications onto the Intel IXP Network Processor Sjoerd Meijer, Johan Walters, David Snuijf, and Bart Kienhuis University Leiden, Leiden Institute of Advanced Computer Science, Niels Bohrweg 1, 2333 CA, Leiden, Netherlands {smeijer,kienhuis}liacs.nl Published in the proceedings of Workshop on Software & Compilers for Embedded Systems (SCOPES’07), Nice, April 20 2007

Abstract

When studying the IXP Network processor architecture from Intel, we found quite some interesting aspects that make the IXP attractive for stream-based applications. The architecture is highly optimized for streaming data, albeit in the form of internet packets. Furthermore, the architecture has Gigabit Ethernet connectors for handling incoming and outgoing traffic and can process this data at real-time using dedicated microengines. In this paper, we try to answer three questions; 1) Can we use the IXP architecture for stream-based applications? 2) can we map applications written as a KPN onto an IXP? 3) can we integrate the generation of KPNs using the Compaan compiler with a tool flow that maps the KPN to an IXP, thereby make the programming of IXP much simpler? As will be shown, all three steps can be performed and we show that we can map automatically two non-internet stream-based applications (QR and DWT) onto the IXP.

1.

INTRODUCTION

The complexity of embedded multimedia and signal processing applications has reached a point where the performance requirements of these applications can no longer be supported by embedded system platforms based on a single processor. Instead, multi-core or multi-processor architectures are being introduced to meet the required compute power of the applications. Examples are the Cell processor [8] of IBM for gaming, the Wasabi architecture [13] of Philips Research for video processing, and the IXP processor of Intel for internet traffic processing. The availability of these architectures is the first step in meeting the performance. The next step and challenge is to take full advantage of these architectures; applications that were running in a single thread before must be carefully partitioned and mapped onto the architecture. At Leiden University, we are interested in mapping stream-based applications in the domain of imaging, multi-media, and classical digital signal processing onto multiprocessor architectures. Over the years, we have build up quite an expertise in exploiting the Kahn process networks model of computation to map applications onto multi processor architecture, mainly via the Compaan project [6, 15]. When studing the IXP architecture, we found quite some interesting aspects that makes the IXP attractive for stream-based applications. The architecture is highly optimized for streaming data, albeit in the form of internet packets. Furthermore, the architecture has Gigabit ethernet connectors for handling incoming and outgoing traffic and can process this data at real-time using dedicated microengines. Besides the microengines, a single XScale processor is present. Over the years, Intel has developed a range of IXP

architectures as given in Table 1. Each new IXP architecture has more resources running at a higher clock frequency, thereby providing an ever increasing amount of processing power [1]. Feature XScale core

IXP2325 900 MHz max.

IXP2400 600 MHz max.

IXP2855 750 MHz max.

Microengines, organized into

2 at 600 MHz max. one cluster

8 at 600 MHz max. two clusters of 4

16 at 1.5 GHz max. two clusters of 8

SRAM

16 MB (1 channel)

128 MB (2 channels)

256 MB (4 channels)

DRAM

2 GB (2 channels)

1 GB (1 channel)

2 GB (3 channels)

Table 1: The Overview of the components of the IXP2325, IXP2400, and IXP2855 Processors

After studying the IXP architecture, we asked ourselves the following three questions. 1) Can we use the IXP architecture for other applications than internet traffic, in particular stream-based applications? 2) Can we easily map applications written as a KPN onto a IXP? 3) Can we integrate the generation of KPNs for the IXP from sequential Matlab code using our Compaan tool flow, thereby make the programming of IXP much simpler?

1.1 Stream-based Applications and Network Processors

The IXP Network processor is build to operate in real-time on internet traffic while being completely programmable. The architecture uses microengines that have hardware multi-threading support and various communication structures to move streams of data around as efficient and quickly as possible. We make the observation that stream-based applications have the same requirements. Therefore, can we not use an IXP for stream-based applications? We have already cooperated in a preliminary study done in 2004, when we mapped the BLAST application from the domain of bioinformatics onto a IXP [2]. Inspired by this work, others have also tried to use the IXP for bioinformatics computation [16]. Although showing promising results, the problems remains that the implementations are done using low-level programming models based on assembly or low-level C code. As a consequence, programmers need to have very detailed knowledge about this intricate architecture. The lowlevel programming model makes the deployment of a broader range of stream-based applications onto IXPs difficult. We are therefore interested to see if we can provide a simpler programming model and perhaps even an automate flow to program an IXP from a single specification.

1.2 Kahn Process Networks

Mapping an application written in an imperative language like Matlab, C or Java onto a multiprocessor architecture is a difficult task. Manually partitioning the original code over multiple threads while making sure that the communication between the threads is correct, is a difficult, tedious, and error prone process. We believe that mapping an application automatically on a multiprocessor architecture is difficult since the computational model of an imperative language is in complete contrast to the multiprocessor architecture. Imperative languages use the concept of a single thread of control and a large global memory space. Multiprocessor architectures on the other hand, use concepts like autonomous processes and distributed memories. To bridge these two concepts, we believe that the KPN model of computation is very appropriate [12]. The KPN model expresses an applications in terms of autonomous processes that communicate with each other using unbounded FIFOs [5]. The processes synchronize using a blocking read. If a process tries to read data from an empty FIFO channel, the process blocks until data becomes available. Once data becomes available again, the process unblocks and reads the data from the FIFO channel and continues processing. To map a KPN onto an IXP, we need to establish a relationship between processes and FIFO channels of the KPN model and the architecture components of the IXP.

1.3 Compaan Compiler

If we can map KPNs to the IXP, we still need to obtain the KPNs from the applications we are interested in. To distill the task-level parallelism from an application written in a subset of Matlab, we rely on the Compaan compiler developed within the computer systems group of Leiden University. It uses advanced dataflow analysis techniques to automatically extract tasks from an application and expresses them in a KPN model. If we can establish the relationship between the KPN elements and the IXP architecture, we end up with a framework that can program an IXP from a single imperative specification. This paper is organized as follows. In Section 2, we report on related work. Next, we present the IXP architecture in Section 3 and illustrate some of its strong features. We also discuss the problem of programming the IXP. In Section 4, we present our solution to program the IXP using the KPN model of computation. In Section 5, we explain how the components of a KPN fit with the architecture components of the IXP. In Section 6, we present our IMCA back-end which automates the process of mapping KPNs on a IXP. In Section 7, we present two case-studies we performed and finish in Section 8 with conclusions.

2.

RELATED WORK

To program the IXP, other programming models other than the microengine assembly and low-level C language have been developed. For example, NP-Click [10] offers a higher level of abstraction of the underlying hardware of the IXP1200 network processor. It is mainly targeted at network processing applications, offering support for header verification, route table lookup, etc. Programming is mainly done by clicking data flow streams together. Another effort for improving the programming of an IXP, is the µL programming language and the µC compiler by Network Speed Technologies [14, 3]. Like NP-Click, it offers a high level of abstraction to the hardware, where the programmer focuses on the data streams of network processing applications. Both NP-Click and µL programming language focus on internet packet handling, while we are interested in a programming model that support the class of stream-based applications.

Intel has developed an auto-partitioning C compiler, as described in [7]. In their approach, an input application is specified as a set of sequential C programs that are called packet processing stages (PPSes). These PPSes closely correspond to the Communicating Sequential Processes (CSP) model of computation [4]. To express a program in PPSes is the responsibility of a programmer. In contrast, the Compaan compiler automatically generates KPNs from applications written in a small subset of Matlab supporting nested-Loop programs [15]. CSP and Kahn process networks both belong to the category of Process Network models. The difference between these two models is that due to the unbounded FIFO channels, processes are much more decoupled. This leads to more autonomous processes giving more freedom in a mapping stage. Furthermore, KPNs are always deterministic, while CSP supports non-deterministic constructs. The deterministic property of KPN is exploited in the schedule freedom. Irrespectively of the schedule chosen (i.e., sequential or fully parallel and everything in between), the input/output relation of a network will always be correct.

3. IXP 2400 ARCHITECTURE

Over the years, Intel has develop a whole range of IXP processors as shown in Table 1. Since we had only access to an IXP2400, we focus in this paper on this architecture. A schematic of the IXP2400 is shown in Figure 1. It shows the Intel XScale Core and eight microengines (ME0.0 - ME1.3) clustered in two blocks of 4. Other relevant parts are the specialized controllers to communicate data with off-chip DRAM and SRAM, the scratch path memory, and the Media and Switch Fabric (MSF) Interface. The MSF interface governs the communication with the Ethernet connection. The IXP2400 can receive and transmit data on this interface at a speed of 2.5Gbps. Optional hosts CPU, PCI bus devices

DRAM

DRAM Controller 0

SRAM

SRAM SRAM Controller 0 Controller 1

PCI

External Media Device(s)

Media Switch Fabric Interface RBUF

Intel XScale Core

TBUF

ME 0:0

ME 0:1

ME 1:0

ME 1:1

ME 0:2

ME 0:3

ME 1:2

ME 1:3

Scratchpad memory Hash unit

CAP

IXP2400

Figure 1: IXP2400 The XScale core is a RISC general-purpose processor similar to the processing units found in other hardware, including other embedded computers, handhelds and cell phones. The intended use on the network processors is for controlling and supporting processes on the microengines where needed. The XScale core is easy to program; tools, compilers, and operating systems are widely available for this processor. Nevertheless, we did not make use of this processor in this paper. A microengine is a simple RISC-processor that is optimized for fast-path packet processing tasks. Its instruction set is specifically tuned for processing network data. It consists of over 50 different instructions including arithmetic and logical operations that oper-

ate at bit, byte, and long-word levels, and can be combined with shift and rotate operations in a single instruction. Integer multiplication is supported; floating point operations are not. The structure of a microengine is given in Figure 2. It shows the two operands (A and B), the execution datapath and some general purpose registers (GPRs). Besides these elements, the microengine has special registers to communicate fast and efficiently with DRAM and SRAM, its neighbors and local memory. For example, to communicate with neighboring microengines within a cluser, this microengine can uses special hardware support. Via special registers, it can send data to a neighbor and receive data from a neighbor. The program code for each microengine is stored in a 4K instruction store. From Next Neighbour

640 long words Local Memory

128 GPRs (A Bank)

128 GPRs (B Bank)

128 Next Neighbour

D-Push Bus

128 DRAM Read Xfer

S-Push Bus

128 SRAM Read Xfer

4K Instruction Store

Immed

CRC Unit

A Operand

B Operand

CRC

Remainder

Execution Datapath (Shift, Add, Substract, Multiply, Find Fist Bit Set)

Local CRSs

To Next Neighbour

128 DRAM Write Xfer

128 SRAM Write Xfer

D-Pull Bus

S-Pull Bus

While working with the Developer’s Workbench, we found that the compiler of the microengine is sometimes too aggressive in its optimizations. In many cases, we had to add extra code to instruct the compiler not to optimize certain variables away in order to observe its content for debugging purposes. In such cases the programmer must add an implicit_read statement for variables to avoid that they are being optimized away. Another issue is that quite often the value of variables cannot be observed at all for unknown reasons.

4. APPROACH

LM addr 1 LM addr 0

synchronization to get the correct data at the correct time. It is very easy to make a mistake in this synchronization, which is a problem that is hard to debug.

16 entry CAM

Figure 2: Microengine

3.1 Programming Environment

To program the IXP, we make use of the Intel IXA Software Development Kit that can be downloaded from Intels website. This kit contains an assembler for the microengine assembly language, a compiler for the microengine C language, as well as an integrated desktop environment called the Developer’s Workbench. The microengines are programmed in microengine C language which is a special C variant developed by Intel to program the IXP. The syntax is ansi C, with the exception of no support for function pointers and recursion. Type safety, pointers to memory, functions, enums, structures and arrays are supported. Because of the many exposed memory and register types, variable declarations usually include an extra modifier, __declspec, to inform the compiler where to store the given variable. For example, we define an integer variable x stored in scratchpad memory as follows: __declspec(scratch) int x. Not all features of the IXP2400 hardware can be expressed in ANSI C, such as asynchronous memory access, waiting for signals, and special memory operations. Therefore, a library of intrinsics is provided. Each intrinsic function call is replaced by assembly code, thereby implementing special operations. For example, to read a value from scratchpad memory and increase the value afterwards in one atomic operation, the intrinsic scratch_test_and_incr() is provided in microengine C. Despite the numerous debugging tools available in the Developer’s Workbench, debugging is still a painful task. This is partly due to the complexity of the IXP hardware which remains hard to oversee. As a programmer, you constantly have to keep track of where data is stored in which type of register or memory and thus need to have a detailed picture of the IXP architecture. Furthermore, lots of tasks happen in parallel and you constantly have to do proper

To execute non-internet traffic related applications on an IXP, we have to overcome two issues. One being that the programming of the complex IXP architecture needs to be made simpler. The other one being, that we need to find tasks in applications that can be mapped onto the microengines. To solve these two issue, we want to leverage the environment we developed called the Compaan compiler. This compiler can find in applications written in a subset of Matlab (i.e., nested-loop programs) automatically very lightweight tasks and convert the application into a KPN representation; processes that communicate via FIFOs. Although the target for the Compaan compiler is typically high-end FPGAs, we want to use it for the IXP. To express a KPN on the IXP, we have to express each components of the KPN in terms of microengine C language. The many different microengine C programs we obtain after codegeneration can then be compiled by the Intel microengine compiler and mapped onto the IXP. So, if we are able to extract automatically from an application a KPN that can subsequently be expressed in terms of microengine C, we hide a lot of details for the programmer, making it easier to map stream-based applications onto the IXP.

5. MAPPING A KPN TO THE IXP

To map a KPN on the IXP, we have to express how the components and properties of a KPN maps to the IXP architecture. The IXP2400 has eight microengines and each microengine has hardware threading support for eight different threads. On each thread, we map a single KPN process. In principle we should thus be able to map 64 KPN processes on the IXP. However, the Media and Switch Fabric (MSF) claims three micro engines in the process of transmitting and receiving packets, which leaves use with 40 vacant threads. It may happen that we end up with KPNs that contain more than 40 processes. In that case, we can rely on a merging technique described in [11], allowing us to always limit the number of KPN processes to less than 41.

Since there is only 4K instruction memory present on a microengine, the number of operations a single thread can perform has to be very small; about 512 instructions per thread. This is a very serious limitation. It means that the operations a thread can perform have to be very small. Thread switching on a microengine is done when a thread reaches a blocking read situation. Suppose, a thread wants to read data from a FIFO channel that is empty. In this case, the thread has to stall to properly implement the KPN semantics. This provides a natural moment for thread switching.

5.1 FIFO channel mapping

The IXP Network processor was designed to process internet packets in real-time. Therefore, the IXP provides different FIFO channels to move streams of data through the architecture. These FIFOs map directly on the FIFO channels used in a KPN. The FIFOs support asynchronous communication using signals. A thread that is blocked can be woken up using such signals. Signals are an important and integral feature of the IXP, as many of the processes running on the IXP need to synchronize. Microengines can very efficiently test for the presence or absence of certain signals. Using signals, we can very efficiently implement the blocking read and write semantic in a Produce/Consumer interaction. // Microengine 1 01 02 03 04 05 06 07 08 09

// Microengine 2

__declspec(remote) SIGNAL sig1; 10 __declspec(visible)SIGNAL sig2; 11 12 main () { 13 signal_next_ME(sig1); 14 15 16 __wait_for_all(&sig2); 17 } 18

__declspec(visible)SIGNAL sig1; __declspec(remote) SIGNAL sig2; main () { __wait_for_all(&sig1); signal_previous_ME(sig2); }

Figure 3: Synchronisation between microengines using signals An example of the use of signals is given in Figure 3. Lines 1 – 9 are executed on microengine 1, lines 10 – 18 on microengine 2. In line 5, a signal (i.e., sig1) is sent to microengine 2, which waits for that signal to arrive in line 15. Next, in line 16, microengine 2 sends a signal (i.e., sig2) back to microengine 1, which waits for that signal to arrive in line 08.

SRAM

16 rings 4

Scratchpad memory

64 rings

5

64 rings 6

Direct Xfer register 3 Local memory Local memory Next 1 Neighbour 2 Microengine x Microengine y

Figure 4: FIFO options Since the FIFO is such a central element in the IXP, different implementation exits. We have found that six different FIFO types can be realized on the IXP as shown in Figure 4. The various realizations make a different trade-off between speed, claimed resources, and size. A short description of the different realizations is now given. 1. Local memory. Each microengine has a fast accessible local memory of 640 longwords that is shared among all threads running on that microengine. It can be used to implement a very fast FIFO channel between processes mapped onto the same microengine. 2. Next-neighbor. Each microengine has a next-neighbor register. It can be used to implement a very fast, dedicated, FIFO channel between processes mapped onto a limited set of other microengine that are neighbors. The registers can be used in in three modes: an extra set of general purpose register, one FIFO channel of 128 longwords, or as 128 separate registers accessible by the neighboring microengine.

3. Direct XFer register. Each microengine has 128 SRAM and 128 DRAM read registers. It can be used to write and to the read with any other process on a microengine. The direct xfer register uses the standard bus and is is slower then the previous communication styles. 4. Scratchpad rings. There are 16 sets of special ring register available on the scratchpad unit. These ring register give hardware support to implement the head and tail pointers of a FIFO channel allocated on the scratchpad memory. 5. Scratchpad memory. The ring registers can also be implemented in software directly. These software ring register implement the head and tail pointers of a FIFO channel allocated on the scratchpad memory. This is much slower then the hardware ring register support. 6. SRAM. SRAM rings are hardware supported FIFO channel implementation. Each SRAM memory channel has a queue descriptor table which can hold 64 values. Since the IXP2400 has two SRAM memory channels, a total of 18 rings are available. When very fast FIFO channels are needed, the local memory or next-neighbor registers should be employed. If this is not possible, the hardware support ring registers and scratchpad memory should be used. If this is not possible, the software supported rings should be used. Finally, the SRAM supported FIFO channels should be used. They are the slowest, but can hold the largest amount of data and are available the most.

5.2 Sink and Source Processes

The sink and source processes of a KPN need to be able to communicate efficiently with the outside world. For this purpose, the IXP provides two hardware units: a well-known PCI controller and the Media and Switch Fabric Interface (MSF). The MSF connects to a physical layer device, which is the network. The MSF can accept standard TCP/IP Packets and take out the payload of each packet. This payload is presented to a KPN as an incoming stream. The outgoing stream is stored in the payload of a stream of packets. Being able to communicate with the IXP using standard TCP/IP packets interface, makes the integration of an IXP, running a mapped KPN, in an application very simple. Packet reception and transmission on the IXP is a complex act of segmenting and reassembling small partial-packet, called mpackets. At least one thread should be assigned to handle incoming data and one thread to handle outgoing data on the MSF. Although a single thread can do the work, it is in practice too slow. Typically, all eight threads of one microengine are used for incoming and all eight threads of one microengine for outgoing streams.

6. IMCA

To automatically express a KPN in terms of microengine C, we developed a special back-end to our Compaan compiler that we called IMCA (IXP Mapper for C OMPAAN Applications). The tool-flow as given in Figure 5, shows how IMCA fits in between the Compaan compiler and the IXP. The Compaan compiler is responsible for discovering the task-level parallelism in the Matlab input applications and generates a Kahn process network. Next, the IMCA backend generates the microengine C code as explained in the previous section. A separate C-file is created for each microengine. The obtain microengine C files are compiled by the Intel compiler and loaded onto the IXP. Besides the C-files, IMCA also generates

a number of configuration files. The Common function libraries for microengines implement common functionality needed by all microengines. For example, implementations of the various FIFO types on the IXP2400 hardware and a general interface for accessing these FIFO types. The Configuration files for microengines provide additional data for each microengine, such as which microengine C files are mapped onto which microengine and the number of threads that should be active. Application described in Matlab COMPAAN

Application specification as KPN

P1

P3

Mapping P specification

P5 P2

compiler

P4

ME P

Platform specification

ME

IMCA tool

Common function libraries for microengines

Program code for microengines

Configuration files for microengines

Intel IXP C-compiler

ME 0:0 P1

ME 0:1 P2

ME 1:0 P5

ME 1:1 -

ME 0:2 P3

ME 0:3 P4

ME 1:2 -

ME 1:3 -

SRAM

DRAM

IXP2400

Figure 5: Design flow Besides the Kahn process network, the IMCA backend uses two additional specifications to successfully map an application onto the IXP: the platform en mapping specification. • The platform specification describes the hardware components of an IXP Network Processor platform. This includes number of microengines, number of hardware-assisted FIFO queues, and the size of the scratchpad memory. • The mapping specification describes how the elements of a KPN should be assigned to the hardware of an IXP. Although the mapping can be specified manually, we implemented a very simple assignment strategy in IMCA for the processes and FIFO channels of a KPN. FIFO channels are assigned in a greedy way first to Scratchpad memory with hardware supported ring buffers. Next the remaining scratchpad is filled with software supported ring buffers and the remaining FIFO channels are mapped on SRAM. A big improvement can be made on this mapping strategy. Using topological information of the KPN, we should exploit local memory and next-neighbor communication. Also, using a compile time computation of the bandwidth of all FIFO channels, we should make a better assignment to the various resources available to realize a FIFO channel. KPN processes are mapped to the threads on the microengines. Of the 64 threads available on the IXP2400, 24 are already taken by the MSF for handling packet streams. The remaining 40 threads are assigned in a greedy way. Also here much improvement can be made. Using the FIFO channel information, particular threads have to be grouped in such a way that smart use can be made of the local memory and next-neighbor communication.

7. CASE STUDY AND RESULTS

The tool-flow shown in Figure 5, enabled us to automatically and quickly map stream-based applications written in a subset of Matlab onto the IXP2400. To show that the IXP can be used to execute stream-based applications, we took two stream-based applications that have no relation to internet packet processing. We took QR, which is a classical DSP application (matrix decomposition algoritm) used in for example radar systems, and DWT, which is an important wavelet compute kernel used in the JPEG2000 standard. The KPN of the QR algorithm consists of 5 nodes and 12 FIFO channels [17]. Each process is mapped on its own microengine and all FIFO channels are mapped on hardware assisted scratchpad memory rings as only 12 FIFOs need to be mapped. The main computations in QR (i.e., vectorize and rotate) are not performed as it requires a fix-point implementation which wasnt available. A microengine does not support floating point calculations. Processing the work load of a 5x6 version of QR took 40247 cycles. The throughput of the source is measured at 7.2 Mbps. Note that this measurement only says something about the communication (read and write phase of a thread) as no real function is executed. The KPN of the DWT algorithm consists of 23 nodes and 41 FIFO channels. DWT requires FIFO channels that have a size larger that can be supported by scratchpad and are therefore realized as SRAM FIFO channels. All FIFO channels and processes are assigned using the greedy algorithm discussed previously. Contrary to the QR example, the DWT algorithm is implemented fully functional. Processing a 32x32 image took 576947 cycles and the throughput at the source was measured at 35.6 Mbps. The results shown here are obtained using the very simple greedy mapping strategy. Also, all FIFO channels are mapped to either scratchpad memory or SRAM. We anticipate that a more advanced mapping strategy and larger collection of FIFO channels, we should be able to improve the performance a lot.

7.1 Alternatives Platform

The KPNs generated by the Compaan compiler is FPGAs (Xilinx Virtex-Pro) can be implemented on different platform using different back-ends. We use the IMCA back-end to generate code for the IXP processor. To generate a full hardware version, we use the LAURA back-end tool [18] and to generate a multiprocessor version using MicroBlazes, we use the ESPAM back-end tool [9]. In the full hardware version, each process is mapped in hardware using a dummy function. See [17] for a discussion on the hardware implementation with real IPcores. To get an alternative multi microprocessor mapping of QR, we create a Microblaze microprocessor for each process of the KPN that are connected using a dedicated crossbar. In this case, we also used dummy functions for the vectorize and rotate functions as we did for the IXP and full hardware version. The results on the IXP are given in the first row of Table 2, using MicroBlazes in the second row and a full hardware implementation in the third row. For the IXP mapping, 40247 clock cycles are needed for the sink node in the network to finish. When the sink node finishes, we know that all other nodes have finished as well. In the second, the FPGA with 5 MicroBlaze microprocessors is shown which takes 7729 cycles. In the third row, a full hardware implementation is shown which takes 213 cycles. Table 2 is interesting as it shows the performance for one and the

Arch. IXP, FPGA, 5 MB FPGA, full HW

# clock cycles 40247 3865 213

CPU freq. 600 Mhz 100 Mhz 108 Mhz

time (µs) 67 39 2

Table 2: Results QR same algorithm for a homogeneous multi-processor architecture (i.e., IXP), a heterogeneous multi-processor architecture (i.e., FPGA with MicroBlazes), and a fully custom hardware solution. The more dedicated the communication gets, the higher the performance. In the IXP we need to share a bus, in the FPGA with MicroBlazes we share a crossbar to communicate between MicroBlazes. In the full hardware version, only dedicated FIFO channels are used to communicate between processes. The hardware version is very fast as the logic in the read and write phase is evaluated in a single cycle. The IXP is a complete software only solution, the Microblaze option requires software compilation and hardware synthesis, and the fully hardware solution requires only hardware synthesis. Between the two extremes a factor of 20 can be found, which is not unusual. A synthesis step easily takes hours, while programming the IXP solution takes minutes.

8.

CONCLUSION AND FUTURE WORK

In the introduction, we posed three questions we wanted to answer. The first question was whether the IXP can be used for streambased applications. Secondly, can we easily map KPNs onto the IXP? And third, can we do this in automated and efficient way? By using the Compaan compiler and the developed IMCA backend, we showed how the QR and DWT algorithms can be automatically partitioned and mapped onto the IXP. Therefore, we believe all three questions can be answered with a yes. But a remark has to be made here. The IXP has some very attractive hardware elements, making the architecture potentially attractive for a broader class of stream-based applications besides internet packet traffic handling. However, whether the performance will be enough to make a platform like the IXP interesting for stream-based application, we need to perform many more experiments and obtain more quantitative information. In all, the Compaan compiler with the IMCA back-end provides a simple programming model to a developer, making the deployment of stream-based applications on an IXP much easier. In this paper, we mention very limited amount of experiments and quantitative results making it hard to make firm statements about the applicability of the IXP architecture for stream-based applications. Still this work may put the IXP in a different view point, possibly inspiring others beside us to experiment with mapping streambased applications on the IXP. Furthermore, we believe that having dedicated hardware support for the KPN semantics (i.e., FIFO channels and blocking read) is very interesting when trying to map KPNs on multiprocessor architectures.

9.

ACKNOWLEDGMENT

We thank Todor Stefanov and Hristo Nikolov for the useful discussions and generation of the results for the FPGA platform using ESPAM. We thank Bin Jiang for the results of QR using LAURA.

10. REFERENCES

[1] Matthew Adiletta, Mark Rosenbluth, and Debra Bernstein. The next generation of intel ixp network processors. Intel Technology Journal, 06(03), 15 aug 2002.

[2] Herbert Bos and Kaiming Huang. On the feasibility of using network processors for dna queries. In Proceedings of the Third Workshop on Network Processors & Applications (NP-3), pages 183 – 195, 2004. [3] Lal George and Matthias Blume. Taming the ixp network processor. In PLDI ’03: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, pages 26–37, New York, NY, USA, 2003. ACM Press. [4] C. A. R. Hoare. Communicating sequential processes. Commun. ACM, 21(8):666–677, 1978. [5] G. Kahn. The semantics of a simple language for parallel programming. In J. L. Rosenfeld, editor, Information processing, pages 471–475, Stockholm, Sweden, Aug 1974. North Holland, Amsterdam. [6] Bart Kienhuis, Edwin Rijpkema, and Ed F. Deprettere. Compaan: Deriving Process Networks from Matlab for Embedded Signal Processing Architectures. In Proc. 8th International Workshop on Hardware/Software Codesign (CODES’2000), San Diego, CA, USA, May 3-5 2000. [7] Long Li, Bo Huang, Jinquan Dai, and Luddy Harrison. Automatic multithreading and multiprocessing of c programs for ixp. In PPoPP ’05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 132–141, New York, NY, USA, 2005. ACM Press. [8] D. Pham et al. The design and implementation of a first-generation cell processor. In In ISSCC Digest of Technical Papers, pages p. 184–5, 2005. [9] Hristo Nikolov, Todor Stefanov, and Ed Deprettere. Multi-processor system design with espam. In CODES+ISSS ’06: Proceedings of the 4th international conference on Hardware/software codesign and system synthesis, pages 211–216, New York, NY, USA, 2006. ACM Press. [10] Niraj Shah, William Plishker, Kaushik Ravindran, and Kurt Keutzer. Np-click: A productive software development approach for network processors. IEEE Micro, 24(5):45–54, 2004. [11] Todor Stefanov, Bart Kienhuis, and Ed Deprettere. Algorithmic transformation techniques for efficient exploration of alternative application instances. In CODES ’02: Proceedings of the tenth international symposium on Hardware/software codesign, pages 7–12, New York, NY, USA, 2002. ACM Press. [12] Todor Stefanov, Claudiu Zissulescu, Alexandru Turjan, Bart Kienhuis, and Ed Deprettere. System design using kahn process networks: The compaan/laura approach. In Proceedings of DATE2004, Paris, France, Feb 16 – 20 2004. [13] Paul Stravers and Jan Hoogerbrugge. Homogeneous multiprocessing and the future of silicon design paradigms. In In Prce. International Symposium on VLSI Technology, Systems, and Applications (VLSI-TSA 2001), April 2001. [14] NST: Network Speed Technologies. http://www.network-speed.com. [15] Alexandru Turjan, Bart Kienhuis, and Ed Deprettere. Translating affine nested loops to Process Networks. In International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), Washington D.C., September 23–25 2004. [16] Ben Wun, Jeremy Buhler, and Patrick Crowley. Exploiting coarse-grained parallelism to accelerate protein motif finding with a network processor. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05), pages 173 – 184, 2005. [17] Claudiu Zissulescu. Synthesizing Process Networks to VHDL. PhD thesis, LIACS, Leiden University, The Netherlands, 2007. [18] Claudiu Zissulescu, Todor Stefanov, Bart Kienhuis, and Ed Deprettere. LAURA: Leiden Architecture Research and Exploration Tool. In Proc. 13th Int. Conference on Field Programmable Logic and Applications (FPL’03), 2003.