An Efficient and Low-Cost Input/Output Subsystem ... - Semantic Scholar

1 downloads 0 Views 113KB Size Report
/ixp1200.htm. [4] Motorola, C-Port, http://e-www.motorola.com/. [5] M. Farrens, A. Pleszkun, Implementation of the Pipe. Processor, IEEE Computer, Vol.24, No.
An Efficient and Low-Cost Input/Output Subsystem for Network Processors †

Ioannis Sourdis

Dionisios Pnevmatikatos

Kyriakos Vlachos

Microprocessor & Hardware Laboratory

Microprocessor & Hardware Laboratory

Bell Laboratories AT - Lucent Technologies

ECE Department

ECE Department

Technical University of Crete

Technical University of Crete

1200 BD Hilversum,

Chania, Crete, GR 73 100, Greece

Chania, Crete, GR 73 100, Greece

The Netherlands

Tel. +30 28210 37295

Tel. +30 28210 37344

Tel. +31 (0)35 687 4801

[email protected]

[email protected]

[email protected]

ABSTRACT We present the architecture and implementation of an input/output subsystem for a cost-effective network processor. We believe that adding processing power to a networking chip is relatively straightforward. However, transferring data to and from the processor(s) is insufficient for high wire speeds. To address this limitation we use a hardwired input/output subsystem transferring data directly into the processing core’s register file. Using a simple scalar RISC core at 200MHz, we are able to sustain state-full inspection firewall processing at 2.5Gbps TCP traffic.

Categories and Subject Descriptors B.4.1 [INPUT/OUTPUT AND DATA COMMUNICATIONS] Data Communications Devices. C.1.3 [PROCESSOR ARCHITECTURES] Other Architecture Styles.

General Terms Design, Performance.

Keywords Network Processors, Processor Architecture, Input/Output SubSystems.

1. INTRODUCTION With the expansion of Internet, and the ever-increasing line speeds, the execution of the various networking protocols is becoming the main bottleneck in high-speed communications. Gigabit Ethernet is already available and deployed, while products already exist for 10 Gigabit per second transfer rates. New bandwidth-eager end-user software applications and faster processors in desktop and server systems are placing enormous demands on the current networking infrastructure. As a rule of thumb, the use of network bandwidth doubles every four months. Furthermore, guaranteed quality and priority customization are anticipated for many data, voice and video applications. To meet the stringent processing demands, designers are faced with two alternatives. The first is to create a custom solution for their application with an ASIC. The second is to use a commercially available network processor [1][2]. The ASIC approach can certainly achieve very high processing speeds, but it is inflexible since changes in the chip-set behavior are either not permitted, or allowed but to a very limited extend. Since applications and protocols continue to evolve and are extended to meet the user-desired functionality, this is a significant concern. In addition, ASIC design time is longer, leading to potentially increased time to market.

Commercial network processors (such a Intel’s IXP [2][3], Motorola’s C-Port [4], etc) are programmable, so they can be reprogrammed to comply with newer protocols. However, they represent a brute-force approach, in that they use multiple (simple) processing cores in order to achieve the desired processing performance levels. The developer has to comply with several system restrictions, write a parallel program in a (potentially) heterogeneous environment, and ensure that all the hard real-time constraints of network processing are met. In addition, although considerable progress has been made, the †

Also with the Institute of Computer Science (ICS), Foundation for Research and Technology-Hellas (FORTH).

For each packet { 1. Identify connection ID (flow), packet classification 2. Get state information (i.e. last packet seen, etc.) 3. Consult selected fields (parts of header, body) 4. Execute protocol code on state and selected fields 5. Update (?) packet and flow state 6. Send (?) updated packet 7. Create (??) other control packets } Figure 1. Protocol Processing Abstract Pseudo-code software development tools for such processors are still in their infancy.

We believe that currently available processors provide sufficient processing power to execute the networking protocols available today. However, we see that even fast workstations fail to perform this processing even at 1 Gbps rate. The reason is that while the processor offers high processing potential, its input/output abilities lag considerably. To alleviate this bottleneck, we propose an intelligent input/output technique that offers two important advantages. First, it relieves the processing core from I/O duties, freeing resources that can be used to meet the protocol processing needs. Second, it facilitates application development since existing tool-chains (compilers, etc), can be augmented to utilize the added features with minor modifications. We demonstrate the viability of this approach by implementing the proposed technique in the context of Pro3, a single chip network processor capable of delivering statefull inspection firewall processing of TCP packets at 2.5 Gbps. The contributions of this work are twofold: (i) we propose an efficient I/O architecture for network processing cores and (ii) we implement and evaluate the circuit performance and cost of the proposed architecture.

The rest of the paper is organized as follows: In section 2, we describe a typical context for protocol processing and elaborate on the various bottlenecks that need attention, and describe the Pro3 Architecture, the framework in which we developed our architecture. In sections 3 and 4, we present the proposed architecture and implementation, and discuss its implications on the hardware and software development. We also discuss the hardware/software interface that is needed to efficiently utilize the proposed architecture. In section 5, we present quantitative results that show the potential of our approach, and in section 6 we compare our approach to other research. Finally we draw our conclusions in Section 7.

2. NETWORK PROCESSING FUNDAMENTALS Protocol processing functions can be abstracted according to the pseudo-code of Figure 1(question marks indicate optional steps). The basic functions of a processor correspond to the above steps. Flow classification takes the header information (such as source and destination IP address) and produces a tag (flow-ID) that is used internally for indexing the connection-related internal data structures. Subsequently, these structures must be read in order to obtain information about the state of this connection. This information together with the selected header fields is usually sufficient for the protocol software to decide its course of action. The action usually involves forwarding the packet (possibly with modifications) to its destination, and can potentially create more control (and possibly data) packets. This general framework for network processing is applicable to many applications such as firewalls, routers/gateways, etc. A general-purpose processor can perform most of these steps at a high processing rate. However, the pin-bandwidth for transferring the header fields (and the packet in general) and the state information is limited. Processors are optimized to perform very well for predictable programs, where caches offer low latency accesses. The connection tables, and the packets present limited locality, underutilizing the processors abilities. Similarly, low performance computation engines found in today’s network processors are limited in both processing and input/output potential. While the actual processing portion of the above code outline is for most applications relatively short and manageable (register manipulation), the input and output portions are expensive operations: the locations present only spatial and no temporal locality, and the latency to collect all the data is quite high. Based on this observation, we propose the use a regular processing cores, coupled with intelligent I/O, can provide the required performance levels at low cost. We next describe the overall architecture of Pro3, a programmable network processor chip that incorporates the proposed technique to achieve statefull inspection firewall processing of TCP packets at 2.5 Gbps.

CAM for classification

CPU RAM

External Host CPU (optional)

Control RAM (state)

CAM I/F

SDRAM I/F

CPU I/F

Control RAM I/F

RISC CPU

Timers

RPM

RPM

FEX PPE FMO

INSERT / EXTRACT

FEX PPE FMO

Internal BUS Pre-processing ATM/CPCS RX Layers CRC

IN

Post-processing Bus control, Internal Scheduling

Packet Classifier

ATM/CPCS TX Layers CRC

Data Memory Manager

OUT Traffic Scheduling

Pointer RAM

Scheduling RAM I/F Scheduling Memory

Storage DRAM

Figure 2. The Pro3 Architecture

2.1 Pro3 Architecture 3

The Pro system Architecture [8][9][10] aims in accelerating execution of telecom protocols by extending a scalar RISC core with programmable, pipelined hardware. The concept of Pro3 is to provide the required processing power through a novel architecture incorporating parallelism and pipelining wherever possible, integrating generic micro-programmed engines with hardwired components optimized for specific protocol processing tasks. The system is designed to support 2.5 Gb/s links for up to 512K active connections (or flows), corresponding to 7.5Mpackets/sec in worst-case TCP traffic. The Pro3 processor (Figure 2) integrates a low-cost, powerefficient scalar RISC processor [11], with reconfigurable, pipelined module able to deliver the needed processing power to support efficiently many thousands of (different) protocol instances, called flows. The heart of the protocol processing occurs in the “Re-configurable Pipelined Module” (RPM). Two RPM modules, shown in Figure 2, operate in parallel to allow the execution of protocols with different incoming and outgoing data flow processing as well as load balancing for higher throughput.

stage pipeline module that is the processing heart of the system (Figure3). The Field Extraction engine (FEX) is a small RISC with a threestage pipeline architecture. It is fully programmable and operates on a protocol or application specific firmware. Only specific fields are extracted from the received portion of the packet. The payload does not need to be entirely sent from DMM to RPM for processing. The Field Modification (FMO) engine is similar to FEX but with dual operation. It also adopts a three-stage pipeline architecture with programmable operation. Its target is to compose the protocol message (byte stream) taking as input the results of processing from PPE (fields), the data in the Delay FIFO and according to the protocol specifications (firmware controlled). The components (FEX, PPE, FMO) are able to process data with a maximum throughput of 6.4Gbps. This is accomplished by using a 32-bit wide data path and operating clock frequency of 200MHz (System Clock).

Flow-id

2.2 Reconfigurable Processing Module The RPM module is an innovative module optimized to perform packet processing. Each RPM consists of a Protocol Processing Engine (PPE), that includes a modified Hyperstone RISC core [11], surrounded by a Field Extraction (FEX) engine that directly loads the required protocol data to the RISC for processing, and a Field Modification engine (FMO) for packet construction and header modification. These three modules form a powerful 3-

Modified Fields

Header Fields

Packet Header

Field Extraction Engine

Protocol Processing Engine

Flow State

Modified Packet

Field Modification Engine

Modified Flow State

Figure 3. The Processing Pipeline

We concentrate on the PPE module that is the heart of the protocol processing in Pro3.

3. THE PPE ARCHITECTURE The PPE module employs a smart I/O subsystem to off-load the processor core from transferring of the packet data and state information. The architecture of our system is shown in figure 4. It is based on the assumption that the processor can be modified to provide additional ports to its register file. The control logic reads the incoming packet data from the transceivers, and passes it directly to the processor registers. The state control logic matches the incoming packet with the flow state, reads the required fields from memory and again passes the information to the processor. Here we assume that the classification step has been performed already. Flow classification is well defined, and can be easily performed either by hardwired blocks, or using CAMs.

The number of registers used in our technique present a tradeoff: larger register files are slower, but allow more data to be transferred without stalling. We performed an analysis of protocols and found that 32 registers are sufficient to hold the necessary state information, and the packet header. Processing the packet body requires more registers, but can be performed in a pipelined fashion, using 32 registers at a time. Processors nowadays contain more that 32 registers (for the use of renaming) and 64 registers can be reasonably fast. We overlap the use of registers between the input and output functions in order to reduce the size of the register file. In our simulations we assume that two sets of 32 registers are used for processing and I/O of packets, and another 32 registers are used for program variables and computations.

Packet Fields

Processor

The processor is activated when all the necessary information is available, and commences the protocol processing in its registers. The produced results are also put in registers, and the control logic extracts the data and transmits it (if needed) to the outside world.



Output

Flow State

Issues and potential limitations of this architecture are: •





There should be synchronization between input, processing and output, since they all share the processors registers. The total number of registers is a potential limit of this architecture. A powerful hardware/software interface that will interact with the input and output portions of the control logic to transfer the data packets is needed.

Fortunately, solutions for all these issues exist. In order to achieve maximum bandwidth utilization with a single flow, the processing of packets of the same flow has to occur in a pipelined fashion. However, there is a dependency between the processing of different packets of the same flow via the flow state information (i.e. via the Control RAM). To achieve pipelined operation while maintaining correct operation in the presence of pipelined processing of same flow packets, it is necessary to add bypass logic to short-circuit the latest version of the flow state for the processing of later packets.

Process Regs Instr. & Data Mem.

Process

It off-loads the processor: the processor needs not execute load and store instructions for state and packet I/O. It creates a high level, three-stage pipeline between input packet data transfer, processing, and output data packets. This pipeline allows the overlapping of these tasks and leads to better overall performance.

In/Out Regs

Input

The advantages of this architecture are: •

Updated Packet Fields

Control Logic

State Control

Updated Flow State

State Memory

Figure 4. The Proposed PPE Architecture The software/hardware interface is a very important subject, and we discuss it in detail in the next sub-section of this paper.

3.1 Hardware/Software Interface For our architecture to work, the hardware must coordinate with the software that runs on the processor in two ways: (i) upon input of a packet, the software should initiate the packet processing, and (ii) upon completion of the processing, the hardware should be notified to commence the output of the processed packet. These two are very different requirements. For the first one, we devised a simple handshake scheme: a wire indicates to the processor the availability of a new packet to be processed, while the processor maintains an “Idle” signal when it is not processing any packet. When the Idle signal is deactivated, the packet processing has begun. When it is activated again, the processing is completed, and the processed packet can begin. For efficiency reasons, we use another mechanism to initiate the packet processing. The processor can spin using the following instruction: move Wait:

R4, Wait

jump_register

R4

This is an infinite loop, but out hardware after transferring the packet data, writes in R4 the address of the packet handler (different than Wait). Therefore, the loop ends, and the processing of the packet begin. To achieve this, the hardware supports a dispatch table, with which each type of packet is associated with a starting point in the software code. On the output side, the interface is more complicated. To output a packet, the control logic needs to know the size, where to send it, and other similar information. Moreover, the result of processing may be multiple packets, not just one. To handle these cases, we defined a format for a Software Result Register. This register is the first register read by the output logic, and defines all the subsequent actions. The format divides the register in the following fields: Length

How many words to transfer

Start register

Number of the first register

MoreFields

More fields follow (pipelining)

Type

Type of packet (internal)

MemPacket

Allows large or multiple packets to be stored in memory. Control RAM I/F Module

Write Buffer

READ DATA BLOCK

FlowID FIFO

Write Request Fifo

FlowID FIFO STATE MUX

STATE FIFO

WRITE DATA BLOCK

= External bypass fifo

RWR

These fields are interpreted by the output control logic. Note that in case of multiple output packets, we use one software result register per packet. The proposed Hardware/Software interface, because it is register based, can be easily expressed in software with a function call (or a system call) that on return updates the appropriate registers. Therefore, even with the defined interface, traditional compiler tools can be used to develop, optimize, and debug the application code. This advantage can be crucial for the timely product development.

4. PPE IMPLEMENTATION The PPE module consists of three units: Modified RISC core (MHY), Read-Write control RAM (RWR) and RPM Glue Logic (RPG) shown in Figure 5. The Input module transfers packets (fields, flow state, Dispatch PC) into MHY’s register file for processing. The Output module transfers the process results (fields, flow state, commands) to the Field Modifier module, and CMM is a control module responsible of the initialization of all internal structures (Dispatch table, etc). The Input module is also responsible for internal state bypass: it detects where the latest version of flow state for each incoming packet is located. Since the processing is pipelined, the correct state information can be found either in MHY register file (just updated from the previous packet), in the RWR Fifos (the state was updated but is not yet written to memory), or in the state Fifo. The most important issue in RPG is the synchronization between input, output and RISC processor, especially because of the variable duration of every pipeline stage. The RWR module performs 3 major tasks. For each packet it reads the appropriate state information from the Control RAM and provides it to the Input sub-module of the RPG. It receives the updated state from the newly processed packet and writes it to the Control RAM and finally it acts as a searchable write buffer to ensure that reading the Control RAM will always provide the correct results.

uP Bus

RPG Dispatch Table

OUTPUT Command

MSG Fields

CMM

New Field

INPUT

Modified HY RISC

PPE Module

Figure 5. The micro-architecture of PPE

We implemented in VHDL all the PPE modules except the RISC processor. The modified Hyperstone RISC was provided to us as a hard-macro, along with a gate level model that was used for simulation purposes. Our performance target was to operate the PPE module at 200 MHz, which is the operating frequency of the Pro3 chip, and that is sufficient to achieve line-speed processing at 2.5 Gbps. The design consists of semi-custom logic for all control and storage, except for large buffers that use generated memories. The resulting design was synthesized for the UMC 0.18µ technology, and after minor design modifications we achieved the target operating frequency. The design was incorporated in the Pro3 chip that was fabricated and is currently under testing.

5. PPE EVALUATION Our evaluation consists of (a) an analysis of the processing efficiency of the architecture, and (b) an analysis of the implementation cost of the architecture. Protocol Process duration (Cycles)

Packet Latency (Cycles)

Efficiency

40-byte packets

80-byte packets

120byte packets

40-byte packets

80-byte packets

120byte packets

20

27

42

42

75%

47%

47%

30

33

40

42

91%

75%

71%

40

43

44

45

93%

91%

89%

50

53

53

53

94%

94%

94%

60

63

63

63

95%

95%

95%

Table 1: Processing efficiency Versus Process Duration. In order to measure the performance of the architecture we used the VHDL simulations to obtain the actual number of useful cycles and the number of stalled cycles, for a variety of input packets (different lengths), and total processing times. We assumed 10 different active IP connections, with processing times varying from 20 to 60 cycles. We randomly selected 20% of the packets to produce more than one (two) response. We also varied the Control RAM response time between 5 and 10 cycles to model contention between multiple processing engines, and we simulated the system for packet sizes of 40, 80 and 120 bytes (10, 20 and 30 32-bit words). The results of these simulations are

These results are encouraging but do not address the optimality of our approach compared to other architectural alternatives. To address this limitation we defined two simulated architectures. The first is Serial I/O (SIO), and operates exactly like the PPE but assumes a memory mapped FIFO interface which are read and written using load and store instructions. SEQ models an unmodified processor integrated in the processing pipeline. This processor performs the input, processing, and output operations in sequence. SEQ however is unrealistic in that it assumes that FIFOs always have data to deliver, and that are never full. Therefore, SEQ models the best case for the serial processing of the packets. SEQ

SEQ2

PPE

1

Efficiency

40-bytes packets 80-bytes packets 120 bytes packets 100%

Process c.c./Total c.c.

The conclusion that we can draw from these results is that the hardwired I/O mechanism is quite effective when the computation cost is high. This is usually the case, since 40 or 50 instructions correspond to a low level and highly tuned protocol code. In several cases the cost can be high larger, improving furthermore the efficiency of the architecture. Note also, that the efficiency should be compared to the efficiency of the same code with the instructions that perform the I/O, which generally generate significant number of stall cycles. Figure 6 plots in a more intuitive fashion the efficiency of processing as a function of process duration. The operating efficiency of the pipeline improves as the processing time for the packet grows, meaning that for larger processing times, a larger fraction of the input and output of packet and state data are hidden and provided by the system “for free”. The figure also shows that our technique achieves most of its benefits when the processing time exceeds 40 cycles.

80%

0,8 0,6 0,4 0,2 0 20

60%

30

40

50

60

Process duration 40%

Figure 7: Efficiency versus process duration for 40 byte packets 20% 10

20

30

40

50

60

70

Process duration

Figure 6: Efficiency versus protocol process duration presented in detail in Table 1. The first three columns of Table 1 shows the number of cycles required to execute the protocol code, the next three columns show the corresponding efficiency (the percentage of time where useful work is performed), compared to the absolutely minimum processing time, which excludes all input and output of data to and from the processor.

The second modeled alternative is an optimized version of SEQ, called SEQ2. In SEQ2, only half of the header fields and state information is read or written by the processor. Hence, SEQ2 models a more aggressive implementation that only loads the useful header and body fields. Note however, that to support random access between fields, such an architecture requires that packets are put in memory and not in FIFOs. To compare the PPE approach and the two other architectural alternatives we performed a series of experiments. We varied the

process duration from 20 to 60 cycles, and we simulated packet sizes of 40, 80 and 120 bytes. The results are shown in Figures 7-9. In these three figures we can clearly see that the PPE approach is successful in overlapping the I/O operations required for the packet processing. SEQ is clearly worse, and for some cases requires almost twice as much time as PPE to process a packet. SEQ2 is better since it makes fewer references to the memories. However, as described above, both SEQ and SEQ2 do not model overheads between the pipeline and their data. If we have model these overheads, the difference between the PPE performance and the SEQ and SEQ2 performance would have been even greater. SEQ

SEQ2

Table 2: Synopsys Synthesis Results 2

(area in λ , for λ = 0.18 m) Cell#

Combinational Area

Non Combinational

Total Area

Area 10,028

136,974

933,235

1,070,206

PPE

6. RELATED WORK

1

Efficiency

architecture of the core. The cost however, contains the extra registers in the register file (64 in our simulations) and the implementation of the hardwired handshake mechanism for notifying the processing completion.

0,8 0,6 0,4 0,2 0 20

30

40

50

60

The Pipe processor [5][6] used (two) register-mapped queues to synchronize the producer/consumer relationship in its decoupled architecture. The iWarp processor [7] also used a registered mapped network interface to efficiently communicate with its systolic peers. We use a similar handshaking mechanism, to wakeup the software efficiently, but we advocate the use of many registers to efficiently present the packet data to the application code.

Process duration Figure 8: Efficiency versus process duration for 80 byte packets The next question is: “how much must we spend to achieve this performance”? To answer this question we synthesized the HDL code for the input and output portions of the design. We do not include the gates and area of the processing core since we ant to measure just the cost of the extension. We used a general 0.18µ library and the Synopsys dc_shell to perform the synthesis. The results are presented in Table 2. SEQ

SEQ2

Many network-processing architectures have been proposed, and several products are available today. The network processor that exhibits the most similarities to our work is the IXP. In IXP, each micro-engine can initiate a transfer of a 64-byte block from the main chip buffer that holds the packets. However, in IXP, each micro-engine supports up to four threads, and he necessary thread synchronization and scheduling. Furthermore, no easy mapping can be made for these operations for the compilers, leaving the programmers to explicitly manage this functionality. Our approach targets simpler processing cores, and is easier to integrate into the existing compiler chain.

PPE

7. CONCLUTIONS Efficiency

1 0,8 0,6 0,4 0,2 0 20

30

40

50

60

Process duration Figure 9: Efficiency versus process duration for 120 byte packets These results include all structures needed by our architecture, but exclude the cost of changes in the processing core. This cost is relatively hard to measure since it varies depending on the

We have presented the design of a custom I/O subsystem for augmenting a regular processor core with network processing abilities. Our design is simple and requires modest resources for its implementation, and modest changes from pre-existing processing cores for their integration. Furthermore, it allows significant overlap between the I/O and the processing activities, especially when the processing takes 40 or more cycles. We were able to use this architecture to augment a simple, scalar RISC processor, and achieve the execution of a state-full inspection firewall application for TCP packets at 2.5Gbps. We believe that our approach is ideal for cost-conscious products, and that it can even be used for high-end system that incorporates multiple processing cores.

8. ACKNOWLEDGMENTS This work has been performed under a Lucent subcontract in the context of the IST- Pro3 project. Jorge Sanchez, Nikos Nikolaou, and Kostas Pramataris have contributed to the development of the PPE architecture. We thank them all.

[7] Borkar, S., Cohn, R., Cox, G., Gross, T., Kung, H.T., Lam,

9. REFERENCES

[8] G. Konstantoulakis, Ch. Georgopoulos, Th. Orphanoudakis,

[1] Samuel J. Barnett, “When Shopping For Network Processors, One Size Does Not Fit all”, Electronic Design Magazine, April 2, 2001.

[2] Peter N. Glaskowsky, “Network Processors Mature in 2001”, Microprocessor (http://www.mpronline.com/), February 19, 2002.

Report

[3] Intel

IPX1200 Home page: http://developer.intel.com/design/network/products/npfamily /ixp1200.htm

[4] Motorola, C-Port, http://e-www.motorola.com/ [5] M. Farrens, A. Pleszkun, Implementation of the Pipe Processor, IEEE Computer, Vol.24, No. 1, January 1991, pp. 65-70.

[6] J. R. Goodman, J. T. Hsieh, K. Liou, A. R. Pleszkun, P. B. Schechter and H. C. Young, PIPE: a VLSI Decoupled Architecture, Proceedings of the Twelfth Annual International Symposium on Computer Architecture, pp. 2027, June 1985.

M.. Levine, M., Moore, B., Moore, W., Peterson, C., Susman, J., Sutton, J., Urbanski, J., Webb, J. Supporting systolic and memory communication in iWarp, In Proceedings, 17th Annual International Symposium on Computer Architecture, 1990. N. Nikolaou, M. Steck, D. Verkest, G. Doumenis, D. Reisis, J. -A. Sanches and N. Zervos, “A Novel Architecture for Efficient Protocol Processing in High Speed Communication Environments”, presented in the IEEE European Conference on Universal Multiservice Network (ECUMN’2000/0), Colmar, France, October 2-4, 2000.

[9] C. Georgopoulos, G. Konstantoulakis, T. Orphanoudakis, N. Nikolaou, J.-A. Sanches, N. Mouratidis, K. Pramataris, N. Zervos “A Protocol Processing Architecture Backing TCP/IP-based Security Applications in High Speed Networks”, INTERWORKING’2000, Bergen, Norway, October 2000.

[10] N. Nikolaou, J. Sanchez-P., T. Orphanoudakis, D. Pollatos, N. Zervos, “Application Decomposition for High-Speed Network Processing Platforms”, in proc. of the 2nd European Conference on Universal Multiservice Networks (ECUMN’2002), Colmar, France, April 8-10, 2002.

[11] Hyperstone

Electronics E1-32X http://www.hyperstone-electronics.com

RISC/DSP,

Suggest Documents