A New Network Processor Architecture for High-Speed ... - CiteSeerX

3 downloads 551 Views 69KB Size Report
large varieties of standards and the fast development of new applications on the field of broadband communication technologies, it is desirable to make as many.
A New Network Processor Architecture for High-Speed Communications

Frank Engel and Gerhard Fettweis Chair Mobile Communications Dresden University of Technology 01069 Dresden, Germany [engel,fettweis] @ifn.et.tu-dresden.de

Xiaoning Nie and Lajos Gazsi Infineon Technologies St.-Martin-Str. 76 81541 München, Germany [Xiaoning.Nie,Lajos.Gazsi] @inifineon.com

Abstract - Today many applications require high-speed communications. To provide for the protocol processing with efficient hardware and software the use of a flexible and efficient platform becomes very import for the high-speed communications networks. In this paper we first describe a top-level view of the implementations platform for handling communication protocols. From the top-level view we derived the requirements on a Network Processor (NP) which shall particularly be useful for the high-speed communications devices. To this end an efficient NP architecture is designed and implemented to meet the requirements. The key features of such a NP are the bit field instructions, port-based instructions and the zero-latency task switch among others.

1. Introduction The today’s communications technology is ready for high quality multimedia applications at a speed of several megabit per second. The people are talking about up to 19 Mbit/s and 38 Mbit/s of throughput in a 6 MHz terrestrial broadcasting channel and cable television channel, respectively [1]. The xDSL and cable modems make it possible to provide even Personal Computers (PC) with a bandwidth of more than 2 Mbit/s. The rapidly increasing consumption of bandwidth significantly shapes the development of the transport networks and the communication terminals. All optical networks are introdued to carry gigabits of data traffics. Gigabit routers and switches are used to route the data through the network. Running high-speed networking functionalities on today’s common systems requires that a lot of functions are to be implemented in a hardwired manner. However, due to the large varieties of standards and the fast development of new applications on the field of broadband communication technologies, it is desirable to make as many parts of the system as possible be programmable and configurable. For considering communication networks it is common to make use of the abstraction of the layered models [2]. Each layer is characterized by the correspond-

-1-

ing protocol. A standardized layering scheme is the so-called Open Systems Interconnection (OSI) model developed by the International Organization for Standardization (ISO) [2]. In the ISO/OSI model seven layers are proposed: There are the application layer, the presentation layer, the session layer, the transport layer, the network layer, the data link layer and the physical layer. The internet protocol stack can be mapped into the ISO/OSI model. For example a TELNET application over TCP/IP and LAN is setup on the Ethernet (physical/datalink layer) the TCP/ IP (network/transport layer) and the application layer. The protocol of each layer is specified in syntax, semantics and timing requirements. It generally does not imply any implementation details. How each layer is implemented in a particular communication device depends on the kind of protocols used, the speed of the data transmission and the available technologies among other trade-off. In a lower speed communication interface such as a 144 kbit/s ISDN transceiver port, the physical layer may be implemented by means of a programmable digital signal processor (DSP). The upper layers of the ISDN modem could be handled by software rather than hardware. The situation would change drastically if a very high-speed network were connected. Consider a high-quality MPEG-2 application running on a terminal, the processor would have to work very hard to handle both the MPEG 2 functions and the lower layer protocols of the network interface. If we take the low power constraint into account, an involvement of the host processor for lower layer protocol processing would be much worse in power consumption compared to the optimized hardwired solutions. In fact, the protocols up to transport layer have some typical characteristics which the general purpose processors are not optimized for [3, 4]. A general purpose processor would waste a significant portion of its processing power when protocols of lower layers are processed. Therefore lower layer protocols are often handled by some hardwired ASICs, particularly within the network equipment. As we mentioned previously, however, it is always highly desirable to make the implementations programmable and configuarble. In this paper we first present a top-level view of systems for processing highspeed communication protocols. From this top-level architecture we shall derive a new processor architecture dedicated for the lower layer protocol processing in a high-speed network element.

- THE TOP-LEVEL ARCHITECTURE Following the OSI/ISO communication model there are seven logical layers [2]. In order to map the functionality of these layers into some implementation means, we define a corresponding top-level architecture of the overall system, which processes the seven layers. A system is an implementation which carries out the functions of protocol processing from layer 1 to layer 7. To describe the system architecture we introduce three sytem layers, namely the host layer, the adaptation layer and the media layer. Each layer is characterized by the implementation and the targeted functionality. The host layer corresponds to the implementation which carries out functions

-2-

such as web-browsing or telnet sessions. It may consist of one or several generalpurpose processors. The adaptation layer is the implementation which prepares the information for physical transmission and delivers the application-relevant informations to the host layer. It may be implemented in a hardwired manner or by a general purpose processor or a dedicated protocol processor, as we shall describe later. The media layer is the implementation which is responsible for reliable physical transmission and receiving of the information. It may be hardwired or may consist of one or more DSPs which are programmed for signal processing algorithms.

Host

Application Layer

Adaption

Network Layer

Media

Physical Layer Figure 1

The partitioning of the functions of the OSI network layers onto the implementation layers may be different from case to case. In a lower speed network it may be advantageous to map also the OSI session layer, transport layer and network layer functions onto the system host layer. In a high-speed equipment like a router or switches it is likely necessary to move the transport and network layer functions onto the system adaptation layer. The media layer is preferred to implement the physical layer functions such as modulation and demodulation.

- NETWORK PROCESSOR ARCHITECTURE While the system media layer commonly consists of some DSP and some hardwired hardware, the system host layer consists of one or more general-purpose microprocessors. In the computer systems, it is common to further use the host processor for doing protocol translations between the application and the lower layer [5]. The system adapation layer does not appear as a separate hardware block. For high-speed applications, e.g. high-resolution video terminals or routers/ hubs, it is in general necessary to separate the middle OSI layers and map its functionality to the system adaptation layer. For the system adaptation layer a network processor is useful which can more efficiently process the protocol than the general purpose host processor. What tasks are typical for the network processor (NP)? Basically we deal with

-3-

protocols that always contain some header information and the payload information. The header information should be parsed and used as input for the control Finite State Machine (FSM). Also interface states or state changes should be notified to high-layer entifties and may cause some programmed actions. The NP should be scalable for processing multiple OSI layers with different protocols and multiple channels. The operations which have to be performed by a NP can be characterized as follows: • The operations for data transfer (move, load , store, parse) consumes a significant portion of the processing power. • There can be very frequent interrupts especially if the buffering has to be limited due to some network delay constraints e.g. in video communication applications. • There are many operations involved in the finite state machines. • For routing/switching efficient access to some lookup tables is essential. • Depending on the protocol, there can be many operations that are not word or even byte-aligned. For the above operations a NP should be area efficient, compact in code size and meet low-power constraints. Furthermore, it is highly desirable, that the NP is configurable from architecture to compiler and simulator. First of all we choose the fixed-point RISC-based architecture as the underlying architecture. The reason is that even at the network layer with routing functionality we usually do not need floating point operations. To be efficient in silicon area, power consumption, it is necessary to let the data path be scalable to 8 bit, 16 bit and 32 bit. To make processing of different kind of communications interfaces efficient the port-based instructions are incorporated. The ports are addressed directly in the same manner of registers. Data can be moved between register and port, memory and port. The load/store paradigm is relaxed to allow the powerful operation of moving data from a port to another. That helps a lot for minimizing cycles and power consumption. The assembler instruction looks like move port1@bit1 port2@bit2 Particularly for processing bit-level functions the concept of ports can encapsulate these operations and left the designer enough space for making hardware and software partitioning. The number of ports is configurable.

-4-

Code

NP core

RAM I/O

Port

Buffer Figure 2

The memory buffers can be separated for the communication interface data from those memories that are addressed by the control part of the software. This makes it possible to use another bus width for the fast transfer of payload data than for the processing of protocol header information. A further architectural level decision is to provide for multiple tasks with the zero-overhead, hardware-supported threads. How many such threads are supported are again configurable to fit the application requirement. The multi-threading feature makes it possible to very easily program for multiple channels without an operating system. The multi-threading functionality is further embedded within a sophisticated interrupt concept that enables a very efficient protocol processing for high-speed and low delay applications. The operations on bit fields unaligned to the word boundaries are provided as an important part of the NP architecture. Great attention has been paid to the various load and store operations. An example of a load operation is LOAD dest@bit source@bit bit-width. And a block-move instruction that also works with programmable bit-offset helps be extremely efficient in code size and power optimization. Combined with this block load we provide a parallel-processing unit that works on the high-speed data bus such that in the other processing unit the control processing can proceed in parallel to the data transfer. Due to the multi-threading support of the NP this parallel processing of instructions is very simple to implement and easy for programming. An example for a store operation is /* start block move and let the task #task run in parallel */ Repeat counter #task STORE_BL port@bit dest@bit number We let the data transfer running in one thread in parallel to the other thread. In comparison to a DMA oriented processing we achieve a better control within the software even when the block-move operation has to be interrupted. Furthermore, the operations for managing the DMA controller are saved and we get a smaller

-5-

code size and lower power consumption. The hardware cost is also less since the registers can be reused instead of providing extra registers in the DMA controller. We further combine arithmetic operations with conditional jumps into a single instruction for efficient FSM implementations. We observe from our application profiles that the arithmetic operations are frequently involved in flow controls. Particularly for loops the compute & jump operations can be very efficient.

- ARCHITECTURE PRINCIPLES As mentioned in the previous section the processor architecture is based on the RISC-principle. Arithmetic operations are register-register operations. The data transport is done by load/store and move instructions, also called memory instructions. The basic frame consists of datapath containing hardware like an ALU, a register-file with resources for up to 8 active threads, a decoder to decode instructions and control the instruction execution, and a branch unit responsible for branches and thread switching. It has separated program- and data-memory.

Bitmanipulation Support Since data transfers of lower layer protocols are stream-oriented a lot of attention has to be paid on data alignment operations. Operations on variable bit-strings are introduced. To enable bit access the bit addressing is necessary. First the data word containing the required bits is selected. Then the bits itself are addressed by specifying the beginning and the length of a bit sequence within the element. This principle has the advantage to overcome word element boundaries by specifying a combination of bitwidth and bitoffset running provocing an overflow into the next data element. So data streams can be handled as what they are: a sequence of bits.

Element n+1

Element n

Bitwidth

Bitoffset

Figure 3 This principle has influence on the instruction structure since additional parameter fields are needed. In assumption of a RISC instruction set with two operands it leads to the following structure of a register-register operation executing a logical AND function:

-6-

Ry.Oy := Ry.Oy AND Rx.Ox @ Width Where Width bits of register Rx beginning at offset Ox are ANDed with Width bits of register Ry beginning at offset Oy. The result is stored in register Ry. So the addressing of one operand consists of two fields: the register number and the bit offset. In Table 1 , a comparison is made to ARM instruction set [8].

ARM Processor ADR ADR LDR LDR AND AND OR STR

R2 a R3 b R1 [R3] R0 [R2] R0 R4 R1 R2 R2 R0 R5 [R3]

code size cycles

The new NP

// pointer a // pointer b {0} // load b {+1} // load a by shift bx00011000 // extract 2 bit bx11100111 // clear 2 bit R5 // compose // write to b 8x 32 bit 8

LDMR R1 a LDMR R2 b MVRR R2.2 R1.3 2 STMR R2 b

// load a // load b // move bits // write to b

4x 21 bit 4

Table 1

Data Movement There are 3 levels of memory: • register • ports, buffer, interfaces (I/O) • memory In comparison to a pure RISC-architecture where data is moved only between data memory and registers, an extended set of memory instructions leads to a speedup by reducing the number of executed instructions [6]. This is important since most part of a thread is transfer of payload information. Hence more types of data movement instructions are introduced: • Load/Store (data memory -> register/port) • Load/Store indirect (data memory -> register/port) • Move (register/port -> register/port) • Immediate Load (value -> register/port) Loop Support by Compute & Jump Since data transfer operations are transport of data elements from one point to another, this can be programmed as loops. Therefore a counter is needed. Thus in addition to conditional branches there are arithmetic instructions extended by conditional branch functionality. Therefore, short loops can be executed more ef-

-7-

ficiently.

Conventional LDI L1: SUBI BNZ

New

R1 0Ah // Init Count // Operations R1 1h // Dec Count R1 // End?

LDIR8 R1 0Ah L1: SUBI

// Init Count //Operations R1 1h #L1 // Dec&End?

Figure 4 The SUBI instruction decrements the counter and a jump to label L1 is executed until the counter value reaches zero. The second example is the implementation of a “for-loop”. For inner loop instructions the number of cycle is half the number needed by the ARM. . for (i+0; i

Suggest Documents