A Fully-Programmable Memory Management System Optimizing ...

0 downloads 0 Views 465KB Size Report
CORE. Classifier. Data. Internal Bus headers. Internal Scheduler. DDR SDRAM. 256 MByte. Memory. Manager. Data Memory. Controller. Data Queue. Manager.
5.1

A Fully-Programmable Memory Management System Optimizing Queue Handling at Multi Gigabit Rates G. Kornaros, I. Papaefstathiou, A. Nikologiannis, N. Zervos Ellemedia Technologies 223 Siggrou Av, GR17121, Athens, Greece {kornaros,yanni,anikol,nzervos}@ellemedia.com are also fully programmable so as to support the different network protocols.

ABSTRACT

In the current networks these systems are designed using an embedded-systems methodology since they consist of hardware components (implementing high performance functions) and software ones (so as to be fully programmable). One of the main problems when designing such systems is the memory bandwidth needed, especially due to the large number of network queues these devices support. In this paper we present a memory scheme that can support 512K such queues and provide an overall bandwidth of 10 Gb/sec from a simple DRAM device and a relatively small SRAM one. It uses the principles of parallelism, pipelining and pre-fetching in order to achieve such a high performance, together with a highly sophisticated segmentation and reassembly technique. The device supports memory access reordering, very efficient free list organisation and memory access arbitration so as to reduce the bandwidth required from the two memories. It can handle either fixed size or variable length pieces of data and it supports a large number of general instructions; therefore, we claim that it is a useful component for every embedded system that manipulates queues.

Two of the main bottlenecks when designing a network embedded system are very often the memory bandwidth and its capacity. This is mainly due to the extremely high speed of the state-of-theart network links and to the fact that in order to support advanced quality of service (QoS), per-flow queueing is desirable. In this paper we describe the architecture of a memory manager that can provide up to 10Gbs of aggregate throughput while handling 512K queues. The presented system supports a complete instruction set and thus we believe it can be used as a hardware component in any suitable embedded system, particularly network SoCs that implement per flow queuing. When designing this scheme several optimisation techniques have been evaluated and the most cost and performance effective ones used. These techniques minimize both the memory bandwidth and the memory capacity needed, which is considered a main advantage of the proposed scheme. The proposed architecture uses a simple DRAM for data storage and a typical SRAM for keeping data structurespointers, therefore minimising the system’s cost. The device has been fabricated within a novel programmable network processor designed for efficient protocol processing in high speed networking applications. It consists of 155K Gates and occupies 5.23 mm² in UMC 0.18µ CMOS.

In particular, the system has been integrated in a network embedded system, the PRO3. There, it offers support for multiple network flows while maintaining 2.5 Gb/sec throughput in each of the 4 different interfaces (two input and two output ones). Its flexibility allows wire-speed buffer management while it uses shared buffering for best memory utilization. In this paper we present the architecture of this Memory Management System (called the DMM thereafter) focusing on the techniques used for increasing the performance and reducing the cost/complexity of it. The methodology we used in order to choose the correct hardware components and circuit architecture is also described.

Categories and Subject Descriptors B.3 Memory Structures, B.6 Logic Design, B.7 Integrated Circuits

General Terms Design, Performance. Keywords Memory management, Network Processor.

1. INTRODUCTION

In general we claim that the proposed system is either extremely faster than the ones described in the bibliography ([1], [2]) or less expensive while providing at least 30% higher performance ([3] and [4]). Also, it is the only one supporting a full instruction set. Additionally the fact that it uses standard DRAM memory chips and not any highly sophisticated ones (like the other proposed approaches), allows it to be integrated even in low-cost embedded systems.

The rapid growth in the number of network nodes, along with the ever-increasing network protocols have imposed the development and deployment of high-capacity telecommunication systems that Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 2003, June 2-6, 2003, Anaheim, California, Usa. Copyright 2003 ACM 1-58113-688-9/03/0006…$5.00

2. PROGRAMMABLE PROTOCOL PROCESSOR (PRO3) ARCHITECTURE PRO3 [6] is designed to offer a programmable solution, able to perform demanding applications in any known networking environment. The system is designed to support up to 2.5 Gb/s

54

manipulations of data structures occur in parallel with data transfers, keeping DRAM accesses to a minimum. The architecture of the DMM is shown in Figure 2. It consists of five main blocks: Data Queue Manager (DQM), Data Memory Controller (DMC), Internal Scheduler, Segmentation Block and Reassembly Block. Each block is internally pipelined in order to exploit parallelism and increase performance. In order to achieve efficient memory management in hardware, the incoming data items are partitioned into fixed size segments of 64 bytes each. Then, the segmented packets are stored in the data memory, which is segment aligned. The internal scheduler forwards the incoming commands from the four ports to the DQM giving different service priorities to each port. The DQM organizes the incoming packets into queues. It handles and updates the data structures, kept in the Pointer memory. The DMC performs the low level read and writes to the data memory minimizing bank conflicts in order to maximize DRAM throughput (as described in the Performance section).

links for up to 512K flows, while at the same time being fully programmable. The PRO3 network processor is an embedded system integrating three RISC cores, with re-configurable pipelined modules, able to deliver the needed processing power in terms of speed and programmability. The main programming modules are the Protocol Processing Engines (PPE in Figure 1). There are two of them, which operate in parallel, so as to achieve higher throughput. Each PPE constitutes of a modified RISC core, a field extraction (FE) programmable engine to directly load the required protocol data to the RISC for processing, and a Packet Modification module (PM in Figure 1) for flexible packet construction. These 3 modules are forming a 3-stage pipeline. The PPE RISC core belongs to the same family as the generic RISC CPU of the system, but it is optimized for efficient context switching (by using a set of shadow registers) and enhanced with glue logic I/O handling. There is also one more processor (Control CPU in Figure 1) for processing non time-critical data.

CAM for classification

CAM I/F

CPU RAM

External Host CPU (optional)

Control RAM (state)

SDRA SDRAM I/F I/F

CPU I/F

Control RAM I/F

Control RISC CPU CPU

Internal BusBus Internal

Timers

RPM PPE

FEX PPE Core FPM MO

Internal Scheduler Scheduler

headers

FEX PPE Core FPM MO

headers

Internal BUS

Pre Pre-processing processing

Bus Buscontrol, control, Internal Internal Scheduling

ATM/CPCS ATM/CPCS Packet Packet RX RX Layers Layers Classifier CRC Classifie

IN IN

Pointer RAM

Storage DRAM

IN

Post-processing ATM/CPCS TX Layers Layers TX

Segmentation Segmentation Data Memory Manager

CRC

OUT OUT

Data Memory Manager Traffic Traffic Scheduling

RPM RPM RISC RISC CORE CORE

Classifier Classifier

RPM PPE INSERT / EXTRACT

Priority Outgoing Priority Outgoing Traffic Task Traffic Task Scheduler Scheduler Scheduler Scheduler

RPM RPM RISC RISC CORE CORE

Scheduling RAM I/F

Data Data Memory Memory Controller Controller

Data Memory

Scheduling Memory

Data Data Queue Queue Manager Manager Reassembly Reassembly

DDR SDRAM ZBT ZBT SRAM SRAM MByte 32 MByte 256 MByte 32

OUT

Pointer Memory

Figure 2: DMM Architecture Figure 1 : PRO3 Architecture

The DMM provides a large instruction set in order to support the diverse protocol processing requirements of any device handling queues. Beyond the primitive commands of “enqueue” and “dequeue”, the DMM features a large set of 18 commands to perform various manipulations on its data structures (a list of the commands is given in Table 1). Thus it can be incorporated in any embedded system that should handle queues.

The PRO3 uses packet buffering prior to protocol processing. A Data Memory Management sub-system (DMM) is responsible for memory allocation and data buffering. Packets are stored per-flow in the external DRAM, in queues implemented as linked list data structures, and can be retrieved and delivered by the DMM in response to specific commands. The data can be sent, to the PPE modules or to the control RISC or to a host CPU or directly to the output interface.

3.1 Memory Technology A critical decision for the DMM architecture was the choice of the memory technology. Due to the complexity of the supported operations, data structures are located in a different memory (control memory), from packets (data memory). DRAM is selected for data memory because it provides large storage space for the 512K supported queues, at a low cost. We chose DDRDRAM for the data memory because it provides adequate throughput(see the performance section) for supporting the operations of the PRO3 network processor. The DDR-SDRAM module used has a 64-pin data bus, which runs at 133 MHz clock frequency, providing 17.024 Gbps total throughput.

3. DATA MEMORY MANAGER (DMM) The main function of the DMM is to store the incoming traffic to the data memory, retrieve parts of the stored packets and forward them to the CPUs for protocol processing. The DMM is also responsible to forward the stored traffic to the output, based on a programmable traffic-shaping pattern. It supports 2 incoming and 2 outgoing data paths at 2.5 Gbps line rate each; there is one for receiving traffic from the network (input), one for transmitting traffic to the network (output), and one bi-directional for receiving and sending traffic from/to the Internal Bus. It performs per flow queuing for up to 512 K flows. The DMM operates both at fixed length or variable length data items. It uses DRAM for data storage and SRAM for segment and packet pointers. Thus, all

The large number of the required pointer memory accesses requires a high throughput – low latency pointer memory. We use SRAM for pointer memory, which provides the required performance. Modern SRAMs working at 133 MHz clock

55

packet1. This is achieved by keeping the start address of the previous of the last packet in the segment pointer of the last segment of the queue (which is usually null).

frequency provide 133 M accesses per second or about 8.5 Gb/sec.

3. 2 Data Structures and Queue Organization

Since we support two types of queues (packet, segment) we need two free lists, one per type. This results in double accesses for allocating and releasing pointers. We use flexible data structures, which minimize memory accesses and can support the worst-case scenarios. The two types of linked lists are identical and aligned. In other words, there is only one linked list with 2 fields: segment and packet pointers.

We organize our data memory space into fixed-size buffers (named “segments” thereafter). We chose 64-bytes segments because this size minimizes fragmentation loss [5]. For each segment, in the data memory, a segment pointer and a packet pointer are assigned. The addresses of the data segments and the corresponding pointers are aligned, as shown in Figure 3, in the sense that a data segment is indexed by the same address as its corresponding pointer. For example, the packet and segment pointers of the segment 0 are in the address 0 in the pointer memory. The data structures of the DMM are shown in Figure 3. Next Pointers Queue Table ES Head Tail

11 00 512 K 11 11

Packet Segment pointers pointers

3.3 Internal DMM Organization IN

Classifier

IBUS

OUT

TSC TRS

Data Memory Data Segments

1 x 0

port1

x

22 bit 22 bit

Free List Reg.

4M

0

drop

x

E Head Tail

x

0

x

22 bit 22 bit

Ingress FSM

1

p2

port3

Internal Scheduler Data Queue Manager (HL) Data Queue Manager (LL) Data Memory Controller

DMM

x

22 bit 22 bit

p4

Data Segment Command

64 bytes

Figure 3 : DMM data structures

Data Memory

Pointer Memory

Figure 4 : Internal DMM Architecture

The data queues are maintained as single-linked lists of segments. The DMM can handle variable length objects (called packets thereafter) as well as fixed-size data items. This is achieved by using two linked lists per flow: one per segment and one per packet. Each entry in the segment level list stores the pointer that indicates the next entry in the list. The maximum number of entries a data queue consists of, is equal to the maximum number of segments the data memory supports. The packet pointer field has the valid bit set only in the entry that corresponds to the first segment of a packet. The packet pointer also indicates the address of the last segment of a packet. The number of entries of the packet lists is lower than the number of entries of the corresponding segment lists in a typical situation. However, in the worst case, the maximum number of entries in the packet lists is equal to the number of segment lists, which in term is equal to the maximum number of the supported segments in the data memory.

Internally the DMM uses FIFOs at the input interfaces to handle traffic fluctuations. At the “IN” interface a two level FIFO is used; one for “probably valid” packets and the other for valid packets. Initially the internal scheduler prioritises the incoming commands, and then the high level (HL) data queue manager is responsible to break the packet commands down to segment commands and to do the synchronization between the system clock domain and the memory management domain.

3.4 Optimization techniques The following subsections describe the optimization techniques used in the design so as to increase its performance, while at the same time keep the cost of the system at a low level.

3.4.1 Free List Organization The DRAM provides high throughput and capacity at the cost of high latency and throughput fluctuations due to bank conflicts. A bank conflict occurs when successive accesses address the same memory bank, and in such case the second access must be delayed until the bank is available Hence, special care must be given to the buffer allocation and deallocation process. [5] proved that by

Head and Tail pointers are also required for each individual queue; the head pointer points to the first segment of the head packet in the queue while the tail pointer indicates the first segment of the tail packet. Usually, single linked lists are accessed from head to tail; however, our linked list organization allows also deleting a packet from the tail of a queue for the last inserted

1

This function is very useful when receiving networking packets with wrong error redundancy code or in general, a destroyed network packet.

56

using a single free list someone can minimize the memory accesses for buffer releasing (i.e. delete or dequeue of a large size packet requires O(1) accesses to the pointer memory). However, this scheme increases the possibility of a bank conflict during an enqueue operation. On the other hand, using one free list per memory bank (total 8 banks in the current DRAM chips) minimizes or even avoids bank conflicts during enqueueing but increases the number of memory accesses during packet dequeueing-deletion to O(N)2. A trade off of these 2 schemes, which minimize the memory accesses and bank conflicts [5] is to use 2 identical free lists, one holding the addresses of the first 4 consecutive free segments, while the other one the next 4 ones, etc. Additionally, the support of page-based addresses on the DRAM results in reduction of up to 70% in the number of bank conflicts during writes and 46% during reads, as shown in [6].

prove that our analysis holds when the DMM executes a real world application. The total required incoming throughput to the data memory is the actual input traffic from the two ports(5Gb/sec in total) plus the throughput loss due to the memory fragmentation and the bank conflicts. Given the analysis in [6] the fragmentation losses and the bank conflict losses are about 20% each, so the total bandwidth needed from the DRAM should be about 7 Gbps. All the traffic coming in the data memory should at some point come out of it; thus the total required throughput of this memory is 14 Gbps. Regarding the pointer memory, the current SRAMs, work at 133MHz, and thus they provide 133 M accesses per sec. The memory manager must be able to service one command per port per time slot. Assuming the average number of required accesses to the pointer memory per command is 6, 24 accesses per time slot3 are required, so the necessary throughput is 118 M accesses per second. Finally, regarding the hardware complexity of the system, the memory manager has 2 clock domains: a 200 MHz and a 133 MHz one. The DMM blocks that communicate with the internal PRO3 subsystems work at 200 MHz, while blocks that access external memories work at 133 MHz. As the Synthesis and P&R tools indicated, and the experiments of the next section proved, the DMM core hardware implementation can handle up to 12Gb/sec of throughput.

3.4.2 Memory Access Reordering The execution of an incoming command, such as Enqueue, Dequeue, Delete or Append packet, sends read and write commands to the pointer memory in order to update the corresponding data structures. Due to access dependencies, the latency when executing an operation may be increased. By reordering the accesses in an effective manner the execution latency is minimised and thus the system performance increased. This reordering is performed for every operation and we measured that it achieves a 30% reduction in the mean access latency.

3.4.3 Memory Access Arbitration

4.1 Experiment results

Using the described free list organization we control the write accesses to the data memory in order to minimize the bank conflicts. An analogous control cannot be performed in read accesses because they are random and unpredictable. Thus, a special memory access arbiter is used in the data memory controller block to shape the flow of read accesses in order to avoid bank conflicts. Memory accesses are classified in four FIFOs (one FIFO per port). The arbiter implements a round-robin policy. It selects an access only if it belongs to a non-busy bank. The information for bank availability is achieved by keeping the data memory access history (it remembers the last 3 accesses). This function minimizes bank conflicts by 23% as the results of our experiments show. It also reduces the hardware complexity of the DDR memory controller.

Extensive experiments of the DMM were performed with the support of micro-code specifically developed for the embedded micro-engines of the PRO3. In Table 1 a list of the DMM commands is demonstrated. Note that the packet commands are internally translated into segment commands and then the segment ones executed. Table 1 also shows the measured latency of these commands and the number of necessary pointers’ manipulation functions. The actual data accesses at the Data Memory can be done, almost, in parallel with the pointer handling. In particular, a data access can start after the first pointer memory access of each command has been completed. This is achieved because the pointer memory accesses, of each command, have been scheduled so as the first one provides the Data memory address. It is clear that the DMM can always handle a queue instruction within 65ns. Since the Data memory is accessed at about 50-60ns (at the average case), and the major part of the queue handling is done in parallel with the data access, we claim that our system introduces minimum latency. In other words, in terms of latency, you get the queue handling almost “for free”, since the DMM’s latency is about the same as the one of a typical (support only read and write) DRAM subsystem.

3.4.4 Internal backpressure The data memory manager uses internal backpressure in order to delay incoming operations that correspond to blocked flows or blocked devices. The DMM keeps data FIFOs per output port. As soon as these FIFOs are about to overflow, backpressure signals are asserted in order to suspend the flow of incoming operations related to this blocked datapath. Internal backpressure avoids overflows and data loss. Using this technique we increase the reliability of our architecture.

4. SYSTEM PERFORMANCE

Table 2 depicts the performance results measured after stressing the DMM (inside the PRO3 environment) with real TCP traffic plugged to the incoming interface. This table demonstrates the performance of the DMM in terms of both bandwidth and number

In order to analyse the performance of the DMM, we check the possible bottlenecks of our architecture. In general, these bottlenecks can be either due to the memories themselves (DRAM, SRAM) or due to the hardware complexity of the implemented logic. We first do a quantitative analysis of the DMM performance and then we present experiment results that 2

3

The time slot is defined as the time that is required to receive or send a 64-byte segment from/to the network at 2.5 Gb/s line rate and thus it is equal to 204.8 ns

N is the number of segments per packet.

57

of instructions serviced. It also presents the memory bandwidth required by our design in order to provide the performance specified. Note that the input bandwidth serviced cannot go above 2.5 Gbps since this is the maximum input rate of the PRO3.

Since the DMM in a network processor environment (as is PRO3) should actually service each packet 4 times, the maximum aggregate throughput, serviced by it, is 10Gb/sec. From the results of Table 2 it can easily be derived that the worst case for our system is when there is only one incoming flow and this consists of very small packets. Even in this case, the DMM can service the maximum bandwidth provided to it. Within the PRO3 the DMM works and 200Mhz(which is the system clock) and at this rate it can satisfy the performance requirements of this network device while at the same time having a large number of idle cycles (more than 25% even in the worst case). As it has been described above a simple DRAM can provide up to 17Gb/sec of real bandwidth while the SRAM up to 8.5Gb/sec. The maximum memory bandwidth utilization figures show that even in the worst-case scenario the bandwidth required by the DRAM is up to about 14Gb/sec (equal to the DMM Bandwidth plus the measured 37% overhead due to bank conflicts and fragmentation and that of the SRAM 4.5 Gb/sec. As the internal hardware of the DMM in any of these case is idle for more than 30% of the time, we claim that the DMM can support even 12Gb/sec without a problem.

Table 1 : Packet Commands and Segment Commands’ pointers’ manipulation latency

Read

10

3r

Dequeue

Read_N

10

3r

Append

Dequeue_N

13

Min 5 (3r2w) Max 8 (3r5w)

Ignore

Overwrite

10

3r

Delete

Overwrite_Segment_l ength

7

2r1w

Dequeue

13

Min 5 (3r2w) Max 8 (3r5w)

2 2 2 4 4 Single Single Single Single

4 3

SRAM BW

2 1 0

Ignore

4

0

Ignore&Overwrite_Se gment_length

7

2r1w

Overwrite_Segment_l ength&Append

11

6r4w

Figure 5 : Pointer Memory bandwidth fluctuations

Overwrite_Segment& Append

11

6r3w

Notice that the above bandwidth numbers are all average numbers. We have also measured the fluctuations of the bandwidth required out of the SRAM and the results are shown in Figure 5. The mean SRAM bandwidth for the presented case is 3.74Gb/sec, and as it shown, the maximum bandwidth required by the SRAM is 4.23Gb/sec. Since this figure demonstrates the worst case scenario from those measured, we claim that the fluctuations of the required bandwidth at the SRAM cannot restrict the performance of the DMM. The Data Memory Accesses show no fluctuations since for each command one data memory access is needed.

µs

Table 2 : Performance of the DMM Number Of Flows

5

1

Ignore& Delete

DMM bandwidth 7.02Gb/sec, 128 bytes packets, single flow

41

Read

36

4r4w

31

10

26

Enqueue

21

Enqueue

16

Segment commands

11

Packet commands

6

Pointer Memory Accesses r:Read w:Write

Bandwidth(Gb/sec)

Clock cycles 5ns

AVG packet size (bytes) 100 90 128 128 128 64 64 64 50

MOperations/s Serviced

8.22 10.08 11.26 10.05 9.44 10.47 13.70 15.43 13.43

Pointer Memory DMM bandwidth BW (Gb/s) utilization (Gb/s) 4.53 7.60 4.45 9.72 4.40 9.20 3.70 8.40 3.80 9.20 2.68 5.32 3.74 7.04 4.50 9.52 4.42 6.88

5. CONCLUSIONS This paper focuses on a Memory Management system that provides fully programmable and ultra high speed manipulating of hundreds of thousands of data queues. It is supported that this system can effectively be used in any embedded system that handles large numbers of queues since it supports a full instruction set, can handle both fixed-size and variable length data items, and uses simple external memories. The DMM has been implemented within a network processor and provides an aggregate throughput of 10Gb/sec using an SDRAM memory device and a typical SRAM one, while supporting 512K different

58

queues. Its latency is about the same with that of a simple DRAM controller.

[3]

Its hardware complexity is about 155K gates including 130K bits of memory, and it occupies 5.23µm² of die area when implemented at UMC’s 0.18µ technology. The paper presents also the optimization techniques used so as to increase the performance and reduce the cost (which is a critical factor in any embedded system) together with experiment results that support the mentioned performance characteristics.

A. Nikologiannis, M. Katevenis, “Efficient Per-Flow Queueing in DRAM at OC-192 Line Rate using Out-ofOrder Execution Techniques”, Proceedings of ICC2001, Helsinki, Finland, June 2001.

[4]

D. Whelihan and H. Schmit, “Memory optimization in single chip network switch fabric” Proceedings of the 39th conference on Design automation, New Orleans, Louisiana, USA, June 2002.

[5]

Ch. Ykman at all: “System-level performance optimization of the data queueing memory management in high-speed network processors”, Proceedings of the 39th Conference on Design automation, New Orleans, Louisiana, USA, June 2002.

[6]

K.Vlachos, T.Orphanoudakis, N.Nikolaou, G.Kornaros, K.Pramataris, S.Perissakis, J-A.Sanchez, and G. Konstantoulakis. "Processing and scheduling components in an innovative network processor architecture.", Proceedings of the 16th VLSI Conference 2003, New Delhi, India, January 2003.

6. REFERENCES [1]

[2]

B. Suter, T.V. Lakshman, D. Stiliadis, A.K. Choudhury: “Buffer Management Schemes for Supporting TCP in Gigabit Routers with Per-Flow Queueing”, IEEE Journal in Selected Areas in Communications, August 1999. P. Andersson, C. Svensson (Lund Univ. of Sweden): “A VLSI Architecture for an 80 Gb/s ATM Switch Core”, IEEE Innovative Systems in Silicon Conference, October 1996.

59

Suggest Documents