A Parallel IP Lookup Algorithm for Terabit Router1

4 downloads 6384 Views 188KB Size Report
Classless Inter-Domain Routing (CIDR) IP address lookup is a major bottleneck in high performance routers, whose key problem arises from the fact that the ...
A Parallel IP Lookup Algorithm for Terabit Router 1 Kai Zheng, Hongbin Lu, Bin Liu Department of Computer Science and Technology Tsinghua University Beijing China 100084 [email protected], [email protected], [email protected] II.

Abstract—IP address lookup is a key bottleneck for

PREVIOUS WORK

high performance routers because they need to find the

Classic binary trie algorithm is a kind of time-consuming

longest matching prefix. With traditional memory

algorithm. Gupta presented an algorithm called DIR-24-8 [2],

organization, core routers can hardly improve their

which is based on the multi-bit trie algorithm. This algorithm

performance much with the restriction of memory

uses at most 2 steps to finish a BMP lookup, but the

accessing speed. By analyzing the statistical attribution

forwarding table of DIR-24-8 is about 33MB (40,000 routing

of the IP prefixes, this paper presents a novel parallel IP

entries) in size. Later, Gupta altered DIR-24-8 to DIR-21-3-8

lookup algorithm based on a new memory organization,

[2], in which a mid-table is introduced. The maximum number

which can achieve much higher throughput rate while

of memory accesses for a lookup is increased to 3 while the

keeping the memory consumption unchanged. With

memory consumption is reduced to about 9MB. When

current 5ns SRAM, the proposed mechanism furnishes

implemented in a pipeline skill, it can achieve one routing

approximately 600 million routing lookups per second.

lookup in every memory access. Illuminated by Gupta’s idea, Huang brought forward a table compression scheme based on

Keywords—IP address; Lookup; Terabit router

bitmap [3], named as BC-16-162 here, in which the maximum number of memory accesses for a lookup is still 3, but the size

I. INTRODUCTION

of memory used sharply reduces to 450-470KB. The cost of

Classless Inter-Domain Routing (CIDR) IP address

this algorithm is its complexity in implementation and more

lookup is a major bottleneck in high performance routers,

calculating time in table updating. Since the table size reduces

whose key problem arises from the fact that the prefix

to KB-level, SRAM can be introduced to replace the SDRAM,

length of IP is variable. And one destination IP address may

which means a higher speed in IP address lookups, and 50ns

match many IP prefixes. In this case the longest matching

per memory access is now reduced to 5ns. But the throughput

prefix should be chosen, which is called the Best Matching

rate is still restricted by the memory accessing time, and it is

Prefix (BMP). Recently, with the sharply increasing

clear that 200Mpps (at most 5ns for each packet) is the limit of

demand on throughput of line cards, high-speed interface

the throughput rate of these non-parallel lookup mechanisms,

standards such as OC48, OC192, OC768, … appear

which can not meet the wire-speed demand of terabit routers.

successively, which also results in significant increases of

III. A PARALLEL LOOKUP ALGORITHM AND ITS

the pressure on the lookup subsystem. With the restriction

PERFORMANCE ANALYSIS

of memory accessing speed, how to raise the parallelism of

Huang’s BC-16-16 performs fairly well in both speed and

routing lookups and decrease the number of memory

memory consumption. Here in this paper, we present a parallel

accesses is the key to the design of BMP lookup engines for

lookup algorithm, which is based on the analysis of the

gigabit and terabit routers.

statistical characteristic of the IP prefixes. Cooperating with

The rest of the paper is organized as follows. Several

1

BC-16-16, it can achieve a lookup speedup relative to

existing schemes are reviewed and their associated

BC-16-16.

problems are discussed in Section II. A description on the

A.

Study of the Distribution of IP Prefixes

proposed algorithm together with its mathematic model and

By analyzing the data of prefixes provided by the IPMA3

the solution to it is presented in Section III. Finally, the

(Internet Performance Measurement and Analysis) project,

simulation results and a summary are given in Section IV

we find out that they can be classified into several flows

and V. 2 1

This research is supported by NSFC(60173009) and the National 863

High-tech Plan(No.2001AA112082).

3

Bitmap Compression 16-16. A joint effort of the University of Michigan and Merit Network. On http://www.merit.edu

averagely depending on certain bits of them. For instance,

bits as an index to locate the entry;

4

we use bit15 and bit16 bits to classify some 98028 IP

If the entry contains the NIP directly then

prefixes provided by the IPMA project, and the results are shown in Table I.

Push the NIP into the output buffer as well as the sequence number of the DIP Else /*points to the Next Hop Array (NHA)[3] */

TABLE I. IP PREFIXES DISTRIBUTION

Bit15 & Bit16

00

01

10

11

Number

24941

24241

24904

23942

Ratio

25.44%

24.73%

25.40%

24.42%

B.

Begin If the offset length ≤ 3 Then Do normal NHA lookup Else

Distributed Memory Organization

Do compressed NHA (CNHA [3]) lookup;

Different from software implementations, multiple

End;

memory modules of a hardware mechanism make parallel

Push the NIP into the output buffer as well as the

table lookup operations possible. The key point is to

sequence number of the DIP;

allocate prefixes among different memory modules evenly so that a lookup speedup can be achieved in BMP lookups. And according to the study of the distribution of the IP

End; /* while (true) do */ Step 3. For each output buffer of the memory modules do Pop out a result, and use the sequence number with

prefixes, we can classify the incoming IP addresses by certain bits of them. The later the bits of the entry are used, the more stochastic performance can be achieved [4]. For instance, we use 4 memory modules to store route prefixes, using the bit15 and bit16 as the identifier as shown in Table II and Fig.1. Used as segment

ID

Used as offset

(Bit1—Bit14) (Bit15,Bit16)(Bit17—Bit32) Figure 1. Route entry sample. TABLE II. MEMORY ALLOCATIONS

ID

00

01

10

11

Memory Module

#1

#2

#3

#4

it to associate the NIP with its corresponding DIP;

Step4: Stop (Loop back)

C. Performance Analysis Assumption: We use queuing theory to model the lookup subsystem. We assume that the arriving process of the incoming IP addresses is a Poisson process with average arrival rate λ . The service type of lookup is a deterministic process with service rate µ and service time Ts = 1 / µ . According to the study and assumption mentioned above, we use N M/D/1 queues (as shown in Fig. 2) to model the system. Evidently, the average arriving rate of each M/D/1 queue reduces to λ / N while the service rate is still µ .

The whole parallel lookup mechanism and the distributed memory organization are shown in Fig.3. A Parallel Lookup Algorithm with Bitmap Compression

λ

λ/n

µ

λ/n

µ

Input: Destination IP Address (Abbreviated to DIP)

……

Output: Next Hop IP Address (Abbreviated to NIP) Step 1. Use the ID bits of DIPs to classify them. Allocate and push them into first-in-first-out buffers (FIFOs) of the corresponding memory modules. Give each DIP a sequence number (tag) to identify them. (Only a short one is needed here.)

Step 2. For every memory module Mi parallelly do While (true) do /* Endless loop */ Begin If FIFO is empty then continue; Pop a DIP from local FIFO, and use the Segment 4 Bits in an IP address are ordered from left to right with bit1 in the leftmost position, bit32 in the rightmost position.

µ

λ/n

Figure 2. The theoretical queuing model of the lookup subsystem

Then we use the classic queuing theory to solve the model and calculate the corresponding parameters as follows: Traffic intensity of each memory module:

ρ = Ts × λ / N ;

(1)

Service throughput rate:

T = ρ × Nµ ;

(2)

Average number of DIPs in each memory module:

q=

ρ2 2(1 − ρ )

+ρ;

(3)

Average number of DIPs waiting for service in each memory module:

w=

ρ2 2(1 − ρ )

route prefixes (entries) and assume that λ / µ = 3 / 1 . So the traffic intensity ρ of each memory module is equal to λ ( N × µ ) −1 = 3 / 4 . According to the formulas given above, we get the parameters of the lookup subsystem: T=3 µ ;q=15/8;w=9/8;Tw=3Ts/2. Here w represents the average number of DIPs in every FIFO of the lookup units, while Tw stands for the average time

;

(4)

that a DIP in a FIFO waits for service. We can see that, just tiny FIFOs are required and not much delay would be introduced into the lookup operation. According to (2), direct

Average time spent in waiting for service:

Ts ρ ; Tw = 2(1 − ρ )

(5)

For example, we use N = 4 memory modules to store

Figure3.

ratio relationship lies between the lookup throughput rate T and the traffic intensity ρ , while the lookup speedup is equal to N × ρ . ρ = 3 / 4 here means a speedup of 3 is achieved and the throughput rate T is increased to 3 µ .

Distributed memory organization and parallel lookup mechanism

The results of the model are fairly good, in both only need tiny FIFOs to buffer the incoming IP addresses. We also notice the queuing time introduced by the proposed scheme. According to the analysis presented above, Fig.4 shows the running average queuing delay time as a function of the traffic intensity ρ . The figure shows that the queuing

Delay time (Ts)

lookup throughput rate and memory consumption. And we

delay will not grow up sharply as long as ρ remains under 0.8, which is the common payload rate adopted by networking applications.

ρ Figure4.

Growing trend of average delay time with

ρ

IV. SIMULATION RESULTS

lookup subsystem. For example, if the lookup throughput

We use Matlab and C programs to simulate the lookup subsystem furnished with the proposed algorithm. And we focus on the effect caused by the ratio ρ = λ ( N × µ ) −1 . Since ρ should be less than unity [5] and we want to achieve higher speedup in lookup operations, we let λ / µ = N − 1 here. Namely, ρ is equal to N ( N − 1) −1 . Fig.5 and Fig.6 show the simulation results, as well as the contrast to the theoretic analysis.

demand of a router is 200Mpps, when we upgrade the BMP lookup engine to a parallel mechanism equipped with 4 lookup units, the 400KB 5ns SRAM can be replaced by 4 × 100KB 10ns ones, and this alteration will do no impact on the throughput rate of the system. The cost of the scheme is some extra use of tiny fast on-chip cache and a little more processing delay. We also bring forward a mathematic model of the scheme, as well as the solution to the model. Using the result presented, we can trade off between implementation simplicity and speedup, between process

Average delay time (Ts)

delay and lookup throughput rate. TABLE III. COMPARISON AMONG LOOKUP SCHEMES BASED ON MULTIBIT TRIE ALGORITHM

Scheme

Memory Accessing Time/ Lookup

Supported Line Speed (Gbps)

1/4

Memory Requirement (KB/8,000 Entries)

Suggest Documents