A Parallel IP Lookup Algorithm for Terabit Router 1 Kai Zheng, Hongbin Lu, Bin Liu Department of Computer Science and Technology Tsinghua University Beijing China 100084
[email protected],
[email protected],
[email protected] II.
Abstract—IP address lookup is a key bottleneck for
PREVIOUS WORK
high performance routers because they need to find the
Classic binary trie algorithm is a kind of time-consuming
longest matching prefix. With traditional memory
algorithm. Gupta presented an algorithm called DIR-24-8 [2],
organization, core routers can hardly improve their
which is based on the multi-bit trie algorithm. This algorithm
performance much with the restriction of memory
uses at most 2 steps to finish a BMP lookup, but the
accessing speed. By analyzing the statistical attribution
forwarding table of DIR-24-8 is about 33MB (40,000 routing
of the IP prefixes, this paper presents a novel parallel IP
entries) in size. Later, Gupta altered DIR-24-8 to DIR-21-3-8
lookup algorithm based on a new memory organization,
[2], in which a mid-table is introduced. The maximum number
which can achieve much higher throughput rate while
of memory accesses for a lookup is increased to 3 while the
keeping the memory consumption unchanged. With
memory consumption is reduced to about 9MB. When
current 5ns SRAM, the proposed mechanism furnishes
implemented in a pipeline skill, it can achieve one routing
approximately 600 million routing lookups per second.
lookup in every memory access. Illuminated by Gupta’s idea, Huang brought forward a table compression scheme based on
Keywords—IP address; Lookup; Terabit router
bitmap [3], named as BC-16-162 here, in which the maximum number of memory accesses for a lookup is still 3, but the size
I. INTRODUCTION
of memory used sharply reduces to 450-470KB. The cost of
Classless Inter-Domain Routing (CIDR) IP address
this algorithm is its complexity in implementation and more
lookup is a major bottleneck in high performance routers,
calculating time in table updating. Since the table size reduces
whose key problem arises from the fact that the prefix
to KB-level, SRAM can be introduced to replace the SDRAM,
length of IP is variable. And one destination IP address may
which means a higher speed in IP address lookups, and 50ns
match many IP prefixes. In this case the longest matching
per memory access is now reduced to 5ns. But the throughput
prefix should be chosen, which is called the Best Matching
rate is still restricted by the memory accessing time, and it is
Prefix (BMP). Recently, with the sharply increasing
clear that 200Mpps (at most 5ns for each packet) is the limit of
demand on throughput of line cards, high-speed interface
the throughput rate of these non-parallel lookup mechanisms,
standards such as OC48, OC192, OC768, … appear
which can not meet the wire-speed demand of terabit routers.
successively, which also results in significant increases of
III. A PARALLEL LOOKUP ALGORITHM AND ITS
the pressure on the lookup subsystem. With the restriction
PERFORMANCE ANALYSIS
of memory accessing speed, how to raise the parallelism of
Huang’s BC-16-16 performs fairly well in both speed and
routing lookups and decrease the number of memory
memory consumption. Here in this paper, we present a parallel
accesses is the key to the design of BMP lookup engines for
lookup algorithm, which is based on the analysis of the
gigabit and terabit routers.
statistical characteristic of the IP prefixes. Cooperating with
The rest of the paper is organized as follows. Several
1
BC-16-16, it can achieve a lookup speedup relative to
existing schemes are reviewed and their associated
BC-16-16.
problems are discussed in Section II. A description on the
A.
Study of the Distribution of IP Prefixes
proposed algorithm together with its mathematic model and
By analyzing the data of prefixes provided by the IPMA3
the solution to it is presented in Section III. Finally, the
(Internet Performance Measurement and Analysis) project,
simulation results and a summary are given in Section IV
we find out that they can be classified into several flows
and V. 2 1
This research is supported by NSFC(60173009) and the National 863
High-tech Plan(No.2001AA112082).
3
Bitmap Compression 16-16. A joint effort of the University of Michigan and Merit Network. On http://www.merit.edu
averagely depending on certain bits of them. For instance,
bits as an index to locate the entry;
4
we use bit15 and bit16 bits to classify some 98028 IP
If the entry contains the NIP directly then
prefixes provided by the IPMA project, and the results are shown in Table I.
Push the NIP into the output buffer as well as the sequence number of the DIP Else /*points to the Next Hop Array (NHA)[3] */
TABLE I. IP PREFIXES DISTRIBUTION
Bit15 & Bit16
00
01
10
11
Number
24941
24241
24904
23942
Ratio
25.44%
24.73%
25.40%
24.42%
B.
Begin If the offset length ≤ 3 Then Do normal NHA lookup Else
Distributed Memory Organization
Do compressed NHA (CNHA [3]) lookup;
Different from software implementations, multiple
End;
memory modules of a hardware mechanism make parallel
Push the NIP into the output buffer as well as the
table lookup operations possible. The key point is to
sequence number of the DIP;
allocate prefixes among different memory modules evenly so that a lookup speedup can be achieved in BMP lookups. And according to the study of the distribution of the IP
End; /* while (true) do */ Step 3. For each output buffer of the memory modules do Pop out a result, and use the sequence number with
prefixes, we can classify the incoming IP addresses by certain bits of them. The later the bits of the entry are used, the more stochastic performance can be achieved [4]. For instance, we use 4 memory modules to store route prefixes, using the bit15 and bit16 as the identifier as shown in Table II and Fig.1. Used as segment
ID
Used as offset
(Bit1—Bit14) (Bit15,Bit16)(Bit17—Bit32) Figure 1. Route entry sample. TABLE II. MEMORY ALLOCATIONS
ID
00
01
10
11
Memory Module
#1
#2
#3
#4
it to associate the NIP with its corresponding DIP;
Step4: Stop (Loop back)
C. Performance Analysis Assumption: We use queuing theory to model the lookup subsystem. We assume that the arriving process of the incoming IP addresses is a Poisson process with average arrival rate λ . The service type of lookup is a deterministic process with service rate µ and service time Ts = 1 / µ . According to the study and assumption mentioned above, we use N M/D/1 queues (as shown in Fig. 2) to model the system. Evidently, the average arriving rate of each M/D/1 queue reduces to λ / N while the service rate is still µ .
The whole parallel lookup mechanism and the distributed memory organization are shown in Fig.3. A Parallel Lookup Algorithm with Bitmap Compression
λ
λ/n
µ
λ/n
µ
Input: Destination IP Address (Abbreviated to DIP)
……
Output: Next Hop IP Address (Abbreviated to NIP) Step 1. Use the ID bits of DIPs to classify them. Allocate and push them into first-in-first-out buffers (FIFOs) of the corresponding memory modules. Give each DIP a sequence number (tag) to identify them. (Only a short one is needed here.)
Step 2. For every memory module Mi parallelly do While (true) do /* Endless loop */ Begin If FIFO is empty then continue; Pop a DIP from local FIFO, and use the Segment 4 Bits in an IP address are ordered from left to right with bit1 in the leftmost position, bit32 in the rightmost position.
µ
λ/n
Figure 2. The theoretical queuing model of the lookup subsystem
Then we use the classic queuing theory to solve the model and calculate the corresponding parameters as follows: Traffic intensity of each memory module:
ρ = Ts × λ / N ;
(1)
Service throughput rate:
T = ρ × Nµ ;
(2)
Average number of DIPs in each memory module:
q=
ρ2 2(1 − ρ )
+ρ;
(3)
Average number of DIPs waiting for service in each memory module:
w=
ρ2 2(1 − ρ )
route prefixes (entries) and assume that λ / µ = 3 / 1 . So the traffic intensity ρ of each memory module is equal to λ ( N × µ ) −1 = 3 / 4 . According to the formulas given above, we get the parameters of the lookup subsystem: T=3 µ ;q=15/8;w=9/8;Tw=3Ts/2. Here w represents the average number of DIPs in every FIFO of the lookup units, while Tw stands for the average time
;
(4)
that a DIP in a FIFO waits for service. We can see that, just tiny FIFOs are required and not much delay would be introduced into the lookup operation. According to (2), direct
Average time spent in waiting for service:
Ts ρ ; Tw = 2(1 − ρ )
(5)
For example, we use N = 4 memory modules to store
Figure3.
ratio relationship lies between the lookup throughput rate T and the traffic intensity ρ , while the lookup speedup is equal to N × ρ . ρ = 3 / 4 here means a speedup of 3 is achieved and the throughput rate T is increased to 3 µ .
Distributed memory organization and parallel lookup mechanism
The results of the model are fairly good, in both only need tiny FIFOs to buffer the incoming IP addresses. We also notice the queuing time introduced by the proposed scheme. According to the analysis presented above, Fig.4 shows the running average queuing delay time as a function of the traffic intensity ρ . The figure shows that the queuing
Delay time (Ts)
lookup throughput rate and memory consumption. And we
delay will not grow up sharply as long as ρ remains under 0.8, which is the common payload rate adopted by networking applications.
ρ Figure4.
Growing trend of average delay time with
ρ
IV. SIMULATION RESULTS
lookup subsystem. For example, if the lookup throughput
We use Matlab and C programs to simulate the lookup subsystem furnished with the proposed algorithm. And we focus on the effect caused by the ratio ρ = λ ( N × µ ) −1 . Since ρ should be less than unity [5] and we want to achieve higher speedup in lookup operations, we let λ / µ = N − 1 here. Namely, ρ is equal to N ( N − 1) −1 . Fig.5 and Fig.6 show the simulation results, as well as the contrast to the theoretic analysis.
demand of a router is 200Mpps, when we upgrade the BMP lookup engine to a parallel mechanism equipped with 4 lookup units, the 400KB 5ns SRAM can be replaced by 4 × 100KB 10ns ones, and this alteration will do no impact on the throughput rate of the system. The cost of the scheme is some extra use of tiny fast on-chip cache and a little more processing delay. We also bring forward a mathematic model of the scheme, as well as the solution to the model. Using the result presented, we can trade off between implementation simplicity and speedup, between process
Average delay time (Ts)
delay and lookup throughput rate. TABLE III. COMPARISON AMONG LOOKUP SCHEMES BASED ON MULTIBIT TRIE ALGORITHM
Scheme
Memory Accessing Time/ Lookup
Supported Line Speed (Gbps)
1/4
Memory Requirement (KB/8,000 Entries)