1 of 5
A Longest Prefix Match Search Engine for Multi-Gigabit IP Processing †Masayoshi Kobayashi
†Tutomu Murase
and
‡Atsushi Kuriyama
†C&C Media Research Laboratories, NEC Corporation ‡System LSI Operations Unit, NEC Corporation †{masayosi,murase}@ccm.CL.nec.co.jp, ‡
[email protected] Abstract— We propose an IP forwarding table search engine architecture, VLMP (Vertical Logical operation with Maskencoded Prefix-length), for routers with multi-gigabit/sec speed links. We discuss the existing approaches and the requirements for search engines, and go on to propose VLMP search engine architecture that expands upon a Content Addressable Memory (CAM) and can perform wire-speed packet processing of an OC192 (9.6 Gb/s) link. In this architecture, prefixes can be stored in arbitrary order, while existing ternary CAMs require prefixes to be stored in the order of their lengths. Also presented is a newly developed search LSI in which the architecture is implemented.
TABLE I E XAMPLE F ORWARDING TABLE prefix Next Hop Info. 110 10001 11011 1101
Interface #3, Next-hop IP address Interface #1, etc. Interface #2, etc. Interface #4, etc.
Prefix-length), which avoids the above drawbacks of ternary CAM approaches, and we present an actual LSI in which this architecture has been implemented. II.
M ULTI -G IGABIT IP F ORWARDING TABLE S EARCH
A. I.
I NTRODUCTION
Today, in order to cope with explosive increases in IP (Internet Protocol) traffic, the physical links of networks have been upgraded to multi-gigabit/sec speed. For example, Gigabit Ethernet (1.0 Gb/s) and OC-48 (2.4 Gb/s) are at present supported by many vendors, and OC-192 (9.6 Gb/s) interfaces are expected to be widely deployed soon. Furthermore, the switching fabrics of IP routers are capable of moving packets from input to output interfaces at multi-gigabit/sec speed [1]. The packet forwarding process, however, which consists of a forwarding table search and the manipulation of a packet header, lags behind this increasing network speed. In the IP packet forwarding process, the major performance bottleneck is induced by a forwarding table search needed to determine the next hop of every incoming packet on the basis of its destination IP address. This is because the IP forwarding table search performs a Longest Prefix Match (LPM) search in a CIDR (Classless Inter-Domain Routing [2]) environment (i.e. one which allows an arbitrary length of prefixes). In recent years, a number of methods have been proposed for improving the performance of LPM searches [3, 4, 5, 6, 7, 8, 9]. Some of them achieve adequately high average search throughput; however, they gain search speed via the prefix expansion technique [9], and this results in unacceptably slow forwarding table updating. In approaches using a ternary CAM (Content Addressable Memory), search latency is deterministic and short enough to support multi-gigabit/sec link speed [10, 11, 12]. In such approaches, however, entries for the forwarding table must be stored in the CAMs in the order of their prefix length; this necessitates complex update operations in the forwarding table and reduces its level of performance. In this paper, we propose a novel LPM search engine architecture, VLMP (Vertical Logical operation with Mask-encoded
Longest Prefix Match Search First of all, let us consider the nature of the Longest Prefix Match (LPM) search (referred to as a “longest match” in [13]). As illustrated in Table I, each entry in an IP forwarding table consists of a prefix and its associated next hop information. A prefix may be of an arbitrary length (so long as it does not exceed the network address length), and it consists of a bit string specifying the initial substring of a network address. No two entries contain the same prefix. An LPM search is a searching for the longest prefix among those that match an initial substring of a given network address. For example, if the given network address is 11011111, a search of Table I will yield matches in the first, third and fourth entries, and the prefix of the third entry will be selected as the LPM search result because it is the longest of the matching entries. B.
Requirements for LPM search Engine Architecture Next, let us consider the requirements of an LPM search engine. We primarily require that the LPM search engine is able to perform LPM searches at least at the speed of OC-192 (9.6Gb/s) and can be used in a network of any size, from LAN to Internet backbone. • Prefix Length: In order to support CIDR, arbitrary lengths of prefixes must be allowed. • Search Throughput: We have set our target to wire-speed forwarding1 of an OC-192 (9.6Gb/s) link, i.e., the required search throughput is 25 million search/second. • Searchable entry size: In order to be usable with LAN routers, an engine must be capable of storing a minimum of 4K entries, while the required capacity for use with Internet backbone routers is currently 60K and increasing. With 1 the forwarding of all packets, free of queuing, even when packets of the shortest length are arriving in a continuous stream (with no intervals between them). Assuming the shortest packet length to be 48 bytes, the maximum arriving rate of packets of an OC-192 link is 25 million packet/second.
0-7803-6286-1/00/$10.00 (c) 2000 IEEE
2 of 5 Match Lines Long entry Prefix entry Prefix
entry Prefix
entry Prefix entry Prefix entry Prefix entry Prefix
Short
Parallel Comparison
(a)Existing ternary CAM Solution Prefixes must be stored in order of Prefix Length
Existing Methods
Mask-encoded Prefix-length
entry Prefix
Encoder
entry Prefix
Lowest Address
search key
VLMP
entry Prefix
Priority Encoder
C.
search key
Prefix Length
this in mind, we require that the searchable entry size must be expandable to 128K entries. • Search Key length: We have set our search key length requirement to be 64bit rather than 32bit for IPv4 address because of additional requirements such as QoS routing. • Update Cost: The routing table of the Internet backbone is updated very frequently [14]. During an update of the forwarding table stored in a search engine, the forwarding process is suspended, and a long update latency will considerably reduce the performance of the router. We require a search engine to be able to update its forwarding table quickly enough to avoid any significantly adverse affect on performance.
Parallel Comparison longest
(b)Our Solution Prefixes are stored in arbitrary order
Fig. 1. Existing ternary CAM solution and our solution
Trie Based Algorithm Solution: In recent years, several algorithms [3, 4, 5, 6, 7, 8, 9], have been proposed to improve LPM search performance. In these algorithms, prefixes are represented in a prefix tree and the tree is traversed according to a given search key. Some algorithms [8, 7] which have improved the LPM performance to the extent of wire-speed forwarding of multi-gigabit/sec links are based on the same technique, which is called “Prefix Expansion” in [9]. This technique reduces a set of arbitrary length prefixes to a predefined set of lengths. By this reduction, the number of depths in the prefix tree which search algorithms have to traverse is reduced and search speed is increased. However, when updating the forwarding table, a number of prefixes which are expanded from short prefixes has to be deleted or updated. This results in unacceptably slow updating of the forwarding table. Ternary CAM Based Hardware Solution: A Content Addressable Memory (CAM) is able to search itself with a very short latency (typically, tens of nanoseconds) for the location of an entry whose content bit string matches a given key. The short search latency is accomplished by “Parallel Comparison” in which the given search key is compared with the content bit string of all entries in parallel. In a ternary CAM, each entry has, in addition to a content bit string, a mask bit string that indicates those bit positions for which the content bit string does not need match the given key. In [10], approaches that employ CAMs are well summarized, and a ternary CAM based solution is introduced. Some LSI vendors appear to have taken this ternary CAM based solution [11, 12]. Fig. 1 (a) illustrates the way searches are made in the existing ternary CAM solution. A ternary CAM consists of a fixed number of entries, each of which contains a prefix. Prefixes which could each conceivably match a single given key have to be stored in the order of their prefix lengths. In a search operation, prefixes are compared with a given search key in parallel. The results of this comparison are provided to the Priority Encoder, and which then encodes the lowest address among those of matching entries. Note that the resulting ad-
dress automatically coincides with LPM entry address because of the way in which prefixes are stored. This restriction on ordering is the major drawback of the ternary CAM based solution. If a new prefix appears when the forwarding table is being updated, all of the prefixes that are shorter than the new one have to be re-stored to upper addresses in the ternary CAM to make a space for the new one while keeping the same order of prefix length. Therefore, it is essentially impossible to update a forwarding table incrementally; it would be necessary to restore whole prefixes from scratch. Other Hardware Solutions: One currently available LPM Search LSI [15] employs a proprietary technique for supporting LPM searches. Unlike the ternary CAM based approaches, it does not have a restriction on the store address ordering of prefixes, but the search throughput is 4.1 million search/second which is not sufficient for multi-gigabit link speed. III.
P ROPOSED S EARCH LSI A RCHITECTURE
A.
VLMP Search Engine Architecture We propose a new scheme,“Vertical Logical operation with Mask-encoded Prefix-length(VLMP)”, which removes the restriction that prefixes have to be stored in the order of their lengths in existing ternary CAM solutions. Fig. 1 (b) depicts our solution. In addition to “Parallel Comparison”, which is commonly used in existing ternary CAMs solutions as in Fig. 1 (a), our solution employs VLMP to determine the longest prefix among the matched prefixes which are stored in arbitrary order. Unlike the priority encoder in the existing solution, our solution utilizes the length information of matched prefixes to determine the longest prefix. Both “Parallel Comparison” and VLMP are based on logical operations applying to a number of bits, which is typically implemented by wired logic in hardware. The concept of VLMP is explained in Fig. 2 (a) and (b). Fig. 2 (a) illustrates how a comparison is done in each entry in “Parallel Comparison”. First, each bit of a stored bit string is compared with a corresponding bit of a given search key.
0-7803-6286-1/00/$10.00 (c) 2000 IEEE
3 of 5
bit
bitwise compare stored bit string
resulting bit string
Horizontal AND Operation
1
Masked Compare (parallel)
Match (P1=110) P1_data =11000000 P1_mask=11100000
VLMP 11100000
2
(P2=10001) P2_data=10001000 P2_mask=11111000
3
Match (P3=11011) P3_data =11011000 P3_mask=11111000
11111000
4
Match (P4=1101) P4_data=11010000 P4_mask=11110000
11110000
prefix length Mask bit string prefix length Mask bit string
vertical bitwise logical OR
search key
Vertical OR Operation
entry
search key=11011111
VLMP: Vertical Logical operation with Mask-encoded Prefix-length bit prefix length Mask bit string
11111000
(a)Horizontal AND Operation (b) Vertical OR Operation
Fig. 3. Vertical Logical operation with Mask-encoded Prefixlength(VLMP)
Fig. 2. Functional view of VLMP search engine architecture Exact Compare (parallel)
(P1=110) P1_data =11000000 P1_mask=11100000
2
(P2=10001) P2_data=10001000 P2_mask=11111000
3
(P3=11011) P3_data =11011000 Match P3_mask=11111000
4
(P4=1101) P4_data=11010000 P4_mask=11110000
result of VLMP
Encoder
The result is also a bit string in which matched bit positions are expressed by 1’s and others are expressed by 0’s. Then, in terms of the all bits in the resulting bit string, a logical AND operation (Horizontal AND) is performed. We refer to length direction of a bit string as “Horizontal” in this paper. In contrast, VLMP (see Fig. 2 (b)) is a logical OR operation applying to corresponding bits from different mask bit strings. We refer to the direction of binding different bit strings as “Vertical” in this paper. In an LPM search, while the primary role of a mask bit string is to indicate the prefix portion of a stored bit string in Parallel Comparison, it also encodes the length of the prefix in the form of a bit string containing contiguous 1’s from its MSB (Most Significant Bit) where the length of the contiguous 1’s is identical to the prefix length. VLMP makes use of mask bit strings to obtain the maximum length of prefixes matching a given key. The following example is provided to further explain this idea. In Fig. 3, four prefixes, P1=110, P2=10001, P3=11011 and P4=1101 are stored. Note that prefixes can be stored in arbitrary addresses. Each entry contains a pair of bit strings (a data string and a mask string) whose length is equivalent to the length of a search key. Each data string contains a prefix from its MSB (Most Significant Bit); the rest of the bits are padded by 0. Each mask string contains a bit string of contiguous 1’s from its MSB, whose length equals the prefix length; the rest of the bits are padded by 0. For example, the first entry contains a data string P1 data = 11000000 and a mask string P1 mask = 11100000 which are representing the prefix P1=110. For a given search key 11011111, Parallel Comparison is performed and the first, third and fourth entries yield matches in terms of their prefixes. VLMP is a bitwise logical OR operation applying to the mask strings whose prefixes are matched with the given key. In this example, VLMP is performed for the mask strings P1 mask=1100000, P3 mask=11111000 and P4 mask = 11110000, and the result is 11110000. It should be noted that the result is identical regardless of the storage order of these mask bit strings.
1
Address of LPM entry Address 3
11111000
Fig. 4. LPM search using the result of VLMP Once the result of VLMP is obtained, we can then find the Longest Prefix Match entry by searching matched entries for an entry which contains a mask string exactly matching the result of VLMP (see Fig. 4). This second search also employs Parallel Comparison. In this example, we search P1 mask, P3 mask and P4 mask for 11111000 (the result of VLMP) and find the third entry to be the Longest Prefix Match entry. B.
Detailed Architecture
Fig. 5 shows an internal block diagram of the proposed search LSI architecture. It contains a fixed number of entries (indicated by the dashed-line rectangles in Fig. 5), each of which contains a pair of bit strings; a data string DS and a mask string MS, both of which are of the same length as the key length (n bits). These bit strings are generated from a prefix in the same way as explained in the previous section A.. A search key is stored in the Key Register and the LPM search is performed as follows. First stage 1-1 A search key, K, stored in the Key Register is provided to each entry by means of Comparand Line1. 1-2 In each entry, a masked comparison, R1 := (K & MS) XOR (DS & MS) is performed, where “&” and XOR are a bitwise AND operation and bitwise exclusive OR operation, respectively.
0-7803-6286-1/00/$10.00 (c) 2000 IEEE
4 of 5
Key Register
K
1
0
0
0
0
Match Line2
Comparand Line1
Horizontal AND
0
MS : mask string (n bits) 1
1
1
0
0
S1
Horizontal AND
entry
00..0 MS : mask string
D.
LPM Line
A
n
n
Encoder
Comparand Line2
0
n
S2
Search Result Port
RV
1
Vertical OR
Exact Compare R2:= RV XOR MS
VLMP Lines(n bits) Match Line1
DS: data string (n bits)
Masked Compare R1:=(K&MS) XOR (DS&MS)
positions at B (see Fig. 5). With these latches, the respective components for the first and the second stages are isolated from each other and both stages can be parallelly processed.
VLMP Search LSI
n Search Key Port
entry entry B
RV: result of VMLP (n bits)
n n
VLMP port
Fig. 5. Block diagram of LSI architecture 1-3 An AND operation in terms of all the bits in the R1 is performed, and the result is provided to Match line1. 1-4 If Match line1 is set to 1, a selector S2 outputs MS onto the VLMP Lines. Otherwise, S2 outputs all 0’s onto the VLMP Lines. 1-5 On each bit position of the VLMP Lines, a vertical bitwise logical OR operation (VLMP) is performed. The result is referred to as RV in the following. Second stage 2-1 RV is provided to each entry by means of the Comparand Line2. 2-2 In each entry, two bit strings, RV and MS, are exactly compared, that is, R2 := RV XOR MS is performed. 2-3 An AND operation in terms of all the bits in the R2 is performed, and the result is provided to Match line2. 2-4 If an entry’s Match Line1 and Match Line2 are both 1, a selector S1 will output 1 onto the LPM Line (this will be true for only one of the entries, because no two entries contain the same prefix). If not, S1 will output 0 onto the LPM Line. 2-5 The address of the entry whose LPM Line is 1 can then be obtained from the Encoder. In updating the forwarding table, new prefixes are stored merely by generating two bit strings, i.e. a data string and a mask string from the prefix, and then storing them at an arbitrarily determined address in the LSI. The deleting prefixes is also easy: simply erase the two bit strings.
Extending Capacity To increase the number of searchable entries, we connect the VLMP Ports (see Fig. 5) of a number of LSIs. This connection expands the Vertical OR operation in step 1-5 to all of the connected LSIs. The result of VLMP is used in the second stage in all the connected LSIs. IV.
P ERFORMANCE
The performance of our proposed VLMP architecture is determined by the time consumed in Horizontal AND operation and Vertical OR operation (see Fig. 2). We have estimated the performance of our VLMP architecture through delay simulation results obtained in a currently available 0.25 µm CMOS process. Fig. 6 is a timing chart illustrating the steps explained in the previous section III.. Let us define T 1 and T 2 as the delay for the first and the second stages, respectively. T 1 consists of t1 and t2, the delay of the Horizontal AND operation on Match Line1 of an entry and that of the Vertical OR operation on VLMP Lines, respectively. And T 2 consists of the time consumed in the Horizontal AND operation on Match Line2. We have estimated these delays using a circuit simulation tool, for an entry bit length of 64 and for 4K entries. Simulation results show the following typical values: t1 = 7.5ns(nanoseconds), t2 = 8.5ns, and T 2 = 15.0ns. That is to say, we can obtain a total delay of Ttotal = T 1 + T 2 = t1 + t2 + T 2 = 31.0ns. Taking into account other wiring delays that might be expected to occur in an LSI layout, we conservatively estimate the search latency to be less than 40ns, which suggests that the proposed LSI architecture can be operated at 25MHz. In other words, the LPM engine throughput would be no less than 25 million search/second with a fixed latency of 40ns (nanoseconds). When we apply a pipelining technique which is explained in the previous section C., the operational frequency of an LSI is determined by which delay in each of the two stages is larger. On the same delay simulation condition as above, the estimated delay of the first and second stages are T 1 = t1 + t2 = 7.5 + 8.5 = 16.0ns and T 2 = 15.0ns, respectively. We estimate 20ns to be long enough for either of the stages to be completed taking into account other wiring delays. That is to say, a pipelined architecture can be operated at 50MHz, clock Key Register
search key 1
Match Line1 VLMP Lines result
Improvement by pipelining We can apply a pipelining technique to this architecture by inserting a latch at A for each entry, as well as at each of the bit
0-7803-6286-1/00/$10.00 (c) 2000 IEEE
Valid
result of VLMP
Match Line2
C.
search key 2
Valid
t1
Valid
Valid
Result1
Result2
t2
T1:1st Stage
result of VLMP
T2:2nd Stage
Fig. 6. Timing Chart
5 of 5 TABLE II S PECIFICATIONS OF D EVELOPED LSI AND P REVIOUS LSI Developed LSI Existing LSI[15] Clock Freq. (MHz) 33 66 Data Length (bit) 64 32 # of entries/chip 4K 8K Search Latency 2 clocks(60 ns) 420 ns Throughput 16.5 M 4.1 M (search/second) Max. Link Speed for Wire-Speed 6.3 Gb/s 1.6 Gb/s Forwarding TABLE III S PECIFICATIONS WHEN M ULTIPLE LSI S # of Searchable Entries 16K (4 LSIs) D (Delay of 1st Stage) Search Latency Throughput(search/sec)
C ONNECTED 128K (32 LSIs)
ARE
1 clock 2 clocks (60ns) 16.5 M
2 clocks 3 clocks (90ns) 11 M
and the throughput of an LPM search will be be 50 million search/second with a fixed latency of 40ns. This performance is adequate for wire-speed forwarding of an OC-192 link 2 . V.
D EVELOPED LSI
In order to evaluate the VLMP architecture, we have developed an LSI (VLMP Search Engine, Fig. 7). In this first version of LSI, we chose a slower clock of 33MHz rather than 50MHz and inhibited the pipelined operation. Table II shows the evaluated performances of this LSI, which has a maximum link speed of 6.3Gb/s for its wire-speed forwarding. The table also compares the performance specifications of an existing LSI [15] which, to our best knowledge, is the only alternative search LSI without a prefix sorting restriction. Table III shows the evaluated performance when multiple developed LSIs are connected. Even when a large number of entries for current Internet backbone routers are accommodated (e.g. 128K entries), the LSI can perform 11 million search/second with fixed latencies of 90ns. VI.
C ONCLUSION
In this paper, we introduced a novel Longest Prefix Match search scheme, “Vertical Logical operation with Maskencoded Prefix-length (VLMP)”, and proposed a VLMP based search engine architecture which is exploited on existing ternary CAMs. VLMP frees the search engine from the restriction that prefixes must be stored in order of their lengths in the existing ternary CAM based search engine. With this architecture, search performance of 50 million search/second can be achieved with a fixed search latency of 40 nanoseconds. This performance is adequate for wire-speed forwarding of OC-192 (9.6Gb/s). Moreover, unlike conventional ternary CAMs, the updating of the forwarding table can be completed quickly because prefixes can be stored in an arbitrary order. Furthermore, the number of searchable entries can be increased 2 Remember
that wire-speed forwarding of OC-192 links requires the performance of 25 million search/second.
Fig. 7. Developed LSI (VLMP Search Engine) by connecting multiple LSIs. Even when a huge number of entries required by Internet backbone routes are accommodated, search latency here is adequately short for multi-gigabit/sec link speed. A developed LSI based on the proposed architecture was found to be capable of performing 16.5 million search/second with a fixed search latency of 60 nanoseconds. ACKNOWLEDGMENT The authors would like to thank Satoshi Hasegawa, Hideaki Tani, Hisao Tateishi and Masahiro Nomura for their valuable and useful comments on this paper. R EFERENCES [1] C. Partridge, P. P. Carvey, E. Burgess, I. Castineyra, T. Clarke, L. Graham et al. “A 50-Gb/s IP Router,” IEEE/ACM Trans. Networking, Vol.6, No.3, pp.237-248, June 1998. [2] V.Fuller, T. Li, J. Yu and K. Varadhan, “Classless Inter-Domain Routing (CIDR): an Address Assignment and Aggregation Strategy ; RFC1519,” Internet Request for Comments, September 1993. [3] M. Waldvogel, G. Varghese, J. Turner and B. Plattner, “Scalable High Speed IP Routing Lookups,” Proc. ACM SIGCOMM ’97, September 1997. [4] M. Degermark, A. Brodnik, S. Carlsson and S. Pink, “Small Forwarding Tables for Fast Routing Lookups,” Proc. ACM SIGCOMM ’97, September 1997. [5] S. Nilsson and G. Karlsson, “IP-Address Lookup Using LC-Tries,” IEEE J. Select. Areas Commun., Vol.17, No.6, pp.1083-1092, June 1999. [6] H. H. Tzeng and T. Przygiend, “On Fast Address-Lookup Algorithms,” IEEE J. Select. Areas Commun., Vol.17, No.6, pp.1067-1082, June 1999. [7] N. Huang and S. Zhao, “A Novel IP-Routing Lookup Scheme and Hardware Architecture for Multigigabit Switching Routers,” IEEE J. Select. Areas Commun., Vol.17, No.6, pp.1093-1104, June 1999. [8] P. Gupta, S. Lin and N. McKeown, “Routing Lookups in Hardware at Memory Access Speeds,” Proc. IEEE INFOCOM ’98, Vol.3 pp.12401247, March 1998. [9] V. Srinivasan and G. Varghese, “Faster IP Lookups using Controlled Prefix Expansion,” Proc. ACM SIGMETRICS ’98, June 1998. [10] A. J. McAuley and P. Francis, “Fast Routing Table Lookup Using CAMs,” Proc. IEEE INFOCOM ’93, Vol.3, pp.1382-1391, March 1993. [11] MUSIC Semiconductors, “MUAC Routing CoProcessor (RCP) Family,” http://www.music-ic.com/muac.pdf, October 1998. [12] NETLOGIC MICROSYSTEMS, Inc., “Ternary Synchronous Content Addressable Memory (IPCAM),” http://www.netlogicmicro.com/pdf/NL82721.PDF [13] W. R. Stevens, TCP/IP Illustrated, Volume 1 – The Protocols, AddisonWesley, 1994. [14] C. Labovitz, G. R. Malanet and F. Jahanian, “Internet Routing Instability,” IEEE/ACM Trans. Networking, Vol.6, No.5, pp.515-527, October 1998. [15] Kawasaki LSI USA, Inc., “Longest Match Engine KE5BLME008,” http://www.klsi.com/products/lme.html
0-7803-6286-1/00/$10.00 (c) 2000 IEEE