IPS06-3: Bit-Map Trie: A Data Structure for Fast ...

0 downloads 0 Views 214KB Size Report
[4] Anthony J. Bloomfeld NJ McAuley, Paul F. Lake Hopatcong NJ. Tsuchiya, and Daniel V. Rockaway Township Morris Country. NJ Wilson, "Fast Multilevel ...
Bit-Map Trie: A Data Structure for Fast Forwarding Lookups Seung-Hyun Oh and Jong-Suk Ahn Computer Engineering Department, Dongguk University Chung-Gu Pil-Dong 3-26, Seoul Korea

Abstract -- This paper proposes a data structure to perform forwarding lookups at Gigabit speed by condensing the routing table of backbone routers into cache, thus eliminating slow memory accesses. The proposed structure bases on the conventional trie, known to be good for partial string searches. For the longest IP matching lookups, its each level denotes a segment of IP address, and a node has multiple links each of which represents one combination of the address segment assigned to that level. When a given IP address reaches the dead-end node by following the links matching the IP address segment-by-segment, the node points the routing entry for forwarding the packet. For the size reduction, the trie compresses pointers of child’s locations and routing entries as a bit array where a single bit encodes a pointer. So, we call the proposed structure bit-map (BM) trie. For better performance, the BM trie jumps to the appropriate node at the middle level rather than starts from the root node. The experiments show that it compacts backbone routers’ tables into 512Kbyte cache and accomplishes around 2.4 million lookups per second on Pentium II processor.

1. INTRODUCTION The necessity to furnish the Internet with high-speed backbones gives a strong motivation to develop fast routers operating at the speed summing their input optical links' speed. One research area for fast routers focuses on speeding up forwarding lookups, which is one of main bottlenecks for routers' performance. It is due to that routers should locate from a large-size forwarding table an entry with the longest prefix matching the destination IP address of each incoming packets and this LPM (Longest Prefix Matching) lookups hardly accomplish Ο(1) complexity. The time complexity of conventional LPM algorithms tends to grow in proportion to either the routing table size or IP address size [1]. The existing research for hastening lookups can be categorized into three categories: hardware-based, software-based, and protocol-based techniques. Hardware-based solutions [2,3,4,5] try to expedite routing lookups by employing special hardware such as high speed SRAM and CAM. Protocol-based schemes [6,7] such as MPLS (Multiple Protocol Label Switching) try to avoid the complicated IP routing lookups by assigning a small index called label to a connection at the edge router, which later the edge router

attaches at incoming packets from that connection. After each router placed along the path up to the destination learns the association of the label to the appropriate output link, future lookups for this connection becomes a simple index operation by using the label as an offset to the table pairing an outgoing link and the corresponding label. In contrast to the above two approaches, software-based methods accelerate forwarding lookups simply by employing new noble structures and their accompanying algorithms. Recently many forwarding structures [8,9,10,11] for shortening lookup time up to a few hundreds of nanoseconds per packet, which is fast enough for catching up with Gigabit links, have been proposed. Due to the deployment feasibility, this approach attains full attentions. For speedup, this approach either compacts the forwarding table small enough to sit on cache or transforms the longest partial match problem to an exact match problem. This approach tries to limit the number of memory accesses since the memory access time is usually 10 times slower than the instruction cycle of modern processors. This paper presents a forwarding structure named bit-map (BM) trie that processes forwarding lookups at Gigabit speed by compacting the large routing tables of the Internet backbone routers small enough for cache. Like conventional trie [12] for word search, the BM trie consists of links denoting some segment of IP address and nodes holding a pointer to a routing entry. For searching an entry with the longest matching prefix for a given IP address, it traces down links whose assigned values are equal to the corresponding segment of the IP address. For the size reduction, the BM trie encodes child and entry pointers into two bit arrays different from other trie-based techniques [13,14] even though they also try to compact the forwarding table size for cache. By representing the presence of a child node and a routing entry pointer with a bit, their location can be computed by counting 1’s in the bit array. The experiments show that the BM trie efficiently condenses routing tables of back-bone routers into less than 512Kbyte and performs around 4 million lps (lookups per second) on Pentium II processor for randomly generated IP addresses or 2.4 million lps for real traffic traces from our campus network. This paper is organized as follows. Section 2 details our BM trie and analyzes its search complexity and Section 3 depicts the experiment results. Finally, Section 4 summarizes some future works.

This work was supported by grant No.2001-1-30300-00503 from the Basic Research program of the Korea Science & Engineering Foundation.

0-7803-7208-5/01/$17.00 (C) 2001 IEEE

2. BIT-MAP TRIE

2.1 Data Structure of Bit-Map Trie Trie [12] is one of data structures commonly adopted for storing different length of character strings but sharing the same prefix. When a dictionary is organized as a trie, for example, the search for a word would simply navigate down the sequence of links from the root node by comparing each character of the queried word with the one assigned over each link. This comparison continues until there is no more matching link or no more character left to compare. The node finally reached would retain an index pointing where the queried word’s definition is described in the dictionary. By the same token, the routing table can be represented as a trie if we think each bit of IP address as a character. Starting from the trie’s root, for example, the left and right links are assigned bit 0 and 1 respectively. We repeat the above procedure until all the entries of the routing table are represented as nodes in the trie. Each node, finally, stores a pointer to the corresponding routing entry whose destination address is equal to the bit string represented by the link path leading to the node. This representation, however, requires a lot of memory due to a huge number of child and entry pointers, especially when the forwarding table is large as shown in [13]. This space complexity restrains the trie from being adopted as an index structure for the forwarding table. If the index database resided in memory, the longest prefix search with trie would be inadequate for Gigabit switching due to the expensive memory access. For diminishing the conventional trie’s size, we propose a data structure called BM trie that encodes child and entry pointers with two bit arrays, named as prefix BM and child BM respectively. The two BM arrays recover pointers by counting the number of 1’s stored in the bit-string up to a certain bit-position. The bit-position is either an IP address in the prefix BM or the number assigned to each node in the childe BM. Note that nodes are numbered in the breadth-first search order. The prefix BM, at first, encodes a set of entry pointers into a bit array by transforming the trie into a complete binary trie as explained in [9]. This expansion operation to the complete tree is equivalent to linearly spreading all possible routing entries from 0 to all 1’s of IP address. After building the complete binary tree like in Fig. 1, 1 or 0 is assigned on leaf nodes depending on whether or not the link path leading to each leaf has a corresponding entry in the forwarding table. The prefix BM is then a set of bits assigned to the leaf nodes from the left-most one. Fig. 1 shows one example building a trie from a small forwarding table where a black (or prefix) node has a matching entry in the forwarding table while a white one doesn’t. To construct the complete binary tree from this trie, we simply append dotted nodes below some nodes in Fig. 1 when they don’t have their children up to the lowest level. After that, 1 or 0 is set at each leaf node depending on whether a node is colored black or white. For dotted nodes, only the leftist one among its sibling nodes is marked with 1 since other siblings share the same routing

Fig. 1. Bit-Map Construction By Expanding Trie to Complete Binary Tree

entry. Then, the accumulated number of 1’s from the left most of the prefix BM up to the bit position determined by IP address is an index to the routing table. Refer to [12] for details. If we construct a prefix BM for 32-bit address in this way, however, it would require 232 bits, wasting huge memory due to many 0’s since routing entries are likely to be quite less than this whole range of IP address. For compaction, each node represents some segment of IP address rather than a single bit like one in Fig. 1. Each node contains its own small prefix BM and multiple links. The trie only needs to be expanded until all the prefixes in the routing table are included in the prefix BM’s in nodes. Since this trie doesn’t unfold to a complete binary tree, it saves a lot of memory especially unless the routing table is dense. Of course, the storage is saved at the expense of visiting some number of small prefix BM’s rather than one direct indexing over the large prefix BM to compute the entry pointer. Fig. 2, for example, illustrates a BM trie with four levels each of which represents an 8-bit segment of 32-bit address and each node depicted as a triangle manages a 256-bit prefix BM since it covers 8-bit. The table beside the root node in Fig. 2 shows a node table maintaining the child BM’s specifying the location of their 256 children and the prefix BM’s of all nodes. Finding the routing entry in the BM trie consists of two steps; locating the destination node whose prefix BM contains the longest prefix matching with the queried IP address and computing the number of 1’s from the located prefix BM for the routing entry pointer. For the destination, the BM trie traces down the child nodes by iteratively computing their location in the node table with each segment of the given IP address, for example an 8-bit segment in Fig.2 until there is no link to follow. For the routing entry’s pointer after reaching the destination, it calculates the bit-position in the prefix BM of the destination by using the address segment pertaining to the destination’s level. The entry index is the sum of all 1’s in the prefix BM’s in nodes before the destination and 1’s up to the located bit position in the destination’s prefix BM. Note again that nodes in the BM trie are numbered in the breadth first order. For example, when the given IP address is loaded at 120th bit position of

0-7803-7208-5/01/$17.00 (C) 2001 IEEE

node k, the index to the routing entry is the sum of all 1’s in the prefix BMs from node 1 to node k-1 and all 1’s up to 120th in the prefix map of node k. By assigning 1 to each node representing the corresponding prefix in the routing table, the prefix BM distinguishes one pointer from another. To lessen the counting overhead, each prefix BM stores the precomputed number accumulating 1’s in all the previous prefix BM’s. The prefix bit of each node, furthermore, is segmented into a few chunks and their incremental sums are also pre-calculated and stored. For our implementation, we split a 256-bit prefix BM into four chucks and reserved 5-bytes for storing this pre-computed information. These pre-computed 1’s sums of all the previous prefix BM’s, and three previous chunks are retained into four columns, tp, p1, p2, and p3 in Fig. 2 respectively. So, for example, when the BM trie locates 4th bit-position of the third chunk of node n representing the longest prefix, the pointer is the sum of tp, p1, p2 and 1’s up to 4th in the third chunk of node n. The storage for this information required 35Kbyte for a backbone router’s table with 50,000 entries in our code. In addition to the prefix BM, the BM trie also needs child BMs informing the location of child nodes. As like the prefix, 1 or 0 set at each bit of the child BM decides the presence or absence of the child node hung at the link respectively. Note that a link is instantiated as a bit. When a bit is set to 1, it indicates that the forwarding table has an entry with longer prefix than the prefix specified by the bit’s position. As like the prefix BM, the link pointer to the next child is the sum of 1’s in all the previous nodes and 1’s up to the bit-position in the child BM of the destination determined by the IP address segment. For example, each node in Fig. 2 has 256-bit in its child BM, namely 8-bit segment range so that each 8-bit segment of a given 32-bit IP address is extracted for the appropriate bit-position. The sum of 1’s up to this bit-position is an offset to the low-level node in the node table, which holds the two BM’s in the breadthfirst order. For speedy counting, the child BM also stores the precomputed information about 1’s like tc, c1, c2, and c3 in Fig. 2.

Fig. 2. Bit-Map Trie Configuration

2.2 Search Algorithm’s Complexity

At each packet arrival, the BM trie is navigated over to find the node with the narrowest prefix range covering the packet’s address by counting 1’s in the child BM of each visited node. When the navigation has no link to follow or hits the leaf node, then the search algorithm computes the routing entry pointer by counting all 1’s up to the bit-position of the destination prefix BM indexed by the IP address’s segment. After the index to the routing table is determined, it needs to be translated again to take into account that some indexes should points at the same routing entry. It is because prefix nodes are generated more than the number of routing entries when a trie is transformed to the corresponding complete binary tree [9]. Note that when the address range of a prefix node overlaps with some middle portion of its parent’s prefix range, the transformation process needs to generate three prefix nodes to differentiate three different prefix ranges. Since the last and first ranges belong to the same parent prefix node, their prefix nodes should indicate the same routing entry but the second prefix covering the middle range should indicate a different routing entry. When the forwarding table has two entities with 11 and 1101 prefix, for example, then the complete binary tree generates three nodes each of which covers 11-1101, 1101-111, and the remaining interval from 111. For this purpose, the BM trie accompanies the conversion table coupling each computed index with the real offset to the routing table. This conversion table takes smaller than 100Kbyte with 50,000 entries and requires one cache access. As explained in the above, the search algorithm is divided into two steps, one for searching the destination node by using the child BM and another for computing the offset from the prefix BM. The worst-case time complexity occurs when the queried address is all 1’s and its routing entry exists. In this case, the BM trie traverses over three levels each of which needs 4 addition and 64 shift operations when each BM entry is divided into 4 chunks. After locating the destination node, it also needs to determine the entry index, requiring another 4 additions and 64 shifts. The worst computation, therefore, needs 16 additions and 256 shift operations in total. We, furthermore optimize 64 shifts on each level with 7 adds and 8 shifts by using a 256-entry table whose each entry translates each index number, namely 256-bit string of the BM, to the number of 1’s contained in each index number. So, the worst computation is reduced to 44 (16 + 7 * 4) additions and 32 shifts. In terms of memory access, each trace-down and one index computation need to retrieve four 32-bit words from the two BM’s. Note that the four words consist of 5-byte pre-computed sums and 64-bit chunk. Finally, the search algorithm needs to translate the computed index to the real entry one by fetching the corresponding pair from the conversion array. When we assume an instruction cycle as 2.5ns at 400Mhz CPU and a word access time from cache as 10ns, the worst lookup time per packet in our implementation would be 360ns, 190ns for computation and 170ns for cache access. So, the theoretical worst performance for this implementation is around 2.7 million lps switching speed. To enhance the worst lookup time, the BM trie allows the lookup

0-7803-7208-5/01/$17.00 (C) 2001 IEEE

to start form the third level rather than from the root node by using another BM called fast BM. The fast BM is a bit array encoding the presence of nodes at the third level like the other two BMs. For example, the 64kbit fast BM in our implementation encodes the existence of 216 nodes at the third level. The location of the destination node at the third level is just the number of 1’s placed in the fast BM before the bit-position specified by the first 16-bit segment of IP address. To avoid the counting operation, it also has pre-computed incremental sums for each 64-bit chunk. When the BM trie finds that the node covering the first 16-bit segment of a given IP address doesn’t exist, it goes back to the root node for looking up shorter prefixes. Otherwise, it goes to the three and forth level for longer prefixes. The fast BM alleviates the worst tracedown cost from 4 levels to 3 levels.

3. PERFORMANCE MEASUREMENTS For evaluating the BM trie’s average performance over real routing tables, we adopt the routing information available from [15] and use a 400MHz Pentium II (with L1 cache 32K byte, and L2 cache 512K byte) platform running Linux 2.2.x. Table 1 shows the built time and the size of BM tries for five different routers’ routing tables. The result confirms that the BM trie can accommodate the backbone router’s table with as much as 50,000 entries into a small node table of less than 304Kbyte. The average performance of the BM trie is measured by conducting four different scenarios combining two different kinds of routing tables with two different input traffics. For routing tables, we use the backbone routers’ routing tables archived in IPMA [15] and our campus router’s table built by BGP (Border Gateway Protocol). For input traffic, we employ randomly generated IP addresses and real IP ones collected at our campus’s exit to the outside. For landing the forwarding table on cache, we repeat the same lookup twice for each input IP address, calibrate total execution time by getrusage() function, and finally divide the total time by the number of input addresses. Table 2 ~ 5 lists the performance results for four different permutations of the two input traffic and the two kinds of routing tables.

TABLE 1 Bit-Map Trie Size and Built Time for Each Forwarding Table Bit-Map # of # of Routing Site Table Size Prefix Entries (KB) Nodes 304 74,366 48,290 Mae-East 265 45,476 29,588 Mae-West 242 37,672 25,275 PacBell 203 24,498 16,864 AADS 151 12,743 9,618 Paix

Built Time ( ) 290 200 180 140 80

IPMA tables forward packets as many as the number specified in # of packets row. The seek time of the BM and fast BM tries ranges from 215ns to 260ns and from 215ns to 250ns respectively. The resolved ratio with random IP addresses is quite low, less than 33 % meaning that only 33% of packets locate their corresponding routing entry and most packets are just dropped due to the non-existence of their associated routing entry. In other words, most lookups for random addresses end up with the root node without traveling further down the trie. Table 3, however, depicts the increase of lookup times when the real traffic is used, leading to higher resolved ratio. The seek time doubles or triples while the hit ratio nearly approaches 100%, meaning that the search of the routing entry for most IP packets treads below the first level at least. Note that most routing entries in IPMA tables have prefixes longer than 16-bit [14]. Table 4-5 estimates the performance of the two BM tries when the routing table of our campus router is employed. In Table 4-5, R11-23 columns represent six different routing tables constructed at different times. These two kinds of tables altogether verify that the two BM tries can execute the Gigabit switching while the comparison of these two shows that the lookup time doubles at real traffic. Based on this observation, we believe that some published algorithms [9,10,13] need to be evaluated over real traffic to accurately predict their behavior over real networks. Finally, the comparison of the four tables indicates that the routing table size doesn’t affect the performance of the BM trie like the input traffic. TABLE 2 Performance Measurements Over IPMA DB With Random IP Addresses MaeMaePacAADS Paix East West Bell # of packets (K) 2000 2000 2000 2000 2000 Resolved ratio (%) 32.9 32.9 26.4 26.1 21.6 Seek time (ns) 260 250 240 240 215 Fast seek time (ns) 250 245 230 230 215 TABLE 3 Performance Measurements Over IPMA DB With Real IP Addresses MaeMaePacAADS Paix East West Bell # of packets (K) 6199 6199 6199 6199 6199 Resolved ratio (%) 100 99.9 99.9 99.9 99.9 Seek time (ns) 492 501 492 503 492 Fast seek time (ns) 417 417 416 417 416 TABLE 4 Performance Measurements over Our Routing Table With Random IP Addresses R11 R12 R13 R21 R22 R23 # of packets (K) 2000 2000 2000 2000 2000 2000 Resolved ratio (%) 30.5 31.3 31.3 30.5 31.3 31.3 Seek time (ns) 190 195 195 190 195 195 Fast seek time (ns) 210 210 215 210 215 210

Table 2 shows the average lookup time of the two BM tries in the two bottom rows named Seek time and Fast seek time when the

0-7803-7208-5/01/$17.00 (C) 2001 IEEE

TABLE 5 Performance Measurements over Our Routing Table With Real IP Addresses R11 R12 R13 R21 R22 R23 # of packets (K) 2262 6199 6184 12446 3088 2553 Resolved ratio (%) 100 100 100 100 100 100 Seek time (ns) 344 312 312 314 314 321 Fast seek time (ns) 260 243 244 245 242 242

Fig.3 shows why the performance differs based on the traffic characteristics by depicting the resolved ratio at each trie level. For the routing table, we use IPMA Mae-East and DGU R12 routing table. From Fig, 3, we can see that random IP addresses are resolved almost at level 1 and 2, but real ones hit either at the level 2 in the campus routing table or 3 in the backbone router’s table. Comparing to each corresponding random address hit ratio, real traffic need to walk through the BM trie more deeply to be resolved.

4. CONCLUSIONS This paper proposes an indexing structure named BM trie for accelerating lookups at Gigabit speed The experiments show that it condenses the forwarding table with as many as 48000 entries into about 300 Kbytes. They also indicate that it processes one lookup within 190ns ~ 250ns for randomly generated IP addresses, equivalent to around 4 million lps while the performance gets slower with real IP addresses up to 240ns ~ 410ns, still supporting 2.4 million lps. Our immediate future research will compare a variety of existing data structures designed for fast lookup with our bit-map structure on the same platform with IP addresses captured over real networks. Finally, another future research will be to extend the BM trie suitable for parallel machines to achieve Gigabit speedup even for 128-bit IPv6 address.

[1] Keith Sklower, “A Tree-Based Routing Table for Berkeley Unix”, Technical report, University of California, Berkeley [2] Tong-Bi Pei and Charles Zukowski, "Putting Routing Tables in Silicon", IEEE network Magazine, January 1992 [3] A.J. McAuley and P. Francis, "Fast routing table lookup using CAMs", In Proceedings of IEEE Infocom '93, v3, 1382-1391, San Francisco, 1993 [4] Anthony J. Bloomfeld NJ McAuley, Paul F. Lake Hopatcong NJ Tsuchiya, and Daniel V. Rockaway Township Morris Country NJ Wilson, "Fast Multilevel hierarchical routing table using content-addressable memory", U.S. Patent serial number 034444 [5] P. Gupta, et al., "Routing Lookups in Hardware at Memory Speeds", In Proceedings of IEEE Infocom '98, San Francisco, April 1998 [6] A. Bremler-Barr, Y. Afek, and S. Har-Peled, "Routing with Clue", In Proceedings of ACM SIGCOMM 99, Cambridge, September 1999 [7] E. Rosen, et al., "Multiprotocol Label Switching Architecture", ftp://ds.internic.net/internet-drafts/draft-ietf-mpls-arch-07.txt, July 2000 [8] Donald R. Morrison, "PATRICIA - Practical Algorithm to Retrieve Information Coded In Alfanumeric", journal of the ACM, 15(4): 514-534, October 1968 [9] Mikael Degermark, Andrej Brodnik, Svante Carlsson, and Stephen pink, "Small Forwarding Tables for Fast Routing Lookups", In Proceedings of ACM SIGCOMM '97, October 1997 [10] B. Lampson, V. Srinivasan and G. Varghese, "IP Lookups using Multiway and Multicolumn Search", In Proceeding of INFOCOM '98, March 1998 [11] S. Venkatachary and G. Varghese, "Faster IP Lookups using Controlled Prefix Expansion", In Proceedings of ACM Sigmetrics '98, June 1998 [12] Alfred V. Aho, John E. Hopcroft, Jeffrey D. Ullman, "Data Structure and Algorithms", Addison-Wesley, 1983 [13] T. Kijkanjanarat and H.J.Chao, "Fast IP Lookups using a TwoTrie Data Structure", In Proceedings of Globecom '99, 1999 [14] Pinar A. Yilmaz, Andrey Belenkiy, Necdet Uzun, and Nitin Gogate, “A Trie-based Algorithm for IP Lookup Problem”, In Proceedings of Globecom '00, 2000 [15] Michigan University and merit Network, Internet Performance Management and Analysis (IPMA) project, http://nic.merit.edu/~ipma [16] S. Deering, and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", IETF RFC 2460, December 1998

Fig. 3 The Forwarding Lookup Resolution Level Distribution

REFERENCES

0-7803-7208-5/01/$17.00 (C) 2001 IEEE

Suggest Documents