A cache-based internet protocol address lookup ... - Semantic Scholar

3 downloads 1308 Views 690KB Size Report
Sep 22, 2007 - MPC's updating technique allows the IP routing lookup mechanism to freely decide when and how to .... by don't cares. Classless-Inter-Domain Routing ..... Prefix name (output port) is shown below the nodes. 2 A preliminary ...
Available online at www.sciencedirect.com

Computer Networks 52 (2008) 303–326 www.elsevier.com/locate/comnet

A cache-based internet protocol address lookup architecture Soraya Kasnavi a, Paul Berube b, Vincent Gaudet a

a,*

, Jose´ Nelson Amaral

b

Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada T6G 2V4 b Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8 Received 16 January 2007; received in revised form 22 June 2007; accepted 29 August 2007 Available online 22 September 2007 Responsible Editor: A. Marshall

Abstract This paper proposes a novel Internet Protocol (IP) packet forwarding architecture for IP routers. This architecture is comprised of a non-blocking Multizone Pipelined Cache (MPC) and of a hardware-supported IP routing lookup method. The paper also describes a method for expansion-free software lookups. The MPC achieves lower miss rates than those reported in the literature. The MPC uses a two-stage pipeline for a half-prefix/half-full address IP cache that results in lower activity than conventional caches. MPC’s updating technique allows the IP routing lookup mechanism to freely decide when and how to issue update requests. The effective miss penalty of the MPC is reduced by using a small nonblocking buffer. This design caches prefixes but requires significantly less expansion of the routing table than conventional prefix caches. The hardware-based IP lookup mechanism uses a Ternary Content Addressable Memory (TCAM) with a novel Hardware-based Longest Prefix Matching (HLPM) method. HLPM has lower signaling activity in order to process short matching prefixes as compared to alternative designs. HLPM has a simple solution to determine the longest matching prefix and requires a single write for table updates.  2007 Elsevier B.V. All rights reserved. Keywords: IP lookup; IP caching; Content addressable memory (CAM); Packet forwarding architectures

1. Introduction The increasing volume of Internet traffic coupled with the growing quality-of-service demands of new services, such as voice-over-IP, requires commensurately more powerful routers that can forward packets with both high throughput and low latency. Software-based solutions for packet routing fail to meet the throughput demand in routers that are *

Corresponding author. Tel.: +1 780 492 6486. E-mail address: [email protected] (V. Gaudet).

located close to the core of the network. Thus, high-performance routers use specialized circuitry to perform packet forwarding in hardware. To cope with the demand increase, this circuitry has become increasingly complex and power-hungry. Circuit complexity increases router costs and high power consumption introduces heat dissipation challenges and increases operating costs. This paper introduces low-complexity enhancements to the packet forwarding architecture of a router. The improvements can be adapted to handle longer IP addresses if required, and is geared towards low signaling

1389-1286/$ - see front matter  2007 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2007.08.010

304

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

activity, a key indicator of dynamic power consumption. The target application for this new routing architecture are routers that are placed close to the core of the network. Routing traffic is a fundamental task in a communication network. Fig. 1a depicts the architecture of a general bus-based router for Internet Protocol (IP) packets. IP routers receive packets from an input line, inspect the packet headers, and forward the packets to an outgoing interface according to the destination address and local routing information. The arrows in Fig. 1b show two possible paths through a router. Routers lookup the destination address of each packet in their lookup table (LUT) to determine which interface the packet should be forwarded to. Fig. 1a highlights that routers either store LUT information in main memory and use a softwarebased method to perform the lookup, or use dedicated hardware for the LUT. With large routing tables, routing lookup is non-trivial. Lookup delay can be a bottleneck in high-throughput routers. Thus, fast routers use the classic technique of storing recent lookup results in a cache memory for later reuse. The example in Fig. 1b illustrates the impact of caching on forwarding. Packet 1 hits the cache and is forwarded to the third interface. Packet 2 misses the cache. The processor fetches the missing information from the lookup table before forwarding packet 2 to its output interface. This paper presents an efficient and highthroughput forwarding architecture for Internet routers. The goals in the design of this new architecture are to: (1) reduce cache miss rate; (2) reduce effective cache miss penalty; and (3) lower power

SW LUT

consumption. We focus our attention on two key components: the cache and the lookup table. The cache design uses a non-blocking Multizone-Pipelined Cache (MPC) that we first presented in [15]. MPC uses prefix caching in multiple zone caches to improve cache miss ratio. A non-blocking buffer in MPC reduces the effective cache miss penalty. MPC’s pipelined design implements a novel search technique and reduces power consumption. The hardware-based LUT described in this paper uses a novel Hardware-based Longest Prefix Matching (HLPM) method that we first presented in [14], to deliver fast table updates as well as fast search operations without requiring table management. The techniques described in this paper improve the throughput of address forwarding in IP routers while addressing the issue of power consumption in such routers. The main original contributions presented here are: (1) a new forwarding architecture with a hardware-based lookup table method that improves the resolution of the Longest-Prefix Matching process on an IP routing table, (2) a TCAM-based longest prefix matching solution that delivers fast updates, addresses power consumption by having lower activity, and implements smart search operations, (3) an expansion-free software lookup method and a new short prefix table expansion method that increases cache locality and keeps the lookup table small and simple, (4) a new multizone pipelined cache that has better miss ratio and lower signaling activity (a key predictor of power consumption) than alternative designs, and finally (5) a performance evaluation of the architecture based on real traces, which indicates that the design yields potential power savings resulting from low

CPU

CPU

Main Memory

H.W. LUT Cache

Cache

Cache

Line Card Line Card Line Card P1 P2 P1 P2

H.W. LUT Cache

Cache

Cache

Line Card Line Card Line Card P2

P1

P2

Fig. 1. Architecture of an Internet Router.

P1

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

signaling activity for certain specific traffic patterns in routing table structures, and that the MPC miss rate is comparable to the miss rate of full-prefix caching. This paper is organized as follows. Section 2 discusses background material and Section 3 describes related work. Section 4 describes the routing-cache design and explains the motivation for an efficient forwarding cache. An implementation of a hardware-based LUT using the HLPM technique is presented in Section 5. The details of the MPC architecture and features are further discussed in Section 6. Section 7 presents simulation results. Finally Section 8 concludes the paper.

2. Background 2.1. Routing lookup IP routers forward packets to their destination using information stored in routing tables. Rather than storing full destination addresses, routing tables often store routing prefixes, thus reducing table size. A prefix of length n is comprised of the n most-significant bits of the IP address followed by don’t cares. Classless-Inter-Domain Routing (CIDR) based routing prefixes may have variable length [23]. When multiple prefixes match an IP address, the correct lookup result is the longestmatching prefix (lmp). Therefore, when searching a routing table, routers must perform an operation known as Longest-Prefix Matching (LPM).1 In order to ensure high throughput and low latency under load, routers require a fast lookup mechanism, since the LPM operation is performed in every router along an IP packet’s path from source to destination. An example of a LUT for a 4-bit addressing scheme is shown in Fig. 2a. According to the figure, a lookup for a packet with destination 1001 would return prefix I, and the corresponding port ID would be port A. Similarly, 1101 would match with both prefixes I and II. Here, prefix II is the lmp, and the port ID is therefore port B. Fig. 2b depicts the LUT of Fig. 2a organized as a trie. A trie representation of a LUT is a graph organized as a tree. The root corresponds to the most-significant bit of an IP address. Paths that lead away from the root go 1

We denote the longest-matching prefix (lmp) as the prefix returned by the Longest-Prefix Matching (LPM) process.

305

0

1

0

1

2

1

1

PI(A)

Prefix

Port ID

Prefix I (1xxx) Prefix II (110x) Prefix III (01xx)

Port A Port B Port A

4

0

6

1

PIII(A)

13 PII(B)

Fig. 2. An example of a longest matching prefix.

towards less significant bits. A branch to the right indicates that a specific bit is a 1, and a branch to the left indicates that the bit is a 0. All possible IP addresses are enumerated in a trie. For our prefixbased tries, only the nodes that are required to form paths to each prefix are stored, thus limiting the space requirements for the trie. Nodes are sequentially numbered, in a breadth-first fashion, from top to bottom and left to right. Key metrics that are used to evaluate routing table lookup efficiency include the lookup speed, routing table update latency, energy consumption per search, and routing table memory footprint. Several algorithms have been proposed in the literature to increase routing lookup efficiency [23]. Schemes that achieve high lookup speeds or small memory footprints generally do so at the expense of greater implementation complexity or slower routing table updates. Software-based solutions typically require 4–6 memory accesses, thus they can be slow and do not easily scale up to 10-Gbps data rates. On the other hand, Ternary Content-Addressable Memories (TCAM) can resolve lookups in a single memory access, and are thus desirable solutions for IP lookup due to their high speed. However they are not necessarily efficient in terms of power consumption and chip area when compared to RAM-based solutions. Also, when there are multiple matching prefixes, complicated table management schemes are required to find the LPM in TCAM-based LUTs. 2.2. Routing cache A functional description of a routing cache designed by Berube et al. is shown in Fig. 3. A Destination Address Array (DAA) stores IP addresses or prefixes, and a Next-Hop Array (NHA) stores nexthop information [4,3]. The DAA and NHA are

306

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

Processor

DAA (CAM)

NHA (RAM)

Next Hop

IP Address

Cache Controller

Entry Select

Fig. 3. A functional view of a routing cache [4].

co-indexed, such that each NHA entry corresponds to a single DAA entry. The NHA can be implemented using a standard CMOS SRAM process technology, and the DAA can be implemented using a CAM or a TCAM, also in a standard CMOS process. A CAM is a fully-associative-binary memory that can match a key against all its entries in a parallel fashion. A TCAM can store and search three values: 0, 1, and don’t-care. A don’t-care value is a wild card that matches both a 1 and a 0 during a search process. If prefixes are cached, a TCAM is used to implement the DAA. During an IP address search in a routing cache, all entries of the DAA are searched simultaneously. A cache hit occurs if an address is found. The corresponding next-hop identifier is then read from the NHA, and the packet is directly forwarded to the appropriate output port. If the IP address is not found in the DAA, then a cache miss occurs. A software or hardware lookup is then performed in the routing table, and the routing cache is accordingly updated with the new destination-address/next-hop pair. The impact of a cache on the total throughput of a system is directly related to the miss rate and to the miss penalty of the cache. Characteristics of IP traffic, such as its temporal and spatial locality, affect the miss rate of a routing cache. If an IP address stream has good temporal locality, then a cached IP address will typically be accessed many times within a short time span, thus lowering the miss rate and increasing the utility of that cached address. Similarly, when addresses in the same numerical range are accessed in close succession, we say that there is good spatial locality. By caching prefixes rather than full addresses, a single cache entry can cover a large number of full addresses within the

same numerical range, thus converting a traffic stream’s spatial locality into temporal locality in the cache access stream. Since cache misses require relatively slow table lookups, the miss penalties for IP caches can be very large. The lookup speed can be improved by using complex lookup techniques in the routing table. However, this is done at the expense of increased time required to perform table updates. Improving the cache miss rate through the appropriate usage of prefixes can thus compensate for a large cache miss penalty, and may additionally allow a simple LUT with fast table updates. Alternately, a non-blocking cache could be used, and would hide memory latency by overlapping LUT operations with cache accesses [6,15]. 3. Related work This paper covers several concepts. Related work can be divided into routing caches, prefix caching, multizone caching, and the implementation of longest prefix matching using ternary CAMs. 3.1. Routing caches There exists a wide body of literature related to routing caches. Some papers focus on the locality in IP traffic [11,30,32]. Other papers present designs for more efficient IP routing caches [9,7,19,31]. Feldmeier demonstrates that the lookup time in network gateways can be reduced by 65% by using a routing table cache, with a fully associative cache providing the best performance [11]. Berube et al. demonstrate an implementation method for high-density, fully associative, CAMbased caches, that uses a Xilinx VirtexE Field Programmable Gate Array (FPGA) [4]. Chiueh et al. present a CPU-style IP caching methodology, and show that general-purpose microprocessors are a feasible solution for high-performance routing [7]. However, further analysis shows that differences in the data streams presented to network processors and general-purpose CPUs used in the studies requires significantly different cache architectures in order to achieve acceptable performance [8]. Their studies additionally show that significant temporal locality exists in Internet traffic, based on analysis of real packet traces. IP address coverage in a cache can be improved by caching ranges of addresses rather than individual IP addresses. Finally, they show that IP traffic with low spatial locality lends itself well to small cache blocks, pref-

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

erably with a single entry. Talbot et al. show that the IP lookup process can be sped up by caching IP destination addresses [32]. Shyu et al. report significant temporal locality in backbone routers [30]. Bhuyan et al. study the impact of instruction-level parallelism (ILP) and cache architectures on the performance of routers, using execution-driven simulation [5]. Performance improvements of up to 37% are reported on their traces, with out-of-order execution and non-blocking loads indicated as primary factors. A non-blocking cache can be used to hide memory latency, by overlapping memory accesses with computations [6]. 3.2. Prefix caching Liu coined the term IP Prefix Caching to describe caching address ranges instead of destination addresses [19]. Liu demonstrates that the greater locality in prefixes results in lower miss ratios when prefixes are cached instead of addresses. A potential problem with IP-prefix caching is multiple-prefix hits. If only one match is cached, packets may be forwarded to the wrong output port. To solve the problem, Liu applies a transformation algorithm to the lookup table, as discussed in Section 4. Improving the replacement policy of an IP cache is another way to reduce the miss rate [20,30]. Feldmeier demonstrated that for large caches the performance of a First-Input–First-Output (FIFO) policy is almost as good as the performance of a Least-Recently-Used (LRU) policy. However, FIFO performs poorly in small caches [11]. A Least Frequently Used (LFU) replacement policy outperforms FIFO and LRU [30]. Liu presents the mLRU replacement policy applied to prefix caches. mLRU aims to combine the advantages of LRU and LFU but it is complicated to implement and maintain [20]. 3.3. Multiple zone caching Caches can naturally exploit temporal locality when it exists. IP routers, however, typically handle traffic from many different hosts. If using a single cache, streams having low locality may pollute the cache with low-utility entries, and thus reduce the cache’s effectiveness for the combined IP traffic. Chvets and MacGregor [9] and Shyu et al. [31] reported that splitting a cache into multiple caches may improve system performance. For instance, in a split cache, one section could store entries for shorter routing prefixes, and the other section could

307

store entries for longer routing prefixes. In such a multizone cache, the lack of locality in one IP traffic stream does not prevent the exploitation of locality in the rest of the traffic. Multizone caches can have miss rates about half those of conventional caches [9]. Splitting a cache into multiple zones is not without drawbacks. The control logic necessary to coordinate the search into multiple caches introduces overhead into cache searches and updates, possibly necessitating longer, multi-cycle cache operations. The MPC design balances the competing concerns of reducing miss rates and streamlining the cache architecture operation to achieve the benefits of a split cache with short cache search and update operations. 3.4. Longest prefix matching in ternary CAMs Among hardware-based LUT implementations, TCAM-based lookup tables are preferred due to their high speed and to the simplicity of their architecture. However TCAM-based solutions have drawbacks: (1) large area because of large memory cells; (2) high power consumption due to precharging and discharging long memory lines; and (3) a potentially complex mechanism to resolve the longest matching entry. The power consumption of a TCAM can be reduced by lowering the total number of entries that are searched during each search operation. For example, in a pipelined TCAM searching the nonmatching entries are avoided after each stage [21]. A pre-computational TCAM proposed in [18] stores a parameter in the form of a few binary bits per entry. A search operation first calculates the parameter for an IP address (Talbot et al. uses the four most-significant bits [33]) and then compares it with all the existing parameters in the table. Only the matching entries are fully searched with the IP address. Another solution to reduce the number of searched entries is block selection [25,36,37, 2,35,22]. Block selection partitions a large LUT into small blocks or pages. Incoming IP addresses go through a range-detection phase to resolve which TCAM block needs to be searched. A partitioning algorithm allows incremental updates to insert new prefixes in correct TCAM blocks. Partitioning allows multiple TCAM blocks to be searched concurrently. However, the temporal and spatial locality in the IP address stream makes consecutive accesses to the same TCAM block likely, limiting

308

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

the potential improvement in throughput. Akhbarizade et al. propose storing recently referenced prefixes in a separate TCAM block [2]. Other potential solutions require complicated table management to balance the load on each TCAM block [25,37]. Wang et al. propose a combination of TCAM and DRAM to implement the LUT [34]. The prefixes are merged for storage in the TCAM, while the DRAM stores the full table. Searching the TCAM provides an index to a segment of the DRAM which is fully searched. The simplest method to find the lmp in a TCAM is to sort the prefixes stored in the TCAM based on their length. Then a priority encoder can easily select the lmp. However, an insertion of a new entry in a sorted TCAM storing N prefixes might result in shifting O(N) TCAM entries, and this would be undesirable because forwarding tables may require hundreds to thousands of updates per second [17]. Some TCAM lookup tables reserve empty entries between sets of different length prefixes to simplify table updates. However, empty entries lead to under-utilization of the TCAM, and does not change the worst-case complexity of updates. Nonetheless, if these reserved entries are in the middle of the TCAM, an empty space can be made at an arbitrary location with no more than L=2 shifts, where L is a prefix length (e.g. L = 32 for IPv4 prefixes) [24]. However, the efficiency of such TCAMs is reduced due to space management overhead and non-uniform update delays. Furthermore, due to longer prefixes, these problems get worse with IPv6 addresses, where L = 128. A TCAM can store the length of a prefix as a 32bit (or 128-bit) mask in an additional table. Each mask is formed by storing consecutive ones for the prefix bits, followed by zeros for the don’t-care bits. However, this encoding results in at least a 70% increase in the TCAM’s memory requirements [26]. Binary CAMs can use a mask to simulate a TCAM [16], resulting in slower search times and lower memory density. Existing TCAM LPM solutions have many limitations. They have long worst-case update delays, slow down lookup speeds, require table management and maintenance schemes, may require significant silicon area and may not easily scale to the IPv6 standard. The HLPM technique described in this paper2 is a power-efficient hardware solution

2

A preliminary description of HLPM appeared in [14].

for the LPM that provides very fast table updates with no table management requirements. 4. Consistency in prefix caching Prefixes stored in caches may cover large ranges of the IP address space and hence can reduce the cache miss rate. It is, however, possible for an IP address to match more than one prefix, and the requirement for the router to return the longestmatching-prefix (lmp) complicates the caching process. A case that may cause complications is the one where two prefixes match an IP address, the shorter prefix is cached, but the lmp is not. In this case, the short prefix would produce a cache hit, and the wrong routing decision would be taken. Consider once more the example in Fig. 2, shown again for convenience in Fig. 4a. If the cache contains prefix I but not prefix II, IP addresses whose lmp in the trie is prefix II (e.g. address 1101) will be matched by prefix I, in other words the wrong prefix, and be forwarded to port A rather than port B. Since node 2 is on the path from the root to node 13, we say that the prefix in node 2 (prefix I) encompasses the prefix in node 13 (prefix II). To ensure LPM, encompassing prefixes must be non-cacheable. The full addresses must be cached when a lookup results in a non-cacheable prefix. In Fig. 2, the prefix

0

1

0

2

1

0

0

1

1

1

1

1

0

2

1

PI(A)

4

0

6

4

1

5 PI1(A)

PIII(A)

13

13

6

1

14 PI−2(A)

PII(B)

0

1

0

0

1

1

0

4

5

2

1

PI(A)

PI−1(A)

0

6

13

Fig. 4. Trie presentation of a small lookup table. Prefix nodes are shaded. Prefix name (output port) is shown below the nodes.

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

at node 4 (prefix III) does not encompass any other prefix and is therefore cacheable. If non-cacheable prefixes are common lookup results, then a prefix cache degrades into a full-address cache. Moreover, the lookup mechanism must decide if a prefix is cacheable or not. Simulation results based on real traffic indicate that up to 45% of lookup results are non-cacheable prefixes (see Section 7). 4.1. Prefix caching and table expansion To increase the locality of data stored in a forwarding cache, full-prefix caching is preferred. A full-prefix cache requires a full LUT expansion. In Fig. 4b, the trie from the example in Fig. 4a is fully expanded. This expansion is achieved by appending ‘‘0’’ to prefix I to form prefix I-1 in node 5, and by appending ‘‘11’’ to prefix I to form prefix I-2 in node 14. The original encompassing prefix is removed from the trie and its forwarding information (port A) is added to both new prefixes. Note that this process has increased the number of valid prefixes and has increased the size of the LUT. In [19], Liu reports that the table expansion process can increase the table size by up to 118%. In hardware implementations this expansion is undesirable, since it leads to memory-area limitations, greater power consumption, and cost. Prefixes are also pushed lower into the trie, thus increasing the search time for software-based searches. Updates in fully expanded tables can also be challenging. An example of a partial LUT expansion, reported in [19], in which non-cacheable prefixes are expanded, but only to their first possible level of expansion, is shown in Fig. 4c. For every noncacheable prefix, only one new cacheable prefix is added to the table. The new prefix is formed by adding either a 0 or a 1 to the non-cacheable prefix. The non-cacheable prefix is not removed from the table and the new prefix is added only if it is cacheable. This expansion increases the chance that the longest lmp of more IP addresses are cacheable. If a new prefix is cacheable, then this prefix is the lmp of half of the IP addresses covered by the original non-cacheable prefix. As depicted in Fig. 4c, a new prefix PI-1, that has the same forwarding information as prefix PI, is added to the trie at node 5. In the original trie, the lmp of all the IP addresses covered by nodes 5 and 14 is prefix I (node 2). Since prefix I is not cacheable, those IP addresses must be cached in full. After the addition of prefix PI-1 to the trie (see Fig. 4c), the lmp of all

309

IP addresses covered by node 5 is prefix PI-1, which is cacheable. Caching PI-1 instead of IP addresses increases the locality of the cache and results in lower miss rates. The LUT generated by this partial expansion is smaller than a fully expanded table. However, the level-one expansion method is not useful for some non-cacheable prefixes that remain non-cacheable after a 0 or a 1 is added. This is especially important because the short prefixes close to the top of the trie, that cover most of the IP addresses, encompass many other prefixes. The lookup scheme must also decide whether a prefix is cacheable. Even though the routing table grows less than with full expansion, updates are still complicated and non-cacheable prefixes still exist in the trie. 4.2. Expansion-free (EF) software lookups Software lookups proceed down the trie to find the lmp for an IP address. The process ends when there are no more branches that can be taken [23]. The lmp is the last prefix encountered during the process. We have developed an Expansion-Free (EF) method, used during a software lookup, that generates cacheable prefixes by using a simple and inexpensive mechanism. Akhbarizadeh et al. independently discovered a similar method and used simulations to evaluate its effectiveness [1]. They report lower miss rates for a prefix cache with the EF method compared to a full IP address cache. With the EF methodology, cacheable prefixes are dynamically generated and forwarded to the cache, but are not stored in the LUT, thus requiring no LUT expansion. The lookup table can be easily updated with no constraints, because there is no table transformation. In Fig. 5b the EF method is illustrated, as applied to the example of Fig. 4. The EF method has three rules, described below. Let n refer to the last node visited during a traversal of the trie, and let p refer to the last node visited during the traversal, that contains a prefix. 1. If p ¼ n, and p is a leaf node, then the prefix in node p can be cached and forwarded to the prefix cache. In Fig. 5b, IP addresses covered by node 4 have p ¼ n ¼ 4, and the prefix in node 4 is cacheable. 2. If p ¼ n, and p is not a leaf node, then the prefix in node p is not cacheable. In Fig. 5b, if an IP address matches node 5, then n ¼ p ¼ 2.

310

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

Fig. 5. Table transformation.

A cacheable prefix is produced by adding a 0 (the IP address’ next bit) to the path from the root to node p. This dynamically generated prefix is forwarded to the cache. 3. If p 6¼ n, then the prefix in node p is non-cacheable. In Fig. 5b, if an IP address matches node 14, then p ¼ 2 and n ¼ 6. A cacheable prefix is dynamically produced by adding a 1 (the IP address’ next bit) to the path from the root to node n. This dynamically generated prefix is forwarded to the cache.

that an address might match with multiple long prefixes, and that in this case, the LUT must resolve the LPM. • It is sufficient to check the length of a prefix in order to determine if it is cacheable or noncacheable. If the LUT is implemented in a two-stage pipelined TCAM, the HLPM technique can be employed to easily find the longest cacheable prefix on a cache miss [14]. 5. Hardware-based lookup table

4.3. Short prefix expansion (SPE) We propose a new expansion method, called Short-Prefix Expansion (SPE), that provides more cacheable prefixes to improve cache locality, while keeping the LUT small and simple. Short prefixes provide more address coverage than longer ones. Although there are fewer short prefixes in a routing table, short prefixes are referenced more frequently (see Section 7). SPE fully expands the trie for prefixes that are shorter than 17 bits. As depicted in Fig. 5a, SPE pushes short prefixes down the trie until they either become leaf nodes, or become 17bits long. Thus all short prefixes are cacheable. SPE has these advantages: • Expansion of the LUT is restricted to those prefixes that provide the most significant IP address space coverage. • All short prefixes can be cached, and hence at most one short prefix may match an IP address. • Short prefixes are cacheable leaves of the trie, and hence no IP address can match with both a short prefix and a long prefix. However, we note

This Section describes a TCAM-based lookup table to store IP prefixes expanded by the SPE method that matches with MPC. The TCAM employs HLPM to resolve whether a prefix is cacheable or not and to find the lmp of non-cacheable prefixes. 5.1. LUT search operation Fig. 6 depicts a two-stage-pipelined TCAM storing 32-bit IPv4 prefixes. The first stage stores the 17 Most-Significant Bits (MSBs) and the second stage stores the 15 Least-Significant Bits (LSBs). A don’t care in the last bit of the first stage of the TCAM indicates that the second stage need not be searched. The traces studied contain many cacheable 16-bit prefixes. Thus the 17-bit first stage allows the identification of these 16-bit prefixes in the first stage and dispenses with a second-stage search for them. For each entry, the Last-Prefix-Bit (LPB) field records, as a 4-bit binary-coded value, the position of the last bit of the prefix of an entry in the second stage. By design, the value stored in this field is only

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

311

CPU

Cache Line Card

Cache

Cache

Line Card

Line Card

HLPM LUT IP Address

Next Hop

Stage 1

Stage 2 Length 24

I

II

X

X

LPB X

0111

X

0000

X

0001

NHA LPM (RAM)

Length 18

III 17 bits

15 bits

4 bits

Fig. 6. Two-stage pipelined TCAM with HLPM [14].

relevant if the lowest of the MSBs is not a don’t-care value. New prefixes can be inserted at any location, regardless of the prefix length. If the length of a new prefix is less than or equal to 17 bits, 0000 is stored in the corresponding LPB field. For prefixes that are longer than 17 bits, the binary value of the prefix length minus 17 is stored in the corresponding entry of the LPB. For example, 0000 is stored in the corresponding LPB entry of a prefix with size 17 since there is no non-don’t-care bit in the second stage, and 0111 is stored for a prefix of length 24. When an IP address misses the cache, it is searched in the LUT. This search is done in two steps in the two-stage-pipelined TCAM. In step one, the MSBs of the IP address are applied to the first half of the TCAM (all entries are searched in parallel). If the IP address matches with an entry that has a don’t-care value in its last cell (the 17th bit), SPE ensures that this prefix is the lmp and cacheable (see Section 4.3). Thus the search ends and the second stage is a no-operation task for that IP address. For example, in Fig. 6, if an IP address matches with prefix II, this prefix and its output port can be forwarded to the cache. To find out whether an entry has a don’t care in its last cell or not, the TCAM entry is modified. If the search in the first stage results in a matching entry without a don’t care in the last cell, the lmp is longer than 16 bits. For example, an IP address that matches prefix I in Fig. 6. In the second stage, the LSBs of the IP address (bits 18–32) are applied

to the second part of the matching entry (entry I). There is no need to search all the entries in the second stage because if the first half of an IP address does not match with the first half of a prefix, the IP address and the prefix do not match. If a prefix matches with the IP address in the second stage as well, the matching prefix is found. In this case, the full IP address and the corresponding output port can be forwarded to the cache. However, multiple long prefixes might match with an IP address after the second stage. For example, an IP might match with both prefixes I and III as in the example in Fig. 6. In this case, the data in the Last-Prefix-Bit field resolves the LPM. 5.2. Modification of the last cell of a TCAM entry Fig. 7 depicts an entry in the first stage of the TCAM. The rightmost cell is modified by adding two extra transistors, labeled M1 and M2, that are controlled by the complementary data values stored in the cell. For this cell (the 17th bit), a search for a don’t care is performed in addition to the conventional search for a match between the stored data and the key. The circuits required to perform this extra search are similar to those used for the conventional search. During the pre-charge operation phase, a very short match line, labeled MLx in Fig. 7, is pre-charged to logic high. During the evaluation phase, the last cell searches for a don’t care, corresponding to a stored value of 00. If the cell contains a don’t care stored value, the circuit paths

312

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

Valid Bit

Precharge Circuit Output

MLx Match Line (ML)

Sensing Circuit

M1 TCAM cell

TCAM cell

SL0 SL1

SL0 SL1

M2 SL0

SL1

Fig. 7. An entry in the two-stage pipelined TCAM with HLPM [14].

from the ML line to ground are shorted but the path from the MLx line to ground remains open-circuited. In that case the MLx line is discharged, and the output is a logic high. We note that although the ML and MLx search operations are similar, the MLx evaluation is much faster due to the very short length of the MLx line. It also consumes less power, and does not require complicated sense amplifier circuits. Smart sensing and precharging circuits, such as the one described in [27], can be used for additional power savings, by only pre-charging the ML line in stage 2 for those entries that match in stage 1. Note that an entry in the second stage is very similar to the entry depicted in Fig. 7, but it has no extra transistors in its last cell and its pre-charging circuit is controlled by the search results of the entry in the first stage.

LPB Entry I SFS ( 0 ) = 1 0

0

1

1

1 SFS i ( j )

SFSi+1 ( j )

Data i ( j )

SRAM cell

Entry II SFS ( k ) = 0 0

0

0

0

0

Propagation Logic Entry III SFS ( j ) = 1 0

0

0

0

1

PSi DPi (0) DPi (1)

SFSi ( j )

SFSi + 1 ( j )

DPi ( j )

5.3. Last-prefix-bit field Datai ( j )

In the second stage of the pipelined TCAM, two or more entries might match with the IP address (e.g., entries I and III in Fig. 6). The Last-Prefix-Bit (LPB) field is used to resolve the LPM in this case. Fig. 8a depicts a bank of LPB fields. The LPB field contains the length of the corresponding prefix minus 17, encoded as a binary number. For each bit i of the LPB field j, a Search Further Signal, SFS i ðjÞ is used to indicate when two or more matching fields have identical values up to bit i  1. Each bit of the LPB field requires an SRAM cell to store the value of Datai ðjÞ, the ith bit of field j, and an SF propagation logic (shown in Fig. 8b). When a search in an LPB bank starts, the SFS 0 ðjÞ ¼ 1 for all entries j that have matched on the second stage of the pipelined TCAM. To decide which fields should be searched further, the

SFS

i+1

( j ) =SFS ( j ). PS +SFS ( j ). Data ( j ) i

i

i

i

Fig. 8. Last prefix bit (LPB) field.

SF propagation logic must decide if there is at least one of the propagating fields that stores a 1 in its SRAM cell. Thus a data propagation signal, DP i ðjÞ, is computed for the bit i of each field j. When the Propagate Signal for column j in the bank, PS j , is 1, then the fields that have a DP i ðjÞ ¼ 1 should be further searched. In the example given in Fig. 6, the SFS for bit 0 of the LPB fields for entries I and III are set. The LPB field containing the maximum value corresponds to the lmp. HLPM uses a 4-stage-pipe-

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

313

prefixes that have less than 17 bits. For matching entries with a don’t care in the last cell of the first stage, HLPM does not need to search the remainder of the entry. This is possible because all cells to the right of a don’t care are also don’t cares. Section 7 discusses the power savings due to this technique. Secondly, HLPM resolves the LPM without requiring LUT management or sorting of the entries in the TCAM. Table updates require a single write operation. Third, HLPM is simple and scales well with TCAM width. The LPB field search does not change if the TCAM has additional 16-bit stages. For example, for 128-bit IPv6 prefixes, a 17-bit first stage, one 15-bit stage and six 16-bit stages would be used, with a single 4-bit LPB field that indicates the length of the prefix in the last stage that has nondon’t-care bits.

lined bit-serial approach to find the maximum length in the LPB bank. The simple LPB field pipeline can be clocked faster than the TCAM pipeline. It could be sensitive to both clock edges, while a digit-serial approach can be used to provide shorter latency. The LPB field is a compromise between the speed and simplicity of the design and the area required for implementation. The use of the LPB field does not require an update algorithm and does not restrict where prefixes can reside in the TCAM. LPB simplifies TCAM control logic and provides guaranteed constant-time TCAM updates. The area required for the LPB field is determined by the width of the largest pipeline stage. If the width of the stages is reduced, the LPB may also be reduced. For example, if two additional pipeline stages were added to increase throughput, this would also reduce the width of the LPB field by 1 bit (20%). On the other hand, if additional pipeline stages are added to support longer addresses (e.g., 128 IPv6 addresses), the space requirement for the LPB field remains constant.

6. The multizone pipelined cache (MPC) Fig. 9 presents a structural description of our design for a Multizone-Pipelined Cache (MPC) [15]. In the MPC the DAA is divided into two independently sized horizontal parts that form the two zones of the cache. The upper zone, denoted as the prefix zone, stores short IP prefixes that are at most 16 bits long. Full 32-bit IPv4 destination addresses are stored in the lower zone,

5.4. HLPM special features The HLPM architecture has several key advantages. First, HLPM saves power for short matching

CPU

H.W. LUT Cache

MPC

IP Address

Line Card Line Card

DAA

NHA Next Hop

Cache

TCAM Stage 2 16 Prefix Zone

CAM1

CAM2

Stage 1

Stage 2

16

16

(MSB)

(LSB)

Full Address Zone

Fig. 9. MPC memory allocation.

RAM Entry Select

314

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

denoted as the full-address zone. These 32-bit entries are further split into two 16-bit sections that are stored in two sections of the full-address zone, denoted as CAM1 for the most-significant bits and CAM2 for the least-significant bits. A TCAM is used to implement the prefix zone, while a binary CAM is used to implement the full-address zone of MPC. The three separate pieces of the DAA, along with the NHA RAM, can be pipelined into three stages as follows: (1) a lookup in CAM1; (2) depending on the result of the first stage, a lookup in either CAM2 or in the prefix zone; (3) on a hit, an access to the NHA RAM for the lookup result. The result of a lookup is either the forwarding information or a cache-miss indication. In the first pipeline stage the 16 most-significant bits of the IP address are applied to the CAM1 block. If CAM1 returns any matches, the extensions of the match lines into CAM2 corresponding to entries that matched in CAM1 are pre-charged. Since only entries with pre-charged match lines can be searched, the second stage searches the extensions into CAM2 of matching entries, with the 16 least-significant bits of the IP address, in order to complete the search of a full IP address in the full-address zone. Otherwise, if there are no matches in the first stage, then the second-stage searches the prefix zone using the 16 most-significant bits of the IP address. If there is a secondstage match, either in CAM2 or in the prefix zone, then in the third stage the RAM location corresponding to the matching entry is accessed, in order to read the next-hop data (the lookup result). SPE ensures that an address either hits the full-address zone or the prefix zone, but not both. A cache miss is reported when there is no match in either cache zone. In this case a search of the LUT is conducted. The routing information that is returned by the search is then cached in the MPC. We refer to the miss penalty as the amount of time required to complete the routing table search, and to store the result in the MPC. The time required to service a miss may be long, due to slow main-memory access times, and contention for the lookup mechanism between misses from multiple line cards. The MPC’s recent misses are stored in an Outstanding-Miss Buffer (OMB), until lookup results are returned by the processor. The pipeline flow diagram for an MPC search is shown in Fig. 10a.

Search Start Search MSB in CAM 1 Yes

Search LSB in CAM 2 Hit

No Stall

Yes

No

Hit

Search prefix in TCAM

Yes

Yes Hit

No Cache Hit Read destination from RAM

Cache Miss Set Miss Signal

OMB Full?

Search End

No

Insert Missing IP into OMB

Update Start Is Update Result a Prefix?

No

Yes Update Full Address Zone

Update Prefix Zone

Search all entries of OMB for Update Result Hit

No

Yes

Update End

Clear Valid Bit for all matching entries

Fig. 10. Flow diagram of the cache operation.

6.1. Outstanding-miss buffer As stated above, the OMB is a buffer that is used by the MPC to store recent misses until they are resolved by the processor, thus avoiding costly MPC pipeline stalls. The throughput of a cache can be severely hindered by blocking, since subsequent cache searches cannot be performed until the blocking lookup is resolved and the cache is updated, especially when subsequent searches would hit the cache. On the other hand, in a non-blocking cache, an address causing the miss is stored in the OMB and the cache can proceed with further lookups. A hit under miss is said to occur when a subsequent IP address hits the MPC and a previous miss

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326 Time (clock cycles) 1 2 3 4 5 6 7 IP1 C1 C2/T NHA Latency (Cycles) IP2 C1 C2/T B B UC1/s UC2/T UNHA IP C1 C2/T NHA 3 IP4 C1 C2/T B

IP5 IP6 IP7

C1 C2/T NHA C1

C2/T s s s

B s

C1

C2/T

315

s : Stall B : Buffer C1 : Access CAM1 NHA : Access Next Hop Array C2 / T : Access CAM2 or TCAM UC1 / s : Update CAM1 or Stall UC2 / s : Update CAM2 or Stall UNHA : Update Next Hop Array

Fig. 11. Pipeline diagram of the cache.

is being serviced. Similarly, a miss under miss (secondary miss) is said to occur when a subsequent IP address also misses the MPC while a previous miss is being serviced. These new misses are stored in the OMB until the buffer is full, and at that point the MPC is blocked and the processor is stalled until previous outstanding misses have been serviced. Fig. 11 depicts an example with a two-entry OMB. Here, addresses labeled IP2, IP4, and IP6 represent cache misses. While the IP2 miss is serviced, the MPC can search for IP3 and IP5, and forward their routing information to the processor. However, after the IP2 and IP4 misses, the OMB is full and the MPC stalls while searching for IP6. At this point no new IP addresses can be searched until IP2 has been serviced and removed from the OMB, leaving room for IP6. Maintaining the order of packets traveling from one host to another is important because packets arriving out of order at the destination can, incorrectly, signal network congestion and cause the sending host to throttle the rate at which it sends packets. The use of the OMB does not result in packet reordering. All packets in a flow will encounter the same cache miss until the miss is serviced. At that point, the misses are retried in order. Therefore the packets will be forwarded in the order in which they were received. 6.2. Cache update After the result of a main-memory lookup is returned, either the prefix zone or the full-address zone of the MPC is updated according to the cache replacement policy. As explained in Section 4, SPE ensures that the lookup result is either a short, cacheable prefix that can be stored in the prefix zone, or a full 32-bit IP address that can be stored in the full-address zone. A pipeline stall is required to update the MPC. Contents are updated stage by stage, after which the IP address causing the update is removed from

the OMB. However, there may be other IP addresses in the OMB that are identical to IP address causing the update. Also, the result of the update could be an IP prefix that covers many pending entries in the OMB. The OMB is organized as a 33-bit CAM, where each entry stores a 32-bit IP address as well as a valid bit, ensuring that the same lookup result does not get written into the MPC more than once. Lookups and cache updates are only performed for valid OMB entries. Thus after each cache update, an associative search finds all matching entries within the OMB. In order to ensure that IP prefixes match with the entries in OMB, their don’t-care bits are externally masked prior to the search. After the search the OMB clears the valid bits of any matching entries. A new MPC search for those entries will result in a hit, providing the routing information. Fig. 10b shows the flow diagram for an MPC update. Clearing the OMB and updating the cache is atomic, and therefore there is no possibility for another packet lookup to be processed at the same time. To avoid packet reordering, the outstanding items in the OMB must be served before new packets are searched. This simple update scheme allows the cache to issue update requests one by one. However, a more elaborated update scheme can be employed to allow the lookup algorithm and the corresponding interface to decide when pending requests are serviced, how many requests are serviced at a time, and the order in which the requests are serviced with no restrictions. This new Out-Of-Order Cache Update (OCU) algorithm is presented in Fig. 12b. OCU lets both the interface to the lookup and the lookup algorithm benefit from out-of-order processing and batch requests to move data more efficiently. The general functionality of OCU is similar to the simple update algorithm, but there are three issues that require special consideration: update/lookup interlacing, batch updates, and contention for the

316

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

Search Start Search MSB in CAM 1 Yes

Search LSB in CAM 2 Hit

No

No

No

Hit

Yes

Search MSB in TCAM Yes

Hit

No

Read info from RAM

Search End

Stall

Cache Miss Set the Miss Signal

Buffer Empty?

Compare the Miss with the OUR

Yes

Update Start

Insert the Missing IP into the OMB

Is there a miss report?

Yes

Write the recent miss to OMB

No Search all entries of OMB (Outstanding Miss Buffer) for the Update Result Clear the Valid Bit for all matching entries

No

Is the Update Result a Prefix? Yes

Update the Full Address Zone

Update the Prefix Zone

Update End

Fig. 12. Search/update diagram with free interface.

OMB. Packet reordering may occur with this elaborated update scheme. Cache lookups and cache updates are both pipelined operations that move through the same pipeline stages. Therefore, both lookups and updates can be fed arbitrarily into the pipeline without introducing stalls. If an update precedes a lookup, the lookup will see the updated data at each stage. Alternatively, if the lookup generates a miss and writes to the OMB before an update starts, the update will clear that entry from the OMB. However, a lookup that occurs shortly before an update

could produce a miss on the same data that the update is writing into the cache. If the miss was written into the OMB, a second main table lookup would return the same result, and the second update would write a second, identical entry into the cache. This situation should be avoided. Therefore, before an update starts, the main-tablelookup result is stored in a Pending-Update Register (PUR). Every cache miss checks the missed value against the address or prefix in the PUR before writing it to the OMB. If there is a match, the next-hop information in the PUR can be forwarded as a cache hit. Otherwise, the miss is written to the OMB. After an update is finished, the PUR is replaced with a zero value, which cannot match any address. To improve data movement efficiency, the lookup system may batch update requests and lookup results. A batch request can easily be formed by taking multiple valid entries from the OMB. However, batch updates conflict with the serialized cache-update mechanism. Furthermore, batch and out-of-order lookups may generate the same lookup result, which should only be written to the cache once. To rectify this situation, lookup results are placed in a FIFO, and sequentially moved into the PUR. Therefore, each lookup result generates a separate cache update, and these updates are completely serialized. A second identical update result does not match any valid entries in the OMB, and is discarded. With OCU, the OMB becomes a point of contention in the system. Cache misses write to the OMB, while cache updates read from and write to the OMB. Only one of these actions can be performed in a given clock cycle. The write from a miss occurs at the end of the lookup pipeline, while an update accesses the OMB before it enters the pipeline. Also, multiple updates are serialized. Therefore, miss writes have priority over update accesses. In the worst case, a long sequence of misses may cause an update to be delayed until the OMB fills and blocks further cache lookups. This situation is mitigated by the fact that the OMB is small and any cache hit provides an opportunity for an update to proceed. Fig. 13 presents all three cases. Each update request returns to the cache after two cycles. Thus the update result for IP2 is written to PUR at the end of cycle 5. The miss for IP4 at the end of cycle 5 must be compared with the PUR and written to the OMB at cycle 6. IP4 is not written in the

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

317

Time (clock cycles) 1

2

3

4

5

6

7

IP1 C1 C2/T NHA IP2 s s C1 C2/T CB B IP C1 C2/T NHA 3 IP4 C1 C2/T CB IP5 C1 C2/T CB IP6 C1 C2/T IP7 IP8 IP9

C1

8

9

10

s : Stall B : Buffer SB UC1/s UC2/T UNHA C1 : CAM1 IP : IP Address SB : Search the Buffer NHA : Next Hop Array C2 / T : CAM2 or TCAM NHA UC1 / s : Update CAM1 or Stall CB C2/T UC2 / s : Update CAM2 or Stall UNHA: Update Next Hop Array C1 C2/T CB : Compare with PUR, then s C1 C2/T Push to the Buffer

Fig. 13. Update complications.

OMB if IP4 matches with the PUR. Otherwise, IP4 is written to the OMB with its valid bit set. At cycle 7, another missing IP is waiting to be compared with the PUR and to be written in the OMB. In cycle 8, the OMB starts the search operation before this update. During cycles 9, 10 and 11, the cache is updated with the lookup result for IP2. However, if the OMB was full at the end of cycle 5, IP4 would not be written to the OMB in cycle 6. Instead, the OMB would search for update results of IP2 at cycles 6 and write IP4 in cycle 7. If the OMB has no empty entries in cycle 7, a stall of the pipeline is required. 7. Performance evaluation This Section presents a simulation-based performance evaluation of the new forwarding mechanism described in this paper. This study uses a high-level architectural simulator run with real IP traces and LUTs. The new forwarding architecture is comprised of an MPC and a LUT. Section 7.2 presents simulation results and analysis of MPC performance. These results are independent of the type of LUT used. Results for the HLPM LUT are presented in Section 7.6. The main findings are as follows: • Pipelined searches in the MPC result in significant potential for power savings (30% or more). • The MPC miss rate is comparable to those for full-prefix caching with fully expanded LUTs. • Even a small OMB in the MPC implementation is very effective to hide lookup latency. • HLPM yields a potential power savings of up to 14% compared to a conventional TCAM. • LPB searching is only required for 60% of the searches.

7.1. Trace characterization This performance evaluation uses real, non-sanitized, IP traces and the corresponding routing tables from three distributing (neither core nor edge) routers of local service providers in Edmonton, Alberta. Table 1 presents the sizes of the traces and routing tables. Shi et al. demonstrate that the frequency of reference of a given IP address in a trace follows Zipf’s Law [29]. Fig. 14 plots the frequency of reference of individual IP addresses in the traces of Table 1 for three prefix length: full address, 24 and 16 bits. The graph in Fig. 14a follows Zipf’s Law. Thus, the spatial locality of the traces of Table 1 is consistent with traffic studied in the literature, i.e. a small number of addresses are responsible for a large percentage of the traffic (elephants) while a large number of addresses are rarely encountered in the trace (mice). This paper focuses on prefix caching, thus analyzing the effect of prefix on spatial locality is important. In Fig. 14a and b the least significant 8 (or 16) bits of each address are masked off, as if each address were matched by a 24-bit (or 16-bit) prefix. The resulting prefixes are then counted. Reducing the size of the prefix only affects the spatial locality of the ISP3 trace, where most of the mice disappear from the trace, and a small number of elephants account for most of the traffic. Normalized entropy, H n , measures the distribution of addresses in a trace:

Table 1 Trace and routing table sizes

Trace length (packets) Routing table size (prefixes)

ISP1

ISP2

ISP3

99,117 10,219

98,948 10,219

98,142 6355

318

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326 10000

10000

ISP1 ISP2 ISP3

1000 Frequency

Frequency

1000

ISP1 ISP2 ISP3

100

100

10

10

1

1 1

10

100 Rank

1000

1000

1

100000

100 Rank

1000

1000

ISP1 ISP2 ISP3

10000

Frequency

10

1000

100

10

1 1

10

100

1000

1000

Rank

Fig. 14. Spacial locality in IP traces.

n

H ¼



PN

i¼1 p i log2 p i ; log2 N

where N is the number of distinct addresses in the trace and pi is the empirically determined probability of the ith distinct address in the trace [12]. Normalized entropy ranges between 0 and 1, where a value of 0 indicates that only one address is referenced in the trace, while a value of 1 indicates that every address in the trace is equally frequent. Table 2 presents the normalized entropy of the traces of Table 1. As expected, considering only the 16 most-significant bits greatly reduces entropy, which Table 2 Normalized entropy in traces Trace

32 bits

24 MSBs

16 MSBs

ISP1 ISP2 ISP3

0.769 0.700 0.605

0.724 0.656 0.636

0.460 0.450 0.495

supports the concept of prefix caching. The increased entropy in the ISP3 trace when the 24 MSBs are considered is evidence that the 24-bit prefixes are referenced more uniformly than the original full addresses. Temporal locality in a trace can also be measured by calculating the reuse distance of each reference in the trace [28]. The reuse distance of the nth reference to address a, dðan Þ, is the number of distinct references in the trace between an1 and an , inclusively. The reuse distance dða1 Þ (i.e., the first reference to a) is defined as the number of distinct addresses encountered in the trace up to and including a1 . Fig. 15 presents the complementary cumulative distribution function (CCDF)3 of the reuse distances for the traces. Given a monolithic M-entry cache using a Least-Frequently-Used replacement policy, the CCDF predicts the cache miss rate for

3

the probability P ðdðaÞ > xÞ.

1

1

0.1

0.1

0.01

0.01

CCDF

CCDF

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

0.001

319

0.001

0.0001

0.0001

ISP1 ISP2 ISP3

1e-05 1

ISP1 ISP2 ISP3

1e-05 10

100

1000

10000

10000

1

10

Reuse Distance

100

1000

10000

10000

Reuse Distance

1

CCDF

0.1

0.01

0.001

0.0001

ISP1 ISP2 ISP3

1e-05 1

10

100

1000

10000

10000

Reuse Distance

Fig. 15. Temporal locality in IP traces.

the trace. Only M distinct entries can be stored in the cache. Thus, any reference with a reuse distance larger than M must result in a cache miss. With full-IP addresses, as shown in Fig. 15a, most references have a reuse distance greater than 100, despite the packet train nature of IP traffic. Furthermore, more than 10% of the references have reuse distances larger than 1000. These references have the potential to cause thrashing in small caches. When the least-significant bits of the addresses are disregarded to simulate prefix references, there is little change in the observed temporal locality of the ISP1 and ISP2 traces. Meanwhile, temporal locality in the ISP3 trace increases dramatically when only the 16 most-significant bits of each address are considered.

LPM of most IP addresses are cacheable prefixes, there is little benefit from table expansion. This question can be answered by simulating a cache-less architecture where all IP addresses reference the LUT directly. Table 3 shows that the percentage of non-cacheable prefixes varies significantly in the traces. For ISP1 the LPM of close to 47% of all IPs are non-cacheable prefixes. In this case, the spatial locality of the cache decreases dramatically if IP addresses are cached in full. Table 4 compares the miss rates of different cache types. Based on these simulation results, if non-cacheable prefixes are cached in full (the Prefix Cache), performance is similar to that of the full-address cache. Thus, the non-cacheable prefixes degrade the prefix cache to

7.2. MPC performance analysis

Table 3 Non-cacheable prefixes in a LUT

Section 4 asked: To what extent do non-cacheable prefixes impact cache performance? If the

Referenced non-cacheable prefixes (%)

ISP1

ISP2

ISP3

46.8

37.8

2

320

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

Table 4 Miss rates (%) vs. cache sizes (no. of entries) for three traces Entries

IP cache Prefix cache MPC Full-prefix cache

ISP1

ISP2

ISP3

512

1024

2048

512

1024

2048

512

1024

2048

22.7 17.2 15.5 7.4

15.4 11.2 7.9 2.5

10.5 6.5 3.7 1.4

10.8 7.1 6.2 2.9

7.2 4.5 3.3 1.3

4.9 3.1 2.0 1.2

3.6 0.6 3.0 0.5

2.2 0.6 2.0 0.5

1.9 0.6 1.6 0.5

Table 5 Number of prefixes after table expansion ISP1

Original table Full expansion SPE

ISP2

ISP3

Entries

% Larger

Entries

% Larger

Entries

% Larger

10,219 30,620 17,485

– 199 71

10,219 30,620 17,485

– 199 71

6355 7313 6469

– 15 2

a full-address cache, despite the expensive TCAM implementation. However, the ISP3 trace references very few noncacheable prefixes. A careful investigation of ISP3 finds that all the IP addresses of this trace are covered by 7% of the prefixes in the LUT, which indicates very strong locality in the traffic. This degree of locality is not observed in routers closer to the backbone. In order to evaluate the performance of the MPC, several cache types are simulated and compared. Table 4 compares the miss rates of the MPC with an IP Cache, a Prefix Cache (as explained above) and a Full-Prefix Cache. For a fair comparison, the simulated IP Cache has two equalsized zones with two pipeline stages, and caches full IP addresses; it is implemented as a 32-bit binary CAM. The Prefix Cache and the Full-Prefix Cache are implemented as 32-bit TCAMs. The Prefix Cache stores cacheable prefixes and stores full addresses if the lookup results are non-cacheable prefixes. The Full-Prefix Cache uses a fully expanded version of the real lookup table (LUT) in which all prefixes are cacheable. The Full-Prefix Cache outperforms all other caches. However, the Prefix Cache and the Full-Prefix Cache are implemented using a 32-bit TCAM, which occupies nearly twice the area of a 32-bit binary CAM. Therefore, for the same number of entries, the MPC and the IP Cache use half the area of the prefix caches. For a fair comparison of cache performance, the prefix caches should only have half as many entries as the MPC and the IP Cache. Results in Table 4 show that when considering

caches of comparable storage area, the miss rate of the MPC is almost as good as that of the FullPrefix cache. In addition, the Full-Prefix Cache requires a complete expansion of the LUT, but the MPC only requires SPE. The total number of prefixes in the LUT after expansion for a Prefix Cache and the MPC, are given in Table 5. 7.3. Eliminating LUT redundancy Lookup tables contain redundant information. A prefix P i is redundant if the lookup table still returns correct results when P i is removed from the table. Fig. 16 gives two examples of redundancy in a LUT. P1 at node 1 encompasses P2 at node 8, but

P1 (Port A) 1 0

3

1

P2 8 (Port A)

0

0

1

0

5

0

2

P1 (Port A) 1

1

6

1 P1 (Port A)

1

0

5

P3 P4 (Port B) (Port B)

0

0

2 1

6

P3 P4 (Port B) (Port B)

0

1

2 P5 (Port B)

Fig. 16. Examples of existing redundancy in a LUT.

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

321

on the operation of a router. Simulation results for CPO versus Latency for MPC with no OMB, a single entry OMB, and a 10-entry OMB, are shown in Fig. 17. In a cache without an OMB, CPO increases linearly with latency, as could be expected. For the MPC with a single-entry OMB, and for small latency values, the CPO is largely independent of latency. As the latency value increases, the CPO starts to increase linearly, but is smaller than for a bufferless cache. The OMB becomes more important in systems where a single LUT handles the misses from multiple caches. Contention produces longer and non-uniform latencies. Therefore, the overall effect of contention can be estimated by observing cache throughput under longer-latency conditions.

they forwarded to the same port. Since there is no other prefix on the path from P1 to P2, P2 duplicates information in P1. This duplication can be removed without changing lookup results. Fig. 16b depicts the new trie after P2 is removed. Alternately, P3 at node 5 and P4 at node 6 differ only in their last bit, and forward to the same port. These two prefixes can be replaced by a new 1-bit-shorter prefix, P5 at node 2. Fig. 16c depicts the trie with no redundancy. Some software-based lookup tables compress the total trie by removing all redundant prefixes [23]. Although the initial table size is significantly reduced in such a compressed table, it is difficult to perform some table updates without generating redundant prefixes. Thus, lookup tables usually contain some redundancy. When redundant entries are removed from a table there are fewer prefixes to cache, and caching effectiveness should improve. Furthermore, redundant prefixes often cause short prefixes to unnecessarily encompass redundant prefixes. In the ISP1 and the ISP2 LUTs, 27.0% of the prefixes are redundant, while 28.6% of ISP3’s LUT’s prefixes are redundant. To investigate the impact of table redundancy on cache performance, we removed redundant information from the LUTs and transformed them to ensure correct cache results. Table 6 indicates that the cache miss rate does not degrade after removing redundancy.

7.5. Search power savings The power consumption of CAM-based systems is a design constraint that has been extensively addressed in the literature [10,21]. The power consumption of a CAM has three components: search power, input power and the peripheral-circuit power [13]. Dynamic power consumption in digital integrated circuits is linearly related to the amount of switching activity on signal wires. Therefore the search power consumption should strongly correlate to the number of entries searched in a CAM. In order to analyze the reduction in the level of activity independent of the physical implementation of the architecture, we consider the average reduction in the number of entries searched in the design. For instance, if on average half the entries in a cache are searched every clock cycle and the other half are only searched 50% of the time, then only 75% of entries are searched, leading to a 25% power savings in comparison to a design that searches all the entries for each lookup. As described in detail in Section 7.1, the traces used to simulate the performance of the proposed architecture are comprised of IP lookups applied to the cache and then the LUT in case of a miss.

7.4. Effect of the outstanding-miss buffer on throughput A bufferless cache cannot store outstanding misses, and must therefore stall at each miss until the cache update result is received. In order to circumvent this issue a small buffer (the OMB) is used by the MPC to hide the miss penalty. The simulator uses a Latency parameter to model miss penalty. Let Clocks Per Output (CPO) refer to the average number of clock cycles required to provide the next-hop information for an IP lookup. The CPO metric can be used to evaluate the impact of the miss penalty

Table 6 Miss rates (%) vs. cache sizes (no. of entries) for three traces with no redundancy in the LUTs Entries

ISP1

ISP2

ISP3

512

1024

2048

512

1024

2048

512

1024

2048

Full-IP cache MPC Full-prefix cache

22.7 15.5 7.4

15.4 7.9 2.5

10.5 3.7 1.4

10.8 6.0 2.5

7.2 3.2 1.1

4.9 2.0 1.1

3.3 3.0 0.3

2.0 2.0 0.3

1.9 1.6 0.3

322

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

Fig. 17. CPO vs. latency for 1K-entry (equally sized zones) MPC.

S. Kasnavi et al. / Computer Networks 52 (2008) 303–326

Note that identical IP lookups do appear in the trace. This is essential to study the effectiveness of the cache mechanism. The MPC is a multi-zone cache containing a TCAM-based prefix zone and a CAM-based fulladdress zone. In our power consumption studies we assume that half of the cache entries are contained in each zone. Recall that the prefix zone is only searched after an address is not found in the CAM1 block of the full-address zone. A result of this vertical-pipelining is that not all cache entries are always searched for an IP address, and the number of cache entries searched in each routing operation is smaller than the number of entries in the cache, resulting in lower activity (and hence lower search power) compared to the caches that search all entries. Hence a study of the hit rate in CAM1 provides information about the power consumption reduction potential for the CAM2 and TCAM blocks. Simulation results are given in Table 7. In all scenarios, over 60% of IP address lookups hit CAM1 and do not require a subsequent TCAM search. Therefore, on average we expect a greater than 30% reduction in the number of searched entries in the cache, compared to a cache where all entries are searched. We also expect a commensurate 30% reduction in signaling activity, a key predictor a power consumption in digital integrated circuits. Note that the total cache hit rate is less than 100%. Thus a miss in CAM1 does not automatically correspond to a hit in the TCAM. The miss rate of MPC as a whole is reported in Tables 4 and 6.

Table 7 CAM1 hit rates # Entries

ISP1 (%)

ISP2 (%)

ISP3 (%)

512 1024 2048

62 63 63

64 65 65

74 74 74

323

7.6. The HLPM-based LUT performance evaluation In the forwarding architecture described in Section 5, a cache miss requires a reference to the LUT. The LUT stores the routing prefixes after SPE expansion. Table 8 shows the simulation results for a 1K-entry MPC with both actual lookup tables and for lookup tables with redundant information removed. The first row of the table shows the total number of short prefixes in the LUT. The second row shows how often these short prefixes have been referenced during the complete simulated lookup process (cache and the LUT). Table 8 indicates that for ISP1 and ISP2 approximately 35% of the expanded prefixes stored in the LUT are short and almost 28% of the missing IP lookups of the cache match with these short and cacheable prefixes. Cacheable prefixes of the ISP3 table are referenced infrequently because most IPs in this trace match with a very small set of cacheable prefixes. In other words, the ISP3 trace has very high spatial locality. When those prefixes are cached, most IP addresses hit the cache. This result is also observed in the MPC simulation results given in Tables 3 and 4. In a prefix cache, the miss rates do not improve when the cache size increases. This insensitivity to cache size suggests that only a small number of prefixes need to be cached to hit most addresses in the trace. The miss rates improve very little for a larger MPC or a larger full-address cache. Naturally, this increase is due to caching full IP addresses. Simulation results in Table 8 indicate that the power saving benefit of the architecture is perhaps more suitable for applications with lower spatial locality. As shown in Table 8, a lookup in the LUT requires a second level search to find the longest matching prefix at most for 83% of the references to the LUT. Note that these references are the MPC misses. For ISP1 and ISP2, at least 40% of the LUT references do not need the second level search. This simplifies the search operation and

Table 8 Simulation results for HLPM-based LUTs (tables with redundancy are the real tables) Redundancy

LUT short prefixes (%) Referenced short prefixes (%) Second level search (%) Power savings (%)

ISP1

ISP2

ISP3

Yes

No

Yes

No

Yes

No

36.6 27.8 55.8 14.0

41.6 27.8 55.8 14.0

36.6 27.2 59.2 13.6

41.6 29.1 57.5 14.5

19.5