Throughput Increase in Packet Forwarding Engines ... - IEEE Xplore

838

IEEE COMMUNICATIONS LETTERS, VOL. 9, NO. 9, SEPTEMBER 2005

Throughput Increase in Packet Forwarding Engines Using Adaptive Block-Selection Scheme Mohammad J. Akhbarizadeh and Mehrdad Nourani Abstract— We propose a new approach for using the blockselection scheme to increase the search throughput in a multiblock TCAM-based packet forwarding engine.While the existing methods try to counter and forcibly balance the inherent bias of the Internet traffic, our method takes advantage of it hence improving flexibility of table management and scalability towards high rates of change in traffic bias. This approach also offers higher throughput than the current art. Index Terms— Router, packet forwarding, Ternary CAM, block selection scheme.

I. I NTRODUCTION HE BLOCK selection scheme was proposed in [1] as a system level remedy for the high power consumption problem of ternary content addressable memory (TCAM). In this approach the entire forwarding table is partitioned into K equally-sized portions by applying a partitioning algorithm. Each portion is placed into a separate TCAM chip as small as 1/Kth the expected table size. In the data path, these TCAM chips are preceded by an ASIC that includes a range detector (RD). RD examines every incoming search key (a.k.a destination IP address) and instantly determines the unique TCAM block which contains the longest matching prefix for that key. Then, only that block is enabled and given the key to search. All other blocks remain idle during that cycle. Thus, the peak power consumption of the resulting architecture is roughly 1/Kth that of a conventional TCAM with an equal size. Their partitioning algorithm allows incremental table updates even when a partition is full. As each TCAM block is a separate chip, practical values for K are in the range of 4 to 8. Authors in [2] propose two more categories of TCAM block partitioning methods. All designs based on this concept need a two (or more ) stages pipelined implementation. Reference [1] also suggested that the partitioning technique can be used to increase the search throughput. With K TCAMs, and M parallel RDs (given M ≤ K) up to M concurrent lookups can be fulfilled per cycle. All RDs receive search keys independently and choose their own target TCAM block. If every RD chooses a different block then M lookups will be fulfilled in that cycle. We call such a system Multi-Selector and Multi-Block (MSMB) structure. The Internet traffic is strongly biased and in practice there are lots of conflicts when multiple RD units choose the same TCAM block in the same cycle. One possible method to make the distribution less biased is to divide the forwarding table into a large number of small partitions (say 4 × K) and randomly assign four

T

Manuscript received December 21, 2004. The associate editor coordinating the review of this letter and approving it for publication was Dr. Nikos Nikolaou. The authors are with the Center for Integrated Circuits & Systems, University of Texas at Dallas, Richardson, TX (e-mail: {eazadeh, nourani}@utdallas.edu). Digital Object Identifier 10.1109/LCOMM.2005.09022.

partitions to each of the K TCAMs. The randomness in such assignment helps in reducing bias towards any one TCAM. Instead of random assignment, one could use a learning algorithm that tries to predict the future behavior of incoming traffic based on its current distribution. At the beginning of each period, the algorithm re-arranges the assignment of table partitions to TCAMs based on its latest knowledge to balance the load over TCAM blocks [1]. An enhanced version of the latter idea combined with the notion of storage redundancy is described in [3]. The drawbacks of The state-of-the-art MSMB approaches are: 1) These approaches require periodic table reconstruction and rearrangement, hence degrading table management flexibility. They do not support incremental table update when a block is full. 2) They fall short when traffic bias changes rapidly because the existing algorithm needs adequate time to learn the traffic distribution and then more time to rearrange the table based on this knowledge. 3) They duplicate popular prefixes inside TCAM, causing inefficient use of expensive TCAM storage. 4) Use of internal queuing to handle contentious situations increases the design complexity. II. A F LEXIBLE S OLUTION FOR I NCREASING T HROUGHPUT The previous solutions try to counter the inherent bias in the incoming traffic and balance its distribution over parallel TCAM blocks in order to improve their utilization hence increasing the throughput. In this work, instead of countering the traffic bias we take advantage of it. We propose a practical solution that makes MSMB rapidly adapt itself to the changes in traffic pattern. Our solution is based on the simple observation that when the bias of IP address stream to a small subset of route prefixes increases they inevitably advertise some temporal locality of references. The less the uniformity, the stronger the locality of references property. The system described in this section (Fig. 1-a) takes advantage of this property to improve parallelism regardless of bias in incoming traffic. As the figure shows, the system is divided into two stages. Our technique is only implemented on the first stage, in which an ASIC chip contains selectors augmented with small associative memory units to improve the system performance through promoting parallelism. In the second stage, the K TCAM blocks (a.k.a K separate TCAM chips) simply contains The K-way partitioned lookup table. The main mechanisms devised for stage 1 are as follows: Resolving contentions: The first step in handling dynamic traffic is to smoothly handle the contentious situations when two or more RDs select the same block in the same cycle. For that purpose, contention resolver (CR) units are added to

c 2005 IEEE 1089-7798/05$20.00

AKHBARIZADEH and NOURANI: THROUGHPUT INCREASE IN PACKET FORWARDING ENGINES USING ADAPTIVE BLOCK-SELECTION SCHEME

Stage1 (all put in one ASIC)

Stage2 (separate TCAM chips) Start (request dispatched from ingress to selector i when it is ready)

TCAM

1

3

TCAM

TCAM

3

Search result, to SRAMs

2

CR 3

PT

Key

2

RD 2

CR 4

2

Sel 2

Key

Sel 3

Incoming traffic

PT1

Stage 1

TCAM

PT_HIT & HOLD

Selector i (RDi + PTi + CRj)

PT_HIT & HOLD

PT_HIT Stage 2

CR

RD 1

CR 2

1

1

Sel 1

Key

839

One cycle lag to synchronize

TCAMj lookup

RD 3 4

PT 3

Done (to SRAM)

(a) MSMB-PT with three selectors and four TCAM blocks

Fig. 1.

(b) Search operation diagram of MSMB-PT

MSMB-PT high level block diagram and search flow diagram.

the system, one per each TCAM block. When a contention for block TCAM i occurs, CRi permits the highest priority RD to proceed with its search request and puts the rest of the requesting RDs on hold. This way, the design is kept simple and efficient by avoiding internal queues. Handling the biased traffic: Contentions degrade system’s throughput. For a 4-selector, 4-block MSMB our simulations showed a speedup less than 1.60 instead of the ideal value of 4. To prevent most of the contentions and get close to the ideal speedup factor when the traffic is biased, we turn to the temporal locality of references property. Our system dynamically stores few recently popular route prefixes in a table, called Popular-prefix Table (PT), next to each selector. Therefore, we call our system model MSMB-PT. PT units work in parallel to improve the average number of lookups per cycle by improving parallelism. PT units have identical content. PT is a fully associative memory unit capable of storing prefixes, i.e. a small TCAM unit by itself (our simulations show a size of 64 × 32 bits should be enough for a very large IPv4 forwarding engine). However, note that the entire stage 1 is put into a single ASIC chip, which is connected to K (typically K=4) TCAM chips that store the lookup table. The combination of RD and PT is called selector (Sel) in Fig. 1-a. This figure is a simplified illustration. In practice, there is a dedicated request signal going from each selector to each CR. There is also a Hold signal coming back from CR units to each selector. Only the search keys and the selector priorities are broadcast from each selector to all CR units. In general there are M selectors, K CR units and K TCAM blocks. In the previous art usually M = K while in MSMBPT it is advantageous to have M ≥ K. So, it is merely for the sake of simplicity of illustration that M is set to 3 in Fig. 1-a, where K = 4. The diagram in Fig. 1-b shows the life cycle of each search operation in MSMB-PT. M such operations are taken place in every cycle. When Seli receives a search key, PTi is looked up simultaneously with RDi . If there is a match in PTi then the result is produced and the lookup of TCAM blocks in the

second stage is canceled for this particular key. In this case, the search result is delayed for one cycle for everything to be synchronized and for pipelining to become simple. If there is no match in PTi then the search key has to be passed to one of the second stage TCAM blocks, as RD selects (say TCAM j ). If at the same time, other selectors also place request for TCAM j , CR j chooses the one with the highest priority. If our selector loses this contention, it receives a HOLD signal back from CR j which puts it in the hold state. It has to repeat the operation all over again in the next cycle. The selector priorities get incremented in every cycle up to M and back to 1, so no hold selector will stay on hold more than M − 1 cycles. The two stages are folded in a simple pipelined structure, similar to other multi-block architecture. Placement into PT units: In this design, a placement is scheduled for PT whenever there is a contention in the first stage. When CRi receives two or more requests for TCAM i , it schedules the next search result of TCAM i to be placed in all PT units. This is a greedy method based on guessing, but our studies have proven it effective and more efficient than blindly placing every matched prefix. The placement process is asynchronous and does not halt ongoing search tasks. A parent prefix (a prefix with encompassed prefixes in the same table) cannot be placed into PT directly, otherwise it can cause wrong forwarding decisions. In this case, the minimal expansion procedure -introduced in [4] by the authors -is applied asynchronously which finds the shortest disjoint extension of the given parent prefix that can partially represent it in PT units. Updating PT units takes 2 cycles for disjoint prefixes and at most 2 + w for parent prefixes, where w=32 (128) for IPv4 (IPv6). The first contribution of this paper is that by taking advantage of temporal locality of references we can keep the performance of multi-block search engines high even when the traffic is highly biased, without a need for complicated and un-scalable arrangements. However, note that MSMB-PT is fundamentally different from a regular cache system, in which a cache hit is several times faster than a memory fetch (cache miss). Cache reduces the average search/fetch latency while

840

IEEE COMMUNICATIONS LETTERS, VOL. 9, NO. 9, SEPTEMBER 2005

III. E XPERIMENTAL R ESULTS Fig. 2-a shows the simulation results obtained from a clock cycle accurate model of MSMB-PT. Here K = 4. Real-life routing table and IP packet traces are used for this experiment [5]. The bar chart shows the speedup versus PT size (NPT ). For each NPT , four values of M are tried: 4, 8, 12, and 16. The graph clearly shows three major characteristics of MSMB-PT: 1) Effectiveness: even a small NPT makes a big difference. Compare NPT = 0 with NPT = 16. 2) Efficiency: NPT does not need to be too large. Speedup gets saturated near NPT = 256. 3) Excess speedup. We find M = 12, K = 4, and NPT = 64 to be a practical and efficient choice in our case. The stage 1 ASIC for this configuration is estimated to have about 1.4 million transistors which is an average size for such application. It will provide a minimum throughput equivalent to 13 OC-192 links in the worst case assuming it can only provide a speedup of 4, and all packets have the minimum size of only 40 bytes. Fig. 2-b compares MSMB-PT’s adjustment to traffic bias change with the previous art. MSMB-PT results are shown for two cases, when M = K = 4, and when M = 12, K = 4, with NPT = 64. In our definition traffic bias change (the horizontal axis) is the change of at least 10% of prefixes in the list of 100 currently most popular prefixes, obtained by a C program. To normalize, we have divided that rate by the traffic rate (link speed). So, 0.1% on the horizontal axis means that on average after each 1000 lookups, 10 of the popular prefixes change. Both the bar chart and the curve clearly show achievements 1, 2, and 4, in the above list. MSMB-PT offers high performance and excess speed up. With the assistance of PT units, MSMB-

9

8

7

6

S peedup

in MSMB-PT both PT and TCAM searches take virtually one clock cycle. The gain only lies in the fact that multiple PT units work in parallel to help MSMB-PT achieve multiple searches per cycle. For example, a single cache unit that precedes the entire forwarding engine cannot have that result because it restricts the system to one search per cycle. MSMB-PT is a two-faced architecture. When there is little or no bias, the search requests are nicely distributed over TCAM blocks, contentions are few, and the speedup is close to K. But when the traffic is biased an plenty of contentions can occur, parallel PT units come into play to avoid the majority of contentions and boost the speedup factor. The main achievements of MSMB-PT are listed below: 1) A significant boost in throughput regardless of bias in the ingress traffic. 2) When M > K, MSMB-PT can achieve speedups higher than K (excess speedup). Our simulation show that with 4 TCAM blocks (K=4) up to 8-folds speedup is achievable. This property of MSMB-PT is unique. 3) Unlike other approaches, MSMB-PT leaves the structure of the lookup table unaffected and imposes no requirement on table management, making it a scalable solution. All the dynamism happens in stage 1. Incremental table updates are possible. 4) MSMB-PT adapts rapidly to the changes in traffic bias, while the previous art needs some time to learn the traffic bias and calibrate its weight variables and more time to re-arrange several lookup table sub-ranges among the TCAM blocks based on that calibration.

5

4

3

2 M=4 M=8 M=12 M=16

1

0 0

16

32

64

128 P T Size [Entries]

256

(a) Speedup graph for varying M and different sizes of PT. For each N PT, speedup bars are shown for M = 4, 8, 12, and 16, respectively. Speedup 8 MSMB-PT (M=12, K=4) 6

4

MSMB-PT (M=K=4)

Current art (K=4) 2

0.01%

0.1%

1.0%

Traffic bias change

(b) Speedup vs. rate of traffic bias change for MSMB-PT and the previous art.

Fig. 2.

Graphed experimental results.

PT adapts to changes in traffic bias efficiently even when the changes are radical. IV. C ONCLUSION AND F UTURE W ORK MSMB-PT model is a parallel TCAM-based forwarding engine that employs block selection scheme. It improves stateof-the-art in terms of throughput, storage efficiency, scalability to high rates of change in traffic bias, and table management flexibility. With one ASIC chip that implements the technique and four commodity TCAM chips that store the partitioned lookup table, MSMB-PT is the only architecture that can achieve a speedup more than four. Further mathematical analysis, hardware implementation, details of a dynamic priority mechanism that avoids starvation, and complete performance evaluation of the prototype are due in the next phase. ACKNOWLEDGMENTS This work is supported in part by Cisco Systems. We thank Rina Panigrahi and Samar Sharma for their helpful discussions. R EFERENCES [1] R. Panigrahy and S. Sharma, “Reducing TCAM Power Consumption and Increasing Throughput,” IEEE HotI 10, Stanford University, 2002. [2] F. Zane, G. Narlikar, and A. Basu, “CoolCAMs: power-efficient TCAMs for forwarding engines,” in Proc. IEEE INFOCOM, vol. 1, pp. 42-52, March 2003. [3] K. Zheng, C. Hu, H. Lu, and B. Liu, “An ultra high throughput and power efficient TCAM-based IP lookup engine,” in Proc. IEEE INFOCOM, vol. 3, pp. 1984-1994, March 2004. [4] M. Akhbarizadeh and M. Nourani, “Efficient prefix cache for network processors,” IEEE HotI12, August 2004. [5] http://bgp.potaroo.net, “BGP Routing Table Analysis Reports,” 2004.

Throughput Increase in Packet Forwarding Engines ... - IEEE Xplore

Throughput Increase in Packet Forwarding Engines ... - IEEE Xplore

Suggest Documents

Incentive-Compatible Packet Forwarding in Mobile ... - IEEE Xplore

Throughput Analysis in Multihop CSMA Packet Radio ... - IEEE Xplore

Integrating Legacy Forwarding Environment to ... - IEEE Xplore

Query Forwarding in Geographically Distributed Search Engines

Remarkable Increase in Student Membership - IEEE Xplore

Inferring Packet Forwarding Priority, Network

Encrypted Packet Forwarding in Virtualized Networks - Engineering ...

Achieving near maximum throughput in IEEE 802.11 ... - IEEE Xplore

ieee 802.11n: enhancements for higher throughput in ... - IEEE Xplore

building better search engines - IEEE Xplore

Forwarding and Caching in Named Data Networking ... - IEEE Xplore

Efficient Data Forwarding Techniques in Wireless ... - IEEE Xplore

Allocation subject to Throughput Constraints - IEEE Xplore

Outage Constrained Secrecy Throughput Maximization ... - IEEE Xplore

Throughput Maximization for Opportunistic Spectrum ... - IEEE Xplore

Throughput Performance Evaluation of Multiservice ... - IEEE Xplore

Packet Loss Reduction During Rerouting - IEEE Xplore

Throughput Improvement for OFDMA Femtocell ... - IEEE Xplore

Throughput and energy consumption analysis of IEEE ... - IEEE Xplore

Throughput Optimization of Multi-BSS IEEE 802.11 ... - IEEE Xplore

Cooperation Enforcement for Packet Forwarding Optimization ... - arXiv

Saturation throughput analysis of IEEE 802.11 wireless ... - IEEE Xplore

On the Throughput Performance of Multirate IEEE ... - IEEE Xplore

Rethinking Packet Forwarding Hardware - acm sigcomm