these algorithms in hardware to lookup these multiple fields in the ... of forwarding engine, IP address lookup, and layer-4 packet ..... Domain Name Service. 7.
Design of Multi-field IPv6 Packet Classifiers Using Ternary CAMs Nen-Fu Huang, Whai-En Chen, Jiau-Yu Luo, Jun-Min Chen Department of Computer Science, National Tsing-Hua University, Hsinchu, Taiwan, R.O.C. Abstract – Typically, high-end routers/switches classify a packet by looking for multiple fields of the IP/TCP headers and recognize which flow the packet belongs to. Several packet classification algorithms to accelerate packet processing and reduce the memory requirement have been proposed. But it is not easy to implement these algorithms in hardware to lookup these multiple fields in the same time. This paper tends to design a novel packet classification engine capable of simultaneously processing multi-field searching, especially for the IPv6 packets with relative longer addresses (128 bits). To classify the IPv6 packets in wire-speed, the CLM (CAM-Like Memory)–based hardware architecture is considered and five fields (source IPv6 address, destination IPv6 address, source port, destination port, and protocol) are designed as the searching key. Evaluation results indicate that comparing with the typical market leading delivering search engines, the proposed hardware architecture provides a 30% speed-up performance. A compact method is also provided to compress the bit-width required to represent the multi-field of an IPv6 packet. This saves the memory space required for the IPv6 rule table for about 20%. Keywords: IPv6, Multi-field, Classification, CAM 1. Introduction To deal with the exploding Internet traffic today, the network interfaces have been upgraded to several gigabit per second speed and the OC-192/OC-768 interfaces are expected to deploy quickly. Furthermore, the packet forwarding process also should handle multi-gigabit traffic speed. The packet forwarding process contains the modules of forwarding engine, IP address lookup, and layer-4 packet classification; these modules should achieve the processing speed equivalent to physical interfaces. There are several network applications that require packet classification, such as firewall, policy-based QoS management, DiffServ, network security (IDS), and advanced billing system. Typically, at least five fields are required for a layer-4 packet classification: source and destination IP address, source and destination ports, and protocol (TCP/UDP). It is necessary to determine the proper rule(s) the packets belong to. For IPv6 packets, the bit-width of these fields is about 296 bits. Although there are several configurable CAM (Content Associated Memory) products provided the function to configure the width and depth (number of entries) of a CAM, it is still very expensive for high density ternary CAMs [14-19]. For the common CAM interface with typically 70-80 bits data bus, it takes at least four clocks to input the data of these fields into the CAM for comparison, which may take another one or two clocks.
The famous trie structures have been proposed in several articles to compare the five fields of IP/TCP headers sequentially [1],[2],[7],[8]. Also multi-dimension structures have been proposed for the same purpose [10],[12]. We note that for a multi-field classification, a rule is selected only when the five parameters are all matched simultaneously. Therefore, these proposed sequential algorithms [8],[9],[13] may need to be recursively executed and spend more searching time. In other words, the penalties of lookup miss are very heavy. Based on this critical fact, it is desire to design an algorithm capable of simultaneously accelerating the packet lookup and reducing the bit-width requirement, especially for the IPv6 packet information. The fundamental idea proposed in this paper is to import these multi-field parameters into CLMs at the same time and then these CLMs perform the lookup function simultaneously. Each CLM will then output an MA (Match Array), which indicates a list of matched entries. This paper then proposed a VLOP (Vertical Logical OPeration) and an HLOP (Horizontal Logical OPeration) to determine the matched rule. An ADE (Address Decoding Circuit) is also designed to output the Matched Index of the matched rule in O(1). To save the bit-width required for presenting the multi-field information to classify IPv6 packets, a compacting method is also proposed to compress some of these fields. Simulation results indicate the proposed method speed up the searching time and save the required memory by 30% and 20%, respectively, compared to the single-CAM based implementation. The rest of this paper is organized as follows: Section 2 presents the major issue of IPv6 packet classification and related works. Section 3 describes our proposed algorithm and hardware architecture. Simulation results and performance evaluation are presented in Section 4. Finally, some conclusions future works are given in Section 5. 2. Issues of IPv6 packet classification Several algorithms have been proposed to improve the multi-field packet classification in recently years. Some of these papers presented the software designs [4],[6],[7] to find out the matched rule, and the others [8],[9],[13] designed hardware architectures, for example, using CAMs, to accelerate the searching procedure. However, in the worst case, these algorithms may suffer from the problem of backtracking in the searching procedure, which is undesired in the ultra high-speed lookup engines. For example, let us see the multi-level trie structure shown in Figure 1(a). When a packet matches node C in the DST-Tree but fails in the SRC-Tree, the algorithm should go back to node B in the DST-Tree and then searches the SRC-Tree again. This “backtrack” procedure may take a
0-7803-7206-9/01/$17.00 © 2001 IEEE
1877
lot of time. To avoid the backtracking issue, some improving mechanisms are proposed by employing the additional links; this causes the data structure more complicate.
Table 1
Table 2
Table 3
Table 4
Table 5
IPv6 SRC Address
IPv6 DST Address
Protocol
SRC Port
DST Port
(a) (b) Figure 1. Typical packet classification algorithms.
The reason why these algorithms need to perform backtrack in the searching procedure is due to in the hash tables or searching trees, the filter entries (rules) may be multiple matched. If this happens, then it is difficult to choose one correct result, which is an index for next level searching. An elementary idea to resolve this issue is that a rule is selected only when all parameters are matched simultaneously. This paper designs several logical circuits capable of choosing the accurate rule and exporting the searching result within a deterministic latency. 3. Design of IPv6 packet classifier Table 1 shows a rule table example. Generally, five fields are employed to recognize a packet as a flow: source and destination IP address (128 bits for IPv6), source and destination transport layer port number (16 bits), and protocol field (8 bits). When a packet arrives, the packet classifier should search these five fields in the rule table. Nevertheless, a rule is matched only when all these fields of the packet are matched in the rule table. Then an action or a sequence of actions will be followed after the rule is matched. Take firewall as an example, the action may as simple as forward the packet or discard it. Table 1. Rule Table Example Destination Address
Destination
Protocol
(Address/prefix length)
(Address/prefix length)
3ffe:3600:0001::1
3ffe:3600:0002::1
Port
range 20-21 TCP
48
48
3ffe:3600:0001::1
3ffe:3600:0002::1 48
3ffe:3600::1
3ffe:3600:000B:0001::1
32
64
3ffe:3600::1
3ffe:2400:0002::1
32
32
gt 1023
Pass (FTP) 80
TCP 48
Action
Source Port
*
Select a rule with exactly matching all fields
Match Flag
Transform the result into Index (Address)
Match Index
Output
Nevertheless, this situation may also occur for the hardware architecture. Figure 1(b) shows that if the destination address of a packet matches multiple entries in the DST table, then the searching procedure may need to check the related source information for each of the matched destination entries.
Source Address
circuits is proposed to choose the accurate rule and exporting the search result within a deterministic latency, as shown in Figure 2.
Pass (WWW)
*
*
*
Filter
UDP
*
*
Filter
A hardware architecture with two logical operation
1878
Figure 2. Hardware architecture for packet classification The hardware architecture tends to find out the rule, if any, which matches all the five fields of the packet. Three steps are involved to recognize a packet and output the matched index. First of all, search each field in the corresponding CLM concurrently, and then we obtain five MAs (Match Array). In the second step, logical operations are designed to find the correct rule that simultaneously matches all these fields. Finally, a transformation circuit is designed to output the result (Index or Address) within a constant latency. The Index can be output directly or be the address input of a SRAM. In the first step, the header information is put into CLM as a searching key. The CLM is the CAM like memory, and it contains a content array and a mask array in each entry. Header information is compared with each entry meanwhile, and then if the entry matched the input key, the match bit in the match array (MA) is set. 128-bit CAM-like Memory
128-bit Entry
Match Array
Content
Header Infomation
Match Bit
Bitwise Compare
Mask
128-bit Entry
128-bit Entry
Match Bit
Match Bit
Figure 3. The detail structure of a CLM. The “Bitwise Comparison” circuit to compare each bit within a deterministic delay is also shown in Figure 4. According to the rules, there are two possible cases for matching the bit in the header information, denoted as Content-bit (C), with that in the CLM, denoted as Prefix-bit (P). The first one is that these two bits are identical (C=P), then the output, denoted as Result-bit (R), is set. Otherwise (C P), the output is reset (R=0). A corresponding bit in the mask, denoted as Mask-bit (M), will also determine the output, where M represents a “don’t care” bit. If M is set to be “1”, then R is set, too. On the other hand, if M is “0”, then R depends on C and P. If C = P, then R = 1. Otherwise, R = 0. The truth table is shown in Figure 4 and we designed some logic circuits to achieve this requirement. In terms of all Result-bit string, match bit is set by performing logical AND and OR operations.
Truth Table Mask
Prefix
Data
Result
Result Array
0
0
0
1
0
0
1
0
Address Decode Engine
0
1
0
0
Address Bus
0
1
1
1
1
0
0
1
1
0
1
1
1
1
0
1
1
1
1
1
SRAM
Action
Figure 6. Address Decode Engine
Figure 4. Bitwise Comparison. Each CLMi will output a match array, denoted as MAi, which contains k entries, where k is the number of entries in CLMi. Let MAi,j denote the jth element of MAi. The parallel comparison is made to find out the correct rule that simultaneously matches all these fields. To achieve this, two logical operations, VLA (Vertical Logical AND) and HLO (Horizontal Logical OR), are designed to indicate the correct rule and output the search match flag signal as shown in Figure 5. For all these fields, the elements MAi,j, for all i, perform a VLA operation and the result (true or false) is kept in a result array. A bit of “1” indicates that all these fields are matched (a rule matched). Otherwise, the bit will be set “0”. To output the search match flag, HLO operation is executed. All bits of the result array perform a logical OR operation to obtain the search match flag. For firewall application, there are typically two types of actions for a matched rule, either forward the packet or discard the packet. If a packet matched two rules with conflict actions, then the rule table is not appropriate and should be updated to avoid this. On the contrary, if the actions of the matched rules are the same; then a pre-computing could be employed to eliminate the overlap of these matched rules. Accordingly, there will be only one matching in the result array for the firewall applications. After VLAP and HLOP, the result is stored in the result array. However, it still takes O(k) time to search a bit (indicates which rule is matched) in the result array with k elements. In our proposed scheme, an ADE (Address Decode Engine) is designed to output the index of the matched rule, if any, of the rule table, as shown in Figure 6. This index can also input as the address into SRAM, and then the result is output within a SRAM read-cycle time. CLM -1
CLM -2
CLM -3
CLM -4
CLM -5
IPv6 SR C A ddress
IPv6 D ST A ddress
Protocol
SR C Port
D ST Port
M A -3 M A -4 M A -5
Result
H orizontal L ogical O R
log 2 ( k −1)
Bj=
∑2
n
=x*2log2(k-1)+…+x*2w+…+x*20, where 0
n =0
ƙwƖ
log (k-1) 2
In this expression, if the co-efficient of 2i element is “1”, then the output of Bj is linked to i-th OR logic gate. For illustration, an ADE circuit is shown in Figure 7 for a result array of k = 8 entries (B7, B6,…, B0 ). The outputs of these 8 bits are connected to the OR gates as follows: Case 1. 7=1*22+1*21+1*20, B7 connects OR Gates 2, 1, and 0. Case 2. 6=1*22+1*21+0*20, B6 connects OR Gates 2 and 1. Case 3. 5=1*22+0*21+1*20, B5 connects OR Gates 2 and 0. Case 4. 4=1*22+0*21+0*20, B4 connects OR Gate 2. Case 5. 3=0*22+1*21+1*20, B3 connects OR Gates 1 and 0. Case 6. 2=0*22+1*21+0*20, B2 connects OR Gate 1. Case 7. 1=0*22+0*21+1*20, B1 connects OR Gate 0. Case 8. 0=0*22+0*21+0*20, B0 does not connect any OR Gate.
Consequently, the ADE circuit offers the match index in a constant time. 4. Compact Method For classifying IPv6 packets, the total width of the entries in a rule table usually needs 296 bits, including source/destination IPv6 address (128 bits), source/ destination port number (16 bits), and protocol (8 bits) fields. Generally, the capacity of a CAM chip is defined as the bit-width multiplies by the number of entries. On the other hand, larger bit-width tends to have less number of entries. Additionally, it is also difficult and expensive to manufacture CAMs with enough bit-width to support IPv6 packet classification, especially for multi-field classification. To perform IPv6 packet classification in a more efficient way, the IPv6 addresses, port numbers, and protocol ID can be compressed. According to IPv6 address spec, the 128-bit address consists of a 64-bit prefix assigned by ISP and a EUI-64 host ID comes from the physical address. For Ethernet, the 48-bit MAC address is encoded with EUI-64 format. Thus we can decode the EUI-64 ID to 48-bit MAC address to reduce the length of an IPv6 address.
Vertical Logical AND
M A -1 M A -2
ƙ
Assume the result array contains k entries. Let Bj , 0 j< k-1, denote the jth bit of this array. Let Bj be expressed as
M atch Flag
Figure 5.The VLA and HLO logical operations.
1879
7 Result Array
6
5
4
3
2
1
0
paper proposes a method to employ multiple 128-bit ternary CAM-like memories, which is more cost effective than a single high-density CAM, and designs efficient logical circuits to achieve wire-speed IPv6 packet classification and meantime to solve the backtracking problem.
0 0 1 0 0 0 0 0
Address Decode Engine
Match Index
1
0
1
Figure 7. Example of Address Decode Engine for k = 8. It is also meaningful to compress the protocol field due to the frequently used protocols only include TCP, UDP, ICMP, IGMP, (E)IGRP, GRE, IPINIP, and wildcard ‘*’ [3],[8]. As a result, the protocol field can be compressed into a small set and three bits are enough. The port number fields of transport layer contain varied types, and many are range specifications. For example, range 20-21 is for FTP, range 137-139 is for NetBIOS, the range of SNMP is 161-162, and we may define the gt1023 to represent the range greater than 1023. Although the port number field contains 16 bits, the popular Internet frequent services [5],[9] can be classified into 16 classes (Table 2), and four bits are enough to represent these classes. Table 2. Service Classes of Popular Internet Applications. Classes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Protocol and Service Echo DayTime FTP TELNET SMTP DNS Bootp Gopher WWW(HTTP) POP News(NNTP) NetBIOS SNMP IRC gt1023 *
Range 7 13 20-21 23 25 53 67-68 70 80 110 119 137-139 161-162 194 1024-65535 *
Detail Echo Service Day-Time Service File Transfer Protocol Remote Login Protocol Simple Mail Transfer Protocol Domain Name Service
World Wide Web Post Office Protocol Usnet News Simple Network Management Protocol Internet Relay Chat Wildcard
After compacting the IPv6 addresses fields, protocol field, and port number fields, the bit-width of each entry contains only 235 bits. Consequently, the memory space required for the rule table can be reduced by 20%. 5. Performance Evaluation For packet classification, the IP/TCP headers are first parsed while the packets arrive. Then the selected fields are employed to match a proper rule. Nevertheless, as each field may match multiple entries, it is possible to initiate backtracking problem, or otherwise, additional link information are required in the data structure. The hardware architecture proposed in this paper employed simple and efficient logical circuits to prevent the backtracking problem and provide the searching result with deterministic latency (constant time). Although several vendors already delivered ternary CAMs [14-19], which are known suitable for packet classification, it is still very expensive with high-density capability and configurable width and depth feature. Thus, to support multi-field IPv6 classification, it is not cost effective to use these high-density ternary CAMs. This
1880
As we have mentioned before, each bit of the Match Arrays takes one logical AND gate to find out the result. To support 64K entries of the rule table, 64K logical AND gates are required in VLA. Then a logical gate is used to output the match flag by HLO processing. To export the match index within a deterministic latency, 16 (216 = 64K) logical OR gates are enough. Two measurements are employed to evaluate the proposed IPv6 classifier: classification speed and memory space required. It is known that CAMs are hardware based solutions and very effective for packet classification [14-19]. Although some ternary CAMs provided configurable feature to adjust the bit-width of an entry, only few products [14],[17] furnish enough density to support IPv6 packet classification because of the lack of supporting bit-width up to 300 bits. Most ternary CAM products are capable of supporting packet classification only up to 128-bit width. The design proposed in this paper is constructive for employing the 128-bit ternary CAMs. To evaluate the classify speed for IPv6 packets, the classification procedure provided by the Network Search Engine (NSE) of Netlogic Microsystem Inc [17]. is compared with our solution. The whole classification operations include read data operation, lookup and output the result. Take NSE for example, the data bus is 72 bits, and the input of the five fields of an IPv6 header takes 4 clock cycles. Then the NSE only takes one clock to lookup the correct result. Finally, another clock cycle is taken to output the result. For our proposed solution, only two clocks are required in the first step and also one clock is enough to lookup. In the last step, our solution takes one clock plus several gate delay (constant time) to get the result. Consequently, our solution furnishes a lookup speed faster than about 30% of that of NSE. Table 3. Comparison of Classification Latency Items
Netlogic’s NSE Our Solution
Read
4 clock cycles
2 clock cycles
Lookup 1
1
Output 1
1+ several gate delay
Total
4
6
Furthermore, the proposed method can be performed in a pipeline fashion. The procedure of our solution can be separated into two steps. The first step performs the read and lookup operations, and the second step exports the results. By pipelining, each lookup takes only three clock cycles. In case the ternary CAMs are running at 66MHz, then our solution can fulfill about 20 millions packet
classification operations per second. The compacting method employed in this paper organizes the frequently used protocols and applications into several levels to further compress such that only 235 bits are required, instead of 296 bits without compression, to represent an entry of the rule table. This saves about 20% of memory space. 6. Conclusions Novel packet classification hardware architecture capable of processing multi-field searching simultaneously has been proposed. Ternary CAM-like memories are used to enforce hardware lookup. In addition, two hardware operations, VLOP (Vertical Logical OPeration) and HLOP (Horizontal Logical OPeration), are designed to match the correct rule. An ADE (Address Decoding Circuit) is also proposed to output the Matched Index in O(1). Evaluation results indicate that comparing with the typical market leading delivering search engines, the proposed hardware architecture provides a 30% speed-up performance. A compact method is also provided to compress the bit-width required to represent the multi-field of an IPv6 packet. This saves the memory space required for the IPv6 rule table for about 20%. In this paper, only five fields of IPv6/TCP headers are considered in the packet classification. For the emerging content-based classification, layer-7 information (such as URLs and Cookies) will be required to differentiate several applications. In this environment, it will be more difficult and expensive to design a single, ultra high density, ternary CAM to support the most critical work: packet classification. The hardware architecture proposed in this paper is relative cost effective and scalable, and therefore can be applied as a hardware implementation platform for the emerging content-based packet classifications. References [1] Feldman, A.; Muthukrishnan, S., “Tradeoffs for packet classification”, IEEE INFOCOM2000, Israel, March 2000, P1193-P1202. [2] Ching-Fong Su, “High-speed packet classification using segment tree”, IEEE GLOBECOM 2000, San Francisco, November 2000, P582–P586. [3] Che, H.; Li, S.-Q., “MPOA flow classification design and analysis”, IEEE INFOCOM'99, New York, March 1999 , P1497-P1504. [4] Singh, K., “A configurable 5-D packet classification engine with 4Mpacket/s throughput for high-speed data networking”, ISSCC 2000, San Francisco, February 2000 , P82 -P83. [5] Ilvesmaki, M.; Luoma, M.; Kantola, R., “Learning vector quantization in flow classification of IP switched networks”, Sydney Australia, IEEE GLOBECOM’98, November 1998, P3017-P3022. [6] Jiří Matoušek, “Range searching with efficient hierarchical cuttings”, Proceedings of the eighth annual symposium on Computational geometry, Berlin Germany, June 1992, P276–P285. [7] Borg, N.; Svanberg, E.; Schelen, “Efficient
1881
multi-field packet classification for QoS purposes”, IWQoS '99, London, June 1999 ,P109–P118. [8] V. Srinivasan, G. Varghese, S. Suri and M. Waldvogel , “Fast and scalable layer four switching”, Proceedings of the ACM SIGCOMM '98, Vancouver Canada, August 1998, P191–P202. [9] Pankaj Gupta and Nick McKeown, “Packet classification on multiple fields”, Proceedings of ACM SIGCOMM '99, Cambridge, August 1999, P147–P160. [10] V. Srinivasan, S. Suri and G. Varghese, “Packet classification using tuple space search”, Proceedings of the SIGCOMM '99, Cambridge, August 1999, P135–P146. [11] T. V. Lakshman and D. Stiliadis, “High-speed policy-based packet forwarding using efficient multi-dimensional range matching”, Proceedings of the ACM SIGCOMM'98, Vancouver Canada, August 1998, P203–P214. [12] Priyank W.; Subhash Suri; George V., “Fast Packet Classification for Two-Dimensional Conflict –Free Filters”, IEEE INFOCOM 2001, Alaska, April 2001. [13] Jun Xu; Singhal, M.; Degroat, J., “A novel cache architecture to support layer-four packet classification at memory access speeds”, IEEE INFOCOM 2000, Israel, March 2000, P1445-P1454. [14] Kawasaki LSI’s Classification CAM products web page [online] Available WWW http://www.klsi.com/products/CAM.htm. [15] MOSAID Technologies Inc.’s MOSAID Class-IC Ternary Content Addressable Memories product page [online] Available WWW http://www.mosaid.com/semiconductor/class-ic.htm. [16] MUSIC Semiconductors’ Routing Co-processor and Ternary CAM [online] Available WWW http://www.music-ic.com/product/products.html. [17] NetLogic’s Network Search Engine (NSE) web page [online] Available WWW http://209.10.226.212/netlogic/html/products/nse.html. [18] SiberCore Technologies’ ternary Content Addressable Memory (T-CAM) technology [online] Available WWW http://www.sibercore.com/products.htm. [19] Virage Logic’s NetCAM web page [online] Available WWW http://www.viragelogic.com/products.