A Pattern Matching Coprocessor for Deep and Large Signature Set in ...

3 downloads 240 Views 204KB Size Report
is done by pure software solution with high performance CPU. However, this method is ... implementation when pattern length is deep and signature set is large.
This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.

A Pattern Matching Coprocessor for Deep and Large Signature Set in Network Security System Chih-Chiang Wu*, Sung-Hua Wen**, Nen-Fu Huang**, and Chia-Nan Kao** *Computer and Communication Research Center (CCRC), **Institute of Communications Engineering, National Tsing Hua University, Taiwan, ROC [email protected] Abstract - As the network is growing fast and the viruses are spreading around the network more frequently, network intrusion prevention system (NIPS) is becoming more and more important. The traditional way for intrusion prevention is done by pure software solution with high performance CPU. However, this method is out of date, when gigabit network is booming and the high performance throughput is required. In recent years, the programmable hardware solutions have been proposed but they cannot deal with deep and large amount of pattern matching and are lack of flexibility when signatures are growing up. In this paper, we propose a novel patternmatching coprocessor that overcomes the difficulties in TCAM implementation when pattern length is deep and signature set is large. Since patterns are all stored in TCAM, it is a scalable and flexible system. Keywords-FPGA, Network Security, Pattern Matching, TCAM.

I. INTRODUCTION Signature matching is the most popular method to detect misuse behavior in intrusion detection and to locate a virus for an anti-virus gateway. Signature patterns are mainly predefined strings, which indicate the presence of worm or viruses, and therefore the most critical part of signature matching is to locate the pre-defined strings among packet payloads. Comparing to traditional layer-4 security devices like firewalls, the signature matching process is much more computation-intensive since it involves payload scanning, not only header field comparison. In order to improve the efficiency of processing, numerous hardware-based solutions to pattern matching have been investigated. There are two most common techniques for hardware-based solutions. One is based on the finite state automata (FSA) [1] [2] [3] [4] [5] [6], most of which are implemented by FPGA for flexibility. The other is based on the Ternary Content Addressable Memory (TCAM) or CAM-like circuit [7] [8] [9]. Other than these two, [10] uses the bloom filter to match the prefix of pattern first and then exact match is proceeded. Since system usually updates the signature set frequently and immediately, the solutions to hardwire the patterns into circuit are not practical even FPGA features flexible circuit. To store the patterns into the memory is the suitable way to achieve the goal of flexibility and scalability. This work was supported by MOE Program for Promoting Academic Excellent of Universities (II) under the grant number NSC-94-2752-E007-002-PAE.

IEEE Globecom 2005

1791

DRAM is the most cost-effective solution, but in gigabit requirement, the time-consuming refresh characteristic makes it less desirable. On the other hand, SRAM is fast device but lack of parallel comparison capability. With TCAM’s price slipping down, the cost issue becomes neglectful and its parallel comparison capability makes it more practical to handle pattern matching under gigabit traffic. Even though TCAM is a useful device for pattern matching, it is not an efficient way to cascade the TCAM for deep pattern matching. Especially for virus patterns announced in Clam AV [11], the longest pattern length is more than 2000 bytes, which is far from a single TCAM’s word width. Another problem of using TCAM device is that only the highest priority matched pattern position is reported. Each pattern occupies one entry and it is arranged in length descending order. Therefore, extra complex tables have to be made to record the multiple hits. Such method consumes much table look-up time and host computing resource. The main contribution of this paper is to propose a pattern matching coprocessor to solve above problems. This paper is organized as follows: section 2 will introduce the related work. In Section 3, we will define the problem more precisely. Section 4 will present the functionality of the whole system. The architecture of the proposed coprocessor will be described in Section 5. Section 6 will analyze the hardware resources consumption. Section 7 gives the implementation and simulation results. Finally, some conclusions are given. II. RELATED WORK Current software-based NIDS cannot meet the bandwidth requirements of a multiple-gigabit network. For example, one open-source IDS (e.g., SNORT [12]) equips with a dual 1GHz Pentium III system can sustain a throughput of only 50 Mbps. Several hardware solutions have been proposed to solve the problem. Many of them use FPGAs. FPGAs can be used for fast pattern matching due to their reconfigurable capability and parallel ability. Sidhu et al. mapped Nondeterministic Finite Automata (NFA) for regular expression into FPGA to perform fast pattern matching [4]. Then Moscola et al. translated regular expressions into deterministic finite automata (DFA) [5]. Young et al. implemented a fast pattern-matching engine with a parallelpipelined series of reconfigurable lookup tables (LUT) that can sustain a bandwidth of 3.2Gbps [2]. Sourdis mapped a similar design with a deeper pipeline to achieve a bandwidth

0-7803-9415-1/05/$20.00 © 2005 IEEE

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.

1234

of 10Gbps [6]. The compilation time and reconfiguration time required by FPGA design can be in the order of minutes to days. Such reconfiguration delay may not acceptable as new network intrusions or worms are released to the network frequently and the routing result is different every time that will not guarantee the performance is as good as current result. Dharmapurikarup et al. proposed an approximate method using bloom filter [10]. Their approach can handle thousands of patterns and detect the patterns at 600Mbps with number of false positive. Therefore, it needs an additional exact string comparison process to detect the false positive. Gokhale et al. proposed a fast reprogrammable pattern search system using Content Addressable Memory (CAM) [7]. Most of these works only handle very few patterns and short pattern length.

123* abcd efg* 56** The “*” means “don’t care” byte. “1234” and “abcd” are both head subpattern. Subpatterns can be extended and arranged in another way called “extended TCAM” to speed up the comparison further [9]. In such method, each entry in TCAM can be extended to w entries for shifting w bytes at most if none of any entry match occurred. B. Multi-addresses Output Mechanism

III. PROBLEM DEFINITION Given an input n-byte text T = {T0, T1,…,Tn}, and a finite set of r patterns P ={P1,P2,…,Pr}, the multi-pattern matching problem involves locating and identifying the substring of T which is identical to Pj = a0, a1, …,am-1, 1 ≤ j ≤ r. And ai = Ts+i, where s is the pattern’s start location. Therefore, if Ts…Ts+m-1 = a0a1…am-1, this pattern is matched with the substring and its related location s is also obtained. Since the word width of TCAM is limited to w, and only the highest priority matched pattern is reported when multiple patterns matched, there are two problems to be solved:

Subpattern may overlap with short pattern. For example, if there is a head subpattern “abcdefgh” and two short patterns “abcd” and “abc”. When the head subpatten is matched, the two short patterns are also matched simultaneously. In such case, TCAM will report only the address of the head subpattern. However, the selector needs all these patterns’ addresses that belong to a same group in the TCAM to generate corresponding pattern IDs. Therefore, an address multiplexer is illustrated in Figure 1. Each matched address is distributed to k multiplexers, and there are at most k addresses to be output. Since each group size is different (the maximum group size is k), an invalid address is added to each multiplexer when there is no any address to be output.

1. How to report all the matched patterns in one clock cycle? 2. How to handle the deep pattern matching?

PID11 Addr

In this paper, r is very large and m is varied from 1 to a large number. Our goal is to report all the matched patterns and their locations in one packet or continuous packets in gigabit speed. IV. REQUIRED FUNCTIONALITY

PID1k Addr Exact PIDk Addr

Exact PID1 Addr PIDn1 Addr

PIDnk Addr

invalid

invalid

Figure 1. Address Multiplexer.

To solve above problems, some functions need to be accomplished. We explain these functions as follows.

C. Position Counter

A. Subpattern Format Suppose the TCAM width is w. We call a pattern short pattern if its length is smaller than w. On the other hand, if a pattern’s length is larger than w, we call it long pattern. Because of the limit of TCAM’s word width, long patterns are partitioned into several subpatterns that their length just equals the word width of TCAM except the last subpattern. We call the first subpattern head subpattern. For example, there are three patterns and TCAM width is 4 bytes. Pattern 1 is “123456”, Pattern 2 is “abcdefg”, and Pattern 3 is “123”. The saved entries in TCAM will be IEEE Globecom 2005

Match Addr

1792

In order to handle the pattern-correlated rule, each position of matched pattern must be saved. Thus there will be a counter to count the incoming string character number as the position. For example, if a rule is defined as “P1 P2 distance: 3”, that means P2 should occur after 3 characters as P1 appears. The content in the counter is saved as soon as the P1 is matched. When P2 is matched, the content of counter is also saved in a queue. The post processor or host will get the information of P1 with its position and P2 with its position to match the rule table. We call patterns in the same rule

0-7803-9415-1/05/$20.00 © 2005 IEEE

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.

correlated patterns. In the example above, P1 and P2 are correlated patterns. V. SYSTEM ARCHITECTURE The whole system architecture is depicted in Figure 2. For each clock cycle, the incoming string is shifted one byte and w bytes of the string are sent to TCAM. If a subpattern or a short pattern is hit, the comparison result is sent to the central control unit (CCU). CCU stores IDs of matched patterns and their locations in the SRAM. The post processor gets this information from SRAM and compares them with rule table to identify a rule match. As long as the single or correlated patterns match the rule, post processor will alarm the host. Figure 3 illustrates the detailed block diagram of a CCU. A CCU consists of four main blocks: Processor Element, Selector, Pattern Table and Download Controller. Processor element (PE) is the kernel of the patternmatching coprocessor. When a head subpattern is matched in TCAM, the “matched address” and “CAM hit” signal are sent to the selector. The selector will enable an idle PE to match following subpatterns that belong to the same pattern. After enabling, a PE downloads the corresponding information from pattern table through the download controller. The information includes number of waited subpatterns, associated pattern ID, and a set of subpattern addresses in TCAM. Then the PE checks the match address from TCAM every w clock cycles. If the match address is not the expected subpattern address, the PE will release itself and notify the selector that it can be used for matching another pattern. Otherwise, if the matched subpattern is the last subpattern, the associated pattern ID and pattern position recorded in the byte counter are stored in SRAM and the PE is released too. Some last subpatterns may be overlapped. For example, there are a subpattern “abcdef” and a subpattern “abcd”. The TCAM will report only the address of pattern “abcdef”. Nevertheless, for the PE which is waiting for the address of pattern “abcd”, it should be notified the match. Therefore, we adopt the mechanism introduced in Figure 1 in the comparator to generate related subpattern addresses. The PE compares these addresses to its expected address to identify a subpattern match. Several processor elements can handle the subpattern match results output from TCAM in parallel. There may be some patterns, which have the same head subpattern. In this case, the selector will enable a PE for each pattern. Selector is not only an initiator for PEs but also a collector for short patterns. It stores the pattern ID with position into the SRAM directly when the matched pattern is less than w bytes. For patterns larger than w bytes, any matched head subpattern will let the selector to enable a PE by finding an idle PE in the available PE list. Another important function of the selector is to generate all the matched IDs for short patterns. For example, if pattern “abcde” is matched, then short pattern “abc” is also matched. The TCAM reports only the address of “abcde”, but the IEEE Globecom 2005

1793

selector should store these two patterns’ ID simultaneous. Therefore, the selector adopts the mechanism illustrated in Figure 1. In this example, one multiplexer outputs the address of “abcde” and another multiplexer outputs the address of “abc”. Other multiplexers output invalid address. Since input addresses of each multiplexer are stored in the registers, they can be configured with different value according to different rules. Shift Reg

W Bytes

Byte Counter Central Control Unit

Host

TCAM

Pattern ID/ Position Queue

Post Porcessor

Figure 2. System architecture. PE n

Subpattern Number

Pattern Position

Pattern ID

Pattern ID

PE 0 Enable Register PID

Set of Address

Short Pattern PID

Selector

CAM hit Match Addr

Comparator Download Controller

Pattern Table clear

Byte Counter

Long Pattern PID

Figure 3. Block diagram of central control unit (CCU).

Pattern table, which is stored in the embedded SRAM, records the following subpatterns’ addresses. Because multiple PEs may access this table, the best way is to organize this table in an array type such that each entry can be accessed individually. Since the operation of rule matching is similar to pattern matching. We can also use this architecture to handle rule matching. The rule table can be treated as the pattern table and patterns with locations can be treated as incoming string. For each rule, we put the pattern ID and its location condition in TCAM, like the subpattern. We can separate the TCAM into several blocks and turn on each block individually, and pattern matching and rule matching use different block in TCAM. Pipeline operation from pattern matching to rule matching can be achieved. In that case, the host actually contains not only CPU but also a rule matching engine called post processor, which is almost the same as pattern matching 0-7803-9415-1/05/$20.00 © 2005 IEEE

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.

TCAM width is 8 bytes, the sum of TCAM size and SRAM size is the lowest.

We assume the total number of long patterns is N, the total number of short patterns is M, and the longest pattern’s length is L. The total number of PEs required to handle the worst-case condition (i.e. continuous hits) is (w × L/w) × (1 + r) ≅ ( L + 1) × ( 1+r ),

Number of Patterns

VI. ANALYSIS OF THE PROPOSED SYSTEM

1000 800 600 400 200 0

where r is the number of the other patterns that have the same head subpattern with a pattern. From the formula above, it is clear that the number of PEs is highly associated with the signature characteristic and incoming payload. For example, it r= 10, L=16, the maximum number of PEs to handle the worst case is 187.

( M + ∑h Ph/w ) × w, where Ph is each pattern’s length. That means the longer TCAM word length will cause the larger TCAM size. According to general TCAM’s characteristic, drawbacks of longer word length are the slower processing speed and the more power consumption. Though the pattern size is proportional to the number of patterns, it also relates to the TCAM’s word length w. Because each pattern table entry contains pattern ID, number of subpatterns and their addresses, different w will cause different SRAM size. The formula of the SRAM size is ∑h ( K + log2Ph/w + Ph/w × log2 (∑h Ph/w) ), where K is the number of bits of a pattern ID.

VII. SIMULATION AND IMPLEMENTATION RESULTS A. Simulation Results We used a modified SNORT rule set, which has 2010 rules with 1717 patterns, and the longest pattern length is 61 bytes, to evaluate the coprocessor system’s performance. Figure 4 illustrates the distribution of the pattern length, and we find that many patterns’ length is 6 bytes. In order to find out the characteristic of the rule set, we use the formula represented in Section 6 to calculate TCAM size and SRAM size. Figure 5 illustrates the relation among TCAM size, SRAM size and TCAM width. From the analysis result, we can infer that the optimal TCAM width is 8 bytes. When IEEE Globecom 2005

1794

20

40

60

80

100

120

Pattern Length (Bytes) Figure 4. The distribution of pattern length.

TCAM Size (KBytes)

We do not need to use extended TCAM structure unless more than 10Gbps throughput is required. For the normal case, we expect that the TCAM size is as small as possible though all the patterns are stored in TCAM. The size of TCAM is

0

100

120

80

100 80

60

60

40

40

20

20

0

0 1

SRAM Size (KBytes)

engine. The detail operation of the post processor is out of the scope of this paper.

TCAM SRAM SUM

4 8 16 TCAM Width (Bytes) Figure 5. TCAM size and SRAM size.

In order to find the different characteristics among pattern sets, we extracted some patterns from the whole rule set to form a new rule set used by each simulation case. There are total four cases used in the simulation. We choose the number of patterns for each case with 592, 526, 714, and 633 patterns respectively. Note that there are few replicate patterns among these cases. TCAM word width varies from 2 Bytes (16 bits) to 256 Bytes (2048 bits). In order to simulate the real-world attack, we extract some packets from the dump file downloaded from DEFCON [13] to construct the test packets. MaxPE means the maximum number of PEs enabled by the selector simultaneously. Figure 6 shows the simulation result. After simulating all the test packets, we obtain the value of MaxPE. Figure 6 indicates that different pattern set has different characteristic. Case 3 and Case 4 have similar characteristic. Case 3 and 4 contain many subpatterns whose length are larger than 16 bytes and smaller than 32 bytes. When TCAM width is 16 bytes, the MaxPE is still large for Case 3 and Case 4. However, the MaxPE of Case 1 closes to zero. In Figure 7, we get the simulation results with the average of MaxPE, which is AvgMaxPE for short, after processing all the packets. From the simulation result, we can find that 8-byte TCAM word width can satisfy most packets’ requirement.

0-7803-9415-1/05/$20.00 © 2005 IEEE

This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE GLOBECOM 2005 proceedings.

160 140 120 100 80 60 40 20 0

VIII. CONCLUSION

MaxPE

Case 1 Case 2 Case 3 Case 4

2

4

8

16

32

In this paper, we proposed a pattern matching coprocessor to handle deep pattern matching. The detailed architecture and operation of the coprocessor are described. The analyses of the required number of PEs for the worst case, the required pattern table size, and TCAM size with different TCAM word width are also provided. From these analysis results, we can estimate the optimal TCAM word width. Some simulations with real-world attack traffic are preformed to evaluate the performance of the proposed system. The proposed system is also implemented in FPGA. From the simulation and implementation results, we can find that the proposed system is practical and flexible to handle the real gigabit network attack.

64 128 256

TCAM Width (Bytes) Figure 6. MaxPE.

70

AvgMaxPE

REFERENCES

Case 1 Case 2 Case 3 Case 4

60 50 40 30 20 10 0 2

4

8 16 32 64 128 256 TCAM Width (Bytes) Figure 7. Average of MaxPE.

B. Implementation According to the previous analysis and simulation results, we implemented a coprocessor with 32 PEs in ALTERA’s Cyclon II FPGA (EP2C20F484), and the w is 8 bytes and total TCAM size is 2048×64 bit. Table I shows the resources report. From the compilation report, 6305 logic elements and 22K bits memory are used. Since the coprocessor’s operation speed is 150MHz, the system throughput can be 1.2Gbps with compact TCAM and 64Gbps with extended TCAM structure. Since the coprocessor architecture is very regular and scalable, we can implement the system in FPGA as well as ASIC. TABLE I.

FPGA RESOURCES USED FOR EACH MODULE.

Module

Resource Usage

Selector

530 LEs ( 1% of total LEs)

PE

150×32 LEs ( 26% of total LEs)

Pattern Table

22K bits ( 9% of memory )

I/O Pin

210 ( 50% of total pins)

IEEE Globecom 2005

1795

[1] R. Franklin, D. Carver, and B. L. Hutchings, “Assisting Network Intrusion Detection with Reconfigurable Hardware,” in Proc. of the 10th annual IEEE symposium on Field-Progammable Custom Computing Machines (FCCM’02), Napa, California, USA, April 2002, pp.121-130. [2] Young, et al., “Deep Network Packet Filter Design for Reconfigurable Devices,” in Proc. of the 12th Conference on Field Programmable Logic and Applications (FPL), Montpellier, France, September 2002. [3] Young, et al., “Programmable Hardware for Deep Packet Filtering on a Large Signature Set,” in Proc. of the First IBM Watson P=ac2 Conference, Yorktown, NY, October 2004. [4] R. Sidhu and V. K. Prasanna. “Fast Regular Expression Matching using FPGAs,” in Proc. of the 9th annual IEEE symposium on FieldProgammable Custom Computing Machines (FCCM’01), Rohnert Park, California, USA, April 2001, pp. 223-232. [5] J. Moscola, J. Lockwood, R. P. Loui, and M. Pachos, “ Implementation of a Content-scanning Module for an Internet Firewall,” in Proc. of the 11th annual IEEE symposium on Field-Progammable Custom Computing Machines (FCCM’03), Napa, California, USA, April 2003, pp. 31-38. [6] Sourdis, et al., “Fast, Large-scale String Match for 10Gbps FPGAbased Network Intrusion Detection System,” in Proc. of the 13th Conference on Field Programmable Logic and Applications (FPL’03), Lisbon, Portugal, September 2003, pp. 880-889. [7] M. Gokhale, et. al., “Granidt: Towards Gigabit Rate Network Intrusion Detection Technology”, in Proc. of the 12th International Conference on Field Programmable Logic and Applications (FPL’02), Montpellier, France, September 2002, pp. 404-413. [8] M. Silberstein, et al., “Designing a CAM-based Coprocessor for Boosting Performance of Antivirus Software”, Technion technique report, March 2004. [9] R. T. Liu, N. F. Huang, C. H. Chen, and C. N. Kao “A Fast String Matching Algorithm for Network Processor-Based Intrusion Detection System,” ACM Transaction on Embedded Computing Systems, Vol. 3, Issue 3, August 2004, pp. 614-623. [10] Sarang Dharmapurikarup, et al., “Deep Packet Inspection using Parallel Bloom Filters” IEEE Micro, Vol. 24, No.1, January 2004. pp. 5261. [11] Clam Anti-virus signature database, http://www.clamav.net. [12] SNORT official web site, http://www.snort.org. [13] DEF CON web site, http://www.defcon.org.

0-7803-9415-1/05/$20.00 © 2005 IEEE

Suggest Documents