A fast pattern-match engine for network processor ... - IEEE Xplore

A Fast Pattern-Match Engine for Network Processor-based Network Intrusion Detection System Rong-Tai Liu1, Nen-Fu Huang1, Chia-Nan Kao2, Chih-Hao Chen1, Chi-Chieh Chou1 1

Department of Computer Science National Tsing Hua University, Taiwan { tie,nfhuang,mr914238,mr904397}@cs.nthu.edu.tw Abstract Network Intrusion Detection Systems (NIDS) are one of the latest developments in security. The matching of packet strings against collected signatures dominates signature-based NIDS performance. This work presents FNP2, an efficient pattern-matching engine designed for Network Processor platform which conducts matching sets of patterns in parallel. This work shows that combining our string matching methodology, hashing engine supported by most Network Processors, and characteristics of current Snort signatures frequently improves performance and reduces number of memory accesses compared to current NIDS pattern matching algorithms. Another contribution is to highlight that, besides total number of searching patterns, shortest pattern length is also a major influence on NIDS multipattern matching algorithm performance.

1. Introduction Network Intrusion Detection Systems (NIDS) are designed to identify attacks against networks or a host that are invisible to firewalls, thus providing an additional layer of security [12]. The performance of signaturebased NIDSes has been shown to be dominated by the speed of the string matching algorithms used to compare packets with signatures [6, 7]. Implementing a different algorithm achieves an up to 500 percent increase in performance for snort 2.0 [6, 14]. A NIDS must use an efficient string matching algorithm since an underperforming passive system drops many packets and misses many attacks, while an under-performing inline system creates a bottleneck for network performance [7]. This work presents a multiple pattern matching algorithm which works efficiently whatever the number of ruleset or the minimal pattern length is. Moreover, this design utilizes the hardware accelerators of network processors to hide the search latency and speed up the performance. _____________________ This research was partially supported by the MOE Program for Promoting Academic Excellence of Universities under Grant 89-EFA04-1-4.

2

Institute of Communications Engineering National Tsing Hua University, Taiwan [email protected]

Network processors are part of an emerging class of programmable ICs based on system-on-a-chip technology that perform communications-specific functions more efficiently than general-purpose processors. Network processors generally use pipelining, parallelism, and multi-threading to hide latency, and also employ hardware accelerators for hashing function, tree searches, frame forwarding, filtering, and alteration [9, 13, 16]. The design presented here employs multi-thread for parallel processing and hardware accelerated hashing engine to identify matching entries via a linked list in the event of hash collision to save processor power. Hashing engine checks the linked entries individually from a given starting address until it identifies a matched entry or reaches the end of the linked list. As previously described, searching entries by hashing engine hides latency and improves performance owing to context switching before a search result is returned. The remainder of this paper is organized as follows. First Section 2 reviews pervious works, after which Section 3 describes the algorithm developed here. Section 4 then presents experiments involving FNP2 and compares the performance of FNP2 with existing best alternatives. Finally, Section 5 summarizes the contribution of this research and presents conclusions.

2. Previous Works Aho and Corasick [1] proposed an algorithm for concurrently matching multiple strings. Aho-Corasick (AC) algorithm used the structure of a finite automation that accepts all strings in the set. The automation processes the input characters individually and tracks partially matching patterns. The AC algorithm has proven linear performance, making it suitable for searching a large set of rule signatures [7]. The Marc Norton’s implementation [14] (so-called optimized AC algorithm) is used for experiments in this paper. The running time of the optimized AC algorithm is independent of the pattern set. Fisk and Varghese designed a set-wise Boyer-Moore-Horspool (SBMH) algorithm [7], adapting the Boyer-Moore-Horspool [8] algorithm to

Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’04) 0-7695-2108-8/04 $ 20.00 © 2004 IEEE

simultaneously match a rule set. The set of patterns can be compared to any position in the text quickly by storing the reversed patterns in a trie. Their experiments show that this algorithm is faster than both the Aho-Corasick and Boyer-Moore algorithms [3] for medium-size pattern sets. However the maximum number of shifts is bounded by the length of the shortest pattern (denoted as LSP) in the pattern set. Markatos, Antonatos, Polychronakis, and Anagnostakis have provided an exclusion-based signature matching algorithm called ExB [11]. ExB is based on a simple reasoning: If there is at least one character in pattern P that is not in text T, then P is not in T. Every time a new payload comes, they check the patterns one by one to make sure if there exists a character appears in the pattern but not in the payload. If it does then skip this pattern since there is no way to match. Otherwise, they invoke Boyer-Moore algorithm to search the pattern in the given payload. The basic concept of E2xB is the same with ExB except to the way of denoting a match. In [2] their implementation performs better in a Pentium-3 1 GHz processor with 512KB L2 cache than in a Pentium-4 1.7 GHz processor with 256KB L2 cache. Their experiment results show that the performance of ExB heavily relies on the cache size. Therefore, when they are implemented over the network-processor platform, the performance will suffer due to large memory access latency, as usually the cache memory is absent and even the size of internal memory is very small. The MWM algorithm is another widely used multipattern matching algorithm designed by Wu and Manber [17]. The current implementation of Snort [14] uses this algorithm as the default engine if the search-set size exceeds ten. The MWM algorithm uses the Boyer-Moore algorithm with a two-byte shift table established by preprocessing all patterns. This algorithm performs a hash on the two-byte prefix into a group of patterns, which then are checked beginning from the final character when partially matching occurs. The MWM algorithm is used in agrep [17] and has been shown to deal with large amounts of patterns efficiently. However, the performance of the MWM algorithm also depends considerably on the LSP, because the maximum number of shifts equals this value minus one.

ts…ts+m-1 =

a0j ...amj 1 . And this equation can be also

denoted as ts…ts+m-1 =

a0j ...amj 1 .

FNP2 is a MWM-like algorithm and based on the following simple reasoning: For an arbitrary pattern Pj =

a0j , a1j ,..., amj 1 , if w sequential bytes of T can be found from location s, where ts...ts+w-1

aij ...aij w1 , i = 0, 1, 2, … , m - w

Then, Pj doesn ϗ t contain the ts...ts+w-1 so that the w sequential bytes can be skipped safely during searching. On the other hand, if the w sequential bytes ts...ts+w-1 is identical to

aij ...aij w1 , where 0 d i d m-w, then it

furthermore verifies whether the last w sequential bytes of j

j

Pj ( am w ...am1 ) is identical to ts+m-i-w...ts+m-i-1. Once the w-bytes suffix of Pj is matched in certain position in T, an exact match will be performed. To clarify this point, this study uses a Prefix Sliding Window (denoted as PSW) with length w which shifts from the leftmost byte to the rightmost byte of T. Every time the PSW shifts, an attempt is made to determine whether S, the w sequential bytes covered by PSW, contains ai

j

...aij w1 of pattern Pj,

where 0 d i d m-w. The following details the design of the FNP2 algorithm. The off-line pre-processing constructs necessary rule tables and lookup tables while the runtime processing processes the payload and identifies the matches. For simplicity this work assumes w = 3 for better performance

3.1. Off-line Pre-Processing

3. Design and Implementation

Figure 1. An Example of the construction of SDT

This work addresses the string matching problem formally before introducing the proposed FNP2 algorithm. Given an input text T = t0, t1, …, tn, and a finite set of strings P = {P1, P2, …, Pr}, the string matching problem involves locating and identifying the substring of T which

This stage involves constructing Skip Distance Table (SDT), Rule Hashing Table (RHT), and Rule Status Table (RST) during initialization of this engine. SDT is used to determine how many sequential bytes can be skipped safely in searching phase. During initialization, all entries in SDT are set to LSP. Every table entry whose last (rightmost) 8-bits of address is identical to any one-byte prefix of the patterns is set to

is identical to Pj =

a0j , a1j ,..., amj 1 , 1d jd r, where


LSP - 1, and every table entry whose last (rightmost) 16bits of address is identical to any two-byte prefix of the patterns is set to LSP - 2. Every table entry whose address is identical to

a

j LSP3i

...a

j LSP1i

is set to i, 1 d j d r and

LSP-3 i 0. Figure 1 demonstrates an example of the construction of SDT. The LSP in this example is 6 and the maximal skip distance is 6 also. The skip distance of other NIDS pattern matching algorithms is bound to LSP while the minimal skip distance of FNP2 is 3. The next step is to insert rules into RHT, where we store the pattern rules. The first three sequential bytes of every content string are used to derive the hashing value of that string. In the event of a collision, a linked list is maintained to preserve the collision entries. Other fields like pattern length, content strings, and rule id also are filled in the RHT. The multiple-content rules (matches iff all its contents are matched) are taken apart so that every content string has its own entry. The MATCH flag of all RST entries is set to false, and entries corresponding to single-content rules are marked as HEADs. For multiple-content rules only the longest content entries are marked as HEADs. When several entries belong to the same multiple-content rule then the HEAD entry will contain pointers to other entries. During comparison, if a pattern is found in T, the MATCH flag of the corresponding entry is set to true so that compare this entry again is unnecessary. Once T has been screened out entirely, hash coprocessor is used again to search for the HEAD entry with MATCHED flag set so that no need exists to perform a linear search to identify the highest priority matched entry. A multiple-content rule is matched if every content matches T. Figure 2 illustrates an example of inserting a multiple-content rule into RST and RHT.

3.2 Runtime Processing The matching procedure of the proposed FNP2 algorithm is quite simple. Initially PSW is aligned with the first byte of the incoming payload. The string within the PSW S (t0…t2) then is fetched, and its skip distance N is looked up in the SDT. If skip distance N does not equal to zero, then PSW is shifted right by N bytes in the next round. If N is zero, the 3-sequential-bytes key is thrown to the Hashing Lookup Engine for further searching. If a key needs to be searched in RHT, this job is left to the lookup co-processor, after which context switching is performed. After this thread wakes up again, the lookup co-processor either returns the address of the matching entry or sets a bit indicating the failure of matching. If lookup fails, PSW is shifted one byte right to continue. If lookup succeeds, then whether the MATCH flag of this entry has been set in RST is checked, because this avoids the need to waste time on rechecking matching entries. If the entry found previously has not been matched then an exact matching is conducted between the payload and the remaining content. If the remaining content matches the payload, then the MATCH flag of the corresponding RST entry is set to TRUE to indicate a match. Meanwhile, if no match exists the following entries are searched again using a lookup coprocessor until all of the collision entries have been checked. The primary payload matching procedure of the FNP2 algorithm is as follows: After going through the entire payload, a matching entry with the highest priority is selected from RST. If no matching entry exists, then the payload is clean. Notably, the work of searching in RST can be performed by the lookup co-processor. Consequently, the whole table does not need to be traversed to select the entry with the highest priority by CPU.

4. Experiments

Figure 2. An Example of the RST and RHT

To verify the effectiveness of the proposed FNP2 algorithm, its performance was evaluated against the previously mentioned SBMH, AC, MWM, and E2xB algorithms. Because of the difficulty of implementing all these five algorithms with Network Processor micro codes, the later four experiments were implemented on general PCs to simulate the network-processor environment. Nevertheless, the FNP2 was implemented using the Vitesse IQ2000 [16] Network Processor. The current Snort ruleset [14], containing 1,942 rules with 2,475 patterns, was employed as the default searching pattern. The full-packet traces can be derived from the “Capture the Capture The Flag” (CCTF) project held in DEFCON [5] annually. The DEFCON9 packet traces used in the present experiments were the most up-to-date available.


˟˦ˣʳːʳ˅

˟˦ˣʳːʳ˄ ˔˖

˙ˡˣ˅ʳ

˦˕ˠ˛

˄˅˃˃˃

˖̂̀̃˿˸̇˼̂́ʳ˧˼̀˸ʳʻ̆ʼ

˖̂̀̃˿˸̇˼̂́ʳ˧˼̀˸ʳʻ̆ʼ

˙ˡˣ˅ʳ ˄˃˃˃˃ ˋ˃˃˃ ˉ˃˃˃ ˇ˃˃˃ ˅˃˃˃ ˃ ˆ˅

ˉˇ

˄˅ˋ

˅ˈˉ

˔˖

ˈ˄˅ ˄˃˅ˇ ˄ˈˆˉ ˅˃ˇˋ

ˆ˅

ˉˇ

˄˅ˋ

ˈ˄˅ ˄˃˅ˇ ˄ˈˆˉ ˅˃ˇˋ

˟˦ˣʳːʳˇ

˦˕ˠ˛

˙ˡˣ˅ʳ

ˠ˪ˠ

ˈ˃˃˃

˖̂̀̃˿˸̇˼̂́ʳ˧˼̀˸ʳʻ̆ʼ

˖̂̀̃˿˸̇˼̂́ʳ˧˼̀˸ʳʻ̆ʼ

˅ˈˉ

˦˸˴̅˶˻ʳ˦˸̇ʳ˦˼̍˸

˟˦ˣʳːʳˆ ˔˖

ˠ˪ˠ

ˊ˃˃˃ ˉ˃˃˃ ˈ˃˃˃ ˇ˃˃˃ ˆ˃˃˃ ˅˃˃˃ ˄˃˃˃ ˃

˦˸˴̅˶˻ʳ˦˸̇ʳ˦˼̍˸

˙ˡˣ˅ʳ

˦˕ˠ˛

ˇ˃˃˃ ˆ˃˃˃ ˅˃˃˃ ˄˃˃˃ ˃

˔˖

˦˕ˠ˛

ˠ˪ˠ

ˈ˃˃˃ ˇ˃˃˃ ˆ˃˃˃

indicates that value of LSP is even a major influence on the performance of multi-pattern matching algorithms.

4.2. Implementing FNP2 on Vitesse IQ2000 Platform

ˣ˸̅˹̂̅̀˴́˶˸ʳʻˠ˵̃̆ʼ

4.1. Evaluation of the number of memory accesses

˘̋̃˄

˄˅ˋ

˅˃˃˃ ˄˃˃˃ ˃

ˆ˅

ˉˇ

˄˅ˋ

˅ˈˉ

ˈ˄˅ ˄˃˅ˇ ˄ˈˆˉ ˅˃ˇˋ

˦˸˴̅˶˻ʳ˦˸̇ʳ˦˼̍˸

ˆ˅

ˉˇ

˄˅ˋ

˅ˈˉ

˘̋̃˅

ˈ˃˃ ˇ˃˃ ˆ˃˃ ˅˃˃ ˄˃˃ ˃ ˅ˈˉ ˈ˄˅ ˄˃˅ˇ ˣ˴˶˾˸̇ʳ˟˸́˺̇˻ʳʻ˵̌̇˸̆ʼ

˄ˈ˄ˇ

ˈ˄˅ ˄˃˅ˇ ˄ˈˆˉ ˅˃ˇˋ

˦˸˴̅˶˻ʳ˦˸̇ʳ˦˼̍˸

Figure 3. Number of memory accesses during pattern matching processing As previously described, the proposed FNP2 algorithm requires fewer memory accesses so that better performance can be achieved. The five algorithms are evaluated using different search set sizes and LSPs by counting number of memory accesses. The packet trace (900MB) defcon_eth0.dump2 [5] was employed to generate the test traffic more realistically. Trace defcon_eth0.dump2 was selected because of its low compression rate compared to other packet traces, and because the content of this trace is considerably more complicated, thus increasing test fairness. Figure 3 illustrates the results of four algorithms except E2xB for different search set sizes and LSPs. The case involving the MWM algorithm with LSP = 1 was not assessed because the MWM algorithm does not support this situation. The experiment results of E2xB is not listed in Figure 3 because of the big gap between its result and the other four. The number of memory accesses in E2xB’s simulation results is from 1112M to 23354M. It’s clear to see that E2xB doesn’t work well when rule-set size is large and cache memory is unavailable like Network Processor platform. Figure3 shows that FNP2 algorithm clearly outperformed other algorithms in this way. Notably, two major influences affect the performance of multi-pattern matching algorithms in the NIDS, namely: LSP value and the pattern ruleset size. Interestingly, previous works focused on the latter factor only [2, 4, 7, 11], while neglecting the former factor. Figure 3 reveals that search-set size does not influence the number of memory accesses required for the MWM algorithm to complete the multi-pattern matching, but for LSP = 2,3,4 the required number of memory accesses is approximately 1800M, 950M, 800M, respectively. The SBMH algorithm displays the same phenomenon. This phenomenon

Figure 4. The performance of the FNP2 algorithm with different packet lengths To further evaluate the practical performance of the FNP2 algorithm, this work implements it on the Vitesse IQ2000 Network Processor platform. The IQ2000 has four 200 MHz RISC Packet Processing Engines (PPEs), each containing five sets of 32-bit registers, allowing up to four separate contexts to be active simultaneously. Each PPE also contains a lookup co-processor used to search for a given key in a specified linked-list. This facility can be used to search both the RST and RHT. Each PPE contains 2K-byte internal memory, and 512 bytes are assigned to each context. The system also has 512MB Direct Rambus DRAMs (RDRAMs) as the main memory. To write the micro-code program efficiently, the IQ2000 technical documents suggest reducing the direct RDRAM accesses and trying to move data into the internal memory instead. In the present case, since the other tables such as SDT are too large to fit into the 2Kbyte local memory, manipulating the packet payloads is the only way to reduce the number of direct RDRAM accesses. To confirm the impact of memory access number on performance, this work implements the FNP2 algorithm using two different methods. The first method (Exp1) is to access the packet payloads from RDRAM directly eight bytes at a time, with the next eight bytes being fetched each time PSW moves beyond the boundary of the current 8-byte payload. Meanwhile, the second method (Exp2) involves first fetching the payloads via DMA into internal memory 384 bytes a time, then fetching the next 384-bytes of data through DMA if the PSW exceeds the boundary of 256 bytes. The SmartBits 6000B [15] and SmartApplication [15] are employed to generate the UDP traffic in Gigabit rate. Figure 4 illustrates the throughput of both experiments. LSP is set to 4 in both experiments. Notably, the performance


measurement results presented here are inline forwarding rates, not passive processing rates. The SmartApplication generates UDP packets in different sizes, and obviously the performance of the FNP2 program is better for small packets than for large packets. This phenomenon appears related to the fact that the program presented here ignores the header parts (MAC header, IP header, and UDP header) of a packet, and the proportion of the payload in a small-size packet is smaller than in a large-size packet. Figure 4 shows that the program in Exp2 is more efficient than that in Exp1. The only difference between the programs in Exp1 and Exp2 is the method of moving packet payloads. The testing results demonstrate the point that reducing the number of memory accesses during processing significantly improves program performance. Notably, program performance obviously depends on hardware capacity. We believe that performance can be improved markedly by using more high-end Network Processor Units, like Vitesse IQ2200 [16] (with four 400 MHz PPEs), Intel IXP2400 [10] (with eight 600 MHz PPEs), or even Intel IXP2800 [10] (with 16 1.4 GHz PPEs).

5. Conclusions This work examined the importance of the pattern matching algorithm for NIDS, and designed and implemented a fast and efficient algorithm named FNP2 for network processor platforms. FNP2 uses the characteristic of NIDS rulesets and the hardware facility of Network Processor to maximize performance. Owing to the difficulty of implementing other multipattern matching algorithms (such as AC, SBMH, E2xB and MWM) by micro-code simultaneously, only the FNP2 algorithm is implemented on the Vitesse IQ2000 Network Processor platform to evaluate the relation between performance and the number of memory accesses for processing multi-pattern matching. The experimental results reveal that the FNP2 outperforms the other algorithms in this matter. Network Processors are known to be powerful to handle L3/L4 traffic and this design take use of the characteristic of Network Processor to process L7 payloads efficiently. Moreover, the searching algorithm benefits much more when LSP is small whereas it’s the common case in NIDS application. Generally, the NIDS detection engine conducts flow classification, header-field comparison, and multi-pattern matching. Although multi-pattern matching is the most time-consuming task, a fast packet processing flow is desirable for integrated handling of these issues. Using the facilities provided by the Network Processor may be a good solution to this problem. This direction is left for future works to pursue.

6. References [1] A. Aho and M. Corasick, “Efficient string matching: An aid to bibliographic search,” Communications of the ACM, vol. 18, no. 6, June 1975, pp. 333-343. [2] K. G. Anagnostakis, E. P. Markatos, S. Antonatos, and M. Polychronakis. “E2xB: A domainspecific string matching algorithm for intrusion detection,” Proceedings of the 18th IFIP International Information Security Conference (SEC2003), May 2003. [3] R. S. Boyer and J. S. Moore, “A fast string searching algorithm,” Communications of the ACM, vol. 20, no. 10, Oct. 1977, pp. 762-772. [4] C. Jason Coit, Stuart Staniford, and Joseph McAlerney, “Towards faster patern matching for intrusion detection or exceeding the speed of snort,” in Proceedings of the 2nd DARPA Information Survivability Conference and Exposition (DISCEX II), June 2001. [5] DEFCON. http://www.shmoo.com/cctf/ [6] Neil Desai: “Increasing Performance in High Speed NIDS”. [7] M. Fisk and G. Varghese. “An analysis of fast string matching applied to content-based forwarding and intrusion detection,” Technical Report CS2001-0670 (updated version), University of California - San Diego, 2002. [8] R. Nigel Horspool, “Practical fast searching in strings,” Software Practice and and Experience, vol. 10, no. 6, 1980, pp. 501-506. [9] IBM: “The Network Processor Enabling Technology for High-Performance Networking,” Available from http://www.npforum.org/pressroom/whitepapers.shtml [10] “Intel(R) Network Processor,” http://www.intel.com/ [11] E.P Markatos, S. Antonatos, M. Polychronakis and K.G Anagnostakis. “ExB: Exclusion-based signature matching for intrusion detection,” Proceedings of the IASTED International Conference on Communications and Computer Networks (CCN), Cambridge, USA, November 2002, pp. 146-152. [12] Network ICE: “Protocol Analysis vs Pattern Matching in Network and Host Intrusion Detection Systems”. [13] P. Paulin, F.Karim, P. Bromley, “Network Processors: A perspective on Market Requirements, Processor Architectures and Embedded S/W Tools,” In Proceedings of the DATE 2001 on Design, automation and test in Europe, IEEE Press, 2001, pp 420-429. [14] “Snort.org,” http://www.snort.org/ [15] “Spirent.com,” http://www.spirentcom.com/ [16] “Vitesse.com,” http://www.vitesse.com/ [17] Sun Wu and Udi Manber, “A fast algorithm for multi-pattern searching,” Tech. Rep. TR94-17, Department of Computer Science, University of Arizona, May 1994


A fast pattern-match engine for network processor ... - IEEE Xplore

A fast pattern-match engine for network processor ... - IEEE Xplore

Suggest Documents

A fast pattern-match engine for network processor-based network ...

Cell Broadband Engine processor: Design and ... - IEEE Xplore

Fast and processor efficient parallel matrix multiplication ... - IEEE Xplore

Search Engine - IEEE Xplore

WLAN Security Processor - IEEE Xplore

Processor Saving Scheduling Policies for ... - IEEE Xplore

A Fast Custom Network Topology Generation with ... - IEEE Xplore

Neural network and regression based processor load ... - IEEE Xplore

Optimizing an Open-Source Processor for FPGAs: A ... - IEEE Xplore

Optogenetics in Silicon: A Neural Processor for ... - IEEE Xplore

A LOW-POWER PARALLEL PROCESSOR IC FOR ... - IEEE Xplore

A TwoStage GPS Anti jamming processor for ... - IEEE Xplore

A High-Voltage Integrated Circuit Engine for a ... - IEEE Xplore

smission network - IEEE Xplore

A Fast Recursive STFT Algorithm - IEEE Xplore

TNT Digital Pulse Processor - IEEE Xplore

Fast Robust Correlation - IEEE Xplore

A Smart Volt-Var Optimization Engine for Energy ... - IEEE Xplore

Memory-fast interfaces for DRAMs - IEEE Spectrum - IEEE Xplore

FPGA-Based Reconfigurable Processor for Ultrafast ... - IEEE Xplore

Application-Specific Instruction Set Processor for SoC ... - IEEE Xplore

Low-complexity space-time processor for DS-CDMA ... - IEEE Xplore

Pre-processor for MAC-layer Scheduler to Efficiently ... - IEEE Xplore

Digital RF Processor (DRP) for Mobile Phones - IEEE Xplore