ANCS 2009
Evaluating Regular Expression Matching Engines on Network and General Purpose Processors Michela Becchi
Charlie Wiseman
Patrick Crowley
Washington University Computer Science and Engineering St. Louis, MO 63130-4899
Washington University Computer Science and Engineering St. Louis, MO 63130-4899
Washington University Computer Science and Engineering St. Louis, MO 63130-4899
[email protected]
[email protected]
[email protected]
we show how three significant regular expression representations can be efficiently realized on a fully operational high-speed networking platform built around Intel IXP2800 NPs [22][23]. The Intel IXP2800 NP is a highly-integrated multi-core processor designed specifically for networking tasks. One characteristic of the IXP memory hierarchy is the lack of caches. As a term of comparison, we evaluate the same regular expression representations on two GPPs (provided with a standard two-level cache hierarchy): an Intel Xeon and a multiprocessor system consisting of four AMD Opteron 850 cores. Representations for high-speed regular expression evaluation are traditionally based either on deterministic or nondeterministic finite automata (DFAs and NFAs, respectively) [1]. In this work, we consider DFAs, NFAs and hybrid-FAs [10], a data structure proposed in previous work that brings together the strengths of DFAs and NFAs while avoiding their weaknesses. Our experiments are performed on a set of 534 complex regular expressions drawn from the Snort network intrusion detection system [2][3]. Our evaluation shows the following results. On a moderately provisioned NP based system, the sustainable performance varies from 20.2 Mbps to 90.5 Mbps depending on the underlying automata representation and on the fraction of malicious content in the packet stream (which we vary between 35% and 95%). Similarly, the throughput varies from 2.4 Mbps to 192.7 Mbps on the Intel Xeon, and from 15.6 Mbps to 534 Mbps on a 4-way AMD Opteron 850. Finally, our Hybrid-FAbased engine outperforms by a factor 50X-180X the pcregrep [4] engine used by Snort[2][3]. The contributions of this paper can be summarized as follows. To the best of our knowledge, this is the first work to report experimental measurements of high-speed regular expression representations on NPs and GPPs using rule sets of realistic size and complexity. Additionally, we describe and evaluate how DFAs, NFAs and hybrid-FAs can be reduced to practice on the Intel IXP2800 NP. Third, we make available to the research community everything needed to reproduce these experiments and extend them, which includes our source code and access to our experimental infrastructure for high-speed networking research based on the Washington University’s Open Network Laboratory [18][21]. The remainder of this paper is organized as follows. In Section 2, we provide some background on the construction of efficient finite automata. In Section 3, we discuss our hardware setup. In Section 4, we present our tool-chain. In Section 5, we present our concrete implementations. In Section 6, we report our experimental evaluation. In Section 7, we briefly discuss further opportunities for optimizing these implementations. The paper concludes with Section 8.
ABSTRACT In recent years we have witnessed a proliferation of data structure and algorithm proposals for efficient deep packet inspection on memory based architectures. In parallel, we have observed an increasing interest in network processors as target architectures for high performance networking applications. In this paper we explore design alternatives in the implementation of regular expression matching architectures on network processors (NPs) and general purpose processors (GPPs). Specifically, we present a performance evaluation on an Intel IXP2800 NP, on an Intel Xeon GPP and on a multiprocessor system consisting of four AMD Opteron 850 cores. Our study shows how to exploit the Intel IXP2800 architectural features in order to maximize system throughput, identifies and evaluates algorithmic and architectural trade-offs and limitations, and highlights how the presence of caches affects the overall performances. We provide an implementation of our NP designs within the Open Network Laboratory (http://www.onl.wustl.edu).
1. INTRODUCTION Deep packet inspection based on regular expression evaluation is a critical mechanism in modern network security. While deep packet inspection will never be a comprehensive security solution, it represents the standard technique for detecting malicious patterns in network traffic. As a result of its importance, researchers have been active in proposing improved algorithms, data structures, and architectures for high-speed implementations. This flurry of activity has induced two problems. First, much of the work has developed in a disjointed fashion, resulting in a disconnection between the high-level goals and characteristics of the research proposals and the practical realities of regular expression rule-sets and implementation technologies. In order to reduce these ideas to practice in real systems, it is necessary to understand how the techniques for building efficient regular expression representations relate to one another and what implications they have on a given implementation technology. Second, in a very short time period, competing proposals have become increasingly sophisticated and complex, making both objective comparisons and independent verification of results difficult. Consequently, the practical value of these recent proposals is unclear. In this paper, we address this situation directly. In particular, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, republish, post on servers, or redistribute to lists requires prior specific permission and/or a fee. ANCS’09, October 19–20, 2009, Princeton, NJ, USA. Copyright 2008 ACM 978-1-60558-346-4/08/0011…$5.00. .
30
∑ 0
As an example, let us process the input text abcda. The NFA traversal will involve the following active set sequence (accepting states are underlined).
a
(a) a b
1
b
2
c
3/1
d 4
c 7
c d
5
8
d
6/2
e
9/3
a:1-10 1
b
2
c
b
4
d c
5
d
6/2
c 8
d
9
e
In this case, the maximum active set size is 4. The worst case traversal often reported in literature corresponds to an active set including all the states in the NFA. This worst case is in practice never achieved. As an example, state 1 can never be active together with any of 2, 3, 4, 5, 6, 7, 8 and 9 as it is entered upon a different input character. However, the NFA memory bandwidth requirement is in general data-dependent and, therefore, nondeterministic. In the DFA case, since each state has one and only one state transition on any given symbol, processing an input character will involve a single state traversal. As an example, processing the input text abcda on the DFA in Figure 1 will lead to the following activation sequence.
3/1
b:2-10
a 0
(0 ) a → (0 ,1 ) b → (0 , 2 , 4 ) c → (0 , 3 , 5 , 7 ) d → (0 , 6 , 8 ) a → (0 ,1 ) d d
7/2
e 10/3
(b) c:1,3,5-10
Figure 1: (a) NFA and (b) DFA accepting regular expressions: (1) a+bc, (2) bcd+ and (3) cde. In the DFA, omitted transitions lead to state 0.
2. BACKGROUND ON REGULAR EXPRESSIONS
(0 ) a → (1 ) b → (2 ) c → (3 ) d
High-speed regular expression matching systems are based on either nondeterministic or deterministic finite automata (NFAs and DFAs). The theory for both constructing an NFA for a given regular expression set and converting an NFA into its DFA counterpart is well known [1]. However, in some cases, NFA-DFA conversion can lead to state explosion, leaving solutions based on a single DFA impractical [7][9]. In this section, we provide a background on the use of finite automata to perform regular expression matching.
→
(6 ) a → (1 )
Thus, DFAs offer the advantage of limited and deterministic memory bandwidth requirement, and are therefore preferred to NFAs when memory centric architectures are targeted.
2.2 Compressing DFAs A substantial body of research work has focused on compression techniques aimed at reducing the amount of memory needed to represent DFAs. In [5] it is observed that DFAs from practical data-sets exhibit a conspicuous state transition redundancy, which can be exploited to achieve space savings in excess of 90%. Namely, transitions common to two states sx and sy can be eliminated from one of them, say sy, by introducing a non-consuming transition from sy to sx, called a default transition. The traversal of a compressed DFA is performed as follows. When processing input character c in state sy, first it is established whether a regular (labeled) transition on c exists. If so, it is taken and c is consumed. Otherwise, the default transition is followed and the matching operation on c is repeated on the default transition target state. Clearly, this compression technique is based on a trade-off between memory storage and memory bandwidth requirements. In this paper we make use of default transition compression to encode DFAs. In particular, we compute default transitions through the algorithm proposed in [6]. This algorithm has several attractive properties, the most important being a worst case bound guarantee of 2N state traversals to process an input string of length N. Additionally, we make use of the alphabet compression algorithm described in [6] for DFAs and extended in [17] to NFAs. In the implementation proposed in this paper we do not consider other techniques proposed in literature (bit-split DFAs [12] and hashing schemes [13]). In fact, as detailed in [19], these proposals are applicable only to very restricted classes of regular expressions.
2.1 Using NFAs and DFAs In Figure 1, we represent the NFA and DFA accepting regular expressions a+bc, bcd+ and cde, constructed in the standard way [1]. The NFA is built using the reduction algorithm presented in [17]. To evaluate a representation, whether an NFA, DFA, or other data structure, we must consider two metrics: the amount of memory needed to store it and the amount of memory bandwidth needed to operate it. In an NFA representation, each state presents only transitions corresponding to progress in the match operation. Therefore, NFA size depends only on the number of characters in the pattern set. This is true even if some regular expressions contain simple and repeated character ranges. On the other hand, a DFA state must have one and only one outgoing transition for each character of the alphabet. Moreover, as will be detailed later, the number of states in a DFA can exceed that of the corresponding NFA. Thus, the memory storage requirement of a DFA representation is in general higher than that of the nondeterministic counterpart. To find the operating memory bandwidth, we must understand how the NFA and DFA traversal works. The pattern matching operation starts in both cases from the entry state 0, as shown in Figure 1. A match is reported every time an accepting state (double-circled and gray) is traversed. The characters in the input text are processed in sequence, and all the outgoing transitions from the active state labeled with the current input character are taken. In the NFA case, since each state can have more than one transition on any given character, many states can be active in parallel. We will call these states the active set. Since every state traversal implies one or more memory operations, the size of the active set gives a measure of the memory bandwidth requirement and, in case of sequential memory accesses, of the processing time.
2.3 Dealing with State Blowup The generation of a single DFA accepting a given pattern-set is not always feasible. In fact, exponential state blow-up can take place when performing NFA-to-DFA conversion. As pointed out in related work [7][9][10], problematic regular expressions present in practical data-sets typically contain constrained and unconstrained repetitions of wildcard and large character ranges.
31
Table 1: Number of DFAs, size (in terms of number of states) and memory footprint of DFAs obtained with various TDFA settings on a set of 534 patterns from Snort NIDS.
sorted REi..REj REi..RE(j-i)/2 NFA-DFA conversion
RE(j-i+1)/2..REj
N
split RE set
size < TDFA
TDFA
Y
DFAi,j
Figure 2: Simplification of rule-partitioning algorithm.
10K
# DFA 17
max size 8,582
avg size 3,147
total size 53,502
memory footprint 802KB
20K
15
12,454
4,114
61,705
50K
13
12,454
4,426
48,687
1,147KB 1,480KB
100K
11
93,834
14,199
127,794
4,285KB
200K
10
136,034
32,748
261,980
9,155KB
2.3.1 Multiple DFAs A way to address the problem is to partition the set of regular
case is bound by the number of tail-DFAs in the hybrid-FA. Third, the compression algorithms proposed for DFAs can be utilized.
expressions into multiple smaller pattern sets, and compute a different DFA for each of them. Notice that, if k DFAs are generated, the memory bandwidth requirement is increased by a factor k (for every input character, k state traversals must be performed). Rule partitioning can be addressed in different ways. In [7] F. Yu et al. proposed incremental pattern clustering. Specifically, two regular expressions REx and REy are combined into a set if the size of the combined DFAxy does not exceed the sum of the sizes of DFAx and DFAy. The operation is repeated recursively. Unfortunately, this technique may lead to an excessive memory bandwidth requirement. On a memory centric architecture with sufficient storage provisioning some degree of state blow-up can be tolerated, especially if paid off by a decreased memory bandwidth requirement. In our work we adopt a different approach. We set a threshold TDFA to the maximum size we allow for a DFA. We first try to compile together all regular expressions in consideration. If the size of the resulting DFA exceeds TDFA, we split the set into two and repeat the operation recursively. This approach is exemplified in Figure 2. Note that there exists a trade-off between memory size and memory bandwidth. This fact is highlighted by the data reported in Table 1, where we show the results of processing the considered 534 regular expressions from the Snort NIDS with different TDFA settings.
2.3.3 Other techniques Recently two additional techniques have been proposed to address state explosion: namely History-based FAs [10] and XFAs [11]. These proposals are similar in spirit: state blowup is prevented by adding some state information in scratch memory in the form of bitmaps and counters. These variables keep track of the traversal history. However, these proposals have some constraints which limit their applicability. History-based FAs have two limitations. First, the entries of the history information must be accessed (and sometimes updated) in parallel upon each state traversal. This slows down the average case. Second, for large and complex data-sets, the size of the additional data structure and the number of outgoing transitions per state can increase considerably. XFAs have been described for regular expressions consisting of non-overlapping sub-patterns separated by dot-star terms. This is not the case of many regular expressions considered in this work, for which either the first assumption is violated, or the separating term is a large character range repetition.
3. HARDWARE SETUP As mentioned, we chose to evaluate the different regular expression matching designs on the Intel IXP 2800 NP and on two GPPs, an Intel Xeon and a 4-way AMD Opteron 850. In our IXP experiments we took advantage of the Open Network Laboratory (ONL) [18][21], a networking test-bed which provides access to programmable routers built on the Intel IXP 2800. This allowed us to perform real network experiments without implementing a fully functional device from scratch. As a drawback, only restricted resources of the IXP were available to our application. In our GPP experiments we did not run a full networking pipeline. Instead, we evaluated only the regular expression matching engine, which we implemented as a server program receiving input data via TCP sockets. On the uniprocessor system, different connections are handled sequentially. On the multiprocessor system, multiple connections can be handled in parallel by spawning a new thread on a different core. However, flow parallelism is limited to the number of available cores. Table 2 summarizes the hardware resources (processors, caches and memories) available to the regex application code in the different systems. In the remainder of this section we provide some background on both the IXP and ONL.
2.3.2 Hybrid-FAs Hybrid-FAs, an alternative approach proposed in [9], provide a compromise between NFAs and DFAs. The basic idea is the following. NFA-to-DFA conversion is interrupted on states where DFA blowup would originate. This leads to a hybrid automaton with a head-DFA and multiple tail-NFAs. Each tail-NFA can in turn be converted to DFA. The structure of the resulting automaton is exemplified in Figure 3. Note that, during processing, the headDFA is always active. Conversely, each tail-DFA is activated only upon border state traversal. The work proposed in [9] provides a more detailed analysis of how to perform the construction so to allow at most an activation of each tail-DFA state. This approach presents multiple advantages. First, in the average case the traversal will be confined to the head-DFA, leading to one state access per input character. Second, the worst tail-DFA 1 head-DFA
tail-DFA2
3.1 IXP 2800 and ONL
tail-DFAk
Like most network processors, the IXP 2800 is designed for rapid development of high performance networking applications. To enable the desired high performance there are 16 multi-threaded
Figure 3: Hybrid-FA simplification.
32
Regex processor
Table 2: Hardware configuration of systems in consideration. We report size/cache line/set associativity for caches, and size/latency for memories. CPUs
NP 5 MEs @ 1.4 GHz
1-way GPP 1 Intel Xeon @ 2.4 GHz
I1 cache D1 cache L2 cache scratchpad SRAM DRAM/sys
1KB/60 clk 5MB/150 clk 128MB/300 clk
16KB/32B/8-w 8KB/64B/4-w 512KB/64B/8-w 3GB
Memory layout generator
4-way GPP 4 AMD Opteron @ 2.4 GHz 64KB/64B/2-w 64KB/64B/2-w 1MB/64B/8-w 16GB
ONL regex-plugin for IXP NP
Regex processor code for GPP
Figure 4: Regular expression processing infrastructure. Most of the memory is already in use by the rest of the router, but the plug-in MEs have access to 1KB of the shared scratchpad, 5MB of SRAM, and 128MB of DRAM. The RLI is used to install plug-ins on the routers and then packet filters are setup to direct packets to all plug-ins. Each plugin ME has a dedicated SRAM FIFO that matching packets are put into by the main router pipeline. It is up to the plug-in code to pull the packet out of that FIFO and do whatever processing is desired before forwarding the packet back to the main pipeline. Our plugins take advantage of one of the packet filter options which sends a copy of the packet to the plug-in and a copy down the rest of the pipeline. This mechanism allows plug-ins to monitor packet streams without adding delay to the actual packet flow.
Micro-Engine (ME) cores which handle the majority of the packet processing tasks. These cores typically run at 1.4GHz. Each ME supports eight thread contexts via eight distinct sets of registers, although only one thread is actually executing at any given time. Context switching between the threads takes only 2-3 cycles, and there is no preemption, i.e. a cooperative threading model must be used. Primarily, these threads are how IXP programs overcome the memory latency gap. In most IXP applications, threads operate in round-robin ordering, context switching whenever a memory operation is issued. There are no caches (data or instruction) in the IXP as networking applications rarely have locality-of-reference that can be exploited. The MEs also have other features such as local Content Addressable Memories and hardware supported FIFOs connecting all 16 MEs together in a pipeline, but our implementations do not currently use them. The IXP 2800 has 3 DRAM channels and 4 SRAM channels, as well as a shared on-chip scratchpad memory. Each ME has a small segment of local data memory along with an 8K instruction program store. DRAM is primarily used for packet buffers, while SRAM is used for packet meta-data. The SRAM controller also provides hardware support for using SRAM blocks as a simple FIFO for inter-ME communication. Each IXP also has an ARM XScale Management core, which is provided with direct access to the entire NP, including MEs and memory. The XScale is responsible for starting and stopping the MEs, loading code into the ME program stores, and any other ME management tasks. It also handles any packets or tasks which are not implemented by the data path. In a standard router implementation, this likely includes handling of control messages such as ICMP packets or routing updates. ONL is a network test-bed which allows users to remotely access network resources for running experiments on actual hardware, rather than simulation. Users interact with ONL through the Remote Laboratory Interface (RLI), a graphical user interface to build network topologies and allocate resources. Users are given full access to their allocated resources, and those resources are not shared with other users. The experiments in this paper use two types of resources: end hosts and IXP-based programmable routers. The hosts run a standard Linux operating system, and serve primarily as traffic sources and sinks. Each IXP-based router [18] runs on a single IXP 2800 and has five 1-Gbps Ethernet interfaces. IP routes are supported via a Ternary Content-Addressable Memory and each interface has 8K outgoing queues for quality of service and traffic engineering needs. The TCAM also supports more general filters for packet classification. Most importantly, five of the MEs are set aside for user written plug-ins. Any code can be loaded onto those MEs in order to expand the functionality of the router. In our case, we will use the five MEs to run the regular expression matching code.
4. PROPOSED INFRASTRUCTURE One of the objectives of this work is to provide to the scientific community a tool-chain to evaluate regular expression matching alternatives on a network processor based workbench or on a general purpose processor. We want the design to be flexible enough to support data-sets of varying size and complexity. Additionally, we aim to minimize the need for user configuration. Our approach is to decompose the tool into three components: a (i) regex processor, a (ii) memory layout generator and an (iii) regex application code, to run either on the IXP NP (in which case we provide an ONL plug-in) or on a GPP. As exemplified in Figure 4, these components act sequentially. The regular expression processor generates the finite automaton accepting a given pattern set. Its input consists of the regular expression set and the desired representation (either DFA with a specified threshold TDFA, or NFA, or HFA). Its outputs are: (1) the corresponding compressed automaton and (2) the alphabet translation table (as mentioned, alphabet reduction [6] is performed in all cases). The implementation uses the regex-tool that was made available by M. Becchi at [20]. The memory layout generator originates the maps used to populate the NP/GPP memories for proper operation of the application code. The generated memory configuration consists of: (i) FA structure, in terms of states and transitions, and (ii) automaton specific parameters, such as basic addresses of each DFA, number of tail-DFAs in the hybrid-FA, and so on. The input to this block is represented by the output from the previous one. In the IXP case, the information about basic address and size of each available memory (namely, scratchpad, SRAM and DRAM) is also requested. Finally, the regex application code is the main code that runs on the NP/GPP and performs regular expression matching. It is programmed through the memory maps produced by the previous block. We provide three different implementations of the application: one based on DFAs, one based on NFA and one based on Hybrid-FA. Specifics about the implementation of this block are provided in Section 5.
33
4.1 Memory Generation Algorithm
FULL STATES
The memory generation algorithm is trivial in the case of GPPs, where the states can be simply laid out sequentially; however, some more consideration is required in the case of the IXP. One architectural characteristic of the IXP network processor is the availability of memory banks with different sizes and access latencies. This fact is central in the design of the memory layout, which is guided by the following objectives. First, for each FA, states more likely to be traversed should be stored in faster memory. Second, in case of DFA based designs, none of the DFAs should be penalized: all of them should be assigned a fair share of fast memory. In the hybrid-FA case, this holds for the tail-DFAs. Being always active, the head-DFA should be privileged over the tails.
COMPRESSED STATES
• 1 bit: end-of-state • 1 bit: compressed • 30 bits: state word address
Tx on char1 Tx on char2 Tx on vchar3 ...
v tx1 labeled
Tx on char∑-1
labeled txi
• 1 bit: match • 1 bit: compressed • 30 bits: state word address
To this end, the IXP memory generation algorithm operates as follow. It performs a breath-first traversal of the available automata and lays out their states starting from scratchpad, then moving to SRAM and finally to DRAM, preventing each state from spanning across different memories. In case of DFA designs, scratchpad and SRAM are equally partitioned among the DFAs. In case of hybridFAs, the whole scratchpad and a large portion of SRAM are assigned to the head-DFA; the remaining SRAM is equally partitioned among the tail-DFAs. The DFAs and tail-DFAs are initially sorted according to increasing size. In order to avoid unassigned scratchpad and SRAM words, some adjustments are possibly made after laying-out each DFA.
default tx
|∑|
• 1 bit: • 1 bit: • 8 bits: • 2 bits: • 1 bit: •19 bits:
end-of-state match accepted symbol scratch/sram/dram compressed delta state word address
Figure 5: DFA state layout (the scratch/sram/dram indicators are not used in the GPP case).
5. REGEX-APPLICATION CODE Regular expression matching is a memory intensive application. As introduced in Section 2, processing each input character results in one or more FA-state traversals, which, in turns, translates into a set of memory queries. The presence of caches on GPPs makes the regex-application code straightforward. On the contrary, the particular architectural features of the Intel IXP NP lead to a need for a more specific design. In Section 5.1-5.3 we describe the implementation of the ONL regex plug-in; in Section 5.4 we briefly mention the basic aspects of the GPP implementation. The design of the plug-in is guided by two goals: (i) taking advantage of the framework provided by the ONL Plug-in Architecture and (ii) exploiting the architectural characteristics of the Intel IXP NP in order to maximize system throughput. In particular, we take advantage of the following architectural features: (i) availability of different memories (local memory, scratchpad, SRAM and DRAM) varying in size and access latency, (ii) availability of asynchronous memory operations and (iii) parallelism given by multiprocessor and multithreaded environment. As explained in Section 3, in the current deployment of the ONL Plug-in Architecture five micro-engines are devoted to execution of plug-in code. We deploy our regex plug-in on each of those cores, that we name matching engines. Each matching engine processes input data from a different packet flow. However, all matching engines are configured to process the same pattern-set (i.e., they all work on the same finite automata, deployed in shared memory). In our experiments (Section 6) we compute the aggregate throughput across the different matching engines. As mentioned above, we provide three versions of the regex application code: one DFA based, one NFA based and one HybridFA based
4.2 State Layout After default transition compression and alphabet reduction, DFA states can be classified into two categories: full states and compressed states. Full states are represented through |∑| outgoing transitions, one for each symbol of the reduced alphabet. Given an input character c and a full state s, the next state can be determined by direct indexing within s. A compressed state is represented through the default transition and a number of labeled transitions. Each input c must be compared with all the characters on which a labeled transition is defined in order to determine the next state. This processing overhead is not justified if the state has a number of transitions exceeding a threshold t∑. Therefore, t∑ will be used in order to determine which state to compress and which not. When setting the value of t∑,, we considered the IXP capability of reading multiple words on the same memory read. Specifically, we observed in simulation that an 8-word read would not affect the performance in a negative way. We also observed that more than 90% of the states have less than 8 outgoing transitions. Therefore, we set the value of t∑ to 8, thus allowing a single memory read for both full and compressed states. Note that this value is acceptable also in the GPP case, in that it fits the cache line of both Intel Xeon and AMD Opteron 850 processors. As can be observed in Figure 5, all states contain control bits indicating whether the next state is matching and compressed. Compressed transitions encode the accepted symbol and, because of the reduced addressing space, use a differential representation of the next state address (in respect to the base address of the corresponding DFA). NFA states, which have a varying number of transitions and may present multiple transitions on the same character, are represented as DFA compressed states. The hybrid-FA is analogous to the DFA representation. The exception is represented by special handling of border-states within the head-DFAs and of dead-states within the tails.
5.1 DFA-based ONL Plug-in The DFA-based implementation assumes that a certain number of DFAs, say NDFA, are given as input. As discussed in Section 2, NDFA depends on the size and complexity of the regular expression set. The plug-in must be configurable and support arbitrary NDFA. Each incoming packet must be processed against all the DFAs. The ideas at the basis of our design are the following. First, given a certain number of DFAs each packet must be processed against, take advantage of multi-threading in order to parallelize the processing. Second, as detailed in Section 4.1, assign to each DFA a fair share of the available fast memory and store in fast
34
5.1.2 Matching-threads syncthread
matchmatchthread matchthread k matchthread2 i thread1
Matching threads retrieve 32-bit entries from the buffer and process them. The first operation performed on each character processed is alphabet translation (the alphabet translation table is stored in local memory). To be able to accommodate the alphabet translation table in local memory even for large NDFA, alphabet translation is performed globally across all DFAs. That is, a single, globally minimal alphabet is used. After alphabet translation is performed, the next state is queried. As explained in Section 4.3, next state processing differs depending whether we are in a full or in a compressed state. The matching loop is represented in the pseudo-code below. Note that matching threads yield the ME in the following situations: when the desired input entry is not yet available in the buffer, when waiting for an asynchronous memory operation to resume, and after processing an entry.
LOCAL MEM shared
BUFFER BUF_IDX
local
SCRATCH/SRAM/DRAM_BASE_ADDR ALPH_SIZE ALPH_TX_TABLE
Figure 6: DFA-based matching engine on Intel IXP NP. memory the states which are more likely to be accessed. Finally, hide the memory latencies by using asynchronous memory operations and by taking advantage of the IXP multithreading model. Figure 6 schematizes the internals of the DFA-based matching engine. Configured DFAs are equally distributed among so called matching threads. The number of matching threads per matching engine can be configured. Context switch is performed in order to grant all threads a fair share of the processor and hide memory access latencies. To minimize waiting delays on slower threads, at any given time the matching threads are allowed to be processing different input characters. A synchronization mechanism is therefore necessary, and is implemented through a single synchronization thread. Therefore, up to seven matching threads can be configured on each matching engine.
DFA_matching_loop(){ while (no_input_data_available){ send_signal(stalled_matching_thread); ctx_swap(); } input_char = read_buffer(); symbol = alphabet_tx[input_char]; do { if (full_state){ memory_query(state_address+symbol, 1 word); ctx_swap(); update(state_address); default_tx_taken = false; }else{ memory_query(state_address, 8 words); ctx_swap(); update(state_address); update(default_tx_taken); } }while (default_tx_taken); ctx_swap(); }
5.1.1 Synchronization thread The synchronization thread has essentially two tasks: handle synchronizations and read packet data. Synchronization is implemented through a circular buffer of configurable size stored in shared local memory. The synchronization thread writes packet data into the buffer, and the matching threads read them from it. Each matching thread records the index of the next value to be processed in a buffer_index array, also stored in shared local memory. The synchronization thread keeps memory of the oldest not yet processed entry in the buffer, and of the most recently inserted element. It uses those values along with the content of the buffer_index to determine buffer fullness and periodically flush the already processed data.
5.2 NFA-based ONL Plug-in The NFA based implementation is single-threaded. In fact, since a single NFA must be processed, there is no need for splitting the data structure across different threads and introducing the overhead of synchronization mechanisms. The presence of a dynamic active set and the need for performing a variable number of state traversals per input character introduces some added complexity. In our design we use two techniques to maximize throughput. Specifically: (i) we hide memory latencies by issuing multiple parallel memory queries; (ii) we allow code processing to overlap with memory accesses. The matching loop is summarized by the pseudo-code above. Parameters NREQ and SREQ guide the maximum number of outstanding memory requests. In particular, NREQ gives an upper bound to the number of states queried in parallel. For each state query, SREQ indicates the size of the memory read. In the experiments described in Section 6, those parameters have been set to 5 and 6, respectively. As can be observed, the plug-in operates as follows. For each input character, it first issues up to NREQ memory reads of size SREQ. Results are processed as they become available (as signaled by
synchronization_loop(){ while (no_input_data_available) ctx_swap(); if (full_buffer){ wait_for_signal(stalled_matching_thread); flush_buffer(); } read_data_into_buffer(); ctx_swap(); }
The synchronization loop is represented in the pseudo-code above. Note that the buffer is not flushed unless there is the need for it (in fact, this operation involves several local memory accesses and is time consuming). Also, the synchronization thread yields the ME when waiting for incoming data, when the buffer is full and after processing an entry.
35
memory). The active set is updated as needed. If the SREQ are not sufficient to process the considered state, more memory reads are issued. Remind that NFA state processing consists of a linear traversal of its transitions, and that the size of a state can be larger than SREQ. The operation continues until the whole active set has been processed.
Each hybrid-FA based matching engine presents a headmatching thread and a number of tail-matching threads, configurable between zero and seven.
5.3.1 Head-Matching Thread The head matching thread performs several tasks: it reads the input data, manages synchronization among threads, handles border-state traversal and tail-DFA activation, processes the head-DFA and, if necessary, processes some active tails. The operation of this thread is summarized in the pseudo-code above. The head-DFA is handled as in the DFA based matching engine case. The current state and its attributes are stored in general purpose registers. Tail-DFA activations are handled using some shared data structures. In particular, the active_tail parameter stores the address of each active tail-state, and is represented through a dynamically handled array in shared local memory. The enable_tail and the executing_tail variables, stored in shared general purpose registers, are used for thread preemption and synchronization. For each input character, the head-matching thread updates, if necessary, the set of active tail-states. If some tail-DFAs are active, then the operation of tail-matching threads is enabled. After processing the head-DFA, the head-matching thread participates to tail-DFAs processing. It waits until all tail-DFAs have been completely processed before moving to the next input symbol.
NFA_matching_loop(){ input_char = read_input_data(); symbol = alphabet_tx[input_char]; while (!active_set_processed()) { memory_bulk_query(state_addr[NREQ], NREQ, SREQ); for (R < NREQ){ wait_for_signal(mem_readyR); if (symbol in words_readR) update(active_set); while (!state_processed(stateR)) memor_query(state_addr[stateR]+processed_txR, SREQ); } } }
5.3 Hybrid-FA-based ONL Plug-in The Hybrid-FA is characterized by an always active DFA (the head) and a variable number of DFAs which can be dynamically activated and de-activated (the tails). Our implementation aims at hiding memory latencies by exploiting thread parallelism while keeping the synchronization mechanisms as simple as possible.
5.3.2 Tail-Matching Threads The goal of the tail-matching threads is to process any active tailDFA. Their operation is summarized in the pseudo-code below (which is also used within the head-matching thread). As can be observed, the shared tail_ptr variable is used to fetch the first not yet processed active tail-DFA. Tail-DFA processing is similar to what seen above, with the exception that a dead-state traversal can invalidate the current tail-DFA.
HybridFA_head_loop(){ input_char = read_buffer(); symbol = alphabet_tx[input_char]; if (border_state) update(active_tails); if (active_tails ≠Ø) { enable_tail=true; tail_ptr=0; } do { /* head processing */ if (full_state){ memory_query(state_address+symbol, 1 w); ctx_swap(); update(state_address); default_tx_taken = false; }else{ memory_query(state_address, 8 w); ctx_swap(); update(state_address); update(default_tx_taken); } }while (default_tx_taken); if (active_tails ≠Ø) { /* tail processing */ while (tail_ptr < size(active_tails)) HFA_tail_loop(); while (executing_tail ≠Ø) ctx_swap(); enable_tail=false; } }
HybridFA_tail_loop(){ if (enable_tail) { while (tail_ptr < size(active_tails)){ state_address = active_tails[ptr++]; executing_tail = executing_tail U {ctx()}; do{ if (full_state){ memory_query(state_address+symbol, 1 w); ctx_swap(); update(state_address); default_tx_taken = false; }else{ memory_query(state_address, 8 w); ctx_swap(); update(state_address); update(default_tx_taken); } if (dead_state) invalidate_tail(); }while (tail_valid & default_tx_taken); executing_tail = executing_tail \ {ctx()}; } } }
As far as synchronization is concerned, we opted for a different design choice compared to our DFA-based implementation. Specifically, we forced the head- and the tailDFAs to operate in lock-step, that is, to process the same input character. This decision is motivated by the complexity associated with handling tail-DFAs presenting a dead-state, which can get repeatedly activated and de-activated.
5.4 GPP Regex Application Code The GPP application code consists of a simplified version of the algorithms described above. The presence of caches avoids all the complexity in handling memory operations. Cache-line reads are in effect bulk reads: contiguous transitions belonging to the same
36
Table 3: Throughput (in Mbps) achieved on 534 regular expressions from Snort NIDS using 1MB large packet traces with different pM settings. pM
NFA
DFA50K
DFA100K
DFA200K
HFA
29.520.2
23.5
28
35.5
IXP2800
0.350.95
95.0 38.1
0.35
11.7
19.6
22.0
25.3
192.7
0.55
12.0
19.6
22.2
25.8
165.0
0.75
5.2
20.9
24.4
27.0
44.6
0.95
2.4
21.2
25.3
28.0
32.3
0.35
70.9
127.6
136.1
146.0
533.9
0.55
70.1
124.3
139.3
157.4
481.6
0.75
32.7
125.4
142.2
157.7
185.3
0.95
15.6
125.3
139.6
161.3
151.4
Intel Xeon
4-way AMD Opteron
350.0
Throughput (Mbps)
System
400.0
300.0 250.0
IXP2800 Intel Xeon 4-way AMD Opteron
200.0 150.0 100.0 50.0 0.0 NFA
DFA50K
DFA100K
DFA200K
HFA
Automaton
Figure 7: Average throughput on three systems using different automaton-representations. they can update and that are visible in real-time charts in the RLI. In all of the matching engine implementations we increment one such counter whenever we begin processing the data in a new packet. The arriving and departing packet rates at each port are also visible, so those rates are compared with the processing rate in each matching engine. If the processing rate falls behind for an extended period of time then the matching engine is not keeping up with the input rate. In our experiments, we record the maximum throughput for which the matching engines can keep up with the input rate. In the GPP setup, clients send data to the matching engine via TCP sockets. The performances reported take into account the time spent by the engine to spawn a new pthread, to read the input data from socket and to process them against the finite automata. We verified that the regex matching task accounts for the largest fraction of the processing time; pthread creation and data reception cumulatively take an average of 8-10 ms per input stream. For each trace, we run ten simulations, resulting in one hundred simulations per (FA,pM) data point (recall that ten seeds are used for each pM). The reported results are obtained eliminating ten outliers and averaging the remaining data values. Table 3 summarizes the throughput reported on the three systems in consideration. In Figure 7 we also visualize the average throughput across the different pM. The following observations can be made. First, for DFA based designs there is little variability over pM. Second, the Hybrid-FA performs the best overall. This is not surprising since in the average case the head-DFA is the only active DFA and the processing needs are similar to that of normal standard DFA. On the IXP, NFA and DFAs perform similarly; on both GPPs, DFAs outperform the NFA. In general, the NFA is the most compact representation but also the most complex in terms of processing needs and in how many memory operations are required for each input byte. Third, the DFA performances increase with TDFA. Recall that a larger TDFA implies more memory being used but fewer DFAs, leading to less overall processing time and fewer total memory accesses on each byte of the data stream. Finally, the performances on the IXP are similar to those on the Intel Xeon. The 4-way AMD Opteron 850, better provisioned in both CPUs and caches, presents an aggregate throughput 6X larger on NFA and DFAs and 2.5-4X larger on the Hybrid-FA. For completeness, we evaluated the behavior of pcregrep [4] (version 7.9) on the considered pattern-set. Pcregrep is the textdriven regex matching engine used by Snort to match Perl Compatible Regular Expressions. The reported throughput on all 534 patterns for pM spanning from 0.35 to 0.95 varies from 1.06
state are implicitly fetched in parallel. Since there is no need for hiding memory latencies, the processing of an input stream is single-threaded in all cases. As a consequence, all DFAs are handled in a single thread both in the DFA- and in the Hybrid-FAbased code. As mentioned above, multiple threads (pthreads) are used only in the multi-processor setup to allow parallel processing over multiple input streams. In the NFA case, the active set size is dynamic. We evaluated two implementations: one using dynamic memory allocation (in the form of linked lists) and one using static arrays of predefined maximum size. In Section 6, we report only the results obtained with the second implementation, which are better by a factor 2-3X.
6. EXPERIMENTAL EVALUATION In this section, we present the results of our experimental evaluation based on the aforementioned set of 534 rules drawn from Snort NIDS. All the considered regular expressions contain character sets, and more than half of them contain wildcard and large character set repetitions in the form [^\n\r]*. The sizes of the resulting DFAs are reported in Table 1. In this evaluation, we focus on the DFAs obtained setting the threshold TDFA to 50K, 100K and 200K. The corresponding NFA consists of 9,498 states and has a memory footprint of 85KB. Finally, the hybrid-FA has a head-DFA of 466KB, 22 tail-DFAs and an overall memory footprint of 5,876 KB. The compressed alphabet size is 74 in all cases. To test each of the rule-sets, we generated packet traces using the regular expressions from the sets. Specifically, we used the traffic generator feature of the regex-tool [20]. The tool traverses the state machines and produces traces that move deeper into the state machines with a certain probability pM. We used four pM settings (0.35. 0.55, 0.75 and 0.95), and for each of them generated ten 1MB traces using ten probabilistic seeds. Note that pM equal to 0.95 represents the worst-case scenario of a malicious attacker able to repeatedly cause random walks traversing all FA states. In the ONL setup, one router is connected to five hosts in a standard star topology. As mentioned in the previous section, each of the five plug-in MEs is loaded with the same matching engine and the same regular expression set. This allows each ME to process a different packet stream. Specifically, filters are added to direct all packets which will leave one router interface through the same matching engine. The hosts each send traffic into the router at the same rate such that the output rate at each interface is equal. In order to determine if the matching engines are keeping up with the input rate we use another feature provided to ONL plug-in developers. Plug-ins have access to a number of fast counters that
37
Table 4: Traversal characteristics with pM set to 0.35. avg active set
max active set
trans/ char
default trans/ char
full states/ char
Table 5: Traversal characteristics with pM set to 0.95.
compr states/ char
avg active set
max active set
trans/ char
default trans/ char
full states/ char
compr states/ char
3.62
12
81.65
-
-
-
28.36
51
228.73
-
-
-
DFA50K
13
13
42.37
11.09
11.75
12.34
DFA50K
13
13
38.04
7.05
12.08
7.98
DFA100K
11
11
38.28
9.85
9.84
11.01
DFA100K
11
11
31.32
5.93
10.19
6.74
DFA200K
10
10
35.14
8.97
8.96
10.01
DFA200K
10
10
26.11
5.15
9.40
5.75
1.00
2.00
5.97
0.79
0.68
1.10
HFA
10.10
14.00
16.03
2.25
9.89
2.46
NFA
HFA
NFA
The miss rate is in any case limited. In the DFA case, the hit rate increases with pM (leading to a higher throughput). This is partially due to the fact that deeper states tend to be compressed: a memory read can therefore fetch multiple contiguous states in the same cache-line. It is worthwhile at this point to consider the best possible performance achievable given the characteristics of the IXP 2800. At best, each of these state machines must do one memory access for each byte in the data stream to find the next state. Most of the states are in SRAM, so if we assume that the SRAM latency is the only limiting factor we can compute an upper bound on the throughput. The nominal SRAM latency, when the SRAM channel is not overloaded, is 150 cycles. The MEs run at 1.4GHz, so 150 cycles/byte translates to approximately 70 Mb/s for one ME. An approximate upper bound on throughput for our system is then around 350 Mb/s across the 5 matching engines. Our best results are one third of that, so there is room for improvement. We also duplicated these tests in Intel's IXP simulation environment to determine where we might be able to improve the code. In each case except the NFA, the matching engines are always utilizing the CPU and never idle waiting on memory operations to complete. That is, the computation needed for each byte of the data stream is the dominant limiting factor. In the NFA case, the CPU was utilized 91% of the time, leaving the ME ideal waiting on memory around 9% of the time. The NFA, then, is mainly limited because the number of active states is too high to be handled efficiently by a single thread.
Mbps down to 0.3 Mbps on the Intel Xeon, and from 4.54 Mbps down to 1.12 Mbps on the 4-way AMD Opteron 450. Not only are those performance numbers far smaller than those reported using our FA-based engines, but they are also strictly dependent on the fraction of malicious activity present in the input stream. It must be said that Snort invokes pcregrep only after a pre-filtering operation (based on exact-match patterns). For a fairer comparison, we run pcregrep on 50 (that is about 10%) of the considered patterns. The performances achieved were 7-8 times those reported above: still far worse than those of the FA-based engines. In order to better understand our FA-based results, let us analyze the dynamic behavior of the finite automata. Tables 4 and 5 report the basic characteristics of the automata traversal when using an input trace with pM equal to 0.35 and 0.95, respectively (again, averages over ten seeds are reported). Note that those measurements are independent of the underlying hardware platform utilized. Recall that the active set size represents the number of states which are active in parallel. The active set size is static and equal to the number of automata in the DFA case. However, default transition traversals (column five) may double the total number of state traversals. We report the effective average number of state traversals per character in columns six and seven. The active set size is dynamic in the NFA and Hybrid-FA cases and, as could be expected, it increases with pM. This motivates the decrease in performance with the increase of pM that we can observe in the NFA and hybrid-FA cases. On a finer granularity, the throughput depends on the number of instructions executed as well as on the number of memory operations. The number of transition traversals per character (column four) gives an indication of the total number of instructions needed to process the input stream. As anticipated in Section 4.2, in the DFA and hybrid-FA cases full states require one transition traversal, while compressed states up to eight (t∑). In the NFA case, a linear traversal of all transitions up to the matching one is always required, leading to a worse behavior. To quantify the memory operations that are performed on the IXP, we must recall that in our implementation we perform bulk memory reads. In the DFA and hybrid-FA cases the number of memory requests is equal to that of state traversals. In the NFA case, for each state with t traversed transitions t/SREQ memory requests are performed, where SREQ is set to 6. This leads to an average of 13.6 and 38.1 memory requests per character processed when pM is equal to 0.35 and 0.95, respectively. In Tables 6 and 7 we report a characterization of the behavior of the regex application code on the Intel Xeon processor. The data have been collected using the cachegrind tool [25]. As can be observed, the number of instructions executed is proportional to that of transitions traversed. Almost half of the instructions are memory reads. The most part of them hit the D1 cache, especially in case of the NFA, characterized by a limited memory footprint.
7. DISCUSSION In this work we aimed at evaluating design alternatives for deep packet inspection on NPs and GPPs. We considered three distinct FA based designs and the state of the art NFA and DFA compression techniques. We proposed implementations aiming at exploiting the architectural characteristics of the IXP2800 network processors. Even if we acknowledge that further optimizations can be evaluated to achieve better throughput, one of the goal is to consider the practicality of regular expression matching on network processors and to project the results on more equipped systems.
7.1 Projection of Results The best aggregate throughput reported on the IXP is 95 Mbps. This was achieved using a limited fast memory provisioning, as detailed in Section 3. Moreover, the regex plug-in code has been deployed on five Micro-Engines. If we project these data onto a larger chip-set, like the Netronome NFP-32XX [24], provisioned with 40 Micro-engines, we can think of multiplying the aggregate throughput by a factor of about 8X without big changes in the infrastructure. If, on top of this, we assume a better fast memory provisioning, it should be possible to achieve a 1GB throughput without further optimizing the implementation.
38
Table 6: Cache behavior on Intel Xeon with pM set to 0.35. NFA
Instructions 2,011,159,667
Cache reads 973,749,915
DFA50K
1,096,606,818
592,969,061
3,396,611
DFA100K
974,912,544
525,688,917
4,520,419
DFA200K
893,323,771
481,503,339
1,374,317
HFA
135,766,044
67,460,400
25,138
Table 7: Cache behavior on Intel Xeon with pM set to 0.95.
D1 read misses 18,367
Instructions 8,872,396,833
Cache reads 3,321,727,996
D1 read misses 4,962,127
DFA50K
965,649,550
515,059,757
1,444,472
DFA100K
806,200,163
430,816,504
668,065
DFA200K
694,761,552
373,782,799
875,565
HFA
794,847,313
419,742,419
233,078
NFA
[6] M. Becchi and P. Crowley, “An Improved Algorithm to Accelerate Regular Expression Evaluation,” in ACM/IEEE ANCS 2007.
7.2 Possible Directions A way to double the throughput without a significant change in the infrastructure consists in using multiple stride DFAs. However, as detailed in [21], this may significantly affect the alphabet size, especially when large DFAs are considered. As proposed in [21], alternative implementations of fully specified states can be adopted to limit the memory storage requirement. As an example, full states may be represented as decoders. This opens the way to new directions of exploration: in particular, the opportunity to represent frequently taken state transitions as instructions and store them in instruction store rather than in data memory.
[7] F. Yu et al, “Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection,” in ANCS 2006. [8] B. Brodie, D. E. Taylor and R. K. Cytron, “A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching,” in ISCA 2006. [9] M. Becchi and P. Crowley, “A Hybrid Finite Automaton for Practical Deep Packet Inspection,” in ACM CoNEXT 2007. [10] S. Kumar et al, "Curing Regular Expressions Matching Algorithms from Insomnia, Amnesia, and Acalculia," in ACM/IEEE ANCS 2007.
7.3 Comparison with Different Architectures When addressing regular expression matching, peak performance can be achieved using NFAs on FPGA implementations [14][15][16][17]. Encoding the NFA in logic allows processing one or more character per clock cycle independent of the active set size. It must be said that this scheme operates on a single flow. If multiple incoming flows with varying throughput must be handled, a DFA/hybrid-FA based scheme may be preferable and is worth to be explored and optimized.
[11] R. Smith et al, “Deflating the big bang: Fast and scalable deep packet inspection with extended finite automata,” in ACM SIGCOMM, August 2008. [12] L. Tan and T. Sherwood, “A High Throughput String Matching Architecture for Intrusion Detection and Prevention,” in ISCA 2005. [13] S. Kumar et al, “HEXA: Compact Data Structures for Faster Packet Processing,” in ICNP 2007.
8. CONCLUSION To summarize, in this work we implemented a tool-chain to perform and evaluate different design alternatives in the context of regular expression matching on network and general purpose processors. On a relatively large and complex data-set, our evaluation shows aggregate performance up to 95 Mbps using five Micro-Engines on the Intel IXP2800 NP, 193 Mbps on a Intel Xeon and 0.5 Gpbs on a 4-way AMD Opteron 850. Simulation also highlights that, in an IXP implementation hiding memory accesses, the time spent processing instructions cannot be ignored and significantly affects the performance. If we wish to achieve significantly higher throughputs on a network processor we should either have a better system provisioning, or think about alternative schemes outside the ones which have been proposed in the context of traditional and ASIC memory centric architectures.
[14] R. Sidhu and V. K. Prasanna, "Fast Regular Expression Matching using FPGAs," in FCCM 2001. [15] C. Clark and D. Schimmel, “Efficient reconfigurable logic circuit for matching complex network intrusion detection patterns,” in FLP 2003. [16] A. Mitra, W. Nejjar and L. Bhuyan, “Compiling PCRE to FPGA for Accelerating SNORT IDS,” in ACM/IEEE ANCS 2007. [17] M. Becchi and P. Crowley, “Efficient Regular Expression Evaluation: Theory to Practice,” in ACM/IEEE ANCS 2008. [18] C. Wiseman et al, “A Remotely Accessible Network Processor-Based Router for Network Experimentation,” in IEEE/ACM ANCS 2008.
9. REFERENCES
[19] M. Becchi, M. Franklin and P. Crowley, “A Workload for Evaluating Deep Packet Inspection Architectures,” in IEEE IISWC 2008.
[1] J. E. Hopcroft and J. D. Ullman, “Introduction to Automata Theory, Languages, and Computation,” Addison Wesley, 1979.
[20] M. Becchi, regex-processor: http://regex.wustl.edu.
[2] M. Roesch, “Snort: Lightweight Intrusion Detection for Networks,” in 13th System Administration Conference, November 1999.
[21] Open Network Lab: http://onl.arl.wustl.edu. [22] Johnson and Kunze, “IXP2400/2800 Programming: The Complete Microengine Coding Guide,” Intel Press, 2003.
[3] Snort: http://www.snort.org.
[23] Intel IXP 2xxx Product Line of Network Processors: http://www.intel.com/design/network/product.
[4] PCRE: http://www.pcre.org. [5] S. Kumar et al, “Algorithms to Accelerate Multiple Regular Expressions Matching for Deep Packet Inspection,” in ACM SIGCOMM, September 2006.
[24] Netronome NFP-32XX: http://www.netronome.com/pages/network-flow-processors. [25] Valgrind: http://valgrind.org.
39