Real-time Packet Filtering with the Connex Array Dominique Thiébaut Dept. Computer Science Smith College, Northampton, MA 01063, USA
[email protected] Mihaela Malita Dept. Computer Science St Anselm College, Manchester, NH 03102, USA
[email protected]
A fundamental problem in network detection intrusion on high-speed networks is the ability to filter and analyze packet headers in real time. In this paper we present the Connex Array™ circuit, a new general-purpose programmable processor-in-memory (PIM) architecture, and how it can be used to accelerate real-time packet filtering, more specifically the problem of matching IP addresses against profile tables. We show that with the current 1024-cell circuit, 200 rules can be tested in parallel against incoming IP address in 45 PIM cycles, for a raw performance of 700 K IP/sec at the current 5-ns PIM cycle-time. Because the Connex Array is designed to easily integrate in multi-chip solutions, this performance scales linearly with the number of circuits.
a controller. These active memory/ALU elements all operate in parallel and execute each instruction in 1 cycle at an operating frequency of 200 MHz. In addition, each active memory cell has access to a bank of private regular memory words which can be used as registers in arithmetic and logic operations, or as regular storage. The combination of intelligent cells and regular memory cells on the same chip yields a transistor density only half that of regular static memory. While other parallel solutions have been proposed for packet filtering, this one is remarkable in that the accelerator circuit is a programmable, general purpose architecture that can easily adapt to changes in filtering algorithms. This same circuit is currently manufactured for video processing, and has been shown to be effective at inexact DNA string matching [4][5], and image processing, implementing all the stages of a the image pipeline of a digital camera.
1 Introduction
2. Related Research
The purpose of a Network Intrusion Detection System is to help ensure that malicious packets sent by hackers are not allowed to enter a network[1]. In order to provide high levels of security, firewalls have to perform many types of analyses on the packets, some involving just the header information, (such as applying rules to validate the source and destination IP and port numbers), some involving the payload with the goal of discovering suspicious patterns. The sophistication of the pattern searching algorithms and the bandwidth of delivery of packets puts a tremendous computational burden on the firewall hardware. In this paper we present a new circuit, the Connex Array™, a Processor-In-Memory (PIM) architecture for a wide range of applications, proposed by Connex Array Technologies [3], which can also be described as belonging to the ALU in memory architecture. In this architecture 1024 active memory cells contain simple ALUs that can perform basic contentaddressable-memory (CAM), logic and arithmetic operations on 16-bit words under the control of instructions broadcast by
Specialized processors have been developed to handle the high compute-bound task of filtering packets. ClassiPi [7] and Strata II [8] are examples of serial custom processors designed for deep packet filtering. Their performance, however, is limited by the typical Von Neumann bottleneck, and scales poorly with increasing network traffic and with increasing number of rules. Parallel solutions have also been investigated, such as Cho, Navab, and Mangione-Smith's FPGA rule-based multi-layer firewall inspection system [9] which processes rules in parallel. The claimed performance is the filtering of 2.88 Gbps of data at an operating frequency of 90MHz, for 105 rules. The main drawback of these solutions is the low operating speed, and the limited amount of parallelism obtained compared to VLSI solution, along with the rigidity of the end-solutions, which are not programmable. Sourdis and Pnevmatikatos [10] also propose an FPGAbased solution mixing CAM and discrete logic for the more
Abstract
general problem of intrusion detection, with the added feature of reconfiguration to update the fitering rules.
3. The Connex Array Circuit
2.2. Basic search operations in the Connex Array
In this section we present the basic organization and operations of the Connex Array Circuit.
2.1. Basic organization The basic organization of the Connex Array 0 is shown in Figure 1:
The first command we introduce is find(x), where x is a symbol. Assume that the CA (now shown horizontally) contains the symbols forming the string “cool cats can”, as shown in Figure 2.a. (a) original contents
D M A C
R A M
1 3 1 1 4 2 2 23 1 1 1 9
controller serving I/O devices, with a peak transfer rates in excess of 3 GBps. In the next section we present some of the PIM instructions that best illustrate the CA mode of operation.
PIM
co o l
c a t s
ca n
(b)
co o l
c a t s
ca n
(c)
co o l
c a t s
ca n
(d)
co o l
c a t s
ca n
Figure 2. The CA with its original contents (2.a), and after the exection of find(c) (2.b), conditional-find(a) (2.c), and conditional=find(t) (2.d).
Controller Figure 1: General architecture of the CA. A linear organization of intelligent memory cells form the Processor-In-Memory (PIM) part of the circuit. Each PIM cell is connected to its left and right neighbors, forming a chain network. Each PIM cell is also connected to its own bank of RAM cells, forming the equivalent of a register file for each PIM cell. A controller broadcasts operations to the PIM array, and gathers status information on the array, while a directmemory-access controller (DMAC) oversees the loading and unloading of data in the array. The PIM cells can execute single-cycle instructions in synchrony with the controller, the two entities working in parallel, the controller reading instructions from its own memory, and broadcasting PIMspecific instructions to the PIM array where they are executed in parallel. The instructions support basic fixed-point arithmetic operations, basic bit-logic operations, contentaddressable search operations, insert, and delete operations, as well as parallel exchange of data between the PIM cells and their associated RAM cells. In the current implementation of the CA the controller maintains the PIM array fed with instructions and data, maintaining a peak execution rate of one instruction per 5-ns cycle. Note that the combination of PIM cells and their associated RAM cell allow an abstraction of the processing where the PIM array is simply a vector processor, and the RAM a storage for vectors. Data are inputted in the array in three basic ways: insertion, memory write operation, or via Direct Memory Access (DMA). The first two are programmed operations, under the supervision of the controller. The last one is independent of the controller and executed under supervision of a DMA
Assume that the command find(‘c’) is now issued to the CA. In our context, issuing a command means that an instruction and a symbol are fed to the device over a bus in a manner similar to a conventional memory-write operation. In one clock cycle, the symbol ‘c’ is broadcast to all the cells and their contents compared to it. When a match takes place, a bit called the marker is set in all the cells where the match occurs, as shown in Figure 2.b. We indicate that a cell has a marker set by underlining the character it contains. Observe that all the ‘c’-symbols are all marked. The purpose of having a marker in each cell becomes evident when we introduce the next CA command: conditional-find(x) which also takes a symbol x as argument. During the execution of a conditional-find command, only the cells that have a left neighbor with a marker bit set perform a comparison. The others do not. Assuming that the contents of the CA is that shown in Figure 2.b, then applying conditionalfind(‘a’) to the CA changes its contents to that shown in Figure 2.c. Two cells remain marked, those containing the symbol a. The CA circuit is engineered such that a signal is output to the controlling entity to indicate if 1 or more markers remain set after the last instruction. If we continue with conditionalfind(‘t’), then the CA gets set to the state illustrated in Figure 2.d. We just performed a search of the string “cat” in the CA: find(‘c’), conditional-find(‘a’), and conditional-find(‘t’). Note that this search requires three commands, and thus takes only 3 memory cycles, one cycle for each symbols in the query string, independently of the length of the string stored in the CA. Had the CA contained several instances of “cat”, all
Page 2 of 6
would have been found and marked. Note furthermore that no pre-processing of the data is required, only initially storing the string in the CA, which can be done in a variety of ways as mentioned earlier. Searching for a substring of N symbols in a string of M symbols stored in the CA requires exactly N cycles. The CA supports forward string searches, as presented above, as well as reverse string searches. In this case, looking for the substring “cat” would require the execution of find(‘t’), reverse-conditional-find(‘a’), and reverse-conditionalfind(‘c’). This simple example shows how searching can be performed with little effort with the CA.
2.3. Basic multi-operand operations The association of regular memory cells with each PIM cells provides also for vector operations illustrated below. In this example we start with a series of positive numbers, keep every fourth number of the series intact, and replace all the others by -1. This is illustrated in Figure 3. (a) original contents (b)
PIM 1 3 1 1 4 2 2 2 3 1 1 1 9 RAM 1 3 1 1 4 2 2 2 3 1 1 1 9
RAM 1 3 1 1 4 2 2 2 3 1 1 1 9 PIM 0 1 2 3 4 5 6 7 8 9 A B C
(d)
RAM 1 3 1 1 4 2 2 2 3 1 1 1 9 PIM
(e)
0 1 2 3 01 2 3 0 1 2 30
RAM 1 2 7 2 3 4 1 0 4 3 2 1 9 PIM 0 1 2 3 0 1 2 3 0 1 2 3 0
(f)
RAM 1 2 7 2 3 4 1 0 4 3 2 1 9 PIM 0 -1-1 -1 0 -1 -1 -1 0 -1 -1 -10
(g)
4. Packet Filtering 4.1. Profile Table
RAM
PIM 1 3 1 1 4 2 2 2 3 1 1 1 9 (c)
controller then issues the instruction index which initializes the contents of the CA cells with their index, shown in Figure 3.c in hexadecimal. The next instruction, and 3, performs the logical and of the contents of the selected PIM cells with 3, resulting in the pattern shown in Figure 3.d. In 3.e the instruction notEqual 0 deselects all the cells containing 0. The next two instructions, Store -1 and complementSelection then store the value -1 in all the remaing selected cells, and complement the marker bits to switch the selected status of the cells (Figure 3.f). Finally, the ramLoad 0 instruction loads in parallel the contents of Vector 0 only in the selected cells. All instruction taking one cycle, this simple program (ramStore 0, index, and 3, notEqual 0, store -1, complementSelection, ramLoad 0) takes only 6 consecutive cycles. The checking of IP addresses against a profile table requires similar operations. In the next section we present the problem of filtering packets and our propose solution.
RAM 1 2 7 2 3 4 1 0 4 3 2 1 9 PIM 1 -1 -1 -1 3 -1 -1 -1 4 -1 -1 -19
Figure 3. Multi-operand operations with the CA. In Figure 3.a the original series of numbers is in the cells of the PIM cells. All the marker bits are set, and hence all the numbers are shown underlined. In a first step we store the contents of these marked cells in one of the vectors associated with the CA, represented here in grey. This is done by issuing the instruction ramStore. The result is in Figure 3.b. The
One of the operations essential for detecting illegal packets on a network is the comparison of the IP addresses (source and destination) of incoming packets against rules stored in a profile table. A Profile Table contains a set of criteria that are used to profile packets in order to label them as safe, or suspicious and requiring further processing. For our purpose, we assume that the profile table is populated with rules containing several bit fields defined as follows: SourceIP (sip): 4 bytes DestinationIP (dip): 4 bytes PacketSize: typically 1500 bytes for Ethernet SoucePort (sport): 2 bytes DestinationPort (dport): 2 bytes SessionIdentifier: 1 bit. Flag used to identify if all packets in a session should be inspected EndTime: multi-byte timestamp Not all fields need to be populated for each entry into the profile table. Don’t-care values are allowed. These bit fields correspond to similar bit fields found in the header of incoming packets, and each packet header must be compared to the rules in the table to determine the fate of the packet. All the fields of a packet must match the conditions expressed in the fields of a rule (including possible don’t cares) for the packet to match the given rule. No match means the packet is safe and can continue on its way. A match means that an action associated with the rule must be followed, triggering a deeper analysis, or dropping the packet right away. In general, there are different protocol validation engines for each protocol that is being monitored. To avoid complexity, we only concentrate here on domainname server (DNS) requests.
Page 3 of 6
4.2. Implementing a Profile Table in the Connex Array. Our solution is to stored all the rules of the profile table in the RAM associated with the PIM cells, and to replicate the header of an incoming packet in the PIM. Having several copies of the header allows for the comparison of its fields in parallel with several rules at the same time. In general, the processing of packets requires three steps: 1. Input the packet header into the Connex Array. 2. Replication of the header in the PIM cells. 3. Analyze the contents of the packet. The first operation can take place under various techniques, such as under program control, where the controller managing the transfer, or under faster DMA control. The replication is a quick procedure and is performed in a manner similar to the one illustrated in Figure 3: A repetitive pattern is created in the PIM cells, and all the cells containing the number 0 are selected. The first block of the packet header is assigned to all the selected cells. Next, the PIM cells containing the next number in the pattern are selected, and the second block of the header are assigned to all these cells, and so on. Since selecting cells and assigning them values are operations that take one cycle each, storing a header in N chunks of data will take just 2N Connex cycles. We can very quickly replicate the packet header in the PIM array. Figure 4 illustrates this concept. Here, for simplicity, we assume that one header is comprised of four chunks labeled A, B, C, and D, that can fit in four PIM cells, and that each chunk requires the same type of processing. The vectors in the RAM cells are filled with rules, also divided in groups of four chunks. Figure 4 shows three rules, R, S, and Q. Matching the header to these rules is therefore done in parallel, matching, in one step, Chunk A of the header to chunks R0, S0, and Q0, Chunk B to R1, S1, and Q1, and so on. R0R1 R2R3 S 0S1 S2S3 Q0Q1Q2Q3 . .
ABCDABCDABCD . .
PIM
Figure 4: the header ABCD is replicated in the PIM array, and tested in parallel against three rules, R, S, and Q, in parallel. The careful analysis of the steps required to test all the fields of a packet against their associated fields in the rules reveals that the operations are not simply testing for equality of two quantities. In some cases, the operations required are of the type “If Fieldi is equall to α, then verify that Fieldj is equal to β.” Others are of the type “if Fieldi contains a value
comprised between δ and γ,” and require a careful coding of the tests in a program whose instruction count yields performance information which we develop in the next section.
5. Performance Evaluation and Comparison We can separate the processing of a packet header against a given rule into six distinct groups of actions: Group 1: operations that must be performed serially, and before all others. Group 2: operations requiring testing bits, bytes, or words against one value, which can be all performed in parallel. Group 3: operations requiring testing two fields of the packet in a conditional manner, one against the other, as mentioned above. Group 4: comparing the Internet Protocol address (IP) of the packet against a collection of rules (e.g. testing if the four bytes of the IP address match 131.225.X.X, where X represent a don’t care condition) Group 5: Checking the validity of a series of domain names. Group 6: Performing multi-branch switch with the (optional) validation of another series of domain names. Coding the tests in these different groups yields a series of equations expressing the total execution times in Connex cycles for operations in each group, which are summarized in Table 1. Group Connex-Array cycles Groups 1, 2, 3 38 Group 4 45 * ( floor( R/200 )+1) Group 5 93 * S Group 6 19 + 93 * S Table 1: Projected execution time of Analyzer operational groups, where R represents the number of rules the packet header is tested against, and S the number of strings contained in Group 6 operations. The processing of the packet requires both serial and parallel processing, but fortunately the level of parallelism for Group 4 is high, and can be exploited well by distributing and packaging up to 200 rules in a 1024-cell vector of the Connex Array. In this fashion, up to 200 rules can be tested in parallel against a single IP block of the header in 45 cycles. For cases when 201 to 400 rules must be checked, then the rules must be distributed in 2 vectors, requiring twice the amount of cycles, or 90 cycles. For 401 to 600 rules, 135 cycles will be required, and so on.
Page 4 of 6
1GBps), and not necessarily the processing power. Whether today’s embedded processor could maintain that high a throughput on an OC48 link operating at 2.488 Gbps remains an open question.
Number of Packets vs Table Size Number Packets (K Pkts/sec)
3000 2500
1 CA 4-CA
2000
5. Conclusion
1500 1000 500 0 100
300
500
700
900
Profile Table Size (rules)
Figure 5: Number of Packets processed per second as a function of Profile Table size. Figure 5 above shows the total number of packets, in thousands, processed by one, and by four 200-MHz Connex Arrays as a function of the number of rules in the profile table. We show the four-chip performance data to highlight one of the important features of the chip, one that allows the network of PIMs of one chip to be connected to the network of another chip, and to use the controller of one chip to control both PIM arrays. This allows the construction of multi-chip boards with a minimum of outside hardware. Note that for tables of fewer than 200 rules, close to 700 KPkts can be processed per second, or 2700 KPkts/sec for a 4CA board. The performance drops predictably as the number of rules grows with steps at multiples of 200. This number is not directly comparable to proposed solutions that tackle the more general problem of deep packet-analysis, where the actual throughput expressed in bps is more appropriate. For the purpose of simple comparative analysis, assuming that packets contain an average of 1 KBytes, the actual throughput of our solution, for 200 rules, would yield 2.7 GBytes per second, comparable to other published solutions, but in a totally programmable design, scalable, and with linear speedup. It should be noted that the Linux community has generated some very efficient data structures and algorithms for implementing high performance network filtering when the number of rules is very high. NF-HiPac [3], for example, boosts almost no loss in network throughput over a range of rules spanning 25 to 25,600 rules, ending in the high 90% of the 100 Mbps max bandwidth. An independent test reported on HiPac’s Web site shows a sustained 95% max throughput of a 1GBit/sec connection when filtering packets with 3072 rules. This performance comes at a high price in preprocessing, (10,000 rules require creating 30,000 B-trees), in memory (1 MByte of storage for the B-trees), in processing power (Dual High-end Pentium circuits), and in dissipated power in the 100 Watt-range, while post-production ConnexArray chips have been operated at less than 5 Watt in power dissipation. While the numbers reported by HiPac translate in low packet throughputs, in the 8K to 80Kpps, the limiting factor here is the network bandwidth used in the tests (100Mbs and
The problem of detecting covert intrusion in a network is paramount for maintaining the integrity of a computer system. In this paper we present a solution for the problem of real-time checking of IP addresses against profile tables using a new PIM architecture. The packet-filtering operation which tests the IP address of incoming packets against the profile table containing hundreds of rules can be implemented on the Connex Array such that 200 rules can be tested parallel in as little as 45 cycles. The computation scales linearly with the number of rules, with steps at multiples of 200 rules, and scales up linearly with the number of CA circuit used. The main advantages provided by the Connex Array solution are size, power-dissipation, and operations that require no pre-processing of the data. The Connex Array itself is a single chip which integrates 1024 PIMs and their associated registers, and typically dissipates less than five Watts at a 200 MHz operating frequency. The additional circuits required for the operation of the PIM array are all imbedded on the same chip. Because of the simplicity of the computation required in comparing one IP address against 200 separate rules, no complex preprocessing of the rules is necessary before storing them in the array. Furthermore, the general purpose of the array makes updating the rule table an elementary operation. It should be noted that pre-processing and high-power processors still present interesting competing solutions, but may not be appropriate in imbedded systems because of their power dissipation.
6. References [1] J. McHugh, A. Christie, J. Allen, "Defending Yourself: The Role of Intrusion Detection Systems," IEEE Software Magazine, Sept./Oct. 2000. [2] Stefan, G., and D. Thiebaut, "A memory engine for the inspection and manipulation of data," United States Patent 6,760,821, granted July 6, 2004 [3] Connex Array Technology, http://www.connextechnology.com/. [4] D. Thiebaut and G. Stefan, "Local Alignment of DNA Sequences with the Connex Engine," Poster, WABI 2001, 1st workshop on Algorithms in BioInformatics, BRICS Univ. of Aarus, Denmark, Aug. 2001. [5] D. Thiebaut and G. Stefan, "Ziv-Lempel compression with the Connex Engine," Tech. Rep. 077, Dept. Computer Science, Smith College, Northampton, MA, 01063, Jan 2002. [6] D. Thiebaut, G. Stefan, M. Malita, Image Processing on the Connex Array, Technical Report. [7] PMC Sierra Inc., "PM2329 ClassiPi Network Classification Processor Datasheet," Product Datasheet, PMC-2010146, Issue 4, 2001
Page 5 of 6
[8] Broadcom Inc., "Strada Switch II BMC5616, Integrated Mutilayer Switch High Peformance Packet Classification, http://www.hipac.org/ [9] ChoY. and W. Mangione-Smith, Deep Packet Filter with Dedicated Logic and Read Only Memories, in IEEE Symposium on Field-Programmable Custom Computing Machines, Napa Valley, CA, April 2004.
[10] I. Sourdis, Pnevmatikatos D., Fast, Large-Scale String Match for a 10Gbps FPGA-based NIDS, New Algorithms, Architectures, and Applications for Reconfigurable Computing, Patrick Lysaght and Wolfgang Rosenstiel (Eds.), Chapter 16, pp. 195-207, ISBN 1-40203127-0, Springer 2005.
Page 6 of 6