Packet Filtering in Gigabit Networks Using FPGAs - CiteSeerX

25 downloads 18024 Views 364KB Size Report
network traffic by rules and omit those packets that might affect the network ... tering is done in the Linux kernel. .... the usage of best matching filters instead of priority based systems ..... ing the Wireshark tool that allows to monitor and analyze.
Packet Filtering in Gigabit Networks Using FPGAs Johannes Loinig [email protected]

Johannes Wolkerstorfer [email protected]

Alexander Szekely [email protected] Institute for Applied Information Processing and Communications Graz University of Technology

Abstract Network security is an important aspect for a networked information society. The ever growing datarates of Internet traffic call for dedicated hardware solutions to prevent networks from malicious attacks. Packet filtering belongs to a number of measures to ensure the availability and reliability of networks. Packet filters classify the network traffic by rules and omit those packets that might affect the network security. Our work turns out that the latest series of Xilinx FPGAs is a very effective platform to realize such filters in hardware. The Xilinx Virtex-4 series provides hard macros for network access, has ample memory resources to store filter rules, and provides embedded processors to manage and maintain the built network devices. Our pipelined packet-filter architecture on Virtex-4 FPGAs is able to filter Gigabit Ethernet traffic. On a small Virtex-4 FX-12 device, 32 complex filter rules can be applied to decide the acceptance (or dropping) of packets by applying linear search. The filter causes a latency of only 2,300 ns, which is magnitudes faster than software solutions. The pipelined architecture allows for scaling for larger rule sets on larger FPGAs and for multiGigabit throughput rates when moving to larger wordsizes than the currently applied 8-bit datapaths.

1

Introduction

IT security plays a key-role in today’s networks. No eCommerce, e-Government applications, nor mobile banking is possible without securing the network infrastructure. Securing networks is a broad activity that cannot be pinpointed to one particular measure. Instead, a combination of several security approaches guarantees the security of data. Data security in networks means to protect against denial-of-service attacks, to maintain the confidentiality and integrity of sent information and to protect other assets. For instance, the authenticity of communication partners is of importance. Several of the listed security goals can be achieved by applying cryptography. Nevertheless, comparatively simple approaches like packet filtering protect network devices against a vast number of serious security threats like malicious intrusion or denial of services.

Secure LAN 1

Security Gateway 1

Secure LAN 2

Insecure Internet (WAN)

Security Gateway 2

Figure 1. Application of packet filters in security gateways

Efficient packet filtering helps ensuring high availability of systems. [1] Packet filtering analyzes the network traffic by monitoring the packet information. Simple packet filters examine the header information only, while fully-fledged firewalls also scrutinize the payload of the packet. They even check the plausibility of the state of higher-level protocols like HTTP (stateful inspection). Simple packet filtering and stateful inspection have in common that they have a set of rules which are applied to the network traffic. By applying these filter rules, each packet is classified by one of the rules. The rule decides whether to accept the packet or to drop it. The set of rules is usually a sorted list where the first entries have higher priority than later ones. If none of the rules apply, a default action is taken. When the default action is ’accept’ then the rule set is a so-called ’black list’. When the default action is ’drop’, packets are only accepted if at least one rule explicitly applies – thus these rules are called ’white list’.

1.1

Application of Packet Filters

Packet filters are implemented in devices like security gateways that have a direct connection to the insecure Internet. See Figure 1 for such a gateway application scenario. Packet filters can be implemented as software in the network stack. This type of implementation is usually sufficient for personal packet filters like the Windows fire-

wall or Netfilter / IPtables used in Linux systems. Larger company networks usually apply network filters on their Internet gateways to protect from malicious packets. The quality requirements of packet filters in gateways are different of those in personal firewalls. Gateways have to be able to cope with data throughput in the Gigabit scale and to introduce only latencies of a few milliseconds. Most software implementations of network stacks are not able to meet such requirements. Thus, many gateways that are commercially available are a combination of hardware and software.

1.2

Related Work

Cisco Systems, world-market leader for network switches and routers, offers a security module for their Catalyst series of routers that provide packet filtering on network layer 2 [2]. Their solution consists of several special-purpose hardware accelerators and conventional processors running software. Throughput rates of 5 Gbps can be achieved at a latency of 13 ms. Although packet filtering is used in many networked devices, little is published about implementation approaches. Software solutions are better covered by scientific literature than hardware solutions are. A well-known implementation of packet filtering in software is Netfilter/IPtables of the Linux operating system [3]. Packet filtering is done in the Linux kernel. Filter rules are provided by the user-space program IPtables. The search algorithm applied in Netfilter makes the performance sensitive to the number of applied rules. This is not the case for HiP AC 2 which uses different algorithms as described in [15]. The system can perform packet classification that is independent from the number of filter rules. The filtering is split into a classification and a lookup procedure which accelerates the filtering. However, Linux systems are not capable to filter Gigabit network traffic and to provide good quality-of-service parameters simultaneously. Packets have to traverse large parts of the network stacks which introduces latencies of more than 10 ms, which is not appropriate for commercial Internet gateways. The project Safecard realizes a network filter on a network interface card equipped with a single Intel IXP2400 network processor [4]. It achieves a throughput of 940 Mbps. Besides packet filtering, Safecard can also inspect the state of TCP streams. Multiple projects that are similar to our goal of packet filtering for use in IPsec gateways were conducted at the Washington University by John Lockwood, Haoyu Song et al. [5, 6]. Their approaches for packet filtering are either based on ternary memory (TCAM) or Bloom filters. TCAM is a content-addressable memory which allows to search in constant time [5]. It is costly to realize TCAMs on FPGA platforms. More recent work makes use of Bloom filters that allow to search for given strings in payload of packets too [6].

1.3

Our Contribution

In this article, we present the implementation of a packet filter for Gigabit Ethernet being realized on Xilinx FPGAs. The complexity of Xilinx FPGAs and their ability to be reconfigured in the field turns out be valuable assets that facilitate the implementation of flexible network applications. We present a pipelined architecture on a Xilinx Virtex-4 FX-12 device that is able to filter network traffic on layers 2 and 3 at a throughput rate of 1 Gbps. A latency of only 2,300 ns (at a clock frequency of 125 MHz) can be achieved by a sole hardware implementation. Our proof-of-concept on the rather small FX-12 device is limited to handle 32 filter rules. Moreover, it supports 32 entries of routing information for routing packets to their final destination. The approach is scalable for a medium number of filter rules on larger FPGA devices. The architecture is tuned to utilize Xilinx FPGA resources like distributed RAM, Block RAM and of course the integrated hard-macro media-access controller (MAC) for Ethernet access. It does not use ternary memory but instead utilizes the enormous bandwidth to the on-chip RAM resources. The remainder of this article as structured as follows: After giving a introduction into packet filtering in Section 2, Section 3 shows the architecture of our pipelined FPGA packet-filtering approach. Section 4 presents the achieved results and Section 5 draws conclusions and asks questions that will be addressed in future work.

2

Packet Filtering

All of today’s computer networks are packet oriented. This means that a communication stream is split into packets that are sent over the network to the receiver. These packets need to have a defined format which is provided by a set of network protocols. A network packet usually consists of a header (or a set of headers), the body (also called payload), and the trailer. The header describes the necessary transport information and the nature of the data in the payload. The trailer usually holds a checksum for error detection. Packet filtering is a packet-classification problem. To distinguish between packets that are allowed to pass through the system and the ones that are not, packet filters must be able to classify the packets into classes like ALLOW, DROP, etc. These classes are also called actions. This classification is based on attributes stored in the packet header. According to the OSI reference model, the network is split into several layers. Each layer has its own purpose and its own protocols that can be used within. This results in a packet structure composed of headers of several network protocols. Each protocol provides its own header with information about the payload. These headers might not have a constant length. Thus, filtering-relevant information can spread in the header area where neither their position nor their existence is constantly given.

Static packet filtering is defined as a system that classifies packets individually only by analyzing packet headers. A real-life implementation of a packet filter has to provide a rule set that can be (re-)configured during runtime. The configuration is done by loading filter rules which define a set of packets and the according class of action. These sets of packets are defined by ranges for header fields. Examples for rule ranges are IP-address ranges or TCP-port-number ranges. Through the usage of such ranges, it is possible that one and the same packet matches more than one filter rule. [11] differs between two types of such conflicts: the subset conflict where one rule selects packets that were already defined by another rule and the overlapping conflict, which is caused by rules that are sensitive to different header fields. Subset conflicts can be resolved by reordering filter rules. Overlapping conflicts can be resolved by splitting conflicting filter rules. Both methods can be very complex if the rule set consists of a huge number of rules. Even if the authors of [11] propose the usage of best matching filters instead of priority based systems, the most common way in today’s implementations is to provide the filter rules with priorities using an explicit order. Filter rule ranges can be described in different ways. Two commonly used ways are the number range (like TCP port number 1-30) and the prefix notation (like IP addresses 10.0.0.*). However, these two notations are not completely equivalent. In sense of granularity, the prefix notation can be understood as a subset of the numberrange notation. Because network addresses are very often organized in subgroups, filter rules use prefixes usually for IP addresses. Number ranges are used where grouping is not practicable, as it is the case for TCP ports. Not every classification algorithm (explained in the next section) can handle rules in both notations equally well. Although filter rules can be converted from one notation to the other, this usually leads to a different number of rules. For example the rule ’allow 10.0.0.1-10.0.0.254’ containing a range cannot be described in prefix notation with a single rule. At least the two rules ’drop 10.0.0.255; allow 10.0.0.*’ are needed, if priorities are applied. Realizing packet filters in hardware is not straight forward as there are plenty of choices for algorithms and architectures. The kind and number of filter rules play an important role when choosing the optimal classification algorithm and system architecture. The most widely used classification algorithms will be described next.

2.1

Classification Algorithms

Taylor differentiates in [12] between four different kinds of packet-classification techniques: Exhaustive Search, Decision Tree, Decomposition, and Tuple Search. However, many algorithms applied in practice base on more than one of these principle techniques.

Exhaustive Search. Exhaustive Search is the most trivial way to find a matching rule. One example for an Exhaustive Search is the linear search where the filter rules are searched sequentially until the first rule matches. The priority of the rules is given implicitly if the search is stopped after the first match. The search can be parallelized if more than one memory block (holding the rules) is used. A fully parallelized search can be achieved by using an associative memory. TCAMs (Ternary Content Addressable Memory) find a matching rule in constant time O(1). Due to the hardware complexity of TCAMs their disadvantages are high cost, storage inefficiency, high power consumption, and limited scalability for long input keys [12]. Furthermore, number ranges and negations in filter rules are not well supported by TCAMs even if [14] has introduced a way to perform range searches with socalled Extended TCAMs (E-TCAM). Unfortunately (but not surprisingly) the hardware cost of the E-TCAM approach doubles in terms of gate count. Decision Tree. The Decision Tree algorithm uses bits of the header field as search key in the traversal of a preprocessed tree-structure which represents the filter rules [12]. The leaves of the tree contain the actions to perform on the packet. Due to the way of searching in trees, the search time is dependent on the length of the search key. Therefore, the search time depends on the type of the analyzed packet. E-TCAMs use multiple levels of decision trees with constant depth. This gives highly deterministic timing behavior. Several algorithms based on decision trees are commonly referred to as cutting algorithms. They are geometrical approaches where filter rules with d different fields can be represented as d-dimensional rectangles in a d-dimensional space. Each decision while traversing the tree is equivalent to a d-dimensional cut through the spanned filter space. Typical cutting algorithms for packet classification are Hierarchical Intelligent Cuttings (HiCuts) and HyperCuts. [12] Decomposition. Parallelizing the Decision Tree approach leads to the Decomposition algorithm. In decomposition the filter rules are split up in multiple search trees (e.g. separated trees for certain header fields) [12]. This reduces the search time, but the results for the different header fields are disjoint. To find the final matching rule, these results have to be evaluated in combination. Typical examples for decomposition are the parallel bit-vector algorithm and its performance improvement, the aggregated bit-vector algorithm. Both are based on a geometric view of the rules. Decomposition algorithms are known to be very memory consuming. In addition, some algorithms (like Crossproducting) make it impossible to change single rules easily because the filter rules are used in an entangled way. [12] Tuple Space. The Tuple Space method reduces the search space by grouping similar filter rules [12]. Each

group is defined using a tuple specifying the filter rules within its group. One can think about the tuple as a kind of hash value wherewith it is easily possible to check if a packet matches that tuple or not. The search algorithm only identifies the matching tuple. The resulting subset of filter rules is searched afterwards to find finally the matching rule. Because the tuples depend on the character of certain filter rules, the search time does as well.

2.2

Gateway SOFTWARE Embedded Linux Administration and Monitoring

M A C

HARDWARE Packet Filter / IPsec

Comparison of Classification Algorithms

Implementing the algorithms in hardware would imply different memory architectures. A simple linear search can use a common memory architecture like RAM to sequentially access the rules. A more sophisticated implementation like the TCAM has a very complex memory structure as it provides parallel search facilities in addition to information storage. It cannot be implemented efficiently on FPGAs. Decision Tree and Decomposition need memory structures that allow to store and access search trees efficiently. Although modern FPGAs include a fair number of relatively big embedded memory blocks, it is not obvious how to implement search trees efficiently in hardware. Either memory utilization would be very low or the time needed for searching the tree will be high. Thus, algorithms which make use of TCAMs or search trees are not the first choice for FPGAs. Therefore classification algorithms like Decision Tree and Decomposition are usually avoided. Concerning the search time, linear search has the most predictable behavior (except TCAMs) because it only depends on the number of rules (which can be limited). The tree-based approaches have search times that vary with the length of the header fields of the packets. The search time for Tuple Search depends on how similar the filter rules are. The more similar the rules are, the more rules can be grouped in tuples. This makes the search space smaller. Every filter implementation based on groupings performs well for filter rules in prefix-notation. However, packet filters have also to support rules that define ranges, at least for TCP and UDP ports. All classification algorithms, except the linear search, suffer from inefficient implementations for these range rules, because they have no suitable binary representation and thus require multiple rules. The linear search does not depend on the notation of rules, and has no need to precompute data structures (e.g. set up the search trees). The precomputation step in other algorithms allows faster searches. This precomputation has to be done in software and has to be recomputed every time a filter rule is added, deleted, modified, or if the priority of the rules changes. However, the linear search also has the slowest search time that increases linearly with the number of rules. Fortunately, analysis has shown that the number of rules in deployed systems is not very high [12]. Allowing just a moderate and constant number of filter rules will suggest a linear search to be the most efficient approach.

Packets to/from the Gateway

M A C

Secure LAN Insecure Internet (WAN)

Figure 2. Hardware and software components

Summarizing this, only the linear search has a search time that does not depend on the packet’s size, the size of rules, their structure or existing rule-interdependencies.

3

Architecture

The considerations about classification algorithms and their implementation on FPGAs showed that the used FPGA greatly influences implementation details. When using the internal memory blocks of FPGAs for storing filter rules, a linear search is the first choice as classification algorithm.

3.1

General Architecture

Our implementation was done on the Xilinx ML403 evaluation platform [10]. The board is equipped with a Xilinx Virtex-4 FPGA [8] and a lot of peripherals, from which only the Ethernet physical device is used in our prototype. The Ethernet can operate at 10, 100, or 1000 Mbps. The Ethernet physical is connected via an MII, GMII, or RGMII interface to the FPGA. The Virtex-4 FX-12 FPGA provides 12,312 logic cells within 5,472 slices, an embedded PowerPC 405, and two Ethernet Media-Access Controller (TMAC) [9]. Furthermore the FX-12 has 36 block RAMs of 18 kbit each resulting in a total storage of 648 kbit. To provide sufficient speed for the Gigabit network we decided to split the system into a hardware and a software part as sketched in Figure 2. The software components are used for managing the filter rules. The hardware part processes the packets. This article will concentrate on the hardware components of the system which implement the packet filter. The hardware handles all incoming and outgoing packets. Packets that are addressed to the management software are passed to the software. Software parts are not yet implemented. They will be realized on embedded Linux running on the PowerPC CPU. The management software will provide administration, monitor-

Software RX/TX FIFOs

Admin./ Monit. Interface

TMAC

FIFO

TMAC

Inbound Filter

FIFO

Insecure WAN

3.2.1

Outbound Filter

Admin/ Monit. Interface

Software RX/TX FIFOs

Secure LAN

Figure 3. Filter structure

ing, and management functions like setting up new filter rules or managing routing entries. Another future extension of the packet filter is the support of IPsec functionality [13]. IPsec will encrypt and decrypt the network traffic and ensure the integrity of data by message authentication. IPsec relies on packet classification too because it has to determine which packets have to be encrypted/decrypted. This will extend the classes of actions to ALLOW, DROP, IPSEC. Figure 3 shows an overview of the entire filter architecture consisting of the TMACs implemented as hard cores on the FPGA and the filters for the inbound and outbound directions. Additional interfaces to the PowerPC for administration/monitoring and passing packets to the software are depicted as well. All hardware blocks in the inbound and outbound direction use an interface that is similar to those of the TMAC hard cores. As the TMACs have different interfaces for the RX and TX directions, adapters were necessary to allow packet forwarding. These adapters are implemented in the FIFO components that are able to store an Ethernet Jumbo frame which might have up to 9048 bytes. The FIFO supports retransmission of frames if requested by the TMAC.

3.2

Architecture Details

Figure 4 shows the basic architecture of the inbound filter. The entire filter is based on a pipelined structure containing modules for packet header extraction, packet labeling, and multiplexing, IPsec encryption / decryption, bypassing of packets, and routing. The extracted header information is stored in a header FIFO and evaluated in the rule seeker module. This module implements the Exhaustive Search algorithm (a parallelized linear search) to find a matching filter rule and labels the packet with the desired action. A basic consideration in the architecture is to equip all pipeline components with an interface that is similar to the RX-interface of the TMAC. This facilitates the introduction of additional components into the pipeline.

The Pipeline

No components of the pipeline structure store a complete packet. They just hold as many bytes as are necessary to fulfill the required functionality like extracting the IP header information. This reduces the hardware complexity and lowers the latency of the components. Using such a pipelined structure was necessary to provide support for jumbo frames [7] which are common in Gigabit networks. Typical lengths for jumbo frames are 9,048 bytes. Jumbo frames reduce the overhead caused by the packet headers relatively to the transmitted payload. Working on packets which are that long reduces the feasibility to buffer them. A packet classification system based on a buffer approach need to store the entire packet in a memory before starting the needed operations. First, this would take (in the worst case) more than 9,000 clock cycles before the classification can start as the interface of the TMACs are byteoriented and work at 125 MHz. Secondly, another packet can arrive directly after the first one. Thus, the buffer has to be much larger than 9,000 bytes to hold parts of both packets. This additional size depends on how long the operations on the first packet will take. The pipeline modules are based on 8-bit wide shift registers. Packets are handled by the TMACs as a stream of bytes which are shifted through these registers. Additional control logic is responsible to detect the start and the end of a packet. The control logic also indicates the point of time when a shift register holds information that is needed to classify the packet. In particular, it signals when the shift registers contains header information of the packet. The advantage of such a pipelined structure (in comparison to a structure which is based on a buffer) is that the number of needed registers is independent of the maximal packet length. In most cases the pipeline has to have as many pipeline stages as the queried header information has bytes. An additional advantage is the modular approach of using a pipeline structure. The stages do basically not affect each other. This helps dividing the system into submodules which are easier to analyze and to implement. Submodules do not interfere with each other and therefore the complexity is reduced. Each pipelined structure introduces packet latency that is directly proportional to the number of register stages in the pipeline. Each reg1 ister stage introduces a delay of t = 125 M Hz = 8 ns because the complete circuit uses a single clock domain that is synchronous to the TMAC. Thus, even very complex header extraction modules do not affect the latency of the whole system such that the delays reach the same magnitude as transport delays of the network (ms). The extracted information is stripped from the data stream in the pipeline. Packet assemblers can later add new header information to the payload. This simplifies the modification of header information, as it is needed for example in the routing module.

filter_top header_ fifos

rule_ seeker

Header Fields

from MAC

Packets labeled with Action

action

ipsec_ decrypt

packet_ mark_ demux

header_ extraction

shift_ reg

packet_ mux

route

packet_ unmark_ demux

to Software FIFO

to Jumbo Frame FIFO

Figure 4. Inbound filter

Extracted Header Fields

Start Address Counter

Address

RAM with Filter Rules

Set of Comparators

Action

Rule Matches

Figure 5. Linear search for filter rules

3.2.2

The Linear Search

For stated reasons we decided to implement the packet classification based on the linear search algorithm using the FPGA internal memory structures to hold the filter rules. The implemented scheme is relatively simple and depicted in Figure 5. It consists of RAM blocks holding the set of filter rules and comparators that check if the extracted header fields match the addressed rule in the memory. The filter rules contain data for the most important kind of rules like implemented in Netfilter with additional IPsec functionality. The linear search starts to check whether the rule matches the first entry in the RAM which is the rule with highest priority. The address counter is increased every clock cycle to check against rules with lower priority. The action of the first matching rule is fetched out to label the packet. This label is evaluated in further modules and determines the packet processing. The main problem of this scheme is the implementation of the comparators with a short critical path to allow operation at 125 MHz. Especially the comparators for the Internet Protocol addresses (128-bit comparison, 64-bit per range, 32-bit per address) had to be implemented using a pipelined approach. The pipelined comparators compute hmax − h and h − hmin in order to verify whether the extracted header information

h is in the range [hmin , hmax ]. This is the case when both subtractions yield a positive result. When several ranges are checked (e.g. IP address and TCP port), the results of the comparators have to be combined by an AND function to determine if the whole rule applies or not.

The approach of linear search does not scale very well with the number filter rules. Therefore, we developed a seeker architecture that is tailored towards the architecture of Xilinx FPGAs. They do support multiple internal memories. This is in contrast to software-based systems or to hardware using external memory, where only one memory block with one address bus and one data bus is available. Xilinx FPGAs offer the ability to use either LUTs as 16 × n-memory or to use the block RAM resources with configurable address and data bus widths. Using multiple such memory structures is very advantageous because it increases the bandwidth to the memory. Memory bandwidth is usually the limiting factor when realizing classification algorithms.

In our approach, we fragment the rule set into blocks of 16 rules. Each rule seeker module matches the extracted header information against these 16 rules. Doing all the comparison in one block takes 17 clock cycles because of the pipelined comparator structure. This architecture is sketched in Figure 5 where two memory blocks are searched in parallel to support 32 rules in total. This architecture can be scaled for an arbitrary number of blocks. The maximum number of blocks just depends on the distributed RAM resources of the used FPGA. The search for matching rules is done in all blocks in parallel to speed up the search. In this scheme the first matching filter rule is not necessarily the right one as shown in Figure 6. There may be another matching rule in another block with an higher priority. Thus, the search cannot be stopped after the first match. All rules have to be checked. If several rules match, the rule with the highest priority (lowest block number, lowest rule index) is taken. This takes 17 clock cycles which is independent of the number of filter rules and seeker blocks.

Rule 2 ... Address

Block 1

Rule 1

Rule 16, correct maching rule

...

Block 2

Rule 17 Rule 18, fist matching rule

Rule 32

Figure 6. Filter rules in two blocks

3.2.3

Block RAM vs. Distributed RAM

Block RAMs are embedded modules in the FPGA which usually are utilized by synthesis tools to implement memory. Due to their high storage capacity there are only a few modules available on a typical FPGA. Each block RAM stores 18 kbit of data and can be configured in 16k × 1, 8k × 2, to 512 × 36 modes. The blocks are cascadable to generate deeper and wider memory elements. Each module has two symmetrical synchronous ports for read and write accesses. [8] The usage of block RAM should be restricted to modules where the utilization of the 18K bits per module is high. Each of our filter rules needs 311 bits of storage. Even if a 512 × 36 configuration is chosen for a block RAM, at least 94% of the memory would be unused per block because only 16 out of 512 words are used. Alternatively, the access time using smaller wordsizes would take much longer. Therefore we chose distributed RAM to store the filter rules. This decision is also supported by the fact that the future IPsec functionality will make heavy use of the block RAM resources and thus storing the rules in distributed memory utilizes the FPGA best. The Xilinx Virtex-4 allows using its configurable logic blocks (CLBs) in a distributed-RAM mode. This means, that the SRAM of the CLBs can be used to provide memory elements instead of lookup-tables (LUTs). These LUTs usually hold data representing combinational logic. Each LUT can implement a 16 × 1-bit synchronous RAM. Multiple LUTs can be combined to provide larger singleported and dual-ported memory elements. Because distributed RAM can be sized very fine-grained, it is a good choice for memory elements with unusual aspect ratios like our 16 × 311-bit memory.

4

Results

We have implemented a prototype of the described gateway system on a Xilinx ML403 evaluation board. Unfortunately this board has only one physical Ethernet port. In order to evaluate our approach with one network interface the prototype acts as a kind of loopback device by swapping the IP addresses and using the same MAC for

input and output. Changing the MAC-addresses is done implicitly by the routing module and suitable routing entries. Our implementation consists of one packet filter, a simplified IPsec module, and a routing module. It filters incoming packets and sends them back to the source of the packet. It can handle all sizes of Ethernet packets and is not limited to maximum number of packets per second. The header extraction modules of the filter considers almost every header field of Ethernet, IPv4, ARP, ICMP, TCP, UDP, and ESP packets. The filter itself matches every incoming packet against up to 32 filter rules within 17 clock cycles. Each rule can define if the matching packet has to be dropped, has to be sent to the rudimentary IPsec module, has to be sent to the software part of the gateway, or is allowed to bypass the filter. The rudimentary IPsec module can decrypt ESP packets with the null-algorithm if it matches one of the 32 stored Security Associations, which are not discussed in this article. The routing module provides a routing table with 32 entries. It does a table lookup to retrieve the according Ethernet MAC address for an IP address. This lookup mechanism is again range-based and uses the memory and comparator structures as the rule-seeker module does. All filter rules, Security Associations, and routing entries can be changed individually via a software interface. This interface connects the filter to the Processor Local Bus (PLB) of the embedded PowerPC. To avoid inconsistent entries in the packet filter hardware, the interface also ensures that the memory content cannot be changed during the processing of a packet. The prototype is able to run at a maximum frequency of 139 Mhz. This is above the required 125 MHz for Gigabit Ethernet networking. The system causes a packet latency of 283 clock cycles (2,264 ns at 125 MHz) for every packet, independent from the packets protocol, number of filter rules, or the action of the matching filter rule. This number does not include the latency caused by the two TMACs. FPGA Utilization. A test implementation consisting of the inbound filter structure, a loop-back device, and an output FIFO was verified on the ML403 board. The used Xilinx Virtex-4 FX-12 FPGA is utilized to 54% (5,913) of the Slice Flip Flops. 46% of the FPGA LUTs are used. A rough estimation shows that the FPGA is approximately utilized to 50%. The realized systems comprehends approximately 50% of all necessary components in comparison to a system including both inbound and outbound filter. If we leave the detailed IPsec implementation unconsidered it would be almost possible to implement both filters for inbound connections and outbound connections on the FX-12 device. The Virtex-4 FX-20 is the FPGA next in size providing 82% more LUTs than the one we used. With this FPGA it will be possible to implement the complete filter (not considering the IPsec modules). Verification and Prototype. The prototype implementation on the ML403 board was limited because the board

has only one physical Ethernet driver. The correct functionality of the prototype was verified by a PC equipped with a Gigabit network card. The network traffic generated by a PC running Linux was filtered by the FPGA board. Packets passing the filter were sent back to the PC through the loop-back device, which exchanged source and target addresses of IP packets. The MAC addresses of the Ethernet frames were corrected by the routing module. The correct functionality of the filter was verified using the Wireshark tool that allows to monitor and analyze network traffic. The correctness of the complete circuit, which could not be realized on the prototype board, was assured by simulation. A user-mode Linux network model was used to generate realistic network traffic. This traffic was recorded using TUN/TAP interfaces which connect the virtual Linux machines to the host system. This data was taken to feed the simulation running on Cadence NCsim simulator. The simulation could be done in two ways. The first way is to connect the user-mode Linux directly to the simulation core. This setup behaves like a network with very high latency (caused by the quite slow simulation time). In the second approach, the transmitted packets were written into test files, to re-run the same tests without requiring the user-mode Linux or virtual network interfaces. The simulation was controlled using TCL scripts.

5

Conclusions

In this paper we have presented a Gigabit packet filter tailored for implementation on FPGAs. Our solution stores the filter rules in distributed RAM on the FPGA and uses the linear search algorithm to find a matching rule. It exploits the fact that internal RAM resources on FPGAs can be connected with high data widths and do not face the limitation of restricted pin counts as when using external memories. We have implemented a prototype on the Xilinx ML403 board, achieving one Gigabit throughput and a latency of only 2,264 ns, which is magnitudes faster than comparable software approaches. Furthermore our results show that FPGAs are not only useful as prototyping platforms but are also well suited for Gigabit network filtering in real-life applications. This is in particular true when the specifics of FPGAs are taken into account when choosing the system architecture.

5.1

Future work

As future work we will extend the gateway with real IPsec functionality supporting the Encapsulated Security Protocol (ESP) using the AES encryption algorithm. Our goal will be a complete IPsec gateway in hardware, being controlled by a Linux system running on the embedded PowerPC core. For that software drivers are needed which synchronize the hardware filters and IPsec engine with the filter and the IPsec subsystems in the Linux kernel. In addition we will investigate whether higher throughput rates can be achieved on FPGAs.

References [1] Lessing, A. Linux-Firewalls – Ein praktischer Einstieg, 2nd edition, OReilly, ISBN 3-89721-446-6, 2006. [2] Cisco Systems Cisco Catalyst 6500 Series Firewall Services Module, Product Literature, 2007. http://www. cisco.com/. [3] Russel R.. Linux 2.4 Packet Filtering HOWTO, Revision 1.26, 2002. http://www.netfilter.org/. [4] Bruijn W., Slowinska A., Reeuwijk K., Hruby T., Xu L., Bos H. SafeCard: A Gigabit IPS on the Network Card, Proceedings of RAID’06, Hamburg, 2006. [5] Song H., Lockwood J. Efficient Packet Classification for Network Intrusion Detection using FPGA, International Symposium on Field-Programmable Gate Arrays – FPGA’05, 2005. [6] Dharmapurikar S., Song H., Turner J., Lockwood J. Fast Packet Classification Using Bloom filters, ACM/IEEE symposium on Architecture for networking and communications systems – ANCS06, 2006. [7] IEEE Computer Society. IEEE Std 802.3: Gigabit Ethernet, 1999. http://standards.ieee.org/ getieee802/. [8] Xilinx Corporation. Virtex-4 Family Overview, 2005. http://www.xilinx.com/. [9] Xilinx Corporation. Virtex-4 Embedded Tri-Mode Ethernet MAC, 2005. http://www.xilinx.com/. [10] Xilinx Corporation. Xilinx. ML401/ML402/ML403 Evaluation Platform User Guide v2.5, May 2006. http:// www.xilinx.com/. [11] Hari A.; Suri S.; Parulkar G. Detecting and Resolving Packet Filter Conflicts. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies, Volume 3, 2000. [12] Taylor D.E. Survey and Taxonomy of Packet Classification Techniques. Technical Report, Department of CSE, Washington University, USA, 2004. [13] Kent S. and Seo K. RFC4301, Security Architecture for the Internet Protocol. Technical report, Network Working Group, 2005. http://rfc.net/rfc4301.html. [14] Spitznagel E.W. CMOS Implementations of a Range Check Circuit. Technical report, Department of CSE, 2004. [15] Feldman A. and Muthukrishnan S. Tradeoffs for Packet Classification In Nineteenth Annual Joint Conference of the IEEE Computer and Communication Societies. volume 3, pages 11931202, March 2000.

Suggest Documents