J Supercomput DOI 10.1007/s11227-011-0558-8
Improving IPS by network processors Pablo Cascón · Julio Ortega · Yan Luo · Eric Murray · Antonio Díaz · Ignacio Rojas
© Springer Science+Business Media, LLC 2011
Abstract Many present applications usually require high communication throughputs. Multiprocessor nodes and multicore architectures, as well as programmable NICs (Network Interface Cards) provide new opportunities to take advantage of the available multigigabits per second link bandwidths. Nevertheless, to achieve adequate communication performance levels efficient parallel processing of network tasks and interfaces should be considered. In this paper, we leverage network processors as heterogeneous microarchitectures with several cores that implement multithreading and are suited for packet processing, to investigate on the use of parallel processing to accelerate the network interface, and thus the network applications developed above it. More specifically, we have implemented an intrusion prevention system (IPS) with such a network processor. We describe the IPS we have developed that after its offloaded implementation allows faster packet processing of both normal and corrupted traffic. The benefits from placing the IPS close to the network, by using specialized network processors, give many times lower latency and higher bandwidth available to the legitimate traffic. Keywords Network processors · Offloading · IPS · Parallel network interface · Multithreading 1 Introduction One of the most important part of many present applications is network processing. The availability of high bandwidth links (Gigabit Ethernet, GigaNet, SCI, Myrinet, P. Cascón () · J. Ortega · A. Díaz · I. Rojas Department of Computer Architecture and Technology, University of Granada, Granada, Spain e-mail:
[email protected] Y. Luo · E. Murray Department of Electrical and Computer Engineering, University of Massachusetts Lowell, Lowell, MA, USA
P. Cascón et al.
QsNet, Hippi-6400, etc.) and the scale up of network I/O bandwidths to multiple gigabits per second have shifted the communication bottleneck towards the network nodes. Therefore, the network interfaces (NI) performance is getting decisive in the overall communication path performance and it is determinant to reduce the communication protocol overhead due to context switching, multiple data copies, and interrupt mechanisms. Thus, it will be interesting to have cores with microarchitectures able to accelerate the applications whose performance mainly relies on the network processing efficiency. Network processors (NP) usually have heterogeneous microarchitectures with cores optimized for packet processing. Their use is not only driven by the increasing demand for high throughput but also by the flexibility required for packet processing. Thus, instead of application specific circuits (ASICs) in these NICs, the need for adaptability, differentiation, and short time to market makes network processors a suitable design choice. These network processor cores usually have multithreading capabilities that can be used to improve the performance of network applications by offloading parts of them to the NP in order to take advantage of the parallelism and the memory hierarchy implemented at the NP microarchitecture. This way, not only the networking applications can be accelerated thanks to their implementation in the proximity to the network but also the CPU load is reduced and the free cycles can be allocated to other tasks thus improving the server throughput. In this same research line, the paper [18] describes how a content-aware switch implemented in a NP can reduce the latency for HTTP processing, improve the packet throughput, and optimize the cluster server architectures, by processing requests and distributing them to the server according to the application-level information. The authors have used the IXP2400 NP [2] and compare their NP-based switch with a Linux-based one. The latency on the NP-based switch is reduced between 83% (for small file sizes) and 89% (at 1024 Kbytes sizes) and the throughput is improved between 5.7 and 2.2 times. In this work, we propose and analyze an offloaded implementation of a network intrusion prevention system. The demand of firewalls and network intrusion detection and prevention systems has grown with the increasing importance of network services and infrastructure along with the difficulty of designing end-system security strategies [17]. An intrusion prevention system (IPS) needs to analyze the headers and the content of the packet at higher-level protocols. It is also required that the function implemented by the intrusion prevention system to be updated with new detection procedures due to the evolving characteristics of the attacks. As this application requires high-performance processing capabilities and flexibility, it is a good candidate to be implemented in a network processor. The use of network processors for IPS implementation has been reported in recent papers [3, 17]. Both works implement their respective IPS onto an IXP1200 network processor and a 45% improvement of performance (for 1 Gbit/s link) is reported in [17]. In this paper, we will follow this line trying to improve the exploitation of the parallelism and other NP resources by taking advantage of the upgraded features of the IXP2XXX network processors (precisely, future work in both, [17] and [3] papers comprise the use of more powerful NPs). In Sect. 2, the main topics of IPSs are presented. At Sect. 3, we describe the offloaded IPS we propose. Section 4 describes the test environment and shows the ob-
Improving IPS by network processors
tained experimental results. In Sect. 5, the previous work in this researching line is considered, and in Sect. 6, the conclusions and the future work are given. 2 Related issues on intrusion prevention systems An Intrusion Prevention System (IPS) is a system that prevents the network attacks. To perform this task, it monitors the network and systems to detect undesired behavior. In our research work, we focus on monitoring the network. The most common setup for IPS is to monitor all the network traffic entering and exiting the network of an organization by placing a computer running an IPS software at the main Internet connection of the organization. Packets coming from the Internet enter the organization through this computer that process them before eventually let the packets reach the organization network and systems. Usually packets are received into this computer through a regular NIC, processed at the CPU to decide whether to stop the packet or let it reach the computer where is destined through another regular NIC. This approach has the disadvantage that all packets have to reach and be processed at the CPU. The processing can be very CPU resources consuming and, therefore, if there are too many packets force the IPS software to discard some packets that will affect the computers of the organization. Our approach is to move the IPS, partial or completely, from the host CPU to the network processor (NP) at the NIC by using the network interface detailed in our paper [5]. Thanks to this network interface, we can control where to place the different parts of the IPS and evaluate how it affects the overall performance of applications using the network. The most used open-source IPS is Snort [15]. It is structured into three subsystems: a packet decoder, the detection engine, and the logging and alerting subsystem. The packet decoder subsystem analyze every packet, and layer by layer (of protocols) analyzes and extracts the different required fields depending on the set of rules that configure the behavior of this IPS. The detection engine detects which packet or connection matches which rule. And the logging and alerting system writes logs and warnings to help system administrators or to be used by other software that can trigger actions based on these alerts. The two subsystems we are interested in this research are the packet decoder and the alerting subsystem. These two subsystems can be moved to a specialized network processor and help the host CPU to perform its IPS tasks. In this paper, we have compared Snort [15] with a prototype of IPS implemented at the network processor in order to take advantage of its parallel microarchitecture. The comparison is done in terms of how much it does affect the legitimate traffic performance when the IPS has to drop packets corresponding to corrupted traffic. The corrupted traffic is the traffic that matches a set of rules, usually created to reflect attacks. The legitimate traffic is the rest of the traffic. 3 A multi-threaded IPS on a network processor In this research work, we have used the Netronome NFE-i8000 card equipped with an IXP2855 network processor. As it can been seen elsewhere [2, 4], the IXP28XX se-
P. Cascón et al.
Fig. 1 IPS location at the application level or at the microcode (a); and path for reception and transmission in a NP (b)
ries implement microarchitectures with a high level of parallelism, including several programmable processors: a general-purpose Intel XScale processor (RISC architecture compatible with the ARM architecture), and 16 optimized coprocessors (called MicroEngines) for packet processing. Moreover, each MicroEngine implements eight threads of execution that change their context with little overhead (the cost to fill the pipeline): each MicroEngine has registers and counters for each thread, so that a quick context change from thread to thread is possible. This characteristic is essential in order to hide the latencies and it should be efficiently applied to take advantage of the network processor parallelism. To implement our IPS, first of all we have developed a Linux kernel module (tested on versions 2.6.18 to 2.6.22) that implements a network interface that uses the four 1 Gbps ports of the NFE-i8000 card as virtual network interfaces. This network interface is detailed in our paper [5]. Once the network interface has been developed and stabilized, it is possible to place the IPS at the MicroEngines or at the host. Figure 1 is used to explain the different locations of the IPS. In this manner, the IPS could be either at “Parallel IPS” box or at the “Application (IPS)” box used in Fig. 1(a). This is equivalent in Fig. 1(b) with the IPS running at the “NP-based NIC” box or at the “Host CPU” box. When running in the microcode at MEs, it will be placed to check every received packet that it is to be sent to the host. The corrupted traffic will be stopped and the legitimate traffic will follow the path to the host. In the case the IPS is executed by the host CPU, it will receive all packets, including the normal and corrupted traffic. The closer position to the network and the specialized hardware of the MicroEngines makes it the candidate alternative to give better results. This will be checked in the experimental results section. Moreover, it will be studied how the location of our IPS affects the processing of corrupted traffic with respect to the legitimate one. Figure 1(b) provides a scheme of the communication path in the node through the NFE-i8000 card including the IXP2855 network processor. The packets enter
Improving IPS by network processors
the network processor from the port through the MSF (Media Switch Fabric), 1, the MicroEngines take the data to the host memory through the I/O bus and north bridge, 2–4, where the user-level application that runs on the CPU accesses them, 5, 6. There is a similar path from the host memory to the corresponding port through the north bridge, the I/O bus, the MicroEngines, and the MSF, 7–10. The microcode for this IPS is based in the one used for the network interface. One of the MicroEngines is used for processing and for this IPS modified to act as such and drop packets that matches some rules and not let them reach the CPU. The code has been written by using the SDK provided by the manufacturer that includes a ME assembler. This SDK allowed us to take advantage of the multithreading ME capabilities in an explicit way. It is required to specify which code runs on which ME. Every thread within every ME run the same code although it is possible to use if instructions to determine which thread runs which part of the code.
4 Experimental set up and results To measure the performance of the offloaded network interface that support our IPS we have used an experimental configuration based on three computers with Intel Xeon processors at 2 GHz, including 8 cores and 6GB of DRAM. One of the computers is equipped with the Netronome NFE-i8000 card, while the other two have one standard (nonprogrammable) 1 Gb/s Ethernet card (Intel Pro/1000) each. We have used the Netpipe benchmark [14] to measure the communication latency and throughput according to the type of packets used. We have chose Netpipe because it is a well-known and widely used benchmark. One of the computers with the standard Ethernet card is used as the server of this benchmark while the other one is used to inject corrupted traffic. The measures have been obtained by using the microcode (code that runs in the MicroEngines) provided by the card manufacturer, in which one MicroEngine is dedicated to packet transmission, other MicroEngine to packet reception, and other two MicroEngines to the PCI Express bus transfers. Moreover, we had also taken measures corresponding to the parallelism that can be obtained by devoting more MicroEngines to a specific communication task. Thus, better results have been obtained using two MicroEngines (up to 16 threads) to transmission and other two MicroEngines to reception as detailed in our previous paper [5]. The obtained results with the intrusion prevention system test have been very successful. In our experiment, both corrupted and legitimate traffic are sent through the MicroEngines to the host. In the first case, the corrupted traffic is dropped at the MicroEngines while in the second one the activity of the IPS is done in the host. The corrupted traffic matches a set of snort rules, implemented in both the host and the ME, while the legitimate traffic is the one used in the communications benchmark Netpipe [14] (TCP traffic). We measure how the processing associated to the detection of the corrupted traffic affects the performance of this benchmark. The corrupted traffic is composed of HTTP and DNS queries. The rules used both in the MicroEngines and the snort IPS are the same. For simplicity, only part of the code used at the MicroEngines is detailed here:
P. Cascón et al.
;copy IP header from DRAM to transfer register ;extracted fields proto,source_addr... .if (source_addr == PEN_IP_ADDR) ; first check: source ip address .if (dest_addr == IXP2_IP_ADDR) ; 2nd: dest. ip address .if (proto == IPPROT_TCP ) ; ip protocol number for tcp .if (dest_port == 0x50) ; port 80 for webserver .if ( ( data == 0xbeef) || (data == 0xcafe) || ... ;in this case we drop the packet alu[dl_eop_buf_handle, --, b, IX_NULL] move[dl_next_block, IX_DROP] br[FAIL_LABEL] .endif ...
and one of the rules used at snort is detailed as follows: drop tcp any any -> 192.168.102.2 80 (msg:"detected 0xbeef"; content:"|beef|";sid:100000;)
Figure 2 presents a latency comparison among the two different setups. The corrupted traffic rate is moved from 100 Mbps to 1000 Mbps. When corrupted traffic is stopped at the MicroEngines, the legitimate traffic latency is lower, not affected a lot for sharing the same path as the corrupted traffic. When the IPS runs at the host (Snort) and the corrupted traffic rate goes up to 600 Mbps performs similarly as the when the IPS runs at the network processor (NP). But from 700 Mbps to 1000 Mbps of corrupted traffic rate, the IPS running at the host does not perform well in terms of latency and even the network benchmark does not run completely because of timeouts. There is a significantly higher latency when the IPS is running at the host than when it runs at the network processor. If the corrupted traffic rate is higher than 600 Mbps, it makes the IPS drop legitimate packets. The same results are observed when comparing the performance, in terms of throughput, of the network benchmark depending on the IPS placement (Fig. 3). If the corrupted packets are stopped at the MicroEngines level, the CPU of the host can give a better service to the legitimate traffic. The benefit for this kind of traffic, whenever the IPS is located at the NIC, is that it is not affected after the 700 Mbps corrupted traffic limit. We have also compared the performance of the network benchmark with and without corrupted traffic sent to the host. As it is expected, when only non corrupted traffic Fig. 2 Latency (IPS in the NP vs. IPS in the host)
Improving IPS by network processors Fig. 3 Throughput (IPS in the ME vs. IPS in the host)
Fig. 4 Latency (IPS in the ME)
is sent, performance is better in terms of lower latency and higher throughput in both implementation alternatives, an IPS executed at the host CPU or running at the MicroEngines. Nevertheless, the difference between performance of non corrupted and corrupted traffic is bigger when the IPS is executed in the host CPU instead of running at the MicroEngines. This is shown in Fig. 4, where there is almost no difference among latencies of corrupted and non corrupted traffic for an IPS executed in the MicroEngines, and in Fig. 5, where there is a clear difference in the latency when the IPS runs in the host CPU. Thus, the processing of corrupted packets affects more to the communication performance whenever this processing is done at the host CPU instead of at the network processor. Thus, the conclusions are clear for the IPS processing: running it on a network processor does not affect much the rest of the traffic (legitimate one). If it is implemented in the host, it takes a lot of CPU cycles that cannot be used by other functions (as processing of normal traffic or computation).
P. Cascón et al. Fig. 5 Latency (IPS in the host CPU)
5 Previous work The use of some of the processors available in a node to reduce the communication cost and to release cycles in the processor that executes the network application has been proposed in some works [10, 11, 13]. Besides offloading communication tasks to processors included in the network card, the possibility of using other generalpurpose cores included in a CMP for communication has been also proposed. This alternative, usually called onloading [12], has been marketed by Intel, along with other techniques, through the so-called I/OAT (I/O Acceleration Technology) [1]. The parallelization of network protocols and the use of parallelism available in programmable network cards has also been proposed and analyzed in some previous works [6, 9, 10, 16]. In [10], it is considered the performance effect of a proper allocation of application and network interface tasks in the different cores of a multicore for communication through a 10 Gigabit/s Ethernet. The benefits of the different protocol stack parallelization strategies for the alternatives of messages and connections parallelization are discussed in [16]. In [8], a network interface is developed with the Radisys ENP-2505 card. It includes the Intel IXP1200 network processor (two generations earlier than the processor used in our work). In [3], an IXP network processor is used to implement an IDS, but only as a traffic splitter that distributes the traffic to several intrusion detection sensors. In [17] and [7], IXP network processors are used in an IDS to implement the rules to detect corrupted traffic. The authors focused on the string matching part of the detection and not on comparison with the host performance.
6 Conclusions This paper describes and evaluates an intrusion prevention system based on a multithreaded network interface. It makes it possible to take advantage of the parallelism implemented in network processors to improve not only the latency, but also the bandwidth of legitimate traffic that shares the same communication path with the corrupted traffic.
Improving IPS by network processors
Compared with previous works, our contribution resides in the fact that we take advantage of the IXP28xx features. Moreover, with the multithreaded network interface, we have developed the IPS can be placed in two different positions (the MicroEngines or the host) so a better and fair comparison can be established among the IPS processing done at a host general purpose CPU or at a network processor. The benefit from placing the IDS close to the network, by using specialized network processors, gives many times lower latency and higher bandwidth available to the legitimate traffic. The analysis of the other possible optimizations to the IPS processing such as moving to the NP-based NIC the stage of snort that requires more communication processing, along with the evaluation of the effect of the improvements in real communication applications are the main tasks for our future work. At this stage of our IPS the basic IPS functionality is done at the MicroEngines and no other IPS software is running at the CPU. As a future work, we plan on to integrate both the CPU and the MicroEngines, by using a complete Snort at the host that communicates with the MicroEngines to perform the packet decoding and detection and letting the CPU with the logging and alerting system. The results of our prototype shows there is a huge benefit to the legitimate traffic by moving the IPS from the CPU to the MicroEngines and, therefore, a hybrid system, using the power of the network processor for certain tasks and the flexibility of the CPU for others can become a high performance combination. Acknowledgements This work has been funded by projects TIN2007-60587 (Ministry of Science and Technology of Spain) and TIC01935 (Regional Government of Andalusia). Funding from NSF award CNS0709001 provided network processor development equipment.
References 1. Intel i/o acceleration technology. http://www.intel.com/technology/ioacceleration/ 2. Intel network processors. http://www.intel.com/design/network/products/npfamily/ 3. Bos H, Xu L, van Reeuwijk K., Cristea M., Huang K. (2005) Network intrusion prevention on the network card. In: IXA Education Summit, Hudson, MA, USA, September 2005. 4. Byrne J, Gwennap L (2005) A guide to network processors. The Linley Group, Mountain View 5. Cascón P, Ortega J, Haider WM, Díaz AF, Rojas I (2009) A multi-threaded network interface using network processors. In: Proc. of the 17th euromicro international conference on parallel, distributed, and network-based processing, February 2009 6. de Bruijn W, Bos H (2008) Model-T: rethinking the OS for terabit speeds. In: Computer communications workshops, 2008. INFOCOM. IEEE Conference on, pp 1–6 7. Luo Y, Xiang K, Fan J, Zhang C (2009) Distributed intrusion detection with intelligent network interfaces for future networks. In: IEEE international conference on communications, Dresden, Germany, June 2009 8. Mackenzie K, Shi W, Mcdonald A, Ganev I (2003) An intel IXP1200-based network interface. In: Proceedings of the workshop on novel uses of system area networks at HPCA (SAN-2 2003) 9. Willmann M, Brogioli P, Rixner S (2006) Parallelization strategies for network interface firmware. In: Proceedings of the workshop on optimizations for DSP and embedded systems 10. Narayanaswamy G, Balaji P, Feng W (2007) An analysis of 10-Gigabit ethernet protocol stacks in multicore environments. In: Proceedings of the 15th annual IEEE symposium on high-performance interconnects. IEEE Comp Soc, Los Alamitos, pp 109–116 11. Ortiz A, Ortega J, Díaz AF, Prieto A (2010) Network interfaces for programmable nics and multicore platforms. Comput Netw 54(3):357–376 12. Regnier G, Makineni S, Illikkal I, Iyer R, Minturn D, Huggahalli R, Newell D, Cline L, Foong A (2004) TCP onloading for data center servers. Computer 37(11):48–58
P. Cascón et al. 13. Shalev L, Makhervaks V, Machulsky Z, Biran G, Satran J, Ben-Yehuda M, Shimony I (2006) Loosely coupled TCP acceleration architecture. In: Proceedings of the 14th IEEE symposium on high-performance interconnects. IEEE Comput Soc, Los Alamitos, pp 3–8 14. Snell Q, Mikler A, Gustafson J, Helmer G (2007) A network protocol independent performance evaluator. http://www.scl.ameslab.gov/netpipe/ 15. Snort (2009) Snort open source network intrusion prevention and detection system (ids/ips). http://www.snort.org 16. Willmann P, Rixner S, Cox AL (2006) An evaluation of network stack parallelization strategies in modern operating systems. In: Proceedings of the annual conference on USENIX ’06 annual technical conference, Boston, MA, pp 8–8. USENIX Association 17. Xinidis K, Anagnostakis K, Markatos E (2005) Design and implementation of a high-performance network intrusion prevention system. In: Security and privacy in the age of ubiquitous computing, pp 359–374 18. Zhao L, Luo Y, Bhuyan LN, Iyer R (2006) A network processor-based, content-aware switch. IEEE MICRO 26(3):72–84