IP Routing Processing with Graphic Processors Shuai Mu1, Xinya Zhang3, Nairen Zhang2, Jiaxin Lu2 , Yangdong Steve Deng1, Shu Zhang4 1
Tsinghua University,2University of Wisconsin-Madison,3Fudan University,4NVidia Corporation
{mus04ster, nairan.wisc, ljx1110, yangdong.deng,}@gmail.com,
[email protected],
[email protected] Abstract—Throughput and programmability have always been the central, but generally conflicting concerns for modern IP router designs. Current high performance routers depend on proprietary hardware solutions, which make it difficult to adapt to ever-changing network protocols. On the other hand, software routers offer the best flexibility and programmability, but could only achieve a throughput one order of magnitude lower. Modern GPUs are offering significant computing power, and its dataparallel computing model well matches the typical patterns of packet processing on routers. Accordingly, in this research we investigate the potential of CUDA-enabled GPUs for IP routing applications. As a first step toward exploring the architecture of a GPU based software router, we developed GPU solutions for a series of core IP routing applications such as IP routing table lookup and pattern match. For the deep packet inspection application, we implemented both a Bloom-filter based string matching algorithm and a finite automata based regular expression matching algorithm. A GPU based routing table lookup solution is also proposed in this work. Experimental results proved that GPU could accelerate the routing processing by one order of magnitude. Our work suggests that, with proper architectural modifications, GPU based software routers could deliver significant higher throughput than previous CPU based solutions. Keywords-GPU; CUDA; router; table lookup; Deep packet inspection; Bloom filter; DFA
I.
INTRODUCTION
Since the invention of the Internet, the Internet traffic has always been rising in an exponential manner. With the wide adoption of on-line voice, audio, video, TV, and gaming applications, Internet traffic will have to grow at an accelerated rate in the future [1]. Internet routers, therefore, have to deliver increasing processing capacity accordingly. Besides tradition router applications like route table lookup and packet classification, modern routers often need to perform dataintensive applications such as intrusion detection [2], which certainly pose new challenges on router throughput. To keep pace with the growth of the Internet traffic, current high performance routers depend on custom hardware to meet the ever-increasing demand for throughput (e.g., [3] and [4]). On the other hand, it is extremely difficult for proprietary router hardware to meet the requirement for programmability [5]. The custom hardware solutions make it difficult for core routers to adapt to fast-changing network protocols. The incompatibility among different router solutions also incurs significant overhead in router configuration and management [6]. To improve programmability and re-configurability, today’s hardware routers are increasingly using programmable, network processors (NPs) to perform data-plane routing applications such as table lookup, packet classification and deep packet inspection (DPI) [7]. However, NPs are hard to
978-3-9810801-6-2/DATE10 © 2010 EDAA
program due to the lack of mature programming models as well as software development tools and incompatibility of architectures [8]. The situation is even worsened by the relatively small size of the developer community for NPs. Pure software routers, in contrast, perform all significant processing steps (i.e., per-packet protocol processing, route lookup, forwarding) in software running on commodity server/PC platforms (e.g. [9] and [10]). Such a pure software solution certainly offers the best flexibility and extensibility. Unfortunately, the processing throughput of today's software routers is still much lower than that of the proprietary routers by at least one order of magnitude. In fact, current software routers do not scale beyond the 1-3 Gbps range, while currently available carrier-grade equipment starts at 40 Gbps and scales as high as 92 Tbps [11]. As a result, current software routers can only be deployed in enterprise level networks. Modern multi-core platforms are offering significant computing power. It is thus appealing to leverage the strong processing capability for router applications. As a matter of fact, there are already a few papers exploring different architectures [12~15] to promote the performance of software routers. In the industry, Intel has already been exploring packet processing solution on their multi-core CPUs [16]. Recently, graphic processing units (GPUs) are rising as an exciting new trend in high-performance computing. While multi-core CPUs generally exploit the task level parallelism, the GPU computing is based upon a data parallel programming model. When receiving a workload, a GPU would launch tens of thousands of fine-grain threads concurrently, with each thread executing the same program but on a different chunk of data. The latest GPUs could deliver a peak floating-point throughput of around 1TLOPS, which is higher than the fastest CPU by a factor of 30. Meanwhile, GPU programming has been made accessible to non-graphic programmers with the introduction of NVidia’s Compute Unified Device Architecture (CUDA) technology [17]. Based on our analysis on typical NP architectures [18], it can be concluded that NPs actually bear a lot of similarities to GPUs. They both install a number of identical processing elements and utilize a single instruction multiple data (SIMD) programming model. Accordingly, it is attractive to deploy GPUs in a software router to serve as a dedicated packet processor. The benefits of such an approach are multifold. First, it is now feasible to leverage GPUs’ strong computing power for a higher packet processing throughput. Second, the software router is completely assembled with off-the-shelf hardware and the open architecture guarantees cost efficiency. The third advantage is that GPUs are backed up by relatively more mature software development tools because they are addressing a mass market. In [19] and [20], GPUs were already used to address the network intrusion problem and impressive
Figure 1. Diagram of the proposed GPU based software router
speedup against CPUs was attained. In this work, we seek for a systematic solution for a GPU based software router. The proposed software router architecture is illustrated in Figure 1. This is a purely PC based router in which an NVidia GPU serving as a packet processing engine. The GPU together with its memory are installed on a graphics card, which is plugged into the motherboard through a PCIe bus [21]. The north-bridge, i.e., memory controller, has two 16-lanes PCIe interfaces, with one connected to the graphics card and the other to network-interface-cards (NICs). Each NIC card needs 4 lanes, and thus up to 4 NICs can be installed. Currently, we use the main memory as buffers for packet storage, but in an ongoing research we are investigating effective mechanisms for direct communication between a NIC and the GPU. As a first step toward a GPU based software router, in this paper we explore GPU solutions for two key IP routing applications, IP routing lookup and network intrusion detection. For the IP routing lookup problem, we selected a typical longest prefix matching algorithm [22] based on a radix tree. For the network intrusion problem, we have implemented both a Bloom-filter based string matching algorithm and an Aho-Corisick [23] styled regular expressing matching algorithm. Our results prove that GPU could accelerate the packet processing by one order of magnitude. The speedup suggests that GPU could be deployed in a software router to achieve good computing throughput and programmability at the same time. The rest of this paper is organized as follows. In section II, we review the hardware architecture of NVidia GPUs and the corresponding data parallel programming model. In section III, we present our GPU solutions for deep packet inspection based on Bloom filter and Deterministic finite-state automaton (DFA), respectively. We will introduce our GPU implementation details for IP routing lookup in section IV. The performance results will be evaluated in section V. Finally we conclude and propose future research directions in section VI. II.
OVERVIEW OF CUDA ARCHITECTURE
In this work, we use NVidia’s CUDA platform [15] to develop GPU based router implementations. Here we briefly review CUDA hardware and software.
A. Hardware Architecture The architecture of NVidia’s latest flagship GPU chip, GT200, is illustrated in Figure 2. The main computing resource is organized as an array of 240 streaming processors (SPs), evenly distributed into 10 streaming multiprocessors (SM). With complete instruction fetching and decoding hardware, A SM dispatch instructions to its 8 internal SPs. Two special functional units (SFU) and a double precision unit are also installed inside each SM. The SFUs have dedicated logic for mathematical functions, while the double precision unit is a new feature only belonging to G200 series GPUs. Given proper computing and data accessing patterns, the peak floating point throughput of GT200 series can reach 933 GFLOPS [15].
Figure 2. NVidia GPU architecture
During GPU computing, data would be stored inside the socalled global memory, i.e. video memory, integrated on the graphic card. Although the memory bus could deliver a bandwidth of 141.7GB/s, the latency of accessing the global memory is still time consuming and in the range of 400~800 cycles. Today’s GPUs are enhanced with a memory coalescing mechanism such that accessing to adjacent memory addresses by neighboring processing elements could be merged into a single operation. Such a blocked accessing mechanism
significantly boosts effective memory bandwidth by taking advantage of the parallel architectures of memories. Every SM is equipped with a 16KB shared memory, which could provide up to 16 4-byte words of data in one clock cycle. It is completely software-controlled so that frequently used data can be placed close to the computing resource without suffering many times of global memory latency. B. CUDA Programming Model A CUDA program is composed of codes running on both CPU and GPU. The GPU code would be concurrently executed by GPU as coordinated by CPU. The function called by CPU but executed on GPU is called a kernel. One CUDA program could have multiple kernels executed sequentially. According to the CUDA model, a GPU application could launch tens of thousands of threads, with each running the same program on different data sets. A thread is the minimum unit of parallel execution and the internal code runs sequentially. A number of threads are organized into thread blocks in a 1-D, 2-D, or 3-D manner. The arrangement should match the problem structure so as to simplify programming. The threads inside a block could exchange data through the shared memory and synchronize with one another. A kernel is composed of a grid of thread blocks arranged as a 1-D or 2-D array. III.
NETWORK INTRUSION DETECTION
To meet the demand for sophisticated services such as intrusion detection, traffic shaping, and quality of services, network intrusion detection techniques are increasingly deployed in modern network devices. The central task here is signature matching, which checks if network payloads contain pre-supplied signatures to at line rates. For example, Cabrera et al. [24] reported that signature matching alone accounts for approximately 60% of the processing time in the Snort Intrusion Prevention System [2]. The critical operation in signature matching is to search a predefined rule set of strings and/or regular expressions in a text, which is generally a packet under the context of DPI. Many efficient algorithms have been proposed to solve this problem (please refer to [25] for a thorough treatment of this subject). We developed efficient GPU implementations for both string and regular expression matching problems. In the following of this section, we first briefly introduce the algorithm we used in this work. Then the details of implementation and optimization are presented.
results with the corresponding entries in the Bloom vector. A match happens when all those entries are equal to 1. Of course, it is possible to have false positives with such an approach, but the probability of such errors can be controlled by properly choosing the type and number of hash functions. Obviously the Bloom Filter algorithm is very appropriate for a GPU implementation since the evaluations on different strings are completely independent. Meanwhile, the algorithm is computing bound, when a large number of hash functions need to be computed. 2) Aho-Corisick (AC) Although Bloom filter is very efficient in string matching, however, it could only deal with the traditional string rules. With the increasing risks of network attacking, regular expression based patterns are also widely adopted as the standard for expressing signatures. In this work, we use a classical Aho-Corisick [23] algorithm for this purpose. The AC algorithm uses Deterministic Finite Automata (DFA) [27] to recognize patterns of regular expressions. A DFA is created in a preprocessing step as a directed graph in which each node representing a unique state and each edge a possible transition between states. An efficient data structure for the DFA is a 2dimensional transition table, where one row corresponds to a state and one column represents a different letter in the acceptable alphabet. Matching a given string begins at a start state and then follows the corresponding transitions from state to state as each byte in the input is read. The DFA based technique again provides sufficient data level parallelism if we check multiple packets in parallel. However, an efficient GPU implementation is challenging because the transition table has to be accessed in every step and thus such an algorithm has to be memory bound. B. Implementation The overall structures of our Bloom Filter and AC implementations are similar, except that they use different data structures to store the rule set. So we discuss their implementations together. The compositions of Bloom vector and transition table are performed on CPU because they are not suitable for parallel computing and only need to be executed once. As illustrated in Figure 3, the matching process consists of three different steps, transferring packets to GPU, the pattern matching processing on GPU, and copying results back the CPU.
A. Algorithm description 1) Bloom filter A Bloom filter, first introduced in the 1970s [26], is a generalization of the hash table. It provides a simple, spaceefficient data structure to represent a data set for fast membership queries. The basic data structure of Bloom filter is a bit array, so-called Bloom vector, with each bit initialized to 0. In a preprocessing step, a series of hash functions are evaluated on each string in a given rule set. Each hash function would return an integer, h, and then the hth bit of the Bloom vector is set to 1. With the pre-computed table, the checking process can be straightforwardly realized by computing the hash functions on each string in a packet and then compare the
Figure 3. Working Flow in Processing Matching
In router applications, it is of key importance to efficiently move packets from an input port to a processing unit and then to an output port. On our target software router, arriving packets would be queued in a buffer located in the CPU main memory. To use a GPU for the packet processing, the simplest approach would be to transfer each packet individually to GPU memory for a timely handling. However, to reduce the overhead when operating GPU data transfer, a better way is to batch many small transfers into a larger one instead of making each transfer separately. Here the number of the packets in a batch has an obvious influence on the transferring efficiency. Through experiments, we found that a batch size of ~3000 packets delivers the smallest packet transferring time on average. We used the paged-locked memory to store the packets in CPU so that the bandwidth between CPU memory and GPU memory can be optimized. We also adopted NVidia GPUs’ streaming transfer mechanism [15] so that the packets transferring and processing can be overlapped. It should be indicated that the batched transfer might not guarantee the quality-of-service for certain packets because they have to wait until there are enough packets in a batch. An ongoing project of us is to devise a better packet scheduling mechanism by properly modifying GPU hardware architecture. The Bloom vector and transition table also have to be transferred into GPU memory. Unlike packets, they are determined by the given rule set and do not need frequent update. So we can store them in GPU texture memory, which is automatically cached, to reduce the memory access latency. In our GPU implementations, the Bloom vector is stored as a onedimensional array, and the transition table is as a twodimensional array. Once the packets have been transferred to the GPU, pattern matching operations are performed concurrently on processing elements. Intuitively we could assign one thread for each packet. Here a major concern is the load balance. In fact, the size of packets varies from several bytes to over 1 KB. Since on a GPU every 32 threads in a warp have the same instruction execution schedule, the processing time of a warp is actually determined by the slowest thread. Accordingly, a more efficient way is to divide each packet into smaller chunks of equal size and thus each thread has exactly the same workload. However, as indicated in [20], we might miss some matched texts if they happen to be cut into different chunks. To eliminate such a possibility, every two neighboring chunks have an overlapped content with a length equal to the largest matched texts. IV.
ROUTING TABLE LOOKUP FOR PACKET FORWARDING
Routing table lookup is the central application for a router. A routing table records the map from any input port to a given output port. Upon receiving a packet, a router looks up the table to determine to which port the packet should be forwarded. In modern routers, the routing table lookup is realized through a longest prefix match (LPM) technique, for which many efficient algorithms have been proposed [28]. To our best knowledge, this work is the first one to report a routing table lookup procedure to GPU. We extended a radix tree based LPM algorithm, which is taken from RouterBench [29] and is already deployed in Berkeley Unix [30], for GPU implementation. The routing table is organized as a tree
structure in which a node represents a given state in the search process and edges corresponds to values of bits in the destination IP address. A matching process is actually a traversal of a given path in the tree structure. Unlike the AC approach where the state transition can be stored in a 2-D array, today’s routing table is very large and only a tree-like sparse data structure can be feasible. The overall flow of our routing table lookup procedure is similar to that used in DPI. The route table is calculated and managed on CPU. When the route table is constructed or updated, it will be transferred to GPU memory. The parallel organization is then trivial. We use one thread to process the IP in one packet. In the original radix tree, pointers are used to manage tree nodes and edges. Such pointer chasing operations are extremely difficult on GPUs. One key problem is that the pointers inside a tree structure cannot be directly copied to GPU memory. In our implementation, we proposed a modified data structure called “portable routing table” (PRT), which uses displacement instead of pointers for tree operations. The routing table is stored in the texture memory to leverage GPU’s on-chip cache. The released code of RouterBench used a radix tree for routing table storage. In a radix tree, an edge would be labeled with multiple bits as long as no confusion is incurred. In other words, a state transition is trigger. This mechanism proved to be effective for CPU implementations. On the GPU implementation, we find that a simpler, so-called “trie” structure [31], actually performs better on GPU. A trie is similar to a radix tree, but each transition is only activated by a single bit (i.e., a radix tree actually compressed the path of a trie). It incurs almost 50% less memory accesses than a radix tree. As a result, we used a trie in out GPU implementation. V.
RESULTS
In this section, we report the performance evaluation of our prototype GPU-based IP routing processing system. We developed both CPU and GPU implementations for table lookup, Bloom filter, and AC algorithms. All experiments are performed on a Linux PC with a 3.33-GHz, Core 2 Duo processor and an NVidia GTX280 graphic card and 4-GB RAM. All programs are compiled with CUDA release 2.2 [15]. For deep packet inspection, our string rule sets are randomly picked up from Snort [2]. And the processing packets are collected from China educational network by using a program written by us on the basis of the libpcap tool [32]. We also use a full payload trace [33] captured at the access link to replay real network traffic traces. For routing table lookup, the packet traces were downloaded from the Routing Information Services (RIS) [34], which is a RIPE NCC project that collects and stores real-world routing data from the internet and from Finnish University and Research Network (FUNET) [35] which is a snap-shot of the Internet traffic induced by both university and student activities. A. Deep packet inspection We first evaluated the scalability of our DPI implementations with regard to number of patterns. We use a synthetic payload trace that contains 1663K TCP packets with an average packet size of 96 bytes. The patterns are randomly
collected from Snort rule sets. Figure 4 and Figure 5 shows the performance results of CPU, GPU processing plus memory transfer, GPU processing only, and GPU processing plus pagelocked memory transfer, respectively. From figure 4, we can see that the throughput of GPU based Bloom filter processing is about 19Gbit/s, which is more than 30 times faster than CPU (0.6Gbit/s). Even considering the packets transfer time (which can be hidden through streaming transferring and/or architectural enhancements), a 5X speed up is still attainable in the case. We can also use paged-lock memory technique to effectively hide transfer time. It can be seen that the throughput of paged-lock GPU is almost the same as GPU kernels without considering transfer time, about 30 times faster than CPU for Bloom filter. For DFA based AC implementation, paged-lock GPU kernel can achieve up to a throughput of 3.2Gbit/s, about 5 times faster than CPU (0.6Gbit/s).
performanc of DFA improves rapidly and approaches a peak throughput of 9.2Gbit/s, which is more than 15 times faster than CPU.
Figure 6. Throughput for different packet size
Table I compares our implementations with previouly published results. Our GPU based AC implementation is about 15 times faster than the best CPU solution. It also outperforms Gnort [20], which is the first practice to use GPU for network intrusion detection. At the same time, our GPU based Bloom filter is faster than the CPU solution by a factor of 30X. In addition, its throughput is almost 2 time of that can be achieved by FPGA [37]. TABLE I.
PERFORMANCE COMPARISON FOR RELATED WORK Algorithm
Throughput (Gbit/s)
nVidia GTX280 GPU
AC
9.3
nVidia GTX280 GPU
Bloom filter
19.2
AC with Bloom filter
12.9
Gnort AC2
1.4
Snort AC
0.6
Hardware
[37]
FPGA Figure 4. Throughput sustained for Bloom filter
[20]
nVidia 8600GT GPU [20]
P4 3.4GHz
B. Routing table lookup A large number of real-world packet traces can be downloaded from RIS [33] and FUNET [34]. We randomly picked up several traces and their route tables. Table II shows the performance of GPU route lookup on these traces. The GPU time reported includes both GPU processing and data transfer between CPU and GPU. Our GPU implementation could achieve a over 6X speed up. Note that this is a memory intensive application which is challenging on GPUs. Figure 5. Throughput sustained for AC
Another fact revealed in Figures 4 and 5 is that, when the number of rules increases, the performance of CPU and GPU remains almost the same. In the case of Bloom filter, the ratio of false positive will be increasing if the size of Bloom vector remains unchanged. And for DFA, with the increasing of regular expressions increasing, the transition table might get too big. On the other hand, the transition table is very sparse, and thus we can use corresponding techniques to accelerate the processing (e.g. [36]). Figure 6 illustrates the performance comparison for varying packet sizes (assuming all packets have the same size). Clearly, the CPU performance remains almost constant, while the GPU
TABLE II. Packet trace
PERFORMANCE COMPARISON FOR ROUTING LOOKUP #entries of route table
#packets of traces
CPU time (ms)
GPU time (ms)
Speed up
FUNET
41709
99840
22670
3459
6.6
RIS 1
243667
121465
24875
3827
6.5
RIS 2
573810
144908
25637
4135
6.2
VI.
CONCLUSION AND FUTURE WORK
In this paper, we developed efficient GPU implementations for a series of key router applications including IP routing table lookup and pattern match for network intrusion detection. Our results prove that GPU could accelerate packet processing by
up to one order of magnitude. To the best of our knowledge, this work is the first one to perform routing table lookup algorithm and Bloom filter based string match on GPUs. At the same time, our DFA based regular expression matching procedure also outperforms previous results. The speed up reveals that GPU could be deployed in a software router to deliver high throughput and good programmability. Nevertheless, with the high processing throughput of GPUs, the limited bandwidth between a GPU and the CPU main memory has become a bottleneck for packet processing. We are addressing the problem through 3 correlated approaches. First, we are using a detailed GPU microarchitecture simulator to study how to enhance memory hardware and scheduling mechanism of GPUs for faster packet delivery. In addition, we are exploring different integration schemes so that GPU can better communicate with the network interface card. For instance, one possible solution would be building GPU with the network interface on the same card. Finally, currently 3-D VLSI technology (e.g. [38] and [39]) is offering new possibilities to integrate heterogeneous devices into a PC platform. We will investigate such a 3-D software router solution for future high performance network applications. REFERENCES [1]
[2] [3] [4] [5]
[6]
[7] [8]
[9] [10]
[11] [12] [13]
M. Shin, Y. Kim, “New Challenges on Future Network and Standardization,” Advanced Communication Technology, vol. 1, pp. 754-759, Feb. 2008. The Snort Project, Snort users manual 2.8.0, http://www.snort.org/docs/snort/manual/2.8.0/snort manual.pdf. Cisco. Cisco Carrier Routing System. http://www.cisco.com/en/US/products/ps5763/. Juniper. Juniper Networds T Series Core Routers Architecture Overview. http://www.juniper.net/us/en/products-services/routing/t-tx-series/. P. Saif, P.W. Anderson, A. Degangi and P. Agarwail, “Gigabit routing on a software-exposed tiled-microprocessor,” Architecture for networking and communications systems, pp. 51-60. Oct. 2005. A. Bianco, R. Birke, “Control and management plane in a multi-stage software router architecture,” High Performance Switching and Routing, pp. 15-17, May. 2008. G. Varghese. Network Algorithmics. Elsevier/Morgan Kaufmann. 2005. C. Kulkarni, M. Gries, C. Sauer and K. Keutzer, “Programming challenges in network processor deployment,” Architecture and Synthesis for Embedded Systems, pp. 178-187, Nov. 2003. E. Kohler, et. al, “The Clock Modular Router,” ACM Transactions on Computer Systems, vol. 18, pp. 263-297, Aug. 2000. P. Handley, E. Kohler, A. Ghosh, O. Hodson and P. Radoslavov, “Designing extensible IP router software,” Symposium on Networked Systems Design & Implementation, pp. 189-202, May. 2005. W. Eatherton, “The push of network processing to the top of pyramid,” keynot at ANCS, http://www.cesr.ncsu.edu/ancs/keynotes.html, 2005. C. Partridge, et al, “A 50-Gb/s IP router”, IEEE/ACM Transaction on Networking, vol. 6, pp. 237-248, June 1998. Jonathan S. Turner, P. Crowley, et al. “Supercharging planetlab: a high performance, multi-application, overlay network platform,” ACM SIGCOMM Computer Communication Review, vol. 37, pp. 85-96, 2007.
[14] K. Argyraki, S. Baset, et al. “Can software routers scale?” Proceedings of the ACM workshop on Programmable routers for extensible services of tomorrom, pp. 21-26, 2008. [15] R. Bolla, R. Bruschi, and A. Ranier, “ Performance and power consumption modeling for green COTS Software Router”, Communication Systems and Networks and Workshops, pp. 1-8, Jan. 2009. [16] J. Trank, Packet Processing with Intel Multi-core Processors, Solution Brief, 2008. [17] NVidia, CUDA Programming Guide, CUDA Driver, Toolkit, and SDK code samples, http://www.nvidia.com/object/cuda_get.html, 2009. [18] P. Crowley, Z. O. Peter and A. F. Mark, Network Processor Design: Issues and Practices, Morgan Kaufmann, 2002. [19] R. Simth, N. Goyal, J. Ormont, K. Sankaralingam and C. Estan, “Evaluating GPUs for network packet signature matching,” Performance Analysis of Systems and Software, pp. 175-184, April 2009. [20] G. Vasiliasdis, S. Antonatos, et al, “Gnort: High Performance Network Intrusion Detection Using Graphics Processors,” Proceedings of the 11th international symposium on Recent Advances in Intrusion Detection, pp. 116-134, Sep. 2008. [21] PCI-SIG, PCI Express Base Specification. Revision 1.0, 2002. [22] H.J. Chao, “Next generation routers,” Proceedings of the IEEE, vol. 90, pp. 1518-1558, Sept. 2002. [23] A. V. Aho and M. J. Corasick, “Efficient string matching: an aid to bibliographic search,” Communications of the ACM, vol. 18, pp. 333340, June 1975. [24] J. B. Cabrera, J. Gosar, W. Lee and R. K. Mehra, “On the statistical distribution of processing times in network intrusion detection,” IEEE conference on Decision and Control, vol. 1, pp. 75-80, Dec. 2004. [25] M. Crochemore, C. Hancart, and T. Lecroq, Algorithm on Strings, Cambridge University Press, 2007. [26] B. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Communication of the ACM, vol. 13, pp. 422-426, Jul. 1970. [27] J.E. Hopcroft, R. Motwani, and J.D. Ullman, Introduction to Automata Therory, Languages, and Computation, Addison-Wesley, 2000. [28] H. J. Chao and B. Liu, High Performance Switches and Routers, Wiley, 2007. [29] Y. Luo, L. Bhuyan, and X. Chen, “Shared Memory Multiprocessor Architectures for Softwre IP Routers,” IEEE Trans. On Parallel and Distributed Systems, Vol.14, No. 12, Dec. 2003. [30] K. Sklower, “Tree-Based Packet Routing Table for Berkeley Unix,” In Proceedings of Winter Usenix Conference, pp. 93-99, 1991. [31] E. Fredlin, “Trie Memory,” Communications of ACM, Vol. 3, No.9, Desp. 1960. [32] http://www.tcpdump.org/. [33] A. Turner. Tcpreplay. http://tcpreplay.synfin.net/trac/. [34] http://www.ripe.net/projects/ris/rawdata.html [35] http://www.nada.kth.se/%7Esnilsson/public/code/router/Data/. [36] Y. Deng, B. Wang, and S. Mu, “EDA Applications on GPUs,” In Proc. of Int’l Conf’ on Computer Aided Design, 2009. [37] Abhishek Mitra, Walid Najjar, Laxmi Bhuyan, “Compiling PCRE to FPGA for Accelerating SNORT IDS”, ACM/IEEE Symposium on Architectures for Networking and Communications Systems, Orlando, FL, December 2007. [38] Y. Xie, G. Loh, B. Black, and K. Bernstein, "Design Space Exploration for 3D Architecture," ACM Journal of Emerging Technologies for Computer Systems, Vol. 2. No. 2, pp.65-103, April 2006. [39] Y. Deng and W. Maly, “A New 3-Dimensional VLSI Scheme - 2.5Dimensional Integration," Springer, Jan. 2010.