High-performance Multi/Many-core Architectures with ...

1 downloads 0 Views 656KB Size Report
High-performance Multi/Many-core Architectures with Shared and Private Queues: Network Processing Approaches. Reza Falamarzi 1, Bahram Bahrambeigy 2, ...
High-performance Multi/Many-core Architectures with Shared and Private Queues: Network Processing Approaches Reza Falamarzi 1, Bahram Bahrambeigy 2, Mahmood Ahmadi 1*, Amir Rajabzadeh 1 1

2

Computer Engineering Department, Razi University, Kermanshah, Iran Information Technology Department, Islamic Azad University of Kermanshah, Kermanshah, Iran {rezafalamarzi, bahramwhh}@gmail.com, {m.ahmadi, rajabzadeh}@razi.ac.ir

Abstract Software solutions are not effective to be used in network applications because of their low throughput. By employing hardware implementation on FPGA, not only sufficient flexibility is achieved but also the throughput is increased considerably. In this paper, two multi-core architectures are proposed for Bloom filter and CRC as two main network processing core functions. These architectures called multi-core architecture with shared queue and multi-core architecture with private queue. The proposed architectures are implemented for 1, 2, 4, 8 and 16 cores. Experimental results show that multi-core architecture with private queue achieves higher throughput In comparison to the other one. As compared to Bloom filter, CRC application leads to less computational load and consequently more throughput. Moreover, Bloom filter is implemented on GPU and CPU and the results are compared with each other. When number of packets in GPU memory is 16384, the speedup achieved by GPU implementations using CUDA is about 274 times compared with CPU implementations. However, FPGA results outperform GPU, so that the throughput of the first architecture (shared queue) and second architecture (private queue) with 16 cores are almost 5.5 and 7.1 times higher than GPU throughput, respectively.

Keywords: Bloom filter; Cyclic Redundancy Check (CRC); Field-Programmable Gate Arrays (FPGA); Graphics Processing Unit (GPU); Multi-core/Many-core Processors.

1- Introduction In recent years, the role of parallel architectures to achieve higher performance and speed-up in various programs has become more and more important. According to Moore's law, the number of transistors on a chip has become double every 18 to 24 months. This growth in the number of transistors has increased the efficiency of on-chip hardware. However, this law is just about hardware while development in software area is far behind than hardware developments [1]. The purpose of advances in processor developments is not mainly because of optimizing the serial performance of general-purpose processors. On the contrary, the parallelism and employing more cores in processors are the main goals. This is due to the fact that increasing the frequency of a single-core CPU leads to more power consumption which is the main reason of moving toward multi-core technology [2],[3]. Multi-core processors are used in various aspects of computer field such as signal processing, image processing, embedded systems, desktops and etc. These processors can provide the required processing power with a reasonable power consumption level. Moreover, the Internet and other * Corresponding author

computer networks are growing rapidly. This growth is because of increasing number of different services such as Firewalls, Quality of Service (QoS), Virtual Private Networks (VPN) and network security. Therefore, the bigger number of users and services becomes, the more amount of bandwidth is required. Furthermore, hardware and software of the next generation network devices, especially routers, must be capable of supporting the aforementioned services [4], [5]. Consequently, the need for routers with high speed and high throughput which are capable of fast and accurate packet processing is essential. Packet processing is widely used in network devices such as routers, switches and firewalls. The main purpose of packet processing is to provide different services such as Quality of Service (QoS), Virtual Private Networks (VPNs), network security and policy-based routing [6]. To provide these services, routers have to classify incoming packets according to the routing information of one or more fields in the packet header. These fields can be source/destination addresses, protocol field, and source/destination port numbers. Packets are classified based on the rule-sets. Consequently, for each incoming packet the process of classification based on the predetermined fields is performed. This means for each packet a search is performed in the rule-set in order to find the proper operation for that packet, but considering the huge

number of packets and rules in a rule-set, this operation is extremely time-consuming. Former solutions of this problem were based on hash functions to accelerate this operation while faster and more convenient solution for reducing the duration of this process is based on Bloom filters [7],[8]. Bloom filter is an optimal and space efficient data structure which can be used for membership checking of an input element (query) in a data-set. In heavy network processing tasks such as packet classification, custom cores (e.g. Bloom filter and CRC cores) can be applied in order to achieve higher performance. Field-Programmable Gate Arrays (FPGAs) as reconfigurable platforms are very suitable cases to implement aforementioned cores in order to optimize memory usage and accelerate them when they are used as network processors. On the other hand, high arrival rate of network packets as well as the potential of parallel processing of individual packets are interesting motivations to use multi/many-core architectures. In this paper, highperformance multicore architectures for network processing applications are proposed. The cores in this architecture can be selected among different network processing applications. In this case, Bloom filter and CRC cores are selected to implement on this architecture. Moreover, the Bloom filter is also implemented on GPU as a many-core platform. Furthermore, hardware implantation of Bloom filter-based processor on FPGA has been compared with the GPU implementation as many-core architecture. The contributions of this paper are as follows: 1-

The proposal of two high-performance multi-core architectures (with shared and private queues) for network processing. 2- Hardware implementation of Bloom filter and CRC as network processing cores on these architectures using FPGA 3- Software Implementation of Bloom filter on GPU as many-core architecture to compare with the proposed architectures The rest of this paper is organized as follows: related work is summarized in Section 2. An overview of Bloom filter, CRC and GPGPU is given in Section 3. In Section 4, the proposed architectures are explained. Implementation results are presented in Section 5. Finally, Section 6 concludes the paper.

2- Related Work Because of current improvements of the multi-core systems, parallel packet processing has gained more population. The following contains some of the many different works in the field of network processing especially packet processing. In [9], a template matching

algorithm is proposed with optimized algorithm for network processing dedicated multicore platforms. Their method is implemented on a platform of 16 MIPS cores. In [10], a method is presented for packet classification which is optimized for the multi-core network processors. The evaluation has been done on the Intel IXP 2850 network processor. Moreover, some researches have been done using multicore processors in the network security where high speed of packet classification is needed [11], [12], [13]. High-throughput and tremendous performance of the multi-core processors make them an excellent candidate. Bloom filter is also used in different applications of computer networks [8] specially in the network security [14] (For example it has been shown that Bloom filter can accelerate a software router to speedup IP lookups [15]). On the other hand, a lot of works have been done in the field of cyclic redundancy codes such as parallel implementations of CRC [16], [17]. In [26], packet classification task deployed in the virtual switches that exploits Bloom filter searches is implemented on GPU. The authors try to build GSwitch as a GPU-accelerated software switch. In [27], a 2-dimensional pipelined architecture for packet classification on FPGA; this architecture achieves high throughput while supporting dynamic updates. In this architecture, modular Processing Elements (PEs) are arranged in a 2-dimensional array. Each PE accesses its designated memory locally, and supports prefix match and exact match efficiently. In [28], authors proposed a GPU-based multiple-pattern matching algorithm for filtering malicious packets by using a Bloom filter to inspect the packet payload by leveraging the high parallelism computing power of GPU. They compared the proposed algorithm with different GPU-implemented technologies to sequence the Bloom filter algorithm on different platforms and achieved 58x speedup in comparison to a single thread platform. In [29], a two layer NIDS to accelerate the performance and processing capacity of Snort NIDS is proposed. To accelerate Snort NIDS, they offload dynamically and on the fly the preprocessor and the detection engine functions of the most frequently triggered rules to NetFPGA. They implemented Bloom filter technique on NetFPGA to match the incoming packets against the offloaded rules. In [30], a hybrid computing architecture which enables the communication between the Android OS and a traffic analysis hardware accelerator, coexisting on the same chip is proposed. At this aim, the proposed architecture is

hosted by new FPGA chip family, the Xilinx’s Zynq, a SoPC based on dual-core ARM. In [31], a distributed and parallel data statistical modeling algorithm is implemented within the MapReduce framework. Based on that the big data in a certain unit block can be assigned into several distributed compute nodes. A statistic combination strategy is induced so that the intermediate results from each block can be combined into the global result of the entire dataset. In [32], a novel design and implementation for the MILC compression algorithm, denoted as “Parallel MILC”, which is able to exploit the power and the capabilities of the parallel computing paradigm is proposed. By doing this, the novel algorithm we propose can be executed over several heterogeneous device types supporting the OpenCL framework, as for example CPU, GPU, FPGA and many others. They redesigned the MILC compression strategy according to the OpenCL framework. The speedup achieved by the proposed algorithm ranges from 4 up to 36 times faster than MILC. In [33], a high-level overview of the existing parallel data processing systems categorized by the data input as batch processing, stream processing, graph processing, and machine learning processing and introduce representative projects in each category is proposed. Then, they surveyed other batch-processing systems, including general-purpose systems Dryad, Nephele/PACT and Spark. SQL-like systems involved in this paper are Hive, Shark, SCOPE, AsterixDB, and Dremel. For stream processing systems, Storm and S4 are introduced as representatives. Scalability is one of the ML algorithms bottlenecks. We then discussed how graph-centric systems like Pregel and GraphLab, and ML-centric systems like Petuum, parallelize the graph and ML model, as well as their distinctive characteristics. In the standard Bloom Filter, there are different types of hash functions (e.g. CRC32 and CRC32C are used in this paper) to generate k indexes to be set in Bloom filter bitarray. The functions also can be processed in parallel. Ma et al. [18] implemented the Bloom Filter on GPU, specifically targeting a genome bio-sequence alignment application. In their design, long queries are split into multiple sub-queries and the sub-queries are processed independently by the threads. The sub-query size influences two performance indicators, throughput and false positive rate. There is an inherent tradeoff between them and changing sub-query size in order to decrease false positive rate, reducing the throughput as well. The number of indexes and the number of sub-queries are limited. Furthermore, increasing number of sub-queries increases false positive. Therefore, the previous models cannot

exploit capacity of parallel execution of a large number of threads offered by modern GPUs. In our previous work [25], two high-performance architectures with shared and private queue is presented, in this work, the results and details of these architectures are highlighted.

3- Background Fundamentals In this section, Bloom filter and CRC which are used primarily in the processing cores are explained. Afterwards, a brief introduction to General-purpose computing on GPU (GPGPU) is presented.

3-1- Bloom Filter Bloom filter is a space efficient randomized data structure for representing a data-set in order to support membership queries. Burton Bloom introduced Bloom filters in the 1970s [19]. A set S(x1, x2, ..., xn) of n elements is represented by an array V of m bits that are initially all set to 0. A set of k independent hash functions h1, h2, ... , hk (each with an output range between 1 and m) is utilized to set k bits in array V at positions h1(x), h2(x), ..., hk(x) for all x in set S. More precisely, for each element x ∈ S, the bits at positions hi(x) are set to 1 for 1 ≤ i ≤ k. Moreover, a location can be set to 1 multiple times. To verify whether an item y is a member of the set S, the same set of hash functions is utilized to determine hi(y) (for 1