An Efficient Parallel String Matching Algorithm Based on DFA Yujian Fan, Hongli Zhang, Jiahui Liu, and Dongliang Xu School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
[email protected],
[email protected], {hitljh,xudongliang}@pact518.hit.edu.cn
Abstract. The classical string matching algorithms are facing a great challenge on speed due to the rapid growth of information on Internet. Meanwhile, multicore CPU has been widespread on computers. But classical string matching algorithms does not apply to multi-core CPU flexibly. It not only affects the run-time speed, but also makes a waste of the resource on CPU. In this paper, we proposed a parallel string matching algorithm based on DFA, it solved the problem effectively. By classification on the first letter of each pattern, all CPU cores could work at the same time, which do not conflict. Experiments demonstrate whether the hit rate is high or low, the algorithm has an ideal performance. Keywords: string matching, parallel algorithm, DFA.
1
Introduction
The string matching algorithm has been a classic problem in computer science, and it is an important part in many fields, such as pattern recognition, virus scanning, content filtering, search engine, and network intrusion detection system (NIDS). In many systems, the speed of string matching algorithm has become the bottleneck. For example, in NIDS, pattern matching is the most time-consuming part. The role of pattern matching part is detecting attacks through comparing the patterns in rules with the network traffic. The paper [1] shows that the time spent on pattern matching accounted for 30% of the NIDS, and the consumption of time could increase up to 80% under intensive traffic. Thus, how to improve the speed of string matching algorithm has become a key issue to improve the performance of these systems. Meanwhile the multi-core CPU has been increasing popularity, that each core has a separate cache and they share the memory. But traditional string matching algorithm does not give full play to the performance of the multi-core CPU, which not only affects the run-time speed, but also makes a waste of resource on CPU. In this paper, we proposed a parallel string matching algorithm based on deterministic finite automaton (DFA) which using the first letter of the pattern to classify the patterns set, and Y. Yuan, X. Wu, and Y. Lu (Eds.): ISCTCS 2012, CCIS 320, pp. 349–356, 2013. © Springer-Verlag Berlin Heidelberg 2013
350
Y. Fan et al.
it builds the automations that coexisted in memory. All CPU cores could work independently and collaboratively. The rest of the paper is organized as follows. In Section 2, we summarize the related work of string matching. In Section 3, we give the description of our parallel string matching algorithm, and the method to compress the memory. The experimental results of our algorithm compared with AC algorithm are presented in Section 4. Finally, we will give a conclusion of this paper in Section 5.
2
Related Work
The problem of string matching is to find out the times of each pattern appears in text T. Throughout the paper, P={P1, P2,…,Pk} denotes the patterns set, k acts as the number of patterns. According to the different ways of matching, Navarro et al. [2] divided string matching algorithms into three categories: algorithms based on prefix matching, algorithms based on suffix matching, and algorithms based on substring matching. As following it describes three representative algorithms. 1. AC algorithm [3] based on prefix. AC algorithm puts all patterns in P into an automatic machine, and utilizes it to scan the text T. The scanning process starts from the initial state, transfers to the next state according to the character read in next and the current state. It determines whether a match occurs when reaches a new state. 2. Wu-Manber algorithm [4] based on the suffix. Wu-Manber algorithm is based on the thought of BM [5] algorithm and the biggest feature of it is jumping. It creates a scan window whose length is equal to the length of the shortest pattern. Within the window, the scan direction is from right to left. It searches the longest common suffix of the window and the text T. It could jump over portions of text while meeting bad characters, thus improving the scanning speed. 3. SBOM algorithm [6] based on the substring. The scan window of SBOM algorithm is similar to the one of WM. In the window, the scan direction is also from right to left. But it searches whether there is a substring of patterns in the text T, using Factor Oracle data structure. There also have been many parallel string matching methods in recent years besides these traditional string matching algorithms above. The main kinds are as follows. 1. Cut the Text. The text T is divided into T={T1,T2,…,TN} in accordance with the number of CPU cores, and each part assigned to each core to match, then summarize the results. But this may generate some problems. Such as: (a) It may lose match or generate duplicate match in the intersection of two sections of text, resulting in inaccurate results. (b) If hit rate is higher in some sections of the text, while in others is lower, the running time will depends on the longest time, and it destroys the balance of parallel.
An Efficient Parallel String Matching Algorithm Based on DFA
351
2. Classify Patterns Set P. According to the characteristic of patterns set P, such as the first letter of each pattern, establish classification. Each core processes the same text T with different patterns, and the matching results are summarized in the end. It still may have some troubles. For example, if the patterns are classified by the first letter, it may produce 26 categories, but the number of CPU cores is 2 or 4 in common, the number of classification is far greater than cores which would result in each core requires repeating many times to search the same text T with the different patterns classification. 3. Choose Different Algorithms for Different Patterns. The paper [7] maintains that the speed of string matching algorithm mainly depends on the number and minimal length of patterns. They proposed a heuristic algorithm using dynamic programming and the greedy algorithm techniques, to divide patterns set and choose an optimal string matching algorithm for them. It keeps balance of the running time on each core, to minimize total time. 4. Rely on the Hardware. Due to the structure of these devices, such as field programmable gate array (FPGA) [8] or graphics processing unit (GPU) [9], it can make parallel string matching algorithm run on it. However, the hardware resource is limited, so the application scenarios are restricted. We develop a parallel string matching algorithm extended from the approaches above.
3
Parallel String Matching Algorithm Based on DFA
In this section, we will present our parallel string matching algorithm based on DFA (PSMBD). We first outline a classic string matching algorithm using DFA in sub-section 3.1. We will introduce PSMBD algorithm in detail in sub-section 3.2. Theoretical analysis is presented in sub-section 3.3. Finally, we will give a method to compress the memory in sub-section 3.4. 3.1
Description of String Matching Algorithm Using DFA
Among string matching algorithms using DFA, AC algorithm is the most classic. In pre-processing phase of AC algorithm, the patterns set P is converted into a definite state automata, each state with a number indicates, the input is text T and patterns set P, the output is the times of each pattern occurs. The matching process is as follows: read text T in order, and transfer to the next state from the initial state based on the current state and current input character with the state automata. Then check whether there is a match and make record. This is the procedure of AC algorithm. From that we could realize when read a character in text T, the location in the automaton is uncertain, which led each core could not process text in parallel. Our algorithm has solved this problem.
352
3.2
Y. Fan et al.
PSMBD Algorithm
The main idea of PSMBD algorithm is to classify the patterns set by the first letter, put the patterns whose first letters are same into one DFA. These DFAs coexist in memory so that all CPU cores could access them independently. In matching process, each core enters the corresponding DFA according to the character it read. The main processes are as follows: Pre-processing phase: read the patterns and insert them into different DFAs by the difference of first letters. The automata could use matrix to store. The difference with AC automaton is this process does not need to consider whether a pattern is prefix or suffix or substring of another. It would avoid these situations in the matching process, and save the time of pre-processing. After completing this process, the number of DFAs is same to the kinds of first letters, and these DFAs coexist in memory so that each CPU core could access them independently. Matching process: each CPU core reads the text T in order, the character read in is regarded as the initial character, enters the corresponding DFA according to it. Within the DFA, it transfers from the initial state to the next according to the characters read in next. When reaches a state, it requires checking whether a match occurs, then makes record. One searching process will not complete until reaching the initial state or the times of transferring exceeds the maximum length of all patterns in patterns set. 3.3
Theoretical Analysis
Lemma 1. When each CPU core enters the corresponding DFA according to the character read in, all patterns led by this character existed on current location of the text T would be found. Proof. In the matching process, after each CPU core entering the corresponding DFA according to the character read in, it will transfers to the next state according to the character read in next. After each transferring, it requires checking whether a match occurs. The searching process in DFA will not be finished until reaching the initial state which refers to matching fail, or the times of transferring exceeds the maximum length of all patterns in patterns set which means matching fail because there can be no pattern whose length is more than the maximum length of all patterns. After this search process, all patterns led by this character existed on current location of the text T will be found. Lemma 2. Every character in the text T would enter the corresponding DFA as the initial character. Proof. In the matching process, each CPU core reads the text T in order, and then enters the corresponding DFA. After one search process is complete, the cores will also read the rest of text T in order, which ensures that every character in the text T would enter the corresponding DFA as the initial character.
An Efficient Parallel String Matching Algorithm Based on DFA
353
Lemma 3. In the parallel environment, this algorithm would not produce duplicate match or lost match. Proof. Every character in text T would enter the DFA as the initial character of patterns only by one time, so it will not produce duplicate match. Each CPU core only read a single character and search for all patterns led by this character in one process of searching, so it will not lose match. Theorem 1. All patterns existed in the text T would be found. Proof. According to Lemma 1, Lemma 2 and Lemma 3, every character in the text T would enter the DFA as the initial letter only by one time, and within the DFA, all patterns led by this character would be found. Therefore, all patterns with different first letters may exist in the text T would be found in matching process. 3.4
Compress Transition Matrix
PSMBD algorithm uses matrix data structure to store the DFAs, so there may be many zero-states in the state table, which means jumping to initial state. This sparse matrix wasted a lot of space. Because the data structure used in PSMBD is similar to that used in AC algorithm, the method of compressing memory for AC algorithm can be used to compress transition matrix of PSMBD. Norton proposed a format for matrix to reduce the storage space named bandedrow [10]. It deletes unnecessary space at both ends of rows, and allows us to access to the valid data randomly.
Before compression: 0 0 0 2 4 0 0 0 6 0 7 0 0 0 0 0 0 0 0 0 Banded-Row Format Number of Items: 8 Start Index: 4 Valid Values: 2 4 0 0 0 6 0 7 After compression:
8424000607
Fig. 1. Banded-Row Format
The banded-row format stores the elements from the first non-zero value to the last non-zero value. To keep accessing to the data randomly we need to record the number of data elements and the starting index of the data. It is shown in Figure 1.
354
Y. Fan et al.
Before the compression, the number of items is 20. After converting to banded-rrow format, the number is halff, so it decreases the required memory. By recording the number of data elements and a the starting index, we still could access to valid ddata randomly. This adjustment should be applied to each row in the transition matrix, and it will greatly reduce the tottal storage space that occupied.
4
Experimental Results R
In the experiment, the textt and patterns are generated randomly. The length of the input text is 10 MB and th he numbers of pattern are {10000, 20000, 40000, 800000}. The operating system is Windows W 7. CPU is Intel i5-2300, which has fore coores. The RAM is 4G, and the compiler is Visual Studio 2008. We compared PSM MBD algorithm with AC algorithm.
0.18 0.15 Time (s)
0.12 0.09
AC
0.06
PSMBD D
0.03 0 1
2
4
8
No. of Patterns (x 104)
Fiig. 2. Comparison on a higher hit rate
The Figure 2 shows that in the case of a higher hit rate, with the size of patternss set increases, the growth rate of o time of PSMBD algorithm is less than the AC algorithhm, and the time is far lower th han the AC algorithm. This experiment suggests that in the case of a higher hit rate th he highest speed of PSMBD algorithm is more than thhree times of the AC algorithm.
An Efficcient Parallel String Matching Algorithm Based on DFA
355
0.15
Time (s)
0.12 0.09 AC
0.06
PSMBD D 0.03 0 1
2
4
8
No. of Pattern (x 104)
Fig. 3. Comparison on a lower hit rate
The Figure 3 shows that in the case of a lower hit rate, with the size of patternss set increases, the time that PSM wer MBD algorithm spent is basically in a flat, and is far low than AC algorithm. For AC C algorithm, along with the size of patterns set increases, the time ascends gradually. This experiment suggests that in the case of a lowerr hit SMBD is more than three times of the AC algorithm. rate the highest speed of PS
500
Memory (x 103K)
400 300 PSMBD
200
Banded-Ro ow 100 0 1
2 4 No. of Pattern (x 104)
8
Fig. 4. Comparison bettween the original PSMBD and the one with Banded-Row
356
Y. Fan et al.
The Figure 4 shows that, with the size of patterns set increases, the memory PSMBD algorithm occupied increases by a big margin. While the data structures applied with banded-row format, the memory occupied by the algorithm is obvious lower. It has a less increment, and the growth rate is gentler.
5
Conclusion
We proposed a parallel string matching algorithm for multi-core CPU. It inserts patterns into different DFAs by the difference of the first letters, these DFAs coexist in the memory so that each CPU core can work independently and does not conflict. Each CPU core enters the corresponding automata according to the character it read to complete matching. Experiments show that in the same condition, whether the hit rate is high or low, the fastest speed of PSMBD algorithm could reach more than three times of AC algorithm. Acknowledgment. This work is partially supported by the National Grand Fundamental Research 973 Program of China (Grant No. 2011CB302605), High-Tech Research and Development Plan of China (Grant No. 2010AA012504, 2011AA010705), and the National Natural Science Foundation of China (Grant No. 61173145).
References 1. Fisk, M., Varghese, G.: An Analysis of Fast String Matching Applied to Content-based Forwarding and Intrusion Detection, Technical Report CS2001-0670. University of California, San Diego (2002) 2. Navarro, G., Raffinot, M.: Flexible Pattern Matching in Strings: Practical on-line Search Algorithms or Texts and Biological Sequences. Cambridge University Press (2002) 3. Aho, A.V., Corasick, M.J.: Efficient String Matching: An Aid to Bibliographic Search. Communications of the ACM 18(6), 333–340 (1975) 4. Wu, S., Manber, U.: A Fast Algorithm For Multi-pattern Searching. Technical Report TR94-17 (1994) 5. Boyer, R.S., Moore, J.S.: A Fast String Searching Algorithm. Communications of the ACM 10(10), 762–772 (1977) 6. Beate, C.W.: A String Matching Algorithm Fast on the Average. In: Proc. the 6th Colloquium on Automata, Languages and Programming, Graz, Austria, pp. 118–132 (1979) 7. Tan, G., et al.: Revisiting Multiple Pattern Matching Algorithms for Multi-core Architecture. Journal of Computer Science and Technology 26(5), 866–874 (2011) 8. Sidhu, R., Prasanna, V.K.: Fast Regular Expression Matching using FPGAs. In: Proc. the 9th Ann. IEEE Symp. Field-Programmable Custom Computing Machines, Rohnert, USA, pp. 227–238 (2001) 9. Qiao, G., et al.: A Graphics Processing Unit Based Multi-string Matching Algorithm for Anti-virus Systems. Energy Systems and Electrical Power, 8864–8868 (2011) 10. Norton, M.: Optimizing Pattern Matching for Intrusion Detection (2004), http://docs.idsresearch.org/OptimizingPatternMatchingForIDS.pdf