EXPERIMENTAL RESULTS ON ALGORITHMS FOR MULTIPLE KEYWORD MATCHING Charalampos S. Kouzinopoulos Parallel and Distributed Processing Laboratory Department of Applied Informatics, University of Macedonia 156 Egnatia str., P.O. Box 1591, 54006 Thessaloniki, Greece email:
[email protected]
Konstantinos G. Margaritis Parallel and Distributed Processing Laboratory Department of Applied Informatics, University of Macedonia 156 Egnatia str., P.O. Box 1591, 54006 Thessaloniki, Greece email:
[email protected]
ABSTRACT Multiple keyword matching is a basic problem in computer science and is used to locate all the appearances of a finite set of keywords (the so called ``patterns'') inside an input string (the so called ``text''). This paper presents experimental results for the well known Commentz-Walter, Wu-Manber, Set Backward Oracle Matching and Salmela-Tarhio-Kytöjoki multiple keyword matching algorithms. The algorithms are compared in terms of running time for random texts using a binary alphabet and texts using the English alphabet while the experimental results are validated against the theoretical complexity. KEYWORDS Algorithms, Multiple Pattern Matching, String Searching
1. INTRODUCTION Multiple keyword matching is a basic problem in computer science. It is a variant of string matching for multiple keywords that is commonly used to locate all the appearances of a finite set of keywords (the so called “patterns”) inside an input string (the so called “text”). Given an input string T = t 1t2...tn over an alphabet Σ and a finite set of r keywords P = {p 1, p2,...,pr} where all keywords have a length of m characters and the total size of all keywords is denoted as |P|, the multiple keyword matching problem can be defined as the way to report all locations i in T where there is an exact occurrence of any keyword (Navarro, 2002). The multiple keyword matching problem has many applications across several areas of scientific computing including information retrieval, data filtering, DNA sequence matching, virus detection and packet classification in intrusion detection systems. The multiple keyword matching algorithms are also used by many well known tools: Snort (Snort, 2010) uses a variant of the Aho-Corasick algorithm to perform intrusion detection, the Wu-Manber algorithm is utilized by Agrep (Wu, 1992) to perform fuzzy string searching while a Commentz-Walter-like algorithm is used in GNU Grep (GNU Grep, 2010) when searching for the occurrences of multiple keywords in a text. The simplest solution to the multiple keyword matching problem is to apply a string matching algorithm iteratively on the input string for each keyword. While frequently used in the past, this technique is not efficient when a large keyword set is involved. The aim of all modern multiple keyword matching algorithms is to scan the input string T in a single pass to locate the occurrences of all keywords. These algorithms are based of single-keyword matching algorithms, with some of their functions generalized to process multiple
keywords simultaneously during the preprocessing phase, generally with the use of trie structures and hashing. Aho-Corasick (Aho, 1975), a variant of the Knuth-Morris-Pratt algorithm, was the first algorithm to solve the multiple keyword matching problem in linear time. Commentz-Walter (Commentz-Walter, 1979) later combined Aho-Corasick with the Boyer-Moore algorithm to improve its performance. A simpler variant of Commentz-Walter is Set Horspool (Navarro, 2002) that directly extends the Horspool algorithm for multiple keywords by using a generalized bad character function. Wu-Manber (Wu, 1994) is another algorithm based on the Horspool algorithm that reads the input string in blocks to effectively increase the size of the alphabet and then applies a hashing technique to reduce the necessary memory space. The Salmela-Tarhio-Kytöjoki (Salmela, 2006) variants of the Horspool, Shift-Or (Baeza-Yates, 1992) and BNDM (Navarro, 1998) algorithms combined a technique called q-grams, that also increases the alphabet size and hashing to perform multiple keyword matching. The Set Backward Oracle Matching algorithm (Navarro, 2002) is a natural extension of the Backward Oracle Matching algorithm, a string matching algorithm that uses a factor oracle to locate matches in an input string, for multiple keyword matching. Finally Kim introduced in (Kim, 1999) a multiple keyword matching algorithm that also takes the hashing approach. This paper presents experiments for the running time of the well known Commentz-Walter, Wu-Manber, SBOM and Salmela-Tarhio-Kytöjoki algorithms in a uniform way for different types of data and compares the experimental findings with the expected theoretical complexity. A detailed analysis of the multiple pattern matching algorithms presented in this paper, additional experiments on different types of data as well as a study on the preprocessing time and the memory requirements of the algorithms can be found in (Kouzinopoulos, 2010). This paper is organized as follows. Section 2 describes the experimental methodology used to compare the algorithms and Section 3 discusses the experimental results of the algorithms. Finally, Section 4 presents some conclusions as well as ideas for future research.
2. EXPERIMENTAL METHODOLOGY The experiments were executed locally on an Intel Core 2 Duo CPU with a 3.00GHz clock speed and 2 Gb of memory, 64 KB L1 cache and 6 MB L2 cache. The operating system used was an Ubuntu Linux and during the experiments only the typical background processes ran. To decrease random variation, the time results were averages of 100 runs. The algorithms presented in the previous section were implemented using the ANSI C programming language and were compiled using the GCC 4.4.3 compiler with the “-O2” and “funroll-loops” optimization flags. To compare the keyword matching algorithms, the practical running time was used as a measure. Practical running time is the total time in seconds an algorithm needs to find all occurrences of a keyword in an input string including any preprocessing time and was measured using the MPI_Wtime function of the Message Passing Interface. The data set was a subset of the one used in (Lecroq, 2007). It consisted of randomly generated texts and natural language texts: • The CIA World Fact Book from the Large Canterbury Corpus. The text had a size of n = 2.473.400 and an alphabet of size 94. • Randomly generated texts of size n = 4.000.000 with a binary alphabet. Two keyword sets were used. The first was created randomly using the same alphabet as the text resulting mostly in mismatches occurred (noted as “misses” in the Figures), while the second was created from randomly chosen subsequences from each text with a match occurring every m positions (noted as “hits” in the Figures). All keywords had a size ranging between 100 and 100.000 keywords and a length of m = 8 and m = 32 characters.
3. EXPERIMENTAL RESULTS In the previous section the experimental methodology was discussed. In this section, the performance of the algorithms is evaluated for different sets of data. Figure 1. Running time of the algorithms for English texts
Figures 1 and 2 present the running time of the algorithms for an English alphabet and a binary alphabet for a keyword length of m = 8 and m = 32, for both keyword sets and for 100 to 100.000 keywords including preprocessing. As can generally be seen from the Figures, varying parameters such as the size and the type of the keyword set as well as the length of the keywords can affect the performance of the algorithms in different ways. The experimental study proved that no algorithm is the best for all values of the problem parameters. The Commentz-Walter algorithm had the best performance comparing to the rest of the algorithms when used on texts with a binary alphabet with a keyword length of m = 32. It is important to note that CommentzWalter had the worst performance when used on texts with an English alphabet with a small keyword length and a lot of matches and thus it is not recommended for text scanning for these types of data. As the size of the keyword set increased, the running time of Commentz-Walter increased linearly on all types of data as expected from its theoretical complexity. Wu-Manber had an average performance on texts with an English alphabet and the worst performance on texts with a binary alphabet. Although linear to |P|, the running time of the algorithm was affected more than Commentz-Walter when the size of the keyword set increased. The performance of the Wu-Manber
algorithm was also affected from the amount of exact matches in the text, especially when texts with a binary alphabet were used. Figure 2. Running time of the algorithms for random binary texts
SBOM was the fastest algorithm for texts with a binary alphabet when keywords of m = 8 were used, similar to the performance reported in (Navarro, 2002). When texts with an English alphabet were used with a keyword length of m = 8, SBOM had the worst performance comparing to the rest of the algorithms as can be confirmed by the experiments presented in (Salmela, 2009). The performance of the SBOM algorithm was affected by the amount of exact matches in the text, especially when English alphabet texts were used. The HG, SOG and BG algorithms had the best performance when used on English texts and average results for binary alphabet texts. In practice, SOG was the fastest algorithm among the Salmela-TarhioKytöjoki algorithms as was expected from the theoretical analysis, since it needs the least theoretical preprocessing and running time. In (Salmela, 2009) is stated that although the algorithms were not designed for searching texts that contained a lot of matches they performed well in practice and this fact can be confirmed by the experimental results, since although the q-grams algorithms were affected more by the increase in the number of matches in the text, they still were competitive to the rest of the algorithms in terms of running time.
4. CONCLUSION In this paper experimental results of the well known Commentz-Walter, Wu-Manber, SBOM and the Salmela-Tarhio-Kytöjoki algorithms were presented. The algorithms were compared in terms of running time for texts of English and binary alphabet, for a keyword set of size between 100 and 100.000 keywords and for keyword lengths of size m = 8 and m = 32. It was shown that for different data sets, different algorithms are preferable: Salmela-Tarhio-Kytöjoki had the best performance on an English text and for either keyword lengths, SBOM was the fastest algorithm for texts with a binary alphabet when keywords of length m = 8 were used while the Commentz-Walter algorithm was faster for the same data when keywords of length m = 32 were used. The work presented in this paper could be extended with experiments that use keywords of varying length and larger keyword sets while focusing on the preprocessing time and memory requirements of the algorithms. It would also be interesting to examine the application of the multiple pattern matching algorithms in areas of scientific computing including indexing and computational biology. Since the data set that needs to be processed by multiple keyword matching is usually inherently parallel in nature, future research could focus on the speed up of the existing algorithms when parallel processed on traditional parallel architectures like clusters environments (Michailidis, 2002) and multicore systems as well as on modern parallel systems like GPU architectures (Kouzinopoulos, 2009).
REFERENCES A.V. Aho and M.J. Corasick, 1975. Efficient string matching: an aid to bibliographic search. Communications of the ACM, Vol 18, No. 6, pp. 333-340. R.A. Baeza-Yates and G.H. Gonnet, 1992. A new approach to text searching. Communications of the ACM, Vol 35, No. 10, pp. 74-82. B. Commentz-Walter, 1979. A string matching algorithm fast on the average. Proceedings of the 6th Colloquium, on Automata, Languages and Programming, pp. 118-132. GNU Grep (2010). Webpage containing information about the GNU Grep search utility. Available on-line: http://www.gnu.org/software/grep. Checked 18.5.2010. S. Kim and Y. Kim, 1999. A fast multiple string-pattern matching algorithm. Proceedings of the 17th AoM/IAoM Inernational Conference on Computer Science, pp. 1-6. C. S. Kouzinopoulos, K. G. Margaritis, 2009. String Matching on a Multicore GPU Using CUDA. Proceedings of the 13th Panhellenic Conference on Informatics, pp. 14-18. C. S. Kouzinopoulos, K. G. Margaritis, 2010. Algorithms for multiple keyword matching: survey and experimental results. Technical Report, University of Macedonia. T. Lecroq, 2007. Fast exact string matching algorithms. Information Processing Letters, Vol. 102, No. 6, pp. 229-235. P.D. Michailidis and K.G. Margaritis, 2002. Parallel Implementations for String Matching Problem on a Cluster of Distributed Workstations. Neural, Parallel and Scientific Computations, Vol. 10, pp. 287-312. G. Navarro and M. Raffinot, 1998. A bit-parallel approach to suffix automata: Fast extended string matching. Lecture Notes in Computer Science, Vol. 1448, pp. 14-33. G. Navarro and M. Raffinot, 2002. Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences. Cambridge University Press. L Salmela, J Tarhio, and J Kytöjoki, 2006. Multipattern string matching with q-grams. Journal of Experimental Algorithmics, Vol. 11, pp. 1-19. L. Salmela, 2009. Improved Algorithms for String Searching Problems. PhD thesis. Helsinki University of Technology. Snort (2010). Webpage containing information on the Snort intrusion prevention and detection system. Available on-line: http://www.snort.org. Checked 18.5.2010. S. Wu and U. Manber, 1992. Agrep - A Fast Approximate Pattern-Matching Tool. In Proceedings of USENIX Technical Conference, pp. 153-162. S. Wu and U. Manber, 1994. A fast algorithm for multi-pattern searching. Technical Report TR-94-17, University of Arizona.