ON-LINE APPROXIMATE STRING SEARCHING ALGORITHMS: SURVEY AND EXPERIMENTAL RESULTS P. D. MICHAILIDIS and K. G. MARGARITIS* Parallel and Distributed Processing Laboratory, Department of Applied Informatics, University of Macedonia, 156 Egnatia Str., P.O. Box 1591, 54006, Thessaloniki, Greece (Received 9 March 2000) The problem of approximate string searching comprises two classes of problems: string searching with k mismatches and string searching with k differences. In this paper we present a short survey and experimental results for well known sequential approximate string searching algorithms. We consider algorithms based on different approaches including dynamic programming, deterministic finite automata, filtering, counting and bit parallelism. We compare these algorithms in terms of running time against pattern length and for several values of k for four different kinds of text: binary alphabet, alphabet of size 8, English alphabet and DNA alphabet. Finally, we compare the experimental results of the algorithms with their theoretical complexities. Keywords: String searching; Hamming distance; Edit distance; String searching with k mismatches; String searching with k differences C.R. Categories: F.2.2; H.3.3; I.5.4
1. INTRODUCTION String searching is a very important component of many problems, including text processing, information retrieval, data base operations, library systems, compilers, command interpreters, DNA processing, signal processing, error correction, speech and pattern recognition and several other fields [HD80, Aoe94, Ste94, Nav98]. The basic string searching problem can be defined as follows. Let a given alphabet (a finite sequence characters) S, a short pattern string P ¼ P½1P½2 P½m of length m and a large text string T ¼ T ½1T ½2 T ½n of length n, where both the pattern and the text are sequences of characters from S with m n. The string searching problem consists of finding one or more generally all the exact occurrences of a pattern P in a text T . Survey and experimental results of well known algorithms for this string searching problem can be found in [Aho90, CR94, 5te94, MM99a, Smi82, DB86, BY89, Leq95, MF96, MM99b].
The approximate string searching problem is a generalization of the exact string searching problem, which involves finding substrings of a text string close to a given pattern string. More specifically, the approximate string searching problem can be formally stated as follows: Let a given alphabet S, a short pattern string P of length m, a large text string T of length n with m n, an integer k 0 and a distance function d. This problem consists of finding all the substrings S of T such that dðP; SÞ k. The distance dðP; SÞ between two strings P and S over an alphabet S is the cost of the minimum cost sequence of operations that needed to transform P into S. The cost of a sequence of operations is the sum of the costs of the individual operations. The cost of an operation is considered a positive real number. In particular, in string searching applications the most interesting operations are: (a) changing one character to another single character (or a substitution), (b) deleting one character from the given string (or a deletion), and (c) inserting a single character into the given string (or an insertion). There are several distance functions; two very well known functions are the Hamming distance and Levenshtein distance which are used in this paper. The Hamming distance between two strings of equal length is defined as the number of positions with mismatching characters in the two strings. In other words, it allows only substitutions, which cost l. The approximate string searching problem with d being Hamming distance is called string searching with k mismatches. The Levenshtein or edit distance between two strings of not necessarily equal lengths, is the minimum number of character insertions, deletions and substitutions, which all cost l, required to transform the one string into the other. Algorithms for computing the edit distance between a pair of strings are presented in [WF74, MP80, Ukk85a]. The approximate string searching problem with d being the Levenshtein or edit distance is called string searching with k differences (or sometimes string searching with k errors). Together the above two problems are called approximate string searching. The solutions to two problems differ if the algorithm has to be on-line (that is, the text is not known in advance) or off-line (the text can be preprocessed). In this paper, we focus on on-line algorithms for the these two problems. There are numerous algorithms for approximate string searching problem, see for example the reviews of [GG88, Aho90, Ste94, BY96, JTU96, Nav98, MM99c]. In general, an online approximate string searching algorithm consists of two phases: the preprocessing phase in P and the searching phase of P in T . The preprocessing phase involves gathering of information about the pattern which can be used for a fast implementation of primitive operations in the searching phase or of constructing a finite automaton that recognizes all strings at a distance at most k from the pattern. The searching phase consists of scanning the text or the construction of a array in order to find all approximate occurrences of the pattern in the text. In general, the searching phase is based on four different approaches including dynamic programming/classical, deterministic finite automata, filtering, counting and bit-parallelism algorithms. More specifically, for the string searching with k mismatches problem, the algorithms can be divided in to four categories: Classical algorithms: Brute-Force algorithm, Landau–Vishkin [LV86] algorithm, Galil– Giancarlo [GG86] algorithm, and Tarhio–Ukkonen [TU93] algorithm. Deterministic finite automata algorithms: Partial-DFA [BYG94, Nav97b] algorithm. Counting algorithms: Grossi–Luccio [GL89] algorithm, BY-filter/counting [BYG94] algorithms, EM [EMC96] algorithm, Pevzner–Waterman [PW95] algorithm, and Baeza– Yates–Perleberg [BYP96] algorithm. Bit-parallelism algorithms: Shift-Or [BYG92] algorithm, Dermouche [Der95] algorithm, and BNDM [NR98] algorithm.
Similarly, the algorithms for the string searching with k differences problem are divided in to four categories: Dynamic programming algorithms: Sellers [Sel80] algorithm, CUTOFF [Ukk85b] algorithm, Landau–Vishkin [LV88, LV89] algorithm, Galil–Park [GP90] algorithm, Ukkonen– Wood [UW93] algorithm and Chang–Lampe [CL92] algorithm. Deterministic finite automata algorithms: Ukkonen [Ukk85b] algorithm, Wu–Manber– Myers [WMM96] algorithm and Partial-DFA [BYG94, Kur96, Nav97b] algorithm. Filtering algorithms: Tarhio–Ukkonen [TU93] algorithm, COUNT [GL89, JTU96, Nav97a] algorithm, Maximal matches [Ukk92] algorithm, Chang–Lawler [CL94] algorithm, Takaoka [Tak94] algorithm, Suntinen–Tarhio [ST95] algorithm and Baeza et al. with exact partitioning techinque [BYP96]. Bit-parallelism algorithms: Wu–Manber [WM92] algorithm, Baeza–Yates–Navarro [BYN96a, BYN96b, BYN99] algorithm, BNDM [NR98] algorithm, Wright [Wri94] algorithm and Myers [Mye98, Mye99] algorithm. It is clear that with such a variety of different approaches to the same problem it is difficult to select a appropriate algorithm for each problem of approximate string searching. The theoretical analyses given in the literature are useful but it is important that the theory is completed with experimental comparisons extensive enough. Several experiments on string searching with k differences problem have already been reported [JTU96]. In [JTU96], Jokinen et al. compare the practical running time of seven algorithms only for the string searching with k differences problem. More specifically, they compare two algorithms (Seller [Sel80] and CUTOFF [Ukk85b]) based on dynamic programming approach, two algorithms (Galil–Park [GP90] and Ukkonen–Wood [UW93]) based on diagonal transition approach and three algorithms (Tarhio–Ukkonen [TU93], [JTU96] and Maximal matches [Ukk92]) based on filtering approach, except for the bit-parallelism algorithms and the algorithms for the k mismatches problem. In this paper we report extensive experiments for the running times of the well known and recent algorithms for the k mismatches and k differences problem, respectively. Finally, we examine these experiments if confirm the theoretical analysis of the algorithms. This paper is organized as follows: in the next section we briefly describe the algorithms tested both for the k-mismatches and the k-differences problem. In the third section we describe the experimental methodology, including the test environment, types of test data and ways measures for the comparison of the algorithms. In section four we present the results of our experiments, in the form of performance tables and graphs. Finally, we present some conclusions and suggest some further research issues.
2. APPROXIMATE STRING SEARCHING ALGORITHMS In this section we give the formal description of string searching with k mismatches and string searching with k differences as well as the sequential algorithms tested. However, for the details and the coding of the algorithms, the reader is referred to [MM99c] and the original references. We start by reviewing the basic algorithms from each category for the string searching with k mismatches problem. Finally, in all the algorithms below we suppose that the pattern and the text are stored in arrays P½1::n and T ½1::m.
2.1 String Searching with k Mismatches 2.1.1 Problem Definition Let a given alphabet S, a short pattern string P ¼ P½1P½2 P½m of length m, a large text string T ¼ T ½1T ½2 T ½n of length n, in an alphabet S of size jS|, where m; n > 0 and m 0 and m n, and an integer maximum number of differences allowed k 0, find all the text positions j such that the edit distance (i.e. number of differences) between P and some substring of T ending an T ½j is at most k. We say that there is an approximate occurrence of P at position j of T . 2.2.2 Algorithms for string searching with k Differences DYNAMIC PROGRAMMING APPROACH The dynamic programming approach is a classical solution which have been proposed independently by many researchers and mainly by the Wagner and Fischer [WF74] for computing the edit distance between two strings, the distances between longer and longer prefixes of the strings are successively evaluated from previous values until the final result is obtained. Later, Sellers [Sel80] converts this classical solution into a search algorithm in order to find all approximate occurrences of P in the T . This algorithm has running time OðmnÞ in the worst and average case. There are many results that improve the SEL algorithm and take ad-
vantage of the geometric properties of the dynamic programming array (i.e. values in neighbor cells differ at most by one) [Ukk85a] in order to compute kn instead of mn entries. For example, Ukkonen [Ukk85b] developed an algorithm which is called CUTOFF whose expected running time is OðnkÞ, by computing only a part of the dynamic programming array. Subsequently new algorithms were developed that are based on diagonal transition approach. The basic idea of the diagonal transition algorithms is the fact that the diagonals of the dynamic programming array are monotonically increasing. The algorithm is based on computing in constant time the positions where the values along the diagonals are incremented. Therefore, there are four algorithms which are based on diagonal transition approach: Brute Force, Landau–Vishkin [LV88, LV89], Galil–Park [GP90] and Ukkonen– Wood [UW93]. In our experiments we include the Galil–Park algorithm (in short, GP). The preprocessing phase of GP algorithm takes Oðm2 Þ space and time and the searching phase is OðknÞ time in the worst case or the average case. It uses reference triples that represent matching substrings of the pattern and the text as the Landau–Vishkin algorithm. Finally, the Chang–Lampe algorithm [CL92] (in short, CL) is a variation of the dynamic programming and is also very efficient and practical algorithm. This adaptation of the simple dynamic programming approach is based on a "column partition" approach, and has expected pffiffiffiffiffiffi time in Oðkn= jSjÞ. The running time for preprocessing and the space requirement for this algorithm is OðmjSjÞ. DETERMINISTIC FINITE AUTOMATA APPROACH Although this approach is rather old has received little attention. It is based on reexpressing the problem by mean of an automaton. The basic idea is to convert the general automaton into a deterministic one and reduce the states and the memory requirements. Ukkonen devised an algorithm who proposed the idea of such a deterministic finite automaton (DFA) [Ukk85b]. However, this algorithm has the disadvantage that a large number of automaton states may be generated. As a result, we have large time and space requirements which may limit the applicability of this algorithm. Later, Wu et al. looked again into this problem [WMM96]. The idea was to trade some time for space using a Four Russians technique [ADKF75] and give an Oðkn= log nÞ expected time algorithm which is a log factor improvement over the CUTOFF OðknÞ expected time algorithm, and an Oðmn= log nÞ time algorithm in the worst case using Oðn þ mjSj= log nÞ space for the universal lookup array. The running time for preprocessing is OðmjSjÞ. Finally, [Kur96] and [Nav97b] proposed another way to reduce space requirements. It is an adaptation of [BYG94], who first proposed it for the Hamming distance. The idea was to build the automaton in lazy form, i.e. build only the states and transitions actually reached in the processing of the text. This algorithm has running time Oðn þ m minðt; nÞÞ where t is the total number of transitions in the complete automaton and it requires OðminðjSj; nÞ minðm; jSjÞÞ space in the worst case. However, the average time complexity for this algorithm is Oðn þ mtðl e n=t Þ where t is the total number of transitions in the complete automaton. Finally, this algorithm in our experiments limited to m 10, because for longer patterns, it requires large amounts of memory. In our experiments we include two algorithms from this category, the Wu–Manber–Myers algorithm (in short, WMM) and the partial DFA algorithm (in short, PDFA). FILTERING APPROACH This method is much newer trend and it is currently very active. It is based on finding fast algorithms to discard large areas of the text that cannot match and apply another algorithm in the rest, using the simple dynamic programming approach.
First, Tarhio–Ukkonen [TU93] have devised an approximate string searching algorithm (in short, TUD) that tried to use Boyer–Moore–Horspool techniques [BM77, Hor80] to filter the text. The preprocessing phase has Oððk þ jSjmÞ time and the space which is required by this algorithm is OðjSjmÞ. The searching phase of the TUD algorithm has Oðmn=kÞ time in the worst case and OððjSj=jSj 2kÞknðk=ðjSj þ 2k 2 Þ þ 1=mÞÞ in the average case. Navarro [Nav97a] developed an algorithm (in short, COUNT), of [JTU96] and [GL89] which is a filter based on counting matching positions. In other words, the key idea is to search for substrings of the text whose distribution of characters differs from the distribution of characters in the pattern at most as much as it is possible under k differences. The preprocessing phase of the COUNT algorithm has OðjSj þ mÞ time and the searching phase has OðnÞ time if the number of verifications is negligible. Finally, this algorithm uses OðjSjÞ space. Wu and Manber [WM92] proposed a simple filter which is called pattern partition approach. This approach is based on the following fact: an occurrence with at most k differences of a pattern of length m implies that at least one substring of length r in the pattern matches a substring of the text occurrence exactly, where r ¼ bm=ðk þ 1Þc. There are many ways to use this idea. Perhaps the simplest one, used in [WM92], is to search for the first k þ l consecutive blocks of size r of the pattern P. If any of the blocks is an exact match, we try to extend the match, checking if there are at most k differences. This idea was used in conjunction with the extension of the SO algorithm [BYG92] to string matching with differences. The combination of the pattern partition approach with the SO algorithm we called MULTIWM. This algorithm has Oðmn=wÞ time complexity. Further, in our experiments this algorithm is limited to m 3l. Finally, Baeza et al. [BYP96] suggested a algorithm (in short, BYPEP) which combines the pattern partition approach with traditional multiple string searching algorithms. The simplest algorithm is to build an Aho–Corasick machine [AC75] (the extension of the KMP algorithm [KMP77, MM99a] to search for multiple patterns) for the k þ 1 blocks of length r. For every match found, we extend the match, checking if there are at most k differences, by using the standard dynamic programming algorithm to check the edit distance between two strings. This algorithm with the AC machine has OðnÞ expected search time for k Oðm= log mÞ using Oðm2 Þ extra space. Moreover, the above algorithm of the searching phase can be improved by using multiple string searching algorithm based on the Boyer– Moore algorithm [CW79].
BIT-PARALLELISM APPROACH We have seen this technique is applied to k-mismatches problem. Therefore, this approach can be applied similar way to k-differences problem. There are two main alternatives: parallelization of the non-deterministic finite automaton (NFA) and parallelization of the dynamic programming array. Wu and Manber algorithm [WM92] (in short, WM) uses this approach to simulate the automaton by rows. This algorithm has a preprocessing phase which requires OðmjSj þ kdm=wÞ time. Then the searching phase runs in Oðkndm=weÞ time in the worst and average case which is OðknÞ for patterns typical in text searching (i.e. m w). Moreover, this algorithm requires OðmjSj) space. In our experiments, this algorithms is limited to m 31. Baeza et al. [BYN96a, BYN96b, BYN99] proposed an another algorithm (in short, BYN) which parallelizes the NFA by diagonals using bits of the computer word. The preprocessing phase of the BYN algorithm takes OðjSj þ m minðm; jSjÞ time and it requires OðjSjÞ space. The search phase needs OðnÞ time in the worst and average case. This algorithm is limited to m 9 for w ¼ 32 bits in our experiments.
P. D. MICHAILIDIS AND K. G. MARGARITIS TABLE II Time and space complexities for string searching with k differences Search time
Worst case
Average case
Preprocessing time
Extra space
mn mn kn mn mn= log m n þ n minðt; nÞ mn=k
mn kn kn pffiffiffiffiffiffiffi kn= jj kn= log n n þ mtð1 en=t Þ ðjj=jj 2kÞ knðk=jj þ 2k2 þ l=mÞ n mn=w n; k m= log n kndm=we n kn=w
– – m2 mjj mjj – ðk þ jjÞm
mn m m2 mjj n þ mjj= log n minðjj; nÞ minðm; jjÞ mjj
jj þ m jj þ m m mjj þ kdm=we jj þ m minðm; jjÞ mjj
mn mn=w kndm=we n mn=w
m2 mjj jj jj
Finally, Myers [Mye98, Mye99] developed an algorithm (in short, MYE) which is based on bit parallel simulation of the dynamic programming array. The parallelization has optimal speedup, and the time complexity is Oðkn=wÞ on average and Oðmn=wÞ in the worst case. The preprocessing phase of the MYE algorithm requires OðmjSjÞ time and OðjSjÞ space. This algorithm in our experimental study is limited to m 3l. The known time and space complexities of several algorithms for solving the edit distance are shown in Table II for both the worse case and average case. We must note that the algorithms SEL, CUTOFF, PDFA, COUNT, BYPEP and MULTIWM were developed for the string searching with k differences problem. However, they can applied for the string searching with k mismatches problem with slight modifications. Therefore, we developed the algorithms SELM, CUTOFFM, PDFAM, COUNTM, BYPEPM and MULTIWM for k mismatches problem and we included in our experimental study.
3. EXPERIMENTAL METHODOLOGY In this section we present the testing methodology which used in our experiments in order to compare the relative performance of approximate string searching algorithms. The parameters which is described the performance of the algorithms are: (a) The text size, (b) The pattern length, (c) The number of allowed mismatches or differences, and (d) The alphabet size. It is known that none of the algorithms are optimal or best in all four cases. Therefore, the main goal in our experimental study is to explore the practical performance of the algorithms and verifying their theoretical analysis against the length of the pattern (small and long patterns), and against the number of allowed mismatches or differences (small and long values of k) under various alphabets of different sizes (or types of text) i.e. binary alphabet, alphabet of size 8, English alphabet and DNA alphabet, which have different characteristics.
3.1 Test Environment The experiments were run on a Sun UltraSparc-1 of 143 Mhz clock, with 64 Mb RAM which is a 32 bit machine and a 2.1 Gb local hard disk. The operating system is Solaris 2.5. During all experiments, this machine was not performing other heavy tasks (or processes). The data structures used in the testing were all in physical memory during the experiments. Finally, the algorithms presented in the section 2 have been implemented in ANSI C programming language [KR78] in a homogeneous way so as to keep their comparison significant, using the compiler cc.
3.2 Types of Test Data We note that because the performance of the approximate string searching algorithms depended upon statistical properties of the pattern and the text string from which the test patterns were obtained, experiments were performed on four different types of texts: binary alphabet, alphabet of size 8, English alphabet and DNA alphabet. Binary alphabet: The alphabet is S ¼ f0; 1g. The text is consisted of 150,000 characters and was randomly built. For patterns of lengths between 3 and 100 we search ten of them random built. Alphabet of size 8: The alphabet is S ¼ fa; b; c; d; e; f ; g; hg. The text is consisted of 150,000 characters and was random built. In addition, for patterns of lengths between 3 and 100 we search ten of them random built. English alphabet: We used a document of English language from an web page. The alphabet is consisted of 70 different characters. The text is consisted of 148,188 characters and we search ten patterns of each length from 3 to 100 characters were chosen at random from words inside the text. DNA alphabet: The DNA alphabet consists of the four nucleotides a, c, g and t (standing for adenine, cytosine, guanine, and thymine, respectively) used to encode DNA. Therefore, the alphabet is S ¼ fa; c; g; tg. The text is consisted of 997,642 characters and we search ten patterns of each length from 10 to 100 characters. Finally, the text and the patterns is portion of the GenBank DNA database, as distributed by Hume and Sunday [HS91].
3.3 Measures of Comparison For the comparison of the approximate string searching algorithms we used the practical running time as measure. The running time is the total time of calling an algorithm to search a pattern in the text including the preprocess time of building the auxiliary arrays. The running time is obtained by calling the C function clock ( ) and it is measured in seconds. Thus, we measured the running time all the algorithms in section 2 in order to examine the effect of the pattern length and the effect of an absolute k. We performed two test series: (a) We measured the effect of the pattern length in a test series with varying m ¼ 3; 4; 8; 10; 20; 30; 40; 60; 80; 100 and fixed k ¼ 3. In case of the DNA alphabet we used longer patterns because this alphabet has biological applications on long patterns. For this reason, in this alphabet we measured the effect of the pattern length in a test series with varying m ¼ 10; 20; 30; 40; 50; 100 and fixed k ¼ 3, and
(b) We measured the effect of an absolute k in three test sub-series: (b.1) with varying k ¼ 1; 2; 4; 6; 8 and fixed m ¼ 8, except for DNA alphabet. (b.2) with varying k ¼ l; 8; 10; 15; 19 and fixed m ¼ 20, and (b.3) with varying k ¼ 1; 6; 13; 25; 40 and fixed m ¼ 50. Finally, to decrease random variation, the results of the algorithms are averages of ten runs with different patterns of each length. 4. EXPERIMENTAL RESULTS In the previous sections we have briefly presented the most well known approximate string searching algorithms and the experimental methodology of our test. In this section, we present the experimental results both for the string searching algorithms with k mismatches and the string searching algorithms with k differences. The performance of the algorithms for these two problems was measured on ten iterations over the four types of text. In particular, the performance of each algorithm was plotted against the length of the pattern and against the absolute k for each type of text. 4.1 Results for the String Searching Algorithms with k Mismatches 4.1.1 Performance versus the Pattern Length Here, we report the performance and theoretical results for k ¼ 3 and variable pattern length. Figures 1–4 show the practical running time for a binary alphabet, an alphabet of size 8, an English alphabet and an DNA alphabet, respectively. Further, Figures 5–8 show the theore-
FIGURE 1 jSj ¼ 2 and k ¼ 3.
FIGURE 2 jSj ¼ 8 and k ¼ 3.
FIGURE 3 jSj ¼ 70 and k ¼ 3.
FIGURE 4 jSj ¼ 4 and k ¼ 3.
FIGURE 5 jSj ¼ 2 and k ¼ 3.
FIGURE 6 jSj ¼ 8 and k ¼ 3.
FIGURE 7 jSj ¼ 70 and k ¼ 3.
FIGURE 8 jSj ¼ 4 and k ¼ 3.
tical time complexity for a binary alphabet, an alphabet of size 8, an English alphabet and an DNA alphabet respectively. We observe that there is a general agreement between the experimental and theoretical results of algorithms in most cases. However, the experimental results of the filtering algorithms (such as COUNTM, BYPEPM and MULTIWMM) do not follow the theoretical calculation of time complexity. We must notice that only for large alphabets such as an English alphabet and only for large patterns, the experimental results of the COUNTM algorithm confirm with the theoretical calculations. 4.1.2 Performance versus the Values of k Figures 9–12 show the practical running time for a binary alphabet, an alphabet of size 8, an English alphabet and an DNA alphabet respectively, with patterns of size m ¼ 8, m ¼ 20 and m ¼ 50 and all possible values of k. Figures 13–16 show the theoretical time complexity for a binary alphabet, an alphabet of size 8, an English alphabet and an DNA alphabet, respectively. We can observe that the theoretical time complexity of the filtering algorithms group is not confirmed in practice for all sizes of alphabet and the several values of k. 4.2 Results for the String Searching Algorithms with k Differences 4.2.1 Performance versus the Pattern Length Figures 17–20 show the practical running time for a binary alphabet, an alphabet of size 8, all English alphabet and an DNA alphabet respectively. Figures 21–24 show the theoretical running time for a binary alphabet, an alphabet of size 8, an English alphabet and an DNA al-
FIGURE 9 jSj ¼ 2 and m ¼ 8; 20; 50.
FIGURE 10 jSj ¼ 8 and m ¼ 8; 20; 50.
FIGURE 11 jSj ¼ 70 and m ¼ 8; 20; 50.
FIGURE 12 jSj ¼ 4 and m ¼ 20; 50.
FIGURE 13 jSj ¼ 2 and m ¼ 8; 20; 50.
FIGURE 14 jSj ¼ 8 and m ¼ 8; 20; 50.
FIGURE 15 jSj ¼ 70 and m ¼ 8; 20; 50.
FIGURE 16 jSj ¼ 4 and m ¼ 20; 50.
FIGURE 17 jSj ¼ 2 and k ¼ 3.
FIGURE 18 jSj ¼ 8 and k ¼ 3.
FIGURE 19 jSj ¼ 70 and k ¼ 3.
FIGURE 20 jSj ¼ 4 and k ¼ 3.
FIGURE 21 jSj ¼ 2 and k ¼ 3.
FIGURE 22 jSj ¼ 8 and k ¼ 3.
FIGURE 23 jSj ¼ 70 and k ¼ 3.
FIGURE 24 jSj ¼ 4 and k ¼ 3.
phabet, respectively. We observe that the group of filtering algorithms such as COUNT, MULTIWM and BYPEP algorithms in most cases does not confirm the theoretical time complexity measures. Furthermore, in our experiments the algorithm CL which is based on dynamic programming approach, does not agree absolutely with the theoretical results for all sizes of the alphabet. Finally, the computational behavior of the remaining algorithms generally agrees with the theory in most cases.
4.2.2 Performance versus the Values of k Figures 25–28 show the practical running time for a binary alphabet, an alphabet of size 8, an English alphabet and an DNA alphabet respectively, with patterns of size m ¼ 8, m ¼ 20 and m ¼ 50 and all possible values of k. Figures 29–32 show the theoretical time complexity for a binary alphabet, an alphabet of size 8, an English alphabet and an DNA alphabet, respectively. We can observe that the theoretical analysis of filtering algorithms such as COUNT, MULTIWM and BYPEP algorithms is not valid in practice in most cases. In addition, the CL and MYE algorithms do not present the expected behavior. Finally, the remaining algorithms present practical performance according to the theoretical results.
4.2.3 General Remarks Based on empirical results it can be concluded that in all cases the BF, SELM, and SEL algorithms are linear in the running time. These algorithms produce relatively good running time results despite their simplicity. More specifically, the BF algorithm is the best approach with
FIGURE 25 jSj ¼ 2 and m ¼ 8; 20; 50.
FIGURE 26 jSj ¼ 8 and m ¼ 8; 20; 50.
FIGURE 27 jSj ¼ 70 and m ¼ 8; 20; 50.
FIGURE 28 jSj ¼ 4 and m ¼ 20; 50.
FIGURE 29 jSj ¼ 2 and m ¼ 8; 20; 50.
FIGURE 30 jSj ¼ 8 and m ¼ 8; 20; 50.
FIGURE 31 jSj ¼ 70 and m ¼ 8; 20; 50.
FIGURE 32 jSj ¼ 4 and m ¼ 20; 50.
the exception of SO and PDFAM algorithms when the pattern size is very small or k is close to m. In addition, the SEL algorithm is also among the fastest when m is small or k is close to m. This observation is valid for all the alphabets. It should be noted that BF, SELM and SEL algorithms have no special memory requirements and no complex coding. However, these algorithms performs poorly for large patterns and for large values of k. Our experiments confirm the OðknÞ expected running time of the CUTOFFM, CUTOFF and GP algorithms although they are quadratic in the theoretical worst case. In addition, we note that our measurements do not confirm the statement of Chang and Lampe [CL92] according to which the CL algorithm is always faster than the CUTOFF algorithm. Especially, for long patterns and small alphabets, the CL algorithm is much slower than the CUTOFF algorithm. However, our experiments showed that the CL algorithm is better than the CUTOFF algorithm for large values of k. Finally, we must notice that the CUTOFFM and CUTOFF algorithms are the best approaches for all alphabets when k is relatively large. We have experimentally shown that the PDFAM and PDFA algorithms outperform all the other algorithms for small values of k and small pattern lengths. In addition, the WMM algorithm achieves subquadratic worst case running time and very good expected running time for large values of k and patterns. This algorithm has the advantage that it can work for patterns that contain a class of characters, a complement of a character or a class and don’t care symbols. However, a main drawback of these methods is that they require large amounts memory. The theoretical analysis of filtering algorithms is not valid in practice in most cases. Algorithms COUNTM, COUNT, BYPEPM, BYPEP, MULTIWMM and MULTIWM allow the text to be scanned in linear expected time. Here, we must observe that the BYPEPM, BYPEP, MULTIWMM and MULTIWM algorithms have slightly better performance than
the COUNTM and COUNT algorithms only for small values of k. On the other hand, for large values of k they have lower running time than the rest of the algorithms. The COUNTM and COUNT algorithms are simple and fast in practice for small values of k. Further, they are fast for long patterns and alphabets (i.e the alphabet of size 8 and the English alphabet). However, they are not faster than the best sublinear filters because they inspect all text characters. According to our experimental study, we observe that the BYP algorithm achieves OðnÞ worst case time independent of k and without restrictions on m. It can be easily adapted to find the "best match" (smallest k). This is desirable in many cases where a bound on k is not known a priori. Also, the BYP algorithm has been used for two dimensional text searching [BYR93]. Finally, the running time of TU and TUD algorithms decreases as the pattern length and the alphabet size increases. This fact support theoretical evidence that TU and TUD algorithms are sublinear in average running time. Therefore, those algorithms performs well on average for small k and large patterns and alphabets. We experimentally demonstrated that all bit-parallelism algorithms with the exception of WM algorithm are among the fastest for typical text searching. More specifically, for small patterns the SO, BYN and MYE algorithms scan the text in linear time, regardless of the value of k. It can also be seen that they are fastest for small patterns and medium values of k for all the alphabets. The WM algorithm is also linear according to our experiments but it is less efficient scheme nowadays. Those algorithms are fairly simple to implement and are also very flexible. Additionally, those algorithms can be applied to cases where the pattern may contain a class of characters and don’t care symbols. We have not studied other cases, such as very long patterns because all bit-parallelism algorithms do not perform so well as other algorithms.
5. CONCLUSIONS We have presented the experimental results of an extensive set experiments of most well known approximate string searching algorithms based on dynamic programming, deterministic finite automata, filtering and bit-parallelism approach. Therefore, we report the general conclusions regarding the algorithms and their testing procedures. As a general conclusion we can say that testing the algorithms on four different types of text (binary alphabet, alphabet of size 8, English alphabet and DNA alphabet) indicates that varying parameters such as the pattern length, the values of k and the alphabet size can produce slight different algorithm performances. Therefore, our experimental study proved that none of the algorithms both for k mismatches and k differences problem is the best for all values of the problem parameters. Finally, we discuss a number of directions for future research that can mention along this paper. First, we will present data parallel algorithms for the three distributed problems related to the exact and approximate string searching using simple sequential algorithms: we search all the occurrences of a pattern in a text, where the pattern and/or the text can be either single strings or multiple ones. In the other words, we can easily extend the simple sequential algorithms which were described in section 2 for the classical Single Pattern versus Single Text (SPST) problem to data parallel algorithms for the three other problems: Single Pattern versus Multiple Text (SPMT) problem, Multiple Pattern versus Single Text (MPST) problem and finally Multiple Pattern versus Multiple Text (MPMT) problem. Second, we will report a extensive experimental study of data parallel algorithms for the three distributed string searching problems using Message Passing Interface (MPI).
