Improved Text Scanning Approach for Exact String Matching Muhammad Zubair, Fazal Wahab, Iftikhar Hussain and Junaid Zaffar Department of Computing and Technology Iqra University, Islamabad Campus Islamabad, Pakistan
[email protected],
[email protected],
[email protected] and
[email protected] Abstract—Exact String matching is an important subject in the domain of text processing and an essential component in practical applications of computer system. In this research we proposed a new algorithm to solve the problem of exact string matching by scanning text string for last and first characters of pattern in its preprocessing phase. In matching phase of TSPLFC (Test Scanning for Pattern Last and First Character) compares pattern with text window from both directions simultaneously. Experimental results verify that TSPLFC is efficient than number of existing algorithms and take its time complexity is O(km) in average case and O(1) in best case.
In this research we compare TSPLFC algorithm's results with Boyer-Moore [1] bad character rule, Horspool BoyerMoor, Quick Search [8], Berry Ravindran [16], Smith Boyer Moor and Raita algorithms which considered efficient in terms of number of character comparisons and attempts take to complete the task.
Keywords-string matching; text scanning; exact pattern matching; bidirectional, text window
Related works Researchers have developed several exact pattern matching algorithms with the view to enhance the searching processes by minimizing the number of characters comparisons and maximize the length of the shifts [20].
I.
INTRODUCTION
Exact String matching problem is define as to find all occurrences of pattern ‘P’ of size ‘m’ in text string ‘T’ of size ‘n’. Researchers have paid their attention to this area of research and many algorithms have been proposed and designed to solve the problem. Exact String matching algorithms are widely used in the domain of text processing ant its algorithms are implemented in many applications of the operating systems [20]. Used in bibliographic search, question answering application, molecular biology, text processing applications, virus scanning and information retrieval from databases [3, 4]. For exact string matching algorithms it is important to locate possible occurrences of pattern in text [22]. Other applications of Exact String matching algorithms are DNA pattern matching, Protein sequence analyses, and Database Query. Efficiency of the exact string matching algorithms has great impact on the performance of its application [3]. According to literature survey, all the authors focus to reduce the number of character comparisons and attempts as [6] and processing time as [10, 7 and 8] in all three cases best, worst and average cases. In this study we proposed TSPLFC algorithm which compares the user define pattern with the text window of same size from both sides simultaneously. In case of mismatch or a complete match preprocessing phase first scans text string for rightmost character of pattern in the right of partial text window of T, if last character is found then it look for leftmost character of pattern in text string on same distance as in pattern. It will continue scanning text string until both (rightmost/leftmost) characters are found to align with new selected text window to compare with pattern.
Section II describes the working, Implementation and analysis of TSPLFC algorithm. Then we compare the TSPLFC algorithm with some existing algorithms in Section III. Finally, in Section IV, we draw conclusion.
Boyer-Moore (BM) [1] algorithm compared character from right to left of pattern and did not require the whole pattern to be searched in case of mismatch. In case of a mismatch or complete match, it used two shifting rules; bad character and good suffix rules to shift the pattern toward right. The time complexity of preprocessing phase is O(m). Worst case running time of searching phase is O(nm + |∑|). The best case of searching phase is O(n/m). Boyer-Moore Horspool (BMH) [8] is improved version of the Boyer-Moore algorithm. It used only the bad character rule of the Boyer-Moor algorithm to improve the length of the shifts. In preprocessing phase, it scans pattern for right most character of partial text window from right to left. As occurrence of the character is found it aligned the found character with the rightmost character of the text window. If the character is not found in the pattern then take maximum shift of size m. It's preprocessing time complexity is O(m) and searching time complexity is O(mn). Boyer-Moore Smith (MBS) [17] noticed that computing shift by BMH sometimes maximize the shifts than QS. It uses the bad character shifting rule of BMH and QS bad character rule to shift the pattern. It's preprocessing time complexity is O(m+|∑|) and searching time complexity is O(mn). The preprocessing phase of Quick Search (QS) [9] algorithm scans pattern from right to left for one character right to the partial text window to identify the shifts by applying bad character shifting rule. It performs character comparison from left to right of the pattern with selected text window. The worst
978-1-4244-8003-6/10/$26.00 ©2010 IEEE
case time complexity of QS algorithm is same as BMH algorithm but it can take more steps in practice. In Raita algorithm [21] algorithm, first compare the last then first and at last middle character before actual comparison of pattern with text window. The preprocessing phase of the Raita algorithm is similar to bad character shift rule of the Horspool BM algorithm. Time complexity of Raita algorithm in preprocessing phase is O(m) and in searching phase its complexity is O(nm). Two Way algorithm’s (TW) [15] uses an idea related to the short maximal suffix of the pattern to calculate the shifting lengths of pattern in text string. The Two Way algorithm's time complexity is O(n). Berry Ravindran (BR) [16] performs shifts by using bad character shifting rule for two consecutive characters to the right of the partial text window. The preprocessing time complexity is O(m) and the searching time complexity is O(mn). The FLC-RJ [3] algorithm searches the whole text string for the first and last character of the pattern and maintains an occurrence list by storing the index of the corresponding character. Time complexity of preprocessing phase of FLC-RJ algorithm is O(n) and the space complexity is also O(n) in all three cases best, worst and average as it use an array equal to size of the text string n for maintaining occurrence list. Time complexity depend upon the occurrence of the corresponding characters in the text string. II.
TSPLFC EXACT PATTERN MATCHING ALGORITHM
A. Overview of TSPLFC Algorithm In sliding window method rule most of the string matching algorithms in preprocessing phase search text character/s in the pattern with the purpose to maximize the length of the shift [20]. Instead it is more reasonable to search the pattern character/s in text string in preprocessing phase [3]. TSPLFC algorithm is different from FLC-RJ [3], FLC-RJ algorithm only used preprocessing phase before searching phase and require O(n) memory for maintaining indexes and O(n) time complexity for scanning text string. However TSPLFC, first start searching like most of other algorithms and in either case (match or mismatch) call preprocessing phase. Suppose we have a text string T=[0…n-1] and pattern P=[0…m-1] and starts searching of Pattern P in text T. TSPLFC algorithm compares pattern with selected text window from both (right and left) sides simultaneously as shown in Fig. 1. In case of match or mismatch it starts scanning T for character P[m-1] from T[i+1] to T[i+1…n-1]. If T[i'] = P[m1] is found then it look at T[i’’]. Index T[i’’] is calculated by subtracting difference of pattern last and first character from T[i’]; T[i’’]= T[i’-(m-1)]. If T[i’]=P[m-1] then we look that is T[i’’]=P[0]. If T[i’]=P[m-1] and T[i’’]=P[0] then preprocessing phase return its index and TSPLFC algorithm calculates shift’s length and align character P[m-1] with T[i’] and P[0] with T[i’’] and start matching again as shown in Fig. 1. Otherwise continued scanning T[i+1…n-1] for P[m-1]. If combination of
P[m-1] and P[0] did not find in T[i+1…n-1] then algorithm return an index that cause to terminate the TSPLFC algorithm. Overview of the TSPLFC exact string matching algorithm is shown in Fig. 1. 0
B
i
i’’
C
2
3 1j
B
B A
Bi
B A
0
B
i
C
i’
i’
CA C B A
A
Text String
Pattern String i’’
i’
C A C B A A C C Fj B A 2
4 6
8
7
5
3
Text String
1
BA A C C F B A
Pattern String
Figure 1. Overview of pattern matching and Text Scanning
B. Example of TSPLFC Working mechanism of TSPLFC algorithm is shown in following example. In this Example we select a text string T=”ABCDGHDBDHABABABDCGHDB” and a pattern P=”ABDCGHDB”. At the beginning of the searching, pattern is aligned with text window T[0…7] at leftmost position of the text string. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABC D G H D BD H A B A B A B D C G H D B
AB D C G H DB
In first searching attempt TSPLFC algorithm compare six characters three by each pointer as mentioned below, left pointer found a mismatch at index T[2]. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABC D G H D BD H A B A B A B D C G H D B
AB D C G H DB
At this mismatch Preprocessing called. Preprocessing phase scans for the last character ‘B’ of pattern in text string. P[m-1] = T[11] then it look for the first character at T[I’-(m-1)] in text. In this case it is T[4] and found a mismatch as shown below. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABC D G H D BD H A B A B A B D C GH D B
AB D C G H DB
Preprocessing phase continues scanning for P[m-1] in T[i+1…n-1] and found that T[13]=P[m-1] ‘B’ but T[6] ≠ P[0] (‘A’) as shown below. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABC D G H D B D H A B A B A B D C G H D B
AB D C G H DB
Preprocessing phase continued scanning for next occurrence of P[m-1] in text. This time it found again that T[15]=P[m-1] but T[8] ≠ P[0] as shown below. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABC D G H D BD H A B A B A B D C G H D B
AB D C G H DB
Again preprocessing phase continued scanning for the next occurrence of P[m-1] and P[0]. This time preprocessing phase found that T[21]=P[m-1] and T[14]= P[0]. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABC D G H D BD HA B A B A B D C G H D B
AB D C G H DB
Preprocessing phase return the index (21) and searching phase aligned the pattern with the text window as described below. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABC D G H D BD HA B A B A B D C G H D B
A B D C G H DB
Now searching phase start matching pattern P with text window T[14…21]. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 ABC D G H D BD HA B A B A B D C G H D B
A B D C G H DB
In this attempt searching phase found a match at index 14 when both (left and right) pointers cross each other. In this example TSPLFC algorithm compares total fourteen characters, make two attempts and found one occurrence of the pattern in the text string. C. Implementation of TSPLFC Algorithm The TSPLFC algorithm is divided into two phases; preprocessing and searching phase. Proposed algorithm explained in pseudocode form in Fig. 2 and 3. Preprocessing Phase Preprocessing phase of TSPLFC algorithm takes text T[], last character P[m-1], first character P[0] of pattern P[], starting index T[i+1] and an integer which represent distance between P[m-1] and P[0] as an Input. Output of this phase is the index (i) where T[x]=P[m-1] and T[x-dif]=P[0] is found in T[i+1…n1]. If combination of P[m-1] and P[0] is not found at T[x] and T[x-dif] in T[i+1…n-1] then it returns an invalid index that causes to terminate TSPLFC algorithm. If T[x] = P[m-1] and T[x-dif] = P[0] is found in T[i+1…n-1] then preprocessing phase breaks the loop and return the index of P[m-1]. Detailed steps are shown in the pseudocode form in Fig. 2. 1. 2. 3. 4. 5. 6.
Prepro_phase(T[], P[m-1],P[0], index, dif) Ret_index ← aa.length for x ← index to length[T] if T[x]=P[m-1] and T[x-dif]=P[0] Returning_index =x break return Ret_index Figure2: Pseudocode of Preprocessing phase of TSPLFC Algorithm
Searching Phase Searching phase takes text T[] and pattern P[] as an input and returns possible occurrences of Pattern in Text. Detailed steps are shown in Fig. 3. It used two pointers (left and right) to
compare pattern with selected text window from both sides simultaneously. Outer for-loop used for searching Pattern in Text. Inner while-loop used for matching Pattern with selected text window. Left pointer incremented and right pointer decremented in each loop iteration of the inner loop when both pointers cross each other at middle of Patter we have a match. Either in case of match or mismatch preprocessing phase called to the index for next text window and calculate the length of shifts. Searching_Phase(T[], P[]) 1. n ←T.length 2. m ← P.length 3. for i←m-1 to n-1 4. left←0 5. right←m-1 6. while left< right 7. if P[right]=T[i-left]and P[l]=T[i-right] 8. if left=right 9. We have match at: i-(m-1) 10. i=preprophase(T, P[m-1],P[0], i+1,m-1) 11. left ←left+1 12. right ← right-1 13. else 14. i=preprophase(T, P[m-1],P[0], i+1,m-1) Figure 3: Pseudocode of Searching phase of TSPLFC
D. Analysis of TSPLFCAlgorithm In the preprocessing phase TSPLFC algorithm does not need any extra memory therefore space in all cases (best, worst and average) is O(1). In this phase if P[m-1]=T[x] and P[0]=T[x-dif] last and first characters is found in first attempt then the time complexity is O(1). However if the P[m-1] and P[0] is not found at T[x] and T[x-dif] in T[i+1…n-1] then time complexity is O(n-m+1) which can be represented as O(n-m). In searching phase if preprocessing phase returns index by finding P[m-1] and P[0] k times in T[i+1…n-1]. It takes km/2 time for matching m characters of Pattern with the text window then its average time complexity is O(km/2). If preprocessing phase returns index of P[m-1] at every location then searching phase match m characters of P with every text window which is worst case of the TSPLFC algorithm and its time complexity in this case is O(nm). If preprocessing phase doesn’t fined P[m-1] and P[0] in text string then searching phase terminated automatically this is the best case of the TSPLFC algorithm and time complexity in this case is O(1). Space complexity of the TSPLFC algorithm in best, average and worst case is sum of the text and pattern size as no extra memory is required and can be represented as O(n+m+Σ) and represented as O(n+m). If m is negligible as compare to n then space complexity in O(n). TSPLFC algorithm compared with six existing algorithms Boyer Moor-Bad Character rule, Boyer Moor Hors Pool, Quick Search, Boyer Moor Smith, Berry Ravindran and Raita. Time complexities in average case of these exact string matching algorithms are shown in a table in Fig. 4. Comparison results showed that average case time complexity O(km/2) of TSPLFC is better then existing algorithms.
Boyer Moor-Bad Character Boyer Moor Hors Pool Quick Search Boyer Moor Smith Berry Ravindran Raita TSPLFC
Preproces. Searching phase phase O(m+|∑|) O(mn) O(m+|∑|) O(mn) O(m+|∑|) O(mn) O(m+|∑|) O(mn) O(m+(|∑|)2) O(mn) O(n) O(m2) O(m+|∑|) O(km)/2
Figure4: comparison of time complexity
III.
EXPERIMENTALRESULTS AND DISCUSSION
TSPFLC algorithm is implemented and compared with the Boyer Moor Bad character rule, Boyer Moor Hors Pool, Quick search, Boyer Moor Smith, Berry Ravindran and Raita algorithm which are considered efficient in exact pattern matching. In the experiment we took a text string T of sixty thousands characters and pattern P of lengths {6, 12, 18, 24, 30, 36, 42, 48, 54, and 60}. Text String is consist of the four characters Σ= {A, C, G, T} these are the characters occurred in DNA pattern. Pattern is also of Σ= {A, C, G, and T} and formed randomly concatenate these characters. In this research several experiments have been conducted and the obtained results are compared with Boyer Moor Bad character rule, Boyer Moor Hors Pool, Quick search, Boyer Moor Smith, Berry Ravindran and Raita algorithms as shown in Fig. 5 and 6. Comparison made on two bases; total number of characters compared by each algorithm and the number attempts taken by each algorithm for finding all possible occurrence of the pattern in the text. A. No. of Characters compared Base comparison In experiments we took text string of 60,000 characters and pattern of different sizes. Both text string and pattern form by combining different DNA characters A, C, G, T. Graph in Fig. 5 explain the result of experiments. X-axis represents the length of pattern of different sizes and the Y-axis represents the total number of characters compared by different algorithms. As Fig. 5 shows that TSPLFC is more efficient than Boyer Moor Bad character rule, Boyer Moor Hors Pool, Quick search, Boyer Moor Smith, Berry Ravindran and Raita algorithm as it compared less number of characters. No. of Characters Compared Base Comparison No. of Characters Compared
70000 60000 Boyer Moore BC
50000
Horspool BM
40000
Quick Search
30000
Berry Ravndrn
20000
Smith BM
10000
Raita
0 6
12 18 24 30 36 42 48 54 60
TSPLFC
Pattern Size
Figure 5: Characters Compared Base Comparison of TSPLFC
In all experiments with different pattern size TSPLFC provide better result than existing algorithms as shown in Fig. 5. B. No. of Attempts Base comparison Numbers of attempts based comparison shown in Fig. 6. Xaxis of graph represents size of pattern’s length and the Y-axis of the graph represents the attempts made by different exact string matching algorithms. Result in the Fig.6 shows that TSPLFC algorithm takes fewer attempts to match pattern in the text string. Result shows that Boyer Moor Bad character rule, Boyer Moor Hors Pool, Quick search, Boyer Moor Smith, Berry Ravindran and Raita algorithm took more attempts than TSPLFC algorithm to complete the process of matching. Efficiency of TSPLFC is verified through experiment as shown in Fig. 6. No. of Attempts Base Comparison 35000 30000 No. of Attempts
Algorithms
Boyer Moore BC
25000
Horspool BM
20000
Quick Search
15000
Berry Ravndrn
10000
Smith BM
5000
Raita
0 6
12 18 24 30 36 42 48 54 60
TSPLFC
Pattern Size
Figure 6: Attempts Base comparison result of TSPLFC
The main intention of this research is to reduce the number of attempts exact string matching algorithms by increasing the lengths of shifts and reduced number of Character comparison by using two pointers to match simultaneously. Experimental results verified that by decreasing the number of attempts helps to increase the efficiency of exact pattern matching process. Experimental results and best case time complexity (O(1)) of the searching phase of TSPLFC algorithm shows that it is more reasonable to look the pattern characters in the text string for calculating length shifts in preprocessing phase. IV.
CONCLUSION
In this study, we proposed a new exact string matching algorithm TSPLFC. In addition to proposed algorithm Boyer Moor Bad character rule, Boyer Moor Hors Pool, Quick search, Boyer Moor Smith, Berry Ravindran and Raita algorithm are experimented with TSPLFC. Comparison of proposed algorithm is made with experimented algorithms on the bases of number of characters compared by each algorithm in matching phase and the attempts made by each algorithm to search all possible occurrences of pattern in the text. In preprocessing phase of TSPLFC; last and first characters of pattern is searched in text string. Index of the characters in text string is used to calculate the shift’s length and align the pattern with next text window. Its average case is O(km/2) and best case is O(1). The analysis and the experimental results show that the TSPLFC algorithm is efficient than experimented algorithms. In future, we will try to improve the worst case and the results of the text scanning approach of exact string matching algorithm.
REFERENCES [1] [2] [3] [4] [5]
[6] [7] [8] [9] [10] [11]
R.S. Boyer, J.S. Moore, "A fast string searching algorithm," Communication of the ACM, Vol. 20, No. 10, 1977, pp.762–772. Knuth, D., Morris, J. H., Pratt, V., "Fast pattern matching in strings," SIAM Journal on Computing, Vol. 6, No. 2, doi: 10.1137/0206024, 1977, pp.323–350. Rami H. Mansi, and Jehad Q. Odeh, "On Improving the Naïve String Matching Algorithm," Asian Journal of Information Technology, Vol. 8, No. 1, ISSN 1682-3915, 2009, pp. 14-23. Ziad A.A. Alqadi, Musbah Aqel, & Ibrahiem M. M. El Emary, "Multiple Skip Multiple Pattern Matching Algorithm," IAENG International Journal of Computer Science, Vol. 34, No. 2, IJCS_34_2_03, 2007. Ababneh Mohammad, Oqeili Saleh and Rawan A. Abdeen, "Occurrences Algorithm for String Searching Based on Brute-force Algorithm," Journal of Computer Science, Vol. 2, No. 1, ISSN 15493636, 2006, pp.82-85. A. Apostolic and R. Giancarlo, "The Boyer-Moore-Galil string searching strategies revisited," SIAM J. Computer. Vol. 15, No. l, 1986, pp.98–105. L. Colussi, Z. Galil, and R. Giancarlo, ‘On the exact complexity of string matching,’ 31st Symposium an Foundations of Computer Science I, IEEE (October 22-24 1990), pp.135–143. R. N. Horspool, "Practical fast searching in strings," Software—Practice and Experience, Vol. 10, No. 3, 1980, 501–506. Sunday, D.M., "A very fast substring search algorithm," Communications of the ACM, Vol. 33, No. 8, 1990, pp. 132-142. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Introduction to Algorithms, Chapter 34, MIT Press, 1990, pp 853-885. Karp, R.M., Rabin, M.O., "Efficient randomized pattern matching algorithms," IBM Journal on Research Development, Vol. 31, No. 2, 1987, pp. 249-260.
[12] Apostolico, A. Crochemore, M., "Optimal canonization of all substrings of a string," Information and Computation, Vol. 95, No. 1, 1991, pp.7695. [13] Crochemore, M., Czumaj, A., Gasieniec, L., Jarominek, S., Lecroq, T., Plandowski. W., Rytter, W., "Speeding up two string matching algorithms," Algorithmica, Vol. 12, No. 4/5, 1994, pp.247-267. [14] Colussi, L., "Fastest pattern matching in strings," Journal of Algorithms, Vol. 16, No. 2, 1994, pp.163-189. [15] Crochemore, M. and Rytter, W., "Jewels of Stringology," World Scientific, Singapore, 2002. [16] Berry, T. Ravindran, S., "A fast string matching algorithm and experimental results, in proceeding of the Prague Stringology," Club Workshop-99, Collaborative report DC-99-5, Czech Technical University, Prague, Czech Republic, 1999, pp.16-26. [17] Smith, P.D., "Experiments with a very fast substring search algorithm," Software-Practice and Experience, Vol. 21, No. 10, pp.1065-1074. [18] Amjad, H. et al, “A Fast Pattern Matching algorithm with two Sliding Window (TSW),” Journal of Computer Science, Vol. 4, No, 5, ISSN 1549-3636, PP. 393-401, 2008. [19] Christian Charras, Thierry Lecroq,”Not So Naïve Algorithm” Handbook of Exact Pattern Matching Algorithm, Publication 1999, PP. 81-85. [20] Thierry Lecroq, “Fast exact string matching algorithms,” Information Processing Letters, Volume 102 , no. 6, Year of Publication: 2007, Pages 229-235. [21] Raita, T., “Tuning the Boyer-Moore Horspool string searching algorithm,” Software-Practice & Experience, Vol. 22, no. 10, pp. 879884, 1992. [22] Charras, C. and T. Lecroq, Hand Book of Exact String-Matching Algorithms, Publication 2004, First Edn., ISBN: 978-0-7546-64, PP. 1924.