An Effective Fast Searching Algorithm for Internet Crawling Usage Chia Zhen Hon, Nor Azhar Ahmad Fakulti Sistem Komputer & Kejuruteraan Perisian, Universiti Malaysia Pahang, Lebuhraya Tun Razak, 26300 Gambang, Pahang, MALAYSIA
[email protected],
[email protected]
Highlights: The search algorithm is a crucial part in any internet applications. One of the crucial usages is internet crawling to find the best information using keywords. The searching algorithm has been one of the famous algorithm which is used throughout most of the existing software, such as Google, Facebook, Twitter. Therefore, an effective search algorithm is very crucial as most of the users are depending on it. This project will benefit to all users. The prototype has been developed using PHP, JavaScript enveloped in a Bootstrap web creator. It has been tested in eBook searching for UMP Library and produced success story. It could search faster in comparison to the other approach when crawling more than 10k eBook titles and content. Further development could be a combination of a faster method inside the proposed searching engine. Key words: Search, Algorithm, Effective search Introduction The algorithm has been existed and been used in our daily life ever since the early civilization. Search algorithm itself is being commonly used and it is known as the universal problem –solving mechanism. Actually search algorithm is just an algorithm for searching an item (pattern) within a set of collection of data (text). Usually, it is a list of data that matches the predefined matching criteria which is the outcome of the search
58
form. This type of algorithm is used to look for more specific information from the available database. For instance, a search algorithm is applied when to search for a phrase in a document or search for a specific date or time. Some of the common text searching algorithm is Brute-Force algorithm (BF), Rabin-Karp algorithm (RK), Boyer-Moore algorithm (BM) and Boyer Moore Horspool algorithm (BMH). This research is concerned about the pattern matching problems. It is applied to check on the matches between pattern matching and the text from the database itself. In the field such as computer science, bio- chemical engineering, literature and more, text matching processing is essential and important. To check whether the specific pattern is a substring of the test, an algorithm is used to compare a short text string with a longer string text. E-book Search Algorithm Development Framework Prototype model exploration applied in the system. Each stage has been implemented to make sure the effective prototype development. Figure 1 shows the project flow.
Figure 1: Development Framework
The model starts with Requirement Planning. In the RAD model, planning is a combination of planning and
59
analysis. Without the brief planning, the project cannot be arranged well. Hence, this stage is needed in the system in order to identify he algorithm selected which is carried out along this stage. In this phase, the suitable title, the problem statement, objectives and scopes of the project are determined. Hence, the title for this project is Implementation of Boyer Moore Horspool Algorithm for Library e-book Searching. This problem is proposed due to amount of memory space needed by the computations. The second problem statement is the means of standard measures of a running time in order to search in the fastest way. The objectives are to develop a searching function using Boyer Moore Horspool algorithms based on a keyword in e-book database, to provide an automatic search routine in the e-book software and evaluate performance of the proposed algorithm through character match percentage. For the design phase, it is needed to collect data provided from the e-book database. This data need in order to be applied to the database in the Boyer Moore Horspool algorithm. Besides, this phase will involve in design for system prototype for BMH algorithm. In project interface design, it will include only one interface which is designed for searching algorithm interface where user need to input their search and display the related output.
60
Result and Discussion The character matching percentage is being used to evaluate performance of the proposed searching algorithm. As we can see, after experimental data was handled, we can conclude that the total average of the Boyer Moore Horspool algorithm is a strong match with a 78 % accuracy result. In order to evaluate the performance of purpose algorithm, a few experiments on data is conducted. In the experiment, 1000 data from different size is taken as a sample. Therefore, based on 1000 data, the data are divided into five datasets which will each handle 200 data. Twenty different keywords is selected and search in order to test the accuracy for the data. Based on the calculation, we will know that the algorithm is suitable use for searching or not. Below are the one of the dataset results gains from the testing.
No
Keyword
1
Blood
2
Weak
3
Loss
Table 1: Testing Set Expected Actual result result Blooded, warmblooded, coldBlood blooded, bloodstream, blood Weak, Weakness, Weak weaken Loss, Losses, Glossaries, Loss glossary
Accuracy (%)
62
50
40
61
4
Today
5
Morning
6
God
7
Knight
8
bottle
9
car
10
computer
11
Apple
12
research
13
human
14
mouse
15
database
16 17
hot baby
18
honey
19
sword
62
Today, today’s Morning, morning’s Godmother, god, godforsaken Knight, knightly, knighted Bottle, bottles Carousing, carriage, career, car Computerbased, computer, computers Apple, apples Research, researcher Human, humans Mouse, mousse, mouser Database, databases Hot, hotel Baby, babyish Honey, honeys, honeymoon Sword, swordfish
Day
75
Morning
78
God
43
Knight
60
bottle
86
car
55
computer
73
apple
83
search
72
human
83
mouse
83
database
89
hot baby
60 57
honey
60
sword
55
20
airplane
21
shaking
TOTAL AVERAGE
Airplanes, airplane Shaking
airplane
89
Shaking
100
(62 + 50 + 40 + 75 + 78 + 43 + 86 + 55 + 73 +73 + 83 + 67 + 83 + 89 + 60 + 57 + 60 + 55 + 89 + 100) / 21 = 72 %
Advantages 1. Able to provide a real-time search routine in the system. 2. Ability to calculate the accuracy of the proposed searching algorithm through character match percentage. 3. Increase of knowledge among Malaysian towards search algorithm. Acknowledgement The researcher would like to thank UMP for giving the student’s chance to carry out their research. References Reviewer-Kintali, S. (2015). Algorithms Unplugged by B. Vöcking, H. Alt, M. Dietzfelbinger, R. Reischuk, C. Scheideler, H. Vollmer, and D. Wagner: The Power of Algorithms by Giorgio Ausiello and Rossella Petreschi. ACM SIGACT News, 46(4), 14-16. Eng, C. H. B. (2015). Efficient compression of large repetitive strings (Doctoral dissertation, RMIT University, Melbourne). Cisłak, A. (2015). Full-text and Keyword Indexes for String Searching. arXiv preprint arXiv:1508.06610.
63