An Effective Fast Searching Algorithm for Internet ...

4 downloads 71072 Views 530KB Size Report
existing software, such as Google, Facebook, Twitter. Therefore, an .... Apple. Apple, apples apple. 83. 12 research. Research, researcher search. 72. 13 human.
An Effective Fast Searching Algorithm for Internet Crawling Usage Chia Zhen Hon, Nor Azhar Ahmad Fakulti Sistem Komputer & Kejuruteraan Perisian, Universiti Malaysia Pahang, Lebuhraya Tun Razak, 26300 Gambang, Pahang, MALAYSIA [email protected], [email protected]

Highlights: The search algorithm is a crucial part in any internet applications. One of the crucial usages is internet crawling to find the best information using keywords. The searching algorithm has been one of the famous algorithm which is used throughout most of the existing software, such as Google, Facebook, Twitter. Therefore, an effective search algorithm is very crucial as most of the users are depending on it. This project will benefit to all users. The prototype has been developed using PHP, JavaScript enveloped in a Bootstrap web creator. It has been tested in eBook searching for UMP Library and produced success story. It could search faster in comparison to the other approach when crawling more than 10k eBook titles and content. Further development could be a combination of a faster method inside the proposed searching engine. Key words: Search, Algorithm, Effective search Introduction The algorithm has been existed and been used in our daily life ever since the early civilization. Search algorithm itself is being commonly used and it is known as the universal problem –solving mechanism. Actually search algorithm is just an algorithm for searching an item (pattern) within a set of collection of data (text). Usually, it is a list of data that matches the predefined matching criteria which is the outcome of the search

58

form. This type of algorithm is used to look for more specific information from the available database. For instance, a search algorithm is applied when to search for a phrase in a document or search for a specific date or time. Some of the common text searching algorithm is Brute-Force algorithm (BF), Rabin-Karp algorithm (RK), Boyer-Moore algorithm (BM) and Boyer Moore Horspool algorithm (BMH). This research is concerned about the pattern matching problems. It is applied to check on the matches between pattern matching and the text from the database itself. In the field such as computer science, bio- chemical engineering, literature and more, text matching processing is essential and important. To check whether the specific pattern is a substring of the test, an algorithm is used to compare a short text string with a longer string text. E-book Search Algorithm Development Framework Prototype model exploration applied in the system. Each stage has been implemented to make sure the effective prototype development. Figure 1 shows the project flow.

Figure 1: Development Framework

The model starts with Requirement Planning. In the RAD model, planning is a combination of planning and

59

analysis. Without the brief planning, the project cannot be arranged well. Hence, this stage is needed in the system in order to identify he algorithm selected which is carried out along this stage. In this phase, the suitable title, the problem statement, objectives and scopes of the project are determined. Hence, the title for this project is Implementation of Boyer Moore Horspool Algorithm for Library e-book Searching. This problem is proposed due to amount of memory space needed by the computations. The second problem statement is the means of standard measures of a running time in order to search in the fastest way. The objectives are to develop a searching function using Boyer Moore Horspool algorithms based on a keyword in e-book database, to provide an automatic search routine in the e-book software and evaluate performance of the proposed algorithm through character match percentage. For the design phase, it is needed to collect data provided from the e-book database. This data need in order to be applied to the database in the Boyer Moore Horspool algorithm. Besides, this phase will involve in design for system prototype for BMH algorithm. In project interface design, it will include only one interface which is designed for searching algorithm interface where user need to input their search and display the related output.

60

Result and Discussion The character matching percentage is being used to evaluate performance of the proposed searching algorithm. As we can see, after experimental data was handled, we can conclude that the total average of the Boyer Moore Horspool algorithm is a strong match with a 78 % accuracy result. In order to evaluate the performance of purpose algorithm, a few experiments on data is conducted. In the experiment, 1000 data from different size is taken as a sample. Therefore, based on 1000 data, the data are divided into five datasets which will each handle 200 data. Twenty different keywords is selected and search in order to test the accuracy for the data. Based on the calculation, we will know that the algorithm is suitable use for searching or not. Below are the one of the dataset results gains from the testing.

No

Keyword

1

Blood

2

Weak

3

Loss

Table 1: Testing Set Expected Actual result result Blooded, warmblooded, coldBlood blooded, bloodstream, blood Weak, Weakness, Weak weaken Loss, Losses, Glossaries, Loss glossary

Accuracy (%)

62

50

40

61

4

Today

5

Morning

6

God

7

Knight

8

bottle

9

car

10

computer

11

Apple

12

research

13

human

14

mouse

15

database

16 17

hot baby

18

honey

19

sword

62

Today, today’s Morning, morning’s Godmother, god, godforsaken Knight, knightly, knighted Bottle, bottles Carousing, carriage, career, car Computerbased, computer, computers Apple, apples Research, researcher Human, humans Mouse, mousse, mouser Database, databases Hot, hotel Baby, babyish Honey, honeys, honeymoon Sword, swordfish

Day

75

Morning

78

God

43

Knight

60

bottle

86

car

55

computer

73

apple

83

search

72

human

83

mouse

83

database

89

hot baby

60 57

honey

60

sword

55

20

airplane

21

shaking

TOTAL AVERAGE

Airplanes, airplane Shaking

airplane

89

Shaking

100

(62 + 50 + 40 + 75 + 78 + 43 + 86 + 55 + 73 +73 + 83 + 67 + 83 + 89 + 60 + 57 + 60 + 55 + 89 + 100) / 21 = 72 %

Advantages 1. Able to provide a real-time search routine in the system. 2. Ability to calculate the accuracy of the proposed searching algorithm through character match percentage. 3. Increase of knowledge among Malaysian towards search algorithm. Acknowledgement The researcher would like to thank UMP for giving the student’s chance to carry out their research. References Reviewer-Kintali, S. (2015). Algorithms Unplugged by B. Vöcking, H. Alt, M. Dietzfelbinger, R. Reischuk, C. Scheideler, H. Vollmer, and D. Wagner: The Power of Algorithms by Giorgio Ausiello and Rossella Petreschi. ACM SIGACT News, 46(4), 14-16. Eng, C. H. B. (2015). Efficient compression of large repetitive strings (Doctoral dissertation, RMIT University, Melbourne). Cisłak, A. (2015). Full-text and Keyword Indexes for String Searching. arXiv preprint arXiv:1508.06610.

63

Suggest Documents