Performance Evaluation of a Regular Expression Crawler and Indexer Sadi Evren SEKER Department of Computer Engineering, Istanbul University, Istanbul, Turkey
[email protected]
Abstract. This study aims to find a solution for the optimization of indexer and crawler modules of a search engine if the possible varieties of the search phrases are previously known as a regular expression. A search engine can be considered as an expert in any area if the search domain is narrowed and the crawling and indexing modules are optimized according to this domain. A general expertise of the search engines can be modeled with regular expressions like searching only emails or telephone numbers on the Internet. This paper mainly discusses several alternatives on an expert search engine and evaluates the performance of several varieties. Keywords: Regular Expression, Search Engine, Crawler, Indexer
Internet
Web Crawler
Indexer
1. Introduction Any achievement on the search engine technology can result a benefit for all of the Internet users since the search engines are the main gateways to the information on the Internet. A classical search engine tries to index the information on the Internet by crawling the web pages. During the crawling phase, the web spider downloads the web page and extracts the information, indexes the extracted information and then continues to the next web page. Any search engine tries to index the information extracted in a general form of notation for a great variety of search possibilities. From the Fig.1 a spider gets connects to the Internet and supplies information for indexer which is responsible to keep the information for queries. This information can be kept in a database or can stay in memory for faster results. Finally, a user gets connect to the search engine through a user interface and queries the data in the indexer. This study mainly concentrates on the question, “What if the search engine previously knows the regular expression representation of the keywords searched?”. In this case the search engine do not need to index and store the information unnecessary for the search result and also a reasonable performance increase would occur during the processing of the web pages.
Indexer Database
User Interface
Fig. 1. A sample view of a web spider and its components
This approach can be useful if an expert search engine is designed for example to search only the personal information on the Internet. Let’s take the case of searching personal information, such as an email address or telephone of a given name and surname on the Internet. In this case all the search engine components should get expert on the personal information only. During this study, the representation of search query is accepted as previously known in a regular expression notation. Web crawler is free to crawl any web page by following the classical crawling algorithms. Indexer is specially optimized for the regular expressions built on b+
tree or data structure[1], which is also a find out from our previous research[2]. An overview of the developed system can be demonstrated as in Fig. 2:
spider gets another link from the “to search” list. This operation keeps looping until the “to search” list gets empty. GUI
Regular Expression Target Web Site
Get Link from GUI
Internet
NO
YES
Check robots.txt
Web Crawler
Get link and follow NO Indexer Tree
User Interface
Spider YES
During this paper, the personal information will be provided as an example of the regular expressions on the Internet. Please note that the initial regular expression, so the expertise of the search engine, can be easily updated by an user interaction
2. Regular Expression spider A web spider should find out the links and follow them out while creating a list of traversed sites and the follow up queue for the next sites. Fig. 3 holds the flow chart of the web spider algorithm. The spider gets a URL from the GUI and starts by this initial page. Also an important check should be done before proceeding any URL from the robots.txt file provided in the web site. If the site permits the spider to go forward, than spider simply tries to find out all the links from the web page and add these links to a list for further traverses. Finally the
NO
Already searched?
Fig. 2 Deployment diagram of expert search engine on given regular expression
From the above diagram it is obvious that the web crawler gets the regular expression [3] of its expertise from the user and crawls and indexes the Internet using this regular expression. Also the user can query any information obeying the regular expression provided initially from the indexer data structure.
YES
Found another link?
Add searched list YES
List is empty?
NO
End
INDEXER Fig. 3 Flow char of the spider
While producing a “to search” list for implicit usage of the spider, the list of already searched sites should also be checked for double entry to the site.
Also another job of indexer is keeping the keywords in an appropriate data structure. Also there should be a connection between the indexer data structure and the GUI modules.
GUI
Spider
Graphical User Interface
Indexer Fig 4. IPO Diagram of the spider
A simple input is fetched from the GUI and all the outputs are sent to the indexer. Data Structure to keep user inputs
3. Indexer This module is responsible of extracting the keywords from the URLs got from the spider. As already discussed in the analysis part, the HTML Tokenizer is a part of the indexer, which can parse the keywords from the sites.
Indexer Data Structure
The most important two modules of the indexer are listed below: • Data Structure • Tokenizer The deployment diagram of the indexer should look like in Fig 5 : INTERNET Search Results (page by page) and a pointer Spider
Indexers
Fig. 6 Data structures between the GUI and indexer
TOKENIZER Tokenizer
Regular Expression Extractor Fig. 5 Connection between spider and the indexers and tokenizers in indexers
Fig. 5 demonstrates the connection between spider and the indexers. Each indexer keeps a tokenizer to get the keywords from html pages.
Fig. 6 demonstrates the connection between the indexer and the graphical user interface.
4. HTML Tokenizer This module is responsible of extracting the keywords from a given web page. Since all the information on the internet is transferred by the html format, the indexer should parse the html format. Because of the performance issues all the modules should run concurrently, so each of the html tokenizer should run in a concurrent thread. By using multi threaded implementation, the busy waiting of the spider and the rest of the indexing jobs will be avoided. A simple view of the html tokenizer should look like in the fig. 7:
No
Get the target URL of the page
data structure or a well formatted string which the indexer obeys the same protocol.
Page contains more token?
One of the major improvements on this study is implementing the indexer within a suitable manner of the regular expressions. The regular expressions can be consists of multiple tokens. For example in table 1, the sites are listed with the token varieties:
5. Reverse Indexing
Yes
TABLE 1 SAMPLE SITES AND KEYWORDS
Return result string as keywords list
Site URL http://www.microsoft.com
Get next string from the page No
http://www.mit.edu
Match to Regular Expression?
Add to result string
Fig. 7 Flowchart of the HTML Tokenizer
Fig. 7 demonstrates a simple flow chart of the html tokenizer. The initial step of the tokenizer is getting the target URL from the web spider. The URL information and the keywords extracted from this URL will be returned to the indexer for the future queries. The end condition of the HTML tokenizer is finishing all the keywords in the web page. This information can be gathered from the file pointer which is created on the target web page. Since all the information on the internet is downloaded to the local memory, the current web page viewed should also kept on the local computer with a file pointer. The file operations on this level are left to the java network library. Above the file operations which use the file pointer should keep track of the strings and html tokens. Fortunately all the html tags are kept within the “” symbols. So a string tokenizer with the knowledge of html tokens can easily be converted to the html tokenizer. Also the extracted keywords should be kept on a result set with a separator. The result set can be a composite type
Keywords Microsoft, product, support, help, research, training, Office, Windows, software, download… research, offices, about, education, news, students, faculty, …
Indexing of the above list can be done in two ways, either by indexing from web site to the keywords or from keywords to the web sites. The former method is called as reverse indexing [4] and increases the access time performance. In order to keep the regular expression in a tree with reverse indexing, an efficient way of regular expression modeling is required. The regular expressions are kept by tokens in the tree. For example a regular expression of an email can be represented as below: [A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4} Above regular expression checks for the validity of email addresses is given above. The regular expression above has 5 sub parts: TABLE 2 SUB PARTS OF THE REGULAR EXPRESSION
1 2 3 4 5
[A-Z09._%+-]+ @ [A-Z0-9.-]+ \. [A-Z]{2,4}
name of mail account
the @ sign on email domain of email . sign on domain extension of domain (2 to 4 chars like com, org, edu) Let’s consider the below examples and their separations according to table 2.
TABLE 3 PARSING OF SAMPLE EMAILS
Email
[email protected] [email protected] [email protected] [email protected] [email protected]
1 ali john bill paul dean
2 @ @ @ @ @
3 baba hotmail hotmail yahoo mit
4 . . . . .
5 com com com com edu
Table 3 results from the regular expressions can be indexed in either way below:
TABLE 5 PERFORMANCE TABLE OF SAMPLE DOMAINS (CONTINUED)
URL www.yildiz.ed u.tr www.setegiti m.com
Indexing Time
Reverse Indexing Time
Search Time
Reverse Search Time
54211
48921
5196
2176
618822
543228
5282
2043
139864
127464
6093
2584
7939231
5094323
7387
2983
www.mit.edu com
.
hotmail
@
yahoo
@
www.sun.com
edu
.
.
@
Table 4 displays the performance of indexing and reverse indexing algorithms of the email regular expression covered previously. The data structure for the tree implementations is the b+ tree. Previous research on the data structure of the indexer showed that the best possible data structure is b-tree variants. Depending on this research the name part of the regular expressions are kept on a b+ tree structure. The search results are measured from the email addresses with 50% of existing and 50% of not existing search queries.
.
bab
mit
@
@
7. Conclusion john
bill
ali
paul
dean
Fig 8. Tree representation of regular expression results
6. Performance Evaluation of Regular Expression Search Engine Table 4 holds the results of time performance of the search engine results from different regular expressions. TABLE 4 PERFORMANCE TABLE OF SAMPLE DOMAINS
URL www.yildiz.edu.tr www.setegitim.com www.mit.edu www.sun.com
# of Keywords 116 1711 313 5041
Depth 5 5 5 5
# of pages 85 366 156 720
It is obvious that some of the Internet queries can be represented with regular expressions. This study aims to find an optimized way for crawling and indexing parts of the search engines in the case of regular expression of the search queries are previously known. At the best of our knowledge the parsing result of the regular expression is not kept in a leveled tree until this research. This new indexing approach has increased the speed of indexing and queries also a possible data structure and reverse indexing algorithm are suggested for the case. Also a better version of indexing data structure can be applied to this study. ACKNOWLEDGEMENT
# of links 145 446 792 5201
This study was supported by Scientific Research Projects Coordination Unit of Istanbul University. Project number YADOP-16728.
8. References [1] National Institute of Science and Technology, nist.gov, 2009 [2] Web Spider Performance and Data Structure Analysis, Şadi Evren ŞEKER, Banu Diri, 2009 [3] Gerard Berry and Ravi Sethi. From regular expressions to deterministic automata. Theoretical Computer Science, 48:117– 126, 1986.
[4] Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Science Department, Stanford University, 1999 [5] TUSSE (Turkish Speaking Search Engine) , http://www.shedai.net/tusse, 2008[8] G.M. [6] Adelson-Velsky and E.M. Landis, An algorithm for the organization of information, Soviet Mathematics 3 (1962), pp. 1259–1263.