Performance Evaluation of a Regular Expression ...

3 downloads 5519 Views 566KB Size Report
information, such as an email address or telephone of a given name and surname ... internet is transferred by the html format, the indexer should parse the html ...
Performance Evaluation of a Regular Expression Crawler and Indexer Sadi Evren SEKER Department of Computer Engineering, Istanbul University, Istanbul, Turkey [email protected]

Abstract. This study aims to find a solution for the optimization of indexer and crawler modules of a search engine if the possible varieties of the search phrases are previously known as a regular expression. A search engine can be considered as an expert in any area if the search domain is narrowed and the crawling and indexing modules are optimized according to this domain. A general expertise of the search engines can be modeled with regular expressions like searching only emails or telephone numbers on the Internet. This paper mainly discusses several alternatives on an expert search engine and evaluates the performance of several varieties. Keywords: Regular Expression, Search Engine, Crawler, Indexer

    Internet

Web   Crawler

Indexer

1. Introduction Any achievement on the search engine technology can result a benefit for all of the Internet users since the search engines are the main gateways to the information on the Internet. A classical search engine tries to index the information on the Internet by crawling the web pages. During the crawling phase, the web spider downloads the web page and extracts the information, indexes the extracted information and then continues to the next web page. Any search engine tries to index the information extracted in a general form of notation for a great variety of search possibilities. From the Fig.1 a spider gets connects to the Internet and supplies information for indexer which is responsible to keep the information for queries. This information can be kept in a database or can stay in memory for faster results. Finally, a user gets connect to the search engine through a user interface and queries the data in the indexer. This study mainly concentrates on the question, “What if the search engine previously knows the regular expression representation of the keywords searched?”. In this case the search engine do not need to index and store the information unnecessary for the search result and also a reasonable performance increase would occur during the processing of the web pages.

Indexer   Database

User  Interface

Fig. 1. A sample view of a web spider and its components

This approach can be useful if an expert search engine is designed for example to search only the personal information on the Internet. Let’s take the case of searching personal information, such as an email address or telephone of a given name and surname on the Internet. In this case all the search engine components should get expert on the personal information only. During this study, the representation of search query is accepted as previously known in a regular expression notation. Web crawler is free to crawl any web page by following the classical crawling algorithms. Indexer is specially optimized for the regular expressions built on b+

tree or data structure[1], which is also a find out from our previous research[2]. An overview of the developed system can be demonstrated as in Fig. 2:

spider gets another link from the “to search” list. This operation keeps looping until the “to search” list gets empty. GUI

Regular  Expression   Target  Web  Site

Get  Link  from  GUI

Internet

NO

YES

Check   robots.txt

Web   Crawler

Get  link  and  follow NO Indexer   Tree

User  Interface

Spider YES

During this paper, the personal information will be provided as an example of the regular expressions on the Internet. Please note that the initial regular expression, so the expertise of the search engine, can be easily updated by an user interaction

2. Regular Expression spider A web spider should find out the links and follow them out while creating a list of traversed sites and the follow up queue for the next sites. Fig. 3 holds the flow chart of the web spider algorithm. The spider gets a URL from the GUI and starts by this initial page. Also an important check should be done before proceeding any URL from the robots.txt file provided in the web site. If the site permits the spider to go forward, than spider simply tries to find out all the links from the web page and add these links to a list for further traverses. Finally the

NO

Already   searched?

Fig. 2 Deployment diagram of expert search engine on given regular expression

From the above diagram it is obvious that the web crawler gets the regular expression [3] of its expertise from the user and crawls and indexes the Internet using this regular expression. Also the user can query any information obeying the regular expression provided initially from the indexer data structure.

YES

Found   another   link?

Add  searched   list YES

List  is   empty?

NO

End

INDEXER Fig. 3 Flow char of the spider

While producing a “to search” list for implicit usage of the spider, the list of already searched sites should also be checked for double entry to the site.

Also another job of indexer is keeping the keywords in an appropriate data structure. Also there should be a connection between the indexer data structure and the GUI modules.

GUI

Spider

Graphical User Interface

Indexer Fig 4. IPO Diagram of the spider

A simple input is fetched from the GUI and all the outputs are sent to the indexer. Data Structure to keep user inputs

3. Indexer This module is responsible of extracting the keywords from the URLs got from the spider. As already discussed in the analysis part, the HTML Tokenizer is a part of the indexer, which can parse the keywords from the sites.

Indexer Data Structure

The most important two modules of the indexer are listed below: • Data Structure • Tokenizer The deployment diagram of the indexer should look like in Fig 5 : INTERNET Search Results (page by page) and a pointer Spider

Indexers

Fig. 6 Data structures between the GUI and indexer

TOKENIZER Tokenizer

Regular  Expression  Extractor Fig. 5 Connection between spider and the indexers and tokenizers in indexers

Fig. 5 demonstrates the connection between spider and the indexers. Each indexer keeps a tokenizer to get the keywords from html pages.

Fig. 6 demonstrates the connection between the indexer and the graphical user interface.

4. HTML Tokenizer This module is responsible of extracting the keywords from a given web page. Since all the information on the internet is transferred by the html format, the indexer should parse the html format. Because of the performance issues all the modules should run concurrently, so each of the html tokenizer should run in a concurrent thread. By using multi threaded implementation, the busy waiting of the spider and the rest of the indexing jobs will be avoided. A simple view of the html tokenizer should look like in the fig. 7:

No

Get the target URL of the page

data structure or a well formatted string which the indexer obeys the same protocol.

Page contains more token?

One of the major improvements on this study is implementing the indexer within a suitable manner of the regular expressions. The regular expressions can be consists of multiple tokens. For example in table 1, the sites are listed with the token varieties:

5. Reverse Indexing

Yes

TABLE 1 SAMPLE SITES AND KEYWORDS

Return result string as keywords list

Site URL http://www.microsoft.com

Get next string from the page No

http://www.mit.edu

Match to Regular Expression?

Add to result string

Fig. 7 Flowchart of the HTML Tokenizer

Fig. 7 demonstrates a simple flow chart of the html tokenizer. The initial step of the tokenizer is getting the target URL from the web spider. The URL information and the keywords extracted from this URL will be returned to the indexer for the future queries. The end condition of the HTML tokenizer is finishing all the keywords in the web page. This information can be gathered from the file pointer which is created on the target web page. Since all the information on the internet is downloaded to the local memory, the current web page viewed should also kept on the local computer with a file pointer. The file operations on this level are left to the java network library. Above the file operations which use the file pointer should keep track of the strings and html tokens. Fortunately all the html tags are kept within the “” symbols. So a string tokenizer with the knowledge of html tokens can easily be converted to the html tokenizer. Also the extracted keywords should be kept on a result set with a separator. The result set can be a composite type

Keywords Microsoft, product, support, help, research, training, Office, Windows, software, download… research, offices, about, education, news, students, faculty, …

Indexing of the above list can be done in two ways, either by indexing from web site to the keywords or from keywords to the web sites. The former method is called as reverse indexing [4] and increases the access time performance. In order to keep the regular expression in a tree with reverse indexing, an efficient way of regular expression modeling is required. The regular expressions are kept by tokens in the tree. For example a regular expression of an email can be represented as below: [A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4} Above regular expression checks for the validity of email addresses is given above. The regular expression above has 5 sub parts: TABLE 2 SUB PARTS OF THE REGULAR EXPRESSION

1 2 3 4 5

[A-Z09._%+-]+ @ [A-Z0-9.-]+ \. [A-Z]{2,4}

name of mail account

the @ sign on email domain of email . sign on domain extension of domain (2 to 4 chars like com, org, edu) Let’s consider the below examples and their separations according to table 2.

TABLE 3 PARSING OF SAMPLE EMAILS

Email [email protected] [email protected] [email protected] [email protected] [email protected]

1 ali john bill paul dean

2 @ @ @ @ @

3 baba hotmail hotmail yahoo mit

4 . . . . .

5 com com com com edu

Table 3 results from the regular expressions can be indexed in either way below:

TABLE 5 PERFORMANCE TABLE OF SAMPLE DOMAINS (CONTINUED)

URL www.yildiz.ed u.tr www.setegiti m.com

Indexing Time

Reverse Indexing Time

Search Time

Reverse Search Time

54211

48921

5196

2176

618822

543228

5282

2043

139864

127464

6093

2584

7939231

5094323

7387

2983

 

 

www.mit.edu com

.

hotmail

@

yahoo

@

www.sun.com

edu

.

.

@

 

Table 4 displays the performance of indexing and reverse indexing algorithms of the email regular expression covered previously. The data structure for the tree implementations is the b+ tree. Previous research on the data structure of the indexer showed that the best possible data structure is b-tree variants. Depending on this research the name part of the regular expressions are kept on a b+ tree structure. The search results are measured from the email addresses with 50% of existing and 50% of not existing search queries.

.

bab

mit

@

@

 

7. Conclusion john

bill

ali

paul

dean

Fig 8. Tree representation of regular expression results

6. Performance Evaluation of Regular Expression Search Engine Table 4 holds the results of time performance of the search engine results from different regular expressions. TABLE 4 PERFORMANCE TABLE OF SAMPLE DOMAINS

URL www.yildiz.edu.tr www.setegitim.com www.mit.edu www.sun.com

# of Keywords 116 1711 313 5041

Depth 5 5 5 5

# of pages 85 366 156 720

It is obvious that some of the Internet queries can be represented with regular expressions. This study aims to find an optimized way for crawling and indexing parts of the search engines in the case of regular expression of the search queries are previously known. At the best of our knowledge the parsing result of the regular expression is not kept in a leveled tree until this research. This new indexing approach has increased the speed of indexing and queries also a possible data structure and reverse indexing algorithm are suggested for the case. Also a better version of indexing data structure can be applied to this study. ACKNOWLEDGEMENT

# of links 145 446 792 5201

This study was supported by Scientific Research Projects Coordination Unit of Istanbul University. Project number YADOP-16728.

8. References [1] National Institute of Science and Technology, nist.gov, 2009 [2] Web Spider Performance and Data Structure Analysis, Şadi Evren ŞEKER, Banu Diri, 2009 [3] Gerard Berry and Ravi Sethi. From regular expressions to deterministic automata. Theoretical Computer Science, 48:117– 126, 1986.

[4] Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Science Department, Stanford University, 1999 [5] TUSSE (Turkish Speaking Search Engine) , http://www.shedai.net/tusse, 2008[8] G.M. [6] Adelson-Velsky and E.M. Landis, An algorithm for the organization of information, Soviet Mathematics 3 (1962), pp. 1259–1263.

Suggest Documents