Information Sciences 120 (1999) 1±11
www.elsevier.com/locate/ins
A neural network-based intelligent metasearch engine Bo Shu, Subhash Kak
*
Department of Electrical and Computer Engineering, Louisiana State University, Baton Rouge, LA 70803, USA Received 4 July 1998; received in revised form 4 November 1998; accepted 20 May 1999 Communicated by George Georgiou
Abstract Determining the relevancy of web pages to a query term is basic to the working of any search engine. In this paper we present a neural network based algorithm to classify the relevancy of search results on a metasearch engine. The fast learning neural network technology used by us enables the metasearch engine to handle a query term in a reasonably short time and return the search results with high accuracy. Ó 1999 Elsevier Science Inc. All rights reserved. Keywords: World Wide Web; Search engine; CC4 neural network; Information retrieval
1. Introduction The last 10 years of this century have witnessed the great success of the Internet and the World Wide Web. It is estimated that now there are at least 800 million pages on the Web [1]. To help users mine the Web eciently, Web search engines have been developed. These search engines collect and index Web pages. Once a search engine accepts a query of keywords from the user, it starts to retrieve Web pages that match the query according to certain standards from its database. Web search is one of the biggest new industries on the Internet. Some of the well-known search engines are Yahoo, Northern Light, WebCrawler, HotBot, Alta Vista, Excite and Infoseek. *
Corresponding author. E-mail address:
[email protected] (S. Kak)
0020-0255/99/$ - see front matter Ó 1999 Elsevier Science Inc. All rights reserved. PII: S 0 0 2 0 - 0 2 5 5 ( 9 9 ) 0 0 0 6 2 - 6
2
B. Shu, S. Kak / Information Sciences 120 (1999) 1±11
Web search engines are of great help for Web sur®ng. However, as the Web is growing rapidly and as it is a dynamic, distributed and autonomous information system, information retrieval from the Web is more dicult than from conventional systems. The performance of major search engines is far from satisfactory. We list some of the major disadvantages of current search engines below: First, the coverage of any single search engine is severely limited. Research shows that none of the major search engines indexes more than one-sixth of the total Web pages and the coverage of search engines may vary by an order of magnitude [1]. Second, the search results are not accurate. The accuracy of the search results is measured by the relevancy of the Web pages to the query terms. Often a simple query on a search engine can generate tens of thousands Web pages. It is obvious that if these pages are not sorted and listed in an appropriate order, almost no meaningful information can be retrieved from such a huge amount of data. All the search engines rank the relevancy of the search results to the query terms and display them in such an order that the more relevant pages will come ®rst and the less relevant pages come last. The ranking algorithms are usually based on information retrieval models such as Vector Space Model, probability models and fuzzy logical models [2]. These models depend on the frequency of query keywords in the document to determine the similarity between query terms and the document content [3]. However, the frequency of keywords only re¯ects the content of the Web page very roughly. A high frequency of keywords does not necessarily mean a high relevancy of the Web page. Also the standard search engines are more concerned with handling the queries quickly and they tend to use relatively simple fast ranking schemes [4]. All these may cause the search engine to give a poor ranking of the search results. For example, when a sample query of ``China Sports Express'' is submitted to search engines Yahoo, Excite, Infoseek and WebCrawler, we ®nd the all these search engines perform poorly. Here for this search it is assumed that the user aims to ®nd Web pages that provide sports news of China. Any Web page that does not oer China sports news or does not contain a direct hyperlink to a page that does provide such services is considered irrelevant to the query term. The analysis of the search results is showed in Table 1. Search results in Table 1 show that the search engines may yield highly inaccurate search results. Study by other researchers also shows that up to 75% of the search results could be irrelevant [5]. 2. Metasearch engine and neural classi®cation As discussed in the previous section, each single standard search engine covers only a small fraction of the indexable Web pages. Due to the dierent
B. Shu, S. Kak / Information Sciences 120 (1999) 1±11
3
Table 1 Search results for ``China Sports Express''
Yahoo Excite Infoseek WebCrawler
No. of irrelevant pages in the ®rst 30 search results
Total number of pages found
1 17 12 24
3 2,432,530 6,959,280 188,118
technologies that are used in collecting and indexing Web pages, for each query term these search engines yield dierent results. It is obvious that if the power of several standard search engines could be combined together then the coverage of Web pages would be greatly improved. This is the basic idea of the metasearch engine. The metasearch engines do not have to index and search for Web pages by themselves. Instead, they submit queries to several standard search engines at once [5]. It is standard search engines that do the real searching. The metasearch engines get the search results from dierent search engines, throw away the redundant pages, combine them and display them in a consistent user interface. Some of the commercial metasearch engines are MetaCrawler, SavvySearch and Dogpile. Metasearch engines have greatly increased the coverage of standard search engines. However, as the metasearch engines rely on the standard search engines to provide summaries and documents of Web pages, they also inherit the limited precision and vulnerability to keyword spamming from standard search engines [4]. In this paper, we investigate the problem of how to improve the search accuracy of metasearch engines and we have developed an experimental metasearch engine named Anvish. This technology has been licensed to a commercial search engine company. As displayed in Table 1, a simple query may return millions of hits from a standard search engine. Apparently the metasearch engine cannot process all of these Web pages. Experience shows that only the ®rst several tens of pages in the search results list provide valuable information to the user. Usually the metasearch engines take only the top 10 or 20 of the Web pages from each standard search engine. Anvish takes the top 20 Web pages from Yahoo, 10 from Excite, 10 from Infoseek and 25 from WebCrawler and then combines them together. The next step is to present these Web pages in an appropriate order. To achieve this, the most straightforward way is to place these pages in their natural order. That is, ®rst display the pages from Yahoo, next the pages from Excite, then pages from Infoseek and ®nally the pages from WebCrawler. The pages from the same search engine, are sorted in the same order as in the standard search engine. This is how Dogpile presents its search results. Although this method is simple, it has serious problems because there may be many irrelevant Web pages among those returned by the standard search
4
B. Shu, S. Kak / Information Sciences 120 (1999) 1±11
engines. If we place the search results in their natural order, it is highly possible that we put irrelevant Web pages from Yahoo on top of the search results above the relevant Web pages from WebCrawler and so on. Note that most of the Web surfers check only the top 20 or 30 matches (the ®rst two pages) of the search results [6]. Under such circumstances, it is very likely that the surfer will miss important information. This is unacceptable. In this paper, we have proposed a neural network-based classi®cation algorithm to determine the relevancy of these Web pages and reorder them appropriately. This is the core of the Anvish search engine. It can be seen that even though the search results may not be accurate, the top few search results returned by each standard search engine are more likely to be relevant to the query term than the last few search results. It is also reasonable to assume that the relevant Web pages obtained from dierent standard search engines are more similar to each other than to irrelevant Web pages and vice versa. The similarity between two Web pages would be better determined if the complete contents of the Web pages could be loaded and reviewed. But this would be too complex and time consuming. Often the titles and the summaries are already good enough to re¯ect the content of the Web pages. Hence, in Anvish only titles and the summaries returned by standard search engines are analyzed. Two Web pages will be considered similar in content if there are more common keywords in their titles and summaries. Based on these assumptions, we can take the top few Web pages and the last few Web pages from each search engine and assume that their classi®cations (relevant or irrelevant) are already known. Then these already classi®ed Web pages can be used as training samples to train a neural network. In Anvish, the top two pages from each standard search engine are used as relevant training samples and the last ones are used as irrelevant training samples. Once the training of the supervised neural network has been ®nished, it can be used to classify the others. Those Web pages that are classi®ed as relevant will be put on top of the list of search results while the irrelevant Web pages will be put on last of the list. For pages having the same relevancy, they are simply sorted in the natural order. The selection of the neural network is based on the following two considerations: First, the neural network must have good generalization ability; second, the training process must be extremely fast. As the search engine is expected to handle the query in minutes or even in seconds, the training must be ®nished almost instantaneously. To satisfy these two criteria, a novel type of feedforward neural network ± the CC4 neural network has been chosen to process the search results. The CC4 algorithm, proposed by Kak and Tang [7,10], is a new type of corner classi®cation training algorithm for three-layered feedforward neural networks. It requires each training sample to be presented to the network only
B. Shu, S. Kak / Information Sciences 120 (1999) 1±11
5
once. This kind of training is called prescriptive learning. Compared to backpropagation, the corner classi®cation approach has been proved to have comparable generalization capability for pattern recognition and prediction problems and is much faster in speed [8±10]. The CC4 network maps an input binary vector X to an output vector Y. The architecture of the network can be seen in Fig. 1. The input and output layers are fully connected. The neurons are all binary neurons with binary step activation function as follows: P X 1; P xi > 0;
1 yf xi 0; xi 6 0; where xi 1 or 0. The number of input neurons is equal to the length of the input vector plus one, the additional neuron being the bias neuron which has a constant input of 1. The number of hidden neurons is equal to the number of training samples with each hidden neuron representing one training sample. The training of the CC4 neural network is very simple. Let wij
i 1; 2; . . . ; N and j 1; 2; . . . ; H ) be the weight of the connection from input neuron i to hidden neuron j and let Xij be the input for the ith input neuron when the jth training sample is presented to the network. Then the weights are assigned as follows: 8