geniminer: web mining with a genetic-based ... - Semantic Scholar

2 downloads 0 Views 82KB Size Report
F. Picarougne, N. Monmarché, A. Oliver, G. Venturini. Laboratoire d'Informatique, Université de Tours,. 64, Avenue Jean Portalis, 37200 Tours, France.
GENIMINER: WEB MINING WITH A GENETIC-BASED ALGORITHM F. Picarougne, N. Monmarché, A. Oliver, G. Venturini Laboratoire d'Informatique, Université de Tours, 64, Avenue Jean Portalis, 37200 Tours, France

ABSTRACT We present in this paper a genetic search strategy for a search engine. We begin by showing that important relations exist between Web statistical studies, search engines, and standard techniques in optimization: the web is a graph which can be searched for relevant information with an evaluation function and with operators based on standard search engines or local exploration. It is then straightforward to define an evaluation function that is a mathematical formulation of the user request and to define a steady state genetic algorithm that evolves a population of pages with binary tournament selection and specific operators. The creation of individuals is performed by querying standard search engines. The mutation operator consists in exploring the neighborhood of a page thanks to the links going out of that page. We present a comparative evaluation which is performed with the same protocol as used in optimization. Our tool obtains pages which are significantly better than those found by standard search engines for complex queries. We conclude by showing that our framework for Web search could be generalized to other optimization t echniques like parallel genetic algorithms. KEYWORDS Search engine, local search, meta search, genetic algorithm.

1. INTRODUCTION Standard index-based search engines on the web have astonishing properties: they have over a billion of web documents in their index, they can handle millions of requests per day, they give voluminous answers almost in real time, they require huge human and computer resources. But even if those search engines have many positive advantages, they also have drawbacks, like for instance rather simple queries, a presentation of the results which is poor of information, or the fact that the user must very often explore a lot of results by himself before finding interesting pages. In this paper, we make the assumption that the user can wait for his results during one or two hours for instance but provided that he spends only a very short time in manual analysis of the results. Our final aim is to define a search engine for strategic watch that establishes at given time interval (one day for instance) a complete report on how a given subject is presented on the Web. Our approach is complementary to standard index-based search engines. If we assume that the computer has a few hours to give its answer, then it is possible to perform additional computation, a point which is lacking to standard search engines. This may consists for instance in formulating a ''richer'' request, to download the pages in order to better analyze their content (Lawrence and Giles 1999a), to propose a textual clustering of the results (Zamir and Etzioni 2000), to perform additional search with a given strategy. We deal in this paper with this last point and we make use of the optimality of genetic algorithms (GAs) (Holland 1975) with respect to the solving of the exploration versus exploitation dilemma: for a given number of trials (i.e. Web pages downloads), what optimal strategy should one adopt to hopefully maximize its gain (i.e. finding the most interesting pages for the user)? From this intuitive view, we show that GAs and more generally evolutionary algorithms can positively contribute to the problem of defining an efficient search strategy on the Web.

263

IADIS International Conference WWW/Internet 2002

The remaining of this paper is organized as follows: section 2 formalizes the problem we deal with as an optimization problem by making a relationship between concepts used in optimization and concepts used in studies dealing with Web statistical properties and with Web search. Section 3 gives the principles of our GA which evolves a population of Web pages. Section 4 reports on the experimental tests and on a comparison with meta-search. Section 5 concludes on the perspectives that can be derived from this work.

2. SEARCHING THE WEB AS AN OPTIMIZATION PROBLEM Table 1. Modeling the problem of information search on the Web as an optimization problem Optimization Search space Fitness function Optimal solution Neighborhood relation Creation operator

Notation S f: S → R+ s* = arg maxs Î S f(s) V: S → Sk Ocreation: . → Sk

Local search operator

O: S → Sk

Web Set of pages/documents Relevance of the page w.r.t. user request Page that maximizes rele vance Links going out of a page Random IP address generation Results of standard search engines Exploration of a page's link

Thanks to recent web studies, we argue that web search can be seen as standard optimization problem, and may thus benefit from knowledge learned in previous studies in optimization. We establish a parallel between web search and the general problem of function optimization, as summed up in table 1. Recent statistical studies have modeled the Web as a graph in which the nodes are Web pages and the edges are the links that exist between those pages (Albert et al. 1999) (Broder et al. 2000). The search space S of our optimization problem is the set of Web pages and is structured with neighborhood relationship V: S → Sk between the points of S thanks to the links between pages. We associate to this search space an evaluation or fitness function f: S → R+ which can numerically evaluate web pages. A search engine tries to output pages which maximize this function, and thus tries to solve this optimization problem. To scan this search space S, optimization algorithms and search engines make both use of similar search operators: 1. creation operators that initialize points from S. In optimization, random generation is a common creation operator, but in the Web context, randomly generating IP addresses for instance has already been studied (Lawrence and Giles 1999b) for other purposes but only gives a valid Web server with a low chance (one over several hundreds). So this kind of random creation operator does not seem suitable for Web search. In optimization, another example of creation operator is the use of a heuristic that builds a solution from the description of the problem. Many search engines, either based on a metasearch or agents, use such an operator for the Web by querying one or more index-based search engines and outputting the obtained links. From the evolutionary computation point of vue, this operator would be used for the initial generation of the population. As we need to use this operation during all the stages of our algorithm, we extend this notion to the term “operator”, 2. operators that will modify existing points in the population. Web robots and more generally web agents (Menczer et al. 1995) (Moukas 1997) use such strategy by exploring the links found in pages. From this point of view, a standard heuristic in optimization such as hill climbing can be directly adapted to the Web: starting from a given page, explore its neighbors and select the best one according to f in order to define a new starting point. Since one may define a hill-climbing operator for searching the Web, it appeared natural to us to define an evolutionary algorithm.

264

GENIMINER: WEB MINING WITH A GENETIC-BASED ALGORITHM

3. A GENETIC BASED SEARCH ENGINE 3.1 Main algorithm 1. 2. 3. 4. 5. 6. 7. 8.

Get the user request and Define the evaluation function f, Pop ← ∅ (initially empty, the population will progressively grow until it reaches a size of PopMax), Generate an offspring page O: With probability (1- Pmut) (or if |Pop| < 2) Then O ← heuristic creation (page from standard search engines) With probability Pmut Then Select one parent page P from the best pages in Pop and let O ← Mutation(P) (P's links exploration) Evaluate f(O) Insert O in Pop if (|Pop| < PopMax) or if f(O) is greater than the fitness of the worst page in Pop which is deleted, Go to 3 or Stop (Pop is the output given to the user). Figure 1. The steady state GA used in GeniMiner.

The GA that we propose for Web mining combines the concepts described in the previous section with those of a steady state GA (Whitley 1989) and is represented in table 2. An individual in the population is a Web page which can be numerically evaluated with a fitness function. Initially, the first individuals are mostly generated with a heuristic creation operator which queries standard search engines (see section 3.3) in order to obtain pages. Then, the individuals can be selected/deleted according to their fitness and can give birth to offspring either with a selection/mutation operators (probability of Pmut) or with the creation operator (probability of (1-Pmut )). Mutation of a parent page P consists in 1) selecting P in the population with a binary tournament selection (randomly choose too pages in Pop, keep the best one as the selected page) , 2) choosing one of the best link l going out of P, and 3) proposing the so-pointed page as an offspring (see section 3.3). Mutation thus performs local search while creation performs more global search using indexbased search engines. From an intuitive point of view, the behavior of this search algorithm can range from a meta-search engine (with Pmut = 0) which only analyzes/evaluates the results of standard search engines, to a search engine which explores in parallel as many local links as possible (Pmut = 1) with the help of selective pressure to guide the search through the links. When Pmut > 0, the selection strategy of the GA decides about the survival of a page in the population and about the number of offspring it will give birth to. It controls the intensity with which links in pages are explored. We rely on the assumption that the GA will resolve (almost) optimally the decision problem which consists in choosing which pages to explore. As far as we know, other applications of GAs to problem centered on the Web are for instance (Sheth 1994) (Menczer et al. 1995) (Morgan and Kilgour 1996) (Moukas 1997) (Fan et al. 1999) (Monmarché et al. 1999) (Vakali and Manolopoulos 1999). For instance, (Menczer et al. 1995) present an adaptive search with a population of agents. Those agents are selected according to the relevance of the documents they return to the user. Our approach models the problem at a level which is closer to the fitness landscape: the GA search does not optimize the parameters of searching agents but rather directly deals with points in the search space.

3.2 User request and evaluation function The fitness function f that evaluates Web pages is a mathematical formulation of the user query and numerous evaluation functions can be defined. In this paper we have used two functions f1 and f2 . f1 is very close to the evaluation functions used in standard search engines and equals for a given page P:

265

IADIS International Conference WWW/Internet 2002

f1 (P)=0, if ∃i such that #(Ki )=0 f1(P) =∑ #(Ki ) , else i

where # means ''number of occurrences in P'' and where (K1 , K2 , ...) are the keywords given by the user. This function requires that all keywords are present in P and it favors pages with many keywords. The other function f2 is different and more complex than f1 . It equals: k

f 2 ( P ) = (∑Uniform ( p (K i )) × (3∑ # ( K i ) + ∑ # (S i ) − ∑ # ( Sn i ) + 4∑ Link eval (L i ))) i

i

i

i

i

where Uniform(p(Ki )) is an entropy-like function which takes values between 0 and 1 and which penalizes pages with a non uniform keywords distribution, where (S1 , S2 , ...) and (Sn 1 , Sn 2 , ...) denote respectively a list of words that should/should not be present in P, and where Linkeval(Li ) is a function that evaluates the interest of the ith link found in P. To do this, we consider that each keyword near the link increases the evaluation of this link according to its proximity to the link. Thus f2 (P) is such that P will get a very high score if it contains many keywords from (K1 , K2 , ...) with uniform proportions of each keyword, many words from (S1 , S2 , ...), no word from (Sn 1 , Sn 2 , ...) and many links that might lead to relevant pages. We do not give further details about the possibilities offered in the query but we just mention that it is possible to look for additional information in the pages like regular expressions or referenced files (like MP3 etc), and to analyze other file formats than the HTML format.

3.3 Genetic operators and other search mechanisms We use a heuristic creation operator which outputs an address of a web page from the results given by five standard search engines (Altavista, Google, Lycos, Voila, Yahoo). It consists in querying each search engine with the keywords (K1 , K2 , ...) and in extracting the results. The links found are stored in a list sorted in the same order given by each search engine (1st link of the 1st engine, 1st link of the 2nd engine, …, 2nd link of the 1st engine, …), and each time the creation operator is called then it outputs the next link on this list. When none of these engines can provide further links, then the creation operator is not used anymore and is replaced by the mutation operator. This creation operator allows the genetic search to start with points of good quality. As will be seen in the results section, those heuristically generated individuals can be greatly improved with the mutation operator. From a selected parent page P, the mutation operator generates an offspring O by exploring the local neighborhood of P. For this purpose, the links found in P are ordered in a list in decreasing order according to the values of the link evaluation function Linkeval (see previous section). In this way, the most promising links are explored first. Each time the mutation operator is called, the next link on the list is given as an output. When the list is empty, then the creation operator is used. In order to speedup the pages evaluation and to avoid downloading twice the same page, we maintain a ''black list'' of pages which have already been explored. If one of the two previous operators outputs a page of this list, then this operator is ran again. One should notice that the graph structure of the Web does not allow us to define a crossover operator in a straightforward way. It could be possible to define such an operator by combining links present in two parent pages P1 and P2 : if those two pages have a link in common, or links pointing to the same web site, then it might desirable to combine these information and to focus the search on this common links or web sites.

4. RESULTS 4.1 Experimental settings All the algorithms mentioned in this paper have been implemented on a standard computer (PC Celeron 500MHz, Memory of 128 Mo, Internet connection up to 64Ko/s). We have defined ten queries (see table 3) across several domains and we have used a testing methodology which is similar to the one used in

266

GENIMINER: WEB MINING WITH A GENETIC-BASED ALGORITHM

optimization: all tested algorithms are given the same number of pages evaluations, 500 in our case. A run represents from one to two hours of downloading and computing time. Results are averaged over 4 runs. Table 2. The ten queries that we have used in our experimental tests Num 1 2 3 4 5 6 7 8 9 10

Keywords (K1, K2, ...) telecom crisis analysis fiber optic technology information text poet flower wind wine excellent price good buy cd music michael jackson mouse disney movie animation Artificial ant algorithm genetic algorithm artificial ant javascript window opener dll export class template

Should (S1, S2, ...) mobile future network Baudelaire Bourgueil download mp3 DVD

Should not (Sn1, Sn2, ...) France Rimbaud Bordeaux

experimental comparison tutorial free code example

4.2 Studyi ng the parameters Table 3. Final population mean quality (15 best individuals averaged quality) for different values of Pop max Popmax Mean

30 398.1

50 411.5

100 457.9

150 429.5

200 427.1

300 428.5

0.5 0.3 0.5 0.3 0 0.7

601

651

701

751

801

851

901

951

601

651

701

751

801

851

901

951

551

501

451

401

351

301

251

201

151

101

0 0.7

1

551

501

451

401

351

301

251

201

1

51

151

Downloaded pages

0

101

180 200 160 180 140 160 120 140 100 120 80 100 60 80 40 60 20 40 0 20

51

Quality

Quality

200

Downloaded pages

Figure 1. Evolution of the popula tion mean quality (for all the individuals in the population) for different values of Pmut

We have first analysed the two parameters of our search engine, namely Popmax and Pmut. Table 4 shows the mean quality of the best 30 individuals in the final population for different values of population size (Popmax) and for request 2. If the population is too large, then the binary tournament selection does not concentrate the search on important pages. Now if the population is too small, then the search narrows and decreases the quality of the results. We have confirmed this tendancy on other queries, and we have thus been using Popmax = 100 in the other tests. Figure 1 shows the results obtained with different values of Pmut as a function of the number of pages downloaded and for request 3. The beginning of the curve is rather noisy because the initial population

267

IADIS International Conference WWW/Internet 2002

contains only a few individuals. After 100 generations, once the population is filled in, the mean quality increases progressively for all curves. The meta-search (i.e. when Pmut = 0, no local link exploration) obtains the best performances until the 600th generation. Then, the local exploration of links outperforms the meta search. When Pmut is too large (i.e. Pmut = 0.7), the search strategy spends too much time in local and unsuccessful exploration of links. We have set Pmut = 0.5 in the following because it seems to be a good compromise between the heuristic search and the local search. Similar results can be observed with the other queries.

4.3 All queries Table 4. Comparative results for function f 1 Query 1 2 3 4 5 6 7 8 9 10

GeniMiner (Pmut = 0.5) Min Max Mean 18 177 35.4 37 856 119.0 33 922 144.4 68 908 163.7 45 10006 324.8 66 819 137.9 33 317 76.1 52 932 145.1 68 672 132.3 42 489 94.9

GeniMiner (Pmut = 0) Min Max Mean 20 177 39.5 42 856 131.0 46 8011 196.8 81 908 170.0 27 10006 267.6 90 869 188.9 32 715 73.5 55 1017 171.1 68 672 140.1 54 2504 156.6

Table 5. Comparative results for function f 2 Query 1 2 3 4 5 6 7 8 9 10

GeniMiner (Pmut = 0.5) Min Max Mean 37.2 1034.4 83.2 102.7 810 212.9 48.8 1223.7 195.7 100.6 977.4 234.1 277.4 2229.7 439.6 150.3 841.4 256.7 63.7 575.3 129.3 122.7 1462.3 271.6 87.3 1097.0 198.2 99.2 2091.7 258.7

GeniMiner (Pmut = 0.3) Min Max Mean 39.5 1034.4 88.2 109.8 825.5 222.5 53.4 1839.1 182.1 108.2 977.4 204.0 219.8 2107.7 418.5 151.8 1037.2 284.3 59.7 575.3 120.3 107.7 1510.8 273.5 78.8 1097.0 177.6 130.4 2091.7 328.8

GeniMiner (Pmut = 0) Min Max Mean 41.2 1034.4 91.3 99.3 794.5 205.4 54.6 1102.7 147.4 115.4 992.0 243.3 172.3 2107.7 375.2 155.2 1037.2 281.7 51.4 575.3 96.6 129.3 1510.8 310.3 88.5 1097.0 198.4 141.5 3371.7 430.3

We present in table 5 and 6 the results obtained for the 10 queries averaged over 4 trials. For f1 , GeniMiner used with different Pmut values get comparative results. We can notice that with Pmut=0, GeniMiner process is very close to a meta-search process (based on Altavista, Google, Lycos, Voila, Yahoo). So one can conclude from this experiment and for the tested parameters values that the genetic search is not really useful when the page evaluation function is close to those used in standard search engines. In the following, by “GeniMiner” we consider that we use Pmut values greater than 0 and by “meta-search” we consider that Pmut=0. For function f2 which is much more complex than f1 , GeniMiner significantly outperforms the metasearch-like process. This is due to the following reasons: evaluation functions used in standard search engines are designed for giving a very quick answer to simple queries. For instance, they sort pages without considering the keywords of the query because analyzing the content of the pages would be time consuming. This is the reason why they may handle simple queries only. Those queries are well adapted to the numerous Internet basic users but not for more specific and complex information search as required for instance in strategic watch. So if function f2 is the evaluation of pages that the user really has in mind, then he is going to spend a lot of time for analyzing the results of standard search engines. With GeniMiner, he may find the relevant information without such a long analysis. But to achieve this, GeniMiner requires much more

268

GENIMINER: WEB MINING WITH A GENETIC-BASED ALGORITHM

computation time than standard search engines, and also to use of the results given by such engines in its heuristic creation operator. This is why we mentioned in the introduction that our approach was complementary to standard search engines.

5. DISCUSSION AND CONCLUSION We have described in this paper an important part of a search engine, namely the search strategy. We have shown that important relationships exist between studies dealing with the Web and with optimization. We have thus been able to define a search strategy which implements as directly as possible the concepts used in genetic and evolutionary algorithms. We have shown experimentally the relevance of this approach on several queries by comparing the genetic search with a standard meta-search. The results that we obtain show that there is an opportunity for evolutionary algorithms to help the user in finding relevant pages in a reasonable time. The efficiency of GeniMiner is due at least to the two following reasons: 1) the evaluation function can be easily changed and adapted to the problem which will increase the relevance of the results, 2) the search strategy can efficiently minimize the search time (including the manual analysis of the results). Among the perspectives that can be derived from this work, we can mention the following ones. We have used a rather standard GA, but one could define a more advanced algorithm with additional evolutionary techniques such as population restarts (Maresky et al. 1995), self adaptation of the mutation probability (Eiben et al. 1999). These ideas can be easily integrated in our framework. Also, search engines (like Google for instance) use a parallel architecture. Parallel genetic and evolutionary algorithms exist since many years (Cantú-paz 2000) and there is no doubt that such a parallel approach is extremely interesting in our problem: an island model for instance would certainly combine the efficiency of the genetic search with an increasing rate of pages evaluation per second. Finally, there are several other aspects of this search engine that we have not detailed here because they are somehow out of the scope of evolutionary algorithms, but these are nevertheless crucial points for the effective usefulness of such a tool. We simply mention some of the other characteristics on which we are currently working: the presentation of the results where pages are clustered together, the personalization of the search, a large disk cache for speeding up the downloading of pages, an efficient web server for remote searching. We are developing a real world application in the context of strategic watch.

REFERENCES Albert R., Jeong H. and Barabasi A.-L. (1999), Diameter of the World Wide Web. Nature, 401:130-131, 1999. Broder A., Kumar R., Maghoul F., Raghavan P., Rajagopalan S., Stata R., Tomkins A. and Wiener J. (2000), Graph structure in the Web, Proceedings of the Ninth International World Wide Web Conference, Elsevier, 2000. Brin S. and Page L. (1998), The anatomy of a large-scale hypertextual Web search engine, Computer Networks and ISDN Systems, 30, 1-7, pp 107-117, 1998. Cantú-Paz, E. (2000). Efficient and Accurate Parallel Genetic Algorithms, Kluwer Academic Publishers. Eiben A.E., Hinterding R. and Michalewicz Z. (1999), Parameter Control in Evolutionary Algorithms, IEEE Transactions on Evolutionary Computation, Vol 3, 2, 1999. Fan W., Gordon M.D., Pathak P. (1999), Automatic generation of a matching function by genetic programming for effective information retrieval, Proceedings of the 1999 Americas Conference on Information Systems, pp 49-51. Holland J.H. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press. Lawrence S. and Giles C.L. (1999a), Accessibility of information of the Web, Nature 400, pp 107-109. Lawrence S. and Giles C.L. (1999b), Text and image meta-search on the web, International Confrence on Parallel and Distributed Processing Techniques and Application, 1999. Maresky J., Davidor Y., Gitler D., Aharoni G. and Barak A. (1995), Selectively destructive re -start, Proceedings of the sixth International Conference on Genetic Algorithms, 1995, L. Eshelman (Ed.), Morgan Kaufmann, pp 144-150. Menczer F., Belew R.K., Willuhn W. (1995), Artificial life applied to adaptive information agents, Spring Symposium on Information Gathering from distributed, Heterogeneous Databases, AAAI Press, 1995.

269

IADIS International Conference WWW/Internet 2002

Monmarché N., Nocent G., Slimane M. and Venturini G. (1999), Imagine: a tool for generating HTML style sheets with an interactive genetic algorithm based on genes frequencies. 1999 IEEE International Conference on Systems, Man, and Cybernetics (SMC'99), Interactive Evolutionary Computation session, October 12-15, 1999, Tokyo, Japan. Morgan J.J. and Kilgour A.C. (1996), Personalising information retrieval using evolutionary modelling, Proceedings of PolyModel 16: Applications of Artificial Intelligence, ed by A.O. Moscardini a nd P. Smith, 142-149, 1996. Moukas A. (1997), Amalthea: information discovery and filtering using a multiagent evolving ecosystem, Applied Artificial Intelligence, 11(5):437-457, 1997 Sheth B.D. (1994), A learning approach to personalized information filtering, Master's thesis, Department of Electrical Engineering and Computer Science, MIT, 1994. Vakali A. and Manolopoulos Y. (1999), Caching objects from heterogeneous information sources, Technical report TR99-03, Data Engineering Lab, Department of Info rmatics, Aristotle University, Greece. Whitley D. (1989), The Genitor algorithm and selective pressure: why rank-based allocation of reproductive trials is best, Proceedings of the third International Conference on Genetic Algorithms, 1989, J.D. Schaffer (Ed), Morgan Kaufmann, pp 116-124. Zamir O. and Etzioni O. (2000), Grouper: a dynamic clustering interface to web search results, Proceedings of the Ninth International World Wide Web Conference, Elsevier, 2000.

270

Suggest Documents