ARTICLE IN PRESS
Information Systems 30 (2005) 299–316
Generating page clippings from web search results using a dynamically terminated genetic algorithm$ Lin-Chih Chena,*, Cheng-Jye Luhb, Chichang Jouc a
Department of Information Management, National Taiwan University of Science and Technology, 43 Keelung Road, Section 4, Taipei, Taiwan b Department of Information Management, Yuan-Ze University, 135 Yuan-Tung Road, Jung-Li, TaoYuan, Taiwan c Department of Information Management, Tamkang University, 151 Ying-Chuan Road, Tamsui, Taipei, Taiwan Received 7 January 2004; received in revised form 18 March 2004; accepted 26 April 2004
Abstract We present a page clipping synthesis (PCS) search method to extract relevant paragraphs from other web search results. The PCS search method applies a dynamically terminated genetic algorithm to generate a set of best-of-run page clippings in a controlled amount of time. These page clippings provide users the information they are most interested in and therefore save the users time and trouble in browsing lots of hyperlinks. We justify that the dynamically terminated genetic algorithm yields cost-effective solutions compared with solutions reached by conventional genetic algorithms. Meanwhile, effectiveness measure confirmed that PCS performs better than general search engines. r 2004 Elsevier Ltd. All rights reserved. Keywords: Genetic algorithm; Information retrieval; Web search; Page clipping synthesis; Termination criteria
1. Introduction With the introduction of HTML, the web has become the largest accessible repository of information. The semi-structured nature of the web makes it very difficult to retrieve information $ The experimental search engine used in this paper is available at: http://webminer.mis.yzu.edu.tw/cayley/dissertation/ or http://cayley.sytes.net/dissertation/. *Corresponding author. Tel.: +886-3-4638800; fax: +886-34352077. E-mail addresses:
[email protected] (L.-C. Chen),
[email protected] (C.-J. Luh),
[email protected]. tw (C. Jou).
relevant to specific users’ needs. Additionally, the rapid growth and the fast change rate of the web make information retrieval (IR) even harder. Under current search engines techniques, users’ queries are usually responded with thousands of returned documents. Users usually spend lots of time browsing these documents for treasures they really want. Researchers have proposed that next generation search engines should use Information Extraction (IE) to return ‘‘things’’ (like people, jobs, companies, and events), their relations, facts and trends [1]. Until these so-called next generation search engines are implemented, it would be very helpful to the users if deeper relevant
0306-4379/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.is.2004.04.002
ARTICLE IN PRESS 300
L.-C. Chen et al. / Information Systems 30 (2005) 299–316
information from search results could be extracted automatically. Most search engine results show for each matched web page one short paragraph with matched keywords highlighted. As some valuable related information may be associated with other paragraphs of the matched web pages too, it would be helpful providing the paragraphs of the matched web pages in a page clipping fashion. However, the complexity of how to properly display the paragraphs of the matched web pages is exponentially growing since the number of paragraphs is usually huge. To generate page clippings from web search results, we propose a genetic algorithm-based page clipping synthesis (PCS) method that treats a set of special words as genes. These special words are collected from the surrounding words of queries in the correct answers from previous IR research. If a paragraph matches one special query phrase pattern which includes one query term and one special word within the range of three words, this paragraph is encoded as a gene of ‘1’; otherwise, it is encoded as ‘0’. We assign a weight for each paragraph in a page clipping based on the relative position of the paragraph to represent the importance of the paragraph. The fitness function associated with a page clipping then is defined as the sum of the weights of its paragraphs. The IR in the form of page clippings is then transformed to find a set of page clippings with the maximum fitness values. Since the problem of finding the maximal page clips in PCS is hard to solve, we propose to dynamically determine its termination criteria based on the improvement ratio and the standard deviation of the improvement. Based on an observed phenomenon we assume that Internet users prefer quick responses from search engines. PCS is intended to yield cost-effective solutions within controlled amount of time rather than to reach the global optimum. The rest of this paper is organized as follows: Section 2 introduces some related work. Section 3 discusses PCS in details. We then present three experiments of PCS in Section 4. The first experiment emphasizes on its termination criteria; the second experiment is to justify that PCS yields cost-effective solutions; and the third experiment
presents an effectiveness measure of PCS with comparisons to other search engines. Finally, Section 5 concludes this paper.
2. Related work 2.1. Application of genetic algorithms to information retrieval Genetic algorithms are applied extensively in IR. Gordon [2,3] proposed a genetic algorithmbased approach for document indexing. In his formulation, a keyword represents a gene, a document’s list of keywords represents chromosomes, and a collection of relevant documents judged by a user represents the population. The population then evolves through generations and eventually finds a set of keywords which, in terms of the fitness function, best describes the documents. Petry et al. [4] applied genetic algorithms to a weighted IR system. In their design, a weighted Boolean query was modified to improve recall rate and precision rate. They found that the form of the fitness function had a significant effect on IR performance. Yang and Korfhage [5] used relevance feedback to develop adaptive retrieval methods based on genetic algorithms and the vector space model. They reported the effect of adopting genetic algorithms in large databases, the impact of genetic operators, and genetic algorithm’s parallel searching capability. Chen et al. [6,7] used the best first search algorithm and the genetic algorithm to develop a web spider system. They concluded that the genetic algorithm spider did not outperform the best first search spider, but they found both results to be comparable and complementary. Nick and Themis [8] have employed a genetic algorithm in an intelligent agent system that recommends web pages directly to users. In order to assist the intelligent agent in learning a user’s interests, the user is requested to provide some web page examples of interest in advance. Picarougne et al. [9] also developed a web spider system, called GeniMiner. They claimed that the genetic search can be valuable when (1) the user can wait for a longer time than in standard search engines; (2) queries are more
ARTICLE IN PRESS L.-C. Chen et al. / Information Systems 30 (2005) 299–316
complex or more precise than a list of keywords. However, users usually expect quick responses from search engines. Vrajitoru [10] used a genetic algorithm to improve performance in IR systems. To avoid the classical crossover operator leads to fewer offspring than their parents, he used two cross points instead of one and treated the two input individuals differently. Many researches adopt GA to resolve relevance user feedback problem which is one of the applications of IR [11–13]. These researches found that the fitness function plays an important role in improving the precision rate. Thus, the design of the fitness function considers not only the documents that were retrieved but also the order in which they were retrieved. How to find appropriate ranking metrics to retrieve more relevant documents and fewer non-relevant documents for users remains a big challenge. Using a static ranking function cannot guarantee good performance under all situations. Several researchers [14–16] use Genetic Programming to develop ARRANGER technique which can discover ranking functions automatically. They claimed that the advantage of ARRANGER lies in that it can learn the ‘‘optimal’’ ranking functions for different contexts by effectively combining multiple types of evidence in an automatic and systematic way. Several other researches also use genetic algorithms on solving information filtering problems such as newsgroup filtering [17–19], e-mail marketing [20], and junk e-mail filtering [21]. 2.2. Genetic algorithm’s termination criteria Genetic algorithms are generally terminated once they reach either a predefined objective condition or a predefined maximum number of generations (NG). Due to the goal of converging to the optimum, most genetic algorithms are timeconsuming [22–24]. The majority of genetic algorithm research aims at reaching the global optimum in a reasonable amount of time. Several researchers have conducted work either to prevent early convergence or to adjust the threshold. Eshelman and Schaffer [25] suggested using incest prevention with an elitist selection replacement strategy to prevent premature convergence.
301
Rudolph and Sprave [26] developed a self-adjusting threshold mechanism that chooses offspring for the next generation. They claimed that the mechanism is convergent to the global optimum if it runs through a predefined maximum NG. Sheth [19] adopted user feedback to adjust the allowable maximum NG in newsgroup filtering. Several researchers [27–29] used Markov chains to calculate the smallest NG required to guarantee an optimal solution with a fixed probability. In their proof, the genetic algorithm can reach the optimal solution within an upper bound of time. Koza [24] described a probability model to minimize the total number of individuals that need be processed until a predefined maximum NG are reached. He pointed out that there is a point after which the cost of extending a given run exceeds the benefits obtained from the increase in the cumulative probability of success. Our idea to develop the dynamically terminated genetic algorithm to solve the page clipping generation problem is similar to Koza’s suggestion. 2.3. Question answering Question answering (QA’s) goal is to return the most likely answers to a given question that can be located in a predefined, locally accessible corpus of documents. Salton et al. [30] developed a dual text comparison system in order to verify both the global vector similarity between query and document texts, as well as the coincidence in the respective local contexts. Documents whose global similarity falls below a stated threshold are therefore rejected. Those globally similar text pairs also with a sufficient number of locally similar substructures are assumed to be related. Hovy et al. [31] developed a QA system based on IR and natural language processing techniques. A given question is first parsed to create a query in order to retrieve the top-ranked documents. These topranked documents are then split into segments and further ranked. Finally, potential answers are extracted and sorted according to a ranking function involving the match with the question type and manually constructed patterns. The matching of each question and potential answer to these patterns is done using rules learned from a
ARTICLE IN PRESS 302
L.-C. Chen et al. / Information Systems 30 (2005) 299–316
machine learning-based grammar parser. Radev et al. [32] proposed a question answering using statistical models (QASM) probabilistic algorithm that learns the best query paraphrase of a natural language question. A total of 10 query modulation operators (INSERT, DELETE, DISJUNCT, etc.) are identified and trained for a wide range of training questions. The best operators tailored to different questions are then identified and applied for later online QA using the web search engines. Radev et al. [33] next developed an architecture that augments existing search engines so that they can support natural language QA. The process consists of five steps which are query modulation, question type recognition, document retrieval, passage (sentence) retrieval, answer extraction, and answer ranking. They concluded that sentence ranking can indeed improve the performance of QA. 2.4. Search engine vector voting and hyperlink prediction This paper refers to several preceding projects in which we have developed search engine vector voting (SVV) and hyperlink prediction (HLP) search methods. The SVV mechanism was developed to rate a web page among the search results returned from the following six well-known search engines: Google,1 Yahoo, AltaVista, LookSmart, Overture, and Lycos. A web page wins a vote from a particular search engine if its URL is placed among the top 50 records of the search engine’s results. That is, the weight of a web page is the dot product of its weights from the six source search engines. The SVV method arranges its own search results in a descending order of the weights. The HLP mechanism was then developed to predict all the hyperlinks that users are most likely to visit starting from the SVV search results. The HLP method is primarily based on the tournament competition concept in which hyperlinks with 1
Google has been blocking queries from metasearch engines for about 2 years. In order to access Google using the SVV method, we emulate our crawler to become Opera 7.11 by modifying the user-agent of http-header to be ‘‘Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0) Opera 7.11’’.
larger weights have higher chance to be collected. More details about HLP can be found in our previous research [34]. 3. Application of genetic algorithm to page clipping generation from web search results 3.1. Genetic algorithm formulation of page clipping generation Given an ordered list of weighted web pages produced by a query to a search engine, our goal is to generate a set of page clippings, consisting of paragraphs from these web pages, to provide instructive hints about the query. A method called PCS is proposed to solve this problem. In the following, we use search results of the HLP method [34] as input for PCS. As mentioned in Section 2.4, these web pages are sorted descending according to their weights. Most of the paragraphs in the page clippings generated by PCS exhibit the following properties: 1. Among the first few paragraphs of a web page. 2. From the first few web pages. 3. Matching the query and a special word within the range of three words. In case there is a pair of a query and a special word occurring together within the range of three words in a paragraph, we deem the paragraph is of interest to the user. For example, let the query be ‘‘php’’. A paragraph consisting of the phrase ‘‘yphp isy’’ might be an answer the user wants. More sophisticated methods for finding paragraphs of interest to the user could be easily incorporated into PCS. A genetic algorithm is adopted in PCS for assigning weights to the paragraphs and generating page clippings. Most paragraphs in a page clipping with high fitness value would exhibit the above properties. 3.1.1. Parameter encoding The parameters in the page clippings problem are translated into those in the genetic algorithm as follows: *
Population: For any generation, its population size represents the number of all page clippings
ARTICLE IN PRESS L.-C. Chen et al. / Information Systems 30 (2005) 299–316
*
*
produced in that generation for a query. The initial population consists of the result web pages of HLP for the query. Chromosome: A chromosome is a sequence of genes. It is used to represent one page clipping. Gene: A gene is used to represent a paragraph in a page clipping. If a paragraph matches one special query phrase pattern which includes one query term and one special word within the range of three words, this paragraph is encoded as a gene of ‘1’; otherwise, it is encoded as ‘0’. The Gene encoded method is similar to Specific Express Forms which transforms the queries to some questions and to do some patterns match for every question [35]. The special words notably play an important role in encoding the genes. In order to find common used special words, we counted the occurrences of the words nearby the answers to the 500 queries provided in TREC 2002 QA data [36]. Note that we recover the plural and/or past tense forms of a word to its base form, i.e., the present singular form. For example, the variant forms of BE like are, was, and were are covered by its base form is. Consequently, we found the following 20 words are most frequently used: the, in, of, and, to, a, is, on, as, at, by, it, for, that, from, with, have, or, which, and about.
paragraphs in pc and log Tpc means the penalty cost. The following characteristics of common user behaviors are captured in calculating Wp,pc,ts: users generally prefer top-ranked search results and this preference gradually decreases. This phenomenon is called primacy effect in psychology [37]. Several researchers [38,39] also have similar observations of most users’ behavior. Consequently, users usually spend much more time on top-ranked search results. Based on the primacy effect, we define the user behavior function (UBF) for the ith item li within an ordered item list l as follows: UBF ðl; liÞ ¼ aib
ðwhere bo0Þ;
ð2Þ
where a is the user’s preference of the first item, which in general would significantly influence the user’s impression to the items in the item list; b is the user preference decay factor. Fig. 1 demonstrates the effect of the user preference decay factor for b ¼ 0:3 and 0.9. When jbj is small, the UBF value decreases slowly.
3.1.2. Fitness value computation For each page clipping, in addition to a sequence of ‘0’s and ‘1’s representing the sequence of paragraphs, the information about the weight of each original page and the relative position of each paragraph in its original page could be utilized in the fitness function. For a query with term sequence ts, for a page clipping pc, we use the following formula to compute the fitness value fpc,ts: P 8p in pc ðGp;pc Wp;pc;ts Þ ; ð1Þ fpc;ts ¼ log Tpc where ‘‘8p in pc’’ stands for all paragraphs p in pc; Gp,pc is the gene representing p in pc, and would be either ‘1’ or ‘0’; Wp,pc,ts is the weight of p in pc for term sequence ts, and will be discussed later in formula 4; and Tpc is the total number of
303
Fig. 1. Examples of UBF trajectory.
ARTICLE IN PRESS L.-C. Chen et al. / Information Systems 30 (2005) 299–316
304
Base on the UBF, the initial weight wpc,p of a paragraph p in page clipping pc is defined as follows: 1 wpc;p ¼ apc xbpc;p ;
ð3Þ
where apc is the initial weight of the first paragraph in pc, and it inherits from that of its original web page; xpc,p is the sequential-order number of p in pc, and b1 is the user preference decay factor of the paragraphs in pc. We also need to consider which terms in a query are matched. For example, if the query is ‘‘php mysql apache’’, users generally expect search results satisfying the property that a paragraph matching more leading terms is assigned a higher weight. For example, let wftsg be the weight of a paragraph matching the term sequence ts. For query ‘‘php mysql apache’’, users would expect w{php, mysql, apache}>w{php, mysql}>w{php, apache}>w{php}. The UBF is again applied to this situation The weight Wp,pc,ts of paragraph p in page clipping pc for a query with term sequence ts is adjusted as follows:2 X b2 Wp;pc;ts ¼ wpc;p xp;t;ts ; ð4Þ 8t in ts
where ‘‘8t in ts’’ stands for all query terms t in ts; wpc,p is the initial weight of the paragraph p in pc; xp,t,ts is 0 if p does not match any phrase pattern of t, otherwise xp,t,ts is the rank of term t in ts; and b2 is the user preference decay factor of the relative order of the terms in a query. For example, if the query is {php, mysql, apache} and the paragraph p contains phrase patterns matching ‘‘php’’ and ‘‘apache’’, then xp, php,{php,mysql,apache}=1 and xp, apache,{php,mysql,apache}=3. The pseudo-code of our genetic algorithm is listed as follows: Algorithm {
2. Computing the fitness value: Compute the fitness value for each chromosome in the population. 3. While (termination criteria are not satisfied) { While offspring’s population size o initial population size { 3.1. Selection: Randomly choose two chromosomes from the population. 3.2. Crossover: The crossover operator is performed on these two selected chromosomes if a randomly generated float number is less than the crossover rate. 3.3. Mutation: The mutation operator is performed individually on each of the resulting chromosomes if a randomly generated float number is less than the mutation rate. Increase offspring’s population size. } // end of while 3.4. Compute the fitness value: Compute the fitness value of the offspring. 3.5. Substitute: Choose the top population-size chromosomes among the offspring and the current generation as the next generation. 3.6. Increase Generation Number, } 4. Return the best solution. } ; end of algorithm 3.2. Genetic operators in the application In this section, we explain how the three genetic operators: selection, crossover, and mutation are performed on the chromosomes. *
1. Encoding: Translate the parameters of the page clippings problem to population, chromosomes, and genes of the genetic algorithm. * 2
Inverse document frequency (IDF) [43] is a popular measure of a word’s importance used in information retrieval. A drawback of IDF is that all the texts that contain a certain term are treated equally. According to the primacy effect, users prefer terms in the front than others in the back. Thus, we adopt UBF instead of IDF to redefine term weighting.
Selection: We employ the Roulette Wheel selection [22,40–42] to randomly choose two chromosomes in the current generation. The higher the fitness value of a page clipping, the higher its chance of being chosen, and for participating in crossover and mutation operators. Crossover: We first generate two random numbers from the range [0..1]. If the first random number is less than the second one, then crossover takes place. The length of these two chromosomes may not be identical since individual page clipping contains a different
ARTICLE IN PRESS L.-C. Chen et al. / Information Systems 30 (2005) 299–316
110011
110
111xxxxxxxx
010110
010
11100110010
305
110011
010
111xxxxxxxx
110011
010
111xxxxxxxx
010110
110
11100110010
010110
011
11100110010
Fig. 3. Example of mutation.
110011
010
111xxxxxxxx
010110
110
11100110010
Fig. 2. Example of crossover for two strings with different length.
*
number of paragraphs. Thus, the shorter chromosome can be completed with Xs (Don’t Care) to make them equally long. The crossover operator is performed by randomly choosing two crossing points, and exchanging and swapping the genes in between from one chromosome to the other. All the Xs will be erased later. As shown in Fig. 2, the chromosomes 110011110111 and 01011101011100110010 are selected for crossover with positions 7 and 9 (as indicated by the vertical bars ‘|’). Mutation: The normal bit inversion mutation is not applicable for the page clipping problem since a gene represents whether or not its corresponding paragraph contains special phase patterns matching at least one query item. Our mutation reverses the order of some genes to achieve improvements of the fitness value. Two random numbers are generated from the range [0..1]. If the first random number is less than the second one, then mutation takes place. A twopoint mutation operator is applied individually to the two chromosomes after crossover such that the order of genes between two randomly chosen positions is reversed. As an example, consider the two chromosomes in Fig. 3 with positions 7 and 9 chosen.
3.3. Termination criteria in the application We assume that most users want quick responses, which may not be globally optimal or suboptimal, from search engines. These responses
must be cost effective in the sense that further computation could not yield comparable improvements in the solution. This is different from traditional genetic algorithms which aim at finding optimal solution. The genetic algorithm we adopted here is terminated either when the global best-fit solution is reached or when the maximum NG is reached. We discuss these two situations below. In the first case, let the number of all ‘‘best-fit’’ page clippings C be defined as follows: C ¼ fpcjfpc > f% þ nsf g ð5-1Þ where J J is the number of items in a set; f% is the average fitness value of the initial population; sf is the standard deviation of the fitness values of the initial population; n is some positive integer, and we set n ¼ 3 in PCS. If a page clipping satisfies the ‘‘best-fit’’ condition: fpc > f% þ nsf ; it could be treated as a very good solution. Our genetic algorithm is adopted to maximize C. The global best-fit solution of the page clipping problem is defined as follows: C¼ initial population size:
ð5-2Þ
This means that once all the chromosomes are very good solutions, there is not much space to further improve the results. In the second case, the maximum NG is usually set as a fixed digit through experiments. It is quite difficult to claim what the maximum NG should be. On the one hand, a large NG may waste a significant amount of computing resources on relatively slim improvement. On the other hand, a small NG may result in premature convergence. It would be cost effective if the maximum NG is dynamically determined by the status of fitness value progress at each generation. We first give several definitions. If the maximal fitness value of chromosomes at the current generation is equal to that of the previous generation, we say that the current generation has made no improvement. If the consecutive NG
ARTICLE IN PRESS 306
L.-C. Chen et al. / Information Systems 30 (2005) 299–316
without improvement is large, then there is a slim chance of making further improvement. For generation g we define the improvement ratio Ig as follows: f%g f%g1 ðg > 1Þ; ð6Þ Ig ¼ f%g1 where f%g and f%g1 are the average fitness value of all page clippings at generations g and g 1: Let I%g be the average of all Ij ; 1pjpg: Generation g is said to make ‘‘no significant improvement’’ if Ig pI%g : ð7Þ That is, the improvement ratio at the current generation is not larger than the average improvement ratio for all generations up to now. We then define the allowable NG without significant improvement for generation g as follows: Ig sg MGg ¼ Pop size ; s% g I%g where ð8Þ Ig > 0; sg > 0; where Pop size is the initial population size; sg is the standard deviation of all Ij, ; 1pjpg; s% g is the average of sj ; 1pjpg: As indicated in the above definition, the allowable NG without significant improvement is dynamically determined on the base of improvement ratio at the current generation and improvement progress history. If either Ig =I%g or sg =s% g is large, then there is a significant progress in the fitness value at generation g. That implies the probability of having more progress is also high and the probability of having the optimal solution at this generation is low. We refine the above heuristics to be our second termination criterion as follows: the genetic algorithm is terminated whenever the consecutive NG without improvement is larger than its allowable NG without significant improvement at the current generation. We consider that is a sign that further computation could not yield much progress. 3.4. A sample run A sample run of the genetic algorithm has been conducted to create a number of page clippings
with highest fitness values in response to the query term ‘‘php’’, as illustrated in Fig. 4. The sample piece of page clipping not only marks the query term with different color but also indicates the gene’s value with different color balls in which a green ball represents the gene value 1, and a red ball represents the gene value 0. For example, the first, second, fourth, and fifth paragraphs are recommended by the system to be answers that the user wants. However, the third paragraph might not be answers that the user wants. ‘‘Fit’’, as shown on the upper bar, is the fitness value of the current page clipping. ‘‘AVG’’ is the average fitness value for all page clippings generated so far. ‘‘STD’’ is the standard deviation of the fitness values for all page clippings generated so far. ‘‘Response’’ is the response time for all page clippings generated so far. ‘‘Pop size’’ is the source population size and ‘‘C’’ is the accumulated number of ‘‘best-fit’’ page clippings. The ‘‘IntelligentGuess’’ description text appears if more than half of the paragraphs in this page clipping are selected from the same web page. Thus, the user can click on the ‘‘TryIt’’ link to look at that web page directly. Meanwhile, the user is able to look at a particular paragraph in the context of its original web page. When the user clicks the ‘‘ReadIt’’ link in Fig. 4 next to the paragraph ‘‘Ever wondered how popular PHP is? see the Netcraft Survey.’’, the original text is popped up and shown in a colored box on the source web page as illustrated at the left corner of Fig. 5.
4. Experiments We designed and conducted the following three experiments: The first experiment verifies the appropriateness of the two termination criteria; the second experiment executed our genetic algorithm to several predefined numbers of generations and to a system dynamically determined NG to justify the cost effectiveness of PCS; and the final experiment measured the effectiveness of PCS with comparison to other search engines.
ARTICLE IN PRESS L.-C. Chen et al. / Information Systems 30 (2005) 299–316
307
Fig. 4. A sample piece of page clipping.
4.1. Appropriateness verification of termination criteria Two simulation runs were conducted in this experiment to demonstrate that our genetic algorithm is terminated properly when (1) the global best-fit solution is reached, or (2) the consecutive NG without improvement exceeds its allowable NG without significant improvement. We randomly choose the following parameters: the number of chromosomes in the population, the number of genes in each chromosome, the value (‘1’ or ‘0’) for each gene, and the number of matched query terms contained in each gene. The simulation results3 of the first criterion are shown in Table 1, where MaxGen is the maximum NG set by humans and we set MaxGen=200; 3
The simulation results are available at: http://webminer. mis.yzu.edu.tw/cayley/simulation/simulation.php or http:// cayley.sytes.net/simulation/simulation.php.
SystemRunGen is the current generation highlighted; Gen# is the generation number; and ConseGen is the consecutive NG without improvements. Other parameters including, f%g1 ; f%g Ig ; I%g ; sg ; s% g ; and MGg are previously defined in Section 3.3. The simulation was run with a randomly chosen population size Pop size ¼ 71; and the global best-fit solution has already been reached (C ¼ 71) at generation 36. The simulation run should be terminated at this generation. We purposefully continue to run additional generations in order to demonstrate that continued computation after generation 36 wastes computing resource for no improvements on C at all. In the second simulation, we randomly choose a large population size Pop size=194. As illustrated in Table 2, the second termination criterion was reached at generation 107, where the consecutive NG without improvements (ConseGen=88) is larger than the allowable NG without significant
ARTICLE IN PRESS 308
L.-C. Chen et al. / Information Systems 30 (2005) 299–316
Fig. 5. ReadIt link connecting a paragraph to its source web page.
improvements (MGg=87). We continue this run up to 200 generations in order to demonstrate that continued computation after generation 107 wastes computing resource. The number of bestfit page clippings C was found to stay at 192 and failed to reach its maximum number 194 at the end of the simulation.
4.2. Justifying cost effectiveness of PCS In this section, we compare the cost effectiveness of solutions reached by our dynamically terminated genetic algorithm with solutions produced by predefined NG. Each time with a randomly generated population size, we check 1000 solutions of our genetic algorithm that are terminated by satisfying the second termination criterion. The predefined numbers of generations (NG) are 50, 100, 150, and 200. We define the success rate SR between the real C achieved and the global best-fit
C (C ¼ Pop size) as follows: SR ¼ Cachieved =Cbestfit :
ð9Þ
Fig. 6 shows the SR curves over 1000 simulation runs. The average SR (ASR) for the five cases are 0.7275 (NG ¼ 50), 0.9331 (NG ¼ 100), 0.9490 (NG ¼ 150), 0.9624 (NG ¼ 200), and 0.9255 (NG=System Determined), respectively. For demonstrating purpose, each point on a given curve in Fig. 6 stands for the ASR of 50 runs for that case. Fig. 7 displays the NG achieved for the five cases. The NG for cases NG ¼ 50; 100, 150, and 200 are all constant since they run to a predefined NG. In the case of NG=System Determined, the curve fluctuates since each run was dynamically terminated. The average NGs (ANGs) for the five cases over 1000 simulation runs are 50, 100, 150, 200, and 76.601. For n > m; we define the performance ratio PRn,m between NG=n and m by dividing the
ARTICLE IN PRESS L.-C. Chen et al. / Information Systems 30 (2005) 299–316
309
Table 1 Simulation results of reaching the global best-fit solution C=Pop size MaxGen=200 SystemRunGen=36 Gen# Pop size
fg1
fg
C
Ig
Ig
sg
sg
MGg
ConseGen
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
141.30377 150.67259 160.73373 170.8358 181.501 193.07562 204.95651 217.72415 230.68797 244.3312 258.55654 272.60726 287.19611 302.1749 317.21514 332.64784 348.18768 363.33336 378.86.113 393.92343 408.90864 423.8672 438.65753 454.1956 469.14006 483.7596 497.82824
150.67259 160.73373 170.8358 181.501 193.07562 204.95651 217.72415 230.68797 244.3312 258.55654 272.60726 287.19611 302.1749 317.21514 332.64784 348.18768 363.33336 378.86.113 393.92343 408.90864 423.8672 438.65753 454.19567 469.14006 483.7596 497.82824 511.91269
53 55 57 59 61 63 65 67 69 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71
0.10901 0.06677 0.06285 0.06243 0.06377 0.06153 0.06229 0.05954 0.05914 0.05822 0.05434 0.05352 0.05216 0.04977 0.04865 0.04672 0.0435 0.04274 0.03976 0.03804 0.03658 0.03489 0.03542 0.0329 0.03116 0.02908 0.02829
0.10901 0.1075 0.10596 0.10451 0.1032 0.1019 0.10069 0.09948 0.09833 0.09722 0.09606 0.09494 0.09384 0.09274 0.09167 0.09059 0.0895 0.08844 0.08736 0.08628 0.08523 0.08418 0.08318 0.08218 0.08118 0.08017 0.0702
0.0371 0.03727 0.03753 0.03772 0.0378 0.03791 0.03794 0.03803 0.03808 0.03812 0.03824 0.03835 0.03846 0.03859 0.03873 0.03887 0.03907 0.03926 0.03948 0.03971 0.03994 0.04017 0.04036 0.04057 0.0408 0.04103 0.04126
0.04087 0.04074 0.04063 0.04054 0.04045 0.04037 0.04029 0.04023 0.0417 0.04011 0.04006 0.04001 0.03997 0.03994 0.03991 0.03989 0.03987 0.03985 0.03984 0.03984 0.03984 0.03985 0.03986 0.03988 0.03989 0.03992 0.03994
40 41 39 40 42 41 42 41 41 41 39 39 38 37 37 36 34 34 33 32 31 30 31 29 28 27 27
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71 71
increased success rate by the increased percentage of NG as follows: PRn;m ¼ improved ASRn;m =increased ANG ration;m ;
ð10Þ where improved ASRn;m ¼ ASRn ASRm ; increased ANG ration;m ¼ ðjANGn ANGm jÞ=ANGm :
Using the case NG ¼ 50 as the benchmark, the PR for cases, NG=System Determined, NG ¼ 100; 150; and 200 are shown in Table 3. We arranged the columns in the ascending order of their ANG of these cases. We found that the case NG=System Determined significantly outperforms other cases. The PR values drop rapidly beyond the system determined NG.
Similarly, using the case NG=System Determined as the benchmark, the PR results of cases NG ¼ 50; 100, 150, and 200 are shown in Table 4. Obviously, the case NG ¼ 50 performed poor than the System Determined case. Other PR values are lower than 2.5%. It means that at the price of long processing time a run rarely improves beyond the system determined NG. For example, in the case NG ¼ 100; we pay extra 30.55% of NG for only about 0.76% improvements in success rate. The situations are even worse for cases NG ¼ 100 and 200. Note that increasing the NG beyond 76.601 does increase the success rate; however, the cost of this increased success rate, as measured by the increased amount of computation, outweights its benefits. In summary, we conclude that our dynamically terminated genetic algorithm can yield cost-effective solutions in a comparatively short amount of time.
ARTICLE IN PRESS L.-C. Chen et al. / Information Systems 30 (2005) 299–316
310
Table 2 Simulation results of reaching the termination criteria: MGgoConseGen MaxGen=200 SystemRunGen=36 Gen# Pop size
fg1
fg
C
Ig
Ig
sg
sg
MGg
ConseGen
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
284.87309 291.90684 299.04971 306.26201 313.56732 320.9603 328.4007 335.91739 343.59833 351.32598 359.20183 367.20819 375.32846 383.59458 391.88046 400.25727 408.62844 417.07735 425.42663 433.69656 442.05893 450.39653 458.87.73 467.4351 476.00875 484.59132 493.2813 502.03701 510.72948 519.58163 528.49209 537.35253
291.90684 299.04971 306.26201 313.56732 320.9603 328.4007 335.91739 343.59833 351.32598 359.20183 367.20819 375.32846 383.59458 391.88046 400.25727 408.62844 417.07735 425.42663 433.69656 442.05893 450.39653 458.87.73 467.4351 476.00875 484.59132 493.2813 502.03701 510.72948 519.58163 528.49209 537.35253 546.19555
170 172 174 176 178 180 182 184 186 188 190 192 192 192 192 192 192 192 192 192 192 192 192 192 192 192 192 192 192 192 192 192
0.02469 0.2447 0.02412 0.02385 0.02358 0.02318 0.02289 0.02287 0.02249 0.02242 0.02229 0.02211 0.02202 0.0216 0.02138 0.02091 0.02068 0.02002 0.01944 0.01928 0.01886 0.011881 0.01866 0.01834 0.01803 0.01793 0.01775 0.01731 0.01733 0.01715 0.01677 0.01646
0.04456 0.04433 0.0441 0.04388 0.04365 0.04343 0.04321 0.04299 0.04278 0.04257 0.04236 0.04215 0.04195 0.04174 0.04154 0.04134 0.04114 0.04094 0.04073 0.04053 0.04033 0.04013 0.03993 0.03973 0.03954 0.03935 0.03915 0.03896 0.03877 0.03859 0.0384 0.03822
0.02099 0.02098 0.02097 0.02096 0.02095 0.02094 0.02094 0.02093 0.02092 0.02092 0.02091 0.0209 0.02098 0.02089 0.02088 0.02088 0.02087 0.02088 0.02088 0.02089 0.02089 0.0209 0.0209 0.02091 0.02091 0.02092 0.02092 0.02093 0.02094 0.02094 0.02095 0.02096
0.02221 0.0222 0.02218 0.02217 0.02216 0.02214 0.02213 0.02212 0.0221 0.02209 0.02208 0.02207 0.02206 0.02204 0.02203 0.02202 0.02201 0.022 0.02199 0.02198 0.02197 0.02196 0.02195 0.02194 0.02193 0.02192 0.02191 0.0219 0.0219 0.02189 0.02188 0.02187
102 102 101 100 100 98 98 98 97 97 97 97 97 96 95 94 93 91 88 88 87 87 87 86 85 85 84 83 83 83 82 81
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194 194
4.3. Effectiveness measure We used the mean reciprocal rank (MRR) to measure the effectiveness of the search engines. We applied the 500 queries provided in the TREC 2002 QA data [36] individually at Google, Yahoo, AltaVista, LookSmart, Overture, Lycos, SVV, HLP, and PCS.4 The MRR of each individual query is the reciprocal of the rank at which the first
correct answer was returned, or zero if none of the top 10 results contained a correct answer. The score for the 500 queries is the mean of the individual query’s reciprocal ranks. Table 5 lists the ranks at which correct answers were returned by the search engines. For example, out of 500 cases, Google returned the correct page at rank one in 262 cases and returned other answer5 in 95 cases.
4
SVV is metasearch engine which collects search results from six well-known search engines. Both HLP and PCS are not metasearch engines. Strictly speaking, they are post-search methods to augment the search results of any search engines including SVV.
5 In our judgment, ‘‘other’’ answer indicates either the search provides answer with rank>10, there is no correct answer or a dead link (404 not found, 403 Forbidden or timeout) appeared in the engine’s top 10.
ARTICLE IN PRESS L.-C. Chen et al. / Information Systems 30 (2005) 299–316
311
1 0.9 0.8
Success Rate
0.7 0.6 0.5 0.4 NG=50 NG=100 NG=150 NG=200 NG=System
0.3 0.2 0.1 0 50
100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000
Simulation runs on the second termination criteria
Fig. 6. Success rate (Cachieved/Cbestfit) curves for 1000 simulation runs.
250
Number of Generations
200
150
100
50
0 50
100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000
Simulation runs on the second termination criteria NG=50
NG=100
NG=150
NG=200
NG=System
Fig. 7. NG achieved over 1000 simulation runs.
Considering the MRR performance of all nine engines, an analysis of variance (ANOVA) shows that F ¼ 3:341734 (Table 6) is larger than or equal
8 to F492 ð0:001Þ ¼ 3:333211 (F -distribution). Thus, we reject hypothesis H0 that there is no significant difference among the search engines.
ARTICLE IN PRESS L.-C. Chen et al. / Information Systems 30 (2005) 299–316
312
Table 3 PR for NG ¼ SystemDetermined; 100, 150, and 200 based on NG ¼ 50 NG
50
System Determined
100
150
200
Average SR Average NG Improved ASR Increased ANG ratio PR
0.7275 50
0.9255 76.601 0.1980 0.5320 0.3722
0.9331 100 0.2056 1 0.2056
0.9490 150 0.2215 2 0.1108
0.9624 200 0.2349 3 0.0783
Table 4 PR for NG ¼ 50; 100, 150, and 200 based on NG=System Determined NG
50
System Determined
100
150
200
Average SR Average NG Improved ASR Increased ANG ratio PR
0.7275 50 0.1980 0.3473 0.5701
0.9255 76.601
0.9331 100 0.0076 0.3055 0.0249
0.9490 150 0.0235 0.9582 0.0245
0.9624 200 0.0369 1.6109 0.0229
Table 5 Rank distribution at which the first correct answer returned by search engines out of the 500 queries Rank
Google
Yahoo
AltaVista
LookSmart
Overture
Lycos
SVV
HLP
PCS
1 2 3 4 5 6 7 8 9 10 Other
262 54 31 19 17 8 5 5 0 4 95
266 58 34 15 12 8 4 5 1 3 94
209 71 34 24 22 10 8 7 5 6 104
115 55 35 33 27 24 9 7 4 4 187
224 84 33 16 14 8 6 9 4 2 100
172 74 47 31 17 11 7 10 6 4 121
277 64 20 11 9 9 2 5 2 2 99
278 42 27 12 6 3 4 5 6 4 113
313 39 11 8 3 1 1 5 4 4 111
Fig. 8 is the MRR values achieved by the nine search engines. We found that all our search methods, SVV, HLP, and PCS performed better than other search engines and PCS performed the best. This is because SVV is to collectively express the six other search engines’ vote behavior on correct answers. Therefore, correct answers appeared in any search engine’s top 10 results are highly possible to also appear in SVV top 10 results, but not the other way around. HLP further expanded the links inside the top-ranked URLs in SVV results into its top 10 results and possibly
Table 6 ANOVA analysis of MRR SS
DF
MS
F
Treatments Error
42.53414 781.1911
8 491
5.316767 1.591021
3.341734
Total
823.7253
499
pushed correct answers out of the top 10. Thus, HLP performed a little poorly than SVV. Finally, PCS uses phrase matching to combine the
ARTICLE IN PRESS L.-C. Chen et al. / Information Systems 30 (2005) 299–316
0.7
Mean reciprocal rank of first correct answer
62.111%
64.610%
63.085%
313 68.109% 62.993%
57.552%
0.6
54.215% 48.193%
0.5
0.4
34.964%
0.3
0.2
0.1
0
Google
Yahoo
AltaVista LookSmart Overture
Lycos
SVV
HLP
PCS
Search engines report as at 11/03/2003 to 11/18/2003 Google
Yahoo
AltaVista
LookSmart
Overture
Lycos
SVV
HLP
PCS
Fig. 8. MRR achieved by the engines.
matched paragraphs from top-ranked web pages. Note that PCS is capable of moving paragraphs with correct answers higher among its top 10 results or into its top 10 results from source web pages which are not in the top 10. Thus, PCS has higher probability to surpass SVV and HLP. For example, the correct answer for Qid=1507 (Q: What is the national anthem in England? A: God Save the Queen) appeared at rank 1 in PCS, but appeared at rank 4 in HLP and rank 2 in SVV, respectively. Next, we take the search results of the most popular search engines, Google and Yahoo instead of HLP as inputs to PCS in order to demonstrate that PCS can be applied to improve search results of any search engines. We designate the combined search engines, PCS Google and PCS Yahoo, respectively. Fig. 9 is the MRR achieved by Google (62.111%), PCS Google (67.202%), Yahoo (63.085%), and PCS Yahoo (68.525%). Obviously, PCS Google and PCS Yahoo outperform their respective origin, Google
and Yahoo. And more important, the improvement is due to our GA. In the same manner, PCS can apply to AltaVista, LookSmart, Overture, and Lycos individually and should gain improvement over their original sources.
5. Conclusions We have applied a dynamically terminated genetic algorithm to generate page clippings from web search results. A set of best-of-run page clippings was demonstrated to be generated in a controlled amount of time. The page clippings provide users the information they are most interested in and therefore save the users time and trouble in browsing lots of hyperlinks. The dynamically terminated genetic algorithm was justified to yield cost-effective solutions. Meanwhile, effectiveness measure confirmed that PCS performs better than general search engines.
ARTICLE IN PRESS L.-C. Chen et al. / Information Systems 30 (2005) 299–316
314
68.525%
67.202%
Mean reciprocal rank of first correct answer
0.7
63.085%
62.111%
0.6
0.5
0.4
0.3
0.2
0.1
0
Google
PCS_Google
Yahoo
PCS_Yahoo
Search engines report as at 11/03/2003 to 11/18/2003 Google
PCS_Google
Yahoo
PCS_Yahoo
Fig. 9. MRR achieved by Google, PCS Google, Yahoo, and PCS Yahoo.
Although a preliminary user study also indicates that users are more satisfied with our search methods than other search engines, we need more user feedback to fine tune our search mechanisms. Moreover, the genetic algorithm currently employs a simple text-based pattern matching method in gene encoding. Such an approach is limited in encoding multimedia web pages and is hard to generate new combinations such as things, relations, facts, and trends of paragraphs. We are also investigating content-based pattern matching methods for gene encoding so that the proposed genetic algorithm can be applied to more web applications.
Acknowledgements
References [1] A. McCallum, Information extraction from the World Wide Web, available online at http://www-2.cs.cmu.edu/ Web/Groups/NIPS/NIPS2002/nips-tutorials.html, 2002. [2] M. Gordon, Probabilistic and genetic algorithms for document retrieval, Commun. ACM 31 (10) (1998) 1208–1218. [3] M.D. Gordon, User-based document clustering by redescribing subject descriptions with a genetic algorithm, J. Am. Soc. Inf. Sci. Technol. 42 (5) (1991) 311–322. [4] F. Petry, B. Buckles, D. Prabhu, D. Kraft, Fuzzy information retrieval using genetic algorithms and relevance feedback, In Proceedings of the ASIS Annual Meeting, 1993, pp. 122–125. [5] J. Yang, R.R. Korfhage, Effects of query term weights modification in document retrieval: a study based on a genetic algorithm, In Proceedings of the Second Annual Symposium on Document Analysis and Information Retrieval, 1993, pp. 271–285. [6] H. Chen, Y. Chung, M. Ramsey, C. Yang, An intelligent personal spider (agent) for dynamic Internet/Intranet
This work was supported in part by National Science Council, Taiwan under grant NSC 922213-E155-061.
searching, Decision Support Systems 23 (1) (1998) 41–58. [7] H. Chen, Y. Chung, C. Yang, M. Ramsey, A smart Itsy Bitsy Spider for the Web, J. Am. Soc. Inf. Sci. Technol. 49
ARTICLE IN PRESS L.-C. Chen et al. / Information Systems 30 (2005) 299–316
[8] [9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
(7) (1998) 604–618 Special Issue on AI Techniques for the Emerging Information Systems Applications. Z.Z. Nick, P. Themis, Web search using a genetic algorithm, IEEE Internet Comput. 5 (2) (2001) 18–26. F. Picarougne, N. Monmarch!e, A. Oliver, G. Venturini, Web mining with a genetic algorithm, In Proceedings of the Eleventh International World Wide Web Conference, 2002. D. Vrajitoru, Genetic algorithms in information retrieval, AIDRI97: Learning; From Natural Principles to Artificial Methods, Gen"eve, 1997. J.T. Horng, C.C. Yeh, Applying genetic algorithms to query optimization in document retrieval, Inf. Proc. Manage. 36 (2000) 737–759. ! C. Lopez-Pujalte, V.P. Guerrero-Bote, F.de. Moya-Aneg! A test of genetic algorithms in relevance feedback, Inf. on, Proc. Manage. 38 (6) (2002) 795–807. ! C. Lopez-Pujalte, V.P. Guerrero-Bote, F.de. Moya-Aneg! Order-based fitness functions for genetic algorithms on, applied to relevance feedback, J. Am. Soc. Inf. Sci. Technol. 54 (2) (2003) 152–160. W. Fan, M.D. Gordon, P. Pathak, A generic ranking function discovery framework by genetic programming for information retrieval, Inf. Proc. Manage. (2004), in press. W. Fan, M.D. Gordon, P. Pathak, Discovery of contextspecific ranking functions for effective information retrieval by genetic programming, IEEE Trans. Knowledge Data Eng. 16 (4) (2004) 523–527. L. Wang, W. Fan, R. Yang, W. Xi, M. Luo, Y. Zhou, E.A. Fox, Ranking function discovery by genetic programming for robust retrieval, In Proceedings of the Twelfth TREC Conference, 2003. J. Bao, Y. Lin, Information filtering based on behavior evolutional genetic algorithm, course project and paper, available online at http://www.cs.iastate.edu/Bbaojie/ acad/current/begenetic algorithm/572project/begenetic algorithm-news.htm, 2001. A. Chouchoulas, A genetic algorithm based information filter for usenet, B.Sc. Thesis at University of Edinburgh, available online at http://bedroomlan.dyndns.org/ Balexios/files/genetic algorithmBIFU.ps.gz, 1997. B.D. Sheth, A learning approach to personalized information filtering, Master Thesis at MIT, available online at http:// agents.www.media.mit.edu/groups/agents/publications/ newt-thesis/main.html, 1994. Y.K. Kwon, B.R. Moon, Personalized email marketing with a genetic programming circuit model, In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), 2001, pp. 1352–1358. H. Katirai, Filtering junk e-mail: a performance comparison between genetic programming & Na.ıve Bayes, available online at http://members.rogers.com/hoomank/ katirai99filtering.pdf, 1999. D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, 1999.
315
[23] J. Holland, Adaptation in Natural and Artificial Systems, The University of Michigan Press, Ann Arbor, 1975. [24] J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, Cambridge, MA, 1992, pp. 191–203. [25] L.J. Eshelman, J.D. Schaffer, Preventing premature convergence in genetic algorithms by preventing incest, In Proceedings of the Fourth International Conference on Genetic Algorithms, 1991, pp. 115–122. [26] G. Rudolph, J. Sprave, A cellular genetic algorithm with self-adjusting acceptance threshold, In Proceedings of the First IEE/IEEE International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, 1995, pp. 365–372. [27] H. Aytug, G.J. Koehler, Stopping criteria for finite length genetic algorithms, ORSA J. Comput. 8 (1996) 183–191. [28] H. Aytug, S. Bhattacharrya, G.J. Koehler, A Markov chain analysis of genetic algorithms with power of 2 cardinality alphabets, Eur. J. Operational Res. 96 (1996) 195–201. [29] D. Greenhalgh, S. Marshall, Convergence criteria for genetic algorithms, SIAM J. Comput. 30 (1) (2000) 269–282. [30] G. Salton, J. Allan, C. Buckley, Approaches to passage retrieval in full text information systems, In ACM SIGIR Conference on R&D in Information Retrieval, 1993, pp. 49–58. [31] E. Hovy, L. Gerber, U. Hermjakob, M. Junk, C.Y. Lin, Question answering in webclopedia, In Proceedings of the Ninth TREC Conference. 2000. [32] D.R. Radev, W. Fan, H. Qi, Z. Zheng, S.B. Goldensohn, Z. Zhang, J. Prager, Mining the web for answers to natural language questions, In Proceedings of the 2001 International Conference on Information and Knowledge Management, 2001, pp. 143–150. [33] D.R. Radev, W. Fan, H. Qi, H. Wu, A. Grewal, Probabilistic question answering from the web, In Proceedings of the 11th WWW Conference, 2002, available online at http://www2002.org/CDROM/refereed/19/. [34] C. Chen, C.J. Luh, The design of an intelligent search engine with web page prediction capability, J. Intelligent Inf. Systems (2003), submitted for publication, available online at http://webminer.mis.yzu.edu.tw/cayley/dissertation/ SVV-HLP.pdf. [35] S. Lawrence, C.L. Giles, Context and page analysis for improved web search, IEEE Internet Comput. 2 (4) (1998) 38–46. [36] TREC 2002 QA data, TREC 2002 QA data, available online at http://trec.nist.gov/data/qa/t2002 qadata.html, 2003. [37] C.G. Morris, A. Levine, A.A. Maisto, Psychology: An Introduction, 11th Edition, Prentice-Hall, Englewood Cliffs, NJ, 2002.
ARTICLE IN PRESS 316
L.-C. Chen et al. / Information Systems 30 (2005) 299–316
[38] R. Lempel, S. Moran, The stochastic approach for linkstructure analysis (SALSA) and the TKC effect, Comput. Networks 33 (2000) 387–401. [39] A. Paepcke, H. Garcia-Molina, G. Rodriguez-Mula, J. Cho, Beyond document similarity: understanding value-based search and browsing technologies, SIGMOD Record 29 (1) (2000) 80–92. [40] S. Austin, An introduction to genetic algorithms, AI Expert 5 (3) (1990) 48–53.
[41] M. Obitko, Introduction to genetic algorithms with Java applets, available online at http://cs.felk.cvut.cz/Bxobitko/ genetic algorithm/, 1998. [42] D.W. Patterson, Introduction to Artificial Intelligence and Expert System, Prentice-Hall, Englewood Cliffs, NJ, 1990. [43] K.S. Jones, A statistical interpretation of term specificity and its application in retrieval, J. Doc. 28 (1) (1972) 11–21.