I n f o r m a t i o n
R e t r i e v a l
Adaptive Web Search: Evolving a Program That Finds Information Michael Gordon, University of Michigan Weiguo (Patrick) Fan, Virginia Tech Praveen Pathak, University of Florida
A
nyone who’s used a computer to find information on the Web knows that the experience can be frustrating. Search engines are incorporating new techniques
(such as examining document link structures) to increase effectiveness. However, searchers all too often face one of two outcomes: reviewing many more Web pages than
This approach uses genetic programming to automatically evolve new retrieval algorithms based on a user’s evaluation of previously viewed documents.
they’d prefer or failing to find as much useful information as they really want. The problem might be most frustrating to knowledge workers whose livelihoods depend on keeping current with some aspect of the world. Even when such individuals have unchanging information needs, this consistency doesn’t yield any extra advantage in terms of improved retrieval performance. Search engines don’t typically tailor their algorithms to knowledge workers’ persistent information needs. Here, we introduce a new retrieval technique that exploits users’ persistent information needs. These users might include business analysts specializing in genetic technologies, stockbrokers keeping abreast of wireless communications, and legislators needing to understand computer privacy and security developments. To help such searchers, we evolve effective search programs by using feedback based on users’ judgments about the relevance of the documents they’ve retrieved. The algorithm we use is based on genetics and uses selection and modification to “breed” increasingly effective retrieval algorithms.
Information retrieval and evolutionary algorithms Historically, information retrieval has attempted to develop ways to effectively retrieve documents. Computer programs can’t read and understand text written about an unrestricted range of topics. So, IR has adopted certain heuristics to try to represent documents’ contents so that we can store these representations in computer databases and later retrieve the documents they represent. 72
1541-1672/06/$20.00 © 2006 IEEE Published by the IEEE Computer Society
By far the most common heuristic for doing this is attaching keywords to documents—each keyword suggesting a topic that the document is supposed to address. Similarly, searchers typically enter queries using keywords. A matching function matches the information in document representations with the information in the query representation to assign each document a retrieval score. The application presents documents to the user in decreasing order of this retrieval score. One of the most popular matching functions, for example, is the cosine measure.1 It finds the angle between a query keyword vector and a document keyword vector. The closer these two vectors are to each other, the higher the document’s matching score. A problem with using keyword representations is that, although two keywords might each accurately describe the same document, one might be more applicable given the document’s contents. Term weighting To accommodate such differences, IR algorithms have many methods of weighting terms and phrases. For instance, we presume that a document that uses a particular term frequently is better represented by that term than a document that uses it less frequently. Similarly, when a relatively rare term—such as “accordion”—occurs in a document, we might deem it more indicative of the document’s subject than when a more common term—such as “and” or “computer”—occurs. So, term-weighting strategies often weight a term on the basis of its number of occurrences in a document as well as its rarity in the document collection. Obviously, different term-weightIEEE INTELLIGENT SYSTEMS
ing strategies would result in different retrieval performance because the retrieval score would differ with each strategy. No uniformly best approach exists for supplying term weights. Various careful studies have compared different term-weighting strategies, but none consistently outperforms the others when measured by its ability both to identify relevant documents and to insulate the user from nonrelevant documents.1,2 Several factors exacerbate the difficulty of developing effective weighting strategies: differences among document collections (their contents, style, and vocabularies), among users (from novice to sophisticated), and among queries (number of terms used and terms’ specificity). However, although a single best generalpurpose approach to weighting terms might not exist, a best method does exist for weighting terms for a particular user with a recurring need for information. By employing an adaptive algorithm using genetic programming (see the “Intro to Genetic Programming” sidebar), we seek to evolve a matching function that approximates that method. Genetic programming To determine an effective IR weighting program for a specific user, we first choose the appropriate features (see table 1). We can easily compute each of these descriptive statistics for any term, any document, and any document collection. Additionally, we used five operators to construct our “weighting trees”: +, , , /, and log. The token frequency (tf) function returns a number indicating how frequently a given word or phrase occurs in a document. As we’ve suggested, this measure provides information about the term’s suitability as an index term—with terms occurring more frequently thought to be better descriptors and to deserve larger weights. The document frequency (df) function returns a number that tells, for the entire document collection, how many documents contain a particular term. This also provides useful information on the assumption that a rare term’s appearance offers more meaning than that of a more common term. Table 1 describes other terminals such as tf_max and tf_avg. Genetic programming can help us generate a program that effectively combines document features for a particular user with a persistent information need. For example, figure 1 shows one combination of features that the program might produce. SEPTEMBER/OCTOBER 2006
Table 1. Terminals used in our genetic programming system. Terminals
Statistical meaning
tf
Token frequency—the number of times a given term appears in a document
tf_max
A document’s maximum tf
tf_avg
A document’s average tf
tf_doc_max
The maximum tf in a document collection
df
Document frequency—the number of documents in which a given term appears
df_max
The maximum df for all terms in a user query
N
The total number of documents in a document collection
Length_avg
The average length of a document in a collection
R
A real-number constant, randomly generated by the system
n
A document’s number of unique terms
The program’s semantics gives each term in a document this weight: 1/8th its token frequency times the difference between its document frequency and token frequency. Although this is simply an example, the fact that our program could have evolved it is important for two related reasons. First, genetic programming often produces feature combinations that differ significantly from what a knowledgeable designer might think up. (Most weighting algorithms involve a ratio-like relationship between a term’s token frequency and its document frequency). Second, despite the strange (even cumbersome) programs that genetic programming produces, they often produce strikingly effective results—even if an expert programmer would have an extremely difficult time reading, let alone writing, the program. What matters is performance, not comprehensibility or ease of maintenance. Genetic approaches Genetic algorithms are attractive because of their intrinsically parallel search mechanisms and powerful global exploration capabilities in a high-dimensional space. Programmers have used genetic algorithms to adapt document descriptions, modify query representations, and fine-tune parameters in matching functions.3–7 However, up to now, genetic programming has seen little application in IR.
Experiments and results We ran 35 separate mini-experiments to determine how well we could evolve effective weighting programs. For each experiment, we began with a query selected from a TREC (Text Retrieval Conference) test collection of stories from the Associated Press. TREC is the primary conference on IR techniques for largewww.computer.org/intelligent
* / – tf
8 df
tf
Figure 1. A sample genetic programming tree (tf denotes token frequency; df denotes document frequency). Table 2. Genetic programming experiment settings. Feature Population size Crossover rate Reproduction rate
Value 200 0.9 0.1
Generations
30
Fitness function
P_Avg
scale text collections.8 Many IR systems and commercial engines use TREC data as a testbed for validating and evaluating their performance. Figures 2a and 2b show a sample TREC document and query. We randomly divided the TREC data into three sets: one each for training, validation, and testing. Each data set contained approximately 80,000 documents. A three-data-set design is a common practice in machine learning experiments to avoid overfitting a solution to the given data. We used the training data set to train the algorithm and the validation data set to pick the best tree available from the training phase (in the hope that it would perform well on unseen documents). Finally, we used the test data set to test the algorithm on unseen documents and to report the results. Table 2 lists the experimental settings. 73
I n f o r m a t i o n
R e t r i e v a l
Intro to Genetic Programming Genetic programming is based on genetic algorithms, which mimic natural genetic operations in which the genetic material of fitter entities flourish, and crossover and modifications to their genes ensure the exploration of new genetic combinations.1 With genetic algorithms, although strings of bits or real values typically represent individual chromosomes, other data structures are possible. John Koza showed that you can evolve tree structures representing programs, resulting in a technique he called genetic programming.2 In genetic programming, programs representing a potential solution to a problem are represented using a tree structure, with many solutions (or trees) in a generation. Each solution’s fitness is found and the solutions are arranged in the order of fitness. Fitter trees from one generation get more representation in the next generation (the selection operation). The trees also undergo crossover, an operation which exchanges genetic materials between two parent trees. Sometimes another operation, mutation, is applied on randomly selected nodes in the trees to ensure that the solution trees don’t get stuck in a locally optimal solution. Selection, crossover, and mutation repeat for a sufficient number of generations until a solution is found or until no further significant performance improvement in the solution occurs from one generation to the next. For example, suppose you wanted to evolve a program that could compute a number’s square root. (Naturally, this wouldn’t be a typical genetic-programming application, which normally evolves a program for which there is no known solution or acceptable approximation). At some intermediate generation, we might have two programs in the population of possible solutions, as shown in figures A1 and A2. Each of these solutions (or trees) operates on the input value n. Table A shows how the two programs would perform for several values of n. Although neither program computes an effective approximation of the square root function, the second performs consistently better. So, its constituent parts would appear more often in the intermediate stage between generations, just before crossover. Crossover would involve the exchange of subtrees. Crossing the second tree with the tree in figure A3 could result in the two trees shown in figure A4. (We say “could” because the resulting trees depend on the crossover points used in each tree.) If applied, a mutation operator could
+ 7
/ n
– n
(1)
n
7+n value (expression)
4
value − n
value (expression)
9
2/3
32
27
7 2/3
2 2/3
107
97
32 2/3
22 2/3
randomly change the value of one or more nodes in these figures. These trees’ performance would be computed and compared against the performance of all other trees during the then-current generation, thus determining how often each of these trees would be represented in the next intermediate generation before crossover occurs again. Typically, the trees become increasingly fit (that is, they represent the solution better and better) from one generation to the next. Then, they eventually settle down on or near the optimal solution. As this example illustrates, genetic programming features two different types of nodes. Leaf nodes are constants or variables that have their own value. Nonleaf nodes are operators and functions that can compute values based on subtrees, variables, or constants. Because computing a square root is a numeric problem, we chose numeric leaf nodes and operators for the example. Programmers choose the “raw materials” for genetic programming before adaptation, and this choice is an important factor in determining whether the evolution will produce a favorable result. We might say that programmers must determine the right features for a problem to be solved, and the genetic program itself attempts to determine the best combination of those features.
References 1. J.H. Holland, Adaptation in Natural and Artificial Systems, 2nd ed., MIT Press, 1992. 2. J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, 1992.
/
–
5
3
*
* n
1 1/3
25
+
2
value − n
100
5
(3)
(n – 2) / 3
11
+ 3
(2)
Table A. Values in the genetic programming example.
n 3
(4)
2 n
3
Figure A. Solutions for the square root problem: The (1) first, (2) second, and (3) third trees, and (4) the trees after crossover.
Beginning with 200 randomly generated weighting programs associated with a particular user’s query, we determined the effectiveness of each in retrieving 500 documents from a training set of nearly 80,000 documents. That is, for each of the nearly 80,000 documents, a weighting program computed 74
document weights for each associated term in the user’s query. Added together, the weights produced an overall matching score for that document, according to that particular weighting program. By imposing a document cut-off value (DCV) (the number of documents the user is willing to see) of 500 www.computer.org/intelligent
and using TREC’s predetermined judgments of which documents were relevant for this query, we could tell how many of the top 500 ranked documents were and weren’t relevant. This let us calculate the fitness function, P_Avg (defined in table 3), for the individual weighting program. IEEE INTELLIGENT SYSTEMS
We performed separately a similar set of computations and evaluations for each of the other 199 weighting programs in use for this query. The top 10 percent (the reproduction rate) of the programs in terms of fitness function were automatically entered into the next generation. We used tournament selection to select the trees for the next generation. We reproduced more copies of the better-performing programs, subjected most to crossover, and tested this new set of programs against the same training documents. We repeated this process 30 times. We also noted, for each generation, the best 10 of the 200 programs. At the end of a given document’s training period (conducted to evolve programs that we hoped would be effective for retrieval), we used the 300 saved weighting programs on a different set of validation documents in the TREC collection to select the best-performing weighting program—that is, one we thought would work well for new, unseen data. We ran this program against the test documents in the TREC collection to obtain the final performance result. For each of the 35 mini-experiments, we followed the same process using a different query. We compared the best performing of these weighting programs in several ways. (All results reported in this section are statistically significant at p < 0.05.) Table 3 shows the performance measures. First, we compared our program’s performance against a retrieval program called SMART, which has performed extremely well in various retrieval conditions in the TREC competitions.9 Using a DCV of 1,000 for testing and TREC’s benchmark statistic P_Avg, our genetic programming approach was superior for 34 out of 35 test queries, usually many times better. We also compared the two systems’ performances when only the first 10 documents were retrieved (P_10)—a situation that resembles many real-life retrieval situations where users require only a few documents. Again, our system outperformed SMART—this time for all 35 queries. Next, we compared our program’s performance against a competing adaptive approach—a neural network. Neural networks have also been used widely in IR and routing experiments.10 In terms of P_Avg, genetic programming outperformed the neural network approach 30 out of 35 times. Most of the performance comparisons are quite dramatic (more than 100 percent improvement). We obtained similar performance results SEPTEMBER/OCTOBER 2006
AP881223-0040 AP-NR-12-23-88 0401EST r i PM-Obit-Suroi 12-23 0160 PM-Obit-Suroi,0165 Yugoslav Ambassador To Madrid Dies In Car Accident MADRID, Spain (AP) Redzai Suroi was 59. Suroi died Thursday near Guadalajara as he returned alone to Madrid from a private trip, said embassy counselor Zoran Raicevic. Suroi, a native of Erizren in the autonomous province of Kosovo, had held the post of adjunct to the Yugoslav foreign minister before coming to Spain in October 1985. He also served as ambassador to Mexico from 1978 to 1982 and to Bolivia from 1970 to 1974, Raicevic said. Suroi, a law graduate of Belgrade University, worked as a journalist for 15 years, and became director of Radio Pristina in Kosovo before beginning his diplomatic career in 1970, Raicevic said. Survivors include his wife, a son and a daughter, the counselor said.
(a) Tipster Topic Description Number: 002 Domain: International Economics Topic: Acquisitions Description: Document discusses a currently proposed acquisition involving a U.S. company and a foreign company. Narrative: To be relevant, a document must discuss a currently proposed acquisition (which may or may not be identified by type, e.g., merger, buyout, leveraged buyout, hostile takeover, friendly acquisition). The suitor and target must be identified by name; the nationality of one of the companies must be identified as U.S. and the nationality of the other company must be identified as NOT U.S. Concept(s): 1. acquisition, takeover 2. suitor, target 3. merger, buyout, leveraged buyout (LBO) 4. arb, arbitrage, arbitrager, leverage, offer, bid, tender, purchase 5. anti-takeover, poison pill, white knight, restructure, sale of assets, recapitalization (b) Figure 2. A sample (a) TREC (Text Retrieval Conference) document and (b) TREC query (topic).
when considering only the top 10 documents retrieved. Overall, genetic programming outperformed the neural network by more than 100 percent in both P_Avg and P_10. However, a neural network’s performance partly depends on its tuning. Although we strived to find effective parameter settings, we selected them empirically, not systematically through an optimization routine. www.computer.org/intelligent
We repeated all our experiments with a slight variation. Whereas our original experiments took into account only the title and the description portions of topics such as shown in figure 2b to serve as a user query (we call this a short query), our modified experiments considered these fields plus other fields in figure 2b, such as the narrative and concepts, to form a long query. With this 75
I n f o r m a t i o n
R e t r i e v a l
Table 3. Performance measures. Measure
Definition
P_Avg
The average of the precision (the ratio of the number of relevant documents retrieved to the total number of documents retrieved) scores calculated every time a new relevant document is found, normalized by the total number of relevant documents in the collection. We define this mathematically as
⎛ ⎛ ⎜ ⎜ D ⎜ ⎜ ⎜ r ( di ) * ⎜ ⎜ i =1 ⎜ ⎜ ⎜ ⎜⎝ ⎝
∑ P _ Avg =
⎞⎞
i
∑ r (d j ) ⎟⎟ ⎟⎟ j =1
i
⎟⎟ ⎟⎟ ⎟ ⎟⎟ ⎠⎠
T _ Rel
where r(d) (0,1) is the relevance score assigned to a document, it being 1 if the document is relevant and 0 otherwise. |D| is the total number of retrieved documents. T_Rel is the total number of relevant documents for the query. It’s easy to see that this measure combines the traditional (precision and recall) performance measures into a single measure.
P_10
The precision obtained for the top 10 retrieved documents. This measure is important when the user isn’t interested in looking at many documents. So, we require high precision within the top 10 documents retrieved.
T_Rel_Ret
The total number of relevant documents (for all 35 queries) retrieved in the top 1,000 documents.
modified query representation, genetic programming still outperformed SMART and the neural network. Figures 3a–c graphically compare the
T h e
results of the three systems for P_Avg, P_10, and T_Rel_Ret, for both the short and long queries. Interestingly, as the basis of query repre-
A u t h o r s Michael Gordon is a professor of business information technology at the
Ross School of Business at the University of Michigan. His research interests include using information and communication technology to help alleviate poverty and improve health and education; the retrieval and discoverybased uses of textual information; information-based communities; appropriate uses of technology to support teaching, learning, and information sharing. He received his PhD in computer science from the University of Michigan. Contact him at the Univ. of Michigan, Ross School of Business, Wyly, Ann Arbor, MI 48109;
[email protected]. Weiguo (Patrick) Fan is an associate professor of information systems and computer science at the Virginia Polytechnic Institute and State University. His research interests focus on the design and development of novel information technologies (data mining, text and Web mining, personalization, and knowledge management techniques) to support better management of business information and improved decision making. He received his PhD in computer and information systems from the University of Michigan’s Ross School of Business. Contact him at Virginia Tech, 3007 Pamplin Hall, Blacksburg, VA 24061;
[email protected]. Praveen Pathak is an assistant professor of decision and information sciences at the University of Florida. His research interests include information retrieval, adaptive algorithms, AI, Web mining, e-commerce, and knowledge management. He received his PhD in computer and information systems from the University of Michigan’s Ross School of Business. Contact him at the Univ. of Florida, Warrington College of Business, PO Box 117169, Gainesville, FL 32611;
[email protected].
76
sentation expanded to include narrative and concepts as well as title and description (that is, as we began to use long, not short, queries), all three systems improved. For all three performance measures, the neural network improved the most on a percentage basis; for P_Avg and P_10, genetic programming improved approximately 20 percent more than did the SMART system. For T_Rel_Ret, the two systems improved about the same. These results imply that adaptive retrieval systems benefit from rich query representations.
www.computer.org/intelligent
W
e’ve shown that genetic programming offers a new way to develop IR programs. Translated into everyday practice, the technique we developed can help provide knowledge workers regularly seeking information on the same topic with increasingly effective, customized retrieval programs. Much exciting work remains in applying genetic programming to the problem of IR. Others might use a broader set of features than what we used in our work (functions, operators, and terminals). And although we applied the same weight to every term under consideration, you don’t have to have identical weights for all terms. This work doesn’t represent the last word on evolving programs that are highly effective in retrieving information from the World Wide Web or other large data stores. But, we believe it is the first word, and it represents a promising avenue of research and application.
References 1. G. Salton and C. Buckley, “Term Weighting Approaches in Automatic Text Retrieval,” Information Processing and Management, vol. 24, no. 5, 1988, pp. 513–523. 2. J. Zobel and A. Moffat, “Exploring the Similarity Space,” SIGIR Forum, vol. 32, no. 1, 1998, pp. 18–34. 3. H. Chen et al., “A Machine Learning Approach to Inductive Query by Examples: An Experiment Using Relevance Feedback, ID3, Genetic Algorithms, and Simulated Annealing,” J. Am. Soc. Information Science and Technology, vol. 49, no. 8, 1998, pp. 693–705. 4. O. Cordon, F. Moya, and M. Zarco, “A New Evolutionary Algorithm Combining Simulated Annealing and Genetic Programming for Relevance Feedback in Fuzzy Information Retrieval Systems,” Soft Computing, vol. 6, no. 5, 2002, pp. 308–319. IEEE INTELLIGENT SYSTEMS
P_Avg
Genetic programming SMART Neural network 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0
Long
Short Query type
(a) 0.6
DON’T //au: following is the RUN sidebar// THE RISK.
BE SECURE.
0.5 P_10
0.4 0.3 0.2 0.1 0
(b)
Long
Short Query type
2,000
T_Rel_Ret
1,500
Top security professionals in the field share information you can rely on:
1,000 500 0
Long
Short Query type
(c) Figure 3. Comparison of system performance for (a) P_Avg, (b) P_10, and (c) T_Rel_Ret.
5. M. Gordon, “User-Based Document Clustering by Redescribing Subject Descriptions with a Genetic Algorithm,” J. Am. Soc. for Information Science and Technology, vol. 42, no. 5, 1991, pp. 311–322.
8. D.K. Harman, “Overview of the First Text REtrieval Conference (TREC-1),” The First Text REtrieval Conf. (TREC-1), D.K. Harman, ed., NIST Special Publication 500-207, 1993, pp. 1–20.
6. J. Horng and C. Yeh, “Applying Genetic Algorithms to Query Optimization in Document Retrieval,” Information Processing & Management, vol. 36, no. 1, 2000, pp. 737–759.
9. A. Singhal et al., “Document Length Normalization,” Information Processing and Management, vol. 32, no. 5, 1996, pp. 619–633.
7. C. Lopez-Pujalet, V.P. Guerrero-Bote, and F.D. Moya-Anegon, “Order-Based Fitness Functions for Genetic Algorithms Applied to Relevance Feedback,” J. Am. Soc. Information Science and Technology, vol. 54, no. 2, 2003, pp. 152.
10. H. Schutze, D. Hull, and J.O. Pedersen, “A Comparison of Classifiers and Document Representations for the Routing Problem,” Proc. 16th Ann. Int’l ACM Conf. Special Interest Group on Information Retrieval (SIGIR 95), ACM Press, 1995, pp. 229–237.
SEPTEMBER/OCTOBER 2006
Ensure that your networks operate safely and provide critical services even in the face of attacks. Develop lasting security solutions, with this peer-reviewed publication.
www.computer.org/intelligent
• Wireless Security • Securing the Enterprise • Designing for Security Infrastructure Security • Privacy Issues • Legal Issues • Cybercrime • Digital Rights Management • Intellectual Property Protection and Piracy • The Security Profession • Education
Order your $29 subscription today. www.computer.org/security/ 77