INFORMATION VALUE THEORY IN QUERY ... - CiteSeerX

15 downloads 0 Views 64KB Size Report
Relevant Document (i) x Score (i). Total Number of Relevent Documents x 100. (11). Variation in Recall Under EVPI-Guided Query Augmentation. 0.00%. 10.00 ...
INFORMATION VALUE THEORY IN QUERY AUGMENTATION ANDY DONG, J ENRIQUE BARRETO, ALICE M AGOGINO University of California at Berkeley Department of Mechanical Engineering 5136 Etcheverry Hall Berkeley, CA 94720-1740 +1 510 642 6450 PHONE +1 510 643 8982 FAX {adong, jorge,aagogino}@best.ME.Berkeley.EDU

Abstract. Information retrieval (IR) systems interact with users by returning a ranked list of relevant documents in response to a query. Through feedback mechanisms such as relevance feedback and automated keyword expansion, IR systems attempt to guide users in constructing search queries which better represent their information needs. These mechanisms, however, do not offer the user more insight into the content of the documents in the IR database nor do they provide direction as to which search terms might yield better search results in terms of relevance and certainty that the retrieved document contains the information the user intended to retrieve. This paper presents a methodology based on the decision-analytic concept of expected value of perfect information for controlling query augmentation in information retrieval. The system dynamically learns the content of the documents in the database to compute the utility (measured in terms of relevance) of retrieving certain documents in response to queries, where the words in the queries represent the random variables. By computing the expected value of perfect information for each query term, the system either suggests new search terms or suggests that the user terminate the search.

1. Introduction The decision-analytic concept of expected value of perfect information (EVPI) applied within an intelligent real time problem solving framework is used to evaluate choices when uncertainties in the available information exist and when it is important to balance limited time or computational resources against the quality of the decision made (Hartsough and Turner, 1990). The pay-off depends on both the alternative chosen and the state of the world at the time the decision is made. For example, in designing a mechanical device for transmitting power between two devices, the engineer might consider several alternatives among a wide selection, such as gears, belts or shafts. In the conceptual design stages, the available information for making selection decisions is

1

often uncertain, or only known approximately within ranges. The designer is then forced either to make the best possible decision based on the current state of information or obtain more information in order to reduce the uncertainty or ambiguity, perhaps by performing an analysis, experiment or conducting market research (Bradley et al., 1991). The degree of refinement or reduction in uncertainty that is possible with an information gathering task comes at some cost, such as costs associated with computation, experimentation, or research or from the real time costs associated with delaying a decision. The EVPI specifies an upper bound on the amount of cost that should be spent in order to reduce uncertainty in the decision. This intelligent real time problem and solution strategy applies equally well to information retrieval. Here, the information retrieval system selects among a collection of documents in a database which are potentially relevant to the information needs of the user. For example, suppose the database contains a collection of documents relating to the design of automatic controls for mechatronic1 devices. Then, suppose the user, who does not necessarily know what information is contained in the database but knows exactly which document is required, poses a query such as control in the complex plane. In general the database is very large and it would not be worthwhile to perform an exhaustive search on the whole database to look for the “most relevant” document. At some point, the expected cost of further search would outweigh the expected benefits. The goal then is to focus the users’ resources (query) towards those documents which would yield the most relevant information by (1) selecting among a pool of keywords that would reduce the uncertainty in selecting relevant documents; and (2) inform the user to stop querying when the cost outweighs the benefits. The paper describes a decision-theoretic framework for defining a search strategy to guide users in formulating queries over full-text document databases. The motivation to supply guidance in constructing search queries is apparent. Ullman finds that 25% of the design time is spent gathering more information to refine or broaden the designer’s knowledge (Ullman, 1988). However, commonly the designer is not cognizant of the content of the design documents or how to formulate queries to locate relevant information. According to Lewis and Jones, (1996), “many end users have little skill or limited experience in formulating initial search requests or modifying their requests after observing failure. Even when relevance feedback is available, it needs to be leveraged from a sensible starting point.” Thus, the need exists for (1) IR systems to “know” the content of the documents in the database, i.e., have an internal representation of the document set; and (2) then use that representation to guide the user’s search. Information retrieval research has primarily focused on retrieval techniques given a query. This paper focuses on the integration of a preference function, namely the similarity between the user’s query and the target document, into the determination of the search terms for the query. In Section 2 we discuss the utility-theory approach to information retrieval, including the utility function, the decision-making context and the decision-making strategy. Section 3 presents a formalization of the full-text retrieval problem based on utility theory. Section 4 presents a performance study of full-text 1

The term “mechatronics” refers to electro-mechanical devices with embedded computing for controlling electronic components. Examples of “mechatronic” devices include disk drives, VCRs and CD-ROM players.

2

retrieval under the framework described. Section 5 presents some of the implications and extensions arising from this research. 2. Utility Theory in Information Retrieval 2.1 UTILITY AND RELEVANCE It has been argued that the utility of a document is not necessarily related to the relevance of the document (Cooper, 1973a). In fact, one could bring forth numerous arguments that illustrate how relevance and utility are not the same metric: (1) the first time that the system presents a relevant document to the user, the document would have a higher utility than after, say, the tenth time; (2) a document from a journal publication could have a higher utility than one from a conference proceeding; (3) the worth of a document depends upon how many relevant or irrelevant documents the system has presented to the user. In short, the measurement of utility is based on the perspective of the user including the experience with all previous documents in the search sequence (Cooper 1973b). Cooper argues that this utility measurement is also independent of the utility of the IR system, that is, how well the user judges the system has found useful information. This perspective in document utility highlights the difficulty in formulating a document utility function which adequately describes the preferences of the user. However, in order to formulate a retrieval strategy, one would need to quantify the utility of the action of retrieving a particular document based upon the utility of the document to the user. That is, the decision-making (retrieval strategy) process consists of computing the expected utility of an action (retrieving the document) based upon the available information of the state of the relevant variables in the decision (how specific the search terms are or how sure the user is that these search terms represent the user’s information needs). The decision to retrieve a document is typically based upon the degree of association between the query string and the document using measures such as Jaccard’s coefficient, Dice’s coefficient, or the cosine measurement. However, as Gordon has argued (1990), retrieving solely on the magnitude of similarity is somewhat limited. While we agree that document utility is not equivalent to document relevance, formulating document relevance as document utility does provide a first order approximation to capturing the intuition that the prefers to retrieve relevant information over irrelevant information. Therefore, for the purposes of presenting and testing our methodology, we use document relevance for the utility of a given query. Our theoretical framework and the formulation of the EVPI, however, place no restrictions on the utility function. Mathematically, a convex utility function simplifies the optimization and satisfies normative properties. An additive, multi-criteria utility function in which relevance expresses one of the criteria is one possibility for a more complex utility function. Therefore, to a first approximation, we assume that the utility of the document is maximized when the IR system selects those documents with the highest expected relevance (Gordon and Lenk, 1991). That is, the similarity function is an individual utility function which represents the user’s preference over a given set. Implicit in this assumption is that the document contains all the information required to make the

3

selection decision. The similarity function can be interpreted as a measure of quality of information in the document. The prior over the decision expresses the uncertainty in the information needs of the user, which is characterized by the specificity of the search terms. Typically, the uncertainty in finding the document (the utility of the action of finding the document) arises from the probability of relevance of the query string to the document set, i.e., P(document=relevant | query). In a conventional information retrieval system, the user is forced to translate the preferences into a query that the database system can answer. We attempt to integrate the decision problem with the database system by viewing the retrieval process as a formalization in part of the decision-maker’s decision strategy. Taking this decision-theoretic approach, we compute the merit of expending additional effort to collect more information about certain problem variables to assure that more relevant information is not overlooked. In essence, we augment the query to modify the uncertainty in selecting documents. This approach to selecting from alternative choices under uncertainty is the expected utility method and the upper bound is the expected value of perfect information (EVPI). 2.2 THE DECISION MODEL At the time the system searches for information, the keywords which would produce the most useful query will not be known precisely (since the user might not have specified them in the query). There exists no oracle to foretell which documents will be relevant to which queries or which queries will retrieve certain documents. 2 The degree of uncertainty of the information required will depend on the specificity of the keywords the user expresses in the query. One can characterize the specificity of the words in the query as a probability distribution. In fact, studies have been conducted to characterize the frequency of word occurrence in English (Nelson and Kucera, 1982). Specific words will have tighter ranges of possible weights (derived from standard term weighting schemes based on term frequency and inverted document frequency) whereas generic words will have broader distributions and both might be centered at different weights depending upon the knowledge-base used to derive the distributions. This is not the same as how sure the user is of the information the user wants and then how well the user expresses this need in the query. The estimation of relevance of the keywords in the query to the information needs of the user is the subject of other research (Koll and Srinivasan, 1990). In our implementation, we derive the distributions from the corpus of documents the user is selecting from. To designate specific keywords requires that the user (or the system) understand the content of the target documents and that the query be precisely modeled. The amount of effort the user wishes to expend on searching is proportional to the certainty of knowing the target document exists and its likelihood of being found as well as the cost associated with augmenting the query to increase the likelihood of finding that document. That is, the user in tandem with the system could spend more time to locate more specific search terms related to the information needs of the user, but this is often not advantageous. 2

If complete information about relevant and non-relevant documents given a query were known, then one could estimate the discriminating power of a term. For a reference on this, see (van Rijsbergen, 1979).

4

Thus, information retrieval not only involves decisions concerning the specifications of the search terms themselves but also meta-decisions concerning the formulation of the search strategy itself. The system must decide on the appropriate amount of information to acquire and the best document selection to make. In this paper, we model uncertainty and imprecision with probability theory, taking a decision-analytic approach. In modeling full-text information retrieval in a probabilistic decision-making framework, one must define the extremes of the probability density/distribution function (PDF) characterizing the uncertainties in the decision. That is, what is the zero probability case and the no uncertainty or “perfect” case? In the example of selecting among a gear, belt or shaft, the “perfect” case is, perhaps, finding a gear with the desired properties as specified by the designer. A similar analogy exists in the document selection case. The case of zero probability is the trivial case — the word does not appear in the document collection. The “perfect” case occurs when the document contains an exact match for the query words being posed by the user and the exact level of specificity requested, i.e., the weight on the words requested and the words found in the document are exactly the same. That is, the “perfect” case occurs when the user has expressed a query and found a document which not only contains the search terms but also at the desired level of specificity of concept as embodied by the meaning of the search terms. Presumably, to get to this point, the user must not only know the content of the document but also the specificity (the potential relevance of search terms) to the document database. These two pieces of information would constitute perfect information of the document database. In practice, this state is difficult to attain. The user’s goal is to select from the database the record(s) that will maximize his or her utility. Part of this utility is captured in the sense that we assume the user prefers relevant information over irrelevant information. The state space of the decision is all of the possible words that the user can use to express the information needs. It embodies all of the relevant aspects of the decision about which the decision-maker is uncertain because the user is possibly uncertain which words will return the desired set of information or is uncertain of the current information needs. With a traditional IR system, users must construct their own search strategy (query) typically unassisted by the system. The statement of the query is a translation of preferences into a feasible set of alternatives. The response to the query by the system might be: (1) an acceptable set of documents, i.e. relevant and concise; (2) an unacceptable set of documents, i.e. irrelevant or too long; or (3) no documents. Depending upon the type of response, the user may decide to recast the query into a new characterization of a feasible set of alternatives and repeat the search process. Whether or not the repetition of the search process will likely yield useful information depends upon the users’ knowledge of the contents of the database and upon the users’ willingness to expend resources to refine the search. 3. Mathematical Model 3.1 DOCUMENT MODEL We use the vector model of documents (Raghavan and Wong, 1986). Each document Di is a vector that can be expressed as

5

z

Di = ∑ wd ki t k

( i = 1, 2, .... n)

(1)

k =1

where the coefficient wdki is the weight assigned to the term tk (where t is a vector direction) in document Di and there are z unique terms in the n documents. In this system, we use the tf.idf metric (the original version was proposed by Fagan (1987)). The tf.idf metric is given by tf t k Di n tf . idf = • ln (2) max_ tf Di df tk where tft D is the frequency of term tk in the document represented by vector Di, n is the k i

number of documents in the collection and dfD is the number of documents containing the i

word or term tk at least once. The cosine normalization yields the final weight, wt D . k i

wtk Di =

tf .idf t k Di

∑ tf .idf i =1

For

(3)

n

any

2 t k Di

query

q,

the

corresponding

query

vector

has

the

expression

r

q = ∑ wq j t j where r is the number of terms (noun-phrases) in the query and wq is the j =1

weight assigned to the term tj in query q. Typically, one assigns a weight of 1.0 to each query term, but can be modified by the user or the system to emphasize particular keywords. The similarity between the query and the document is then given by the innerproduct function of q and Di. The objective is to maximize similarity. We define the utility Ui then to be the objective function for similarity defined by the cosine measurement (Noreault, McGill and Koll, 1991) U i (q, Di ) = q • Di (4) and utility calculations can be taken for each of the documents in the collection. 3.2 RETRIEVAL MODEL Qualitatively, the retrieval rule will be: select the document(s) that appear(s) best given the present query if the cost of re-writing and re-running the query would outweigh the expected benefit to be had with the new query result(s); in other words, “buy” the information whose expected value most exceeds its cost. For document selection, the expected value of information E is: E = the expected utility of the document chosen given the “optimal” query minus the expected utility of the document chosen given the “present” query We will choose to select the document only when E exceeds the expected cost of information C: E > C.

(5) 6

This expected cost is the price the user is willing to pay to acquire the information, i.e., how much control the user wishes to give the system in refining or broadening the search. For n documents in each of i information sources, we will select the action a with the largest value for E - C: a = arg max(E i − Ci )

(6)

i

Without loss of generality, for purposes of illustration, we simplify the problem by considering only individual actions, and not sequences, and assume that the optimal sequence results from individual optimal action. This is known as the “one-step horizon assumption” (Russell and Wafeld, 1989). We assume that the cost of re-writing and rerunning the query is zero. Therefore, if the user decides to re-run the query, the important questions are: which of the query terms should be modified to reduce the uncertainty most in the decision and what is the upper bound (EVPI) on the expected gain in utility? Let us represent utility of the similarity as U i(q,Di) for document i, where Ui is the similarity function, a deterministic function of the random variable q. The parameter qj is the word in the query for which the EVPI is of interest. Even if there is more than one word in the query for which a PDF is known, we consider only the single-frontier case, i.e., the EVPI for a single attribute rather than a collection of attributes as a whole. We then consider each parameter individually and treat the other parameters as a variable. The domain of the vector composed of the values that qj can take on may be constrained, in which case we will assume that suitable mathematical constraints in terms of qj can be formulated. That is, any arbitrary word does not appear infinitely many times in the document, arising in an infinite tf.idf score. If there are n documents available for retrieval, and the user is powerless to change the uncertainty in the parameters q, the optimal solution would be the option i, a document, for which the expected value of the utility is a maximum. The system should select the best document i* such that:    i* = arg max  E max U i ( q, Di )   (7)  i p q Since the utility function is simply the similarity function, this rule simplifies to the standard sorted query-document similarity list generated by many IR systems. However, since there exists a probability distribution on the likelihood of the query term being related to the document, the order of the list is not expected to be the same for the case of no probability distribution. In theory, the best alternative is identified at each possible value of the variable. In the document retrieval case, the best document(s) is(are) retrieved given the weight on a search term. This alternative is termed the conditional best. The expected value of perfect information (EVPI) given parameter qj is computed as the difference between the payoff for perfect information of the uncertain parameter minus the payoff for the best decision possible given the current state of the uncertain parameter, multiplied by the probability of obtaining perfect information of the uncertain parameter. Equation (8) expresses the EVPI for uncertain parameter qj :

7

{

bj

EVPI q j =

}dq

(U (q, D ) − U (q, D ))P(q)d ( q) ∫ P(q ) × Max i ∫ j

aj

*

q

i

i

i

(8)

j

Where: i = document aj , bj = lower and upper bounds of the variable qj P(qj ) = probability of finding a term with a given relevance (to the document) in the target document collection q = the array of basic problem variables (query) q = the array of basic problem variables (query) where the j-th term takes on the current value of qj from the integration loop U* (q,Di) = the unconditional best alternative, i.e., the one with the largest expected utility, is defined by: bj

U * (q, Di ) = max ∫ P (q1 ... q j ... q r )U ( q1 ... q j ... q r , Di ) dq i

(9)

aj

P( q1…qj…qr) = probability of all the variables which influence the relevance of a word to the document In order to simplify computational complexity, we assume that P(q1…qj…qr) is

(

)

r

conditionally independent given the document set, i.e., P q1K q j K q r = ∏ q m . To m =1

express Equation (8) in words, the EVPI for a particular query term qj is the utility of the most relevant document the user can find if the user knew that particular search term exactly (knew how relevant the term is to the document set) minus the utility of the most relevant document given what the user knows about the relevance of the search term now, multiplied by the probability of the search term having a given relevance. To improve the decision-making, that is the confidence of retrieving relevant documents, the user can add closely related keywords to words which have high EVPI and remove terms which have low EVPI since there is no value in being more precise about the concept described by a particular search term. Given that the estimates of probability and utility are actually discrete, Equation (8) becomes: bj

( )

{∑ (U (q, D ) − U (q, D ))P(q)}

EVPI q j = ∑ P q j × Max i aj

*

q

i

i

i

(10)

Equation (8) therefore indicates that perfect information for the term qj yields a particular improvement in the uncertainty of what the user is searching for. The system attempts to locate topics of interest to the user by solving a dynamically generated belief network (BN) describing the content of the documentation for which the events represent topics in the document collection and the arcs represent dependencies between topics. Figures (1a) and (1b) illustrate portions of the belief network which was

8

generated by learning over a document collection of design documents relating to automatic controls (Dong and Agogino, 1996). The network is slightly different from those used for IR since the leaves do not represent documents but rather “general topics” (Fung and DelFavaro, 1995). In Figure (1a), a dependency relation exists between air pressure, output pressure and the available supply pressure. In Figure (1b), evidence of the “transient response” of the control system could be revealed through characteristics of the “natural frequency,” “time constraint,” or “damping ratio,” of the system. control system

air pressure

output pressure

basic control

control law back pressure

natural frequency

damping ratio

time constant

supply pressure

Figure 1a/b. Portions of the Belief Network

Given the value of EVPI for each particular word in the query, the question is what should the decision be now — continue the query and provide the user with newer search terms or quit? The algorithm for making this decision is depicted in Figure 2.

9

RUN QUERY PARSE

START

NOUN PHRASES

tf.idf STATISTICS EVPI COMPUTATION

DOCUMENT COLLECTION

BELIEF NETWORK

USER PROMPT

Re-Write Query by Augmenting Highest EVPI Query Term?

YES BN SOLVER SUGGEST KEYWORDS

NO QUIT

Figure 2 EVPI Computation in the Retrieval Process

3.3 PROBABILITY DISTRIBUTIONS The range of values for tf.idf weights for each word varies between 0 and 1 and is both continuous and random. In information retrieval, this weight is often interpreted as the probability that a document is relevant given the query term, i.e., P( documentD = i

relevant | term=tk ) = tf.idft D . It is assumed, though, that each word has a particular k i

occurrence pattern depending upon the document collection. That is, although the appearance of words is random in the document collection, they will occur regularly given the content of the document set. As a result, the randomness of the words is characterized as a discrete distribution. The probability distribution function (PDF) for a word measures the probability of finding a word with a specified relevance (i.e., tf.idf weight) to the target document collection. That is, the probability distribution measures the probability of a term taking on a particular value given the document collection, i.e., P( tf.idft D = qj | document collection ). k i

The state space of decision attributes is the range of values for tf.idf weights. Since each decision attribute value has a corresponding probability of occurrence, the utility of the decision at each step depends upon the likelihood of the decision attribute taking on a particular value. By summing the utility over the probability of obtaining the information required to make the decision, one can arrive at the expected value of perfect information for the random variable.

10

4. Performance Study The goal of this performance study is to determine whether the EVPI computation, calculated for each particular decision attribute or word in the query, allows the user to identify those words in the query which more closely match the user’s information needs; and whether expanding on the word with the highest EVPI, using related keywords, will contribute the most in finding the key relevant documents. This process can then be implemented by allowing the system to prompt the user with related keywords on the decision attribute with the highest EVPI so that the user can select one or more keywords to expand the query and repeat the querying process to obtain relevant documents. The performance test consists of the following steps: (1) Determine the relevant documents to a particular query (2) Run the query on the document set using freeWAIS-sf (3) Compute EVPI for each search term in the query (4) Add keywords in contextual locality of term with highest and lowest EVPI (5) Check change in precision/recall (Pr/Re) for the expanded query To avoid biasing the results, graduate students in mechanical engineering specializing in controls were asked to construct queries using pre-determined terminology based on a learned representation (Dong and Agogino, 1996) of the target document collection, a chapter on mechanical control design of The Mechanical Engineers’ Handbook (Kutz, 1986). Other control students were also asked to select the relevant documents to the queries from the handbook. To circumvent coding bias, the relevancy judgments were performed on every document in the target corpus relative to the entire range of initial test queries and independent of the actual test. The query-document relevance assessments were then applied in an algorithmic manner to the results of freeWAIS-sf and our algorithm. The first case, Case A, is the “raw” freeWAIS-sf scores obtained using the original query and serves as the benchmark. In Case B, related keywords are added to the word identified with having the highest EVPI value. Case C consists of expanding on the word with the lowest EVPI value, and Case D consists of running the original query without the term with the lowest EVPI value. One would expect that the results would show highest precision for Case B, expanding upon the term with the highest EVPI. Note that the goal of the EVPI computation is to narrow the search and to guide the user towards the potentially most relevant document. In the probabilistic sense, expanding upon the term with the highest EVPI increases the probability of the relevance of the highest value document given the evidence, i.e., the search terms. Note however that we cannot necessarily compare the augmented precision results to the precision results for the original query. By augmenting the query, we have modified the joint probability distribution for the query. However, the computation of EVPI shown in Equation 8 does not include a feedback loop to compute the change in the value of the decision under modification, namely addition of terms to the query. The equation makes a relative comparison between the highest and the lowest EVPI term given perfect knowledge of the state of those variables rather than between the state of the world under modification (actually addition) to the decision variables and non-addition. However, it would be interesting to note the difference in precision and recall under query 11

augmentation, given the likelihood that the recall should increase under query augmentation which might cause a drop in precision relative to the un-augmented case. 5. Results and Discussion

The standard information-retrieval definition of precision is applied as the number of relevant documents retrieved above a certain similarity score, divided by the number of returned documents above that same score. In this case, if a document from the return set had a similarity score of 300 or more, it was deemed “relevant.” Figure 3 shows the precision results for all the queries, comparing the original freeWAIS-sf scores to the scores obtained by augmenting the queries with the terms from the belief net. Variation in Precision Under EVPI-Guided Query Augmentation 80.00% 70.00%

Precision

60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Query Num ber Case A (Baseline): Precision

Case B: Precision

Case C: Precision

Figure 3 Variation in Precision Under EVPI-Guided Query Augmentation

In general, precision performance improved under the guidance of EVPI computation for query augmentation. From a probabilistic standpoint, the conclusion is that the “best” document(s) scored higher, resulting in better precision, since adding terms contextually similar to the term with the highest EVPI reduced the uncertainty in the decision for picking the “correct” document. Figure 3 also shows that precision decreases by expanding on the term with lowest EVPI since there is no value in being more precise about a term that is not related to the information being searched. Recall tells us about the completeness of the search. Figure 4 shows the recall results for all the queries, comparing again the original freeWAIS-sf scores to the scores obtained by augmenting the queries with the terms from the belief net. Since we wanted to explicitly consider the value of the decision, a different metric was used to display the results for recall. This metric is applied as the summation of the freeWAIS-sf scores of the pre-determined relevant documents retrieved above a certain similarity score, divided by, the number of pre-determined relevant documents it should have returned above that

12

same score multiplied by the maximum freeWAIS-sf score of 1000, as shown in Equation 11. Recall =

Relevant Document (i) x Score (i) Total Number of Relevent Documents x 100

(11)

Variation in Recall Under EVPI-Guided Query Augmentation 100.00% 90.00% 80.00%

Recall

70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Query Num ber Case A (Baseline): Recall

Case B: Recall

Case C: Recall

Figure 4 Variation in Recall Under EVPI-Guided Query Augmentation

The results in Figure 4 show that recall also improved when expanding on the term with the highest EVPI relative to the other cases. A per query analysis of the precision and recall results show that there still exists the typical inverse relationship found in the other cases. That is, precision still went up at the expense of recall and vice versa, but in this case the average score for all the queries was higher. Figure 5 shows the results of Case D, running all the original queries after discarding the term with the lowest EVPI. Except for Query 6 and Query 12, where the relevant documents were already found and were located at the top of the returned list of documents, the precision increased, indicating that the low EVPI term actually lowered the value of the decision. Therefore, excluding the lowest EVPI term decreased the uncertainty in selecting the relevant documents.

13

Variation in Precision with Low EVPI Term Excluded 100.00% 90.00% 80.00%

Precision

70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Query Number Case D: Precision

Case A (Baseline): Precision

Figure 5 Variation in Precision with Low EVPI Term Discarded

6. Conclusions and Future Directions The theoretical contribution of this paper is the formulation of full-text search as a decision-theoretic information value model in which the benefit is modeled as the maximum amount of relevant information attainable. By applying information value theory to the determination of valuable search terms, and rational substitution of keywords into search, the “decision-theoretic augmented information retrieval system” directs full-text search based on the utility of search strategy (query). The results of this preliminary performance study suggest that identifying those high-value attributes in the query improves decision-making, that is, the assurance of retrieving relevant documents because they closely match the information needs of the user. Conversely, discarding useless terms results in a similar improvement in precision. The primary assumption in our work is that adding contextually similar keywords reduces the uncertainty in describing the concept of a query. Keywords which better describe the domain may cause the system to perform better than shown, and vice-versa. In essence, these preliminary results suggest that using EVPI computation offers guided navigation through documentation by assisting the user in constructing queries which better represent their information needs. That is, it offers a way of evaluating the accuracy of the search when the user adds specific related terms to the high-EVPI term in the query. The principal weakness in this approach is the assumption that the user’s probability distribution on the uncertainty in the words chosen is the same as the probability distribution of word scores in the document database. Instead the system should convert user confidence judgment into the probability distributions. However,

14

using the system’s derived probability distribution accurately models the likelihood of finding a particular piece of information. The algorithm does not suggest the optimal strategy for choosing keywords to direct the user to the most relevant document since the change in value is not fed back into the EVPI computation. Further, in practice, it might be more efficient to perform EVPI computation over a subset of the returned documents rather than the entire document collection. How would this affect the confidence in the value? Finally, the integration of other metrics of document utility into the utility function poses both a theoretical and practical challenge. These questions and a more complete performance study hold great promise in fully developing the theory and practice of decision theory towards full-text information retrieval.

References Barreto, J Enrique: 1996, Augmenting Information Retrieval Using EVPI Computation, Master’s Project Report, Department of Mechanical Engineering, University of California, Berkeley. Bradley, Stephen R. and Agogino, Alice M.: 1991, “Intelligent Real Time Design Application to Prototype Selection,” Artificial Intelligence in Design, AID’90, John S. Gero, (ed.), Oxford: Butterworth-Heinemann Publishers, 815-937. Fagan, Joel L.:1987, Experiments in Automatic Phrase Indexing For Document Retrieval: A Comparison of Syntactic and Non-Syntactic Methods, Doctoral Dissertation, Department of Computer Science, Cornell University, Ithaca, New York. Cooper, William, Chen, Aitao, and Gey, Fredric: 1994, Experiments in the Probabilistic Retrieval of Full Text Documents, Proceedings of the Third Text Retrieval Conference (TREC-3), Gaithersburg, MD, November 2, 1994. Cooper, William S.: 1976, The Paradoxical Role of Unexamined Documents in the Evaluation of Retrieval Effectiveness, Information Processing and Management, 12, 367-375. Cooper, William S.: 1973a, On Selecting a Measure of Retrieval Effectiveness, Journal of the American Society for Information Science, Arthur W. Elias, (ed.), 24(2), 87100. Cooper, William S.: 1973b, On Selecting a Measure of Retrieval Effectiveness Part II. Implementation of the Philosophy, Journal of the American Society for Information Science, Arthur W. Elias, (ed.), 24(6), 413-423. Dong, Andy, and Agogino, Alice M.: 1996, “Text Analysis for Constructing Design Representations,” Artificial Intelligence in Design - AID ‘96, John Gero and Fay Sudweeks, (eds.), The Netherlands: Kluwer Academic Publishers, 21-38. Fung, Robert, and Del Favero, Brendan: 1995, Applying Bayesian Networks to Information Retrieval, Communications of the ACM, March, 38(3), 42-48,57.

15

Gordon, Michael, and Lenk, Peter: 1991, A Utility Theoretic Examination of the Probability Ranking Principle in Information Retrieval, Journal of the American Society for Information Science, Donald H. Kraft, (ed.), 42(10), 703-714. Hartsough, Bruce R., and Turner, John L.: 1990, A Streamlined Approach for Calculating Expected Utility and Expected Value of Perfect Information, Decision Support Systems, Hans-Jochen Schneider, Andrew Whinston, (eds.), 6(1), 1-11. Koll, Matthew and Srinivasan, Padmini: 1990, Fuzzy versus Probabilistic Models for User Relevance Judgments, Journal of the American Society for Information Science, June 1990, 41(4), 264-271. Kutz, Myer (ed.): 1986, Mechanical engineers’ handbook, New York: John Wiley and Sons, Inc. Lewis, David D., and Jones, Karen Sparck: 1996, Natural Language Processing for Information Retrieval, Communications of the ACM, January 1996, 39(1), 92-101. Moore, James C., Richmond, William B., and Whinston, Andrew: 1990, A DecisionTheoretic Approach to Information Retrieval, ACM Transactions on Database Systems, Gio Widerhold, (ed.), September, 15(3), 311-340. Nelson, Francis W., and Kucera, Henry: 1982, Frequency Analysis of English Usage, Boston: Houghton Mifflin. Noreault, Terry, McGill, Michael, and Koll, Matthew B.: 1981, A performance evaluation of similarity measures, document term weighting schemes and representations in a Boolean environment, Information Retrieval Research, R. N. Oddy, S. E. Robertson, C. J. van Rijsbergen, and P. W. Williams, (eds.), London: Butterworths. Pfeiffer, Paul E.: 1990, Probability for Applications, New York: Springer-Verlag New York Inc. Raghavan, Vijay V., and Wong, S. K. M.: 1986, A Critical Analysis of Vector Space Model for Information Retrieval, Journal of the American Society for Information Science, 37(5), 279-287. Ruge, Gerda: 1995, Human Memory Models and Term Association, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Edward A. Fox, Peter Ingwersen, and Raya Fidel, (eds.), ACM Press. Russell, Stuart and Wefald, E.: 1989, On Optimal Game-Tree Search Using Rational Metareasoning, Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, J. Saveland, (ed.), Morgan Kaufmann, 4(63-65), 334-340. Salton, Gerald and McGill, Michael J.: 1983, Introduction to Modern Information Retrieval, New York: McGraw-Hill Book Company.

16

Ullman, David G., Wood, Stephen, and Craig, David: 1990, The Importance of Drawing in the Mechanical Design Process, Computers and Graphics, 14(2), pp. 263-274. Ullman, David G., Dietterich, Thomas G., and Stauffer, Larry A.: 1988, A Model of the Mechanical Design Process Based on Empirical Data, Artificial Intelligence in Engineering Design and Manufacturing, 2(1), pp. 33-52. van Rijsbergen, C. J.: 1979, Information Retrieval, Second Edition, Boston : Butterworths.

17

Suggest Documents