Gathering Information Based on its Value

0 downloads 0 Views 82KB Size Report
As more information becomes available on the Internet, there is a growing potential .... are dealing with approaches to converting web-pages, gopher sites and ...
Gathering Information Based on its Value Joshua Grass and Shlomo Zilberstein Computer Science Department University of Massachusetts Amherst, MA 01003 U.S.A. {jgrass,shlomo}@cs.umass.edu Abstract As the Internet continues to grow it enables computer systems to make complex decisions based entirely on the information gathered from different on-line resources. Unfortunately, retrieving information from sites on the Internet in order to make a decision is a time consuming and potentially expensive task. This paper presents a system designed to control the information gathering process so as to optimize the value of information to the user. The system uses a database describing the contents, costs, and responsiveness of different information sources, and a description of the user’s decision model. By predicting the expected querying value for a set of sites, the Value-Driven Information Gathering (VDIG) system can determine an ordering for querying sites and monitor and alter the plan as it executes. Preliminary results show that significant gains in performance can be achieved by taking into account the time/cost/quality tradeoffs offered by the different information sources.

1 Introduction As more information becomes available on the Internet, there is a growing potential for using this information in order to make more informed decisions. But the large number of information sources and their different levels of reliability and costs present a complex information gathering problem. One approach in designing information gathering systems is to maintain a database of meta-information about the expected cost, time and quality for sites that the system will use in its queries. This meta-level information is easy to learn, does not require much space to store and can drastically increase the quality of a decision. Since many sites on the Internet contain equivalent information, the meta-level information about the site itself can be the most important factor in decide which sites to investigate and which to avoid. For example, a site may have a very high likelihood of returning precise information, but it may take a long time. Alternatively, a site may be cheap or free to access, but the quality of the information may be low. Depending on the user’s situation each of these sites may be the best one to query. The behavior of sites may also depend on environmental variables, such as the time of the day or the amount of traffic at a particular

site. The question is how to control the information gathering process in a way that is sensitive to these characteristics. The value-directed information gathering algorithm presented in this paper shares many characteristics with anytime algorithms because of the ability to smoothly trade quality for execution time [1; 6; 5; 15]. The expected querying value for a set of sites is very similar to the performance profile that an anytime algorithms uses to calculate the optimal execution time (see figure 7). As the VDIG system executes, the probability of a node in the user decision model becoming instantiated increases1 . If an unexpected event occurs (e.g. a site returns a result earlier then planned) then the VDIG system can re-plan and take advantage of the event. This allows the VDIG system to be able to gradually increase the quality of a decision, predict the quality at any given time, and monitor and re-plan if events change during execution.

Site Characteristics

User decision model

Feature values

Site library database

Information Source model

Feature names

Value-directed Query Planner and Monitor

Query

Cost of time function

Current cost

Site status

Result

Internet

Figure 1: The components of the VDIG system In this paper we will present a system that creates an intelligent query plan from three components that are described in detail latter in the paper. Figure 1 shows the basic layout of the three components and the planner of the VDIG system. A user decision model – The user decision model is an influence diagram that relates specific features to a decision. Knowing the value of a feature will improve the 1 We use an influence diagram to represent the user decision model, with nodes representing specific qualities that influence the decision

decision and increase the utility of the system. Using an influence diagram allows the system to calculate the utility of discovering the value of each specific feature. These utilities influence the value of querying sites with that information. A database of information sources – The database of information sources contains information about which sites may return the value for each feature. It also contains information about the cost of accessing a site and the probability of a site returning the value of a feature in a certain amount of time. A cost of time function – The cost of time function is defined by the user and represents how long they are willing to wait for a decision. Currently, there is a great deal of interest in gathering and processing information from the Internet, and specifically the world wide web. Several other groups have discovered the value of maintaining meta-level information about sites. This information can vary from different access methods to information about the quality of the data to a mapping from topics to locations on the Internet [9; 14; 2]. The site library in the VDIG system serves a similar purpose. At the moment the VDIG system is running in a simulated environment, but in the future the system will be linked to a web-based information extraction engine such as the ones presented in [3; 10; 4; 12] that can process web pages and return the specific information requested. Since we generate the feature weights from an influence diagram, and these weights help determine our plan of action, the VDIG work is also related to work on decision analysis and planning [7; 8; 11; 13]. Section 2 describes the user decision model, which is defined by the instantiation of particular values for features in the influence diagram and a user-defined cost of time function. Section 3 describes the information sources database. In section 4, we describe how the anytime planner uses the information from these three components to create an optimal expected utility plan and how it monitors that plan as the sites are queried. Section 5, describes the experimental results of the prototype VDIG system that we have built. We conclude with a summary of the system and future work.

Four Wheel Drive

Depreciation

Buy car

Value of car

Anti-lock Brakes

Gas Millage

Value of feature Anti-lock brakes

5.2

Four Wheel drive Depreciation Gas Millage

4.3 1.2 3.4

Figure 2: Determining the value of information for each node in the influence diagram remaining unknown feature nodes. Each node has several sites that can instantiate the node, so the value of a node effects the value of several sites. Once any one of these sites has returned a value, all of them are removed from the list of possible sites to explore. For now we do not address the issue of information extraction from particular sites on the web. Numerous papers are dealing with approaches to converting web-pages, gopher sites and databases into a form that a decision system can use. In the future we would like to maintain meta-level information about the expected time and quality of these extraction engines and incorporate their characteristics into the planning.

2 User decision model

2.2 Cost of time

2.1 Decision model instantiation

The other information passed to the VDIG system by the user is a cost of time function. This function can be any nondecreasing function that returns a utility cost as a function of time. In the prototype VDIG system we use the following cost of time function:

A user decision model is created by an outside component and passed to the VDIG system. The user decision model is an influence diagram of nodes which represent specific features that will be used to make a decision. A specific instance of that model contains information about an individual’s ranking of the features in their particular situation. The value of information can be calculated for each feature in the influence diagram using well-known techniques [11]. Figure 2 shows an example user decision model being converted into a list of feature values. As the value of nodes are instantiated in the user decision model, the value of the features in the remaining nodes may change. When a result for a node is discovered, the user decision model is re-evaluated to determine the value of the

C (t) =

(

0   t?tmin curve Vtotal tmax ?tmin

if t  tmin otherwise

Pn Where Vtotal = i=1 Vi , the sum of the value of knowing all

the features. Figure 3 shows a graph of the cost function. This simple cost of time function allows the user to set three easy parameters: A minimum exploration time ( tmin ), a deadline (tmax ), and an curve constant (curve) which represents the rate that the cost of time function increases between the minimum time and the deadline.

V Total

Input from user decision model

List of sites to explore with their result probability histogram and query cost. http://honda.com Query cost: 1.3 Result probability histogram

Feature

Cost

P r o b Time

Environment Variables

TMin

Time

http://CarandDriver.com Query cost: 0.6 Result probability histogram

Information sources database

P r o b Time

TDeadline

http://Consumer.com Query cost: 2.4 Result probability histogram

Figure 3: The cost of time function used in the VDIG system

P r o b

Site 15

Time

3 Information sources database An information sources database maintains meta-level information about sites that the VDIG accesses in order to make a decision. This information allows the the VDIG system to predict the cost of accessing a site in terms of an access fee, the amount of time needed to retrieve an answer, how reliable it is that a site will return an answer, and how likely that answer is of being correct. The information sources database could also contain accessing information and a mapping from topics to specific sites on the web. This database must be local so that it may be accessed quickly, and because the characteristics it contains may change from site to site. Using the VDIG system’s previous experience with a site it is not a difficult task to update the entry in a database and increase it’s accuracy. In our current implementation the information sources database contains two pieces of information for each site which is also indexed by the state of the environment: a result probability histogram and the cost of accessing the site. We assume for now that each feature has several sites from which it can extract the value and that each site contains perfect information about only one feature. The access cost is a fee for interacting with the site, it does not guarantee that you will find the information you are looking for. The result probability histogram represents the probability of receiving the value of the feature at any particular time-slice, as well as a time-out value, at which point the probability of the site ever returning an answer drops to zero. The result probability histogram represents both the time for information to travel across the Internet and the time to process the information and see if the result is present. Currently, the information sources database that the VDIG has is perfect (i.e. the same information is used for the simulator), but in the future the VDIG system will learn the model based on interactions with information sites. The state variable we use in our system is the time of day and the day of the week. Internet traffic, and thus response time, is greatly influence by these two variables in the real world so we included them in the system. Figure 4 shows an example set of sites returned by the information sources database. Once the information sources database has found a set of all the sites that can instantiate the features, along with the site’s result probability histogram and access cost, the VDIG system can begin to evaluate which site is the best to explore first.

Site library

Figure 4: Retrieving the site information

4 Value-directed information gathering The VDIG system instantiates the user decision model by querying sites in order of the expected query utility. The expected query utilityis calculated from the information value derived from the influence diagram, the cost of accessing the site, the result probability histogram and the cost of time function. The VDIG system queries one site at a time and will continue querying that site until either the site returns the value of the feature or the expected utility of querying another site rises above the expected utility of continuing to query the current site. The sites are ranked by finding the maximum expected query utility over all possible amounts of time that the site may be queried. So site a may have a higher utility then site b when both are queried for 3 seconds, but site b has a higher utility after being queried for 9 seconds then site a has during any amount of querying time (see figure 5). Site b

Utility

Site a

Time being queried

Time = 9 sec Maximum expected query utility for site b Time = 3sec

Time = 8 sec Maximum expected query utility for site a

Figure 5: Retrieving the site information At each step in execution2 the expected querying utility is calculated for all of the sites, including the site currently 2 Execution time is divided into discrete units that match the time steps in the result probability histogram for each site (see figure 4). In the implementation we used a time unit of 0.1 seconds. Since most Internet operations take place on the order of seconds, this time step size seemed appropriate.

being queried, and depending on that value the VDIG system chooses one of three operations to do during the next time step: 1. Begin querying a new site

t0 and tj of its execution. Where tj is the current amount of

time that the process has been waiting for the current site to return.

CPr(ti; tj ) = Pr(ti jt0; : : :; tj )

2. Continue querying the current site 3. Stop execution

= Pr(t0; : : :; tj jti)Pr(ti ) Pr(t0)    Pr(tj ) (t0; : : :; tj jti )Pr(ti) = (1 ?PrPr (t0))    (1 ? Pr(tj )) Pr(tj ) =

To determine which action to take we have to calculate the maximum expected query utility for each site that has not yet been explored (mequ(site), and the maximum expected continuing query utility for the site that is currently being queried(mecqu(site)). Once the expected utility for each site is calculated, the planner queries the site with the maximum utility. The expected query utility is defined as:

equ(n) =

n X

+

k=0

Pr(ti )(V ? C (ti + ct)) +

i=0 n Y

i=0

(1 ? Pr(ti))C (tn + ct) ? qc

(3)

(1 ? Pr(tk ))

Where:

tj

(1)

The formula states that the expected utility is equal to the utility of knowing the node multiplied by the probability of finding it at a particular time minus the cost of time. We also have to subtract the cost of time if we do not find the node in the given amount of time. And finally, we must subtract the cost of accessing the site. Where: n number of time steps to query the site Pr(t) probability of the site returning a result at time t ti time step i (in our prototype t0 = 0, t47 = 4.7 seconds) V value of the feature this site returns C (t) cost of time at time t ct current time qc cost to access the site The expected query utility function evaluates the utility of querying the site for the next n time steps starting at the current time. The maximum expected query utility is the expected query utility for the best value of n.

mequ = max equ(n) n

kY =j

(2)

If this value is a negative number, then it is not worthwhile to query the site. This can often happen for low value sites once we start to approach the deadline. The expected continuing query utility is slightly more complicated to determine because the probabilityof the site returning a value is not equal to the entries in the result probability histogram. The probabilityof the site returning in a given time step is influenced by the fact that the site has not returned a value so far. Instead of using Pr(ti ) the VDIG system uses Pr(tijt0; : : :; tj ) (where i > j ) to more accurately predict the probability of a site returning during a time slice. We define the CPr(ti; tj ) as the continuing probability. This value is the probability of the site returning a result at time slice ti given that the site has not yet returned a result between time

amount of time the site has been queried already Substituting equation 3 for Pr(t) in equation 1 defines the expected continuing query utility function.

ecqu(n) =

n X i=0

+

CPr(ti; tj )(V ? C (ti ? tj + ct))

n Y

(1 ? CPr(ti; tj ))C (tn ? tj + ct()4)

i=j +1

To estimate the utility of evaluating the site the VDIG system determines which value of n will maximize the expected utility.

mecqu = max ecqu(n) n

(5)

Using equation 2 and equation 5 the VDIG system can evaluate the expected utility of querying every site for the best amount of time, given the current time, the site’s result probability histogram, the site access cost, the feature values, and the amount of time the VDIG system has been querying the current site. The VDIG system pursues the site with the best expected utility at each time step. The actions could be to continue querying the site that is currently being queried, begin querying a new site that is more promising, or, if none of the maximum expected utilities are positive, to stop computation and return a result.

5 Implementation & preliminary results The prototype VDIG system was created in Lispworks and contains a user interface for each of the three components (the site library, the cost of time function and the feature value list). Figure 6 shows the site library interface and figure 7 show the expected utility graph for querying each site as the system runs. The VDIG system was tested against two simpler query strategies in a simulated environment: A random querying algorithm and a querying algorithm sorted by the feature value. In each experiment the system ran 500 times and the results were averaged.

Figure 7: The expected utility for each site at 0.0sec (left) and 0.3sec (right) so we would expect the results in a larger system with more sites to favor the VDIG approach even more. These results are a good indication that a value-directed approach is useful.

6 Conclusion

Figure 6: The site library of the prototype VDIG system

Minimum time

Deadline time

1.0 sec 2.0 sec

2.0 sec 5.0 sec

4.0 sec 5.0 sec

8.0 sec 10.0 sec

VDIG Utility Set A 1.931 2.756 Set B 2.941 4.076

Random selection Utility

Feature sorted Utility

0.830 1.390

1.283 2.341

1.858 3.189

2.432 3.735

The values in each column represent the average information utility generated executing the VDIG, random and feature-sorted systems. In all of these examples, the VDIG strategy generated better information for the user in the same amount of time. As the time was extended the difference becomes less noticeable because eventually all three systems had enough time to query most or all of the sites. Both of these test cases had few sites(around 20) to query compared to the number that a fully implemented system would have,

In this paper we have described a system for making decisions based on information retrieved from the Internet. Using the Internet as a source of information has opened up a number of potential research areas. The work we have done so far has demonstrated that value-directed information gathering is a good strategy for deciding how to order queries. The cost of maintaining a meta-level database of information about sites is easily outweighed by the increase in the information utility. As more information is available on the Internet, predictive informationabout the behavior of sites will become even more useful. Using the information sources database lets the VDIG system determine the expected utility of querying each site and maximize the expected utility of the decision. There is still a great deal of work to be done in the area of value-directed information gathering. We hope to expand the system to take into account the time requirements for processing a document retrieved from a site as well as the probability of a site returning the document at any given moment(currently, we do only the latter). Using the processing time for the documents, as well as the retrieval time, the system will determine how many sites to access at once without over-burdening the system and possibly missing a document being returned. We would also like to incorporate the notion of sites having the results for multiple features. This step will allow the system to decide between gathering information from a large number of specific sites or from fewer more general sites. Further in the future, we would like to have the system address the quality of a site’s result and combine multiple answers for the same feature. There are several different approaches to information fusion we would like to explore: from techniques used in real-world sensors to traditional probability theory. The system also does not yet incorporate information

about the current traffic at a site. In the future sites may return information about the current load on their resources, allowing a VDIG system to more accurately predict the response histogram. Finally, we would like to combine the VDIG system with an influence diagram package and a natural language system so that we could build a web page where users could interact with the system and watch it produce decisions in real-time.

Acknowledgments Support for this work was provided in part by the National Science Foundation under grant IRI-9624992 and IRI-9634938 and in part by Rome Laboratory, under grant F30602-95-10012.

References [1] M. Boddy and T.L. Dean. Solving time-dependent planning problems. In Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 979–984, Detroit, Michigan, 1989. [2] J. Callan, W. B. Croft, and S. Harding. The inquery retrieval system. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications, pages 78–83, 1992. [3] K. S. Decker, V. R. Lesser, M. V. Nagendra Prasad, and T. Wagner. An architecture for multi-agent cooperative information gathering. In Proceedings of the CIKM’95 Intelligent Information Agent Workshop, Baltimore, Maryland, 1995. [4] O. Etzioni and D. Weld. A softbot-based interface to the internet. Comm. of ACM, July 1994. [5] Joshua Grass and Shlomo Zilberstein. Programming with anytime algorithms. In Anytime Alogrithms and Deliberation Scheduling, pages 22–27, Montreal, Canada, 1995. IJCAI-95. Available on-line at http://anytime.cs.umass.edu/ jgrass. [6] E. J. Horvitz. Reasoning about beliefs and actions under computational resource constraints. In Proceedings of the 1987 Workshop on Uncertainty in Artificial Intelligence, Seattle, Washington, 1987. [7] R. A. Howard. Information value theory. IEEE Transactions on Systems Science and Cybernetics, 2(1):22–26, 1966. [8] R. A. Howard and J. E. Matheson. Influence diagrams. Principles and applications of decision analysis, 2, 1984. [9] S. B. Huffman. Learning information extraction patterns from examples. IJCAI-95 Workshop on New Aproaches to Learning for Natural Language Processing, August 1995. [10] T. Oates, M. Nagendra Prasad, V. Lesser, and K. S. Decker. A distributed problem solving approach to cooperative information gathering. In AAAI Spring Symposium, Stanford CA, March 1995. [11] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan-Kaufmann, Los Altos, California, 1988.

[12] E. Riloff and W. Lehnert. Automated dictionary construction for information extraction from text. In Proceedings of the ninth IEEE Conference on Artificial Intelligence for Applications, pages 93–99, 1993. [13] R. D. Shachter. Evaluating influence diagrams. Operation Research, 34(6):871–882, 1986. [14] D. Steier, S. B. Huffman, and W. C. Hamscher. Metainformation for knolwedge navigation and retrieval: What’s in there. In Working notes of the 1995 AAAI Fall Symposium on AI Applications in Knowledge Navigation and Retrieval, October 1995. [15] S. Zilberstein. Optimizing decision quality with contract algorithms. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1576–1582, Montreal, Canada, 1995.

Suggest Documents