Evaluation Methods for Learning about Users - CiteSeerX

11 downloads 0 Views 112KB Size Report
User Dungeon (MUD) adventure game and the World Wide. Web (WWW) ... evaluation methods used throughout the project, and analyze them in relation to the ...
Evaluation Methods for Learning about Users I. Zukerman, A.E. Nicholson and D.W. Albrecht School of Computer Science and Software Engineering, Monash University Clayton, VICTORIA 3168, AUSTRALIA fingrid,annn,[email protected]

Abstract This paper describes the evaluation methods applied to assess the performance of the user models developed in the framework of a three-year plan recognition project. The user models were learned from observations of the behaviour of large numbers of users, and were used to predict users’ immediate activities and eventual goals in two related domains: a Multi-User adventure game and the World Wide Web. The evaluation methods were influenced by the features of the domain and the applied modeling techniques.

1 Introduction Systems that learn about users typically perform the following tasks: (1) collect data, (2) consider the features of the domain to identify models that are suitable for representing the data, (3) use the data to learn the parameters (and structure) of the models, and (4) evaluate the learned models. In the last three years, we have been working on a project which involves learning user models from observations of the behaviour of large numbers of users. The objective of the learned models was to predict the future behaviour of users. The domains that we chose for our investigation were a MultiUser Dungeon (MUD) adventure game and the World Wide Web (WWW) (Section 2). The uncertainty inherent to these domains suggested the use of probabilistic models, while the differences between these domains warranted the use of different types of models within this class. During the course of our project, we evaluated our models separately for each domain using several evaluation methods (Section 4). In this paper, we take advantage of the experience gained from this project to consider jointly the different evaluation methods used throughout the project, and analyze them in relation to the features of the domains and the applied modeling techniques. We then compare our evaluation techniques with those used by other researchers who learn predictive user models (Section 5), and discuss the contribution of our work (Section 6).

2 The Domains 2.1

Keyhole Plan Recognition in the MUD

Overview. In [Albrecht et al., 1998], we presented an approach to keyhole plan recognition which uses a dynamic belief (Bayesian) network to represent features of the domain

that are needed to identify users’ plans and goals. The application domain was a MUD adventure game. We proposed several network structures which represent the relations in the domain to varying extents, and compared their predictive power for predicting a user’s current goal, next action and next location. The conditional probability distributions for each network were learned during a training phase, which dynamically builds these probabilities from observations of users’ behaviour. We then applied simple abstraction and learning techniques in order to speed up the performance of the most promising dynamic belief networks without a significant change in the accuracy of goal predictions. Our experimental results in the application domain showed a high degree of predictive accuracy. Domain description. Our results were obtained for the “Shattered Worlds” MUD, which is a text-based virtual reality game where players compete for limited resources in an attempt to achieve various goals. The MUD has over 4,700 locations and 20 different quests (goals); more than 7,200 actions were observed. Quests vary in complexity from simple quests such as the “Teddy-bear rescue”, which involves locating and retrieving a lost teddy bear, to more complex quests which involve achieving a number of sub-goals, e.g., obtaining potions and solving puzzles. Players usually know which quest or quests they wish to achieve, but they don’t always know which actions are required to complete a quest. In addition, they often engage in activities that are not related to the completion of a specific quest, such as chatting with other players or fighting with MUD agents. As a result, players typically perform between 25 and 500 actions until they complete a quest, even though only a fraction of these actions may actually be required to achieve this quest. The plan recognition problem is further exacerbated by the presence of spelling mistakes, typographical errors, snippets of conversations between players, newly defined commands and abbreviations of commands. Data collection. The MUD software collected the actions performed by each player and the quest instance completed by each player. Each data point takes the form

f

g

f

session-number time character-name location For example, the record 125 773335264 action . spillage players/paladin/room/trading post buy indicates that the character spillage was buying something

g

at the trading post at time 773335264 in session 125 (only the first word of each command, e.g., buy, was considered

during training and testing). The log of a player’s actions is often incomplete, since the MUD software did not record an agent’s movements on the horizontal plane. The data points for each player were grouped into runs, where a run begins either after a player enters the MUD or completes the previous quest, and ends when a new quest is achieved. Our corpus contains 3,017 quest-achieving runs. Since one player could have more than one MUD character or several players could have the same character at different times, we couldn’t find a one-to-one correspondence between people and characters. Hence, we assumed we were modeling 3,017 different users, each performing a different run.

number of bytes in the requested document. Our data consisted of 1,095,730 document requests, where 59,486 clients at 21,692 referer locations requested 17,332 different documents (one session per client); 14,023 of the referers were requested documents, and there were 103,972 different referer/document combinations. As for the MUD, we couldn’t determine whether different users used the same client server or one user used several client servers. Hence, we assumed that each client corresponds to a single user.

2.2

2.3

A Decision-Theoretic Approach to Pre-sending Documents on the WWW

requestedDoc is the http address of the document being requested by the user. The time is a time stamp (in seconds) indicating when the request was received. The size is the

Comparison of the MUD and WWW Domains

Domain description. Our results were obtained for a server on an internet WWW site which handles users’ requests for documents, either from the local site or from other internet sites. There is a cost to the user associated with receiving a requested document: (1) the cost of transmitting the page from the server where the document is located to the user’s server; and (2) the cost of the delay while waiting to receive the requested document. The transmission cost typically depends on the number of bytes being transmitted (hence it is a function of the size of the document) and on the cost per byte, which may vary with the network load. The cost of waiting depends on the cost of a timed internet connection, and is also influenced by less quantifiable factors, such as the user’s dissatisfaction at having to wait, and the loss of productivity while waiting for work-related information. As stated above, our decision-theoretic approach pre-sends the document with the highest expected positive immediate benefit, that is, it considers only documents a user might request next, rather than further in the future. However, users have caches that store documents for extended time periods, hence the user may still benefit from a pre-sending action even if s/he requests a pre-sent document later on. This has implications for the evaluation of a pre-sending agent.

The similarities between the two domains are summarized as follows.  The state space is very large, the observed data is sparse, and there are large numbers of candidate predictions.  It is extremely difficult to obtain a perspicuous representation of the domain, and the domain is constantly changing. For example, in the MUD wizards can independently change the MUD (e.g., add new features, actions, locations and non-playing characters), while in the WWW there may be links to pages from external locations, which we cannot model, and the existence, location and size of documents are all subject to continual change, as are the links between documents.  The data are naturally divided into runs or sessions which vary in length, that is, number of actions. In the MUD, a run ends when a quest is completed; in the WWW, a session ends when the user stops making requests.  We predict the next event: next action/location in the MUD, and next request in the WWW.  Observations are missing: movements on the horizontal plane are not logged in the MUD, and only limited state information is logged. In the WWW, back-links are not directly recorded, and visits to some external sites and requests for documents already in the client’s cache are not observed.  There are many ways for a user to reach the same outcome. In the MUD, there may be more than one way to achieve a quest, while in the WWW, because links between documents form a graph, there may be several candidate paths from a document to a desired location.  Users may combine different behaviours in the same run or session. MUD players may socialize while trying to achieve a quest, while WWW users may alternate between browsing and being more focused in their sequence of document requests.

Data collection. Our results were obtained using training data from a 50 day time window of data logged by our web server. The collected data points were divided into sessions. Each session contains the temporal sequence of requests from a single client, where a request takes the form referer requestedDoc time size . The referer is the current internet location (http address) of the user. This location may be a local (previously requested) web page on the server site, an external web page on another internet site, or ‘-’ (empty) when the information has not been provided. The

There are also significant differences between these domains, which affect the types of evaluation methods that are applicable.  The MUD has a limited number of specific goals (20 quests), whereas there are no specific goals in the WWW – some users may be browsing, others may be seeking specific information, but there is no indication of goal achievement even when this information is found. This means that we can predict a final event in the MUD (i.e., the quest), but there is nothing equivalent in the WWW.

Overview. In [Albrecht et al., 1999], we proposed a decision-theoretic agent which pre-sends WWW documents to a user based on predictions of documents the user is likely to request. In [Zukerman et al., 1999], we compared several predictive models, the best of which is used by the decisiontheoretic agent. The agent selects for pre-sending the document that yields the highest expected positive immediate benefit for the user. The calculation of this benefit takes into account the increased cost of transmitting documents that are not requested versus the reduction in waiting time for documents that are requested.

g

f

A 0

A 1

Q

L 0

A 2

A 3

Q’

L 1

A 0

Q

L 2

L 3

L 0

A 1

A 2

A 2

A 3

D2

Q’

L 1

L 2

referer

D1

D3

D5

D4

D6 D7 D8

L 3

(b) indepModel

(a) mainModel A 0

A 1

A 3

Q

(a) Site structure

Q’

Q’

L 0

(c) actionModel

L 1

L 2



(d) locationModel

MUD

We developed and investigated several Dynamic Belief Network (DBN) models involving nodes for action, location and quest (Figure 1). The different network structures reflect different independence assumptions. For example, Figure 1(a) shows the most complex of these models (called mainModel): this model stipulates that the location Li at time step i depends on the current quest, Q0 , and the previous location at time step i ? 1, and that the action Ai depends on the previous action, the current quest and the current location. The conditional probability distributions in the DBNs are obtained from the data by using frequency counts; many combinations of the state variables are unobserved, so the conditional probability tables are very sparse. Once a DBN is constructed, new data from a user is incorporated into the network as evidence, and belief updating is performed. This means that the posterior probability distribution for that user’s current quest, next action and next location is calculated given the most recent evidence.

3.2

D2

270

D2

D1

D1

D3

D3

D8

300 320

D8

D1

D1

D4

350

D5

396

{ D1, D2 }

{D3, D8}

{ D2, D1 }

WWW

In preliminary work [Nicholson et al., 1998], we used a simple Time Markov prediction model and evaluated the presending system only in terms of its immediate benefit to the user. In [Zukerman et al., 1999], we compared the predictive performance of the following Markov models: Time, Space, Second-order Time and Linked. Figure 2 illustrates these models when they are derived from a small sequence of requests (the requests in boldface correspond to back-links

D3

D1

D2

D5

D4 D3

{ D1, D3}

{ D8, D1 }

D8

D4

D8

(c) Time Markov model

The WWW domain involves modeling a user’s cache; there is no counterpart in the MUD. In the MUD, the outcome of a user’s actions is uncertain, i.e., the performance of an action is not a sufficient condition for the achievement of the action’s intended effect (e.g., due to the presence of other agents who affect the state of the system). In the WWW, although there is some uncertainty about whether a user receives a document, in general we can safely assume that a sent document is successfully transmitted, i.e., it arrives in the user’s cache.

3 Models and Techniques 3.1

261

D1

L 3

Figure 1: Dynamic Belief Networks for the MUD: (a) mainModel; (b) indepModel; (c) actionModel; (d) locationModel.



time

D1

(b) Record of document requests

D1 Q

requestedDoc

{ D1, D4}

D5

(d) Second-order Time Markov model {

D1}

{ D1

D2} { D3

D8 }

D1 }

{ D2

D1 D2 D2

D4

D5

D3 D8

(e) Space Markov model

{ D1

D3

D1 D3}

D8

{ D8

D4

D1}

D4}

{ D1

D5

(f) Linked Space-Time Markov model

Figure 2: Sample scenario and Markov models for the WWW: (a) site structure; (b) record of requests; (c) Time Markov model; (d) Second-order Time Markov model; (e) Space Markov model; (f) Linked Markov model. which are inferred by our software). We then empirically derived hybrid predictive models which combine the features of several Markov models. The best of these hybrid models, called maxHybrid, consults all four Markov models, and then makes its own predictions using the model which made a prediction with the highest probability. In [Albrecht et al., 1999], this model was used by agents which pre-send documents a user is likely to request. We then evaluated the immediate and the eventual benefit to the user as a result of presending a document under different operating conditions. We also compared our decision-theoretic pre-sending policy to a naive policy which pre-sends the document that is most likely to be requested [Bestavros, 1996], and identified which policy provides a clear overall benefit to the user under different operating conditions.

4 Evaluation In this section, we consider the various techniques used to evaluate the models developed for the MUD and the WWW, and discuss how the features of these domains and the applied modeling techniques influence the evaluation methods.

4.1

MUD

In the MUD, our main aim is to recognize a user’s current goal, namely the quest being attempted, as early as possible. However, we are also interested in predicting a user’s next action and next location (which also affect the belief in which quest is being attempted and vice versa). So our method of

evaluation needs to determine the model which best predicts the user’s current quest, next action and next location, and the stage of quest completion where these predictions are made. In order to compare across runs where the number of recorded actions varies, we used the percentage of actions taken to complete a quest. That is, we applied our measures of performance at 0.1%, 0.2%, 0.3%, : : :, 1%, 1.1%, : : :, 100% of quest completion. These percentages guarantee that for all quests there is at least one data point corresponding to each action; the quests with only a few actions have several data points for a single action. The evaluation of our prediction models depends on the method used to make a prediction. Two ways to make a prediction for a variable are: (1) selecting a value from a distribution, and (2) choosing the value with the highest probability (taking into account that sometimes this value will be randomly selected among several equiprobable values which have the highest probability). To evaluate the performance of the models with respect to these two prediction methods we used two measures: average prediction and average score.  Average prediction is the average across all test runs of the predicted probability of a domain variable, i.e., the actual quest, next action or next location, at each point during the performance of a quest: average prediction = 1

n

X n

i=1

Pr(actual value of variable in the i-th test run);

where variable may be either current quest, next action or next location, and n is the number of test runs performed.



Average score consists of using the following function to compute the score of a prediction, and then computing the average of the scores at each point during the performance of a quest:



score =

1

top predicted valuesj

j

0

average score =

1

n

X n

i=1

Pr(actual value)=Pr(top) otherwise

score in the i-th test run:

The average score measures the percentage of times the probability of the actual value of a domain variable is the highest, while taking into account the possibility that other values for this variable may have been assigned an equally high probability. The two measures were used to evaluate each of the four models, mainModel, indepModel, locationModel and actionModel, on 20% of the data with cross-validation using 20 different splits (training was performed on 80% of the data). Figure 3(a-c) shows the average prediction and Figure 4(a-c) the average score for action, location and quest predictions for the four models. Note that locationModel cannot make action predictions, and actionModel cannot make location predictions. As seen from Figures 3 and 4, both measures of performance produce generally consistent assessments of the various models. The assessment produced by the average score

accentuates some differences in performance compared to the assessment produced by the average prediction, while it blurs certain other differences. For example, for the first 50% of a run, the average action predictions of mainModel are quite close to those of locationModel and indepModel (Figure 3(a)), while the average action scores obtained by mainModel are much lower than the scores obtained by the other two models (Figure 4(a)). This is because with an average prediction probability of less than 0.2 for the actual action, such as that yielded by mainModel, it is quite possible that there are other actions predicted with the same probability or a higher probability. In the first case, the score assigned to the actual action is reduced (it is divided by the number of equiprobable actions), and in the second case, it is 0. The average score tends to blur the differences between actionModel and indepModel for action predictions, and the differences between locationModel, indepModel and mainModel for location predictions. This is because different models may have assigned different probabilities to the actual value of a particular variable. However, if all the models assign the highest probability to n values for this variable (including the actual value), then all the models will obtain the same score. Likewise, for quests the average scores of locationModel are quite close to those of mainModel and indepModel, while the average quest predictions obtained by locationModel are much lower than the average predictions obtained by the other two models. This is because the probabilities assigned to the current quest by locationModel are lower than those assigned by these two models, but on average the current quest is still assigned the highest probability among its competitors.

4.2

WWW

A WWW agent which pre-sends documents on the WWW requires (1) an accurate prediction model, which determines documents a user is likely to request, and (2) a pre-sending policy, which determines which of these documents to presend, if any. The selection of a prediction model requires a comparative evaluation between candidate models in order to select the best model, and a comparative evaluation against a base-line, in order to determine the “absolute” goodness of our models. The selection of a pre-sending policy requires a comparative evaluation of the benefit of candidate policies under different operating conditions. All the evaluations described below were performed on 20% of the data, after training on 80%. Prediction models Unlike the MUD, we are never told if users reach their goal, and users may often not have a specific goal (e.g., browsers). Therefore, we cannot evaluate the performance of the prediction models relative to goal completion. Instead, we propose two complementary evaluation methods: (1) lift curves, and (2) recall/precision. We used the first method both to perform a comparative evaluation of different models and to compare them against a base-line. The second method was used to evaluate our best model under different operating conditions, assuming a particular pre-sending policy. Lift curves. In these curves, the probability with which a model predicts the next request made by the user is represented in the x-axis. The y-axis represents the average per-

AVERAGE PREDICTION

1

actionModel indepModel mainModel

0.6

0.4

actionModel indepModel mainModel

0.8 Average Score

0.8 Average Prediction

AVERAGE SCORE 1

0.2

0.6

0.4

0.2

0 0

0

20 40 60 80 100 Percentage of actions until quest completion

0

(a) Action prediction

(a) Action prediction

1

1 locationModel indepModel mainModel

0.8 Average Score

Average Prediction

0.8

0.6

0.4

0.2

0.6

locationModel indepModel mainModel

0.4

0.2

0 0

0

20 40 60 80 100 Percentage of actions until quest completion

0

(b) Location prediction

20 40 60 80 100 Percentage of actions until quest completion

(b) Location prediction

1

1 actionModel locationModel indepModel mainModel

0.6

0.4

0.2

actionModel locationModel indepModel mainModel

0.8 Average Score

0.8 Average Prediction

20 40 60 80 100 Percentage of actions until quest completion

0.6

0.4

0.2

0 0

20 40 60 80 100 Percentage of actions until quest completion

0 0

20 40 60 80 100 Percentage of actions until quest completion

(c) Quest prediction

(c) Quest prediction

Figure 3: Performance comparison of models. Average prediction for (a) actions, (b) locations, and (c) quests.

Figure 4: Performance comparison of models. Average score for (a) actions, (b) locations, and (c) quests.

centage of predictions whose probability is greater than or equal to the probability shown on the x-axis (Figure 5). For both the MUD and the WWW, at each step during testing, our models provide a prediction in the form of a probability distribution over the candidate options. Both the average prediction measure and the lift curves consider the probability with which the actual next event is predicted. However, the two evaluation methods differ in how they represent this probability. For the average prediction, we take the average across all the test runs at a certain point toward quest completion. For the lift curves, we consider separately each prediction probability for the actual next event over the whole run,

and find the percentage of the predictions (over the whole run) that are greater than or equal to this prediction probability. We then compute the average of these percentages over all the test runs. In general, the model corresponding to the lift curve which dominates the other curves is the one that is performing the best. However, one rarely sees a single curve that dominates all the other curves for all values of x. Instead, one finds regions where one model performs better than the others. This information can be used to develop hybrid models which combine the best features of several models. The lift curves in Figure 5 compare the hybrid models de-

Average Percentage of Pre-Sent Documents

100 Linked Markov maxHybrid orderedHybrid spaceLinkedHybrid

90 80 Percentage

70 60 50 40 30 20 10 0 0

0.2

0.4 0.6 Probability

0.8

1

veloped for the WWW to the overall best of the individual prediction models, the Linked model (Figure 2(f)). All the hybrid models perform significantly better than the Linked Markov model for probabilities greater than 0.5. The predictive accuracy of the orderedHybrid model is higher than that of the other models until the probability reaches 0.39, at which point the maxHybrid model starts performing better than the other models. Since the prediction probabilities obtained by our models seemed rather low in absolute terms, we also compared each model with its naive counterpart in order to assess the predictive power of our models relative to a base line. The naive version of a model replaces the predictions of the model (i.e., a distribution of what documents might be requested next) with a uniform distribution based on the number of children of the current node in the model. All models were shown to perform better than their naive counterpart. Recall/precision. These measures, which are borrowed from Information Retrieval, were used to determine the impact of different operating scenarios on the predictive performance of our best prediction model, maxHybrid. For this evaluation, we assumed that a naive pre-sending policy is applied, which pre-sends the document with the highest probability of being requested next. Recall indicates the percentage of requested documents that were previously pre-sent, and precision indicates the percentage of pre-sent documents that were subsequently requested. While this formulation of these measures is specific to our domain (i.e., in terms of present and requested documents), these measures may be cast in more general terms that should be applicable across a range of user modeling domains. Recall indicates how many of the user’s requirements an agent has met, while precision indicates how many of the agent’s actions or decisions were in fact required by the user (and hence how many were wasted). Figure 6(a) depicts the recall of the maxHybrid model for three scenarios: no-memory/next-request, where the client’s cache holds a document (in addition to the document being read by the user) only until another document is sent; 8-hours/cache, where the cache holds documents for 8 hours; and /cache, where the cache holds documents indefinitely. The x-axis shows the number of documents requested by a client during a session (in order to smooth the graph, each point on the x-axis represents groups of clients

1

Average Percentage of Requested Documents

Figure 5: Performance comparison of models. Lift curves of the prediction probabilities obtained with the hybrid models.

100 infinity/cache 8-hours/cache no-memory/next-request

80

60

40

20

0 0

20

40 60 80 100 120 140 160 180 200 Number of Requested Documents

(a) Recall. 100 infinity/cache 8-hours/cache no-memory/next-request

80

60

40

20

0 0

20

40 60 80 100 120 Number of Pre-sent Documents

140

(b) Precision. Figure 6: Performance of the maxHybrid model. Recall and precision. who have requested a similar number of documents). 1 The y-axis shows the average percentage of requested documents that were pre-sent within the event memory span of each scenario (0, 8 hours or 1). Figure 6(b) depicts the precision performance of the maxHybrid model. Both graphs show that predictive performance improves when the client has a larger cache (8 hours or 1). Figure 6(b) also shows that the precision for the /cache scenario rises over the precision for the 8-hours/cache scenario in sessions where more than 71 documents were pre-sent. This happens because 60% of these sessions take longer than 8 hours, and for this 60% the decision-theoretic policy pre-sends more documents for the 8-hours/cache scenario than for the /cache scenario (where the cache holds every previously sent document). Despite this, for the purpose of evaluating pre-sending policies, the performance of the maxHybrid model for the /cache scenario was considered sufficiently similar to its performance for the 8-hours/cache scenario to justify dropping from consideration the /cache scenario. Pre-sending policy The operating conditions considered for the evaluation of candidate pre-sending policies depend on domain parameters such as the cost of sending a document over the internet and waiting for a document, and the transmission rate. We compared a decision-theoretic policy and a naive policy (which

1

1

1

1

1

In order to view the data more clearly, we exclude the final data point, e.g., x=6143 in Figure 6(a); this still leaves 99% of the data.

3.5e+06 40000:d

Average Total Benefit

3e+06

40000:n

2.5e+06 2e+06 1.5e+06 1e+06

20000:d 20000:n

500000

10000:d 10000:n

0 -500000

400:d 4000:n

4000:d 400:n

-1e+06 0

20

40

60 80 100 120 140 160 180 200 Number of Requests

(a) Immediate benefit – no-memory/next-request. 2.5e+06

Average Total Benefit

2e+06

40000:n

1.5e+06

40000:d

1e+06

20000:n

500000

4000:d

400:d

20000:d 10000:d

0 10000:n

-500000 400:n

-1e+06

4000:n

-1.5e+06 0

20

40

60 80 100 120 140 160 180 200 Number of Requests

(b) Eventual benefit – 8-hours/cache. Figure 7: Effect of the pre-sending policy and cost parameters on the average total benefit. pre-sends the document with the highest probability) under different configurations of these domain parameters. Figures 7(a) and 7(b) show the average total benefit (yaxis) achieved by the pre-sending policies in terms of the number of requests in a session (x-axis) for two scenarios: no-memory/next-request and 8-hours/cache (the data points on the x-axis are grouped as described for the results in Figure 6). Each line in Figure 7 is labelled with a waiting cost (measured as cost per second) and a tag that indicates the presending policy (d for decision-theoretic and n for naive). By varying the waiting cost and fixing the other domain parameters (transmission rate and transmission cost), we can determine the domain parameter configurations under which it is advantageous to pre-send documents. Figure 7 also shows that for the same configurations, the decision-theoretic presending policy generally outperforms the naive policy. However, when the cost of waiting is large (e.g.,  20000) the eventual benefit of the naive policy is higher than that of the decision-theoretic policy. This is because the naive policy, which pre-sends a document after every request, sometimes achieves a large reduction in cost, which offsets its losses from its unnecessary transmissions.

5 Related Research Both Davison and Hirsh (1998) and Motoda and Yoshida (1998) learn models which predict the next action of a user in the Unix domain. The prediction model developed by

Davison and Hirsh, which assigns a higher weight to recent events than to earlier events, was evaluated using two measures: macroaverage, which measures the average (over all users) of the average predictive performance for each user, and microaverage, which measures the average performance across all the data. The former measure gives equal weight to the predictive performance obtained for each user, while the latter gives greater weight to the performance obtained for users with longer sessions. The predictions obtained by their model were also compared to those obtained by C4.5 [Quinlan, 1993]. In addition, Davison and Hirsh performed top-n evaluations, where they determined whether the command executed next was among the top n predicted commands (1  n  5). Such an evaluation was also undertaken in our early MUD research [Albrecht et al., 1997]. The predictive models learned by Motoda and Yoshida included scripts (macro-operators involving more than one Unix command) and the file involved in a command so that it can be pre-fetched. Their learning approach is based on graph-based induction together with an information gain measure. Their system predicts one or more commands the user is likely to execute next, and considers a prediction to be successful if the user selects a command from those predicted. This evaluation method is similar to the top-n evaluation method described above. For n = 1, the top-n metric resembles our score (Section 4.1). However, unlike our measure, it assumes that there is a single top-predicted event. Thus, if there are two or more equiprobable top-predicted events, and one of them is executed next, their evaluation will (inappropriately) reward the prediction model with a “hit”, while our score gives a fractional reward. Chiu and Webb (1998) consider the prediction of a student’s next action in a tutoring domain. Their model, which is feature based, fails to make a prediction when there is an unresolvable conflict between the predictions made by different classification rules. This leads to an evaluation measure composed of two parts: (1) prediction accuracy given that a prediction is actually made, and (2) prediction rate, that is, how often a prediction is made. Chiu and Webb’s research then focuses on how to improve the prediction rate while maintaining the prediction accuracy. This two-part measure is adequate in light of the modeling technique used by these researchers. However, it not sufficient for evaluating the model’s predictive performance, since unlike the above evaluation techniques and those described in this paper, it does not give an overall indication of the system’s predictive accuracy relative to its opportunity to make a prediction. Lesh (1997) evaluates goal recognizers using a measure which gives the average percentage of task completion when the following conditions are satisfied: (1) the top-predicted goal is the current goal, and (2) the probability of this prediction reaches some probability threshold. Lesh’s measure requires the pre-selection of thresholds, which may vary between different domains. Further, it assumes that once a threshold is reached, the plan recognizer will not change its mind. Finally, Lesh assumes that there will be a single toppredicted goal only, which does not occur in our domains. Gmytrasiewics et al. (1998) use the Recursive Modeling Method (RMM) to monitor and process models an agent may use to interact with other models. They use Bayesian propagation to update the agent’s belief regarding which model

is correct based on the success of the models’ predictions of the behaviour of other agents. The agents’ decision-making process is modeled by means of payoff matrices (which lead to a decision-theoretic behaviour). The evaluation methods used by Gmytrasiewics et al. assess indirectly the degree of coordination between the agents in a group in terms of their performance in two domains: a pursuit game, where a prey must be captured, and air defense, where incoming missiles must be intercepted. These measures of performance depend on the features of the domain. In the pursuit game, the evaluation was in terms of the time to capture as a function of the initial uncertainty of the agents regarding which was the prey. In the air defense domain, the evaluation measured the average number of targets the defense units attempted to intercept, and the average total expected damage from the incoming missiles after the defense actions were performed. The former is a variant of the recall measure, while the latter corresponds (inversely) to our total eventual benefit measure.

6 Discussion Our project involved learning probabilistic user models from the observations of large numbers of users. These models were used to compute a probability distribution predicting a user’s next action, location or goal. The evaluation methods used in the project fall into two categories: those which evaluate these prediction models in isolation, and those which evaluate the performance of an agent which uses these models. Evaluating prediction models in isolation The techniques which are appropriate for evaluating a probabilistic prediction model depend on the method used to select a prediction from the distribution returned by the model. We have described two methods for selecting a prediction – randomly selecting a prediction according to the distribution, and selecting the highest-probability prediction. These methods led to two measures for comparing predictive accuracy – average prediction and average score, each of which highlights slightly different performance features. Both of these measures are suitable for selecting a model to be used by agents which need a single prediction, as well as for selecting a model to be used by decision-theoretic agents that incorporate the whole distribution into their action choice. When the domain involves an explicit goal, an item of interest is the performance of a prediction model as goal completion draws nearer. This information was obtained by plotting the above measures as a percentage of goal completion. In the absence of an explicit goal, these measures may be plotted against some other quantity, e.g., percentage of actions performed. Another option suggested in our work consists of representing the perfomance of a model by menas of lift curves which plot how often a prediction or score is above a particular value. Evaluating the performance of an agent Once the prediction models are embedded in an agent, other evaluation methods are appropriate. The recall/precision measures require an action based on predictive performance, e.g., pre-sending a document. These measures may be used to compare prediction models and action policies (e.g., decision-theoretic versus selecting the highest-probability document), as well as to assess the performance of a single

model or policy under different operating conditions (e.g., different cache sizes). Our final measure, cumulative benefit to the user as a result of an agent’s actions, supports a quantitative comparison of the gain from using particular action policies (compared to recall and precision, which indicate how often a particular policy outperforms another). This measure requires the quantification of the costs and benefits associated with different courses of action.

Acknowledgments This research was supported in part by grant A49600323 from the Australian Research Council.

References [Albrecht et al., 1998] Albrecht, D. W., Zukerman, I., and Nicholson, A. E. (1998). Bayesian models for keyhole plan recognition in an adventure game. User Modeling and User-Adapted Interaction, 8(1-2):5–47. [Albrecht et al., 1999] Albrecht, D. W., Zukerman, I., and Nicholson, A. E. (1999). Pre-sending documents on the WWW: A comparative study. In IJCAI99 – Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Stockholm, Sweden. [Albrecht et al., 1997] Albrecht, D. W., Zukerman, I., Nicholson, A. E., and Bud, A. (1997). Towards a Bayesian model for keyhole plan recognition in large domains. In UM97 – Proceedings of the Sixth International Conference on User Modeling, pages 365– 376, Sardinia, Italy. [Bestavros, 1996] Bestavros, A. (1996). Speculative data dissemination and service to reduce server load, network traffic and service time in distributed information systems. In Proceedings of the 1996 International Conference on Data Engineering. [Chiu and Webb, 1998] Chiu, B. C. and Webb, G. (1998). Using decision trees for agent modeling: improving prediction performance. User Modeling and User-Adapted Interaction, 8(12):131–152. [Davison and Hirsh, 1998] Davison, B. and Hirsh, H. (1998). Predicting sequences of user actions. In Notes of the AAAI/ICML 1998 Workshop on Predicting the Future: AI Approaches to Time-Series Analysis, Madison, Wisconsin. [Gmytrasiewicz et al., 1998] Gmytrasiewicz, P. J., Noh, S., and Kellog, T. (1998). Bayesian update of recursive agent models. User Modeling and User-Adapted Interaction, 8(1-2):49–69. [Lesh, 1997] Lesh, N. (1997). Adaptive goal recognition. In IJCAI97 – Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence, pages 1208–1214, Nagoya, Japan. [Motoda and Yoshida, 1998] Motoda, H. and Yoshida, K. (1998). Machine learning techniques to make computers easier to use. Artificial Intelligence, 103:295–321. [Nicholson et al., 1998] Nicholson, A. E., Zukerman, I., and Albrecht, D. W. (1998). A decision-theoretic approach for presending information on the WWW. In PRICAI’98 – Proceedings of the Fifth Pacific Rim International Conference on Artificial Intelligence, pages 575–586, Singapore. [Quinlan, 1993] Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, California. [Zukerman et al., 1999] Zukerman, I., Albrecht, D., and Nicholson, A. (1999). Predicting users’ requests on the WWW. In UM99 – Proceedings of the Seventh International Conference on User Modeling, Banff, Canada.

Suggest Documents