To appear in the 2005 International Conference on Case-Based Reasoning Workshop on Computer Games and Simulation Environments.
Evaluating Case-Based Systems in Virtual Games Keith Needels1, Matthew Molineaux2, David W. Aha1 1 Navy Center for Applied Research in Artificial Intelligence; Naval Research Laboratory (Code 5515); Washington, DC 20375 2 ITT Industries; AES Division; Alexandria, VA 22303 1,2
[email protected]
Abstract. TIELT is a software testbed that facilitates the integration and testing of learning-embedded decision systems on user-selected tasks from virtual gaming simulators. A key component of TIELT is its support for user-defined experimental methodologies. This paper examines how some case-based agents have been evaluated in game environments, describes how such experiments can be supported using TIELT, and how future extensions of TIELT will better meet the needs of case-based reasoning researchers.
1 Introduction When considering environments for developing knowledge intensive learning systems, video games are a natural choice. Modern games simulate complex worlds with large decision spaces. There are hundreds of game titles on the market, available in many genres. Different games can have different objectives, ranging from winning a car race to commanding an army to victory. The diversity of game tasks makes games an excellent research platform for studying case-based and other learning systems. The development of advanced AI in video games carries more than an academic advantage. The video game industry achieved sales of nearly $10 billion in 2004 and early figures show a 23% increase in sales for the first quarter of 2005 (NDP Group, 2005). As game graphics are approaching a point where little further improvement can be made without requiring consumers to purchase overly expensive hardware, developers have begun to focus on improving the realism of the computer controlled agents that exist in their game worlds (Laird & van Lent, 2001). This paper focuses on evaluating decision systems on tasks selected from virtual gaming simulators (i.e., game engines). In this article, decision system means a system that makes decisions based on the world state and (possibly) internal knowledge. These systems can, for example, select actions for an agent to execute in the game world, provide advice to a human player, or predict an opponent’s future actions. Unfortunately, integrating decision systems, whether they are based on case-based reasoning (CBR) or other AI techniques, with game engines has high costs (e.g., in time and money). To complicate matters, comparison studies often require integrating M decision systems with N game engines, which requires M * N integration efforts.
To simplify the process of integrating decision systems with game engines, Aha and Molineaux (2004) have created a testbed called the Testbed for Integrating and Evaluating Learning Techniques (TIELT). TIELT acts as middleware that bridges the gap between decision systems and gaming simulators. Instead of integrating each decision system directly with each game engine, each is instead integrated with TIELT, using a set of knowledge bases that specify their communication. This reduces the number of required integrations from M * N (each decision system to each game) to M + N (each decision system and game to TIELT). Knowledge bases for integrating TIELT with a specific game engine can be made available to other researchers, further reducing the effort required for evaluating their decision systems. As with any heuristic approach, empirical evaluation is imperative for studying case-based reasoning approaches. While experiment methodologies on some classic problem types (e.g., classification) are now well defined, evaluating the performance of decision systems on tasks defined in complex virtual worlds is not always straightforward; success cannot always be simply measured as classification accuracy or mean error. Thus, TIELT must support the variety of experiment methodologies that CBR researchers require for their empirical investigations. This paper focuses on TIELT’s support for empirical evaluations of decision systems on tasks defined in virtual gaming simulators. Section 2 describes the TIELT system. The following two sections cover the two basic variable types in any experiment, and how they are handled in TIELT. In particular, Section 3 discusses the performance measures of a system (the dependent variables) and Section 4 concerns the controlled variables of an experiment (the independent variables). Finally, Section 5 explores ways that TIELT can be improved, and improvements that are planned.
2 An Overview of TIELT TIELT is a free software system that can be downloaded from http://nrlsat.ittid.com. This website also contains more detailed information about TIELT, including a manual, tutorial, technical papers, and slide presentations. 2.1 The Knowledge Bases All the information that TIELT has about games and decision systems is defined by a user in five knowledge bases. TIELT’s user interface simplifies the process of creating and editing these knowledge bases. Much of the time spent integrating a game engine with a decision system is spent modifying either the game engine, decision system, or both to work together. This can be difficult for highly complex systems, and becomes even more challenging for game engines that are not open source. TIELT attempts to solve this problem by leveraging existing interfaces when available. Rather than defining an interface specification that software systems must use to communicate with TIELT, TIELT instead employs a flexible interface strategy that allows a user to describe how TIELT must communicate with a game engine and decision system. Two knowledge bases exist
Table 1: An overview of the TIELT Experiment Methodology. Feature
Description
Game Model Reference
Reference to a higher level description of the target environment.
Independent Variables
Primitive types varied independently with bounded ranges over which to evaluate experimental settings.
Agent Descriptions
A set of agents to be compared within the target environment.
Decision Systems
A set of case-based reasoning (or other) systems to be evaluated within the target environment.
Dependent Variables
Measures of performance arising from the interaction of the environment and agents/decision systems.
for this purpose. The Game Interface Model describes how TIELT should communicate with the game engine, and the Decision System Interface Model describes how it should communicate with the decision system. For the decision system to sense the game world, TIELT needs to store the current game state. The Game Model knowledge base describes the game world. However, it does not require a complete game state or be given a complete and correct model of the game engine; it only needs the information that will be passed between the game and the decision system. The Game Model also has the ability to store background knowledge about a game, such as its rules, which may be used by a decision system to interact with environments for which it has no internal knowledge. The Agent Description identifies the subtasks that a decision system focuses on in the game engine. It describes an agent that listens for updates from the game world and can take action in the game world. This agent can communicate with decision systems through their corresponding Decision System Interface Models. The agent description may communicate with multiple decision systems, which permits TIELT to be used with learning ensembles. However, an agent need not have an associated decision system; it can be designed to give scripted responses to game input (e.g., to act as a “straw man” in an experiment). The agent is aware of the game state as described in the Game Model, and is notified of Game Model events when they occur. 2.2 Running Experiments in TIELT TIELT’s fifth knowledge base is the Experiment Methodology, whose features are summarized in Table 1. This is used to define experiments, with reference to the other four knowledge bases. The user creates an Experiment Methodology by first selecting a Game Model and a Game Interface Model, and then defining/identifying the independent and dependent variables to be used in the experiment. An experiment is defined as a set of trials. Each trial, implemented as a userdefined script, identifies a set of dependent and independent variables. Dependent variables may include task-specific measures such as elapsed time, resources ex-
pended/lost, and the degree to which the task was completed successfully. Independent variables may refer to characteristics of the initial state (e.g., starting location, amount and location of resources) and capabilities of the playerFigure 1: Dependent variables defined for the game reversi. controlled units. Decision systems can also serve as independent variables, in that different decision systems can be used for successive trials, allowing their performance to be compared on the same task. Dependent variables are selected from among the variables defined in the Game Model, or as introduced by the user, as shown in Figure 1. After each trial run, TIELT records the values of the independent and dependent variables. Currently, these results can be stored in an Excel spreadsheet or MySQL database. TIELT also allows the user to configure other databases through JDBC or ODBC database connectivity. Experiments involving complex games often take hours, and sometimes days, to complete. TIELT has been tested on long-running experiments spanning multiple days, during which the researcher need not be involved in the experimental process.
3 Performance Measures When evaluating decision systems, dependent measures are used to gauge the system’s performance. For some simple tasks, performance can be measured by classification accuracy or mean error. However, more complex gaming tasks often require hundreds of decisions per trial, and there may be no obvious way to assess a decision system’s performance for each decision point, whose “correct” responses may be unknown. In this section we discuss performance measures used in game-oriented CBR applications. Table 2 summarizes some of this previous work, highlighting the dependent and independent variables used along with our categorization of each experiment’s “type.” 3.1 Game Playing Agents Game playing agents can be evaluated in much the same way that human game players are evaluated. For example, game playing agents often have the goal of winning a game, and the evaluation of such systems can focus on how often they win. In the first reported experiment conducted with TIELT, Aha et al. (2005) developed and evaluated the Case-Based Tactician (CAT), a learning system designed to play and win the real-time strategy game Wargus. Their primary dependent variable was the percentage of games won versus a static opponent. A secondary measure was
Table 2: A partial summary of CBR research on games Publication
Game
Dependent Variables
Independent Variables
Experiment Types
De Jong & Schultz (1988)
Othello: Board game
Point advantage, experience base usage, experience base size
Number of games played
Decision system comparison
Goodman (1994)
Bilestoad: Realtime individual
Point advantage
Projective agent vs. non-projective agent
Decision system comparison
Fasciano (1996)
SimCity: City building game
Successful plan ratio
Plan failure recovery on/off, Learning on/off
Decision system improvement
Fagan & Cunningham (2003)
Space Invaders: Real-time individual
Prediction accuracy, prediction rate
Plan library size
Decision system comparison
Powell et al. (2004)
Checkers: Board game
Win ratio, case usage
Number of games played
Decision system comparison
Ulam et al. (2004)
Freeciv: Turnbased strategy
Successful trial percentage
Type of adaptation used
Decision system improvement
Aha et al. (2005)
Wargus: Real-time strategy
With percentage, point advantage
Number of games played
Decision system comparison
the ratio of CAT’s score to the sum of CAT’s and the opponent’s scores. This experiment exemplifies how a game and CBR decision system can be integrated with TIELT, and how to evaluate a decision system with TIELT’s experimentation system. Many other experiments could use similar performance measures for a variety of game genres. For example, De Jong and Schultz (1988) developed GINA, a casebased Othello player, and measured the difference between the scores of GINA and a simple opponent. Powell et al. (2004) compared two versions of CHEBR, a casebased checkers learning agent, by measuring their win/loss ratios. In his work on projective visualization, Goodman (1993) measured the success of an agent playing the real time fighting game “The Bilestoad” by the mean difference in scores between two agents. Recording these types of measures is easy in TIELT if the game’s outcome and score are provided by the game engine, as this allows TIELT to simply store the information as part of the game state and record it when each trial ends. However, winning isn’t the goal of all CBR investigations of game playing agents. Some games don’t have a pre-defined win/loss definition, and some agents don’t play games with the intention of winning. The first is true for Fasciano’s (1996) MAYOR, a planning agent for playing SimCity. In this city-building game, the player (either a human or MAYOR) sets their own goals. Fasciano measured MAYOR’s performance by the ratio of plan executions that MAYOR itself considers successful, which cannot be calculated without recording internal variables from MAYOR. TIELT’s Experiment Methodology scripts can set the values of dependent variables to the values of variables in the decision system; this allows an experimenter to track data intrinsic to the decision system. Some agents are not concerned with winning, but with accomplishing subtasks within the game world. For example, Ulam et al. (2004) focused on the task of defending a city in Freeciv, a turn-based strategy game; their system learned which adaptation strategy to use to avoid planning failures. However, the goal was not to win the game, but to prevent the city from being destroyed or taken by another player. For TIELT, measuring success on this task is functionally identical to measuring a win/loss ratio because the city’s status is stored in the Game Model.
3.2 Advice Giving Systems
Player
The goal of an advice giving decision system is to give helpful advice to a game-playing human. Advice A major difference between advice giving systems and game playing systems is that an advice Feedback giving system usually does not make enough Decision decisions to play a game on its own. Since these Game System decision systems do not act in the game world, they cannot be measured based on win rates, Figure 2: An illustration of an point advantage, and so on. Figure 2 shows how advice-giving system. such a decision system interacts with the game and the game player. Advice giving systems have not yet received attention in the case-based reasoning literature on game-related tasks. However, studies such as that performed by Sweester and Dennis (2003) have shown that these systems are of interest in the machine learning community, and systems that advise a human in other domains are popular in the CBR literature (e.g., Conversational Case-Based Reasoning, (Aha et al., 2001)). Sweetser and Dennis created a system that would dispense useful advice to players learning to play a real time strategy game. To evaluate the system, they broke the experiment into two phases, with six participants each. In each phase, as advice was displayed on the screen, each participant indicated whether or not the advice was useful. The system performed self improvement between phases. To perform this type of experiment with TIELT, the user must create an advice evaluation interface as part of the game engine or decision system. Each time advice is presented to the user, the decision system or game engine then must ask the user to evaluate the advice and send a message to TIELT with the results of the user’s evaluation. We plan to include this type of evaluation interface as a feature of TIELT. 3.3 Other Measures Some decision systems may focus on an analysis task (e.g., classification, prediction), as opposed to a synthesis task (e.g., planning). For example, Fagan and Cunningham (2003) developed COMETS, a system that observes a human playing the game Space Invaders and attempts to predict the human’s next move. They used prediction accuracy to evaluate their system. To perform this type of experiment in TIELT, the decision system can send its predictions to TIELT, and TIELT can compare the predictions to the actual outcomes. As an alternative, the decision system itself could inform TIELT on the accuracy of its predictions, since it will have access to the game state. Another measure of interest is system efficiency. For example, some games concern solving a puzzle, which can take significant time and resources. Instead of simply measuring correct solution frequency, elapsed time/effort and other measures of resource usage could be recorded to permit the comparison of multiple systems with similar solution frequencies. Slow responses from a decision system may also contribute to poor performance in non-puzzle domains, as well. TIELT can measure this using a time function to record the elapsed time between calls to a function.
4 Common Experiment Types The performance measures discussed in Section 3 are the dependent variables of agent experiments. Here we examine the independent variables of these experiments. We describe three types of experiments, and how they are supported in TIELT. 4.1 Improving Decision Systems When developing a decision system, a researcher often wants to change aspects of the system, either to optimize those aspects, or to examine their effect on performance tasks. Three typical types of experiments in this vein are, as identified by Langley and Kibler (1991): (1) varying internal agent parameters, (2) deleting certain system components (lesion experiments), and (3) changing the knowledge representation. An example of (1) is the COMET experiment. Fagan and Cunningham (2003) varied plan library size and observed its effect on prediction accuracy. An example of (2) was performed by Fasciano (1996) on MAYOR. He compared the plan success rate with and without learning enabled, as well as with and without failure recovery enabled. Although important, experiments that compare different knowledge representations in CBR systems are not yet common in games-related research. This category of experimentation is straightforward to perform using TIELT. The user simply identifies the parameters that will serve as independent variables, and then instructs TIELT to tell the decision system what values to use for each trial. 4.2 Comparing Decision Systems One of the more common types of experiments in case-based learning tests multiple algorithms on the same task to determine which performs best. Even when not comparing different algorithms, researchers usually choose to test their approach against a “straw man” non-learning algorithm to obtain evidence that learning improves performance. TIELT is being developed with these experiments in mind, in part so that it can support a wide variety of challenge problems. Figure 3 illustrates how such an experiment is run through TIELT. Comparison studies have often been used to investigate different algorithms for inducing classifiers. Similarly, multiple decision systems can be compared on a specified task from a game world, provided that the same scenario or scenario distribution can be used to test each system. These tasks can vary greatly depending on the game being played. For example, MAYOR has the task of building a city, while CAT’s task is to defeat an opponent. This type of experiment requires the researcher to specify some dependent variable that can be justified as a performance measure on the attempted task. Both systems are evaluated independently on this measure. Each of these can be easily specified using TIELT’s Experiment Methodology.
Another way to compare game playDecision ing agents is to have them compete with System 1 one another in the game world. For example, Powell et al. (2004) tested variDecision TIELT ants of CHEBR using this method. Sev- Game Task System 2 eral platforms have been developed for : : agent competition. For example, RoDecision boCup (2005) is perhaps the best-known System n competitive gaming platform for agent development; it consists of both hardware and software competitions for comFigure 3: Multiple decision systems are paring the ability of AI agents (or teams compared on the same game task. of AI agents) to play soccer. Some examples of CBR studies performed with the RoboCup platform include work by Wendler et al. (2001), Gabel and Veloso (2001), and Karol et al. (2003). Currently, running a multiple agent competition in TIELT is not as straightforward as running single agent experiments. A single instance of TIELT will not allow users to run multiple decision systems simultaneously. Instead, they must run a separate instance of TIELT per decision system, with each TIELT instance sharing the same game engine. This may not be a problem for one-on-one competitions, but it could be for experiments involving many separate agents acting through TIELT. 4.3 Knowledge Transfer Experiments When evaluating a case-based learning system, we want to make sure it does more than merely memorize solutions to specific problems. An intelligent agent should be able to generalize knowledge learned in one scenario for reuse in different situations. In a knowledge transfer experiment, some aspect of the (e.g., gaming) environment is varied. This variation could be as simple as changing agent starting locations, or it could be a significant change, such as changing the game that is being played, as illustrated in Figure 4. Knowledge transfer is of significant interest to the USA’s Defense Advanced Research Projects Agency (DARPA), which is currently promoting research on cognitive systems. Recognizing that a key cognitive ability Game 1 of humans is the ability to generalize and reuse previDecision ous experiences to novel Time Game 2 TIELT System situations, DARPA is initiat: ing a program on transfer : learning. TIELT will be used in this program to assist Game n researchers with performing experiments on challenge Figure 4: A decision system is tested on multiple games problems. to measure knowledge transfer.
Changing different aspects of a game is easy to do in TIELT. It is similar to changing aspects of a decision system in a decision system improvement experiment, except instead of informing the decision system of what values to use for the independent variables in each trial, TIELT will instead inform the game engine of these values. Currently, changing the game itself cannot be performed from within a single experiment in TIELT. To conduct this type of experiment, the user must create a different Experiment Methodology knowledge base for each game. These experiments then must be run separately, and the decision system will need to keep all of its learned knowledge between runs. We discuss this further in Section 5.
5 Future Improvements While TIELT is almost fully functional, we are planning several additional enhancements for it, and there are always more ways that TIELT can be improved to better fit the needs of researchers. This section highlights a few future improvements that we are planning related to its support for experimentation with CBR decision systems. 5.1 Performance Measure Improvements We are planning to integrate an advice evaluation interface with TIELT that can be used with advice giving agents. This will allow TIELT to directly log a user’s responses to the usefulness of advice, and will make running experiments with advice giving agents more convenient since no changes will need to be made to the decision system and game engine. A general evaluation interface would be useful for more than just evaluating advice. In some cases, we may want a human to evaluate how a game playing agent is performing. TIELT can also be improved with additional support for measuring system efficiency. Currently, users can record the time taken by a decision system call, but this information must be tracked in the Agent Description, and it must be set up in a convoluted manner. In a future version of TIELT, automated logic will track a decision system’s response time when requested, removing the need for explicit agent controls.
5.2 Experiment Support Improvements Although TIELT supports experiments that compare different agents competing against static opponents, there is no easy way to compare agents competing against other agents when both agents are controlled by TIELT. A researcher may also want to run multiple agents in the same game to cooperate with each other on the same task, rather than compete. As mentioned previously, running multiple agents in the same game requires one instance of TIELT for each decision system, as shown in Figure 5a. Allowing multiple agents to run through one instance of TIELT, as shown
Game
Game …
TIELT 1
TIELT 2
….
TIELT n
Decision System 1
Decision System 2
….
Decision System n
(a)
Current experiments with multiple agents with associated decision systems require multiple instances of TIELT.
TIELT
Decision System 1 (b)
Decision System 2
…
Decision System n
TIELT will be improved by allowing multiple agents with associated decision systems to compete through one instance of TIELT.
Figure 5: Experiments with multiple agents with associated decision systems.
in Figure 5b, would be more convenient. In the long term, this will be supported via a more complex communications interface. Another improvement would be to allow users to run a single experiment over different game engines. Currently, a different Experiment Methodology must be loaded manually when changing games, which requires loading the new knowledge base each time a new game engine is used. Thus, a user cannot let an experiment run overnight or for several days at a time. To permit this, we will create a new experiment type that allows multiple games to be compared as an independent variable.
6 Conclusion TIELT is a useful utility for CBR researchers who want to conduct experiments involving tasks defined in virtual gaming simulators. Aha et al. (2005) demonstrated how TIELT can be used in CBR research, and we expect it will be used by many other researchers as it matures. By simplifying the integration and experimentation process, we hope to encourage research on knowledge intensive learning systems. We described TIELT’s utility to researchers, and how various experiments (e.g., those for comparing and improving learning systems), can be created and performed using TIELT. We briefly discussed future desirable functionalities, including a facility for simultaneous action by separate agents, a timing system for efficiency measurement, and the ability to describe a multi-environment experiment. Going forward, we will improve the system further to meet the needs of case-based reasoning researchers.
Acknowledgements This research was supported by DARPA and the Naval Research Laboratory.
References Aha, D.W., Breslow, L.A., & Munoz-Avila, H. (2001). Conversational case-based reasoning. Applied Intelligence, 14(1), 9-32. Aha, D.W., & Molineaux, M. (2004). Integrating learning in interactive gaming simulators. In D. Fu & J. Orkin (Eds.) Challenges in Game AI: Papers of the AAAI’04 Workshop (Technical Report WS-04-04). San José, CA: AAAI Press. Aha, D., Molineaux, M., & Ponsen, M. (2005). Learning to win: Case-based plan selection in a real-time strategy game. To appear in Proceedings of the Sixth International Conference on Case-Based Reasoning. Chicago, IL: Springer. De Jong, K., & Schultz, A. C. (1988). Using experience-based learning in game playing. Proceedings of the Fifth International Conference on Machine Learning (pp. 284-290). Ann Arbor, MI: Morgan Kaufmann. Fagan, M., & Cunningham, P. (2003). Case-based plan recognition in computer games. Proceedings of the Fifth ICCBR (pp. 161-170). Trondheim, Norway: Springer. Fairclough, C., Fagan, M., Mac Namee, B., & Cunningham, P. (2001). Research directions for AI in computer games. Proceedings of the Twelvth Irish Conference on Artificial Intelligence & Cognitive Science (pp. 333-344). Maynooth, Ireland: Unknown publisher. Fasciano, M.J. (1996). Everyday-world plan use (Technical Report TR-96-07). Chicago, Illinois: The University of Chicago, Computer Science Department. Gabel, T., & Veloso, M. (2001). Selecting heterogeneous team players by case-based reasoning: A case study in robotic soccer simulation (Technical Report CMU-CS-01-165). Pittsburgh, PA: Carnegie Mellon University, School of Computer Science. Goodman, M. (1993). Projective visualization: Acting from experience. Proceedings of the Eleventh AAAI National Conference on Artificial Intelligence (pp. 54-60). Washington, DC: AAAI Press. Karol, A., Nebel, B., Stanton, C., & Williams, M.-A. (2003). Case based game play in the RoboCup four-legged league: Part I the theoretical model. In D. Polani et al. (Eds.) RoboCup 2003: Robot Soccer World Cup VII. Padua, Italy: Springer. Laird, J.E., & van Lent, M. (2001). Interactive computer games: Human-level AI’s killer application. AI Magazine, 22(2), 15-25. Langley, P., & Kibler, D. (1991). The experimental study of machine learning. Unpublished manuscript: AI Research Branch, NASA Ames Research Center, Moffett Field, CA. The NPD Group (2005). Annual 2004 U.S. Video Game Industry Retail Sales and First Quarter 2005 US Video Game Industry Retail Sales. Port Washington, NY: [http://www.npdfunworld.com/funServlet?nextpage=pr_content.html&show=ALL] Powell, J.H., Hauff, B.M., & Hastings, J.D. (2004). Utilizing case-based reasoning and automatic case elicitation to develop a self-taught knowledgeable agent. In D. Fu & J. Orkin (Eds.) Challenges in Game Artificial Intelligence: Papers from the AAAI Workshop (Technical Report WS-04-04). San Jose, CA: AAAI Press. RoboCup (2005). The RoboCup Soccer Simulator. [http://sserver.sourceforge.net/] Sweetser, P. & Dennis, S. (2003). Facilitating learning in a real time strategy computer game. In R. Nakatsu & Junichi Hoshino (Eds.) Entertainment Computing: Technologies and Applications. Boston, MA: Kluwer Academic Publishers. Ulam, P., Goel, A., & Jones, J. (2004). Reflection in action: Model-based self-adaptation in game playing agents. In D. Fu & J. Orkin (Eds.) Challenges in Game Artificial Intelligence: Papers from the AAAI Workshop (Technical Report WS-04-04). San Jose, CA: AAAI Press. Wendler, J., Kaminka, G. A., & Veloso, M. (2001). Automatically improving team cooperation by applying coordination models. In B. Bell & E. Santos (Eds.) Intent Inference for Collaborative Tasks: Papers from the AAAI Fall Symposium (Technical Report FS-01-05). Falmouth, MA: AAAI Press.