Learning Agents in Quake III

6 downloads 0 Views 1MB Size Report
This paper shows the results of applying Reinforcement Learning ( ) to train ... but, contemporary computer games, like First Person Shooters ( ), which can ... combat movement with techniques as ”circle strafing” and, of course, a lot of ...
Learning Agents in Quake III Remco Bonse, Ward Kockelkorn, Ruben Smelik, Pim Veelders and Wilco Moerman Department of Computer Science University of Utrecht, The Netherlands Abstract This paper shows the results of applying Reinforcement Learning () to train combat movement behaviour for a Q  A [3] bot. Combat movement of a Q  bot is the part that steers the bot while in battle with an enemy bot or human player. The aim of this research is to train a bot to perform better than the standard Q  bot it is based on, by only changing its combat movement behaviour. We extended a standard bot with a Q-learning algorithm that trains a Neural Network to map a game state vector to Q-values, one for each possible action. This  bot (to which we refer as NeurioN) is trained in a reduced Q  environment, to decrease noise and to make the training phase more effective. The training consists of one-to-one combat with its non-learning counterpart, in runs up to 100.000 kills (frags). Reward is given for avoiding damage, thereby letting the bot learn to avoid getting hit. We found it is possible to improve a Q  bot by using  to train combat movement. The trained bot outperforms its non-learning counterparts, but does not appear smarter in combat with human opponents.

1

Introduction

Games are more and more considered a suitable testing ground for artificial intelligence research [5]. Reasons for this include their ever increasing realism, their limited and simulated environment (in which agents do not require sensors and imaging techniques to perceive their surroundings), their accessibility and inexpensiveness and the fact that game industry is big business. Artificial Intelligence has been successfully applied to classic games like Othello and Chess, but, contemporary computer games, like First Person Shooters (), which can be regarded as the most popular games nowadays, typically have limited artificial intelligence (rule-based agents or agents based on finite state machines). As a result,  agents (called ”bots”) are by far not able to compete with human players on their tactical and learning abilities. With the  rendering computations being moved more and more to the graphics card’s GPU, CPU cycles become available for  More elaborate and computational intensive  techniques might now be used in these games, adding to the game’s realism and creating more human-like artificial opponents. This research explores the application of Machine Learning () methods to ’s, focusing on one of the most popular  games: Q  A [3]. Bots in Q  have limited intelligence and are no match for an expert human player. We will extent such a bot with  capabilities and evaluate its performance against non-learning bots and human players. For more information on Q  and its bots, we refer to [10, 11]. Previous research on applying  techniques to  bots include [4], in which Laird et al. present a bot for Q , based on (800+) rules, that can infer how to set-up an ambush, in a way quite resembling a human player. Related to our research is the work of Zanetti et al. [12]. They have implemented a Q  bot that uses three ’s for: movement during combat, aim and shoot (including selecting which weapon to use) and path planning in non-combat situations. Their 1

networks are trained using Genetic Algorithms () and a training set of recordings of matches of expert human players. The goal is for the networks to imitate the behavior of these experts. The resulting bot turns out to be far from competitive, but still has learned several expert behaviors. A bot has several independent decision nodes (e.g. aiming and shooting, goal selection, chatting). Our focus is on movement behaviour of a bot in combat mode (i.e. upon encountering a nearby enemy). This behaviour typically includes dodging rockets and moving in complex patterns as to increase the aim and shoot difficulty for the enemy. Human expert players excel at combat movement with techniques as ”circle strafing” and, of course, a lot of jumping around, thereby making it very difficult for their opponents to get a clear shot. Combat movement is one of the parts of   that is not often used in this type of research (whereas goal and weapon selection are). This made it interesting for us to examine. As mentioned, Evolutionary Algorithms have already been applied succesfully to similar problems in ’s, albeit limited to research environments, yet little is known of the suitability of Reinforcement Learning (, see e.g. [1, 9]) in such games. Therefore, we have chosen to apply  to improve the combat performance of a Q  bot. As we did not have the time and rescources (i.e. expert players) to implement supervised learning and because supervised learning has already been applied (somewhat) succesfully [12], we use an unsupervised learning algorithm.

2

Reinforcement Learning

We have implemented the Q-learning algorithm (see [9]) as described below. 1 2 3 4 5 6 7 8 9

Initialize Q(s, a) arbitrarily Repeat (for each combat sequence ): Initialize state s Repeat (for each step of combat sequence ) Select action a based on s using softmax Execute a, receive reward r, s0 Q(s, a) ← Q(s, a) + α[r + γ maxa0 Q(s0 , a0 ) − Q(s, a)] s ← s0 Until s is non - combat or bot is killed Listing 1: Q-learning algorithm As follows from Listing 1, we consider each combat sequence a  episode. A combat sequence starts when the combat movement decision node of the Q  AI is first addressed and ends when more than 10 in-game seconds have passed since the last call, in order to keep the Q-value function smooth. Since the Q  engine decides when to move into combat mode it might be 100 milliseconds or just as well be 100 seconds since the last combat movement call. In this (fast) game this would mean a completely different state, which has minimal correlation with the action taken in the last known state. If the time limit has expired, the last state in the sequence cannot be rewarded properly and is discarded. However, if the bot is killed not too long after the last combat movement, it still receives a (negative) reward. We have also experimented with an adaptation of Q-learning known as Advantage Learning. Advantage Learning is implemented by replacing line 7 in Listing 1 with the following equation: A(s, a) ← A(s, a) + α

[r + γ maxa0 A(s0 , a0 ) − A(s, a)] ∆Tk

(1)

Here the A-value of a s, a pair is the advantage the agent gets by selecting a in state s. ∆T stands for the time passed since the last time the action is selected and k is a scaling factor. Both α (in Q-learning) and k (in Advantage Learning) are ∈ h0, 1]. Since we work with discrete time, the ∆T term is always 1. Advantage learning uses the ‘advantage’ that a certain state-action-pair (Qvalue) has over the current Q-value. This advantage is then scaled (using the scaling factor k). This 2

algorithm is useful when the Q-values do not differ very much. Normal Q-learning would have to become increasingly accurate, to be able to represent the very small, but meaningfull differences in adjecent Q-values, since policy has to be able to accurately determine the maximum over all the Q-values in a given state. Since the Q-values are approximated, this poses a severe problem for the function approximator. It is easy for the approximator, to decrease the overall error by roughly approximating the Q-values, however, it is hard, requires lots of training time, and might even be impossible (given the structure of the approximator, for instance not enough hidden neurons) to accurately approximate the small but important differences in Q-values. Advantage Learning learns the scaled differences (advantages) between Q-values, which are larger, and therefore has a better chance of approximating these advantages, then Q-learning has in approximating the Q-values. Since the adavantages correlate to the Q-values, the policy can just take the maximum advantage when the best action needs to be selected. [2] For action selection we use a softmax-policy with Boltzmann selection. The chance P that action a is chosen in state s is defined as in Equation 2. e

P= P n

Qt (a) τ

b=1

e

Qt (b) τ

(2)

As is common to softmax, action a1 is chosen more often in state s if Q(s, a1 ) ≥ Q(s, ai ), i ∈ A. But since the policy is stochastic, there will always be some exploration. In the beginning, much exploration is performed, because all Q-values are initialized with random values. However, as the learning process progresses, actions that have been rewarded highly, will have higher Q-values than others, thereby exponentially increasing their chance of being selected. Because the state space is continuous we use a Multilayer Perceptron (MLP) to approximate the Q-value function. The input layer consists of 10 neurons for the state vector; the output layer consists of 18 neurons that contain the Q-values for each action. We use one hidden layer, and vary the number of hidden neurons during our experiments, ranging from 0 to 30. Where 0 hidden neurons means there is no hidden layer. We use sigmoid activation functions for the hidden neurons and linear functions for the output neurons. This is because we expect a continuous output, but we also want to reduce the effects of noise. The sigmoid functions in the hidden layer filter out small noise in the input. In most experiments, we initialized the weights with a random value i from the uniform distribution over -0.01 and 0.01. We also conducted some experiments with a higher margin. More about the parameters of the MLP can be found in Section 4. We considered several reward rules that might lead to good tactical movement. Possible reward rules are: R(st ,a) = −

(healtht−1 − healtht ) healtht−1

R(st ,a) = numFrags − R(st ,a) =

(healtht−1 − healtht ) healtht−1

(enemyHealtht−1 − enemyHealtht ) (healtht−1 − healtht ) − enemyHealtht−1 healtht−1 R(st ,a) = accurateHits −

(healtht−1 − healtht ) healtht−1

(3) (4) (5) (6)

Rule 3 mainly lets the bot minimize damage to itself and thus lead to self preserving bots, which favor evasive actions over aggressive actions. Rule 4 will eventually lead to good game play because it is our intuition this is what human players do; keep good health and whenever it is possible frag1 someone. Frags do not happen very often and a good hit does not always mean a frag. Therefore, we came up with the next rule. In rule 5 every hit will be taken into account 1 Fragging

is slang for killing someone in a computer game.

3

(a) Cube map

(b) Round Rocket map

Figure 1: Reduced training environments and it will be more attractive to attack a player with low health or run away if your own health is low. But we argued that the health loss of the enemy is not directly related to tactical movement. For example if the enemy happens to take a health pack, the last chosen action will be considered bad, while it could be a good evasive maneuver. Rule 6 does not have this disadvantage, but depends too much on aiming skills of the bot. We have chosen to use the rule (3) which rewards evasive actions, as this is the most important goal of combat movement: evade enemy fire. Rule 3 takes in consideration the minimal amount of information to evaluate the action.

3

Environment

To eliminate a part of the randomness commonly present in ’s, we use a reduced training environment in the form of a custom level, which is a simple round arena in which the bots fight (Figure 1(b)). In this map, bots enter combat mode almost immediately when they spawn. To eliminate the influences of different types of weapons, we have chosen to allow only the use of one type of weapon. To speed-up the learning process, the custom map does not feature any cover, nor health packs and other items, other than this weapon and corresponding ammunition. The bot we created, which we call NeurioN, is based on one of the standard Q  bots, named Sarge. Except for their looks, sounds, and, most importantly, their combat movement behaviour, these bots are identical. For NeurioN we train this part of the  with , while Sarge remained fixed during training. The combat movement is called irregulary by the bot  decision loop. It is not necessary for a bot to be in combat mode to shoot at an opponent, but when a bot is close to an opponent and has the opponent in sight, the combat mode will be called. By making a small custom map in which the bots cannot hide from eachother, we forced the bots to enter their combat mode more often. The normal behaviour of a bot in combat mode is based on its preference to move and / or jump in unpredictable directions. In determining these directions, the environment is not taken into account. The same goes for incoming rockets, for which a bot is ‘blind’. The only exceptions to this are collisions with walls and falling off edges. Because it is not easy to extract information about imminent collisions or drops off edges, we have chosen not to include this information in the state vector of the bot we train.

3.1

Problems with the environment

The environment used for the experiment, Q  A [3], seems very suitable for these types of experiments, since it’s source code is freely available and the game is very stable and well known. However, it is not without its disadvantages. The first problem we ran into is the fact that despite the release of the source of Q , it is still obvious the game is a commercial product that was not created to be an experimental environment. The source code is hardly documented and despite some webpages on the internet 4

Sarge vs Sarge Control run (moving average 5000 shown every 500) 0.55

win %

Sarge 1 Sarge 2

0.5

0.45 0

50000

100000

150000

200000

250000

total frags

Figure 2: Advantage for the first spawned bot [7, 8, 6] information is scarce. The engine is created to work in real-time. While speeding up the simulation is possible, the training is still slow. During the creation our reduced training environment, we encountered several other problems. One problem was the cube shaped map we started out with (Figure 1(a)). When in combat mode, the learning bot does not take its static surroundings into account. Therefore more often than desirable NeurioN would end up in a corner, with a very large chance of getting shot. To counter this problem a round map was created, so that there are no corners where the bot can get stuck. Another problem we discovered while training in our map is when the map does not have any items for bots to pick up, they remain standing still, having no goals at all. If both bots are standing still with their backs facing each other, nothing will happen. This situation is quite rare and therefore took a while to discover what was ruining our experiments. The problem was tackled by adding excessive amounts of weapons and ammunition-crates. We also discovered a problem that is caused by the order the different bots in a game are given time to choose their actions. This is done sequentially, which results in a small but significant advantage for the bot that is added first to the arena. When two identical (non-learning, standard Q ) bots are added to a game, the bot first in the list wins about 52% of the time, as can be seen in Figure 2. This is a significant difference. Finally we found out that the shooting/aiming precision of Sarge at the highest difficulty level is so good that there is not much room to evade the shots. This is especially the case with weapons that fire their ammunition with a high velocity. The bullets of these weapons are almost instantaneously, which means that they impact directly when fired. Because we wanted the bot to learn combat movement it was better to only use the rocket launcher as the only weapon. The rocket launcher is the weapon with the slowest ammunition in Q , so it is the easiest weapon to evade and thus the best weapon to learn tactical movement in combat.

3.2

States

After various different sets of state information, we have finally chosen to use the following state vector: distance to opponent a number between 0 and 1, where 1 is very close and 0 is far away. This means that when the opponent is out of sight the distance will also be 0; relative angle of placement of opponent relative angle of bot to opponent: a number between 0 and 1, with 1 if the opponent is standing in front of the bot and 0 if the opponent is standing 5

(a) Opponent placement

(b) Opponent viewangle

(c) Projectile placement

(d) Projectile direction

Figure 3: The state information behind the bot. See Figure 3(a); relative side of placement of opponent 1 if the opponent is standing to the left, -1 if the opponent is standing on the right; relative angle of opponent to bot a number between 0 and 1, with 1 if the bot is standing in front of the opponent and 0 if the opponent is standing behind the opponent. See Figure 3(b); relative side of placement of the bot -1 or 1 depending on the sign of the previous input; distance to the nearest projectile a number between 0 and 1, where 1 is very close and 0 is far away. If there is no projectile heading towards Neurion 0 is also given; relative angle of placement of the nearest projectile a number between 0 and 1, with 1 if the projectile is in front of the bot and 0 if the projectile is behind the bot. See Figure 3(c); relative side of placement of the nearest projectile 1 if the projectile is to the left, -1 if the projectile is to the right; relative angle of nearest projectile to bot a number between 0 and 1, if the projectile is heading straight for the bot this will be 1, if the angle between the projectile and the bot is equal or larger than a certain threshold this will be 0. Any angle beyond the threshold indicates the projectile isn’t a thread for the bot; See Figure 3(d); relative side of bot with regard to the nearest projectile 1 if Neurion is to the left, -1 if NeurioN is to the right of the projectile.

3.2.1

Information not used for states

Some information that seems important to describe the game state is left out for the following reasons: Health of NeurioN the health of a player is of no influence on the movement of that player when in direct combat; Health of the opponent same as above, the health of the opponent is of no effect to the combat movement; Current weapon of the opponent in our training environment two ballistic weapons are available, the machine gun for which there is no extra ammunition available and the rocket launcher. The rocket launcher is by far the most popular weapon from those two for Sarge and thus also for NeurioN. As soon as they get one, they will switch to the rocket launcher. This results in effectively only one weapon that is used, therefore the weapon that the opponent uses is known and it is not necessary to include this information in the state vector.

6

Figure 4: The actions visualized

3.3

Actions

We consider 18 (legal) combinations of five elementary actions: • Move forward (W); • Move backward (S); • Strafe left (A); • Strafe right (D); • Jump (J); So, the possible actions are {−, W, WA, WJ, . . . , SDJ, D, DJ} (Figure 4). In this system other moves would be theoretically possible, like for example moving forward and backward (WS) simultaneously, but of course this is impossible, so these kind of options are left out as moves. This results in the 18 legal moves, 3 options for forward movement (being forward, nothing, backward), 3 options sideways in the same manner and 2 options for jumping (to jump or not to jump. . . ). The orientation is not considered in our combat movement, because it is determined by the aiming function of the bot, which would overwrite any change made in the rotation of the bot with its own values. The movement of a bot is relative to its orientation vector, given by the aiming function.

4

Setup of Experiments

When using  and ’s, a number of parameters have to be ‘tweaked’, as they have a large influence on the resulting training performance. The training phase consists of a game with a very high fraglimit (≤ 200, 000 frags). Each setting is run several times, to validate the results (as the game is stochastic and the network initialization may be of influence). Because  starts each training run with a randomly initialized  and contains several variables of which the optimal setting is unknown, we did a broad sweep in the parameter space. The settings we tried consist of the following variables: Number of hidden neurons Discount Temperature Learningrate Timescale factor 2k

n γ τ α k

∈ ∈ ∈ ∈ =

{0, 5, 15, 30} {0.95, 0.80} {10, 50} {0.001, 0.003, 0.01} {1, 0.5, 0.25}2

= 1 is equal to Q-learning

7

Because we had some trouble with determining which information had to be part of the state vector and made some incorrect implementations of it, many training runs performed earlier unfortunately had to be discarded.

5

Results

During the broad sweep, several combinations of parameters were found, that resulted in succesful learning after some 100,000 or more fraggs. It was found that 15 hidden neurons, combined with a learningrate of 0.01, a discount of 0.95 and a temperature of 50 worked well, see Figure 5(a). This was achieved with the Q-Learning algorithm (and therefore the advantage k factor equals 1). This combination of parameters therefore served as the ‘base’ for further investigation, and if not otherwise mentioned, these values are used. Increasing the learningrate above 0.01 resulted in instable neural nets which sometimes learned, but most of the time remained at the initial level of success, see Figure 5(b). A learningrate of 0.01 was therefore used in further experiments. Increasing the range in which the initial weights were randomly distributed, to 0.1 or even 0.5 did not result in any stable learning (at least, with using ‘base’ values for the other parameters). Sometimes a decent learning curve was seen, but most of the time learning did not occur. See Figure 6(a) A margin of 0.01 for the initialisation of the random weights was therefore used. Varying the timescale factor k never resulted in bad learning or non-learning situation, see Figure 6(b). Most of the values for k resulted in roughly the same learning curves, although k = 0.1 resulted in a quicker learning phase, but a less stable, more jagged curve, as the large standard deviation indicates. Decreasing or increasing the number of neurons to 10 did not result in good results (Figure 7(a)), or at least not in the timeframe that the ‘base’ setting with 15 hidden neurons needed to achieve a good success rate. And since the ’base’ already needed 100,000-200,000 frags, taking some 10 hours of computing time, other numbers of hidden neurons were not extensivly investigated. A lower temperature of 10 did not results in learning, however a higher temp of 80 did (Figure 7(b)). A higher temp resulted in much quicker learning, but also a more jagged learning curve. As a comparison, the temperature was varied when 30 hidden neurons were used (other settings were the same as the ‘base’). This experiment revealed that the temperature did not have any significant effect on the average that was reached, but the variation increased with the higher temperature, resulting in some good runs and some runs were nothing was learned, see Figure 8(a).

6

Conclusion

Learning a bot how to move in combat situations is a difficult task. The experiment depends on both choosing the right input vector and settings as well as on the used environment, Q  in our case. However, using a  with 15 hidden neurons, the learning bot can be improved to a 65% chance of winning a 1-on-1 combat against its non learning equal. This means that a bot with a trained  wins almost twice as often as a preprogrammed bot. The learning phase takes quite some time, often more than 100,000 to 200,000 frags, on a normal pc this takes about 10 hours. Some settings show a capricious learning process, but the speed of learning is higher in these cases. Despite the fact that the numerical results show an improvement, human players will hardly notice any difference in behaviour between the standard bot and the learning bot. This is largely due to the fact that the bot is not always in combat mode when shot at (as one might expect), hence not using his .

8

6.1

Future Work

As an extension of the research described in this paper it is possible to look into the influence of a larger space vector, in which information concerning the surroundings of the bot is taken into account. Moreover it would be interesting to let the bot learn its combat movement in a more complex environment. One can think of more opponents, weapons and items. Or just more complex maps, as to let the bot learn in a normal Q  game. Lastly the use of memory could be a good research subject in which the bot uses its last action as an input. This could be done by using recurrent ’s

References [1] M. Harmon and S. Harmon. Reinforcement learning: a tutorial, 1996. http://www-anw.cs. umass.edu./∼mharmon/rltutorial/frames.html. [2] M. E. Harmon and L. C. Baird. Multiplayer residual advantage learning with general function approximation. Technical Report Tech. Rep. WL-TR-1065, Wright Laboratory, WL/AACF, 2241 Avionics Circle, WrightPatterson Air Force Base, OH 45433-7308, 1996. [3] id Software. Quake III Arena, 1999. quake3-arena/.

http://www.idsoftware.com/games/quake/

[4] J. E. Laird. It knows what you’re going to do: adding anticipation to a Quakebot. In AGENTS ’01: Proceedings of the fifth international conference on Autonomous agents, pages 385–392, New York, NY, USA, 2001. ACM Press. [5] J. E. Laird and M. van Lent. Human-Level AI’s Killer Application: Interactive Computer Games. In Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence, pages 1171–1178. AAAI Press / The MIT Press, 2000. [6] PhaethonH. Quake iii: Arena, baseq3 mod commentary. http://www.linux.ucla.edu/ ∼phaethon/q3mc/q3mc.html. [7] PlanetQuake.com. Code3arena. http://code3arena.planetquake.gamespy.com/. [8] [email protected]. Quake 3 game-module documentation. http://www.soclose.de/q3doc/ index.htm. [9] R. S. Sutton and A. G. Barto. Course Notes: Reinforcement Learning I: An Introduction. MIT Press: Cambridge, MA, 1998. [10] J. van Waveren. The Quake III Arena Bot. Master’s thesis, University of Technology Delft, 2003. [11] J. van Waveren and L. Rothkrantz. Artificial player for Quake III Arena. International Journal of Intelligent Games & Simulation (IJIGS), 1(1):25–32, March 2002. [12] S. Zanetti and A. E. Rhalibi. Machine learning techniques for FPS in Q3. In ACE ’04: Proceedings of the 2004 ACM SIGCHI International Conference on Advances in computer entertainment technology, pages 239–244, New York, NY, USA, 2004. ACM Press.

9

(a)

(b)

Figure 5: Results

10

(a)

(b)

Figure 6: Results

11

(a)

(b)

Figure 7: Results

12

(a)

Figure 8: Results

13