Multiagent Monte Carlo Tree Search with Difference

18 downloads 0 Views 1MB Size Report
[64] GMJB Chaslot, Steven De Jong, Jahn-Takeshi Saito, and JWHM Uiterwijk. ... the 18th BeNeLux Conference on Artificial Intelligence, pages 91–98, 2006.
University of Nevada, Reno

Multiagent Monte Carlo Tree Search with Difference Evaluations and Evolved Rollout Policy

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Mechanical Engineering

by Nicholas Zerbel Dr. Logan Yliniemi/Thesis Advisor

May 2018

THE GRADUATE SCHOOL We recommend that the thesis prepared under our supervision by NICHOLAS ALEXANDER ZERBEL Entitled Multiagent Monte Carlo Tree Search With Difference Evaluations And Evolved Rollout Policy

be accepted in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Logan Yliniemi, Ph.D., Advisor

Matteo Aureli, Ph.D., Committee Member

Sushil J. Louis, Ph.D., Graduate School Representative

David W. Zeh, Ph.D., Dean, Graduate School May, 2018

i

Abstract by Nicholas Zerbel

Monte Carlo Tree Search (MCTS) is a best-first search algorithm that has produced many breakthroughs in AI research. MCTS has been applied to a wide variety of domains including turn-based board games, real-time strategy games, multiagent systems, and optimization problems. In addition to its ability to function in a wide variety of domains, MCTS is also a suitable candidate for performance improving modifications such as the improvement of its default rollout policy. In this work, we propose an enhancement to MCTS called Multiagent Monte Carlo Tree Search (MAMCTS) which incorporates multiagent credit evaluations in the form of Difference Evaluations. We show that MAMCTS can be successfully applied to a cooperative system called Multiagent Gridworld. We then show that the use of Difference Evaluations in MAMCTS offers superior control over agent decision making compared with other forms of multiagent credit evaluations, namely Global Evaluations. Furthermore, we show that the default rollout policy can be improved using a Genetic Algorithm, with (µ + λ) selection, resulting in a 37.6% increase in overall system performance within the training domain. Finally, we show that the trained rollout policy can be transferred to more complex multiagent systems resulting in as high as a 14.6% increase in system performance compared to the default rollout policy.

ii

Acknowledgements This work would not have been possible without the unwavering support of my family, friends, and colleagues. I would like to thank all of them for helping me with the completion of this project. First, I would like to thank my mentor, advisor and committee chairman: Dr. Logan Yliniemi. He has provided me with valuable guidance during my time as a student, and he assisted me in setting reasonable goals for the completion of this project. He also helped me navigate the intricate world of academia, and pushed me to keep working when I wanted to give up. Thank you so very much, Dr. Yliniemi, for all of your help these past two years. I would like to thank the members of my graduate committee: Dr. Sushil J. Louis and Dr. Matteo Aureli. As a student in their classes, I learned many valuable skills which were immensely helpful in the completion of this project. Their help and guidance during my time as a student has been extremely valuable. Thank you both for agreeing to be on my committee and for helping me complete my graduate program. I would like to thank my parents, Carol and David Zerbel, as well my grandfather, Charles Sublett, for their love and support throughout this entire process. My grandfather helped spark my interest in science and mathematics when I was younger, and that interest led me to pursue a career as an engineer. My mom and dad have always

iii supported me in both my academic and extra-curricular pursuits, and they taught me the value of hard work and perseverance. I cannot thank my family enough for always making the time to listen to my worries, for letting me vent about my frustrations, and for never letting me give up on my goals. Finally, I would like to thank my friends and labmates, Scott Forer and Sierra Gonzales, for all of their support these past two years. Scott and Sierra gave me extremely valuable feedback on my work, helped me with debugging my code, and offered words of encouragement when things seemed bleak. I would not be where I am today without their help and friendship. I wish both of them the best of luck in their future endeavors and careers.

iv

Contents

Abstract

i

Acknowledgements

ii

List of Figures

vi

1 Introduction

1

2 Background Information

5

2.1

Tree Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . .

8

2.2.1

Monte Carlo Tree Search Enhancement . . . . . . . . . . . . .

11

2.3

Multiagent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.4

Multiagent Credit Evaluation . . . . . . . . . . . . . . . . . . . . . .

14

2.5

Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.5.1

. . . . . . . . . . . . . . . . . . .

17

2.6

Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.7

Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

2.7.1

MCTS in Multiagent Systems . . . . . . . . . . . . . . . . . .

20

2.7.2

MCTS Rollout Policy Enhancement . . . . . . . . . . . . . .

22

GA with (µ + λ) Selection

v 2.7.3

MCTS in Non-Game Domains . . . . . . . . . . . . . . . . . .

3 Methodology

25 27

3.1

Multiagent Monte Carlo Tree Search . . . . . . . . . . . . . . . . . .

27

3.2

Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

3.3

Rollout Policy Transfer . . . . . . . . . . . . . . . . . . . . . . . . . .

32

4 Test Domain and Setup 4.1

4.2

Multiagent Gridworld

34 . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.1.1

Multi-Rover Domain . . . . . . . . . . . . . . . . . . . . . . .

35

4.1.2

Multiagent Gridworld Domain . . . . . . . . . . . . . . . . . .

35

Test Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.2.1

Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

4.2.2

Rollout Policy Training . . . . . . . . . . . . . . . . . . . . .

39

4.2.3

Multiagent Gridworld Testing . . . . . . . . . . . . . . . . . .

41

5 Experimental Results 5.1

5.2

Multiagent Gridworld Problem . . . . . . . . . . . . . . . . . . . . .

43

5.1.1

5x5 Multiagent Gridworld . . . . . . . . . . . . . . . . . . . .

44

5.1.2

8x8 Multiagent Gridworld . . . . . . . . . . . . . . . . . . . .

48

5.1.3

10x10 Multiagent Gridworld . . . . . . . . . . . . . . . . . . .

51

5.1.4

20x20 Multiagent Gridworld . . . . . . . . . . . . . . . . . . .

54

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

6 Conclusions and Future Work 6.1

43

Future Work

Bibliography

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

60 61 63

vi

List of Figures 2.1

Example of a search tree data structure with key structural components labelled. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2

Monte Carlo Tree Search diagram showing the four step cycle: selection, expansion, simulation, and back-propagation. . . . . . . . . . .

4.1

45

Average number of iterations needed for a solution (100% goal capture) to be found on a 5x5 MAG. . . . . . . . . . . . . . . . . . . . . . . .

5.3

38

Reliability of MAMCTS on a 5x5 MAG. Results are displayed as the total number of successful statistical runs out of thirty. . . . . . . . .

5.2

10

A 4 x 4 MAG featuring three agents and three goals. A possible, optimal solution to this system is represented by arrows. . . . . . . .

5.1

7

45

Average maximum agent cost for MAMCTS-agents on a 5x5 MAG. Results are displayed as the average number of steps taken by the last free agent to reach the final free goal. . . . . . . . . . . . . . . . . . .

5.4

Reliability of MAMCTS on a 8x8 MAG. Results are displayed as the total number of successful statistical runs out of thirty. . . . . . . . .

5.5

46

48

Average number of iterations needed for a solution (100% goal capture) to be found on a 8x8 MAG. . . . . . . . . . . . . . . . . . . . . . . .

49

vii 5.6

Average maximum agent cost for MAMCTS-agents on a 8x8 MAG. Results are displayed as the average number of steps taken by the last free agent to reach the final free goal. . . . . . . . . . . . . . . . . . .

5.7

Reliability of MAMCTS on a 10x10 MAG. Results are displayed as the total number of successful statistical runs out of thirty. . . . . . .

5.8

52

Average number of iterations needed for a solution (100% goal capture) to be found on a 10x10 MAG. . . . . . . . . . . . . . . . . . . . . . .

5.9

49

52

Average maximum agent cost for MAMCTS-agents on a 10x10 MAG. Results are displayed as the average number of steps taken by the last free agent to reach the final free goal. . . . . . . . . . . . . . . . . . .

53

5.10 Reliability of MAMCTS on a 20x20 MAG. Results are displayed as the total number of successful statistical runs out of thirty. . . . . . .

55

5.11 Average number of iterations needed for a solution (100% goal capture) to be found on a 20x20 MAG. . . . . . . . . . . . . . . . . . . . . . .

55

5.12 Average maximum agent cost for MAMCTS-agents on a 20x20 MAG. Results are displayed as the average number of steps taken by the last free agent to reach the final free goal. . . . . . . . . . . . . . . . . . .

56

1

Chapter 1

Introduction

Monte Carlo Tree Search (MCTS) is a best-first search which has produced many breakthroughs in AI and search-based decision making research. Much of the early work in MCTS focuses on its use as a game AI, and there is heavy emphasis placed on its successful application to the game of Go [1–3]. Since then, MCTS has been applied to a wide variety of computerized card games and board games such as Scrabble, Poker, Othello, and Settlers of Catan [4–7]. MCTS has also been applied to several different video games including real-time games such as Ms. Pac-Man, and multiplayer games such as Starcraft and Civilization II [1, 8–10]. These examples of the application of MCTS demonstrate that MCTS has the ability to adapt to games featuring a large number of possible actions and a complex set of system variables. It also demonstrates the potential MCTS has for dealing with both single agent and

2 multiagent systems. That being said, the application of MCTS to non-game related domains, particularly in the area of multiagent systems, remains largely unexplored. Part of what makes MCTS such a versatile algorithm is that it can be readily adapted and modified to improve the algorithm’s performance within specific test domains. One major improvement made to the MCTS algorithm was to combine MCTS with a heuristic algorithm called Upper Confidence Bounds for Trees (UCT). Using this approach, Gelly et al. were able to develop an improved game AI for computerized Go which is capable of beating master level players on a 9 x 9 board [11]. Another focus for MCTS improvement lies in the algorithm’s rollout policy which allows MCTS to focus on promising actions while ignoring unprofitable actions [9]. The default rollout policy used in MCTS consists of a set of purely random moves. With this approach, a single rollout may not accurately estimate the value of a particular action; however, with enough rollouts, the optimal decision path in a tree will present itself [4]. Unfortunately, this default policy can be uninformative if MCTS is used within complex, real-time games or in multiagent systems [12]. These uninformative rollouts drastically reduce the performance of MCTS. In this work, we propose a new enhancement for MCTS called Multiagent Monte Carlo Tree Search (MAMCTS) which combines Monte Carlo Tree Search with multiagent credit evaluations in the form of Difference Evaluations. Using MAMCTS, we demonstrate that MCTS can be used to control the decision making ability of autonomous agents in such a way that agent performance in cooperative, multiagent

3 environments optimizes system performance. We also show that using Difference Evaluations with MAMCTS leads to superior system performance compared to other techniques, such as Global Evaluations, in multiagent path planning problems. In this work, we demonstrate the effectiveness of MAMCTS by applying the algorithm to a domain called Multiagent Gridworld (MAG) which is similar to a multi-rover domain where multiple rovers (robots/agents) must learn to navigate to points of interest located at fixed points on a two dimensional plane [13, 14]. This work also shows that overall multiagent system performance can be improved significantly by using a Genetic Algorithm (GA) with (µ + λ) selection to train a non-random rollout policy for use in MAMCTS. Furthermore, we demonstrate that transfer learning, transferring knowledge from one problem to another, can be used to transfer a rollout policy trained on a relatively simple multiagent system to a more complex multiagent system [15]. Finally, this work shows that transferring trained rollout policies in this manner results in improved system performance compared with the default rollout policy. The primary contributions of this thesis work are:

• Creating a multiagent enhancement for MCTS, in the form of MAMCTS with Difference Evaluations, to control agent decision making in multiagent path planning applications,

4 • Demonstrating that pairing MAMCTS with Difference Evaluations yields superior multiagent system performance, in a cooperative setting, compared to other evaluation techniques such as Global Evaluations, • Demonstrating that using a GA with (µ + λ) selection to train a deterministic rollout policy for MAMCTS leads to a significant improvement in overall system performance compared with using the default rollout policy in MAMCTS, • Demonstrating that transferring rollout policies, trained on relatively simple multiagent systems, to more complex multiagent systems produces a better system performance than solely relying on the default rollout policy.

The remainder of this thesis is divided up into six chapters with this chapter being Chapter 1. In Chapter 2, we discuss key background information and related works. In Chapter 3, we discuss the MAMCTS algorithm, the GA used to train rollout policies for MAMCTS, and the logistics of transferring rollout policies to more complex systems. In Chapter 4, we discuss the MAG domain and its relationship with a multi-rover domain, and we outline the test parameters and procedures used for testing MAMCTS. In Chapter 5, we discuss the results of the tests conducted on varying sizes of MAG. Finally, in Chapter 6, we present our concluding remarks, and we discuss possibilities for future work.

5

Chapter 2

Background Information

In this chapter, key background information needed to understand the content of this work is presented. This information is broken up into the following sections: Tree Searches (Section 2.1), Monte Carlo Tree Search (Section 2.2), Multiagent Systems (Section 2.3), Multiagent Credit Evaluation (Section 2.4), Genetic Algorithms (Section 2.5), Transfer Learning (Section 2.6), and Related Works (Section 2.7).

2.1

Tree Searches

A search tree is defined as a data structure which is initially accessed from a root node before it branches out to additional nodes in deeper levels of the tree [16]. Nodes, also referred to as leaves, in the tree are organized into levels and are interconnected by branches. These nodes contain state information pertaining to the search space

6 the tree is constructed in. When two nodes are connected by a branch, the two states represented by those nodes are related via some sort of action. This relationship between nodes in a search tree also implies that a domain must be deterministic in order for a tree search to be successfully applied. A node which has branches to additional nodes in the next level of the tree is referred to as a parent node, and the nodes which stem from that parent node are referred to as child nodes [16]. Nodes contained within the same level of a tree are typically referred to as sibling nodes if they share a common parent. These structural features are pointed out in Fig. 2.1. All tree data structures share the same basic structural components depicted in Fig. 2.1. The topology of a tree structure is often unique to the particular domain it is used in and the variety of tree search used to construct the tree. The process in which a tree is constructed is known as a tree search. These tree searches typically fall into one of two categories: exhaustive search or best-first search. An exhaustive search, also known as a brute-force search, is a search which does not utilize domain specific information, other than the root state, to actively search through a space action by action until a solution is found [17]. In direct contrast to the exhaustive search, a best-first search utilizes domain specific information to focus the tree search on actions which are more likely to lead to a good solution and produce a higher reward value [18]. Two standard examples of exhaustive search are known as breadth-first and depthfirst search. Breadth-first search expands the search tree one level at a time, adding

7

Figure 2.1: Example of a search tree data structure with key structural components labelled.

a child node for each action which can be taken from a parent node, until a goal state is reached [19]. Although this method guarantees that the first solution found will be the solution of shortest action length, this technique has several drawbacks. The most significant drawback of this technique is its demand on memory space and the time complexity of the algorithm [17]. Instead of expanding the tree one full level at a time, depth-first search focuses on sub-tree expansion. Depth-first tree search is a descendant tree search which adds a new child node to the tree from the most recent parent until a cutoff point is reached [19]. After the cutoff is reached, the search travels back up the tree until reaching a parent node which can be expanded further. Like breadth-first search, depth-first search also has several noteworthy drawbacks.

8 If a cutoff point is arbitrarily defined, it is possible that depth-first search may not find a solution if that cutoff point is defined to be too shallow in the search tree [17]. On the other hand, if the cutoff depth is defined to be too deep, the time complexity of the search and the memory space demands can become extremely taxing [17]. Due to the descendant nature of depth-first search, it is also not guaranteed that the first solution found is the solution of the shortest action length which, in many cases, represents the most optimal solution. In this work, we focus on Monte Carlo Tree Search, which is an example of a best-first search. Monte Carlo Tree Search is described in further detail in the following section.

2.2

Monte Carlo Tree Search

Monte Carlo Tree Search (MCTS) is a variety of best-first tree search which uses Monte Carlo methods to probabilistically sample a given domain and which organizes the search tree incrementally and asymmetrically [1]. MCTS also attempts to balance exploration vs. exploitation within the search space so that optimal solutions may be discovered more quickly [1]. The algorithm itself can be broken down into 4 steps: selection, expansion, rollout, and back-propagation [1, 4]. Selection: Starting at the root node of the tree, MCTS selects actions based on action values stored within each node. During the selection process nodes are chosen level by level until an unexpanded node in the tree is reached.

9 Expansion: After reaching an unexpanded node during selection, MCTS proceeds into the expansion phase. From the current parent state, a node is added into the next level of the tree for each possible action which can be taken from the parent state to an adjoining child state. A node is only expandable when it represents a non-terminal state within the domain [1]. Simulation/Rollout: After the parent state has been completely expanded, MCTS moves into the simulation phase. This phase is also referred to as the rollout phase. During this phase, one of the newly added nodes are selected and a simulated playout is conducted from that child node. Using the default rollout policy, the rollout consists of a set of random actions taken sequentially in the domain. After the simulation is complete, either by reaching a predetermined number of moves or by reaching a terminal state, an action value estimation is assigned to the child node based on the results of the rollout. The rollout can be conducted from a single node per MCTS cycle, or it can be conducted from any number of nodes, up to all of the newly added child nodes, per MCTS cycle. In the scientific literature, many authors consider simulation to be an all-encompassing term used to describe selection, expansion, and rollout [11, 20, 21]. In this work, simulation and rollout will only be used to describe the singular process in MCTS whereby a simulated playout is conducted to assign an initial action value to a newly expanded node. Back-Propagation: After completing the rollout phase, MCTS proceeds into the fourth and final step in the cycle: back-propagation. Using the knowledge gained from

10

Figure 2.2: Monte Carlo Tree Search diagram showing the four step cycle: selection, expansion, simulation, and back-propagation.

the rollout, the entire sub-tree, starting at the parent state, is updated all the way up to the root node. The specific policy for updating action values varies. One popular method of back-propagation is to evaluate states based on their mean outcome during simulations [2, 11, 20]. The equation used for these evaluations is shown in Eqn. 2.1.

¯ s,a + Q(s,a) = Q

¯ s,a R−Q Ns

(2.1)

¯ s,a represents the current average action value of state s, R represents In Eqn. 2.1, Q the result of the most recent rollout, and Ns represents the number of times state s has been visited. The four-step MCTS process is summarized in Fig. 2.2.

11

2.2.1

Monte Carlo Tree Search Enhancement

In this work, we use a well known enhancement of MCTS called UCB1 where UCB stands for Upper Confidence Bounds. This variant is also referred to as Upper Confidence Bounds applied to Trees (UCT) [22]. The UCB1 algorithm is meant to balance the trade-off between the exploration of new actions and the exploitation of known, valuable actions in MCTS by treating the leaves in each level of the tree as a Multiarmed Bandit problem [20]. The UCB1 technique, originally proposed by Auer et al., balances the exploration/exploitation trade-off by characterizing the value estimation of a state-action pair as a function of expected outcome and regret [23]. In this application, regret is defined as the expected loss due to a policy not always selecting the best possible action [23]. The UCB1 evaluation formula is defined in Eqn. 2.2. The ¯ a), encourages the exploitation of known reward left side of the UCB1 equation, Q(s, √ p) values, and the right side of the equation, ϵ ln(n , encourages the exploration of ns new actions [1].

√ ¯ a) + ϵ QU CB1 = Q(s,

ln(np ) ns

(2.2)

¯ a) represents each state’s average action value which is averaged In Eqn. 2.2, Q(s, based on the number of times an action, a, has been selected from a state, s. Exploration constant, ϵ, is used to adjust the balance between exploration vs. exploitation, np represents the number of visits to the current state, or parent state, and ns represents the number of times action a has been taken from the parent state. In other words, ns represents the number of times the child node has been visited. The UCB1

12 value, QU CB1 , is used to select nodes in the tree for sampling; it is not used to calculate the action values of the nodes [24]. This thesis work proposes an enhancement to MCTS for use in a cooperative, multiagent environment which combines MCTS with a multiagent credit evaluation technique called Difference Evaluations. The MAMCTS algorithm itself is discussed in greater detail in Chapter 3; however, multiagent systems and multiagent credit evaluation are introduced in the following sections.

2.3

Multiagent Systems

In this work, an agent is defined as an intelligent, artificial entity, capable of autonomous decision making within its environment, which tries to achieve a pre-defined goal [25, 26]. It then follows that a multiagent system is defined as a system which features multiple, individual agents which can communicate with each other either directly or indirectly within their environment [27]. In a multiagent environment, agents must have some method of estimating the impact other agents have on the system. Typically, there are two forms of multiagent systems: cooperative systems, where agents share common goals and work cooperatively to achieve those goals, and competitive systems, where agents do not share common goals and try to outperform other agents in the system [28].

13 Multiagent systems can offer several distinctive advantages over single-agent systems, also referred to as centralized systems, if the domain is appropriate for multiagent application. For example, multiagent systems can breakdown complex tasks into simpler tasks running in parallel [25]. In cases where controllers may be modelled as agents in control systems, multiagent control systems (decentralized control systems) can increase the overall robustness of the system and prevent catastrophic failure if any single controller should fail [25]. In other cases, the domain simply requires a multiagent approach [25]. In terms of application, multiagent systems have been applied to a wide range of problems. One example is a domain known as the Predator vs. Prey domain, which is also sometimes referred to as the Pursuit domain [25]. In the Predator vs. Prey domain introduced by Benda et al., four predatory agents must capture a prey agent [29]. The Predator vs. Prey domain is not typically classified as a complex, real-world problem; however, it helps convey concepts and details some of the complexities of coordination among agents in a system [25]. Multiagent systems have also been applied to domains such as multiplayer video games and robotic soccer where several international competitions have been held [25, 30–32]. Moving beyond these conceptual and game-based applications, multiagent systems have also been applied to several real-world problems including: air traffic control (both manned and unmanned), human-robot interaction, and multiagent space exploration [33–37]. In this work, we focus on cooperative, multiagent domains in which agents share

14 common goals and must work as a team to accomplish those goals. Specifically, we focus on problems involving multiagent path planning. In the next section, we discuss multiagent credit evaluation which is a key component of facilitating coordination among autonomous agents in a cooperative environment.

2.4

Multiagent Credit Evaluation

In the field of multiagent systems there are two main difficulties: promoting coordination among the agents in the system, so they do not work in opposition with one another, and dealing with the increased learning complexity caused by individual credit assessments being drowned out by the actions of other agents in the system [37]. Both of these problems pertaining to multiagent systems can be addressed by carefully choosing the reward signal used by the agents in the system. When it comes to multiagent credit evaluation, there are many strategies used to assign credit to agents; however, in this work, we focus on the following three techniques: Local Evaluations, Global Evaluations, and Difference Evaluations. Local Evaluations produce local rewards, L(s), which are a highly individualized reward signal in which an agent receives a reward based on its personal performance in the system. Although this approach can work well in simple, or competitive, systems, Local Evaluations often encourage selfish behavior in agents which, in a team setting, can lead to agents working against each other causing a decrease in overall system

15 performance [38]. Particularly in large multiagent systems, this may cause agents to behave inefficiently in a cooperative setting and not contribute positively to the overall system performance. The domain used in this work is a cooperative, multiagent system; therefore, Local Evaluations are not suitable as an evaluation technique in this setting. Global Evaluations produce global rewards, G(s), which use the overall system performance as the reward signal for each agent. This approach prioritizes the overall system performance over the performance of individual agents. The formula for calculating a global reward is summarized in Eqn. 2.3.

G(s) =

n ∑

Ri (s)

(2.3)

i=2

In Eqn. 2.3, G(s) represents the global reward at state s, and Ri (s) represents the reward generated by agent i for reaching state s. In certain cases it may be desirable to place heavier emphasis on system performance. However, if the system features a large number of agents, the reward signal can become diluted by the actions of other agents making it nearly impossible for individual agents to assess their impact on system performance [38]. Although this downfall is the opposite problem of Local Evaluations, the result is the same: agents behave inefficiently resulting in lowered system performance. Difference Evaluations try to incorporate the better parts of Local Evaluations and

16 Global Evaluations. Difference Evaluations utilize shaped reward signals which enable individual agents to learn how their actions effect overall system performance [39]. Using this approach, actions which increase an agent’s reward value simultaneously increase the overall system performance and discourage agents from working at crosspurposes [38]. The reward evaluation is defined below in Eqn. 2.4.

Di (s) = G(s) − Gi (s)

(2.4)

In Eqn. 2.4, Di (s) represents the difference reward gained by agent i for reaching state s, G(s) represents the global reward, and Gi (s) represents the global reward if agent i were completely removed from the system. In this work, we focus on using Difference Evaluations in conjunction with MAMCTS. We also compare the performance of Difference Evaluations and Global Evaluations within MAMCTS since both techniques measure rewards in terms of system performance. In addition to the proposed multiagent enhancement of MCTS, this work also proposes utilizing a Genetic Algorithm to improve the rollout policy of MAMCTS. In the following section, we introduce Genetic Algorithms.

2.5

Genetic Algorithms

Genetic Algorithms (GA) are a type of probabilistic search algorithm which function in search spaces that can be broken down into numeric strings [40]. Unlike a standard

17 hillclimber, which traverses a search space starting from a single state, GA’s explore a larger portion of the search space from multiple initial states by maintaining a population of individuals which evolve over generations based on performance scores, also known as fitness values [41, 42]. The actual method in which the population evolves largely mimics natural selection; in other words, individuals in the population which are more fit to survive pass on their “genetic” information to offspring in the form of genetic crossover and mutation [41, 42]. The performance of a GA is strongly influenced by the manner in which crossover and mutation are conducted, and there are many variants of GAs which are tuned to perform well in certain types of problems [41, 42]. How parents are selected also has a strong impact on GA performance. Some standard methods of selecting parents for crossover include: fitness proportional selection, linear ranking, uniform ranking, and random selection [43, 44].

2.5.1

GA with (µ + λ) Selection

In this work, we use a GA featuring (µ + λ) selection to optimize the fitness function used to evaluate individuals in the GA population. The specifics of the GA used in this work will be discussed further in Chapter 3. With (µ + λ) selection, the GA has two primary populations of sizes µ and λ. The parent population keeps track of the µ best individual policies discovered so far, and the offspring population consists of

18 λ individuals generated from the parent population through crossover and mutation operations [43–45]. As mentioned earlier, the GA in this work is used to train a rollout policy for use in MAMCTS. In addition to pre-training rollout policies, this work also proposes transferring rollout policies, which are trained on a relatively simple multiagent system, to a more complex multiagent system. By transferring rollout policies, knowledge gained from the simple domain can be of use in the more difficult domain, and the long run-times associated with policy training on larger systems can be avoided. This process is known as Transfer Learning which is introduced in the next section.

2.6

Transfer Learning

Transfer Learning describes the transfer of knowledge from one domain, or system, to another [15]. There are numerous examples of transfer learning which can be observed in a person’s everyday life. For example, learning to recognize a particular variety of fruit may help one recognize other varieties of fruit, or learning to play one type of musical instrument could help one learn a second instrument [46]. The study of Transfer Learning is of particular interest to the field of machine learning, and it has been categorized under a variety of different names including: life-long learning, knowledge transfer, multi-task learning, and meta learning [46, 47]. Within the research area of Transfer Learning, there are three primary research questions:

19 what should be transferred, how should it be transferred, and when should it be transferred [46]. Another important aspect to take into account is what factors affect transferability. Knowledge that is transferred between systems and domains must be generalizable [48]. As the “distance” between the training problem and the target problem becomes more distant, more dissimilar, the transferability of knowledge is adversely effected. However, in certain situations, the transfer of knowledge between distant tasks can be better than having no transfer of knowledge at all [48]. Transfer Learning is often utilized to try and improve either learning speed or system performance within a target domain [47]. For these applications, there are several methods for measuring success; for example, if the goal is to improve learning time, one might measure the training time needed to reach a certain performance threshold when using a transferred policy πT in place of a completely untrained initial policy, πi [49]. The application of transfer learning covers a diverse range of problems, and it is particular relevant to the field of machine learning. Some examples of Transfer Learning applications include: pattern and facial recognition, the translation of languages (particularly webpages), creating a robust game AI, and even the development of improved spam filters [46, 50–52]. Transfer Learning has even been applied to MCTS. In their work, Powley et al. propose a new enhancement for MCTS which they call the Information Capture and ReUse Strategy (ICARUS) which can be used to analyze the impact of combining different MCTS enhancements [53]. The fundamental

20 idea behind ICARUS is that it can collect information from visits to one part of the search tree, and then use that information to determine which enhancements may be beneficial in other areas of the tree [53]. In this work, we focus on the transferal of a trained rollout policy, πGA , that contains a generalized set of actions which can be applied to more complex test domains. We then show that the trained rollout policy outperforms the default rollout policy, πD , used in MCTS.

2.7

Related Works

In this section, we discuss works which are similar in scope to the content presented in this work. This section is divided into several categories for clarity. Those categories include: MCTS used in Multiagent Systems (section 2.7.1), MCTS Rollout Policy Enhancement (section 2.7.2), and MCTS used in Non-Game Domains (section 2.7.3).

2.7.1

MCTS in Multiagent Systems

Starting with the application of MCTS in multiagent domains, there are several works which bare similarities to this work. In 2011, Marcolino and Matsubara proposed a multiagent adaptation of MCTS with UCT used to develop strong players in Go. In their multiagent approach to MCTS in Go, an agent is selected from among a database of agents to make a move in Go, then, when it is time to make the next

21 move, a new agent is selected [54]. The logic behind this strategy is that having a bank of agents with different preferences in strategy will allow for a more natural game [54]. Interestingly, Marcolino found that using all the agents in their databank produced a weaker Go player than the standard Fuego program, which is a highly rated Go AI program. To combat this, Marcolino utilized both a simple hillclimber and simulated annealing to choose a set of agents that would produce a stronger Go player. By implementing these optimization algorithms, the authors found that their approach had a significant improvement over traditional Fuego for both the hillclimber strategy and simulated annealing. The authors conclude that implementing a multiagent approach to MCTS greatly improved the performance of the Go agent [54]. Branching out from Go, Szita et al. utilized MCTS augmented with limited domain knowledge in the multiplayer game Settlers of Catan. In this multiagent environment, a single-agent using MCTS plays against three other default game AI programmed specifically for Settlers of Catan. In their work, Szita et al. found that the MCTS agent had greater strength and consistently outperformed the other AI used in the experiment [5]. That being said, MCTS did not perform well when playing against an experienced human opponent. The authors believe that this is due to the fact that human players often employ a mix of two dominant strategies while the MCTS agent typically focuses on a single strategy [5]. This work demonstrates the versatility of MCTS, and it also demonstrates that MCTS can be successfully applied to a competitive multiagent domain. This work differs in its approach in several major ways. First, instead of a single MCTS agent competing against other non-MCTS agents, all

22 agents utilize MAMCTS in a cooperative, multiagent environment. Additionally, this work explores the effects of modifying the rollout policy to avoid completely random simulations in favor of a deterministic rollout policy. Also applying MCTS to a multiplayer game, Branavan et al. implement a non-linear regression within MCTS to allow agents to generalize between related states and actions [10]. This allows MCTS to gain more accurate evaluations using a smaller number of rollouts [10]. The authors applied their strategy in the multiplayer game of Civilization II, and they found that their agent using MCTS was able to outperform the default game AIs over 78% of the time when competing against several AI simultaneously [10]. This work is interesting because it also shows a successful application of MCTS to a competitive multiagent domain, but it also demonstrates how MCTS can be improved by enhancing the evaluation techniques used. In this application, the use of non-linear regression is an enhancement made to the calculation of action values and not to the rollout policy itself. Instead of enhancing the action value estimations within MCTS, this thesis work develops an evolved rollout policy which is transferable to other multiagent systems.

2.7.2

MCTS Rollout Policy Enhancement

In addition to examples of MCTS being applied to multiagent systems, there are also several noteworthy examples of works where authors attempt to modify the rollout policy in MCTS. In their 2011 paper, Robles et al. proposed using Temporal

23 Difference Learning to improve the reliability of the rollout policy in MCTS. Robles et al. sought to improve the performance of MCTS being used as an AI in a well known board game called Othello [6]. To do this, the authors use Temporal Difference Learning to learn a linear function approximator which they then use to bias the selection of actions within the search tree [6]. Based on the results of their research, Robles et al. conclude that, although a “perfect” Othello agent has yet to be found, incorporating Temporal Difference Learning did improve the reliability of the rollout function and the overall performance of MCTS in Othello [6]. The work presented by Robles et al. is, conceptually, very similar to this thesis work. Both works utilize learning algorithms in an attempt to improve the performance of MCTS. This thesis work differs in that we propose using a GA instead of Temporal Difference Learning. There are, however, other works which do also use a GA for this purpose. Using genetic programming, Alhejali and Lucas attempt to improve the performance of an agent using MCTS in the game Ms. Pac-Man [8]. Using a GA to evolve agent rollout policies, Alhejali and Lucas were able to develop a MCTS-based agent for use in Ms. Pac-Man which can outperform a MCTS-based agent using the default rollout policy. The authors demonstrate that agents using the trained rollout policy outperformed agents using the default rollout policy by as much as 18% [8]. That being said, the authors did note that MCTS-based agents using the default rollout policy did perform at the same level as an agent using an evolved rollout policy when given a large enough number of rollouts in the search space [8]. The authors also noted that one setback to their approach is that implementing a GA with MCTS

24 drastically increases the run-time of the program which forced them to reduce the overall number of evaluations in the program [8]. In this thesis work, it was also found that using a GA to train MCTS rollout policies resulted in a significantly long run-time (approximately one hour per statistical run). This long run-time issue is discussed in further detail in Chapter 5. To combat the drawback of increased run-times when trying to pre-evolve a rollout policy, Lucas et al. worked to develop a fast evolutionary adaptation for MCTS which would improve the algorithm’s performance in real time video game and control problems [12]. Lucas et al. then test their algorithm using the Mountain Car RL benchmark and a simplified version of Space Invaders [12]. In their work, Lucas et al. show that evolving a small number of weights which bias the default rollout policy while MCTS is running can drastically reduce the run-time while still offering a better performance over the unbiased default policy. The authors did note, however, that for larger domains with higher branching factors, this approach still required a significant run-time [12]. Although each of these examples of MCTS rollout enhancement are similar to the methods used in this work, all of the examples seen in the literature have been applied to game related domains. In this work, we focus on applying these methods to a nongame related domain.

25

2.7.3

MCTS in Non-Game Domains

One major non-game application of MCTS falls under the category of combinatorial optimization problems [1]. The range of problems dealing with combinatorial optimization is very diverse and ranges from assessing vulnerabilities in image-based authentication systems to famous combinatorial problems such as the Travelling Salesman Problem [1, 55, 56]. Problems like the Travelling Salesman Problem (TSP) are specifically known as shortest path problems and are very common testbeds for evolutionary computation because they are easy to describe yet difficult to solve [57]. MCTS has been successfully applied to several of these problems including the Canadian Traveller Problem which is a variation of the TSP in which certain edges and paths are blocked [58]. Additionally, Kocsis and Szepesvári show that MCTS using the UCB1 enhancement can be used in a sailboat domain in which a sailboat must find the shortest path between two points when subjected to varying wind conditions [24]. Several other examples of MCTS being used in combinatorial optimization problems include: mixed integer programming, physics simulations such as inverted pendulum and bicycle problems, and Lipschitz function approximation [1, 22, 59, 60]. Another popular, non-game applications of MCTS is in scheduling optimization problems [1]. Scheduling optimization problems include problems involving time management, resource management, and route planning. MCTS has been applied to many scheduling optimization problems including: a Monte Carlo random walk planner,

26 a mean-based heuristic search for anytime planning, printer scheduling, production management problems, and bus regulation [1, 61–65]. There have even been some applications of MCTS in procedural based generation of content such as art, music, and language processing [1, 66, 67]. Another interesting example comes from a work introduced by Mahlmann et al. where the authors use MCTS to generate content for strategy games such as Civilization and Starcraft [1, 68]. This last example is included in the list of non-game applications because MCTS is being used to generate digital content for a game and not being used to develop a strong AI player for game play. In this chapter, we discussed relevant background information crucial to the understanding of the content presented in this work including: Monte Carlo Tree Search, Multiagent Systems, Genetic Algorithms, and Transfer Learning. In Chapter 3, the methods utilized in this work are discussed in detail, and the framework of both algorithms (MAMCTS and the GA with (µ + λ) selection) are introduced.

27

Chapter 3

Methodology

In this chapter, we introduce the proposed framework for Multiagent Monte Carlo Tree Search (MAMCTS). The MAMCTS algorithm itself is outlined in section 3.1. In section 3.2 we discuss the GA with (µ + λ) selection used to train rollout policies which improve agent decision making for agents using MAMCTS. Finally, in section 3.3, we summarize the logistics of how transfer learning is incorporated within this framework and how trained rollout policies are transferred to different multiagent systems.

3.1

Multiagent Monte Carlo Tree Search

In this work, we propose a specialized, multiagent variant of MCTS which incorporates MCTS with multiagent credit evaluation. We call this MCTS enhancement

28 Multiagent Monte Carlo Tree Search (MAMCTS). The specific multiagent credit evaluation technique used in this work is Difference Evaluations; however, the algorithm is structured in such a way that any multiagent credit evaluation technique can be incorporated with the algorithm. MAMCTS is summarized in Algorithm 1. In MAMCTS, each agent, a, performs Imax iterations of MCTS using the UCB1 enhancement. After each agent has completed a single cycle of MCTS, selection through back-propagation, all agents perform a system rollout from the root node. The first step in the system rollout is to calculate the global reward for the system. To determine the global reward, each agent selects actions based on their action values, Q(s,a) , stored in each node of the tree. When each agent has reached a terminal node, or an unexpanded node, the overall system performance is measured and recorded as the global reward. Next, each agent is reset to the root position in the tree, and the difference reward for each agent is calculated. For each agent, a, a simulation is conducted to see how the system would perform without agent a. The simulation without agent a is then used to generate Gi (s) from Eqn. 2.4. Using G(s) and Gi (s), the difference reward for each agent is calculated. Finally, the difference reward is used to update the action value of the terminal node in each agent’s tree. After the system rollout is complete, the action values along the decision path in each agent’s tree are updated all the way back to the root node during a second back-propagation phase. This process continues until the agents have found a solution, or until a specified cutoff point is reached.

29 Algorithm 1 Multiagent Monte Carlo Tree Search 1: while T ermination = F alse do 2: for a < nagents do 3: for i < Imax do 4: Selection 5: Expansion 6: Simulation 7: Back_P ropagation(a) 8: end for 9: end for 10: System_Rollout(nagents ) 11: for a < nagents do 12: Back_P ropagation(a) 13: end for 14: Check_F or_T ermination 15: if T ermination = T rue then 16: Break 17: end if 18: end while

3.2

Genetic Algorithm

To evolve the rollout policy of MAMCTS, we use a GA with (µ + λ) selection. The rollout policy itself is a set of consecutive actions carried out by an agent. Therefore, the GA contains a population of binary strings, and each binary string maps to a list of numerically designated actions. The fitness function used to evaluate each rollout policy in the GA population is based off of the performance of MAMCTS using the evolved rollout policy. The three performance measurements used are: reliability, the number of successful runs out of the total number of runs; speed, how many iterations it takes, on average, for a solution to be found; and agent cost, how many actions agents take, on average, to complete an objective. The fitness function is further

30 defined in Eqn. 3.1. Fi =

NS Imax − IA Smax − SA + + NT Imax Smax

(3.1)

In Eqn. 3.1, NT represents the total number of runs of MAMCTS conducted, NA represents the number of successful runs, Imax represents the maximum number of iterations of MAMCTS that may be performed before cutoff, IA represents the average number of iterations needed to find a solution, Smax represents an imposed maximum number of consecutive actions which may be taken by an agent, and SA represents the average number of actions needed for agents to find a solution to the problem. Each performance measure in the fitness function is expressed as a percentage, which means that the upper bound for the fitness value is 3 and the lower bound of the fitness value is 0. Although the upper limit is 3, it is not possible for the fitness value to reach that number. The upper limit not being attainable is due to the fact that it is impossible for SA and IA to be 0. This means that, in any given system, it is impossible for a solution to be found in zero iterations of MAMCTS or with agents taking zero actions, on average, to complete an objective. The application of the GA is summarized in Algorithm 2. The GA starts by creating two populations: the parent population, which is of population size µ, and the offspring population, which is of population size λ. After creating the two populations, the parent population is evaluated to assess the initial fitness value for each policy in the population. Each individual in the parent population is tested against a specified number of test cases (sets of initial conditions) to measure the performance of each

31 policy over a number of different scenarios. Testing against a number of different configurations allows for a general policy to be learned instead of a policy that is specific to any one set of initial conditions. After an individual policy, j, has been tested against all configurations, the overall fitness of the policy is evaluated using Eqn. 3.1. After the parent population has been evaluated, parents are selected via fitness proportional selection and the offspring population is created through crossover and mutation. After λ individuals have been created using crossover, each policy is mutated based on the defined mutation rate. For the offspring population, each individual policy is tested against the same test configurations just like the parent population. After each individual in the offspring population has been tested and evaluated, the parent population and offspring population are combined and ranked, from highest to lowest, in terms of their fitness values. The top µ individuals in the population then become the new parent population. A new offspring population is then created and tested. This process continues until the maximum number of generations has been reached.

32 Algorithm 2 Genetic Algorithm with (µ + λ) Selection 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

Create_Initial_P opulations(µ, λ) for i < max_generations do if i == 0 then for j < µ do Rollout_Policy ← population.at(j) for x < nconf igurations do Run_M AM CT S M easure_F itness(xi ) end for Evaluate_F itness(j) end for end if Create_Of f spring_P opulation(λ) for j < λ do Rollout_Policy ← newpopulation.at(j) for x < nconf igurations do Run_M AM CT S M easure_F itness(xi ) end for Evaluate_F itness(j) end for Combine_P ops(µ + λ) Rank_Individuals Create_P arent_P opulation(µ) end for

3.3

Rollout Policy Transfer

At the end of each statistical run of the GA, the policy with the highest fitness value is recorded. The fitness value of each of each best policy is also recorded in a separate file. This allows the trained rollout policies to be imported when running tests of MAMCTS in larger, more complex multiagent systems. During tests of MAMCTS, only the recorded policy with the highest fitness value is used. An example of a possible trained rollout policy can be seen in Eqn. 3.2. Each number in π

33 would correspond with a potential action an agent would take during rollout. In this example, the rollout policy would consist of 10 consecutive actions. After the last action in the sequence is taken, an estimated action value would be assigned to the node the agent initiated rollout at, and then back-propagation would occur.

RolloutP olicy = π = [0, 1, 1, 3, 1, 4, 4, 2, 1, 0]

(3.2)

In this chapter, we discussed the methods utilized in this work. In section 3.1, the MAMCTS algorithm was summarized. In section 3.2 the fitness function used to evaluate rollout policies was defined, and the GA with (µ + λ) selection was outlined. Finally, in section 3.3, the logistics of transferring the evolved rollout policy were discussed. In Chapter 4, we introduce Multiagent Gridworld which is the domain used for testing in this work, and we describe the setup and procedure of each test.

34

Chapter 4

Test Domain and Setup

In this chapter, we introduce the testing domain used in this work: a cooperative, multiagent domain known as Multiagent Gridworld (MAG). The MAG domain used in this work is similar to a multi-rover domain seen in some of the literature for multiagent systems. Both the multi-rover and MAG domains are discussed in section 4.1. The test setup and test procedure used in this work are summarized in section 4.2

4.1

Multiagent Gridworld

In this section we introduce the multi-rover domain as a point of reference to describe the MAG domain. We then outline MAG and discuss the differences between it and the multi-rover domain.

35

4.1.1

Multi-Rover Domain

In a multi-rover domain, several rovers, or agents, are tasked with exploring points of interest (PoI) at fixed positions on a two dimensional plane [13, 14]. Each PoI on the plane has a specific reward value associated with it which is factored into the rover’s reward score when the rover observes that PoI. The actual reward received by a rover is a function of the value of the PoI and also the rover’s observational distance [13, 14]. In terms of the overall system reward, no additional points are earned for additional observations of a PoI, made by a rover, which do not contribute new information. In other words, only the closest observations of a PoI made by a single rover are counted towards the total system reward. Since it is only the closest observation of a PoI which is counted towards the overall system score, rovers must have a way to prioritize the exploration of PoIs which they are closer to compared with other rovers in the system. Therefore, any method used to control a rover’s decision making in this domain must be able to take into account an individual rover’s contribution to the overall system reward as well as the actions of other rovers in the system.

4.1.2

Multiagent Gridworld Domain

Similar to the multi-rover domain, MAG is also a two-dimensional environment containing PoIs, goals, at fixed positions on the grid. Unlike the multi-rover domain,

36 agents in MAG have more stringent rules governing their movement. In the multirover domain, rovers are given a full range of motion in two dimensions meaning they can move along any vector composed of x and y components. In MAG, agents can only move one space per turn and only in cardinal directions (up, down, left, or right). Agents can also choose to remain stationary. Furthermore, the boundaries of MAG are treated as an impassable wall. Any agent attempting to move beyond these boundaries is “pushed” back onto the grid. In addition to the rules governing agent movement, agents in MAG also cannot cohabit a single state. In MAG, n agents and n goals are assigned unique, random starting coordinates. Similar to the multi-rover domain, agents are tasked with locating goals, PoIs, within MAG; however, agent and system rewards are calculated differently in MAG. In the multi-rover domain, each PoI is worth a certain number of points, and a rover’s reward value is calculated as a function of the PoIs value and also the rover’s observational distance from that PoI. In MAG, agents do not observe PoIs and then move on to observe additional PoIs; instead, each agent is responsible for capturing an individual goal. Each goal in MAG is worth the same number of points when captured. An agent’s reward score becomes a function of how many steps (actions) it took the agent to reach the goal from its initial state, and whether or not that goal has been captured by another agent. Agents do not receive any points for reaching a goal which has already been captured; in fact, agents are penalized for reaching a goal which has already been captured. The overall system reward is defined in Eqn. 4.1. In this equation, Rsys represents the total system reward, ncap represents the number

37 of goals captured by a single agent, RGoal represents the reward given to an agent for capturing a goal, n represents the total number of agents and goals in the system, and Ns represents the number of steps taken by an agent, i.

Rsys = ncap ∗ RGoal −

n ∑

Ns

(4.1)

i=1

Agents in MAG must learn optimal paths to goals, but they also need to learn which goal to capture such that the overall system performance is optimal. Similar to the multi-rover domain, this requires that any method used to control an agent’s decision making must be able to distinguish between the individual agent’s contribution to the system performance and the influence other agents have on system performance. Without such a system in place, multiple agents may attempt to capture the same goal which, in a real-world scenario, would waste system resources resulting in poor system performance. An illustration of a 4x4 MAG with three agents and three goals is presented in Fig. 4.1. Figure 4.1 shows one possible, optimal solution to this system. This figure also demonstrates some of the complexity of this cooperative, multiagent domain. In the lower left quadrant of the grid, agent A3 is equidistant from two goals: G1 and G3. If agent A3 were trying to minimize the individual number of steps it must take, it could optimally solve the problem by pursuing either goal. However, it is in the best interests of the overall system performance for it to pursue goal G3 so that agent A1 in the upper left quadrant can capture goal G1.

38

Figure 4.1: A 4 x 4 MAG featuring three agents and three goals. A possible, optimal solution to this system is represented by arrows.

4.2

Test Setup

In this section, we discuss the test setup and procedure used in this work. First, the control variables and test parameters are introduced, then, the rollout policy training procedure is explained, and, finally, the procedure for testing in MAG is outlined.

4.2.1

Parameters

In this work, there are several parameters used by both the GA and MAMCTS. These parameters are control variables which are kept consistent throughout all tests. For example, the exploration modifier, ϵ, used in the calculation of QU CB1 (Eqn. 2.2), is

39 Table 4.1: Global test parameters used during testing.

Parameter Exploration Modifier Rollout Steps Goal Reward Rollout Reward Penalty Step Penalty

Symbol Value ϵ 7 SR 10 RGoal 100 RRollout 1 P 100 PStep 1

utilized by both the GA and MAMCTS proper. Other important parameters include: the number of rollout steps, SR , which determines the number of steps taken by an agent during each simulation, the goal reward, RGoal , which determines the reward value received by an individual agent for being the first to capture a goal, the rollout reward, RRollout , which determines the reward an agent receives during simulation for reaching a free goal state, the penalty, P , which is a negative reward value incurred by an agent for reaching a goal that has already been captured, and step penalty, PStep , which represents the cost incurred for each step taken by an agent in MAG. These parameters are summarized in Table 4.1 along with their associated values. The values of the parameters in Table 4.1 were determined experimentally over numerous trials. This particular set of parameters has, so far, produced the best results.

4.2.2

Rollout Policy Training

Before tests are conducted using MAMCTS, a rollout policy is trained using the GA with (µ + λ) selection. The training occurs on a 5x5 MAG with five agents and five goals, which is the maximum number of agents and goals used on a 5x5

40 Table 4.2: Testing parameters used by the GA with (µ + λ) selection.

Parameter Parent Population Size Offspring Population Size Probability of Crossover Probability of Mutation Number of Generations Number of Configurations

Symbol µ λ Pcross Pmut Genmax Ncon

Value 54 54 0.95 0.9 50 10

MAG. This particular configuration is used during rollout policy training because it represents the most complex multiagent system for a 5x5 MAG. In other words, training should occur on a configuration which is more difficult to solve in order to create a more robust policy. During training, each individual in the GA’s population is tested against a number of different test configurations each containing a unique set of initial conditions for the system. That is to say, a specified number of sets of different starting coordinates for agents and goals is used during training. Each policy’s performance is averaged across the total number of test configurations. This allows a more generalizable policy to be learned instead of a policy that only performs well on a specific set of initial conditions. The other parameters used specifically by the GA include the following: the parent population size, µ, the offspring population size, λ, the probability of crossover, Pcross , the probability of mutation, Pmut , the maximum number of generations, Genmax , and the number of testing configurations, Ncon . These parameters, and their associated values, are summarized in Table 4.2. As was the case with the global test parameters, the value of the GA parameters

41 were experimentally determined. The relatively small population size and number of generations are necessary to keep the training run-time at a reasonable level. Due to the small population size and low number of generations, high rates of crossover and mutation are needed to encourage a faster rate of exploration through the search space of possible rollout policies. During rollout policy training, two separate policies are developed: one for MAMCTS with Difference Evaluations and one for MAMCTS with Global Evaluations. Training separate policies allows for a proper comparison of the two multiagent credit evaluation schemes used with MAMCTS.

4.2.3

Multiagent Gridworld Testing

After policy training is complete, each trained policy, πGA , is compared with the default policy, πD for both Difference Evaluations and Global Evaluations. The first test is conducted on a 5x5 MAG, which is the same size of MAG training is conducted on. During each test, each credit evaluation scheme is tested for 30 statistical runs with each policy. This means that two pairs are tested in each test: Difference Evaluations with πD , Difference Evaluations with πGA , Global Evaluations with πD , and Global Evaluations with πGA . For each statistical run, a unique set of starting coordinates (initial conditions) is selected from a bank of randomly generated configurations. Storing these configurations in a bank ensures that each multiagent credit evaluation

42 technique and rollout policy pair are tested against the same set of initial conditions in each statistical run. When testing the rollout policies, a maximum number of iterations is defined as a cutoff point. In these tests, an iteration is defined as a full execution of the while loop in Algorithm 1. If the maximum iteration is reached before a solution is found, the statistical run is terminated and counted as a failed run. For the 5x5 MAG, two tests are run: a test with a maximum iteration count of 1,000 and a test with a maximum iteration count of 5,000. In addition to testing on a 5x5 MAG, tests are also run on 8x8, 10x10, and 20x20 MAGs. For each of these tests on larger MAGs, the rollout policies trained on a 5x5 MAG are transferred over and compared with the default rollout policy. For each size of MAG, the same test procedure is used, and two tests are run: 1,000 max iterations and 5,000 max iterations. The only exception is that, for the 20x20 MAG, a 10,000 max iteration test is run instead of a 5,000 max iteration test.

43

Chapter 5

Experimental Results

In this chapter, we summarize the results of this work. Section 5.1 summarizes the results of tests conducted in MAG including tests on the 5x5 MAG (section 5.1.1), the 8x8 MAG (section 5.1.2), the 10x10 MAG (section 5.1.3), and the 20x20 MAG (section 5.1.4). Finally, the chapter is concluded with an in-depth discussion of the results in section 5.2.

5.1

Multiagent Gridworld Problem

For the MAG problem, a separate rollout policy was trained for MAMCTS with both Difference Evaluations and Global Evaluations on a 5x5 MAG. After each statistical run of the GA, the rollout policy with the highest fitness was recorded to a text file, and those policies were then imported into MAMCTS to be compared with the

44 default rollout policy. In the remainder of this chapter, the default rollout policy will be referred to as πD and the GA trained rollout policy will be referred to as πGA . To measure the performance of MAMCTS in MAG, we utilize the following metrics: reliability, speed, and cost. Reliability is defined as the total number of successful statistical runs out of 30 total statistical runs. A successful run is defined as a run in which a solution was found within the maximum number of iterations. Speed is defined as the average number of iterations it took all agents in the system to find a free goal, and cost is defined as the average number of steps it took the last free agent to reach the last free goal.

5.1.1

5x5 Multiagent Gridworld

In the 5x5 MAG tests, we compare the performance of Difference Evaluations using πGA and πD and Global Evaluations using πGA and πD . A set of 30 different initial starting positions for agents and goals are used resulting in 30 different statistical runs for each degree of multiagent system. Both the 1,000 max iteration test and the 5,000 max iteration test used the same set of initial conditions for each statistical run. For the 5x5 MAG, multiagent systems were tested, starting with two agents, increasing incrementally all the way up to five agents. The results of these tests are shown in Fig. 5.1 - Fig. 5.3. Based on the results of these tests, we note several interesting patterns which emerge. First, we observe that, for all metrics, Difference Evaluations outperform Global

45 MAMCTS Reliability

MAMCTS Reliability

30

25

Number of Succesful Runs

Number of Succesful Runs

30

20

15

10

5

0

25

20

15

10

5

0 2

3

4

5

2

Number of Agents

3

4

5

Number of Agents

(a) 1,000 Iteration Max

(b) 5,000 Iteration Max

Figure 5.1: Reliability of MAMCTS on a 5x5 MAG. Results are displayed as the total number of successful statistical runs out of thirty. Average Solution Discovery Speed

0 100

500

Number of MAMCTS Iterations

Number of MAMCTS Iterations

Average Solution Discovery Speed

0

200 300 400 500

DE Default DE GA GE Default GE GA

600 700 800 900

1000 1500 2000 2500 3000 3500 4000 4500

1000

5000 2

3

4

Number of Agents

(a) 1,000 Iteration Max

5

2

3

4

5

Number of Agents

(b) 5,000 Iteration Max

Figure 5.2: Average number of iterations needed for a solution (100% goal capture) to be found on a 5x5 MAG.

Evaluations using both πD and each evaluation technique’s respective πGA . These results are consistent with examples seen in the multiagent credit evaluation literature, discussed in Chapter 2, where Global Evaluations are unable to accommodate multiple agents acting in a system due to the reward signal noise generated by their actions [27, 38, 39]. This result also confirms that MAMCTS can be paired with Difference Evaluations to effectively control agent decision making in a cooperative, multiagent

46 Average Agent Cost

Average Agent Cost

20

18

18

16

16

Agent Cost (Steps)

Agent Cost (Steps)

20

14 12 10 8 6

14 12 10 8 6

4

4

2

2

0

0 2

3

4

5

2

Number of Agents

(a) 1,000 Iteration Max

3

4

5

Number of Agents

(b) 5,000 Iteration Max

Figure 5.3: Average maximum agent cost for MAMCTS-agents on a 5x5 MAG. Results are displayed as the average number of steps taken by the last free agent to reach the final free goal.

environment. This result will be discussed in greater detail in section 5.2. Another interesting result can be seen in the comparison of MAMCTS with Difference Evaluations using πD and MAMCTS with Difference Evaluations using πGA . The results indicate that πGA outperforms πD in all metrics for both the 1,000 max iteration test and the 5,000 max iteration test. We observe that, for the 1,000 max iteration tests, πGA was 40.8% more reliable, found solutions in 39.2% fewer iterations, and was 33.2% more cost effective than πD . These results confirm that MAMCTS with Difference Evaluations can effectively control agent decision making in a cooperative, multiagent system. Furthermore, the results indicate that pre-training a rollout policy enables agents to make more optimal decisions in terms of overall system performance than they would using the default rollout policy. In the 5,000 max iteration test, we observe that πD begins to catch up with πGA in terms of performance. This result is consistent with examples scene in the literature

47 Table 5.1: MAMCTS with Difference Evaluations system performance using πD and πGA in a 5x5 MAG.

Metric

Max Iterations Default Policy (πD ) 1,000 55.0% Reliability 5,000 75.8% 1,000 43.2% Speed 5,000 63.9% 1,000 54.4% Cost 5,000 38.4%

Trained Policy (πGA ) 95.8% 99.2% 82.1% 94.6% 21.2% 18.7%

such as the work presented by Alhejali and Lucas who note that, given enough time to perform more rollouts, the default rollout policy performed at or near the same level as the evolved policy with their MCTS-based, Ms. Pac-Man agent [8]. Although the performance of πD improved greatly in the 5,000 max iteration trial, πGA still has a superior performance. It was found that πGA is 23.4% more reliable, finds solutions in 30.7% fewer iterations, and is 19.7% more cost effective, on average, compared to πD in this multiagent system. In addition to πGA outperforming πD in the 5,000 max iteration test, the results also indicate that πGA in the 1,000 max iteration test also outperforms πD in the 5,000 max iteration test. Examining the results from πGA in the 1,000 max iteration test and πD in the 5,000 max iteration test, we see that πGA is approximately 20.0% more reliable, discovers solutions in 18.2% fewer iterations, and is 17.2% more cost effective. The results of both the 1,000 max iteration test and 5,000 max iteration test are summarized side-by-side in Table 5.1.

48 MAMCTS Reliability

MAMCTS Reliability

30

25

Number of Succesful Runs

Number of Succesful Runs

30

20

15

10

5

0

25

20

15

10

5

0 2

4

6

8

10

Number of Agents

(a) 1,000 Iteration Max

2

4

6

8

10

Number of Agents

(b) 5,000 Iteration Max

Figure 5.4: Reliability of MAMCTS on a 8x8 MAG. Results are displayed as the total number of successful statistical runs out of thirty.

5.1.2

8x8 Multiagent Gridworld

In this section we explore the results of transferring πGA , which was trained on a 5x5 MAG, to an 8x8 MAG. As was the case for the 5x5 MAG, a 1,000 max iteration test and a 5,000 max iteration test were performed. For the 8x8 MAG, the multiagent system starts with a two agent system and then increases to a ten agent system in increments of two. The results of these tests are illustrated in Fig. 5.4 - Fig. 5.6. As was the case with the 5x5 MAG, the results indicate that Global Evaluations perform very poorly in this multiagent system using both πD and πGA . Again, this result is consistent with examples from the literature. With the increased size of an 8x8 MAG and the added complexity of having more agents in the system, MAMCTS with Difference Evaluations also does not perform as well compared to the results of tests conducted on the 5x5 MAG. However, it is noted that πGA still has a significantly better performance than πD for both the 1,000 max iteration test and the 5,000 max

49 Average Solution Discovery Speed

Average Solution Discovery Speed

0 500

300

Number of MAMCTS Iterations

Number of MAMCTS Iterations

200

400

500

600

700

800

900

1000 1500 2000 2500 3000 3500 4000 4500 5000

1000 2

4

6

8

10

2

4

Number of Agents

6

8

10

Number of Agents

(a) 1,000 Iteration Max

(b) 5,000 Iteration Max

Figure 5.5: Average number of iterations needed for a solution (100% goal capture) to be found on a 8x8 MAG. Average Agent Cost 25

20

20

Agent Cost (Steps)

Agent Cost (Steps)

Average Agent Cost 25

15

10

5

15

10

5

0

0 2

4

6

8

Number of Agents

(a) 1,000 Iteration Max

10

2

4

6

8

10

Number of Agents

(b) 5,000 Iteration Max

Figure 5.6: Average maximum agent cost for MAMCTS-agents on a 8x8 MAG. Results are displayed as the average number of steps taken by the last free agent to reach the final free goal.

iteration test. The results of MAMCTS with Difference Evaluations using πD and πGA are summarized in Table 5.2. Examining the results presented in Table 5.2, for the 1,000 max iteration test, it is observed that πGA is 12.7% more reliable, finds solutions in 9.5% fewer iterations, and is 10.9% more cost effective than πD . Again, these results indicate that MAMCTS

50 Table 5.2: MAMCTS with Difference Evaluations system performance using πD and πGA in a 8x8 MAG.

Metric

Max Iterations 1,000 Reliability 5,000 1,000 Speed 5,000 1,000 Cost 5,000

Default Policy (πD ) GA Policy (πGA ) 22.0% 34.7% 30.7% 46.7% 16.6% 26.1% 25.3% 39.7% 82.7% 71.8% 76.7% 63.3%

with Difference Evaluations can be successfully implemented within a cooperative, multiagent system to control agent decision making. Additionally, these results indicate that a rollout policy trained on a comparatively simple multiagent system can be transferred to a more complex multiagent system and still yield a better system performance than relying on the default rollout policy. Similar to the results of tests run on the 5x5 MAG, the results of the 5,000 max iteration test on the 8x8 MAG indicates improvement in overall system performance for both πD and πGA . Furthermore, πGA still exhibits a better performance compared to πD . Specifically, in the 5,000 max iteration test, πGA is 16.0% more reliable, finds solutions in 14.3% fewer iterations, and is 13.4% more cost effective than πD . Although the overall system performance improves between the 1,000 max iteration test and the 5,000 max iteration test, that improvement is not as significant as the improvement observed in the results of the 5x5 MAG tests. However, the 8x8 MAG is approximately 2.5 times larger than the 5x5 MAG, and the maximum number of agents operating in the system has doubled. Despite the added complexity of this system, the performance of πGA in the 1,000 max iteration test still outperforms πD

51 in the 5,000 max iteration test. This result further demonstrates the value of training a rollout policy on a comparatively simple multiagent system and transferring that knowledge to more complex systems. It is believed that the drop in overall system performance between 8x8 MAG and 5x5 MAG is a direct result of the large increase in the overall size of the state-space as well as the added complexity created from adding more agents to the system. The effects of increasing sizes of MAG and increasing numbers of agents will be discussed further in section 5.2.

5.1.3

10x10 Multiagent Gridworld

In this section, the results of increasing the size of MAG from 8x8 to 10x10 are examined. As was the case for the 5x5 MAG, a 1,000 max iteration test and a 5,000 max iteration test were performed. In the 10x10 MAG, the system starts with a two agent system and then increases to a ten agent system in increments of two. The results of these tests are illustrated in Fig. 5.7 - Fig. 5.9. Based on the information presented in Fig. 5.7 - Fig. 5.9, we can see that the results for the 10x10 MAG are very similar to the results seen in the 8x8 MAG. Global Evaluations still perform very poorly in this system, and the behavior observed from MAMCTS with Difference Evaluations is very similar to the behavior witnessed in the 8x8 MAG. The metrics are also similar when comparing Difference Evaluations using πD and Difference Evaluations using πGA in both the 8x8 MAG and the 10x10

52 MAMCTS Reliability

MAMCTS Reliability

30

25

Number of Succesful Runs

Number of Succesful Runs

30

20

15

10

5

0

25

20

15

10

5

0 2

4

6

8

10

2

4

Number of Agents

6

8

10

Number of Agents

(a) 1,000 Iteration Max

(b) 5,000 Iteration Max

Figure 5.7: Reliability of MAMCTS on a 10x10 MAG. Results are displayed as the total number of successful statistical runs out of thirty. Average Solution Discovery Speed

Average Solution Discovery Speed

500 1000

650

Number of MAMCTS Iterations

Number of MAMCTS Iterations

600

700

750

800

850

900

950

1500 2000 2500 3000 3500 4000 4500 5000

1000 2

4

6

8

Number of Agents

(a) 1,000 Iteration Max

10

2

4

6

8

10

Number of Agents

(b) 5,000 Iteration Max

Figure 5.8: Average number of iterations needed for a solution (100% goal capture) to be found on a 10x10 MAG.

MAG. For the 1,000 max iteration test, πGA is approximately 4.6% more reliable, discovers solutions in 4.1% fewer generations, and is 4.0% more cost effective than πD . For the 5,000 max iteration test, πGA is 10.0% more reliable, discovers solutions in 7.5% fewer iterations, and is 8.6% more cost effective than πD . These results are summarized in Table 5.3.

53 Average Agent Cost

30

25

Agent Cost (Steps)

25

Agent Cost (Steps)

Average Agent Cost

30

20

15

10

5

20

15

10

5

0

0 2

4

6

8

10

2

4

Number of Agents

(a) 1,000 Iteration Max

6

8

10

Number of Agents

(b) 5,000 Iteration Max

Figure 5.9: Average maximum agent cost for MAMCTS-agents on a 10x10 MAG. Results are displayed as the average number of steps taken by the last free agent to reach the final free goal. Table 5.3: MAMCTS with Difference Evaluations system performance using πD and πGA in a 10x10 MAG.

Metric

Max Iterations 1,000 Reliability 5,000 1,000 Speed 5,000 1,000 Cost 5,000

Default Policy (πD ) GA Policy (πGA ) 14.7% 19.3% 22.7% 32.7% 8.7% 12.8% 17.7% 25.2% 88.3% 84.3% 83.1% 74.5%

The results presented in Table 5.3 show a new trend in the data that was not seen in previous tests: πGA in the 1,000 max iteration test no longer has a stronger performance than πD in the 5,000 max iteration test. We believe this is a result of the increasing complexity in the system as the overall size of MAG is scaled up. This complexity is discussed in greater detail in section 5.2.

54

5.1.4

20x20 Multiagent Gridworld

For both the 8x8 MAG and the 10x10 MAG, it was observed that transferring πGA trained on the 5x5 MAG produced a very similar results. Specifically, the overall system performance was not as strong as the 5x5 MAG; however, πGA still demonstrated that it was more effective than πD in dealing with a more complex multiagent system. To test the limits of this transfer learning approach, we transfer πGA to a 20x20 MAG which is four times larger than the 10x10 MAG, over six times larger than the 8x8 MAG, and 16 times larger than the 5x5 MAG. Most of the test procedures are the same as those used in previous tests; however, instead of comparing 1,000 max iterations to 5,000 max iterations, we compare 1,000 max iterations to 10,000 max iterations in this test. The change from 5,000 to 10,000 max iterations was made due to there being no substantial difference in system and agent performance between the 1,000 max iteration and the 5,000 max iteration tests during initial testing. The results of the tests run on the 20x20 MAG are illustrated in Fig. 5.10 - Fig. 5.12. The results shown in Fig. 5.10 - Fig. 5.12 indicate that the complexity of the 20x20 MAG appears to pose a significant challenge for MAMCTS with Difference Evaluations. When comparing Difference Evaluations using πD to Difference Evaluations using πGA , we observe that the system performance is significantly reduced compared to the previous tests conducted in MAG. For the 1,000 max iteration test, πGA is 2.00% more reliable, finds solution in 1.06% fewer iterations, and is 1.8% more cost effective than πD . The 10,000 max iteration test shows a marginal improvement in

55 MAMCTS Reliability

MAMCTS Reliability

30

25

Number of Succesful Runs

Number of Succesful Runs

30

20

15

10

5

0

25

20

15

10

5

0 2

4

6

8

10

2

4

Number of Agents

6

8

10

Number of Agents

(a) 1,000 Iteration Max

(b) 10,000 Iteration Max

Figure 5.10: Reliability of MAMCTS on a 20x20 MAG. Results are displayed as the total number of successful statistical runs out of thirty. Average Solution Discovery Speed

940

Average Solution Discovery Speed

8200

Number of MAMCTS Iterations

Number of MAMCTS Iterations

8400 950

960

970

980

990

8600 8800 9000 9200 9400 9600 9800 10000

1000 2

4

6

8

Number of Agents

(a) 1,000 Iteration Max

10

2

4

6

8

10

Number of Agents

(b) 10,000 Iteration Max

Figure 5.11: Average number of iterations needed for a solution (100% goal capture) to be found on a 20x20 MAG.

performance for both πGA and πD compared with each policy’s performance in the 1,000 max iteration test. Additionally, the results indicate that the performance of πD and πGA are nearly identical in this test. The results of the 10,000 max iteration test reveals that πGA is 0.68% more reliable, finds solutions in 0.56% fewer iterations, and is 0.7% more cost effective than πD . These results are outlined in Table 5.4.

56 Average Agent Cost

Average Agent Cost

50

45

45

40

40

Agent Cost (Steps)

Agent Cost (Steps)

50

35 30 25 20 15

35 30 25 20 15

10

10

5

5

0

0 2

4

6

8

Number of Agents

(a) 1,000 Iteration Max

10

2

4

6

8

10

Number of Agents

(b) 10,000 Iteration Max

Figure 5.12: Average maximum agent cost for MAMCTS-agents on a 20x20 MAG. Results are displayed as the average number of steps taken by the last free agent to reach the final free goal. Table 5.4: MAMCTS with Difference Evaluations system performance using πD and πGA in a 20x20 MAG.

Metric

Max Iterations 1,000 Reliability 10,000 1,000 Speed 10,000 1,000 Cost 10,000

5.2

Default Policy (πD ) GA Policy (πGA ) 2.67% 4.67% 5.33% 6.01% 0.96% 2.02% 3.63% 4.19% 97.7% 95.9% 95.6% 94.9%

Discussion

One result seen consistently throughout testing is that MAMCTS with Difference Evaluations vastly outperforms MAMCTS with Global Evaluations in the MAG domain. When using Global Evaluations, solutions in two agent systems are occasionally found, but it is extremely unreliable. Beyond the two agent system, MAMCTS with Global Evaluations fails completely for both πD and πGA . We believe there are several reasons for this result. One reason is that using Global Evaluations makes it

57 difficult for agents to assess their individual impact on the system due to the noise generated by the actions of other agents in the system. On the other hand, Difference Evaluations are evaluated in such a way that actions which increase an agent’s reward value simultaneously increase the system’s reward score. This allows agents to choose actions which have positive impacts on system performance. This issue with Global Evaluations is well documented in the multiagent credit evaluation literature discussed in Chapter 2 [27, 38, 39]. A second explanation of Global Evaluation’s poor performance in MAG is that there are inherent challenges within MAG which Difference Evaluations can overcome but Global Evaluations cannot. Within the MAG domain there are many actions which are equally valuable to other actions in the domain. We believe that this overabundance of equally valuable actions coupled with the reward signal noise generated from multiple agents in the system makes it nearly impossible for MAMCTS with Global Evaluations to control agent decision making effectively. These difficulties become even more severe as MAG scales in size. When the size of MAG and the number of agents in the system increase, the results show that the performance of MAMCTS with Difference Evaluations performs well; however, there is a noticeable drop-off in performance that occurs when the system is scaled up in size. Consider that, from most states within MAG, an agent has five potential moves it can make: up, down, left, right, and remain stationary. In MAMCTS, certain actions within the search tree are pruned; additionally, there are less actions available to agents when they are near the border of MAG. For that reason, the number of nodes added in each expansion of the tree is assumed to be

58 four nodes on average. Assuming that the maximum depth of the tree is X +Y levels, where X and Y represent the dimensions of MAG, this means that a fully expanded tree for an agent in a 5x5 MAG would contain roughly 349, 525 nodes. That number increases to 1.43 ∗ 109 nodes in an 8x8 MAG, 3.66 ∗ 1011 nodes in a 10x10 MAG, and 4.03 ∗ 1023 in a 20x20 MAG. Also taking into account that each agent in the system has a unique decision tree, it is clear that the complexity of the system increases immensely as the size of MAG and the number of agents in MAG increase. This massive, exponential increase in system complexity also helps to explain the massive run times associated with using a GA to train rollout policies. Similar to the works discussed in Chapter 2 by Alhejali and Lucas, we found that training rollout policies on a 5x5 MAG took roughly one hour per statistical run to complete [8, 12]. Although implementing πGA within the same size of MAG it was trained on led to a large increase in system performance, attempting to train rollout policies on larger sizes of MAG with more agents in the system would be prohibitively time expensive. For this reason, this work focused on training rollout policies in comparatively simple multiagent systems and then transferring them to larger, more complex systems. Fortunately, the results shown in this chapter indicate that this Transfer Learning approach results in better system performance than only relying on the default rollout policy. Furthermore, the results indicate that, when the max iteration cap was increased, πGA shows a considerable performance increase as well. This trend suggests that allowing MAMCTS with Difference Evaluations to run for a longer amount of

59 iterations may overcome the disadvantage of not being able to train rollout policies on larger MAGs. Despite the inherent complexities present within MAG, MAMCTS with Difference Evaluations delivers an impressive performance. When paired with πGA , MAMCTS with Difference Evaluations delivers a near 100% reliability, within the 5x5 MAG, in addition to having an extremely low agent cost (within 25% of the estimated maximum cost). On a more general scale, these results also indicate that MAMCTS with Difference Evaluations can be successfully applied to a cooperative, multiagent system, in a non-game setting, to control agent decision making.

60

Chapter 6

Conclusions and Future Work

With this thesis work, we demonstrated that MAMCTS with Difference Evaluations can be successfully applied to a cooperative, multiagent system as a means to control agent decision making leading to optimal system performance in multiagent path planning problems. In addition, we also demonstrated that MAMCTS using Global Evaluations was unable to overcome the complexity associated with the MAG domain; however, MAMCTS with Difference Evaluations did overcome these difficulties and delivered a vastly superior performance compared to Global Evaluations. This work also shows that a better rollout policy can be trained using a GA with (µ + λ) selection. Although the training process is time expensive, the results of this work show that the trained rollout policy offers as much as a 37.6% increase in overall system performance, compared to the default rollout policy, when applied within the same domain it was trained on.

61 Finally, this work demonstrates that, by transferring the trained rollout policy to a more complex multiagent system, we can improve reliability as much as 16.0%, we can improve solution discovery speed by as much as 14.4%, and we can reduce agent cost as much as 13.4% compared with using the default rollout policy. Furthermore, this Transfer Learning approach circumvents the need to spend hours (possibly days) training rollout policies on these large, complex systems.

6.1

Future Work

This work was largely presented as a proof of concept; however, with the successful outcome of this research, we suggest several areas for future work with this concept. One potential area for research would be to examine how MAMCTS would perform in a MAG with obstacles, such as walls, embedded in the plane. This may help decrease the number of actions which are equally valuable in the domain; however, the added difficulty of the obstacles may pose new and unforeseen challenges. When considering a new domain, a good next step would be to apply MAMCTS to a multi-rover domain. As discussed earlier in this work, MAG is similar to a multi-rover domain, but different in terms of reward value calculations and agent/rover movement. Applying MAMCTS to a multi-rover domain may reveal interesting insights in how the algorithm adapts to a problem with slightly different mechanics.

62 Both the MAG domain and the multi-rover domain involve multiagent path planning. Continuing on with this theme, one possible future application of this research could be determining how to run wires for spacecraft electronics in the most efficient way possible so that the overall mass of the materials used is reduced. Due to the large cost per pound to send objects into space, any optimization in wire planning could potentially save thousands of dollars. In such an application, it could be possible to treat each wire (or group of wires) as an agent which must find a quick and efficient path to the appropriate connection point in the shortest path possible. Another potential area for application of this research is air traffic control. Airspace across the planet is becoming increasingly crowded; particularly with the rising popularity of drones and other unmanned aerial vehicles. Treating the airspace as a large, cooperative, multiagent system could allow this research to be used within this context. Similarly, it may also be possible to apply MAMCTS with Difference Evaluations to control or guide decision making in autonomous vehicles. Cars travelling along roadways share a common set of laws which govern what actions are legal and which actions are illegal. Furthermore, these vehicles share common goals: to travel from point A to point B as quickly and safely as possible. For these reasons, it may be possible to model roadway travel as a cooperative, multiagent system so that MAMCTS can be used to assist autonomous vehicles with decision making.

63

Bibliography [1] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012. [2] Sylvain Gelly, Yizao Wang, Olivier Teytaud, Modification Uct Patterns, and Projet Tao. Modification of uct with patterns in monte-carlo go. 2006. [3] Yngvi Bjornsson and Hilmar Finnsson. Cadiaplayer: A simulation-based general game player. IEEE Transactions on Computational Intelligence and AI in Games, 1(1):4–15, 2009. [4] Guillaume Chaslot, Sander Bakkes, Istvan Szita, and Pieter Spronck. Montecarlo tree search: A new framework for game ai. In AIIDE. AAAI, 2008. [5] István Szita, Guillaume Chaslot, and Pieter Spronck. Monte-carlo tree search in settlers of catan. In Advances in Computer Games, pages 21–32. Springer, 2009.

64 [6] David Robles, Philipp Rohlfshagen, and Simon M Lucas. Learning non-random moves for playing othello: Improving monte carlo tree search. In Computational Intelligence and Games (CIG), 2011 IEEE Conference on, pages 305–312. IEEE, 2011. [7] Marc JV Ponsen, Geert Gerritsen, and Guillaume Chaslot. Integrating opponent models with monte-carlo tree search in poker. In Interactive Decision Theory and Game Theory, 2010. [8] Atif M Alhejali and Simon M Lucas. Using genetic programming to evolve heuristics for a monte carlo tree search ms pac-man agent. In Computational Intelligence in Games (CIG), 2013 IEEE Conference on, pages 1–8. IEEE, 2013. [9] Spyridon Samothrakis, David Robles, and Simon Lucas. Fast approximate maxn monte carlo tree search for ms pac-man. IEEE Transactions on Computational Intelligence and AI in Games, 3(2):142–154, 2011. [10] SRK Branavan, David Silver, and Regina Barzilay. Non-linear monte-carlo search in civilization ii. In IJCAI, pages 2404–2410, 2011. [11] Sylvain Gelly and David Silver. Achieving master level play in 9 x 9 computer go. In AAAI, volume 8, pages 1537–1540, 2008. [12] Simon M Lucas, Spyridon Samothrakis, and Diego Perez. Fast evolutionary adaptation for monte carlo tree search. In European Conference on the Applications of Evolutionary Computation, pages 349–360. Springer, 2014.

65 [13] Adrian Agogino and Kagan Tumer. Efficient evaluation functions for multi-rover systems. In Genetic and Evolutionary Computation Conference, pages 1–11. Springer, 2004. [14] Adrian K Agogino and Kagan Tumer. Analyzing and visualizing multiagent rewards in dynamic and stochastic domains. Autonomous Agents and MultiAgent Systems, 17(2):320–338, 2008. [15] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(Jul):1633–1685, 2009. [16] Paul E. Black. Nist dictionary of algorithms and data structures, 1999. URL https://xlinux.nist.gov/dads//HTML/tree.html. Accessed: 1/31/2018. [17] Richard E Korf. Depth-first iterative-deepening: An optimal admissible tree search. Artificial intelligence, 27(1):97–109, 1985. [18] Gordon Goetsch et al. A quantitative study of search methods and the effect of constraint satisfaction. 1984. [19] mycodeschool. Binary tree traversal - breadth-first and depth-first strategies, Feb 2014. URL https://www.youtube.com/watch?v=9RHO6jU--GU. [20] Sylvain Gelly and David Silver. Monte-carlo tree search and rapid action value estimation in computer go. Artificial Intelligence, 175(11):1856–1875, 2011.

66 [21] David Silver and Joel Veness. Monte-carlo planning in large pomdps. In Advances in neural information processing systems, pages 2164–2172, 2010. [22] Pierre-Arnaud Coquelin and Rémi Munos. Bandit algorithms for tree search. arXiv preprint cs/0703062, 2007. [23] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002. [24] Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. In ECML, volume 6, pages 282–293. Springer, 2006. [25] Peter Stone and Manuela Veloso. Multiagent systems: A survey from a machine learning perspective. Autonomous Robots, 8(3):345–383, 2000. [26] Gerhard Weiss. Multiagent systems: a modern approach to distributed artificial intelligence. MIT press, 1999. [27] Michael Wooldridge. An introduction to multiagent systems. John Wiley & Sons, 2009. [28] Wiebe Van der Hoek and Michael Wooldridge. Multi-agent systems. Foundations of Artificial Intelligence, 3:887–928, 2008. [29] Miroslav Benda. On optimal cooperation of knowledge sources. Technical Report BCS-G2010-28, 1985.

67 [30] Michael K Sahota, Alan K Mackworth, Stewart J Kingdon, and Rod A Barman. Real-time control of soccer-playing robots using off-board vision: the dynamite testbed. In Systems, Man and Cybernetics, 1995. Intelligent Systems for the 21st Century., IEEE International Conference on, volume 4, pages 3690–3693. IEEE, 1995. [31] Hiroaki Kitano, Minoru Asada, Yasuo Kuniyoshi, Itsuki Noda, Eiichi Osawai, and Hitoshi Matsubara. Robocup: A challenge problem for ai and robotics. In Robot Soccer World Cup, pages 1–19. Springer, 1997. [32] A Sanderson.

Micro-robot world cup soccer tournament (mirosot).

IEEE

Robotics & Automation Magazine, 4(4):15–15, 1997. [33] Anand S Rao, Michael P Georgeff, et al. Bdi agents: from theory to practice. In ICMAS, volume 95, pages 312–319, 1995. [34] Scott Forer and Logan Yliniemi. Monopolies can exist in unmanned airspace. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1105–1112. ACM, 2017. [35] Logan Yliniemi, Adrian K Agogino, and Kagan Tumer. Evolutionary agent-based simulation of the introduction of new technologies in air traffic management. In Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, pages 1215–1222. ACM, 2014.

68 [36] J Alfredo Sanchez, Flavio S Azevedo, and John J Leggett. Paragente: Exploring the issues in agent-based user interfaces. In ICMAS, pages 320–327, 1995. [37] Logan Yliniemi, Adrian K Agogino, and Kagan Tumer. Multirobot coordination for space exploration. AI Magazine, 35(4):61–74, 2014. [38] Sam Devlin, Logan Yliniemi, Daniel Kudenko, and Kagan Tumer. Potentialbased difference rewards for multiagent reinforcement learning. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems, pages 165–172. International Foundation for Autonomous Agents and Multiagent Systems, 2014. [39] Logan Yliniemi and Kagan Tumer. Multi-objective multiagent credit assignment through difference rewards in reinforcement learning. In Asia-Pacific Conference on Simulated Evolution and Learning, pages 407–418. Springer, 2014. [40] David E Goldberg and John H Holland. Genetic algorithms and machine learning. Machine learning, 3(2):95–99, 1988. [41] John H Holland. Genetic algorithms. Scientific american, 267(1):66–73, 1992. [42] Melanie Mitchell. An introduction to genetic algorithms. MIT press, 1998. [43] Thomas Bäck and Frank Hoffmeister. Extended selection mechanisms in genetic algorithms. 1991.

69 [44] Fulya Altiparmak, Mitsuo Gen, Lin Lin, and Turan Paksoy. A genetic algorithm approach for multi-objective optimization of supply chain networks. Computers & industrial engineering, 51(1):196–215, 2006. [45] Thomas Bäck. Generalized convergence models for tournament-and (µ, lambda)selection. 1995. [46] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2010. [47] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012. [48] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014. [49] Matthew E Taylor, Peter Stone, and Yaxin Liu. Value functions for rl-based behavior transfer: A comparative study. In Proceedings of the National Conference on Artificial Intelligence, volume 20, page 880. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, 2005. [50] Xiao Ling, Gui-Rong Xue, Wenyuan Dai, Yun Jiang, Qiang Yang, and Yong Yu. Can chinese web pages be classified with english data source? In Proceedings of the 17th international conference on World Wide Web, pages 969–978. ACM, 2008.

70 [51] Gregory Kuhlmann and Peter Stone. Graph-based domain mapping for transfer learning in general games. In European Conference on Machine Learning, pages 188–200. Springer, 2007. [52] Steffen Bickel.

Discovery challenge.

URL http://www.ecmlpkdd2006.org/

challenge.html. [53] Edward J Powley, Peter I Cowling, and Daniel Whitehouse. Information capture and reuse strategies in monte carlo tree search, with applications to games of hidden information. Artificial Intelligence, 217:92–116, 2014. [54] Leandro Soriano Marcolino and Hitoshi Matsubara. Multi-agent monte carlo go. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pages 21–28. International Foundation for Autonomous Agents and Multiagent Systems, 2011. [55] Yasuhiro Tanabe, Kazuki Yoshizoe, and Hideki Imai. A study on security evaluation methodology for image-based biometrics authentication systems. In Biometrics: Theory, Applications, and Systems, 2009. BTAS’09. IEEE 3rd International Conference on, pages 1–6. IEEE, 2009. [56] Arpad Rimmel, Fabien Teytaud, and Tristan Cazenave. Optimization of the nested monte-carlo algorithm on the traveling salesman problem with time windows. In European Conference on the Applications of Evolutionary Computation, pages 501–510. Springer, 2011.

71 [57] Pedro Larranaga, Cindy M. H. Kuijpers, Roberto H. Murga, Inaki Inza, and Sejla Dizdarevic. Genetic algorithms for the travelling salesman problem: A review of representations and operators. Artificial Intelligence Review, 13(2):129–170, 1999. [58] Zahy Bnaya, Ariel Felner, Dror Fried, Olga Maksin, and Solomon Eyal Shimony. Repeated-task canadian traveler problem. In Fourth Annual Symposium on Combinatorial Search, 2011. [59] Ashish Sabharwal, Horst Samulowitz, and Chandra Reddy. Guiding combinatorial optimization with uct. In International Conference on Integration of Artificial Intelligence (AI) and Operations Research (OR) Techniques in Constraint Programming, pages 356–361. Springer, 2012. [60] Christopher R Mansley, Ari Weinstein, and Michael L Littman. Sample-based planning for continuous action markov decision processes. In ICAPS, 2011. [61] Hootan Nakhost and Martin Müller. Monte-carlo exploration for deterministic planning. In IJCAI, volume 9, pages 1766–1771, 2009. [62] Damien Pellier, Bruno Bouzy, and Marc Métivier. An uct approach for anytime agent-based planning. In Advances in Practical Applications of Agents and Multiagent Systems, pages 211–220. Springer, 2010.

72 [63] Shimpei Matsumoto, Noriaki Hirosue, Kyohei Itonaga, Kazuma Yokoo, and Hisatomo Futahashi. Evaluation of simulation strategy on single-player montecarlo tree search and its discussion for a practical scheduling problem. In Proceedings of the International MultiConference of Engineers and Computer Scientists, volume 3, pages 2086–2091, 2010. [64] GMJB Chaslot, Steven De Jong, Jahn-Takeshi Saito, and JWHM Uiterwijk. Monte-carlo tree search in production management problems. In Proceedings of the 18th BeNeLux Conference on Artificial Intelligence, pages 91–98, 2006. [65] Tristan Cazenave, Flavien Balbo, Suzanne Pinson, et al. Monte-carlo bus regulation. In 12th international IEEE conference on intelligent transportation systems, pages 340–345, 2009. [66] Cameron Browne. Towards mcts for creative domains. In ICCC, pages 96–101, 2011. [67] Jonathan Chevelu, Thomas Lavergne, Yves Lepage, and Thierry Moudenc. Introduction of a new paraphrase generation tool based on monte-carlo sampling. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, pages 249–252. Association for Computational Linguistics, 2009. [68] Tobias Mahlmann, Julian Togelius, and Georgios N Yannakakis. Towards procedural strategy game generation: Evolving complementary unit types. In European Conference on the Applications of Evolutionary Computation, pages 93–102. Springer, 2011.