On the classification of maze problems Anthony J. Bagnall and Zhanna V. Zatuchna University of East Anglia, Norwich, NR4 7TJ, England
[email protected] [email protected]
1
Introduction
A maze is a grid-like two-dimensional area of any size, usually rectangular. A maze consists of cells. A cell is an elementary maze item, a formally bounded space, interpreted as a single site. The maze may contain different obstacles in any quantity. Some may be significant for learning purposes, like virtual food. The agent is randomly placed in the maze on an empty cell. The agent is allowed to move in all directions, but only through empty space. The task is to learn a policy to reach food as fast as possible from any square. Once the food is reached, the agent position is reset to a random one and the task repeated. Maze environments have been widely used as testbed problems in machine learning research [4], especially in the learning classifier system literature [27, 13, 2, 1]. There have been made some different approaches for the examination of problem complexity for learning agents [23, 8], some of them for the maze problem domain [15, 25]. The aim of the paper is through a thorough survey and analysis of research using alternative maze structures to define metrics to quantify the complexity of maze problems. For the purpose we collected 50 different mazes, 44 of them have been used in at least 55 publications. Frequently mazes have been given different names in different publications. Where possible, we assign a maze the name most commonly used in the literature. If there is no commonly accepted name, the maze is named after the author of the first publication referring to it, suffixed by the year of publication. The techniques applied to mazes can be categorised as follows: XCS [27, 11, 12, 7] (13 mazes in 17 papers); ZCS [26, 5, 7, 1] (7 mazes in 10 papers); ACS [22, 3, 18] (18 mazes in 11 papers); Other methods (Witness algorithm [4], Q-learning with added memory [14], ATNoSFERES [10], others) (25 mazes in 20 papers). A more complete description can be found in [28]. The characteristics of mazes that determine the complexity of the learning task fall into two classes: those that are independent of the learning algorithm, such as number of squares and density of obstacles (described in Section 2) and those that are an artifact of the agents ability to correctly detect its current state, such as the number of aliasing squares. In Section 3 we examine the effect of agent perception on the problem complexity, and describe how alternative types of aliasing may effect complexity of mazes. Generalization and noise problems in maze environments are considered in the Section 4. Section 5 outlines the areas of future work. In Section 6 we summarize our conclusions.
2
Agent independent maze attributes
Size. The number of cells a maze contains obviously effects the complexity. The mazes range from 18 cells (Cassandra4 [4], Fig.1(a)) to 1044 cells (Woods7 [26]). Mazes smaller than 50 cells are classified as small (19 mazes). Medium mazes, such as MiyazakiC [19], Fig.1(b), have between 50 and 100 cells, and large mazes have more than 100 cells. We denote the number of cells of a maze m as sm .
(a) (b) Fig. 1. (a) Cassandra4, sm = 18, φm = 1.33; (b) MiyazakiC, sm = 64, φm = 3.37
Distance to food. The average distance to food (φm ) in a maze is an important characteristic of complexity. The bigger the value, the more difficult the maze is. The range of values in the mazes considered varies from φm = 1.29 for Koza92 [9] to φm = 14.58 for Nevison-maze3 [20]. We classify a maze as having a short distance to food if φm ≤ 5, a medium distance if 5 < φm < 10 and a long distance if φm ≥ 10.
(a)
(b)
Fig. 2. (a) Russell&Norvig, om = 10, δm = 0.48; (b) Gerard-Sigaud-2000, om = 12, δm = 0.71
Obstacles. Mazes may contain walls, partitions or both. A wall is complete cell that the agent cannot occupy or see through whereas a partition is a barrier between cells. For example, Russell&Norvig maze [21] ( Fig.2(a)) is a wall maze, Gerard-Sigaud-2000 [6](Fig.2(b)) is a partition maze, and MiyazakiC (Fig.1(b)) is a wall-and-partition maze. Mazes like E2 (Fig.3(b)), that contain only surrounding walls, are empty mazes. The number of obstacles in a maze, om , is
defined as the total number of internal wall cells plus the total number of partitions plus half the total number of surrounding walls. Thus, for mazes with a surrounding wall, we adjust the number of obstacles to allow for the fact that they may only ever present an obstacle from a single direction. Density. The density of a maze is the proportion of obstacles to squares, δm = om sm . A maze is spacious when δ ≤ 0.3 and a maze is restricted when δ ≥ 0.6. Mazes with intermediate values of δm are mazes of average density. Spacious mazes, for example Woods7, may be extremely difficult for an agent because the maze does not provide enough reference points for the agent to distinguish the environment state. Toroidal mazes are mazes without a border of obstacles. Under this measure, a toroidal maze will be classified as more difficult than the same maze enclosed by a wall. We consider size, density and distance to food are the most important agent independent characteristics that effect complexity. Other features of mazes that influence the complexity of the problem include: Type of objects. In addition to a target (food) state, some mazes contain a penalizing state, such as enemy. For example Russell&Norvig maze has an enemy marked as E. Enemy and enemy+food mazes present a different learning problem to food mazes, and have been used only by Russell and Norvig [21] and Littman [16]. Mazes may also have different types of obstacles as well as different kinds of food (Woods2 [27]). The number of types of object effects the agent’s ability to perceive its environment, and hence influences the number of aliasing states. Maze dynamics. Some mazes involve cells which change state. For example, a multi-agent maze will have cells that are sometimes empty and sometimes occupied. A dynamic maze with moving enemy will have uncertain position of the negative rewarding object. Other mazes such as Sutton98 [24] may include moving walls. On the whole, we can talk about three sources of non-static mazes: dynamics of indifferent objects (walls), dynamics of principal objects (food/enemy) and multi-agent systems. Dynamic mazes are obviously more difficult than static ones and represent a completely different kind of maze problems. The complexity of the learning problem is only partially dependent on the physical complexity of the maze. Perhaps greater importance is the ability of the agent to perceive the environment.
3
Agent dependent maze attributes
The agent may not be able to distinguish one square from another, despite the fact that they are in different locations, because the environment signals the agent receives in the squares are the same. Cells that appear identical under a particular detector are commonly called aliasing, and a maze containing at
least two aliasing cells, is called an aliasing maze. Aliasing mazes deserve special emphasis in the context of maze classification because they represent the most difficult to solve class of problem. In [25] Wilson proposes a scheme to classify reinforcement learning environments with respect to the sensory capabilities of the agent. An environment belongs to Class 1 if the sensory capabilities of the agent are sufficient to determine the entire state of the environment. In Class 2 environments the agent has only partial information about the true state of the environment. Class 2 environments are said to be partially observable with respect to the agent, or equivalently are non-Markov with respect to agent’s actions. Accordingly, the agent is said to suffer from the hidden state problem. Littman in [15] presents a more formal classification of reinforcement learning environments, based on the simplest agent that can achieve optimal performance. Two parameters h and β characterize the complexity of an agent. An (h, β) environment is best solved by an (h, β) agent that uses the input information provided by the environment and at most h bits of local storage to choose the action which maximize the next β reinforcements. Hence, Class 1 environments correspond to (h = 0; β = 1) and (h = 0; β > 1) environments, while Class 2 environments correspond to (h > 0; β > 1) (non-Markov) environments. Of the 50 mazes considered, 21 are Class 1 mazes and 29 are Class 2. Whilst this classification is useful, there is still a large degree of variation in complexity within Class 2 problem and the nature of the aliasing may alter the difficulty of the learning problem. Alternative types of aliasing. In reinforcement learning terminology, the presence of aliasing states is reflected in the characteristics of the transition matrix of the decision problem of an agent in a maze. A transition matrix describes the probability of moving from one state to another for any given action. Mazes with no aliasing squares have the characteristic that for any state action pair, there will be one state with a probability of transition of 1 (i.e. any action in any square will always move the agent to the same square). We define an aliasing state as one where for at least one action the probability of moving to any other state is neither 0 nor 1. An aliasing square is a cell in the maze which is included in an aliasing state. Thus, two or more aliasing squares may appear to be a single aliasing state to the agent. However, the complexity of the maze cannot be determined from just the transition matrix. Some mazes (e.g. Woods2) may produce a transition matrix with uncertainty but still be easily solved by a memoryless agent and are, according to Littman [15], Class 1 environments. The complexity of the problem is determined by not only the uncertainty, but also the optimal strategy. Woods2 is classified as Class 1 because the optimal strategy in the squares that appear the same is identical. The complexity of a maze for a LCS agent with a particular detector can be quantified by how long, on average, an agent using a Q-table trained by the Qlearning algorithm takes to find food compared to the optimal steps to food. If Q-
learning can disambiguate all squares then, assuming it has been trained for long enough, it will find the optimal route to food. If, however, it has a detector that introduces aliasing, it will take longer if the aliasing effects the optimal strategy. We use a standard version of the Q-learning algorithm, γ = 0.2, α = 0.71, with roulette-wheel action selection in exploration mode and greedy action selection (max Q) in exploitation mode, number of trials n = 20000. Let φQ m be the average steps to food of a trained Q-learning agent that can only detect the surrounding φQ m . squares. The complexity measure ψm is then defined as ψm = φm This measure gives us a metric that can quantify the effects of aliasing. For example, mazes E2 [18] (Figure 3(b)) and Cassandra4x4 [4] (Figure3(a)) both have aliasing squares and similar average steps to goal and density values. However, Cassandra4x4 is much easier to solve than E2, because the aliasing squares of Cassandra4x4 do not effect the optimal strategy. This is reflected in their widely different complexity values of ψm = 251 for E2 and ψm = 1 for Cassandra4x4.
(a)
(b)
Fig. 3. (a) Cassandra4x4, δ = 0.38, φ = 2.27, ψ = 1; (b) E2, δ = 0.25, φ = 2.33, ψ = 251
Lanzi [11] noticed that disposition of aliasing cells play a significant role in maze complexity. For most LCS agents there are two major factors that have a significant influence on the learning process: minimal distance to food, d, and correct direction to food, or right action, a. Let d1 and d2 be minimal distance to food from an aliasing cell 1 and aliasing cell 2 respectively, and a1 and a2 be the optimal actions for the cells. There are four different situations for that case: – when the distance is the same and direction is the same (d1 = d2 and a1 = a2 ), the squares are pseudo-aliasing; – when the distance is different but direction is the same (d1 6= d2 , a1 = a2 ) these are type I aliasing squares; – when the distance is different and direction is different (d1 6= d2 , a1 6= a2 ) these are type II aliasing squares; – when the distance is the same but direction is different (d1 = d2 , a1 6= a2 ) the squares are aliasing type III. Thus, there are three types of genuine aliasing squares and one type of pseudo-aliasing conditions. Woods2 is an example of a maze with pseudo-aliasing cells. It can be seen from Figure 4(a) that for Littman57 [16] the aliasing
cells marked with 1, 2 and 3 have the same direction to food (aliasing type I). Figure 4(b) shows MazeF4 [22] with aliasing squares type II marked with 1. Both squares have different distances to food as well as different directions. Woods101 [17] (Fig. 6(a)) is an example of a maze with type III aliasing squares. In some maze cells there are more than one optimal direction to the nearest food. Thus, there are two more additional subcategories which we consider as variants of the aliasing type I: – when the T distance is the same and direction sets are intersecting (d1 = d2 and a1 a2 ), and – when the T distance is the different and direction sets are intersecting (d1 6= d2 and a1 a2 ).
(a)
(b)
Fig. 4. (a) Littman57, aliasing maze type I; (b) MazeF4, aliasing maze type II
Influence aliasing types on maze complexity. Each aliasing type will produce distinctive kinds of noise in the agent’s reward function and understanding the internal structure of the noise may help us to develop a mechanism for improving the learning of the agent. The obtained results show that the mazes with a large value of ψm (ψm > 150) all have type III aliasing squares (see Fig. 5). The majority of mazes that include aliasing type II squares as the highest aliasing have 10 ≤ ψm ≤ 150. Mazes that include only aliasing type I produce a ψm < 10. Each maze can then be categorized by the type of aliasing cells it includes. For mazes that have combined aliasing (more than one aliasing type), we define the aliasing group a maze belongs to by the highest aliasing type it contains. Thus, aliasing mazes type III may be considered as the most difficult group of aliasing mazes, mazes type II are of medium complexity and those type I are the easiest. MASS system. The collected mazes have been assessed using created Maze Assessment Software System (MASS), capable of analyzing maze domains by: width; height; average steps to goal; max steps to goal; density; number of pseudo-aliasing and aliasing states and squares; average Q-learning steps, types and location of aliasing squares. MASS also produces the following outputs: transition matrix; step-to-food map; Q-learning coefficient map; Q-learning step map. A detailed description of the properties of all mazes considered can be found in [28]. The source code is available from either of the authors.
Fig. 5. Maze complexity chart.
Further aliasing metrics. Solving an aliasing non-Markov maze implies bringing it to the condition where it becomes a Markov one and hence predictable for the agent. Thus, it is the agent’s structure and abilities that make an aliasing maze Markov or non-Markov, while dynamic mazes are completely agentindependent in their non-Markov properties. Different learning systems may have different attributes that influence on complexity. For example, agents that belongs to the class of predictive modelling systems, like Anticipatory Classifier Systems (ACS) [22], predict not only reward, but also the next environmental state s0 . Aliasing can thus be more complex and a wider classification is suitable: – d1 d1 – d1 d1 – d1 d1 – d1 d1
= d2 , = d2 , 6= d2 , 6= d2 , 6= d2 , 6= d2 , = d2 , = d2 ,
a1 a1 a1 a1 a1 a1 a1 a1
= a2 , = a2 , = a2 , = a2 , 6= a2 , 6= a2 , 6= a2 , 6= a2 ,
s01 s01 s01 s01 s01 s01 s01 s01
= s02 6= s02 = s02 6= s02 = s02 6= s02 = s02 6= s02
— — — — — — — —
pseudo-aliasing, and pseudo-aliasing, predictive mismatch type I, and type I, predictive mismatch type II, and type II, predictive mismatch type III, and type III, predictive mismatch
In addition, some aliasing mazes may have aliasing chains, like Woods102 [5] (Fig.6(b)) with adjacent aliasing squares 1 and 2. Other mazes may have communicating aliasing cells, like Woods101 (Fig.6(a)) with two aliasing cells bordering on the same neighbour cell. The chains may be composed of different aliasing states, or, on the contrary, of the same aliasing states (e.g. E2). The environments may present a task of increased complexity for some kind of predictive modelling agents, compared to the aliasing mazes that do not have such conditions. According to the maze complexity chart (Fig. 5), some aliasing mazes are much harder for Q-learning than could be expected. For example, among aliasing type III, such small short distanced mazes as Woods100 and Woods101-1/2 produce extremely high ψm coefficients, coming up with in much bigger mazes featuring numerous aliasing states, such as Maze10 and E2. Among aliasing type II the same position employs MazeF4, surpassing quite intricate maze Sutton90, featuring 7 different aliasing states in 23 aliasing squares.
(a)
(b)
Fig. 6. (a)Woods101, communicating aliasing cells ; (b)Woods102, aliasing chains
Upon examination it can be noticed that MazeF4, Woods100 and Woods1011/2, as well as some other mazes of higher complexity, have the property that to reach food an agent has to pass through a wall-isolated aliasing square situated close to the food object from the majority of starting positions. The presence of the alias gate may make the maze significantly harder for some Q-based agents. Quantifying and specification of the alias gate effect may be necessary for further research.
4
Generalization and uncertainty issues
Generalization. In terms of LCS, generalization is reducing the number of significant bits used to represent an environment situation. The process groups similar types of states together in a less specialized state based upon common attributes and substitutes ’zero’ and ’one’ with the don’t care symbol. The goal of generalization is to extend the range of the states that can be represented by a smaller population without being too crowded or too sparse. The main question is how the right generalization can be differentiated from overgeneralization. Any generalization process applied to a maze introduces aliasing. As far as the generalized states have the same distance and the same directions to food (i.e. if they fall into the pseudo-aliasing category), the generalization is correct and beneficial. Generalization leading to the aliasing type I (the same directions, different distance) also can be beneficial, although the error-based classifier systems may be sensitive to the continuous changes in the reward, thus, some disturbance to the learning process should be expected. Any generalized state that contains aliasing type II or III is overgeneralized, because the squares concealed in the state always demand completely different actions. Noise. Noise in LCS is a disturbance of a random nature in the agent’s information system, bringing an uncertainty either to its actions or to the environment signals it receives. The detector noise means the every perceived state sper is a probabilistic function of the original environment state s. Thus, each environment state will correspond to a set of perceived environment states: s ⇒ sper1 , sper2 , spern . The size of the set depends on the noise function that is used, and is limited to the number of states the detector is able to perceive. As a result, the number of states significantly increases as well as the learning time. The outcome of the learning will depend on the result sets of perceived
states. If for each two states si and sj in a non-aliasing maze the sets of perceived states do not intersect, the maze can be solved by the same agent and with the same amount of classifiers, provided that an appropriate generalization technique is used. Otherwise, if the sets of perceived states are intersecting, the noise function introduces aliasing and the outcome will depend on some characteristics of it and of how big the intersections are. In any case, the performance of the learning agent is considerably affected. The effector noise means the conducted action acond of the agent is a probabilistic function of the original action a. Thus, for each action-state pair (st−1 , a) in the environment, there will be a set of next environment states {st1 , st2 , . . . , stn }, but the size of the set cannot be greater than the number of actions available for the agent. The effector noise always introduces aliasing, although it seems to be simpler than the aliasing introduced by the detector noise because the overall number of states in the maze remains the same. The outcome of the learning will also depend on the noise function.
5
Future research
Future research may include investigations into maze complexity for predictive modelling systems, and testing different LCS agents on the mazes to define their sensitivity to alternative aliasing types. The future research may also examine the influence of further aliasing metrics (aliasing gates, chains, communicating aliasing cells) on the learning process. Investigation of different generalization techniques and specific noise functions also can be beneficial. Finally, study of maze topology and specific-purpose maze generation seems to be an essential direction of maze research in the nearest future.
6
Conclusion
Maze problems are useful and popular test problems for reinforcement learning algorithms, particularly LCS. The research covered 50 different mazes of a wide range of complexity that have been used or can be used for LCS research. We examined agent independent and agent dependent maze attributes and proposed a set of metrics for measuring maze complexity. We considered the present definitions of aliasing, highlighted the effect of the nature of aliasing squares on the maze difficulty and introduced alternative aliasing types for a Q-based learning agent with a detector only able to perceive the surrounding squares. In addition, we proposed an approach to different aliasing types for predictive modelling systems, considered further aliasing metrics and had a short look at influence of generalization and noise on maze complexity. The introduced metrics will provide a clearer mechanism for assessing the learning ability of new algorithms. The research also offered appropriate tools for analyzing the correlation between a learning agent and the kind of mazes it experiences difficulties with, that may provide a better understanding of its weaknesses and facilitate improvements into the agent’s structure.
References 1. Bull, L.: Lookahead And Latent Learning In ZCS. GECCO-2202 (2002) 897-904 2. Bull, L., Hurst, J.: ZCS: Theory and Practice. Tech.Report UWELCSG01-001, University of West England (2001) 3. Butz M.V., Goldberg D.E., Stolzmann W.: Introducing a Genetic Generalization Pressure to the Anticipatory Classifier System. GECCO-2000 (2000) 4. Cassandra A.R., Kaelbling L.P., Littman M.L.: Acting Optimally in Partially Observable Stochastic Domains. Proc. of the 12th Nat. Conf. on Art. Intel. (1994) 5. Cliff, D., Ross, S. : Adding memory to ZCS. Adaptive Behavior 3(2) (1994) 101-150 6. Gerard, P., Siguad, O.: YACS: Combining Dynamic Programming with Generalization in Classifier Systems. Advances in Learning Classifier Systems. Springer (2001) 52–69 7. Hurst J., Bull L.: A Self-Adaptive Classifier System. Advances in Learning Classifier Systems. Springer (2000) 70-79 8. Kovacs, T., Kerber, M.: What makes a problem hard for XCS? Advances in Classifier Systems. (2001) 80–99 Springer 9. Koza, J.R. Evolution of Subsumption Using Genetic Programming. Proc. of the 1st European Conference on Artificial Life. (1992) 110-119 10. Landau, S., Picault, S., Sigaud, O., Gerard, P. : A Comparison Between ATNoSFERES And XCSM. GECCO-2002 (2002) 926–933 11. Lanzi, P.L.: Solving Problems in Partially Observable Environments with Classifier Systems. Tech. Rep. 97.45 (1997) Politecnico di Milano 12. Lanzi, P. L. and Colombetti, M.: An extension to XCS for stochastic environments. GECCO-99 (1999) 353-360 13. Lanzi, P. L., Wilson S. W.: Toward optimal classifier system performance in nonMarkon environments. Evol. Comp. 8 (4) (2000) 393-418 14. Lanzi, P.L.: Adaptive Agents with Reinforcement Learning and Internal Memory. 6th Inter.Conf. on the Simulation of Adaptive Behavior (SAB2000) (2000) 333–342 15. Littman, M. L.: An Optimization-Based Categorization of Reinforcement Learning Environments. 2nd Inter.Conf. on Simulation of Adaptive Behavior, MIT (1992) 16. Littman, M. L., Cassandra, A. R., Kaelbling, L. P.: Learning policies for partially observable environments. The 12th Intern. Conference on Machine Learning (1995) 17. McCallum, R. A.: Overcoming Incomplete Perception with Utile Distinction Memory. Proc. of the 10th Intern. Machine Learning Conference (1993) 18. Metivier, M., Lattaud, C.: Anticipatory Classifier System using Behavioral Sequences in Non-Markov Environments. For 5th Intern. Workshop, IWLCS-2002 19. Miyazaki, K. and Kobayashi, S.: Proposal for an Algorithm to Improve a Rational Policy in POMDPs. Proc. of Int. Conf. on Systems, Man and Cybernetics (1999) 20. Nevison, C.: Maze Lab 1: Event Loop Programming. Colgate University (1999) 21. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Hall (1994) 22. Stolzman, W.: An introduction to Anticipatory classifier system. Learning Classifier Systems, From Foundations to Applications. Springer (2000) 175–194 23. Smith S. J., Wilson S. W.: Rosetta: Toward a Model of Learning Problems. Proc. of the Third ICGA (1989) 347–350 24. Sutton R. S., Barto, A. G.: Reinforcement Learning: An Introduction. MIT (1998) 25. Wilson, S. W.: The animat path to AI. Proc. of the 1st Intern. Conference on the Simulation of Adaptive Behaviour. MIT (1991) 26. Wilson, S. W.: ZCS: A Zeroth Level Classifier System. Ev. Com.2 (1) (1994) 1–18 27. Wilson. S.W.: Classifier Fitness Based on Accuracy. Evol. Comp. 3(2) (1995) 28. Zatuchna, Z.V.: To the studies on maze domains classification in the framework of LCS research. Technical Report CMP-C04-02, University of East Anglia (2004)