IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 42, NO. 4, JULY 2012
Fusion of Multiple Behaviors Using Layered Reinforcement Learning Kao-Shing Hwang, Senior Member, IEEE, Yu-Jen Chen, and Chun-Ju Wu Abstract—This study introduces a method to enable a robot to learn how to perform new tasks through human demonstration and independent practice. The proposed process consists of two interconnected phases; in the first phase, state-action data are obtained from human demonstrations, and an aggregated state space is learned in terms of a decision tree that groups similar states together through reinforcement learning. Without the postprocess of trimming, in tree induction, the tree encodes a control policy that can be used to control the robot by means of repeatedly improving itself. Once a variety of behaviors is learned, more elaborate behaviors can be generated by selectively organizing several behaviors using another Q-learning algorithm. The composed outputs of the organized basic behaviors on the motor level are weighted using the policy learned through Q-learning. This approach uses three diverse Q-learning algorithms to learn complex behaviors. The experimental results show that the learned complicated behaviors, organized according to individual basic behaviors by the three Q-learning algorithms on different levels, can function more adaptively in a dynamic environment. Index Terms—Behavior-based control, intelligent robots, reinforcement learning.
I. I NTRODUCTION In reinforcement learning (RL), since an agent selects an action according to the state currently visited and its value function, it is important to define or refine the state space prior to the learning process. Several methods have been proposed to tackle this problem. A popular method of accelerating the learning process is to aggregate relevant states in a continuous state space into a discrete state, and this solution always suffers from the curse of dimensionality, resulting from the need for subtle resolution [1]. This type of approach usually uses methods akin to a lookup table scheme to estimate the value function and ensure the essential property of convergence. These methods must develop a means to discern the state spaces for optimal controllers in a continuous space for Markov processes. The simplest method involves partitioning the state space into several uniform intervals for each dimension [2]. However, this approach requires enormous computational power since discretization is extremely subtle, particularly in a highdimensional state space. Moreover, a certain sensory input may be important to a local subspace but trivial in another subspace. If prior knowledge about the spatial characteristics of the value function or the environment can be obtained in advance, the state space can be discretized efficiently.
Manuscript received February 9, 2010; revised January 17, 2011 and June 1, 2011; accepted August 6, 2011. Date of publication February 16, 2012; date of current version June 13, 2012. This paper was recommended by Associate Editor K. Hirota. K. S. Hwang is with Electrical Engineering Department, National Sun Yat-sen University, Taiwan (e-mail:
[email protected]). Y.-J. Chen is with Electrical Engineering Department, National Chung Cheng University, Taiwan (e-mail:
[email protected]). C.-J. Wu is with the Energy and Environment Research Laboratories, Industrial Technology Research Institute, Taiwan (e-mail:
[email protected]. edu.tw). Digital Object Identifier 10.1109/TSMCA.2012.2183349
999
Unfortunately, prior knowledge of the value function is always absent or difficult to gather. Generalized neural networks have been utilized to predict state values without discretization in state spaces, particularly in highdimensional ones [3]. However, these networks cannot provide convergence since they may oscillate or diverge, particularly in a more complex value function, owing to an inability to update the value function locally. Similarly, the decision tree approach can be employed to tackle the problem [4], [5]. This approach partitions the state space adaptively. Although a decision tree considers dissimilarity, it spends more time learning comparatively. This study proposes an RL method to induce a decision tree. The proposed algorithm regards the problem of tree construction as a sequential decision-making process, where decisions are made to split each node based on long-term cost and to ease the design complexity and where the pruning operation is totally ignored. With this Q-learning algorithm, a skilled instructor only needs to demonstrate a certain behavior to a robot several times, and the robot can duplicate the instructor’s experience by inducing the mapping of the sensory inputs and outputs in the framework of the decision tree. Ultimately, designing appropriate behaviors for robotic control is a complex task because of the diversity between the system and the working environment. Recent studies in robotics have highlighted some aspects of layered control systems and emphasized either the decomposition of control into multiple intertwined behavior systems (top–down fashion) or the organization of individual efforts into more elaborate behavior systems (bottom–up fashion) [6]–[9]. However, these approaches increase the problems caused by the complexity of decomposed or individual behaviors, particularly in conflict resolution and prioritization. Because of the inherent complexity of multiple behavior systems, there has been much interest in using machine learning or rule-based techniques to help deal with this complexity. Action selection, or arbitration, has been proposed as a resolution mechanism to overcome this difficulty [6]–[10] since it develops some basic behavior modules to achieve the task goal. Under general conditions, only one behavior is activated by suppressing the other behaviors in such algorithms, but this may lead to other problems. For the example of subsumption [6], [10], the robot must decide what behavior appropriately suppresses others for every condition, i.e., there is a problem of priority assignment. Furthermore, this method fails when one or more behavioral modules fail to connect the schemes, so the system might fail to function or might malfunction [11]. The modular method proposed in [7] uses a forward model (predictor) as the state transition model for counterparts’ actions (macro actions) and a behavior learner (action planner), which estimates the state-action value function based on the forward model, in the manner of RL. However, the use of macro actions associated with other agents’ state values leads to dimensionality in state space aggregation. A Q-learning algorithm for organizing individual behaviors into more sophisticated behaviors is proposed to overcome the problems mentioned earlier. Although this, like the subsumption method or the modular method, still needs to develop some basic behavior beforehand, it utilizes an RL algorithm to learn the weighting parameters and sum up the motor command outputs from all behaviors, instead of sole selection or arbitration. In brief, the proposed process consists of three interconnected phases. In the first phase, basic behaviors are recorded from human demonstrations or emulations in terms of a decision tree, which induces an aggregated state space by grouping similar states together
1083-4427/$31.00 © 2012 IEEE
1000
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 42, NO. 4, JULY 2012
through RL. The reward function of the RL algorithm is defined as the composition of purity, entropy, and splitting cost of the growing tree. Without trimming, the tree instead encodes a control policy that can be used to control the robot, by means of repeatedly improving itself. In the second phase, a behavior recorded in a decision tree is refined through a Q-learning process, while interacting with the working environment. The process learns an additional parallel action set for the same states. These additional actions are not physical behaviors, but a numerical “bias” that is added to the output of the original decision tree policy. This method enables the robot to learn individual basic behaviors. Once the pool of basic behaviors has been amassed, more elaborate behaviors can be displayed by organizing a set of behaviors based on a weighting mechanism whose parameters are learned by another Q-learning algorithm, which dynamically contributes each individual behavior for imminent situations. A multilayered behavioral learning system is established through the execution of the three phases. Each phase strongly depends on the preceding algorithm to successively learn more complex behaviors. The remainder of this paper is organized as follows. Section II reviews works related to Q-learning, decision tree induction, and the subsumption method. Section III describes the cloning of a single behavior from an expert by the proposed RL-based decision tree method. Details of the proposed algorithm, including the structure, the meaning of each parameter, and the processes of the algorithm, are also presented in this section. Section IV discusses experiments conducted on a soccer robot using the proposed algorithm and the subsumption method. Section IV also analyzes the compositions of three individual behaviors executed and makes comparisons between the results. Finally, conclusions are presented in Section V. II. BACKGROUND A. Decision Tree Induction The process of constructing a decision tree, also known as tree induction, involves the identification of a parameter in a decision tree [4], [5]. In general, tree induction is a recursive procedure, which assigns an optimal test to each visited node. Once a decision tree has been induced, additional time is always needed to prune the decision tree, in order to produce a small tree and avoid overfitting. It has been shown that the construction of optimal decision trees, which requires a minimal number of tests to classify an unknown sample, is an NPcomplete problem [12]. Therefore, it is necessary to identify efficient heuristics, which allow the construction of near-optimal decision trees. Most tree induction algorithms are typically based on a top–down greedy strategy, which always makes local optimal decisions at each node. Because of the greedy and local nature of the decisions, an optimal decision at the current node may result in more subsequent tests than nonoptimal ones. Therefore, the evaluation of different methods of tree induction and their implementation has become a major field of study for the improvement of decision tree performance [13]. Nongreedy approaches, such as look-ahead methods, have been adopted by some investigators [14], [15]. In [15], a single-step lookahead search did not produce significantly better decision trees. If a tree induction problem is considered to be a decision process, i.e., a process that determines each split, at each visited node, the induction problem can be modeled as an RL problem. Therefore, a self-organizing decision tree induction method based on Q-learning is suitable. The method can address the problems associated with greed, which are always present in the tree induction methods already mentioned. The decision to split and the selection of a splitting point, as determined by the method, consider not only the local evaluation of impurity calculation, which is usually used to measure the performance of splitting points, but also the consequential evaluations that are accumulated before a split is selected.
Fig. 1.
Architecture of the behavior composition algorithm.
B. Subsumption Method A classical control model usually stalls to build the world model and infer, plan, and control the robot. Brooks advocated the subsumption method, which has a structure of multiple layers and is a reactive behavior-based model [6], [10]. Each layer implements a behavior and achieves a particular goal. Higher layers are more abstract, while the lower layers perform like reflexes. The programmer can provide every layer with different systems, memory, and processors and then use the response signals to coordinate layers. III. L AYERED B EHAVIOR -L EARNING A LGORITHM Since it is difficult to construct a classical control model, a layered behavior-learning algorithm is proposed to systematize this process. The architecture of the algorithm is shown in Fig. 1. Similarly to the subsumption method, some basic behavioral agents must be developed before the system is constructed. These are denoted as Bi and are constructed in the preceding phases. The learning system outputs a weighted action composed of basic behaviors. In other words, the weighted action sums up all products of the action ref _ai , decided by each behavioral agent, and the associated weight wi , derived from the composition system, which is an RL agent. The environment that interacts with the weighted action then sends back a reward, as a performance (utility) index of the interaction, and the composite system then uses this reward as evaluative feedback to learn the most appropriate set of weights. A. First Phase—Behavior Developing Humans usually learn new behaviors through imitation and repeated practice. This idea inspired the use of a self-organizing decision tree to emulate the behavior of a skilled operator. First, the instructor controls the robot based on his/her instincts or experience, and the robot records the patterns of the sensory inputs, denoted as attributes. The robot’s outputs are denoted as classes. Second, the robot classifies all of the patterns, according to the attributes, by means of a tree structure. The robot can then imitate the behavior demonstrated by the instructor. Upon completion of the construction of a tree that records a basic emulated behavior, the Q-learning process can again be successively applied to improve dexterity [16], [17]. 1) Behavior Imitations: The target task is performed with The Federation of International Robot-soccer Association (FIRA) robotic soccer simulator. At first, an instructor demonstrates the target behavior for the robot, and it records the relationship between sensory inputs and the outputs for each time period. The outputs are the rotating velocities of the two wheels of the robot. A training pattern for the self-organizing decision tree uses the sensory inputs as its attributes and the outputs as its membership of categorized classes. Since the classes of patterns are discrete but the outputs are continuous values, the velocities of the two wheels are assigned 29 levels and encoded with 841 (i.e., 29 ∗ 29) classes.
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 42, NO. 4, JULY 2012
2) Induction of the Self-Organizing Decision Tree: In general, the decision tree induction algorithms use a top–down greedy strategy to make a decision at each node. However, this implies that the algorithms may choose a local minimum [14], [18]. In other words, the optimal decision tree must take successors into account when making a decision. In this paper, the process of decision tree induction is regarded as an Markov decision process where the decision specifies the most appropriate split, which has the maximum long-term (accumulated) benefit to the decision tree. The entire state space of RL corresponds to the root node of the tree. Each unique state corresponds to a unique leaf node of the tree. Each state is represented by a vector that denotes the ranges of sensory inputs in this state. The representation of the state corresponding to max max , . . . , xmin node φ is defined as s(φ) ≡ (xmin 1 , x1 Ndim , xNdim ), which is actually a vector that consists of the coordinates of the vertices of the hypercuboids where the sensory inputs reside, and Ndim denotes the dimensions of the sensory inputs. The maximum and minimum input values of the ith dimension in the node are denoted as xmax i and xmin , respectively. The sensory inputs, {x ∈ RNdim |xmin ≤ xi < i i xmax i, i = 1 ∼ Ndim }, belong to the state s(φ). i The proposed method consists of two stages. In the first stage, the inducer traverses the subtree space to update the long-term evaluation estimates of the splits at each visited node. The update procedure transfers the local evaluation back to the preceding subtrees. In the second stage, nodes are proliferated by the split that carries the maximal long-term evaluation. The action set A(s) of the state s contains no-split and half-split actions for every input dimension. If the state s(φ) accommodates one of the following stop-splitting criteria, only the no-split action can be selected: 1) the range of state s(φ) is too small; 2) there are only a few patterns in s(φ); 3) after taking a half-split action, neither child node contains any data; 4) the impurity in node φ is good enough. In general, the performance of a decision is evaluated by the error rate and tree size. Since the reward function defines the learning direction, the reward function contains two parts. The first part of the reward function is the information gain Igain (φ, a) Igain (φ, a) = Entropy(φ) − [(nφl /nφ )Entropy(φl ) + (nφr /nφ )Entropy(φr )] .
(1)
The entropy function is defined as Entropy(φ) = −
C
pi ln(pi )
(2)
i=1
where C is the total number of classes and pi is the probability (or percentage) of patterns belonging to class i in node φ. If the chosen action causes a larger information gain, the patterns of child nodes are
more pure such that the error rate is small. nφl , nφr , and nφ are the numbers of the patterns in the left and right child nodes φl and φr of parent node φ. However, without a restriction of the splitting process, the process of decision tree induction may be too trivial and become overfitted. Therefore, a splitting cost rsplit is introduced to the reward function as a small penalty for the effort of splitting. The splitting cost is a negative constant to ensure that the growth of the decision tree growing favors ceasing growth. The reward function is defined as
R (s(φ), a) =
Igain (φ, a)+rsplit 0
if a is a splitting action if a is a no splitting action.
Q (s(φ), a) =
B. Second Phase—Behavior Composition Using Q-Learning In this section, a behavior composition algorithm is proposed, which controls a robot with multiple simultaneous behaviors. The algorithm inherits the merit of simplicity by designing individual behaviors into a template structure, which is built in preceding phases. In other words, a robot imitates and refines individual behaviors, one by one. In this phase, it combines these refined behaviors by using a set of weighting parameters. These parameters are learned through Q-learning such that the robot can behave adaptively and optimally in an unknown environment. Depending on the number of sensory inputs observed from the environment, the learning system decides how to compose a more sophisticated behavior from the pool of learned basic behaviors. Before composition, each individual behavior must have been recorded and refined in its own corresponding decision tree. Since each behavior induces an individual decision tree, the representation of the state space of the Q-learning algorithm in this phase must consist of all joint behaviors, represented in a concise fashion. For each behavior, the sensory inputs always activate the corresponding leaf nodes. Therefore, the numbers of leaf nodes in the decision tree for each behavior represent one variable of the state set in the state space of the Q-learning algorithm, for the purposes of composition. Actions from the learning system must also be outputs combined with the action from each individual behavior. When the states, or actions, are composed of sets of variables, they are referred to as “factored” states, or actions. It is common for problems with large state or action spaces to have states and actions expressed in a “factored” form. During learning, instead of controlling signals to the robot directly, the action space is actually composed of sets of weighting parameters for the outputs generated by joint behaviors. The architecture of the composition algorithm is shown in Fig. 1. When a sensory input is received, each behavior Bi activates a leaf node ci and outputs a reference action ref _ai . These activated leaf
R(s(φ), a) + γ
(4)
max Qt (s(φl ), b) + max Qt (s(φr ), b) /2
b∈A(s)
−Entropy (s(φ))
(3)
In general, in RL, after taking an action, the current state translates only to the subsequent state. However, in a self-organizing decision tree, after taking a splitting action, two child nodes are the subsequent states. The updated rule is modified as (4) and (5) shown at the bottom of the page, where φl and φr are the left and right child nodes of parent node φ.
Qt+1 (s(φ), a) = Qt (s(φ), a) + α [Q (s(φ), a) − Qt (s(φ), a)]
1001
b∈A(s)
if a is a splitting action if a is a no splitting action
(5)
1002
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 42, NO. 4, JULY 2012
nodes are combined to represent the input state of the composition system, denoted as s = (c1, c2, . . . , cn ). In each state, the system maintains n weighting parameters as its outputs and maintains n Q values, corresponding to each weighting parameter. The Q value is updated by (6). The reward is divided according to each weight of the behavior. The more effective the behavior, the more responsibility it takes Qk+1 (s, Bi ) = Qk (s, Bi )
+ α wk (s, Bi )r+γ
n
Fig. 2. Variables used in three single behaviors developed for FIRA. (a) Variables in ball-seeking behavior. (b) Variables in wall-avoiding behavior. (c) Variables in approaching behavior.
wk (s , Bj )Qk (s , Bj )−Qk (s, Bi )
.
j=1
(6) In the proposed algorithm, the Q value represents the experiment after many iterations. The weight value represents the actual output actions. The value is updated by reference to the Q value, as shown in
(1 − w (s, B )) k i ×δ (Q (s, B ) + f ) , if B = arg max Q(s, B) k+1 i i B + −w (s, B ) k i ×δ (Qk+1 (s, Bi ) + f ) ,
otherwise.
(7) The weight value increases if the behavior is better than others and other weight values decrease at the same time. Here, δ is a scale parameter and f is an offset to ensure that the weight value is always positive. To limit the sum of the weight value, it is normalized as n
wk+1 (s, Bj ), for all Bi .
(8)
j=1
Before briefly proofing the convergence of the learning composition system, some symbols are predefined. r max is the maximum value of the reward function for all states and behaviors. Q max is the maximum Q ∗ (s, B) for all states and behaviors, where Q ∗ (s, B) is the optimal Q value of state s and action a. Hence, there are two inequalities w(s, i)∗ r ≤ rmax n
w(s , Bj )Qk (s , Bj ) ≤
j=1
Subsumption method.
IV. S IMULATION
wk+1 (s, Bi ) = wk (s, Bi )
wk+1 (s, Bi ) = wk+1 (s, Bi )
Fig. 3.
n
(9)
w(s , Bj )Qmax = Qmax .
(10)
j=1
From (6), (9), and (10)
∆Qk (s, Bi ) = wk (s, Bi )r + γ
n
To verify the learning composition algorithm, the FIRA simulation, used in the FIRA SimuroSot competition, is performed. Three single behaviors are developed in advance, based on human experience. The three behaviors are described as follows. 1) Ball-seeking behavior: This behavior allows the robot to trace and find the ball as soon as possible. The sensory inputs are the distance from the ball to the robot and the angle between the direction of the robot and the direction of the ball, as shown in Fig. 2(a). 2) Wall avoiding: This behavior keeps the robot safe and helps it avoid becoming trapped by the wall. The closest point of the wall is defined as the position of the wall. Then, the sensory inputs are the distance from this point on the wall to the robot and the angle between the direction of the robot and the direction of the point, as shown in Fig. 2(b). 3) Approaching behavior: This behavior pushes the ball to the goal ahead. It moves the robot toward the target along the direction of approach. Since the target is set as the goal, the robot maneuvers around the ball and then pushes the ball in the direction of the goal. The sensory input is the direction of approach of the robot to the target, as shown in Fig. 2(c). Each behavior is learned from human training, using the methods described in Section III. After inducing these decision trees to imitate human actions, the decision trees for ball-seeking behavior, wall-avoiding behavior, and approaching behavior have 122, 11, and 37 leaf nodes, respectively. These single behaviors deal with different intentions. The simulation utilizes these behaviors to accomplish the offense strategy. The ball is set on the free-ball position, and the robot is set in front of the blue goal. It quickly scores a goal when the three behaviors are used wisely.
wk (s , Bj )Qk (s , Bj )
j=1
A. Simulation With the Subsumption Method
− Qk (s, Bi ) ≤ [rmax + γQmax − Qk (s, Bi )] . If the learning rate α has the constraints ∞ 2 α < ∞, the Q value can converge. k=1 k
∞ k=1
(11) αk = ∞ and
A subsumption method fuses these behaviors, as shown in Fig. 3. If more than one behavior is encouraged, a suppression signal (S) maintains harmony. The lowest level is the ball-seeking behavior, used to dribble the ball. The robot uses wall-avoiding behavior if it is near the wall. The approaching behavior, which allows the robot to stand in a good position to score a goal, is the highest level of action. The robot kicks the ball until it scores 50 goals, and it fails if the ball is kicked into its gate. In the simulation, the robot failed 12 times
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 42, NO. 4, JULY 2012
Fig. 4.
Total steps and the collision times of the subsumption method.
Fig. 5.
Total steps and collision times of the proposed system.
1003
before succeeding 50 times. Observing the 50 successful courses, the total number of steps from the beginning to the goal are shown in the left side of Fig. 4. The bold line is the moving average, and the period is 5. The average number of steps is between 1200 and 3700. To verify the effect of the wall-avoiding behavior, the numbers of collisions in each simulation are shown on the right side of Fig. 4. The bold line is the moving average, and the period is 5. The average number of collisions is stable, between 21.2 and 80.8. B. Experiment Using the Composition Algorithm Since the basic behaviors described earlier have 122, 11, and 37 leaf nodes, the proposed system has 122 × 11 × 37 = 49 654 states. The rewards for success and failure are 10 and −10, respectively, and the reward for a collision is −5; otherwise, the reward is 0. The large state space and sparse rewards result in endless learning time. To reduce the time taken for the initial learning stage, the subsumption method is first used to guide the composer in learning for 20 episodes. The robot takes the action selected by the subsumption method and then updates the Q values and weights according to the composition algorithm. After 20 instructional episodes, the robot selects its actions according to the composition algorithm. At the same time, it continues updating the Q values and the weights of the learning composition system. The robot also performs the same simulation as the subsumption method, until it scores 50 goals. It only failed four times, using this approach. For the 50 successful courses, the total numbers of steps and collision times for each course are shown on the left and right sides of Fig. 5. The bold line is the moving average, with a period of 5. The average number of steps is between 300 and 2300 steps, and the trend is downward. The right side of Fig. 5 also shows that the trend for the collision times is downward. Fig. 6 compares the success rate for the subsumption method and the proposed system. The success rate of the proposed system increases with the number of successful attempts. From Figs. 5 and 6, it can be seen that the performance improves as the number of successful attempts increases, meaning that the proposed system learns to fuse these basic behaviors to score a goal. A comparison of the total number of steps also shows that the proposed system is better than the subsumption method, but a comparison of collision times shows an opposite trend. Although the subsumption method is better than the proposed system, in the early episodes, the proposed system improves faster than the subsumption method. Some of the reasons
Fig. 6. Success rate comparison for the proposed system and the subsumption method.
for the increased number of collisions for the proposed system are as follows: 1) the proposed system is interested in the long-term reward; 2) in the subsumption method, the wall-avoiding behavior has priority, but it is very difficult to collide with the wall. V. C ONCLUSION This study proposes a systematic method for the design of a behavior controller, particularly for a complex task. The approach incorporates ideas from three extensively studied research fields: decision tree learning, RL, and learning by demonstration. Using a divideand-conquer strategy, the proposed method uses several key points of layered control systems and emphasizes the decomposition of control into multiple intertwined behavior systems and the organization of individual efforts into more elaborate behavior systems. In designing basic behaviors, the decision tree algorithm enables a robot to imitate an expert through demonstration. The main purpose of the selforganizing decision tree process is not only to create a lookup table for state-action mapping but also to cluster similar states together based on demonstration data. The composition system spontaneously fuses basic behaviors together with different weights, learned by Q-learning, in accordance with environmental conditions. Since each behavior is regarded as an individual agent from the viewpoint of the composition system, when one or more behaviors fail in this multiagent (multibehavior) system, the remaining agents can still perform the behaviors and learn to take over some of the tasks originally executed by the failed agent. This means that a robot using the proposed method is inherently more robust and has greater survivability.
1004
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART A: SYSTEMS AND HUMANS, VOL. 42, NO. 4, JULY 2012
R EFERENCES [1] D. Dong, C. Chen, H. Li, and T. J. Tarn, “Quantum reinforcement learning,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 5, pp. 1207– 1220, Oct. 2008. [2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998. [3] P. He and S. Jagannathan, “Reinforcement learning-based output feedback control of nonlinear systems with input constraints,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 35, no. 1, pp. 150–154, Feb. 2005. [4] S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier methodology,” IEEE Trans. Syst., Man, Cybern., vol. 21, no. 3, pp. 660– 674, May/Jun. 1991. [5] S. K. Murthy, “Automatic construction of decision trees from data: A multi-disciplinary survey,” Data Mining Knowl. Disc., vol. 2, no. 4, pp. 345–389, Dec. 1998. [6] R. Brooks, “A robust layered control system for a mobile robot,” IEEE J. Robot. Autom., vol. RA-2, no. 1, pp. 14–23, Mar. 1986. [7] Y. Takahashi, K. Noma, and M. Asada, “Efficient behavior learning based on state value estimation of self and others,” Adv. Robot., vol. 22, no. 12, pp. 1379–1395, 2008. [8] A. Kleiner, M. Dietl, and B. Nebel, “Towards a life-long learning soccer agent,” in Proc. Int. RoboCup Symp., Jun. 2002, pp. 119–127. [9] H. Hattori, Y. Nakajima, and T. Ishida, “Learning from humans: Agent modeling with individual human behaviors,” IEEE Trans. Syst., Man, Cybern. A, Syst., Humans, vol. 41, no. 1, pp. 1–9, Jan. 2011.
[10] J. Simpson, C. L. Jacobsen, and M. C. Jadud, “Mobile robot control: The subsumption method and Occam-Pi,” in Proc. Commun. Process Arch., 2006, pp. 225–236. [11] L. Busoniu, R. Babuska, and B. Schutter, “A comprehensive survey of multiagent reinforcement learning,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 2, pp. 156–172, Mar. 2008. [12] L. Hyafil and R. L. Rivest, “Constructing optimal binary decision trees is NP-complete,” Inf. Process. Lett., vol. 5, no. 1, pp. 15–17, 1976. [13] W. Tong, H. Hong, H. Fang, Q. Xie, and R. Perkins, “Decision forest: Combining the predictions of multiple independent decision tree models,” J. Chem. Inf. Comput. Sci, vol. 43, no. 2, pp. 525–531, Mar./Apr. 2003. [14] S. K. Murthy and S. Salzberg, “Lookahead and pathology in decision tree induction,” in Proc. 17th Int. Joint Conf. Artif. Intell., San Mateo, CA, 1995, pp. 1025–1031. [15] O. Simsek and A. G. Barto, “Using relative novelty to identify useful temporal abstractions in reinforcement learning,” in Proc. 21st Int. Conf. Mach. Learn., 2004, p. 95. [16] K. S. Hwang, Y. J. Chen, and T. H. Yang, “Behavior cloning by a selforganizing decision tree,” in Proc. IEEE ICIT, 2007, pp. 20–24. [17] K. S. Hwang, T. W. Yang, and C. J. Lin, “Self organizing decision tree based on reinforcement learning and its application on state space partition,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., 2006, pp. 5088–5093. [18] M. Dong and R. Kothari, “Look-ahead based fuzzy decision tree induction,” IEEE Trans. Fuzzy Syst., vol. 9, no. 3, pp. 461–468, Jun. 2001.