Improving Reinforcement Learning Algorithms by the Use of Data Mining Techniques for Feature and Action Selection Davi C. de L. Vieira, Paulo J. L. Adeodato, Paulo M. Gon�alves Jr. Centro de Informatica Universidade Federal de Pernambuco Recife, Brasil
[email protected],
[email protected],
[email protected]
Abstract-Data mining can be seen as an area of artificial intelligence that seeks to extract information or patterns from large amounts of data either stored in databases or flowing in streams. The main contribution of this work is to present how LVF data mining technique improves Sarsa(.X) algorithm combined with tile-coding technique by selecting the most relevant features and actions from reinforcement learning environments. The objective of this selection is to reduce the complexity of the problem and the amount of memory used by the agent thus leading to faster convergence. The motivation of this work was inspired by the rationale behind Occam's razor, which describes that a complex model tends to be less accurate than another with a lower complexity. The difficulty in using data mining techniques in reinforcement learning environments is due to the lack of data in a database, so this paper proposes a storage schema for states visited and actions performed by the agent. In this study, the selection of features and actions are applied to a specific problem of RoboCup soccer, the dribble. This problem is composed of 20 continuous variables and 113 actions available to the agent which results in a memory consumption of approximately 4.5mb when the traditional Sarsa(>') algorithm is used combined with the tile-coding technique. The experiments' results show that the amount of variables in the environment were reduced by 35% and the amount of actions by 65%, which resulted in a reduction in memory consumption of 43% and an increase in performance of up to 23%, according to the relative frequency distribution of agent's success. The approach proposed here is both easy to use and efficient.
Index Terms-Reinforcement Learning, Intelligent Agents, RoboCup, Data Mining, Feature and Action Selection.
I.
INTRODUCTION
Reinforcement learning has been widely used in problems which other learning approaches cannot be applied to. Such tasks do not offer the support of a teacher who can provide the desired answers or expected behavior. Models that require a teacher are known as supervised models and use supervised learning as the basis of their training [1]. Widrow et al [2] used the term "learning with a critical" to distinguish the learning process by reinforcement from models that learn from a teacher. It is the critical's responsibility to evaluate and encode the effect of actions taken by the agent in a reward signal, which indicates how "pleasurable" performing a particular action at a given state is. The agent should choose the actions that maximize the reward signal it receives in the long
978-1-4244-6588-0/10/$25.00 ©2010
run, because actions that lead to states which apparently give the greatest instantaneous rewards may lead to others which do not [3]. Feature and action selection is important for reinforce ment learning problems because an unnecessarily large set of features and actions increases the size of the search space, making it more difficult for the learning algorithm to find a near optimal solution [4], [5]. Research related to the use of data mining techniques in reinforcement learning environments have been made in [6], [7]. However, the techniques of data mining have never been used to determine which features and actions are relevant for the agent to achieve its goal, but there are several cases of success on its applications in other learning paradigms that encourage research also in this area, as presented in [8], [9]. Data mining is an area of artificial intelligence that seeks to extract information or patterns from large amounts of data stored in databases [10]. This work aims to use data mining techniques in reinforcement learning environments in order to identify and select the most relevant1 features and actions related to a given problem, in such a manner that these techniques can be applied to reinforcement learning issues despite specific learning algorithms. The idea is that techniques from data mining may add value to reinforcement learning solutions. The purpose is not to compare performance among data mining techniques in reinforcement learning applications. The difficulty of using data mining techniques in the re inforcement learning environment is due to the lack of data in a database. To overcome this problem, this article presents a snowflake storage schema for the visited states and actions taken by the agent to enable the use of data mining techniques in reinforcement learning environments. The transformations performed in this schema are based on the correlations between data mining and reinforcement learning exposed by Maimon [11]. Another contribution to be considered is the solution of a new complex testbed environment for reinforcement learning methods presented in this article. The selection described in this paper is a step from KDD 1 In this article the term relevant is used to indicate the variables and actions that most contribute to the convergence of the learning algorithm.
IEEE
1863
(Knowledge Discovery in Database) [10] related to the identi fication and selection of the most relevant features and actions from the environment. There seems to be no research related to the selection of features and actions from reinforcement learning environments up to the moment. This paper shows how to do that and by reducing the amount of variables and actions is a possible way to reduce the complexity of the problem (the "curse of dimensionality" [12]), enhance the generalization capability, speed up the learning process, improve the model interpretability and reduce the amount of memory used by the agent which can lead to faster convergence [13]. The motivation of this work was inspired by the rationale behind Occam's razor [13], which says that a complex model tends to be less accurate than another with a lower complexity. The difficulty of feature selection without the use of data mining techniques is due to the exponential growth of the time needed to find an optimal selection of the features, since without the use of a probabilistic approach it is not possible to find an optimal selection without testing all possible combinations of all N features, i.e., by applying the criterion 2N 1 times. Therefore, this paper uses a probabilistic approach proposed by Liu in [14] for feature selection in reinforcement learning problems to identify the most relevant features aimed at improving the performance of the agent and reduce the memory consumption. The environment used in the experiments of this paper is provided by the RoboCup soccer simulator. This environment includes continuous state variables, external actions affecting the agent like wind action, uncertainty, noise in sensors and agent actuators, and others that justify and motivate the use of the environment offered by RoboCup soccer simulator for testing new learning approaches [15]. These and other complexities make the RoboCup soccer a real-time simulator capable of providing a challenging environment for testing new computational models. The RoboCup simulator is a fully distributed environment, which have players and opponents. Formally, the environment provided by the RoboCup soccer simulator is defined as: Partially Observable, Stochastic, Episodic, Dynamic, Contin uous and Multi-Agent [16]. As described by Stone [17], the simulator operates in discrete time, where each time step has lOOms, and if the agent does not perform an action within this interval, the cycle is lost. Also, in the RoboCup simulator hidden states are included, i.e., each agent has only a partial view of the environment; by default, the view of each agent is limited to an angle of 90 degrees and the accuracy of the objects decreases as distance increases. Also, the model needs to be tolerant to noise, because the agents have noise in sensors and actuators, which means that they do not perceive the world as it really is, and can not affect it exactly as they wish. These features make the RoboCup simulator realistic and challenging. In this paper we will present how simple applications of some data mining techniques can be used to improve the per formance and reduce the memory consumption of the agent in a reinforcement learning environment. Therefore, we organized this paper as follows. In Section II, the concept of dribble -
is defined and the testbed environment is presented. Section III presents the model used for training the agent to perform the dribble. The storage, selection, and the transformation of the training data are shown in Section IV. The data mining techniques for feature and action selection are presented and applied in Sections V and VI, respectively. The results of the experiments are presented in Section VII and an analysis and conclusion of these results are shown in Section VIII. II. T HE
TESTBED ENV I RONMENT
The dribble is defined as the ability of the player to conduct a ball to a specific point of the field without losing it to its opponents. During the dribbling process the agent must learn to select among the possible actions those with the largest probability of success, whose goal is to induce the opponent to a position that favours dribble. In a real football game the ability to dribble is considered one of the most difficult to perform and one of the most efficient in an attack movement. With this ability it is expected that the players be able to improve the performance of the attack, allowing individual moves that can increase the chances of scoring, but this can only be verified in matches with all players in the simulator. In many articles the dribble is treated solely as the player's skill to conduct the ball, which is not true. In other words a movement is considered a dribble if:
and
IPJ'(t + n )
-
P%(t + n )1
Any,
if
-+
Q(s,a) + a [r + 'YQ(s',a') -Q(s,a)]
(5)
where s' and a' are the state and action following respectively a and s, a is a constant that indicates the learning rate, r is the reward received in s and 'Y is a constant that indicates the
The goal of the agent is to satisfy the Equations 1 and 2 without the agent exceeding the boundaries of the region represented by the light lines in Figure 1. The dark line is considered as the end of the dribble region and if the agent exceeds it, the agent must be rewarded despite not satisfying the dribble Equations 1 and 2. The episode is restarted and the agent is punished if it exceeds the boundaries of the region represented by the lighter lines or if Equation 3 is false. There are no constraints on the size of the dribble region, but in this paper it has been defined as having 6m x 8m size. The opponent position is randomized at each episode, and the agent position is fixed centered at the start of the dribble region on the left, once that the general attacking direction is towards the opponent's goal. The representation of the environment to the agent is given through 20 variables described in Table I. Several among these variables were conceived as transformations from original variables aiming at embedding human knowledge about the application domain. This is a procedure similar to the goal scoring modeling that had been previously developed for the same domain [18]. The 113 [= 3 x (180° /5° + 1) + 1 + 1] actions achievable by the agent are described in Table II. The action space A for state s is defined as following:
A(s) =
LE ARNING ALGORI THM
Learning by Temporal Difference, also called TO, emerged with the combination of ideas taken from the two methods of reinforcement learning, which are: Monte Carlo and Dynamic Programming [3]. This type of learning updates its estimates by the difference between other estimates that differ in time and can be used in learning in real time. Among the methods of learning by temporal difference this article describes the operation of the Sarsa(>.) algorithm used to solve the problem described in Section II. This algorithm updates its estimates Q to being "pleasant" or "painful" when performing an action a in state s according to Equation 5.
TABLE II POSSIBLE AGENT ACTIONS.
Action Conduct(angle)
where dmax is the kickable margin defined on the RoboCup server parameters. In this paper, at each step during the episode, the agent was rewarded by 0 and at the end of each episode by 1 if the agent performed the dribble and by -1 if the agent did not perform the dribble. The opponent behavior was taken from the UvA Trilearn Team [19], but it could have been taken from any team. The choice of the UvA Trilearn team was due to its reputation and availability of source code.
discount rate. Because the state variables are continuous values, the use of the model described by Equation 5 becomes impracticable, since an infinite amount of memory would be necessary to store the estimates due to the exponential growth of states in the number of state variables. The solution to such a problem is the generalization that can be obtained by the combination of learning techniques. Some techniques have been reported in large-scale domains, including back-gammon [20], elevator control [21], helicopter control [22], pole bal ancing [23], RoboCup Soccer Keepaway [24], and others. As done by Stone et al [24], this article uses the Sarsa(>.) [25] algorithm combined with the tile-coding technique [26] based on the cerebellar model articulation controller (CMAC) technique [27] to obtain the desired generalization. The tile coding technique consists in separating the input space in regions called tiles. Basically a set of several tiles is called a tiling, and there can be several overlapping tilings with a shift of � between them, where c is the number of tilings used. Figure 2 illustrates this concept for two overlapping tilings. Each tile contains an estimate of that region. The estimate value of Q is obtained by equation 6. c
Q(s,a) =
L O(s,a,i)
(6)
i=l
(4)
where 0 corresponds to the estimate contained in the tile of the tilings of a in which state s is contained and c is the
1865
tiling#1_ tiling #2
I V.
I
Subsection IV-A presents a snowflake storage schema for the agent's training. This article trained an agent up to 12 thou sand episodes with a time window equal to 300, performing 10 training repetitions using the model described in Section III. Subsection IV-B defines how the selection and transformation of these data were performed.
f--i
-
y
r
1
L!.�
[
A. x
Fig. 2. Two overlapping tilings, each tile is addressed by two variables x and y. The generalization is obtained between the overlapping regions. The overlapping makes the obtained experience in a region able to be shared among its neighbors.
Storage
The tables and major fields of the storage schema repre sented by Figure 3 are described below: ' ..
Cod_Result INT
=
argmaxQ(s,a) a
Cod_Result
INT (FK)
CocLExperlment
amount of tilings used. The updating of each tile is given by the equation 5 with only one modification where the value of a is substituted by �. The action a to be performed by the agent in state s is defined by:
a
_
{
numberlNT aVelX DOUBLE
Success VARCHAR(1)
aVelYDOUBLE
-+
oVelX DOUBLE
�
[r+')'Q(s',a') -Q(s,a)]
1.."
oOist DOUBLE
oBAngle DOUBLE oAngle DOUBLE
oSA DOUBLE
Start_Time TIME
aBodyA DOUBLE
End_Date DATE
aBodyN OOUBLE
End_Time TIME
oBodyA DOUBLE
Fig. 3. •
aSADOUBLE
Start_Date DATE
Size_Model BLOB
oBodyASDOUBLE
A Snowflake Storage Schema.
Table result: table that stores general data of an episode. field episode: the episode number. field execution: number of the experiment repetition. field success: indicates whether the agent has per formed the dribble. field window: indicates episodes as part of the same time window.
(8) •
Table actionJesult: table that stores all the variables of the environment and actions performed by the agent at each step of time.
•
Table action: table that stores all possible actions of the agent. Table experiment: table that stores general data of the experiment.
- field number: step of time.
(9)
(10)
We tried several configurations for the model parameters getting the best result for the following configuration. For the results described in this article the parameters a, A, ,)" c and p were configured to 0.03125, 0.5, 0.95, 32 and 0.1, respectively. In the experiments 32 tilings of 1 dimension were used for each variable, thus for the 20 variables a state s has 640 tiles per action a. The tile width for distance, angle and velocity variables were set to 2 meters, 5 degrees and 0.2m/ s2, respectively.
obOistDOUBLE aSODOUBLE
bAngleDOUBLE
Cod ExperimenllNT
•
=
1.'-
eSODOUBLE 1
where
8
bVelY DOUBLE
1
bOistDOUBLE
=
Q(s,a)+ e(s,a) * a * 8
1
oVelY DOUBLE
where at each step, et(s,a) decay ')'A to all state s and action a and has value 1 for the state visited at time t. Equation 5 is modified to support the eligibility traces and the updating of each tile is now defined as:
Vs,aQ(s,a)
Cod_Action tNT
IDescripTIon VARCHAR(45) I
bVelXDOUBLE
1.'-
(7)
')'Aet-1(S,a), if s i- St 1, if s St
INT (FK)
tNT (FK)
Execution tNT
Window INT
Equation 7 ensures that the agent should always perform the actions that apparently lead to the largest rewards along time. To ensure that the agent should explore a greater number of states, a random action should be selected with probability p, thus it is possible to reduce the probability of the agent falling into local maxima. The last detail of the Barsa algorithm that has not yet been mentioned is the eligibility traces (>.) [28]. The technique of eligibility traces ensures that past states should receive a share et(s,a) of participation by the reward received in a current state s described by the equation below:
V s ,aet(s,a) -
CocLAction
Ep;sode VARCHAR(45)
"
\..I
DATA P RE PAR ATION
field size_model: amount of memory used by the model. B.
Selection and Transformation
The selection and transformation of data contained in the storage schema described in Section IV-A is defined in Figure 4. Note that in the database query described above, the average (avg) of the environment variables is performed by grouping them according to the variables: success, cod_action and window. The success and cod_action variables were chosen
1866
is the sum of all the inconsistency counts divided by the total number of instances. Below it is presented Algorithm 1 as shown in the original article [14]:
avg(b d i s t ) , avg(b v e l x ) , avg(b v e l y ) , avg(a v e l y ) , avg(a v e l x ) , avg(a sd) , avg(o d i s t ) , avg(o a n g l e ) , avg(a s a) , avg(b a n g l e) , avg(o v e l x ) , avg(o v e l y ) , avg(o b D i s t ) , avg(oSD) , avg(o B A ngle) , avg(oSA) , avg(aBody A) , avg(aBodyN) , avg(oBody A) , avg(oBody AS) , b . s u c c e s s from a c t i o n _r e s u l t a , r e s u l t b wh ere b . c o d_r e s u l t = a . c o d _r e s u l t and c o d_e x p e r i m e n t = ? and e p i s o d e > ? group by b . s u c c e s s , a . c o d_a c t i o n , b . window
select
Input: MAX-TRIES, D - Dataset, N - Number of attributes, 'Y - Allowable inconsistency rate Output: sets of M features satisfying the inconsistency criterion Obest N; for i 1 to MAX-TRIES do S randomSet(seed); C numOfFeatures(S); if 0 < Obest then if InconCheck(S,D) < 'Y then Bbest B; Obest 0; printCurrentBest(S); end Obest) and (InconCheck(S,D) < 'Y) else if (0 then I printCurrentBest(S); end end end =
Fig. 4.
Selection and transformation of data contained in the storage schema.
=
=
=
in order to make it possible to separate the actions that took part in successful episodes from those in unsuccessful ones. The variable window decreases the loss of information caused by the average in the grouped training data. The parameter episode was set to 10000. This value must be set close to the episode number where the agent's training starts to converge. This ensures that the training data used to select the most relevant features vary systematically with category membership [29], since in the beggining of the training the actions taken by the agent is purely random. V.
=
=
=
FEATURE SELE CTION
This paper uses a Las Vegas Filter (LVF) [14] algorithm for feature selection. It works as follows: considering the data set has N features, the LVF algorithm primarily generates a random subset, B, of the features every round. If the number of features (0) is less than the current best (0 < Obest), the data (D) with the features selected in B are checked against the inconsistent criterion (specified in the next paragraph). If its inconsistent rate is below a pre-specified value (-y), this subset is chosen the best up to the moment, saved as the current best (Bbest) and printed to the user. If the number of features generated by the random subset is equal to the number of features of the current best set and the inconsistent criterion is satisfied, then an equally good current best is found and printed (but not saved). After that, restart the algorithm with the new set of features. The algorithm will stop when it loops a certain number of times. The inconsistent criterion is the key to the success of LVE The criterion specifies to what extent the dimensionally reduced space can be accepted. When the inconsistent rate is below a pre-specified value, it means the dimensionally reduced space is acceptable. The inconsistent rate is calculated in a three step process. First, if two instances match, except for their class labels, they are considered inconsistent. After that, we calculate the inconsistent count. It is the number of matching instances (without considering their class labels) minus the largest number of instances of class labels. For example, consider there are n matching instances, and, among them, Cl instances belong to labeh, C2 to label2, and C3 to label3 where Cl + C2 + C3 n. If C3 is larger than Cl and C2, the inconsistent count is (n C3). Finally, the inconsistent rate =
-
The main benefits of LVF are: 1) 2) 3) 4)
Simple to implement; Fast to obtain results; Not affected by any bias of a learning algorithm; It presents possible solutions whenever they are found. If better solutions are found, these are also presented; and 5) Independent of a specific data mining algorithm.
The main drawback of LVF is that it is applicable only to discrete features. To overcome this limitation it is possible to apply a discretization algorithm before running LVF. In general the solutions are either to treat the continuous feature as a discrete one in some cases or apply LV F only to the discrete features, the latter being recommended when the number of features is large. The following features were selected by the L VF algorithm: bDist, bVelX, bVelY, aVelY, aVelX, aSD, oDist, aSA, oVelX, oVelY, oSA, aBodyA, obDist. VI.
ACTION SELE CTION
The data mining technique used here is based on two mea sures: support and confidence. These measures are normally used to assess the impact and quality of rules extracted from a database. The objective is extract association rules [10]. In this paper the support and confidence are used to measure the degree of use and importance of each action for the success of the agent's dribble. If both measures do not surpass their respective thresholds previously defined, the action is
1867
discarded. Support measures the degree of use of an action A. Equation 11 shows how it is calculated.
%support(A)
=
count(A) count(ALL)
x
70 60
(11)
100
Learning Curves (Averaged Over 10 Runs)
50 40
where count(A) is the number of times action A is performed by the agent and count(ALL) is a count of all the actions performed by the agent. The second measure, confidence, informs how much an action is important to the success of the dribble and is defined by:
30 20
-o--, I-.•'---�-
10
2000
4000
•• • • • •AJI
. %conftdence(A::::} S)
=
count(A U S) count(A)
x
100
Fig. 7.
(12)
where count(Au S) is the number of times action A happens when the dribble is performed with success by the agent and count(A) is the number of times action A is performed by the agent, independent of its success. So, if support(A) < Bs and confidence(A ::::} B) < Be the action A is discarded. In this article the parameters for action selection were experimentally defined as Bs 2% and Be 20% resulting in 70 eliminated actions, with only 43 actions remaining. The support of all actions is shown in Figure 5. There is low support rate for almost all actions due to the high amount of actions in the agent's action space and almost all have a low confidence level (see Figure 6); that is, they do not contribute for the success in dribble, turning learning difficult. Eliminating these actions is expected to reduce the convergence time of the learning algorithm because of the reduction in the search space; the actions that do not help the agent to perform the dribble would be unnecessarily consuming resources.
40
Variables
6000
8000
Episodes
10000
12000
- Relevant Variabl�
Learning Curves for Feature Selection.
Learning Curves (Averaged Over 10 Runs)
35 30 25 20 15
=
10
=
V II. E X P ERI M ENTAL
2000 •••••
Fig. 8.
6000 Episodes
All Actions
8000
10000
12000
-RelevantActions
Learning Curves for Action Selection.
TABLE III AVERAGED MEMORY USAGE OVER 10 RU NS AND TEST PERFORMANCE ON DRIBBLE OVER 1000 EPISODES.
Featu re Selection All Variables Relevant Variables
RESULTS
The agent was trained using all variables described in Sec tion II and the performance was measured. After that the most relevant variables, identified on Section V, were selected and the performance was compared to All Variables performance. Since the most irrelevant variables were not present in the training, the performance was expected to be higher than with all the variables. Figure 7 shows this comparison. Table III also shows that the training of Relevant Variables with All Actions reduced the memory usage related to the training of All Variables with All Actions in 29% (1,290 kbytes). Figure 8 shows the learning curves for the action selec tion. The curve Relevant Actions (RA) shows the learning of the agent using only the most relevant actions that were selected according to the procedure proposed in Section VI. No improvement in performance was obtained by removing the irrelevant actions but the memory was reduced to about 34% if compared to the training with all actions and variables. The next step was to combine the techniques used to identify the most relevant variables and actions to see if an improvement
4000
Action Selection All Actions Relevant Actions All Actions Relevant Actions
Memory Usage 4,461 2,940 3,171 2,531
Perfor mance
KB KB KB KB
34.2'10 35.2'70 57.8'10 51.2'70
in performance or a reduction in memory usage would be obtained. Table III shows this comparison. The learning with relevant variables and all actions obtained a performance greater than the learning with all variables and actions even though the memory usage was lower. The use of Relevant Actions in training did not help to improve the performance on both training with all and relevant variables but decreased their memory usage. As shown on Table III the variable selection decreased the memory usage of up to 43% and increased the learning performance of up to 23%. V III.
DI S CUSSION AND CON CLUSION
The main objective of this paper in identifying and selecting the most relevant variables and actions in reinforcement learn ing was reached. The experiment verification was obtained through the comparison of learning curves in Figures 7 and 8 and in the analysis of Table III.
1868
Confidence (%) (All Actions)
Support (%) (All Actions)
10%
60% ..,-------..., 50% -I-H- -I------ I---- I-,--1--
8%
40%
6%
30% -H HI,h··.HI- - t-I- I20%
4% 2% 0%
�LI
,I I .II ,II """""""'"''''''''
I
I
11111 111111111111111 11111111111111111111111111 111111111111
10% 0%
1
"""""""'"
Actions
Actions Fig. 5.
Fig. 6.
Support values of all actions.
The first set of trials investigated the feature selection for memory usage reduction and performance enhancement of the agent success. In Figure 7 the Relevant Variables training obtained better performance with a memory usage smaller than All Variables training. The second set of trials investigated the action selection for memory usage reduction and performance enhancement of the agent. Figure 8 shows that action selection does not enhance the performance of the agent but can actually reduce the mem ory usage as efficiently as feature selection does. These results are summarized on Table III. The combination of feature and action selection has reduced the agent's performance by up to 6%, so the action selection was performed again but using the data generated by the Relevant Variables (RV) training instead of the All Variables (AV) training. The new results are shown and compared with the action selection using AV training data on Figure 9. learning Curves (Averaged Over 10 Runs)
70 60 "
:g � 5 5 t: a "
50 40 30 20 10 2000
4000
6000
8000
10000
12000
Episodes •••••
RV+RA(olddat a)
-RV+RA{newdata)
Fig. 9. Comparison of RA (Relevant Action) Selection Using Different Training Data.
Using the data generated by RV training, the agent has a better performance on tests, even up to 58% over 300 episodes having 7% more than the action selection using AV training data which had 51%. Using the RV training data to select the 2% and Be 20%, four more actions relevant action with Bs were removed than the previous selection with AV training data and 20 different actions were removed. No improvement on performance was noticed while using the action selection but the learning process was faster. A reduction in performance was noticed only on training, but the performance on the test was considerably the same and the RV+RA training had the =
=
I-I-:- I- I II-;---- -I- �- + I-I-- 1
Confidence values of all actions.
lowest memory usage. The experiments have shown that only the variable selection was capable of increasing the agent performance; the action selection was not capable of improving the performance. How ever, the action selection reduced the memory usage almost in 50% without much loss in performance. The suggestion of this article can be used on any reinforcement learning algorithm with no restriction, since the techniques described here use data retrieved from a database. The complexity of the algorithms used here is not important in this case because without the use of a probabilistic approach it would be necessary to train the agent 2 N - 1 times to try all possible combinations of N features to find the best set for a given problem. In some cases, the time necessary to train the agent may take more than 20 hours to complete [24] so the time needed to find the best set of features would be (2N - 1) x 20 hours in the exhaustive search. For the dribble, the training of the agent took about 3 hours to complete. In Sections V and VI some data mining techniques were used on an episodic environment, so we do not know if it is possible to use it on continuous environments, since we do not have a final feedback of success or failure from this type of environment that selection and transformation presented in Section IV-B needed. The main drawback is due to the fact that in order to use the proposed method one needs to already have completed the training at least once and then to redo the training from scratch. However, seems that this drawback can be avoided by saving the model after the first training and then continue the training after remove the identified irrelevant variables. In the testbed environment described in Section II, the agent had to choose the actions that would induce the opponent to a position that should favor the dribble. The high number of actions and variables available in the problem made the task more difficult, but the techniques applied here proved highly effective in eliminating the irrelevant actions and variables making its application possible in complex environments. As future work, we intend to investigate why the action selection was incapable in improving the performance of the agent. More detailed experiments about this step must be performed. Maybe, if the exploration had been strongly encouraged in the agent's learning step, the technique used here for action selection could have reached better results. It is also intended to investigate other data mining techniques for
1869
features and action selection. REFEREN CES [1] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed. Prentice Hall, 1998. [2] B. Widrow,N. K. Gupta,and S. Maitra, "Punish/reward: Learning with a critic in adaptive threshold systems;' IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-3, pp. 455-465,1973. [3] R. S. Sutton and A. G. Barto,Reinforcement Learning: An Introduction, 1st ed. The MIT Press,1998. [4] K. E. Coons, B. Robatmili, M. E. Taylor, B. A. Maher, D. Burger, and K. S. McKinley, "Feature selection and policy optimization for distributed instruction placement using reinforcement learning;' in PACT '08: Proceedings of the 17th international conference on Parallel archi tectures and compilation techniques. New York,NY,USA: ACM,2008, pp. 32-42. [5] M. Riedmiller, "Concepts and facilities of a neural reinforcement learning control architecture for technical process control," in Neural Computing & Applications, vol. 8,no. 4. Springer London,December 1999,pp. 323-338. [6] R. Albajj and M. Kaya, "Employing olap mining for multiagent re inforcement learning," Design and application of hybrid intelligent systems, pp. 759-768,2003. [7] G. Kheradmandian and M. Rahmati, "Automatic abstraction in reinforcement learning using data mining techniques;' Robotics and Autonomous Systems, July 2009. [Online]. Available: http: IIdx.doi.org/l0.l016/j.robot.2009.07.002 [8] K. Matou�ek and P. Aubrecht,"Data modelling and pre-processing for efficient data mining in cardiology," in New Methods and Tools for Knowledge Discovery in Databases. Information Society, 2006, pp. 77-90. [9] H. Lu, S. Yuan, and S. Y. Lu, "On preprocessing data for effective classification," in ACM SIGMOD'96 Workshop on Research Issues on Data Mining and Knowledge Discovery. Montreal: ACM Press,June 1996. [10] J. Han and M. Kamber, Data Mining: Concepts and Techniques (T he Morgan Kaufmann Series in Data Management Systems), 2nd ed. Mor gan Kaufmann, 2005. [11] O. Maimon and L. Rokach,"Reinforcement-learning: An overview from a data mining perspective," Data Mining and Knowledge Discovery Handbook, pp. 469-486,2005. Pricenton: [12] R. Bellman, Adaptive Control Processes: A Guide Tour. Pricenton University Press,1961. [13] H. Liu and H. Motoda,Feature Selection for Knowledge Discovery and Data Mining. Norwell,MA,USA: Kluwer Academic Publishers,1998. [14] H. Liu and R. Setiono, "A probabilistic approach to feature selection a filter approach," in Proceedings of the 13th International Conference on Machine Learning. Morgan Kaufmann,1996, pp. 319-327. [15] H. Kitano,M. Asada,Y. Kunioyshi,I. Noda,and E. Osawa,"Robocup: The robot world cup initiative;' in Proceedings of the First International Conference on Autonomous Agents ( Agent-97), 1997. [16] S. Russel and P. Norvig, Artificial Intelligence: A Modem Approach, 2nd ed. Prentice Hall,2002. [17] P. Stone,''Layered learning in multi-agent systems;' Ph.D. dissertation, School of Computer Science, Carnegie Mellon University, 1998. [18] R. M. Oliveira,P. J. L. Adeodato,A. G. Carvalho,I. B. V. Silva,C. D. A. Daniel,and T. I. Ren,"A data mining approach to solve the goal scoring problem," International Joint Conference on Neural Networks - IJCNN, 2009. [19] R. Boer and J. Kok,"The incremental development of a synthetic multi agent system: The uva trilearn 2001 robotic soccer simulation team," Master's thesis, University of Amsterdam, 2002. [20] G. Tesauro, "Td-gammon, a self-teaching backgammon program, achieves master-level plays," Neural Computation, vol. 6,no. 2,pp. 215219, 1994. [21] R. H. Crites and A. G. Barto, "Improving elevator performance using reinforcement learning;' in Advances in Neural Information Processing Systems 8. MIT Press,1996,pp. 1017-1023. [22] J. A. D. Bagnell and 1. Schneider,"Autonomous helicopter control using reinforcement learning policy search methods," in Proceedings of the International Conference on Robotics and Automation. IEEE, May 2001,pp. 1615-1620.
[23] A. G. Barto, R. S. Sutton, and C. W. Anderson, "Neuronlike adaptive elements that can solve difficult learning control problems," Artificial neural networks: concept learning, pp. 81-93, 1990. [24] P. Stone, R. S. Sutton, and G. Kuhlman,"Reinforcement learning for robocup soccer," International Society for Adaptive Behavior, vol. 13, pp. 165-188, 2005. [25] G. A. Rummery and M. Niranjan,"On-line q-Iearning using connection ist systems;' Cambridge University,Engineering Department,Tech. Rep. 166,1994. [26] C.-S. Lin and H. Kim,"Cmac-based adaptive critic self-learning control," in IEEE Transactions on Neural Networks. IEEE,1991,pp. 530--533. [27] J. S. Albus, "A new approach to manipulator control: The cerebellar model articulation controller (cmac);' Journal of Dynamic Systems, Measurement, and Control, vol. 97,pp. 220--227, 1975. [28] C. J. C. H. Watkins,''Learning from delayed rewards," Ph.D. dissertation, Cambridge University, 1989. [29] J. H. Gennari,P. Langley,and D.Fisher,"Models of incremental concept formation," Artificial Intelligence, vol. 40,no. 1-3,pp. 11-61, 1989.
1870