the effect of sensory information on reinforcement learning ... - CiteSeerX

0 downloads 0 Views 44KB Size Report
Aug 15, 1994 - effectors, and two, given its higher number of internal degree of freedom and its non anthropomorphic ... eight equal sectors, or four unequal sectors. (see Fig.2). ... divisions, namely four and sixteen equal sectors, on learning.
TO APPEAR IN PROCEEDINGS OF ISRAM'94, FIFTH INTERNATIONAL SYMPOSIUM ON ROBOTICS AND MANUFACTURING, AUGUST 15–17, 1994, MAUI, HI, USA

THE EFFECT OF SENSORY INFORMATION ON REINFORCEMENT LEARNING BY A ROBOT ARM MARCO DORIGO IRIDIA, Université Libre de Bruxelles, Avenue Franklin Roosvelt 50, CP 194/6, 1050 Bruxelles, Belgium, [email protected].

MUKESH J. PATEL and MARCO COLOMBETTI Progetto di Intelligenza Artificiale e Robotica, Dipartimento di Elettronica e Informazione, Politecnico di Milano, Piazza Leonardo da Vinci 32, 20133 Milano, Italy. [email protected], [email protected]

ABSTRACT In this paper we present an application of ALECSYS , a distributed learning classifier system, to the control of a robot arm. A LECSYS is initialised with a set of randomly generated rules and is trained to control a robot arm whose task is to reach a non moving light source. At this point of our research our results are relative to the simulation of a real robot arm (IBM 7547 with a SCARA geometry), which will be the target of the final implementation of our learning system.

INTRODUCTION ALECSYS (Dorigo, 1993; 1994), an implementation of a learning classifier system (Booker, Goldberg and Holland, 1989; Holland and Reitman, 1978) on a net of transputers, was utilised to train a robot arm to solve a light approaching task. This task, as well as more complicated ones, has already been learnt by A LECSYS implemented on AutonoMouse, a small autonomous robot (Colombetti and Dorigo, 1992; Dorigo and Colombetti, 1994). The main difference between the present and previous applications are, one, the robot arm has (asymmetric) constraints on its effectors, and two, given its higher number of internal degree of freedom and its non anthropomorphic shape, it was not obvious, as it was with the AutonoMouse, where to place the visual sensors. We report results of a number of exploratory simulations of the robot arm’s relative success in learning to perform the light approaching task. On the bases of results of such trials it was possible to derive a near optimum combination of sensors which is now being implemented on a real robot arm (an IBM 7547 with a SCARA geometry). ALECSYS A LECSYS is a learning system made up of a set of ICS’s (Improved Classifier Systems) which co-operate to learn to solve a given task (see Dorigo, 1993; 1994). Each ICS can be implemented on a set of transputers, and different sets of transputers implement different ICS’s. These ICS’s can be organised in different architectures (for a detailed description, see Dorigo and Colombetti, 1994), and for the purposes of the present study the simplest, a single ICS (monolithic architecture) with 270 classifiers was used, parallelized on 3 transputers allocating 90 classifier each. ALECSYS was set to its reactive mode (that is, no internal messages were used); a GA was applied at steady-state with a crossover rate pc=0.5, a mutation rate

2

pm=0.25, and the mutespec operator probability was set to 0.5 (see Dorigo, 1993 for details on this unconventional operator). EXPERIMENT METHODS AND PROCEDURE The results reported here are based on simulation, though work has now moved on to the training of a real robot arm. In order to constrain the level of complexity of the learning task, and to ensure that it did not depart too radically from that of the previous studies with an autonomous agent, the simulation was carried out in a twodimensional space; hence the simulated task solution did not depend on up-down or wrist rotational manipulations. The domain of the robot arm movement was constrained to the area indicated by all the shaded area in Figure 1, while the light source was randomly placed in the lightly shaded subsection of this area. The mechanical arm had the possibility of being provided with the following sensors (which varied) and effectors (which remained the same). • visual sensors (on wrist and elbow) • proprioceptive sensors (elbow joint and shoulder joint) • motor effectors. In the simulation experiments the motor effectors closely mimicked those on the real robot arm, and the focus was on the effect wrist of various numbers and types of sensors on learning the light approaching task. We 160˚ compared the effect on learning of a single 200˚ visual sensor on the wrist with two visual elbow sensors, one on the wrist and the other on the elbow. Further, the field of vision (360 degrees) of a sensor - or both when two shoulder were being used - was either divided into eight equal sectors, or four unequal sectors (see Fig.2). These four possible visual sensor configurations, each providing varying degrees and sorts of (visual) information about the light source, were Figure 1. The simulated mechanical arm. tested with information from one, two or neither proprioceptive sensors. The first proprioceptive sensor provided (binary) information on whether the elbow angle is between 0 and 80 degrees, or between 81 and 160 degrees (160 degrees being the upper limit as shown in Figure 1). The second sensor provided similar information about the extent of the shoulder joint, that is, whether the position of the upper arm is between 0-90, 91-180 or 181-200 degrees (200 degrees being the upper limit). Finally, we used two motor effectors, one each for the elbow and shoulder joints. These motors can rotate right, rotate left or stay still. Once the best configuration (supporting an efficient and robust learning of this task) was identified, we further investigated the influence of different visual field divisions, namely four and sixteen equal sectors, on learning. A reinforcement program (RP) was developed to evaluate the utility (vis-a-vis the task) of the arm's moves at each time step (cycle). It compared the difference in distance between the wrist and the light source before and after a move. If it had decreased the learning system was rewarded, and if not, it was punished. The reward (punishment) value of +20 (–20) was determined by empirical observations of this sort of reinforcement learning. Here we report results of various sensory combinations on learning the light approaching behaviour. Each combination was subjected to 12 trials of learning and

3

test phases. Normally, a learning phase lasted for 20,000 cycles, which consisted of a number of episodes, defined as the successful intercept of the randomly placed light source. The actual number of episodes in each trial varied according to the relative success during learning to approach the light source with the aid of reinforcement (by RP) and the variable operation of the GA (to generate new classifiers). The test phase of 5,000 cycles checked for the extent and efficiency of learnt behaviour – when learning is switched off, and only the learnt rules are used to by the control system to choose an action at each cycle; applicable rules are selected with a probability proportional to their strength. Sector 1

Frontal sector

Sector 8

45˚ Sector 2

Sector 7 Right sector

Left sector Sector 6

Sector 3

Sector 4

Sector 5 a)

Rear sector b)

Figure 2. The visual sensors. a) The visual space is divided into eight equal sectors. b) The visual space is divided into four unequal sectors.

RESULTS We collected data on both local and global performance indexes. The local index gives the number of moves that resulted in the arm moving closer to the light over the total number of cycles (either of learning or testing); the global index gives the average number of cycles per successful intercepts during the learning phase or the test session. Results, which are averaged over twelve trials, are given for different combinations of number and location of visual and proprioceptive sensors, and, number of sectors in the field of vision. Figures 3 and 4 illustrate local and global indexes respectively for test sessions. Performance during learning for all combinations was reasonably good and did not vary much between trials; local performance ranged between 0.77 and 0.90 and global performance between (on average) 15 and 25 cycles per intercept. It is in the test phases that radical differences between trials and among different sensory combinations are observed. This may indicate partial (or incomplete) learning, that is, rules for some particular input combinations are missing or had not been reinforced during learning. In the learning phase this is not so obvious because whenever a rule is missing it would be eventually found by the genetic algorithm (which is switched off during the test session). It is interesting to observe from Figures 3 and 4 that the less the amount of proprioceptive information from the elbow and shoulder joints the better the average test performance. This inverse relationship seems to reflect the adverse effect of an increase in ambiguity as sometimes the same wrist position vis-a-vis the light position required different actions. Though, this cannot be stated with a great deal of certainty since the test performance for all combinations varied a lot between

4

0

Local performance during testing 0.2 0.4 0.6 0.8

1

trials; it was not unusual for one trial to display high proportion of good behaviour while another to be totally abysmal (test performance less than 0.01 were not infrequent). The best test performance combination is an 8-sector visual field sensor located on the wrist combined with information about the elbow joint (160 degrees divided into two sectors). In this case, the performance is much better than when knowledge about the relative position of the shoulder joint is available but unlike the rest, it also performs better when no proprioceptive information is given. This pattern of trends is mostly replicated for data on intercept success, as can be seen in Figure 4. The observed differences in local and global test performance results are highly significant overall (Kruskal-Wallis p

Suggest Documents