Algorithmic foundation of the Clockwork Orange ... - Semantic Scholar

7 downloads 0 Views 173KB Size Report
Algorithmic foundation of the Clockwork Orange. Robot Soccer Team ... Currently we use the robot's Polaroid sonars to estimate the free space around the robot.
Algorithmic foundation of the Clockwork Orange Robot Soccer Team Pieter Jonker 1, Bas Terwijn2, Jev Kuznetsov 1, Bram van Driel 1 1

Quantitative Imaging Group, Faculty of Applied Sciences, Delft University of Technology 2 Intelligent Autonomous Systems Group, Science Faculty, University of Amsterdam, {pieter, jev, bram}@ph.tn.tudelft.nl, [email protected]

Abstract. The Dutch robot soccer team Clockwork Orange is a research project of the Delft University of Technology and the University of Amsterdam. In this project we study multi-robot collaboration, robot vision and robot control. Our research focus, overviewed in this paper, is on automatic calibration for robust colour segmentation, behaviour based robotics, reinforcement trained dribbling with the ball, team roles based on voting and world modelling based on particle filters.

1

Introduction

Robot Soccer is an environment to research, develop and test algorithms for cooperating autonomous robots. The robot teams operate in a simple physical world with simple rules. A world wide competition (the RoboCup) takes care of evolutionary steps in the development of the robots. Concepts and algorithms that are not good enough do not survive. The RoboCup has several leagues, such as the simulation, small size, middle size, four legged (AIBO) and humanoid robot leagues. The middle size league includes development of robots, sensors, actuators, hard and software and is as such the most challenging league. World wide about 32 teams take part in the competition in this league. Each year the rules of the game are updated to be more and more in accordance with the FIFA soccer rules and the world of the robots is made more alike the real physical world. The end-goal is to beat in 2050 with a humanoid robot team the human soccer team, i.e. the world champion at that time. This goal is in analogy with the chess computer Deep Blue that has “proven to be a better chess player” than humans. [1] The Dutch robot soccer team Clockwork Orange is a research project of the Delft University of Technology and the University of Amsterdam. In this project we study multi-robot collaboration, robot vision and robot control [2]. We use a single type of robots, formerly known as Nomad Scouts, but with re-engineered robot hardware so that only the housing, the power-electronics and motor driving electronics remained original. The system has an embedded processing board with 1 GHz Pentiums, a PCI -bus frame-grabber and several PC104 boards for motion control, sensing control, etc.

The system has a colour camera, simple sonar sensors and collision bumpers. For wireless communication we use IEEE 802.11b WLAN. System Integrity

all modules

Team Skills

Player Skills

Motion

Coach/Ref. Interf

World_Model

Vision

all modules

Communication

Fig. 1. The software architecture

Our software architecture is basically a layered and modular system in which higher layers represent higher abstraction processes [3]. The software of all modules in all layers runs in parallel, and within each module multiple threads can exist. The software consists up to now of three layers: the intelligent devices (motion, vision, communication), the skills (player skills, world observation) and the top layer (team play, system health monitoring and automatic error recovery, and coach/referee interaction). The coach and referee-box are laptops at the side of the field. The coach sends commands to the robot indicating the direction of play, the initial roles of the players ... The referee sends commands such as: kickoff_team_A, start, pause, continue, free_kick_A, … As the aim of the software is an annually moving target, we describe below the latest developments.

2

Robust, lighting independent colour vision system

One of the rule changes that is about to come in the middle size league is the changing of lighting conditions, the allowance of daylight spots and large shadows on the field, as our ultimate goal is to be able to play in open air. Since object detection is done based on colour segmentation, the vision of the robot is severely hindered by a changing colour of the illumination [4]. Moreover, a colour classifier built and calibrated before each game and each different field is not a sensible thing for a fully autonomous agent. Hence we made a self calibrating vision system using a strategy of emergent behaviour [5]. Emergence can basically be described as using information from various orthogonal but overlapping sensors, constantly observing the world. Currently we use the robot’s Polaroid sonars to estimate the free space around the robot. Mapping this estimation on the camera data (images) we can make an estimate of which pixels and colours belong to the field. With this information we can continuously recalibrate the colour segmentation by growing, shrinking, resizing or moving the colour classifier in the appropriate colour space. Since the classifier is a 3D look up table this method works just as well for YUV as for RGB colour spaces. The

colour segmentation only uses this classifier which is in fact a (temporal) generalization of the data from the sensors. There is no separate learning and testing phase, so if lighting conditions change the robot starts forgetting what it knows about colours and learns new colour labels. This way it adapts itself in a few seconds to the new situation. Fig. 2 shows: Top-Left: Projecting the sonar data into the image, indicating the green open space in front, this space is used to do the auto iris and white 1 balancing on. Top right, a camera on which we have put brown sunglasses. The image was automatically adapted by our procedure through adjusting the R+G+B and R and B amplifiers on the camera. Bottom-Left, idem, but adapted by adjusting the parameters on the camera as well as by further fine tuning the illumination level and the white balance in software. Bottom Middle and Bottom Left: The same, but now eliminating the effect of the blue sun-glasses we held in front of the camera.

Fig. 2. Adapting the camera to changing lighting conditions.

3

Behaviours based soccer skills

The player skills of our robots was formed by a module that could execute actions such as: Goto(object), Search(object), Dribble_to(object), Turn_and_Shoot_to(object),…, with object = {ball, goal, centerfield, (x,y,Ø), …}. All actions were conditioned with reflexes such as: collision avoidance, at-linebehaviour, shoot_to_goal, etc. Within the player skills, many (+50) parameters were used, much with a non-physical meaning [6]. This module was too complex and entangled and we decided to change the system from a single action that is selected by the team skills module (possibly interrupted by autonomous reflexes), to a behaviours based system in the tradition of the Subsumption Architecture [see e.g. 4]. Using a behaviour based approach has advantages such as reduced complexity due to the separation into simple behaviours and relative ease to add and test new behaviours. 1

In this case a better term is: green balancing.

The current action behaviours are: goto(object), get_ball, dribble, side_tick_ball, avoid_collision, retract_from_collision, stay_on_field. The current sense behaviours are: displacement, freespace, ball, goal, static_objects. Behaviours are scheduled based on the belief state of the individual robot. For the scheduling we have chosen a finite state machine with fixed activation values, in contrast to dynamic activation functions, to allow for easier control and fine-tuning. In order to fuse the output of the behaviours we use a voting mechanism where each behaviour votes on a large set of possible movements by assigning a value between 0 and 1. The votes of the behaviours are multiplied by the activation value of the respective behaviour and for each movement all the votes are added. After addition, the movement with the highest sum of votes is executed. The module has shown good results in dealing with complex situations where there are multiple obstacles and the position of the ball to consider. Especially good results come from the specially designed 'rotate with ball' behaviour which enables the robot to turn with a relatively short radius while keeping control of the ball. The 'goal sensor' and 'ball sensor' gives the robot a good measure of the chance of scoring at any time to allow for long-distance shooting when appropriate.

4

Training a robot to dribble with the ball

Present software heavily relies on hand coded algorithms along with many empirical constants. This approach produces fairly good results in the short term, but as the software becomes increasingly complex, human comprehension of what is going on becomes more difficult. At a certain point so many empirical factors have been introduced, that it is almost impossible to find their optimal values, as many of them are correlated. Consequently, we implemented and tested Q-learning [8..11] in our dribble behaviour. The requirements for this behaviour are quite complex due to the rules that only one third of the ball may be covered by the robot and while trying to score, holding the ball is not allowed. At the same time, the robot should avoid obstacles and orient itself with its vision system on its absolute position on the field and its relative position with respect to the other players. This need to look around independently generates motions. A typical lay-out of a ball handler’s geometry and the involved variables is presented in fig. 3.

xb,yb xlost x0,y0 ylost Fig. 3: a) Clockwork Orange robot

xhave yhave

b) Layout of the ball handling

Using reinforcement learning, the robot can try to optimize its behaviour by learning what to do in each situation on the field. Moreover, it can be made to adapt itself to new situations, e.g., for changes in its ball handler geometry, or for the friction of the ball to the field. In general, a learning algorithm for a system with a large number of parameters needs several hours to converge to its optimal performance. Each training episode starts with positioning the robot and the ball on arbitrary positions of the field, which is impractical for a real physical robot and a human trainer. With an adequate simulator this training can be done considerably faster than with a real robot. Pre-trained with the simulation and downloaded onto a real robot, we should only need a limited number of extra training runs to correct for the factors that were not present in the simulation. This probably also holds for small adaptations to the ball handler geometry and to the field-ball friction on the various soccer fields. Supervisor

Supervisor

Learning

Learning

Simulator

Robot

Fig. 4. The learning procedure; a) pre-training

b) post-training

So, in order to develop a learning algorithm for a robot, a dynamic system model was made describing the kinematics of the robot and its interaction with the ball. When robot and ball are colliding elastically, both momentum and kinetic energy are conserved. We derived equations for the (angular) velocity of the robot and the (angular) velocity of the ball resolved in components parallel and perpendicular to the ball handler, both before and after the collision. We have set-up a set of equations for a partially elastic collision, to simulate the collisions between robot with piecewise linear kicking device and ball. We proved that the simulator and real robot controller showed the same behaviour. On top of the dribble simulator we implemented reinforcement learning. The algorithm is equal to the Watkin's Q( λ ) algorithm, as described in [7, 9, 10]. After about 3500 training runs, our simulated robot could dribble very well [15]. The achieved simulated success rate on a field of 15 x 25m. was about 80%, with a performance of 60%. We defined the performance as the time it takes for the agent to complete a task, relative to the theoretical minimum amount of time necessary. This minimum was calculated by dividing the distance to the goal by the maximum velocity of the robot. This gives the shortest time it can take the robot to complete its task. Training on a 10x20 m field resulted in a decrease of success rate from 80% to 70% and on a 4x8 field to only 40%. We then tested the post-training on the real robot, and the robot could dribble using the pre-learned behaviour. Using an algorithm of no more than 30 logical state-

ments and almost no expert data, a complex behaviour was achieved, resulting in a robot chasing the ball and pushing it into goal.

5

Training a robot to dribble while avoiding obstacles

In our behaviours module we fuse simple behaviours, up until now all hand coded algorithms. Dribble to goal and collision avoidance are two disjunct simple behaviours. In section 4 we showed that we could replace the dribble to goal behaviour with a version based on Q-learning. We want to augment this now with collision avoidance. The main problem with reinforcement learning algorithms, however, is scalability. Every extra input variable adds an extra dimension to the state space. During learning, the agent should visit all the states repeatedly to find the optimal policy. Increasing the number of inputs/outputs adds extra dimensions to the state space, which increases the learning times exponentially. Still, long convergence times to an optimal policy are not even the biggest problem. In high dimensional spaces the agent is very likely to land in local minima. Different policies tend to interfere with each other. With Q-learning in particular, extra inputs which are un- or weakly correlated to the rewards and therefore irrelevant, make the system perform less. This is inherent to the algorithm, as only the positive part of the 'noise' is propagated forward, biasing the system. High dimensionality along with the inability to reject irrelevant data and poor generalization, results in poor performances.

Fig 5. Dribbling with collision avoidance requires 6 inputs: the vector to the ball, to the goal, and to the nearest obstacle, and 2 outputs: the set velocities of the wheels. For a real life robot soccer game, fast reaction is required, meaning that time steps (the rate at which the decisions are taken) should be kept as small as possible. Ideally, a continuous decision making is desirable. Unfortunately reinforcement learning algorithms do not support continuous time. In a robot soccer game, robots move at rates of about 1 m/s, so when using a 1s sampling time, a decision to stop just in front

of an obstacle can come too late, the robot has smashed itself into its opponent. Using smaller time steps, however, makes learning more problematic, as table 1 shows. Table 1. Success rate and performance in relation with the time step Task Dribble with ball to goal

Time step size [sec] 0.25 0.50 1.00

Success Rate mean 0.42 0.64 0.69

Performance mean 0.20 0.26 0.38

This is caused by the fact that the agent can 'feel' the rewards only during a certain number of time steps from reinforcement. This number of steps is dependant on the length of the eligibility trace [7]. Making traces longer enables the robot to feel the rewards further from a goal. But long traces also pose a problem, as an uncertainty is introduced at each step. Choosing a trace length of about 5 time steps appeared to be reasonable, provided that the time steps are small enough to enable the agent to keep up with the changes in its environment. If we want to make a combined behaviour based on Q-learning that involves dribbling while performing collision avoidance, this means in terms of sampling rates, that reaction times of less than 0.2 s., or sampling frequencies of more than 5 Hz, are needed. An input and an output (action) is a continuous one dimensional vector. In Qlearning the actions to perform are just considered an extra input. Our actions, setting the speed of the two wheels, are sampled using the CMACS tile coding. Originally we used the vector (angle, distance) to the ball and to the goal as inputs. Now for each obstacle we want to avoid, another vector (2 inputs) is needed. The many attempts we made, revealed the inability of the Q learning algorithm to cope with more than 8 inputs, given the sampling of the space due to the 5 Hz sampling rate we need. But this is not enough to keep track of the ball and multiple obstacles simultaneously. At least two obstacles must be avoided simultaneously, to overcome closing in by opponents or corners. This problem of the large input/output space can be reduced by using scaling functions, filtering, clipping, etc. But this imposes a complexity, which the reinforcement system should solve in addition. The great advantage of a learning system is that the designer only tells it what to do, and not how. From a simple reward structure the agent should be able to figure out the policy its designer had in mind. Simple rewards soon prove to be inefficient. Take as an example a simple 'go to ball' behaviour. The simplest reward structure is a reward for getting to the ball and a penalty for every time step. This reward structure soon proved to be inefficient as an effect of a reward at the end of the entire training episode is propagated back in time only for a certain amount of steps according to the eligibility trace, meaning that the agent cannot 'feel' the goal when he is more than the eligibility trace length (in time) away from it. Increasing the sampling rate makes such simple rewards even less efficient. To solve this, a designer decides to give reinforcements at every time step, positive for getting closer to the ball, and negative for going away from it. This results in rewards resembling a potential field, pushing

the robot towards the goal. This forms an important caveat: when expecting complex behaviours and the agent fails to find the policy that satisfies its designer, the designer tends to start telling the agent how to accomplish its task, implying complex potential reward structures and completely robbing the agent of its freedom to explore. So the solution is the same as we have experienced before: decouple behaviours into simple behaviours and fuse them to obtain transitions in the behaviours. This should also be done when Q-learning behaviours; use a sort of “prioritized learning”: Our desired behaviour is to approach the ball at a predefined angle while avoiding obstacles. While far away from the ball, keeping the heading is important. However, only close by, an approach at a right angle should be achieved. So our distributed learning system learns one policy on top of an already learned behaviour: Let A ( sa ) be a first and 'base' behaviour like ‘go to ball’, learned only with its relevant states sa . Let B be the obstacle avoidance behaviour, etc. Let S be the total state and v be an input variable. First, behaviour (A) chooses an appropriate action based on the data it receives. The second behaviour (B) gets only inputs which are relevant for it (obstacle avoidance behaviour does not need to know where the ball is). Along with its action to drive the wheels, behaviour B has a special 'intervention' action. This is a binary action, determining whether the actions of all previous behaviours (in this case A) will be overridden. By making this intervention very costly, the agent learns to use it only in the cases where the collision is most probably unavoidable. The optimal action a value now becomes:

S = v1 + v2 + ... + vn

a = A ( sa ) ∨ B ( sb ) ∨ C ( sc ) ...

(1)

si = ∑ vrelevant We need B (and all subsequent behaviours) to be active only in the relevant situations, with back propagation times of only a couple of steps, making a simple reward structure possible. One great advantage for using such compensators is that their preferred action when learning starts is always 0, meaning one can fill all Q values accordingly before starting the actual learning.

6

Team skills based on voting

Our newly developed team strategy of the robots is implemented as a voting mechanism. Periodically all team members vote, taking into account the position of the ball, on the overall team strategy which can be 'Attack', 'Defend' or 'Search Ball'. Depending on the selected team strategy, a number of roles come available on which another round of votes is held for the purpose of distributing the roles over the available team members where the more important roles are distributed first. After role assignment, a team member will select the position where to perform its task using a potential field that depends on this role. In the potential field the positions of all relevant objects

such as the ball, team members, opponents, and goals are taken into account [13]. The resulting team play is found to be satisfactory, as only one team member is going to the ball while the other players position themselves in support of the attacking member. While defending, the team members have shown a good ability to build a defensive wall quickly relative to the changing position of the ball.

7

Position Estimation

The team skills, and in a lesser extent the behaviour based player skills, rely on knowing, with some precision, the global position of the robots on the field. The position of each robot is estimated by tracking a fixed number of position-hypotheses using a particle filter. At each time step the particle filter takes as input the odometry information, to displace the hypotheses in the same way the robot is displaced, and position-candidates to evaluate the existing hypotheses and generate new hypotheses. These position-candidates result from matching the robots visual observation with a static model of the soccer field taking into account the lines and goals. Additional position-candidates can result from matching the dynamic objects observed by the robot with a global world model constructed from the dynamic objects observed by team members together with their position hypotheses. In this way, observing objects, e.g., a moving ball, and knowing where on the soccer field the objects are observed by team members, will help the robots position estimation. The particle filter algorithm has been adapted to be robust in case of collisions or wheel spin in combination with noise in the observations caused by robots and people in and around the field. The adaptation evaluates a hypothesis not only based on the latest position-candidates, but also based on candidates received in the past [14]. By registering the number of times a hypothesis has been supported by candidates, the most consistent hypotheses can be identified and preserved, even when there is no support from position-candidates during a number of time slices. In this way, a short period of incorrect and possibly consistent position candidates will not invalidate the most consistent hypotheses. Experiments showed that with this approach, a robot is quickly able to recover its position after it was displaced, even when the sensors registering displacement could not do this properly. We observed that we could adequately control the trade-off between quickly recovering the correct position of the robot and being robust to consistent noise in the environment.

8

Conclusions

Setting up and maintaining a team of autonomous co-operating robots, albeit functioning in a simple world with simple rules, involves many different algorithms, all with their own parameters settings. As each robot’s algorithmic “install base” currently contains about 200,000 lines of C/C++ code and an estimated state space of

over 3000 states, the calibration of all involved parameters is a tedious job. Apart from a proper layered and modular architecture, auto-calibration, learning and emergent behaviour must be applied in order to cope with this complexity. In this paper we showed the various techniques we used: 1) Overlapping sensors of different kind that measure the same item; i.e. measuring the colour of the free floor space, to perform auto-gain and white balancing for colour segmentation under outdoor lighting conditions. A first step towards emergent behaviour. 2) Applying Q-learning in a cascaded way to learn combined simple behaviours, without stepping into the pitfall of large state spaces with high sample rates. It performs at least equivalent to hand coded algorithms, involves less code and lacks magic numbers. 3) Team behaviour due to a continuously voting on the roles of the team members. 4) Position estimation using particle filters, combining sensor information from various sources and quality.

9

References

[1] [2] [3]

The robocup foundation www.robocup.org , Dutch branch: www.robocup.nl Robot soccer team Clockwork Orange. http://www.ph.tn.tudelft.nl/~phrobosoc P.P. Jonker, B. Terwijn, The Clockwork Orange Team, Team Description Paper. submitted to the RoboCup International Symposium 2004, July 4-5, Lissabon P.P. Jonker, J. Caarls, and W. Bokhove, Fast and Accurate Robot Vision for Vision based Motion, in: P. Stone, T. Balch, G. Kraetzschmar (eds.), RoboCup 2000: Robot Soccer World Cup IV, Proc. 4th Int. Workshop on Robocup (Melbourne, Australia, Aug.31-Sep.1, 2000), Lecture Notes in Artificial Intelligence, vol. 2019, Springer Verlag, Berlin, 2001, 149-158. Rolf Pfeiffer, Christian Scheier, Understanding Intelligence, The MIT Press, Cambridge, Massachusetts, ISBN 0-262-16181-8 W. Altewischer. Implementation of robot soccer player skills using vision based motion. MSc. thesis, Delft University of Technology, November 2001, on request from the author R. S. Sutton, A.G. Barto. Reinforcement Learning, MIT press Cambridge MA, ISBN 0-26219398-1, 1998 J. C. Santamaria, R. S. Sutton, and Ashwin Ram. Experiments with reinforcement learning in problems with continuous state and action spaces. Technical report, Lederle Graduate Research Center, University of Massachussetts, Amherst, December 1996 C. Watkins and P. Dayan. 1992. Technical note: Q-learning. Machine Learning, 8(3-4):279292. C. J. C. H. Watkins. 1989. Learning from delayed rewards. PhD thesis, Cambridge University Jelle R. Kok and Nikos Vlassis. Sparse tabular multiagent q-learning. In Kris Steenhaut Ann Nowé, Tom Lenaerts, editor, Benelearn 2004: Proceedings of the Annual Machine Learning Conference of Belgium and The Netherlands, pages 65-71, Brussels, Belgium J. Latombe, Robot Motion Planning. Kluwer Academic Publishers, 1991 B. D. Damas, P. U. Lima and L. M. Custodio. A Modified Potential Fields Method for Robot Navigation Applied to Dribbling in Robotic Soccer. Instituto Technico, Av. Rovisco Pais, 1, 1094-001, Lisboa, Portugal, 2002. B.Terwijn, J.M. Porta and B.J.A. Kröse, A Particle Filter to Estimate non-Markovian States. Proceedings of the 8th Int. Conf on Intelligent Autonomous Systems, Amsterdam, March 2004 See also: http://www.ph.tn.tudelft.nl/~pieter/simulator.html.

[4]

[5] [6] [7] [8]

[9] [10] [11]

[12] [13]

[14]

[15]