reinforcement learning and genetic programming for ...

REINFORCEMENT LEARNING AND GENETIC PROGRAMMING FOR MOTION ACQUISITION Adam Szarowicz1, Jaroslaw Francik2, * and Ewa Lach3 Kingston University 1 School of Maths, 2 School of Computing and Information Systems Penrhyn Road, KT1 2EE Kingston, United Kingdom E-mail: {a.szarowicz, jarek}@kingston.ac.uk

KEYWORDS Animated avatars, Q-learning, motion prototyping, Genetic Programming ABSTRACT While computer animation is currently widely used to create characters in games, films, and various other applications, techniques such as motion capture and keyframing are still relatively expensive. Automatic acquisition of secondary motion and/or motion prototyping using machine learning might be a solution to this problem. Our paper presents an application of two of the machine learning algorithms to generate action sequences for animated characters: Reinforcement Learning (RL) and Genetic Programming (GP). RL can be used in both deterministic and non-deterministic environments to generate actions which can later be incorporated into more complex animation sequences. The paper presents an application of both deterministic and non-deterministic updates of the Q-learning algorithm to automatic acquisition of motion. Results obtained from the learning system are also compared to human motion and conclusions are drawn. The second approach presented in this paper represents a virtual agent as an automaton trying to achieve its goal by executing an internal program. The agent’s program is built using the layered learning genetic programming technique. The paper presents also an approach, in which a virtual character is treated as an automation executing some internal program so that to achieve some well defined goal. This program is created using layered learning genetic programming technique.

*

3

Silesian University of Technology Institute of Computer Science Akademicka 16, 44-100 Gliwice, Poland E-mail: [email protected]

INTRODUCTION This paper presents learning techniques for animated characters that mimic human behaviour, especially in the context of interaction with physical objects. Anthropomorphic characters have been used before to simulate object manipulations, but also interactions between individuals (Tomlinson et al, 2000, Russel and Blumberg, 1998) but this did not include action learning. A good example of an architecture addressing the problem of action learning is C4 (Isla et al, 2001). C4 tackled the problem of learning on the cognitive level - characters (usually virtual dogs) learn how to respond to new commands and events. An extension of C4, which includes much greater learning capabilities, is described in Blumberg et al (2002). The applied learning algorithm is a modification of the reinforcement learning technique (Sutton and Barto, 1998). However the task of the learning engine is not to learn the necessary motor skills but rather is defined on a higher level, "with respect to a motivational goal of moving in a certain way" and happens in real-time during the interaction with the system. The system uses a so-called pose-graph to generate motion, the nodes of which are derived from source animation amended by an interpolation technique (Downie, 2000). Thus the animation is realistic and transitions can be generated in real-time but the actions must be prepared by an animator and pre-programmed into the system. Another example of applying reinforcement learning (RL) to animation includes Yoon et al (2000), where RL techniques were used to create motivational and emotional states for a human character. This system incorporates such concepts as motivation driven learning (where the source of the reinforcement signal for learning was the creature's motivational module), organisational and concept learning but not motor

Work supported by Europen Comission as a part of Marie-Curie fellowship. © IJIGS/University of Wolverhampton/EUROSIS

tional and concept learning but not motor learning. Similarly as before the learning occurs on a higher level and only affects the character's behaviour in an indirect way. These systems visualise motion of the characters using a blend of motion-capture, keyframing and kinematics-based techniques. Quite often, however, motion is generated using dynamic simulation. This allows creating characters with very complex motor skills. Terzopoluos and his colleagues (Tu and Terzopoulos, 1994, Terzopoluos et al 1996) present a system for animating dynamically simulated fish and snakes. They employed machine learning to acquire complex motor skills for the simulated fish. The virtual characters are able to learn low-level motions and also high-level behaviours. In this approach physics-based simulation was used, based on a dynamic model of the fish with muscles and springs. However a similar approach applied to dynamic simulation of human figures requires that the characters have many degrees of freedom thus making it computationally expensive. Despite that complication a lot of research is being conducted in this field. Hodgins et al (1995) propose controllers for three different athletic behaviours. Apart from dynamic simulation they also use state machines, techniques for reducing disturbances to the system introduced by idle limbs, and inverse kinematics. Van de Panne and others (van de Panne et al, 2000) propose a limit cycle algorithm for the animation of a walking biped and a dynamic motion planner for simplified characters (Acrobot, Luxo)., Laszlo et al (1996) applied the limit control cycle technique to a 19 degree-of-freedom model of a human, similarly Anderson and Pandy (1999) investigated realistic simulation of human gait using a 23 degree-offreedom model. There have been few attempts to build dynamic controllers, which could control more than one specific motion. Examples of these are the ones proposed by Pandy and Anderson (1999) who tried to create a controller applicable to both jumping and walking behaviours and also the work presented by Faloutsos and his colleagues (Faloutsos, 2002), who combined several different controllers and additionally applied Support Vector Machines (see Christianini and Shawe-Taylor, 2000) to automatically learn preconditions of different dynamic actions as an off-line process. An alternative to physically-

based simulation was proposed by Lee et al (2000) by implementing a system in which constraints imposed on motion of a character are calculated in a procedural way. Thus the calculations are faster and more stable and can easily be used in real-time applications. Metoyer and Hodgins (2000) presented a framework for rapid crowd motion prototyping, where simplified bipeds are playing American football. Additionally their agents can learn high level behaviours from real data using a memory-based learning algorithm. Classic reinforcement learning has been applied to create successful board games implementations (Schraudolph et al, 1994, Thrun, 1995), with unmanageable state spaces. Backgammon is the most successful example (Tesauro, 1994). Reinforcement learning has also been used in robotics to control one or more robotic arms (Davison and Bortoff 1994, Schaal and Atkeson, 1994), Sutton (1996) succesfully applied RL to various optimisation tasks including control of the Acrobot - a two-link robot actuated only at the second joint and Boone (1997) compared Q-learning with other control methods for the Acrobot task including the A* search. Recently Tedrake and Seung presented a reinforcement learning technique for expanding a controller for the planar one-legged hopping robot (Tedrake and Seung, 2002). Solutions based on the Q-learning algorithm (Watkins, 1989) have also been modified and adapted. Examples include ant systems (Gambardella and Dorigo, 1995, Monekosso et al, 2002) or reward shaping (Ng et al, 1999) a technique in which additional rewards are used to guide the learning. A survey of reinforcement learning techniques can be found in Kaelbling et al (1996), an excellent tutorial on reinforcement learning techniques was published by Harmon and Harmon (1996) and Touzet (1999) describes techniques for combining Q-learning and neural networks in the context of robotics. Animation prototyping is a topic which has recently gained much popularity in the animation research community. Rapid prototyping techniques offer an opportunity to quickly sketch an animation sequence without need for a fully simulated motion. Fang and Pollard (2003) proposed a system for fast generation of motions for characters having from 7 to 22 degrees of freedom using physical simulation. Another recent system for creating and editing of © IJIGS/University of Wolverhampton/EUROSIS

character animation based on motion capture was presented by Dontcheva et al (2003). Similarly the system presented by Lee and his colleagues (Lee et al, 2002) allows the user to combine clips from a database of mocaped data by identifying possible transitions between motion segments. The system works in real-time and can additionally be controlled by sketching required motions or by acting them in front of a camera. The generated results are comparable to recorded human motion. Zordan and Van Der Horst (2003) presented a new solution for mapping motion captured using optical motion capture systems to joint trajectories for a fixed limblength skeleton based on virtual springs. This allowed them to generate smooth and uniform motion applied to virtual avatars. Kovar and Gleicher (2003) proposed a novel technique for motion blending - a technique which allows to create new motions by combining multiple clips according to some criteria. Li with his colleagues (Li et al, 2002) described a system for synthesis of complex human motion (dancing) from motion captured data. The system learns so called motion textons (repetitive patterns in complex motion) and their distributions and can synthesise new motion. A similar concept was introduced by Liu and Popovic (2002). They presented a system for rapid prototyping of realistic (highly dynamic) character motion from a simple animation provided by an animator. The system learns an estimator for predicting transition poses from examples taken from a database of motion captured motions.

to other players and the ball. Evolution of keepaway soccer players was also addressed by other researchers (Kohl et al. 2003, Barne et al 2002).

Evolution Programming has previously been applied to simulate intelligent behavior of virtual characters. Tang and Wan (2002) proposed genetic algorithms to simulate a virtual human that learns by applying muscle forces to its body joints while attempting to correctly perform a jumping task. Duthen et al. (1999) used genetic algorithms to produce binary rules, which define behaviour of autonomous players in a virtual soccer game. Hsu and Custafson (2002) applied layered learning genetic programming to evolve behaviour of a team of agents playing a keep-away soccer. In the proposed implementation two genetic programming trees are searched. The agents’ kick tree, which gives the direction and distance in which to kick the ball and the move tree, which computes where to move the agent. The terminal set consists of vectors describing agents’ distance

Learning is implemented by creating a suitable state space and applying reinforcement learning techniques to learn the optimal movements to reach an object of interest. Figure 1 illustrates the concepts of forward and inverse kinematics and the current model of the avatar.

In order to create believable characters, both the physical and cognitive aspects of an avatar must be implemented - or some variants thereof (Funge, 1999, Isla et al, 2001, Szarowicz and Forte, 2003). Modelling learning agents also includes a more or less complex structured environment where the characters thrive (Monzani, 2002). Realistic environments are usually implemented by imposing internal and external physical constraints, such as gravity, obstacles and body limitations (for example the limited movements of body limbs). THE AVATAR MODEL The used avatar model borrows its biomechanical characteristics from robotics. An avatar has a set of joints whose movements can be either prismatic (movements constrained on a 3D plane) or revolute (movements involving a rotation about an axis in 3D space). Then the kinematics of manipulators (Craig, 1989) rules all possible movements of joints as combinations of prismatic and revolute elemental movements. A standard goal usually includes more or less complex object manipulations. In fact, using forward and inverse kinematics for a simple but articulated avatar the optimal sequence of simple actions fulfilling a goal can indeed be learnt (Szarowicz and Remagnino, 2004).

Figure 1 Position of the end effector can easily be calculated when all joint rotations are given (forward kinematics), the opposite task is the problem of inverse kinematics.

© IJIGS/University of Wolverhampton/EUROSIS

REINFORCEMENT LEARNING FOR AUTONOMOUS AVATARS The avatars can perform a number of actions. The standard way of adding new actions and behaviors (seen as compositions of actions) to an avatar repertoire is to manually script them. Ideally, an avatar should allow for new actions but should also have a form of automatic generation of new actions. The reinforcement learning technique lends itself very well to the automatic acquisition of actions and behaviors. The implemented avatars use the Qlearning technique. All standard reinforcement learning techniques, and Q-learning in particular, do assume a scene evolving along a discrete time line, indicated by the t variable. A suitable state space is defined as well as all available actions for each defined state. The reader should refer to (Watkins, 1989, Sutton and Barto, 1998) for a detailed discussion on reinforcement learning techniques. The quality of an action is kept up to date either using a table of quality values Q(st, ai) or a neural network (Bertsekas and Tsitsiklis, 1996) or more stable alternatives (for instance Baird and Moore, 1999). Results on both deterministic and nondeterministic approaches are described in the next sections. In all experiments the state space and the goals of the agent are explicitly defined. The Qlearning was implemented by discretising the space into states and using a Q-table. The following list describes more details of the current implementation (see also Szarowicz et al, 2005, Szarowicz et al, 2003): • An avatar can perform a number of simple actions including arm, forearm and hand motion illustrated in Figure 2 and textually described in Table 1.

all rotations were discretised and constrained to realistic physical movements. In the case of inverse kinematics the discretisation is performed on the 3D space location of the end effector of the avatar, that is its hand (see last illustration of Figure 2). • In both forward and inverse kinematics, walking along one dimension is considered as an additional action. Descretisation here is implemented along one axis in the two main directions of the ground plane, where the avatar moves. Similarly grabbing an object is considered to be an additional action. • Other external objects (such as the door shown in the experiments) were represented as additional variables. • For each possible state space dimension there are always two possible actions, indicating a movement of a body part (i.e. an arm, a forearm, a hand etc.) along such dimension in the two opposite directions. Examples include the avatar walking forward and backwards or moving its hand along the vertical axis resulting in lowering or raising the hand. • Successful fulfilment of a goal is rewarded; collisions with the environement and violence of the biomechanical constraints are punished. Although Q-learning convergence is not affected by the initial state, for optimisation reasons all animation experiments were biased towards a realistic starting state (ie with the avatar upright and both arms aligned with the body). Table 1 Low-level actions used to train the avatar: Forward kinematics 1. Rotate arm up/down by ∆α 2. Rotate arm forward/backward by ∆α 3. Rotate forearm by ∆α 4. Rotate hand along Z axis by ∆α 5. Rotate shoulder along Z by ∆α 6. Perform a grabbing action 7. Move forward/backward by ∆x

Inverse kinematics Move palm by ∆x Move palm by ∆y Move palm by ∆z

Perform a grabbing action Move forward/backward by ∆x

LEARNING TASKS

Figure 2 Avatar degress of freedom for the teapot task: FK (first four) and IK (last) control

• The state space is different for each mode of control but in both cases it is discretised defining a number of degrees of freedom for the used joints. In the case of forward kinematics the degrees of freedom of the arm were defined as rotations around spatial axes (see first four illustrations of Figure 2),

Two example learning tasks are presented below, for more details the reader is referred to Szarowicz and Remagnino (2004). The door opening task For this task the goal of the agent was to get through a locked door. The door would be unlocked upon touching the door handle. The avatar would then have to push the door and pass through it. The agent was rewarded whenever its position was behind the door. The simple actions available to the agent were © IJIGS/University of Wolverhampton/EUROSIS

selected from Table 1 (actions 1,2,3,4,5,7), for the FK, α was set to 20 degrees, step size (∆x in action 7) was 35 cm, in all experiments γ = 0.95. Eight simple actions were available to the agent at each time step. These were three rotations – two for the arm and one for the forearm – in two opposite directions and walk along one (2*3+2). The task of the IK-controlled experiment was the same as for the FK previous one but the mode of control and the state and state-action spaces were changed. The simple actions available to the agent were 1,2,3 and 7 (Table 1, inverse kinematics column), x = 35 cm for walk (the size of a single step) and ∆x = ∆y = ∆z = 5cm for the motion of a hand, γ = 0.95. Therefore the agent could choose from 8 simple actions - hand motion along 3 spatial axes in two opposite directions for each axis plus walk (2*3+2).

implemented using this technique is the IKcontrolled teapot problem. The state space is same as in the deterministic implementation, and the length of the shortest solution obtained is also the same (10 simple actions). The convergence is reached faster – in approximately 800 interactions as opposed to about 3000 in the deterministic case (Figures 3 and 4) and the time necessary to reach the optimum solution is shorter as well – about 90 minutes on average (550 iterations). The convergence is also more stable (Figure 3). This suggests that the non-deterministic version of the algorithm generates comparable results in a shorter amount of time.

The teapot lifting task The goal here was to lift a teapot (z co-ordinate of the teapot position had to increase). Therefore the agent was rewarded whenever the end position of the teapot was higher than the start position. The simple actions available to the agent were selected from Table 1, for the FK these were actions 1,2,3,4,6,7. The learning parameters were set as follows: α was set to 10 degrees and γ = 0.95. The dimensionality of the task was 5 - 2 degrees of freedom for the left arm, 1 for the left forearm, 1 for hand rotation and 1 for the state of the teapot. Ten simple actions were available to the agent at each time step (2 for each state-space dimension, as described earlier). An experiment with biped control using inverse kinematics was also conducted. The simple actions available to the agent in this case were actions from Table 1 (actions 1,2,3,7 of the inverse kinematics column), and ∆x = ∆y = ∆z = 8 cm for the motion of a hand, γ = 0.95. Therefore the state-space was 4-dimensional and the agent could choose from 8 simple actions - hand motion along 3 spatial axes in two opposite directions for each axis plus the grabbing action. In all experiments the Qtable was represented as a lookup table and the values were initialized to 0 before the simulation.

An additional simulation with the non-deterministic update has also been executed, in which the outcome of the action selection was randomised in some percentage of cases. The action selected by the agent according to its Q-table was replaced with a random action with some probability. The results of that simulation for different levels of action randomisation are presented in Figure 5.

LEARNING USING THE NON-DETERMINISTIC ALGORITHM This section presents results obtained when applying the non-deterministic update of the Q-learning algorithm to the task of action acquisition. The task

LEARNING WITH NON-DETERMINISTIC ACTION SELECTION

0 0

500

1000

1500

-100

-150

Total Q value

-50

-200

-250 Epoch no.

Figure 3 Convergence graph for the IK teapot nondeterministic problem 1.E+07 8.E+06

4.E+06 2.E+06 0.E+00 0

2000

4000

6000

8000

Total Q value

6.E+06

-2.E+06 -4.E+06 -6.E+06

Epoch no.

Figure 4 Convergence graph for the IK teapot deterministic problem


As presented in Figure 5 the speed of convergence is decreased with the growth of the uncertainty of the action selection mechanism. However, convergence is still reached, even for relatively high uncertainty levels. Although the results of this experiment do not have much significance in a fully predictable animated landscape, they suggest possible utilisation of the action acquisition scheme for robotic environments.

giving her additional guidelines prior to performing the task (Figures 7 and 8).

EVALUATION OF THE LEARNING RESULTS The results obtained from applying the learning mechanism indicate that the IK learning mode is faster and easier to implement. The convergence is reached in a smaller number of iterations, compared to the FK case, and is more pronounced. However the ultimate assessment can only be made upon analysing the resulting animations. For the simpler door problem the generated motion resembles human actions to a large extend (Figure 6). Experiments with the teapot task have shown that a relatively low resolution of the state space discretisation is sufficient to generate believable result (α=10 degrees)

Figure 6 Motion generated in the Door experiment compared to human motion

Figure 5 Convergence for randomised action selection updates

and in this case the motion can also be compared to human performance. Lower values of α (20 degrees) for the FK teapot task generate motion which is too jerky and inaccurate. Some artefacts are still present however, even in the quality motion for the FK problem, this mainly concerns unnecessary motions and especially a zigzag-like way of approaching the teapot which is present in some animations, but the result is resembling human motion with a sufficient detail as demonstrated in Figures 7 and 8. Indeed the way of executing the action achieved using the FK mode of control matches the way of executing the same action by a human actor without

Figure 7 FK controlled motion compared to motion of a human actor

Results yielded by the IK controlled experiments (both deterministic and non-deterministic) are also interesting. First of all the state space is substantially smaller than for the FK experiments and therefore the solutions are found in fewer iterations. The resulting motion looks realistic, despite the fact that © IJIGS/University of Wolverhampton/EUROSIS

the human actor did not initially perform the action the way suggested by the IK solution. This does not mean humans can not perform the lifting task in this way as demonstrated in Figures 9 and 10, and the reason for this way being less natural is only in the fact that the table was relatively high. Reducing the height of the table changes the way of performing the task by humans (the hand does not have to be moved around the table). Moreover, the generated motion still looks natural, and contains fewer unnecessary artefacts compared to the FK solution, because IK control implicitly rejects some of the unnecessary moves. The IK state space can be represented in a more compact way (only three values need to be stored regardless of the hand position). This however causes problems when more expressive motion or combination of different modes of control are required (for the door opening task it was necessary to combine IK hand control and walking), as the representation of the state space for such extensions is more uniform when using the FK approach. The main problem with FK approach is its extensibility – additional degrees of freedom very quickly expand the state space and substantially increase the number of iterations required to find a solution. Therefore tasks for which more than 6-7 degrees of freedom is necessary may have to be simulated using the more compact IK control.

Figure 9 IK controlled motion compared to motion of a human actor

Figure 10 IK controlled motion compared to motion of a human actor

It also appears that the non-deterministic algorithm generates the solution faster than the deterministic one, maintaining the same quality of the results. Future implementations therefore should rely on this version of the Q-learning technique. Figure 8 FK controlled motion compared to motion of a human actor


LEARNING BY MEANS OF GENETIC PROGRAMMING Genetic Programming (GP) is another technique implemented to automate the process of animating virtual humans. GP is a domain-independent problem solving approach in which a population of computer programs (individuals) is evolved to find a solution. The simulated evolution in GP is based on the Darwinian principle of reproduction and survival of the fittest. For more details on Genetic Programming the reader is referred to Koza (1992). In our implementation GP is used to create decision trees that control behaviour of an avatar (see Figure 11). The trees are built with nodes representing actions, which the avatar can perform, and tests, which return information about environment or avatar states (see also Lach, 2004). The actions include arm and forearm motion (see first three illustrations of Figure 2), turning and walking forward. The tests make obtained solution more general and give the avatar an impression of intelligence. An avatar shows awareness of the environment when conditionally executing one of the possible actions after checking environment state or its own position in respect to certain conditions. For a virtual world consisting of a plane with a single door seven tests were defined (see Table 2). Table 2 Low-level actions and tests used to build decision tree. Actions and Tests

Actions - Forward kinematics 1. Rotate arm up/down by ∆α 2. Rotate arm forward/backward by ∆α 3. Rotate forearm by ∆α 4. Turn left/right by ∆α 5. Move forward by ∆x Tests 6. Avatar’s position to door 7. Avatar’s facing direction 8. Distance between avatar and door 9. Avatar’s hand’s position to knob 10. Door being opened

Nodes Name

Subtrees

RRAUD, MRAUD RRAFB, MRRAFB RRFUD, MRRFUD RL, RR MOVE

1 1 1 1 1

WD FD HC WHX,WHY,WHZ IFOPEN

3 5 4 3 2

After choosing appropriate nodes an initial population of individuals (trees) is generated at random. Then, each individual tree in population is measured in terms of how well it performs on a particular task. This measure is called the fitness function and its definition depends on the problem being solved. To acquire more general solution each tree is tried out over a number of different fitness cases

(initial, different avatar states) so that its fitness is measured as a sum or an average over a variety of representatives from different situations. Usually the individuals in generation 0 have a very poor fitness measure. Nonetheless, some trees in the population are more fitted than the others and these differences in performance are exploited by Genetic Programming. The individuals in the population are iteratively evaluated for fitness and genetic operations are performed on those individuals to generate a new population. The force driving this highly parallel, locally controlled, decentralised process uses only the observed fitness values of the individuals in the current population. This algorithm produces populations of decision trees, which over many generations, tend to exhibit increasing average fitness in dealing with the environment. WD

WHX

RRAUD

FD

MOVE

RL

IFOPEN

RRAFB

...

...

IFOPEN

MOVE

RR

...

MRAUD

RRFUD

RL

Figure 11 Decision tree – individual of population in GP used to control behaviour of an avatar.

For complex problems with low-level primitive operations it is intractable to search for a direct solution using GP. This is mainly due to the combinatorial explosion of the GP search space as a function of the problem state space. Many of GP researchers that faced this obstacle tried to find the answer in decomposing main problem into simpler ones (for instance Kohl at al, 2003). One of the proposed solutions, is application of layered learning (Gustafon and Hsu, 2002). Applying the layered learning paradigm to a problem consists of breaking that problem up into a hierarchy of subproblems. The original problem is solved sequentially by using the learning results from all the member problems of each layer in the next layer. This is conceptually similar to many other divideand-conquer learning paradigms, but an important difference is that the structure of the solution does


SOLVING THE DOOR OPENING TASK In order to test effectiveness of Genetic Programming in the animation learning task, described in previous paragraphs, the locked door example was used. In this task the simplest fitness measure should take into account the distance between an avatar and a location behind the door. However, the large number of degrees of freedom of the virtual human and a big environmental space quickly results in a combinatorial explosion of the GP search space. We have to face the fact that with such fitness measure the avatar would have to be very “lucky” to come across the door handle, open the door and fulfil the goal. In order to solve the problem we use a bottom up decomposition of the task, where individuals first learn simpler tasks, then compose and coordinate them to solve larger tasks. In our experiment we create two layers. The fitness objective for the first layer is to come within a reaching distance of the door handle and lay a hand on it, while fitness objective for the second layer is to minimise the distance between the avatar and the location behind the door. The fitness measures are computed as follows:

f 1 (i, t ) =

f 2 (i, t ) =

1 d k + d d * 0,6 + p s + p f + pe + 1 1 d bd + p s + p f + pe + 1

,

(1)

probability that an individual belongs to next generation equal to 0.1) mutation (with individual probability at each generation equal to 0.05) and crossover (with probability 0,9). EXPERIMENTAL RESULTS

As was explained in the previous sections evolutions with only one fitness function (equation 2) provided very weak results. For 10 fitness cases average number of hits for population didn’t exceeded 0.005 over 500 generations (Figure 12). 0,005 0,004

hit

not necessarily reflect this procedural hierarchy of training.

0,003 0,002 0,001 0 0

100

200

300

400

500

generation Figure 12 Classic GP. Graph for average number of hits (opened doors) for populations.

Experiments with Layered Learning GP have shown that without changing other parameters, only by adding a second layer, we can significantly improve generated results (Figure 13) 6

,

(2)

5 4

where fj(i,t) is the fitness of an individual i at generation time step t for layer j, dd is the distance between the avatar and any location from which the avatar can reach the door knob, dk is the distance between the avatar’s hand and the door handle, dbd is the distance between the avatar and the location behind the door, ps is the number of actions performed by the avatar multiplied by a constant c, pf is the value of the penalty for the avatar’s wrong facing direction and pe is an assessed penalty for illegal motion and avatar’s collisions with objects in the environment. The fitness measures are bigger for better adjusted individuals in the population. During the evaluation genetic operations were performed on individuals including selection (with

hit 3 2 1 0 0

50

100

150

200

250 300

350

400

450

500

generation Classic GP - T500p400g10c

LLGP - T500p400g10c

Figure 13 Graph presenting the average number of hits (opened doors) for populations with Classic and Layered Learning GP.

Performed experiments also showed that population size has significant influence on speed with witch we obtain adequate decision trees as well as their quality. For a fixed time, estimated by the number © IJIGS/University of Wolverhampton/EUROSIS

of processed individuals, better results were obtained after reduction of population size from 2500 to 250 individuals (see Figure 14). Population size decrease resulted in more generations for evolution. More generations yielded better results than fewer generations with bigger populations. However reduction of population size can improve the results only to some point. With small populations the search space GP is capable of exploring is smaller than with big ones. This is a reason why we stop population size decrease at 250 individuals.

0,083 0,078

fitness

0,073 0,068 0,063 0,058 0,053 0,048 0

50000

100000

150000

200000

proceeded individuals T250p1000g10c

T500p400g10c

T1000p400g10c

T2500p200g10c

Figure 14 Average population’s fitness for experiments with populations with 250 (T250p), 500 (T500p), 1000 (T100p) and 2500 (T2500p) individuals.

CONCLUSIONS

In summary, the learning technique presented here generated satisfactory results when applied to a nontrivial task. Comparison of the motion generated by Q-Learning to the motion of a human actor indicates that the sequence is sufficiently realistic to be applied in an animation system mimicking human behaviour. Although the technique appears to be averagely scalable, some extensions, especially using the IK control and neural networks for state space approximation, will be possible to it, allowing adding a few additional degrees of freedom to simulate a task requiring the use of both hands or the head motion by the simulated biped. Additionally results obtained with a better hardware configuration suggest that a modern computer will improve the learning times by at least one order of magnitude. The second technique: Genetic Programming also generated very promising results. The enormous difference between Classic GP and Layered Learning GP suggests that following research should address other divide-and-conquer techniques (ADFs, switching fitness). The fitness function ought to be adjusted as well so that it is best suited for the animation task at hand. In the nearest future a comparison between the two methods will be attempted. REFERENCES

Evaluated decision trees let avatar come through the door for different starting positions. Unfortunately, generated motion is not perfect. Avatar performs many unnecessary movements. Also after getting through the door, avatar usually “forgets” to return his arm to the initial position (see Figure 15). Modifying the fitness function may solve these problems.

Anderson F. C. and Pandy M. G., Three-Dimensional Computer Simulation Of Gait, Bioengineering Conference Big Sky, Montana, June 16-20, 1999 Baird L. C. and Moore A. W., Gradient descent for general reinforcement learning, in Advances in Neural Information Processing Systems 11, eds., M. S. Kearns, S. A. Solla, and D. A. Cohn, Cambridge, MA, MIT Press, 1999 Barne L., Pietro A. Di and While L,, Learning in RoboCup Keepaway using Evolutionary Algorithms, in Proceedings of the GECCO, 1065-1072, 2002 Bertsekas D.P. and Tsitsiklis J.N., Neuro Dynamic Programming, Athena Scientific, 1996 Craig J.J., Introduction to Robotics: Mechanics and Control, Addison Wesley, 1989 Blumberg B., Downie M., Ivanov Y., Berlin M., Johnson M. P., Tomlinson B., Integrated learning for interactive synthetic characters, ACM Transactions on Graphics, Vol. 21, Iss. 3, pp417-426, July 2002

Figure 15 Motion derived by the best solution found


Christianini N. and Shawe-Taylor J., Support vector machines and other kernel-based learning methods. Cambridge University Press, 2000 Davison D. E. and Bortoff S. A., Acrobot software and hardware guide, Technical Report Number 9406, Systems Control Group, University of Toronto, Toronto, Ontario M5S 1A4, Canada, June 1994 Duthen Y., Luga H., Panatier C. and Sanza C, , Adaptive Behavior for Coopertion: A Virtual Reality Application, in Proceedings of the International Workshop on Robot and Human Interaction, 76-81, 1999

graphics 2000 Short Presentations Programme, Interlaken, Switzerland, August 20-25, 2000 Lee J., Chai J., Reitsma P. S. A., Hodgins J.K., Pollard N.S., Interactive Contol of Avatars Animated With Human Motion Data, Proceedings of the 2002 ACM SIGGRAPH, San Antonio, Texas, USA, 21-26 July 2002 Li Y., Wang T., Shum H.-Y., Motion Textures: A Two-Level Statistical Model for Character Motion Synthesis, Proceedings of the 2002 ACM SIGGRAPH, San Antonio, Texas, USA, 2126 July 2002

Dontcheva M., Yngve G., Popovic Z., Layered Acting for Character Animation, Proceedings of the ACM SIGGRAPH San Diego, California, USA, 2003

Liu K.C. and Popovic Z., Synthesis of Complex Dynamic Character Motion From Simple Animations, Proceedings of the 2002 ACM SIGGRAPH, San Antonio, Texas, USA, 21-26 July 2002

Faloutsos P., Composable Controllers for Physics-Based Character Animation, Ph.D. Thesis, Department of Computer Science, University of Toronto, 2002

Metoyer, R. A., Hodgins, J. K., Animating Athletic Motion Planning By Example. Proceedings of Graphics Interface 2000, pp. 61-68, Montreal, Quebec, Canada, May 15-17, 2000

Fang A. C. and Pollard N. S., Efficient Synthesis of Physically Valid Human Motion, ACM Transactions on Graphics 22(3) pp417-426, SIGGRAPH 2003 Proceedings, 2003

Monekosso N.D., Remagnino P., Szarowicz A., An Improved Q-Learning Algorithm Using Synthetic Pheromones in From Theory to Practice in Multi-Agent Systems, Lecture Notes in Computer Science, vol. 2296 Edited by Dunin-Keplicz, B. and Nawarecki, E., Springer-Verlag, pp. 197, March, 2002

Funge J. D., AI for Games and Animation. A Cognitive Modeling Approach, A K Peters Natick, Massachusetts, 1999 Gustafson S.M. and Hsu W.H., Genetic Programming and Multi-Agent Layered Learning by Reinforcements, in Proceedings of the GECCO, 764-771, 2002 Hodgins, J. K., Wooten, W. L., Brogan, D. C., O'Brien, J. F., Animating Human Athletics, Proceedings of Siggraph '95, In Computer Graphics, pp 71-78, 1995 Isla D., Burke R., Downie M., Blumberg B., A layered brain architecture for synthetic creatures, in Proceedings of Seventeenth Joint Conference on Artificial Conference IJCAI-01, pp. 1051–1058, Seattle, USA, 2001 Kaelbling L.P., Littman M.L., Moore A.W., Reinforcement learning: A survey, Journal of Artificial Intelligence Research, vol.4 pp.237-285, 1996 Kohl N., Miikkulainen R., Stone P., Whiteson S., Evolving Keepaway Soccer Players through Task Decomposition, in Proceedings of the GECCO, 356-368, 2003 Koza J.R., Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, 1992 Lach E., Learning Intelligent Behaviours of Animated Agents by Means of Genetic Programming, in Proceedings of the VI International Conference on Artificial Intelligence AI19’2004, No 23, 159-170, 2004 Lee J. W., Baek N., Kim D., Hahn J. K., A Procedural Approach to Solving Constraints of Articulated Bodies, Euro-

Monzani J.-S., An architecture for the Behavioural Animation of Virtual Humans, Ph.D. dissertation, Ecole Polytechnique Fdrale de Lausanne, 2002 Ng A. Y., Harada D., Russell S., Policy invariance under reward transformations: Theory and application to reward shaping, Proceedings ICML-99, Bled, Slovenia, 1999 Pandy M. G. and Anderson F. C., Three-Dimensional Computer Simulation Of Jumping and Walking Using the Same Model, in Proceedings of the VIIth International Symposium on Computer Simulation in Biomechanics, August 1999 Russell K. B. and Blumberg B., Behavior-friendly graphics, in Computer Graphics International, pp. 44–, 1999 Schaal S. and Atkeson C., Robot juggling: An implementation of memory-based learning. Control Systems Magazine, 14, 1994 Sutton, R. S., Generalization in reinforcement learning: Successful examples using sparse coarse coding, In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information Processing Systems: Proceedings of the 1995 Conference, pp.1038-1044, Cambridge, MA, MIT Press, 1996 Sutton R.S. and Barto A. G., Reinforcement Learning: an Introduction, MIT Press, 1998 Szarowicz A. and Forte P., Combining intelligent agents and animation, in AIxIA 2003 - Eighth National Congress on AI,


Lecture Notesin Artificial Ingelligence, vol 2829, Pisa, Italy, Springer-Verlag , 2003 Szarowicz A., Mittmann M., Remagnino P., Francik J., Automatic Acquisition of Actions for Animated Agents, 4th Annual European GAME-ON Conference, November 19-21, London, United Kingdom, 2003 A. Szarowicz, P. Remagnino, Avatars That Learn How to behave, European Conference on Artificial Intelligence ECAI 2004, Springer, Valencia, Spain, 2004 Szarowicz A., Francik J., Mittmann M., Remagnino P., Layering and Heterogeneity as Design Principles for Animated Embedded Agents in International Journal of Information Sciences, to appear, Elsevier, 2005 Tang W. and Wan T.R., Intelligent Self-learning Characters for Computer Games, in Proceedings of The 20th EGUK, 2002 Tedrake R. and Seung H. S., Improved Dynamic Stability using Reinforcement Learning, Proceedings of the International Conference on Climbing and Walking Robots (CLAWAR'02), 2002 Terzopoulos D., Rabie T., Grzeszczuk R., Perception and Learning in Artificial Animals, Artificial Life V: Proc. 5th Inter. Conf. on the Synthesis and Simulation of Living Systems, Nara, Japan, 1996

Tesauro G., TD-Gammon, a self-teaching backgammon program achieves master-level play. Neural Computation, 6(2) pp.215-219, 1994 Thrun S., Learning to play the game of chess. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, MIT Press Cambridge, MA, 1995 Tomlinson B., Blumberg B., Nain D., Expressive autonomous cinematography for interactive virtual environments, in Proceedings of the Fourth International Conference on Autonomous Agents, Barcelona, Spain, 2000 Touzet C. F., Neural Networks and Q-Learning for Robotics, IJCNN '99 Tutorial, 1999 International Joint Conference on Neural Networks, Washington, DC - July 10-16, 1999 van de Panne M., Laszlo J., Huang P., Faloutsos P., Dynamic Human Simulation: Towards Agile Animated Characters, Proceedings of the IEEE International Conference on Robotics and Automation 2000, pp. 682-687, San Francisco, CA, 2000 Watkins C. J. C. H., Learning from Delayed Rewards, Ph.D. dissertation, Cambridge, Psychology Department, 1989 Yoon S.Y., Blumberg B. M., Schneider G. E., Motivation driven learning for interactive synthetic characters. In Proceedings of Autonomous Agents 2000


reinforcement learning and genetic programming for ...

reinforcement learning and genetic programming for ...

Suggest Documents

Genetic Reinforcement Learning Algorithms for On ...

Genetic Reinforcement Learning for Neurocontrol Problems

Approximate dynamic programming and reinforcement learning

Approximate dynamic programming and reinforcement learning

Genetic Scheduling and Reinforcement Learning in Multirobot

Toggling a Genetic Switch Using Reinforcement Learning

LAYERED LEARNING IN GENETIC PROGRAMMING FOR A ...

Reinforcement Learning and Dimensionality

Coevolution and Linear Genetic Programming for Visual Learning

Reinforcement Learning

Reinforcement Learning for Humanoid Robotics

Benchmarking for Bayesian Reinforcement Learning

Ant System Reinforcement Learning for

Transfer Learning for Multiagent Reinforcement Learning ... - IJCAI

Reinforcement learning for board games

REINFORCEMENT LEARNING FOR COORDINATED ... - CiteSeerX

Reinforcement learning for board games

Reinforcement Learning for Railway Scheduling

Reinforcement Learning for Computer Vision and

A Vision for Reinforcement Learning and its

Towards Deep Representation Learning with Genetic Programming

A Statistical Learning Perspective of Genetic Programming

Genetic Programming Discovers Ef cient Learning

Multiple Instance Learning with MultiObjective Genetic Programming ...