Motivation and Emotion in Anticipatory Behavior of Consequence ...

Motivation and Emotion in Anticipatory Behavior of Consequence Driven Systems Stevo Bozinovski Mathematics and Computer Science Department South Carolina State University [email protected]

The work addresses the basic problem of what is motivation and what emotion in anticipatory behavior. The result of this work is that what motivates an agent’s anticipatory behavior is the value of the anticipated consequence of that behavior; emotion is an evaluation system for computing the value of that consequence. The paper firstly presents an overview of the consequence driven systems theory and then introduces the motivational graphs and motivational polynomials on the way toward the obtained results.

1. Introduction: Problem Statement In this work we try to find a relation between concepts of motivation, emotion and anticipation in adaptive learning systems. The work is continuation of our previous effort in developing a sound consequence driven system theory. In the sequel we will first describe the basic concepts of our theory, and then we will introduce new concepts, motivational graphs and motivational polynomials, using which we will present our relations between motivations, emotions and anticipatory behavior.

2. Consequence Driven Systems Theory Consequence Driven Systems theory is an attempt to understand the agent personality. It tries to find an architecture that will ground the notions such as motivation, emotion, learning, disposition, anticipation, curiosity, confidence, and behavior, among other notions, which are usually present in discussion about an agent personality. It originated as an attempt to solve the delayed reinforcement learning problem. In this chapter we will describe the origin of the theory and some of its features.

2.1. Origin and Framework Consequence driven systems theory originated in an early Reinforcement Learning research effort to solve the assignment of credit problem using a neural network. Such an effort was undertaken in 1981 within the Adaptive Networks (ANW) Group at the Computer and Information Science Department of the University of Massachusetts at Amherst. Two instances of the assignment of credit problem were considered: the maze learning problem and the pole balancing learning problem. Two anticipatory learning architectures were proposed within the Group: The Actor/Critic (A/C) architecture (Sutton and Barto) and the Crossbar Adaptive Array (CAA) architecture (Bozinovski). Although both architectures were designed to solve the same basic problem, the maze learning problem, the challenges of the design were different. The A/C architecture effort was challenged by the mazes from animal learning experiments, where there are many states and there is only one rewarding state (food). The CAA architecture effort from the start was challenged by the mazes defined in the computer game of Dungeons and Dragons, where there is one goal state, but many rewarding and punishment states along the way. So, from the very beginning CAA effort adopted the concept of dealing with pleasant and unpleasant states, feelings and emotions. The resulting architectures are shown in Figure 1. Environment X r

X

r^ V

Y W

Y W v

Figure 1a. A/C architecture

Figure 1b. CAA architecture

Figure1. Examples of two anticipatory learning systems: Actor/Critic architecture and Crossbar Adaptive Array architecture

As Figure 1 shows, the obvious difference is that A/C architecture needs two identical memory structures, V and W, to compute the internal reinforcement and the action, while CAA architecture for both computations uses only one memory structure, W, the same size as one of the A/C memory structures. The most important difference, however, is the design philosophy: in contrast to A/C architecture, CAA architecture does not use any external reinforcement r; it only uses the current situation X as input. Further, a more subtle difference, not discussed in this paper, is the computational complexity. A/C architecture uses four incremental equations, two for learning rules and two for memory traces, and the equations are of second order; as contrast CAA architecture uses only one, first order incremental equation. In addition, the CAA approach introduces the concept of state evaluation and connects it to the concept of feeling, which it uses as internal reinforcing entity. Simulation experiments with agents based on those architectures were carried out by Sutton and Bozinovski (office-mates at that time). Interestingly enough, the CAA approach proved more efficient in solving the maze learning instance of the credit assignment problem; and was the only architecture that presented the solution in front of the ANW group in 1981. Interested reader may find more details in the first published reports by Barto, Sutton, and Anderson (1983) and Bozinovski (1982). While the former mentioned paper keeps the statement of the problem as credit assignment problem, the latter mentioned paper reformulates it as delayed reinforcement learning problem. The original CAA idea of having one memory structure for crossbar computation of both state values and action values was later also implemented in reinforcement learning architectures such as Q-learning system (Barto, Sutton and Watkins 1990, Watkins 1989) and Dyna architecture (Sutton 1990). Q-learning (Watkins 1989) uses exactly the same memory structure as the CAA memory structure W, and denotes it as Q-table. The main difference between CAA approach and Q-learning approach remains the use of external reinforcement r. The CAA learning rule is in the form w’ij = wij + vk, while the Q-learning rule is in the form w’ij = (1-α)wij + α(rij+γvk) where wij is the crossbar value (or Q-value) for action i in state j, vk is anticipated emotion in state k, r is the immediate external reinforcement in state j after performing action i, γ is discount factor, and α is a forgetting parameter. In the philosophy of CAA approach, there is no external reinforcement. In the CAA philosophy the parameter r could be considered only as an internal cost, -cij, of performing i in j, in a learning rule of the form w’ij = wij + vk,- cij (Bozinovski 1995), but not as an external reinforcement. Dyna architecture (Sutton 1990) extends the concept of crossbar adaptive array, introducing more arrays in order to build and use a model of the environment. This extension we denote as crossbar adaptive tensor (CAT) architecture (Bozinovski 1995).

The CAA approach introduced learning systems that can learn without external reinforcement. In the philosophy of CAA approach, all the external reinforcement learning rules are classified into the supervised learning class. Figure 2 shows our taxonomy of learning systems, which is also a framework for studying emotion learning systems. Learning supervised advice learning

reinforcement learning

unsupervised emotion learning

similarity learning

Figure 2. A framework for studying emotion learning systems As Figure 2 shows, in supervised learning system, a supervisor (e.g. teacher) can give an external reinforcement of agents behavior, and/or an advice as how the agent should chose the future actions. In unsupervised (selfsupervised) systems, having no external teacher of any kind, the agent must develop internal state evaluation system in order to compute an emotion (internal reinforcement). A kind of unsupervised learning can also be observed in agents that use some measure of similarity in tasks of adaptive classification of input situations or data. Reinforcement Learning (e.g. Barto 1997) can be considered as a part of supervised learning class, which has recently been recognized by other researchers (e.g. Pushkin and Savova, 2002). Other researchers, such as Gadanho (1999) and Butz, Sigaud and Gerard (2002) recognize distinction between external reinforcement learning as supervised learning and internal reinforcement learning, as self-supervised learning.

2.2. Main Concepts of the Theory After this exposition of origins of the Consequence Driven Systems theory, we will present its grounding concepts. Those are: 1. three environments 2. past- present-future: evaluation-emotion-moral 3. crossbar adaptive array architecture 4. emotional graphs 5. evaluate states, backpropagate emotions, remember behaviors 6. understanding interactions by parallel programming

2.2.1.Three environments The theory assumes that an agent should always be considered as a threeenvironment system. The agent expresses itself in its behavioral environment, where it behaves. It has its own, internal environment where it synthesizes its behavior. It also has access to a genetic environment where from it receives its initial conditions for existing in its behavioral environment. The genetic and the behavioral environment are related. The initial knowledge transferred through the imported genome properly reflects, in the value system of the agent, the dangerous situations for the agent in the behavioral environment. It is assumed that all the agents import some genomes in the time of their creation, but not all of them are able to export their genomes after some learning period. This concept can be applied for biological and for nonbiological agents as well. However, for the non-biological agents, instead of genetic, we use the concept of generic environment. For example, a BIOSROM is an example of an imported generome for a non-biological agent. 2.2.2. Past- present-future: Evaluation-emotion-moral The theory emphasizes that each agent architecture should be able to understand the temporal concepts, like past, present, and the future, in order to be able to self-organize in an environment. The past tense is associated with the evaluation of its previous performance, the present tense is associated with the concept of emotion that the agent computes toward the current situation (or the current state), and the future with the moral (self-advice) the agent will learn for its future behavior. A Generic Architecture for Learning Agents (GALA architecture) is shown in Figure 3. U = advice for future behavior Y = current behavior X = current situation

learning agent e = current emotion

r = evaluation of previous behavior Figure 3. The GALA architecture Note that it is a genuine generic, black-box-only architecture: only the inputs and outputs are specified. Yet it is very specific the way inputs and outputs are defined. It is ready to use reconnection to generate instance architectures. The generic GALA architecture can generate advice learning, reinforcement learning, and emotion learning architectures, just by using appropriate

feedbacks. It is a reconnection reconfigurable architecture. An advice learning agent can be generated directly, connecting the inputs and outputs of the agent to the environment. A reinforcement learning agent can be obtained by connecting the current behavior output to the advice input (U Y). The third type of learning agent, the emotion learning agent can be obtained from a reinforcement learning agent by connecting the current output emotional value to the input evaluation of the past performance of the agent (r e). Figure 4 shows the construction of an emotion learning agent by a neural rewiring mechanism from the generic GALA architecture. U

Y

U = advice for future behavior S = current situation

Y = current behavior emotion learning agent e = current emotion

r = evaluation of previous behavior r

e

Figure 4: Emotion learning agent generated from the GALA architecture Let us note that the advice learning and reinforcement learning agents do not necessarily require the emotional output of the agent. They can learn from the advices and reinforcements they receive from the environment. The emotion learning agent, in contrast, needs an internal emotional system in order to value its previous behavior and current situations. The value system of an agent can be emotional or rational or both. Lower level organisms have emotion-based value system, while higher-level ones can develop a ratiobased evaluation system. Here we are interested in lower level, emotion based value system. 2.2.3. The Crossbar Adaptive Array architecture Crossbar Adaptive Array architecture (Figure 5) is derived as an emotion learning agent (Figure 4). In a crossbar fashion, this architecture computes both state evaluations (emotions) and behavior evaluations. It contains three basic modules: crossbar learning memory, state evaluator, and behavior selector. In its basic routine, the CAA architecture firstly computes the emotion of being in the current state, and then, using a feedback loop, computes the possibility of choosing again, in a next time, the behavior to which the current situation is the consequence. The state evaluation module computes the global emotional state of the agent and broadcasts it (e.g. by way of a neuro-hormonal signal) to the crossbar learning memory. The

behavior computation module using some kind of behavior algebra, initially performs a curiosity driven, default behavior, but gradually that behavior is replaced by a learned behavior. situation Crossbar Adaptive Memory

State Evaluator

learned behavior

patience

Behavior computation

behavior

curiosity driven behavior Personality parameters

emotion Figure 5. Crossbar Adaptive Array architecture The forth module defines specific personality parameters of a particular agent, such as curiosity, level (threshold) of tolerance (patience) etc. It also provides some physiological parameters, such as urge signals. The urge signals produce interrupt behaviors in the behavior algebra of the agent. 2.2.4. Emotional graphs The emotional graph is the basic concept of the mental representations of the agents in the consequence driven systems theory. The environment situations are represented as emotionally colored nodes. The emotional value can be represented in different ways, for example using numerical values, but stylizations of facial expressions are preferred whenever possible. Transitions between states are behaviors. The term “behavior” covers both the simple actions, and also possible complex network of actions. Sometimes there are states that are not reachable by the behavior repertoire of the agent. The emotional value is given to a state either by the genetic mechanism, or by learning, after the agent visits that state. A state can be a neutral one and can change its emotional value after being visited. There are different concepts what agent learns in an environment. It can learn the whole graph, like a cognitive map, or it can learn only a policy (the set of states and behaviors associated to those states). In case of policy learning, the environment itself provides the actual map, the interconnection network between the states.

2.2.5. Evaluate states, backpropagate emotions, remember-behaviors The Consequence Driven System theory introduced the principle of remembering only the behavior evaluations, not the state evaluations. Each behavior is assigned a value. Each state computes its emotional value according to its behavior values. The behavior values are computed from the values of their consequence states. What is actually stored, are the behavior values; there is no need to store the state values, since they can be computed from the behavior values. Let us note that before 1981, the research in Dynamic Programming (Bellman 1957) was based on memorizing states. The relation between the Dynamic Programming and delayed reinforcement learning was established by Watkins (1989). The terminology now used in reinforcement learning is highly influenced by Dynamic Programming terminology. 2.2.6. Understanding interactions by parallel programming In consequence driven systems theory the parallel programming is understood as a way of thinking about the agent-environment interaction. It was actually used in carrying out the experiments of pole balancing learning, where the CAA controller run on one VAX/VMS terminal, while the environment, pole balancing dynamics, (designed by Charles Anderson) run on another terminal (Bozinovski 1981).

3. Anticipatory Behavior: Motivations, Urges, Emotions, and Learning An anticipatory behavior is mostly a consequence driven, learned behavior. However, an agent performing an anticipatory behavior is often forced to deviate from such a behavior in order to execute some self-maintaining behavior (thirst, hunger, repair, etc). We distinguish such interrupt behaviors with respect to the main, purposive, anticipatory behavior. In this part of the paper we will introduce the motivational graph concept by which we will precisely distinguish between anticipatory motivated and interrupt behaviors.

3.1. Motivational Graphs and Motivational Polynomials The motivational graph (or motivational net) is a representational concept for the value system of an agent. This representation distinguishes between motivations and emotions such that motivations are associated to behaviors, while emotional values are associated to states of an agent. Figure 6 shows basic elements of a motivational graph.

ÆS1 cost of B1 urge for B1

ÈB1/S1 B1 possible

motivation update = ☺ (sS2) cost of B2