Department of Computer Science Tokyo Institute of Technology

9 downloads 0 Views 215KB Size Report
Tokyo Institute of Technology ... Learning automata are basic learning devices used in a new programming language ..... Academic Press, New York pp233-257,.
L

ISSN 0918-2802 Technical Report

On Some Asymptotic Properties of Learning Automaton Networks Taisuke Sato TR99-0003 August

Department of Computer Science Tokyo Institute of Technology

^ Ookayama 2-12-1 Meguro Tokyo 152-8552, Japan

http://www.cs.titech.ac.jp/

c The author(s) of this report reserves all the rights.

Abstract

In this report, we analyze the collective behavior of learning automata which are used in a programming language under development that combines reinforcement learning and symbolic programming [2, 6]. Learning automata can automatically improve their behavior by using a response from a random stationary environment, but when connected with each other, their behavior becomes much complex and hard to analyze. We analyzed a class of learning automaton networks and proved that they eventually take the most rewarding action with probability one when they use an appropriately decaying learning rate.

1 Introduction Learning automata are basic learning devices used in a new programming language we are currently developing which combines reinforcement learning and symbolic programming [2, 6]. We embed them in a program and let them learn optimal decisions by using a response from a stationary random environment to the computed results. The behavior of a single learning automata is well-known [3, 4], but when they are connected each other such as those embedded in a program, not much has been known about their collective behavior. We therefore present the analysis of asymptotic behavior of learning automaton network s, i.e. learning automata organized as a dag (directed acyclic graph). We chose this class for analysis because dags often appears as a skeletal structure of embedded automata in a program but still tractable compared to that of general graph structures including loops. We rst review the properties of a single learning automaton in section 2. The class of learning automaton networks is introduced in section 3. In section 4, we prove a main theorem on learning automaton networks that the best action will eventually be taken with probability one if an appropriately decaying learning rate is used.

1

2 A single learning automaton A learning automaton [3, 4] is a reinforcement learning device applicable in a situation where there are several possible actions with di erent probabilistic rewards, and we want to know the best action that maximizes the average reward. This kind of situation happens any time, for instance, searching for a good doctor while trying to cure a disease. reward LA p1 d1

pN

p2 d2

dN

Environment

Figure 1: A learning automaton Figure 1 shows a learning automaton with actions 1 ; . . . ; . On each trial, some action is chosen with probability p (1  i  N ). Put P p(n) = hp1(n); . . . ; p (n)i ( =1 p (n) = 1) which is a vector representing a probability distribution for actions at time n. By executing , the automaton gets a reward (0   1) from a random environment1 , and update p(n) by using L 0 (Linear Reward-Inaction) scheme [3] as follows. N

i

i

N

N

i

i

i

R

I

p (n + 1) = p (n) + c (1 0 p (n)) p (n + 1) = (1 0 c )p (n) (j 6= i) i

i

n

j

n

i

j

where c (0 < c < 1) is a learning rate at time n. Since is a random variable, p(n) becomes a random vector evolving with time n. The L 0 scheme improves the average performance of the automaton. Let M (n) be the average reward at time n conditioned on p(n) de ned as n

n

R

I

M (n) def = E [ j p(n)] 1 We

assume the environment is stationary.

2

N X

=

p (n)d j

j

j =1

where E [1] denotes expectation and d is the average reward for action (1  j  N ). Note M (n) is a random variable as a function of p(n). We see j

E [M (n + 1) j p(n)] = E [

N X

j

p (n + 1)d j p(n)] i

i=1

= M (n) +



i

c

n

2

M (n)

N X

(d

i

0 d ) p (n)p (n) j

2

i

j

(1)

i;j

(2)

Taking expectation of both sides of (2) gives E [M (n + 1)]  E [M (n)]. E [M (n)], the expected average reward of the automaton, thus improves with time by the L 0 scheme, but how about the sample behaviors of M (n) and p (n)? First it follows from (2) that fM (n)g 1 is a submartingale[1]. Since jM (n)j  1, it almost surely converges, which implies lim !1 E [M (n + 1) j p(n)] 0 M (n) = 0. So if the learning rate c is constant, say c, it follows from (1) that lim !1 p (n) = 0 or 1 for all j (1  j  N ). In other words, with a constant learning rate, the automaton converges (a.s.) to a speci c action, but the problem is that the action is not guaranteed to be an optimal one in the sense of average reward. It is mathematically proved that we cannot avoid the risk of convergence to a non-optimal action as long as the learning rate is constant, though such risk is made arbitrarily small by choosing suciently small c [3]. How to obtain the a.s. convergence to the optimal action on the other hand seems to have remained unknown until recently when Poznyak and Najim proved that it is obtainable by using the decaying learning rate c = + (a  1 > b > 0) [4]. R

I

i

n

n

a:s:

n

a:s:

n

j

b

n

n

a

3 LA networks In this section, we introduce a class of learning automaton network s which is a generalization of hierarchical systems [3]. Formally, a learning automaton network is a nite dag such that a node and its outgoing edges connecting

3

to the child nodes comprise a learning automaton2 and there exists uniquely a node called the root that has only outgoing edges. Each edge is labeled by an action and the choice probability associated with it (see Figure 2). A node with no child nodes is called a leaf. On each execution of the root, edges are chosen one by one with associated probabilities from the root to a leaf, de ning an execution path  . An action taken last determines a reward (0   1) from the environment3 . It is delivered to every node on  and used to adjust choice probabilities by using the L 0 updating algorithm. R

I

root reward

leaf

Environment

Figure 2: A LA network It must be emphasized that LA network are harder to analyze than a single learning automaton. This is because generally speaking, when there are multiple learning automata connected with each other, for a given learning automaton, other automata are part of the environment and the average reward for a selected action becomes non-stationary, i.e. changes with time as their learning proceeds, which for example prevents the derivation of (1) that implies E [M (n + 1)]  E [M (n)]. what follows, for simplicity, a learning automaton network is referred to as a LA network and we use term \node" and the learning automaton containing it interchangeably. 2 In

3 We

consider leaves in the LA network represent our actual choices.

4

A hierarchical system4 [3] can ensure (1) w.r.t. the root automaton by computing di erent learning rates for descendent automata at each level on every trial. When a constant learning rate c is used for the root automaton however, since the system works as a single automaton with the learning rate c, it cannot avoid the risk of convergence to a non-optimal action. If we use the decaying learning rate c = + instead, we then face the following problem. The updating method in [3] forces uniform decay of learning rates for every automaton though only some of them actually takek an action. This causes unduly faster decaying of learning rates for non-chosen learning automata, and might give rise to slower convergence or even non-convergence to their optimal actions. As a result, we cannot be sure if the root will eventually take the optimal action with probability one. In the next section, we rst show that (1) holds for a tree-structured LA network with binary reward and second that the a.s. convergence to the optimal action is obtainable for general LA networks by using \locally decaying learning rates." b

n

n

a

4 Asymptotic behavior of LA networks Let N be a LA network. We de ne a well-founded relation  over the nodes in N by

A  B if-and-only-if node A is an ancestor of node B In this section, we prove by induction on , N eventually takes the best action with probability one. We de ne for each node A in N the best action and corresponding maximum average reward as follows. If A is a leaf, the best action is a one that yields the maximum average reward from the environment. Else A has edges (actions) 1 ; . . . ; , each leading to a child node A for which the maximum average reward d (1  j  N ) is already de ned. Then the best action for A is such that i = argmax d and the corresponding maximum average reward is d . If there is more than one d that achieves the maximum, choose one of them. Next note that while the root node is executed consecutively, descendent nodes are executed only probabilistically and during the interval between N

j

j

i

j

i

4A

hierarchical system is a LA network which is a tree.

5

j

j

two executions of a node, no change occurs to its descendents. To count how many times a node is executed, we introduce, local time which starts with 1 and incremented by 1 every time A is executed. By contrast, we call usual time global time. In what follows, time means global time unless otherwise stated. Now suppose node A has child node A with choice probability p (n) of action (1  i  N ) at time n. M (n), the average reward of A at time n, is calculated recursively as i

i

i

M (n) =

N X

p (n)M (n) i

(3)

i

i=1

where M (n) is the average reward of A (1  i  N ). We use p(n) to denote a vector of all choice probabilities in N at time n. First we prove that E [M (n)] improves with time under certain conditions. i

i

Let A be a node in N and M (n) the average reward of A at time n. If reward is binary, i.e. = 0; 1 and N is a tree, we have for l = 0; 1, Lemma 4.1

E [ M (n + 1) j p(n)]



l

By induction on Appendix). Proof

.

E [ j p(n)] M (n) l

Details omitted in this extended abstract (See

Let N be a LA network and M (n) the average reward of the root automaton of N respectively. If N is a tree and reward is binary,

Theorem 4.2

E [M (n + 1) j p(n)]  M (n) Hence, E [M (n)] improves with time.

Use Lemma 4.1 and put l = 0. Since E [ we are done. Proof

l

j p(n)] = E [1 j p(n)] = 1,

Q.E.D.

Next we prove the a.s. convergence to the best action w.r.t. the root automaton in case of a general LA network. We rst look at how often a 6

node is executed. In the following, we assume that A uses the learning rate c de ned below at local time n (= at n execution) n

th

c def = n

b

(a  1 > b > 0)

n+a

(4)

and initial choice probabilities are all positive and less than 1. If node A is executed in nitely often with probability 1, every child node of A is executed in nitely often with probability 1. Lemma 4.3

Let n be a local time for A and c the learning rate at local time n. Since A is executed in nitely often, we can de ne a sequence of random variables T (T1 < T2 < T3 < 1 1 1) such that when A is executed at n time, the global time is T . We use (resp. (n)) here for an event that action is taken (resp. an event that is taken at T ), and E as the complement event of E . For [T 1), the half open time interval from T , we see Proof

n

n

th

n

i

i

i

i

n

m

m

P ( occurs in [T i

m

1))

= P(

\

occurs in [T T ]) i

m

n

n>m

= lim !1 P ( occurs in [T T ]) i

n

m

n

On the other hand, from

P ( (n) j (n 0 1); . . . ; (m); p (T )) = (1 0 c 01 ) 1 1 1 (1 0 c )p (T ) p (T )  D def = (1 0 c1 ) 1 1 1 (1 0 c 01)p (1) i

i

i

n

i

m

m

i

i

m

m

m

m

i

it follows that

P ( (n) j (n 0 1); . . . ; (m); p (T ))  exp 0f(1 0 c 01) 1 1 1 (1 0 c )D g i

i

i

i

n

m

m

m

Therefore, using

P ( (n); (n 0 1); . . . ; (m) j p (T )) = P ( (n) j (n 0 1); . . . ; (m); p (T )) 1 1 1 P ( (m) j p (T )) i

i

i

i

i

i

i

m

i

7

m

i

i

m

and (1 0 c 01 ) 1 1 1 (1 0 c ) n

m01+a0b n01+a



m

!b

[4]

we conclude that

P ( (n); (n 0 1); . . . ; (m) j p (T )) 01 m 0 1 + a 0 b ! X  (1 0 D ) exp 0 D : k+a = i

i

i

i

m

b

n

m

m

k

m

Hence5 ,

P ( occurs in [T T ]) = P ( (n); (n 0 1); . . . ; (m)) 01 m 0 1 + a 0 b ! X  (1 0 D ) exp 0 D : k+a = i

m

n

i

i

i

b

n

m

m

k

m

So we have lim !1 P ( occurs in [T T ]) = 0 and by substitution, P ( occurs in [T 0. Accordingly, 1 X P ( occurs in [T 1)) = 0 P ( occurs nite times)  i

n

m

n

i

i

i

m

m

m=1

which implies P ( occurs in nitely often) = 1

Q.E.D.

i

From Lemma 4.3, we conclude that every node in often. We next prove a main theorem.

N

is executed in nitely

Let N be a LA network such that every node has a unique best action. Suppose A is a node in N with the average reward M (n) at time n and the best action for A is with choice probability p (n) at time n. If the decaying learning rate c de ned by (4) is used at A's n execution, we have

Theorem 4.4

i

i

n

th

lim !1 p (n) = 1 lim !1 M (n) = d a:s:

n

i

a:s:

i

n

(5) (6)

an event A and a random variable X , we have P (A) = E [A ] = E [E [A j X ]] = E [P (A j X )] where A is the characteristic function of A. So, if P (A j X )  , P (A)   5 For

as well.

8

1)) =

where d is the maximum average reward corresponding to . Hence eventually returns the maximum average reward with probability 1. i

i

N

We use induction on  (the base case is provable based on the result in [4]. See the remark below). So assume they hold for every child node A (1  j  N ) of A and its average reward M (n). Since (5) implies (6) in view of (3) and the inductive assumption, it is sucient to prove (5). We remark that we have only to prove (5) w.r.t. A's local time n. This is because we can introduce global time T (T1 < T2 < . . .) owing to Lemma 4.3, and (5) being true w.r.t. A's local time n means lim !1 p (T ) = 1 at global time T and this is equivalent to lim !1 p (n) = 1 as fp (T )g 1 is a subsequence of fp (m)g 1 and p (m) changes only at some T . With the above remark and by the induction hypothesis, we may also assume lim !1 M (n) = d w.r.t. A's local time n where d is the maximum average reward of A (1  j  N ). They are all di erent by assumption. So there is uniquely the best action with the maximum average reward d such that d > d (j 6= i). We therefore can nd some local time n0 (depending on the sample process) and a constant  > 12 min =6 (d 0 d ) such that for any n > n0 , M (n) 0 M (n) >  for j (j 6= i). In what follows, n denotes A's local time. First we take up (1 ) (p (n) > 0 is proved by induction on n) and introduce W (n) def = (1 ) 0 1 ([4]). We calculate their conditional expectation for local time n > n0 as follows. In calculation, we drop n and use for an event that action (1  j  N ) is taken at time n. Proof

j

j

n

a:s:

n

i

n

a:s:

n

n

i

m

i

i

i

n

n

n

a:s:

n

j

j

j

j

i

i

i

j

j

i

i

i

j

j

i

pi n

pi n

j

j



"

#

1 E p p (n + 1) i



"

#



"

X p p p; + p; E = E p + c (1 0 p ) (1 0 c

) p 6= " # 2 2 2  E 1c+ c W p; +10c MW

W ! X c2 p 1+c M + + 10c p 6= ! c2 2 2 2  E [c W jp; ] + 1 + 1 0 c + 1 0 c W i

i

n

i

j

n

i

n

i

n

i

n

j

n

n

j

j

n

i

i

n

n

#

j

i

i

n

n

9

i

j

Hence,

E [ W (n + 1)j p]

 

(

(





1 0  0 E [ jp; ]c W c + 2

i

1 0 ( 0 c W )c + n

n

n

n

)

c2

W

n

10c

n

W

n

10c

)

c2

n

2 P P Since we can prove lim !1 c W (n) = 0, 1 c = 1 and 1 10 < 1, it follows from Lemma A.3-1 in [4] (an applied form of Robbins and Siegmund's theorem [5]) that lim !1 W (n) = 0. Since lim !1 W (n) = 0 implies lim !1 p (n) = 1, the induction case is proved. Q.E.D. a:s:

n

n

cn

n

n

n

cn

a:s:

a:s:

n

n

a:s:

n

i

When the learning rate is not decaying with time but instead constant, we have to give up assuring that A eventually takes its best action with probability 1. Suppose the root node has actions (edges) (1  i  N ). For an action taken by a leaf automaton, we say is reachable from if there is a path containing and . If there is some action such that is not reachable without executing , we call the action potentially unreachable6 . We can prove i

i

i

i

i

Proposition 4.5 If the learning rate is set constant c > 0 in a LA network, every potentially unreachable action has a positive probability that it is never executed. Proof Apparently it is enough to prove that every action of the root node has a positive probability that it is never executed. Let be an arbitrary action of the root node and E an event such that action has not been taken from time 1 to n. We have 01   Y 1 0 (1 0 c) p (1) P (E ) = i

n

i

n

k

i

n

k =0

is not potentially unreachable, it means that whatever action is taken at the root node, we can reach by choosing subsequent actions appropriately. 6 If

10

Choose n0 and c such that (1 0 c) p (1) < 1 1 2 ) > 2 , we have for this n0 and n > n0 , n

i

Q for all n  n0 . Since 1= 0 (1 0

1

n2

k

n

k

P (E ) >

0

Y1 

n0

n  Y

i

n

k =0

>

1 0 (1 0 c) p (1) k

k =n0

(1 0

0 01   1 Y 1 0 (1 0 c) p (1) 2 =0

1

n2

)

n

k

i

k

Therefore, since E

n+1

E

n

, we see \

P ( is never taken) = P ( E ) i

n

n

= lim !1 P (E ) 0 01 1 Y  2 (1 0 (1 0 c) p (1)) > 0 =0 n

n

n

k

i

k

Q.E.D. So, if action happens to give the maximum average reward among the actions taken by the leaves in N but is potentially unreachable, the root node has non-zero probability that it cannot achieve the maximum average reward with a constant learning rate.

References [1] Feller,W., An Introduction to Probability Theory and Its Applications (2nd. ed), Wiley, 1971. [2] Kaelbling,L.P. and Littman,M.L., Reinforcement Learning: A Survey, J. of Arti cial Intelligence Research 4, pp237-285, 1996. [3] Narendra,K.S. and Thathacher,M.A.L., Learning Automata: An Introduction, Prentice-Hall Inc., 1989. [4] Poznyak,A.S. and Najim,K., Learning Automata and Stochastic Optimization, Lecture Notes in Control and Information Sciences 225, Springer, 1997. 11

[5] Robbins,H. and Siegmund,D., A Convergence Theorem for Non Negative Almost Supermartingales and Some Applicaitons, Optimizing Mehtods in Statistics (Rustagi,J.S. ed.), Academic Press, New York pp233-257, 1971. [6] Sato,T., Reactive Logic Programming by Reinforcement Learning, sumitted for ICLP'99, 1999.

12

Appendix Lemma 4.1 Let p(n) be a vector of choice probabilities in a LA network N at time n, A a node in N and M (n) the average reward of A at time n, respcetively. Suppose A has child node A with choice probability p (n) and average reward M (n) (1  i  N ). If reward is binary, i.e. = 0; 1 and N is a tree, we have for l = 0; 1 and for any learning rate c at time n, i

i

i

n

E [ M (n + 1) j p(n)] l



E [ j p] M (n) + l



c

E [ j p(n)] M (n)

n

2

(M (n) 0 M (n))2 p (n)p (n)

N N X X

i

i=1

j

j

i

j

l

We use p for p(n), p for p (n), for the event that A is executed at time n and for the event that action is selected at time n.

Proof

i

i

i

i

E [ M (n + 1) j p; ] l

=

N X

l

i

i=1

=

E [ M (n + 1) j ; ; p]P ( j ; p)

N N X X

=

E [ M (n + 1)p (n + 1) j ; ; p]p l

j

i=1 j =1 N n X

=

def

i

l

i

N X

(I ) + c (II )

n

i

6

i

o

E [ M (n + 1)(1 0 c )p j ; ; p] p l

j

j =i

(I )

i

E [ M (n + 1)(p + c (1 0 p )) j ; ; p] +

=

j

i

i=1

i

n

j

i

n

Xn N

E [ M (n + 1) j ; ; p]p l

i

i=1

+

i

N X

6

j =i

i

o

E [ M (n + 1) j ; ; p]p p l

j

13

i

j

i

i



N n X

E [ j ; ; p]M p + l

i

i

l

i

i

6 (by induction hypothesis and the fact that action does not a et A (j = 6 i)) i=1

j =i

i

=

N

X

(II )

=

def

E [ j ; ; p]fM p + l

i

i

E [ j ; p]M

j

i

N

X

6

j =i

M p gp j

j

i

l

N N X X

E [ +1 (M (n + 1) 0 M (n + 1)) j ; ; p]p p l

i

6

i=1 j =i

=

j

j

i

i=1

=

o

E [ j ; ; p]M p p

N X

N N X X

j

j

E [ (M (n + 1) 0 M (n + 1)) j ; ; p]p p i

i=1 j =1

i

j

i

j

i

i

( +1 = as is binary) l

 =

N N X X

E [ j ; ; p](M i

i=1

j

N

i

i

0 M )p p j

j

i

(the same reasron for (I ))

1 XX (M 2 =1 N

i

0M ) p p 2

j

j

i

(because E [ j ; ; p] = M ) i

i

j

Therefore,

E [ M (n + 1) j ; p] l

 

E [ j ; p]M + l

E [ j ; p]M

c

n

2

N N X X

(M

i

i=1

0M ) p p j

2

j

i

j

l

On the other hand, if A is not executed at time n, E [ M (n + 1) j ; p] = E [ j ; p]M holds as no learning is done ( is the complement of ). Hence, l

l

E [ M (n + 1) j p] = E [ M (n + 1) j ; p]P ( j p) + E [ M (n + 1) j ; p](1 0 P ( j p))  E [ j ; p]MP ( j p) + E [ j ; p]M (1 0 P ( j p)) = E [ j p]M l

l

l

l

l

l

Q.E.D. 14