Global Reinforcement Learning in Neural Networks

10 downloads 0 Views 564KB Size Report
Characteristic Eligibility (REINFORCE) learning principle first suggested by Williams. ..... [1] J. Hertz, R. G. Palmer, and A. S. Krogh, Introduction to the Theory of.
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 2, MARCH 2007

573

Letters Global Reinforcement Learning in Neural Networks

performs a statistical ascent on the average reward

Xiaolong Ma and Konstantin K. Likharev

1E fr j  g  r E fr j g 1 E f1 j g  0:

Abstract—In this letter, we have found a more general formulation of the REward Increment Nonnegative Factor Offset Reinforcement Characteristic Eligibility (REINFORCE) learning principle first suggested by Williams. The new formulation has enabled us to apply the principle to global reinforcement learning in networks with various sources of randomness, and to suggest several simple local rules for such networks. Numerical simulations have shown that for simple classification and reinforcement learning tasks, at least one family of the new learning rules gives results and A for the comparable to those provided by the famous Rules A Boltzmann machines.

=

2

2

Index Terms—Neural networks (NNs), reinforcement learning, stochastic weights.

I. INTRODUCTION Most previous work on reinforcement training of neural networks (NNs) [1], [2] has been focused on systems in which the randomness, necessary for policy exploration, is provided by stochastic neural cells (“Boltzmann machines” [3]). For such networks operating in the simplest conditions of instant reward, Williams [4] has derived a general class of REward Increment Nonnegative Factor 2 Offset Reinforcement 2 Characteristic Eligibility (REINFORCE) learning algorithms which achieve a statistical gradient ascent of the average reward E fr j wg in the multidimensional space of deterministic synaptic weights wij which relate presynaptic and postsynaptic signals

=

xi

=

wij yj :

(1)

j

Our effort to generalize these results is motivated by our group’s development of the so-called CMOS/nanowire/MOLecular hybrid (CMOL) CrossNets, a specific nanoelectronic implementation of NNs—see [5] for a review. In CMOL hardware, the synapses are by their very nature stochastic, though the degree of their randomness may be regulated by physical parameters of nanodevices (“latching switches”) playing the role of elementary synapses. In [6], we have shown that the Williams’ formulation may be generalized. Namely, for any stochastic system with a set of random signals v fv1 ; v2 ; g and a probability function p v;  controlled by a set of deterministic internal parameters  , the following learning rule [7]:

=

...

( )

1k = k rek ek = r ln[p(v;  )]

In the case of NNs, with their local relation between signals, (2b) leads to simple learning rules because the partial derivative with respect to a particular weight only depends on local signals (for details, see [6]). For example, in networks with the “Bernoulli-logistic” stochastic cells (and deterministic synapses), (2) yields the so-called Rule Ar0i [2] as follows. Rule Ar0i :

1wij = r(yi 0 hyii)yj

(4)

(which had been obtained from the initial REINFORCE formulation by Williams). By adding a small detrapping term with   , this equation leads to the famous Rule Ar0p as follows. Rule Ar0p :

1

1wij = [r(yi 0 hyi i)yj + (1 0 r)(0yi 0 hyi i)yj ]

(5)

which is probably the most efficient global reinforcement rule for Boltzmann machines with instant reward [1], [2]. In this letter, we use (2) to explore several new learning rules for NNs with various sources of randomness, and compare their performance with Ar0i and Ar0p for a few basic problems of supervised and reinforcement learning. II. RULES A Let us consider a feedforward multilayered perceptron (MLP) in which the somatic cells are deterministic yi g xi , while the sources of fluctuations are located at somatic cell inputs

= ( )

= hxi i + xi hxi i = wij yj : xi

(6) (7)

j

Let us assume that the added noise xi is Gaussian, so that its probability density function

(

m

pi xi ; w ; y

0

m 1

2 ) = p21 exp 0 (xi 02h2xi i) i i

(8)

where i2 is the fluctuation variance. Let us identify the variables v, participating in (2), with the set of input signals x, and  with the set of synaptic weights wij . Then, from (2b), the eligibility component [6]

(2a)

eij

(2b)

Manuscript received February 21, 2006; revised September 6, 2006; accepted October 11, 2006. This work was supported in part by the Air Force Office of Scientific Research (AFOSR), by the MACRO via Functional Engineered Nano Architectonics (FENA) Center, and by the National Science Foundation (NSF). The authors are with Stony Brook University, Stony Brook, NY 11794-3800 USA (e-mail: [email protected]). Color versions of one or more of the figures in this letter are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TNN.2006.888376

(3)

ln pi = (xi 0 hxi i)yj : = @@w 2 ij i

Using the certain freedom in using (2), [6], we may take ij obtain the following simple local learning rule. Rule A1:

1wij = r(xi 0 hxi i)yj

(9)

= i2 to (10)

which is very close in structure to Ar0i . Fig. 1(a) shows the learning dynamics of an MLP trained with Rule A1 to perform the parity function. In this simple experiment, the inputs

1045-9227/$25.00 © 2007 IEEE

574

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 2, MARCH 2007

Fig. 1. Process of training a fully connected MLP (4-10-1) with random synaptic weights to implement the four-input parity function. Plots show the sliding average reward as a function of training epoch number for ten independent simulations runs. The parameters are  (0) = 10 and = 1 unless specified otherwise. (a) For Rule A1:  = 0:1. (b) For Rule A2:  = 0:1 and  = 0:005. (c) For Rule A3:  = 0:03 and  = 0:002. (d) For Rule B:  = 0:02. (e) For Rule C1:  = 0:4 and (0) = 1. (f) For Rule C2:  = 0:4; (0) = 1; and  = 0:005.

1

+1

were binary (0 and ),1 and the neural cells were deterministic, with the following activation functions:

yi = tanh(hi ) = tanh

pNG xi

(11)

m01

= 04

where G : throughout this letter and Nm01 is the number of for cells in the previous layer.2 The reward signal was simply r 0 for the wrong sign. the correct sign of the output signal, and r The network performance was measured by the sliding average reward defined as

= 1

= +1

ra (t) = 0:99ra (t 0 1) + 0:01r(t)

(12)

()

where r t is r averaged for all patterns at the tth epoch. The following “fluctuation quenching” procedure was used:

(t) = (0)[1 0 ra (t)]

(14)

The learning dynamics for Rule A2 for the same parity problem (and for the same network as described previously) is shown in Fig. 1(b). We 1Because

=

1ij = [r(xi 0 x0i ) + (1 0 r)(0xi 0 x0i )]yj

of such symmetric data representation, a certain number of bias cells with constant output (+1) had to be added to the input and hidden layer, in both this task, and that described in the following sections. These biases are not included into the cell count. 2In these simulations as well as those in the following sections, we have used the normalized input h in place of x for all the learning rules.

(15)

where xi is as in (1) and

x0i =

(13)

with positive parameter which allows to control the quenching speed. Fig. 1(a) shows that Rule A1 generally works, but just as the Rule Ar0i it suffers from occasional trapping in local minima of the effective multidimensional potential profile. This analogy makes it natural to assume that this problem can be tackled by adding a similar antitrapping  term resulting in the following. Rule A2:

1wij = [r(xi 0 hxii) + (1 0 r)(0xi 0 hxi i)]yj :

can see clear improvement of the performance in comparison with Rule A1. Now, let us note that our derivation of Rules A1 and A2 is valid regardless of whether fluctuations xi are the result of some additive noise or the randomness of synaptic weights. In the latter case, wij in (1) are random numbers, and we may identify  with their statistical ij . In this case, Rules A1 and A2 are valid for averages hwij i adaptation of these average values. In hardware implementation, however, it may be hard to keep track of the average input hxi i. In that case, the following rule can be used as a substitute for Rule A2. Rule A3:

j

wij0 yj

(16)

0 are independent random weights with the same mean and wij and wij values, and r is the reward corresponding to wij . Indeed, let us denote the weight changes derived from A2 and A3, respectively, as ij and 0ij . Let us calculate their expectation values (with  ). First

1

=0

1

E f10ij g = E fr(xi 0 x0i )yj g:

(17)

0 are independent, wij0 does not correlate with r or yj Since wij and wij 0 in (17) with which are all functions of wij . Thus, we can replace wij their expectation values ij . This immediately gives us

E f10ij g = E fr(xi 0 hxi i)yj g = E f1ij g:

(18)

Therefore, A2 and A3 on the average follow the same gradient.3 3Note that when implementing the second set of random weights (16), the output signals are memories from the first perturbation, i.e., y = g (x ); otherwise, we would not have been able to obtain the same expectation values.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 2, MARCH 2007

575

TABLE I GENERALIZATION PERFORMANCE FOR MONK’S PROBLEMS

Learning dynamics for Rule A3 on the same parity problem is shown in Fig. 1(c). One can see that it performs approximately on a par with Rule A2.

( ) = tanh( )

III. RULES B Rules A have been obtained by looking at the reward as a function of x. However, there is another legitimate way to look at the reward: at a fixed network input, we may consider it a function of the synaptic weight set only r r w . From this standpoint, in (2), we can replace v with w, while  with  , and derive the following rule. Rule B:

= ( )

1ij = r(wij 0 ij ):

(19) NNs.4

It is This is perhaps the simplest learning rule suggested for also extremely localized: it involves no information about processes in the rest of the network rather than provided with the reward signal. Fig. 1(d) shows the parity function learning process using Rule B. These results, and also those described in Section V, show that Rule B does not work as well as Rules A2 and A3 for classification tasks. Our efforts to improve it by adding a similar antitrap term were unsuccessful. IV. RULES C Now, it is natural to consider MLPs in which Gaussian fluctuations are added to outputs rather than inputs of the somatic cells yi

= hyii + yi = g(xi ) + yi

(20)

where xi is as in (1). Using (2), it is straightforward to derive the following learning rule. Rule C0:

1wij = r(yi 0 hyi i)yj g0 (xi ): (21) The simplest activation function g (x) one can think of is the piecelinear function x; if jxj  1 g (x ) = (22) 1; if jxj > 1 which leads to an even simpler rule as follows. Rule C1:

1wij = 0r; (yi 0 hyi i)yj ;

4One

Testing this rule on the same parity problem shows that it does not work very well—see Fig. 1(e). One could imagine that this poor performance is related to the lack of weight adjustment at large signals, and might be corrected using a more smooth activation function, e.g., x , for which g 0 x 0 g2 x . For this option, adding g x also an antitrap term, we obtain the following rule. Rule C2:

1 1

if jxj  if jxj > :

can show that Rule B can also be applied to binary weights.

(23)

( )=1

()

1wij = [r(yi 0hyi i)+ (1 0 r)(0yi 0hyi i)](1 0hyii2 )yj :

(24)

This change leads to some performance improvement [see Fig. 1(f)], but it is still insufficient for Rule C2 to compete with A2 or A3. V. COMPARISON ON MONK’S PROBLEMS Although the preliminary testing on the simplest parity problem gives a leading edge to Rules A, we have carried out a more detailed comparison of all the new rules on a set of larger “Monk’s” problems [8] which are widely used for NN algorithm benchmarking. The set contains three classification tasks with two classes. Each of the three problems contains 432 data vectors with 17 binary components each. For problems 1, 2, and 3, there are, respectively, 124, 169, and 122 vectors in the training sets; the rest of the data are used as the test sets. For the comparison of different training methods, we have used the MLPs of the same size as used earlier by other authors: 17–3–1. The positive output was treated as representing one class and negative one for correct classias representing the other class. We have used r fications and r 0 for wrong classifications.5 Training was stopped either when the sliding average reward ra t exceeded 0.99 or after 10 000 epochs. Table I shows the generalization performance (the percent of correct classifications on the test set) after training the networks with the new and some well-known algorithms, including both the most popular supervised training rule, error backpropagation (BP) [8], and global reinforcement rules (Ar0i and Ar0p [9]). The error bars correspond to the standard deviation of the results of five experiments, except for the case of Ar0p , A2, and A3, where they are the results of 20 experiments. The results for BP have been borrowed from [10]. The last column of the table shows the parameters used for each training rule. In order to ensure a fair comparison between the best rules (Ar0p , A2, and A3), we have optimized their parameters for individual problems. Fig. 2 shows, as an example, the optimization on the second problem. We can see that

= 1

= +1

()

2

5Rule A had been originally designed for r [0; 1]; however, in our simulation, we found marginal improvement of performance using r = 1.

6

576

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 2, MARCH 2007

p

Fig. 2. Optimization of the generalization performances with respect to learning parameters on the second of the Monk’s problems. Results were averaged over 20 independent experiments and the error bar represents the standard deviation divided by 20. The values of  (0) and in (c) and (d) are the best values obtained from (a) and (b). (a) A2 at  = 0:1 and  = 0:003. (b) A2 at  = 0:1 and  = 0:003. (c) A2 at  (0) = 1:0 and = 0:2. (d) A2 at  (0) = 1:0 and = 0:2. . (f) A . (e) A

( ) _ ( ) ( ) _( )

of the cart–pole system fx t ; x t ;  t ;  t g as input and produces a single output a t as the action. This network has been trained by either A1 or Ar0i (for this task, antitrapping terms are not necessary) with the temporal difference (TD) error [2]

()

( ) = r(t) + V (t + 1) 0 V (t)

 t

Fig. 3. Cart–pole balancing problem. The force applied to the cart is a(t)F , = where 1 a(t) 1 is a function of time. In our particular example F 10 N, the masses of the cart and pole are 1.0 kg and 0.1 kg, respectively, and the length of the pole is 1 m. The dynamics of the system is simulated with a time step of 0.02 s which is small in comparison with the dynamics time scales (which are of the order of 1 s).

0 



the performance is not very sensitive to parameters within reasonable ranges of their variation. VI. COMPARISON ON THE CART–POLE BALANCING PROBLEM One may argue that the global reinforcement rules have to be also characterized on problems which do not allow direct supervision. We have done this for the cart–pole balancing task [11] in which the system tries to balance a pole hinged to a cart moving freely on a track (Fig. 3) by applying a horizontal force to the cart. A failure occurs when either the pole incline angle magnitude exceeds 12 or the cart hits one of the 62.4 m). A reward of r 0 is issued upon failure and walls (x r : , otherwise. To solve this delayed-reward problem, the usual actor–critic method [2] was used. The actor is a 4–30–1 MLP which takes the state vector

= 01

=

= 1

(25)

()

playing the role of the instant reward signal. In (25), r t is the real reward at time t, V t is the value function, and is the discount factor. For example, in the case of TD  , the A1 rule takes the form

()

() 1wij (t) = a (t)eij (t) eij (t) = TD eij (t 0 1) + [xi (t) 0 hxi (t)i]yj (t):

(26a) (26b)

One more option here is to use an additional adaptation of fluctuation intensity instead of the global quenching used in the previous tasks. Indeed, by identifying the set of variances i with  in (2) and letting i  i2 , one arrives at the following.6 Rule  :

=

1i =  r (xi 0 hxi i)2 0 i2

i :

(27)

Unfortunately, this rule seems inconvenient for hardware implementation. The critic is a 5–30–1 MLP which takes the state-action vector fx t ; x t ;  t ;  t ; a t g as input and produces a single output

( ) _ ( ) ( ) _( ) ( )

6Actually, this rule had been derived by Williams [4] for random somas with the Gaussian statistics of the output signal fluctuations.

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 18, NO. 2, MARCH 2007

577

which the performance of the known methods such as Ar0i and Ar0p are either insensitive to L or a well-understood function of the problem size. So far, we have been unable to find such problems in literature and plan to develop some tests ourselves. ACKNOWLEDGMENT The authors would like to thank P. Adams, D. Hammerstrom, J. H. Lee, and T. Sejnowski for valuable discussions, as well as the anonymous Reviewer 1 for important suggestions.

REFERENCES

Fig. 4. Training dynamics for the cart–pole balancing, using Rules A1, A1 , . All results were averaged over 20 independent experiments. After and A each failure, the system is restored to its initial condition (x = x_ =  = _ = 0) and the experiment is continued. Parameters used in the training are as follows. For Rule A1:  = 0:01, and  = 10 for all cells (no quenching). For Rule  :  = 10 initially and  = 0:0002. For Rule A :  = 0:02. For BP: = 0:6.  = 10. For TD(): = 0:95 and 

()

V t as a value function estimate. The critic has been trained by the BP with TD error. In the case of TD 

() 1w(t) = c (t)e(t) e(t) = TD e(t 0 1) + rw V (t):

(28a) (28b)

tanh activation

All the somatic cells in the critic network have the function (11), except for the output cell which is linear

V

= yi = 0:1hi :

(29)

The results of simulation are shown in Fig. 4. As we can see, although Rules Ar0i and A1 (which is a combo of Rules A1 and  ) lead to faster training, simple Rule A1 is also able to fully solve this problem (i.e., to learn how to balance the pole without failure indefinitely) eventually. In comparison with the usual reinforcement learning using RBF network [2] or the cerebellar model articulation controller (CMAC) [12], the learning is slow. However, unlike in those methods, Rule A1 learns directly in the continuous space. (No discretization whatsoever is involved.) We believe it makes our method applicable to a broader range of tasks. VII. DISCUSSION The most important result of this letter is a constructive proof that NNs with stochastic synapses, trained using at least one rule family (A), can perform classification and reinforcement tasks on a par with the Boltzmann machines using Rules Ar0i and Ar0p . The new rules are very simple and local, giving hope that it may be readily implemented in hardware, in particular in ultradense nanoelectronic networks like CMOL CrossNets [5]. Such implementation is our most immediate next goal. Another important open question is whether the efficiency of the new rules may be sustained with the growth of network size. (The CMOL hardware development promises future CrossNets with as many as 1010 cells and 1014 synapses [5].) Answering this question hinges on finding benchmark problems with variable length L of the input vector, for

[1] J. Hertz, R. G. Palmer, and A. S. Krogh, Introduction to the Theory of Neural Computation. Redwood City, CA: Addison-Wesley, 1991. [2] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 1998. [3] G. E. Hilton and T. J. Sejnowski, “Learning and relearning in Boltzmann machines,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, D. E. Rumelhart, Ed. Cambridge, MA: MIT Press, 1968, vol. 1, pp. 282–317. [4] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Mach. Learn., vol. 8, pp. 229–256, 1992. [5] Ö. Türel, J. H. Lee, X. L. Ma, and K. K. Likharev, “Neuromorphic architectures for nanoelectronic circuits,” Int. J. Circuit Theory Appl., vol. 32, no. 5, pp. 277–302, 2004. [6] X. Ma and K. K. Likharev, “Global reinforcement learning in neural networks with stochastic synapses,” in Proc. World Congr. Comput. Intell./Int. Joint Conf. Neural Netw. (WCCI/IJCNN), 2006, pp. 47–53. [7] J. Baxter and P. L. Bartlett, “Infinite-horizon policy-gradient estimation,” J. Artif. Intell. Res., vol. 15, pp. 319–350, 2001. [8] S. B. Thrun, “The MONK’s problems: A performance comparison of different learning algorithms” Comput. Sci. Dept., Carnegie Mellon Univ., Pittsburgh, PA, Tech. Rep. CS-91-197, 1991. [9] A. G. Barto and M. I. Jordan, “Gradient following without back-propagation in layered networks,” in Proc. 1st Annu. Int. Conf. Neural Netw., San Diego, CA, 1987, vol. 2, pp. 629–636. [10] K. P. Unnikrishnan and K. P. Venugopal, “Alopex: A correlation based learning algorithm for feed-forward and recurrent neural networks,” Neural Comput., vol. 6, pp. 469–490, 1994. [11] A. G. Barto, R. S. Sutton, and C. W. Anderson, “Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Trans. Syst., Man, Cybern., vol. SMC-13, no. 5, pp. 834–846, 1983. [12] J. S. Albus, “A new approach to manipulator control: The cerebellar model articulation controller (CMAC),” Trans. ASME J. Dyn. Syst. Meas. Control, vol. 97, no. 3, pp. 220–227, 1975.

Suggest Documents