*Department of Electrical Engineering, Tamkang University, Taiwan, R.O.C.. Abstract: In this paper, a novel self-generating neuro-fuzzy system through ...
A Self-Generating Neuro-Fuzzy System Through Reinforcements Mu-Chun Su, Chien-Hsing Chou*, and Eugene Lai* Department of Computer Science & Information Engineering, National Central University, Taiwan, R.O.C. *Department of Electrical Engineering, Tamkang University, Taiwan, R.O.C. Abstract: In this paper, a novel self-generating neuro-fuzzy system through reinforcements is proposed. Not only the weights of the network but also the architecture of the whole network are all learned through reinforcement learning. The proposed neuro-fuzzy system is applied to the inverted pendulum system to demonstrate its performance. Key-words: reinforcement learning, neural network, neuro-fuzzy system
1. Introduction In recent years, a common approach for dealing with complex real-world problems is fuzzy systems. Fuzzy systems are conceptually simple and intuitively appealing, as they provide the high-level, human-like reasoning ability. Although fuzzy systems have been successfully and widely applied to many fields, yet in practice the knowledge acquisition is the bottleneck of implementing fuzzy systems. Thus incorporating the learning abilities of neural networks in the design of fuzzy systems has recently become a very active research area, e.g., fuzzy adaptive learning control network [1], back-propagation fuzzy systems [2], adaptive neuro-fuzzy inference systems [3], fuzzy associative memory [4], fuzzy hyperrectangular composite neural networks [5]-[6], etc. Such integration renders them more useful than either. If supervised learning can be used in training neuro-fuzzy systems, it is suggested to adopt supervised learning instead of other learning paradigms [7]. However, for some real-world applications, precise data for training/learning are usually difficult or expensive to obtain or even not really available. Therefore, there has been a growing interest in reinforcement learning [8]-[17]. In reinforcement learning, there is no supervisor to critically judge the chosen action at each time step. The learning is through trial-and-error interactions with a dynamic environment. An overview of reinforcement learning can be found in the article [18]. This article surveys the field of reinforcement learning from a computer-science perspective. The approaches to solving reinforcement-learning problems can be roughly divided into two strategies. The first is to search in the space of behaviors in order to find one that performs well in the environment. This approach has been taken by work in genetic algorithms and genetic programming as well as some more novel search techniques [12]. The second is to use statistical techniques and dynamic programming methods to estimate the utility of taking actions in states of the world, for example, adaptive heuristic
critic [13], temporal difference (TD (λ) ) [14] and Qlearning [15]-[16], etc. In this paper, we propose a self-generating neuro-fuzzy system using a special reinforcement learning algorithm to incrementally construct its architecture and tune the system’s parameters. The organization of the paper is as follows. In Section 2, the proposed neuro-fuzzy system is introduced. The simulations of balancing an inverted pendulum system are described in Section 3. Finally, Section 4 concludes the paper.
2. The Proposed Neuro-Fuzzy System The proposed self-generating neuro-fuzzy system can not only tune the parameters of the system but also incrementally construct the architecture of the whole system through learning. The development of the proposed neuro-fuzzy systems was motivated by the so-called classifier system which was a machine learning system that learns syntactically simple string rules (called classifiers) [19]-[20].
2.1 Brief Review of Classifier Systems The study of reinforcement learning relates to credit assignment, where, given the results of a process, one has to distribute reward or blame to the individual elements (or rules) contributing to that performance. Classifier systems employ the so-called “bucket brigade” algorithm to pass reward back through chains of decisions. The bucket brigade algorithm may easily viewed as an information economy where the right to trade information is bought and sold by classifiers. This service economy contains two main components: an auction and a clearinghouse. To participate in the auction, each classifier maintains a record of its net worth, called its strength S. Each matched classifier makes a bid B proportional to its strength. In this way rules that are highly fit (have accumulated a large net worth) are given preference over other rules.
The bucket brigade algorithm is summarized as follows [21]: Step 1: Classifier 2 matches a message posted by classifier 1. Step 2: Classifier 2 wins the bid competition and posts its message. Step 3: Classifier 2 pays back its bid to classifier 1. Step 4: Classifier 3 matches the message posted by classifier 2. Step 5: Go Step 1.
learning model is schematically shown in Figure 1. On each time step, the system receives as two inputs, one is x(t ) , indication of the current state of the environment at time t, and another input is v(t) , indication of an evaluation signal to evaluate the state at time t. The system then chooses an action, Out ( x (t )) , to generate an output. After the output action changes the state of the environment, the system can receive a new state x( t + 1) and an updated evaluation signal v( t + 1) at time t+1. We This algorithm allows rules that do not directly then obtain a scalar internal reinforcement signal, participate in a reward-obtaining step to be indirectly r ( t + 1) , which is defined as: rewarded through the bid of classifiers they cause to r (t + 1) = v(t + 1) − v(t ) (4) fire. If one classifier repeatedly causes the firing of Finally, the internal reinforcement signal r ( t + 1) is another classifier that gains strength through reward, then the first classifier’s strength will increase as a used to update the parameters of the at time t+1. result. One thing should be emphasized is that rules in a conventional classifier system are crisp and represented as strings.
2.2 The Architecture of the Neuro-Fuzzy System Basically, the proposed neuro-fuzzy system is similar to a two-layer Radial Basis Function (RBF) network. An RBF network is a two-layer neural network whose output node forms a linear combination of basis functions. The output of the network is computed by Out ( x) =
Fig. 1. The reinforcement-learning model.
J
∑c m j
j
(1)
( x)
j =1
where x = (x1 ,...,x n )T is an input pattern, Out (x) is the output of the network, c j is the connection weight between node j and the output node, and m j ( x) is the output of the j th hidden node. The most common basis function is a Gaussian kernel function of the form: x−w j m j ( x) = exp − 2 2σ j
2
, j = 1,2 ,..., J
(2)
In some sense, each hidden node corresponds to a fuzzy rule which can be represented as
(x ∈ HS 1 ) THEN Out (x ) IF
L
is c1 ;
(x ∈ HS N × N ) Out (x ) is c J ;
(3)
ELSE IF THEN
The domain defined by the antecedent is a fuzzy hypersphere, x − w j
2
= constant.
In the reinforcement-learning environment, our neuro-fuzzy system works as follows. The proposed neuro-fuzzy system based on the reinforcement-
Fig. 2. The architecture of the proposed neuro-fuzzy network. To participate in the computation of the final out of the system, each fuzzy rule maintains a record of its net worth, called strength, S j (t ) . In addition, each fuzzy rule makes a bid, B j (t) proportional to the product of its strength and the matching degree, m j ( x(t )) . In this way, fuzzy rules that have highly fit (have accumulated a large net worth) can contribute more to the computation of the final system’s output. Therefore, the final output of our system is computed as follows: Out ( x (t )) =
J
∑c j =1
where
j
(t )out j ( x (t ))
(5)
out j ( x(t )) =
Cbid S j ( t) m j ( x(t ))
=
J
∑C
bid
S j (t )m j ( x(t ))
j =1
B j ( t) J
∑ B (t )
(6)
j
j =1
Cbid is a constant about the bid coefficient ( 0 < C bid ≤ 1 ), m j ( x( t )) is the membership function of the jth hidden node and B j (t) is the bid of the jth hidden node at time t. In addition, the membership function m j ( x( t )) is defined as n
∑
mj (x(t)) = exp(−
i=1
( x(t) − wji (t))2 2σji (t)2
)
(7)
where σ ji is a regulating parameter which controls how fast the member function m j ( x( t )) decreases. Fig. 2 shows the architecture of the proposed neurofuzzy system. The fuzzy rules in the system then can be represented as IF
( x ∈ HS 1 )
THEN Out ( x) is c1 with a strength S 1 ; L
( x ∈ HS N × N ) Out ( x) is c J with
(8)
ELSE IF THEN
distributed to each fuzzy rule according its contribution. The way we tune the strength of each fuzzy rule is as follows:
a strength S J ;
2.3 Learning Mechanisms In our system, there are many different parameters, such as the strength, S j (t ) , and the weights in the hidden node, w j and σ ji , which need to be updated during the learning procedure. These parameters are updated according to different learning mechanisms. 2.3.1 Reward Mechanism The reward mechanism is used to update the strength, S j (t ) . The idea of the reward mechanism is very simple. In our system all fuzzy rules cooperate to the computation of the system’s output. If the performance improves, then the strength of each fuzzy rule will be increased according the degree of its contribution. Otherwise, its strength will be decreased. The internal reinforcement signal r ( t + 1) is used to indicate the performance of the neuro-fuzzy network at time t. If the internal reinforcement signal r (t + 1) is a positive value, it represents the performance of the current state at time t + 1 is better than the latest state at time t. Otherwise, the current performance is worse than the latest state. First, each fuzzy rule will receive a reward from environmental reward due to its contribution. The reward, R, from the environment is defined as R = S r × r (t + 1) (9) where the parameter S r is a positive value that is determined by the user. Then the reward R is
Condition 1: IF r (t + 1) > rt THEN S j (t + 1) = S j (t) + B j (t) + R × outj ( x(t)) (10) Condition 2: IF r (t + 1) < − rt THEN S j (t + 1) = S j (t) − B j (t) + R × out j ( x(t)) (11) Condition 3: IF − rt ≤ r (t + 1) ≤ rt THEN S j ( t + 1) = S j (t ) + R × out j ( x(t ))
(12)
The parameter rt is a positive value which is used to decide whether we should return the bid to the fuzzy rule or not. When r ( t + 1) is larger than the parameter rt , the fuzzy rule not only can receives a reward from the environment but also can take its bid back. On the contrary, when r ( t + 1) is smaller than a certain negative value the rule should turn over its bid and receives a penalty. For the remaining situation, each fuzzy rule changed its strength only according to the reward R. Notices that, we limit the updated strength S j (t + 1) in the range between 0 to S max , where S max is a constant determined by the user. From many simulations, we found that sometimes the results were not as good as we expected. In order to overcome this problem, we apply the second reward mechanism to further improve the performance of the proposed neuro-fuzzy system. The second reward mechanism is applied only when the system achieves the goal or fail in the middle of training procedure. In reinforcement learning, the learning system obtains a scalar evaluation of the system’s performance of the task according to a given index. The objective of the learning system is then to improve its performance by generating appropriate outputs. However, the reinforcement learning is not as direct immediate, and informative as in supervised learning. Sometimes the simulation results were not in correspondence with the original goal. The second reward mechanism adjusts the strength of each rule as follows: T
∑ B ( t) j
S j (T + 1) = S j (T ) + R F ×
t =1
J
T
∑∑ B (t )
(13)
j
j =1 t =1
where RF represents a reward which will be distributed to each fuzzy rule at the end of the training procedure and T is the iteration times for this training procedure. If the system achieves its goal, then we let RF be a positive value so as to greatly enhance the strengths of the rules. If the system fails to achieve its goal or fails in the middle of the training procedure,
we then let RF be a negative value to penalize those rules. From our many simulations, we found the second reward mechanism can further improve the efficiency of the proposed system. 2.3.2 Updating Weights Here we present how we update the weights, w j and σ ji , in the hidden node. First, we let Pj represent the jth hidden node’s weights in the network, which includes the weighting vector w j and the regulating parameter σ ji . Notice that, the connection weight c j is fixed during the training procedure. The intent of computing Out ( x (t )) is to maximize v(t) , so that the system ends up in a good state and avoids failure. Thus, the learning task is done by gradient descent, which needs to be maximized the objective function v(t) as a function of Pj . The learning rule is defined as Pj ( t + 1) = P j (t) + η( t) = Pj ( t) + η(t )
∂v( t) ∂ Pj (t ) ∂ Out ( x( t)) ∂ v(t ) ∂Out ( x(t )) ∂ Pj (t )
∂ v (t )
∂Out( x(t )) ,
(14)
∂ Out( x(t ))
Berenji and Protap [10]
can be computed by the instantaneous
difference ratio ∂ v (t ) dv(t ) v(t ) − v (t − 1) ≈ ≈ ∂Out( x (t )) dOut( x(t )) Out( x (t )) − Out( x(t − 1)) v (t ) − v (t − 1) ≈ sgn Out ( x ( t )) − Out ( x ( t − 1 ))
(15)
We then reword the learning rule as ∂Out(x(t )) v(t ) − v(t − 1) Pj (t + 1) = Pj (t ) + η(t )sgn ∂P (t ) Out ( x ( t )) − Out ( x ( t − 1 )) j
(16)
and η (t ) = η1 (1 − v (t ) 2 ) + η 2
w1 (t + 1) = x(t ) ,
(18) (19)
c1 (t + 1) = random( range _ output)
(20)
and
where the variable, range_output, is the range of the output value. In other words, we generate at random an action. The neuro-fuzzy system then outputs an action to the environment, tunes the strength and updates the weights of the first rule in each time step according the aforementioned learning mechanism. During the training procedure, we will exploit a new rule if none of the existing rules has good enough response to the current state (i.e. max{m j ( x(t)) ≤ θm ) ) j =1,...J
S J +1 (t + 1) = 1 ,
(21) (22)
w J +1 ( t + 1) = x (t) ,
proposed as a method to make the approximation that ∂ v(t )
S1 (t + 1) = 1 ,
then we generate a new rule whose parameters are initialized as follows:
To do this equation, we need the two derivatives on the right-hand side, which will depend on the state in a general way. Since it is complex to compute the derivative
exploit the new rules. The architecture of the neurofuzzy system is incrementally constructed during the training procedure. First, the system starts with no hidden node (i.e. fuzzy rule). The first rule is initialized as follows:
(17)
where η (t ) is a function of learning rate, η 1 and η 2 are two learning parameters to control the learning rate.
2.3.3 Construction of the Architecture of the System In order to learn behavior through trail-and-error interaction with a dynamic environment, the proposed system must explicitly explore its environment and
and Out( x(t )) cJ +1(t + 1) = random(range_ output)
if r (t + 1) ≥ 0 if r(t + 1) < 0
(23)
The idea is simple and reasonable. If r (t + 1) ≥ 0 , it means that the system has better performance compared with latest state. Hence it is reasonable to assign the latest output action Out ( x (t )) to c J + 1 ( t + 1) . Otherwise, assign a random output to c J + 1 ( t + 1) . During the training procedure, we also need to eliminate useless rules so as to keep the number of rules as least as possible. Those fuzzy rules whose strengths are smaller than a pre-specified threshold (i.e. S j (t ) < θ s ) will be eliminated. This elimination criterion is reasonable. If the rule’s strength is below a pre-specified threshold θs , it means that the rule often results the system to make bad actions during the training procedure. Thus, we eliminate this useless rule to improve the system’s efficiency.
3. Experimental Results In our simulations, the maximum value of strength S max is chosen to be 2, and the bid coefficient C bid is chosen to be 0.01. The parameter rt , S r , θ m and θ s is chosen to be 0.01, 5, 0.3 and 0.1, respectively. The reward R F is 0.5 when the system achieves the goal, and R F is –0.5 when the system fails. The learning parameter η 1 and η 2 is chosen to be 0.0001 and
0.0001, respectively. As shown in Fig. 3, the inverted pendulum problem is a classic non-linear control problem of learning how to balance an upright pole. In the simulations, the movement of both the pole and the car is restricted to the vertical plane. The state of the system is described by the pole’s angle and angular velocity, besides the cart is allowed to move infinitely in the left or right direction. Our control goal is to balance the pole by supply appropriate force F to the cart. The state equations for the inverted pendulum can be expressed as [22]-[23]: •
δθ θ = δt
(a) Pole angle
(24)
and •
− F − mL( θ )2 sin(θ ) g sin( θ ) + cos( θ )( ) •• • mc + m θ = H2 (θ , θ , F ) = 4 m cos2 (θ ) L( − ) 3 mc + m
(25)
where g (acceleration due to gravity) is 9.8 meter/sec 2 , mc (mass of cart) is 1.0 kg, m (mass of pole) is 0.1 kg, L (half length of pole) is 0.5 meter, and F is the applied force in newton ( −15 < F < 15 ). These equations were simulated by the Euler method, which uses an approximation to above equations and a time step of 0.02 second. In addition, the evaluation function is defined as
(b) Pole angular velocity Fig. 4. The initial condition is (23, 21)
•
v (t ) = exp(−5(θ 2 + (θ ) 2 ))
(26) The proposed neuro-fuzzy system is applied to this problem. We assume that a failure happens when | θ | > 30 o or a training overrun of 4 seconds, then we •
give the random initial state (| θ |< 25 o and | θ |< 25 o ) to train the neuro-fuzzy system. Finally, the neurofuzzy system established 23 neuro-fuzzy rules automatically through reinforcement learning. Fig. 4 shows how the pole angle and angular velocity converge in about 2.5 second from a random condition (23, 21). Fig. 5 shows the simulation result of another random condition (-23, -15) converges in about 3 second. The simulation results show that the proposed system works well for the inverted pendulum problem.
(a) Pole angle
(b) Pole angular velocity Fig. 5. The initial condition is (-23, -15) Fig. 3. The inverted pendulum system.
4. Conclusions In this paper, we have presented a novel selfgenerating neuro-fuzzy system through reinforcement. The approach taken is motivated by so-called classifier systems. The proposed self-generating neuro-fuzzy system can not only tune the weights of the network but also incrementally construct the architecture of the whole network through learning. Acknowledgement This work is supported by the National Science Council, Taiwan, R.O.C., under the Grant NSC 90-2213-E-008-051. References [1] C. -T. Lin and C. S. G. Lee, “Neural-networkbased fuzzy logic control and decision system,” IEEE Trans. On Computers, vol. 40, no. 12, pp. 1320-1336, 1991. [2] L. -X. Wang and J. H. Mendel, “Back-propagation fuzzy systems as nonlinear dynamic system identifiers,” Proc. IEEE Int. Conf. On Fuzzy Systems, San Diego, pp. 1163-1170, 1992. [3] J. -S. Roger Jang, “ANFIS: adaptive-networkbased fuzzy inference systems,” IEEE Trans. On Systems, Man, and Cybernetics, vol. 23, no. 3, pp. 665-685, 1993. [4] B. Kosko, Neural Networks and Fuzzy Systems: A Dynamical system Approach to Machine Intelligence, Prentice-Hall, Englewood Cliffs, NJ, 1992. [5] M. C. Su, “Identification of Singleton fuzzy Models via fuzzy hyper-rectangular composite NN,” in Fuzzy Model Identification: Selected Approaches, H. Hellen doorn and D. Driankov, Eds., Springer-Verlag, pp. 193-212, 1997. [6] M. C. Su, C. –W. Lin, and S. –S. Tsay, “Neuralnetwork-based Fuzzy Model and its Application to Transient stability Prediction in Power System,” IEEE Trans. On Systems, Man, and Cybernetics, vol. 29, no. 1, pp. 149-157, 1999. [7] A. G. Barto and M. I. Jordan, “Gradient following without backpropagation in layered networks,” in Proc. IEEE First Annual Conf. Neural Networks, pp. 11629-11636, 1987. [8] G, E, Hinton, “Connectionist learning procedures,” Art. Intell., vol. 40, no. 1, pp. 143150, 1989. [9] C. T. Lin and C. S. G. Lee, “Reinforcement structure/parameter learning for neural-networkbased fuzzy logic control systems," IEEE Trans. on Fuzzy Systems, vol. 2, no. 1, pp. 46-63, Feb. 1994. [10] H. R. Berenji and P. Khedkar, “Learning and tuning fuzzy logic controls through reinforcements,” IEEE Trans. on Neural Networks, vol. 3, no. 5, pp. 724-740, Sep. 1992.
[11] A. Bonarini, “Reinforcement distribution for fuzzy classifiers: a methodology to extend crisp algorithms,” in 1998 IEEE Inter. Conf. on Evolutionary Computation Proceedings, pp. 699704, 1998. [12] J. Schmidhuber, “A general method for multiagent learning and incremental self-improvement in unrestricted environments,” In Yao, X. (Ed.), Evolutionary Computation: Theory and Applications. Scientfic Publ. Co., Singapore, 1996. [13] A. G. Barto, R. S. Sutton and C. W. Anderson, ”Neuronlike adaptive elements that can solve difficult learning control problems,” IEEE Trans. on System, Man, and Cybernetics, vol. 13, no. 5, pp. 834-846, 1983. [14] R. S. Sutton, “Learning to predict by the methods of temporal differences,” Machine Learning, vol. 3, pp. 9-44, 1988. [15] C. J. C. H. Watkins, Learning form Delayed Rewards. Ph.D. thesis, King’s College, Cambridge, UK, 1989. [16] C. J. C. H. Watkins and P. Dayan, “Q-Learning,” Machine Learning, vol. 8, no. 3, pp. 279-292. 1992. [17] D. A. Berry and B. Frostedt, Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London, UK. [18] L. P. Kaelbling, M. L. Littman and A. W. Moore, “Reinforcement learning: a survey,” Jonrnal of Artificial Intelligence Research, vol. 4, pp. 237285, 1996. [19] J. H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI, 1975. [20] D. Goldberg, Genetic Algorithm in Search, Optimization, and machine Learning, AddisonWelsley, MA, 1989. [21] R. E. Smith and D. E. Goldberg, “Reinforcement learning with classifier systems,” IEEE International conference on pp. 184-192, 1990. [22] M. A. Lee and H. Takagi, “Integrating design stages of fuzzy systems using genetic algorithms,” Proc. IEEE-IEE Int Conf. on Vehicle Navigation and Information Systems Conference, pp. 612-617, 1993. [23] R. Jang, “Fuzzy controller design without domain experts,” Proc. IEEE Int. Conf. on Fuzzy System, pp. 289-296, 1992.