Reinforcement Tuning of Type II Fuzzy Systems - Semantic Scholar

6 downloads 0 Views 517KB Size Report
target state, which is expected to be better (Berenji and Khedkar ... state. Gullapalli, Franklin and Benbrahim (1994) used reinforcement learning to balance a ...
Proceedingsof the American Control Conference San Diego, California ● June 1999

ReinforcementThing of TypeII FuzzySystems Cleon Davis Pei-Yuan Peng United Technologies Research Center East Hartford, CT 06108

School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332

Abstract A type II fuzzy system is refined by a reinforcement learning scheme in this paper. By tuning the parameters of the type II fuzzy controller, we demonstrate that reinforcement learning can help to achieve good performance. Results from the pole-balancing problem are given with comparisons of different fuzzy control schemes. It is shown that the learned type II fuzzy controller can achieve goals as well as others. L

Introduction

It is often difficult to provide an optimal decision-making framework for controls. The main problem arises from there not being a systematic approach to improve system performance. For instance, a challenge to supervised control is that learning requires training data or a teacher of the subject domain for adaptation to take place. In most real-world applications, training data is often hard to obtain or may not beavailable atall. Furthermore, thecontrol structures may bepretixed or may only be appropriate foralirnited set of problems (Berenji and Khedkar 1992). An approach to solve this problem is based on an unsupervised learning paradigm known as reinforcement learning. Interest in this paradigm stems from the desire to make systems that learn from autonomous interaction with their environments. In reinforcement learning, it is assumed that the equations describing the system are not known to the controller and that the only information available are the states of the system and a feedback evaluating the performance via a faihrre or success signal (Bartoet al. 1983). Duetoa limited amount of information, the controller or “agent” has to learn an appropriate policy that transfers an unknown system from its current state to a target state, which is expected to be better (Berenji and Khedkar 1992). The agent learns what to do by trial-and-error exploration through the state space and receiving a scalar reward (or punishment) for a given action; in other words, the agent maps situations to actions, so as to maximize the scalar reward signal. Not being told what actions to take, the agent must discover which actions yield the most reward bytryingthem(Lin 1992). Ifanaction istried that producesa favorable response, a reward is given to that action to increase the probability of the system repeating that action should the system find itself in the same or similar state. On the other hand, if the system reaches a non-favorable state, a penalty or punishment is associated with the action that produces theunfavorable response from that state. Furthermore, because of this rewards system, if an action is followed by a satisfactory state of affairs (asdeftned by some clearly defined goal), then the tendency to reproduce the action is strengthened or reinforced. Reinforcements received by the agent can only be used to Ieamhow topredict theoutcome of the selected actions in the future (Berenji 1994), Reinforcement learning techniques assume that, during the learning process, no supervisor is present to directly judge the quality of the selected control actions, and therefore, the final evahsation of a process is only known after a long sequence of actions.

0-7803-4990-6/99

$10.00

@ 1999

AACC

2315

Reinforcement learning is often used for training intelligent controllers where only minimal a-priori knowledge is available. It is also a tool of machine intelligence that combines the spirit of dynamic programming and supervised learning together to successfully solve the difficult problems that neither discipline can address individually. This combination can yield a powerful machine learning tool. In the past ten years, much work has been done in the development of reinforcement learning as a tool of artificial intelligence, computer learning, and intelligent controls (Kaelbling et al, 1996). Several researchers have experimented with reinforcement learning in the control of various applications. Barto, Sutton and Anderson (1983) used reinforcement learning to balance an inverted-penduhtm. They split the state space into discrete regions. Utilizing two separate systems in the implementation, one system selects actions (ASE) to take and the other system tries to predict the occurrence of failure (ACE), This method is sometime called actor-critic or, the term accredited to Sutton (1988), Temporal Difference (TD). It is termed temporal difference because it tries to determine the success of an action depending on the result of the prediction of a failure given the current state and the action it took to get to that state. Gullapalli, Franklin and Benbrahim (1994) used reinforcement learning to balance a ball on a beam using a variation of Sutton’s TD algorithm called stochastic real-valued reinforcement learning. This algorithm is basically a connectionist or neural network implementation of TD that utilizes a clever scheme to generate action stochastically. They also used this algorithm to train a robot to insert a peg in a hole. Both of these implementations were performed on real apparatus. Franklin (1992) developed a reinforcement learning controller for a system that consisted of a ball rolling on a track. The controller had to evolve to be able to predict the behavior of the system, and then, it had to be able to control it. This reinforcement learning algorithm is also based on TD, but it is different in that it uses qualitative reasoning quantity states instead of discrete states. A qualitative reasoning quantity space is based on critical points that define the effects of quantities on the behavior of the system. Constraints are derived that describe relevant structural relationships on the basis of the quantity space. These constraints compose a qualitative model of the system. Using a reasoning mechanism, all possible behaviors of the system are predicted given constraint. This produces a behavioral representation of a system’s output (Franklin 1992). Tesauro (1992) applied reinforcement learning using a connectionist implementation Qf TD to the game of backgammon, The backgammon game

learned to play on an expert level starting from scratch throughreferred to as Q-learning (Watkins and Dayan 1992). For instance, Qlearning systems maintain estimates of utilities for all state-action pairs the use of self-play. and makes use of these estimates to select actions.

II.

Fuzzy Logic Model

There are three types of fuzzy control systems as defined in [Sugeno]. The first one, which we shall call, type I, was introduced by Mamdani and Assilian (1975). It is the conventional fuzzy logic controller that has fuzzy antecedent labels and fuzzy consequence labels. The second fuzzy controller is the type II system (Sugeno 1999). It uses fuzzy variables for the antecedent labels and singletons for the consequence labels. The third type of fuzzy controller is the Takagi-Sugeno-Kang (TSK) method, or type 111system. The TSK controller utilizes fuzzy variables for the antecedent labels and linear functions of the input variables as the consequences (Berenji et al. 1996).

The type 11 system is a combination of both the type I and type 111 systems. In (Sugeno 1999), the stability of the type 11 fuzzy controller is proven for discrete and continuous systems. In the type H approach, fuzzy rules with singleton consequences have the following form IFXl

is Af,

~2is

~=l,z,.

Suggest Documents