reward in a state is called the value of the state. The best action for the agent in each state is the one that, considering transition probabilities, leads to the states ...
SHERPA: a safe exploration algorithm for Reinforcement Learning controllers T. Mannucci˚, E. van Kampen :, C. C. de Visser ;and Q. P. Chu§ TU Delft, Delft, Zuid Holland, the Netherlands
The problem of an agent exploring an unknown environment under limited prediction capabilities is considered in the scope of using a reinforcement learning controller. We show how this problem can be handled by the Safety Handling Exploration with Risk Perception Algorithm (SHERPA) that relies on interval estimation of the dynamics of the agent during the exploration phase along with limited capability from the agent to perceive the presence of incoming fatal instances. An application to a simple quadrotor model is included to show the algorithm performance.
Nomenclature RL MDP FSS RSS RFSS LTF IA
Reinforcement Learning Markov Decision Process Fatal State Space Restricted State Space Restricted Fatal State Space Lead-to-fatal (state) Interval Analysis
I.
Introduction
In the field of aerospace engineering, classic control schemes such as PID still enjoy widespread use in the face of increasing confidence in newer, more modern control tools. This can be partially explained by the amount of effort needed to provide affordable yet efficient dynamic models of aerospace platforms for more advanced control schemes. In the wake of this consideration, special attention in the aerospace control community has been dedicated to control schemes which require less precise knowledge of a model to achieve satisfactory performances. Robust control constitutes an example of controller design developed to tolerate modeling error while guaranteeing a lower bound on performance. Adaptive control represent a promising field in developing new controllers with increased performance and reduced model dependency1, 2 . A virtually model-free option amidst adaptive control is Reinforcement Learning (RL). Reinforcement learning is a knowledge based control scheme popularized by Sutton and Barto3 that mimics the adaptation of animals to their own environment. The notion of environment and animal can be translated into that of plant and controller (or agent). In reinforcement learning, a-priori knowledge of the plant is not always needed and is commonly replaced by an exploratory phase inspired by animal adaptation to environment, during which after every action the agent receives an appropriate “reward” generated from the plant. The reward received is stored in the memory of the agent. This information is used later on in transitioning from an exploring agent that tries new actions and receives new reward, to an exploiting agent that acts according to the reward received. This ˚ PhD
student, Aerospace Faculty, Control & Operations Department professor, Aerospace Faculty, Control & Operations Department ; Assistant professor, Aerospace Faculty, Control & Operations Department § Associate professor, Aerospace Faculty, Control & Operations Department : Assistant
1 of 15 American Institute of Aeronautics and Astronautics
scheme results very effective in finding optimal and complex control strategies even when refined knowledge of the plant is not available. It is clear that inefficient exploration cannot lead to a profitable exploitation. For this reason, much effort has been dedicated in the past to encourage exploration. As a parallel problem, some dedication has been given to enforcing a lower bound on the reward acquired during exploration, to guarantee minimum performance.4, 5 Neither of these approaches, however, consider the case of an hostile environment, where exploration can be potentially fatal for the agent. Such an environment poses a double challenge for the agent. In the aerospace field, multiple such conditions might be experienced by the agent during exploration if no preventive measures are taken. Consider as an example a fixed wing aircraft whose elevator is controlled by a RL agent. Stochastic, exploratory use of the elevator may rapidly lead the aircraft into a stall or a dive and ultimately in a crash. Therefore, safety represents a primary obstacle for RL controllers in aerospace. Hans et al.6 (2008) propose an algorithm for plant control that avoids fatal transitions; however the algorithm relies on an a-priori known safety function (acting as a go/no-go decision maker over possible actions) and a fixed backup policy valid in all workspace. Garcia and Fernandez7 (2011) have a similar approach; introducing a variable amount of perturbation in given safe but inefficient baseline controller, discovery of new trajectory for task completion is possible (taking however a certain degree of risk). The two share the need of a guaranteed safe controller or backup policy in order to prevent catastrophic exploration when facing critical decisions. Moldovan and Abbeel8 (2012) define safety in terms of ergodicity of the exploration, and introduce an algorithm that still relies on believes of the system but not on a predefined baseline policy or safe controller. Safety of the exploration is again guaranteed within a certain degree of reliability. Gillula et al.9 (2010) and Gillula and Tomlin10 (2011), while not dedicated to proper RL exploration, show very promising applications of reachability analysis to the problem of planning safe control. While having been inspirational for the development of SHERPA, both publications assume that the set of fatal states is defined right from the start and limit the algorithm responsibilities to avoid the aforementioned set. The goal of this paper is to provide an insight in the problem of safe exploration and in its definition, as well as providing an algorithm to solve the problem online. Even if safe exploration is not in itself limited to applications in RL and could find uses elsewhere in control design, especially in planning algorithms, performing a satisfactory exploration is a problem typical of RL and also it is in RL applications that the problem is most compelling due to the limited knowledge of the system usually available. The contribution of this paper with respect to the previous methods is in the definition of a new combined strategy that stems from a generalization of already known techniques (mainly collision avoidance and reachability analysis) dedicated to the exploratory problem, entirely avoiding off-line training and reducing a-priori knowledge of the agent dynamics to the essential, with the goal of providing safe exploration. A further contribution of the paper is the proposal of the aforementioned strategy in the form of the Safety Handling Exploration with Risk Perception Algorithm (SHERPA), of which an application to a quadrotor exploration task is presented. The rest of the paper is structured as follows. In section II, the fundamentals of RL in both continuous and discrete space of states and actions will be covered, introducing the problem of safe exploration in a formal way. In this section various assumptions will also be presented. In section III, Interval Analysis will be introduced as the main tool used by SHERPA. In section IV the algorithm itself will be thoroughly discussed. In section V a simulated application of SHERPA to a quadrotor task will be showed. Finally in section VI the conclusions will be drawn.
II. A.
Fundamentals
Reinforcement learning in fatal environment
The fundamentals of RL and of the exploratory problem will now be introduced. The evolution of a system controlled by a decision-maker can often be conveniently modeled as a Markov Decision Process (MDP). In a MDP, the system can assume a number of discrete states, defining the state space. In each state a set of actions is available for the agent to take, and different sets of actions might be available in different states. The agent selects one, and a transition occurs from the original state where the action was taken to another state, or onto the original state itself, depending on the action. Transitions do not depend on the previous sequence of states and actions; in other words, the evolution of the system from the current state 2 of 15 American Institute of Aeronautics and Astronautics
does not depend on previous history. This is called the Markov property. Transitions are generally allowed to be stochastic, meaning that the same action might trigger a transition to different states depending on some probability distribution; otherwise the MDP is said to be deterministic. A game of chess is a perfect example of a deterministic MDP, with all legal combinations of pieces defining the state space, with different deterministic moves available for the player, and fixed outcomes of moves depending only on the current disposition. Albeit the MDP framework was defined for finite state and actions, and for discrete-time choices from the agent, a continuous time and state version of a MDP can also be defined. After a transition, the MDP yields a real valued number, called the reward, which quantifies the immediate benefit of the transition for the agent, e.g. the value of the captured piece during a move of chess. The agent goal in the long term is then to maximize the sum of the expected future reward (in case of an infinite task, the reward is artificially reduced with time to achieve convergence). The sum of the expected reward in a state is called the value of the state. The best action for the agent in each state is the one that, considering transition probabilities, leads to the states with the highest average value. Various methods have been proposed to solve the problem or to approximate the solution; one such method is reinforcement learning (RL). In RL, the agent explores the environment, where the agent behaves with the goal of discovering rewards associated with different actions, and modifies its belief of the value distribution in the state space in the form of a value function. The exploration phase usually yields ground to an exploitation phase during which the agent goal turns into maximizing the sum of expected reward. It can be proven that under certain conditions, this strategy leads to the discovery of the best actions in the form of an optimal policy, or of its approximation. The presence of the exploratory phase makes RL ideal for those MDPs where transition rules are uncertain or unknown. In this paper a further consideration will be added to the traditional RL approach exposed above, by considering harmful environments. Depending on the action taken in such an environment, in addition to receive reward the agent can be harmed, and exploration can be consequently halted. Such an event will be called a fatal occurrence and the action that lead to this a fatal action. One could argue that a possible solution would be to remove fatal actions from all sets. In section II.C it will be shown how this is not always true for the case in exam. This paper will show a different approach. Instead of removing from the set of allowed actions the fatal ones, it suggests searching the current set of actions for those whose safety can be proven beforehand. This will be done as follows. First, a model on how to represent fatal occurrences in physical agents will be drawn. Then, a limited prediction capability, denominated risk perception, will be assumed to be available to the agent. Third, the evolution of the system will be overestimated with the aid of a conservative model of the uncertain dynamics. B.
Risk Perception
Let s P S indicate the state of the system and let S Ď ℜn be the state space. Let Apsq indicate the set of actions for each state s. It will be assumed for ease of exposition that the set A is the same for all s P S. This assumption is not binding and could be dropped. Let then Af at psq Ď A be the (unknown) fatal actions in state s. Finally, let the mapping D: D :S ˆAÑS
(1)
describe the dynamics of a deterministic agent. Assump. 1 states the belief that fatal occurrences are inherently related to fatal states and not to actions; e.g. if a fixed wing aircraft hits the ground, resulting in a fatal occurrence, that could be due to the position, attitude and speed of the aircraft with the respect to the ground, but not to the history of transitions leading to this state. Any action leading to the same configuration would result in the same fatal occurrence. This could be considered an extension of the Markov property to the fatal occurrences of a MDP. Assumption 1 (Fatal states) If for states s¯, sf at P S, and for action a ¯ P A, Dp¯ s, a ¯q “ sf at , a ¯ P Af at p¯ sq ñ @ s P S and a P A s. t. Dps, aq “ sf at , a P Af at psq. Assump. 1 allows a transition to a state-only formulation of fatal occurrences. This consideration allows to define the fatal state space (FSS) as: Definition 1 (Fatal state space) Define the set FSS of all fatal states as: 3 of 15 American Institute of Aeronautics and Astronautics
FSS “ tsf at |Dps, aq “ sf at , a P Af at psqu In analogy to the definition of the FSS we can define the safe state space (SSS) as: Definition 2 (Safe state space) Define the set SSS of all safe states as: SSS “ S z FSS While it is not a strictly fundamental assumption for the SHERPA algorithm, Assump. 2 provides a more organic view of harmful environments. If fatal occurrences are the consequence of one or more causes of different nature, it is reasonable to assume that not all the components of the vector state will be involved for each cause. Consider for example that the FSS of a fixed wing platform could be represented by all states for which the aircraft hits the ground, plus all states for which the aerodynamic forces and moments can fatally affect either the pilot or the structures. This example has an interesting property: the two causes of fatality are dependent on different components of the vector state. The “crashing” can be related to the position of the agent and its speed relative to the ground, while the “structure damage” can be related to airspeed, angle of attack, etc., but not to the position. In this paper it will be assumed that, while the environment might not be known with accuracy, at least the possible modes of fatal occurrence can be derived by the nature of the agent itself. Assump. 2 involves a more complex approach to the problem of safety, but it must be considered that such a formulation is also more “spontaneous” to a designer, since it expresses the nature of the risk involved in the exploration. The formulation is also very high-level, so that the designer is not required to consider dangerous situations but only those that are known to be certainly fatal. Writing the state vector in a standard basis notation: s“
n ÿ
si ¨ ei
(2)
i“1
each of said component of the state vector can be associated with an element of the standard basis. Given the state space S Ď ℜn , define then as a restricted state space (RSS) any subspace spanned by the set of standard basis for which a fatal occurrence can be defined. So if for example a fatal occurrence is defined for a certain agent in terms of vertical coordinate z (associated with e3 ) and in terms of the speed components vx , vy and vz (associated respectively with e4 , e5 and e6 ), the subspace with the spanning set te3 , e4 , e5 , e6 u is a RSS. In analogy to the definition of the FSS, the set of all those elements of a RSS for which there is a fatal occurrence is defined as the restricted fatal state space (RFSS) of the RSS. Assumption 2 (Restricted state spaces) The agent has a-priori knowledge of all RSSs, but not of the respective RFSSs. Note that from the definition of RSS, in the event that no assumption on the nature of fatal occurrence can be made, the whole space state S constitutes a RSS, whose RFSS is the whole set FSS. A fundamental assumption for the existence of the approach will be now introduced in Assump. 3. Assumption 3 (Risk perception) The agent has knowledge of whether an element of the RFSS is within a given Euclidean distance from its current position inside the relative RSS. Assump. 3 constitutes a step forward in treating the safety problem: the aforementioned property will be denominated informally as risk perception. While assuming such a property can be considered restrictive, it is true however that the following considerations apply: • laser range-finders, sonars and other ranged sensors involved in collision avoidance perfectly fit this framework; • some fatal occurrences could be unknowingly related to the states of the RFSS, but at the same time might depend on an easily measurable quantity. For example, the load factor inside a cabin is an easily measurable quantity which in the case of an unspecified aircraft is related to angle of attack and airspeed by an unknown relation. The problem then becomes estimating, even under rough assumptions, the correspondence between the threshold in load factor to that in airspeed and angle of attack;
4 of 15 American Institute of Aeronautics and Astronautics
Figure 1: An agent exploring a state space (bold dotted square) with two different RFSS. The first cause of fatality is independent from the value of component s1 , and an Euclidean distance d1 is provided for the risk perception. A second cause of fatality is dependent on both components of the state so that an Euclidean distance d2 in the whole state space is given. In the above condition, the agent would receive a warning due to RFSS 1 being in reach.
• in the absence of a sensor, artificial boundaries can be imposed as to restrict the agent to reportedly safe portions of the RSS, effectively faking the presence of the sensor. So for example, if an agent is subject to overheating and the exact critical temperature cannot be perceived, but it can be estimated from below, a warning signal can be sent when the temperature is near the estimated one. Figure 1 shows an example of risk perception. Assump. 3 can be rephrased as following. There exists an n-dimensional “cylinder” H of non-zero volume in the state space S centered on the current state of the agent such that, if the intersection between H and the FSS is not empty, this is known to the agent in the form of a binary information when state s is visited. Note that no information on position and volume of the intersection is known, i.e. the agent is not aware of which RFSS triggered the warning. This can be formalized by introducing the mapping W : S Ñ t0, 1u, where W (s) is equal to 0 if H X FSS “ H and 1 otherwise. The combination of Assump. 2 and Assump. 3 has an additional useful result. Suppose to consider the whole S instead of referring to the RSSs workframe. The result would be that the agent, ignorant of the relation between states and different risks, would need to consider fatal states in all S. Instead, by including the relation between states and different risks, we can infer the presence of safe states in location of S where the agent have never directly “looked”. Consider the following example of a car moving in a track of varying width, delimited at the side by two cliffs (figure 2). Assume also that, in an unknown position of a track, there is a speed bump. There are two possible fatal occurrences: falling from the cliffs (case A), and running over the speed bump fast enough to damage the car (case B). Let d be the distance between the position of the car and its projection on the mid-line of the track, x the curvilinear coordinate of the projection, θ the steering of the car with respect to the mid-line and v the speed of the car. The two risks do not share the same RSS. The RSS related to falling involves states x and d, while the RSS related to speeding over the bump involves x and v. It can also be noted that one state is unrelated to risk (the steering θ). If the agent sees that state s1 “ p¯ x, d1 , θ1 , v1 q is safe, it can be deduced that it is safe to drive at speed v1 while in x ¯. If at a later time the agent observes that state s2 “ p¯ x, d2 , θ2 , v2 q is safe, it can combine this new information with the previous one to automatically infer that the set: tp¯ x, d2 , θ, v1 q ; p¯ x, d1 , θ, v2 qu is also safe.
5 of 15 American Institute of Aeronautics and Astronautics
Figure 2: The car must avoid falling from the cliffs at the sides of the track and speeding over the speed bump (in grey).
C.
Lead-to-fatal states
In this subsection, a new category of states called lead-to-fatal (LTF) will be formally introduced. Consider for ease of exposition the continuous-time version of a MDP. Indicate by σa ps, τ q the state of the system at time τ ą t0 when applying action history aptq from state s at time t0 up until τ . Define as A the set of all possible action histories aptq. The set L of the LTF states is then defined as: L “ ts |@aptq P A , Dτ : σa ps, τ q P FSSu
(3)
The lead-to-fatal states are a direct extrapolation to the generalized states space of the inevitable collision states introduced by Freichard and Asama11 (2003). In simpler terms, LTF states are those states that, while not belonging to the FSS are such that, when visited, will lead the agent to end up in the FSS with probability one. It will be shown in this paper that lead-to-fatal states are a main concern for the exploring task. D.
Bounding model
In RL, a fundamental distinction exists between model-less and model-based RL. In model-based RL, the controller has at its disposal a model of the environment and one of the relative reinforcement that are used to train the controller off-line and improve the policy in between exploration sessions. The model might be refined during on-line exploration to better represent the real process being modeled. Model-less RL instead does not rely on a model and does not perform off-line training; the policy refinement is based on on-line training only. In SHERPA, an intermediate position between this two instances will be adopted. A bounding model of the agent will be used with the following property: Definition 3 Given an agent dynamic model Dps, aq “ s¯, a function ∆ps, aq “ S¯ such that: if Dps, aq “ s¯ , ∆ps, aq “ S¯ , ñ s¯ P S¯
(4)
represents a bounding model for the original model. The bounding model yields a set S¯ that includes the state s¯ yielded by the true model. Such a model is not used to perform off-line policy improvements (as in a model-base RL scheme) but only to predict the boundaries of the dynamical evolution of the agent. ∆ needs not to represent the exact structure of the true dynamic model of the agent as long as Eq. 4 is satisfied. As described in section IV, the bounding model will be put to use to predict the evolution of the system in such a way that not only fatal states but also LTF states are prevented. Since both these computations can be conservative, bounding models with high uncertainty with respect to the true model can also be put to use.
6 of 15 American Institute of Aeronautics and Astronautics
E.
Assumptions
In conclusion, for the application of the SHERPA algorithm the following hypothesis are made: I. State and actions are defined in a continuous, deterministic, time-independent framework represented as a MDP; II. The environment is such that assumptions 1, 2, and 3 over fatal occurrences and risk perception hold; III. A bounding model of the agent is available.
III.
Tools
In section II the necessity of a bounding model was discussed. In this section Interval Analysis (IA) will be introduced as a mean of obtaining such a model; in addition, Reachability Analysis will be mentioned as a way of treating the dynamical evolution of the system under uncertainties.12 IA is an arithmetic introduced during the 50s/60s with the intention of bounding uncertainties resulting from numerical computation (e.g rounding errors). A 1-dimensional interval is a set in the form: ra, bs ” ts : a ď x ď bu
(5)
and is equivalent, from a geometrical perspective, to a segment on the line of real numbers. The notion of interval can be extended to higher dimensions as a tuple of 1-dimensional intervals; every vector whose components belong to the respective 1-dimensional interval belongs to the n-dimensional one. The geometrical equivalent of an n-dimensional interval is now a hyper rectangle, commonly called a box. IA permits to compute uncertainty propagation in the dynamics of a system by treating uncertain parameters as intervals. If the dynamics of a system are described by: s9 “ f ps, p¯q
(6)
where s is the state, p¯ is a vector of parameters belonging to interval box rps, then: 9 ” f ps, rpsq s9 P rss
(7)
this can be extended to include the state itself among the uncertain parameters: 9 ” f prss, rpsq s9 P rss
(8)
A feature of using intervals for uncertainty propagation is that, since the result of the computation is an interval box of dimension n, it can be represented by a tuple of only 2n elements; this is done however at the 9 In addition to the above limitation of this representation, an additional price of overestimating the size of s. overestimation is due to treating all instances of an interval that shows up multiple times in the computation as if they were independent from each other, an effect known as dependency problem. Therefore, applying uncertainty propagation using interval analysis has an intrinsic limitation in the number of iterations allowed before the overestimation becomes unacceptable. A reachable set is the set of all possible states that, as the name suggests, the system can assume in a finite time by choosing every possible action from a limited set. It is evident that, even in a deterministic formulation, one starting state will usually result in a whole set of ending states, increasing in size with time. With respect to other techniques for reachable set computation, the IA approach used in SHERPA has the advantage the possibility of including the uncertainty in the autonomous part of the system as well as in the control input. For an overview of some reachable sets computation techniques and issues, see Maler13 (2008).
IV. A.
Algorithm
Algorithm motivations
The Safety Handling Exploration with Risk Perception Algorithm (SHERPA) combines the assumptions over the partition of the FSS of section II with the uncertainty propagation computation under IA of section III. The following statements summarize the driving concepts that lead to the development of SHERPA:
7 of 15 American Institute of Aeronautics and Astronautics
• Constraints limit the overall control on the agent dynamics from the controller. In a discrete environment the state transition can be very simple as in the gridworld examples of Sutton and Barto.3 In a continuous state framework, control over the evolution of the system will be subjected to different limitations. Consider for example the classic mass-spring-damper system (MSD). This system has only 2 states, the displacement of the mass x and the speed of the mass v “ x. 9 In the presence of a force F acting on the mass m: $ &x9 “ v %v9 “ p´c ¨ v ´ k ¨ x ` F q{m
(9)
The MSD is a controllable, linear system. However it is immediate to see that, if only a limited force |F | ă Fmax can be exerted on the mass, we have a limit on the rate of state v. Boundedness of the control impose therefore a constraint over the possible evolution of the system. • Exploration must avoid LTF states. LTF states were introduced in section II. Recall that LTF states are those for which a fatal state will eventually be encountered, regardless of the action chosen by the agent. It is immediate to see that the constraints on dynamics mentioned at the previous point are ultimately the cause for the existence of LTF states. Consider again the MSD and assume that a fatal occurrence happens if the displacement of the spring exceeds a certain critical value. For each value of the displacement x of the spring, there is a threshold value of the speed v for which collision is inevitable regardless of future control action. From an operational view, LTF states are as dangerous as fatal states; in fact even more, since the agent is not capable of perceiving LTF states in the same as it does with fatal states. Eq. 3 does not allow to compute all LTF states beforehand, because FSS as a whole is unknown; however, it can be of use to prove that a certain state s R L. • Interpreting fatal occurrences as terminal states, exploration can be defined “safe” if it can guarantee exploration for an indefinite amount of iterations. If a check for s P L was to be performed by mean of Eq. 3, formally it would have to be done for every τ ą 0. This could be interpreted as an analogue, in a continuous framework, of the ergodicity definition of safety expressed by Moldovan and Abbeel.8 B.
Definitions and assumptions
In this subsections the mathematical frame of the algorithm will be presented in the wake of the previous motivations. The notation will now be slighty altered from the previous part of the paper; in the preceding sections the state vector was indicated as “s” as it is common when referring to MDPs. From this point onward, as usual in control literature, the state vector will be referred to with the letter “x” (xptq in the continuous case). For analogous reasons action “a” will be replaced by “u” (uptq in the continuous case) and will be from now on referred to as control. All controls considered in the remaining part of this paper will be considered to be feasible if @t they satisfy the condition: u ď uptq ď u
(10)
where u and u represent the vector boundaries on the control action, e.g. saturation. In addition, if Eq. 10 is strictly satisfied, the control is also said to be strictly feasible. Define as a first step a safe control as: Definition 4 Control action uptq is a safe control for state x1 “ xpt1 q and t P rt1 t2 s if: σu px1 , tq P SSS , @t P rt1 t2 s. Define also an ideal backup from x1 “ xpt1 q to x2 “ xpt2 q as: Definition 5 A control action ub ptq , t P rt1 t2 s is an ideal backup from x1 “ xpt1 q to x2 “ xpt2 q if: σub px1 , tq P SSS , @t P rt1 t2 s
(11)
Dτ ă t1 : x2 “ xpτ q
(12)
and:
8 of 15 American Institute of Aeronautics and Astronautics
In other terms, an ideal backup is a safe control that brings the agent back to a previously explored state xpτ q. It is immediate to see that if a control action as in Def. 5 exists in a state x, then a safe control for rt 8q exists in x: a dynamical loop has been found, so that if necessary, safety can be guaranteed for an indefinite period of time by repeating the loop. Unfortunately the condition defined by Eq. 12 cannot be checked when taking into account the uncertainties of the bounding model, since it is not possible to exactly predict the next state. The following assumption will then be made: ¯ “ xpt¯q and t P rt1 t2 s, then @ϵ ą 0, @x s.t. }x ´ x} ¯ ă ϵ, Assumption 4 If uptq is a safe control for x ˚ ¯ and t P rt1 t2 s such that: DM P ℜ and Du ptq safe control for x max }uptq ´ u˚ ptq} ď M t
and: M Ñ 0 for ϵ Ñ 0 In short, Assump. 4 says that if there exist a safe control in a state, then in a neighborhood of that state we can find a new safe control by increasing (or decreasing) the previous control by a finite amount at each time. It also states that, with the neighborhood size tending to zero, the finite amount also tends to zero. With Assump. 4 given, let’s introduce a softer definition of a backup from x1 to x2 as: Definition 6 A control action ub ptq , t P rt1 t2 s is a backup with reach ϵ from x1 “ xpt1 q to x2 “ xpt2 q if: σub px1 , tq P SSS , @t P rt1 t2 s
(13)
Dτ ă t1 : }x2 ´ xpτ q} ď ϵ
(14)
and: It is evident that, if an ideal backup exists from x1 to x2 , then it also exists a backup with reach ϵ small at will. With this new definition, the following theorem applies: Theorem 1 Let x1 “ xpt1 q P SSS, let uS ptq be a strictly feasible safe control for t P rt0 t1 s, and let uI ptq be a strictly feasible ideal backup from x1 to: x2 “ σuI px1 , t2 q Then Dub a backup from x1 to x2 with reach ϵ ą 0 and a control u˚ ptq for t ě t2 such that: $ &u ptq if t ď t b 2 uptq “ %u˚ ptq otherwise is a feasible safe control for t P rt1 8q. Demonstration: the existence of the ideal backup uI ptq means that the iterative control action uiter ptq “ puI ptq, uS ptq, uI ptq, ¨ ¨ ¨ q is a strictly feasible safe control. The existence of uI ptq means that it exists a backup with reach ϵ small at will. Assump. 4 guarantees that it exists a safe control uptq such that }uptq ´ uiter ptq} ď M after the application of the backup. Since uiter ptq satisfies eq. 10 strictly, and since, for ϵ Ñ 0, M Ñ 0 then Dϵ s.t. uptq also satisfies eq. 10 and is therefore feasible. l It is important to state that Theorem 1 is not by itself a warranty of safe exploration, since while it proves the existence of a backup that can guarantee safety, it does not specify the reach of said backup. Consider now the following. If for a state xE “ xptE q P SSS, DuE “ uptE q such that: dx pxE , uE q “ 0 (15) dt it is immediate to see that the control action uptq “ uE is a safe control for xE , t ą tE . Any state xE that indeed satisfies the equation can be defined as an equilibrium point. The presence of such points can be of use into expanding the definition of backup: 9 of 15 American Institute of Aeronautics and Astronautics
Definition 7 A control action ub ptq , t P rt1 t2 s is a backup from x1 “ xpt1 q to x2 “ xpt2 q if: σub px1 , tq P SSS , @t P rt1 t2 s
(16)
Dτ ă t1 : }pxpt2 q ´ xpτ q} ď ϵ
(17)
and if: in which case is said to have reach ϵ, or if it exists an equilibrium point xE s.t. : }xpt2 q ´ xE } ď δ
(18)
in which case it is said to have reach δ. Recalling again Assump. 4, and by imposing that uE is strictly feasible, it can be demonstrated that: Theorem 2 Let x1 “ xpt1 q P SSS, let uS ptq be a strictly feasible safe control for t P rt1 t2 s, and let: x2 “ σuS px1 , t2 q be an equilibrium point for the system: dx px2 , uE q “ 0 (19) dt where uE is strictly feasible. Then Dub a backup from x1 to x2 with reach δ ą 0 and a control u˚ ptq for t ě t2 such that: $ &u ptq if t ď t b 2 uptq “ %u˚ ptq otherwise is a feasible safe control for t P rt1 8q. The demonstration is analogous to that of Theorem 1 and will therefore be skipped. The background for the illustration of SHERPA is now completed. C.
SHERPA layout
In SHERPA, the assumptions of the previous section are translated into a feasible computational framework with the use of Interval Analysis. In this sense, a backup can be explicitly reformulated in interval notation as in Def. 8: Definition 8 Let rxk`1 s “ ∆prxk s, uq be the discrete time, interval bounding model of the agent dynamics x9 “ Dpx, uq. Let ub be a sequence of p control vectors: ub “ tub1 , . . . , ubp u and let ubi “ pubi,1 , ubi,2 , ¨ ¨ ¨ , ubi,m q be strictly feasible. Then, ub is a backup in interval form for rxp s if given xbi : rxb1 s “ ∆prxp s, ub1 q rxbi s “ ∆prxbpi´1q s, ubpiq q it is: rxbi s Ă SSS , i P t1, 2, ¨ ¨ ¨ , pu
(20)
@x P rxbp s, Dk¯ ă p : }x ´ xk¯ } ă ϵ
(21)
and if: in which case the backup is said to have reach ϵ, or if it exists an equilibrium point xE s.t. : @x P rxbp s, }x ´ xE } ď δ in which case the backup is said to have reach δ.
10 of 15 American Institute of Aeronautics and Astronautics
(22)
A convenient adaptation of the previous results to the IA framework has been conducted in the choice of the quantities ϵ and δ. The conditions }x ´ xpt¯q} ă ϵ and }x ´ xE } ă δ in Def. 8 have been exchanged with their interval equivalent: |xi ´ xi pt¯q| ď vϵi ; |xi ´ xEi | ď vδi where the inequalities are interpreted on a element-by-element basis and the elements of vϵ and vδ are positive numbers. In other words, instead of requesting that the final state of the backup belongs to a hypersphere or radius ϵ (or δ) with center xk¯ (or xE q, it is required that it belongs to an interval centered in xpt¯q (or xE q and of width vϵ (or vδ ). As mentioned, Theorem 1 and 2 do not specify which value of ϵ and δ are eligible for a backup to be safe. Therefore the choice of vectors vϵ and vδ has been mostly empirical in nature. For vϵ , a correlation between the control history and the actual value of its components is hinted to in Theorem 1. This would also mean that the value of the correct ϵ mentioned in the backup definition would change depending on previous history. To take this fact into account, as soon as a state x is visited, an interval rx´vϵmax x`vϵmax s centered on the state is stored in the memory of the agent as a safe reach for a backup. At each time step, all the intervals computed in previous iterations are shrunk in every dimension by a factor η: d abspui ptq ´ 12 ¨ pui ` ui qq η “ geompη1 , η2 , ¨ ¨ ¨ , ηm q ; ηi “ 1 ´ (23) 1 2 ¨ pui ´ ui q where geom is the geometric mean operator and ui , ui are components of the bounding vectors u, u. It can be seen from Eq. 23 that the shrinking factor η at each time interval is indeed dependent from the control action applied, and that the more the control action nears the boundaries u and u, the more η tends to zero. Therefore the feasible interval defined by vϵ shrinks continuously in time and previous states became less and less “reliable” the more SHERPA proceeds with time steps. This empirical formulation for vϵ was used in the results following in next chapter. A more educated formulation seems a profitable field of investigation. At the start of the exploration, the agent is in state x0 . It will be assumed that x0 is an equilibrium point, and that the agent does not perceive any fatal state: W px0 q “ 0. As a start, a random control action up is proposed. SHERPA computes the predicted state rxp s “ ∆px0 , up q using IA, and subsequently performs a series of checks. The first one consists in checking if the interval rxp s is included in the existing SSS. If the prediction model effectively bounds the real dynamics of the agent, this check alone ensures that at the next time step the agent will not encounter any fatal state. If this check fails, then a different random control action is proposed. If the check succeeds, it may still be that rxp s X L ‰ H, that is, that the next state might lead the agent to a fatal one in the future. This requires a second check on the existence of a backup as defined in Def. 8. As previously noted, an informed choice must be made in the sense that too small thresholds would result in too strict requirements that make the search for a backup unfeasible, while too big thresholds would result in too relaxed conditions. A random finite control sequence ub is generated, and for each step, Def. 8 is checked. If the checked sequence is a backup with acceptable reach, then the procedure ends, ub is stored and up is taken as current action. It has been assumed for simplicity that next system state x1 “ Dpx0 , up q can be observed without error. (This is not a fundamental assumption if the observer error is bounded, so that next state can be at least individuated inside a known interval.) In the new state x1 , SHERPA makes use of the risk sense. If no risk is perceived, i.e. if W px1 q “ 0, then the n-dimensional cylinder H is added to the SSS; otherwise, no addition is performed. This procedure is followed iteratively. From the definition of backup, we have that the exploration is safe: in the event that no action can be found that satisfies both checks, the agent has still a safe option in taking the backup as its next control sequence. The assumption on x0 being an equilibrium point guarantees the existence of an initial backup. Since, claiming Assump. 4, the existence of safe control from any state of the ending interval is guaranteed, safety of exploration is guaranteed as well. In the previous paragraph it was assumed that SHERPA would eventually find a backup or a feasible action, provided they exist, and would resort to a backup otherwise. However, SHERPA is meant to work during online exploration, with a limited amount of iterations allowed for every check. This leads to the conclusion that, even with the aforementioned assumptions, there is no guarantee that SHERPA will individuate, in the allowed amount of iterations, a feasible control and backup even when they actually exist.
11 of 15 American Institute of Aeronautics and Astronautics
This calls for a slight modification of the algorithm. After the amount of iterations for performing the checks has been depleted without results, SHERPA will force the agent to take the currently stored backup action. During the execution of the backup, SHERPA will also perform a check to find another feasible backup starting from the “arrival” interval rxbp s. This is necessary to cope with the possibility that, at the arrival interval, there will be again no feasible controls found. The whole algorithm, including the aforementioned modifications, is summarized in figure 3.
Figure 3: Flowchart summarizing the procedure performed by SHERPA. A randomly proposed action is checked multiple times; if all checks succeed, the action is taken and the new backup is stored in place of the previous. Conversely, if the iteration limit is reached without success, SHERPA recollects the previously stored backup.
V.
Results
An application of SHERPA is showed for clarification. A two-dimensional quadrotor exploratory task is R environment. The nonlinear, bounding model described by eq. 24 and eq. 25 simulated in a MATLAB⃝ was used: rcs T 9 x9 ` | sin θ| |x| sin θ rms rms rcs T z: “ ´ | cos θ| |z| 9 z9 ` cos θ ´ g rms rms
(24)
θ9 “ rksIm
(25)
:“´ x
where c and rcs are respectively a damping coefficient and its known interval; m and rms are the mass of the quadrotor and its known interval; x, z and θ are respectively the horizontal coordinate and the vertical coordinate of the “point mass” quadrotor, and its pitch angle; T is the total thrust generated by the rotors;k is an efficiency for the application of the torque impulse Iτ , and rks its know interval; finally g is the gravity acceleration. The choice of introducing an impulse as the primary control of the pitch angle instead of a pitching torque has two motivations. The first one is to reduce the computational effort by eliminating one intermediate state (the pitch rate itself). In addition to that, it was observed that such a model resulted in a more realistic simulation of the fast dynamics of real-life quadrotors. For the bounded model rcs, rms and rks where chosen as: rms “ r0.344 0.516s kg ; rcs “ r0.72 1.08s ˆ 10´5 kg m ; rks “ r0.8 1.2s ; with control action bounded as: 12 of 15 American Institute of Aeronautics and Astronautics
T P [0 12.7] N and Iτ P [´300˝ 300˝ ] s´1 The “true” model of the quadrotor used for the computation of the real dynamics was obtained from the bounding model by randomly selecting for each of the parameters a value belonging to the interval. A new randomly selection was performed at the start of each run. The task was defined as follows. The quadrotor has to fly safely inside a restricted space, namely a room 3 meters wide and 2 meters high. The quadrotor is initially in hovering condition at 1 meter from the ground and is equidistant from the walls. The rotors’ thrusts are assumed being perfectly symmetrical with respect to the vertical plane of the quadrotor itself, resulting in task that is two-dimensional. The set of fatal state (unknown to the agent) is represented by all those conditions where, regardless of speed and pitch angle, the point-mass quadrotor touches the walls or the floor/ceiling. Assump. 3 is introduced by assuming that the quadrotor will receive a warning when at 1{2 a meter from the ceiling/bottom of the room or when at 1 meter distance from a wall. For this task, the vector: vϵmax “ r 0.5m 0.5 ms 0.5m 0.5 ms
π 6
s
has been selected. Control has been constrained to stay between „10% and „90% of the bounding intervals to avoid the factor η of Eq. 23 to vanish: T P [1.3 10.3] N and Iτ P [´270˝ 270˝ ] s´1 . Vector vδ has been chosen as a fixed interval, instead of a shrinking interval as in vϵ . Taking into consideration the dynamics in the (bounding) model of Eq. 24, it is evident that the position px, yq of the quadrotor itself does not influence the value of the derivatives (i.e. the dynamics are “global”). For this reason, the interval: x9 P [-1 1] ; z9 P [-1 1] ; θ P [´45˝ 45˝ ] ; containing the equilibrium point x9 “ 0, z9 “ 0 , θ “ 0 was selected. Note that the position of the quadrotor is still taken into consideration during the generation of the backup in Eq. 20. A series of 4 runs of SHERPA will be shown in figure 4. At each time step, a maximum of 10 attempts for the first check (i.e. for selecting a safe control action) and of 40 attempts for the second check (i.e. for the existence of the backup) were allowed, for a total of 400 maximum iterations at each time step. As anticipated, the parameters c, m and k have been randomly chosen at the start of each run among the intervals rcs, rms and rks to show independence of the algorithm from the actual model. The values are indicated in the figures’ captions. The figures will be now briefly commented. The black dots indicates the position of the quadrotor at different time steps from the starting point in (0,4). The blue square represents the fatal states, i.e. the walls and the floor/ceiling of the room. The dotted and dashed magenta line represents the contour of the region where no risk is individuated (W px, yq “ 0) and the SSS is therefore updated. The dashed black line represents instead the known SSS at the end of the run. In 3 out of 4 cases, the quadrotor reaches the very boundaries of the SSS, and nonetheless it manages to safely divert its trajectory (figures 4.a and 4.b) or to wait in position until another path has been found (figure 4.c). In general, the agent chooses actions that result in safe trajectory, even if the random nature of the proposed control is still noticeable. The values of the parameters have notable excursions in the parameters and especially in the mass which varies from 345g to 505g.
VI.
Conclusion
In the paper, an approach for handling and categorizing fatal occurrences for autonomous agents has been presented. The concepts of fatal state, lead-to-fatal state, and risk perception have been formalized to provide the background and the motivation for the approach, along with the required assumptions. In this context, the notion of safe control and backup are also introduced to present an algorithm for safe exploration, SHERPA. The algorithm has been illustrated and subsequently simulated on a quadrotor platform in a safe exploration task, showing that safety was achieved under a set level of model uncertainty. The approach and the algorithm present some points of future development. In this paper, risk perception was introduced in a very generic form that includes different sources of such an information. Further investigation of possible 13 of 15 American Institute of Aeronautics and Astronautics
−5
m = 0.505 c = 9.602 10
5
5
4.5
4.5
vertical position [m]
vertical position [m]
m = 0.345 c = 1.063 10−4 k = 0.9214
4
4
3.5
3.5
3
3
−1.5
−1
−0.5
0
0.5
1
−1.5
1.5
−1
−0.5
(a) Run with m “ 345g, c “ 1.06 ˆ 10´4 kg , k “ 0.92 m −5
4.5
4.5
4
3.5
3
3
0
0.5
1.5
4
3.5
−0.5
1
m = 0.4273 c = 1.022 10−4 k = 0.9758
vertical position [m]
vertical position [m]
k = 1.179
5
−1
0.5
(b) Run with m “ 505g, c “ 9.60 ˆ 10´5 kg , k “ 1.17 m
5
−1.5
0
horizontal position [m]
horizontal position [m]
m = 0.465 c = 9.645 10
k = 1.171
1
−1.5
1.5
−1
−0.5
0
0.5
1
1.5
horizontal position [m]
horizontal position [m]
(c) Run with m “ 465g, c “ 9.64 ˆ 10´5 kg , k “ 1.18 m
(d) Run with m “ 427g, c “ 1.02 ˆ 10´4 kg , k “ 0.98 m
Figure 4: Four different runs of the SHERPA algorithm in an exploration task. The quadrotor position is represented by the black dots. The blue square represents the fatal states that the agent must avoid. The magenta dot-and-dashed square delimits the states where the agent does not perceive risk. The black dashed square represents instead the known SSS at the end of each run.
14 of 15 American Institute of Aeronautics and Astronautics
sources seems a promising field of expansion for the approach. Another point where investigation might be profitable is in the choice of vectors vϵ and vδ . An investigation upon this two points could lead to a general improvement on the algorithm, especially in the computational effort required which is currently high. Nonetheless, the approach showed in this paper constitutes a significant effort into tackling the exploration problem for RL agents on a general level, while not focusing on a particular category of tasks. This is also reflected in the absence, inside of the algorith, of a definite model for the whole exploration process. This strategy is therefore in line with the model-free approach of Reinforcement Learning and, more in general, of the model-free approach of adaptive controllers.
References 1 van Kampen, E., Chu, Q.P., Mulder,J.A., Continuous Adaptive Critic Flight Control aided with Approximated Plant Dynamics, Proceedings of the AIAA Guidance, Navigation, and Control Conference and Exhibit, Keystone, Colorado, 2006 2 Ferrari, S., Stengel, R.F., Online adaptive Critic Flight Control, Journal of Guidance, Control and Dynamics, Vol. 27, No.5, 2004, pp. 777-786 3 Sutton, R.S., Barto, A.G., Reinforcement Learning: An Introduction, MIT Press, Cambridge, MA, 1998 4 Coraluppi, S.P., Marcus, S.I., Risk-sensitive and minimax control of discrete-time, finite-state Markov decision processes, Automatica, Vol. 35, Iss. 2, 1999, pp. 301-309 5 Heger, M., Consideration of Risk in Reinforcement Learning, 11th International Machine Learning Conference, Rutgers University in New Brinswick,NJ, July 1994 6 Hans, A., Schneegaß, D., Sch¨ afer, A.M., and Udluft, S., Safe Exploration for Reinforcement Learning, ESANN’2008 proceedings, European Symposium on Artificial Neural Networks- Advances in Computational Intelligence and Learning, Bruges, Belgium, April 2008. 7 Garc´ ıa, J., Fern´ andez, F., Policy Improvement through Safe Reinforcement Learning in High-Risk Tasks, IEEE Symposium on Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), Paris, France, April 2011, pp. 76-83 8 Moldovan, T.M., Abbeel, P., Safe Exploration in Markov Decision Processes, Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012 9 Gillula, J.H., Huang, H., Vitus, M.P. and Tomlin, C.J., Design of guaranteed safe maneuvers using reachable sets: Autonomous quadrotor aerobatics in theory and practice,IEEE International Conference on Robotics and Automation (ICRA), Anchorage, AK, May. 2010, pp. 1649-1654 10 Gillula, J.H., Tomlin, C.J., Guaranteed safe online learning of a bounded system,IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), San Francisco, CA, Sept. 2011, pp. 2979-2984 11 Freichard, T., Asama, H.,Inevitable collision states. A step towards safer robots?, proceedings of IEEE International Conference on Intelligent Robots and Systems (IROS), Oct. 2003, pp. 388-393 12 Moore, R.E., Interval Arithmetic and Automatic Error Analysis in Digital Computing, Ph.d. Dissertation, Department of Mathematics, Stanford University, Stanford, California, Nov. 1962. Published as Applied Mathematics and Statistics Laboratories Technical Report No. 25 13 Maler, O., Computing reachable sets: an introduction, Tech. rep. French National Center of Scientific Research, 2008
15 of 15 American Institute of Aeronautics and Astronautics