Algorithms Game Theoretic Controller Synthesis for Multi-Robot ... - PHP

9 downloads 27468 Views 498KB Size Report
2015, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. for the sampled ... The proposed algorithm makes use of an iterative best .... environment and hence, the domain for a feasible policy is X.
5th IFAC Workshop on Distributed Estimation and 5th on Distributed Control Networked 5th IFAC IFACinWorkshop Workshop onSystems Distributed Estimation Estimation and and Control Systems online at www.sciencedirect.com Control in in Networked Networked Systems September 10-11, 2015. Philadelphia, Available USA, September September 10-11, 10-11, 2015. 2015. Philadelphia, Philadelphia, USA, USA,

ScienceDirect IFAC-PapersOnLine 48-22 (2015) 168–173

Game Game Theoretic Theoretic Controller Controller Synthesis Synthesis for for Multi-Robot Multi-Robot Motion Motion Planning-Part II: Policy-based Planning-Part II: Policy-based Algorithms Algorithms Devesh K. Jha ∗∗ Minghui Zhu ∗∗ Asok Ray ∗∗∗ ∗ Minghui Zhu ∗∗ ∗∗ Asok Ray ∗∗∗ Devesh K. Jha Devesh K. Jha Minghui Zhu Asok Ray ∗∗∗ ∗ Mechanical & Nuclear Engineering Department, University Park, PA ∗ ∗ Mechanical & Nuclear Engineering Department, University Park, PA Mechanical & 16802 Nuclear Engineering Department, University Park, PA USA (e-mail: [email protected]). 16802 USA (e-mail: [email protected]). ∗∗ Electrical Engineering 16802 USA (e-mail: [email protected]). Department, University Park, PA 16802 USA ∗∗ ∗∗ Electrical Engineering Department, University Park, PA 16802 USA Electrical Engineering Department, University Park, PA 16802 USA (e-mail: [email protected]) (e-mail: [email protected]) ∗∗∗ (e-mail: [email protected]) Mechanical & Nuclear Engineering Department, University Park, ∗∗∗ ∗∗∗ Mechanical & Nuclear Engineering Department, University Park, Mechanical PA & Nuclear Engineering University Park, 16802 USA (e-mail: Department, [email protected]) PA PA 16802 16802 USA USA (e-mail: (e-mail: [email protected]) [email protected]) Abstract: This paper presents the problem of distributed feedback motion planning for multiple Abstract: paper problem distributed feedback motion for Abstract: This paper presents the problem of ofmotion distributed feedback motion planning planning for multiple multiple robots. TheThis problem ofpresents feedbackthe multi-robot planning is formulated as a differential nonrobots. The problem of feedback multi-robot motion planning is formulated as a differential nonrobots. The problem of leverage feedback the multi-robot planningalgorithms is formulated a differential noncooperative game. We existing motion sampling-based andasvalue iterations to cooperative game. the sampling-based algorithms value iterations to cooperative game. We We leverage leverage the existing existing sampling-based algorithms and value iterations to develop an incremental policy synthesizer. The proposed algorithm makesand use of an iterative best develop an incremental policy synthesizer. The proposed algorithm makes use of an iterative best develop an incremental policy synthesizer. The proposed algorithm makes use of an iterative best response algorithm to incrementally improve the estimate of value functions of the individual response to improve the value functions of response algorithm to incrementally incrementally improve the estimate estimate ofthe value functionsconvergence of the the individual individual robots in algorithm the multi-robot motion-planning setting. We showof asymptotic of the robots in the multi-robot motion-planning setting. We show the asymptotic convergence of robots inpolicies the multi-robot motion-planning setting. We show the asymptotic of the the limiting induced by the proposed Feedback iNash-Policy algorithm convergence for the underlying limiting induced by Feedback iNash-Policy algorithm the limiting policies policies game. induced by the the proposed proposed Feedback iNash-Policy algorithm for the underlying underlying non-cooperative Furthermore, we show that the value iterations allowfor estimation of the non-cooperative game. Furthermore, we that iterations estimation of non-cooperative game. Furthermore, we show show that the the value value iterations allow allow estimation of the the cost-to-go functions for the robots without the requirement on convergence of the value functions cost-to-go functions for the robots without the cost-to-go functions for at theany robots withoutiteration. the requirement requirement on on convergence convergence of of the the value value functions functions for the sampled graph particular for for the the sampled sampled graph graph at at any any particular particular iteration. iteration. © 2015, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd. All rights reserved. Keywords: Motion planning, Multiple robots, Anytime algorithms, Non-cooperative games. Keywords: Keywords: Motion Motion planning, planning, Multiple Multiple robots, robots, Anytime Anytime algorithms, algorithms, Non-cooperative Non-cooperative games. games. 1. INTRODUCTION 1. INTRODUCTION INTRODUCTION 1. Informally speaking, given a robot with a description of Informally speaking, given aaofrobot robot with aa description description of Informally speaking, given with of its dynamics, a description its environment, an initial its dynamics, description of states, its environment, environment, an initial its aa description of its initial set dynamics, of states, and a set of goal the motion an planning set of states, states, and set of goal goal states, states, the motion motion planning set of of the problem is toand findaa set a sequence of control inputs planning so as to problem is to find a sequence of control inputs so as asgoal to problem is to find a sequence of control inputs so to guide the robot from the initial state to one of the guide the robot from collision the initial initial state to one oneenvironment. of the the goal goal guide robot from the to of states the while avoiding in astate cluttered states while avoiding avoiding collision inbeen cluttered environment. states while collision aa cluttered environment. Many planning algorithms havein proposed, including, Many planning algorithms have been proposed, including, Many planning algorithms approaches have been proposed, including, to name a few, discretized such as A* LaValle to name a few, discretized approaches such as A* LaValle to name a few, discretized approaches such as A* LaValle (2006); continuous approaches involving navigation func(2006); continuous approaches involving navigation func(2006); approaches navigation function andcontinuous potential fields LaValleinvolving (2006), and more recently, tion and potential fields LaValle (2006), and more recently, tion and potential fields LaValle more recently, sampling-based algorithms such(2006), as the and Rapidly-exploring sampling-based algorithms such asKuffner the Rapidly-exploring Rapidly-exploring sampling-based algorithms such as the Random Trees (RRT) LaValle and (2001), ProbaRandom Trees (RRT) (RRT) LaValle and Kuffner Kuffner (2001), ProbaRandom Trees LaValle and (2001), Probabilistic RoadMaps (PRM) Kavraki et al. (1996) and their bilistic RoadMaps (PRM)and Kavraki etKaraman al. (1996) (1996)and and Fraztheir bilistic (PRM) Kavraki al. and their optimalRoadMaps variants RRT* PRM*et optimal variants RRT* and PRM* Karaman and Frazoptimal variants RRT* PRM* and Frazzoli (2011). PRMs and and RRTs (and Karaman their optimal varizoli (2011). PRMs andKaraman RRTs (and (and their optimal optimal varizoli PRMs and RRTs their ants (2011). PRM* and RRT* and Frazzoli (2011))variare ants PRM* and RRT* Karaman and Frazzoli (2011)) are ants PRM* and RRT* Karaman and Frazzoli (2011)) are arguably the most influential and widely-used samplingarguably the most most influential and widely-used widely-used samplingarguably the influential and based motion planning algorithms since the samplinglast two based motion planning algorithms sincewell thein last last two based planning since the two decades.motion They have been algorithms shown to work practice decades. They have been shown to work well in practice decades. They have beenguarantees shown to work wellprobabilistic in practice and possess theoretical such as and possess theoretical theoretical guarantees guarantees such such as as probabilistic probabilistic and possess completeness. completeness. completeness. Inspired by the success of sampling-based algorithms, Inspired by thea success success of sampling-based sampling-based algorithms, Inspired the of there hasby been lot of work on using these algorithms, to develop there has has been been aa lot lot of of work work on on using using these these to to develop develop there   

M. M. M.

Zhu Zhu Zhu

was was was

partially partially partially

supported supported supported

by by by

NSF NSF NSF

grant grant grant

feedback planners by using different metrics to select the feedback by different metrics to the feedback planners by using using to different metrics to select select the best path.planners Some examples solve for stochastic control best path. to best approach-evasion path. Some Some examples examples to solve solve for stochastic control and games could for be stochastic found in control Huynh and could be in andal.approach-evasion approach-evasion games could Even be found found in aHuynh Huynh et (2012); Mueller etgames al. (2013). though lot of et al. (2012); Mueller et al. (2013). Even though a et al. (2012); Mueller et al. (2013). Even though a lot lot of of literature is available on sampling-based motion planning literature available on motion literature is robot, available onissampling-based sampling-based motiononplanning planning for a singleis there no systematic study the use for aa single robot, is study use forthese single robot, there there is no no systematic systematic study on on athe the use of algorithms for multiple robots. Recently, game of these algorithms for multiple robots. Recently, a game of these algorithms for multiple robots. theoretic trajectory-based algorithm wasRecently, proposed aingame Zhu theoretic trajectory-based theoretic trajectory-based algorithm algorithm was was proposed proposed in in Zhu Zhu et al. (2014). et et al. al. (2014). (2014). Traditionally, the multi-robot motion planning approaches Traditionally, the multi-robot multi-robot motion planning approaches Traditionally, the motion approaches have been studied under three broadplanning categories: centralhave been three categories: have planning been studied studied under three broad categories: centralized (e.g., under S´anchez andbroad Latombe (2002)),centraldecenized (e.g., nchez Latombe (2002)), ized planning planning (e.g., S´ S´aaplanning nchez and and Latombe (2002)), decentralized (decoupled) (e.g., Sim´eon et al. decen(2002) tralized (decoupled) (e.g., Sim´ et al. tralized (decoupled) planning (e.g., (1989)). Sim´eeon on A etcentralized al. (2002) (2002) and priority planningplanning (e.g., Buckley and priority planning (e.g., (1989)). A and priority planning (e.g., Buckley Buckley (1989)). A centralized centralized approach typically constructs a path in composite conapproach constructs aa path in conapproach typically typically constructs path in composite composite configuration space, which is formed by forming the Cartefiguration space, which is formed by forming the Cartefiguration space, which is formedspaces by forming Cartesian product of the configuration of the the individual sian configuration spaces sian product product ofis the the configuration expensive. spaces of of the the individual robots which of computationally A individual decentralrobots which is computationally expensive. A robots which is computationally expensive. A decentraldecentralized(decoupled) or priority approach typically generates ized(decoupled) or approach typically ized(decoupled) or priority priority approach and typically generates paths for each robot independently, then generates considers paths for robot and then paths for each each between robot independently, independently, andsuitability then considers considers the interaction the robots. The of one the robots. suitability of the interaction interaction between the robots. The The suitability of one one over the other isbetween generallythe determined by trade-off between over the other is generally determined by trade-off between over the other is generally determined by trade-off between computational complexity associated with a given probcomputational complexity associated with given probcomputational complexity associated that with isaa lost given problem, and the amount of completeness LaValle lem, and the amount of completeness that is lost LaValle lem, Hutchinson and the amount of completeness that is lost LaValle and (1998). and and Hutchinson Hutchinson (1998). (1998).

CNS-1505664. CNS-1505664. CNS-1505664.

Copyright 2015 IFAC 168 Hosting by Elsevier Ltd. All rights reserved. 2405-8963 © 2015, IFAC (International Federation of Automatic Control) Copyright 2015 IFAC 168 Copyright ©under 2015 responsibility IFAC 168Control. Peer review© of International Federation of Automatic 10.1016/ifacol.2015.10.325

IFAC NecSys 2015 Devesh K. Jha et al. / IFAC-PapersOnLine 48-22 (2015) 168–173 Sept 10-11, 2015. Philadelphia, USA



This paper formulates the problem of feedback multirobot motion planning as a differential non-cooperative game where individual robots have different goals. There have been very limited number of feedback differential games whose closed-form solutions are known, including homicidal-chauffeur and the lady-in-the-lake games Basar and Olsder (1999); Isaacs (1999). The methods based on partial differential equations; e.g., in Bardi and CapuzzoDolcetta (1997); Bardi et al. (1999); Souganidis (1999), viability theory; e.g., in Aubin (2009) and level-set methods, e.g., in Sethian (1999) have been proposed to determine numerical solutions to differential games. However, the papers aforementioned only study one-player and twoplayer differential games. Zhu et al. (2014) proposed and analyzed the first distributed anytime computation algorithm to compute open-loop Nash equilibrium for noncooperative robots. To the best knowledge of the authors, there is no systematic approach with provable performance for multi-player feedback differential games beyond unconstrained linear-quadratic differential games in; e.g., Basar and Olsder (1999). Contributions This paper proposes and analyzes the first distributed anytime algorithm for non-cooperative feedback differential games. In particular, we leverage the existing Rapidly exploring Random Graphs (RRG) algorithm in Karaman and Frazzoli (2011),value iterations, model checking and iterative best response algorithm to propose a distributed anytime computation algorithm namely, the Feedback iNash-Policy. We show the asymptotic convergence of the value functions computed using Feedback iNash-Policy algorithm. Furthermore, we show that the value iterations do not require convergence of the value functions on the sampled graph at any particular iteration for asymptotic convergence to an optimal value function. 2. PROBLEM FORMULATION In this section, we introduce some preliminaries which are used as the basic elements of the algorithms which will be used later for algorithm development and analyses. Next, we will present the problem statement using the primitives defined. 2.1 Primitives • Sampling: The Sample(A) procedure returns uniformly random samples from set A. • Local steering: Given two states x, y, the Steer procedure returns a state z by steering x towards y for at most η > 0 distance; i.e., Steer(x, y)  argminz∈B(x,η) z −y. This procedure returns a timeoptimal trajectory that starts from x and reaches z exactly, when such a trajectory exists. This amounts to solving a solving a two-point boundary value problem for an ordinary differential equation. For systems like single integrator, double integrator or a curvature-constrained car (i.e., a Dubin’s vehicle), analytical solutions to this boundary value problem do exist Bertsekas (2000); Dubins (1957). • Nearest neighbor: Given a state x and a finite set S of states, the Nearest procedure re169

169

turns the state in S that is closest to x; i.e., Nearest(S, x)  argminy∈S y − x. • Near vertices: Given a state x, a finite set S and a positive real number r, the NearVertices procedure returns the states in S where each of them is r-close to x; NearVertices(S, x, r)  {y ∈ S | x − y ≤ r}. • Collision check of policy: Given a policy π and a set of policies Π, the CollisionFreePolicy(π, Π) procedure returns 0 if the trajectory corresponding to the policy π collides to any path corresponding to policies in Π; otherwise returns 1. A policy incurs infinite cost if it leads to a collision and could be thus easily incorporated in the value iterations. • Feasible Policies: Given the policy set Πi and π [−i] , the Feasiblei (Πi , π [−i] ) is the set of policies π [i] ∈ Πi such that CollisionFreePolicy(π [i] , π [−i] ) = 1. 2.2 Problem Statement Consider a team of robots, labeled by VR  {1, · · · , N }. Each robot is associated with a dynamic system governed by the following differential equation: x˙ i (t) = fi (xi (t), ui (t)), (1) ni where xi (t) ∈ Xi ⊆ R is the state of robot i, and ui (t) ∈ Ui is the control of robot i. For system (1), the set of admissible control strategies for robot i is given by: Ui  {ui (·) : Xi → Ui , measurable},

where Ui ⊆ Rmi . Let the sets XiO ⊂ Xi and XiG ⊂ Xi denote the obstacle region and the goal region, respectively, for the robot i. Define the obstacle free space as XiF := X \ XiO .

Let us denote the set of dynamically feasible policies for robot i by Πi . In general, a control policy for a robot is a mapping from its state-space to its control set. However, in the multi-robot motion planning setting, a policy of robot i would depend on the states of other robots in the environment the domain for a feasible policy  and hence, 1 is X  X . For each robot i, π [i] : X → Ui is i i∈VR a dynamically feasible policy if there exists Tπ[i] ≥ 0 such [i] that: (i) x˙ [i] (t) = fi (x[i] (t), π [i] (x)); (ii) x[i] (0) = xinit ; (iii) [i] F [i] x (t) ∈ Xi for t ∈ [0, Tπ[i] ]; (iv) x (Tπ[i] ) ∈ XiG . The optimal policy is the one for which the resulting trajectory results in minimum incurred cost during motion, where a cost is defined using an appropriate metric. We consider reachability tasks where each robot has to reach an open goal set XiG ⊂ Xi and simultaneously maintain the state xi (t) inside a closed constraint set XiF ⊆ Xi . Furthermore, the robots could be constrained to follow various specifications during their motion depending on the requirements of their mission. Such complex constraints could be embedded as an automaton and expressed as Linear Temporal Logic (LTL) Baier et al. (2008). LTL is a widely used specification language Baier et al. (2008) which allows to reason how system properties change over time, and thus specify a wide variety of tasks like safety, response, persistence, and recurrence, etc.. To see this more clearly, let APi be the finite set of atomic propositions on Xi . The semantics of LTL is defined over an infinite  1 represents the Cartesian product.

IFAC NecSys 2015 Devesh K. Jha et al. / IFAC-PapersOnLine 48-22 (2015) 168–173 Sept 10-11, 2015. Philadelphia, USA

170

sequence ω of truth assignments to the atomic proposition pi ∈ APi . For simplicity of exposition, we might consider an example is when a robot only needs to always eventually reach the goal while staying clear of the obstacles. However, any complex specification for the robot i could be embedded as a LTL formula Φi . Every closed loop policy will intrinsically generate a path of a Kripke structure Baier et al. (2008) i.e., an ω-word over 2APi based on the trajectory the robot follows from the initial point to the goal set. The word corresponding to the policy satisfies the LTL formula if it belongs to the language generated by the corresponding automata. We denote by [Φi ] ⊆ Πi the set of policies satisfying Φi .

Algorithm 1: The iNash-Policy Algorithm 1 2 3 4 5 6 7 8 9 10 11

for i = 1 : N do [i] V [i] (0) ← xinit ; [i] E (0) ← ∅;

Ak ← ∅; k ← 1; while k < K do for i = 1 : N do [i] xrand ← Sample(Xi ); [i]

[i]

[i]

Gk ← Extend(Gk−1 , xrand );

for i ∈ VR \ Ak−1 do [i] if Vk ∩ XiG = ∅ then Ak ← Ak−1 ∪ {i};

12 In addition to finding a policy that satisfies these specifications, the robot may have several other objectives such 13 for i ∈ Ak do [i] [i] as reaching in the goal region in the shortest possible time. 14 π ˜k = πk−1 ; In such setting, we formulate the motion planning problem 15 for i = 1 : N do for multiple robots as a differential non-cooperative robot. 16 if i ∈ Ak then [i] [j] [j] Definition 2.1. (Differential Non-Cooperative Game). Let 17 Σk ← {{πk }j∈Ak ,ji }; x ∈ RM describe the state of the game evolving according [i] [i] [i] 18 πk ← BestResponse(Gk , Σk ); to the ODE 19 k ← k + 1; ˙ x(t) = f (t, x, u1 , . . . , uN ) (2) where, t ∈ [0, T ], with initial condition x(0) = x0 and ui ∈ Ui . The state of the game evolves according to all player’s decision. The payoff for every player i is defined Algorithm 2: The Extend Procedure using a metric and every player tries to maximize her own [i] 1 V ←V ; k−1 payoff. [i] 2

Then the problem that we are interested in this paper is finding the optimal cost-to-go function for each robot, if attainable. The cost-to-go function for  robot i under a policy π [i] is then a mapping from Π  i∈VR Πi which is defined as follows for a fixed initial condition   Tπ[i] [i] [i] [−i] J (π , π ) = gi (x(t), ui (t), π [−i] )dt (3) 0  +hi (xi (Tπ[i] )) where, gi and hi are bounded and measurable functions, which are called the cost rate function and the terminal cost function respectively. The functions gi and hi are also assumed to be continuous Then the optimal cost-to-go function is defined as J [i] (π [−i] ) = inf π[i] ∈Πi J [i] (π [i] , π [−i] ) 2 . The above multi-robot motion planning problem is formulated as a feedback non-cooperative game where each robot seeks to find a policy which is collision-free, fulfills its task specifications and minimizes its cost-to-go function given the policies of other robots. That is, given π [−i] ∈ Π−i , each robot i wants to find the best policy in the feasible set Feasiblei (Πi , π [−i] )  {π [i] ∈ Πi | π [i] ∈ [Φi ], CollisionFreePolicy(π [i] , π [−i] ) = 1} where the procedure CollisionFreePolicy was defined earlier. The solution notion we will use is feedback Nash equilibrium which is formally stated as follows: Definition 2.2. (Feedback Nash equilibrium). The collection of policies (¯ π [i] )i∈VR ∈ Π is a feedback Nash equilibrium if for any i ∈ VR , it holds that π ¯ [i] ∈ Feasiblei (Πi , π ¯ [−i] ) and there is no 2

We use −i to denote all the robots other than i.

170

3 4

E ← Ek−1 ;

[i]

xnearest ← Nearest(E, xrand ); [i]

xnew ← Steer(xnearest , xrand ); /* returns time-optimal input unew which could be [i]

5 6 7 8 9 10

*/ added to Uk (xnew ) if ObstacleFree(xnearest , xnew ) then 1 k n Xnear ← NearVertices(E, xnew , min{γ( log ) , η}); k V ← V ∪ {xnew }; for xnear ∈ Xnear do if ObstacleFree(xnear , xnew ) then E ← E ∪ {(xnear , xnew ), (xnew , xnear )}; /* returns time-optimal input by solving boundary value problem which are added to [i] [i] Uk (xnew ) and Uk (xnear )

11

*/

return G = (V, E)

π [i] ∈ Feasiblei (Πi , π ¯ [−i] ) such that J [i] (π [i] , π ¯ [−i] ) ≤ [i] [i] [−i] π ,π ¯ ), where the inequality holds point-wise. J (¯ Our approach, outlined in the next section, approximates the optimal cost-to-go function (see Equation 3) and the corresponding optimal policy for multiple robots in an anytime fashion using incremental sampling-based algorithms. 3. FEEDBACK INASH-POLICY ALGORITHM We leverage the RRG algorithm in Karaman and Frazzoli (2011) along with value iterations, theory of regular languages and iterative best response algorithm to synthesize an incremental feedback Nash-Policy (feedback iNashPolicy) algorithm. At each iteration k of the algorithm, a [i−1] by sampling a robot i adds a node to its vertex set Vk [i] [i−1] random point xrand . The graph Gk is then extended by

IFAC NecSys 2015 Devesh K. Jha et al. / IFAC-PapersOnLine 48-22 (2015) 168–173 Sept 10-11, 2015. Philadelphia, USA



will monotonically improve the estimate of the cost-to-go functions and will arbitrarily approximate the policies on the continuous state-space of the robots.

Algorithm 3: The BestResponse Procedure 1

[i]

if

[i] π ˆk

To see the correctness of the robot’s behavior according to the LTL formula Φi , the paths corresponding to the policies are checked if they satisfy Φi . For this reason, a B¨ uchi automaton Baier et al. (2008) is constructed representing the formula Φi . As every state in Xi corresponds to a state in the B¨ uchi automaton, the policies which lead to infeasible transitions in the B¨ uchi automaton are removed (or disabled) to satisfy the specifications. This process, however, is a computationally intensive process.

∈ [Φi ] then  [i]  [i] [i] Πk ← Πk−1 ∪ π ˆk

3

5

[i]

for π ˆk ∈ Πk \ Πk−1 do

2

4

[i]

/* This denotes the set of collision-free policies satisfying Φi */ for (xi , x−i ) ∈ Gk do  [i] gi (xi , ui ) + Jk (x) ← min [i] [i] ui ∈U

k

(xi )∩Π

k

(xi )



[i] [i] αk J˜k−1 (fi (xi , ui ), f−i (x−i , u−i ) ; [i]

πk (x) ← the solution to u in the above step

6

/* gi (xi , u) = ∞ if u from x leads to a collision with other robots */ [i] /* J˜k−1 is the interpolated values used for initializing values on the new sampled point(see Section (4)) */ 7

4. PERFORMANCE ANALYSIS

[i]

return πk [i]

towards xrand by a local steering function (which finds a [i] [i−1] time-optimal control input while connecting xrand to Gk [i] ) and the corresponding inputs are stored in Uk . Note that [i] the set Uk contains all the inputs corresponding to all the [i] [i] nearest neighbors of a node xk ∈ Gk . After the construction of the graphs is complete for all the robots, the set of robots which are active play a game on their product graphs for one round in a sequential manner. At every iteration, every active robot does an update of its value [i] function once and finds a better policy on the graph Gk . Robot i is active at an iteration k if its goal set is reachable [i] through some path on Gk . The BestResponse algorithm is implemented in the same distributed fashion as was done in Zhu et al. (2014); however, instead of listing the trajectories and finding a better trajectory, we update the value function and thus, the corresponding policies. The active robot i with the least index first updates its values and finds a better policy. Next, it broadcasts its policy to other robots which sequentially improve their policy. Once all the active robots finish their policy updates, the game terminates for iterate k. The same steps are repeated at the next iteration. The iNash-Policy algorithm is formally stated in Algorithm 1. The Bellman operator for best [i] response of robot i at iteration k is denoted by Tk which is defined as follows (see Algorithm (3)).  [i] [i] [−i] gi (xi , ui ) (4) min Tk (Jk (x), πk (x)) = [i]

[i]

ui ∈Uk (xi )∩Πk (xi )

[i] [i] +αk J˜k−1 (fi (xi , ui ), . . . , f−i (x−i , u−i ) [i]

[i]

171



[i]

where, xi ∈ Gk , Uk (xi ) is the input set at any xi ∈ Gk . [i] αk ∈ (0, 1) is the discount factor for robot i such that [i] lim αk = 1. Also, policy of other robots are fixed given

k→∞

x. We would like to emphasize here that the feedback iNash-Policy algorithm allows incremental synthesis of the feedback Nash equilibrium policies without the need to [i] compute the optimal policies on Gk . The idea is that as the RRG algorithm samples a monotonically increasing dense set from the state-space of the robots, so the Bellman update equations solved on the sampled graph 171

In this section, we show the asymptotic convergence of the incremental policies generated using the feedback iNashPolicy algorithm which was explained in Section 3. The feedback iNash-policy algorithm incrementally estimates the optimal value functions for a feedback Nash equilibrium for the differential non-cooperative game. In the following, we show the existence of limit points obtained by the incremental feedback iNash-policy algorithm. The norm used in the next two theorems are defined as follows, vS = supv∈S v. Lemma 4.1. Consider a graph Gk ⊂ X. Consider a scalar function, v defined on Gk . Then an interpolation of function v, denoted as fv , is defined as fv (x) = v(x), ∀x ∈ Gk and fv (x) = v(argminy∈Gk x − y) ∀x ∈ X \ Gk . Then fv ≤ fv˜ holds for any two v and v˜ defined on Gk if v ≤ v˜, where the inequality holds point-wise on X. Proof 1. This follows immediately as v ≤ v˜ is true pointwise on Gk . Remark 4.1. The first theorem allows incremental estimation of the optimal value functions on the continuous state-space without the need to first calculate the optimal value function on the sampled state-space at every iteration of the feedback iNash-policy algorithm. Note that in [i] [i] the following theorems, ∀i ∈ VR , Jn is defined on Gk , [i] [i] [i] [i] k > n by interpolating Jn on xk ∈ Gk \ Gn in the sense defined in Lemma 4.1. The interpolated value function on [i] [i] Gk is denoted by J˜n . All the interpolated functions in the next two theorems are defined in the sense showed in Lemma 4.1. We denote the fixed point of the function [i] [i] J¯k by J¯k for the robot i. It is obtained as the limiting function using the iterative best response (defined in equa[i] [i] [i] [i] [−i] tion (4)) on Gk i.e., J¯k (x) = lim Tk,n (J¯k , π ¯k )(x) for n→∞

[i] Gk ,

all x ∈ where subscript n denotes the nth iterate of [i] the best response and J¯k is the initial estimate of the [i] value function on Gk . Assumption 4.1. For all i ∈ VR , the fixed point induced by the iterative best response is unique for any iteration k [i] i.e., Jk is unique for any iteration k. Lemma 4.2. The best response operator defined in equation (4) is monotonic i.e., [i]

[i]

[−i]

Tk (Jˆk , πk where

[i] Jˆk



[i] J¯k

[i]

[i]

[−i]

)(x) ≤ Tk (J¯k , πk

point-wise on

[i] Gk .

)(x)

(5)

IFAC NecSys 2015 Devesh K. Jha et al. / IFAC-PapersOnLine 48-22 (2015) 168–173 Sept 10-11, 2015. Philadelphia, USA

172

Proof 2. This follows from the definition of best response in equation (4) and that the inequality holds point-wise. Theorem 4.1. The following is true for all i ∈ VR . For all [i] [i] [i] xk ∈ Gk , Jk is the estimate of the cost-to-go for robot i computed by feedback iNash-Policy after k iterations by [i] [i] the policy πk and Jk is the fixed point obtained by the iterative best-response for robot i at the k th iteration. Then, the following holds with probability one. [i]

[i]

lim Jk − Jk G[i] = 0

k→∞

k

[i]

where equality holds point-wise on Gk . Proof 3. As defined earlier in equation (4), we denote the discrete Bellman operator for robot i at k th iteration [i] [i] [i] by Tk . Then, J˜k and J˜k+1 are the interpolations of [i] [i] [i] [i] ˜[i] [i] such that J = T in J (i.e., J J and J k

k+1

k+1

k+1 k

k+1

[i] [i] [i] [i] [i] initialized by J˜k on Gk+1 ); J˜k+1 = Jk+1 on Gk+1 . It [i]

is noted that both of them are defined on Gk+1 . At this point we would like to recall the fact that the robots sequentially update their policies from the set of collision [i] [i] [−i] free policies. Consequently πk−1 ∈ Feasiblei (Πk , πk ) [i] [i] [−i] [i] [i] and thus, Tk (J˜k−1 (x), πk (x)) ≤ J˜k−1 (x) for all x ∈ Gk . The existence of the fixed point follows from the above consequence and the fact that the cost-to-go functions are [i] [i] bounded above by zero. Consider c = J˜k − J˜k+1 G[i] . k+1

Then, we have

[i]

[i]

βk and the sequence J˜k

−c≤

[i] J˜k+1



[i] J˜k

+c

(6) [i]

where the inequality is defined point-wise on Gk+1 . Con[i] Gk .

sider a function e(x) = 1 ∀x ∈ For a scalar r, we define a function re(x) = r. Then from equation (4) we [i] [i] [−i] [i] [i] [−i] [i] have Tk+1 (Jk + re, πk )(x) = Tk+1 (Jk , πk )(x) + αk r [i]

for all x ∈ Gk . Then, using the best response Bellman operator on the inequality in (6), result of the best response on the scalar c and using the monotonicity of [i] [i] [−i] best response from Lemma 4.2, we have Tk+1 (J˜k , πk ) − [i] [i] [i] [−i] [i] [i] [−i] [i] α c ≤ T (J˜ , π ) ≤ T (J˜ , π ) + α c. k+1

k+1 [i]

k [i]

k+1 [i]

k [i]

k

k+1

[i]

Thus, we have J˜k+1 −αk+1 c ≤ J˜k+1 ≤ J˜k+1 +αk+1 c, where [i]

the inequality holds point-wise on Gk+1 , which gives (5). Then, the following is true

k

[i] [i] − J˜k+1 G[i] by γk . Then, k

equation(7) could be rewritten as follows.  [i] [i] [i]  βk+1 ≤ e−κk+1 βk + γk (11) One can find the sufficient condition for convergence of [i] the sequence βk from equation(11) which requires that the [i] [i] sequence γk should decay much faster than βk depending on the discount factor. By iterating over k, one can easily show that k+1 k   k − κn − κm  [i] [i] e m=n γn[i] (12) βk+1 ≤ e n=2 β1 + n=2

The first term on the right hand side of Equation (12) decays to zero as the sequence {κn }n≥1 is not summable k+1 

and hence, lim e



κn

n=2

n→∞

→ 0. The second term is shown [i]

to converge by using the fact that the sequence γn [i] diminishes (this follows from the convergence of J˜k which is shown in the next theorem). As such, we can always pick [i] an N large enough so that γN ≤ . Hence, the second term in equation (12) can be written as follows. k k k    k N k − − − κm κm κm    e m=n γn[i] ≤ e m=n γn[i] +  e m=n n=2

[i] J˜k

k+1

[i] [i] Then, let us denote the sequence J˜k − J˜k G[i] by

n=2

n=N +1

(13) Since converges hence, for the first term on the righthand side of equation(13), we have the following by taking k→∞ k  N − κm  (14) e m=n γn[i]  → 0 [i] γk

n=2

With some manipulations it can be seen that k  k − κm  e m=n 0, we conclude that lim βk k→∞

[i] J˜k+1

=



[i] J˜k+1 G[i]

[i] [i] αk+1 (J˜k



k+1



[i] [i] αk+1 J˜k



[i] J˜k+1 G[i]

(7)

k+1

[i] [i] [i] − J˜k ) + (J˜k − J˜k+1 )G[i]

(8)

k+1



[i] [i] [i] [i] [i] ≤ αk+1 J˜k − J˜k G[i] + J˜k − J˜k+1 G[i] k+1 k+1  [i]  [i] [i] [i] [i] = αk+1 J˜k − J˜k G[i] + J˜k − J˜k+1 G[i] k

(9) (10)

k

[i]

where, αk+1 = e−κk+1 (for ease of notation, we assume the same discount factor for all the robots) and κk = k1 such that lim e−κk → 1 and {κk }k≥1 is not summable. k→∞

Equation (7) follows from (6) using triangle’s inequality. Equation (8) follows from the definition of interpolated functions. Hence, the best response is a contraction mapping. 172

(15)

n=N +1

[i]

= 0. Then,

[i]

lim Jk − Jk G[i] = 0 follows from the fact that

k→∞ [i] Jk =

k

[i] [i] [i] [i] J˜k and Jk = J˜k on Gk . Theorem 4.2. The estimate of the cost-to-go function [i] Jk obtained by the Feedback iNash-Policy algorithm [i] for all i ∈ converges to an optimal cost-to-go function J  VR . Furthermore, J [i] i∈V converges to the limit point R  [i]  [i] of the sequence Jk i∈V i.e., lim J [i] − Jk X = 0. R

k→∞

Proof 4. Without loss of generality, we assume Ak = VR . Consider iteration k of the algorithm when robots 1 through i − 1 have already updated their policies and [i] it is the turn of robot i to update its policy. Then, Jk is the estimate of the cost-to-go function obtained by feedback iNash-policy algorithm. We extend the definition

IFAC NecSys 2015 Devesh K. Jha et al. / IFAC-PapersOnLine 48-22 (2015) 168–173 Sept 10-11, 2015. Philadelphia, USA



[i]

of Jk to X by interpolation (as in Lemma 4.1) which is [i] [i] [i] [i] [i] denoted by Jˆk . Then, since Gk−1 ⊆ Gk and Uk−1 ⊆ Uk (Theorem 36 in Karaman and Frazzoli (2011)) and using [i] [i] the best response it follows that Jk+1 ≤ Jk is true [i]

point-wise on Gk (using Lemma 4.2 and Theorem 4.1).  [i]  Then, Jˆk is a monotonically decreasing sequence (see Lemma 4.1), i.e., the following is true [i] [i] ≤ Jˆ Jˆ k+1

k

where, the inequality holds point-wise on X. The estimate of the cost-to-go function is lower bounded by zero. Hence,  [i]  converges to some constantJ¯[i] The the sequence Jˆk [i] [i] [i] convergence of Jk follows as lim Jk − Jˆk X = 0. k→∞



[i]  Jk i∈V R

For iteration k, is the fixed point (feedback Nash equilibrium) obtained by iterative best response on  [i] [i] [i] as Jˆk in i∈VR Gk . We define the interpolation of Jk the same way as defined in Lemma 4.1. [i]

[i]

[i]

[i]

Using Gk−1 ⊆ Gk , Uk−1 ⊆ Uk , Assumption 4.1 and  [i]  the iterative best response, it follows that Jˆk is a monotonically decreasing sequence which is lower bounded by the zero. Hence, it converges to a constant J [i] i.e., [i] J [i] = lim Jˆk k→∞

(16)

where the equality holds point-wise on X. The convergence [i] [i] [i] of Jk follows as lim Jk = lim Jˆk point-wise on X k→∞

k→∞

(we skip some details here for brevity). From Theorem 4.1 and equation (16), it follows that [i] = J [i] (17) J¯[i] = lim J k→∞

k

where the equality holds point-wise on X. This completes the proof. 5. CONCLUSIONS AND FUTURE WORK In this paper, we presented and analyzed a distributed anytime algorithm to solve for a class of multi-robot motion planning problem. A potential advantage of the algorithms presented is that there is no need for online tracking of the trajectories which is practically very expensive and inefficient. The Feedback iNash-Policy can find applications in different multi-robot scenarios like multirobot surveillance, autonomous driving etc., where every robot has individual goals. Analysis of convergence of the value functions to a feedback Nash equilibrium is a topic of future research. REFERENCES Aubin, J. (2009). Viability theory. Springer. Baier, C., Katoen, J.P., et al. (2008). Principles of model checking, volume 26202649. MIT press Cambridge. Bardi, M. and Capuzzo-Dolcetta, I. (1997). Optimal control and viscosity solutions of Hamilton-Jacobi-Bellman equations. Birkh¨ auser. Bardi, M., Falcone, M., and Soravia, P. (1999). Numerical methods for pursuit-evasion games via viscosity solutions. Annals of the International Society of Dynamic Games, 4, 105 – 175. 173

173

Basar, T. and Olsder, G.J. (1999). Dynamic noncooperative game theory, volume 2. SIAM. Bertsekas, D.P. (2000). Dynamic programming and optimal control, volume 2. Athena Scientific, Belmont, MA. Buckley, S.J. (1989). Fast motion planning for multiple moving robots. In Robotics and Automation, 1989. Proceedings., 1989 IEEE International Conference on, 322–326. IEEE. Dubins, L.E. (1957). On curves of minimal length with a constraint on average curvature, and with prescribed initial and terminal positions and tangents. American Journal of mathematics, 497–516. Huynh, V.A., Karaman, S., and Frazzoli, E. (2012). An incremental sampling-based algorithm for stochastic optimal control. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, 2865–2872. IEEE. Isaacs, R. (1999). Differential games: a mathematical theory with applications to warfare and pursuit, control and optimization. Courier Corporation. Karaman, S. and Frazzoli, E. (2011). Sampling-based algorithms for optimal motion planning. The International Journal of Robotics Research, 30(7), 846–894. Kavraki, L., Svestka, P., Latombe, J.C., and Overmars, M. (1996). Probabilistic roadmaps for path planning in high-dimensional configuration spaces. Robotics and Automation, IEEE Transactions on, 12(4), 566–580. doi:10.1109/70.508439. LaValle, S.M. (2006). Planning algorithms. Cambridge university press. LaValle, S.M. and Hutchinson, S.A. (1998). Optimal motion planning for multiple robots having independent goals. Robotics and Automation, IEEE Transactions on, 14(6), 912–925. LaValle, S.M. and Kuffner, J.J. (2001). Randomized kinodynamic planning. The International Journal of Robotics Research, 20(5), 378–400. Mueller, E., Zhu, M., Karaman, S., and Frazzoli, E. (2013). Anytime computation algorithms for approachevasion differential games. arXiv preprints. URL http://arxiv.org/abs/1308.1174. S´anchez, G. and Latombe, J.C. (2002). On delaying collision checking in PRM planning: Application to multi-robot coordination. The International Journal of Robotics Research, 21(1), 5–26. Sethian, J.A. (1999). Level set methods and fast marching methods: evolving interfaces in computational geometry, fluid mechanics, computer vision, and materials science, volume 3. Cambridge university press. Sim´eon, T., Leroy, S., and Lauumond, J.P. (2002). Path coordination for multiple mobile robots: A resolutioncomplete algorithm. Robotics and Automation, IEEE Transactions on, 18(1), 42–49. Souganidis, P. (1999). Two-player, zero-sum differential games and viscosity solutions. Annals of the International Society of Dynamic Games, 4(1), 69 – 104. Zhu, M., Otte, M., Chaudhari, P., and Frazzoli, E. (2014). Game theoretic controller synthesis for multi-robot motion planning part i: Trajectory based algorithms. In Robotics and Automation (ICRA), 2014 IEEE International Conference on, 1646–1651. IEEE.

Suggest Documents