Introduction. Page 3. Menu. Introduction. I Direct approach .... International Conference on Agents and Artificial Intel
Min Max Generalization for Deterministic Batch Mode Reinforcement Learning Raphael Fonteneau1,2 The work presented in this talk was done in collaboration with: Bernard Boigelot1, Damien Ernst1, Quentin Louveaux1 Susan A. Murphy3 Louis Wehenkel1 1
Department of EECS, University of Liège, Belgium Inria Lille – Nord Europe, France 3 Department of Statistics, University of Michigan, Ann Arbor, USA 2
SequeL – Inria Lille – Nord Europe, France – 15 juin 2012
Introduction
Goal
How to safely control a deterministic system living in a continuous state space given the knowledge of: - A batch collection of trajectories of the system, - The maximal variations of the system (upper bounds on the Lipschitz constants).
Author: ArcCan – Wikipedia Commons
Menu Introduction
I
Direct approach - CGRL algorithm
II
Reformulation of the original problem - 2 relaxations schemes in the two-stage case
III
Comparison of the 3 proposed solutions
Conclusions and future work
Introduction
Formalization ●
Deterministic dynamics:
●
Deterministic reward function:
●
Fixed initial state:
●
Continuous state space, finite action space:
●
Return of a sequence of actions:
●
Optimal return:
Introduction
The "batch" setting ●
Dynamics and reward function are unknown
●
For all actions
●
We note:
a set of one-step transitions is given:
Introduction
Lipschitz continuity ●
The system dynamics and the reward function are Lipschitz continuous :
were ●
is the Euclidean norm over the state space.
We assume that two constants known.
and
satisfying the above inequalities are
Introduction
Compatible environments ●
Compatible dynamics and reward functions:
Introduction
Compatible environments ●
Compatible dynamics and reward functions:
●
Return of a sequence of actions under a compatible environment:
Introduction
Min max approach to generalization ●
Worst possible return of a given sequence of actions:
Introduction
Min max approach to generalization ●
Worst possible return of a given sequence of actions:
Our objective:
Menu Introduction
I
Direct approach - CGRL algorithm
II
Reformulation of the original problem - 2 relaxations schemes in the two-stage case
III
Comparison of the 3 proposed solutions
Conclusions and future work
Direct approach
Bounds from a sequence of transitions ●
For a given sequence of actions, one can compute from a sequence of one-step transitions which is compatible with the sequence of actions a lower-bound on the "worst possible return" of the sequence of actions:
Direct approach
Maximal lower bound ●
One can define a best lower bound over all possible sequences of transitions:
Direct approach
Maximal lower bound ●
One can define a best lower bound over all possible sequences of transitions:
●
Such a bound is ''tight'' w.r.t. the dispersion of the batch collection of data:
Direct approach
The CGRL algorithm ●
One can define a best lower bound over all possible sequences of transitions:
Direct approach
The CGRL algorithm ●
●
One can define a best lower bound over all possible sequences of transitions:
Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does.
Direct approach
The CGRL algorithm ●
●
One can define a best lower bound over all possible sequences of transitions:
Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm
●
The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero.
Direct approach
The CGRL algorithm ●
●
One can define a best lower bound over all possible sequences of transitions:
Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm
●
●
The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero, If an "optimal trajectory" is available in the batch sample, that CGRL will return such an optimal sequence,
Direct approach
The CGRL algorithm ●
●
One can define a best lower bound over all possible sequences of transitions:
Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm
●
●
●
The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero. If an "optimal trajectory" is available in the batch sample, that CGRL will return such an optimal sequence. Computational complexity: quadratic w.r.t the size of the batch collection of transitions.
Direct approach
The CGRL algorithm - illustration The puddle world
The state space is uniformly covered by the sample
CGRL
FQI (Fitted Q Iteration)
Direct approach
The CGRL algorithm - illustration The puddle world
The state space is uniformly covered by the sample
Information about the Puddle area is removed
CGRL
FQI (Fitted Q Iteration)
Menu Introduction
I
Direct approach - CGRL algorithm
II
Reformulation of the original problem - 2 relaxations schemes in the two-stage case
III
Comparison of the 3 proposed solutions
Conclusions and future work
Reformulation
Reformulation of min max problem
Reformulation
The two-stage case
Reformulation
Décomposition
Reformulation
Decomposition
Reformulation
Decomposition
Reformulation
Decomposition
Reformulation
Decomposition
Reformulation
Decomposition
Reformulation
Decomposition
va ui Eq nt le
Reformulation
Decomposition
va ui Eq
So (c lo lv se d- e fo d rm )
nt le
ha PN rd
Reformulation
Decomposition N ha Prd va ui Eq
So (c lo lv se d- e fo d rm )
nt le
ha PN rd
Reformulation
Decomposition PN ha rd va ui Eq
So (c lo lv se d- e fo d rm )
rd
ha PN
ha P-
nt le
N
rd
Reformulation
ha PN rd
Reformulation
Building relaxation schemes ●
●
●
We focus on the following (NP-hard) problem:
We look for relaxation schemes that preserve the nature of min max generalization problem, i.e. offering performance guarantees We thus build relaxation schemes providing lower bounds on the return of the sequence of actions
Reformulation
Trust-region relaxation scheme ●
We keep one constraint of each type:
Reformulation
Trust-region relaxation scheme
Reformulation
Trust-region relaxation scheme
Reformulation
Trust-region relaxation scheme
Reformulation
Lagrangian relaxation
Reformulation
Lagrangian relaxation
Reformulation
Lagrangian relaxation
Menu Introduction
I
Direct approach - CGRL algorithm
II
Reformulation of the original problem - 2 relaxations schemes in the two-stage case
III
Comparison of the 3 proposed solutions
Conclusions and future work
Comparison
Direct approach vs Trust-region relaxation
Comparison
Direct approach vs Trust-region relaxation
Comparison
Trust-region vs Lagrangian relaxation
Comparison
Trust-region vs Lagrangian relaxation
Comparison
Synthesis
Comparison
Illustration ●
Dynamics:
●
Reward function:
●
Initial state:
●
Action space:
●
Grid generation:
●
100 batch collections of transitions drawn uniformly randomly
Comparison
Illustration Maximal bounds
Grid
Uniform sampling
Comparison
Illustration Returns
Grid
Uniform sampling
Menu Introduction
I
Direct approach - CGRL algorithm
II
Reformulation of the original problem - 2 relaxations schemes in the two-stage case
III
Comparison of the 3 proposed solutions
Conclusions and future work
Conclusions and future work
(non) Conclusion
ob Pr m le ill st un so e lv
d
Conclusions and future work
Future work
T-stage reformulation Stochastic case
Exact solution for small dimensions
? Infinite horizon
Associated publications ●
●
●
●
"Min max generalization for deterministic batch mode reinforcement learning: relaxation schemes". R. Fonteneau, D. Ernst, B. Boigelot, Q. Louveaux. arXiv:1202.5298v1, 2012. "Towards Min Max Generalization in Reinforcement Learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, January 2010, Revised Selected Papers. Series: Communications in Computed and Information Science (CCIS), Volume 129, pp. 61-77. Editors: J. Filipe, A. Fred, and B.Sharp. Springer, Heidelberg, 2011. "A cautious approach to generalization in reinforcement learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia, Spain, January 22-24, 2010 "Inferring bounds on the performance of a control policy from a sample of trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009), 7 pages, Nashville, Tennessee, USA, 30 March-2 April, 2009.