Min Max Generalization for Deterministic Batch Mode ... - Google Sites

Min Max Generalization for Deterministic Batch Mode Reinforcement Learning Raphael Fonteneau1,2 The work presented in this talk was done in collaboration with: Bernard Boigelot1, Damien Ernst1, Quentin Louveaux1 Susan A. Murphy3 Louis Wehenkel1 1

Department of EECS, University of Liège, Belgium Inria Lille – Nord Europe, France 3 Department of Statistics, University of Michigan, Ann Arbor, USA 2

SequeL – Inria Lille – Nord Europe, France – 15 juin 2012

Introduction

Goal

How to safely control a deterministic system living in a continuous state space given the knowledge of: - A batch collection of trajectories of the system, - The maximal variations of the system (upper bounds on the Lipschitz constants).

Author: ArcCan – Wikipedia Commons

Menu Introduction

I

Direct approach - CGRL algorithm

II

Reformulation of the original problem - 2 relaxations schemes in the two-stage case

III

Comparison of the 3 proposed solutions

Conclusions and future work

Introduction

Formalization ●

Deterministic dynamics:

●

Deterministic reward function:

●

Fixed initial state:

●

Continuous state space, finite action space:

●

Return of a sequence of actions:

●

Optimal return:

Introduction

The "batch" setting ●

Dynamics and reward function are unknown

●

For all actions

●

We note:

a set of one-step transitions is given:

Introduction

Lipschitz continuity ●

The system dynamics and the reward function are Lipschitz continuous :

were ●

is the Euclidean norm over the state space.

We assume that two constants known.

and

satisfying the above inequalities are

Introduction

Compatible environments ●

Compatible dynamics and reward functions:

Introduction

Compatible environments ●

Compatible dynamics and reward functions:

●

Return of a sequence of actions under a compatible environment:

Introduction

Min max approach to generalization ●

Worst possible return of a given sequence of actions:

Introduction

Min max approach to generalization ●

Worst possible return of a given sequence of actions:

Our objective:

Menu Introduction

I


II


III



Direct approach

Bounds from a sequence of transitions ●

For a given sequence of actions, one can compute from a sequence of one-step transitions which is compatible with the sequence of actions a lower-bound on the "worst possible return" of the sequence of actions:

Direct approach

Maximal lower bound ●

One can define a best lower bound over all possible sequences of transitions:

Direct approach

Maximal lower bound ●


●

Such a bound is ''tight'' w.r.t. the dispersion of the batch collection of data:

Direct approach

The CGRL algorithm ●


Direct approach


●


Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does.

Direct approach


●


Finding such a sequence can be reformulated as a shortest path problem, and the sequence can be found without enumerating all sequences. This is what the CGRL algorithm does. Properties of the CGRL algorithm

●

The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero.

Direct approach


●



●

●

The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero, If an "optimal trajectory" is available in the batch sample, that CGRL will return such an optimal sequence,

Direct approach


●



●

●

●

The CGRL solution converges towards an optimal sequence when the dispersion of the sample of transitions converges towards zero. If an "optimal trajectory" is available in the batch sample, that CGRL will return such an optimal sequence. Computational complexity: quadratic w.r.t the size of the batch collection of transitions.

Direct approach

The CGRL algorithm - illustration The puddle world

The state space is uniformly covered by the sample

CGRL

FQI (Fitted Q Iteration)

Direct approach

The CGRL algorithm - illustration The puddle world

The state space is uniformly covered by the sample

Information about the Puddle area is removed

CGRL

FQI (Fitted Q Iteration)

Menu Introduction

I


II


III



Reformulation

Reformulation of min max problem

Reformulation

The two-stage case

Reformulation

Décomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

Reformulation

Decomposition

va ui Eq nt le

Reformulation

Decomposition

va ui Eq

So (c lo lv se d- e fo d rm )

nt le

ha PN rd

Reformulation

Decomposition N ha Prd va ui Eq


nt le

ha PN rd

Reformulation

Decomposition PN ha rd va ui Eq


rd

ha PN

ha P-

nt le

N

rd

Reformulation

ha PN rd

Reformulation

Building relaxation schemes ●

●

●

We focus on the following (NP-hard) problem:

We look for relaxation schemes that preserve the nature of min max generalization problem, i.e. offering performance guarantees We thus build relaxation schemes providing lower bounds on the return of the sequence of actions

Reformulation

Trust-region relaxation scheme ●

We keep one constraint of each type:

Reformulation

Trust-region relaxation scheme

Reformulation


Reformulation


Reformulation

Lagrangian relaxation

Reformulation


Reformulation


Menu Introduction

I


II


III



Comparison

Direct approach vs Trust-region relaxation

Comparison

Direct approach vs Trust-region relaxation

Comparison

Trust-region vs Lagrangian relaxation

Comparison

Trust-region vs Lagrangian relaxation

Comparison

Synthesis

Comparison

Illustration ●

Dynamics:

●

Reward function:

●

Initial state:

●

Action space:

●

Grid generation:

●

100 batch collections of transitions drawn uniformly randomly

Comparison

Illustration Maximal bounds

Grid

Uniform sampling

Comparison

Illustration Returns

Grid

Uniform sampling

Menu Introduction

I


II


III




(non) Conclusion

ob Pr m le ill st un so e lv

d


Future work

T-stage reformulation Stochastic case

Exact solution for small dimensions

? Infinite horizon

Associated publications ●

●

●

●

"Min max generalization for deterministic batch mode reinforcement learning: relaxation schemes". R. Fonteneau, D. Ernst, B. Boigelot, Q. Louveaux. arXiv:1202.5298v1, 2012. "Towards Min Max Generalization in Reinforcement Learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Agents and Artificial Intelligence: International Conference, ICAART 2010, Valencia, Spain, January 2010, Revised Selected Papers. Series: Communications in Computed and Information Science (CCIS), Volume 129, pp. 61-77. Editors: J. Filipe, A. Fred, and B.Sharp. Springer, Heidelberg, 2011. "A cautious approach to generalization in reinforcement learning". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The International Conference on Agents and Artificial Intelligence (ICAART 2010), 10 pages, Valencia, Spain, January 22-24, 2010 "Inferring bounds on the performance of a control policy from a sample of trajectories". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst. Proceedings of The IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009), 7 pages, Nashville, Tennessee, USA, 30 March-2 April, 2009.

Min Max Generalization for Deterministic Batch Mode ... - Google Sites

Min Max Generalization for Deterministic Batch Mode ... - Google Sites

Suggest Documents

Min Max Generalization for Deterministic Batch Mode ... - Google Sites

Min Max Generalization for Deterministic Batch Mode ... - Google Sites

Min Max Generalization for Deterministic Batch Mode ... - Orbi (ULg)

Min Max Generalization for Deterministic Batch Mode ... - Orbi (ULg)

Relaxation Schemes for Min Max Generalization in ... - ORBi

Relaxation Schemes for Min Max Generalization in ... - ORBi

Towards Min Max Generalization in Reinforcement ... - Semantic Scholar

Min-Max Multiway Cut - Google Sites

Combining Max-min and Max-max Approaches for Robust SoS ...

Min-max and min-max regret versions of combinatorial ... - CiteSeerX

108.84 Min: 0 Max: 1384.67 Min: 0 Max: 1916.72 Min - Google Groups

Min (Max)-CS Modules

Max, Min, Sup, Inf

Domains for Dirac-Coulomb min-max levels

Quantum Algorithms for Evaluating MIN-MAX Trees

Upward Max Min Fairness - Research at Google

Quantum Max-flow/Min-cut

inseparability of min-max systems

Min Max Omnilog value

Min-Max Fine Heaps - arXiv

Max Min Sorting Algorithm

All-optical integrated ternary MIN and MAX gate - Google Sites

All-optical integrated ternary MIN and MAX gate - Google Sites

Many-to-Many Matching with Max-Min Preferences - Google Sites