tk ,dik. {ktk + n. â i=1 dik : ηi ⤠tk + dik , dik ⥠0,âi}. ⢠minOWA(η) =.. min n. â k=1 wk (ktk + n. â i=1 dik ). s.t.. { ηi ⤠tk + dik. âi,k dik ⥠0. âi,k.
A Compromise Programming Approach to Multiobjective Markov Decision Processes Wlodzimierz Ogryczak1 , Patrice Perny and Paul Weng LIP6 - UPMC, Paris, France June, 13-17 2011
21st International Conference on MCDM Jyvaskyla, Finland
1
on leave from Warsaw University of Technology, Poland
Presentation of the problem Solution Method Experimental Results
MMDPs Compromise Search
Sequential Decision Making under Uncertainty a
r (s, a)
p(s, a, s0 )
.. .
s’
.. .
.. . ...
s
.. .
Ogryczak, Perny & Weng – LIP6, UPMC
.. .
.. .
Compromise Programming in MMDPs
Presentation of the problem Solution Method Experimental Results
MMDPs Compromise Search
Markov Decision Processes (MDPs) Definition • S set of states • A set of actions • p : S × A × S → [0, 1] • r : S × A → IR Solution • Pure/randomized decision rule δ • (Stationary) pure policy π
Ogryczak, Perny & Weng – LIP6, UPMC
Compromise Programming in MMDPs
Presentation of the problem Solution Method Experimental Results
MMDPs Compromise Search
Value Functions and Solution Methods Value functions • vtπ (s) = r (s, δt (s)) + γ • π % π 0 ⇔ ∀s, v π (s) ≥
X
π p(s, δt (s), s0 )vt−1 (s0 )
s0 ∈S 0 v π (s)
• v ∗ (s) = max r (s, a) + γ a∈A
X
p(s, a, s0 )v ∗ (s0 )
s0 ∈S
Family of solution methods • Value/Policy iterations • LP
Ogryczak, Perny & Weng – LIP6, UPMC
Compromise Programming in MMDPs
Presentation of the problem Solution Method Experimental Results
MMDPs Compromise Search
Multiobjective MDPs (MMDPs) a
(R1 (s, a), . . . , Rn (s, a))
.. .
p(s, a, s0 ) s’
.. .
.. . ...
s
.. .
Ogryczak, Perny & Weng – LIP6, UPMC
.. .
.. .
Compromise Programming in MMDPs
Presentation of the problem Solution Method Experimental Results
MMDPs Compromise Search
Multiobjective MDPs (MMDPs)
Definition • R : S × A → IRn
(n criteria)
• V π (s) ∈ IRn Value functions • Vtπ (s) = R(s, δt (s)) + γ • π%
π0
⇔
∀s, V π (s)
≥P
X
π p(s, δt (s), s0 )Vt−1 (s0 )
s0 ∈S 0 V π (s)
Ogryczak, Perny & Weng – LIP6, UPMC
Compromise Programming in MMDPs
Presentation of the problem Solution Method Experimental Results
MMDPs Compromise Search
Scalarizing Function for Compromise Search Example a
a b
b
c
c
d
Pure policies
d
Randomized policies
Scalarizing Function ψ • ψ : IRn → IR monotonic w.r.t. Pareto dominance • v (s) = ψ(V1 (s), . . . , Vn (s)) • Weighted sum does not provide any control on tradeoffs Ogryczak, Perny & Weng – LIP6, UPMC
Compromise Programming in MMDPs
Presentation of the problem Solution Method Experimental Results
MMDPs Compromise Search
Reference Point Method (RPM) Generic Scalarizing Achievement Function (Wierzbicki, 82) ε X ψε (y) = (1 − ε) max σi (yi ) + σi (yi ) n i=1...n i=1...n 1 r a a a σi (yi ) = r r −r a max βyi + (1 − β)ri − ri , yi − ri , α(yi − ri ) i
i
ηi = σi (yi ) β
1 yi 0
rir
Ogryczak, Perny & Weng – LIP6, UPMC
ria
α
Compromise Programming in MMDPs
Presentation of the problem Solution Method Experimental Results
MMDPs Compromise Search
RPM with an OWA OWA
OWA(η) =
n X
ωi ηhii
where ηi = σi (yi ) ∀ i = 1 . . . n
i=1
where ω1 > ω2 > . . . > ωn > 0 and ηh1i ≥ ηh2i ≥ . . . ≥ ηhni Example r r = (0, 0, 0)
r a = (10, 10, 10)
w = (5/10, 3/10, 2/10)
y η ηh1i ηh2i ηh3i OWA (4, 5, 9) (6, 5, 1) 6 5 1 4.7 (4, 8, 6) (6, 2, 4) 6 4 2 4.6 (4, 7, 7) (6, 3, 3) 6 3 3 4.5 Ogryczak, Perny & Weng – LIP6, UPMC
ψ0 ψε 6 6 + 4ε 6 6 + 4ε 6 6 + 4ε
Compromise Programming in MMDPs
Presentation of the problem Solution Method Experimental Results
MMDPs Compromise Search
Main properties of OWA • Symmetry: OWA(η1 , . . . , ηn ) = OWA(ητ (1) , . . . , ητ (n) ) • Pareto-Monotonicity: η P η 0 ⇒ OWA(η) > OWA(η 0 ) • Fairness (Monotonicity w.r.t Pigou-Dalton transfers): ∀i, j ∈ {1, . . . , n} s.t ηi > ηj , ∀ε ∈ (0, ηi − ηj ), OWA(η1 , . . . , ηi − ε, . . . , ηj + ε, . . . , ηn ) < OWA(η1 , . . . , ηn )
Ogryczak, Perny & Weng – LIP6, UPMC
Compromise Programming in MMDPs
Presentation of the problem Solution Method Experimental Results
MMDPs Compromise Search
RPM with a Weighted OWA OWA symmetric on regrets ⇒ WOWA I
I
A
A
Different Weights
Compromise Solution
RPM WOWA WOWA(η) =
n X
wi (λ, η)ηhii
i=1
P
P where wi (λ, η) = ϕ( k ≤i λτ (k) ) − ϕ( k ω2 > . . . > ωn > 0 and any positive importance weights λi , if y¯ is a properly nondominated with tradeoffs ¯ 1 /(1 − nλω ¯ 1 ) where λ ¯ = mini∈I λi , i.e. if bounded by ∆ = nλβω for any attainable outcome vector y the implication yi > y¯i and yk < y¯k
⇒
¯ 1 nλβω yi − y¯i ≤∆= ¯ 1) y¯k − yk (1 − nλω
(1)
is valid for any i, k ∈ I, then there exist aspirations levels ria and reservation levels rir such that y¯ is an optimal solution of the corresponding problem.
Ogryczak, Perny & Weng – LIP6, UPMC
Compromise Programming in MMDPs
Presentation of the problem Solution Method Experimental Results
Navigation in a Grid Computation times Conclusion
Example: Path-planning problems
Ogryczak, Perny & Weng – LIP6, UPMC
Compromise Programming in MMDPs
Presentation of the problem Solution Method Experimental Results
Navigation in a Grid Computation times Conclusion
Do we really need all the Pareto optimal solutions? Example adapted from (Hansen, 80) (0, 1)
(0, 2)
0
(2, 0)
(1, 0)
V1π (0)
+
...
1
V2π (0)
=
N X
(0, 2N )
(0, 2N−1 ) N-1
(2N−1 , 0)
N
(2N , 0)
2i = 2N+1 − 1
i=0
• The number of Pareto optimal pure policies grows exponentially with the number of states Ogryczak, Perny & Weng – LIP6, UPMC
Compromise Programming in MMDPs