Efficient Reinforcement Learning with Multiple Reward ... - ijcai-11

0 downloads 0 Views 784KB Size Report
Partially observable Markov decision processes (POMDPs) [Kaelbling98]. ▫ Modeling sequential decision making under partial or uncertain observations.
Point-Based Value Iteration for Constrained POMDPs

Dongho Kim Jaesong Lee Kee-Eung Kim Department of Computer Science

IJCAI-2011 2011. 7. 22.

Pascal Poupart School of Computer Science

Motivation goals

action Agent

observation

Environment

• Partially observable Markov decision processes (POMDPs) [Kaelbling98]  Modeling sequential decision making under partial or uncertain observations

 Single reward function encodes the immediate utility of executing actions.  Required to manually balance different objectives into the single reward function

• Constrained POMDPs (CPOMDPs)  Problems with limited resource or multiple objectives  Maximizing one objective (reward) while constraining other objectives (costs)  CPOMDP has not received as much attention as CMDPs. [Altman99] • Exception: DP method for finding deterministic policies [Isom08] Dongho Kim

2

Motivation • Resource-limited agent, e.g., battery-equipped robot  Accomplish as many goals as possible given a finite amount of energy

• Spoken dialogue system [Williams07]  e.g., minimize length of dialogue while guaranteeing 95% dialogue success rate

 Reward : -1 for each dialogue turn  Cost : +1 for each unsuccessful dialogue, 0 for each successful dialogue Dialogue :



𝑠0

𝑠1

𝑠2

𝑅 = −1 𝐶=0

𝑅 = −1 𝐶=0

𝑅 = −1 𝐶=0

• Goal: maximize 𝔼

𝑡𝛾

𝑡𝑟 𝑡

s.t. 𝔼

𝑡𝛾

𝑡𝑐 𝑡

𝑠𝑇 𝑅 = −1 𝐶 = +1 for unsuccessful dialogue 𝐶 = 0 for successful dialogue

≤𝑐

• We propose exact and approximate methods for solving CPOMDPs. Dongho Kim

3

Suboptimality of deterministic policies in CPOMDPs lazy lazy, 𝑝 = 0.9 𝑅 = 0, 𝐶 = 0 𝑅 = 0, 𝐶 = 0 lazy, 𝑝 = 0.1 𝑅 = 0, 𝐶 = 0 AdvisorHappy AdvisorAngry

Procrastinating student problem

work 𝑅=1 𝐶=1

work 𝑅=0 𝐶=1 JobDone

Reward and cost for work at each timestep

with prob. of 𝑐

𝑏0 = 1,0,0 𝛾

Suggest Documents