Non-Deterministic Policies In Markovian Processes

0 downloads 0 Views 453KB Size Report
Jun 16, 2009 - removing such assumption might result in greater usability and robustness of the decision ... integer program and one uses a heuristic search algorithm. ... classes. In this chapter, we look at the two most studied classes of problems in RL. ..... of equations in matrix form using the following definitions :.
Non-Deterministic Policies In Markovian Processes Mahdi Milani Fard

Master of Science

School of Computer Science

McGill University Montreal,Quebec 2009-06-16

A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of Master of Science c

Mahdi Milani Fard, 2009

DEDICATION

To my parents and supporting friends.

ii

ACKNOWLEDGEMENTS

I would like to thank all member of McGill’s Reasoning and Learning lab who provided me support and useful guidelines since I joined the group two years ago. I am particularly thankful to my supervisor, Joelle Pineau, who bared with my unusual problems and guided my research seamlessly and closely over this time. I would also like to thank Doina Precup for her guidance throughout my studies at McGill. Special thanks goes to Susan Murphy and her research group at University of Michigan for their close research collaboration and helpful comments. My visit to University of Michigan provided me with invaluable research experience and helped me greatly with this thesis. Finally, I would like to thank my parents for their unconditional support and encouragement throughout my life. This thesis is dedicated to them.

iii

ABSTRACT

Markovian processes have long been used to model stochastic environments. Reinforcement learning has emerged as a framework to solve sequential planning and decision making problems in such environments. In recent years, attempts were made to apply methods from reinforcement learning to construct adaptive treatment strategies, where a sequence of individualized treatments is learned from clinical data. Although these methods have proved to be useful in problems concerning sequential decision making, they cannot be applied in their current form to medical domains, as they lack widely accepted notions of confidence measures. Moreover, policies provided by most methods in reinforcement learning are often highly prescriptive and leave little room for the doctor’s input. Without the ability to provide flexible guidelines and statistical guarantees, it is unlikely that these methods can gain ground within the medical community. This thesis introduces the new concept of non-deterministic policies to capture the user’s decision making process. We use this concept to provide flexible choice to user among near-optimal solutions, and provide statistical guarantees for decisions with uncertainties. We provide two algorithms to propose flexible options to the user, while making sure the performance is always close to optimal. We then show how to provide confidence measures over the value function of Markovian processes, and finally use them to find sets of actions that will almost surly include the optimal one.

iv

´ E ´ ABREG

Les processus markoviens ont ´et´e depuis longtemps utilis´es pour mod´eliser les environnements stochastiques. L’apprentissage par renforcement a ´emerg´e comme un framework convenable pour r´esoudre les probl`emes de planification s´equentiels et de prise de d´ecision dans de tels environnements. R´ecemment, des m´ethodes bas´ees sur l’apprentissage par renforcement ont ´et´e appliqu´ees pour d´evelopper des strat´egies de traitement adaptables o` u l’objectif est d’apprendre une s´equence de traitements individuelle a` partir de donn´ees cliniques. Malgr´e que ces m´ethodes se sont av´er´ees utiles pour des probl`emes de prise de d´ecision s´equentielle, elles ne peuvent pas ˆetre appliqu´ees avec leur forme actuelle dans le domaine m´edical puisqu’elles ne fournissent pas les garanties g´en´eralement requises dans ce genre de domaine. D’un autre cˆot´e, les politiques retourn´ees par la plupart des m´ethodes d’apprentissage par renforcement sont souvent tr`es rigides et ne laissent pas d’interval de manoeuvre suffisant pour les m´edecins. Cette th`ese pr´esente un nouveau concept de politiques non-d´eterministes pour repr´esenter le processus de prise de d´ecision de l’utilisateur. Nous d´eveloppons deux algorithmes qui proposent des options flexibles a` l’utilisateur tout en s’assurant que la performance soit toujours proche de l’optimal. Nous montrons ensuite comment fournir des mesures de confiance sur la fonction de valeur des processus markoviens et finalement nous utilisons ces mesures pour identifier un ensemble d’actions qui vont presque sˆ urement inclure l’action optimale.

v

TABLE OF CONTENTS DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ii

ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

iv

´ E ´ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ABREG

v

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

1

Introduction, Motivation and Rationale . . . . . . . . . . . . . . . . . . .

1

1.1 1.2

. . . . .

2 5 5 6 6

Technical Background and Notation . . . . . . . . . . . . . . . . . . . . .

8

2

2.1 2.2

2.3

Problem Statement . . . . . . . . . . . . . . . . Contributions . . . . . . . . . . . . . . . . . . . 1.2.1 Near-Optimal Non-Deterministic Policies 1.2.2 Confidence Intervals on Value Function . 1.2.3 Possibly-Optimal Policies . . . . . . . . .

Sequential Decision Making . . . . . . . . . . . Markov Decision Processes . . . . . . . . . . . . 2.2.1 Definitions and Notation . . . . . . . . . 2.2.2 Policy and Value Function . . . . . . . . 2.2.3 Planning Algorithms and Optimality . . Partially Observable Markov Decision Processes 2.3.1 Definitions and Notation . . . . . . . . . 2.3.2 Belief States . . . . . . . . . . . . . . . . 2.3.3 Policy and Value Function . . . . . . . . 2.3.4 Finite State Controllers . . . . . . . . . .

vi

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . .

. . . . . . . . . .

. . . . . . . . . .

8 9 9 12 13 17 17 18 20 21

3

4

Sequential Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

3.1 3.2 3.3

Adaptive Treatment Strategies . . . . . . . . . . . . . . . . . . . . Randomized Clinical Trials . . . . . . . . . . . . . . . . . . . . . . STAR*D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26 27 28

Non-Deterministic Policies . . . . . . . . . . . . . . . . . . . . . . . . . .

30

4.1

. . . . . . . . . . . .

30 31 32 32 35 38 40 40 42 44 44 49

Confidence Measures over Value Function . . . . . . . . . . . . . . . . .

51

5.1 5.2 5.3 5.4 5.5

. . . . . . . .

52 53 54 59 62 62 67 72

Decision Making under Model Uncertainty . . . . . . . . . . . . . . . . .

74

6.1

75 76 77 78 83 86

4.2 4.3 4.4 4.5

4.6 5

5.6 6

6.2

Motivation . . . . . . . . . . . . . . . . . . . 4.1.1 Providing Choice to the Acting Agent 4.1.2 Handling Model Uncertainty . . . . . Definitions and Notation . . . . . . . . . . . -Optimal Non-Deterministic Policies . . . . Optimization Criteria . . . . . . . . . . . . . Maximal -Optimal Policy . . . . . . . . . . 4.5.1 Mixed Integer Programming Solution 4.5.2 Heuristic Search . . . . . . . . . . . . 4.5.3 Directed Acyclic Transition Graphs . 4.5.4 Empirical Results . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

Intrinsic and Extrinsic Variance . . . . . . . . . Gaussian Assumption on Value Function . . . . Model-Based Variance Estimation for MDPs . . Model-Based Variance Estimation for POMDPs Experiment and Results . . . . . . . . . . . . . 5.5.1 POMDP dialog manager . . . . . . . . . 5.5.2 Medical Domain . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . .

Possibly-Optimal Non-Deterministic Policies . . 6.1.1 Optimality Probability . . . . . . . . . . 6.1.2 Monte Carlo Approximation . . . . . . . 6.1.3 General Case with Gaussian Assumption 6.1.4 Directed Acyclic Transition Graphs . . . Empirical Results . . . . . . . . . . . . . . . . .

vii

. . . . . . . . . . . .

. . . . . . . .

. . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . .

. . . . . . . . . . . .

. . . . . . . .

. . . . . .

. . . . . .

6.3

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

7.1 7.2

Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90 92

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

7

viii

LIST OF TABLES Table

page

4–1 Heuristic search algorithm to find -optimal policies with maximum size 43 4–2 Policy and running time of the full search algorithm on the medical problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

6–1 Running Time of Monte Carlo Compared to LVS . . . . . . . . . . . .

86

ix

LIST OF FIGURES Figure

page

4–1 Example MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4–2 Optimal policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4–3 Conservative policy . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

4–4 Non-augmentable policies . . . . . . . . . . . . . . . . . . . . . . . . .

38

4–5 MIP Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

4–6 MIP/Search Running Time Comparison

. . . . . . . . . . . . . . . .

46

4–7 Noise Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

5–1 POMDP Dialog Manager . . . . . . . . . . . . . . . . . . . . . . . . .

63

5–2 Policy Graph for POMDP Dialog Manager . . . . . . . . . . . . . . .

64

5–3 Estimated Covariance Accuracy . . . . . . . . . . . . . . . . . . . . .

66

5–4 POMDP Policy Comparison . . . . . . . . . . . . . . . . . . . . . . .

67

5–5 STAR*D Policy Graph . . . . . . . . . . . . . . . . . . . . . . . . . .

69

5–6 Estimated Covariance Accuracy (STAR*D) . . . . . . . . . . . . . . .

70

5–7 POMDP Policy Comparison (STAR*D) . . . . . . . . . . . . . . . . .

71

x

Chapitre 1 Introduction, Motivation and Rationale Planning and decision making have been well studied in the Artificial Intelligence (AI) community. Intelligent agents have been designed and developed to act in, and interact with, different environments. This usually involves sensing the environment, making a decision using some intelligent inference mechanism, and then performing an action on the environment [38]. Often times, this process involves some level of learning, along with the decision making process, to make the agent more efficient in performing the intended goal. The agent will try to interact with the environment and learn from its experience to maximize its utility in the process. In sequential decision making processes, where the agent has to interact with the environment over a sequence of time steps, there is usually a trade off between short-term and long-term utilities. Actions that might seem beneficial in the shortthem, could lead to bad results in the long-term. Therefore some level of long-term planning is usually needed to achieve the desired results. Reinforcement Learning (RL) is a branch of AI that tries to develop a computational approach to solving the problem of learning through interaction. RL is the process of learning what to do—how to map situations to actions—so as to maximize a numerical reward signal [47]. It is different from supervised learning in that samples of optimal action are not provided to the agent by some domain expert. The agent

1

has to interact with the environment (often times intelligently) to gather information and draw conclusions regrading which action is the optimal one in each situation. Many methods have been developed to solve the RL problem with different types of environment and different types of agents. They are usually based on probabilistic models of the environment, and assume a Markovian property which states that the future of the system—agent and the environment—only depends on the current state and the future actions. A major pitfall of these methods is the fact that if the probabilistic model of the system is not accurate, the agent could end up with a suboptimal behaviour. Even in model-free approaches, which try to solve the problem by directly looking at the desirability of actions, lack of complete knowledge of the systems can cause the decision making to be sub-optimal. 1.1

Problem Statement Most of the work in RL has focused on autonomous agents such as robots. The

RL controllers are thus designed to issue a single action at each time-step which will be executed by the acting agent. However, in the past few years, methods developed by the RL community have been used in sequential decision support systems. In these systems, a human being makes the final decision. Usability and acceptance issues will thus become important in these cases. Most RL methods will therefore require some level of adaptation for this purpose. Medical domains are among the cases for which RL needs further adaptation. Although RL methods have proved to be useful in problems concerning sequential decision making, they cannot be applied in their current form to the medical domains. The medical community is well familiar with statistical comparisons in single-step

2

decision making. Therefore, methods from RL that lack statistical guarantees cannot easily gain ground within the medical community. We introduce methods in this thesis that are specifically useful in the medical settings where statistical guarantees and statistically meaningful comparisons are crucial in practical acceptance of the method. Another important issue with decision support systems comes from the fact that the acting agent is often a human being, which of course has his/her own decision process. Therefore, the assumption that the controller should only send one clear signal to the acting agent is not necessary. In fact, as we will see in this thesis, removing such assumption might result in greater usability and robustness of the decision support system. Such view towards the decision process comes handy in two different situations. In many practical cases, we do not have an exact model of the system. What we have instead is a noisy model that is based on a finite number of interactions with the environment. This leads to a type of uncertainly that is usually referred to as extrinsic uncertainty. Most RL algorithms ignore this uncertainty and assume that the model is perfect. However if we look closely, the performance of the optimal action based on an imperfect model might not be statistically different from the next best action. Bayesian approaches have looked at this problem by providing confidence measure over the agent’s performance. In cases where the acting agent is a human being, we can provide the user with a complete set of actions, each of which might be the optimal and for which we do not have enough evidence to differentiate. The user can then further use his/her expertise to make the final decision. Such methods should

3

guarantee that the suggestions provided by the system are statistically meaningful and plausible. On the other hand, even when we do have complete knowledge of the system and we can uniquely identify the optimal action, there might still be other actions for which the performance is not much different from the optimal one. In practice, these small differences are often not considered significant. Depending on the domain, these actions might be suggested to be “roughly equal” in performance. At this point, the decision between these near-optimal options could be left to the acting agent—the human being that is using the decision support system. This could have many advantages, ranging from better user experience, to increased robustness and flexibly. Among the near-optimal solutions, the user can select based on further domain knowledge, or any other preferences, that are not captured by the system. For instance, in a medical diagnosis system that suggests treatments, providing the physician with several options might be useful as the final decision could be made based on the further knowledge of the patient’s medical status or the preference on the side effects. Throughout this thesis we address the above issues by a combination of theoretical and empirical investigations. We introduce the new concept of non-deterministic policies that capture the decision making process by the acting agent. We then apply this formulation to solve the problem of finding near-optimal policies and handling the extrinsic uncertainly.

4

We focus primarily on sequential decision making problems in clinical domains, where the system should provide suggestions on the best treatment options for patients. These decisions are provided over a sequence of treatment phases. These systems are specifically interesting because often times, different treatment options seem to provide only slightly different results. Moreover, in many cases there is not enough data available to uniquely identify the optimal treatment. In both cases, providing the physician with several suggestions or confidence bounds, would be beneficial in improving the performance of the final decision. In this thesis we investigate how we can suggest several options to the acting agent, while providing performance guarantees with best-case and worst-case analysis. We study how these methods are more robust to noise in the model as compared to the conventional RL methods. We propose several algorithms to solve these problems and provide approximation techniques to speed up the decision process. We pay special attention to problems with limited planning horizon and high-dimensional state-spaces, which are characteristics of medical domains. We provide efficient algorithms for these special cases and see how they work on real medical datasets. 1.2

Contributions This thesis includes 3 main technical contributions. In this section, we will over-

view these contributions and see how they are organized in the thesis. 1.2.1

Near-Optimal Non-Deterministic Policies

We start in Chapter 4 by formally defining non-deterministic policies on Markovian processes as a framework to incorporate the decision process by the acting

5

agent. We then use this framework to rigorously define near-optimality for nondeterministic policies. We discuss interesting optimization problems that arise from such formulation and focus on the finding the largest near-optimal non-deterministic policy. Two methods are introduced to solve this problem. One is based on a mixed integer program and one uses a heuristic search algorithm. We evaluate the empirical performance of these methods on both synthetic and real world datasets. The contributions of this chapter were published in [28]. 1.2.2

Confidence Intervals on Value Function

In Chapter 5, we review the methods developed by Mannor et. al [26] to provide confidence intervals for Markov Decision Processes. We then extend this method to obtain confidence interval over the value function of Partially Observable Markov Decision Processes. As one of the main contributions of this thesis, we show how this result provides better means to compare policies while working with partially observable state spaces. We also show how such method can be used in medical domains and human-robot interaction environments to allow more meaningful comparison of policies. These contributions were published in [29] 1.2.3

Possibly-Optimal Policies

Chapter 6 extends the methods of Chapter 5 to provide a statistically meaningful comparison of policies on Markovian decision processes. We formally address the problem of possible optimality and provide two methods to solve our problem. One method uses an extension of the confidence measures of Chapter 5 to meaningfully compare any number of policies. As this method is intractable in the general case, we provide another method for the special case of problems with a finite decision

6

horizon. We then apply this method to a synthetic dataset and compare it with the solutions based on a computationally expensive sampling technique. We conclude that such method provides similar results at considerably lower computational cost.

7

Chapitre 2 Technical Background and Notation This chapter is a review of background material on decision theory in reinforcement learning (RL). We will look into the main concepts behind sequential decision making and the mathematical formulations used in RL. We will also go through some of the key algorithmic methods for learning and decision making in RL. 2.1

Sequential Decision Making Decision theory was originally introduced in the context of game theory with

early applications in economics [33]. In its modern form, it can be thought of as the study of probabilistic processes along with utility theory. The assumption is that there is an agent acting in an environment, performing actions, and receiving signals. Utility theory assumes that part of this signal from the environment is a numerical value that specifies how good the outcome was. This could be the health measure of a patient, or the profit of a trading agent. The system is assumed to evolve in a probabilistic manner. In the most general case, the future of the system is a probabilistic function of the entire history of the agent’s interactions with its environment. However, often times we deal with systems that are known to have the Markovian property. This property assumes that the system has a state space. At each point in time, the system is in one particular state (which might or might not be known to the agent) and the future of the system solely depends on the current state and the future actions. 8

The sequential decision process deals with the problem of finding strategies for the acting agent that will maximize the long-term future utility. This could be the average or sum of future utilities at each time step. Depending on the type of signal received by the agent and the probabilistic nature of the process, we can divide sequential decision-making problems into several classes. In this chapter, we look at the two most studied classes of problems in RL. First is the Markov Decision Process, which assumes the agent knows the state of the system. We then see the Partially-Observable Markov Decision Process, where this assumption is lifted. 2.2

Markov Decision Processes A Markov Decision Process (MDP) is a model of system dynamics in sequential

decision problems that involves probabilistic uncertainly about future states of the system [2]. The system is assumed to be in a state at any given time. The agent observes the state and performs an action accordingly. The system will then make a transition to the next state and the agent will receive some reward. The MDP model allows us to find the optimal action at each state such as to maximize the long-term future reward. 2.2.1

Definitions and Notation

Formally, an MDP is defined by the 5-tuple (S, A, T, R, γ) : – States : S is the set of states. The state usually captures the complete configuration of the system. Once the state of the system is known, the future of the system is independent from all the history. This means that the state of the system is a sufficient statistic of the history of the system. In a robotic

9

application, for instance, the state could be the position of the robot. Thus once we know where the robot is, it is not important which path it took to get there. – Actions : A : S → 2A is the set of actions allowed in each state where A is the set of all actions. A(s) is the set of actions the agent can choose from, while interacting with the system in state s. For example, a robot can move in several directions, each of which constitutes an action. If all the actions can be performed in all the sates, then we will denote the action set by A for all states. – Transition Probabilities : T : S × A × S → [0, 1] defines the transition probabilities of the system. This function specifies how likely it is to end up at any state, given the current state and a specific action performed by the agent. As stated before, transition probabilities are specified based on the Markovian assumption. That is, if the state of the system at time t is denoted by st and the action at that time is at , then we have : Pr(st+1 |at , st , at−1 , at−1 , . . . , a0 , s0 ) = P r(st+1 |at , st ).

(2.1)

We focus on homogeneous processes in which the system dynamics are independent of the time. Thus the transition function is stationary with respect to time : def

T (s, a, s0 ) = P r(st+1 = s0 |at = a, st = s).

10

(2.2)

– Rewards : R : S×A×R → [0, 1] is the probabilistic reward model. Depending on the current state of the system and the action taken, the agent will receive a reward drawn from this model. We focus on homogeneous processes in which, again, the reward distribution does not change over time. If the reward at time t is denoted by rt then we have : rt ∼ R(st , at ).

(2.3)

Sometimes the reward is taken to be deterministic. We will be using the general probabilistic model throughout this thesis. The mean of this distribution ¯ a). will be denoted by R(s, – Discount Factor : γ ∈ [0, 1) is the discount rate used to calculate the longterm return. It is a way of trading off between short-term and long-term rewards. It can also be thought of as the probability that the process dies at each step. The agent starts in an initial state s0 ∈ S. At each time step t, an action at ∈ A(st ) is taken by the agent. The system then makes a transition to st+1 ∼ T (st , at ) and the agent receives an immediate reward rt ∼ R(st , at ). The goal of the agent is to maximize the discounted sum or rewards, which is usually referred to as the return (denoted by D) : D=

X

γ t rt .

(2.4)

t

In the finite horizon case, this sum is taken up to the horizon limit and the discount factor can be set to 1. However, in the infinite horizon case the discount factor should

11

be less than 1 so that the return has finite value. The return on the process depends on both the stochastic transitions and rewards, as well as the actions taken by the agent. 2.2.2

Policy and Value Function

Policy is a way of defining the agent’s behaviour with respect to the changes in the environment. A (probabilistic) policy on an MDP is a mapping from the state space to a distribution over the action space : π : S × A → [0, 1].

(2.5)

A deterministic policy is a policy that defines a single action per state. That is, π(s) ∈ A(s). We will later introduce the notion of non-deterministic policies on MDPs to deal with sets of actions. The agent interacts with the environment and takes actions according to the policy. The value function of the policy is defined to be the expectation of the return given that the agent acts according to that policy : # "∞ X def γ t rt |s0 = s, at = π(st ) . V π (s) = E[Dπ (s)] = E

(2.6)

t=0

Using the linearity of the expectation, we can write the above expression in a recursive form, known as the Bellman equation [2] : " # X X ¯ a) + V π (s) = π(s, a) R(s, T (s, a, s0 )V π (s0 ) .

(2.7)

s0 ∈S

a∈A

The value function has been used as the primary measure of performance in much of the RL literature. There are, however, some ideas that take the risk or the

12

variance of the return into account as a measure of optimality [17, 39]. But the more common criteria is to assume that the agent is trying find a policy that maximizes the value function. Such a policy is often referred to as the optimal policy. We can also define the value function over the state-action pairs. This is usually referred to as the Q-function, or the Q-value, of that pair. By definition : "∞ # X def Qπ (s, a) = E[Dπ (s, a)] = E γ t rt |s0 = s, a0 = a, t ≥ 1 : at = π(st ) .

(2.8)

t=0

That is, the Q-value is the expectation of the return, given that the agent takes action a first and then follows policy π. The Q-function also satisfies the Bellman equation : ¯ a) + Qπ (s, a) = R(s,

X

T (s, a, s0 )

X

π(s0 , a0 )Qπ (s0 , a0 ),

(2.9)

a0 ∈A

s0 ∈S

which can be rewritten as : ¯ a) + Qπ (s, a) = R(s,

X

T (s, a, s0 )V π (s0 ).

(2.10)

s0 ∈S

The Q-function is often used to compare the optimality of actions given a fixed policy. 2.2.3

Planning Algorithms and Optimality

The optimal policy, denoted by π ∗ , is defined to be the policy that maximizes the value function at the initial state : π ∗ = argmax V π (s0 ).

(2.11)

π∈Π

It has been shown [2] that for any MDP, there exists an optimal deterministic policy that is no worse than any other policy for that MDP. The value of the optimal

13

policy V ∗ satisfies the Bellman optimality equation [2] : " # X ¯ a) + V ∗ (s) = max R(s, T (s, a, s0 )V ∗ (s0 ) . a∈A

(2.12)

s0 ∈S

The deterministic optimal policy follows from this : " # X ∗ 0 ∗ 0 ¯ a) + π (s) = argmax R(s, T (s, a, s )V (s ) . a∈A

(2.13)

s0 ∈S

Alternatively we can write down these equations with the Q-function : ¯ a) + Q∗ (s, a) = R(s,

X

T (s, a, s0 )V ∗ (s0 ).

(2.14)

s0 ∈S

Thus V ∗ (s) = maxa Q∗ (s) and π ∗ (s) = argmaxa Q∗ (s). There is a large literature on computing of the value function of a fixed policy and finding the optimal policy for MDPs. Among these, we will focus on off-line methods which are most relevant to the research topics covered in this thesis. – Policy Evaluation : There are many methods developed to find the value of a policy. These methods are referred to as policy evaluation algorithms. The simplest way to find the value function is by solving the Bellman equation using matrix inversion. The set of linear Bellman equations can be written in matrix form : ¯ π + γT π V π , Vπ =R

14

(2.15)

¯ π and T π are the value, reward and transition under that specific where V π , R policy. That is : ¯ sπ = R

X

¯ a), π(s, a)R(s,

(2.16)

π(s, a)T (s, a, s0 ).

(2.17)

a∈A π Ts,s 0

=

X a∈A

We can solve the matrix form of Bellman’s equation using matrix inversion : ¯π . V π = (I − γT π )−1 R

(2.18)

Often times, when dealing with large state spaces, matrix inversion is not practical or might result in numerical stability problems. There are dynamic programming methods [2] that solve the same problem by iteratively applying the Bellman equation. We start with some initial estimate of the value function and then keep updating it until we see convergence : # " X X π ¯ a) + T (s, a, s0 )Vt−1 (s0 ) , Vtπ (s) = π(s, a) R(s,

(2.19)

s0 ∈S

a∈A

where Vtπ is our estimate of the value function at time t. – Linear Programming Solution : Much of the literature in RL is focused on finding the optimal policy. There are many methods developed for policy optimization. One way to find the optimal policy is to solve the Bellman optimality equation. The Bellman optimality equation involves a maximization with a set of linear constraints. This can be formulated as a simple linear

15

program [3] : minV µT V, subject to ¯ a) + γ P 0 T (s, a, s0 )V (s0 ) ∀s, a, V (s) ≥ R(s, s

(2.20)

where µ represents an initial distribution over the states. The solution to the above problem is the optimal value function. Notice that V is represented in matrix form in this equation. It is known that linear programs can be solved in polynomial time [20]. However, solving them might become impractical in large (or infinite) state spaces. Therefore often times methods based on dynamic programming are preferred to the linear programming solution. – Value Iteration : Value iteration is a dynamic programming algorithm [2] that gradually updates some estimate of V ∗ until further updates do not change the values. We start with some initial estimates (these could be any prior estimates of the values), and then iteratively update the values at each step using the Bellman optimality equations. If Vt is the estimate at time t, the update rule would be : "

#

¯ a) + Vt (s) = max R(s, a∈A

X

0

0

T (s, a, s )Vt−1 (s ) .

(2.21)

s0 ∈S

It has been shown [2] that under reasonable conditions, Vt → V ∗ as t → ∞. In practice, the algorithm stops when the difference between the previous values and the updated ones is less than some threshold  > 0 : ||Vt − Vt−1 ||∞ < . If π is the greedy policy based on the estimate Vt , we get the guarantee that ||V ∗ − V π ||∞ < 2γ/(1 − γ) [51]. 16

2.3

Partially Observable Markov Decision Processes In many practical cases, the assumption that the agent can observe the state of

the system at all times and without any noise or error is unrealistic. In many domains, the agent may only receive an observation that partially (or noisily) identifies the state of the system. Such a system is referred to as a Partially Observable Markov Decision Process (POMDP). For instance, consider the case where we get readings from a noisy GPS system on a robot. The readings can only approximately determine the position of the robot. There is usually some probabilistic model to this noisy observation given the state of the system. For instance, a GPS locator might return the position of the robot with some Gaussian noise. In this case, the probability of observation given the state has a Gaussian distribution. 2.3.1

Definitions and Notation

A POMDP is formally defined by the 7-tuple (S, A, Z, T, O, R, γ) [44] : – States : S is the state space. At each point in time the system is a state st ∈ S. However the agent cannot observe this state. – Actions : A is the action set. Here we assume that this action set is uniform on the state space, as the agent cannot know the state of the system for sure. – Observations : Z is the set of possible observations. This could be similar to the state space, such as the case with a GPS locator in which a noise is added to the state. In many cases, however, this set is different from S. – Transition Probabilities : T : S × A × S → [0, 1] is similar to the transition function for MDPs : def

T (s, a, s0 ) = P r(st+1 = s0 |at = a, st = s). 17

(2.22)

– Observation Model : O : S × A × Z → [0, 1] specifies the probability of receiving an observation when arriving at a specific state by doing a specific action : def

O(s0 , a, z) = P r(zt = z|at−1 = a, st = s0 ).

(2.23)

Obviously these probabilities should sum to one for any state-action pair. – Rewards : R is similar to the reward model for MDPs. – Discount Factor : γ is the discount rate similar to that of an MDP. It is used to calculate the long-term return of an agent. At each point at time t, the system in a state st ∈ S. The agent then takes an action at ∈ A and the system makes a transition to state st+1 ∼ T (st , at ). The agent then receives a reward rt ∼ R(st , at ) and also an observation zt ∼ O(st+1 , at ). The goal of the agent is to maximize the expectation of the long-term return specified similar to the case of MDPs (Eqn 2.4). 2.3.2

Belief States

The agent cannot necessarily observe the state of a POMDP. However, it is possible to keep a probability distribution over the state space that reflects the agent’s expectation about the possible state of the system given the history of interactions with the environment. Such distribution is often called the belief state [1]. The belief state is a sufficient statistics of the history of the agent’s interactions with the environment. That is, in order to calculate the belief state for future times, you only need to have the current belief state [42]. Formally, if the history of interactions up

18

to time t is ht : ht = (a0 , z1 , a1 , . . . , at , zt ),

(2.24)

then the belief state is defined to be bt (s) = Pr(st = s|ht , b0 , ),

(2.25)

where b0 is the initial belief. The initial belief reflects what the agent expects the system state to be, prior to the start of interactions with the environment. As stated before, the belief state is a sufficient statistic for the history. Once the agent takes an action and receives an observation, it is easy to update the belief state so that it reflects the new uncertainty about the state of the system : bt (s0 ) = Pr(st = s0 |bt−1 , at−1 , zt ) =

1 Pr(zt |bt−1 , at−1 )

(2.26)

O(s0 , at−1 , zt )

X

T (s, at−1 , s0 )bt−1 (s),

(2.27)

s∈S

where Pr(zt |bt−1 , at−1 ) is a normalization constant : Pr(z|b, a) =

X

O(s0 , a, z)

s0 ∈S

X

T (s, a, s0 )b(s).

(2.28)

s∈S

This update rule is ofter referred to as the belief update function, denoted by τ (b, a, z), where τ : B × A × Z → B. It can be thought of as an update operator that returns a new belief after taking an action and receiving an observation. We will next see how we can use the belief state as a basis to select actions.

19

2.3.3

Policy and Value Function

We can let the agent make the decision of which action to take based on the current belief state. Since the belief state is a sufficient statistic of the system, this process can continue as the agent interacts with the environment over time. A policy in this case is defined to be a mapping from belief states to a probability distribution over actions. That is, π : B × A → [0, 1], where B is the set of belief states, i.e. the set of probability distributions over the state space. Now the RL task is to find the optimal policy that maximizes the expected return. Similar to the case of MDPs, we can define a value function, this time on the belief states V π : B → R : def

V π (b) =

E[Dπ (b)],

(2.29)

where Dπ (b) is the long-term return of policy π, starting from belief state b. It is easy to see that any POMDP can be modeled by a belief MDP, for which the states are the belief states of the POMDP. For any POMDP (S, A, Z, T, O, R, γ), the belief MDP would be a 5-tuple (S 0 , A, T 0 , R0 , γ), where S 0 = B is the set of belief states we have for the POMDP ; T 0 specifies the probability of moving from one belief state to another one : T 0 (b, a, b0 ) =

X

I{τ (b,a,z)=b0 } Pr(z|b, a),

z∈Z

20

(2.30)

where I{τ (b,a,z)=b0 } is the indicator that the update rule return b0 . R0 can also be defined in a straight-forward way : R0 (b, a) =

X

¯ a). b(s)R(s,

(2.31)

s∈S

The actual distribution of R0 is more complicated. However, since we are only interested in maximizing expected rewards, this formulation is sufficient. Given this equivalence, the value function of the POMDP will be the same as the value function of the corresponding belief MDP. We will therefore get a similar Bellman’s equation for POMDPs : " V π (b) =

X a∈A

π(b, a)

# X

¯ a) + γ b(s)R(s,

s∈S

X

Pr(z|b, a)V π (τ (b, a, z)) .

(2.32)

z∈Z

The RL task is to find a policy that maximizes this value function. It is well known that the value function of the optimal policy of a POMDP in the finite horizon is a convex piecewise-linear function of the belief state [44]. It is often convenient to use a finite-horizon approximation in the infinite horizon case. Thus, in this work we will focus only on POMDPs with piecewise-linear value functions. 2.3.4

Finite State Controllers

Although belief states are mathematically convenient, often times it is difficult to work with them in real-world systems. It is fairly difficult to plan and learn in the infinite state spaces such as the belief state. Sonik [44] points out that an optimal policy for a finite-horizon POMDP can be represented as an acyclic finite-state controller, in which each of the machine states represents a linear piece (or the corresponding alpha-vector ) in the piecewise-linear

21

value function. The state of the controller is based on the observation history and the action of the agent will only be based on the state of the controller. For deterministic policies, each machine state i issues an action a(i) and then the controller transitions to a new machine state according to the received observation. This finite-state controller is usually represented as a policy graph. An example of a policy graph for a POMDP dialog manager is shown in Fig 5–2. Cassandra et al. [4] state that dynamic programming algorithms for infinitehorizon POMDPs, such as value iteration, sometimes converge to an optimal piecewise value function that is equivalent to a cyclic finite-state controller. In the case that the optimal value function is not piecewise-linear, it is still possible to find an approximate or suboptimal finite-state controller [35, 24]. Given a finite-state controller for a policy, we can extract the value function of the POMDP using a linear system of equations. To extract the ith linear piece of the POMDP value function, we calculate the value of each POMDP state over that linear piece. For each machine state i (corresponding to a linear piece), and each POMDP state s, the value of s over the ith linear piece (the expected return if the controller is in i and the system is in s) is : ¯ a(i)) + γ v i (s) = R(s,

X

T (s, a(i), s0 )O(s0 , a(i), z)v l(i,z) (s0 ),

(2.33)

s0 ,z

where l(i, z) is the next machine state from state i and given observation z [15]. If π is the policy (based on the finite state controller) we can rewrite the above system of equations in matrix form using the following definitions : – K π : finite set of machine states in the policy graph

22

– v k for k ∈ K π : |S| dimensional vector of coefficients representing a linear piece in the value function – V π : |S||K π | dimensional vector, vertical concatenation of v k ’s representing the POMDP value function – a(k) for k ∈ K π : the action associated with machine state k according to the fixed policy π ¯ a(k)) for k ∈ K : |S| dimensional vector of coefficients representing – rk = R(., a linear piece in the piecewise-linear immediate reward function – Rπ : |S||K π | dimensional vector, concatenation of rk ’s – T π : |S||K π |×|S||K π | dimensional block diagonal matrix of |K π |×|K π | blocks, with T (., a(k), .) as the kth diagonal sub-matrix – Oπ : |S||K π | × |Z||S||K π | dimensional block diagonal matrix of |K π | × |K π | blocks. Each diagonal block is a |S| × |Z||S| block diagonal sub-matrix of |S| × |S| sub-blocks. Each sub-block is therefore a |Z| dimensional row vector. The kth block, sth sub-block contains the sth row in the O(., a(k), .). – Ππ : |Z||S||K π |×|S||K π | dimensional block matrix of |K π |×|K π | blocks. Each block Ππk1 k2 is itself a |Z||S| × |S| block diagonal sub-matrix of |S| × |S| subblocks. Each sub-block is therefore a |Z| dimensional vector. For all s, the zth component of the sth diagonal block of the (k1 , k2 ) sub-matrix, [(Ππk1 k2 )s ]z , is equal to 1 if k2 is the succeeding index of the machine state when the machine state is k1 and the observation is z, and 0 otherwise. This matrix represents the transition function l(i, z) of the finite-state controller which are the arcs in the policy graph.

23

We can write the system of equations representing the value of a policy π in the following matrix form : V π = Rπ + γT π Oπ Ππ V π ,

(2.34)

V π = (I − γT π Oπ Ππ )−1 Rπ .

(2.35)

leading to :

These equations are similar to the matrix forms of Bellman’s equation we saw for MDPs (Eqn 2.15 and Eqn 2.18). In fact, the policy on MDPs is a special case of these controllers, in which the finite state controller is the same as the system state machine. The above equation can be used to calculate the value function of a given policy, if the models for T , O and R are known. This equation is at the core of most policy iteration algorithms for POMDPs [14, 15], including one of the most recent highly successful approximation method [18]. There are many other methods developed to approximate the value function of POMDPs. However, they are beyond the scope of the work presented in this thesis.

24

Chapitre 3 Sequential Clinical Trials In recent years, adaptive treatment strategies have been receiving attention in the medical communities, particularly for the treatment of chronic disorders [32, 30, 34]. Conventional treatment strategies for many chronic diseases, such as depression, schizophrenia, AIDS or epilepsy, provide varying results with different patients and conditions. Therefore, often times a series of treatments are tried on the patient to obtain the desired response [34]. This calls for adaptive changes in the medication, dosage and duration over the treatment period. Unlike the conventional treatment methods, which involve only a single step of medication, in adaptive methods a series of treatments are tried on the patient over time. As a result, one must consider long-term outcomes rather than short-term responses to each step of the treatment series. This means that the treatments in the first few steps should not be solely compared based on the immediate efficacy or tolerance levels. Another key difference with adaptive treatment methods is that they can be personalized for each patient based on time varying conditions and outcomes. For instance, if the patient does not response to a treatment, then an increase in the dosage or a change in the medication might be suggested. Because of these challenges, adaptive treatment strategies are difficult to design and implement in the classical medical framework. Many of the statistical tools developed for comparing medications and providing guidelines cannot be used in 25

sequential settings. However, many tools developed for RL can be used in such domains. Decision-making in medical domains has thus been receiving attention in the RL community in the past few years [34, 7, 13]. This chapter reviews some of the basic concepts in sequential medical decision making considered in this thesis. This material is included to provide a better understanding of the problems we address in medical domains and to illustrate how RL methods can be used in such settings. 3.1

Adaptive Treatment Strategies Formally, adaptive treatment strategies are series of decision rules, one per treat-

ment step [34]. They often have the form : if baseline assessments = z0 and prior treatments up to now = {a1 , . . . , at } and clinical assessments up to now = {z1 , . . . , zt } then apply at+1 at step t + 1 end The baseline assessments are the clinical assessments on the patient before he/she is enrolled in the treatment series. They could range from demographic data and allergy constraints, to medical assessments of symptom severity. At each step, a treatment is applied and outcomes are observed during a period of time. A responder is a patient that shows improvement with the current treatment method. Responders will either keep using the same medication in long lasting chronic disorders such as schizophrenia [46] or might remit and stop the medication after a

26

while. The non-responders, however, will remain on the adaptive treatment series. A change in the medication or dosage might be suggested based on their status and history of medication. Different objectives may be considered for treatment strategies. For some diseases, faster remission with fewer side effects are preferable. For other diseases, it may be preferable to keep the patient on the medication, in which case the goal is to maximize the time the patient stays on the medication [46]. The process of optimizing a sequence of treatment decisions can be captured by Markovian processes such as MDPs or POMDPs. The variability in the outcomes can be modeled by stochastic transitions and rewards. Once the system is modeled formally, RL methods can be used to find the optimal or near-optimal policies. 3.2

Randomized Clinical Trials Randomized clinical trials are often designed and implemented in order to in-

vestigate the comparative effectiveness of different treatment options. In randomized trials, patients are randomized into different groups each of which will receive a certain medication. The set of randomized options might depend on the outcome of previous medication or the preference indicated by the patient or clinician. As stated before, many of the chronic disorders require series of treatments to assess effectiveness and obtain the desired results. Therefore, in recent years, sequential randomized trials have been have been carried out in order to evaluate and compare medications and treatments in sequential settings [45, 49, 41, 9, 46]. In these cases, the patients are randomized multiple times during the study, each of which might depend on the previous outcomes or the preferences indicated by the

27

patient or the physician. These studies are often referred to as sequential multiple assignment randomized trials (SMART) [30, 31]. Unlike single-step randomized trials, SMART studies are not easy to analyse with conventional statistical tools. Such trials can be used in the context of RL to compare long-term effectiveness of treatment options in sequential settings. The trial can be thought of as the exploration part of RL, which aims to learn the dynamics of the system by selecting actions at random. That is, in order to understand the stochastic properties of the treatment strategy, we need to collect data by applying random medication and observing the results. We can then make the optimal decision based on the leaned models. We will next look into a SMART study designed on depression disorders. We will be using this study throughout this thesis as a benchmark for our proposed methods and algorithms. 3.3

STAR*D Sequenced Treatment Alternatives to Relieve Depression (STAR*D) is a study

to investigate the comparative effectiveness of different treatments provided in succession for patients not adequately benefiting from the initial or subsequent treatment steps [34]. The study is a multi-step randomized clinical trial of patients with major depressive disorders. At each step, patients were ask to choose one of the two different sets of treatment options and were given random treatments from the chosen set. Those who did not show remission were further moved to the next treatment step and were treated with other medications [9, 37]. There were 4 steps in this trial. Overall, 4041 patients went through this study in 41 different clinical sites. During the process, different medical information concerning the patient status were

28

collected using questioners. Among these we focus on a measure of depression symptom severity which is defined by the Quick Inventory of Depressive Symptomatology (QIDS-SR16 ) [9, 37]. Patient’s level of remission was also assessed using the QIDSSR16 score. In particular, if this score dropped to less than 5, then the trial was considered a success and the patient would exit the study and go to follow-up, where patients were occasionally monitored while staying on the same treatment. Previous studies [34] show that in the STAR*D trial, myopic analysis that only looks at immediate outcomes, provides conclusions about the optimality of actions that are different from the results of non-myopic long-term analysis. In this trial, the optimality of actions change when looking at long-term outcomes using techniques from the RL family. This makes STAR*D a good benchmark for the methods developed in this thesis. This trial is especially challenging as the amount of data collected during the study is often not enough to uniquely identify the best treatment for patients. It is thus not easy to draw a line between good and bad options as the effectiveness of different medications are close and often not statistically significant. This brings up questions of how to provide useful suggestions based on the relatively limited amount of data observed during the trial. The details of the treatment options and procedures for the STAR*D trial are not specifically relevant to this thesis. Further details are provided whenever needed in the following chapters.

29

Chapitre 4 Non-Deterministic Policies 4.1

Motivation We begin this chapter by looking at the problem of decision making, aimed to

be used in sequential decision support systems. In particular, MDPs have emerged as a useful framework for optimizing action choices in the context of medical decision support systems [40, 16, 25, 7]. Given an adequate MDP model (or data source), many methods can be used to find a good action-selection policy. This policy is usually a deterministic or stochastic function [3]. But policies of these types face a substantial barrier in terms of gaining acceptance from the medical community, because they are highly prescriptive and leave little room for the doctor’s input. In such cases, where the actions are executed by a human, it may be preferable to instead provide several (near-)equivalently good action choices, so that the agent can pick among those according to his or her own heuristics and preferences. To address this problem, this work introduces the notion of a non-deterministic policy, which is a function mapping each state to a set of actions, from which the acting agent can choose. We show how such formalism can be used to provide choice to the acting agent as well as handling model uncertainty. This can be used to implement more robust decision support systems with statistical guarantees of performance.

30

4.1.1

Providing Choice to the Acting Agent

Even when we have complete knowledge of the system we are planning for, and when we can accurately calculate actions’ utilities and pinpoint the optimal one, it might not be sufficient to provide the user with only the optimal choice of action at each time step. In many cases, the difference between the utility of the top few actions may not be substantial. In medical decision making, for instance, often times this difference is not medically significant based on the given state variables. In such cases, it seems natural to let the user decide between the top few options, using his/her own expertise in the domain. This results in a further injection of domain knowledge in the decision making process that makes it more robust and practical. Such decisions can be based on facts known to the user that are not incorporated in the automated planning system. It can also be based on preferences that might change from case to case. For instance, a doctor can get several recommendations as to how best to treat a patient. She could further decide what medication to apply based on her patient’s medical record or the preference on side effects, or medical expenses. This idea of providing choice to the user should be accompanied by reasonable guarantees on the performance of the final decision, regardless of the choice made by the user. A notion of near-optimality should be enforced to make sure the actions are never far from the best possible option. Such guarantees are enforced by providing a worst-case analysis on the decision process.

31

4.1.2

Handling Model Uncertainty

In many practical cases we do not have complete knowledge of the system at hand. Instead, we may get a set of trajectories collected from the system according to some specific policy. In some cases, we may be given the chance to choose this policy (in on-line and active RL), and in other cases we may have access only to data from some fixed policy. In medical trials, in particular, data is usually collected according to a randomized policy, fixed ahead of time, through consultation with the clinical researchers. Given a set of sample trajectories, we can either build a model of the domain (in model-based approaches) or directly estimate the utility of different options (with model-free approaches). However these models and estimates are generally not accurate as we only observe a finite amount of data. In many cases, the data is too sparse and incomplete to uniquely identify the best option. That is, the difference in the performance measure of different actions is not statistically significant. In such cases, it might be useful to let the user decide on the final choice between the few actions for which we do not have enough evidence to differentiate. This comes with the assumption that the user can identify the best choice among those that are recommended. The task would therefore be to provide the user with a small set of options that will almost surely include the optimal action. 4.2

Definitions and Notation In this section, we formulate the concept of non-deterministic policies and pro-

vide some definitions that are used throughout this chapter and later on in the thesis.

32

A non-deterministic policy Π on an MDP (S, A, T, R, γ) is a function that maps each state s ∈ S to a non-empty set of actions denoted by Π(s) ⊆ A(s). The agent can choose to do any action a ∈ Π(s) whenever the MDP is in state s. In this chapter we will provide a worst-case analysis, presuming that the agent may choose the worst action in each state. In Chapter 6 we will do a best-case analysis, whereby the agent makes the optimal choice from the given recommendations. The value of a state-action pair (s, a) according to a non-deterministic policy Π on an MDP M = (S, A, T, R, γ) is given by the recursive definition : QΠ M (s, a)

¯ a) + γ = R(s,

X



0

T (s, a, s ) 0min 0

a ∈Π(s )

s0 ∈S

0 0 QΠ M (s , a )

,

(4.1)

which is the worst-case expected return under the allowed set of actions. We define the value of state s according to a non-deterministic policy Π, denoted by VMΠ (s), to be mina∈Π(s) QΠ M (s, a). To calculate the value of a non-deterministic policy, we construct an evaluation MDP, M 0 = (S, A0 , R0 , T, γ), where A0 = Π and R0 = −R. Notice that the negated value of the non-deterministic policy Π is equal to that of the optimal policy on the evaluation MDP : ∗ QΠ M (s, a) = −QM 0 (s, a).

33

(4.2)

This follows from substituting the evaluation MDP parameters into the Bellman optimality equation : ¯ 0 (s, a) + Q∗ (s, a) = R

X

T (s, a, s0 ) max0 Q∗ (s0 , a)

s0 ∈S

¯ a) + = −R(s,

X s0 ∈S

a∈A

T (s, a, s0 ) max Q∗ (s0 , a), a∈Π

(4.3) (4.4)

which gives a solution that is the negation of the solution of Eqn 4.1. A non-deterministic policy Π is said to be augmented with state-action pair (s, a), denoted by Π0 = Π + (s, a), if it satisfies :    Π(s0 ), s0 6= s 0 0 Π (s ) =   Π(s0 ) ∪ {a}, s0 = s.

(4.5)

If a policy Π can be achieved by a number of augmentations from a policy Π0 , we say that Π includes Π0 . The size of a policy Π, denoted by |Π|, is the sum of the P cardinality of the action sets in Π : |Π| = s |Π(s)|. A non-deterministic policy Π is said to be non-augmentable according to a constraint Ψ if and only if Π satisfies Ψ, and for any state-action pair (s, a), Π+(s, a) does not satisfy Ψ. In this thesis we will be working with constraints that have this particular property : if a policy Π does not satisfy Ψ, any policy that includes Π does not satisfy Ψ. We will refer to such constraints as being monotonic.

34

4.3

-Optimal Non-Deterministic Policies A non-deterministic policy Π on an MDP M is said to be -optimal ( ∈ [0, 1])

if we have :1 VMΠ (s) ≥ (1 − )VM∗ (s), ∀s ∈ S.

(4.6)

This can be thought of as a constraint Ψ on the space of non-deterministic policies, set to ensure that the worst-case expected return is within some range of the optimal value. The -optimality constraint is monotonic. To prove that, suppose Π is not optimal. Then for any augmentation Π0 = Π + (s, a), we have : 0 QΠ M (s, a)

¯ a) + γ = R(s,

X

T (s, a, s ) 0 min 0 0

a ∈Π (s )

s0 ∈S

¯ a) + γ ≤ R(s,

X



0

0

T (s, a, s ) 0min 0

s0 ∈S

a ∈Π(s )

0 0 0 QΠ M (s , a )

0 0 0 QΠ M (s , a )



≤ QΠ M (s, a), which implies : 0

VMΠ (s) ≤ VMΠ (s). As Π was not -optimal, this means that Π0 will not be -optimal either as the value function has decreased with the augmentation. More intuitively, it follows from the

1

In some of the MDP literature, -optimality is defined as an additive constraint ≥ Q∗M − ) [21]. The derivations will be analogous in that case. We chose the multiplicative constraint as it has cleaner derivations.

(QΠ M

35

fact that adding more options cannot increase the minimum utility as the former worst case choice is still available after the augmentation. A conservative -optimal non-deterministic policy Π on an MDP M is a policy that is non-augmentable according to the following constraint : R(s, a) + γ

X

(T (s, a, s0 )(1 − )VM∗ (s0 )) ≥ (1 − )VM∗ (s), ∀a ∈ Π(s).

(4.7)

s0

This constraint indicates that we only add those actions to the policy whose reward plus (1 − ) of the future optimal return is within the sub-optimal margin. This ensures that the non-deterministic policy is -optimal by using the inequality : QΠ M (s, a) ≥ R(s, a) + γ

X

(T (s, a, s0 )(1 − )VM∗ (s0 )) ,

(4.8)

s0

instead of solving Eqn 4.1 and using the inequality constraint in Eqn 4.6. Applying Eqn 4.7 guarantees that the non-deterministic policy is -optimal while it may still be augmentable according to Eqn 4.6, hence the name conservative. It can also be shown that the conservative policy is unique. This is because if there were two different conservative policies, then the union of them would be conservative, which violates the assumption that they are non-augmentable according to Eqn 4.7. A non-augmentable -optimal non-deterministic policy Π on an MDP M is a policy that is not augmentable according to the constraint in Eqn 4.6. It is easy to show that any non-augmentable -optimal policy includes the conservative policy. This is because we can always add the conservative policy to any policy and remain

36

within the  bound. However, non-augmentable -optimal policies are not necessarily unique. In the remainder of this chapter, we focus on the problem of searching over the space of non-augmentable -optimal policies, such as to maximize some criteria. Specifically, we aim to find non-deterministic policies that give the acting agent more options while staying within an acceptable sub-optimal margin. We now present an example that clarifies the concepts introduced so far. To simplify the presentation of the example, we assume deterministic transitions. However, the concepts apply as well to any probabilistic MDP. Fig 4–1 shows an example MDP. The labels on the arcs show action names and the corresponding rewards are shown in the parentheses. We assume γ ' 1 and  = 0.05. Fig 4–2 shows the optimal policy of this MDP. The conservative -optimal non-deterministic policy of this MDP is shown in Fig 4–3. S1

a(0)

S2

b(−3)

a(0)

S3

b(−3)

a(100)

S4

b(99)

a(0)

S5

a(0)

Fig. 4–1 – Example MDP S1

a(0)

S2

a(0)

S3

a(100)

S4

a(0)

S5

a(0)

Fig. 4–2 – Optimal policy S1

a(0)

S2

a(0)

S3

a(100) b(99)

Fig. 4–3 – Conservative policy

37

S4

a(0) a(0)

S5

S1

a(0)

S2

a(0)

S3

b(−3)

S1

a(0)

S2

a(0)

a(100)

S4

b(99)

S3

b(−3)

a(100)

a(0)

S5

a(0)

S4

b(99)

a(0)

S5

a(0)

Fig. 4–4 – Two non-augmentable policies Fig 4–4 includes two possible non-augmentable -optimal policies. Although both policies in Fig 4–4 are -optimal, the union of these is not -optimal. This is due to the fact that adding an option to one of the states removes the possibility of adding options to other states, which illustrates why local changes are not always appropriate when searching in the space of -optimal policies. 4.4

Optimization Criteria We formalize the problem of finding an -optimal non-deterministic policy in

terms of an optimization problem. There are several optimization criteria that can be formulated, while still complying with the -optimal constraint. Notice that the last two problems can be defined both in the space of all -optimal policies, or only the non-augmentable ones. – Maximizing the size of the policy : According to this criterion, we seek non-augmentable -optimal policies that have the biggest overall size. This provides more options to the agent while still keeping the -optimal guarantees. The algorithms proposed in later sections of this chapter use this optimization criterion. Notice that the solution to this optimization problem is non-augmentable according to the -optimal constraint, because it maximizes the overall size of the policy. 38

As a variant of this, we can try to maximize the sum of the log of the size of the action sets : X

log |Π(s)|.

(4.9)

s∈S

This enforces a more even distribution of choice on the action set. However, we will be using the basic case of maximizing the overall size as it will be an easier optimization problem. – Maximizing the margin : We can aim to maximize margin of a nondeterministic policy Π : max ΦM (Π),

(4.10)

Π

where :  ΦM (Π) = min s∈S

min

a∈Π(s),a0 ∈Π(s) /

 (Q(s, a) − Q(s, a )) . 0

(4.11)

This optimization criterion is useful when one wants to find a clear separation between the good and bad actions in each state. – Minimizing the uncertainly : If we learn the models from data we will have some uncertainly about the optimal action in each state. We can use some variance estimation on the value function [26] along with a Z-Test to get some confidence level on our comparisons and find the probability of having the wrong order when comparing actions according to their values. Let Q be ˆ be our empirical estimate based on some the value of the true model and Q dataset D. We aim to minimize the uncertainly of a non-deterministic policy

39

Π: min ΦM (Π),

(4.12)

Π

where :  ΦM (Π) = max s∈S

 p (Q(s, a) < Q(s, a )|D) . 0

max

a∈Π(s),a0 ∈Π(s) /

(4.13)

In the following sections we provide algorithms to solve the first optimization problem mentioned above, which aims to maximize the size of the policy. We focus on this criterion as it seems most appropriate for medical decision support systems, where it is desirable for the acceptability of the system to find policies that provide as much choice as possible for the acting agent. Developing algorithms to address the other two optimization criteria remains an interesting open problem. 4.5

Maximal -Optimal Policy In order to find the largest -optimal policy, we present two algorithms. We first

present a Mixed Integer Program (MIP) formulation of the problem, and then present a search algorithm that uses the monotonic property of the -optimal constraint. While the MIP method is useful as a general formulation of the problem, the search algorithm has potential for further extensions with heuristics. 4.5.1

Mixed Integer Programming Solution

Recall that we can formulate the problem of finding the optimal deterministic policy on an MDP as a simple linear program [3] : minV µT V, subject to ¯ a) + γ P 0 T (s, a, s0 )V (s0 ) ∀s, a, V (s) ≥ R(s, s 40

(4.14)

where µ can be thought of as the initial distribution over the states. The solution to the above problem is the optimal value function (V ∗ ). Similarly, having computed V ∗ using Eqn 4.14, the problem of a search for an optimal non-deterministic policy according to the size criterion can be rewritten as a Mixed Integer Program :2 maxV,Π (µT V + (Vmax − Vmin )eTs Πea ), subject to V (s) ≥ (1 − )V ∗ (s) ∀s P ∀s a Π(s, a) > 0 ¯ a) + γ P 0 T (s, a, s0 )V (s0 ) + Vmax (1 − Π(s, a)) ∀s, a. V (s) ≤ R(s, s

(4.15)

Here we are overloading the notation Π to define a binary matrix representing the policy, where Π(s, a) is 1 if a ∈ Π(s), and 0 otherwise. We define Vmax = Rmax /(1−γ) and Vmin = Rmin /(1 − γ). The e’s are column vectors of 1 with the appropriate dimensions. The first set of constraints makes sure that we stay within  of the optimal return. The second set of constraints ensures that at least one action is selected per state. The third set ensures that for those state-action pairs that are chosen in any policy, the Bellman constraint holds, and otherwise, the constant Vmax makes the constraint trivial. Notice that the solution to the above problem maximizes |Π| and the result is non-augmentable. Lemma 1 The solution to the mixed integer program of Eqn 4.15 is non-augmentable according to -optimality constraint.

2

Note that in this MIP, unlike the standard LP for MDPs, the choice of µ can affect the solution in cases where there is a tie in the size of Π. 41

As a counter argument, suppose that we could add a state-action pair to the solution Π, while still staying in  sub-optimal margin. By adding that pair, the objective function is increased by (Vmax − Vmin ), which is bigger than any possible decrease in the µT V term, and thus the objective is improved, which conflicts with Π being the solution. We can use any MIP solver to solve the above problem. Note however that we do not make use of the monotonic nature of the constraints. A general purpose MIP solver could end up searching in the space of all the possible non-deterministic policies, which would require exponential running time. 4.5.2

Heuristic Search

In this section, we develop a heuristic search algorithm to find the maximal optimal policy. We can make use of the monotonic property of the -optimal policies to narrow down the search. We start by computing the conservative policy. We then augment it until we arrive at a non-augmentable policy. We also make use of the fact that if a policy is not -optimal, neither is any other policy that includes it, and thus we can cut the search tree at this point. The following algorithm is a one-sided recursive depth-first-search-like algorithm that searches in the space of plausible non-deterministic policies to maximize a function g(Π). Here we assume that there is an ordering on the set of state-action pairs {pi } = {(sj , ak )}. This ordering can be chosen according to some heuristic along with a mechanism to cut down some parts of the search space. V ∗ is the optimal value function and the function V returns the value of the non-deterministic policy that can be calculated by solving the corresponding evaluation MDP.

42

Tab. 4–1 – Heuristic search algorithm to find -optimal policies with maximum size Function getOptimal(Π, startIndex, ) Πo ← Π for i ← startIndex to |S||A| do (s, a) ← pi if a ∈ / Π(s) & V (Π + (s, a)) ≥ (1 − )V ∗ then Π0 ← getOptimal (Π + (s, a), i + 1, ) if g(Π0 ) > g(Πo ) then Πo ← Π0 end end end return Πo We should make a call to the above function passing in the conservative policy Πm and starting from the first state-action pair : getOptimal(Πm , 0, ). The asymptotic running time of the above algorithm is O((|S||A|)d (tm + tg )), where d is the maximum size of an -optimal policy minus the size of the conservative policy, tm is the time to solve the original MDP and tg is the time to calculate the function g. Although the worst-case running time is still exponential in the number of state-action pairs, the run-time is much less when the search space is sufficiently small. The |A| term is due to the fact that we check all possible augmentations for each state. Note that this algorithm searches in the space of all -optimal policies rather than only the non-augmentable ones. If we set the function g(Π) = |Π|, then the algorithm will return the biggest non-augmentable -optimal policy. This search can be further improved by using heuristics to order the state-action pairs and prune the search. One can also start the search from any other policy

43

rather than the conservative policy. This can be potentially useful if we have further constraints on the problem. 4.5.3

Directed Acyclic Transition Graphs

One way to narrow down the search is to only add the action that has the maximum value for any state s :  Π = Π + s, argmax Q (s, a) . 0



Π

(4.16)

a

This leads to a running time of O(|S|d (tm + tg )). However this does not guarantee that we see all non-augmentable policies. This is due to the fact that after adding an action, the order of values might change. If the transition structure of the MDP contains no loop with non-zero probability (transition graph is directed acyclic, i.e. DAG), then this heuristic will produce the optimal result while cutting down the search time. In other cases, one might do a partial evaluation of the augmented policy to approximate the value after adding the actions, possibly by doing a few backups rather than using the original Q values. This offers the possibility of tradingoff computation time for better solutions. 4.5.4

Empirical Results

To evaluate our proposed algorithms, we first test both the MIP and search formulations on MDPs created randomly, and then test the search algorithm on a real-world treatment design scenario.

44

1, 0.4

S1

S4

S5

1, 0.4

3, 0.5

S1

3, 0.2

2, 0.7

S3

3, 9.9

S4

S5

3, 0.2 S3

3, 9.9

3, 0.9

3, 0.9

S2

1, 0.4

3, 0.5

S2

S4

S4

3, 0.5

1, 0.4

3, 0.5

4, 0.2 S1

2, 0.7

S5

3, 0.5

S1

3, 0.2

S5

3, 0.5 S3

3, 9.9

2, 0.7

3, 0.2

S3

3, 9.9

3, 0.9

3, 0.9

S2

S2

Fig. 4–5 – MIP solution for different values of  ∈ {0, 0.01, 0.02, 0.03}. The labels on the edges are action indices, followed by the corresponding immediate rewards. To begin, we generated random MDPs with 5 states and 4 actions. The transitions are deterministic (chosen uniformly random) and the rewards are random values between 0 and 1, except for one of the states with reward 10 for one of the actions ; γ was set to 0.95. The MIP method was implemented with MATLAB and CPLEX. Fig 4–5 shows the solution to the MIP defined in Eqn 4.15 for a particular randomly generated MDP. We see that the size of the non-deterministic policy increases as the performance threshold is relaxed.

45

To compare the running time of the MIP solver and the search algorithm, we constructed random MDPs as described above with more state-action pairs. Fig 4– 6 shows the running time averaged over 20 different random MDPs, assuming  = 0.01. It can be seen that both algorithms have exponential running time (note the exponential scale on the time axis). The running time of the search algorithm has a bigger constant factor, but has a smaller exponent base, which results in a faster asymptotic running time. 100

Time (s)

10

1

0.1

0.01

MIP Search

20

30

40

Number of state-action pairs

Fig. 4–6 – Running time of MIP and search algorithm as a function of the number of state-action pairs.

To study how stable non-deterministic policies are to potential noise in the models, we check to see how much the policy changes when Gaussian noise is added to the reward function. Fig 4–7 shows the percentage of the total state-action pairs that were either added or removed from the resulting policy, when adding noise to

46

the reward model (we assume a constant  = 0.02). We see that the resulting nondeterministic policy changes somewhat, but not drastically, even with noise levels of

Average percentage of differece

similar magnitude as the reward function. 16% 14% 12% 10% 8% 6% 4% 2% 0%

0

0.5

1

1.5

2

STD of Gaussian noise in the reward model

Fig. 4–7 – Average percentage of state-action pairs that were different in the noisy policy.

Next, we implemented the full search algorithm on an MDP constructed for a medical decision-making task involving real patient data. The goal is to find a treatment plan that maximizes the chance of remission on the STAR*D medical domain. As outlined in Chapter 3, the dataset includes a large number of measured outcomes. For the current experiment, we focus on a numerical score called the Quick Inventory of Depressive Symptomatology (QIDS), which was used in the study to assess levels of depression (including when patients achieved remission). For the purposes of our experiment, we discretize the QIDS scores (which range from 5 to

47

27) uniformly into quartiles, and assume that this, along with the treatment step (up to 4 steps were allowed), completely describe the patient’s state. Note that the underlying transition graph can be treated as a DAG because the study is limited to four steps of treatment. There are 19 actions (treatments) in total ; recall that action choices are different at each treatment step. There are at most 17 actions available at each state. A reward of 1 is given if the patient achieves remission (at any step) and a reward of 0 is given otherwise. The transition and reward models were generated empirically from the data using a frequentist approach. Tab. 4–2 – Policy and running time of the full search algorithm on the medical problem.  = 0.02

 = 0.015

 = 0.01

=0

Time (seconds)

118.7

12.3

3.5

1.4

5 < QIDS < 9

CT SER BUP, CIT+BUS

CT SER

CT

CT

9 ≤ QIDS < 12

CIT+BUP CIT+CT

CIT+BUP CIT+CT

CIT+BUP

CIT+BUP

12 ≤ QIDS < 16

VEN CIT+BUS CT

VEN CIT+BUS

VEN

VEN

16 ≤ QIDS ≤ 27

CT CIT+CT

CT CIT+CT

CT CIT+CT

CT

Table 4–2 shows the non-deterministic policy obtained for each state during the second step of the trial (each acronym refers to a specific treatment). This is computed using the search algorithm, assuming different values of . Although this problem is not tractable with the MIP formulation (304 state-action pairs), a full search in the space of -optimal policies is still possible. Table 4–2 also shows the 48

running time of the algorithm, which as expected increases as we relax the threshold . Here we did not use any heuristics. However, as the underlying transition graph is a DAG, we could use the heuristic discussed in the previous section (Eqn 4.16) to get the same policies even faster. An interesting question is how to set  a priori. In practice, a doctor may use the full table as a guideline, using smaller values of  when s/he wants to rely more on the decision support system, and larger values when relying more on his/her own assessments. 4.6

Discussion This chapter introduces a framework for computing non-deterministic policies

for MDPs. We believe this framework can be especially useful in the context of decision support systems, to provide more choice and flexibility to the acting agent. This should improve acceptability of decision support systems in fields where the policy is used to guide (or advise) a human expert, notably for the optimization of medical treatments. The framework we propose relies on two competing objectives. On the one hand, we want to provide as much choice as possible in the non-deterministic policy, while at the same time preserving some guarantees on the return (compared to the optimal policy). A limitation of our current approach is that the algorithms presented so far are limited to relatively small domains, and scale well only for domains with special properties, such as a DAG structure in the transition model, or good heuristics for pruning the search. This clearly points to future work in developing better approximation techniques. Nonetheless it is worth keeping in mind that many domains of

49

application may not be that large (see [40, 16, 25, 7] for examples) and the techniques as presented can already have a substantial impact. In the next chapters we will see the other major use of the proposed framework to handle uncertainly. We start by building confidence intervals over the value functions of MDPs and POMDPs. We then extend these methods to compare policies in a statistically meaningful way. We also investigate other means of comparing policies and finding possibly-optimal non-deterministic policies.

50

Chapitre 5 Confidence Measures over Value Function When dealing with real world problems, often times we do not have a perfect model of the system. We are usually given a data set collected on the system and asked to find policies that will have good performance when deployed on that system. Most RL algorithms try to calculate an estimate of the value function of different policies and choose the one that has the biggest estimated value. As these data sets are finite, and often sparse and noisy, these estimates can be far from the actual values. Ignoring the uncertainly in these estimates, although mathematically convenient, might result in poor performance and lack of robustness in the system. In medical domains, in particular, it is very unlikely that policies based on such noisy estimates—with no confidence measures—can gain ground within the medical community. In similar cases, mostly when humans have a role in the decision making process, some form of confidence measure is needed to meaningfully compare the value of different policies. This is an important factor for sequential decision-making, since it will allow us to provide more formal guarantees about the quality of the policies we implement. Using imperfect models will introduce some error in the estimated value function. As a general practice with learning methods, we might want to know how good this estimate of the value function is, given the error in the estimated models. This can be expressed in terms of bias and variance of the calculated value function. This 51

chapter will start by looking at some of the methods that were previously designed to provide confidence measures on MDP value function. We then extend these, as a contribution of the thesis, to the case of POMDPs. We will also review some of the work done to provide confidence measures for model-free approaches. In this chapter, we focus on policy evaluation and not optimization. 5.1

Intrinsic and Extrinsic Variance We earlier defined the return of a policy (Eqn 2.4) to be the discounted sum of

future rewards and the value of a state was defined to be the expectation of return starting from that state. We thus have : D = V + ∆V,

(5.1)

where ∆V is the variation in the return for any particular trajectory. We know by definition :

E[∆V ] = 0.

(5.2)

However, if there is stochasticity in the system, then the variance of this ∆V term is bigger than zero. This type of variance is often referred to as internal or intrinsic variance [27]. It is due to stochastic transitions, and rewards and causes the return to deviate from its mean value. This type of variance has been studied both with discounted rewards [43] and average rewards [10]. It can be thought of as the risk of a running a policy on an MDP or a POMDP. There have been studies to address the issue of risky policies in the RL context. Heger [17] introduces the idea of changing the

52

value function to include the risk probability and Sato et al. [39] suggest penalizing the value function with a negative term proportional to the variance of value function. In this thesis, we will be focusing on another source of variance that is due to having finite amounts of data. In Eqn 5.1, the value function itself can be thought of a random variable which has its own distribution and variance. We use the collected data set and calculate an estimate of the value function using standard methods from the RL family. Most of these methods converge to the true value function in the limit of infinite data from interactions with the system. However, in cases where we only observe a finite amount of data on the system, our estimate will not be accurate. The variability due to having only finite amounts of data is often referred to as external, extrinsic or parametric variance [27]. Having an estimate of this variance gives us a confidence measure over our calculated value function, which in turn can help us make more meaningful and robust decisions. 5.2

Gaussian Assumption on Value Function The actual distribution over the value function given the collected data set is

often complicated and tedious to handle. It is mathematically convenient to assume that this is close to a Gaussian distribution. This is particularly useful as we can model the whole distribution by its mean and covariance matrix. This Gaussian assumption is used both in model-based [26] and model-free [6] approaches. If the data was collected under a policy that is not too far from the optimal one, then this assumption is reasonably valid. In other cases, we might need to use more complicated techniques that require sampling or particle filters to handle more complex distributions.

53

5.3

Model-Based Variance Estimation for MDPs Mannor et al. [26] introduce a method to estimate the extrinsic variance on

the value function of MDPs based on the Gaussian assumption, and using a second order approximation. They provide a confidence measure over the value function with model-based approaches. We begin in this section by reviewing this approach. In the next section, we extend this result to the POMDP framework. Recall from Chapter 2 (Eqn 2.15) that we can write down the Bellman equation for a fixed policy π in the matrix form. We saw (Eqn 2.18) that we can use matrix inversion to get the value function : ¯π . V π = (I − γT π )−1 R

(5.3)

This assumes that we have complete knowledge about the transition and reward models. However, in many practical cases, T π and Rπ are estimated from data. Suppose we have a number of trajectories collected on the system with an arbitrary policy or sampling mechanism. If Nija is the number of transitions from i to j while action a was taken and Cija is the sum of the rewards collected in such transitions, then a frequentist approach will simply average the rewards and transition probabilities ˆ are the empirical estimates1 , then we have : calculated by simple division. If Tˆ and R

ˆ to be our estimate of of the average reward R, ¯ as we are only interested We take R in estimating the expected return. 1

54

a Nss 0 Tˆ(s, a, s0 ) = , a Ns P a s0 Css0 ˆ a) = , R(s, Nsa

where Nsa =

P

s0

(5.4) (5.5)

a Nss 0 is the total number of transitions from s. Thus, the empirical

transition and observation matrices will be : ˆπ = R s

X

π Tˆs,s = 0

X

ˆ a), π(s, a)R(s,

(5.6)

π(s, a)Tˆ(s, a, s0 ).

(5.7)

a∈A

a∈A

When we use these empirical estimates instead of true models with the Bellman equation, we get an empirical estimate of the value function : ˆπ . Vˆ π = (I − γ Tˆπ )−1 R

(5.8)

Mannor et al. [26] show that using these estimates instead of the true models will result in some bias and variance in the value function. Here we review a simpler version of their work that assumes we know the correct reward model. That is, we use only our empirical transition model. Because we build our empirical model based on a finite amount of data, our estimates contain some error terms : T π = T + T˜,

(5.9)

where T˜ is the error term in our empirical transition model. We know from probability theory that when the state space is finite, the (posterior) distribution of the transition

55

probabilities given the observed data set is defined by a Dirichlet distribution [11]. Thus, we can calculate the variance over the error term :

E[T˜(s, a, s0)T˜(s, a0, s0)] = E[T˜(s, a, s0)T˜(s, a, s00)] = E[T˜(s, a, s0)T˜(s, a, s0)] =

0, a Na −Nss 0 ss00 , 2 a (Ns ) (Nsa +1)

for a 6= a0 ,

(5.10)

for s0 6= s00 ,

(5.11)

a (N a −N a ) Nss 0 s ss0 . (Nsa )2 (Nsa +1)

(5.12)

Eqn 5.10 follows from the fact that the distribution of transition probabilities are independent for different actions. The other two equations follow from the covariance of the Dirichlet distribution. We can thus calculate the covariance of our transition matrix for any specific policy π : " ! !# X X E[T˜ssπ 0 T˜ssπ 00 ] = E π(s, a)T˜(s, a, s0 ) π(s, a)T˜(s, a, s00 ) (5.13) a∈A

=

E

a∈A

"

# X

π(s, a) T˜(s, a, s )T˜(s, a, s ) 0

2

0

(5.14)

a∈A

=

X

h i π(s, a)2 E T˜(s, a, s0 )T˜(s, a, s0 ) ,

(5.15)

a∈A

which can be simply calculated using the transition counts with Eqn 5.11 and Eqn 5.12. The other covariance terms, including the transition probabilities starting from different states, are zero due to Eqn 5.10. Note that the error term in the empirical model is zero-biased. That is, on average, we are not expected to see a bias in this estimate. However, the covariance terms will cause problems when we use our empirical models to calculate the value function. As we said before, when using empirical models with the Bellman equation, we get an empirical estimate for out value function that is different from the true 56

value. Substituting the empirical models with their definition in Eqn 5.9 we get : ¯π , V π + V˜ π = (I − γ(T π + T˜π ))−1 R

(5.16)

where V˜ π is the error term we get in our estimated value function. Using the Taylor expansion of this matrix inversion, we can write the expectation of our estimated value function as :

E[V π + V˜ π ]

= =

¯π ] E[(I − γ(T π + T˜π ))−1 R " #

(5.17)

¯π . γ k (T π + T˜π )k R

(5.18)

E

∞ X k=0

With a little bit of manipulation (see [27]), we get : ¯π + V π + E[V˜ π ] = (I − γT π )−1 R

∞ X

¯π , γ k E[fk ]R

(5.19)

k=1

where fk = X(T˜π X)k and X = (I − γT π )−1 . The first term in Eqn 5.19 is the true value function. The second term in Eqn 5.19 contains all the moments of T˜π . As explained before, T˜π has mean 0 and positive variance. This means that the second term in Eqn 5.19 is not zero, which leads to the result that the estimated values we calculate based on empirical models are biased. That is, the error term has non-zero mean :

E[V˜ π ] =

∞ X

¯π . γ k E[fk ]R

k=1

57

(5.20)

To estimate this bias term we can do a second order approximation in Eqn 5.19. We can ignore moments of T˜π that are higher than 2. Thus we get :

E[V˜ π ]

¯ π + γ 2 E[f2 ]R ¯π ' γ E[f1 ]R

(5.21)

¯ π + γ 2 E[X(T˜π X)2 ]R ¯π = γ E[X T˜π X]R

(5.22)

¯π . = γ 2 E[X(T˜π X)2 ]R

(5.23)

This follows from the fact that T˜π has zero mean. As we saw before, we can calculate the second moment of T˜π using the transition counts, which in turns lets us evaluate the above equation and estimate the bias term. We can apply the same technique to estimate the variance of this error term :

E[Vˆ π (Vˆ π )T ]

= =

¯ π (R ¯ π )T ((I − γ(T π + T˜π ))−1 )T ] E[(I − γ(T π + T˜π ))−1 R 

E

∞ X

!

γ k (T π + T˜π )k

¯ π (R ¯ π )T R

k=0



=

E

∞ X

(5.24)  ! T ∞ X γ k (T π + T˜π )k  k=0

! k

γ fk

¯ π (R ¯ π )T R

k=0

∞ X

!T  γ k fk

.

k=0

Using a second order approximation, we get :

E[Vˆ π (Vˆ π )T ]

¯ π (R ¯ π )T f T ] ' V π (V π )T + γ 2 E[f1 R 1 ¯ π (R ¯ π )T f2T ] +γ 2 E[f0 R ¯ π (R ¯ π )T f0T ]. +γ 2 E[f2 R

58

(5.25)

Using this, we can estimate the covariance of our estimated value function : cov(Vˆ π ) = ' =

E[Vˆ π (Vˆ π )T ] − E[Vˆ π ]E[Vˆ π ]T ¯ π (R ¯ π )T f1T ] γ 2 E[f1 R ¯ π (R ¯ π )T X T ˜(T π )T ]X T . γ 2 X E[T˜π X R

(5.26)

This again only includes the second moments of our empirical models that can be calculated using the transition counts. Eqn 5.26 can be used to estimate confidence bounds on our estimated value function. As we get more and more data points, this variance estimate gets progressively smaller. 5.4

Model-Based Variance Estimation for POMDPs In this section we extend the result of Mannor et al. [26] to provide confidence

bounds for POMDP value functions. For mathematical convenience, we will focus on the policies that are represented by a finite state controller. Recall that we can write down the Bellman equation for a POMDP value function in the matrix form (Eqn 2.34). The value function can then be calculated using matrix inversion : V π = (I − γT π Oπ Ππ )−1 Rπ .

(5.27)

We will again use our empirical estimates instead of the true model in the above equation. We assume here that we know the correct reward model, but we have ˆ π = Oπ + O ˜ π and estimated T and O from some sample labeled set of trajectories (O Tˆπ = T π + T˜π ). The assumption of having training data with known labeled states is a strong assumption, and in many POMDP domains may not be plausible. However, it is still more practical than the assumption of having exact true models of T and O. 59

In the case where EM-type algorithms are used to label the data [22], the derivation of the estimates with the above assumption is not exactly correct, but might still provide a useful guide to compare competing policies with confidence intervals over the value function. Substituting the empirical models in the Bellman equation leads to : ˆ π Ππ )−1 Rπ , Vˆ π = (I − γ Tˆπ O

(5.28)

which can be written in terms of true models and error terms : ˜ π )Ππ )−1 Rπ . V π + V˜ π = (I − γ(T π + T˜π )(Oπ + O

(5.29)

We can calculate the covariance term on T π , similar to the case of MPDs (Eqn 5.13). Similarly, we can derive the covariance equations for the observation models. If there a are Msa transitions to state s after doing action a, Msz of which resulted in the z

being observed, then we have : a ˆ 0 , a, z) = Ms0 z , O(s Msa0

(5.30)

and ˜ 0 , a, z)O(s ˜ 0 , a0 , z)] = E[O(s ˜ 0 , a, z)O(s ˜ 0 , a, z 0 )] = E[O(s ˜ 0 , a, z)O(s ˜ 0 , a, z)] = E[O(s

0,

for a 6= a0 ,

(5.31)

−Msa0 z Msa0 z0 , (Msa0 )2 (Msa0 +1)

for z 6= z 0 ,

(5.32)

Msa0 z (Msa0 −Msa0 z ) . (Msa0 )2 (Msa0 +1)

This will therefore give us the second moment terms of the Oπ matrix.

60

(5.33)

We further assume that the error terms in the observation model and the transition model are independent : ˜ π T˜π ] = 0. E[O ij kl

(5.34)

Now we can apply the same technique used for MDPs on the POMDP value function. We start by calculating the bias term in the estimated value function :

E[V π + V˜ π ]

= =

ˆ π Ππ )−1 Rπ ] E[(I − γ Tˆπ O " #

(5.35)

γ k fk R π ,

(5.36)

E

∞ X k=0

where X = (I − γT π Oπ Ππ )−1 ,

(5.37)

˜ π Π + T˜π O ˜ π Ππ ))k X. fk = (X(T˜π Oπ Ππ + T π O

(5.38)

This follows from using a Taylor expansion and applying the same technique as in MDPs. Using a second order approximation, we get : V π + E[V˜ π ] ' (I − γT π Oπ Ππ )−1 Rπ + γ 2 E[f2 ]Rπ .

(5.39)

The second moment of the value function can also be estimated with the same technique :

61

E[Vˆ π (Vˆ π )T ]

' V π (V π )T + γ 2 E[f1 Rπ (Rπ )T f1T ]

(5.40)

+γ 2 E[f0 Rπ (Rπ )T f2T ] +γ 2 E[f2 Rπ (Rπ )T f0T ]. Using this, we can calculate the covariance of our estimated value function : cov(Vˆ π ) = '

E[Vˆ π (Vˆ π )T ] − E[Vˆ π ]E[Vˆ π ]T γ 2 E[f1 Rπ (Rπ )T f1T ].

(5.41) (5.42)

Taking the independence assumption of Eqn 5.34 into account, the above leads to : cov(Vˆ π ) ' γ 2 X E[T˜π Oπ Ππ V π (V π )T (Ππ )T (Oπ )T (T˜π )T ]X T ˜ π Ππ V π (V π )T (Ππ )T (O ˜ π )T ](T π )T X T . +γ 2 XT π E[O 5.5

(5.43)

Experiment and Results The purpose of this section is two-fold. First, we aim to evaluate the approxima-

tions used when deriving our estimate of the variance in the value function. Second, we wish to illustrate how the method can be used to compare different policies for a given task. 5.5.1

POMDP dialog manager

We begin by evaluating the method on synthetic data from a human-robot dialog task. The use of POMDP-based dialog managers is well-established [8, 50, 36]. However, it is often not easy to get training data in human-robot interaction domains.

62

With small training sets, error terms tend to be important. Estimates of the error variance are therefore helpful to evaluate and compare policies. Here, we focus on a small simulated problem which requires evaluating dialog policies for the purpose of acquiring motion goals from the user. We presume a human operator is instructing an assistive robot to move to one of two locations (e.g. bedroom or bathroom). While the human intent (i.e. the state) is one of these goals, the observation received by the robot (presumably through a speech recognizer) might be incorrect. The robot has the option of asking the user to repeat its request to ensure the goal was understood correctly. Note however that the human may change his/her intent (the state) with a small probability. Fig 5–1 shows a model of the described situation. In the generative model (used to provide the training data), we assume the probability of a wrong observation is 0.15 and the human might change goals with probability 0.05. If the robot acts as requested, it gets a reward of 10 ; otherwise it

goto bedroom

goto bathroom

Fig. 5–1 – Example of a dialog POMDP. Nodes correspond to states of the system (user’s intention). Dashed lines refer to transitions while taking action “ask ”. Solid lines are the transitions when the robot moves.

63

goto x

goto y x/y x

ask

y

x/y ask

y x

ask

y x

Fig. 5–2 – Policy graph for the POMDP dialog manager. Nodes correspond to the states of the finite state controller. Edges show how the controller changes state as an observation is received. The labels on the nodes are the actions issued at each state of the controller. gets a −40 penalty. There is a small penalty of −1 when asking for clarification. We assume γ = 0.95. Fig 5–2 shows one possible policy graph for the described POMDP dialog manager. This policy graph corresponds to the policy where the robot keeps asking the human until it receives an observation twice more than the other one. We ran the following experiment : given the fixed policy of Fig 5–2 and a fixed number n, we draw on-policy labeled trajectories that on the whole contain n transiˆ and use Eqn 2.35 to tions. We use these to calculate the empirical models (Tˆ and O), calculate the value function. Then we use Eqn 5.43 to calculate the covariance and standard deviation of the value function at the initial belief point (b0 = [0.5; 0.5]). Let V (b0 ) be the expected value at the initial belief state b0 , and let α = [α1 ; α2 ] be the vector of coefficients describing the corresponding linear piece (at point b0 ) in the piecewise-linear value function. We have V (b0 ) = E[α · b] = (α1 + α2 )/2 and thus 64

the variance of V (b0 ) can be calculated as : var(V (b0 )) =

var(α1 ) + var(α2 ) + 2cov(α1 , α2 ) . 4

(5.44)

Note that α1 and α2 are elements of V for which we have an estimated covariance matrix. Thus, we can calculate the above by substituting the corresponding variance and covariance terms. Fixing the size of the training set, we run the above experiment 1000 times. For each trial, we calculate the empirical value of the initial belief state (Vˆ (b0 )), and estimate its variance using Eqn 5.44. We then calculate the percentage of cases in which the estimated value (Vˆ (b0 )) lies within 1 and 2 estimated standard deviations of the true value (V (b0 ), calculated using true models). Assuming that the error between the calculated and true value has a Gaussian distribution (this was confirmed by plotting the histogram of error terms), these values should be 68% and 95% respectively. Fig 5–3 confirms that the variance estimation we propose satisfies this criteria. The result holds for a variety of sample set sizes (from n=1000 to n=5000). To investigate how these variance estimates can be useful to compare competing policies, we calculate the variance of the value function for two other policies on this dialog problem (we presume these dialog policies are provided by an expert, though they could be acquired from a policy iteration algorithm, such as [18]). One policy is to ask for the goal only once, and then act according to that single observation. The other policy is to keep asking until the robot observes one of the goals three times more than the other one, and then act accordingly. Fig 5–4 shows the 1 standard deviation interval for the calculated value of the initial belief state as a function of

65

100

Percentage below 1 (+) and 2 (x) STDs

90 80 70 60 50 40 30 20 10 0 500

1000

1500

2000

2500 3000 3500 Number of samples

4000

4500

5000

5500

Fig. 5–3 – Percentage of the cases in which Vˆ (b0 ) lies within 1 (+) and 2 (×) approximately calculated standard deviations from V (b0 ) - the dialog problem. the number of samples, for each of our three policies (including the one shown in Fig 5–2). Given larger sample sizes, the policy in Fig 5–2 becomes a clear favorite, whereas the other two are not significantly different from each other. This illustrates how our estimates can be used practically to assess the difference between policies using more information than simply their expected value (as is usually standard in the POMDP literature).

66

& STD interval of the value of the initial belief state

4!

3!

2!

&!

ask once

!

ask twice !&!

!2!

ask three times

!

2!!!

4!!! $!!! (umber of samples

%!!!

&!!!!

Fig. 5–4 – 1 standard deviation interval for the calculated value of the initial belief state for different policies on the dialog problem. The variance on the value of all these policies approaches 0 in the limit of infinite samples. 5.5.2

Medical Domain

We now evaluate the accuracy of our approximation in a medical decision-making task involving real data. We will be constructing a POMDP model based on the STAR*D data set (described in Chapter 3). The POMDP framework offers a powerful model for optimizing treatment strategies from such data. However given the sensitive

67

nature of the application, as well as the cost involved in collecting such data, estimates of the potential error are highly useful. For the current experiment, we focus on a numerical score called the Quick Inventory of Depressive Symptomatology (QIDS), which roughly indicates the level of depression. This score was collected throughout the study in two different ways : a self-report version (QIDS-SR) was collected using an automated phone system ; a clinical version (QIDS-C) was also collected by a qualified clinician. For the purposes of our experiment, we presume the QIDS-C score completely describes the patient’s state, and the QIDS-SR score is a noisy observation of the state. To make the problem tractable with small training data, we discretize the score (which usually ranges from 0 to 27) uniformly according to quantiles into 2 states and 3 observations. The data set includes information about 4 steps of treatments. We focus on policies which only differ in terms of treatment options in the second step of the sequence (other treatment steps are held constant). There are seven treatment options at that step. A reward of 1 is given if the patient achieves remission (at any step) ; a reward of 0 is given otherwise. Although this a relatively small POMDP domain, it is nonetheless an interesting validation for our estimate, since it uses real data, and highlights the type of problem where these estimates are particularly crucial. We focus on estimating the variance in the value estimate for the policy shown in Fig 5–5. This policy includes only three treatments : medication A is given to patients with low QIDS-SR scores, medication B is given to patients with medium QIDS-SR scores, and medication C is given to patients with high QIDS-SR scores.

68

l

MedA l

m

m

h

MedB

l

MedC

h

h

m

Fig. 5–5 – The policy graph for the STAR*D problem. Nodes correspond to the states of the finite state controller. Edges show how the controller changes state as an observation is received. The labels on the nodes are the actions issued at each state of the controller. Since we do not know the exact value of this policy (over an infinitely large data set), we use a bootstrapping estimate. This means we take all the samples in our dataset which are consistent with this policy, and presume that they define the true model and true value function. Now to investigate the accuracy of our variance estimate, we subsample this data set, estimate the corresponding parameters, and calculate the value function using Eqn 2.35. To summarize the value function into a single value (denoted by V (B)), we simply take the average over the 3 linear pieces in the value function. Thus, the variance of Vˆ (B) is the average of the elements of the covariance matrix we calculated for the value function. To check the quality of the estimates, we calculate the percentage of cases in which the calculated value lies within 1 and 2 standard deviations from the true value. If the error term in the value function has a normal distribution these 69

percentages should again be 68 and 95. Fig 5–6 shows the mentioned percentages as a function of the number of samples. Here again, the variance estimates are close to what is observed empirically. However we see a slight deviation from the expected, especially for N < 45, which we attribute to the approximation introduced by the bootstrapping estimate.

100

Percentage below 1 (+) and 2 (x) STDs

90 80 70 60 50 40 30 20 10 0 15

20

25

30

35 40 45 Number of samples

50

55

60

65

Fig. 5–6 – Percentage of cases in which Vˆ (B) lies within 1 (+) and 2 (×) approximately calculated standard deviations from V (B) in the STAR*D problem.

70

Finally, we conduct a sample experiment to compare policies with different choice of medications in the policy graph of Fig 5–5. During the STAR*D experiment, patients mostly preferred not to use a certain treatment (CT :Cognitive Therapy). To study the effect of this preference, we compared two policies only one of which uses CT. As shown in Fig 5–7, the CT-based policy has a slightly better expected value, but much higher variation. While a standard RL analysis (which usually recommends the action with the highest expected value) would prescribe

3 2 STD interval of the summarized value function

not using CT 2.9

using CT

2.8 2.7 2.6 2.5 2.4 2.3 2.2 20

30

40

50 60 Number of samples

70

80

Fig. 5–7 – 2 standard deviation interval for the calculated value of the summarized belief state for different policies on the STAR*D problem.

71

the policy that uses CT, using the result of this analysis, one might prefer the non CT-based policy for two reasons : Even with high empirical values, we have small evidence to support the CT-based policy. Moreover, CT is not preferred by most patients. Such method can be applied in similar cases for comparison between an empirically optimal policy, and medically preferred ones. 5.6

Discussion This chapter discusses how the use of imperfect empirical models generated from

sample data (instead of the true model), introduces bias and variance terms in the value function of Markovian decision processes. We extend the methods introduced for MDPs by Mannor et. al [26, 27] to the case of POMDPs. We present a method to approximately calculate these errors for the POMDP value function in terms of the statistics of the empirical models. Such information can be highly valuable when comparing different action selection strategies. During policy search, for instance, one could make use of these error terms to search for policies that have high expected value and low expected variance. Furthermore, in some domains (including human-robot interaction and medical treatment design), where there is an extensive tradition of using hand-crafted policies to select actions, the method we present is useful to compare hand-crafted policies with the best policy selected by an automated planning method. The method we present here can be further extended to cases where the reward model is also unknown and is approximated by sampling. However, the derived equations are more cumbersome as we need to take into account the potential correlations between reward and transition models.

72

One main drawback of these techniques is that they are generally difficult to scale to domains with large state spaces. Further approximation techniques might prove to be useful in such cases. This is, however, beyond the scope of this thesis and remains an interesting open question for future work. In the next chapter, we will extend some of these methods to compare policies in statistically meaningful ways.

73

Chapitre 6 Decision Making under Model Uncertainty In Chapter 4 we define the notion of non-deterministic policies and demonstrate how they can be used to provide choice and options for a human user. In that context, the assumption is that the user may make the worst possible choice among the proposed actions. Thus, the idea is to bound the performance of the worst possible policy among the set of all options, while trying to provide as many action choices as possible. Chapter 4 proposes methods to guarantee a worst-case performance within some margin of the optimal policy. Alternatively, we could assume that the acting agent will be choosing the best possible choice among any non-deterministic set of proposed actions. Now the task is to find the smallest non-deterministic policy that will almost surely include the optimal or near optimal policies. Notice that this is relevant in cases where there is uncertainty in the model. This can be thought of as a pre-selection process to narrow down the options as far as possible (statistically plausible), given a finite amount of sample data, and asking the human user to make the final decision. We can then assume a best-case scenario in which the user will surely select the optimal choice. Such mechanism can be especially useful with decision support systems. Take, for instance, a decision support system with a problem to choose the optimal action among 10 options. Given a few sample results for each action, one might check and see that 2 of them are distinctly better than the others, but the difference between 74

the outcomes of these 2 is not statistically significant. A good decision would be to report both of these results, and indicate that there is not enough evidence to decide which one is better. In this chapter, we seek to bring this idea to the case of RL planning in which multi-step policies are involved, instead of single-step actions. We focus on possiblyoptimal non-deterministic policies on MDPs. The extension to the POMDP cases remains an interesting direction for future work. 6.1

Possibly-Optimal Non-Deterministic Policies We want to find the smallest non-deterministic policy that will include the opti-

mal (or an -optimal) deterministic policy with some probability 1 − δ. The task is to narrow down the options so that it will be easier to make a decision, while providing some guarantee that the optimal policy is still within the possible choices. This is particularly useful with decision support systems. Alternatively we could specify the problem in terms of optimal actions at each state. That is, we seek to find the set of actions at each state that will almost surely (with probability 1 − δ) include the optimal action. This condition implies that the optimal policy is included with probability no less than (1 − δ)n where n is the size of the state space. Thus we can use each definition interchangeably by choosing the appropriate value for δ. This can be thought of an extension to the Knows What It Knows (KWIK) framework [23]. The idea of the KWIK framework is that the leaner can opt out of predictions by issuing an “I do not know” (⊥) action. The goal is then to learn

75

the task with a (polynomial) bound on the number of such ⊥ responses with a tiny probabiliy of giving a wrong answer ever. Here, instead of opting out of prediction (action selection in the case of RL planning), we will provide a set of possibly optimal actions. That is, we will prune down the action set as much as the data allows us to. We can of course extend this idea to a generalized KWIK learning framework that tries to bound the total size of action sets over time, while making sure there is only a small chance of missing the optimal action. However, this too is beyond the scope of this thesis and remains as an interesting area of future work. 6.1.1

Optimality Probability

The statistical analysis of near optimality often involves calculating the probability that a particular policy π is -optimal : 0

p∗π = Pr{∀π 0 , s : V π (s) ≥ V π (s) − }.

(6.1)

We can define a similar probability in terms of the optimal actions. That is, we seek the probability that any particular action is -optimal : p∗s,a = Pr{∀a0 : Q∗ (s, a) ≥ Q∗ (s, a0 ) − }.

(6.2)

If we can find a way to estimate these probabilities, we can then solve the proposed problem of finding possibly optimal non-deterministic policies. Formally, given a nondeterministic policy Π, the probability that this policy does not include any -optimal

76

action for state s is : Y

(1 − p∗s,a ).

(6.3)

a∈Π(s)

Thus, if we want to have the optimal action in this set with probability no less than (1 − δ), we need to add probable actions until the above product is less than δ. 6.1.2

Monte Carlo Approximation

Given a finite set of samples, one might be able to build probability distributions over the model parameters using a Bayesian approach. For instance, given a set of sample trajectories on an MDP, the posterior of the transition function has a Dirichlet distribution. If the rewards are stochastic, then we know from the central ¯ has a normal distribution that shrinks limit theorem that the mean of the rewards (R) in variance as we get more samples. Now given these probability distributions over ¯ we want to find the probability that an action is optimal in a particular T and R, state. One obvious way to solve this problem is by sampling, which is to get a Monte ¯ from their distriCarlo approximation of p∗s,a ’s. We can basically sample T and R butions and then for each sample calculate the value of the optimal policy (using linear programming, value iteration, policy iteration, Q-learning, etc). Then we can use a frequentist approach to estimate the probability p∗ that an action is optimal, simply by counting the number of times it comes up in the optimal policy. Terreault et al. [48] use this sampling method to find confidence intervals over the value of the optimal policy. Here, we can use it to build the set of possibly optimal actions. Jong

77

and Stone [19] use a similar sampling method to find the significance levels in the hypothesis testing of policy irrelevance of features in a feature selection problem. Although Monte Carlo approximation is straight-forward to implement, it is generally slow and requires lots of samples to compute p∗ values, with reasonable accuracy. However, with enough samples at higher computational cost, we are guaranteed to get as accurate as needed on the p∗ values as Monte Carlo is an unbiased estimator. This method does not put any specific constraints on the distribution of the value function and thus is the most general solution to our problem. We will take the results from this estimator to be the correct value for p∗ and will compare it with other proposed methods to evaluate their accuracy. 6.1.3

General Case with Gaussian Assumption

In this section, we extend the methods developed in Chapter 5 to compute the mean and covariance of the joint random variable containing the value function of different policies. We will be extending the bias/variance method proposed by Mannor et al. [27]. This method assumes a Gaussian assumption on the value function and thus may be biased. More specifically, we tackle this sub-problem : given a set of policies, how can we decide if one of them is -better than the other with some probability 1 − δ ? Suppose we want to compare n policies π1 . . . πn . Here we assume that the rewards are known and we estimate the transition probabilities using a frequentist approach. Again, we assume that the reward model is known. We define the following :

78

– Ti : The transition matrix for policy πi . These matrices have correlated entries whenever the different policies use the same action. – Vi : The vertical vector representing the value function of the policy πi . – Ri : The vertical vector containing the immediate reward of the actions of policy πi . – V : Vertical concatenation of Vi ’s. – R : Vertical concatenation of Ri ’s. – T : n|S| × n|S| dimensional block diagonal matrix of n × n blocks. The ith diagonal block is Ti . Using the above definition, we can write a joint Bellman equation for all n policies in one single matrix equation : V = R + γT V.

(6.4)

As we saw earlier, when we use the empirical transition probabilities we get the empirical values : Vˆ = R + γ TˆVˆ ,

(6.5)

where Vˆ = V + V˜ and Tˆ = T + T˜, and T˜ and V˜ are the error terms. The expected value of the above will be the empirical value function (for which we can find the bias term). If we do the variance analysis as in Chapter 5 with a second order approximation (the exact same derivations), and assuming that the rewards are exactly known, the covariance matrix is approximated by : ΣV = cov(V ) = γ 2 E[f1 RRT f1T ], 79

(6.6)

where fk = (X T˜)k X and X = (1 − γT )−1 . We can rewrite the above as : ΣV = γ 2 XE[T˜V V T T˜T ]X T . To calculate this formula, we use the following lemma : Lemma 2 Let Q be an n × n dimensional matrix : Q = AXAT , where A is an n × m matrix of zero mean random variables and X is a constant matrix of m × m dimensions. The ijth entry of E[

X

Aik Xkl ATlj ] =

X

=

X

k,l

E[Q] is equal to :

Xkl E[Aik Ajl ]

k,l

Xkl cov(Aik , Ajl ),

k,l

which only depends on the four dimensional covariance of the matrix A. The covariance over T˜ can be fully defined in terms of the number of samples per state-action pair, and the actions taken in each state for each policy. Formally, if two elements of T correspond to transitions from the same state while taking the same action, then their covariance is not zero, and can be calculated from the Dirichlet distribution. We can use the above formulation to approximately calculate the covariance of the value functions of all the states and all n policies. We can also assume that the bias term is negligible, meaning that :

E[V ] ' Vˆ . 80

(6.7)

Now to compare policies we do a weighted average (with the initial probability distribution µ). We are therefore comparing µVi ’s for different policies. Now we want to find the probability that the policy πk is -better than the others : Pr{∀i : µVk ≥ µVi − }.

(6.8)

Assuming that Vi ’s are correlated normals, we can use affine transformation of the matrix V by subtracting Vk from all Vi ’s and still have a normal distribution. Without loss of generality, we can assume that k = 1 and thus define the transformation matrix Bk as :  −µ µ 0 0 . . .  −µ 0 µ 0 . . .  B1 =  .  ..   −µ 0 0 0 . . .



0  0      µ

(6.9)

which implies : Pr{∀i : µVk ≥ µVi − } = Pr{Bk V ≤ }.

(6.10)

As mentioned, ∆k = Bk V is an (n − 1)-dimensional multivariate normal. The covariance matrix of ∆k will be Bk ΣV BkT and the mean will be Bk E[V ] ' Bk Vˆ . So we have a multivariate normal with known parameters, and thus we get : Pr{∀i : µVk ≥ µVi − } = Pr{Bk V ≤ } = Φk (), where Φk is the c.d.f of the normal random variable ∆k .

81

(6.11)

Non-tractable Solution We now use this technique to solve the original problem. We begin by focusing on the policy version of the problem. That is, we seek a non-deterministic policy that will include an -optimal policy with probability greater than 1 − δ. For the set of all possible policies (exponential in the number of states), we construct the above problem and solve it to find the probability that any policy is better than all the others. Let p∗π be the probability that policy π is -optimal. To solve the problem, we keep adding probable optimal policies until the probability of not having an -optimal policy is less than one : Y

(1 − p∗π ) < δ.

(6.12)

π∈Π

This method of course does not scale to domains with large number of possible policies. One might be tempted to think that we can check two policies at a time and get the same result. This is not possible because we have : Pr{[µVk ≥ µVi ] ∧ [µVk ≥ µVj ]} = 6 Pr{µVk ≥ µVi } Pr{µVk ≥ µVj }, as these events are not necessarily independent. Tractable Approximate Solution As an approximation, one might only consider the policies that are close to the optimal one in the expected values. That is, instead of comparing all policies, only compare a few policies sampled around the optimal one. Such sampling techniques could make this method tractable even in larger state spaces. In this thesis, however, we have not investigated this any further.

82

6.1.4

Directed Acyclic Transition Graphs

We saw that both Monte Carlo and distribution-based approximations have problems scaling to bigger domains for general MDPs. In this section, we focus on approximating the solution for the case of MDPs with DAG transition structures. This class of MDPs is specifically interesting as it includes multi-step decision making in finite horizon, such as those in medical domains. Unlike the general case, here we can calculate the joint probability distribution of the value function locally for each level of the DAG. We focus on the case of muti-step decision making problems. Although what we say here can be easily generalized to the case of any DAG structured MDP. We define a multi-step MDP as follows : A set of steps D = {1, 2, . . . , d}, each containing nk , k ∈ D states, denoted by Sk . Further we have only one-step transitions : T (s, a, s0 ) = 0, ∀s ∈ Sk , s0 ∈ / Sk+1

(6.13)

∗ Let Vk,i be the random variable describing the value of the optimal policy for ∗ ith state at the kth step. Further, let Vk∗ denote the random vector containing Vk,i ’s

for i ∈ Sk . We will be looking for the joint probability distribution of the value of the states at each step (the probability distribution of the multivariate Vk∗ ). We will be approximating the probability distribution of Vk∗ ’s with Gaussians : Vk∗ ∼ N (µk , Σk )

(6.14)

Similarly we define the Q∗k,i,a to be the random variable describing the value of the optimal policy for state-action pairs. Let Q∗k be the vector containing Q∗k,i,a ’s. 83

Here we describe a way to approximate the parameters of Sk−1 given the parameters of Sk and the samples at the (k − 1)th step. Starting at state sk,i , and taking action a, we might arrive at any state in Sk+1 . Having some samples of this transition, we can build a Dirichlet distribution over the state of parameters in the transition probabilities : Tk,i,a ∼ Dir(αk,i,a ).

(6.15)

Because the structure of the MDP is a DAG, we can assume that Tk,i,a and Vk∗ are independent. Also because samples on the same step are taken independently, Tk,i,a ’s are independent of each other. So we can calculate Q∗k−1,i,a from Vk∗ and Tk,i,a . In order to write down the equations describing the parameters of Q∗k−1,i,a in terms of the parameters of Vk∗ and Tk,i,a , we make use of the second moment, denoted by µ2 . To make the notation consistent, we use them in function notation : µ(Vk∗ ), µ2 (Q∗k−1,i,a ). We also make use of the Frobenius inner product : A:B=

XX i

Aij Bij = trace(AT B) = trace(ABT ).

(6.16)

j

We repeat the following process iteratively until we arrive at the first level : (1) given the parameter of the multivariate Vk∗ , calculate the parameters of the multivariate Q∗k−1 (2) then using the distribution of Q∗k−1 , approximately estimate ∗ the parameters of Vk−1 : ∗ Vd∗ → Q∗d−1 → Vd−1 → Q∗d−2 → · · · → Q∗1 → V1∗ .

84

(6.17)

It is fairly easy to calculate the parameters of Q∗k−1,i,a from Vk∗ and Tk,i,a . The mean value is simply the product of the mean of Tk and Vk∗ : µ(Q∗k−1,i,a ) = µ(Vk∗ ) : µ(Tk,i,a ).

(6.18)

The covariance terms can also be calculated using the mean and variance of Tk and Vk∗ :

E[(Q∗k−1,i,a)2] E[Q∗k−1,i,aQ∗k−1,i ,a ] 0

0

= µ2 (Vk∗ ) : µ2 (Tk,i,a ),

(6.19)

 = µ2 (Vk∗ ) : µ(Tk,i,a )µ(Tk,i,a )T , ∀(i, a) 6= (i0 , a0 ). (6.20)

∗ Now having the parameters of Q∗k−1 , we need to estimate the parameters of Vk−1 .

Notice that at this point we need to select the best action to capture the optimal ∗ value function. This means that the random variable Vk−1,i is the maximum of the ∗ random variables Q∗k−1,i,a ’s. Also notice that the actual distribution of the Vk−1 is

not necessarily normal (even if Q∗k−1,i,a ’s are perfectly normal, their maximum is not necessarily normal). We are basically approximating the distribution by using the mean and variance and having a normal assumption. Even in this case, there is no closed-form solution for this problem and thus we need to use numerical or sampling methods. One way to estimate the mean and variance is to numerically calculate the integrals describing them, using some quasi Monte Carlo methods that are often used with multivariate normal integration [12]. An easier way is to draw samples ∗ from Q∗k−1 and use the maximum value as the sample for Vk−1 . We can then use ∗ these samples to estimate the parameters of Vk−1 . Although the former solution is

probably faster, the latter case is easier to implement.

85

Again, we have constructed a distribution over the optimal value function (of state action pairs at the first level). We can use this to find the p∗ values and then use them to select those actions whose p∗ value is greater than 1 − δ. 6.2

Empirical Results The discussed methods were implemented on a toy problem resembling a multi-

step medical trial. We experimented with a 5-step decision making process with 10 states in each step and uniform random rewards for all state-action pairs (between 0 and 10). A uniform sampling was performed on the state transitions. The Monte Carlo method was compared to the proposed method with local value sampling for MDPs with DAG transition graphs (LVS). As we saw before, the Monte Carlo solution is not biased and can be thought of as the correct answer in the limit of infinite samples. However, LVS converges much faster than Monte Carlo to values close to the correct solution (always within the tolerance bound of the convergence). Table 6–1 shows the running time of the Monte Carlo method compared to the LVS method. This is the time it takes for each method to converge on the value for the mean of the distribution within the tolerance bound specified on the table, averaged over 5 runs. Tab. 6–1 – Running Time of Monte Carlo Compared to LVS Tolerance

=1

 = 0.1

 = 0.01

Monte Carlo

21.4 ± 0.2 (s)

36.6 ± 0.4 (s)

72.6 ± 0.5 (s)

LVS

1.0 ± 0.1 (s)

1.5 ± 0.2 (s)

2.0 ± 0.1 (s)

It can be seen that LVS is one or two orders of magnitude faster than Monte Carlo, while providing similar results. 86

6.3

Discussion In this chapter we consider the problem of finding possibly-optimal non-deterministic

policies. These are ensembles of deterministic polices each of which has a high probability of being the optimal one. This is especially useful when the amount of sampled data gathered through interactions with the environment is not sufficient to pinpoint the optimal actions with reasonable statistical significance. Such process can be thought of as a conservative planning system. The main idea would be not to specify which action to take when the uncertainty is beyond a specific threshold. The KWIK framework suggests a null action, which indicates that the system “does not know” what to do. In our setting, however, the system is asked to provide the set of actions that are likely to be the optimal one. This is particularly useful with decision support systems where there is a human making the final decision. Further analysis of the sample complexity of such algorithms is an interesting direction for future work. As we saw before, there exists a few methods developed to provide confidence bound over the value function. In this chapter we formally defined the problem and extended those methods so that we can meaningfully compare policies and actions. Although we did not consider this problem in the case of POMDPs, the extension of the confidence interval method of Chapter 5 into a policy comparison technique should be similar to that of the MDPs discussed in this chapter. This however remains as an interesting future work and is beyond the scope of this thesis. Confidence intervals over the value function have also been studied in model-free settings. GPTD [5] and GPSARSA [6] are examples of such methods. Extensions of

87

these methods to statistically compare policies might be beneficial in decision support system when working with model-free learning method. This might, however, need further extensions to the statistical tests we use in model-based cases. This also remains as an interesting avenue for future work.

88

Chapitre 7 Conclusion This thesis introduces the new concept of non-deterministic policies and their potential use in decision support systems based on Markovian processes. In this context, we investigate how the assumption that a decision making system should return a single optimal action can be relaxed to instead return a set of actions. The benefits of this new perspective towards sequential decision making is twofold. First, when we have several actions for which the difference in performance are negligible, we can report all those actions as near optimal options. For instance, in a medical setting, the difference between the outcome of two treatment options might not be “medically significant”. In that case, it will be beneficial to provide all the near optimal options. This will not only make the system more robust and user-friendly, but also more robust to noise and defects in the model used for decision making. In the medical decision making process, for instance, the physician can make the final decision among the near optimal options based on medical expenses, side effects, patient’s preferences, or any other criteria that is not captured by the model used in the decision support system. The key constraint, however, is to make sure that regardless of the final choice of actions, the performance of the active policy is always bounded near the optimal. Another potential use of the non-deterministic action sets in Markovian decision processes is to capture uncertainties in the optimality of actions. Often times, the 89

amount of data from which models are constructed is not sufficient to clearly identify a single optimal action. If we are forced to chose only one action as the optimal one, we might have a high chance of making the wrong decision. However, if we are given the chance to provide a set of possibly-optimal actions, then we can make sure we include all the promising options while cutting off the obviously bad ones. In this setting, the task would be to trim the action set as much as possible while providing the guarantee that the optimal action is still among the top few possible options. Note that non-deterministic policies are inherently different from stochastic policies. Stochastic policies assume a randomized action selection strategy with some specific probabilities, whereas non-deterministic policies do not impose such constraint. We can thus use best-case and worst-case analysis with non-deterministic policies to reflect different scenarios with the human user. 7.1

Summary of Contributions Chapter 2 provides a quick survey of the problem of sequential decision making

on Markovian processes, including how MDPs and POMDPs can be used to model the system’s behaviour while interacting with an intelligent agent in observable or partially observable settings. As discussed, the value function is often used as the main performance measure when comparing different strategies and policies. Chapter 3 examines how RL methods can be used to solve decision making problems in sequential clinical trials. These medical trials are particularly interesting as potential test-beds for the non-deterministic methods introduced in this thesis. We introduce the new concept of non-deterministic policies on MDPs in Chapter 4, and discuss their potential use in providing choice to the acting agent and handling

90

model uncertainty. The main contributions of this chapter are the two methods developed to find maximal -optimal policies. One method is based on a mixedinteger program to solve the problem. As this program is computationally expensive and has problems scaling to bigger domains, we introduce a general search algorithm that can be further extended with heuristics. The search algorithm is applied to a problem in sequential medical decision making, as an example of its potential use. We further investigate the problem of handling model uncertainly in Chapter 5 by first doing a review of the previous work on providing confidence interval for MDPs. The primary contribution of this chapter is the introduction of a similar method for providing confidence intervals for the value function of POMDPs. We then examine how these confidence intervals can be used in medical decision making to compare policies in a more meaningful manner. The problem is further formalized in Chapter 6 where we define near-optimal non-deterministic policies for Markovian processes. We extend the methods introduced in Chapter 5 to compare different policies in a statistically meaningful way. We then show how this new method compares to Monte Carlo methods, and how it can be used to provide sets of possibly optimal actions. We pay special attention to sequential decision making in medical domains, where statistical and performance guarantees are crucial in practical acceptance of proposed strategies. We apply our methods to a medical problem concerning depression disorders and show how they can be used to provide meaningful comparisons and useful decision guidelines in such settings.

91

7.2

Future Work Many of the methods and algorithms discussed in this thesis were developed as

proofs of concept. The idea of non-deterministic policies, however, introduces a wide range of new problems and research topics. In Chapter 4, we discuss the idea of near optimal non-deterministic policies and address the problem of finding the one with the largest action set. As mentioned, there are other optimization criteria that might be useful with decision support systems. These include maximizing the decision margin (the margin between the worst selected action and the best one not selected), or alternatively minimizing the uncertainty of a wrong selection. Another avenue of future work could be the development of better approximations to the methods presented in Chapter 4, so that we can scale them to problems in bigger domains. This can be particularly useful when working with a large number of state variables. A more rigorous look into the heuristics considered for the search algorithm of Chapter 4 is also an interesting research topic. Although we extended the confidence interval methods of Chapter 5 to provide statistical comparison of policies on MDPs, we did not do the same for the case of POMDPs. The derivations and techniques should be similar to the ones proposed in Chapter 6. However, practical use of these methods on POMDPs needs further research. Approximation techniques with DAG transition graphs could also be studied in the context of POMDPs. This might prove to be useful in medical domains as the assumption of complete observability of the state is often impractical.

92

Finally, the idea of possibly optimal policies can be extended to model-free approaches in RL. GPTD and GPSARSA are among the methods developed to provide confidence bounds directly on the value function without keeping an explicit model of the system. This might, in turn, require continuous extensions to the statistical tests we use in the model-based cases with finite and discrete domains. Techniques mentioned in this thesis can also be used within variable selection algorithms. Jong et al. [19] have previously looked at this topic in small settings. But the idea can be further extended with the methods introduced here to come up with algorithms to find policy relevant features and variables. Such methods might prove to be useful in medical domains where a few hundred variables are collected from each treatment case. This will in turn result in further usability and flexibility of medical decision support systems.

93

References [1] K. J. Astrom. Optimal control of Markov decision processes with incomplete state estimation. Journal of Mathematical Analysis and Applications, 10 :174– 205, 1965. [2] R. Bellman. Dynamic Programming. Princeton University Press, 1957. [3] D.P. Bertsekas. Dynamic Programming and Optimal Control, Vol 2. Athena Scientific, 1995. [4] A. R. Cassandra, L. P. Kaelbling, and M. L. Littman. Acting optimally in partially observable stochastic domains. In Proceedings of the Twelfth National Conference on Artificial Intelligence (vol. 2), pages 1023–1028, 1994. [5] Y. Engel, S. Mannor, and R. Meir. Bayes meets Bellman : The Gaussian process approach to temporal difference learning. In Proceedings of the Twentieth International Conference on Machine Learning (ICML), pages 154–161, 2003. [6] Y. Engel, S. Mannor, and R. Meir. Reinforcement learning with Gaussian processes. In Proceedings of the Twenty-Second International Conference on Machine Learning (ICML), pages 201–208, 2005. [7] D. Ernst, G. B. Stan, J. Concalves, and L. Wehenkel. Clinical data based optimal STI strategies for HIV : a reinforcement learning approach. In Proceedings of the Fifteenth Machine Learning conference of Belgium and The Netherlands (Benelearn), pages 65–72, 2006. [8] Doshi F. and Roy N. Efficient model learning for dialog management. In Proceedings of the ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 65–72, 2007. [9] M. Fava, A.J. Rush, and M.H. Trivedi et al. Background and rationale for the sequenced treatment alternatives to relieve depression (STAR*D) study. Psychiatr Clin North Am, 26(2) :457–94, 2003.

94

95 [10] J. A. Filar, L. C. M. Kallenberg, and H. M. Lee. Variance-penalized Markov decision processes. Math. Oper. Res., 14(1) :147–161, 1989. [11] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin. Bayesian Data Analysis, Second Edition. Chapman & Hall/CRC, July 2003. [12] A. Genz. Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1 :141–150, 1992. [13] A. Guez, R. D. Vincent, M. Avoli, and J. Pineau. Adaptive treatment of epilepsy via batch-mode reinforcement learning. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, pages 1671–1678, 2008. [14] E. A. Hansen. An improved policy iteration algorithm for partially observable MDPs. In Proceedings of Tenth Annual Conference on Advances in Neural Information Processing Systems (NIPS), pages 1015–1021, 1997. [15] E. A. Hansen. Solving POMDPs by searching in policy space. In Proceedings of the Fourteenth International Conference on Uncertainty In Artificial Intelligence (UAI), pages 211–219, 1998. [16] M. Hauskrecht and H. Fraser. Planning treatment of ischemic heart disease with partially observable Markov decision processes. Artificial Intelligence in Medicine, 18(3) :221–244, 2000. [17] M. Heger. Consideration of risk in reinforcement learning. In Proceedings of the Eleventh International Conference on Machine Learning (ICML), pages 105– 111, 1994. [18] S. Ji, R. Parr, H. Li, X. Liao, and L. Carin. Point-based policy iteration. In Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence, pages 1243–1249, 2007. [19] N. K. Jong and P. Stone. State abstraction discovery from irrelevant state variables. In Proceedings of the Nineteenth International Joint Conference on Artificial Intelligence, pages 752–757, August 2005. [20] N. Karmarkar. A new polynomial-time algorithm for linear programming. Combinatorica, 4(4) :373–395, December 1984. [21] M. Kearns and S. Singh. Near-optimal reinforcement learning in poly. time. Machine Learning, 49, 2002.

96 [22] S. Koenig and R. G. Simmons. Unsupervised learning of probabilistic models for robot navigation. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 2301–2308, 1996. [23] L. Li, M. L. Littman, and T. J. Walsh. Knows what it knows : a framework for self-aware learning. In Proceedings of the Twenty-Fifth International Conference on Machine Learning (ICML), pages 568–575, 2008. [24] M. Littman. Algorithms for Sequential Decision Making. PhD thesis, Brown University, 1996. [25] P. Magni, S. Quaglini, M. Marchetti, and G. Barosi. Deciding when to intervene : a Markov decision process approach. International Journal of Medical Informatics, 60(3) :237–253, 2000. [26] S. Mannor, D. Simester, P. Sun, and J. N. Tsitsiklis. Bias and variance in value function estimation. In Proceedings of the Twenty-First International Conference on Machine Learning (ICML), pages 308–322, 2004. [27] S. Mannor, D. Simester, P. Sun, and J. N. Tsitsiklis. Bias and variance approximation in value function estimates. Management Science, 53(2) :308–322, 2007. [28] M. Milani Fard and J. Pineau. MDPs with non-deterministic policies. In Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems (NIPS), pages 1065–1072, 2008. [29] M. Milani Fard, J. Pineau, and P. Sun. A variance analysis for POMDP policy evaluation. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, pages 1056–1061, 2008. [30] S. A. Murphy. An experimental design for the development of adaptive treatment strategies. Statistics in Medicine, 24(10) :1455–1481, 2005. [31] S. A. Murphy, K. G. Lynch, D. Oslin, J. R. McKay, and T. TenHave. Developing adaptive treatment strategies in substance abuse research. Drug and Alcohol Dependence, 88(Supplement 2) :S24 – S30, 2007. [32] S. A. Murphy and J. R. McKay. Adaptive treatment strategies : an emerging approach for improving treatment effectiveness. Clin Sci (Newsletter of the American Psychological Association Division 12, section III : The Society for the Science of Clinical Psychology), 2004.

97 [33] J. V. Neumann and O. Morgenstern. Theory of Games and Economic Behavior. Princeton University Press, 1944. [34] J. Pineau, M. G. Bellemare, A. J. Rush, A. Ghizaru, and S. A. Murphy. Constructing evidence-based treatment strategies using methods from computer science. Drug and Alcohol Dependence, 88(Supplement 2) :S52 – S60, 2007. [35] P. Poupart and C. Boutilier. Bounded finite state controllers. In Proceedings of Sixteenth Annual Conference on Advances in Neural Information Processing Systems (NIPS), volume 16, pages 823–830, 2003. [36] N. Roy, Pineau J., and Thrun S. Spoken dialog management for robots. In Proceedings of the Association for Computational Linguistics (ACL), 2000. [37] A. J. Rush, M. Fava, S. R. Wisniewski, P. W. Lavori, M. H. Trivedi, H. A. Sackeim, M. E. Thase, A. A. Nierenberg, F. M. Quitkin, T. M. Kashner, D. J. Kupfer, J. F. Rosenbaum, J. Alpert, J. W. Stewart, P. J. McGrath, M. M. Biggs, K. Shores-Wilson, B. D. Lebowitz, L. Ritz, and G. Niederehe. Sequenced treatment alternatives to relieve depression (STAR*D) : rationale and design. Controlled Clinical Trials, 25(1) :119 – 142, 2004. [38] S. J. Russell and P. Norvig. Artificial Intelligence : A Modern Approach (Second Edition). Prentice Hall, 2003. [39] M. Sato and S. Kobayashi. Variance-penalized reinforcement learning for riskaverse asset allocation. In Proceedings of the Second International Conference on Intelligent Data Engineering and Automated Learning, Data Mining, Financial Engineering, and Intelligent Agents, pages 244–249. Springer-Verlag, 2000. [40] A. Schaefer, M. Bailey, S. Shechter, and M. Roberts. Handbook of Operations Research / Management Science Applications in Health Care, chapter Medical decisions using Markov decision processes. Kluwer Academic Publishers, 2004. [41] L. S. Schneider, M. S. Ismail, K. Dagerman, S. Davis, J. Olin, D. McManus, E. Pfeiffer, J. M. Ryan, D. L. Sultzer, and P. N. Tariot. Clinical antipsychotic trials of intervention effectiveness (CATIE) : Alzheimer’s disease trial. Schizophr Bull, 29(1) :57–72, 2003. [42] R. D. Smallwood and E. J. Sondik. The optimal control of partially observable Markov processes over a finite horizon. Operations Research, 21(5) :1071–1088, 1973.

98 [43] M. J. Sobel. The variance of discounted Markov decision process. Journal of Applied Probability, 19 :794–802, 1982. [44] E. J. Sondik. The optimal control of partially observable Markov processes. PhD thesis, Stanford, 1971. [45] R. M. Stone, D. T. Berg, S. L. George, R. K. Dodge, P. A. Paciucci, P. Schulman, E. J. Lee, J. O. Moore, B. L. Powell, and C. A. Schiffer. Granulocyte-macrophage colony-stimulating factor after initial chemotherapy for elderly patients with primary acute myelogenous leukemia. N Engl J Med, 332(25) :1671–1677, 1995. [46] T. S. Stroup, J. P. McEvoy, M. S. Swartz, M. J. Byerly, I. D. Glick, J. M. Canive, M. F. McGee, G. M. Simpson, M. C. Stevens, and J. A. Lieberman. The National Institute of Mental Health clinical antipsychotic trials of intervention effectiveness (CATIE) project : Schizophrenia trial design and protocol development. Schizophr Bull, 29(1) :15–31, 2003. [47] R. S. Sutton and A. G. Barto. Reinforcement Learning : An Introduction (Adaptive Computation and Machine Learning). The MIT Press, 1998. [48] J. R. Tetreault, D. Bohus, and D. J. Litman. Estimating the reliability of MDP policies : a confidence interval approach. In HLT-NAACL, pages 276–283. The Association for Computational Linguistics, 2007. [49] D. Tummarello, D. Mari, F. Graziano, P. Isidori, G. Cetto, F. Pasini, A. Santo, and R. Cellerino. A randomized, controlled phase III study of cyclophosphamide, doxorubicin, and vincristine with etoposide (CAV-E) or teniposide (CAV-T), followed by recombinant interferon-α maintenance therapy or observation, in small cell lung carcinoma patients with complete responses. Cancer, 80(12) :2222– 2229, 1997. [50] J. D. Williams and S. Young. Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language, 21(2), 2006. [51] R. J. Williams and L. C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. Technical Report NU–CCS-93-14, Northeastern University, Nov 1993.

KEY TO ABBREVIATIONS AI : Artificial Intelligence CATIE : Clinical Antipsychotic Trials of Intervention Effectiveness CT : Cognitive Therapy DAG : Directed Acyclic Graph KWIK : Knows What It Knows LVS : Local Value Sampling MDP : Markov Decision Process MIP : Mixed Integer Program POMDP : Partially Observable Markov Decision Process QIDS : Quick Inventory of Depressive Symptomatology RL : Reinforcement Learning SMART : Sequential Multiple Assignment Randomized Trials STAR*D : Sequenced Treatment Alternatives to Relieve Depression

99

Suggest Documents