IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 59, NO. 4, APRIL 2014
921
Partial-Information State-Based Optimization of Partially Observable Markov Decision Processes and the Separation Principle Xi-Ren Cao, Fellow, IEEE, De-Xin Wang, and Li Qiu, Fellow, IEEE
Abstract—We propose a partial-information state based approach to the optimization of the long-run average performance in a partially observable Markov decision process (POMDP). In this approach, the information history is summarized (at least partially) by a (or a few) statistic(s), not necessary sufficient, called a partial-information state, and actions depend on the partial-information state, rather than system states. We first propose the “single-policy based comparison principle,” under which we derive an HJB-type of optimality equation and policy iteration for the optimal policy in the partial-information-state based policy space. We then introduce the Q-sufficient statistics and show that if the partial-information state is Q-sufficient, then the optimal policy in the partial-information state based policy space is optimal in the space of all feasible information state based policies. We show that with some further conditions the well-known separation principle holds. The results are obtained by applying the direct comparison based approach initially developed for discrete event dynamic systems. Index Terms—Direct comparison-based approach, finite state controller, HJB equation, performance potential, policy iteration, Q-factor, Q-sufficient statistics.
I. INTRODUCTION
I
N many real applications, the data collected are often corrupted by observation noises. The performance optimization for such a system is usually modeled as a partially observable Markov decision process (POMDP). The difficulty in solving a POMDP lies in that the optimal decision at a particular time instant may depend on the entire information (observation + action) history before that time and therefore the optimal policy may be defined on an infinitely large space, especially for infinite horizon problems. It has been shown that searching for an optimal policy for a POMDP is PSPACE-complete [28]. Manuscript received June 17, 2012; revised March 26, 2013; accepted November 25, 2013. Date of publication January 27, 2014; date of current version March 20, 2014. This work was supported in part by the National Natural Science Foundations of China under Grant 61221003, and the Research Grants Council of Hong Kong SAR under the Collaborative Research Fund Grant HKUST11/CRF/10 and the General Research Fund Grant 610809. Recommended by Associate Editor C. Hadjicostis. X.-R. Cao is with the Department of Finance and the Key Laboratory of System Control and Information Processing, Ministry of Education, Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China, and also with the Institute of Advanced Study, Hong Kong University of Science and Technology, Hong Kong, China (e-mail:
[email protected];
[email protected]). D.-X. Wang is with the Quantitative Research Beijing Center, JPMorgan Chase & Co., Beijing 2006040, China (e-mail:
[email protected]). L. Qiu is with the Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology (e-mail:
[email protected]). Digital Object Identifier 10.1109/TAC.2013.2293397
It is well known that for partially observable linear systems with quadratic cost functions and Gaussian noises (LQG), the optimal policy can be obtained easily because the separation principle holds for such systems, see [2], [23], [40], and [41]. By the separation principle, the optimal policy for this problem exists in a subpolicy-space where all the policies are functions of the conditional expectation (mean) of the current system state given the information history, and the optimization problem can be separated into two steps: in the first step (filtering), we recursively calculate the conditional expectation of the system statem and in the second (optimization), we use this conditional expectation as if it were the true state to get an optimal policy. With the separation principle, the LQG problem can be turned into a filtering problem and a fully observable Markov decision process (MDP) problem, for which many standard algorithms exist, see, e.g., [7], [31]. However, when the dynamics or the observations are nonlinear, the separation principle does not hold, and the problem becomes very complicated. There are many excellent works on analyzing and solving the general POMDP problems in the literature in the past 50 years (e.g., [4], [5], [3], [11], [13], [33]). Many works are based on “information state,” which is usually the (unnormalized) conditional distribution of the states. The original POMDP is equivalent to a completely observable Markov decision problem with the information state. The conditional probability can also be represented by a sufficient statistic [33]. The optimal policy with information states is usually defined on a space with a infinite dimension, except for some special systems. One interesting work is by Charalambous and Elliotte [11], in which they found sufficient conditions to compute optimal policy explicitly for nonlinear systems when the nonlinearity only appears in the dynamic as gradient of potential functions. On another direction, researchers in the reinforcement learning community are focusing on developing practically applicable algorithms that lead to optimal or suboptimal policies for POMDP. Excellent progress has been made, and one of the promising approaches is called the finite state controller [1], [16], [17], [21], [25], [29], [30], in which a set of finite states is used to approximate the information state, and then update iteratively to contain more and more information. In this paper, we ask questions from a different direction: if the separation does not hold or the information state-based optimization is too complicated, then what is the best we can do? This question is addressed in three steps: First, based on some
0018-9286 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
922
general statistics of the information history, which may not be sufficient, can we find, in a feasible way, an optimal policy in the subpolicy-space in which policies only depend on these estimates of statistics, and how? If not, how good is what we can find? Second, if the answer is yes, then under what condition the optimal policy in this subpolicy-space is also global optimal for all feasible polices? If not, how close is the former to the latter? Finally, if the answer is yes again, then under what conditions this optimal policy takes the same form as the completely observable MDP, with the state replaced by the statistics considered as an estimate of the state, i.e., the separation principle holds? These questions are addressed with the direct comparison based approach to optimization [7]–[10], an alternative to dynamic programming. In this approach, policy iteration can be viewed as a discrete version of the gradient descent method, and optimization is done via a direct comparison of any two policies. The research in this direction initiates from perturbation analysis of discrete event dynamic systems [7]. This approach is suitable for our study because it allows state aggregation which is involved when we use statistics instead of the state. The direct comparison based approach provides a simple and intuitive way to answer the above questions. We use the terminology partial-information states to refer to any, or any set of, statistics of the information history (these statistics are usually not sufficient). A statistic can be any quantity derived from the information history. It can be viewed as an approximation or a simplified version of the sufficient statistics, or the information history [3], [11], [13]. The essential idea of our approach is the aggregation of all the states that the system may be in when a partial-information state is observed. Our results can be summarized as follows. First, we propose the “single-policy based comparison principle”; with this principle, we may verify whether a policy is optimal in a subpolicyspace where all the policies are functions of a partial-information state, (we called it sub-space optimal”) and find a better policy, if it is not, by analyzing the policy itself only. Policy iteration algorithms can be further developed for the sub-space optimal policy. This sub-space optimal policy may not be the same as the optimal policy for the underlying MDPs (in which the system state is completely observable), or the “true” optimal policy given the information history, and how good the subspace optimal policy is depends on the choice of the partial-information state. Second, we propose the “Q-sufficient statistics”; if a partial information state is a Q-sufficient statistic, then, together with the single-policy based comparison principle, we may prove that the sub-space optimal policy is the true optimal given the information history. This is an extension of the “informative statistics” for the finite horizon case defined in [33]. Third, we further derive conditions under which the sub-space optimal policy has the same form as the one for the underlying MDPs. This revisits the separation principle with a new perspective; as expected, these conditions may appear strong. The rest of this paper is organized as follows. In Section II, we formulate the optimization of a general POMDP with a long run average performance criterion to fit the direct comparison based approach, and introduce the partial-information states and the global states. In Section III, we apply the direct-comparison
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 59, NO. 4, APRIL 2014
based optimization approach to solve for a subspace optimal policy for the POMDP with the partial-information-state based policies. Single-policy based comparison principle is proposed and conditions for it are derived. Under these conditions, an HJB optimality equation and policy iteration algorithms are derived for the subspace optimal policy. In Section IV, we propose the Q-sufficient statistics and provide conditions under which a partial-information-state based optimal policy is global optimal. In Section V, we derive further conditions under which the separation principle holds for a POMDP. In Section VI, we provide three examples for the cases discussed in the above sections. Finally, in Section VII, we give some concluding remarks. The partial-information-state based optimization discussed in this paper is closely related to the finite state controller [16], [17], [21], [25], [29], [30], [32]. In a finite state controller, in addition to the system states, there is a finite set of machine states [16], or notes [30], each of which may represent the current knowledge about the sufficient statistics and play a similar role as a partial-information state. The main difference between both approaches is: The finite state controller tries to make the machine state updates as informative as possible, update rule and the choice of actions are optimized simultaneously; while in this paper, we analyze the policies on a partial-information state space and ask whether subspace optimal policies can be obtained and how good they are compared with the fully observable case. The space of the machine states expands in updates, but the space of partial-information state is kept fixed and may be infinite; our results may shed insight to the performance of the final policy obtained with a finite state controller.1 II. FORMULATION A. POMDPs In this paper, we study a discrete time POMDP with continuous state space and continuous observation space denoted as , where is the state space, is the observation space, is the action space; they can be discrete, or continuous (in this case, we consider , , and , with denoting the corresponding Borel -field); denotes the transition law, denotes the reward (cost) function, and denotes the observation law. We use the notations for the case with continuous spaces. Specifically, a POMDP evolves as follows: at time , if , an observation is obtained, with being a random variable having conditional distribution , then an action is applied to the system; then the system state jumps to , according to a transition probability , ; meanwhile the agent receives a reward . Then the process repeats from . Furthermore, we assume that for any action , is measurable , for any , is measurable , for any , is measurable , and for any and , is measurable . At time , the information available to the agent is the observation history and the action history . Together is called 1We thank one of the reviewers for pointing out the relation and difference of our approach with the finite state controllers.
CAO et al.: PARTIAL-INFORMATION STATE-BASED OPTIMIZATION OF POMDP AND THE SEPARATION PRINCIPLE
the information history. At time , the action applied to the system can be determined according to some rules depending on , called policies, which can be deterministic or randomized. Let be the space of all information histories. A deterministic policy is a mapping from to : . With a randomized policy, the information history determines a probability distribution over , and the action applied to the system is chosen according to this distribution. Let denote the policy space. Different performance criteria can be defined for a POMDP. In this paper, we consider the following long run average criterion: (1) which is assumed to exist (cf. the appendix). Here is the action applied to the system at time determined by a policy , and denotes a policy used by the agent. The superscript indicates that the performance depends on the policy. The goal of solving a POMDP is to determine a policy in that attains the best system performance (1) among all the policies in . If the state of a POMDP is assumed to be fully observable (i.e., for all ), it becomes a standard Markov Decision Process (MDP) problem, which is called the underlying MDP associated with this POMDP and can be denoted as . B. Global-State POMDPs In general, it is difficult to solve many POMDPs exactly because the information history may go to infinitely long. In this paper, we first define a quantity called partial-information state to record the useful information contained in the information history ( is not necessarily an estimate of , albeit what the notation might indicate), then find an optimal policy in a reduced policy space , where all the policies are functions of the current partial-information state; such a policy is said to be sub-space optimal. Of course, the more information the partial-information state records, the better the subspace optimal policy; however, there is a tradeoff between the information recorded and the complexity of the partial-information state. A simple example is the conditional mean of the state given the information history; another example is the current observation or a number of most recent observations such as , and a more complicated example of is an estimate of the system state generated by a filter (see Example 1 below). Generally, can be any one or more other statistics based on the information history. We further assume that at time , with a new observation , the partial-information state can be updated iteratively; i.e., there is a function such that:
923
contains more than one statistic. By (2), we have . Thus, policy is in a sub-space of the information history dependent policy space. There are two ways to interpret a partial-information state: 1) it is a quantity satisfying (2); with this understanding we can choose the form of function , and once is chosen, can be viewed as a statistic, or a set of statistics when is a vector, of the observation history; 2) it is one, or a set of, statistics of the observation history, with a given physical meaning, e.g., conditional mean or variance. With this understanding, once the statistic is chosen, the form of recursion (2) is fixed, which depends on the transition probability and therefore depends on the action . The recursion (2) is very natural for many statistics and has been used in many previous works, including in the finite state controller model [16], [30]. With the above interpretation, choosing a statistic as the partial-information state is equivalent to choosing a particular form of (2). How to choose the partial-information state , or the form of (2), depends on the particular problem. In some cases, a good choice of for some specific problems may even lead to a (global) optimal policy for the original POMDP (see [29] for an example). The choice of actually reflects which information contained in the observation is considered as crucial. Equation (2) covers a wide range of problems including the well-known Kalman filter. 1) Example 1 (The Kalman Filter): Consider the linear-quadratic-Gaussian (LQG) problem in which the system evolves according to (3) is an vector, the control variable where the system state is an vector, and are two matrices with compatible sizes, and is an i.i.d. n-dimensional Gaussian noise with mean zero and variance matrix . The observation at time is a -dimensional random variable (4) is a matrix and is an i.i.d. -dimensional where Gaussian noise with mean zero and variance matrix (independent on ). It is well known that the conditional mean estimate of the state , , given the information history can be calculated as follows:
(2) This recursion can be viewed as the definition of the partial-information state . This assumption is not essential; the approach works in the same way if we assume that the iteration depends on a few previous observations, i.e., . The equation may be multidimensional, if
In this example we have ; thus, the above equations show that the relationship in the updating (2) holds ( and can be precalculated, with the initial condition , and therefore can be viewed as system parameters).
924
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 59, NO. 4, APRIL 2014
If the noises are not Gaussian, we can still use the output of this filter as the partial-information state, but in this case, loses its physical meaning as the conditional mean. 2) Example 2 (The Stochastic Volatilities): It is now well accepted by both scholars and traders that the volatility has its own dynamics and cannot be viewed as a constant. The stochastic property of volatility makes it crucial in risk management of derivative portfolios in reality. Let be the stock price at time , then its dynamics in discrete time under stochastic volatility model is given by
where is a continuous, finite variation process, is a Gaussian distributed noise, and follows another stochastic differential equation
with representing another Gaussian noise that might be correlated to . This is a highly nonlinear system since is a part of the state. However, the instantaneous volatility is not observable. The observable processes are market prices of stocks, and based on them we can construct our partial information state, the moving average estimate of the variance (5) This has the form of (2) (precisely, ). Thus, the approach discussed in this paper can be applied to the optimization of such systems (cf. Section VI-C). In this paper, we apply the direct-comparison based approach to study the optimization problem in a subpolicy-space, in which every policy depends on the information history only via the partial-information state . We call such a subspace optimal policy an partial-information-state based optimal policy. Obviously, the more information contains, the better the subspace optimal performance we can obtain. In general, the process of the partial-information state may not be Markov. The system state and partial-information state together, , is called a global state. Theorem 1: If policies depend on the information history only via and it can be updated recursively according to (2), the process of global states forms a Markov process. Proof: Because , is determined by , which depends on and is independent of its past history. Furthermore, according to (2), we have . Because depends on , also depends on and is independent of its past history. Thus, is Markov. The theorem still holds if . With Theorem 1, we can reformulate the original POMDP into a new POMDP with this global state Markov process. In the globalstate POMDP, the observation is , and the action depends only on . Thus, the global state is also not completely observable, only is accessible to the agent. Because of the updating mechanism (2), we do not need to record the information history.
When considering randomized policies (with deterministic policies as special cases), we assume that, corresponding to every partial-information state , there is a set of probability distributions over the action space , which is denoted as in continuous space control problems, or a countable set. To illustrate the idea, we consider the case where the set of distributions is countable and denoted as , with being the number of all possible action distributions over corresponding to ( may be infinite, but for simplicity, we may assume it be finite). In general, a policy is a mapping from to . With a randomized policy , when is observed, an action is picked up according to distribution and applied to the system. We assume that with randomized policies the actions at different and for different are chosen independently. That is, we do not consider the policies with constraints that request the actions at different and satisfying some relations. This is a standard assumption for fully observed as well as partially observed optimization problems including the LQG problem in Example 1. A deterministic policy specifies a particular action, denoted as (or the distribution is a -function). Apparently, Theorem 1 also holds for randomized policies. We assume that under any feasible policy the global-state Markov process has an invariant probability distribution and is ergodic. The existence of an invariant distribution and the ergodicity requires certain assumptions on the Markov process, and many sufficient conditions exist. For readers’ convenience, some of such conditions are stated in the appendix. There are many excellent works in this subject. For example, the Krylov–Bogolyubov theorem for invariant distribution [12] and the Birkhoff–Khinchin theorem for ergodicity [38]; also, it is proved in Appendix E of [18] that recurrent and invariant probability imply the ergodicity. Also, see [19]. The ergodicity and existence of invariant probability are important research topics and they are the basis of any infinity horizon performance optimization problems. In practice, many systems are stable or can be stabilized, so we can assume that these conditions hold. For an ergodic global-state Markov process, there exists a steady state (invariant) probability distribution, denoted as , , and , where the superscript denotes a partial-information-state based policy used in the global-state POMDP. The one-step transition probability function under action is (6) where
is a set consisting of all satisfying , i.e., . The one-step transition probability function under policy is
(7) (8)
CAO et al.: PARTIAL-INFORMATION STATE-BASED OPTIMIZATION OF POMDP AND THE SEPARATION PRINCIPLE
The steady-state probability distribution [18]:
satisfies
(9) . Here In the global-state POMDP , if action is applied at state , the reward received by the agents is the same as if the agent applies at state in the underlying MDP, that is (10) We define
(11) With
, the long run average performance (1) becomes
(12) This limit exists since we assume that the global state process satisfies the ergodicity conditions under any policy. We can verify that [cf. (1)], and therefore, a subspace optimal policy of the global state POMDP is also a subspace optimal policy of the original POMDP. With this formulation, we may apply the direct-comparison based optimization approach to solve this global-state POMDP for . In essence, this approach is to use to partially “summarize” the past observation history and hence the POMDP optimization problem is reduced to a new one with the observations expanded from to and the policies depending only on the current observations. The advantage of this expansion is that contains some information of system states before time while only contains information at time . The problem described by (6) and (12) with observation is called the global state POMDP. The formulation in this section is similar to that of the finite state controller, in which in addition to the system state , there is a “machine state” , which evolves according to a rule similar to (2) [1], [16], [17], [21], [25], [29], [30]. Therefore, the machine state plays the same role as the partial-information state. However, in this paper, we focus on analyzing the performance of the optimal policy and the finite state controller literature emphasizes implementation. In a finite state controller, the number of machine states is finite and both the number of machine states and iteration (2) can be updated so that the machine states may contain more and more accurate information. In this paper, the partial-information state as a statistic has a clear physical meaning and is usually not countable; the results in our paper can be applied to analyze the performance of the final policy obtained by a finite state controller, and may provide information regarding how to choose the machine states.
925
C. Events for Global-State POMDPs Obviously, different observation histories may lead to the same partial-information state . Thus, a partial-information state represents an aggregation of the information history. The direct-comparison based optimization approach [7], [8], [9] is suitable for studying the effect of aggregation. In this approach, aggregation is a special case of an event. Therefore, we may model the global-state POMDP in the event-based setting with viewed as an event. With this terminology, “observing a partial-information state ” is called an event. Because of the simple structure of this particular problem, we may describe the approach directly for this problem without referring to other details of the event-based formulation. (Readers may not need to refer to [7], [8], [9] unless they wish to know more about this approach.) An event is observable (i.e., the partial-information state is accessible to the agent). In the event-based setting, actions depend on events, rather than states. The evolution of the globalstate POMDP with the event-based framework is as follows. Suppose at time , the global state is . The agent observes only [obtained from by (2)], and based on this , s/he chooses one action distribution, denoted as , according to a policy denoted as . (As a convention, we denote a policy as a distribution over the action space.) When policy is chosen at partial-information state , action will be applied to the system according to the distribution . If is applied, the system state jumps to according to , then a new observation is made according to , and the partial-information state can be updated to by its updating scheme (2). Finally, the global state becomes , and the process repeats. As in the standard MDP, we assume that at different we can choose the action distribution independently. See equation (13) at the bottom of the next page. III. OPTIMAL PARTIAL-INFORMATION-STATE BASED POLICY The fundamental element of the direct-comparison based approach is the performance difference formula, which compares the performance of any two policies. A. Performance Difference Formula The basic concept in performance difference formula is the performance potentials which satisfy the Poisson equation (7). 1) Poisson Equation and Performance Potentials: For ergodic systems, the sample path average converges with probability 1 to the steady-state mean, which is independent on the initial state (for more details, see [18], [12], [38], [19]). Thus, from (12), we have
with independent of fine an operator :
Thus,
. To simplify the notation, we de-
.
926
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 59, NO. 4, APRIL 2014
We start with the Poisson equation (cf. Equation (10) in [43]) (14) where
We can simply verify that for any integer , . Thus, in the global-state POMDP, (17) implies
is an operators, corresponding to the transition function , defined with a measurable function as
follows: where constant
(18) . To make the form concise, we have added a to (17). From (10) and (11), we have
and denotes the identity operator
The solution to equation (14) is called a performance potential function. To find a solution to the Poisson equation, it requires some technical condition to assure certain convergence properties. First, we define (for notational simplicity, we drop the superscripts and tilde ) (15) is defined by , . Set assuming the limit exists pointwisely. Lemma 1 [43]: For any transition function and performance function , if The power of
and hold for every
(16)
, then (17)
. is a solution to the Poisson equation This lemma can be verified directly by substituting in (17) into the above Poisson equation. More specific technical conditions exist, see, e.g., [18]. Note that if is a solution to the Poisson equation, so is for any constant . We shall see that the optimization results do not depend on the constant; and this justifies the name “potentials.”
(19) where is determined by policy . 2) Performance Difference Formula and Aggregated Q-Factors: Now, we are ready to derive the performance difference formula. Denote the steady state probability distribution of under policy as: (20) with being the marginal distribution of the partial-information states and the conditional distribution of given that the partial-information state is . We have (21) with . Consider two partial-information-state based policies and for the global-state POMDP . Denote their corresponding quantities as and . We first briefly derive a simple result for the continuous state Markov systems [43]. From (9), we can easily verify that for any function we have . Left-multiplying both sides of the Poisson equation (14) with and using , we can easily get
Note that this difference equation holds if is replaced by for any constant ( for any transition probability
(13)
CAO et al.: PARTIAL-INFORMATION STATE-BASED OPTIMIZATION OF POMDP AND THE SEPARATION PRINCIPLE
function and constant ). Writing the right-hand side in the form of integration and continuing the analysis, we get equation (13), where (22)
927
to POMDPs. For simplicity, we discuss deterministic policies. When the system is observable, we may set in (22), (23), and (24), and becomes a function. Thus, the Q-factor (24) depends only on policy and reduces to the following simple form [7], [31]: (30)
is the aggregated performance function, and
(23) is the aggregated potential function. We define the aggregated -factor for action as (in the following, we will use the subscript to denote the aggregated version of each quantity) (24) Then for any two partial-information-state based policies and we have the following partial-information-state based performance difference formula:
with being the potential function. The term “Q-factor” is widely used in the reinforcement learning community [34]. The performance difference formula is (31) By (31), we may prove that if is not optimal (meaning it differs from the optimal policy on a set with a nonzero Lebesgue measure), then policy is better than with (32) and that a policy
is optimal if and only if (33)
(25) We further define the aggregated Q-factor for action distribution and as
(26) In general, for any action distribution
, we may define (27)
Then the performance difference formula (25) takes a simple form (28) specifies a particWhen a policy is deterministic, i.e., ular action, denoted as (or the distribution is a -function), then we have (29) In this case, the and in the performance difference formula (28) are two actions, not action distributions. B. Optimization With Partial-Information-State Based Policies 1) Direct-Comparison-Based Approach for MDP: Let us first briefly review the direct-comparison based approach for the standard MDP where the system is completely observable [7], [43]. This approach provides some different vision from the standard dynamic programming and can be easily applied
The optimal policy is (34) Because all the Q-factors defined in (30) can be obtained by analyzing only one policy to get , from (32) and (33), the standard MDP satisfies the following crucial properties for performance optimization: The single-policy based comparison principle holds for an optimization problem, if: 1) given a policy , we may find a policy that is better than , if such a policy exists, by analyzing only policy ; 2) we may verify if a policy is optimal by analyzing only policy . By Property 1, policy iteration algorithm can be developed in searching for an optimal policy, and by Property 2, HJB-type of optimality equations can be derived. MDP satisfies Property 1 by (32), and it satisfies Property 2 by (33). 2) Single-Policy-Based Comparison Principle for POMDP: However, for partially observable systems, the aggregated -factor in (25), , contains items from both policy and policy . Thus, the single-policy based comparison principle does not apply to the general POMDP. The single-policy-based comparison principle is the basis for feasible optimization approaches for a simple reason: if it does not hold, which means to compare the performance of any two policies we need to analyze both these policies, then to search for an optimal policy, we basically need to analyze all the policies in the policy space in order to compare them. This is the exhaustive search method and is practically infeasible. Therefore, a natural question is, for the global-state POMDP , under what conditions can we still have this fundamental principle so that we may have the single-policy based
928
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 59, NO. 4, APRIL 2014
optimality equation similar to (33) for subspace optimal policies in the subpolicy-space and may develop policy iteration algorithm for it? This question can be easily answered by the partial-information-state-based performance difference formula (25). To prove the following theorem, we need some technical conditions. Since is some statistics, the space of , , is a set in some real space . Thus, every measurable subset of has a Lebesgue measure. We assume that the Lebesgue measure is equivalent to the steady state probability measure of under any policy. (i.e., for any if , then its Lebesgue measure is also positive, and vice versa.) With this condition, “almost surely (a.s.)” under the steady-state measure of any policy is the same as “almost everywhere (a.e.)” under the Lebesgue measure. Now, we define
where [cf. (26)]
In general, the aggregated Q-factor for any action distribution , , is defined as [cf. (27)] (39) By (38), because is a positive measure, we conclude that for any two policies and , if a.e. in then, ,
(35) If is independent of , then we can choose and from (22), (23), and (24), we have
a set “ ,” if
. Moreover, for two functions and , we define relation if i) for all and ii) on having positive Lebesgue measure. With relation a.e. in
,
(36) Because of Theorem 2, this condition is called the single-policybased comparison condition. Theorem 2: Consider a POMDP with an information history , let be a statistic depending on the history. If: 1) can be updated recursively according to (2); 2) the aggregated -factor (24), , depends only on policy , i.e., (36) holds. Then the single-policy-based comparison principle holds for the global state POMDP. (The theorem implies the following two facts: a) By analyzing only policy to get , analytically, numerically, or by simulation, we may find a better policy if is not optimal; therefore, we may design a policy iteration algorithm. b) There exists a single-policy based optimality equation: a policy is subspace optimal if and only if
By this construction, we have . Thus, . Finally, by item b) of this theorem, if is not optimal, then we can always find a policy such that (41) holds and therefore . b) From the performance difference formula (38), because is a positive measure, is optimal if (37) holds. Next, assume that (37) does not hold. Then there is a set , with a positive Lebesgue measure, on which
(37)
(38)
(41)
. then we have From the above analysis, given any nonoptimal policy , we can analyze policy to get the aggregated Q-factor according to (35), then we may calculate, by (39), the aggregated Q-factor for all action distributions , , . Then we choose
Furthermore, if we define a policy
With this equation, we may verify wether a policy is a partial-information-state-based subspace optimal policy for the POMDP by analyzing only policy itself.) Proof: a) When the aggregated -factor , depends only on , by (27), the aggregated -factor also depends only on , and hence is denoted as . The performance difference formula (28) becomes
(40)
as if if
(42)
then if if
.
(43)
, i.e., is not optimal. Thus, by (38) we have Therefore, (37) is a necessary condition for an optimal policy. Furthermore, we have the following specific condition for the POMDP optimization:
CAO et al.: PARTIAL-INFORMATION STATE-BASED OPTIMIZATION OF POMDP AND THE SEPARATION PRINCIPLE
Theorem 3: The single-policy based comparison principle (i.e., Theorem 2) holds for the global state POMDP, if one of the following conditions holds: a) The conditional probability distribution does not depend on , i.e., and b)
(44)
and (or
) for all . Proof: a) is obvious from (35) and Theorem 2. For b), we have
There may be other conditions for the aggregated Q-factor depending on only (cf. [7] for the case of discrete states), and we leave this topic for further study. As a special case, if the system equations (thus, the state transition probability functions) do not depend on actions, only the reward function does, then Theorem 3a) holds. This result has important application in the portfolio selection problem in finance [39]. In financial markets, a generally accepted assumption is that any individual investor’s trading activities cannot affect the market evolution. This implies that any investor’s trading policy cannot change the distribution of stock prices in the market. Therefore, if we use market statistics as the partial-information state , this assumption leads directly to our condition (44). That is to say, this condition holds automatically in many financial problems. Therefore, we may use the partial-information-state based approach to find the optimal portfolio selection; this leads to a theoretical justification/explanation to the technical analysis widely used in stock markets [39]. 3) Policy Iteration: By Theorem 2a), a policy iteration algorithm can be developed when -factor depends only on policy . Algorithm 1 (Policy Iteration): 1) Guess an initial policy , set ; 2) (Policy Evaluation) Obtain the potential via the Poisson equation (14) or by estimation from a system sample path under policy , then calculate the aggregated -factor for all and by (35), and for all and by (39). 3) (Policy Improvement) Choose
929
for all partial-information state . If for some , attains the maximum, then set . 4) If a.e. in , stop; Otherwise, set and go to step 2. In step 3) (policy improvement), we need to assume (as we did) that the actions or action distributions can be chosen independently for each partial-information state. By Theorem 2a), the system performance improves at every iteration, and by Theorem 2b), when the algorithm stops, it stops at an optimal policy. 4) Discussions: We have developed an approach to the partially observable optimization problem with partial-information-state-based policies; the analysis is based on the direct-comparison based approach to optimization problems. This approach requires some conditions and therefore is restrictive. When the conditions hold, the approach enjoys the same advantages and drawbacks as for the standard approach to MDP. Such drawbacks include 1) The convergence of the policy iteration algorithm is not guaranteed unless the policy space is finite. The problem of convergence of policy iteration in general policy space for standard MDPs has been extensively studied (e.g., [18]), and similar techniques can be applied to prove the convergence of the partial-information-state-based policy iteration. It is beyond the scope of this paper to discuss the further details. Another related issue is whether an optimal stationary policy exists, or whether the optimality equation (37) has a solution; which requires the measurable selection theorem, readers may refer to [36], [37] for details. 2) It is generally not an easy task to check the optimality equation (37) for every partial-information state for general POMDPs. It requires first to solve the Poisson equation (14) and then to calculate the aggregated -factor for all by (35), and for all by (39). All the variables are continuous and discretization is needed. These two difficulties, one theoretical and one practical, are common to problems with continuous state and action spaces. If closed form solutions exist (e.g., the partially observable LQG problem discussed in Section VI-A), this approach may lead to a partial-information-state-based subspace optimal policy in an intuitive way. Practically, we always need to discretize the partial-information state space and the action spaces, resulting a finite state and policy spaces. For systems with finite spaces, this approach may provide a practical way to find out a partial-information-state-based subspace optimal policy. Furthermore, the finite state controller provides an efficient approach to update finite partial-information state spaces [1], [16], [17], [21], [25], [29], [30]. In addition, the condition (44) in Theorem 2, appears restrictive. On the one hand, a simple partial-information state is always desirable for computation purpose. On the other hand, if a simple cannot be found to satisfy condition (44), the information state, or belief state, itself (i.e., the conditional distribution of system state) can always be used as the partial-information state, in this sense, the condition (44) can always be satisfied. This is the well-known information state model. However, since the information-state space is huge, we do need a simpler partial-information state in order to solve
930
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 59, NO. 4, APRIL 2014
the POMDP practically. For discrete event systems, some other conditions for the single-policy based comparison principle exists [7]; but such conditions may not be generally applicable. For problems where condition (44) does not hold, we may study how different these conditional distributions are for different ’s, and then investigate the errors in the partialinformation-state based subspace optimal policy if we replace with anyway. Although (44) is restrictive, it does hold for many systems, including the partially observable linear quadratic Gaussian system, many problems in financial markets, and the network control problems, which are discussed in Section VI. IV. Q-SUFFICIENT STATISTICS In the previous sections, to simplify the optimization of a POMDP, we introduced the partial-information state to summarize the information history to get a subspace optimal solution to the POMDP. We find that conditions such as (44) are required to establish the optimality equation and policy iteration algorithms, which can be verified and implemented by analyzing only one policy in a way similar to the standard MDPs. Now, we address the second question: when is the partial-information-state based subspace optimal policy global optimal among all feasible policies, i.e., an optimal policy over the policy space . Apparently, this question is related to the well-investigated concept of “sufficient statistics” in the literature, and the most relevant paper is [33]. We show that partial-information state has to be a Q-sufficient statistic as defined below. Consider a statistic , with being a generic notation for an information history. First, we assume that the statistic can be updated recursively according to (2). An -dependent policy in is also an -dependent policy in , . We use the same notation to denote , and thus, for any information history we also have the potential function and aggregated Q-factor . By definition, we have
(45) In addition, the distribution by ; therefore,
is completely determined (46)
for any and ; and the single-policy based comparison principle holds for . This equation is essentially the same as Theorem 1 in [33]. From (45) and (46), we have
for any policy . Q-sufficient means contains the same information as the total information in terms of the -factor, which is weaker as the standard “sufficient statistic,” see Section VI-B for an example. In addition, Q-sufficient may depend on the performance function ; an extreme example is that if for all state , then any statistic is Q-sufficient; and in this case, all the policies are equally good and therefore, optimal. The Q-sufficient statistic is equivalent to the “informative statistic” for finite horizon optimization problems in [33]. In some sense, the in Equation (3.1) in [33] can be viewed as the Q-factors in finite horizon case in which it depends on . corresponds to the performance itself, so in that case the optimality is straightforward. Another slight difference is that we use (2) as a requirement, but in [33] it requires that the statistic must be determined by a conditional distribution. Our definition include quantities such as , which cannot be covered by conditional distributions. This is important, see the example in Section VI-B. We need to address a technical issue before considering the long-run average performance optimization. If the process starts from , then the history at any finite time is of a finite length, and its length always increases; therefore, the process never visits the same history by definition. The situation is similar to the asymptotical stationarity of a Markov process. There are a number of ways to overcome the difficulty caused by this issue. First, we may view the process as started at , and the initial condition at time is an infinitely-long history . In this way, at any time we have an infinitely long history. Second, we may truncate the history into a long but finite length and , and let goes to infinity if necessary. Theorem 4: Consider a POMDP with an information history , let be the statistics depending on the history. If : 1) can be update recursively according to (2); 2) is a Q-sufficient statistic, i.e., (48) holds; 3) the single-policy-based comparison condition (36) holds. Then an optimal policy in policy space exists in a subpolicyspace , i.e., the optimal -based policy is also a global optimal one. Proof: With (2) and (36) and by Theorem 2, we may apply the single-policy-based comparison optimality equation (37). First, we consider deterministic policies. With (48), we may get (49) . Suppose that which holds for any policy is an optimal policy in . We take this as a policy in , Then by (49), we have
:
(47) . for any A statistic information history
is called a Q-sufficient statistic of an , if (48)
This means that is also an optimal policy in . The results can be easily extended to random policies because (48) holds for every action.
CAO et al.: PARTIAL-INFORMATION STATE-BASED OPTIMIZATION OF POMDP AND THE SEPARATION PRINCIPLE
Roughly speaking, this theorem says that if is a Q-sufficient statistic then an optimal policy in the subpolicy-space is also optimal in the global policy space . Theorem 5: Consider a POMDP with an information history . Let be a statistic satisfying (2), and be an based policy. If
Therefore, (51) becomes
(52)
(50) is a Q-sufficient statistic, and (36) holds. Proof: By (46) and (50), we have for any . Then by Theorem 3, (36) holds. Now, we prove that for any policy , if different histories have the same statistic , then they have the same potentials. From (18) (with as a special and superscript added)), we have
then
in which the second equality is due to (10) and (11). Next, Because and that the probability distribution of depends only on and the action , given a policy , the sample path is completely determined by and . Therefore,
931
By definition (48), is Q-sufficient. Equation (50) is used in literature to define the sufficient statistics, and in [33] it is called “equivalent statistics,” to distinguish with the “informative statistics”. Theorem 5 essentially says that a sufficient statistic is Q-sufficient. Summarizing Theorems 4 and 5, we get Theorem 6: Consider a POMDP with an information history , let be statistics depending on the history satisfying the recursion (2). If for all , then an optimal policy in policy space exists in a subpolicyspace , i.e., the optimal -based policy is also a global optimal one. Theorems 4 and 6 provide the conditions that the optimal policy for an infinite horizon POMDP with long run average criterion exists in a subpolicy-space, in which all the policies are functions of statistics of the information history. If a simple statistic satisfying condition (50) or (48) exists, then the POMDP optimization can be greatly simplified since the above subpolicy-space may be much smaller than the original historydependent policy space. Thus, on the one hand, we want to find a simple statistic to save computation; on the other hand, in order to satisfy (50) or (48), the statistics cannot be too simple unless for some special systems, like the partially observable linear quadratic Gaussian control problem discussed in the next section. V. SEPARATION PRINCIPLE FOR POMDPS
Because
does not depend on , (50) implies that also does not depend on . Now, the aggregated -factor in (45) can be written as
(51) Given , the probability of on . Thus,
in the above equation depend
Based on the above results, in this section, we investigate the separation principle for the optimal partial-information-state based policies. In the well-known LQG problem, separation principle means that the optimization problem can be solved in two steps: first to estimate the system state with its conditional mean based on the observation history, and then use this estimate as if it were the true state to get an optimal policy. We will find conditions under which the separation principle can be extended to other statistics. If a statistic is Q-sufficient, then the optimal policy based on this statistic is optimal among all policies. The separation principle says more than this [15]; if it holds, then the optimal policy of the POMDP takes the same form as the underlying completely observable MDP. First, with the separation principle, the partial-information state, or the statistic , should take the same form as a state and is called an estimate of the state, and thus we need . For completely observable MDPs, it is well known that a deterministic optimal policy exists and the optimality equation is (33). Therefore, to establish the separation principle, we may consider only deterministic policies in the partial-
932
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 59, NO. 4, APRIL 2014
information-state based problems, and the optimality equation (37) becomes a.e. in
constant does not affect the comparison, we may set for simplicity. Thus, for any , by setting , we have
(53) (57)
The optimal partial-information-state-based policy satisfies (54) Comparing (34) for the underlying MDP and (54) for the partial-information-state-based optimization, we get the following theorem. Theorem 7: Let be a statistic depending on the information history . Suppose: 1) can be update recursively according to (2); 2) the single-policy-based comparison condition (36) holds. Let be an optimal policy for the underlying MDP and be an optimal partial-information-state based policy for the POMDP. If a.e. in (55) where is a constant, then the separation principle holds. That is, the partial-information state -based optimal policy for the POMDP has the same form as the one for the underlying MDP, with the system state replaced by its estimate. Proof: It follows directly from (34), (54), and (55)
That is, is also the optimal partial-information-state based policy. This theorem is intuitively clear. However, with (55), we may need to find the optimal partial-information-state based policy first and then to verify the separation principle. In fact, condition (55) may be slightly changed, as shown in the next theorem, so that we may verify the separation principle directly with only the optimal policy for the underlying MDP. Theorem 8 (Condition for Separation Principle): Assume that conditions 1 and 2 in Theorem 7 hold. Let be an optimal policy for the underlying MDP. If
is optimal for the underlying MDP, by (33) and Because from (57), we have
From (53), is indeed an optimal partial-information-state based policy, and thus the separation principle holds. VI. EXAMPLES AND APPLICATIONS A. Partially Observable LQG Problem To verify our results, we show that the conditions for the separation principle hold for the well-known LQG problem. Consider the LQG problem discussed in Example 1 with the system (3) and observation (4). The optimization goal is to find a feedto minimize the following long run back control law average cost: (58) with and being symmetric semi-positive (positive) definite matrices. We also assume that in the system (3) is stabilizable, is detectable, and , [22]. The optimal policy for the completely observable LQG system is a linear feedback control: , where , with the stabilizing solution to the Riccati equation (in fact, the results hold for LQ problems in which the noise may have any i.i.d. distribution, see, e.g., [7]): (59) Separation principle implies that when solving the partially observable LQG problem, we can, first find out the conditional mean of the system state (by a Kalman filter); and next find for the corresponding comout the optimal feedback gain pletely observable case, then policy is an optimal policy. By the Kalman filter, at steady state, follows a Gaussian distribution with mean and a constant variance matrix , which is the stabilizing solution of the Riccati equation (60)
a.e. in (56) where is a constant number, then the separation principle holds [ is defined in (35)]. Proof: is a partial-information-state-based policy. We need to prove that under (56), is optimal. First, (56) is . Because the the same as
is sufficient to determine the Thus, with this , knowing is a sufficient statistic. distribution of system state , i.e., In the global-state POMDP , by (22) and (23), noting is a Gaussian distribution with variance , we have +
CAO et al.: PARTIAL-INFORMATION STATE-BASED OPTIMIZATION OF POMDP AND THE SEPARATION PRINCIPLE
where is an optimal policy for the completely observation LQG problem. Now, the left-hand side of (56) becomes
(61) while its right-hand side is given by (62) Thus, the condition for separation principle (56) is satisfied in the partially observable LQG problem. Separation principle is a very special case, where the true state can be replaced by an estimate. This almost requires the estimate (the conditional mean) to be a sufficient statistic, and therefore, the conditional distribution and hence the noises have to be Gaussian. The results in Sections III and IV show that under some conditions we may find an optimal policy with Q-sufficient statistics, which may not take the same form as the completely observable MDP. We give an example in the next subsection.
933
, which controls the transition probabilities , . is not observable, and we Suppose that the state can only observe , . The information state, or the sufficient statistic, is the conditional probability distribution of given the information history , denoted as , where denoting a control policy. Let be the conditional stationary distribution of given that the number of customers in every subnetwork is . Let be the visiting ratio to server in subnetwork , , . By the Gordon–Newell Theorem [14] and Buzen’s algorithm [6], the stationary distribution under policy has the product-form solution
where
, ,
determines the visiting ratio , and
,
B. Control of Networks of Networks Now we provide an example for which condition (44) holds. Consider a closed network of subnetworks of servers; each of them may represent a local network of computers and routers. Subnetwork , , consists of first come first served (FCFS) servers, with exponential service time with mean denoted as , . The customer transition follows the following rules: upon service completion in any server in subnetwork , a customer leaves subnetwork and enters subnetwork with probability , , ; and after entering subnetwork ( may be the same as ), it enters server with probability , , , where is the number of customers in subnetwork , . We can control the transition probabilities among the subnetworks, , , with denoting the action taken, but cannot control , , . There are a total of customers circulating among the subnetworks. The network is a closed Jackson (or the Gordon–Newell) network [20], [14]. Let denote the number of customers in server of subat time , with . Let network ; then is the system state at and it forms a Markov process, having a stationary probability with a product-form solution [14]. Let be the number of customers in subnetwork at time , and , and . Then . The process is a continuous time process, but we can consider the discrete time embedded chain at the transition epochs and apply the results in this paper. In addition for simplicity, we assume that all the mean service times , , , are equal. Denote the ; then, sequence of all the transition epochs as we obtain a discrete-time finite-state Markov chain , . If is completely observable, then a policy is a mapping from the space of to the actions space,
The marginal distribution is
where
Thus, the conditional probability of
given
under policy
is
where (63) and , with being the visiting ratio to server in subnetwork . Next, we observe that (63) contains only keep the visiting ratio of the servers in subnetwork ; thus,
934
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 59, NO. 4, APRIL 2014
the same if we shorten all the servers in other subnetworks by setting their service times equal zero. Thus, for any and are the same as those in a network with the same structure as subnetwork in which a departure customer feeds back to the network with probability , . does not depend on , i.e., does Therefore, not depend on the policy . We may denote it as
independent of . Thus, (44) holds, leading to the single-policybased comparison principle in Theorem 2, and an optimal policy in the -based policy can be found by policy iteration and determined by (37). is not a sufficient statistic. C. Applications to Financial Engineering Problems In financial markets, a generally accepted assumption is that any individual investor’s trading activities cannot affect the market development. This implies that any investor’s trading policy cannot change the distribution of stock prices in the market. Therefore, if we use market statistics as the partial-information state , this assumption leads directly to our condition (44). That is to say, this condition holds automatically in many financial problems. Let us illustrate this idea in more details. Let be the process of the (per-share) price of a financial asset, say a stock. Normally, we assume follows the geometric Brownian motion: (64) where is a standard Brownian motion, is the appreciis the volatility process. In financial ation rate process, and engineering, we need to consider markets with stochastic parameters [24]. Thus, both and are stochastic, and therefore should also be considered as a part of the system state. Note that the expected value of the process (64) grows exponentially with time ; however, its equivalent annual growth rate, defined has a stationary distribution provided that by and are stationary. Therefore, in this problem the state is denoted as . Furthermore, the parameters in a real financial market are generally unobservable to investors, and the only available information is the historical prices of different securities, see [27], [42], [35]. Thus, in our problem, are unobservable to the investors, and only the price history is available. All the decisions at time should be made based on . An investor in (64) should may hold a number of stocks and therefore be considered as a multidimensional vector, and other quantities should be understood as matrices with proper dimensions. The investor wishes to maximize his/her profit by applying an optimal trading strategy (when to sell or buy which stock for how much?) The process , with a suitably defined performance criteria (e.g., the long-run average growth rate), forms a POMDP.
A policy (trading strategy) is based on the observation history and is denoted as . Because personal trading acare tivities cannot affect the market behavior, independent on policy . In particular, policies cannot affect the conditional probability, i.e.,
for all policies ’s. Because any partial-information state a statistic of , it follows
is
for all partial-information-state based polices ’s. In summary, condition (44) is generally automatically satisfied in a financial market with unobservable and stochastic parameters. With such a nice property, we may apply our method to study some practical financial decision problems. This work is reported in [39]. VII. DISCUSSION AND CONCLUSION The separation principle is a very special case of POMDP, and it is not generally applicable. In the well-known LQG problem, the conditional mean is a sufficient statistic, and the control policy does not affect the variance of the conditional distribution. In this paper, we consider the problem along a reverse direction: we start with any statistic, called a partial-information state, and discuss what is the best we can do for the partial-information-state based policies, then we discuss step-by-step under what conditions we can do better, until the best situation under which the separation principle holds. Specifically: 1a) We first proposed the single-policy based comparison principle, which is the basis for many practically feasible optimization approaches. With this principle we may verify whether a partial-information-state based policy is optimal among all the partial-information-state based policies, or find a better policy if it is not (policy iteration), by analyzing only the policy itself. 1b) We found that under conditions (36), or (44), the singlepolicy based comparison principle holds for the partial-information-state based policies, so the HJB type of optimality equation and policy iteration can be derived for a sub-space optimal policy. 2a) We introduced the Q-sufficient statistic, which is equivalent to the information state (sufficient statistics) in terms of optimization. 2b) We found that under further conditions (48), or (50), a partial-information state is Q-sufficient and a sub-space optimal partial-information-state based policy is also optimal among all possible policies for long-run average performance. 3) We found that under further conditions (56) the sub-space optimal policy takes the same form as that for the completely observable MDP (separation principle).
CAO et al.: PARTIAL-INFORMATION STATE-BASED OPTIMIZATION OF POMDP AND THE SEPARATION PRINCIPLE
935
TABLE I MAIN RESULTS
The results are summarized in Table I. Since condition (50) has nothing to do with the performance function, the conditional mean in a linear system with Gaussian noises (LG) is Q-sufficient as listed in the table. We also provided three examples, for the cases listed in the table. The results in this paper are obtained with the direct-comparison based approach; with this approach, optimization is achieved by a direct comparison of the performance of any two policies; policy iteration is viewed as a discrete version of the gradient descent approach, and the HJB equation is a discrete counterpart of setting the gradient equal zero. The partial-information state aggregates the information states together, and therefore it fits the event-based optimization framework. When the condition (44) does not hold, we may explore how much difference it may cause in the aggregated Q-factor in (35), and then we may study how much errors it might have if we still apply the single-policy-based policy iteration or optimality equation (37), by setting in (23). This remains for further research. APPENDIX SOME CONDITIONS FOR STABILITY For convenience, we state some of the stability conditions of Markov chains, i.e., the conditions for invariant measure to exist and ergodicity to hold. There are many such sufficient conditions in the literature, and we certainly cannot state all of them. In this section, we mainly restate some results in [26] and [18] with the notation of this paper. Let and be the Borel -field, and . The kernel , , of a Markov process with transition probability , , is defined as
where
is the and
The first return time on
-step transition probability with
is
and the return time probability is
where the subscript denotes the initial state.
A probability measure
is called invariant if
A Markov process is called -irreducible, if there exists a measure on such that for all and . It is also called -irreducible, with being the maximal irreducible measure (its existence is proved in Proposition 4.2.2 of [26]). Let
A Markov process is called recurrent, if it is -irreducible and for every and . Finally, we have the main results we need: Theorem 9: (Theorem 10.4.4 in [26]) If is recurrent, then it has a unique invariant probability measure. Theorem 10: (Theorem E.13 in [18]) If is recurrent with invariant probability measure , then for any nonnegative measurable function , . we have
and
for any initial distribution . REFERENCES [1] C. Amato, D. S. Bernstein, and S. Zilbersterin, “Optimizing fixed-size stochastic controllers for POMDPs and decentrailized POMDPs,” Autonom. Agents Multi-Agent Syst., vol. 21, no. 3, pp. 293–320, 2009. [2] K. J.Åström, Introduction to Stochastic Control Theory. New York, NY, USA: Academic, 1970. [3] V. E. Benes and I. Karatzas, “On the relation between Zakais and Mortensens equations,” SIAM J. Control Optimiz., vol. 21, pp. 472–489, 1983. [4] A. Bensoussan, Stochastic Contro of Partially Observable Systems. Cambridge, U.K.: Cambridge Univ. Press, 1992. [5] A. Bensoussan and J. H. v. Schuppen, “Optimal control of partially observable stochastic systems with exponential-of-integral performance index,” SIAM J. Control Optimiz., vol. 23, pp. 599–613, 1985. [6] J. P. Buzen, “Computational algorithm for closed queueing networks with exponential servers,” Commun. ACM, vol. 16, pp. 527–531, 1973. [7] X. R. Cao, Stochastic Learning and Optimization — A Sensitivity-Based Approach. New York, NY, USA: Springer, 2007. [8] X. R. Cao, “Basic ideas for event-based optimization of Markov systems,” Discrete Event Dynamic Syst.: Theory Applicat., vol. 15, pp. 169–197, 2005.
936
IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 59, NO. 4, APRIL 2014
[9] X. R. Cao and J. Y. Zhang, “Event-based optimization of Markov systems,” IEEE Trans. Autom. Control, vol. 53, no. 4, pp. 1076–1082, May 2008. [10] X. R. Cao, D. X. Wang, T. Lu, and Y. F. Xu, “Stochastic control via direct comparison,” Discrete Event Dynamic Syst.: Theory Applicat., vol. 21, pp. 11–38, 2011. [11] C. D. Charalambous and R. J. Elliott, “Classes of nonlinear partially observable stochastic optimal control problems with explicit optimal control laws,” SIAM J. Control Optimiz., vol. 36, no. 2, pp. 542–578, 1998. [12] G. D. Prato and J. Zabczyk, Ergodicity for Infinite Dimensional Syst.. Cambridge, U.K.: Cambridge Univ. Press, 1996. [13] R. J. Elliott, L. Aggoun, and J. B. Moore, Hidden Markov Models: Estimation and Control. Berlin, Germany: Springer-Verlag, 1995. [14] W. J. Gordon and G. F. Newell, “Closed queueing systems with exponential servers,” Operat. Res., vol. 15, pp. 252–265, 1967. [15] M. Green and D. J. N. Limebeer, Linear Robust Control. Englewood Cliffs, NJ, USA: Prentice-Hall, 1995. [16] E. A. Hansen, “An improved policy iteration algorithm for partially observable MDPs,” in Proc. 10th Neural Inf Process. Syst. Conf., Denver, CO, USA, Dec. 1997. [17] E. A. Hansen, “Solving POMDPs by searching in policy space,” in Proc. 14th Conf. Artif. l Intell., Madison, WI, USA, 1998, pp. 211–219. [18] O. Hernandez-Lerma and J. B. Lasserre, Discrete-Time Markov Control Processes: Basic Optimality Criteria. New York, NY, USA: Springer, 1996. [19] O. Hernandez-Lerma and J. B. Lasserre, Markov Chains and Invariant Probabilities. Boston, MA, USA: Birkhauser, 2003. [20] J. R. Jackson, “Networks of waiting lines,” Operat. Res., vol. 5, pp. 518–521, 1957. [21] S. Ji, R. Parr, H. Li, X. Liao, and L. Carin, “Point-based policy iteration,” Assoc. Advance. Artif. Intell., pp. 1243–1249, 2007. [22] T. Kailath, A. Sayed, and B. Hassibi, Linear Estimation. Englewood Cliffs, NJ, USA: Prentice-Hall, 2000. [23] A. Lindquist, “On feedback control of linear stochastic systems,” SIAM J. Control, vol. 11, pp. 323–343, 1973. [24] J. Liu, “Portfolio selection in stochastic environments,” Rev. Financial Studies, vol. 20, pp. 1–39, 2007. [25] N. Meuleau, L. Peshkin, K.-E. Kim, and L. P. Kaelbling, “Learning finite-state controllers for partially observable environments,” in Proc. 15thConf. Uncertainty Artif. Intell. (UAI’99), 1999, pp. 427–436. [26] S. Meyn and E. L. Tweedie, Markov Chains and Stochastic Stability, 2nd ed. Cambridge, U.K.: Cambridge Univ. Press, 2009. [27] H. Nagai and S. G. Peng, “Risk-sensitive dynamic portfolio optimization with partial information on infinite time horizon,” Ann. Appl. Probab., vol. 12, pp. 173–195, 2002. [28] C. H. Papadimitriou and J. N. Tsitsiklis, “The complexity of Markov decision processes,” Math. Operat. Res., vol. 12, pp. 441–450, 1987. [29] L. Peshkin, N. Meuleasu, and L. Kaelbling, “Learning policies with external memory,” in Proc. 16th Int. Conf. Mach. Learn., 1999, pp. 307–314, Morgan Kaufmann. [30] P. Poupart and C. Boutilier, “Bounded finite state controllers,” Adv. Neural Inf. Process. Syst., vol. 16, 2003. [31] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming. New York, NY, USA: Wiley, 1994. [32] A. Sardag and H. L. Akin, “Kalman based finite state controller for partially observable domains,” Int. J. Adv. Robot. Syst., vol. 3, pp. 331–342, 2006. [33] C. Striebel, “Sufficient statistics in the optimum control of stochastic systems,” J. Math. Anal. Applicat., vol. 12, pp. 576–592, 1965. [34] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: MIT Press, 1998. [35] J. Tu and G. F. Zhou, “Incorporating economic objectives into Bayesian priors: portfolio choice under parameter uncertainty,” J. Financial Quantitative Anal., vol. 45, pp. 959–986, 2010. [36] R. B. Vinter, Optimal Control. Boston, MA, USA: Birkhauser, 2010. [37] D. H. Wagner, “Survey of measurable selection theorems,” SIAM J. Control Optimiz., vol. 15, no. 5, pp. 859–903, 1977. [38] P. Walters, An Introduction to Ergodic Theory. New York, NY, USA: Springer, 1982. [39] D. X. Wang and X. R. Cao, “Event-based optimization for POMDP and its application in portfolio management,” in Proc. 18th IFAC World Congr., Milan, Italy, 2011.
[40] H. S. Witsenhausen, “Separation of estimation and control for discrete time systems,” Proc. IEEE, vol. 59, no. 11, pp. 1557–1566, Nov. 1971. [41] W. Wonham, “On the separation theorem of stochastic control,” SIAM J. Control, vol. 6, pp. 312–326, 1968. [42] J. Xiong and X. Y. Zhou, “Mean-variance portfolio selection udner partial information,” SIAM J. Control Optimiz., vol. 46, pp. 156–175, 2007. [43] K. J. Zhang, Y. K. Xu, X. Chen, and X. R. Cao, “Policy iteration based feedback control,” Automatica, vol. 44, pp. 1055–1061, 2008.
Xi-Ren Cao (S’82–M’84–SM’89–F’96) received the Ph.D. degree from Harvard University, Cambridge, MA, USA. He is a Chair Professor of Shanghai Jiao Tong University, Shanghai, China, and an affiliate member of the Institute for Advanced Study at the Hong Kong University of Science and Technology (HKUST). He has worked as a Consulting Engineer for Digital Equipment Corporation, was a Research Fellow at Harvard University, and a Reader, Professor, and Chair Professor at HKUST. He owns three patents in data and telecommunications and has published three books in the areas of performance optimization and discrete event dynamic systems. Dr. Cao is a Fellow of IFAC and received best paper awards from the IEEE Control Systems Society and the Institution of Management Science. He is the Editor-in-Chief of Discrete Event Dynamic Systems: Theory and Applications, and has served as an Associate Editor at Large of the IEEE TRANSACTIONS OF AUTOMATIC CONTROL, as a Member of the Board of Governors of the IEEE Control Systems Society, and as a Member on the Technical Board of IFAC. His current research areas include financial engineering, stochastic learning and optimization, performance analysis of economic systems, and discrete event dynamic systems.
De-Xin Wang received the Ph.D. degree in electronic and computer engineering from Hong Kong University of Science and Technology, Hong Kong SAR, China, in 2011. He currently works in the Quantitative Research Beijing Center, JPMorgan Chase&Co., Beijing, China, as a Research Associate. His research interests include learning and optimization of stochastic systems, portfolio optimization, risk managements, etc.
Li Qiu (S’85–M’90–SM’98–F’07) received the Ph.D. degree in electrical engineering from the University of Toronto, Toronto, ON, Canada, in 1990. He joined Hong Kong University of Science and Technology, Hong Kong SAR, China, in 1993, where he is now a Professor of Electronic and Computer Engineering. His research interests include system, control, information theory, and mathematics for information technology, as well as their applications in manufacturing industry. He is also interested in control education and coauthored an undergraduate textbook Introduction to Feedback Control (Prentice-Hall, 2009). This book has so far had its North American edition, International edition, and Indian edition. The Chinese Mainland edition is to appear soon. Dr. Qiu served as an associate editor of the IEEE TRANSACTIONS ON AUTOMATIC CONTROL and an associate editor of Automatica. He was the general chair of the 7th Asian Control Conference, which was held in Hong Kong in 2009. He was a Distinguished Lecturer from 2007 to 2010 and is a member of the Board of Governors in 2012 of the IEEE Control Systems Society. He is a Fellow of IEEE and a Fellow of IFAC.