Stochastic Dynamic Programming Based Approaches ...

Stochastic Dynamic Programming Based Approaches to Sensor Resource Management R.B. Washburn, M.K. Schneider, J.J. Fox Fusion Technology and Systems Division Alphatech, Inc. Burlington, MA fbob.washburn, michael.schneider, [email protected] Abstract – This paper describes a stochastic dynamic programming based approach to solve Sensor Resource Management (SRM) problems such as occur in tracking multiple targets with electronically scanned, multi-mode radar. Specifically, it formulates the SRM problem as a stochastic scheduling problem and develops approximate solutions based on the Gittins index rule. Novel results include a hybrid state stochastic model for the information dynamics of tracked targets, an exact index rule solution of the SRM problem for a simplified tracking model, and use of approximate dynamic programming to extend the index rule solution to more realistic models. Keywords: Sensor resource management, dynamic programming, Gittins index, multi-armed bandit.

1

targets with electronically scanned, multi-mode radar. Specifically, we formulate the SRM problem as a stochastic scheduling (or multi-armed bandit) problem and develop approximate solutions based on the Gittins index rule [2]. Although there has been much research on index rules [7], [6], [3], [9], [5], there appear to be no applications to SRM except for the recent work by Krishnamurthy and Evans [4]. The reason for this is that the SRM problem, formulated as stochastic scheduling, doesn’t usually satisfy the conditions necessary for the existence of an index rule solution. Thus, one is forced to develop approximations (e.g., the “restless bandit” work of [9], [5]). In this paper we show how to relax one of the “usual” conditions and to develop index rule-based SRM algorithms for kinematic tracking. The new contributions of this paper include:

Introduction

Modern sensor suites contain sensors that are agile, allowing fast redirection (e.g., electronic scanning) and fast switching between modes (e.g., radar modes such as SAR and MTI), giving a large number of control options. However, allocation of sensor resources is also subject to a variety of constraints: there may be many things to look at (complex environments with multiple targets), limited time to sense targets because of deadlines on the usefulness of information (urgent requests) or exposure to threat environments, sensors may have a limited field of view, and so on. Developing sensor control algorithms that satisfy such constraints and maximize the value of information obtained is the goal of Sensor Resource Management (SRM) research. Stochastic dynamic programming [1] provides a principled, systematic approach to formulating and solving SRM problems [8]. However, its application requires paying the price of developing mathematical models for the problem formulation and then finding efficient algorithmic solutions. In this paper we describe new results in developing efficient solutions based on stochastic dynamic programming to solve SRM problems such as occur in tracking multiple

ISIF © 2002

1. Stochastic scheduling formulation of SRM and a corresponding extension of the index rule solution to more general reward functions, more suitable for SRM problems. 2. Hybrid state stochastic model of information dynamics for tracking problems, which facilitates efficient solutions of stochastic dynamic programs and alleviates the “curse of dimensionality.” 3. Exact index rule solution for a simple tracking model. This solution provides a basis for efficient approximations of more realistic tracking models. 4. Lookahead approximation of the stochastic dynamic program solution based on an index rule solution, applicable to more realistic tracking models. Section 2 formulates the SRM problem as a stochastic scheduling problem and shows how to extend the index rule solution to more general types of reward functions. This section also discusses generalization of the formulation to controlling sensors with variable dwell times and

608

selectable modes. Section 3 describes hybrid state models of information dynamics to use for tracking problems. These models use discrete-valued random variables to drive a continuous-valued information state. This approach helps reduce the dimensionality of the resulting stochastic dynamic programming equations. In Section 4 we find the optimal index rule solution for a simple version of the tracking problem and use this solution in Section 5 as the basis for approximate solutions of more complex models. Section 5 also describes numerical results of the lookahead approximation for a multi-target tracking example.

2

SRM Problem Formulation

We formulate SRM as a stochastic control problem in which sensor actions u t are the controls and the controlled state, denoted x t , is an information state. Conceptually, x t denotes the accuracy of the information in the track. Most generally, it can be the conditional probability of the target state given previous observations (as in [4]). In our problem x t will be an estimate of the track error covariance. In this paper we will assume that the sensor action applies to one target at a time and that time is discrete (e.g., t ; ; :::). The control u t i indicates which target i to look at time t. The information state is initially xi i , and xi t is given by a stochastic equation

() ()

()=

( + 1)

xi (t + 1) = fi (xi (t) ; wi (t) ; u (t))

=1

(1)

()

for i ; :::; n, where wi t are random variables representing uncertainties in target dynamics and sensor measurements. We assume that wi t are independent random variables with distribution p i wi . At each time t we must determine an action u t for the sensor to take depending only on the information state

() ( )

()

x (t) = (x1 (t) ; :::; xn (t))

(2)

()

of all targets. The SRM problem is to determine u t as a function of x t which maximizes the expected value of the total discounted reward, namely

()

J ( ) = max E

(

1 X k=0

t

n X i=1

2.2

i=1

SRM and Index Rules

t Ru(t) xu(t) (t)

(6)

t

Ri (xi (t)) dtjx (0) =

n X i=1

Ri (xi (t)) :

(7)

In the conventional scheduling problem, the state x i a machine evolves via

xu(t) (t + 1) = fu(t) xu(t) (t) ; wu(t) (t) and

xi (t + 1) = xi (t) if i 6= u (t) :

(t) of (8) (9)

In the SRM problem (8) is still true, where the equation represents the prediction and update of the target track state over one time step. The noise w i t expresses uncertainty in how the target state actually changes as well as uncertainties in the measurement of the target state. However, if the sensor doesn’t look at the target, the evolution of the state is

()

xi (t + 1) = gi (xi (t) ; wi (t)) if i 6= u (t)

(10)

where the equation represents the prediction of the target track state over one time step, without a measurement update. The noise w i t expresses uncertainty in how the target state changes in one time step. In general, g i xi ; wi 6 xi . Thus, the SRM problem differs from the stochastic scheduling problem in two critical respects:

()

(

)=

1. The reward depends on unsensed targets (in the conventional scheduling problem idle machines contribute no reward).

)

2. The state of an unsensed target can change (in the conventional scheduling problem the state of idle machines remain frozen). In some SRM problems it is possible to have

J ( ) =

max u

()

In the conventional stochastic scheduling problem (for which there are index rule solutions), one receives the discounted reward

( ) = ( ( ))

n X

( )

= ()

(3) where the optimization is taken over all stationary control policies (i.e., u t x t ). The parameter is the discount factor. The dynamic programming equation is (

n Y

pi wi : (5) i=1 The optimal control policy is given by finding the action u which achieves the maximum J .

()

=1 2 (0) =

p (w) =

at time t. In the SRM problem one receives the reward

2.1 Stochastic Scheduling Formulation

()

where

Ri (i ) +

Z

J (f (; w; u)) p (w) dw

gi (xi ; vi ) = xi

)

(4)

(11)

so that the state of unsensed targets remains the same, just as in the scheduling problem. For example, this is the case

609

if xi are statistics of a static parameter (e.g., target classification problems, where x i is a probability vector of possible target types). If (11) is true, we can reformulate the SRM problem with an equivalent reward function so that the reward doesn’t depend on unsensed targets, again just like the scheduling problem, and thus we can get index rules which are optimal solutions of the SRM problem. Theorem 1. Suppose that

=1 ()

xi (t + 1) = fi (xi (t) ; wi (t) ; u (t))

() ( )

(12)

; :::; n; where wi t are independent for all i; t for i and wi t has distribution pi wi for all i; t. Suppose that the control u t 2 f ; :::; ng and u t depends on the state x t . Suppose that

()

()

1

()

fi (xi ; wi ; u) = xi

(13)

= i. Define

for all u 6

R~i (xi ) =

Z

Ri (fi (xi ; wi ; i)) pi (wi ) dwi Ri (xi ) : (14)

Then a policy optimizes (

E

1 X t=0

t R~u(t) xu(t) (t) jx (0)

E

1 X t=0

t

n X i=1

Proof. For any control u

(15)

E =

(

1 X t=0

aE

for constants a

Ri (xi (t)) jx (0) :

t R~u(t) xu(t) (t) jx (0)

(

=

)

(t) show that

1 X t=0

t

1

n X i=1

(16)

Ri (xi (t)) jx (0)

Ri (fi (xi ; wi )) pi (wi ) dwi Ri (xi ) ;

+b

(19)

then the index function is defined by

mi (xi ) = min fM : Ji (xi ; M ) = M g :

(20)

Proof. Follows from the theorem above and results for finding index functions in [1]. The result shows that the assumption that idle machines do not contribute to the reward is not a critical assumption. As long as the states of idle machines are frozen, the theorem shows that it is possible to find an equivalent reward such that idle machines do not contribute. Thus, ‘frozen states of idle machines’ is the key assumption in order to have an optimal index rule solution to the stochastic scheduling problem. Of course, the frozen state assumption is not realistic for tracking SRM problems. However, one can use index rules as the basis for good suboptimal approximations of the more general problem.

Extension of SRM Formulation

It is necessary to extend the problem formulation to model more realistic situations in which sensors have selectable modes and in which different modes require different amounts of time to complete (e.g., consider the difference between SAR and MTI modes for a radar). In such situations we formulate the scheduling problem so that the control u t and the information state x i t are constant over the time interval k t < k . Let x i k denote the information state of target i at stage k . Stage k begins at time k and ends at time k . At each stage k we must determine a control u k for the sensor to use over the period k t < k . In this case the control u k has the form

()

()

()

()

() ( + 1) ( + 1) () ( + 1)

u (k) = ( (k) ; (k))

()

(21)

= 1; :::; n to look at and is a sensor (0) = 0; and (k + 1) is given by (k + 1) = (k) + ( (k)) ; (22) so that the time duration ( (k )) of stage k depends on the sensor mode (k ) used. This control u (k ) depends only on where is the target i mode to use. Time

1 > 0; b.

Corollary 2. If the target state remains unchanged when the target is not viewed by the sensor, then there is an index rule which is an optimal solution of the SRM problem. That is, there are functions mi xi , for each target i, such that the optimal SRM policy is given by

( )

u (t) = arg max fmi (xi (t))g : i

(xi ; M ) is the solution of Ji (xi ; M ) = R~i (xi ) max + R Ji (fi (M; xi ; wi ) ; M ) pi (wi ) dwi

Z

()

) )

R~i (xi ) =

2.3

)

if and only if it optimizes (

where

the information state at the beginning of the state, namely,

x (k) = (x1 (k) ; :::; xn (k)) :

(17)

If Ji

(23)

The information state evolves according to

xi (k + 1) = f (xi (k) ; wi (k) ; u (k)) ;

(18)

()

We can assume that wi k depends only on x i and has conditional distribution p i wi jxi ; u .

610

(

)

(24)

(k) and u (k)

The SRM problem is to determine u (t) as a function of x (t) which maximizes the expected value of the total discounted reward, namely

J ( ) = max E

(Z

1

n

X e t Ri (xi (t)) dtjx (0) = 0 i=1

)

(25) where the optimization is taken over all stationary control policies (i.e., u t x t ). The parameter is the discount rate. Because the information state is piecewise constant, the total discounted value is

1

0

=

1 X

e t

e

n X

i=1 (k)

k=0

Ri (xi (t)) dt n e (k+1) X Ri (xi (k)) : i=1

Thus, we obtain an equivalent discrete time problem with the dynamic programming equation

J ( ) =

max u

(

1 e (u) Pn Ri (i ) i=1 +e (u) R J (f (; w; u)) p (wj; u) dw

fi (xi ; D; i) = 1 Fi xi (t) 1 + HiT Ri 1 Hi FiT

3

p (wj; u) =

n Y i=1

pi (wi ji ; u) :

i fi xi ; D;

= Fi xi (t) FiT + Gi QiGTi (30) If the sensor doesn’t look at target i (u 6= i), then fi (xi ; wi ; u) = Fi xi (t) FiT + Gi Qi GTi (31) . Here xi (t) denotes an error covariwhether wi = D or D

ance and the matrices F i , Gi , Qi , Hi , Ri model the effect of prediction and update on the error covariance.

3.2

D ; A; D A; D ; A;

(A; D) ;

(32)

where D stands for detection, D for missed detection, A for association, and A for misassociation. Then f i xi ; wi ; u has the following form.

(

(27)

fi (xi ; (A; D) ; i) = 1 Fi xi (t) 1 + HiT Ri 1 Hi FiT fi xi ; A; D ; i

(28)

Modeling the information state as a conditional probability distribution of the target state variable is one possible model, but for many problems it gives a large dimensional stochastic evolution equation and a correspondingly large dimensional dynamic program to solve. Thus, we try to model the information dynamics with smaller number of salient statistics in order to obtain a tractable stochastic dynamic program. For tracking problems, in which the SRM algorithm is trying to maximize the quality of kinematic tracking for multiple targets, the track error covariance is an effective information state x i t and hybrid state equations, in which wi t takes a finite number of values, provide a low complexity description of the information dynamics.

+ Gi QiGTi

= Fi xi (t) FiT + Gi Qi GTi

)

(33) (34)

D ;i = fi xi ; A; D ; i = Fi (xi (t) + Si) 1 FiT + Gi QiGTi fi xi ; A;

To design the SRM algorithm we need to model how the information state responds to sensor actions, as modeled by the equation

()

Modeling Association Uncertainty

()

)

3.1 Modeling Detection Uncertainty

()

(29)

Model of Information Dynamics for Tracking Problems

xi (t + 1) = fi (xi (t) ; wi (t) ; u (t)) :

+ Gi QiGTi

We can also include the effect of misassociations by considering a four state process w i t as follows. Suppose that wi can be one of the four states

(26) where

()

() =

()=

( ) = ( ( ))

Z

()

The simplest instance of the model has w i t just indicate target detections so that w i t takes just two possible values (denote, w i t D if the target is detected and wi t D if the target isn’t detected). Let i denote the probability that target i is detected, assuming that the sensor looks at target i. Then f i has the following form.

(35)

0

The idea is that Si models the bias induced by a misassociation. If the sensor doesn’t look at target i (u 6 i), then

=

fi (xi ; wi ; u) = Fi xi (t) FiT + Gi Qi GTi :

(36)

A simple model for S i is

a2i

K R^ K T (37) di + 2 i i i where ai is a gate size parameter (e.g., a i = 3 or ai = 4), di is the dimension of the measurement, K i is the Kalman Si =

gain

611

1 Ki = xi (t) 1 + HiT Ri 1 Hi HiT Ri 1

(38)

and

Rî = Hi xi (t) HiT

4.2

+ Ri

(39)

is the predicted measurement covariance. The probabilities of the discrete events are given by

Equations for Index Function

= xi 2. Then we can express the ( ) = 1 zi 2 [Æ 2; 1) (51)

Transform x i to zi reward as R zi

with the evolution equation Pr fwi (t) = (A; D)g = PAjD i : (40) zi + 2 if target is detected : (52) f ( z ; w ) = i i Pr wi (t) = A; D = PAjD (1 i ) : (41) zi if target is not detected Pr wi (t) = A; D = 1 PAjD i : (42) To find the index rule to a sum of individual target re we need to replace the reward function R (z i ) with Pr wi (t) = A; D = 1 PAjD (1 i) : (43) Rwards, ~ (zi) defined by Z For di = 2; the association probability models are given by ~ R (zi ) = R (f (zi ; wi )) pi (wi ) dwi R (zi ) 1 1 e 21 a2i e a2i bi 2 2 2 2 2 PAjD = (44) bi + 12 = 0 if zif a

Define the individual target track reward

R (xi ) = 1 (xi 2 G)

;

Ji (zi ; M ) = 0 X k (M ) 1 (zi 2 [a + k; a + (k + 1)))

()

Suppose that each target i has a location and x i t is the root mean square error of its location estimate. Define the goal set G to be G ;Æ (47)

9 =

by successive approximation, noting that the value function at any iteration is piecewise constant. The successive approximation iterations converge to the optimal value function

Tracking SRM Index Rule

4.1 Target Location Example

R~ (zi ), we

for k

(

jkj+1 1 jkj+1 1 + jkj+1 M if M 1 jkj+1 jkj+1 1 M if M > 1 jkj+1

(54)

0 and =

4.3

Index Rule

1

< 1: (1 )

(55)

( )

The index function m z i is defined as the smallest M such that Ji zi ; M M: (56)

( )= For zi 2 [a + k ; a + (k + 1) ), this is the smallest M such that k (M ) = M . Thus, m (zi ) =

(0) = . 612

jkj+1 1 1 jkj+1

(57)

for zi

2 [a + k; a + (k + 1) ). Similarly, for zi 2 [a +

; 1), we have

m (zi ) = 0: (58) In terms of the parameters Æ and , we can express the index

if xi x0i in the sense of positive definite matrices. For example, i xi is monotonic in this sense if it is the trace or maximum eigenvalue of the matrix x i . Finally, define the integer function i xi to be the minimum n such that

( )

( ) i (f n (xi )) Æi :

function as

m (zi ) =

8 jkj+1 1 < 1 jkj+1 if Æ

2 + (k 1) 2 zi 2 : < Æ + k 2 ; k 0 0 if zi Æ 2

:

The index rule for the scheduling problem is given by

mi (xi ) =

(59)

Thus, the index rule control is

u (t) = argmax fm (zi )g ;

(60)

which is equivalent to

u (t) = arg max

zi (t) : zi (t) < Æ 2 :

(61)

If we assign different goals Æ i as well as different values Vi to each target, then it is easy to see that the corresponding index functions are m i (zi ) given by

mi (zi ) =

8 > < > :

jkj+1 1 Vi 1i jkj+1 if Æi 2 + (k 1) 2 zi i : < Æi 2 + k 2 ; k 0 2 0 if zi Æi

8 < :

i (xi ) 1 Vi i i (xi ) 1 i

i =

1

i < 1: (1 i )

( ) E

nP

~ ( ( )) jxi (0) = xi o (0) = xi

1 t R x t i i t=0 nP 1 t jx i t=0

E

where for the SRM problem function.

5

We can extend the previous scalar results to more general situations where the information state is a covariance matrix. Thus, let x i be the target track error covariance for target i. Assume that xi evolves according to

5.1

f (xi ; wi ; u) =

f (xi ) if = i and wi = 1 xi otherwise

(t) is a 0 1 random variable and Pr fwi (t) = 1g = i : Assume that f (x) satisfies f (x) x

(64)

where wi

in the sense of positive definite matrices x and f the reward function R i xi by

( )

Ri (xi ) =

(65)

(66)

(x). Define

Vi if i (xi ) Æi : 0 otherwise

; (71)

Approximations Based on Index Rule and Examples Lookahead Approximation

The index rule is optimal only if the state of unsensed targets remains unchanged, which is not a realistic assumption for tracking problems. However, as noted in [4], it is a reasonable approximation to make when the state changes slowly. In our problem formulation, we require the track error covariance to grow slowly over the time scale 1 (the discount rate defines a “soft” time horizon for the problem). If the error covariance grows more rapidly over this time horizon, we can use the index rule as a basis for a better approximation of the stochastic dynamic programming problem. For example, we have used the index rule’s value function to compute n-step lookahead approximations to the dynamic programming equations [1]. The state of the dynamic program is the vector x t of target tracking error covariances ( x i t is the covariance for target i). The dynamic programming equation has the form

Assume that i is also monotonic in the sense that

i (xi ) i (x0i )

o

R~i is the transformed reward

()

(67)

(70)

Proof. One can prove this extension by using the same approach as for the simple scalar case–e.g., by solving the dynamic programming equation by successive approximation and computing the index rule. A more elegant approach can be obtained by using the insightful interpretation of the index function m i xi given by [7] as the maximization over stopping times : That is,

4.4 Extension to Higher Dimensions

(xi ) > 0

where i is defined as above.

mi (xi ) = max >0

(63)

if i

0 otherwise

(62)

where

(69)

J (x) = max R (x) + u (68)

613

Z

()

J (F (x; u; w)) p (w) dw

(72)

where R

(x) is the immediate reward R (x) =

n X i=1

Ri (xi ) ;

(73)

(x; u; w) is the dynamic equation x (t + 1) = F (x (t) ; u (t) ; w (t)) (74) with noise w (t). If J0 (x) denotes an initial approximation and F

for the value function, then

Jn (x) = max R (x) + u

Z

Jn 1 (F (x; u; w)) p (w) dw

1

For low-value targets, the frequency of goal attainment is averaged over the targets. These performance measures are computed for different values of the moving process noise. The results are plotted in Figures 1-4. Note that the one and two-step lookahead controls keep the high-value target errors down better than the index rule for moderate to low values of the process noise. At high values, the two-step lookahead does slightly worse. This is because the twostep lookahead control is devoting some time to low value targets, dramatically decreasing their errors. Parameter Number of Monte Carlo Runs Time of Run (T ) Steady-state Start Time (Ts ) Discount Factor () High Target Value (V 1 ) Low Target Value (V i ; i :: ) High-Value Target Error Variance Goal (Æ 1 ) Low -Value Target Error Variance Goal (Æi ; i :: ) Measurement Noise Variance (q ) Probability of Detection ( ) Initial Position Variance (x i )

(75)

defines the n-step lookahead approximation for n . We will use the index rule to define J 0 and compute the one- and two-step lookahead approximations numerically from it. J0 is computed by assuming that the undetected target covariance x i don’t change. Let m i xi denote the index function of target i, and define i and i xi as in the previous section. Moreover, let i xi mi xi if i xi > and i xi 1 if i xi . If we re-order the target indices i so that i > i 0 implies that i xi i0 xi0 , we can show that

( ) 0 ( )

( ) ( ) ( )= ( ) ( )=0 ( )

( )=

J0 (x) = (1 )

n 1X i=1

Vi

i Y i (xi )

j =1

i

:

= 2 10

= 2 10

(0)

Table 1: Simulation Parameters.

=12

High−value Target 5

=0

Relative MSE

The one- and two-step lookahead and index rule controls are tested on a simple scenario with targets, one of which has a high value and others which have low values. Each target moves according to a simple one-dimensional Brownian motion model. The simulation parameters are given in Table 1. Performance is measured for each target in two ways. The first is the time-averaged relative mean square error T X x t xi t 2 (77) T Ts Æ t=Ts i where T is the length of the run, T s is the time from which the system is assumed to be in steady-state, Æ is the error goal, xi is the true target position, and x i is the tracker position estimate for a simple Kalman filter. For low-value targets, the MSE is averaged over the targets. The second performance measure is the time averaged frequency that the error goal is attained,

10

+ 1)

(T

T X

Ts + 1) Æ t=Ts

( ( ) ^ ( ))

1 (jxi (t) xî (t)j < Æ)

(78)

3 2 1 0

^

1

Index Rule Lookahead 1 Step Lookahead 2 Steps

4

5.2 Example

1

100 200 100 0:9 100 1 0:25 4 1 0:8 20

(76)

Using this expression for J 0 we can compute J n numerically from (75) for n ; . The following examples illustrates the performance for the lookahead approximations and compares them to the performance of the pure index rule (n ) approximation.

(

Value

0

0.05 0.1 Moving Process Noise

Figure 1: High-Value Target MSE.

References [1] D.P. Bertsekas, Dynamic Programming and Optimal Control, Athena Scientific, Belmont, Massachusetts, 2001.

614

Low−value Targets

Low−value Targets

5

1

Relative MSE

4

Goal Attainment Frequency

Index Rule Lookahead 1 Step Lookahead 2 Steps

3 2 1 0

0

0.4 Index Rule Lookahead 1 Step Lookahead 2 Steps

0.2

0


Figure 4: Low -Value Target Goal Attainment Frequency.

High−value Target

Trans. on Signal Processing, Vol. 49, 2893-2907, December, 2001.

1 Goal Attainment Frequency

0.6

0


Figure 2: Low -Value Target MSE.

[5] J. Niño-Mora, “Restless bandits, partial conservation laws and indexability,” Dept. of Economics and Business, Universitat Pompeu Fabra, Barcelona, Spain, December 13, 1999.

0.8 0.6

[6] J.N. Tsitsiklis, “A Lemma on the Multiarmed Bandit Problem,” IEEE Transactions on Automatic Control, AC-31, 576-577, 1986.

0.4 Index Rule Lookahead 1 Step Lookahead 2 Steps

0.2 0

0.8

0

[7] P.P. Varaiya, J.C. Walrand, and C. Buyukkov, “Extensions of the Multiarmed Bandit Problem: The Discounted Case,” IEEE Transactions on Automatic Control, AC-30, 426-439, 1985.


[8] R. Washburn, A. Chao, D. Castanon, and R. Malhotra, “Stochastic Dynamic Programming for Far-Sighted Sensor Management,” Proc. of the IRIS National Symposium on Sensor and Data Fusion, Vol. 2, 277-295, 1997.

Figure 3: High-Value Target Goal Attainment Frequency.

[2] J.C. Gittins, “Bandit processes and dynamic allocation indices,” Journal of the Royal Statistical Society, B 41, 148-164, 1979.

[9] P. Whittle, “Restless Bandits: Activity Allocation in a Changing World,” in J. Gani (editor), A Celebration of Applied Probability, Journal of Applied Probability, 25A, 287-298, 1988.

[3] M.N. Katehakis and A.F. Veinott, Jr.,“The multiarmed bandit problem: decomposition and computation,” Mathematics of Operations Research, Vol. 12, No. 2, pp. 262-268, 1987. [4] Vikram Krishnamurthy and Robin J. Evans, “Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking,” IEEE

615

Stochastic Dynamic Programming Based Approaches ...

Stochastic Dynamic Programming Based Approaches ...

Suggest Documents

Stochastic Dynamic Programming Based Approaches to Sensor ...

Dynamic Programming and Column Generation Based Approaches ...

Stochastic programming approaches to stochastic ... - Deep Blue

Degree-Pruning Dynamic Programming Approaches

A Stochastic Dynamic Programming Approach Based on Bounded

Scenario-based Stochastic Constraint Programming

Stochastic Dynamic Programming for Long Term

Discrete Stochastic Dynamic Programming (Wiley ... - Google Sites

Parallel Stochastic Dynamic Programming: Finite Element ... - CiteSeerX

Stochastic Dynamic Programming Using Optimal ... - Optimization Online

Optimal Reservoir Operation Using Stochastic Dynamic Programming

Application of Stochastic Dynamic Programming in ...

markov decision processes: discrete stochastic dynamic programming ...

Sampling Stochastic Dynamic Programming Applied to Reservoir ...

Stochastic equilibrium programming for dynamic ... - Springer Link

Dynamic Programming Approximations for a Stochastic ... - CiteSeerX

Approximate Stochastic Dynamic Programming for ... - Semantic Scholar

Dynamic Programming for Stochastic Target ... - Semantic Scholar

Stochastic Differential Dynamic Programming - Computer Science

Deterministic or stochastic dynamic programming? - AgEcon Search

Fuzzy Stochastic Dynamic Programming for Process ...

g-sddp generalized stochastic dual dynamic programming

OPTIMIZATION-BASED APPROXIMATE DYNAMIC PROGRAMMING

Neural dynamic programming based online