Correction to Hidden Markov Model Multi-arm Bandits: A Methodology for Bearm Scheduling in Multi-Target Tracking
∗
Vikram Krishnamurthy Department of Electrical Engineering University of Melbourne, Victoria 3010, Australia email
[email protected] August 19, 2002
We have discovered an error in the return-to-state formulation of the HMM multi-armed bandit problem in our recently published paper [4]. This note briefly outlines the error in [4] and describes a computationally simpler solution. Complete details including proofs of this simpler solution appear in the already submitted paper [3]. The error in [4] is in the return-to-state argument given in Result 1, equation (12). It should read:
V (p) (x(p) , x¯(p) ) = min R0 (p)x(p) + β
R (p)¯ x
0
B (p) (m)A(p) x(p)
V (p)
0 10 B (p) (m)A(p) x(p)
m=1 Mp
0
Mp X
(p)
+β
X
m=1
V
(p)
0
B (p) (m)A(p) x¯(p) 0
10 B (p) (m)A(p) x ¯(p)
(p)
, x¯
,x ¯(p)
!
0
!
1B
0
10 B (p) (m)A(p) x(p) ,
(p)
(m)A
(p) 0 (p)
x¯
(1)
Unfortunately, there appears no obvious way of obtaining a finite dimensional characterization of the above value function in the return-to-state argument. In this note, it is shown that by introducing the retirement formulation [2] of the multi-armed bandit problem option, a finite dimensional value iteration algorithm can be obtained for computing ∗ This
work was supported by an ARC large grant and the Centre of Expertise in Networked Decision Systems
1
the Gittins index of a POMDP bandit. To avoid repetition, we use exactly the same notation as [4]. For each project p, let M (p) denote a positive real number such that
4 ¯ (p) = M max R(p, i).
¯ (p) 0 ≤ M (p) ≤ M
(2)
i∈Np
¯ (p) . To simplify subsequent notation, we omit the superscript p in M (p) and M In terms of a parameterized retirement reward M , the Gittins index [1], [2] of project p with information state x(p ) can be defined as
4
γ (p) (x(p) ) = min{M : V (p) (x, M ) = M }
(3)
where V (p) (x, M ) satisfies the functional Bellman’s recursion
V
(p)
0
(x, M ) = max R (p)x
(p)
+β
Mp X
0
V
(p)
m=1
B (p) (m)A(p) x(p) 0
10Np B (p) (m)A(p) x(p)
,M
!
0
10Np B (p) (m)A(p) x(p) , M
(4)
The N th order approximation of V (p) (x(p) , M ) is obtained as the following value iteration algorithm k = 1, . . . , N :
(p) Vk+1 (x(p) , M )
Mp X (p) = min R0 (p)x(p) + β Vk m=1
0
B (p) (m)A(p) x(p) 0
10Np B (p) (m)A(p) x(p)
,M
!
0
10Np B (p) (m)A(p) x(p) , M
(5)
Note that Corollary 1 of [4] holds for Gittins index. Similar to Sec.2.4 in [4] we need to compute a finite dimensional characterization of the value iteration algorithm (5). This is done as follows: Define the (Np + 1) dimensional augmented information state x ¯ ∈ {[x0 , 0]0 , [00Np , 1]0 where x ∈ X (p) . Define an augmented observation process y¯k ∈ {1, . . . , Mp + 1} and the corresponding (Np + 1) × (Np + 1)
2
transition and observation probability matrices as
(p)
A (p) A1 = 00Np
(p)
0Np ×Np (p) A2 = 00Np
B (p) B1 (m) =
0Np , 1
1Np , 1
(p)
(m) 0Np , 0 0Np 1
(p)
B2 (m) = I(Np +1)×(Np +1)
(p)
(p)
B1 (m) = diag(column m of B1 );
(6)
(p)
B2 (m) = diag(column m of B2 ),
4 z=
¯ M/M ¯ 1 − M/M,
,
m ∈ {1, . . . , Mp + 1}
¯ 0≤M ≤M
(7)
Define the information state π and following coordinate transformation:
π =z⊗x ¯
A 1 A¯1 = I2×2 ⊗ A1 = 0
0 , A1
A 2 A¯2 = I2×2 ⊗ A2 = 0
¯ (p) (m) = I2×2 ⊗ B (p) (m), ¯ (p) (m) = I2×2 ⊗ B (p) (m) B B 1 1 2 2 ¯ 1 (p) = R0 (p) 0 R0 (p) 0 , ¯ 2 (p) = M1 ¯ 0 R R 0 00Np Np
0 , A2
0
(8)
Then using similar arguments to [4] it can be shown that Theorem 1 in [4] holds. (p)
Moreover, Theorem 2 in [4] holds with the following minor changes: Vk (x(p) , M ) replaces (p)
Vk )(x(p) , x ¯(p) ) in statement 1. Statement 2 is unchanged. Statement 3 is modified to the following explicit formula for the Gittins index:
(p)
γN (x(p) ) =
max
λi,N ∈ΛN
¯ 0 (3) x(p) Mλ i,N ¯ (λi,N (3) − λi,n (1))0 x(p) + M
(9)
(p)
where each vector λi,N ∈ ΛN is of the form 0 λi,k = λ0 (1) 0 λ0 (3) 0 , i,k i,k
3
where λi,k (1), λi,k (3) ∈ RNp
(10)
Statement 3 above gives an explicit formula for the Gittins index of the HMM multi-armed bandit (p)
problem. Recall xk is the information state computed by the pth HMM filter at time k. Given that (p)
(p)
(p)
we can compute set of vectors ΛN , (9) gives an explicit expression for the Gittins index γN (xk )at ¯ for all x. any time k for project p. Note if all elements of R(p) are identical, then γ (p) (x) = M Sections 2.5, 3 and 4 of the paper [4], including the beam scheduling algorithm for a hybrid sensor of Sec.3.2, still hold. It is worthwhile noting that the above solution is computationally simpler as the information state π is a 2(Np + 1) dimensional vector, whereas the information state π considered in [4] is a Np2 dimensional vector.
References [1] D.P. Bertsekas. Dynamic Programming and Optimal Control, volume 1 and 2. Athena Scientific, Belmont, Massachusetts, 1995. [2] J.C.Gittins. Multi–armed Bandit Allocation Indices. Wiley, 1989. [3] V. Krishnamurthy. A value iteration algorithm for partially observed markov decision process multi-armed bandits. submitted to IEEE Transactions Automatic Control, 2001. [4] V. Krishnamurthy and R.J. Evans. Hidden Markov model multi-arm bandits: A methodology for beam scheduling in multi-target tracking. IEEE Trans. Signal Proc., 49(12):2893–2908, December 2001.
4