Mar 5, 2014 - The decision maker is allowed to search for the target ..... any policy Î. Then, we show in App. VIII-A.2 that the Bayes risk Râ under the DGF ...
1
Active Hypothesis Testing for Quickest Anomaly Detection Kobi Cohen and Qing Zhao
arXiv:1403.1023v1 [cs.IT] 5 Mar 2014
Abstract The problem of quickest detection of an anomalous process among M processes is considered. At each time, a subset of the processes can be observed, and the observations from each chosen process follow two different distributions, depending on whether the process is normal or abnormal. The objective is a sequential search strategy that minimizes the expected detection time subject to an error probability constraint. This problem can be considered as a special case of active hypothesis testing first considered by Chernoff in 1959 where a randomized strategy, referred to as the Chernoff test, was proposed and shown to be asymptotically (as the error probability approaches zero) optimal. For the special case considered in this paper, we show that a simple deterministic test achieves asymptotic optimality and offers better performance in the finite regime. We further extend the problem to the case where multiple anomalous processes are present. In particular, we examine the case where only an upper bound on the number of anomalous processes is known.
Index Terms— Sequential detection, anomaly detection, dynamic search, active hypothesis testing, controlled sensing. I. I NTRODUCTION We consider the problem of detecting a single anomalous process among M processes. Borrowing terminologies from target search, we refer to these processes as cells and the anomalous process as the target which can locate in any of the M cells. The decision maker is allowed to search for the target The authors are with the Department of Electrical and Computer Engineering, University of California, Davis. Email: {yscohen, qzhao}@ucdavis.edu This work was supported by Army Research Lab under Grant W911NF1120086. Part of this work was presented at the Information Theory and Applications (ITA) Workshop, San Diego, California, USA, Feb. 2014.
DRAFT
2
over K cells at a time (1 ≤ K ≤ M ). The observations from searching a cell are i.i.d. realizations drawn from two different distributions f and g, depending on whether the target is absent or present. The objective is a sequential search strategy that dynamically determines which cells to search at each time and when to terminate the search so that the expected detection time is minimized under a constraint on the probability of declaring a wrong cell as containing the target. The problem under study applies to intrusion detection in cyber-systems when an intrusion to a subnet has been detected and the probability of each component being compromised is small (thus with high probability, there is only one abnormal component). It also finds applications in target search, fraud detection, and spectrum scanning in cognitive radio networks.
A. A Case of Active Hypothesis Testing The above anomaly detection problem can be considered as a special case of the sequential design of experiments problem first studied by Chernoff in 1959 [1]. In this problem, a decision maker aims to infer the state of an underlying phenomenon by sequentially choosing the experiment (thus the observation model) to be conducted at each time among a set of available experiments. Chernoff focused on the case of binary hypotheses and showed that a randomized strategy, referred to as the Chernoff test, is asymptotically optimal as the maximum error probability diminishes. Specifically, the Chernoff test chooses the current experiment based on a distribution that depends on the past actions and observations. Variations and extensions of the problem and the Chernoff test were studied in [2]–[6], where the problem was referred to as controlled sensing for hypothesis testing in [3], [4] and active hypothesis testing in [5], [6] (see a more detailed discussion in Section I-C). It is not difficult to see that the quickest anomaly detection problem considered in this paper is a special case of the active hypothesis testing problem considered in [1]–[6]. In particular, under each hypothesis that the target is located in the ith (i = 1, 2, . . . , M ) cell, the distribution (either f or g) of the next observation depends on the action of which cell to search (i.e., which experiment to carry out). The Chernoff test and its variations proposed in [2]–[6] thus directly apply to our problem. However, in contrast to the randomized nature of the Chernoff test and its variations, we show in this paper that a simple deterministic test achieves asymptotic optimality and offers better performance in the finite regime.
DRAFT
3
B. Main Results Similar to [1]–[5], we focus on asymptotically optimal policies in terms of minimizing the detection time as the error probability approaches zero. The asymptotic optimality of the Chernoff test as shown in [1] requires that under any experiment, any pair of hypotheses are distinguishable (i.e., has positive Kullback-Liebler (KL) divergence). This assumption does not hold in the anomaly detection problem considered in this paper. For instance, under the experiment of searching the ith cell, the hypotheses of the target being in the j th (j 6= i) and the kth (k 6= i) cells yield the same observation distribution f . Nevertheless, we show in Theorem 1 that the Chernoff test preserves its asymptotic optimality for the problem at hand even without this positivity assumption on all KL divergences. As a result, it serves as a bench mark for comparison. The Chernoff test, when applied directly to the anomaly detection problem, leads to a randomized cell selection rule: the cells to be search at the current time are drawn randomly according to a distribution determined by past observations and actions. The main result of this paper is to show that a simple deterministic policy offers the same asymptotic optimality yet with significant performance gain in the finite regime and considerable reduction in implementation complexity. Specifically, under the proposed policy, the selection rule φ(n) which indicates which K cells should be observed at time n is given by: m(1) (n), m(2) (n), ..., m(K) (n) , ||g) if D(g||f ) ≥ D(f (M −1) or K = M φ(n) = (2) (n), m(3) (n), ..., m(K+1) (n) m , if D(g||f ) < D(f ||g) and K < M (M −1)
where m(i) (n) denotes the cell index with the ith highest sum of log-likelihood ratio collected from this
cell up to time n, and D(·||·) is the KL divergence between two distributions. Since D(g||f ) is the key quantity in the cell selection rule, we refer to the proposed deterministic policy as the DGF policy. We then extend the problem to the case where multiple anomalous processes are present. In particular, we examine the case where only an upper bound on the number of anomalous processes is known. Interestingly, we show that the Chernoff test may not be practically appealing under the latter setting. Hence, we propose a modified optimization problem for this case. Asymptotically optimal deterministic policies are developed for these cases as well.
DRAFT
4
C. Related Work As discussed in Section I-A, the above anomaly detection problem can be considered as a special case of an active hypothesis testing first studied by Chernoff in [1], where the focus was on sequential binary composite hypothesis testing. The results in [1] were extended by Bessler to M-ary hypothesis in [2]. In [4], Nitinawarat et al. considered M-ary active hypothesis testing in both fixed sample size and sequential settings. Under the sequential setting, they extended the Chernoff test to relax the positivity assumption on all KL divergences as required in [1], [2]. Furthermore, they examined the asymptotic optimality of the Chernoff test under constraints on decision risks, a stronger condition than the error probability, and developed a modified Chernoff test to meet hard constraints on the decision risks. In [5], in addition to the asymptotic optimality adopted by Chernoff in [1], Naghshvar and Javidi examined active sequential hypothesis testing under the notion of non-zero information acquisition rate by letting the number of hypotheses approach infinity and under a stronger notion of asymptotic optimality. They further studied in [6] the roles of sequentiality and adaptivity in active hypothesis testing by characterizing the gain of sequential tests over fixed sample size tests and the gain of closed-loop policies over open-loop policies. Target search or target whereabouts problems have been widely studied under various scenarios. Results under the sequential setting can be found in [7]–[10]. Specifically, optimal policies were derived for the problem of quickest search over Weiner processes under the model that a single process is observed at a time [7]–[9]. In [10], an optimal search strategy was established under a different search model, where the target must be contacted by one sensor and identified by another, while contact investigation must not be interrupted until the contact is identified. However, optimal policies under general distributions or with multi-process probing remain an open question. In this paper we address these questions under the asymptotic regime, as the error probability approaches zero. Target search under the setting of fixed sample size was considered in [11]–[14]. In [11]–[13], searching in a specific location provides a binaryvalued measurement regarding the presence or absence of the target. Similar to this paper, Castanon considered in [14] continuous observations: the observations from a location without the target and with the target have distributions f and g, respectively. Different from this paper where we consider sequential settings and obtain an asymptotically optimal policy that applies to general distributions, [14] focused on the fixed sample size setting and required a symmetry assumption on the distributions (specifically, f (x) = g(b − x) for some b) for the optimality of the proposed policy. The problem of universal outlier
hypothesis testing was studied in [15]. Under this setting, a vector of observations containing coordinates with an outlier distribution is observed at each given time. The goal is to detect the coordinates with the
DRAFT
5
outlier distribution based on a sequence of n i.i.d. vectors of observations. Another related problem is sequential detection involving multiple independent processes [16]–[23]. An optimal threshold policy was established for the problem of quickly detecting an idle period over multiple independent ON/OFF processes in [16]. In [17], the problem of quickest detection of the emergence of primary users in multi-channel cognitive radio networks was considered. The problem of quickest search of idle channels when the idle/busy state of each channel is fixed over time was studied in [18]. In [19], [20], optimal index probing strategies were derived for the anomaly localization problem, where the objective is to minimize the expected cost incurred by the abnormal processes. In [21], the problem of identifying the first abnormal sequence among an infinite number of i.i.d. sequences was considered. An optimal cumulative sum (CUSUM) test was established under this setting. Further studies on this model can be found in [22], [23]. These studies all assume different models from that of this paper. D. Organization In Section II we describe the system model and problem formulation. A discussion on the randomized Chernoff test and its asymptotic optimality for the anomaly detection problem is provided in Section III. In Section IV we propose an asymptotically optimal deterministic policy to solve the problem. In Section V we extend the problem to the case where multiple anomalous processes are present. In Section VI we provide numerical examples to illustrate the performance of the proposed policy as compared with the Chernoff test. Section VII concludes the paper. II. S YSTEM M ODEL
AND
P ROBLEM F ORMULATION
A. System Model Consider the following anomaly detection problem. A decision maker is required to detect the location of a single anomalous object (referred as a target) located in one of M cells. If the target is in cell m, we say that hypothesis Hm is true. The a priori probability that Hm is true is denoted by πm , where PM m=1 πm = 1. To avoid trivial solutions, it is assumed that 0 < πm < 1 for all m
At each time, only K (1 ≤ K ≤ M ) cells can be observed. When cell m is observed at time n, an
observation ym (n) is independently drawn from a distribution in a one-at-a-time manner. If hypothesis m is false, ym (n) follows distribution f (y); if hypothesis m is true, ym (n) follows distribution g(y). Let Pm be the probability measure under hypothesis Hm and Em the operator of expectation with respect
to the measure Pm .
DRAFT
6
We define the stopping rule τ as the time (i.e., detection delay) when the decision maker finalizes the search by declaring the location of the target. Let δ ∈ {1, 2, ..., M } be a decision rule, where δ = m if the decision maker declares that Hm is true. Let φ(n) ∈ {1, 2, ..., M }K be a selection rule indicating which K cells are chosen to be observed at time n. The time series vector of selection rules is denoted n by φ = (φ(n), n = 1, 2, ...). Let y(n) = φ(t), yφ(t) (t) t=1 be the set of all the available observations and the chosen cell indices up to time n. A deterministic selection rule φ(n) at time n is a mapping from
y(n − 1) to {1, 2, ..., M } K . A randomized selection rule φ(n) is a mapping from y(n − 1) to probability
mass functions over {1, 2, ..., M } K . Definition 1: An admissible strategy Γ for the sequential anomaly detection problem is given by the tuple Γ = (τ, δ, φ). Let Pe (Γ) =
PM
m=1 πm αm (Γ)
be the probability of error under strategy Γ, where αm (Γ) = Pm (δ 6= P m|Γ) is the probability of declaring δ 6= m when Hm is true. Let E(τ |Γ) = M m=1 πm Em (τ |Γ) be the
average detection delay under Γ. B. Objective
We adopt a Bayesian approach as in [1], [3] by assigning a cost of c for each observation and a loss of 1 for a wrong declaration. The Bayes risk under strategy Γ when hypothesis Hm is true is given by: Rm (Γ) , αm (Γ) + cEm (τ |Γ) .
(1)
Note that c represents the ratio of the sampling cost to the cost due to wrong decisions. The average Bayes risk is given by: R(Γ) =
M X
πm Rm (Γ) = Pe (Γ) + cE(τ |Γ) .
(2)
m=1
The objective is to find a strategy Γ that minimizes the Bayes risk R(Γ): inf R(Γ) . Γ
(3)
C. Notations Let 1m (n) be the indicator function, where 1m (n) = 1 if cell m is observed at time n, and 1m (n) = 0 otherwise. Let ℓm (n) , log
g(ym (n)) , f (ym (n))
(4)
DRAFT
7
and Sm (n) ,
n X
ℓm (t)1m (t)
(5)
t=1
be the log-likelihood ratio (LLR) and the observed sum LLRs at time n of cell m, respectively. We then define m(i) (n) as the index of the cell with the ith highest observed sum LLRs at time n. Let ∆S(n) , Sm(1) (n) (n) − Sm(2) (n) (n)
(6)
denotes the difference between the highest and the second highest observed sum LLRs at time n. Finally, we define I ∗ (M, K) , D(g||f ) + D(f ||g) , if K = M , (K − 1)D(f ||g) KD(f ||g) , D(g||f ) + , max M −1 M −1 if K < M .
(7)
In subsequent sections we show that I ∗ (M, K) plays the role of the rate function, which determines the asymptotically optimal performance of the test. Increasing I ∗ (M, K) decreases the asymptotic lower bound on the Bayes risk. Note that I ∗ (M, K) increases with K (i.e., increasing the number of observed processes improves performance) and decreases with M (i.e., increasing the number of hypotheses deteriorates performance). III. T HE C HERNOFF T EST In this section, we present the Chernoff test proposed in [1] when it is applied to the anomaly detection problem. We then prove the asymptotic optimality of the Chernoff test for this problem when the positivity assumption on KL divergences as required in [1] no longer holds. The Chernoff test has a randomized selection rule. Specifically, for a general M-ary active hypothesis testing problem, the action u at time n under the Chernoff test is drawn from a distribution q(u) that depends on the past actions and observations: q(u) = arg max q(u)
min j∈M\{ˆi(n)}
X
q(u)D(pˆui(n) ||puj ) ,
(8)
u
where M is the set of the M hypotheses, ˆi(n) is the ML estimate of the true hypothesis at time n based on past actions and observations, and puj is the observation distribution under hypothesis j when action u is taken. DRAFT
8
It can be shown that the Chernoff test when applied to the anomaly detection problem works as follows. If D(g||f ) ≥ D(f ||g)/(M − 1), then the Chernoff test selects cell m(1) (n) and draws the rest K − 1 cells randomly with equal probability among the set of cells M \ m(1) (n) . If, however,
D(g||f ) < D(f ||g)/(M − 1), all K cells are drawn randomly with equal probability among the set of cells M \ m(1) (n) .
Even though the positivity assumption on KL divergences as required in the proof of the asymptotic
optimality of the Chernoff test given in [1] no longer holds for the anomaly detection problem, we show in Theorem 1 below that the Chernoff test preserves its asymptotic optimality in this case. Note that in [4], a modified Chernoff test was developed in order to handle indistinguishable hypotheses under some (but not all) actions. The basic idea of this modified test is to replace the action distribution given in (8) with a uniform distribution over a subsequence of time instants that grows at a sublinear rate with time. This sequence of arbitrary actions that are independent of past observations will in general affect the performance in the finite regime. Theorem 1 below shows that this modification is unnecessary for the anomaly detection problem. Theorem 1: Let R′ and R(Γ) be the Bayes risks under the Chernoff test and any other policy Γ, respectively. Then1 , R′ ∼
−c log c ∼ inf R(Γ) as c → 0 . Γ I ∗ (M, K)
(9)
Proof: The proof is given in Appendix VIII-B and is based on the argument of [4] and the proof of Theorem 2 given in Appendix VIII-A. IV. T HE D ETERMINISTIC DGF P OLICY In this section we propose a deterministic policy, referred to as the DGF policy, to solve (3). Theorem 2 shows that the DGF policy is asymptotically optimal in terms of minimizing the Bayes risk (2) as c → 0.
A. The DGF policy: At each time n, the selection rule φ(n) of the DGF policy chooses cells according to the order of their sum LLRs. Specifically, based on the relative order of D(g||f ) and D(f ||g)/(M − 1), either the K cells with the top K highest sum LLRs or those with the second to the (K + 1)th highest sum LLRs 1
The notation f ∼ g as c → 0 refers to limc→0 f /g = 1.
DRAFT
9
are chosen, i.e.,
φ(n) =
m(1) (n), m(2) (n), ..., m(K) (n) , ||g) if D(g||f ) ≥ D(f (M −1) or K = M
m(2) (n), m(3) (n), ..., m(K+1) (n) , ||g) if D(g||f ) < D(f (M −1) and K < M
.
(10)
The stopping rule and decision rule under the DGF policy are given by: τ = inf {n : ∆S(n) ≥ − log c} ,
(11)
δ = m(1) (τ ) .
(12)
and
The deterministic selection rule under the DGF policy is intuitively satisfying. Consider the case where K = 1. If cell m(1) (n) is selected at each given time n, the asymptotic detection time approaches − log c/D(g||f ) since the true cell (say m) is observed at each given time with high probability (in the
asymptotic regime) and the test is finalized once sufficient information is gathered from this cell (for a detailed asymptotic analysis see Appendix VIII-A). In this case, D(g||f ) determines the asymptotically optimal performance of the test, since Em (ℓm ) = D(g||f ). On the other hand, if cell m(2) (n) is selected at each given time n, the asymptotic detection time approaches −(M −1) log c/D(f ||g) since one of the M − 1 false cells are observed at each given time with high probability and the test is finalized once sufficient
information is gathered from all these cells. Since Em (ℓj ) = −D(f ||g) for all j 6= m, then D(f ||g)/(M − 1) determines the asymptotically optimal performance of the test. Therefore, the selection rule selects
the strategy that minimizes the asymptotic detection time according to max [D(g||f ), D(f ||g)/(M − 1)]. Similar arguments apply to the case where K > 1. B. Performance Analysis The following main theorem shows that the DGF policy is asymptotically optimal in terms of minimizing the Bayes risk as c approaches zero: Theorem 2 (asymptotic optimality of the DGF policy): Let R∗ and R(Γ) be the Bayes risks under the DGF policy and any other policy Γ, respectively. Then, R∗ ∼
−c log c ∼ inf R(Γ) as c → 0 . Γ I ∗ (M, K)
(13)
DRAFT
10
Proof: For a detailed proof see Appendix VIII-A. We provide here a sketch of the proof. In App. VIII-A.1, we show that
−c log c I ∗ (M,K)
is an asymptotic lower bound on the Bayes risk that can be achieved by
any policy Γ. Then, we show in App. VIII-A.2 that the Bayes risk R∗ under the DGF policy, approaches the asymptotic lower bound as c → 0. Specifically, the asymptotic optimality property of DGF is based on Lemma 11, showing that the asymptotic expected search time approaches
− log c I ∗ (M,K) ,
while the error
probability is O(c) following Lemma 4. Establishing the asymptotic expected sample size under DGF in Lemma 11 is mainly based on Lemmas 7, 8 and 10, which break the total expected search time into three major events τ1 , τ2 = τ1 + n2 and τ = τ2 + n3 . Roughly speaking, τ1 is the time where the sum LLRs of the true cell (say m) is the
highest among all the cells for all n ≥ τ1 . τ2 is the time where sufficient information has been gathered to distinguish hypothesis m from at least one false hypothesis j 6= m. τ is the termination time, where sufficient information has been gathered to distinguish hypothesis m from all false hypotheses j 6= m. Lemma 8 shows that the second phase, which is the total amount of time between τ1 and τ2 determines the asymptotic detection time. Thus, E(n2 ) ∼ − log c/I ∗ (M, K) as c → 0. Lemma 7 shows that the first phase, which is the total amount of time from the beginning of the search until τ1 occurs does not affect the asymptotic detection time. Thus, E(τ1 )/E(n2 ) → 0. Note that differing from [4], where only polynomial decay of Pm (τ1 > n) was shown to handle indistinguishable hypotheses under some (but not all) actions when applying the extended randomized Chernoff test, Lemma 7 shows exponential decay of Pm (τ1 > n) under DGF. Lemma 10 shows that the third phase, which is the time between τ2 and τ does not affect the asymptotic detection time. Thus, E(n3 )/E(n2 ) → 0. To prove Lemma 10, we define the dynamic range between the sum LLRs of false hypotheses at time t as DR(t) , maxj6=m Sj (t) − minj6=m Sj (t). Lemma 9 shows that DR(τ2 ) is sufficiently small. Following Lemma 9, Lemma 10 shows that the time required to gather information between τ2 and τ does not affect the asymptotic detection time. The asymptotic performance (13) of the test is intuitively satisfying. For any fixed K , the Bayes risk increases with M , since increasing the number of cells increases the required number of observations before making a decision with sufficient reliability. On the other hand, for any fixed M , the Bayes risk decreases with K , since increasing the number of cells that can be observed at each given time reduces the total search time before making a decision with sufficient reliability. V. E XTENSION
TO
M ULTIPLE A NOMALOUS P ROCESSES
In this section we extend the results reported in previous sections to the case where multiple processes are abnormal. In Section V-A we consider the detection of L abnormal processes, where L is known. In DRAFT
11
Section V-B we consider the case where an unknown number ℓ ≥ 1 of abnormal processes are present and only an upper bound ℓ ≤ L is known. Throughout this section, we define M′ as the set of all possible combinations of target locations, with cardinality M ′ = |M′ | (i.e., a set of M ′ hypotheses, Hm′ , indicating that the locations of all targets are given by the (m′ )th set in M′ ) and πm′ as the a priori probability that Hm′ is true. Here, the decision rule declares a set of target locations (i.e., hypothesis Hm′ ) and the error probability under policy Γ is P ′ defined as Pe (Γ) = M m′ =1 πm′ αm′ (Γ), where αm′ (Γ) = Pm′ (δ 6= Hm′ |Γ) is the probability of declaring δ 6= Hm′ when Hm′ is true.
A. A Case of L Abnormal Processes Consider the case where L abnormal processes are located among the M cells and L is known. In this case, the detection problem involves M ′ = M L hypotheses. We show below that a variation of the DGF policy, dubbed the DGF(L) policy, is asymptotically optimal under this setting.
The stopping rule and decision rule under the DGF(L) policy are similar to that under the GDF policy: (14)
τ = inf {n : ∆L S(n) ≥ − log c} ,
where ∆L S(n) , Sm(L) (n) (n) − Sm(L+1) (n) (n) and δ = (m(1) (τ ), m(2) (τ ), ..., m(L) (τ )) .
(15)
The selection rule under the DGF(L) policy is more involved and depends on the relative order of K and L (or M − L). Specifically, φ(n) =
where
φg (n) =
φg (n) , if
φ (n) , if f
D(g||f ) L
≥
D(f ||g) M −L
,
D(g||f ) L
M − L , (L+1) (n), m(L+2) (n), ..., m(L+K) (n) m , if K ≤ M − L .
(18)
It is not difficult to see that when L = 1, the DGF(L) policy degenerates to the DGF policy. Next, we analyze the performance of the DGF(L) policy. Let I ∗ (M, K, L) , Ig∗ (M, K, L) , if
where
I ∗ (M, K, L) , if f
D(g||f ) L
≥
D(f ||g) M −L
,
D(g||f ) L
M − L ,
KD(f ||g) , M −L
(21)
if K ≤ M − L .
The following theorem shows the asymptotically optimal performance of the DGF(L) policy: Theorem 3: Let R∗ and R(Γ) be the Bayes risks under the DGF(L) policy and any other policy Γ, respectively. Then, R∗ ∼
−c log c I ∗ (M, K, L)
∼ inf R(Γ) as c → 0 . Γ
(22)
DRAFT
13
Proof: See Appendix VIII-C. Note that in the DGF(L) policy, all L targets are declared simultaneously at the termination time of the detection. A modification to DGF(L) leads to a policy where abnormal processes are declared sequentially during the detection. Consider, for example, K = 1 and
D(g||f ) L
≥
D(f ||g) M −L .
It can be shown (with minor
modifications to Theorem 3) that an asymptotically optimal policy is to test the cell with the largest sum LLRs and declare the first target once the largest sum LLRs exceeds the threshold − log c. The same procedure is then applied to the remaining M − 1 cells. This repeats until L abnormal processes have been declared, at which point, the detection terminates. The asymptotic expected termination time is given by −L log c/D(g||f ) with Pe = O(c). Even though the total detection time remains the same order as under the DGF(L) policy, this modified version may be more appealing from a practical point of view. In particular, actions can be taken to fix each abnormal process the moment it is identified; the total impact to the system by these L abnormal processes can thus be reduced. If
D(g||f ) L
0 such that: −Gc log c ≥ Pj (δ = m|Γ) ≥ Pj (∆Sm,j (τ ) ≤ −(1 − ǫ) log c , δ = m|Γ)
(37)
≥ c1−ǫ Pm (∆Sm,j (τ ) < − (1 − ǫ) log c , δ = m|Γ) .
As a result, Pm (∆Sm (τ ) < − (1 − ǫ) log c , δ = m|Γ) X ≤ Pm (∆Sm,j (τ ) < − (1 − ǫ) log c , δ = m|Γ)
(38)
j6=m
= O (−cǫ log c) .
Finally, Pm (∆Sm (τ ) < − (1 − ǫ) log c|Γ) = O (−cǫ log c) .
(39)
DRAFT
20
Lemma 2: Assume that I ∗ (M, K) = D(g||f ) +
(K−1)D(f ||g) . M −1
Define the following function: K nt − 1 D(f ||g) . d(t) , t D(g||f ) + M −1
(40)
Then, d(t) is monotonically increasing with t. Proof: Note that I ∗ (M, K) = D(g||f ) +
(K−1)D(f ||g) M −1
implies:
(K − 1)D(f ||g) KD(f ||g) ≥ M −1 M −1 D(f ||g) ⇐⇒ D(g||f ) ≥ . M −1
D(g||f ) +
Differentiation d(t) with respect to t yields: D(f ||g) ∂d(t) = D(g||f ) − ≥0, ∂t M −1
which completes the proof. Lemma 3: For any fixed ǫ > 0, ∗ Pm max ∆Sm (t) ≥ n (I (M, K) + ǫ) | Γ → 0 1≤t≤n
(41)
as n → ∞ , for all m and for any policy Γ. Proof: It should be noted that polynomial decay of a similar condition under a binary composite hypothesis testing was shown in [1, Lemma 5] using a variation of the Kolmogorov’s inequality. Here, we use a different approach to show exponential decay of (41). Let j ∗ (t) = arg minj6=m Nj (t) denotes the cell (except cell m) which has been observed the lowest number of times up to time t. Let ∗ (t) , S (t) − S ∗ (t). ∆Sm m j (t) ∗ (t) for all m and t, we have: Since ∆Sm (t) ≤ ∆Sm ∗ Pm max ∆Sm (t) ≥ n (I (M, K) + ǫ) |Γ 1≤t≤n
≤ Pm
max
1≤t≤n
∗ ∆Sm (t)
∗
≥ n (I (M, K) + ǫ) |Γ
(42)
DRAFT
21
Define ∗ Wm (t)
,
t X
ℓ˜m (i)1m (i) −
ℓ˜j ∗ (i) (i)1j ∗ (i) (i) .
(43)
i=1
i=1
Next, we consider three cases:
t X
Case 1 : K = M :
In this case I ∗ (M, K) = D(g||f ) + D(f ||g). Furthermore, note that Nj (t) = t, for all j and t. Thus, ∗ (t) = W ∗ (t) + t (D(g||f ) + D(f ||g)) ∆Sm m
(44)
∗ (t) + nI ∗ (M, K) . ≤ Wm
Therefore, ∗ (t) ≥ n (I ∗ (M, K) + ǫ) ∆Sm
implies ∗ (t) ≥ nǫ. Wm
Using the i.i.d. property of ℓk (t) across the time series and applying the Chernoff bound yield2 : ∗ Pm max ∆Sm (t) ≥ n (I (M, K) + ǫ) |Γ 1≤t≤n
≤ Pm ≤
n X t=1
≤
n X
max
1≤t≤n
∗ Wm (t)
≥ nǫ|Γ
Nj∗ (t) (t)
Nm (t)
Pm t X
X
ℓ˜m (r) +
r=1
X r=1
t X
−ℓ˜j ∗(t) (r) ≥ nǫ|Γ
(45)
t=1 Nm (t)=0 Nj∗ (t) (t)=0
2
iNm (t) h ˜ × Em es(ℓm (1)−ǫ/2) iNj∗ (t) (t) h ˜ × Em es(−ℓj∗ (t) (1)−ǫ/2) n ǫ o exp −s (2n − Nm (t) − Nj ∗ (t) (t)) , 2
Note that when K = M , then Nm (t) = Nj (t) = t for all t. However, we use the loosened bound in (45), since it applies
to cases 2, 3 below.
DRAFT
22
for all s > 0 . Clearly, a moment generating function (MGF) is equal to one at s = 0. Furthermore, since Em (ℓ˜m (1) − ǫ/2) = −ǫ/2 < 0 and Em (−ℓ˜j ∗ (t) (1) − ǫ/2) = −ǫ/2 < 0 are strictly negative, differentiating the MGFs
of ℓ˜m (1) − ǫ/2 and −ℓ˜j ∗ (t) (1) − ǫ/2 with respect to s yields strictly negative derivatives at s = 0. Hence, ˜j∗ (t) (1)−ǫ/2) ˜m (1)−ǫ/2) s(− ℓ s( ℓ ′ and e−sǫ/2 are strictly , Em e there exist s > 0 and γ > 0 such that Em e less than e−γ < 1. Since 2n − Nm (t) − Nj ∗ (t) ≥ 0, there exist C > 0 and γ > 0, such that summing ′
over t, Nm (t), Nj ∗ (t) (t) yields: Pm
∗
max ∆Sm (t) ≥ n (I (M, K) + ǫ) |Γ
1≤t≤n
(46)
≤ Ce−γn → 0 as n → ∞ ,
Case 2 : K < M and I ∗ (M, K) =
KD(f ||g) M −1 :
Note that: ∗ (t) = W ∗ (t) + N (t)D(g||f ) + N ∗ (t)D(f ||g) ∆Sm m j (t) m h i ∗ (t) + D(f ||g) Nm (t) + N ∗ (t) ≤ Wm j (t) M −1
The last inequality holds since
KD(f ||g) M −1
≥ D(g||f ) +
(K−1)D(f ||g) M −1
implies D(g||f ) ≤
(47) D(f ||g) M −1 .
Note that Nj ∗ (t) (t) ≤
Kt − Nm (t) Kn − Nm (t) ≤ . M −1 M −1
Hence, ∗ (t) ≤ W ∗ (t) + D(f ||g) Kn ∆Sm m M −1
(48)
∗ (t) + nI ∗ (M, K) . = Wm
Therefore, ∗ (t) ≥ n (I ∗ (M, K) + ǫ) ∆Sm
implies ∗ (t) ≥ nǫ. Wm
The rest of the proof is similar to Case 1. Case 3 : K < M and I ∗ (M, K) = D(g||f ) +
(K−1)D(f ||g) : M −1
DRAFT
23
Note that: ∗ ∗ ∆Sm (t) = Wm (t) + Nm (t)D(g||f ) + Nj ∗ (t) (t)D(f ||g)
Kn − Nm (t) D(f ||g) M −1 # K Nmn(t) − 1 ∗ = Wm (t) + Nm (t) D(g||f ) + D(f ||g) M −1 ∗ ≤ Wm (t) + Nm (t)D(g||f ) + "
(49)
Since 0 ≤ Nm (t) ≤ n, by Lemma 2 we have: ∗ ∆Sm (t)
≤
∗ Wm (t)
=
K −1 + n D(g||f ) + D(f ||g) M −1
∗ Wm (t)
(50)
∗
+ nI (M, K) .
The rest of the proof is similar to Case 1. Hence, (41) follows. The following theorem provides the lower bound on the Bayes risk (2):
Theorem 5: Let R(Γ) = O(−c log c) be the Bayes risk under policy Γ. Then, R(Γ) ≥ − (1 + o(1))
c log(c) . I ∗ (M, K)
(51)
for any policy Γ.
Proof: To show the lower bound on the Bayes risk, we use a similar argument as in [1]. For any log c ǫ > 0 let nc = −(1 − ǫ) ∗ . Note that I (M, K) + ǫ Pm (τ ≤ nc | Γ) = Pm (τ ≤ nc , ∆Sm (τ ) ≥ − (1 − ǫ) log c | Γ) +Pm (τ ≤ nc , ∆Sm (τ ) < − (1 − ǫ) log c | Γ) ≤ Pm max ∆Sm (t) ≥ − (1 − ǫ) log c | Γ
(52)
t≤nc
+Pm (∆Sm (τ ) < − (1 − ǫ) log c | Γ) .
The first term in the last inequality approaches zero as c → 0 by Lemma 3. The second term in the last inequality approaches zero as c → 0 by Lemma 1 due to the condition R(Γ) = O(−c log c). As a result, the expected sample size under policy Γ satisfies Em (τ |Γ) ≥ − (1 + o(1)) log(c)/I ∗ (M, K). Hence, R(Γ) ≥ − (1 + o(1)) c log(c)/I ∗ (M, K).
DRAFT
24
2) Asymptotic Optimality of the DGF policy: In this section we show that the DGF policy achieves the lower bound on the Bayes risk (51) as c → 0. We mainly focus on the more interesting case where K ≥ 2 cells are observed at a time. The case where K = 1 is simpler and follows with minor modifications. Below, the proof follows the structure discussed
in Section IV-B. Cumbersome computations omitted here can be found in [24]. Lemma 4: Assume that the DGF policy is implemented. Then, the error probability is upper bounded by: Pe ≤ (M − 1)c .
Proof: Let αm,j = Pm (δ = j) for all j 6= m. Thus, αm =
(53)
P
j6=m αm,j .
Note that accepting Hj
(i.e., ∆Sj (n) ≥ − log c) implies ∆Sj,m ≥ − log c. By changing the measure, as in [1, Lemma 3], we can show that for all j 6= m the following holds: αm,j = Pm (δ = j) = Pm (∆Sj (τ ) ≥ − log c) ≤ Pm (∆Sj,m(τ ) ≥ − log c)
(54)
≤ cPj (∆Sj,m(τ ) ≥ − log c) ≤ c .
Finally, αm =
X
αm,j ≤ (M − 1)c .
j6=m
Hence, (53) follows. Lemma 5: Assume that the DGF policy is implemented. Fix 0 < q < 1. Then, there exist C > 0 and γ > 0 such that Pm (Sj (n) ≥ Sm (n), Nj (n) ≥ qn) ≤ Ce−γn ,
(55)
for m = 1, 2, ..., M and j 6= m.
Proof:
Note that Em (ℓm (t)) = D(g||f ) > 0 and Em (ℓj (t)) = −D(f ||g) < 0. Therefore,
exponential decay of (55) can be shown by applying the Chernoff bound when Nj (n) ≥ qn for every fixed 0 < q < 1. For a detailed computation see [24].
DRAFT
25
Lemma 6: Assume that the DGF policy is implemented. Fix 0 < q < 1. Then, there exists γ > 0 such that Pm (Sj (n) ≥ Sm (n), Nm (n) ≥ qn) ≤ e−γn ,
(56)
for m = 1, 2, ..., M and j 6= m.
Proof: The proof is similar to the proof of Lemma 5. Definition 2: τ1 is the smallest integer such that Sm (n) > Sj (n) for all j 6= m for all n ≥ τ1 . Lemma 7: Assume that the DGF policy is implemented. Then, there exist C > 0 and γ > 0 such that Pm (τ1 > n) ≤ Ce−γn ,
(57)
for m = 1, 2, ..., M .
Proof: Note that differing from [4], where only polynomial decay of Pm (τ1 > n) was shown to handle indistinguishable hypotheses under some (but not all) actions when applying the extended randomized Chernoff test, Lemma 7 shows exponential decay of Pm (τ1 > n) under DGF. We focus on the case where M > 2. The case of M = 2 is simpler and follows with minor modifications. Note that: Pm (τ1 > n) ≤ Pm max sup (Sj (t) − Sm (t)) ≥ 0 j6=m t≥n
≤
X
∞ X
(58)
Pm (Sj (t) ≥ Sm (t)) .
j6=m t=n
Therefore, it suffices to show that there exist C > 0 and γ > 0 such that Pm (Sj (n) ≥ Sm (n)) ≤ Ce−γn . Fix 0 < ρ < 1. Thus, Pm (Sj (n) ≥ Sm (n)) ≤ Pm (Sj (n) ≥ Sm (n), Nj (n) < ρn, Nm (n) < ρn)
(59) +Pm (Sj (n) ≥ Sm (n), Nj (n) ≥ ρn) +Pm (Sj (n) ≥ Sm (n), Nm (n) ≥ ρn)
By Lemmas 5, 6, there exist γ1 > 0 and C1 > 0 such that the second and the third terms on the RHS are upper bounded by C1 e−γ1 n . In the case of K = M the first term on the RHS equals zero (since Nj (n) = Nm (n) = n surely). Hence, it remains to show that the first term on the RHS decreases exponentially with n for K < M . Note that the event (Nj (n) < ρn, Nm (n) < ρn) implies that at least DRAFT
26
er (n) be the number of n ˜ = n − Nj (n) − Nm (n) ≥ n (1 − 2ρ) times cells j, m are not observed. Let N
times when cell r 6= j, m has been observed and cells j, m have not been observed up to time n. We refer er (n) ≥ to each such time as r6=j,m -probing time. There exists a cell r 6= j, m such that N
n ˜ M −2
=
n(1−2ρ) M −2 .
Hence, (59) can be upper bounded by:
Pm (Sj (n) ≥ Sm (n)) X ˜r (n) > n(1 − 2ρ) , ≤ Pm N M −2 r6=j,m
(60)
Nj (n) < ρn, Nm (n) < ρn) +2C1 e−γ1 n
It remains to show that each term in the summation on the RHS of (60) decreases exponentially with er (n) > n. Note that the event N
n(1−2ρ) M −2
implies that at every r6=j,m-probing time, Sj (n) ≤ Sr (n) or
Sm (n) ≤ Sr (n) must occur (otherwise, if Sj (n) > Sr (n) and Sm (n) > Sr (n) then j or m are observed).
Note also that Em (ℓr (t)) = −D(f ||g) < 0 (thus, Sr (n) decreases with high probability by sampling cell er (n) can be arbitrary close r a sufficient number of times). Since 0 < q < 1 can be arbitrary small and N to
n M −2 ,
exponential decay of each term in the summation on the RHS of (60) follows by applying the
Chernoff bound as in (45). Specifically, a detailed computation can be found in [24] with ρ =
1 4M −6 .
For the next lemmas we define D ′ (f ||g) , (K − 1)D(f ||g)/(M − 1) .
(61)
Definition 3: τ2 is define as follows: P ) 1) If K = M , τ2 denotes the smallest integer such that ni=τ1 +1 ℓm (i)1m (i) ≥ − ID(g||f ∗ (M,K) log c and Pn D(f ||g) i=τ1 +1 ℓjn (i)1jn (i) ≤ I ∗ (M,K) log c for some jn 6= m for all n ≥ τ2 . 2) If K < M and I ∗ (M, K) = KD(f ||g)/(M − 1), τ2 denotes the smallest integer such that Pn i=τ1 +1 ℓjn (i)1jn (i) ≤ log c for some jn 6= m for all n ≥ τ2 .
3) If K < M and I ∗ (M, K) = D(g||f ) + (K − 1)D(f ||g)/(M − 1), τ2 denotes the smallest integer Pn Pn D(g||f ) D ′ (f ||g) such that i=τ1 +1 ℓm (i)1m (i) ≥ − I ∗ (M,K) log c and i=τ1 +1 ℓjn (i)1jn (i) ≤ I ∗ (M,K) log c for some jn 6= m for all n ≥ τ2 .
Definition 4: n2 , τ2 − τ1 denotes the total amount of time between τ1 and τ2 .
DRAFT
27
Lemma 8: Assume that the DGF policy is implemented. Then, for every fixed ǫ > 0 there exist C > 0 and γ > 0 such that Pm (n2 > n) ≤ Ce−γn ∀n > −(1 + ǫ) log c/I ∗ (M, K) ,
(62)
for all m = 1, 2, ..., M . Proof: We prove the theorem for three cases:
Case 1 : K = M : In this case I ∗ (M, K) = D(g||f ) + D(f ||g) and 1k (t) = 1 for all k, t. Let τ2m and τ2j for j 6= m be P Pn ) m and the smallest integers such that ni=τ1 +1 ℓm (i) ≥ − ID(g||f ∗ (M,K) log c for all n > τ2 i=τ1 +1 ℓj (i) ≤ D(f ||g) I ∗ (M,K)
log c for all n > τ2j for j 6= m, respectively. Similarly, nk2 denotes the total amount of time
between τ1 and τ2k . Clearly, n2 ≤ maxk (nk2 ). As a result, Pm (n2 > n) ≤
M X k=1
Pm nk2 > n .
(63)
Thus it remains to show that Pm nk2 > n decreases exponentially with n. Next, we prove the lemma
for cell m. The proof for cell j 6= m follows with minor modifications. Let ǫ1 = D(g||f )ǫ/(1 + ǫ) > 0.
Thus,
t+τ X1
ℓm (i) +
D(g||f ) log c I ∗ (M, K)
i=τ1 +1 t+τ X1
D(g||f ) ℓ˜m (i) + tD(g||f ) + ∗ log c I (M, K)
≥
ℓ˜m (i) + tǫ1 ,
=
i=τ1 +1 t+τ X1
(64)
i=τ1 +1
for all t ≥ n > −(1 + ǫ) log c/I ∗ (M, K). As a result,
t+τ X1
ℓm (i) ≤ −
i=τ1 +1
D(g||f ) log c . I ∗ (M, K)
(65)
implies t+τ X1
ℓ˜m (i) ≤ −tǫ1 .
(66)
i=τ1 +1
DRAFT
28
Hence, for any ǫ > 0 there exists ǫ1 > 0 such that Pm (nm 2 > n)
! D(g||f ) ℓm (i) ≤ − ∗ log c ≤ Pm inf t≥n I (M, K) i=τ1 +1 ! ∞ t+τ X X1 D(g||f ) log c ≤ ℓm (i) ≤ − ∗ Pm I (M, K) t=n i=τ1 +1 ! ∞ t+τ X X1 −ℓ˜m (i) ≥ tǫ1 , ≤ Pm t+τ X1
t=n
for all t ≥ n > −(1 +
(67)
i=τ1 +1
ǫ) log c/I ∗ (M, K).
By applying the Chernoff bound, it can be shown that there exists γ1 > 0 such that: Pm
P t+τ1
˜m (i) ≥ tǫ1 ≤ − ℓ i=τ1 +1
e−γ1 t for all t ≥ n > −(1 + ǫ) log c/I ∗ (M, K). Hence, there exist C1 > 0 and γ1 > 0 such that −γ1 n for all n > −(1 + ǫ) log c/I ∗ (M, K). Pm (nm 2 > n) ≤ C1 e
Case 2 : K < M and I ∗ (M, K) =
KD(f ||g) M −1 :
In this case, the cell with the highest index is not observed for all n. As a result, cell m is not observed Pτ1 +t for all n ≥ τ1 since Sm (n) > Sj (n) for all j 6= m for all n ≥ τ1 . Let Nj′ (τ1 + t) , i=τ 1j (i). Let 1 +1
j ∗ (τ1 + t) = arg maxj6=m Nj′ (τ1 + t) be the cell index that was observed the largest number of times Pτ1 +t ℓ∗ (i)1j ∗ (τ1 +t) (i) ≤ log c for all t ≥ n, since τ1 has occurred up to time τ1 + t. Note that if i=τ 1 +1 j (τ1 +t)
then n2 ≤ n. Hence,
Pm (n2 > n) ≤ Pm
sup
τX 1 +t
t≥n i=τ +1 1
!
ℓj ∗ (τ1 +t) (i)1j ∗ (τ1 +t) (i) ≥ log c
(68) .
DRAFT
29
Note that Nj′ ∗ (τ1 +t) (τ1 + t) ≥
Kt M −1
(since Kt observations are taken from M − 1 cells). Thus, for any
ǫ > 0 there exists ǫ1 > 0 such that: τX 1 +t
ℓj ∗ (τ1 +t) (i)1j ∗ (τ1 +t) (i) − log c
i=τ1 +1 τX 1 +t
ℓ˜j ∗(τ1 +t) (i)1j ∗ (τ1 +t) (i)
=
i=τ1 +1
−Nj′ ∗ (τ1 +t) (τ1 + t)D(f ||g) − log c ≤
τX 1 +t
i=τ1 +1
≤
τX 1 +t
(69)
ℓ˜j ∗(τ1 +t) (i)1j ∗ (τ1 +t) (i) tD(f ||g)K − M −1
−(M − 1) log c 1− tKD(f ||g)
ℓ˜j ∗(τ1 +t) (i)1j ∗ (τ1 +t) (i) − tǫ1 ,
i=τ1 +1
for all t ≥ n > −(1 + ǫ) log c/I ∗ (M, K) = −(1 + ǫ)(M − 1) log c/(KD(f ||g). The rest of the proof follows by by applying the Chernoff bound.
Case 3 : K < M and I ∗ (M, K) = D(g||f ) +
(K−1)D(f ||g) : M −1
In this case, the cell with the highest index is observed for all n. As a result, cell m is observed for all n ≥ τ1 since Sm (n) > Sj (n) for all j 6= m for all n ≥ τ1 . Therefore, a similar argument as in Case 1 applies to cell m. Next, we focus on cell j ∗ (τ1 + t) 6= m as in Case 2. Note that in this case Nj ∗ (τ1 +t) (τ1 + t) ≥
(K−1)t M −1
(since t observations are taken from cell m and (K − 1)t observations are
taken from M − 1 cells (for j 6= m)). Thus, (69) can be developed for this case as well with minor modifications. Definition 5: The dynamic range of Sj (n) at time t is defined as follows: DR(t) , max Sj (t) − min Sj (t) . j6=m
j6=m
(70)
Lemma 9: Assume that the DGF policy is implemented. Then, for every fixed ǫ1 > 0, ǫ2 > 0 there exist C > 0 and γ > 0 such that Pm (DR(τ2 ) > ǫ1 n) ≤ Ce−γn ∀n > −(1 + ǫ2 ) log c/I ∗ (M, K) ,
(71)
for all m = 1, 2, ..., M .
DRAFT
30
Proof: Note that Pm (DR(τ2 ) > ǫ1 n)
(72) ≤ Pm (τ2 > n) + Pm (DR(τ2 ) > ǫ1 n, τ2 ≤ n)
Since τ2 = τ1 + n2 , applying Lemmas 7, 8 implies that the first term on the RHS of (72) decreases exponentially with n for all n > −(1+ ǫ2) log c/I ∗ (M, K) for every fixed ǫ2 > 0. It remains to show that the second term on the RHS of (72) decreases exponentially with n. Let ¯j = arg maxj6=m Sj (τ2 ), j = arg minj6=m Sj (τ2 ). Let t0 be the smallest integer such that Sj (t) ≤ S¯j (t) for all t0 < t ≤ τ2 . As a
result, DR(τ2 ) > ǫ1 n implies τ2 X
ℓ¯j 1¯j (t) − ℓj 1j (t) > ǫ1 n .
t=t0
Note that the second term on the RHS of (72) can be rewritten as: Pm (DR(τ2 ) > ǫ1 n, τ2 ≤ n)
(73)
= Pm (DR(τ2 ) > ǫ1 n, τ2 ≤ n, t0 ≥ τ1 ) +Pm (DR(τ2 ) > ǫ1 n, τ2 ≤ n, t0 < τ1 )
Let N =
Pτ2
t=t0
1j (t), N =
Pτ2
t=t0
1¯j (t).
First, we upper bound the first term on the RHS of (73). Note that for all τ1 ≤ t0 < t ≤ τ2 , if 1j (t) = 1 then 1¯j (t) = 1 (since Sm (t) > Sj (t) for all t ≥ τ1 and the decision maker observes either the K cells with the top K highest sum LLRs or those with the second to the (K + 1)th highest sum LLRs). Hence, N ≤ N . Thus,
τ2 X
ℓ¯j 1¯j (t) − ℓj 1j (t)
t=t0
τ2 h i X = ℓ˜¯j 1¯j (t) − ℓ˜j 1j (t) − D(f ||g) N − N
(74)
t=t0
≤
τ2 X
ℓ˜¯j 1¯j (t) − ℓ˜j 1j (t)
t=t0
Similar to (45), applying the Chernoff bound completes the proof for this case. Next, we upper bound the second term on the RHS of (73). Let ǫ3 ,
ǫ1 2D(f ||g)
> 0. Note that
Pm (DR(τ2 ) > ǫ1 n, τ2 ≤ n, t0 < τ1 ) = Pm (τ1 > ǫ3 n)
(75)
+Pm (DR(τ2 ) > ǫ1 n, τ2 ≤ n, t0 < τ1 , τ1 ≤ ǫ3 n) . DRAFT
31
The first term on the RHS of (75) decreases exponentially with n by Lemma 7. Thus, it remains to show that the second term on the (75) decreases exponentially with n. Note that DR(τ2 ) > ǫ1 n implies ! RHS of ! τ1 τ2 X X ℓ¯j 1¯j (t) − ℓj 1j (t) + ℓ¯j 1¯j (t) − ℓj 1j (t) t=t0 t=τ1 +1 Note that for all t0 < τ1 + 1 ≤ t ≤ τ2 , if > ǫ1 n . 1j (t) = 1 then 1¯j (t) = 1. Thus, it remains to show that Pm
decreases exponentially with n. Note that τ1 X
P
τ1 j 1¯ j (t) t=t0 ℓ¯
− ℓj 1j (t) ≥ ǫ1 n, τ2 ≤ n, τ1 ≤ ǫ3 n
ℓ¯j 1¯j (t) − ℓj 1j (t)
t=t0
≤
τ1 h i X ℓ˜¯ 1¯(t) − ℓ˜j 1j (t) + D(f ||g)τ1 j j
(76)
t=t0
≤
τ1 h i ǫ X 1 ℓ˜¯j 1¯j (t) − ℓ˜j 1j (t) + n 2 t=t 0
for all τ1 ≤ ǫ3 n. As a result,
τ1 X
ℓ¯j 1¯j (t) − ℓj 1j (t) > ǫ1 n
(77)
t=t0
implies
τ1 h i ǫ X 1 ℓ˜¯j 1¯j (t) − ℓ˜j 1j (t) > n 2 t=t
(78)
0
for all τ1 ≤ ǫ3 n. Similar to (45), applying the Chernoff bound completes the proof. Definition 6: The dynamic range DRj (t) for j 6= m at time t is defined as follows: DRj (t) , max Sk (t) − Sj (t) . k6=m
(79)
Let η, D(f ||g) I ∗ (M,K) log c , if K = M ||g) log c , if K < M and I ∗ (M, K) = K D(f M −1
D ′ (f ||g) I ∗ (M,K) log c , if K < M and I ∗ (M, K) = D(g||f ) +
(80)
(K−1)D(f ||g) M −1
DRAFT
32
where D ′ (f ||g) is defined in (61). Definition 7: τ3j denotes the smallest integer such that for all n ≥
τ3j
≥ τ2 .
Pn
i=τ1 +1 ℓj (i)1j (i)
≤ η + DRj (τ1 ) for j 6= m
Remark 1: Note that the total detection time is upper bounded by τ ≤ maxj6=m τ3j .
Definition 8: n3 , τ − τ2 denotes the total amount of time between τ2 until the search is terminated. Lemma 10: Assume that the DGF policy is implemented. Then, for every fixed ǫ > 0 there exist C > 0 and γ > 0 such that Pm (n3 > n) ≤ Ce−γn ∀n > −ǫ log c/I ∗ (M, K) ,
(81)
for all m = 1, 2, ..., M . Proof: Let N3j for j 6= m denotes the total number of observations, taken from cell j between τ2 P and τ3j . Note that n3 ≤ j6=m N3j . Thus, it suffices to show that Pm N3j > n decreases exponentially
with n. Note that
Pm N3j > n ≤ Pm DR(τ2 ) > n D(f2||g) +Pm N3j > n | DR(τ2 ) ≤ n D(f2||g) .
(82)
By Lemma 9, the first term on the RHS of (82) decreases exponentially with n for all n > −ǫ log c/I ∗ (M, K). Thus, it remains to show that the second term on the RHS of (82) decreases exponentially with n. Let t1 , t2 , ... denote the time indices when cell j is observed between τ2 and τ3j . Since τ2 has occurred P and DR(τ2 ) ≤ n D(f2||g) , τ3j occurs once ri=1 −ℓj (ti ) ≥ n D(f2||g) holds for all r ≥ N3j . As a result, D(f ||g) j Pm N3 > n | DR(τ2 ) ≤ n 2 ! r X D(f ||g) −ℓj (ti ) < n ≤ Pm inf r>n 2 (83) i=1 ! r ∞ X X D(f ||g) ℓ˜j (ti ) > r Pm ≤ . 2 r=n i=1
Thus, it suffices to show that there exists γ > 0 such that Pm
P
n ˜ i=1 ℓj (ti )
> n D(f2||g) ≤ e−γn . Applying
the Chernoff bound and using the i.i.d. property of ℓ˜j (ti ) completes the proof.
Lemma 11: Assume that the DGF policy is implemented. Then, the expected search time is upper bounded by: Em (τ ) ≤ − (1 + o(1))
log(c) I ∗ (M, K)
,
(84)
DRAFT
33
for m = 1, ..., M . Proof: Note that τ ≤ τ1 + n2 + n3 . Thus, combining Lemmas 7, 8 and 10 completes the proof. Combining Lemmas 4, 11 and Theorem 5 yields the asymptotic optimality property of the DGF policy, presented in Theorem 2.
B. Proof of Theorem 1 Following the same argument as in [4], it suffices to show that Pm (τ1 > n) decreases polynomially with n to prove the theorem. Since D(g||f ) > 0, the KL divergence between the true hypothesis m and any false hypothesis j 6= m is strictly positive under any observed cell. In the case where D(g||f ) ≥ D(f ||g)/(M − 1), the Chernoff test selects m(1) (n) for all n. As a result, exponential decay
of Pm (τ1 > n) follows directly from Lemma 7. In the case where D(g||f ) < D(f ||g)/(M − 1), the Chernoff test selects mj (n) for j 6= 1 randomly for all n. As a result, polynomial decay of Pm (τ1 > n) follows by a similar argument as in [4] for the extended Chernoff test. C. Proof of Theorem 3 The proof follows a similar line of arguments as in the proof of Theorem 2. Hence, we provide here only a sketch of the proof. First, similar to Lemma 4, it can be verified that declaring the target locations once Sm(L) − Sm(L+1) ≥ − log c occurs achieves an error probability O(c). Second, similar to Lemma 11, it can be verified that the detection time approaches − log c/I ∗ (M, K). For example, if and K ≥ L then all the L targets and a fraction r =
K−L M −L
D(g||f ) L
≥
D(f ||g) M −L
of the false hypotheses are observed at each
given time in the asymptotic regime. Therefore, the detection time approaches
− log c D(g||f )+rD(f ||g)) .
Similar
arguments apply to the rest of the cases. D. Proof of Theorem 4 The proof follows a similar line of arguments as in the proofs of Theorems 2 and 3. Hence, we provide only a sketch of the proof. With minor modifications to Theorem 3, it can be verified that (29) is the asymptotic lower bound on the Bayes risk when the number of targets ℓ is known, K = 1 and (23) holds. Similar to Lemma 4, it can be verified that declaring a target once Sm (n) ≥ − log c occurs achieves an error probability O(c). Following a similar argument as in Lemma 7, it can be verified that the ℓ targets DRAFT
34
are tested before testing the M − ℓ normal processes with high probability in the asymptotic regime. Since the decision maker declares the target locations once Sm (n) ≥ − log c for any m, the Bayes risk approaches (29) as c → 0. R EFERENCES [1] H. Chernoff, “Sequential design of experiments,” The Annals of Mathematical Statistics, vol. 30, no. 3, pp. 755–770, 1959. [2] S. Bessler, “Theory and applications of the sequential design of experiments, k-actions and infinitely many experiments: Part I–Theory,” Tech. Rep. Applied Mathematics and Statistics Laboratories, Stanford University, no. 55, 1960. [3] S. Nitinawarat, G. K. Atia, and V. V. Veeravalli, “Controlled sensing for hypothesis testing,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5277–5280, 2012. [4] S. Nitinawarat, G. K. Atia, and V. V. Veeravalli, “Controlled sensing for multihypothesis testing,” IEEE Transactions on Automatic Control, vol. 58, no. 10, pp. 2451–2464, 2013. [5] M. Naghshvar and T. Javidi, “Active sequential hypothesis testing,” Annals of Statistics, vol. 41, no. 6, 2013. [6] M. Naghshvar and T. Javidi, “Sequentiality and adaptivity gains in active hypothesis testing,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 5, pp. 768–782, 2013. [7] K. S. Zigangirov, “On a problem in optimal scanning,” Theory of Probability and Its Applications, vol. 11, no. 2, pp. 294– 298, 1966. [8] E. Klimko and J. Yackel, “Optimal search strategies for Wiener processes,” Stochastic Processes and their Applications, vol. 3, no. 1, pp. 19–33, 1975. [9] V. Dragalin, “A simple and effective scanning rule for a multi-channel system,” Metrika, vol. 43, no. 1, pp. 165–182, 1996. [10] L. D. Stone and J. A. Stanshine, “Optimal search using uninterrupted contact investigation,” SIAM Journal on Applied Mathematics, vol. 20, no. 2, pp. 241–263, 1971. [11] K. P. Tognetti, “An optimal strategy for a whereabouts search,” Operations Research, vol. 16, no. 1, pp. 209–211, 1968. [12] J. B. Kadane, “Optimal whereabouts search,” Operations Research, vol. 19, no. 4, pp. 894–904, 1971. [13] Y. Zhai and Q. Zhao, “Dynamic search under false alarms,” in Proc. IEEE Global Conference on Signal and Information Processing (GlobalSIP), Dec. 2013. [14] D. A. Castanon, “Optimal search strategies in dynamic hypothesis testing,” IEEE Transactions on Systems, Man and Cybernetics, vol. 25, no. 7, pp. 1130–1138, 1995. [15] Y. Li, S. Nitinawarat, and V. V. Veeravalli, “Universal outlier hypothesis testing,” available at http://arxiv.org/abs/1302.4776, 2013. [16] Q. Zhao and J. Ye, “Quickest detection in multiple on–off processes,” IEEE Transactions on Signal Processing, vol. 58, no. 12, pp. 5994–6006, 2010. [17] H. Li, “Restless watchdog: Selective quickest spectrum sensing in multichannel cognitive radio systems,” EURASIP Journal on Advances in Signal Processing, vol. 2009, 2009. [18] R. Caromi, Y. Xin, and L. Lai, “Fast multiband spectrum scanning for cognitive radio systems,” IEEE Transaction on Communications, vol. 61, no. 1, pp. 63–75, 2013. [19] K. Cohen, Q. Zhao, and A. Swami, “Optimal index policies for quickest localization of anomaly in cyber networks,” in Proc. IEEE Global Conference on Signal and Information Processing (GlobalSIP), Dec. 2013. DRAFT
35
[20] K. Cohen, Q. Zhao, and A. Swami, “Optimal index policies for anomaly localization in resource-constrained cyber systems,” submitted to IEEE Transaction on Signal Proccessing, 2013. [21] L. Lai, H. V. Poor, Y. Xin, and G. Georgiadis, “Quickest search over multiple sequences,” IEEE Transactions on Information Theory, vol. 57, no. 8, pp. 5375–5386, 2011. [22] M. L. Malloy, G. Tang, and R. D. Nowak, “Quickest search for a rare distribution,” IEEE Annual Conference on Information Sciences and Systems, pp. 1–6, 2012. [23] A. Tajer and H. V. Poor, “Quick search for rare events,” IEEE Transactions on Information Theory, vol. 59, no. 7, pp. 4462–4481, 2013. [24] K. Cohen and Q. Zhao, “Active hypothesis testing for quickest anomaly detection,” Technical Report, available at http://www.ece.ucdavis.edu/~qzhao/Report.html, 2013.
DRAFT
36
25
Proposed policy Chernoff test Lower bound
Average delay
20
15
10
5
0 1
2
3
4
5
6
7
8
9
10
−log(c)
(a) Average sample sizes achieved by the algorithms and the asymptotic lower bound as a function of the cost per observation. 16
LDGF
14
LCh 12
Loss
10
8
6
4
2
0 1
2
3
4
5
6
7
8
9
10
−log(c)
(b) The loss in terms of Bayes risk under the DGF policy and the Chernoff test as compared to the asymptotic lower bound. LDGF , LCh approach 0 as c → 0 Fig. 1.
Algorithms’ performance for M = 5, K = 1, λf = 0.5, λg = 10
DRAFT
37
30
Proposed policy Chernoff test Lower bound
25
Average delay
20
15
10
5
0 1
2
3
4
5
6
7
8
9
10
−log(c)
(a) Average sample sizes achieved by the algorithms and the asymptotic lower bound as a function of the cost per observation. 3
LDGF
2.5
LCh
Loss
2
1.5
1
0.5
0 1
2
3
4
5
6
7
8
9
10
−log(c)
(b) The loss in terms of Bayes risk under the DGF policy and the Chernoff test as compared to the asymptotic lower bound. LDGF , LCh approach 0 as c → 0 Fig. 2.
Algorithms’ performance for M = 5, K = 2, λf = 2, λg = 10
DRAFT
38
12
Proposed policy Chernoff test Lower bound
10
Average delay
8
6
4
2
0 1
2
3
4
5
6
7
8
9
10
−log(c)
(a) Average sample sizes achieved by the algorithms and the asymptotic lower bound as a function of the cost per observation. 18 16
LDGF LCh
14
Loss
12 10 8 6 4 2 0 1
2
3
4
5
6
7
8
9
10
−log(c)
(b) The loss in terms of Bayes risk under the DGF policy and the Chernoff test as compared to the asymptotic lower bound. LDGF , LCh approach 0 as c → 0 Fig. 3.
Algorithms’ performance for M = 5, K = 2, λf = 0.5, λg = 10
DRAFT