IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. ,
2018 (ACCEPTED)
1
Asymptotic Optimality of Mixture Rules for Detecting Changes in General Stochastic Models arXiv:1807.08980v1 [math.ST] 24 Jul 2018
Alexander G. Tartakovsky, Senior Member, IEEE
Abstract—The paper addresses a sequential changepoint detection problem for a general stochastic model, assuming that the observed data may be non-i.i.d. (i.e., dependent and non-identically distributed) and the prior distribution of the change point is arbitrary. Tartakovsky and Veeravalli (2005), Baron and Tartakovsky (2006), and, more recently, Tartakovsky (2017) developed a general asymptotic theory of changepoint detection for non-i.i.d. stochastic models, assuming the certain stability of the log-likelihood ratio process, in the case of simple hypotheses when both prechange and post-change models are completely specified. However, in most applications, the post-change distribution is not completely known. In the present paper, we generalize previous results to the case of parametric uncertainty, assuming the parameter of the post-change distribution is unknown. We introduce two detection rules based on mixtures – the Mixture Shiryaev rule and the Mixture Shiryaev–Roberts rule – and study their asymptotic properties in the Bayesian context. In particular, we provide sufficient conditions under which these rules are first-order asymptotically optimal, minimizing moments of the delay to detection as the probability of false alarm approaches zero. Index Terms—Asymptotic Optimality; Changepoint Problems; Expected Detection Delay; General Stochastic Models; Hidden Markov Models; Moments of the Delay to Detection; r-Complete Convergence.
I. I NTRODUCTION Uppose X1 , X2 , . . . are random variables observed sequentially, which may change statistical properties at an unknown point in time ν ∈ {0, 1, 2, . . . }, so that X1 , . . . , Xν are generated by one stochastic model and Xν+1 , Xν+2 , . . . by another model. The value of the
S
The work was supported in part by the Russian Federation 5-100 program, the Russian Federation Ministry of Science and Education Arctic program and the grant 18-19-00452 from the Russian Science Foundation at the Moscow Institute of Physics and Technology. A.G. Tartakovsky is a Head of the Space informatics Laboratory at the Moscow Institute of Physics and Technology, Russia and Vice President of AGT StatConsult, Los Angeles, California, USA; e-mail:
[email protected] Manuscript received October 31, 2017; revised May 22, 2018; accepted May 26, 2018. Copyright (c) 2017 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to
[email protected].
change point ν is unknown and the fact of change must be detected as soon as possible controlling for a risk associated with false detections. More specifically, let Xn = (X1 , . . . , Xn ) denote a sample of size n and let {fθ,n (Xn |Xn−1 )}n>1 be a sequence of conditional densities of Xn given Xn−1 . If ν = ∞, i.e., there is no change, then the parameter θ is equal to θ0 , so that fθ,n (Xn |Xn−1 ) = fθ0 ,n (Xn |Xn−1 ) for all n > 1. If ν = k < ∞, then θ = θ1 6= θ0 , so that fθ,n (Xn |Xn−1 ) = fθ0 ,n (Xn |Xn−1 ) for n 6 k and fθ,n (Xn |Xn−1 ) = fθ1 ,n (Xn |Xn−1 ) for n > k. A sequential detection rule is a stopping time T with respect to an observed sequence {Xn }n>1 . That is, T is an integer-valued random variable, such that the event {T = n}, which denotes stopping and taking an action after observing the sample Xn , belongs to the sigma-algebra Fn = σ(Xn ) generated by observations X1 , . . . , Xn . A false alarm is raised when the detection is declared before the change occurs, T 6 ν. The goal of the quickest changepoint detection problem is to develop a detection rule that stops as soon as possible after the real change occurs under a given risk of false alarms. In early stages, the work focused on the i.i.d. case where fθ,n (Xn |Xn−1 ) = fθ (Xn ), i.e., when the observations are independent and identically distributed (i.i.d.) according to a distribution with density fθ0 (Xn ) in the pre-change mode and with density fθ1 (Xn ) in the postchange mode. In the early 1960s, Shiryaev [1] developed a Bayesian sequential changepoint detection theory when θ1 is known. This theory implies that the detection procedure based on thresholding the posterior probability of the change being active before the current time is strictly optimal, minimizing the expected delay to detection in the class of procedures with a given weighted probability of false alarm if the prior distribution of the change point is geometric. At the beginning of the 1970s, Lorden [2] showed that Page’s CUSUM procedure [3] is first-order asymptotically optimal in a minimax sense, minimizing the maximal expected delay to detection in the class of procedures with the prescribed average run length to false alarm (ARL2FA) as ARL2FA approaches infinity. In the mid-1980s, Moustakides [4] established
2
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. ,
exact minimaxity of the CUSUM procedure for any value of the ARL2FA. Pollak [5] suggested modifying the conventional Shiryaev–Roberts statistic (see [1], [6], [7]) by randomizing the initial condition to make it an equalizer. His version of the Shiryaev–Roberts statistic starts from a random point sampled from the quasistationary distribution of the Shiryaev–Roberts statistic. He proved that, for a large ARL2FA, this randomized procedure is asymptotically third-order minimax within an additive vanishing term. The articles [8], [9] indicate that the Shiryaev–Roberts–Pollak procedure is not exactly minimax for all values of the ARL2FA by showing that a generalized Shiryaev–Roberts procedure that starts from a specially designed deterministic point performs slightly better. Shiryaev [1], [6] was the first who established exact optimality of the Shiryaev–Roberts detection procedure in the problem of detecting changes occurring at a far time horizon after many re-runs among multicyclic procedures with the prescribed mean time between false alarms for detecting a change in the drift of the Brownian motion. Pollak and Tartakovsky [10] extended Shiryaev’s result to the discrete-time i.i.d. (not necessarily Gaussian) case. Third-order asymptotic optimality of generalized Shiryaev–Roberts procedures with random and deterministic head-starts was established in [11]. Another trend related to evaluation of performance of CUSUM and EWMA detection procedures was initiated by the SPC (statistical process control) community (see, e.g., [12]–[22]). In many practical applications, the i.i.d. assumption is too restrictive. The observations may be either nonidentically distributed or correlated or both, i.e., noni.i.d. Lai [23] generalized Lorden’s asymptotic theory [2] for the general non-i.i.d. case establishing asymptotic optimality of the CUSUM procedure under very general conditions in the point-wise, minimax, and Bayesian settings. He also suggested a window-limited version of the CUSUM procedure, which is computationally less demanding than a conventional CUSUM, but still preserves asymptotic optimality properties. Tartakovsky and Veeravalli [24], Baron and Tartakovsky [25], and Tartakovsky [26] generalized Shiryaev’s Bayesian theory for the general non-i.i.d. case and for a wide class of prior distributions. In particular, it was proved that the Shiryaev detection rule is asymptotically optimal – it minimizes not only the expected delay to detection but also higher moments of the detection delay as the weighted probability of a false alarm vanishes. Fuh and Tartakovsky [27] specified the results in [24], [26] for finite-state hidden Markov models (HMM), finding sufficient conditions under which the Shiryaev and Shiryaev–Roberts rules are first-order asymptotically optimal, assuming that both pre-change and post-change distributions are completely specified, i.e., the post-
2018 (ACCEPTED)
change parameter θ1 is known. Fuh [28] proved firstorder asymptotic minimaxity of the CUSUM procedure as the ARL2FA goes to infinity. Pergamenchtchikov and Tartakovsky [29] established point-wise and minimax asymptotic optimality properties of the Shiryaev–Roberts rule for the general non-i.i.d. stochastic model in the class of rules with the prescribed local conditional probability of false alarm (in the given time interval) as well as presented sufficient conditions for ergodic Markov processes. In a variety of applications, however, a pre-change distribution is known but the post-change distribution is rarely known completely. A more realistic situation is parametric uncertainty when the parameter θ of the post-change distribution is unknown since a putative value of θ is rarely representative. When the postchange parameter is unknown, so that the post-change hypothesis “Hϑk : ν = k, θ = ϑ”, ϑ ∈ Θ is composite, and it is desirable to detect quickly a change in a broad range of possible values, the natural modification of the CUSUM, Shiryaev and Shiryaev–Roberts procedures is based either on maximizing over ϑ or weighting over a mixing measure W (ϑ) the corresponding statistics tuned to θ = ϑ. The maximization leads to the generalized likelihood ratio (GLR)-based procedures and weighting to mixtures. Lorden [2] was the first established firstorder asymptotic minimaxity of the GLR-CUSUM procedure for the i.i.d. exponential families as the ARL2FA goes to infinity (see also Dragalin [30] for refined results). Siegmund and Yakir [31] established thirdorder asymptotic minimaxity of the randomized mixture Shiryaev–Roberts–Pollak procedure for the exponential family with respect to the maximal Kullback–Leibler information. Lai [23] established point-wise and minimax asymptotic optimality of the window-limited mixture CUSUM and GLR-CUSUM procedures for general noni.i.d. models. Further detailed overview and references can be found in the monographs [32], [33]. A variety of applications where sequential changepoint detection is important are discussed, e.g., in [17], [32]–[49]. In this paper, we generalize the asymptotic Bayesian theory, developed in [24], [26] for a simple post-change hypothesis, to the more important and typical case of the composite post-change hypothesis where the post-change parameter is unknown. We assume that the observations can have a very general structure, i.e., can be dependent and non-identically distributed. The key assumption in the general asymptotic theory is a stability property of the log-likelihood ratio process between the “change” and “no-change” hypotheses, which can be formulated in terms of a Law of Large Numbers and rates of convergence, e.g., as the r-complete convergence of the properly normalized log-likelihood ratio and its adaptive
TARTAKOVSKY: DETECTING CHANGES IN GENERAL STOCHASTIC MODELS
version in the vicinity of the true parameter value. The rest of the paper is organized as follows. In Section II, we introduce the mixture Shiryaev and the mixture Shiryaev–Roberts rules. In Section III, we formulate the asymptotic optimization problems in the class of changepoint detection procedures with the constraint imposed on the weighted probability of false alarm, which we address in the following sections. In Section IV, we establish the first-order asymptotic optimality of the mixture Shiryaev rule and, in Section V, we study the performance of the mixture Shiryaev–Roberts rule as the weighted probability of false alarm goes to zero. In Section VI, we prove asymptotic optimality of the mixture Shiryaev and mixture Shiryaev–Roberts rules in a purely Bayesian setup when the cost of delay in change detection approaches zero. In Section VII, we use several examples to illustrate general results. Section VIII concludes. II. T HE S HIRYAEV AND S HIRYAEV–ROBERTS M IXTURE RULES Let P∞ denote the probability measure corresponding to the sequence of observations {Xn }n>1 when there is never a change (ν = ∞) and, for k = 0, 1, . . . and ϑ ∈ Θ, let Pk,ϑ denote the measure corresponding to the sequence {Xn }n>1 when ν = k < ∞ and θ = ϑ (i.e., Xν+1 is the first post-change observation), where θ ∈ Θ is a parameter (possibly multidimensional). Further, let pk,ϑ (Xn ) = p(Xn |ν = k, θ = ϑ) denote a joint density of the sample Xn = (X1 , . . . , Xn ), (n) i.e., density of the restriction Pk,ϑ of the measure Pk,ϑ to the sigma-algebra Fn = σ(Xn ) with respect to a non-degenerate sigma-finite measure. Let {gn (Xn |Xn−1 )}n>1 and {fθ,n (Xn |Xn−1 )}n>1 be two sequences of conditional densities of Xn given Xn−1 . With this notation, the general non-i.i.d. changepoint model, which we are interested in, can be written as n Y pν,θ (Xn ) = p∞ (Xn ) = gi (Xi |Xi−1 ) for ν > n, i=1
pν,θ (Xn ) =
ν Y
gi (Xi |Xi−1 ) ×
i=1
n Y
fθ,i (Xi |Xi−1 )
i=ν+1
3
Let Ek,ϑ and E∞ denote expectations under Pk,ϑ and P∞ , respectively. The likelihood ratio (LR) of the hypothesis “Hϑk : ν = k, θ = ϑ” that the change occurs at ν = k with the post-change parameter θ = ϑ against the nochange hypothesis “H∞ : ν = ∞” based on the sample Xn = (X1 , . . . , Xn ) is given by the product LRk,n (ϑ) =
n Y fϑ,i (Xi |Xi−1 ) , gi (Xi |Xi−1 )
n>k
i=k+1
and we set LRk,n (ϑ) = 1 for n 6 k. Assume that the change point ν is a random variable independent of the observations with prior distribution πk = P(ν = k), k = 0, 1, 2, . . . with πk > 0 for k ∈ {0, 1, 2, . . . } = Z+ . We will also assume that a change point may take negative values, which means that the change has occurred by the time the observations became available. However, the detailed structure of the distribution P(ν = k) for k = −1, −2, . . . is not important. The only value which matters is the total probability q = P(ν 6 −1) of the change being in effect before the observations become available. Let Ln (ϑ) = fϑ,n (Xn |Xn−1 )/gn (Xi |Xn−1 ). In [24], [26] for detecting a change from {gn (Xn |Xn−1 )} to {fϑ,n (Xn |Xn−1 )} it was proposed to use the Shiryaev statistic ! n n−1 n Y X Y 1 q Li (ϑ) + πk Li (ϑ) , Sn (ϑ) = P(ν > n) i=1 k=0
n > 1,
i=k+1
S0 (ϑ) = q/(1 − q), (2)
Qs
where i=j Li = 1 for s < j. When the value of the parameter is unknown there are two conventional approaches to overcome uncertainty – either to maximize or average over ϑ. The second approach is usually referred to as Mixtures. To Rbe more specific, introduce a mixing measure W (θ), dW(θ) = 1, which can be interpreted as a prior Θ distribution if needed. Define the average (mixed) LR Z ΛW = LRk,n (ϑ) dW (ϑ), k < n (3) k,n Θ
for ν < n. (1) Therefore, gn (Xn |Xn−1 ) and fθ,n (Xn |Xn−1 ) are the pre-change and post-change conditional densities. Note that the post-change densities may depend on the change (ν) point ν, i.e., fθ,n (Xn |Xn−1 ) = fθ,n (Xn |Xn−1 ) for n > ν. We omit the superscript ν for brevity. While often the pre-change density gn belongs to the same parametric family as the post-change one fθ,n , i.e., gn = fθ0 ,n for some known value θ0 , this is not necessarily the case, so that we consider a more general scenario.
and the statistic Z SnW = Sn (ϑ) dW (ϑ) Θ
1 = P(ν > n)
qΛW 0,n +
n−1 X
! πk Λ W k,n
,
(4)
k=0
n > 1, S0W = q/(1 − q), where Sn (ϑ) is the Shiryaev statistic tuned to the parameter θ = ϑ defined in (2). We will call this statistic the Mixture Shiryaev (MS) statistic.
4
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. ,
In the sequel, we study the MS detection rule that stops and raises an alarm as soon as the statistic SnW reaches a positive level A, i.e., the MS rule is nothing but the stopping time TA = inf n > 1 : SnW > A , (5) where A > 0 is a threshold controlling for the false alarm risk. In definitions of stopping times we always set inf{∅} = ∞. Another popular statistic for detecting a change from {gn (Xn |Xn−1 )} to {fϑ,n (Xn |Xn−1 )}, which has certain optimality properties [9]–[11], [33], is the generalized Shiryaev–Roberts (SR) statistic Rn (ϑ) = ωLR0,n (ϑ) + =ω
n Y
Li (ϑ) +
i=1
n−1 X
LRk,n (ϑ)
k=0 n Y n X
(6) Li (ϑ),
n>1
k=1 i=k
with a non-negative head-start R0 (ϑ) = ω, ω > 0. The mixture counterpart, which we will refer to as the Mixture Shiryaev–Roberts (MSR) statistic, is Z W Rn = Rn (ϑ) dW (ϑ), Θ
=
ωΛW 0,n
+
n−1 X
(7) ΛW k,n ,
n > 1,
R0W
= ω,
k=0
and the corresponding MSR detection rule is given by the stopping time TeA = inf n > 1 : RnW > A , (8)
2018 (ACCEPTED)
In a Bayesian setting, the average risk associated with the moments of delay to detection is ¯ rπ,θ (T ) := Eπθ [(T − ν)r |T > ν] R ∞ X πk Rrk,θ (T )P∞ (T > k) =
k=0
1 − PFA(T )
(10) ,
where PFA(T ) = Pπθ (T 6 ν) =
∞ X
πk P∞ (T 6 k)
(11)
k=0
is the weighted probability of false alarm (PFA) that corresponds to the risk associated with a false alarm. Note that in (10) and (11) we used the fact that Pk,θ (T 6 k) = P∞ (T 6 k) since the event {T 6 k} depends on the observations X1 , . . . Xk generated by the pre-change probability measure P∞ since by our convention Xk is the last pre-change observation if ν = k. In Section IV, we are interested in the Bayesian optimization problem inf
{T :PFA(T )6α}
¯ rπ,θ (T ) for all θ ∈ Θ. R
(12)
However, in general this problem is not manageable for every value of the PFA α ∈ (0, 1). So we will focus on the asymptotic problem assuming that the PFA α approaches zero. Specifically, we will be interested in proving that the MS rule is first-order asymptotically optimal, i.e., ¯ r (T ) inf T ∈C(α,π) R π,θ = 1 for all θ ∈ Θ, (13) ¯ r (TA ) α→0 R lim
where A > 0 is a threshold controlling for the false alarm risk. In Section IV, we show that the MS detection rule TA is first-order asymptotically optimal, minimizing moments of the stopping time distribution for the low risk of false alarms under very general conditions. In Section V, we establish asymptotic properties of the MSR rule, showing that it is also asymptotically optimal when the prior distribution becomes asymptotically flat, but not in general. III. A SYMPTOTIC P ROBLEMS P π Let Pθ (A × K) = k∈K πk Pk,θ (A) denote the “weighted” probability measure and Eπθ the corresponding expectation. For r > 1, ν = k ∈ Z+ , and θ ∈ Θ, introduce the risk associated with the conditional r-th moment of the detection delay Rrk,θ (T ) = Ek,θ [(T − k)r | T > k] .
(9)
π,θ
where C(α, π) = {T : PFA(T ) 6 α} is the class of detection rules for which the PFA does not exceed a prescribed number α ∈ (0, 1). In addition, we will prove that the MS rule is uniformly first-order asymptotically optimal in a sense of minimizing the conditional risk (9) for all change point values ν = k ∈ Z+ , i.e., inf T ∈C(α,π) Rrk,θ (T ) =1 α→0 Rrk,θ (TA ) lim
(14)
for all θ ∈ Θ and all k ∈ Z+ . In Section VI, we consider a “purely” Bayes problem with the average (integrated) risk, which is the sum of the PFA and the cost of delay proportional to the r-th moment of the detection delay and prove that the MS rule is asymptotically optimal when the cost of delay to detection approaches 0. Asymptotic properties of the MSR rule TeA will be also established.
TARTAKOVSKY: DETECTING CHANGES IN GENERAL STOCHASTIC MODELS
For a fixed θ ∈ Θ, introduce the log-likelihood ratio (LLR) process {λk,n (θ)}n>k+1 between the hypotheses Hk,θ (k = 0, 1, . . . ) and H∞ : n X
λk,n (θ) =
j=k+1
log
fθ,j (Xj |Xj−1 ) , gj (Xj |Xj−1 )
n>k
nr−1 Pk,θ n−1 λk,n (θ) − Iθ > ε < ∞
n=1
(15)
for all ε > 0, and we say that {n−1 λk,n (θ)}n>1 converges to Iθ uniformly r−completely as n → ∞ if ∞ X n=1
nr−1 sup Pk,θ n−1 λk,n (θ) − Iθ > ε < ∞ 06k 0. (16) Assume that there exists a positive and finite number Iθ such that the normalized LLR n−1 λk,n+k (θ) converges to Iθ r−completely. Then it follows from [26] that when the parameter θ is known the Shiryaev detection rule that raises an alarm at the first time such that the Shiryaev statistic Sn (θ) exceeds threshold (1 − α)/α is asymptotically (as α → 0) optimal in class C(α, π). Below we extend this result to the case where θ is unknown. Specifically, we will show that the MS rule (5) is asymptotically optimal in problems (13) and (14) under condition (15) and some other conditions for a large class of priors and all parameter values θ ∈ Θ. IV. A SYMPTOTIC O PTIMALITY OF THE M IXTURE S HIRYAEV RULE To study asymptotic optimality we need certain constraints imposed on the prior distribution {πk } and on the asymptotic behavior of the decision statistics as the sample size increases (i.e., on the general stochastic model (1)). The following two conditions are imposed on the prior distribution: CP1. For some 0 6 µ < ∞, ∞ X 1 πk = µ. lim log n→∞ n k=n+1
CP2. If µ = 0, then in addition ∞ X
πk | log πk |r < ∞ for some r > 1.
(18)
k=0
(λk,n (θ) = 0 for n 6 k). Let k ∈ Z+ and r > 0. We say that a sequence of the normalized LLRs {n−1 λk,n (θ)}n>1 converges r−completely to a number Iθ under the probability measure Pk,θ as n → ∞ if ∞ X
5
(17)
The class of prior distributions satisfying conditions CP1 and CP2 will be denoted by C(µ). Note that if µ > 0, then the prior distribution has an exponential right tail, in which case, condition (18) holds automatically. If µ = 0, the distribution has a heavy tail, i.e., belongs to the model with a vanishing hazard rate. However, we cannot allow this distribution to have a too heavy tail, which will generate very large time intervals between change points. This is guaranteed by condition CP2. Note that condition CP1 excludes light-tail distributions with unbounded hazard rates (e.g., Gaussian-type or Weibull-type with the shape parameter κ > 1) for which the time-intervals with a change point are very short. In this case, prior information dominates information obtained from the observations, the change can be easily detected at early stages, and the asymptotic analysis is impractical. Note also that constraint (18) is often guaranteed by finiteness of the r-th moment, P∞ r k=0 k πk < ∞. For δ > 0 define Γδ,θ = {ϑ ∈ Θ : |ϑ − θ| < δ}. Regarding the general model for the observations (1), we assume that the following two conditions are satisfied: C1 . There exists a positive and finite number Iθ such that n−1 λk,k+n (θ) converges to Iθ in Pk,θ -probability and for any k ∈ Z+ and ε > 0 1 max λk,k+n (θ) > (1 + ε)Iθ = 0 lim Pk,θ N →∞ N 16n6N for all θ ∈ Θ; (19) C2 . For any ε > 0 there exists δ = δε > 0 such that W (Γδ,θ ) > 0 and for every θ ∈ Θ, for any k ∈ Z+ , any ε > 0, and for some r > 1 Υk,r (ε, θ) := ∞ X 1 r−1 inf λk,k+n (ϑ) < Iθ − ε < ∞. n Pk,θ n ϑ∈Γδ,θ n=1 (20) Note that condition C1 holds whenever λk,k+n (θ)/n converges almost surely to Iθ under Pθ,k , Pk,θ −a.s. 1 λk,k+n (θ) −−−−−→ Iθ for all θ ∈ Θ. (21) n→∞ n In order to establish asymptotic optimality we first obtain, under condition C1 , an asymptotic lower bound for moments of the detection de¯ r (T ) = Eπ [(T − ν)r |T > ν] and Rr lay R π,θ θ k,θ = r Ek,θ [(T − k) |T > k] of any detection rule T from class C(α, π), and then we show that under condition C2
6
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. ,
this bound is attained for the MS rule TA when A = Aα is properly selected. Asymptotic lower bounds for all positive moments of the detection delay are specified in the following lemma. Condition (19) (and hence, the a.s. convergence condition (21)) is sufficient for this purpose. Lemma 1. Let, for some µ > 0, the prior distribution belong to class C(µ). Assume that for some positive and finite function I(θ) = Iθ , θ ∈ Θ condition C1 holds. Then, for all r > 0 and all θ ∈ Θ ¯ r (T ) inf R π,θ 1 T ∈C(α,π) > (22) lim inf α→0 | log α|r (Iθ + µ)r and for every k ∈ Z+ , all r > 0, and all θ ∈ Θ inf
lim inf
T ∈C(α,π)
Rrk,θ (T )
| log α|r
α→0
>
1 . (Iθ + µ)r
(24)
so that for α < 1 − q implies PFA(TAα ) 6 α. (25)
Proof: Clearly, PFA(TA ) = Eπ [P(TA 6 ν|FTA ); TA < ∞]. Qn Using the Bayes rule and the fact that i=j+1 Li (θ) = 1 for j > n, we obtain πk Λ W k,n , P n−1 W qΛ0,n + j=0 πj ΛW j,n + P(ν > n)
so that P(ν > n|Fn ) =
∞ X
P(ν = k|Fn )
k=n
= =
qΛW 0,n
P(ν > n) Pn−1 + j=0 πj ΛW j,n + P(ν > n)
1 . SnW + 1
Therefore, taking into account that STWA > A on {TA < ∞}, we have π
¯m R π,θ (TA ) ∼
PFA(TA ) = E [1/(1 +
STWA ); TA
< ∞] 6 1/(1 + A)
log A Iθ + µ
m .
(27)
(ii) If A = Aα is so selected that PFA(TAα ) 6 α and log Aα ∼ | log α| as α → 0, in particular A = Aα = (1 − α)/α, where 0 < α < 1 − q, then TAα is firstorder asymptotically optimal as α → 0 in class C(α, π), minimizing moments of the detection delay up to order r, i.e., for all 0 < m 6 r and all θ ∈ Θ as α → 0 m | log α| ∼ Rm inf Rm (T ) ∼ k,θ (TAα ) k,θ Iθ + µ T ∈C(α,π) (28) for all k ∈ Z+ and inf
A = Aα = (1−α)/α
P(ν = k|Fn ) =
Theorem 1. Let r > 1 and let the prior distribution of the change point belong to class C(µ). Assume that for some 0 < Iθ < ∞, θ ∈ Θ, right-tail and left-tail conditions C1 and C2 are satisfied. (i) Then, for all 0 < m 6 r and all θ ∈ Θ as A → ∞ m log A m for all k ∈ Z+ (26) Rk,θ (TA ) ∼ Iθ + µ and
Lemma 2. For all A > q/(1 − q) and any prior distribution of ν, the PFA of the MS rule TA satisfies the inequality PFA(TA ) 6 1/(1 + A),
and the inequality (24) follows. Implication (25) is obvious. The following theorem is the main result in the general non-i.i.d. case, which shows that the MS detection rule is asymptotically optimal to the first order under mild conditions for the observations and prior distributions. Its proof is given in the Appendix.
(23)
Proof: The lower bound (22) follows from Lemma 1 in Tartakovsky [26]. The proof of (23) is a modification and generalization of the argument in the proof of Theorem 1 in [29] provided in the Appendix. The following lemma provides the upper bound for the PFA of the MS rule.
2018 (ACCEPTED)
T ∈C(α,π)
¯m R π,θ (T ) ∼
| log α| Iθ + µ
m
¯m ∼R π,θ (TAα ). (29)
Theorem 1 covers a very wide class of non-i.i.d. models for the observations as well as a large class of prior distributions. However, condition (17) does not include the case where µ is strictly positive, but may go to zero, µ → 0. Indeed, as discussed in detail in [26] the distributions with an exponential right tail that satisfy condition (17) with µ > 0 do not converge as µ → 0 to heavy-tailed distributions for which µ = 0. As a result, the assertions of Theorem 1 do not hold with µ = 0 if µ approaches 0 with an arbitrary rate. The rate has to be matched somehow with α. For this reason, we now consider the case where the prior distribution π α = {πkα } of the change point depends on the PFA constraint α and becomes “flat” when α vanishes. In the next lemma, which is analogous to Lemma 1, we provide asymptotic lower bounds for moments of the detection delay in class C(α) = C(α, π α ) when the prior distribution π α = {πkα } depends on α and µ = µα → 0 as α → 0. Lemma 3. Let the prior distribution π α = {πkα } of the change point satisfy condition (17) with µ > 0 such
TARTAKOVSKY: DETECTING CHANGES IN GENERAL STOCHASTIC MODELS
that µ = µα → 0 as α → 0. Assume that for some 0 < Iθ < ∞, θ ∈ Θ, condition C1 holds. Then, for all r > 0 and θ ∈ Θ ¯ r α (T ) inf R π ,θ
lim inf
T ∈C(α)
>
| log α|r
α→0
1 Iθr
(30)
and inf Rrk,θ (T )
lim inf α→0
T ∈C(α)
| log α|r
>
1 Iθr
for all k ∈ Z+ . (31)
7
V. A SYMPTOTIC P ERFORMANCE OF THE M IXTURE S HIRYAEV–ROBERTS RULE Consider now the MSR detection rule TeA defined in (7) and (8). The following lemma shows how to select threshold A = Aα in the MSR rule to embed it in class C(α, π). Write ∞ ∞ X X ν¯ = k πk = (1 − q) k P(ν = k|ν > 0) k=0
k=1
= (1 − q)E[ν|ν > 0].
Proof: The lower bound (30) follows from Lemma 3 in [26]. A proof of the lower bound (31) is given in the Appendix. Using this lemma, we now establish first-order asymptotic optimality of the MS rule when µ = µα approaches zero as α → 0. To simplify the proof, we strengthen condition C2 in the following uniform version:
Since we are not interested in negative values of ν we will refer to ν¯ as the mean of the prior distribution. Recall that ω (ω > 0) is a head-start of the MSR statistic RnW (see (7)).
C3 . For any ε > 0 there exists δ = δε > 0 such that W (Γδ,θ ) > 0 and for every θ ∈ Θ, for any ε > 0, and for some r > 1
ωb + ν¯ , A P∞ where b = k=1 πk , so that if
Υr (ε, θ) := ∞ 1 X inf λk,k+n (ϑ) nr−1 sup Pk,θ n ϑ∈Γδ,θ k∈Z+ n=1 < Iθ − ε < ∞.
(32)
Assume that for some 0 < Iθ < ∞, θ ∈ Θ, the right-tail condition C1 and the uniform left-tail condition C3 are satisfied. If A = Aα is so selected that PFA(TAα ) 6 α and log Aα ∼ | log α| as α → 0, in particular Aα = (1− α)/α, then the MS rule TAα is asymptotically optimal as α → 0 in class C(α), minimizing moments of the detection delay up to order r: for all 0 < m 6 r and all θ ∈ Θ as α → 0 m | log α| m ¯ ¯m inf Rπα ,θ (T ) ∼ ∼R π α ,θ (TAα ) (34) Iθ T ∈C(α) and for all k ∈ Z+ inf
T ∈C(α)
∼
| log α| Iθ
m
∼ Rm k,θ (TAα ).
PFA(TeA ) 6
(36)
A = Aα = (ωb + ν¯)/α
Theorem 2. Let r > 1. Assume that the prior distribution π α = {πkα } of the change point ν satisfies condition (17) with µ = µα → 0 as α → 0 and that µα approaches zero at such rate that P∞ α α r k=0 πk | log πk | lim = 0. (33) α→0 | log α|r
Rm k,θ (T )
Lemma 4. For all A > 0 and any prior distribution of ν with finite mean ν¯, the PFA of the MSR rule TeA satisfies the inequality
(35)
The proof of this theorem is given in the Appendix.
then PFA(TeAα ) 6 α, i.e., TeAα ∈ C(α, π). Proof: Evidently, E∞ [Rn (ϑ)|Fn−1 ] = 1+Rn−1 (ϑ) and hence Z Z E∞ [RnW |Fn−1 ] = dW (ϑ) + Rn−1 (ϑ) dW (ϑ) Θ
Θ
W = 1 + Rn−1 .
So {RnW − ω − n}n>1 is a zero-mean (P∞ , Fn )−martingale and the MSR statistic RnW is a (P∞ , Fn )−submartingale with mean E∞ [RnW ] = ω + n. Applying Doob’s submartingale inequality, we obtain that for j = 1, 2, . . . P∞ (TeA 6 j) = P∞ max RiW > A 6 (ω + j)/A 16i6j
and P∞ (TeA 6 0) = 0. Thus, PFA(TeA ) =
∞ X
πj P∞ (TeA 6 j)
j=1
ω
P∞
j=1
πj +
P∞
j=1
jπj
, A which proves inequality (36). Therefore, assuming ν¯ < ∞, we obtain that setting A = Aα = (ωb+ ν¯)/α implies TeAα ∈ C(α, π) and the proof is complete. The following theorem, whose proof is postponed to the Appendix, establishes asymptotic operating characteristics of the MSR rule TeA . 6
8
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. ,
Theorem 3. Let ν¯ < ∞ and 0 6 ω < ∞. Let r > 1. Assume that for some function 0 < Iθ < ∞, θ ∈ Θ, conditions C1 and C2 are satisfied. (i) Then, for all 0 < m 6 r and θ ∈ Θ e Rm 1 k,θ (TA ) = m m A→∞ (log A) Iθ lim
for all k ∈ Z+
(37)
where c > 0 is the cost of delay per unit of time and r > 1. The unknown parameter θ is now assumed random and the weight function W (ϑ) is interpreted as the prior distribution of θ. The expected loss (integrated risk) associated with the detection rule T is given by Z π ρc,r (T ) = P (T 6 ν) + c Eπϑ [(T − ν)+ ]r dW (ϑ). π,W Θ
and
¯ m (TeA ) R 1 π,θ = m. lim A→∞ (log A)m Iθ
(38)
(ii) If A = Aα is so selected that TeAα ∈ C(α, π) and log Aα ∼ | log α| as α → 0, in particular Aα = (ωb + ν¯)/α, then for all 0 < m 6 r and θ ∈ Θ e Rm 1 k,θ (TAα ) = m m α→0 | log α| Iθ lim
and
2018 (ACCEPTED)
for all k ∈ Z+
¯ m (TeA ) R 1 α π,θ = m. α→0 | log α|m Iθ lim
(39)
(40)
Below we show that the MS rule TA with a certain threshold A = Ac,r that depends on the cost c is asymptotically optimal, minimizing the integrated risk ρc,r π,W (T ) over all stopping times as the cost vanishes, c → 0. Define Z ¯ rπ,ϑ (T ) dW (ϑ). Rrπ,W (T ) = R Θ
Observe that, if we ignore the overshoot, then PFA(TA ) ≈ 1/(1+A) and that using approximation (27) we may expect that for a large A r Z log A dW (ϑ) = (log A)r Dµ,r Rrπ,W (TA ) ≈ I + µ ϑ Θ
The next theorem addresses the case where the headstart ω = ωα of the MSR statistic and the mean value ν¯ = ν¯α of the prior distribution approach infinity as α → 0 with a certain rate. The proof is given in the Appendix.
where
Theorem 4. Assume that ωα → ∞ and ν¯α → ∞ with such rate that the following condition holds:
So for large A the integrated risk of the MS rule is approximately equal to
log(ωα + ν¯α ) = 0. α→0 | log α|
r ρc,r π,W (TA ) = PFA(TA ) + c [1 − PFA(TA )]Rπ,W (TA )
lim
(41)
Assume further that for some 0 < Iθ < ∞ and r > 1 conditions C1 and C3 are satisfied. If threshold Aα is so selected that PFA(TAα ) 6 α and log Aα ∼ | log α| as α → 0, in particular Aα = (ωα bα + ν¯α )/α, then for all 0 < m 6 r and θ ∈ Θ, as α → 0 m | log α| ¯m e ¯m (42) R ( T ) ∼ ∼ inf R α A π ,θ π α ,θ , α Iθ T ∈C(α) and for all k ∈ Z+ m | log α| m e ∼ inf Rm Rk,θ (TAα ) ∼ k,θ (T ). Iθ T ∈C(α)
(43)
Therefore, the MSR rule TeAα is asymptotically optimal as α → 0 in class C(α), minimizing moments of the detection delay up to order r. VI. A SYMPTOTIC O PTIMALITY WITH R ESPECT TO THE I NTEGRATED R ISK Instead of the constrained optimization problem (12) consider now the unconstrained, “purely” Bayes problem with the loss function Lr (T, ν) = 1l{T 6ν} + c (T − ν)r 1l{T >ν} ,
Z Dµ,r = Θ
1 Iϑ + µ
r dW (ϑ).
≈ 1/A + c Dµ,r (log A)r := Gc,r (A). The threshold value A = Ac,r that minimizes Gc,r (A), A > 0, is a solution of the equation rDµ,r A(log A)r−1 = 1/c.
(44)
In particular, for r = 1 we obtain Ac,1 = 1/(cDµ,1 ). Thus, it is reasonable to conjecture that threshold Ac,r optimizes the performance of the MS rule for a small c, and hence, makes this rule asymptotically optimal as c → 0. In the next theorem, whose proof is given in the Appendix, we establish that the MS rule TAc,r with threshold Ac,r that satisfies (44) is indeed asymptotically optimal as c → 0 under conditions C1 and C2 when the set Θ is compact. Theorem 5. Let the prior distribution of the change point belong to class C(µ). Assume that for some 0 < Iθ < ∞, θ ∈ Θ, right-tail and left-tail conditions C1 and C2 are satisfied and that Θ is a compact set. Let A = Ac,r be the solution of the equation (44). Then, as c → 0, c,r r inf ρc,r π,W (T ) ∼ Dµ,r c | log c| ∼ ρπ,W (TAc,r ). (45)
T >0
TARTAKOVSKY: DETECTING CHANGES IN GENERAL STOCHASTIC MODELS
Finally, the results analogous to Theorems 2 and 4 in the case where the prior distribution has an exponential tail, i.e., µ > 0, but µ = µc → 0 as c → 0 also hold for the integrated risk. Specifically, let Dr = Dµ=0,r , i.e., Z 1 dW (ϑ). Dr = r I Θ ϑ Note that the values of the mean prior distribution P∞ of the c = ν¯c , the headof the change point ν¯ = jπ j j=1 start P of the MSR statistic ω = ωc , and the value of ∞ b = j=1 πjc = bc are the functions of the cost c. The following theorem spells out details. The proof is given in the Appendix. Theorem 6. Assume that for some 0 < Iθ < ∞, θ ∈ Θ, right-tail and left-tail conditions C1 and C3 are satisfied and that Θ is compact. (i) If the prior distribution π c = {πkc } satisfies condition (17) with µ = µc → 0 as c → 0 at such rate that P∞ c π | log π c |r (46) lim k=0 k r k = 0 c→0 | log c| and threshold A = Ac,r of the MS rule TA is the solution of the equation rDr A(log A)r−1 = 1/c,
(47)
then, as c → 0, c,r r inf ρc,r π c ,W (T ) ∼ Dr c | log c| ∼ ρπ c ,W (TAc,r ). (48)
T >0
Therefore, the MS rule TAc,r is asymptotically optimal as c → 0. (ii) If the head-start ωc and the mean of the prior distribution ν¯c approach infinity at such rate that log(ωc + ν¯c ) =0 c→0 | log c| lim
(49)
and if A = Ac,r of the MSR rule TeA is the solution of the equation rDr A(log A)r−1 = (ωc bc + ν¯c )/c,
(50)
then, as c → 0, c,r r e inf ρc,r π c ,W (T ) ∼ Dr c | log c| ∼ ρπ c ,W (TAc,r ). (51)
T >0
Therefore, the MSR rule TeAc,r is asymptotically optimal as c → 0. VII. E XAMPLES Remark 1. Obviously, the following condition implies conditions C2 and C3 : C4 . For any ε > 0 there exists δ = δε > 0 such that W (Γδ,θ ) > 0. Let the Θ → R+ function I(θ) = Iθ
9
be continuous and assume that for every compact set Θc ⊆ Θ, every ε > 0, and for some r > 1 Υ∗r (ε, Θc ) := sup Υr (ε, θ) = θ∈Θc
∞ X n=1
nr−1 sup sup Pk,θ θ∈Θc k∈Z+
1
inf λk,k+n (ϑ) n ϑ∈Γδ,θ
(52)
< Iθ − ε < ∞. Hence, it is sufficient for asymptotic optimality of the MS rule as well as for asymptotic results related to the MSR rule. Note also that if there exists a continuous Θ × Θ → R+ function I(ϑ, θ) such that for any ε > 0, any compact Θc ⊆ Θ and for some r > 1 Υ∗∗ r (ε, Θc ) := ∞ X nr−1 sup sup Pk,θ sup n=1
k∈Z+ θ∈Θc
1 λk,k+n (ϑ) ϑ∈Θc n
(53)
− I(ϑ, θ) > ε < ∞, then condition C4 , and hence, conditions C2 and C3 are satisfied with Iθ = I(θ, θ) since 1 inf λk,k+n (ϑ) < Iθ − ε Pk,θ n |ϑ−θ| ε . ϑ∈Θc n As we will see, conditions C4 and (53) are useful in checking of applicability of theorems in particular examples. Example 1 (Detection of Signals with Unknown Amplitudes in a Multichannel System). Assume there is a multichannel system with N channels (or alternatively an N -sensor system) and one is able to observe the output vector Xn = (Xn1 , . . . , XnN ), n = 1, 2, . . . , where the observations in the ith channel are of the form Xni = θi Sni 1l{n>ν} + ξni ,
n > 1.
Here θi Sni is a deterministic signal with an unknown amplitude θi > 0 that may appear at an unknown time ν in additive noise ξni . For the sake of simplicity, suppose that all signals appear at the same unknown time ν. Assume that noises {ξni }n∈Z+ , i = 1, . . . , N , are mutually independent pi -th order Gaussian autoregressive processes AR(pi ), i.e., i
ξni
=
p X
i βji ξn−j + wni ,
n > 1,
(54)
j=1
where {wni }n>1 are mutually independent i.i.d. nori mal N (0, 1) sequences and the initial values ξ1−p i, i i ξ2−pi , . . . , ξ0 are arbitrary random or deterministic numbers, in particular we may set zero initial conditions
10
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. ,
i i ξ1−p = ξ2−p = · · · = ξ0i = 0. The coefficients i i i i β1 , . . . , βpi are known and all roots of the equations i i z p − β1i z p −1 − · · · − βpi i = 0 are in the interior of the unit circle, so that the AR(pi ) processes are stable. Let 2 ϕ(x) = (2π)−1/2 e−x /2 denote density of the standard normal distribution. Define the pin -th order residual i
ei X n
=
Xni
−
pn X
i βji Xn−j ,
n > 1,
2018 (ACCEPTED)
so that condition C1 holds. Furthermore, since all moments of the LLR are finite it is straightforward to show that conditions (52) and (53), and hence, conditions C2 and C3 hold for r > 1. Indeed, using (55), we obtain Pall N that I(ϑ, θ) = i=1 (ϑi θi − ϑ2i /2)Qi and for any δ > 0 ! 1 λk,k+n (ϑ) − I(ϑ, θ) > ε Pk,θ sup ϑ∈[θ−δ,θ+δ] n √ = Pk,θ |Yk,n (θ)| > ε n ,
j=1
where pin = pi if n > pi and pin = n if n 6 pi . Write θ = (θ1 , . . . , θN ) and Θ = (0, ∞) × · · · × (0, ∞) (N times). It is easy to see that the conditional pre-change density is N Y eni ) g(Xn |Xn−1 ) = ϕ(X
where Yk,n (θ) =
k+n N X θ X ei i √i S j ηj , n i=1
n>1
j=k+1
is the sequence of normal random with mean Pk+n PNvariables zero and variance σn2 = n−1 i=1 θi2 j=k+1 (Seji )2 , PN 2 i=1 which by (56) is asymptotic to i=1 θi Qi . Thus, for a sufficiently large n there exists δ0 > 0 such that and the post-change density is PN σn2 6 δ0 + i=1 θi2 Qi and we obtain that for all large n N Y ! e i − θi Sei ), θ ∈ Θ, 1 ϕ(X fθ (Xn |Xn−1 ) = n n Pk,θ sup i=1 λk,k+n (ϑ) − I(ϑ, θ) > ε ϑ∈[θ−δ,θ+δ] n ! Ppin i i PN √ where Seni = Sni − j=1 βj Sn−j . Obviously, due to the δ0 + i=1 θi2 Qi ε n 6 P |ˆ η| > , PN independence of the data across channels for all k ∈ Z+ σn2 δ0 + i=1 θi2 Qi and n > 1 the LLR has the form ! √ n ε P k+n N k+n 6 P |ˆ η| > , PN X X ϑ2 (Sei )2 δ0 + i=1 θi2 Qi eji − i j=k+1 j . ϑi λk,k+n (ϑ) = Seji X 2 i=1 j=k+1 where ηˆ ∼ N (0, 1) is a standard normal random varieni }n>k+1 able. Hence, for all r > 1 Under measure Pk,θ the random variables {X ∞ 1 X are independent Gaussian random variables with mean r−1 n sup sup P sup λk,k+n (ϑ) k,θ i i en ] = θi Sen and unit variance, and hence, under Ek,θ [X n k∈Z θ∈Θ ϑ∈[θ−δ,θ+δ] + c n=1 Pk,θ the normalized LLR can be written as − I(ϑ, θ) > ε < ∞, N k+n X ϑi θi − ϑ2i /2 X ei 2 1 λk,k+n (ϑ, θ) = (Sj ) which implies (53) for all r > 1. n n i=1 j=k+1 Thus, the MS rule minimizes as α → 0 all positive (55) k+n moments of the detection delay. All asymptotic asserX 1 + ϑi Seji ηji , tions for thePMSR rule presented in Section V also hold n N j=k+1 with Iθ = i=1 θi2 Qi /2. In particular, by Theorem 4, the MSR rule is also asymptotically optimal for all r > 1 where {ηji }j>k+1 , i = 1, . . . , N , are mutually indeif the prior distribution of the change point is either pendent sequences of i.i.d. standard normal random heavy-tailed or asymptotically flat. variables. Since by condition C2 the MS and MSR procedures Assume that are asymptotically optimal for almost arbitrary mixing k+n X distribution W (θ), in this example it is most convenient 1 sup |Seji |2 = Qi , (56) to select the conjugate prior, W (θ) = QN F (θ /v ), lim i i n→∞ n k∈Z+ i=1 j=k+1 where F (y) is a standard normal distribution and vi > 0, where 0 < Qi < ∞. This is typically the case in most in which case the MS and MSR statistics can be comsignal processing applications, e.g., for harmonic signals puted explicitly. Note that this example arises in certain interesting Sni = sin(ωi n + φin ). Then for all k ∈ Z+ and θ ∈ Θ practical applications, as discussed in [33]. For example, N 2 Pk,θ −a.s. X θi Qi 1 surveillance systems (radar, acoustic, EO/IR) typically λk,k+n (θ) −−−−−→ = Iθ , n→∞ deal with detecting moving and maneuvering targets that n 2 i=1
TARTAKOVSKY: DETECTING CHANGES IN GENERAL STOCHASTIC MODELS
appear at unknown times, and it is necessary to detect a signal from a randomly appearing target in clutter and noise with the smallest possible delay. In radar applications, often the signal represents a sequence of modulated pulses and clutter/noise can be modeled as a Markov Gaussian process or more generally as a p-th order Markov process (see, e.g, [36], [50]). In underwater detection of objects with active sonars, reverberation creates very strong clutter that represents a correlated process in time [51], so that again the problem can be reduced to detection of a signal with an unknown intensity in correlated clutter. In applications related to detection of point and slightly extended objects with EO/IR sensors (on moving and still platforms such as spacebased, airborne, ship-board, ground-based), sequences of images usually contain a cluttered background which is correlated in space and time, and it is a challenge to detect and track weak objects in correlated clutter [44]. Yet another challenging application area where the multichannel model is useful is cyber-security [46], [47], [49]. Malicious intrusion attempts in computer networks (spam campaigns, personal data theft, worms, distributed denial-of-service (DDoS) attacks, etc.) incur significant financial damage and are a severe harm to the integrity of personal information. It is therefore essential to devise automated techniques to detect computer network intrusions as quickly as possible so that an appropriate response can be provided and the negative consequences for the users are eliminated. In particular, DDoS attacks typically involve many traffic streams resulting in a large number of packets aimed at congesting the target’s server or network. As a result, these attacks usually lead to abrupt changes in network traffic and can be detected by noticing a change in the average number of packets sent through the victim’s link per unit time. Figure 1 illustrates how the multichannel anomaly Intrusion Detection System works for detecting a real UDP packet storm. The multichannel MSR algorithm with the AR(1) model and uniform prior W (θi ) on a finite interval [1, 5] was used. The first plot shows packet rate. It is seen that there is a slight change in the mean, which is barely visible. The second plot shows the behavior of the multicyclic MSR statistic Wn = log RnW , which is restarted from scratch every time a threshold exceedance occurs. Threshold exceedances before the UDP DDoS attack starts (i.e., false alarms) are shown by green dots and the true detections are marked by red dots. Example 2 (Detection of Changes in a Hidden Markov Model). The following example, which deals with a twostate hidden Markov model with i.i.d. observations in each state may be of interest, in particular, for rapid detection and tracking of sudden spurts and downfalls in activity profiles of terrorist groups that could be caused
11
by various factors such as changes in the organizational dynamics of terrorist groups, counterterrorism activity, changing socio-economic and political contexts, etc. In [52], based on the analysis of real data from Fuerzas Armadas Revolucionarias de Colombia (FARC) terrorist group from Colombia (RDWTI) it was shown that twostate HMMs can be recommended for detecting and tracking sudden changes in activity profiles of terrorist groups. The HMM framework provides good explanation capability of past/future activity across a large set of terrorist groups with different ideological attributes. Specifically, let υn ∈ {1, 2} be a two-state Markov chain with the transition matrix 1 − βθ βθ [Pθ (υn−1 = i, υn = l)] = γθ 1 − γθ and stationary initial distribution Pθ (υ0 = 2) = 1 − Pθ (υ0 = 1) = πθ (2) = γθ /(βθ + γθ ) for some βθ , γθ ∈ [0, 1], where the parameter θ equals θ0 in the pre-change mode (θ0 is known) and θ ∈ Θ in the post-change mode (unknown). Suppose that conditioned on υn the observations Xn are i.i.d. with densities (l) pθ (Xn |υn = l) = pθ (Xn ), l = 1, 2. Introduce the probabilities Pθ,n := Pθ (Xn , υn = 2) and Peθ,n := Pθ (Xn , υn = 1). Straightforward computations show that for n > 1 h i (2) Pθ,n = Pθ,n−1 (1 − γθ ) + Peθ,n−1 βθ pθ (Xn ); h i (1) Peθ,n = Pθ,n−1 γθ + Peθ,n−1 (1 − βθ ) pθ (Xn ) with initial values Pθ,0 = πθ (2) and Peθ,0 = πθ (1) = 1 − πθ (2). Denote p0,θ (Xn ) = pθ (Xn ) and p∞ (Xn ) = pθ0 (Xn ). Since pθ (Xn ) = Pθ,n + Peθ,n we obtain that pk,θ (Xnk+1 |Xk0 ) =
Pθ,n + Peθ,n Pθ,k + Peθ,k
and λk,k+n (θ) = log
Pθ,k+n + Peθ,k+n Pθ0 ,k + Peθ0 ,k · Pθ0 ,k+n + Peθ0 ,k+n Pθ,k + Peθ,k
!
For the sake of simplicity, consider now the symmetric case where βθ = γθ = 1/2 for all θ ∈ Θ + θ0 . Then Pθ,n =
n−1 i Y h (1) 1 (2) (2) p (X ) pθ (Xn ) + pθ (Xn ) , n θ n 2 i=1
n−1 i Y h (1) 1 (1) (2) Peθ,n = n pθ (Xn ) pθ (Xn ) + pθ (Xn ) 2 i=1
and we obtain that the LLR is λk,k+n (θ) =
k+n X i=k+1
log
(1)
(2)
(1)
(2)
pθ (Xi ) + pθ (Xi ) pθ0 (Xi ) + pθ0 (Xi )
! .
.
12
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. ,
Packets per second
4
9
Packet rate
x 10
Attack begins
8 7 0
10
Detection threshold 10 Wn
Packet Rate
2018 (ACCEPTED)
20
30 40 50 t, sec Runétime mean estimate
Detection Statistic
5 0 0 10 20 30 Detection statistic n, sec (sample) 12
x 10
40 Attack begins
50
Power Spectral Density
Amplitude
Fig. 1. Detection of the 5UDP DDoS packet storm attack: upper picture — raw data (packet rate); bottom — log of the MSR statistic.
Condition C1 holds with More specifically, the MS rule asymptotically minimizes ! (1) Z all moments of the delay to detection under the right-tail (2) (1) 0 (2) 5 + pθ (x) 10 15 20 pθ (x)0 + pθ (x) pθ (x) and left-tail dx Iθ = log Frequency, kHz conditions C1 and (57). Also, assertions of Sign of attack (1) (2) 2 pθ0 (x) + pθ0 (x) Theorem 2 and Theorem 4 for the MSR rule hold for all being the Kullback–Leibler information number since by r > 1 under conditions C1 and (57). The proof can be the SLLN n−1 λk,k+n (θ) → Iθ Pk,θ -a.s. (assuming that built by considering the cycles [k+(n−1)NA , k+nNA ], Iθ < ∞). Condition (52) usually holds if the (r + 1)-th n = 1, 2, . . . of the length NA = 1+blog A/(Iθ +µ−ε)c absolute moment of the increment of the LLR is finite: and slightly modifying of the technique developed by ! r+1 Tartakovsky [26] in the case of complete knowledge of Z (1) (2) pθ (x) + pθ (x) the post-change distribution. × log (1) (2) 2. Since we do not assume a class of models for pθ0 (x) + pθ0 (x) h i the observations such as Gaussian, Markov or HMM 1 (1) (2) pθ (x) + pθ (x) dx < ∞. and build the decision statistics on the LLR process 2 λ (θ), it is natural to impose conditions on the k,k+n This is the case, for example, if the observations are behavior of λk,k+n (θ), which is expressed by conditions Gaussian with unit variance and different mean values C , C and C3 , related to the law of large numbers 1 2 in pre- and post-change modes as well as for different (l) (l) for the LLR and rates of convergence in the law of states, i.e., pθ (y) = ϕ(y − µθ ) (θ = θ0 or θ 6= θ0 , large numbers. The assertions of Theorems 1–4 hold if l = 1, 2). It is easily verified that the Kullback–Leibler n−1 λk,k+n (θ) and n−1 log ΛW k,k+n converge uniformly number Iθ is finite and that condition (52) is satisfied r-completely to Iθ under Pk,θ , i.e., when for all ε > 0 for all r > 1. Therefore, in this case, the MS and MSR and θ ∈ Θ detection rules are asymptotically optimal, minimizing ∞ X asymptotically all positive moments of the detection 1 r−1 n sup P λ (θ) − I > ε < ∞, k,θ k,k+n θ delay. n k∈Z+ n=1 ∞ X 1 VIII. C ONCLUDING R EMARKS nr−1 sup Pk,θ log ΛW − I > ε < ∞. θ k,k+n n 1. In the case where k∈Z+ Pn the increments {∆λi (θ)} of n=1 the LLR λk,n (θ) = (58) i=k+1 ∆λi (θ) are independent (but not necessarily identically distributed), condition C2 in Theorem 1 and Theorem 3 and condition C3 However, verifying the r-complete convergence condiW in Theorem 2 and Theorem 4 can be relaxed in the tion (58) for the weighted LLR log Λk,k+n is typically much more difficult than checking conditions C2 and following condition: for all ` > k and k ∈ Z+ ! C3 for the local values of the LLR in the vicinity of Z 1 the true parameter value. For the simple post-change λ`,`+n (ϑ) dW (ϑ) < Iθ − ε −−−−→ 0. Pk,θ n→∞ n Γδ,θ hypothesis, sufficient conditions for the class of ergodic (57) Markov models are given in [29] and for HMMs in [27].
TARTAKOVSKY: DETECTING CHANGES IN GENERAL STOCHASTIC MODELS
In these cases, the LLR process is a Markov random walk and one of the key conditions is finiteness of the (r + 1)-th moment of its increment. 3. As expected, the results indicate that the MSR rule is not asymptotically optimal when the prior distribution of the change point has an exponential tail (i.e., µ > 0), but it is asymptotically optimal for heavy-tailed prior distributions (i.e., µ = 0) and also when µ → 0 with a certain rate. 4. The results show that first-order asymptotic optimality properties of the MS and MSR procedures hold for practically arbitrary weight function W (θ), in particular for any prior that has strictly positive values on Θ. Therefore, the selection of W (θ) can be based solely on the computational aspects. The conjugate prior is typically the best choice when possible. However, if the parameter is vector and the parameter space is intricate, constructing mixture statistics may be difficult. In this case, discretizing the parameter space and selecting the prior W (θ = θi ) concentrated on discrete points θi , i = 1, . . . , N , suggested and discussed in [53] for the hypothesis testing problems, is perhaps the best option. Then one can easily compute the MS and MSR statistics (as long as the LR Λk,k+n (θ) can be computed) at the expense of losing optimality between the points θi since the resulting discrete versions are asymptotically optimal only at the points θi . ACKNOWLEDGEMENT The author would like to thank Prof. Sergey Pergamenchtchikov for pointing out that conditions (52) and (53) are sufficient for condition C3 , which was useful in verification of C3 in examples. Thanks also go to referees whose comments improved the presentation. A PPENDIX : P ROOFS Proof of Lemma 1: For ε ∈ (0, 1) and δ > 0, define Nα = Nα (ε, δ, θ) = (1 − ε)| log α|/(Iθ + µ + δ). By the Chebyshev inequality, Rrk,θ (T ) > Ek,θ [(T − k)+ ]r > Nαr Pk,θ (T − k > Nα ) > Nαr [Pk,θ (T > k) − Pk,θ (k < T < k + Nα )] , inf
> Nαr −
inf
T ∈C(α,π)
sup
(A.1)
i Pk,θ (k < T < k + Nα ) .
T ∈C(α,π)
Thus, to prove the lower bound (22) we need to show that, for arbitrary small ε and δ and all fixed k ∈ Z+ , lim
inf
sup
Pk,θ (k < T < k + Nα ) = 0.
(A.3)
First, note that α>
∞ X
πi P∞ (T 6 i) > P∞ (T 6 k)P(ν > k),
i=k
and hence, inf
T ∈C(α,π)
P∞ (T > k) > 1−α/P(ν > k), k ∈ Z+ , (A.4)
which approaches 1 as α → 0 for any fixed k ∈ Z+ . Thus, (A.2) follows. Now, introduce Uα,k (T ) = e(1+ε)Iθ Nα P∞ (k < T < k + Nα ) , 1 max λk,k+n (θ) > (1 + ε) Iθ . βα,k (θ) = Pk,θ Nα 16n6Nα By inequality (3.6) in [24], Pk,θ (k < T < k + Nα ) 6 Uα,k (T ) + βα,k (θ). (A.5) Using inequality (A.4) and the fact that by condition (17), for all sufficiently large Nα (small α), there exists a (small) δ such that | log P(ν > k + Nα )| 6 µ + δ, k + Nα in just the same way as in the proof of Lemma 1 in [26] we obtain that for a sufficiently small α Iθ ε2 | log α| + (µ + δ)k . sup Uα,k (T ) 6 exp − Iθ + µ + δ T ∈C(α,π) (A.6) The right-hand side approaches zero as α → 0 for any fixed k ∈ Z+ and any ε > 0 and δ > 0. Also, by condition C1 , βα,k (θ) → 0 for all k ∈ Z+ , and therefore, (A.3) holds. This completes the proof of the lower bound (22). Proof of Theorem 1: (i) For k ∈ Z+ , define the stopping times
> log(A/πk )}.
P∞ (T > k)
α→0 T ∈C(α,π)
lim
α→0 T ∈C(α,π)
(k)
Rrk,θ (T )
h
and
τA = inf{n > 1 : log ΛW k,k+n + | log P(ν > k + n)|
where Pk,θ (T > k) = P∞ (T > k), so that T ∈C(α,π)
13
P∞ (T > k) = 1
(A.2)
Obviously, for any n > k, πk W W log Sn > log Λ P(ν > n) k,n = log ΛW k,n + log πk − log P(ν > n), (k)
and hence, for every A > 0, (TA − k)+ 6 τA . Let NA = NA (ε, θ) = 1 + blog(A/πk )/(Iθ + µ − ε)c. Using the same chain of equalities and inequalities as
14
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. ,
in (A.5) in [26], we obtain that for any k ∈ Z+ the following inequality holds: h r i r (k) Ek,θ (TA − k)+ 6 Ek,θ τA ∞ X (k) r r−1 nr−1 Pk,θ τA > n . (A.7) 6 NA + r2 n=NA
It is easily seen that for all k ∈ Z+ and n > NA (k) Pk,θ τA > n ( log ΛW 1 A k,k+n 6 Pk,θ < log n n πk ) | log P(ν > k + n)| − n ( log ΛW k,k+n 6 Pk,θ < Iθ + µ − ε n ) | log P(ν > k + n)| − . n Since, by condition CP1, NA−1 | log P(ν > k + NA )| → µ as A → ∞, for a sufficiently large value of A there exists a small κ = κA (κA → 0 as A → ∞) such that µ − | log P(ν > k + NA )| < κ. NA Hence, for all sufficiently large A, 1 (k) W Pk,θ τA > n 6 Pk,θ log Λk,k+n < Iθ − ε − κ . n
Since, by condition C2 , Υk,r (θ, ε1 ) < ∞ for all k ∈ Z+ and θ ∈ Θ, this implies the asymptotic upper bound m log A Rm (T ) 6 (1+o(1)), A → ∞ (A.10) k,θ A Iθ + µ (for all 0 < m 6 r and θ ∈ Θ), which along with the lower bound m log A (1+o(1)), A → ∞ (A.11) Rm (T ) > A k,θ Iθ + µ proves the asymptotic relation (26). Note that the lower bound (A.11) follows immediately from the lower bound (23) in Lemma 1 by replacing α with 1/(A + 1) since it follows from (24) that TA ∈ C(1/(A + 1), π). We now get to proving (27). Since the MS rule TA belongs to class C(1/(A + 1), π), replacing α by 1/(A+1) in the asymptotic lower bound (22), we obtain that under the right-tail condition C1 the following asymptotic lower bound holds for all r > 0 and θ ∈ Θ: r ¯ rπ,θ (TA ) > log A (1 + o(1)), A → ∞. (A.12) R Iθ + µ Thus, to prove (27) it suffices to show that, under the left-tail condition C2 , for 0 < m 6 r and θ ∈ Θ m log A ¯m R (T ) 6 (1 + o(1), A → ∞. (A.13) π,θ A Iθ + µ Using (A.7) and (A.8), we obtain that for any 0 < ε < Iθ + µ ∞ X r πk Ek,θ (TA − k)+ Eπθ [(TA − ν)+ ]r = k=0
6
Also,
∞ X
πk
k=0
log ΛW k,k+n > inf λk,k+n (ϑ) + log W (Γδ,θ ), ϑ∈Γδ,θ
where Γδ,θ = {ϑ ∈ Θ : |ϑ − θ| < δ}. Thus, for all sufficiently large n and ε1 > 0, 1 (k) Pk,θ τA > n 6 Pk,θ inf λk,k+n (ϑ) n ϑ∈Γδ,θ 1 < Iθ − ε − κ − log W (Γδ,θ ) n 1 6 Pk,θ inf λk,k+n (ϑ) < Iθ − ε1 . (A.8) n ϑ∈Γδ,θ Using (A.7), (A.8), and inequality P∞ (TA > k) > 1 − [AP(ν > k)]−1 (see (A.4)), we obtain h ir + Ek,θ (TA − k) Rrk,θ (TA ) = P∞ (TA > k) kr j log(A/πk ) + r2r−1 Υk,r (θ, ε1 ) 1 + Iθ +µ−ε 6 . (A.9) 1 − 1/(AP(ν > k))
2018 (ACCEPTED)
log(A/πk ) 1+ Iθ + µ − ε
r
+ r2r−1
∞ X
πk Υk,r (θ, ε1 ).
k=0
(A.14) This inequality together with the inequality 1 − PFA(TA ) > A/(1 + A) yields P∞ + r ¯ rπ,θ (TA ) = k=0 πk Ek,θ [(TA − k) ] R 1 − PFA(TA ) ∞ ∞ r X X k) r−1 πk 1 + log(A/π + r2 πk Υk,r (θ, ε1 ) Iθ +µ−ε 6
k=0
k=0
A/(1 + A)
. (A.15)
P∞
By condition C2 , k=0 πk Υk,r (θ, ε1 ) < ∞ for any εP > 0 and any θ ∈ Θ and, by condition (18), 1 ∞ r π | log π | < ∞, which implies that, as A → ∞, k k=0 k for all 0 < m 6 r and all θ ∈ Θ r log A m ¯ Rπ,θ (TA ) 6 (1 + o(1)). Iθ + µ − ε Since ε can be arbitrarily small, the upper bound (A.13) follows and the proof of the asymptotic expansion (27) is complete.
TARTAKOVSKY: DETECTING CHANGES IN GENERAL STOCHASTIC MODELS
(ii) Setting A = Aα = (1 − α)/α in (26) and (27) yields as α → 0 m | log α| , Rm (T ) ∼ k,θ Aα Iθ + µ m (A.16) | log α| ¯m R (T ) ∼ , π,θ Aα Iθ + µ which along with the lower bounds (23) and (22) in Lemma 1 completes the proof of (28) and (29). Obviously, all assertions in (ii) are correct if threshold Aα is so selected that TAα ∈ C(α, π) and log Aα ∼ | log α| as α → 0. The proof is complete. Proof of Lemma 3: Let Nα = (1 − ε)| log α|/(Iθ + µα + δα ), where δα > 0 and goes to 0 as α → 0. Analogously to (A.1), Rrk,θ (T ) h > Nαr inf P∞ (T > k) inf
T ∈C(α)
T ∈C(α)
i − sup Pk,θ (k < T < k + Nα ) T ∈C(α)
(A.17)
Using conditions (33) and C3 and taking into account that µα → 0 as α → 0 yields r | log α| r ¯ Rπα ,θ (TAα ) 6 (1 + o(1)). Iθ − ε Since ε can be arbitrary small, we obtain the asymptotic upper bound r ¯ r α (TA ) 6 | log α| (1 + o(1)), R π ,θ α Iθ as α → 0, which along with the lower bound (30) proves (34). Clearly, this upper bound holds if we take any A = Aα such that PFA(TAα ) 6 α and log Aα ∼ | log α| as α → 0. Next, substituting A = (1 − α)/α in (A.9) (or more generally any A such that log A ∼ | log α| and PFA(TA ) 6 α), we obtain Rrk,θ (TA ) 6 kir h j ((α/(1−α)π α ) + r2r−1 Υk,r (θ, ε1 + δα ) 1 + log Iθ +µα −εk 1 − 1/(AP(ν > k))
h >Nαr 1 − α/P(ν > k) i − sup Pk,θ (k < T < k + Nα ) , T ∈C(α)
where we used the fact that inf T ∈C(α) P∞ (T > k) > 1 − α/P(ν > k) (see (A.4)). Using (A.5) and (A.6), we obtain sup Pk,θ (k < T < k + Nα ) 6 βα,k (θ) T ∈C(α)
ε2 Iθ | log α| + (µα + δα )k . + exp − Iθ + µα + δα
By condition C1 , βα,k (θ) goes to zero as α → 0 for all k ∈ Z+ . Obviously, the second term vanishes as α → 0 for all ε ∈ (0, 1) and all k ∈ Z+ . It follows that, for all k ∈ Z+ and θ ∈ Θ, sup Pk,θ (k < T < k + Nα ) → 0
15
as α → 0
T ∈C(α)
and using (A.17) we obtain that for all 0 < ε < 1, r > 0, and θ ∈ Θ as α → 0 r | log α| inf Rrθ,k (T ) > (1 − ε)r (1 + o(1)). Iθ T ∈C(α)
,
which due to conditions C3 and (33) and the fact that µα , δα → 0 implies that, for all fixed k ∈ Z+ and all θ ∈ Θ as α → 0, r | log α| r Rk,θ (TA ) 6 (1 + o(1)). Iθ This upper bound together with the lower bound (31) yields (35) and the proof is complete. Proof of Theorem 3: (i) For ε ∈ (0, 1), let MA = MA (ε, θ) = (1 − ε)Iθ−1 log A. Similarly to (A.1) we obtain Rrk,θ (TeA ) > MAr P∞ (TeA > k) (A.18) − Pk,θ (k < TeA < k + MA ) and similarly to (A.5), Pk,θ 0 < TeA − k < MA 6 UA,k (TeA ) + βA,k (θ), (A.19) where UA,k (TeA ) = e(1+ε)Iθ MA P∞ 0 < TeA − k < MA , 1 βA,k (θ) = Pk,θ max λk,k+n (θ) > (1 + ε) Iθ . MA 16n6MA
Since ε can be arbitrarily small, the lower bound (31) follows. Since Proof of Theorem 2: Setting A = (1 − α)/α in P∞ 0 < TeA − k < MA 6 P∞ TeA < k + MA inequality (A.15), we obtain r ∞ 6 (k + ω + MA )/A, X log((1 − α)/απkα ) α ¯ r α (TA ) 6 (1 − α)−1 R π 1 + π ,θ k Iθ + µα − ε we have k=0
r−1
+ r2
sup Υk,r (θ, ε1 ). k∈Z+
UA,k (TeA ) 6
k + ω + (1 − ε)Iθ−1 log A . Aε 2
(A.20)
16
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. ,
Therefore, UA,k (TeA ) → 0 as A → ∞ for any fixed k. Also, βA,k (θ) → 0 by condition C1 , so that Pk 0 < TeA − k < MA → 0 for any fixed k. Since P∞ (TeA > k) > 1 − (ω + k)/A, it follows from (A.18) that for an arbitrary ε ∈ (0, 1) as A → ∞ r (1 − ε) log A Rrk,θ (TeA ) > (1 + o(1)), Iθ which yields the asymptotic lower bound (for any fixed k ∈ Z+ and θ ∈ Θ) r log A r e Rk,θ (TA ) > (1 + o(1)), A → ∞. (A.21) Iθ To prove (37) it suffices to show that r log A r e (1 + o(1)), A → ∞. (A.22) Rk,θ (TA ) 6 Iθ For k ∈ Z+ , define the stopping times
(A.22) and completes the proof of the asymptotic approximation (37). We now continue with proving (38). By the Chebyshev inequality, ¯ rπ,θ (TeA ) > Eπθ [(TeA − k)+ ]r R > MAr Pπθ (TeA − ν > MA ) h i > MAr Pπθ (TeA > ν) − Pπθ (ν < T < ν + MA ) ν¯ + ω r π e > MA 1 − − Pθ 0 < TA − ν < MA . A (A.27) Let KA be an integer number that approaches infinity as A → ∞. Using (A.19) and (A.20), we obtain the following upper bound Pπθ (0 < TeA − ν < MA ) ∞ X πk Pk,θ 0 < TeA − k < MA = k=0
(k)
τ˜A = inf{n > 1 : log ΛW k,k+n > log A}. Obviously, for any n > k, log RnW > log ΛW k,n , and (k) + e hence, for every A > 0, (TA − k) 6 τ˜A . Analogously to (A.7), we have h ir h r i (k) Ek,θ (TeA − k)+ 6 Ek,θ τ˜A ∞ X (k) fAr + r2r−1 6M nr−1 Pk,θ τ˜A > n , (A.23)
2018 (ACCEPTED)
6 P(ν > KA ) +
∞ X
πk UA,k (TeA ) +
k=0
KA X
πk βA,k
k=0
6 P(ν > KA ) K
+
A ν¯ + ω + (1 − ε)Iθ−1 log A X + πk βA,k , Aε 2
(A.28)
k=0
where the first two terms go to zero as A → ∞ since ν¯ and ω are finite (by Markov’s inequality P(ν > KA ) 6 fA n=M ν¯/KA ) and the last term also goes to zero by condition C 1 and Lebesgue’s dominated convergence theorem. f f where MA = MA (ε, θ) = 1 + blog(A)/(Iθ − ε)c. For a Thus, for all 0 < ε < 1, Pπθ (0 < TeA − ν < MA ) sufficiently large n similarly to (A.8) we have approaches 0 as A → ∞. Using inequality (A.27), we 1 (k) →∞ inf λk,k+n (ϑ) < Iθ − ε . obtain that for any 0 < ε < 1 as A Pk,θ τ˜A > n 6 Pk,θ r n ϑ∈Γδ,θ ¯ r (TeA ) > (1 − ε)r log A (1 + o(1)), (A.24) R π,θ Iθ Using (A.23) and (A.24), we obtain the inequality which yields the asymptotic lower bound (for any r > 0 + r and θ ∈ Θ) Ek,θ TeA − k r ¯ rπ,θ (TeA ) > log A (1 + o(1)), A → ∞. (A.29) r (A.25) R log A Iθ + r2r−1 Υk,r (θ, ε), 6 1+ Iθ − ε To obtain the upper bound it suffices to use inequality (A.25), which along with the fact that Pπ (TeA > ν) > e which along with the inequality P∞ (TA > k) > 1 − 1 − (¯ ν + ω)/A yields (for every 0 < ε < Iθ ) (ω + k)/A implies the inequality P∞ + r e + r r k=0 πk Ek [(TA − k) ] ¯ e R ( T ) = A e π,θ Ek,θ TA − k Pπ (TeA > ν) r Rrk,θ (TeA ) = P∞ A 1 + log + r2r−1 k=0 πk Υk,r (θ, ε) P∞ (TeA > k) I −ε θ j kr . (A.30) 6 A 1 − (ω + ν¯)/A 1 + log + r2r−1 Υk,r (θ, ε) Iθ −ε 6 . (A.26) Since by condition C , 2 1 − (ω + k)/A ∞ X Since, by condition C2 , Υk,r (θ, ε) < ∞ for all k ∈ Z+ πk Υk,r (θ, ε) < ∞ for any ε > 0 and θ ∈ Θ, and θ ∈ Θ, this implies the asymptotic upper bound k=0
TARTAKOVSKY: DETECTING CHANGES IN GENERAL STOCHASTIC MODELS
we obtain that, for every 0 < ε < Iθ as A → ∞, r log A r ¯ e Rπ,θ (TA ) 6 (1 + o(1)). Iθ − ε
All it remains to do is to prove the lower bound r inf ρc,r π,W (T ) > Dµ,r c | log c| (1 + o(1)) as c → 0.
T >0
Since ε can be arbitrarily small, this implies the asymptotic (as A → ∞) upper bound r log A r ¯ e (1 + o(1)), (A.31) Rπ,θ (TA ) 6 Iθ which along with the lower bound (A.29) completes the proof of (i). (ii) To prove (40) and (39) it suffices to substitute log A ∼ | log α| (in particular, we may take A = (b ω + ν¯)/α) in (38) and (37). Proof of Theorem 4: Previous results make the proof elementary. Indeed, substitution A = Aα = (bα ωα + ν¯α )/α in (A.30) yields the upper bound ¯ rπα ,θ (TeA ) 6 R α r log((ωα +¯ να )/α) + r2r−1 supk>0 Υk,r (θ, ε) 1+ Iθ −ε 1−α which implies the asymptotic upper bound r ¯ rπα ,θ (TeA ) 6 | log α| (1 + o(1)), R α Iθ
17
(A.33)
In fact, since Gc,r (Ac,r ) = 1, c→0 Dµ,r c | log c|r lim
it suffices to prove that inf T >0 ρc,r π,W (T ) Gc,r (Ac,r )
> 1 + o(1) as c → 0.
(A.34)
This can be done by contradiction. Indeed, suppose that (A.34) is wrong, i.e., there exists a stopping rule T = Tc such that ρc,r π,W (Tc ) Gc,r (Ac,r )
< 1 + o(1)
as c → 0.
(A.35)
Let αc = PFA(Tc ). First, αc → 0 as c → 0 since ,
α → 0,
αc 6 ρπ,W c,r (Tc ) < Gc,r (Ac,r )(1+o(1)) → 0
as c → 0.
Second, it follows from Lemma 1 that, as αc → 0, Z r Rπ,W (Tc ) > (Iϑ + µ)−r dW (ϑ)| log αc |r (1 + o(1)), Θ
and hence, as c → 0,
since, by condition (41), log[(ωα + ν¯α )/α] ∼ | log α| and by condition C3 , supk>0 Υk,r (θ, ε) < ∞ for all θ ∈ Θ. This upper bound along with the lower bound (30) in Lemma 3 proves (42). Finally, the asymptotic upper bound r | log α| r e Rk,θ (TAα ) 6 (1 + o(1)), α → 0, Iθ
r ρc,r π,W (Tc ) = αc + c (1 − αc )Rπ,W (Tc )
> αc + c Dµ,r | log αc |r (1 + o(1)). Thus, ρc,r π,W (Tc )
Gc,r (1/αc ) + c Dµ,r | log αc |r o(1) Gc,r (Ac,r ) minA>0 Gc,r (A) > 1 + o(1), >
for all k ∈ Z+ and θ ∈ Θ follows immediately from (A.26) with A = Aα = (cα ωα + ν¯α )/α, which along which contradicts (A.35). Hence, (A.34) follows and the proof is complete. with the lower bound (31) in Lemma 3 proves (43). Proof of Theorem 6: (i) Using (A.14), we obtain Proof of Theorem 5: Since Θ is compact it follows Z from the asymptotic approximation (27) in Theorem 1 c Eπϑ [(TA − ν)+ ]r dW (ϑ) that as A → ∞ Θ r Z Z r Z ∞ log A X ¯ rπ,ϑ (TA ) dW (ϑ) = log(A/πkc ) R dW (ϑ)(1 + o(1)) c 6 π 1 + dW (ϑ) Iϑ + µ k Θ Θ Iϑ + µc − ε Θ r k=0 = Dµ,r (log A) (1 + o(1)). Z r−1 + r2 Υr (ϑ, ε1 ) dW (ϑ), Since PFA(T ) < 1/A, we obtain the following asympA
Θ
totic approximation for the integrated risk r ρc,r π,W (TA ) ∼ Dµ,r c (log A)
as A → ∞.
(A.32)
Next, it is easily seen that, for any r > 1 and µ ∈ (0, ∞), threshold Ac,r goes to infinity as c → 0 with such rate that log Ac,r ∼ | log c|. As a result, we obtain that r ρc,r π,W (TAc,r ) ∼ Dµ,r c | log c|
as c → 0.
where the last term is finite since Θ is compact and Υr (ϑ, ε1 ) < ∞ for all ϑ ∈ Θ due to condition C3 and where ∞ X πkc | log πkc |r = o(| log c|r ) as c → 0 k=0
by assumption (46). Recall that, as established in the proof of Theorem 5 above, log Ac,r ∼ | log c| as c → 0
18
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. ,
when Ac,r satisfies (47) and that µc → 0. Therefore, as c → 0, Z c Eπϑ [(TAc,r − ν)+ ]r dW (ϑ) Θ r Z | log c| dW (ϑ)(1 + o(1)). 6 Iϑ − ε Θ Since ε can be arbitrary small, we obtain Z c c Eπϑ [(TAc,r − ν)+ ]r dW (ϑ) Θ
6 Dr c | log c|r (1 + o(1)) as c → 0. Since PFA(TAc,r ) < 1/Ac,r = o(c| log c|r ) it follows that r ρc,r π c ,W (TAc,r ) 6 Dr c | log c| (1 + o(1)) as c → 0. (A.36) The lower bound r inf ρc,r π c ,W (T ) > Dr c | log c| (1 + o(1)) as c → 0
T >0
(A.37) can be deduced using Lemma 3 and the argument essentially similar to that used in the proof of the lower bound (A.33) above with Gc,r (A) = 1/A + c Dr (log A)r . Using the asymptotic upper bound (A.36) and the lower bound (A.37) simultaneously, we obtain (48), which completes the proof of (i). (ii) In order to prove (51) it suffices to prove the asymptotic upper bound r e ρc,r π c ,W (TAc,r ) 6 Dr c | log c| (1 + o(1)) as c → 0. (A.38) Define
e c,r (A) = (ωc bc + ν¯c )/A + c Dr (log A)r . G Threshold Ac,r that satisfies equation (50) minimizes e c,r (A), and it is easily seen that log Ac,r ∼ | log c| as G c → 0 since, by assumption (49), ωc bc + ν¯c = o(| log c|) as c → 0. Using inequality (A.25), we obtain Z c Eπϑ [(TeA − ν)+ ]r dW (ϑ) Θ r Z log A 6 1+ dW (ϑ) I −ε Θ Zϑ + r2r−1 Υr (ϑ, ε) dW (ϑ), Θ
where the last term is finite since Θ is compact and Υr (ϑ, ε) < ∞ for all ϑ ∈ Θ due to condition C3 . Therefore, for an arbitrary small ε as c → 0, Z c Eπϑ [(TeA − ν)+ ]r dW (ϑ) Θ
Z 6 Θ
| log c| Iϑ − ε
r dW (ϑ)(1 + o(1)),
2018 (ACCEPTED)
which implies that, as c → 0, Z c c Eπϑ [(TeA − ν)+ ]r dW (ϑ) 6 Dr c | log c|r (1 + o(1)). Θ
Since PFA(TeAc,r ) 6 (ωc bc +¯ νc )/Ac,r = o(c| log c|r ), we obtain (A.38), which along with the lower bound (A.37) proves (51). The proof of (ii) is complete. R EFERENCES [1] A. N. Shiryaev, “On optimum methods in quickest detection problems,” Theory of Probability and its Applications, vol. 8, no. 1, pp. 22–46, Jan. 1963. [2] G. Lorden, “Procedures for reacting to a change in distribution,” Annals of Mathematical Statistics, vol. 42, no. 6, pp. 1897–1908, Dec. 1971. [3] E. S. Page, “Continuous inspection schemes,” Biometrika, vol. 41, no. 1–2, pp. 100–114, Jun. 1954. [4] G. V. Moustakides, “Optimal stopping times for detecting changes in distributions,” Annals of Statistics, vol. 14, no. 4, pp. 1379–1387, Dec. 1986. [5] M. Pollak, “Optimal detection of a change in distribution,” Annals of Statistics, vol. 13, no. 1, pp. 206–227, Mar. 1985. [6] A. N. Shiryaev, “The problem of the most rapid detection of a disturbance in a stationary process,” Soviet Mathematics – Doklady, vol. 2, pp. 795–799, 1961, translation from Doklady Akademii Nauk SSSR, 138:1039–1042, 1961. [7] S. W. Roberts, “A comparison of some control chart procedures,” Technometrics, vol. 8, no. 3, pp. 411–430, Aug. 1966. [8] G. V. Moustakides, A. S. Polunchenko, and A. G. Tartakovsky, “A numerical approach to performance analysis of quickest changepoint detection procedures,” Statistica Sinica, vol. 21, no. 2, pp. 571–596, Apr. 2011. [9] A. S. Polunchenko and A. G. Tartakovsky, “On optimality of the Shiryaev–Roberts procedure for detecting a change in distribution,” Annals of Statistics, vol. 38, no. 6, pp. 3445–3457, Dec. 2010. [10] M. Pollak and A. G. Tartakovsky, “Optimality properties of the Shiryaev–Roberts procedure,” Statistica Sinica, vol. 19, no. 4, pp. 1729–1739, Oct. 2009. [11] A. G. Tartakovsky, M. Pollak, and A. S. Polunchenko, “Thirdorder asymptotic optimality of the generalized Shiryaev–Roberts changepoint detection procedures,” Theory of Probability and its Applications, vol. 56, no. 3, pp. 457–484, Sep. 2012. [12] A. Bissell, “CUSUM techniques for quality control,” Journal of the Royal Statistical Society - Series C Applied Statistics, vol. 18, no. 1, pp. 1–30, 1969. [13] G. E. Box, A. Luceno, and M. del Carmen Paniagua-Quinones, Statistical Control by Monitoring and Adjustment (2nd ed.). New York, USA: John Wiley & Sons, Inc, 2009. [14] S. V. Crowder, D. M. Hawkins, M. R. Reynolds Jr., and E. Yashchin, “Process control and statistical inference,” Journal of Quality Technology, vol. 29, no. 2, pp. 134–139, Apr. 1997. [15] D. M. Hawkins and D. H. Olwell, Cumulative Sum Charts and Charting for Quality Improvement, ser. Series in Statistics for Engineering and Physical Sciences. USA: Springer-Verlag, 1998. [16] D. M. Hawkins, P. Qiu, and C. W. Kang, “The changepoint model for statistical process control,” Journal of Quality Technology, vol. 35, no. 4, pp. 355–366, Oct. 2003. [17] D. C. Montgomery, Introduction to Statistical Quality Control (6th ed.). John Wiley & Sons, Inc, 2008. [18] C. S. Van Dobben de Bruyn, Cumulative Sum Tests: Theory and Practice, ser. Statistics Monograph. London, UK: Charles Griffin and Co. Ltd, 1968, vol. 24. [19] G. B. Wetherill and D. W. Brown, Statistical Process Control: Theory and Practice (3rd ed.), ser. Texts in Statistical Science. London, UK: Chapman and Hall, 1991.
TARTAKOVSKY: DETECTING CHANGES IN GENERAL STOCHASTIC MODELS
[20] R. H. Woodward and P. L. Goldsmith, Cumulative Sum Techniques, ser. Mathematical and Statistical Techniques for Industry. Edinburgh, UK: Oliver and Boyd for Imperial Chemical Industries, Ltd., 1964, vol. 3. [21] W. H. Woodall, “Control charts based on attribute data: Bibliography and review,” Journal of Quality Technology, vol. 29, no. 2, pp. 172–183, Apr. 1997. [22] ——, “Controversies and contradictions in statistical process control,” Journal of Quality Technology, vol. 32, no. 4, pp. 341– 350, Oct. 2000. [23] T. L. Lai, “Information bounds and quick detection of parameter changes in stochastic systems,” IEEE Transactions on Information Theory, vol. 44, no. 7, pp. 2917–2929, Nov. 1998. [24] A. G. Tartakovsky and V. V. Veeravalli, “General asymptotic Bayesian theory of quickest change detection,” Theory of Probability and its Applications, vol. 49, no. 3, pp. 458–497, Jul. 2005. [25] M. Baron and A. G. Tartakovsky, “Asymptotic optimality of change-point detection schemes in general continuous-time models,” Sequential Analysis, vol. 25, no. 3, pp. 257–296, Oct. 2006, invited Paper in Memory of Milton Sobel. [26] A. G. Tartakovsky, “On asymptotic optimality in sequential changepoint detection: Non-iid case,” IEEE Transactions on Information Theory, vol. 63, no. 6, pp. 3433–3450, Jun. 2017. [27] C. D. Fuh and A. G. Tartakovsky, “Asymptotic Bayesian theory of quickest change detection for hidden Markov models,” IEEE Transactions on Information Theory, 2018, under review. [28] C.-D. Fuh, “SPRT and CUSUM in Hidden Markov Models,” Annals of Statistics, vol. 31, no. 3, pp. 942–977, Jun. 2003. [29] S. Pergamenchtchikov and A. G. Tartakovsky, “Asymptotically optimal pointwise and minimax quickest change-point detection for dependent data,” Statistical Inference for Stochastic Processes, Oct. 2016. [30] V. P. Dragalin, “Asymptotic solutions in detecting a change in distribution under an unknown parameter,” Statistical Problems of Control, vol. 83, pp. 45–52, 1988, in Russian. [31] D. O. Siegmund and B. Yakir, “Minimax optimality of the Shiryayev–Roberts change-point detection rule,” Journal of Statistical Planning and Inference, vol. 138, no. 9, pp. 2815–2825, Sep. 2008. [32] M. Basseville and I. V. Nikiforov, Detection of Abrupt Changes – Theory and Application, ser. Information and System Sciences Series. Englewood Cliffs, NJ, USA: Prentice-Hall, Inc, 1993, Online. [33] A. G. Tartakovsky, I. V. Nikiforov, and M. Basseville, Sequential Analysis: Hypothesis Testing and Changepoint Detection, ser. Monographs on Statistics and Applied Probability. Boca Raton, London, New York: Chapman & Hall/CRC Press, 2014. [34] M. Basseville, “Detecting changes in signals and systems - A survey,” Automatica, vol. 24, no. 3, pp. 309–326, May 1988. [35] ——, “On-board component fault detection and isolation using the statistical local approach,” Automatica, vol. 34, no. 11, pp. 1391–1416, Nov. 1998. [36] P. A. Bakut, I. A. Bolshakov, B. M. Gerasimov, A. A. Kuriksha, V. G. Repin, G. P. Tartakovsky, and V. V. Shirokov, Statistical Radar Theory. Moscow, USSR: Sovetskoe Radio, 1963, vol. 1 (G. P. Tartakovsky, Editor), in Russian. [37] A. Chen, T. Wittman, A. G. Tartakovsky, and A. L. Bertozzi, “Efficient boundary tracking through sampling,” Applied Mathematics Research Express, vol. 2, no. 2, pp. 182–214, 2011. [38] S. Kent, “On the trail of intrusions into information systems,” IEEE Spectrum, vol. 37, no. 12, pp. 52–56, Dec. 2000. [39] R. L. Mason and J. C. Young, Multivariate Statistical Process Control with Industrial Application. Philadelphia, PA, USA: SIAM, 2001. [40] A. S. Polunchenko, G. Sokolov, and A. G. Tartakovsky, “Optimal design and analysis of the exponentially weighted moving average chart for exponential data,” Sri Lankan Journal of Applied Statistics, Special Issue: Modern Statistical Methodologies in the Cutting Edge of Science, vol. 15, no. 4, pp. 57–80, Dec. 2014.
19
[41] D. Siegmund, “Change-points: From sequential detection to biology and back,” Sequential Analysis, vol. 32, no. 1, pp. 2– 14, Jan. 2013. [42] A. G. Tartakovsky, Sequential Methods in the Theory of Information Systems. Moscow, RU: Radio i Svyaz’, 1991, in Russian. [43] A. G. Tartakovsky and V. V. Veeravalli, “Change-point detection in multichannel and distributed systems,” in Applied Sequential Methodologies: Real-World Examples with Data Analysis, ser. Statistics: a Series of Textbooks and Monographs, N. Mukhopadhyay, S. Datta, and S. Chattopadhyay, Eds. New York, USA: Marcel Dekker, Inc, 2004, vol. 173, pp. 339–370. [44] A. G. Tartakovsky and J. Brown, “Adaptive spatial-temporal filtering methods for clutter removal and target tracking,” IEEE Transactions on Aerospace and Electronic Systems, vol. 44, no. 4, pp. 1522–1537, Oct. 2008. [45] P. Szor, The Art of Computer Virus Research and Defense. Upper Saddle River, NJ, USA: Addison-Wesley Professional, 2005. [46] A. G. Tartakovsky, “Rapid detection of attacks in computer networks by quickest changepoint detection methods,” in Data Analysis for Network Cyber-Security, N. Adams and N. Heard, Eds. London, UK: Imperial College Press, 2014, pp. 33–70. [47] A. G. Tartakovsky, B. L. Rozovskii, R. B. Bla´zek, and H. Kim, “Detection of intrusions in information systems by sequential change-point methods,” Statistical Methodology, vol. 3, no. 3, pp. 252–293, Jul. 2006. [48] ——, “A novel approach to detection of intrusions in computer networks via adaptive sequential and batch-sequential changepoint detection methods,” IEEE Transactions on Signal Processing, vol. 54, no. 9, pp. 3372–3382, Sep. 2006. [49] A. G. Tartakovsky, A. S. Polunchenko, and G. Sokolov, “Efficient computer network anomaly detection by changepoint detection methods,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 1, pp. 4–11, Feb. 2013. [50] M. A. Richards, Fundamentals of Radar Signal Processing, ser. 2nd edition. USA: McGraw-Hill Education Europe, 2014. [51] J. Marage and Y. Mori, Sonar and Underwater Acoustics. London, Hoboken: STE Ltd and John Wiley & Sons, 2013. [52] V. Raghavan, A. Galstyan, and A. G. Tartakovsky, “Hidden Markov models for the activity profile of terrorist groups,” Annals of Applied Statistics, vol. 7, pp. 2402–24 307, 2013. [53] G. Fellouris and A. G. Tartakovsky, “Almost optimal sequential tests of discrete composite hypotheses,” Statistica Sinica, vol. 23, no. 4, pp. 1717–1741, 2013.
20
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. , NO. ,
Alexander G. Tartakovsky (M’01-SM’02) research interests include theoretical and applied statistics; applied probability; sequential analysis; changepoint detection phenomena; and a variety of applications including statistical image and signal processing; video tracking; detection and tracking of targets in radar and infrared search and track systems; near-Earth space informatics; information integration/fusion; intrusion detection and network security; and detection and tracking of malicious activity. He is the author of two books (the third is in preparation), several book chapters, and over 100 papers. Dr. Tartakovsky is a Fellow of the Institute of Mathematical Statistics and a senior member of IEEE. He received several awards, including a 2007 Abraham Wald Award in Sequential Analysis. Dr. Tartakovsky obtained a Ph.D. degree and an advanced Doctor-of-Science degree both from Moscow Institute of Physics and Technology, Russia (FizTech). During 1981–92, he was first a Senior Research Scientist and then a Department Head at the Institute of Radio Technology (Moscow, Russian Academy of Sciences) as well as a Professor at FizTech, working on the application of statistical methods to optimization and modeling of information systems. From 1993 to 1996, Dr. Tartakovsky worked at the University of California, Los Angeles (UCLA), first in the Department of Electrical Engineering and then in the Department of Mathematics. From 1997 to 2013, he was a Professor in the Department of Mathematics and the Associate Director of the Center for Applied Mathematical Sciences at the University of Southern California (USC), Los Angeles. From 2013 to 2015, he was a Professor of Statistics in the Department of Statistics at the University of Connecticut, Storrs. Currently, he is the Head of the Space Informatics Laboratory at FizTech as well as Vice President of AGT StatConsult, Los Angeles, California.
2018 (ACCEPTED)