Preprints of the 18th IFAC World Congress Milano (Italy) August 28 - September 2, 2011
Fault Detection and Isolation in Large-Scale IP Network Lionel Fillatre and Igor Nikiforov ICD - Universit´e de Technologie de Troyes - UMR STMR - CNRS 12 rue Marie Curie - BP 2060 - 10010, Troyes, France (e-mail:
[email protected]). Abstract: This paper adresses the problem of fault detection and isolation by using a noisy measurement vector corrupted by some linear unknown nuisance parameters. An invariant constrained asymptotically uniformly minimax test is proposed. It minimizes the maximum false isolation probability as the signal-to-noise ratio becomes arbitrary large, uniformly with respect to the unknown fault amplitude and independently on the nuisance parameters. The theoretical statistical performances of this algorithm are established. To illustrate the practical interest of this test, it is applied to the problem of fault detection and isolation in network origin-destination traffic demands from simple link load measurements. The test performances are evaluated by using real traffic data from a backbone network. Keywords: Fault detection ; Fault isolation ; Statistical hypotheses tests ; Network ; Traffic control. 1. INTRODUCTION Statistical decision tools for detecting and isolating faults in complex and large-scale systems have numerous applications for process control, condition-based maintenance and process supervision. As argued in Basseville and Nikiforov (2002), many industrial processes and systems rely on physical principles, which are written in terms of equations and thus parametric (static/dynamic) models. These parametric models are used for fault detection and isolation. In such a context, faults are modeled as deviations, with respect to a nominal reference values, in the parameter vector of a stochastic (static/dynamic) system. As mentioned in Basseville and Nikiforov (2002), the two following situations are distinguished in fault detection and isolation: a) hypotheses testing, namely deciding between several hypotheses, for detecting and isolating faults; b) quickest detection and isolation of changes after their onset time. The first situation corresponds to the offline diagnostic, i.e. the system belongs to the same steady but unknown state (normal or abnormal) and our goal is to identify the true state of the system. The second situation corresponds to the online diagnostic. For instance, in the beginning of its life cycle, the system is in a normal state and, suddenly, due to a degradation, it transits from the normal state to an abnormal one. As it is stressed in Fillatre and Nikiforov (2010), the monitored (typically large-scale) systems have a variable structure. The definition of anomalies is also time variable. This leads to an extremely complicated sequential strategy: an optimal solution to such a problem is not found. In fact, it is worth noting that the theory of sequential decision is only well-developed in the case of stationary systems (in the pre-change state). In contrast to the sequential strategy, the repeated fixed size sample (or off-line) one Copyright by the International Federation of Automatic Control (IFAC)
is easily applicable to systems with a variable structure. A typical example of fault diagnosis in the case of structure variable systems is the volume anomaly detection in an origin-destination flow’s traffic over a network (for example, due to denial-of-service, viruses/worms, external routing reconfigurations, etc.). Modern networks, like Internet, are systems with a highly variable structure which practically cannot be considered as stationary (see Casas et al. (2010)). As it follows from Nikiforov (1997); Fillatre and Nikiforov (2010), the off-line diagnostic algorithms can be effectively used for quickest detection and isolation of changes. The detailed relations between the error probabilities of the off-line test and the mean detection delay, the mean time between false alarms and the probabilities of false isolation can be found in Nikiforov (1997); Fillatre and Nikiforov (2010). For this reason the goal of this paper is to study the off-line diagnostic. 1.1 Motivation of the theoretical study From the statistical point of view, the problem of fault detection/isolation can be viewed as a hypothesis testing problem between several composite hypotheses Ferguson (1967); Lehman (1986). Each hypothesis is composite because i) the fault amplitude is usually unknown, the only information is its lower bound and ii) there are some nuisance parameters. Let us assume that the parameterized distribution of the observations is denoted as y ∼ Pµ,θ , where θ is the parameter of interest (fault) and µ is a nuisance one. From the practical point of view, dealing with nuisance parameters is an important issue in monitoring safety-critical complex systems. Distinguishing two subsets of distribution parameters, the parameters of interest and the nuisance parameters, is necessary because the nuisance parameters are generally crucial to complete the statisti-
12892
Preprints of the 18th IFAC World Congress Milano (Italy) August 28 - September 2, 2011
cal model but they are of no interest for detection and isolation.
SNMP measurements at the same time and 2) there is a large number of possible faults with unknown intensities.
A general framework to design optimal tests is given by the minimax approach. Minimax tests with a prefixed level of false alarm, namely the constrained minimax tests, between multiple hypotheses composed of a finite number of parameters are established in Bayg¨ un and Hero (1995). This solution depends on some unknown coefficients, namely the optimal weights and the threshold, which are very difficult to calculate in practice. The goal of the actual paper is to design the minimax test under the three following conditions (which have not been considered in the literature) :
The most popular method to detect and isolate faults is based on the Principal Component Analysis (PCA) approach proposed in Lakhina et al. (2004). The PCA approach consists of separating SNMP measurements into a normal subspace that captures the fault-free traffic behavior and an anomalous subspace that provides residuals sensitive to faults. Recent works like Ringberg et al. (2007) have shown that the PCA approach has to be highly tuned for each particular data set in order to provide reliable results. This fact makes it inappropriate in practice. Moreover, the statistical optimality of the PCA-based detection is not established.
i) the geometry of the fault set, which is the union of non-intersecting half-lines with arbitrary directions, is too complex to infer easily the optimal weights; ii) the Gaussian noise has a general covariance matrix which strengthens the complexity of the fault set; iii) the presence of unknown nuisance parameters which affect the separability between the faults. It is worth to note that in practice the optimal weights are usually assumed to be equal but there is no proof that the resulting test, called the M -ary Generalized Likelihood Ratio Test (MGLRT), is optimal in any sense. This gap is also filled in the paper. 1.2 Application to network monitoring A recent practical problem which motivates the necessity to design such an optimal test is the detection and isolation of volume anomalies in Internet Protocol (IP) network traffic flows. The traffic is described by a traffic matrix that captures the amount of traffic transmitted between every pair of ingress and egress nodes in a network, also called the Origin-Destination (OD) flows. This problem is quite similar to the monitoring of transport network described in Abrahamsson (1998) and deeply studied in the literature. In this case, the OD matrix contains information on the number of travelers that commute or the amount of freight shipped between different zones of a region. To improve the reliability and performance of the communication (or transport) network, it is important to detect and isolate some significant volume anomalies in traffic matrices in order to take routing decisions and to improve traffic engineering as explained in Zhang et al. (2005). A volume anomaly is represented by an abnormal modification of the OD flow’s traffic that spans multiple physical links of the network with respect to the nominal one (for example a flash crowd event). Fault detection and isolation consists of detecting a volume anomaly in the traffic matrix and identifying the OD flow which is affected. Unfortunately, high hardware requirements are necessary to network-wide collect and process the direct OD flow measurements (see details in Zhang et al. (2005)). In contrast to such an unfeasible method, the additive sum of OD flow measurements collected on the links is easily measured by using the “Simple Network Management Protocol” (SNMP). The detection and isolation of a fault in the traffic matrix at each time moment from the SNMP measurements is a difficult task because 1) the number of unknown OD flows is much greater than the number of
1.3 Contribution and organization of the paper The main contributions of this study are the following. Firstly, an invariant constrained asymptotically uniformly minimax test is proposed to detect and isolate faults. This algorithm is optimal in the sense that it minimizes the maximum probability of false isolation subject to a constraint on the probability of false alarm provided that the minimum signal-to-noise ratio among all possible faults tends to infinity. Secondly, the statistical performances of the test are clearly established. Finally, it is shown that the proposed test outperforms the PCA-based one. The paper is organized as follows. Section 2 starts with the problem statement. Section 3 presents the invariant constrained asymptotically minimax test. Section 4 describes the problem of parametric network monitoring from SNMP measurements. Section 5 studies the numerical performances of the proposed detection/isolation algorithm on synthetic and real data. Finally, Section 6 concludes this paper. 2. PROBLEM STATEMENT This section presents the detection and isolation problem. It is shown that the nuisance parameters can be eliminated by using the invariance principle, which yields to a reduced statistical problem. 2.1 Multiple hypotheses testing: problem statement Let us consider the following linear regression model : y = Hµ + ξ (1) where µ ∈ Rq is a vector of unknown nuisance parameters, ξ ∼ N (0, γ 2 In ), In is the identity matrix of size n and γ 2 is a known noise variance. Without any loss of generality, the matrix H, of size n × q with n > q, is assumed to be full column rank. Otherwise, it suffices to keep the maximum number of linear independent columns to get a full-column rank matrix spanning the same linear space. The hypothesis testing problem consists of choosing between the m + 1 hypotheses: H0 : {y ∼ N (Hµ, γ 2 In ) ; µ ∈ Rq }, H1 : {y ∼ N (̺ θ1 + Hµ, γ 2 In ) ; ̺ ≥ ̺1 , µ ∈ Rq }, (2) .. .. . .
12893
Hm :{y ∼ N (̺ θm + Hµ, γ 2 In ) ; ̺ ≥ ̺m , µ ∈ Rq }
Preprints of the 18th IFAC World Congress Milano (Italy) August 28 - September 2, 2011
where ̺j > 0 and θj ∈ Rn , j = 1, . . . , m. Typically, in the context of network fault detection/isolation, the goal is to detect the presence of a large fault (greater than ̺j kθj k2 ) and to isolate (identify) the type j of additive fault ̺j θj in SNMP measurements. It must be noted that m ≫ n is possible. 2.2 Nuisance parameters and reduced problem A characteristic feature of the fault detection/isolation problem given by (2) is the fact that the vector of nuisance parameters µ is completely unknown. The problem of fault detection with nuisance parameters has been previously investigated in Fillatre and Nikiforov (2007) by using the theory of invariant tests. The statistical problem (2) is naturally invariant with respect to the group of translations G = {g(y) = y + Hu , u ∈ Rq } (see Ferguson (1967) for a detailed description how to use the invariance principle in such a case). In other words, when the translation g(y) = y + Hu is applied to the Gaussian vector y with mean θ + Hµ where θ is a possible fault, the resulting vector y + Hu is still Gaussian with the nuisance parameter u + µ and the same fault θ. In such a case, the statistical decision should be based on a maximal invariant to the group of translations G, i.e. all invariant tests with respect to G are functions of a maximal invariant statistics (see details in Ferguson (1967)). It is shown (see for instance Fouladirad and Nikiforov (2005)) that the projection z = W T y of y onto the left null space of the matrix H is a maximal invariant. The matrix W = (w1 , . . . , wn−q ) of size n × (n − q) is composed of ⊥ eigenvectors w1 ,. . . ,wn−q of the projection matrix PH = −1 T T In − H(H H) H corresponding to eigenvalue 1. The matrix W satisfies the following conditions: W T H = 0, ⊥ W W T = PH and W T W = In−q . In the rest of the paper, the following simplifying assumption is used. It is assumed that the parameters ̺1 , . . . , ̺m satisfy ̺min = ̺j kW T θj k2 for 1 ≤ j ≤ m, (3) where ̺min > 0 is a positive constant chosen a priori. It means that all projections ̺W T θj onto the subspace of invariant statistics have the same minimum norm ̺min . For all 1 ≤ j ≤ m, let ϕj = ̺j W T θj /̺min , kϕj k2 = 1. Hence, the initial statistical decision problem of choosing between the hypotheses H0 , . . . , Hm becomes the reduced problem of choosing between new hypotheses H0 , . . . , Hm defined as follows : H0 : {z ∼ N (0, γ 2 In−q )}, H1 : {z ∼ N (̺ ϕ1 , γ 2 In−q ) ; ̺ ≥ ̺min }, (4) .. .. . . Hm :{z ∼ N (̺ ϕm , γ 2 In−q ) ; ̺ ≥ ̺min } Generally speaking, this decision problem consists in choosing the half-line which is associated to the theoretical mean of the measurement vector. 2.3 Criterion of optimality Any non-randomized decision rule (to chose between H0 , . . . , Hm ) may be represented as a (m + 1)-dimensional
vector-function φ = (φ0 , . . . , φm )T which is defined on m+1 Rn−q such that φ(z) ∈ {0, 1} and m X φi (z) = 1, ∀ z ∈ Rn−q . i=0
The false alarm probability function of test φ is given by α0 = E0 [1 − φ0 (z)] where Eϕ [φi (z)] stands for the expectation of φi when z follows the distribution N (ϕ, γ 2 In−q ). The class of invariant tests with upper bounded false alarm probability is defined as ) ( m X φi (z) = 1, E0 [1 − φ0 (z)] ≤ α . Dα = φ : i=0
The statistical performance of a decision rule φ in Dα is determined by m functions αi (̺) = E̺ ϕi [1 − φi (z)]. In the rest of the paper, a modified version of the constrained minimax criterion proposed in Bayg¨ un and Hero (1995) is adopted to evaluate a decision rule in the context of the hypothesis testing problem (4). In fact, contrary to the hypotheses considered in Bayg¨ un and Hero (1995), each hypothesis H1 ,. . . ,Hm contains an infinity of possible parameters since the intensity ̺ of the fault is unknown (it is just lower bounded by ̺min ). To remedy this situation, it is proposed to consider the constrained asymptotically uniformly minimax tests, i.e. the test that becomes asymptotically minimax as the amplitude ̺min grows to infinity, uniformly with respect to all ̺ ≥ ̺min .
Definition 1. Let φ∗ (z)=(φ∗0 (z), . . . , φ∗m (z))T be a test function and denote by α∗1 (̺), . . . , α∗m (̺) its probabilities of false isolation as functions of ̺. The test function φ∗ ∈ Dα is a constrained asymptotically uniformly minimax test in the class Dα between H0 , H1 ,. . . ,Hm if for any other test T function φ(z) = (φ0 (z), . . . , φm (z)) ∈ Dα the following condition is fulfilled α∗max (̺) ≤ (1 + ε(̺min )) αmax (̺) , ∀̺ ≥ ̺min , where αmax (̺) = max1≤i≤m αi (̺) and ε(̺min ) → 0 as ̺min → +∞. Hence, the criterion of optimality consists in finding the test with the smallest maximum probability of false isolation under the constraint on the false alarm probability in the class of G-invariant tests Dα , whatever is the value ̺ ≥ ̺min . 3. ASYMPTOTICALLY OPTIMAL TEST The goal of this section is to solve the multiple hypotheses problem (4). The proposed solution is based on minimizing the maximum probability of false isolation under a constraint on the probability of false alarm. Generally speaking, the design of the constrained asymptotically uniformly minimax test is still an open problem. Making a mild assumption on the geometry between the ϕi ’s is necessary to find this test in the case of problem (4). Let di,j be the distance between two normalized faults ϕi and ϕj given by: di,j = d(ϕi , ϕj ) = kϕi − ϕj k2 = dj,i , 1 ≤ i, j ≤ m. (5) This distance is directly related to the Kullback-Leibler Information (see Lehman (1986)) which is known to be a
12894
Preprints of the 18th IFAC World Congress Milano (Italy) August 28 - September 2, 2011
natural measure of statistical separability. The minimum isolability distance di of fault ϕi is given by di = min di,j 1≤j6=i≤m
and the minimum isolability distance over all faults is d∗ = min di . (6) 1≤i≤m
Let us define for a fault ϕi the number ηi of faults ϕj such that di,j = di . In the following, it is assumed that ηi = 1 for all 1 ≤ i ≤ m. This assumption depends on the application. For exemple, in network monitoring, this assumption is not severe in practice because the numbers ηi ’s can be chosen by the network operator through its routing policy. In other words, by choosing conveniently the set of rules which defines the relation between the router and the external world in term of route exchanges and protocol interactions, network operators can adapt the separability between the OD flows. In addition, it is assumed that 0 < d∗ < 2, which means that the case of two alternative hypotheses (m = 2) such that ϕ1 = −ϕ2 is excluded. Under these assumptions, the constrained asymptotically uniformly minimax test is given by the following theorem. The standard Gaussian cumulative distribution function is denoted by Φ(·) and its inverse is Φ−1 (·). Let the well-known function Q(·) be defined by: 2 Z +∞ 1 u Q(x) = √ du. exp − 2 2π x Theorem 1. Let 0 < α < 1. The constrained asymptotiT cally uniformly minimax test function φ∗ = (φ∗0 , . . . , φ∗m ) in Dα between H0 ,. . . , Hm is given by: 1 if max {ϕTi z} ≤ λ, 1≤i≤m ∗ φ0 (z) = (7) 0 if max {ϕTi z} > λ, 1≤i≤m
and for j = 1, . . . , m, ( 1 if ϕTj z = max {ϕTi z} > λ, ∗ 1≤i≤m φj (z) = 0 otherwise,
(8)
where
α . λ = γ̺min Φ−1 1 − m This test asymptotically verifies ∗ d ∗ αmax (̺) ∼ Q ̺ , ∀̺ ≥ ̺min , 2γ as ̺min → +∞.
(9)
4. PARAMETRIC NETWORK MONITORING This section describes the fault detection/isolation problem for network monitoring from SNMP measurements. Let us consider a network composed of r nodes and n monodirectional links (see details in Lakhina et al. (2004)). The volume of traffic s(ℓ), typically in bytes, on the link ℓ at time t is provided by SNMP link load measurements. Let x(i, j) be the OD traffic demand from node i to node j at time t. This situation is shown in Fig. 1. To simplify the notations and since the proposed approach is based on a “snapshot” test, the subscript t is omitted in the rest of the paper. The link loads and the traffic matrix are simply related by a linear equation s=Mx (10)
where s = (s(1), . . . , s(n))T , x = (x(1), . . . , x(m))T contains the m (m ≫ n) unknown traffic matrix elements x(k) = x(ik , jk ) re-written as a vector (according to the lexicographic order) and M = {a(ℓ, k)} is the n×m routing matrix where 0 ≤ a(ℓ, k) ≤ 1 represents the fraction of OD flow k volume routed through link ℓ. Here, X T denotes the transpose of the matrix X. Without loss of generality, the known matrix M is assumed to be full row rank, rank (M ) = n. s(1) = x(1, 2) + x(1, 3) + x(3, 2) x(1, 2) t0
t
x(1, 2) Node 2
Node 1
x(1, 3)
s(1) s(3)
t0
t
s(2) Node 3
Network t0
t
x(1, 3)
Fig. 1. Detection and isolation of faults in OD traffic volumes. A fault, which occurs in OD flow x(1, 3), is routed on links s(1) and s(2). As it was mentioned above, the main problem with gathering the traffic matrix from SNMP measurements is that n ≪ m. To overcome this problem a parsimonious linear model of non-anomalous traffic has been developed in Casas et al. (2010). The idea of this model is that the non-anomalous (ambient) traffic x can be represented at each time t by using a known family of q basis functions B = (b1 , b2 , . . . , bq ) such that q < n. In other words, Casas et al. (2010) has proposed that x = Bµ + ζ (11) 2 where ζ ∼ N (0, γ Σ) is a Gaussian noise with the m × m 2 ). spatial diagonal covariance matrix Σ = diag(σ12 , . . . , σm The matrix Σ is assumed to be known and stable in time. On the contrary, the scalar γ 2 serves to model the mean level of the variance (due to the natural OD flow time variability) and it may depend on the time. Let 1 Υ = M T ΣM and Υ− 2 be the inverse square root matrix of Υ. Inserting (11) in (10) and left multiplying the result 1 1 by Υ− 2 yields to the regression model (1) with y = Υ− 2 s, 1 1 H = Υ− 2 M B and ξ = Υ− 2 M ζ. The detection problem consists in detecting a significant fault in an OD flow x(i, j) by using only SNMP measurements s(1), . . . , s(n) or, equivalently, y(1), . . . , y(n) given by y. The isolation problem consists in identifying the indices (i, j) of the OD flow carrying the abnormal volume of traffic. For example, in Fig. 1, it is necessary to detect the augmentation of the traffic volume x(1, 3) by using s(1), s(2) and s(3). In network monitoring, the fault θj , associated to hypothesis Hj , corresponds to the presence of an abnormal volume of traffic in OD flow j. Hence 1 θj = Υ− 2 M (j) where M (j) is the j-th column of M . This yields to the detection/isolation problem (2). The anomalies θj are time variable since Υ and M vary in time (the monitored network is a variable structure system).
12895
Preprints of the 18th IFAC World Congress Milano (Italy) August 28 - September 2, 2011
Example 1. To illustrate the above mentioned notations and definitions, let us consider the network depicted in Fig. 1. This network is composed of 3 nodes, 3 monodirectional physical links and 6 OD flows. The 3 × 6 matrix M is given by: ! 11 0 00 1 M= 01 1 10 0 , 00 1 01 1 T with x = x(1, 2), x(1, 3), x(2, 1), x(2, 3), x(3, 1), x(3, 2) . To seek simplicity, the matrix M is composed of the coefficients a(ℓ, k) taking two possible values 0 and 1. 5. NUMERICAL RESULTS The objective of this section is twofold. First, it is difficult to apply an asymptotic approach in practice without theoretically warranted bounds for the criterion. For this reason, non-asymptotic lower and upper bounds for the probability of false isolation are proposed and compared with the asymptotic equation. Next, the performances of the proposed test are compared against the PCA-based test proposed in Lakhina et al. (2004).
10−1
Probability of false isolation
10−2
10−3
10−4
Upper bound Lower bound 10−5
0
50
Hypothesis index
100
150
Fig. 2. Probability of false isolation: lower (solid line) and upper (dashed line) bounds. the asymptotic probability of false alarm and its nonasymptotic lower and upper bounds, let us consider the problem (4) with ̺min = 8. It is assumed that the variance γ 2 = 1. The most common lower bound (see Swaszek (1995)) of α∗k (̺min ) is αk,− (̺min ) given by αk,− (̺min ) = max Q(γk,i ). (12) 0≤i6=k≤m
The most common upper bound αk,+ (̺min ) (see Swaszek (1995)) of α∗k (̺min ) is the union bound
The evaluation of the proposed methods requires the knowledge of the real OD traffic flows. Such measurements are quite difficult to obtain in a commercial network but they are available for the Abilene network. The Abilene backbone is composed of r = 12 core routers and m = 144 OD flows. For these numerical experiments, n = 42 backbone links are measured. More details on this network are given in Casas et al. (2010). The primary data inputs are the time series of link loads (bytes across interfaces) gathered through SNMP. The sampling rate is one measurement per 10 minutes. Two sets of measurements are used: the first one, the fault-free data set, is composed of 6 faultfree SNMP measurements (one hour measurement period) and the second one, the testing data set, is composed of 720 SNMP measurements (five days measurement period). Let Ta (respectively Tb ) be the set of time index associated to SNMP measurements of the fault-free (resp. testing) data set. The fault-free data set is measured one hour before the testing one.
αk,+ (̺min ) =
5.2 Lower and upper bounds for error probabilities The detection/isolation problem (2) contains m = 144 hypotheses (m ≫ n). To show the relation between
Q(γk,i ),
(13)
̺min dk,i , 1 ≤ i 6= k ≤ m, 2γ α ̺min γk,0 = − Q−1 γ m γk,i =
(14) (15)
where Q−1 (·) is the inverse of Q(·). Fig. 2 shows the lower bound α∗k,− (̺min ) and the upper bound α∗k,+ (̺min ) for the probability of false isolation, i.e. α∗k,− (̺min ) ≤ α∗k (̺min ) ≤ α∗k,+ (̺min ). The sharpness of these bounds depends on the mutual geometry between the columns of the routing matrix M after the ambient traffic rejection, i.e. the distances between the ϕi ’s. 1
To identify the set of “true” faults in the testing data set, unusual deviations from the mean in each OD flow, are manually detected (see details in Casas et al. (2010)). Let T◦b ⊂ Tb be the set of time instances t associated to the 680 (non obligatory consecutive) SNMP measurements of the testing data set, manually declared as fault-free. Other 40 measurements of the testing data set are affected by at least one significant fault (“true fault”). This section is focused on the model (1) whose practical design is described in Section 4. The routing matrix M has the size 42 × 144. The matrices B and Σ 1are described in Casas et al. (2010). The matrix H = Υ− 2 M B has the size 42 × 6.
m X
i=0,i6=k
Probability of false isolation
5.1 Description of the Abilene network data set
10−1
10−2
10−3
10−4
Lower bound Upper bound Asymptotic probability of false isolation 5
6
7
8
9
10
̺min11
12
13
14
15
16
Fig. 3. Lower (solid line) and upper (dashed line) bounds, and the asymptotic probability of the false isolation (o-mark line) as functions of ̺min . Fig. 3 shows the lower bound max1≤k≤m α∗k,− (̺min ) and the upper bound max1≤k≤m α∗k,+ (̺min ) for the maximum probability of false isolation as functions of ̺min .
12896
Preprints of the 18th IFAC World Congress Milano (Italy) August 28 - September 2, 2011
Type of situation
Spline-based
Normal working False alarms Correct isolations Miss isolations
635 (93.38 %) 45 (6.62 %) 32 (80 %) 8 (20 %)
normal subspace, three different number of PCs are used: 1 PC, 3 PCs and 6 PCs. The proposed test clearly outperforms the PCA approach, at least for the Abilene data set.
PCA (6 PCs) 632 48 5 35
(92.94 (7.06 (12.50 (87.50
%) %) %) %)
Table 1. Results of the detection/isolation for the spline-based and the PCA-based tests.
6. CONCLUSION
5.3 Comparison with the PCA-based test The parameter γt2 is estimated from the short fault-free data set by using the maximum likelihood estimate of noise variance in residuals zt = W T yt . During the test, at time t, if no fault has been declared one hour before, γt2 is estimated by its value one hour before. The minimum intensity to detect is arbitrary fixed at ̺min = 4. The PCA test is described in Lakhina et al. (2004). The Principal Components (PC) subspace that captures the fault-free traffic behavior is defined by the number of PC which are used and, consequently, the PCA anomaly detection test is based on the squared norm of residuals. The isolation step is based on the observation projection onto the residual space. The results are presented in Table 1 for the PCA test with 6 PCs (because the matrix H has also 6 columns). The number of correct algorithm entitled Pm isolations for the ‘label’ is the sum k=1 alabel where alabel is the number of k k correct decisions of type k for the algorithm entitled ‘label’. The number of miss isolations corresponds to the number of non-detection plus the number of detection with an incorrect isolation. Clearly, the PCA test is not as sensitive as the proposed test. In fact, the PCA decomposition of SNMP measurements is too rough to detect small (but significant) faults. 1 Spline-based test
Mean correct isolation rate
0.9 0.8
PCA test with 1 PC
0.7 0.6 0.5
PCA test with 3 PCs
0.4 0.3 0.2
PCA test with 6 PCs
0.1 0
0
0.1
0.2
0.3
0.4
0.5 0.6 False alarm rate
0.7
0.8
0.9
1
Fig. 4. Mean correct isolation rate versus false alarm rate for the spline-based test (solid line) and for the PCA test with different numbers of PC : 1 PC (dasheddotted line), 3 PCs (dotted line) and 6 PCs (dashed line). Finally, Fig. 4 shows the mean correct isolation rates of the spline-based test and the PCA test for different false alarm rates varying between 0 and 1. The mean correct isolation rate is defined by m 1 X label βblabel = ak b k=1
where b = 40 is the total number of testing measurements with faults. In order to appreciate the sensitivity of the PCA test to the number of principal components of the
An optimal constrained asymptotically uniformly minimax test has been proposed to detect and isolate some faults in a noisy measurement vector contaminated by linear nuisance parameters (see Theorem 1). This test is applied to the fault detection and isolation in OD traffic flows from SNMP measurements. Results obtained with real data traffic from a backbone network show that the proposed detection/isolation approach outperforms the popular PCA-based test. REFERENCES Abrahamsson, T. (1998). Estimation of origin-destination matrices using traffic counts - a literature survey. Working paper, International Institute for Applied Systems Analysis. Basseville, M. and Nikiforov, I. (2002). Fault isolation for diagnosis: nuisance rejection and multiple hypotheses testing. Annual Reviews in Control, 26(2), 189–202. Bayg¨ un, B. and Hero, A.O. (1995). Optimal simultaneous detection and estimation under a false alarm constraint. IEEE Trans. Inform. Theory, 41(3), 688–703. Casas, P., Vaton, S., Fillatre, L., and Nikiforov, I. (2010). Optimal volume anomaly detection and isolation in large-scale ip networks using coarse-grained measurements. Computer Networks, 54(11), 1750–1766. Ferguson, T. (1967). Mathematical Statistics: A Decision Theoretic Approach. Academic Press. Fillatre, L. and Nikiforov, I. (2007). Non-bayesian detection and detectability of anomalies from a few noisy tomographic projections. IEEE Trans. Signal Processing, 55(2), 401–413. Fillatre, L. and Nikiforov, I. (2010). A fixed size sample strategy for the sequential detection and isolation of non-orthogonal alternatives. Sequential Analysis, 29, 176–192. Fouladirad, M. and Nikiforov, I. (2005). Optimal statistical fault detection with nuisance parameters. Automatica, 41(7), 1157 – 1171. Lakhina, A., Crovella, M., and Diot, C. (2004). Diagnosing network-wide traffic anomalies. In ACM SIGCOMM. Lehman, E. (1986). Testing Statistical Hypotheses, Second Edition. Chapman & Hall. Nikiforov, I. (1997). Two strategies in the problem of change detection and isolation. IEEE Trans. Inform. Theory, 43(2), 770–776. Ringberg, H., Soule, A., Rexford, J., and Diot, C. (2007). Sensitivy of PCA for traffic anomaly detection. In ACM Sigmetrics. Swaszek, P.F. (1995). A lower bound on the error probability for signals in white gaussian noise. IEEE Trans. Inform. Theory, 41(3), 837–841. Zhang, Y., Roughan, M., Lund, C., and Donoho, D. (2005). Estimating point-to-point and point-tomultipoint traffic matrics: an information-theoretic approach. IEEE/ACM Trans. Networking, 13(5), 947–960.
12897