Online Anomaly Detection Using KDE - Semantic Scholar

Online Anomaly Detection using KDE Tarem Ahmed Department of Computer Science and Engineering BRAC University, Dhaka, Bangladesh Email: [email protected]

Abstract—Large backbone networks are regularly affected by a range of anomalies. This paper presents an online anomaly detection algorithm based on Kernel Density Estimates. The proposed algorithm sequentially and adaptively learns the definition of normality in the given application, assumes no prior knowledge regarding the underlying distributions, and then detects anomalies subject to a user-set tolerance level for false alarms. Comparison with the existing methods of Geometric Entropy Minimization, Principal Component Analysis and OneClass Neighbor Machine demonstrates that the proposed method achieves superior performance with lower complexity.

I. I NTRODUCTION Modern high-speed networks are commonly affected by various types of anomalies, and the broad field of network anomaly detection has thus recently attracted a vast amount of research [1]. My previous work had shown that usual signature-based or model-based approaches to network anomaly detection can only signal previously-seen anomalies, and that too only after a long stretch of time has elapsed [2]. In [2] I had suggested an alternative approach, where instead of modeling the anomalies, the proposed algorithm learns the nature of normality. The Kernel-based Online Anomaly Detection (KOAD) proposed in [2] was a recursive, learning algorithm that is portable across applications, and able to signal anomalies in high-dimensional timeseries in real-time. This was shown in [3] with applications to various datasets. This paper presents the Kernel Estimation-based Anomaly Detection (KEAD) algorithm, which was developed using the mathematical technique of Kernel Density Estimation (KDE), and is the next version of the KOAD algorithm. The anomaly detection threshold has been mathematically linked to the user’s specified tolerance level for false alarms. Supplementary algorithms have been developed to set all the primary parameters in the earlier KOAD algorithm, thereby completely automating the anomaly detection process. A. Related Work While researchers have recently begun to use machine learning techniques to detect outliers in datasets from a variety of fields [4], the use of online machine learning algorithms specifically designed for network anomaly detection is still rare [3]. The algorithm that comes closest to this work, is the technique of Heinz and Seeger using online kernel 1A

density estimation [5]. Heinz and Seeger’s algorithm is based on the use of M-Kernels first proposed by Zhou et al. in [6], where M -Kernels is essentially a weighted version of conventional kernel estimates. They compare their online estimate with the best offline kernel density estimate to evaluate performance in a single dimension. Their algorithm performs exponential smoothing of estimates to account for the evolution of data streams over time. To contrast, KEAD process high-dimensional datasets, and employs deletion of obsolete elements in addition to exponential forgetting to maintain a current dictionary of approximately linearly independent elements, as is explained later. Kivinen et al. present an online algorithm applicable to novelty detection that is based on support vector machines [7]. Their algorithm minimises a risk function with the newly-arriving data point expressed as a kernel expansion of previous data points. They achieve sparsity by using a power series approximation in the update equations for the kernel expansion coefficients. Contrarily, KEAD achieves sparsity by identifying the most representative of the past observations. Duffield et al. present an adaptive architecture where anomalies are flagged using flow signatures that are themselves learnt from the data stream [8]. Another application of learning methods to network anomaly detection is the Geometric Entropy Minimization (GEM) approach of Hero [9]. GEM is a block-based, adaptive method that does not require a detection threshold. It is based on the minimal covering properties of entropic graphs when constructed from a set of training samples. B. Outline of Paper Section II introduces the mathematical techniques used, before the problem is stated in Section III. Section IV defines the detection statistics. Section V presents the KEAD algorithm and Section VI describes how the detection threshold is set. Section VII performs a complexity comparison of the various algorithms studied. Section VIII describes the data, while Section IX presents experimental results. Section X concludes. II. BACKGROUND A. Annotations Based on Minimum Volume Set A Minimum Volume Set (MVS) Gβ for a generic and unknown probability distribution P with mass at least β ∈ (0, 1), with respect to reference measure μ, is defined as: Gβ = arg min{μ(G) : P (G) ≥ β}

significant portion of this work was performed while the author was at McGill University, Montreal, QC, Canada.

(1)

where G is a measurable set [10].

978-1-4244-4148-8/09/$25.00 ©2009 This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE "GLOBECOM" 2009 proceedings.

Following ideas from Scott and Kolaczyk [11], for a block of T , D-dimensional datapoints {xt }Tt=1 ∈ RD , let: βt := inf {β : xt ∈ GP,β }. 0≤β≤1

(2)

That is, by letting β vary from 0 to 1 and consequently selecting regions corresponding to maximum to minimum density under P , {xt }Tt=1 is assigned to a set GP,βt that among all minimum volume sets just barely includes xt . Then sorting the {βt }Tt=1 into a descending order {β(T ) , . . . , β(1) } naturally induces a ranking {x(1) , . . . , x(T ) } of the observations, where (t) denotes the index of the t-th relatively most potentially anomalous observation. Scott and Kolaczyk present the Multivariate Nonparametric Simultaneous Contamination Annotation (MN-SCAnn) algorithm, which proposes a related annotation list: γt := 1 − PF D (GP,βt )

(3)

where PF D (GP,βt ) is the probability of false discovery related to MVS G under probability distribution P at level β. Details of the MN-SCAnn algorithm are provided later. Scott and Kolaczyk show that the descending sequence {γ(T ) , . . . , γ(1) } preserves the ranking {x(1) , . . . , x(T ) } induced on the observations by the sequence {β(T ) , . . . , β(1) }. The development of this related annotation list {γt }Tt=1 as part of the MN-SCAnn algorithm is built on the work of Storey [12], which is an example of various studies of the multiple hypothesis testing problem that were initiated by the seminal paper of Benjamini and Hochberg [13]. B. One-Class Neighbor Machine The One-Class Neighbor Machine (OCNM) algorithm proposed by Mu˜noz and Moguerza provides an elegant means for estimating minimum volume sets [14]. It assumes a sample set S comprising T data points of dimension D, {xt }Tt=1 ∈ RD . The algorithm requires the choice of a sparsity measure, examples of which are the nth nearest-neighbour Euclidean distance and the mean of the first n nearest-neighbour distances. Let g denote the sparsity measure. OCNM proceeds by sorting the values of g for the set of points in S, and subsequently identifying the pre-specified fraction μ of points that lie inside the MVS. The application of the OCNM algorithm to anomaly detection was introduced in my previous work [2], [3]. C. Principal Component Analysis The technique of Principal Component Analysis (PCA) has been used extensively by Lakhina et al. [15], [16] to detect anomalies in large IP backbone networks such as Abilene and GEANT. PCA exploits the inherent low dimensionality of such multivariate network measurements by separating the multivariate space into disjoint “normal” and “anomalous” subspaces. An anomaly is signalled when the magnitude of the projection onto the anomalous subspace exceeds an associated Q-statistic threshold subject to a specified confidence level. The PCA subspace method has since become a popular basis for ongoing work in anomaly detection [17].

III. P ROBLEM D ESCRIPTION A. Problem Statements My research addresses the two problems stated below. Problem 1. Given a sequence of datapoints {xi }t+L i=t−L ∈ D R , I wish to determine if xt is a realisation of probability distribution Pn,t or Pa . I do not know or specify Pn,t , but D are independent assume that the points {xi }t+L i=t−L ∈ R observations from the mixture distribution Pt : Pt = (1 − π)Pn,t + πPa

(4)

where π is the mixing faction. Here Pn,t is slowly timevarying, while Pa is time-invariant. I wish to achieve an operator-specified probability of false discovery, PF∗ D . Problem 2. I wish to make a preliminary decision about the underlying distribution of xt at time t, and a final decision at time t + where L. Thus the preliminary decision is based on the data sequence {xi }ti=t−L and the final decision t+L on {xi }t+ i=t−L , instead of on {xi }i=t−L . B. Discussion and Methodology In accordance with the problem statements, the component distributions Pn,t and Pa correspond to normal and anomalous traffic at time t. The respective a priori probabilities are 1 − π and π. The only information available are the independent observations {xi }t+L i=t−L from the mixture Pt . The distribution governing normal data Pn,t is unknown and slowly timevarying, and the a priori probability of an anomaly occurring (π) is unknown, low, and time-invariant. In the absence of information regarding the anomaly distribution Pa , I adopt the approach of [11] and assume that it is the D-dimensional uniform distribution that bounds the support of Pn,t . While MVS approaches to anomaly detection typically assume that Pa is known [18], the uniform distribution assumption can be justified as it has been shown to optimise the worst-case detection rate among all choices for the unknown anomaly distribution [19]. Algorithms that estimate minimum volume sets yield a set Gˆβ that approximates the true MVS Gβ . A missed detection occurs when xt is a realisation of Pa , but is a posteriori inferred to be a realisation of Pn,t . A false discovery occurs when xt is a realisation of Pn,t , but is inferred to be a realisation of Pa . After an MVS-estimating algorithm is used to obtain an estimate Gˆβ of the true MVS Gβ , the cases which correspond to a correct detection or a false discovery, and the corresponding probabilities, may be identified: / Gβ ] PD = Pr[xt ∼ Pa | xt ∈

(5)

PF D = Pr[xt ∼ Pn,t | xt ∈ / Gβ ].

(6)

IV. D ETECTION S TATISTICS A. Block-based Detection Statistic Kernel density functions are a popular method of obtaining estimates of minimum volume sets [20] and are also a means


of estimating the probability density function of a random variable. Assuming that distribution governing the normal points Pn,t is stationary in {t−L : t+L} leads to the following expression for the Kernel Density Estimate (KDE) at xt : τt =

t+L 1 k (xi , xt ) 2L

(7)

i=t−L

where k (, ) represents a suitably chosen kernel function [4]. Kernel Density Estimate (KDE) τt is expected to be relatively low if the input vector xt arises from the anomaly distribution Pa , compared to if xt arises from the normal distribution Pn,t . This is due to two primary reasons. First, Pa is expected to be more voluminous in space than Pn,t . Second, the probability of a point arising from Pa is expected to be small, so the majority of points are likely to be realisations of Pn,t . Following this reasoning, KDE τt is proposed as the statistic for performing a block-based anomaly decision on xt . B. Online Detection Statistic As stated in Problem 2, the aim is to make a preliminary decision about the underlying distribution of xt at time t, and a final decision at time t + . Thus the decision needs to be initially made using {xi }tt−L , and finally using {xi }t+ t−L , as t+L the sample from Pt from the interval {xi }t−L . The following method of obtaining an estimated detection statistic τˆt is proposed, and τˆt is subsequently used as the online detection statistic that approximates the probability of false discovery achieved with the algorithm for Problem 1. At each timestep t, the proposed algorithm first evaluates the mean squared error δt in representing xt using a relatively small dictionary of approximately linearly independent t xj }m elements Dt = {˜ j=1 in the feature space defined by the kernel function. Error δt may be derived to be [21]: ˜ ˜ (8) δt = min aT t Kt−1 at − 2at kt−1 (xt ) + k (xt , xt ) at

˜t−1 (xt )]j = k (˜ ˜ t−1 ]i,j = k (˜ ˜ j ) and [k where [K xi , x xj , xt ) mt−1 xj }j=1 represent for i, j = 1 . . . mt−1 . Note that here {˜ those selections from {xi }t−1 i=1 that have been entered into the dictionary up to time t − 1. The optimum sparsification coefficient vector at that minimises δt at time t is then: ˜t−1 (xt ). ˜ −1 · k at = K t−1

(9)

The expression for error δt may then be simplified into: ˜t−1 (xt )T · at . δt = ktt − k

(10)

I propose maintaining a sliding window At of the optimal sparsification coefficient vectors at for the past L timesteps. One may then use the dictionary Dt−1 and the matrix of past optimal sparsification coefficient vectors At to obtain the online detection statistic τˆt : τˆt =

L mt−1 L 1 1 ˜t−1 (xt ). aj · k (˜ xj , xt ) = At−1 · k L L i=1 =1 j=1 (11)

Note that τˆt is an approximation of τt in two respects. First, the (sparse) dictionary sample of representative input vectors from the interval {t − L : t} is used. The error introduced on this count is governed by the sparsification parameter ν. Second, the interval {t − L : t} is used as representative of the interval of interest {t − L : t + L}. The error introduced on this count is governed by the window length L. V. KEAD Algorithm 1 presents pseudocode for the proposed Kernel Estimation-based Anomaly Detection (KEAD) algorithm. Matlab code implementing the algorithm, datasets and instructions on replicating my experiments are all available at [22]. Some of the processes involving updating the coefficient and covariance matrices remain the same as in the earlier KOAD algorithm [2]. Details of these steps have thus been omitted. KEAD proceeds at every timestep t by first computing the optimum sparsification coefficient vector at and the projection error δt , and the online detection statistic τˆt . The sparsification statistic δt is then compared with the sparsification parameter ν. If δt ≥ ν, xt is inferred to be approximately linearly independent of the space spanned by the dictionary at time t, with approximation error δt . Input vector xt is consequently added to the dictionary. Contrarily if δt < ν, xt is inferred to be approximately linearly dependent on the dictionary. The dictionary is then kept unchanged. As KEAD is based on the principles of Recursive Least ˜ −1 ˜ t and its inverse K Squares (RLS), the kernel matrix K t must be updated every timestep, as must the covariance matrix Bt = [AT A]−1 where At is the full t × mt matrix of least squares t coefficients a = [aj ]m j=1 . Structure Λ is a binary matrix that stores the Boolean dropping statistic λ which subsequently determines the dropping criterion, and is computed as: ˜t−1 (xt )T · aT > ν/10. λ=k

(12)

The columns of Λ indicate whether the kernel values of x ˜j with xt for the j = 1, . . . , mt−1 members of the dictionary, exceeded ν/10 for the previous L timesteps. KEAD flags anomalies as follows. At every timestep t, it compares detection statistic τˆt with detection threshold ηt . If τˆt ≥ ηt , the KDE of Pt at xt , computed using the dictionary and window of past L sparsification coefficient vectors, is high enough and xt is inferred to represent normal traffic. Contrarily if τˆt < ηt , the KDE of Pt is low. In this case xt either represents an anomaly, or {xi }ti=t−L is not a sufficiently representative sample of {xi }t+L i=t−L for the estimate to be accurate. The following is done in such a situation: an “orange” alarm is raised at time t, xt is stored for the next L timesteps, and taking a firm decision on it is delayed for a further timesteps. Structure Θ stores the input vectors corresponding to unresolved orange alarms. An orange alarm that was raised at time t is resolved at the end of timestep t + in the following manner. Detection statistic τˆ is re-computed using At+ and the kernel values of the xt that had caused the orange alarm, with dictionary Dt+ . The lag allows the window of sparsification coefficient


Algorithm 1: Kernel Estimation-based Anomaly Detection Set detection threshold: η ; Choose ν, , L, c ; 3 Initialise: t = 1, D = {x1 }, m1 = 1, P1 = [1], A = [1], ˜ 1 = [k11 ], K ˜ −1 = [ 1 ] ; Λ = [1], K 1 k11 4 for t = 2, 3, . . . do Data: (xt ) ˜t−1 (xt ) ; 5 Compute k 6 Compute sparsification coefficient vector at from (9) ; 7 Calculate dropping statistic λ ; 8 Compute projection error δt using (10) ; 9 Compute online detection stat. τˆt using (11) ; 10 if δt ≥ ν then 11 Set D = D xt and ˜ at = at ; ˜ t, K ˜ −1 12 Compute K ; t T 13 Set at = ( 0 1 ) ; 14 Update At and Λ ; 15 Compute Bt ; 16 mt = mt−1 + 1 ; 17 else 18 Update At and Λ ; 19 Compute Bt ; 20 mt = mt−1 ; 21 endif 22 if τˆt < ηt then 23 Raise Orange Alarm ; 24 Set Θ = Θ xt ; 25 endif 26 if Orange(xt− ) then ˜t−1 (xt− ) ; 27 Compute k ˜t (xt− ) and At instead of 28 Re-calculate τˆ using k ˜t−1 (xt ) and At−−1 ; k 29 if τˆ ≥ ηt then 30 Lower Orange(xt− ) to Green ; 31 else 32 Elevate Orange(xt− ) to Red ; 33 endif 34 Remove Θ{1} ; 35 endif 36 for j = 1, . . . , mt do 37 if sum([Λ]1:end,j ) = 0 then 38 DropElement(j) ; 39 endif 40 endfor 41 endfor vectors A to slide forward +1 steps, while the dictionary was also allowed + 1 further additions. The objective is to enable one to differentiate between the cases where xt represented a true anomaly, versus where the low value of τˆt arose because {xi }ti=t−L was not a representative enough sample of {xi }t+L i=t−L . By delaying the final decision by L timesteps, the algorithm allows for the decision to be based on data t sequence {xi }t+ i=t−L+ instead of data sequence {xi }i=t−L . If the re-computed τˆ still falls below ηt , the orange alarm is elevated to red. Otherwise, it is lowered to green. 1

2

As Pn,t is slowly time-varying, the dictionary needs to be kept current. This is achieved by gradually disregarding the influence of past observations through exponential forgetting, and by periodically discarding obsolete element from the dictionary. The forgetting factor c places time-dependent weights on past measurements. The mechanics of dropping a dictionary element, which is more involved as it requires dimension reduction and is different from the more-common downdating step in conventional RLS, were explained in [2]. VI. S ETTING KEAD D ETECTION T HRESHOLD The detection threshold η for the KEAD online detection statistic τˆt needs to be set from the user-specified Probability of False Discovery, PF∗ D . This is done by running the MNSCAnn algorithm [11] during a training period. MN-SCAnn provides the modified annotation list {γt }Tt=1 , which may be used to set the value of η that corresponds to the desired PF∗ D . The process is described below. Combining (3) and (5) yields: ˆ P ,β | xt ∼ Pn,t ]. γt := 1−PF D (GPn,t ,βt ) = 1−Pr[xt ∈ /G n,t t (13) Once the acceptable probability of false discovery is specified as PF D , one can get: γ ∗ = 1 − PF∗ D .

(14)

The input vector x∗ in the training dataset {xt }Tt=1 that yields T that exceed γ ∗ , the smallest γ among those γ values in γt=1 is then determined: x∗ = {xt : min γt > γ ∗ }. t

(15)

The detection threshold η is then set to the mean kernel density estimate of x∗ with the training dataset {xt }Tt=1 : η=

T 1 ∗ k (x , xt ). T t=1

(16)

VII. C OMPLEXITY A NALYSIS The objective is to build an online anomaly detector suitable for use in a multivariate, high-speed networks. Such an algorithm must adhere to strict limitations in terms of memory requirements, and the per-timestep processing speed must also conform to the rate at which modern network monitoring devices report statistics. The KEAD algorithm is intended to be run over the test period, after first running the MN-SCAnn algorithm over a training period to initialise KEAD. This section discusses the complexities involved in this technique, and compares with those of the other approaches discussed. MN-SCAnn evaluates the modified annotation values {γt }Tt=1 by estimating the masses and volumes of the level sets. The volumes are estimated using a Monte Carlo technique. The kernel estimates of each Monte Carlo draw with the dataset is computed, and compared with the threshold defining each level set. The computational complexity bottleneck of the MN-SCAnn algorithm lies in performing these kernel


evaluations. In typical implementations, the number of Monte Carlo draws C is expected to be much greater than the size of the data block T , which in turn should be greater than the number of dimensions F . Using a Gaussian kernel means a complexity of O(F 2 ) in evaluating the kernel function. The overall computational complexity of MN-SCAnn is thus O(CT F 2 + CF ) = O(CT F 2 ). The proposed method is to run MN-SCAnn over blocks of timesteps in a sliding window fashion, and thereby keep on updating the KEAD detection threshold. KEAD is online and recursive, so its computational complexity per timestep is analyzed. The bottlenecks in KEAD are the kernel evaluations and the matrix multiplications. At every timestep t, KEAD first evaluates the kernel value of the arriving input vector xt with each element currently in the dictionary. With m elements in the dictionary and using a Gaussian kernel function means a complexity of O(mF 2 ) in computing the m×1 column vector of kernel values. In timesteps where no element is dropped from the dictionary, KEAD performs a constant number of multiplications of an m × m matrix with an m × 1 column vector, requiring O(m2 ) operations. In the rare case that an element is removed from the dictionary, the algorithm must perform a multiplication of two m × m matrices, requiring O(m3 ) operations [2]. The overall complexity of KEAD is thus O(mF 2 + m2 ) for every standard timestep, and O(mF 2 + m3 ) when deletion of a dictionary element occurs. The complexity of KEAD is independent of time, thereby conforming to the requirement of an online algorithm. For a typical value, let us use the Abilene dataset. Here T = 2000, F = 121, and KEAD works with around m = 10 − 20 dictionary members. This yields a value of O(105 ). To compare, the algorithm of Heinz and Seeger [5] involves a complexity of O(M ) when M kernels are used to process a one-dimensional data stream, so the complexity here is also independent of time. However, although their algorithm itself may be extended to multiple dimensions, the calculation of merge costs involves solving quartic equations, indicating that the complexity in a high dimensional feature space using Gaussian kernels is going to become high. Hero’s GEM algorithm is block-based with a sliding window approach suggested for online applications, and the reported complexity is O(T 2 log T ) when using a training set of T multidimensional points [9]. For the Abilene data where T = 2000, this computes to O(107 ). Performing PCA over a block of T , F -dimensional data points involves first calculating the covariance matrix of the data block, involving a complexity of O(T F 2 ), and then obtaining the eigenvectors of the covariance matrix, which involves a complexity of O(F 3 ). Once the number of principal components is decided upon, let me say R, obtaining the projection onto the principal and residual subspaces involves a complexity of O(RT F ). Thus the overall computational complexity of running the full PCA algorithm is O(T F 2 + F 3 + RT F ). The suggestion for adopting PCA to online applications was to use a sliding window to identify normal and anomalous subspaces based on blocks of time, and then

1

0.8

0.6

0.4

0.2

0 0

0.2

0.4

0.6

0.8

1

Fig. 1. Example synthetic data training sequence. Points classified as “normal” and “anomalous” denoted by (o) and (x), respectively

project subsequently arriving data onto the previous block’s eigenvectors. However, the non-stationarity of real multivariate network data typically renders such a straightforward extension ineffective, as was discussed in [2]. PCA works on the F = 121-dimensional Abilene data using around R = 4 − 10 principal components, thereby yielding a complexity value of O(107 ) over T = 2000 timesteps. If the nth nearest neighbour distance function is used as the sparsity measure, the OCNM algorithm involves calculating the distance from every point xt to every other point in the data. The complexity of OCNM is O(T 2 F ) [3]. For the Abilene data, this computes to O(108 ). Comparison of the computational complexities of each algorithm studied, thus reveals that the complexity of KEAD is independent of time, within bounds for high dimensions, and lower than those of competing algorithms. After MN-SCAnn is run over a training block to initialise KEAD, recursive KEAD may subsequently continue to run over incremental timesteps. The actual runtime of KEAD is less than 5 minutes, the interval at which typical backbone routers report statistics. Although other approaches are possible to initialise KEAD, for example running PCA or OCNM during the training period, MN-SCAnn is chosen because it provides a direct way of setting the KEAD detection threshold as a function of the user’s desired false-discovery-rate tolerance level. VIII. DATA A. Synthetic The first set of experiments were conducted on synthetic data. 500 training points were generated in the following manner. Points classified as “normal” were drawn from a two-component Gaussian mixture with equal mixing coefficients. The components were centred at (0.5, 0.5), with covariance matrices [0.01 0.009; 0.009 0.01] and [0.01 − 0.009; −0.009 0.01]. Points classified as “anomalous” were drawn from a two-dimensional uniform distribution defined over the unit square. The a priori probability of a point arising from the anomaly distribution, π, was 0.10. Another 2000 points were then similarly generated to make up the test


1

1

0.8

0.8

PD

PD

0.6 0.6

0.4 0.4

0.2

MN−SCAnn, Training period KOAD, Test period MN−SCAnn, Test period

0.2 0

0.2

0.4

0.6

0.8

1

0 0

FDR

KOAD GEM OCNM 0.2

0.4

0.6

0.8

1

FDR

Fig. 2. ROC curves showing MN-SCAnn run on training sequence, KEAD on test sequence, and MN-SCAnn on test sequence. Comparable performance is observed between the three cases, indicating that it is possible to use MNSCAnn to bootstrap KEAD.

Fig. 3. ROC curves for KEAD, GEM and OCNM run on test data. Online KEAD performs better than block-based GEM and OCNM.

sequence. This exercise was repeated 25 times, and the results presented constitute the ensemble average of these 25 runs. Figure 1 shows an example set of 500 training data points.

behaviour, thus showing that it is possible to subsequently use online KEAD after the detection threshold is initially set using MN-SCAnn over a training block. To investigate the effects of sparsification on KEAD performance, KEAD was run over the test data with various values of the sparsification parameter ν. It was observed that the detection and false discovery rates both rose with increasing ν, when there is a smaller dictionary and higher sparsity. Similarly, to investigate the effects of the window lengths L and on KEAD performance, KEAD was run over the test data for a range of values of L and . It was seen that the detection performance became invariant after about L = 100, which was consequently chosen as the default. The results indicated that PD and empirically obtained F DR were not very sensitive to the value of , and a choice in the range 10 − 20 was sufficient for most desired PF∗ D settings. Subsequently, ν = 0.10, = 20, and use the no-forgetting case were used as defaults for the experiments conducted on the synthetic dataset. The final set of experiments conducted on the synthetic dataset involved comparing the performances of proposed online KEAD with block-based Geometric Entropy Minimization (GEM) and One-Class Neighbour Machine (OCNM). Figure 3 compares the performances on the test sequence, for a range of settings for each algorithm. KEAD is run with detection threshold η corresponding to a range of desired PF∗ D settings from 0.05 to 1. OCNM identified the 1% to 99% outliers. The nth nearest-neighbour Euclidean distance was the OCNM sparsity measure with n = 5, and it was noticed that the results were not sensitive to the choice of n. With GEM, the false discovery rate is a function of the number of training points. With n = 5 selected as the nearest-neighbour parameter, GEM was trained on different training sequence lengths, and then run on the test sequence. The maximum detection rate observed with GEM corresponds to a training sequence length of 6, which is the shortest allowable with n = 5. KEAD is seen to outperform both GEM and OCNM. This is expected

B. Abilene The algorithm was tested on real data by examining performance on a timeseries of measurements from the Abilene backbone network. The Abilene data comprises a multivariate timeseries of the number of packets in each backbone flow, binned at five minute intervals for a week (Dec. 15 to Dec. 21, 2003). A backbone flow is defined as traffic that enters the Abilene network at one of its 11 core routers and exits at another. This provides a data matrix of size T × F , where T = 2016 is the number of timesteps and F = 11 × 11 = 121 is the number of backbone flows. IX. R ESULTS A. Synthetic Data Figure 2 presents the Receiver Operating Characteristics (ROC) curve of the Probability of Detection (PD ) versus the False Discovery Rate (F DR) achieved under three different scenarios. The empirical F DR value is evaluated by: Number of false alarms . (17) Number of total alarms MN-SCAnn was first run over the 500-point training block for a set of desired PF∗ D values. The solid line in Fig. 2 shows the tradeoff between the PD and the empirical FDR obtained for the set of PF∗ D requirements. Note the distinction between the PF∗ D parameter of MN-SCAnn that is set before the experiment, and the empirical F DR value evaluated from the results. With the KEAD detection threshold η set from MNSCAnn run over the training data, KEAD is then run over the next 2000-point test period. The dashed line in Fig. 2 denotes the ROC curve obtained for KEAD. Finally, MN-SCAnn is also run over the test data points. The dotted line in Fig. 2 presents this case. The three curves in Fig. 2 depict similar FDR =


−0.1

10

1−f

as the method of kernel density estimation of the underlying probability distribution as is used by KEAD, is best suited to this synthetic dataset.

−0.5

10

−0.1

B. Abilene Data

−0.8

10

−0.1

10

1−τ

Online Anomaly Detection Using KDE - Semantic Scholar

Online Anomaly Detection Using KDE - Semantic Scholar

Suggest Documents

Network Anomaly Detection using Unsupervised ... - Semantic Scholar

Mote-based Online Anomaly Detection using Echo ... - Semantic Scholar

Anomaly detection using baseline and K-means ... - Semantic Scholar

Magnetic Anomaly Detection Using High-Order ... - Semantic Scholar

Network Anomaly Detection Using Outlier Approach - Semantic Scholar

Anomaly Detection in Surveillance Video using ... - Semantic Scholar

Anomaly Detection in Sensor Systems Using ... - Semantic Scholar

Probabilistic anomaly detection in distributed ... - Semantic Scholar

Anomaly Detection through Registration - Semantic Scholar

Toward Supervised Anomaly Detection - Semantic Scholar

hyperspectral anomaly detection: a comparative ... - Semantic Scholar

A New Morphological Anomaly Detection ... - Semantic Scholar

Anomaly Detection & Behavior Prediction: Higher ... - Semantic Scholar

Predictive Network Anomaly Detection and ... - Semantic Scholar

Hashing-Pursuit for Anomaly Detection - Semantic Scholar

Anomaly detection and classification for ... - Semantic Scholar

Extending Graph-Based Anomaly Detection - Semantic Scholar

An Anomaly-based Intrusion Detection ... - Semantic Scholar

Robust Keystroke Biometric Anomaly Detection - Semantic Scholar

Anomaly Detection using Process Mining

Online Anomaly Energy Consumption Detection

Anomaly Detection in Online Social Networks Using ... - IEEE Xplore

Anomaly based Intrusion Detection System using ...

Event and anomaly detection using Tucker3 decomposition