On Threshold Selection for Principal Component based Network ...

On Threshold Selection for Principal Component based Network Anomaly Detection Petar Djukic MeshIntelligence Inc. Ottawa, Canada [email protected]

Abstract—Principal component based anomaly detection has emerged as an important statistical tool for network anomaly detection. It works by projecting summary network information onto a signal and noise sub-spaces and detecting anomalies in the noise sub-space. Recently some major problems where detected with this network anomaly approach. The chief among the problems is the difficulty in selecting a threshold used to declare that the energy in the noise sub-space contains a network anomaly. We show that the reason for this problem is that some of the assumption previously used to select the threshold, namely that the traffic follows a Normal distribution, do not fit the reality of the available network traces. Then, we show that the energy in the noise sub-space can be modeled with the long-tailed Cauchy distribution and use this approximation to calculate reliable thresholds. Our analysis of network traces indicates that the Cauchy distribution approximation of the energy distribution should significantly lower the false alarm rate. Keywords-Network Anomaly Detection, Principal Component Analysis.

I. I NTRODUCTION Network anomaly detection is an important tool used by Internet Service Providers (ISPs) to ensure their networks are not under attack and that the Service Level Agreements (SLAs) are honoured, in case the network getting suddenly congested. Typically, network anomaly detection is a component of network monitoring tools and raises alarms indicating anomalies are detect from available network statistics. Due to the sheer volume of data traversing modern networks, finding a network anomaly from traffic statistics is analogous to looking for a needle in the haystack. Nevertheless, there are statistical tools that can detect anomalies – outliers – in data, which are currently used for network anomaly detection. Despite the availability of statistical network anomaly detection in network monitoring tools, ISPs routinely ignore their alarms due to their high false alarm rates. Indeed, a high number of false alarms creates its own haystack from which the ISP operators have to look for needles (true network anomalies) again. We address the problem of high false alarm rates in network anomaly detection based on the statistical technique Work was performed while the first author was a postdoctoral researcher at Carleton University, Ottawa, Canada

Biswajit Nandy Solana Networks Ottawa, Canada [email protected]

of principal component (PC) analysis [1]. PC analysis has emerged as an important technique for network anomaly detection [2]–[7], but many question still remain about the appropriateness of PC-based for network anomaly selection due to in part to its problems with false alarms [8] PC analysis starts by finding the eigenvector space spanning the entire data set, where each eigenvector corresponds to a PC. The space is partitioned into the signal sub-space and a noise sub-space. The signal sub-space corresponds a predetermined number eigenvectors highest in energy and captures most of the energy in the data set. The noise sub-space corresponds to the eigenvectors lowest in energy, which are not in the signal sub-space. The data set is then projected on noise sub-space and energy of the projection is checked against a threshold for outliers. Detection of outliers in network traffic data may indicate the presence of anomalies, however the outliers may also come from the data itself thus raising false alarms. The key issue for good false alarm rates is selection of the PCs in the noise sub-space and selection of a threshold that does not capture too many outliers from the original data. We further investigate the validity of PC analysis for network anomaly detection. In particular we examine how to select the number of eigenvectors in the noise sub-space and how to determine the threshold with which an anomaly is determined. We start by determining the distribution of the energy in the noise sub-space from real-network traces. We show that this is a heavy-tailed distribution and that the current method of modeling this distribution with the χ-square distribution is not an appropriate. The χ-square distribution persistently underestimates the threshold, resulting in a large number of false alarms. We propose the Cauchy distribution to model the energy in the noise sub-space. We use estimate the parameters of the Cauchy distribution from network traces on-the-fly and use the estimated cumulative density distribution to determine a threshold for anomaly detection. The Cauchy distribution gives a pessimistic estimate of the anomaly detection threshold and results in a very low false alarm rate. We then analyze the performance of the Cauchy distribution when the number of eigenvectors in the noise sub-space changes and determine the range of eigen vectors for which

this approximation is valid. PC analysis was initially proposed as a method to succinctly characterize network traffic [2]. As shown in [2] most of the traffic variation can be captured by projecting the data traffic on a few of the top eigenvectors of the entire traffic space and to analyse network traffic on a macro scale. Subsequently [3]–[7], PC analysis was also proposed to find anomalies in traffic traces. The approaches suffered from several problems, which mainly stem from the questions of threshold and noise-subspace selection that we investigate in this paper. Authors in [8] show that the threshold detection from sis sensitive to both the number of eigenvectors in the signal sub-space and the threshold chosen to detect anomalies. More clarification was brought in [9] and [10], which finally put a solid mathematical basis behind PC analysis.

A. PC Analysis of Multi-dimensional Data PC analysis uses second-order statistics of X to separate signal and noise sub-spaces. For now we assume that the relevant statistics of the date are known a priori; we discuss how the relevant statistics are obtained in the next section. The second-order statistics are represented with the covariance matrix, CX = E{XX T },

(2)

where without any loss of generality we assume that the data vector has a zero mean, E{Xi } = 0. Diagonalization of the covariance matrix CX produces two p × p matrices U and Λ such that U −1 CX U = Λ where U = [u1 , . . . , up ]

II. M ATHEMATICS OF P RINCIPAL C OMPONENT A NALYSIS We summarize the mathematical theory behind PC analysis. We discuss how we applied it to network anomaly detection in the next section. The method of PCs is a well established statistical technique, used to detect outliers in multi-dimensional data sets [1]. The method checks if a p-dimensional random variable X = [X1 , . . . , Xp ]T contains an anomaly. In the case of network anomaly detection, the random vector X represents some measurable network traffic quantity, such as total packet size or packet count. We explain how this quantity is measured in the next section. Existence of a traffic anomaly is checked from a measurement of network traffic Y = X + a, where Y = [Y1 , . . . , Yp ]T is the measurement of network traffic X, which may contain an unknown anomaly a = [a1 , . . . , ap ]T . To find if the measurement contains an anomaly, PC analysis first finds a projection of the measurement Y on the signal sub-space of the random vector X, YP = [Y˜1 , . . . , Y˜p ]T and checks if the energy in the noise sub-space is greater than some threshold, δ. An anomaly is reported if kY − YP k2 > δ = 0. We explain how the energy in the noise sub-space is found shortly. For anomaly detection using this approach to be useful, the probability of a false positive (false alarm) should be small h i 2 PFA = Pr kY − YP k > δ a = 0 ≤ ǫ, (1) where ǫ is a small probability.

is the matrix composed of p-dimensional eigenvectors ui of CX , and Λ is the diagonal matrix of eigenvalues of CX with each diagonal entry λii corresponding to an eigenvector ui . By convention, λ1 > λ2 > . . . > λp > 0, and matrix U is assumed to be orthonormal U −1 U = U T U = Ip , where Ip is the p-dimensional identity matrix. The value of diagonalizing CX becomes evident if U is used to transform a sample Y to obtain a new random vector p-dimensional random variable Z = [Z1 , . . . , Zp ]T Z = Λ−1/2 U T Y ,

(3)

where Λ−1/2 is obtained from Λ by replacing each of its √ entries λii with 1/ λii . Using simple matrix manipulation it can be easily shown that if there is no anomaly (a = 0) o n E{ZZ T } = E Λ−1/2 U T Y Y T U Λ−1/2 = Λ−1/2 U T CX U Λ−1/2 = Ip .

So, random vector Z consists of 0 mean random uncorrelated random variables with variance of 1 If X is a vector of p jointly Gaussian random variables then Z is a vector of p independent random variables, which follow the N (0, 1) distribution. B. Outlier Detection with PC Analysis The diagonlizing matrix U is an orthogonal basis spanning the space of the the multi-dimensional random variable X. The signal sub-space of X is spanned from the first k columns of U , while the noise sub-space is spanned by the last p − k columns of U . Since the columns of U are ordered by decreasing eigenvalues, k can be selected so that the projection of the measured vector Y on the signal subspace retains most of the signal energy.

Mathematically, the projection on the signal sub-space is YP = U Λ

1/2

ZP ,

(4)

where ZP = Ip,k Z

(5)

is a “clipped” version of Z, which retains the first k dimensions of Z, (3), and h iT h i Ik 0k,k−p Ip,k = Ik 0k,k−p (6)

noise subspace is a weighted sum of squares of independent N (0, 1) random variables, which is a weighted sum of χ-square random variables. Various approximation of this summation are possible [11] [12] and have been used in the context of network anomaly detection [3]–[7]. However, since the measure network traffic does not follow the Normal distribution the value of the approximations is suspect. III. PC- BASED A NOMALY D ETECTION

So far, we have assumed that the distribution of the data is known a priori, which is not the case in practice. Even is the modified p × p identity matrix with only k non-zero if the distribution is known a priori, PC analysis only make entries, where 0k,k−p is a k × k − p zero matrix. The top progress if the input data follow the Normal distribution, k dimensions of ZN correspond to the top k eigenvalues, which is not the case for network traffic. Nevertheless, since which are associated with the highest correlation in the PC analysis is based on second-order statistics, which can original vector Y . be easily estimated, it has the potential to be a powerful tool To find the probability of false alarms, we find the energy for network anomaly detection. in the noise space when no anomaly is present. The energy We first show how the covariance matrix is estimated in the noise sub-space is given by empirically and then apply this estimate to network traffic 2 traces we have obtained from a large service provider. Using EC = kY − YP k , (7) the empirical estimate of the covariance matrix we show that where kxk = xT x is the vector norm. the energy in the noise sub-space is a long-tailed distribution. A simpler form of the energy expression is found by We approximate the energy in the noise sub-space with evaluating the norm the Cauchy distribution and show that this approximation is good for low false alarm rates. 2 T kY − YP k = (Y − YP ) (Y − YP ) T We start by discussion the origin of our network measure1/2 1/2 = U Λ [Ip − Ip,k ]Z U Λ [Ip − Ip,k ]Z ments. = Z T [Ip − Ip,k ]T Λ1/2 U T U Λ1/2 [Ip − Ip,k ]Z T

T

= Z [Ip − Ip,k ] Λ[Ip − Ip,k ]Z p X = λr zr2 , r=k+1

where we used the fact that Y − YP = U Λ1/2 Z − U Λ1/2 ZP = U Λ1/2 [Ip − Ip,k ]Z.

So, the probability of false alarms is h i 2 PFA = Pr kY − YP k > δ " p # X 2 = Pr λr zr > δ ≤ ǫ.

(8)

r=k+1

and for a given upper bound on an acceptable false alarm rate, ǫ, we need to find a threshold δ so that the last inequality holds. If the distribution in the last summation is known, its cumulative density function (CDF) can be used to find the acceptable threshold FEC =kY −YP k2 (δ) = 1 − ǫ

(9)

However, this distribution is usually not readily available. For the special case where the random X is a vector of jointly Gaussian random variables, by (8) the energy in the

A. Network Traffic Traces

We use Internet 2 network traces [13], which are the result of NetFlow traffic sampling. This network has several large routers with hundreds of thousands of end-to-end flows at each router. The available network traces are 1 second samples of the traffic passing the router and give summary counts of the packets and the average size of packets for each flow passing the router. In the sequel, we use the network traffic information from the first 59 days of 2010, from one of the routers in the network (”KANS”). The traces are available for academic use, upon request. The routing information is not available as it was in [3]– [7], so the method of using origin-destination (O-D) flows was not available to us. The method of origin-destination flows accumulates NetFlow counts according to how flows enter and exit the network. All flows entering the network and exiting the network on the same pair of routers are considered to be a part of one O-D flow. In terms of PC analysis, each O-D corresponds to one element in the vector of measurements (Y ) used in PC analysis, so having fewer O-D flows makes the PC analysis more manageable. To make the traffic traces manageable for PC analysis we synthesize the accumulated flows with hashing. This is equivalent to using the O-D method, however it adds more flexibilities. We use a very quick, randomly initialized, hash function, which is based on Marsenne primes [14]. As

Threshold Estimation (ε=0.99) 0.999

0.99

( k)

(δ)=Pr[Ec0]

0.99

0.9 0.7 0.5 0.2 0.1

0.01

0.001

0 126

116

Figure 2.

106

96

86

76

66 56 46 36 Number of Retained PCs

26

16

6

Probability of Underestimating the Threshold.

bution with the Normal distribution. This method is not the moment estimator of the Normal distribution, which is normally used to estimate the Normal distribution. Once parameters of the distribution are evaluated, the threshold is found from the analytical expression for the Normal distribution. The gamma threshold δGamma is estimated using the max-likelihood estimator to find the parameters of the approximating distribution and using the analytical means to recover the threshold From the empirical distribution we see that both Normal and Gamma estimates underestimate the threshold. This means that they may produce false alarms. Since the distribution of the energy is heavy-tailed, we also approximate it with a heavy-tailed Cauchy distribution. We use the maximum log-likelihood estimator [15] to estimate the parameters of the Cauchy distribution. And then its CDF to find the threshold. We see that the Cauchy estimated threshold δCauchy is an underestimate, meaning that it will not cause false alarms. So, the Cauchy distribution seems to be a much better fit to the distribution of energy in the noise sub-space than the previously used estimator. We also examine how the number of selected PCs affects the threshold selection. We use the full 59 days of the traces available to us and use a sliding 2 hour window to estimate many energy distributions. The sliding windows overlap and moves up in increments of 15 minutes. For each sliding window we find the empirically estimated threshold δest and a threshold estimated from a specific distribution δth and evaluate log10 (δest /δth ). If log10 (δest /δth ) > 0, the estimated threshold underestimates the real threshold, otherwise the threshold is overestimated. From our previous discussion, the thresholds should be overestimated to decrease the number of false alarms. Fig. 2 shows the probability that log10 (δest /δth ) > 0 for all possible values of k. We see that the Normal estimate

of the threshold almost always underestimates the true threshold. We also see that the Gamma threshold sometimes overestimates and sometimes underestimates the threshold. The Cauchy distribution always overestimates the threshold for certain ranges of k (e.g. 75 ≤ k ≤ 55). So if the number of retained PCs is chose carefully, the finding the threshold with Cauchy distribution has a much better chance of giving low false alarm rates than the other two methods. IV. C ONCLUSION We have investigated the problem of threshold selection for network anomaly detection with PC analysis. Previous works set the threshold based on the assumption that the traffic distribution follows the Normal distribution, which implies that the distribution of energy in the noise sub-space follows the χ-square distribution, which was approximated with the Normal distribution. Using empirical data, we show that the energy in the noise sub-space follows the Cauchy distribution more closely. Indeed, using the approximation of the distribution to select thresholds results in underestimated thresholds and possibly high false alarm rates, while using the Cauchy distribution results in a high threshold and low false alarm rates. We also find that the Cauchy distribution is valid for some ranges of retained PCs and not others, thus shedding some light on the relationship between the number of retained PCs and the threshold selection. R EFERENCES [1] J. E. Jackson, A User’s Guide to Principal Components. John Willey and Sons, 1991. [2] A. Lakhina, M. Crovella, and C. Diot, “Diagnosing networkwide traffic anomalies,” in SIGCOMM ’04: Proceedings of the 2004 conference on Applications, technologies, architectures, and protocols for computer communications. New York, NY, USA: ACM, 2004, pp. 219–230. [3] ——, “Characterization of network-wide anomalies in traffic flows,” in IMC ’04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement. New York, NY, USA: ACM, 2004, pp. 201–206. [4] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. D. Kolaczyk, and N. Taft, “Structural analysis of network traffic flows,” in SIGMETRICS ’04/Performance ’04: Proceedings of the joint international conference on Measurement and modeling of computer systems. New York, NY, USA: ACM, 2004, pp. 61–72. [5] A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using traffic feature distributions,” SIGCOMM Comput. Commun. Rev., vol. 35, no. 4, pp. 217–228, 2005. [6] P. Barford, N. Duffield, A. Ron, and J. Sommers, “Network performance anomaly detection and localization,” in INFOCOM 2009, IEEE, 19-25 2009, pp. 1377 –1385. [7] C. Issariyapat and K. Fukuda, “Anomaly detection in IP networks with principal component analysis,” in Communications and Information Technology, 2009. ISCIT 2009. 9th International Symposium on, 28-30 2009, pp. 1229 –1234.

[8] H. Ringberg, A. Soule, J. Rexford, and C. Diot, “Sensitivity of PCA for traffic anomaly detection,” SIGMETRICS Perform. Eval. Rev., vol. 35, no. 1, pp. 109–120, 2007. [9] D. Brauckhoff, K. Salamatian, and M. May, “Applying PCA for traffic anomaly detection: Problems and solutions,” in INFOCOM 2009, IEEE, 19-25 2009, pp. 2866 –2870. [10] ——, “A signal processing view on packet sampling and anomaly detection,” in INFOCOM, 2010 Proceedings IEEE, 14-19 2010, pp. 1 –9. [11] J. E. Jackson and G. S. Mudholkar, “Control procedures for residuals associated with principal component analysis,” Technometrics, vol. 21, no. 3, pp. 341–349, 1979. [Online]. Available: http://www.jstor.org/stable/1267757 [12] H. Liu, Y. Tang, and H. H. Zhang, “A new Chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables,” Computational Statistics and Data Analysis, vol. 53, no. 4, pp. 853 – 856, 2009. [13] http://netflow.internet2.edu/. [14] M. Thorup and Y. Zhang, “Tabulation based 4-universal hashing with applications to second moment estimation,” in SODA ’04: Proceedings of the fifteenth annual ACM-SIAM symposium on Discrete algorithms. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2004, pp. 615–624. [15] D. Bloch, “A note on the estimation of the location parameter of the cauchy distribution,” Journal of the American Statistical Association, vol. 61, no. 314, pp. 852–855, September 1966.

On Threshold Selection for Principal Component based Network ...

On Threshold Selection for Principal Component based Network ...

Suggest Documents

On Threshold Selection for Principal Component ...

Principal Component Regression by Principal Component Selection

Threshold Selection for Correlation-Based

Optimal component selection for component based software ...

Adaptive Tensor-Based Principal Component

Principal-component-based multivariate regression for genetic

Principal component analysis-based unsupervised

Principal-component-based multivariate regression for ... - DukeSpace

hyperparameter selection in kernel principal component analysis

Combined and I Indices Based on Principal Component Analysis for

Constrained Principal Component Extraction Network - Google Sites

Constrained Principal Component Extraction Network - Google Sites

Component selection strategies based on system requirements ...

Threshold Selection for UWB TOA Estimation Based on Kurtosis ...

Threshold Selection for UWB TOA Estimation Based on Kurtosis ...

Principal Component Analysis for the

Sketching for Principal Component Regression

Multifunctional Principal Component Analysis for

Neural Network Classifiers and Principal Component Analysis for ...

Principal Component Artificial Neural Network Calibration Models for ...

Principal component analysis based unsupervised feature ... - Core

Heuristic principal component analysis-based ... - IEEE Xplore

Sparse principal component analysis based ...

Principal component analysis-based inversion of effective ...