494
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,
VOL. 8,
NO. 4, JULY/AUGUST 2011
Anomaly Detection in Network Traffic Based on Statistical Inference and -Stable Modeling Federico Simmross-Wattenberg, Juan Ignacio Asensio-Pe´rez, Pablo Casaseca-de-la-Higuera, Marcos Martı´n-Ferna´ndez, Ioannis A. Dimitriadis, Senior Member, IEEE, and Carlos Alberola-Lo´pez Abstract—This paper proposes a novel method to detect anomalies in network traffic, based on a nonrestricted -stable first-order model and statistical hypothesis testing. To this end, we give statistical evidence that the marginal distribution of real traffic is adequately modeled with -stable functions and classify traffic patterns by means of a Generalized Likelihood Ratio Test (GLRT). The method automatically chooses traffic windows used as a reference, which the traffic window under test is compared with, with no expert intervention needed to that end. We focus on detecting two anomaly types, namely floods and flash-crowds, which have been frequently studied in the literature. Performance of our detection method has been measured through Receiver Operating Characteristic (ROC) curves and results indicate that our method outperforms the closely-related state-of-the-art contribution described in [1]. All experiments use traffic data collected from two routers at our university—a 25,000 students institution—which provide two different levels of traffic aggregation for our tests (traffic at a particular school and the whole university). In addition, the traffic model is tested with publicly available traffic traces. Due to the complexity of -stable distributions, care has been taken in designing appropriate numerical algorithms to deal with the model. Index Terms—Traffic analysis, anomaly detection, -stable distributions, statistical models, hypothesis testing, ROC curves.
Ç 1
INTRODUCTION
A
NOMALY detection aims at finding the presence of anomalous patterns in network traffic. Automatic detection of such patterns can provide network administrators with an additional source of information to diagnose network behavior or finding the root cause of network faults. However, as of today, a commonly accepted procedure to decide whether a given traffic trace includes anomalous patterns is not available. Indeed, several approaches to this problem have been reported in the literature (see [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], described in Section 2). Research proposals in anomaly detection typically follow a four-stage approach, in which the first three stages define the detection method, while the last stage is dedicated to validate the approach. So, in the first stage, traffic data are collected from the network (data acquisition). Second, data are analyzed to extract its most relevant features (data analysis). Third, traffic is classified as normal1 or abnormal (inference); and fourth, the whole approach is validated with various types of traffic anomalies (validation). In this regard, literature shows (see Section 2), flood and flashcrowd anomalies are of interest to several anomaly detection contributors.
1. In this paper, the word “normal” will be used in the sense of “natural status” and not as a synonym of “Gaussian.”
. The authors are with the Universidad de Valladolid, ETSI Telecomunicacion, Paseo de Belen, 15, 47011 Valladolid, Spain. E-mail: {fedsim, juaase, jcasasec, marcma, yannis, caralb}@tel.uva.es. Manuscript received 11 June 2010; accepted 14 Jan. 2011; published online 9 Feb. 2011. Recommended for acceptance by R. Sandhu. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TDSC-2010-06-0096. Digital Object Identifier no. 10.1109/TDSC.2011.14. 1545-5971/11/$26.00 ß 2011 IEEE
Following the aforementioned four-stage approach, we can mention that data acquisition is typically carried out by polling one or more routers periodically, so that traffic data are collected and stored for posterior analysis in the second stage. Some authors sample data at the packet level, gathering information from headers, latencies, etc., while others prefer to use aggregated traffic as the source of information, often through the use of the Simple Network Management Protocol (SNMP). Sampling data at the packet level provides more information, but at the cost of a higher computational load and dedicated hardware must be employed. Aggregated traffic, on the other hand, gives less information from which to decide for the presence or absence of anomalies, but is a simpler approach and does not need any special hardware. Apart from this dichotomy, however, there seems to be a consensus on how to proceed in this stage. In the data analysis phase, several techniques can be applied to extract interesting features from current traffic. Some of them include information theory [4], [9], wavelets [6], statistics-based measurements [3], and statistical models [1]. Of these techniques, the use of statistical models as a means to extract significant features for data analysis has been found to be very promising, since they allow for a robust analysis even with small sample sizes (provided that the model is adequate for real data). Moreover, with a traffic model, its set of parameters can be used as extracted traffic features, since any traffic sample is determined by the model parameters. Existing traffic models range from the classical Poisson model, first introduced for packet networks by Kleinrock [13], to most recent models, which state the importance of high variability and long-range dependence [14] in modeling network traffic. Nevertheless, anomaly detection is still Published by the IEEE Computer Society
SIMMROSS-WATTENBERG ET AL.: ANOMALY DETECTION IN NETWORK TRAFFIC BASED ON STATISTICAL INFERENCE AND -STABLE...
often based (at least partially) on classical models, such as Gamma distributions [1]. The fact that these models do not account for high variability may have a negative impact on capturing traffic properties and, as a consequence, on detecting anomalies. High variability manifests itself in the marginal (first-order) traffic distribution and states that traffic is inherently bursty. This results in traffic distributions exhibiting heavy tails which cannot be properly modeled with, e.g., Gaussian functions. Long-range dependence, on the other hand, states that traffic is highly dependent over a wide range of time scales, i.e., its autocorrelation function exhibits a slowly decaying tail. Several statistical distributions are capable of modeling the high variability property. One of such distributions is the -stable family [15], which has been previously used to model network traffic (although with restrictions, as shown in Section 4), and that we briefly explored in [16] (where the detection problem is not addressed). To the best of our knowledge, these distributions have never been applied to anomaly detection. Moreover, in addition to properly modeling highly variable data, -stable distributions are the limiting distribution of the generalized central limit theorem [17], a fact that sets them as good candidates for aggregated network traffic. Regarding the time evolution model and long-range dependence, we show in Section 6 that the first-order -stable model is appropriate to detect flood and flash-crowd anomalies, so we do not use a time evolution model in this paper. Several approaches have been used in the inference stage as well. Classification methods based on neural networks [10], [11], [18], statistical tests [2], information theory [4], and simple thresholding [19], to cite a few, can be found in anomaly detection literature. There seems to be a common point in all of them, though. The inference stage bases its decisions on the existence of a reference traffic window, which allows the classification method to assess whether the current traffic window is normal (i.e., it is sufficiently similar to the reference window) or abnormal (i.e., significantly different from the reference window). How the reference window is chosen not only has an impact on the final normal versus abnormal classification rate, but it also determines the exact definition of a traffic anomaly. Some approaches [2], [12], [20] assume that an anomaly is an abrupt change in some of the features extracted from traffic, so the reference window is simply the previous-to-current traffic window. Other papers [1], [4] assume that the reference window has been previously chosen and approved by an expert, so they do not need to define anomalies as abrupt changes, but simply as traffic windows sufficiently different from the reference. Both of these approaches have disadvantages. The former can only detect traffic anomalies which include abrupt changes from one traffic window to the next, disregarding, for instance, slow trends in traffic data. The latter approach does allow for abrupt changes or slow trends detection, but needs the intervention of an expert. In addition, having just one reference window can be problematic due to the nonstationary nature of network traffic. It is widely accepted [21] that network traffic exhibits a cycle-stationary behavior in periods of days and weeks, so a reference traffic window
495
which is appropriate for a given hour and weekday will probably not fit any other circumstances or any other network for this matter. In the validation stage, researchers give quality measures about the detection capability of their method according to a chosen criterion, which is typically the detection rate in terms of false positives and false negatives (i.e., the fraction of normal traffic patterns incorrectly classified as anomalous and the fraction of anomalous traffic patterns incorrectly classified as normal, respectively), although some researchers prefer other quality measures (see Section 2). Where possible, authors often compare the performance of their methods to other previously reported proposals as well. In this paper, we propose an anomaly detection method based on -stable distributions which does not need network administrators choose reference traffic windows and it is able to detect flood and flash-crowd anomalies regardless of the presence or absence of abrupt changes in network traffic. With this method, we expect to provide a data analysis stage giving more informative traffic features, as well as to address the problem of selecting appropriate reference windows. To this end, we provide statistical evidence about the suitability of -stable distributions to be used as a first-order model for network traffic and we build a classifier with a generalized likelihood ratio test, the performance of which is measured by means of Receiver Operating Characteristic (ROC) curves. We compare our classification results to those reported in [1], a closely related state-of-the-art contribution which, additionally, describes the experiments and results with sufficient detail, so as to let other researchers build fair comparisons with their methods. Due to the restrictions described in Section 6, traffic data used in testing the proposed method comes from two routers in our university, which should be representative of lightly and heavily loaded networks (see Section 3). The traffic model is tested with these data, as well as with public traffic traces. We choose aggregated traffic data over packet-level sampling for added simplicity so that no special hardware is needed for implementing our method. The paper is organized according to the aforementioned four-stage approach in order to enhance readability. Therefore, after describing the recent contributions in the field of anomaly detection in Section 2, we dedicate Section 3 to describe the framework used in our experiments, including data sampling and router specifications (stage one). Then, Section 4 justifies the -stable first-order model (stage two). Section 5 deals with the inference stage and the methods we use to classify network traffic (stage three). Section 6 is divided in two sections. The first one shows statistical evidence that the -stable marginal model is valid for our data under proper circumstances as well that as it behaves better than other models even when those circumstances are not met. The second one analyzes the detection performance of our method for two common types of anomalies (flood and flash-crowd), and compares it to the results reported in [1]. This section completes the fourth stage. The main conclusions and the foreseen steps of further research are exposed in Section 7. Finally, Appendices A and B, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TDSC.2011.14 provide supplementary information on the mathematical methods designed and used in the paper to deal with the calculations in which the -stable model is involved.
496
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,
VOL. 8,
NO. 4, JULY/AUGUST 2011
TABLE 1 Anomaly Detection Methods
2
BACKGROUND
In the last decade, several research teams have contributed to anomaly detection in network traffic. However, solutions given to each stage by different authors vary substantially across different papers. Table 1 provides a systematic comparative description of the main papers found in the literature. Thus, for each paper, the first three columns provide information related to the data acquisition stage,
i.e., where the data come from, their type, as well as whether they are publicly available for the community. The following two columns describe the employed anomaly detection algorithm, i.e., refer to the data analysis and inference stages. Also, one can see the types of anomalies treated in each paper, together with the figures of merit employed for the assessment of each method, as well as the papers against whom they test the performance.
SIMMROSS-WATTENBERG ET AL.: ANOMALY DETECTION IN NETWORK TRAFFIC BASED ON STATISTICAL INFERENCE AND -STABLE...
From what can be observed in Table 1, most authors clearly prefer to use real data in their experiments, rather than simulated traffic. Also, note that authors tend to collect data from their own networks, rather than using publicly available traces. Not using public data in experiments may be objectionable but is advantageous in that detection methods are not tied to the information already available. Thus, authors are free to inject any kind of anomaly in the network and test it with the proposed method. As for anomalous patterns, proposals often inject anomalies in the network on purpose or have some traffic traces prelabeled as anomalous, although some other approaches detect unusual patterns without prior knowledge about anomaly types (these patterns are marked as “U” in Table 1). Regarding the types of (real) sampled traffic, some authors prefer to use aggregated counters, while others do anomaly detection at the packet level. As previously stated, this paper focuses on anomaly detection in aggregated traffic. In the second and third stages, Table 1 shows that a wide range of algorithms have been used to detect anomalies. However, there are no proposals using high variability firstorder models to analyze data; consequently, this paper aims at improving classification rates by making use of -stable properties. As for the inference stage, most reported methods make use of simple thresholding, statistical tests, neural networks, or distance measurements. Among these techniques, approaches based on neural networks or statistical inference should yield better results than arbitrary thresholds or predefined distances since they employ prior knowledge about the data. In addition, parametric approaches, such as the GLRT, are able to take advantage of the robustness of the traffic model (provided that it is an adequate model) while nonparametric methods like neural networks cannot. Thus, inference in this paper is based on hypothesis testing by means of a generalized likelihood ratio test. It is also interesting to look at the anomaly types used in validating detection methods. While there is no consensus on what an anomaly is, or what kind of anomalies should be detected, a trend toward three possible approaches can be observed, from most general to most specific anomalies: first, several papers detect general divergences on measured traffic features; some other authors detect a few anomaly types commonly found in computer networks, and the remaining papers test their methods with very specific attack procedures, such as known viruses or malware. Our approach is validated with common (flood and flash-crowd) anomalies. As such, it falls on the second approach, thus keeping a compromise between general deviations and specific attacks. Table 1 also shows a clear preference to evaluate detection methods via their false positive/negative rates, as we do in this paper (via ROC curves). Another interesting conclusion drawn from the papers in Table 1 regards reference traffic windows. Although it is not directly seen in the table, authors tend to assume immediate past traffic is normal, or to leave the choice of appropriate reference windows up to the network manager. In this paper, we address the problem of setting reference windows so as not to make any assumption about immediate past traffic and not to depend on an expert’s skill. Some papers, as Table 1 shows, compare their results to other similar contributions, when at all possible. In our case,
497
Fig. 1. A snapshot of instantaneous traffic passing through: (a) router 1 and (b) router 2 (10,000 samples each, taken in June 2007 and February 2007, respectively). Average traffic is 30.42 Mbps in (a) and 366.87 Kbps in (b).
within cited papers that use common anomalies to test detection performance, [1] is the natural candidate for comparison due to several reasons: first, it is fairly recent so, to the best of our knowledge, it is a state-of-the-art paper. Second, the method is closely related to ours, so comparisons can draw great insight on both the modeling and the inference problem. And, finally, the authors include exact figures for validating their method so results are directly (and fairly) comparable.
3
DATA ACQUISITION
As mentioned in Section 1, data used in this work to test the proposed detection method were collected from two routers at the University of Valladolid. Our university comprises four campuses, each one in a different city, for a total of 25,000 students and 2,500 faculty members. Router 1 is the core router for the whole University and router 2 is the main router from the School of Telecommunications. Router 2 is directly connected to one of the ports in router 1. Both of them are able to operate at 1,000 Mbps. Data collection is done by querying the routers via SNMP periodically for accumulated byte counters at each physical port. Data have been continuously sampled from June 2007 to July 2008 for router 1, and from February 2007 to October 2008 for router 2 (with some brief interruptions due to unpredictable contingencies). Router 1 is a Cisco Catalyst 6509; it usually deals with average traffic amounts of several Megabits per second (4070 Mbps typically). As mentioned, it is responsible for all network traffic coming from every campus in the University and comprises thousands of hosts directly or indirectly. Router 2, a Cisco Catalyst 3550, usually has a much lower workload, its average traffic ranging typically between practically 0 and 10 Megabits per second, depending on the chosen port. Router 2 alone manages traffic coming from hundreds of computers, which are in turn a fraction of those connected to router 1. These routers should be representative of heavily and lightly loaded networks. Fig. 1 depicts typical traffic traces from both routers. In the data acquisition stage, traffic samples are taken at intervals of t seconds, so that data windows of W seconds are continuously filled and passed to the second stage. W should be large enough to have a minimum amount of data when trying to fit a statistical model to them, and short enough to have (at least) local stationarity. This is necessary
498
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,
since we extract a single set of parameters from each time window, which we assume to be constant for W seconds. Traffic stationarity has been previously studied in [21], where the authors find one-hour periods to be reasonably stationary, so, we make use of this assumption in this paper as well. However, in order to ensure the model adequately fits the data (see Section 6.1), we chose a time window length W ¼ 30 minutes. t, on the other hand, should be short enough to, again, provide as many traffic samples as possible to the second stage, but we must also keep in mind that the shorter the t, the more loaded a router will be. Network managers often find unacceptable for a router to spend any significant amount of time in monitoring tasks, so we chose t with this restriction in mind. In our experiments, we found that for t ¼ 5 seconds the monitoring overhead in both mentioned routers ranged between 1 and 3 percent, which was deemed acceptable by respective network administrators.
4
DATA ANALYSIS
As previously stated in Section 1, the use of statistical models in the data analysis stage can be advantageous since an adequate model allows for a robust analysis even with small sample sizes. With traffic windows of W =t ¼ 360 samples each, our sample size is rather small, so the use of a model is desirable. This approach has been previously used in works such as [1]; however, the model used there does not account for important traffic properties, such as high variability. Section 4.1 identifies these properties and discusses why classical models are not adequate for network traffic. To deal with the problem, Section 4.2 proposes the use of -stable distributions, unrestricted in their parameter space, as a model for traffic marginals.
4.1 Existing Network Traffic Models Traditionally, network traffic has been modeled as a Poisson process for historical reasons. Indeed, the Poisson model has been successfully used in telephone networks for many years, and so it was inherited when telecommunication networks became digital and started to send information as data packets. Also, this model has a simple mathematical expression [32], and has only one parameter, , which is in turn very intuitive (the mean traffic in packets per time unit). In the last decade, however, several authors have studied network traffic behavior and proposed other models that overcome the limitations which are inherent to Poisson processes, the most notable ones probably being that the Poisson model has a fixed relationship between mean and variance values (both are equal to ), and that it does not account for high variability or long-range dependence. More recently proposed models are usually based on the assumption that network traffic is self-similar in nature, as originally stated in [33]. Intuitively, network traffic can be thought of as a self-similar process because it is usually “bursty” in nature and this burstiness tends to appear independently of the time scale. Thus, in [33], Fractional Brownian Motion (FBM) [34] is shown to properly fit accumulated network traffic data (note that FBM is an autoregressive process and so it can model accumulated
VOL. 8,
NO. 4, JULY/AUGUST 2011
Fig. 2. A typical histogram of traffic passing through: (a) router 1 and (b) router 2 (10,000 samples each, taken in June 2007 and February 2007, respectively) along with Poisson (dotted), Gaussian (dashed), Gamma (dash-dot), and -stable (solid) curves fitted to the data.
traffic, but not instantaneous one), but the authors impose a strict condition: analyzed traffic must be very aggregated2 for the model to work, that is, the FBM model is only valid, authors say, when many traffic traces are aggregated, in such a way that the number of aggregated traces is much larger than the length of a single trace (measured in number of traffic samples). Let us consider why this restriction is necessary. First of all, we used our collected data to try and see if this constraint was needed in our particular network, and saw that it was indeed the case. A graph showing some of our data can be seen in Fig. 1. Note that there are some traffic peaks, or “bursts,” scattered among the data, which otherwise tend to vary in a slower fashion. Recalling that instantaneous contributions to FBM are Gaussian random variables, we can calculate a histogram of traffic data like those in Fig. 2, which show typical cases of instantaneous traffic distribution in routers 1 and 2, along with Poisson, Gaussian, Gamma, and -stable curves fitted to real data. Poisson, Gaussian, and Gamma curves were fitted using a Maximum Likelihood (ML) algorithm, while the -stable curve was fitted with the algorithm described in the Appendix, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/ 10.1109/TDSC.2011.14. Clearly, one can see the marginal distribution of sampled data differs considerably from Poisson, Gaussian, and Gamma probability density functions (PDFs), especially in the case of Fig. 2b. This happens due to the extreme values present in the data, which alter mean and variance estimates considerably. All of this means that a single traffic trace cannot be modeled as an FBM because traffic marginals are not Gaussian. However, once many traffic traces are aggregated, the resulting data do follow a Gaussian distribution, and so, the FBM model is valid. This happens as a consequence of the Central Limit Theorem [32] which loosely states that the sum of many identically distributed random variables converges to a Gaussian distribution. Note, however, that FBM can model the self-similarity properties of traffic, i.e., it includes a time evolution model which accounts for the long-range dependence that data usually exhibit. 2. Here, aggregated means exactly “averaged.” In other words, many traffic traces must be summed up, and then divided by the number of summed traces.
SIMMROSS-WATTENBERG ET AL.: ANOMALY DETECTION IN NETWORK TRAFFIC BASED ON STATISTICAL INFERENCE AND -STABLE...
At this point, it should be clear that any model for instantaneous traffic marginals must be flexible enough to adapt to some properties observed in sampled traffic, namely: Let CðtÞ be the amount of traffic accumulated at time t. Then, CðtÞ Cðt þ 1Þ and Cðt þ 1Þ CðtÞ M, where M is the network maximum transmission rate. 2. The fact that at time t there is a certain amount of traffic CðtÞ does not imply in any way that at time t þ 1 the amount of traffic lies anywhere near CðtÞ, due to the inherent burstiness of network traffic. This is equivalent to say network traffic exhibits the high variability property. The latter property is also known as the “Noah effect” or the infinite variance syndrome [14], and is easily observed on a histogram like those in Fig. 2 as a heavy tail, usually on its right side. This data tail is not negligible as, for example, the tails of Poisson, Gaussian, or Gamma distributions. On this aspect, note that the histogram in Fig. 2b shows only data under percentile 98 because the tail is so long. One effect heavy tails have when modeling traffic data is that they distort mean and variance estimates notably, which makes it difficult to fit Gamma, Gaussian, and Poisson curves, as seen in Fig. 2. On the other hand, the first aforementioned property states the obvious fact that network traffic has compact support between 0 and the M. Compact support makes symmetric distributions (Gaussian distributions are symmetric) inappropriate, because if the traffic histogram concentrates on very low transmission rates, the model would allow negative traffic increments to occur with a non-negligible probability and this can never be the case. Accordingly, if traffic data concentrate near the maximum transmission rate, a symmetric model would allow traffic increments to be larger than physically possible, again, with a non-negligible probability. This also affects the Gamma distribution, since its tail always lies on its right side. As an illustrative example, if we extrapolated the Gaussian (dashed) curve in Fig. 2b toward the left, we would see that the probability of getting a negative Mbps rate is not negligible. Regarding the Poisson distribution, recall that it converges to Gaussian when is sufficiently large.3 With our data ranging typically within tens of packets per second, it makes sense to assume that Gaussian convergence holds, so the previous discussion applies. Neither of these problems occur with the -stable (solid) curve in the case of Fig. 2, a fact we briefly explored in our previous work [16], so we now focus our attention on these distributions in order to justify their use as a model for anomaly detection. 1.
4.2 -Stable Models -stable distributions can be thought of as a superset of Gaussian functions and originate as the solution to the Central Limit Theorem when second order moments do not exist [17], that is, when data can suddenly change by huge amounts as time passes by. This fits nicely to the high 3. ¼ 10 is often considered enough for this purpose.
499
variability property seen in network traffic (the Noah effect). Moreover, -stable distributions have an asymmetry parameter which allows their PDF to vary from totally leftasymmetric to totally right-asymmetric (this is almost the case in Fig. 2b), while genuine Gaussian distributions are always symmetric. This parameter makes -stable distributions naturally adaptable to the first traffic property (compact support) even when average traffic is virtually 0 or very near the maximum theoretical network throughput (see Fig. 2 again). In addition, -stable distributions give an explanation to the restriction imposed in [33] about the need to aggregate many traffic traces for them to converge to a Gaussian distribution. According to the Generalized Central Limit Theorem [17], which includes the infinite variance case, the sum of n -stable distributions is another -stable distribution, although not necessarily Gaussian. Since traffic data exhibit the Noah effect, we can assume infinite variance. Then, under the hypothesis that marginals are -stable, the sum of a few traces will be -stable but not necessarily Gaussian. In [33], however, after summing sufficiently many traces, the final histogram converges to a Gaussian curve. This occurs because any real measurement cannot be infinite, even if an infinite variance model proves to reflect reality best. Section 6.1 is dedicated to validating this hypothesis, but before, although describing -stable distributions in detail is beyond the scope of this paper, as there are several good references in this field ([15], [34], [35] for example), we briefly mention some of their properties for readability purposes. -stable distributions are characterized by four parameters. The first two of them, and , provide the aforementioned properties of heavy tails () and asymmetry (), while the remaining two, and , have analogous meanings to their counterparts in Gaussian functions (standard deviation and mean, respectively). Note that, while they have analogous senses (scatter and center), they are not equivalent because -stable distributions do not have, in general, a finite mean or variance, i.e., EfXg 6¼ and STDfXg 6¼ . The allowed values for lie in the interval ð0; 2, being ¼ 2 the Gaussian case, while must lie inside ½1; 1 (1 means totally left-asymmetric and 1 totally right-asymmetric). The scatter parameter () must be a nonzero positive number and can have any real value. If ¼ 2, the distribution does not have heavy tails and loses its meaning since Gaussian distributions are always symmetric. Conversely, the tails of the PDF become heavier as tends to zero. Despite their potential advantages, however, we also state some reasons why -stable distributions are difficult to use. First, the absence of mean and variance in the general case makes impossible the use of many traditional statistical tools. Moreover, as mentioned before, these distributions do not have (to the best of our knowledge) a known closed analytical form to express their PDF nor their CDF, so powerful numerical methods are needed for tasks which are almost trivial with (for example) the Gaussian distribution, such as estimating their parameters for a given data set or even drawing a PDF. Also, the fact that they have four parameters, instead of just two, introduces two new dimensions to the problem, which can make processing times grow faster than
500
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,
in the Gaussian approach. In our experiments, however, this is not an issue with recent hardware. The Appendix, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TDSC.2011.14 describes the mathematical methods we used in dealing with -stable distributions. In spite of the inadequacy of FBM to model network traffic (because of its marginals), it is nowadays commonly accepted in the literature that a proper traffic model should describe 1) the marginal distribution of empirical data and 2) how a given traffic sample depends on past ones (i.e., correlations between them). In accordance to this, other authors have proposed traffic models based on the findings in [33], which we review briefly here. In [36], traffic is modeled as a combination of Linear Fractional Stable Noise (LFSN) and Log-Fractional Stable Noise (Log-FSN) [34]. These are stochastic processes whose marginals are -stable, but the authors must use them with some restrictions for the model to fit real data. For example, the center parameter must be zero for an -stable process to be considered as either LFSN or Log-FSN. With this constraint, the first-mentioned property seen in traffic data cannot hold true, so the model is altered to consider the absolute value of the LFSN or Log-FSN process instead of the original one. This should not pose any limitation per se but, for similar reasons, they must restrict to -stable distributions having ¼ 0 and > 1 (i.e., symmetric PDFs whose tails cannot be very heavy), which does limit the -stable parameter space substantially. In a similar way, Karasaridis and Hatzinakos [37] propose a model based on totally right-skewed LFSN ( ¼ 1). This kind of process, again, imposes some restrictions on the -stable parameters. In addition to the fixed value of , must be greater than 1 for the process to have long range dependency (and thus, self-similarity). These restrictions, however, allow the authors to estimate the -stable distribution using its inherent properties for the case that > 1 and ¼ 1. More related work on this subject can be found in [38], where the authors try to answer, from a mathematical point of view, the question of whether traffic data are better modeled with Stable Le´vy Motion [34] (SLM) or FBM (among other differences, SLM contributions are -stable while FBM ones are Gaussian). To this end, they use connection rates as an input parameter to some commonly used packet-source models, such as the ON/OFF and the infinite source Poisson models. Note that both SLM and FBM are cumulative processes, so they do not model instantaneous traffic but accumulated one. Their conclusion is that for high connection rates FBM can be used, but for low connection rates SLM is more appropriate. This seems to be in concordance with our results because data from router 1, which deals with higher connection rates than router 2, tends to be better modeled with Gaussian distributions than data from router 2 (see Section 6.1). Finally, in [1], the proposed model for the marginals is the Gamma distribution, which is combined with an ARFIMA process for correlations. Using the proposed model, the authors find that the marginals alone can be used to distinguish between normal traffic and flood anomalies,
VOL. 8,
NO. 4, JULY/AUGUST 2011
which are induced by means of various Distributed Denialof-Service (DDoS) attacks. Flash-crowd anomalies are also considered and found to be a priori distinguishable from normal traffic by a multiresolution study of the correlation evolution via the ARFIMA model, although a quantitative analysis is only carried out for flood anomalies. The goal of this paper is to detect anomalies in network traffic (particularly, flood and flash-crowd types), and not to establish a full, novel network traffic model. To this end, we show in Section 6 that the analysis of marginals suffices to detect both kinds of anomalies, and that -stable distributions, when used to their full parameter range, outperform other previously used models both as a model for marginals and as a means to detect anomalies.
5
INFERENCE
In order to detect anomalies, we first need to define what an anomaly is or, in other words, what our method tries to detect. A common approach is to define anomalies as a sudden change in any quantity measured in the previous stage [2], [3], [12], or a significant divergence between the current traffic window and a previously chosen reference [1], [4]. Note the difference between both strategies: the former compares current traffic to recent past traffic, while the latter does not assume recent traffic is necessarily normal. We feel this latter approach is superior since some types of anomalies are detectable this way but would not be otherwise (for instance, slow trends). However, how superior this latter approach is depends directly on the ability of the reference window to represent all kinds of normal traffic at any given circumstance. It is widely known that network traffic exhibits a cyclestationary behavior with periods of days and weeks (see [21], for example) and, generally speaking, traffic patterns that are clearly anomalous in some network, at a given time, can be perfectly normal in some other network or time instant. Thus, the reference window should vary from a router port to another, and from any hour-weekday combination to any other for the anomaly detection system to succeed in real world. Possibly, holidays and other workday interruption periods should also be taken into account. Still, having exactly one reference window for all possible combinations of port, hour, and weekday needs the intervention of an expert who can tell normal traffic apart from anomalous. Since one of our goals is to provide an automatic anomaly detection method, we propose, as a hypothesis, that pðnormal trafficÞ pðanomalous trafficÞ in any correctly behaved network at any circumstance (it seems obvious that normal traffic should happen most of the time for it to be considered as normal). In other words, normal traffic should be that which has gone through the router most of the time in the past. Our data collection includes traffic samples from routers 1 and 2 for a period of at least one year, so we roughly have 2 24 365 ¼ 17;520 30-minute windows for each port in the routers. That gives us 17;520=24=7 (100) traffic windows for each port-hourweekday combination such that, by hypothesis, most of them are representative of normal traffic. For all these windows, we estimate the parameters of an -stable PDF which fits the data and store those parameters in catalogs,
SIMMROSS-WATTENBERG ET AL.: ANOMALY DETECTION IN NETWORK TRAFFIC BASED ON STATISTICAL INFERENCE AND -STABLE...
one catalog for each hour-weekday combination, where we assume typical stationarity cycles of days and weeks, and local stationarity within an hour. Coherently with this assumption, test windows are compared with stored training windows within the corresponding catalog. That leaves us with the problem of deciding when a particular traffic window is far enough from our set of (assumed to be) normal windows. Again, a common approach is to fix an arbitrary threshold which marks the boundary between normal and anomalous traffic, and trigger an alarm when the threshold is exceeded. Unfortunately, this approach is error-prone since a human network administrator will probably feel it is very sensitive or, the opposite way, it does not detect interesting anomalous behaviors. Also, the simple combination of normal windows and a threshold assumes that anything not previously seen is necessarily anomalous. To overcome this situation, we propose the use of sets of synthetic anomalies (one set per anomaly type), analogous to the normal traffic windows but known to be anomalous, and setting the threshold so that a given traffic window is classified as normal or anomalous based on its similarity to the normal and anomalous sets. Moreover, we propose to inform network administrators via an “abnormality index” instead of using binary yes/no alarms (see below). This way, they have a source of information which should be more informative and thus less error-prone. As for the classification algorithm itself, we choose a Generalized Likelihood Ratio Test (GLRT) [39] since, being a parametric test, it can take advantage of the -stable model described in Section 4.2. Although there is no optimality associated with the GLRT, asymptotically it can be shown to be uniformly most powerful (UMP) among all tests that are invariant [39]. We intend to determine whether it is more likely that the current traffic window comes from the normal set or one of the anomalous sets, so we test H0 : current traffic is normal versus H1 : current traffic is not normal. The GLRT will decide H1 if LG ðx xÞ ¼
max 1j pðx x; 1j ; H1 Þ > ; max 0j pðx x; 0j ; H0 Þ
ð1Þ
where x is the vector of traffic samples in the current window, is the chosen threshold, and 0j , 1j are, respectively, the normal and anomalous sets of -stable parameter vectors for the current port-hour-weekday combination. The test is repeated for each kind of anomaly to be detected. This translates into evaluating all the previously stored and catalogued -stable PDFs for the current stationarity period, in the points specified by the current window samples. Thus, we obtain as many values of likelihood as stored PDFs we have for the stationarity period being considered (i.e., for each catalog), from which we take the maximum. Of course, an appropriate threshold should be chosen, depending on the network administrator’s criterion. One approach is to choose a desired false positive rate, and set accordingly, but see below. Our final implementation of the GLRT differs slightly from expression (1). Using this expression, -stable likelihoods are evaluated as
pðx x; ij ; Hi Þ ¼
n Y
fðxk ; ij ; Hi Þ;
501
ð2Þ
k¼1
where fðxk ; ij ; Hi Þ is the PDF of the -stable distribution whose likelihood is being calculated, evaluated at the point xk , and n is the number of samples in each traffic window (3060=5 ¼ 360 in our scenario). However, -stable PDFs can yield very high values, especially when ! 0, to the point that a finite precision machine may incorrectly evaluate the product p as an infinite value. To overcome this situation, we alter the GLRT to use log-likelihoods. By using fðxj ; ij ; Hi Þ > 0 and the fact that log is a strictly increasing function, then logðmaxðzÞÞ ¼ maxðlogðzÞÞ. Thus, the test will decide H1 if xÞÞ ¼ max log pðx x; 1j ; H1 Þ logðLG ðx 1j ð3Þ x; 0j ; H0 Þ > 1 ; max log pðx 0j
where 1 ¼ logðÞ, and n X log pðx x; ij ; HiÞ ¼ log pðxk ; ij ; HiÞ :
ð4Þ
k¼1
This solves the problem of finite machine precision, but 1 must still be chosen appropriately to fit the network manager’s needs. Instead of choosing a fixed 1 , however, xÞÞ may be scaled and shifted, so it the raw value of logðLG ðx represents a conveniently bounded abnormality index, say between 0 and 100. Then, the administrator can judge for themselves whether the extracted log-likelihood exceeds their particular vision of an alarm threshold in each case. Transforming an unbounded value like logðLG ðx xÞÞ into a bounded index in the interval I 2 ð0; 100Þ is easily done (for instance) as I¼
100 1 tan flogðLG ðx xÞÞg þ : 2
ð5Þ
This index may be monitored by the network manager, and its evolution followed over time, in the same fashion as other common network status indicators (such as average traffic levels, connection rates, etc.). This way, more information is provided (in comparison to a simple yes/ no alarm) in the case that the network manager has to make any decision. Nevertheless, binary alarms may be triggered as well by setting an appropriate 1 in (3) if desired. To do so, users should pick an anomaly intensity and set a desired false-alarm rate. Once this is done, 1 is found by using ROC curves (see Section 6.2). In this paper, we focus on detecting flood and flash-crowd anomalies in order to present the results in Section 6.2. Flood anomalies include attacks, or any other circumstances, which result in a net growth of instantaneous traffic. One can think of flood anomalies as having one or more relatively constant traffic sources added to otherwise normal traffic. DDoS attacks typically give rise to anomalies of this kind. Flash-crowd anomalies encompass traffic patterns which are caused by a net growth of (usually human) users trying to access a network resource. Typical flash-crowd anomalies are related to overwhelming web server usage patterns. As stated before, we generate synthetic patterns for both kinds of anomalies. To this end, we assume that traffic
502
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,
Fig. 3. Anomalous patterns for flood and flash-crowd anomalies.
resulting from aggregating two traffic sources is the sum of these particular traces. This implicitly assumes that the summed traces do not exceed the network capacity, so care should be taken in choosing the intensities of generated anomalous patterns. Synthetic anomalies are generated as follows: We start by using public domain programs to generate flood and flashcrowd anomalies in a virtually empty network (traffic peaks less than 3 Kbps). We used Iperf [40] for flood anomalies, and JMeter [41] for flash-crowd. Iperf is a command-line tool, very simple to use, that allows a constant traffic amount to be injected into a network. JMeter, on the other hand, is slightly more complex. It is designed to test the behavior of the Apache web server under configurable load conditions and allows a user-defined set of HTTP queries to be sent to a web server at random intervals. To ensure the simulation mimics human usage patterns, we implemented a lognormal timer with parameters ¼ 3 seconds and ¼ 1:1 for interclick periods, as described in [42]. Fig. 3 shows some flood and flash-crowd anomalous patterns. In the flood case, we generated 104 30-minute anomalies ranging from 16 Kbps to 128 Mbps; in the flash-crowd case, we generated 100 anomalies ranging from 10 to 1,000 threads each. Once we have these two pools of “pure anomalies,” we randomly choose as many of them as existing normal windows in each port-hour-weekday combination and add them together. This results in three sets of traffic windows for each port, hour, and weekday: a normal set, a floodanomalous set, and a flash-crowd-anomalous set. As stated above, normal windows consist of strictly real traffic, while anomalous ones are synthetic (although they have also been built from real traffic). The same way we did with normal traffic windows, we fit an -stable PDF to the data in each anomalous window, and store the estimated parameters for their use in the GLRT classifier. Before presenting performance results in the next section, it should be noted that synthetic anomalies are not real abnormal patterns, since the latter tend to alter network behavior in various ways not directly observed in aggregated traces (network entities tend to drop packets, trigger retries and congestion control mechanisms, alter backoff timers, etc.) especially when network use is near its maximum capacity. Injecting anomalies on a real network to generate enough training patterns, however, is not an option since doing so would interfere with normal network use. Nevertheless, in our tests, -stable parameters obtained from estimation of synthetic anomalous patterns do not seem to differ significantly from those obtained by directly
VOL. 8,
NO. 4, JULY/AUGUST 2011
Fig. 4. Distribution of injected versus synthetic anomalous patterns over the -stable parameter space ( plane). Note that the distribution of injected abnormal traffic does not seem to differ from that of synthetic patterns.
injecting anomalous traffic in a real network. In this regard, Fig. 4 shows the distribution of some flood and flash-crowd patterns injected into a real network, along with the distribution of synthetically obtained abnormal patterns. It can be seen that -stable parameters are distributed similarly for both types of patterns.
6
RESULTS
This section shows the results for the Data Analysis and Inference stages. For data analysis, we show statistical evidence that -stable distributions are adequate as a model of network traffic marginals for the purpose of detecting anomalies and compare our goodness-of-fit figures to those of other marginal models which have been used elsewhere. Regarding the Inference stage, we make extensive use of ROC curves to show our classification rates and, specifically, we compare them with the results reported in [1]. The -stable traffic model is tested with real data collected from routers 1 and 2, as described in Section 3, as well as with well-known public traffic traces available at [22], [23] (respectively, ITA and WITS data sets). As previously indicated, however, the proposed detection method needs a set of traffic data which is sufficiently dense and that provides enough windows in each of the catalogues, so that they are representative of their respective stationarity periods. Unfortunately, we could not find any public traces that satisfy both criteria at the same time, so ROC curves presented below use only data from our University routers.
6.1 Goodness-of-Fit of the -Stable Model A very common way of testing goodness of fit is the use of nonparametric tests, such as the Kolmogorov-Smirnov (KS) test [43]. Unfortunately, this and other similar tests assume that samples are independent and identically distributed (iid). Since we are assuming local stationarity within an hour from the beginning, all samples in one traffic window may be considered as being identically distributed, but not necessarily independent. Actually, several studies ([14], [33], for example) have detected a strong presence of positive dependence in sampled traffic, which in turn results in long-range dependency and the need to use sophisticated models for traffic correlations (such as ARFIMA processes). However, the effect of positive dependence in stationary stochastic processes has been studied in various works.
SIMMROSS-WATTENBERG ET AL.: ANOMALY DETECTION IN NETWORK TRAFFIC BASED ON STATISTICAL INFERENCE AND -STABLE...
503
Fig. 5. KS tests for goodness-of-fit of various distributions to traffic marginals, shown as ðH0 Þ acceptance rates (in percent). Window lengths range from 8.3 minutes (100 samples) to 1 hour, 23 minutes (1,000 samples). The inference stage uses 360-sample windows (30 minutes).
Weiss [44], for example, proposes a modification of the KS test to second-order Autoregressive Moving Average (ARMA) processes. There are also tests which are appropriate for -stable distributed strongly correlated processes [45]. However, we decided against using these because 1) recent literature shows that network traffic is usually too correlated to be closely represented by ARMA processes, and 2) a test which is specific to -stable distributions prevents us from comparing them with any other distribution. Instead, we make use of the results reported in [46], which state that under the presence of positive dependence, the KS test tends to reject the null hypothesis ðH0 Þ. This way, if the test for iid variables accepts the hypothesis that traffic follows a specific distribution then we can be sure that this distribution is adequate even for very positively correlated data. Of course, the drawback of this approach is that, when H0 is rejected, there can still be doubt that the affected distribution could have been an adequate model. Nevertheless, it is not our intention in this paper to measure exact values for the adequacy of a particular model, but just to validate -stable distributions and corroborate that they can fit traffic marginals better than other previously used models. Thus, the distribution that yields the largest H0 acceptance rates should be the best as a model for traffic marginals. The KS test, however, has another drawback: it can only be applied when the theoretical distribution is completely specified, i.e., its parameters cannot be estimated from data. Since the real distribution of traffic data is naturally unknown, the use of the KS test may be, again, objectionable. Nevertheless, the KS statistic can be corrected by simulation to avoid this problem. Appendix B, which can be found on the Computer Society Digital Library at http:// doi.ieeecomputersociety.org/10.1109/TDSC.2011.14 is devoted to find proper correction coefficients for the KS statistic, which are used here to present goodness-of-fit results. We have already referred to Fig. 2 as a pictorial indication that typical traffic histograms can be closely approximated using -stable distributions. To give statistical evidence that this is indeed the case, we made several
tests with output traffic from the upstream port of routers 1 and 2 (all ports cannot be shown here for space reasons). Taking SNMP byte counters as inputs, data windows of 100 to 1,000 consecutive samples are randomly chosen for each of the ports. Then, for each window length, we make 1,000 experiments in which: The four parameters of an -stable distribution are estimated from the data using the algorithm described in Appendix A, which can be found on the Computer Society Digital Library at http:// doi.ieeecomputersociety.org/10.1109/TDSC.2011.14. 2. A KS goodness-of-fit test is made with the null hypothesis ðH0 Þ being “data follow the estimated -stable distribution.” This process is repeated for Gamma, Gaussian, and Poisson distributions, using their corresponding ML estimators. Since we want ðH0 Þ to be accepted, the test should yield p-values greater than the significance level so that ðH0 Þ is not rejected. Fig. 5 shows the results of test sets for routers 1 and 2, as well as for traces LBL-CONN-7 from ITA [22], and AUCKLAND-4 from WITS [23]. For each experiment set, ðH0 Þ acceptance rates are shown, with a significance level of ¼ 0:05 for all tests. For clarity, results for the Poisson distribution are not shown, because due to the fact that the Poisson distribution has only one free parameter, and the aforementioned Gaussian convergence, Poisson fits always lie below Gaussian ones. About these results, it should be noted that ðH0 Þ acceptance rates tend to be smaller as the number of samples grows. This may be a consequence of strong positive correlation in sampled data, loss of local stationarity (since samples apart more than an hour are forced in this experiment to be identically distributed) or the simple fact that the KS test expects more convergence as the number of samples grows. Nevertheless, figures show that for 30-minute, five seconds/sample windows (i.e., 360 samples), -stable distributions fit traffic marginals best, and their ðH0 Þ acceptance rates fall well above other previously used distributions, so they should constitute an adequate model for traffic marginals. 1.
504
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,
6.2 Classifier Performance Once the model for traffic marginals has been validated, we show performance results for our classification method. To this end, it is common practice to use graphs that show the detection probability (PD ) as a function of the false-alarm probability (PF A ). These graphs are called ROC curves, they are strictly increasing, and always lie within range from (0,0) to (1,1). The larger the area under these curves (AUC) the better performance the considered classifier has. The ideal ROC curve would be that which included the point (0,1), indicating perfect detection capabilities and zero false-alarm probability, whereas the worst case is a straight line from (0,0) to (1,1), indicating no gain compared to a purely random classifier. ROC curves for our method are obtained by varying 1 from (3) in a logarithmically spaced set of thresholds from 1 to þ1. A good reference on ROC curves is [39]. In order to avoid training and test data sets to overlap, we use a procedure based on the leave-one-out strategy [47], i.e., we cycle through all available patterns in such a way that, in each round, all patterns except one are used as the training data set and the remaining pattern is used as the test set. In our method, however, anomalous patterns are generated from normal ones, so some correlation exists between these two. To prevent this correlation from affecting the tests, we remove the test pattern an its corresponding pair (instead of just the test pattern) from the training data set, in what may be called a “leave-two-out” strategy. In [1], ROC curves are used to measure the performance of the anomaly detection method there presented, although only flood anomalies are tested. Since anomalous data used in that paper are not publicly available, we implemented that method—with some modifications to adapt it to our scenario—and compared its results to ours. In order to carry out a fair comparison, we briefly describe here our exact implementation of [1]: For all collected traffic windows belonging to a particular combination of port, hour, and weekday, repeat steps 2 to 8. 2. Prepare three consecutive traffic windows. The first one is the reference window; second and third ones act as normal and anomalous traffic windows. 3. Inject a synthetic flood anomaly of a given intensity into the third window. 4. For time resolutions 5, 10, 20, 40, and 80 seconds per sample, repeat step 5. 5. Estimate the shape parameter of three Gamma distributions, one fitted to each traffic window, and compute the following quadratic distances: a) reference to normal windows and b) reference to anomalous windows. 6. Calculate mean quadratic distances over all time resolutions. 7. For a sufficiently dense, logarithmically spaced set of thresholds from 0 to þ1, repeat step 8. 8. Accumulate number of false positives/negatives for each threshold. 9. Calculate false positive/negative ratios. 10. Plot ROC curve and calculate its AUC. 1.
VOL. 8,
NO. 4, JULY/AUGUST 2011
There are three differences between the described algorithm and the original found in [1]: First, the original algorithm uses 10 logarithmically spaced time resolutions from 1 to 1,024 ms. Our data, however, are sampled at 5-second intervals, so we have no other option than making the multiresolution calculations at a larger scale. Second, the authors do not elaborate on how to choose an appropriate reference window, but just indicate that the window preceding the injection of network attacks was used; therefore, we choose the window preceding normal and anomalous ones as reference traffic. Third, [1] does not consider different combinations of ports, hour, and weekdays, but we do so here so as to fairly compare classification performance between both approaches. Step 5 deserves further explanation as well. In the original paper, the authors state that Mean Quadratic Distances (MQDs) are to be independently calculated for both parameters of the Gamma distribution. However, they then observe that the scale parameter does not alter in the presence of flood anomalies and discard it, so we do not calculate it while reproducing its results. For the case of flash-crowd anomalies, the authors do not see any particular change in traffic marginals and prefer to use the correlation model to detect them. For space reasons, we cannot present all generated ROC curves. Instead, we present some of them in Figs. 6 and 7. Each plot shows three ROC curves, one for our method (A), the second for the method in [1] (B), and the third (Ref) obtained by applying a logistic regression classifier [48] to the -stable parameter space (implementation from MATLAB [49] statistics toolbox that will be used as a reference). These reference AUCs give information for the assessment of the GLRT classification capabilities when compared to a simple classification method. In this regard, note that in some cases the ROC for the logistic regression classifier cannot be correctly estimated from the sample since its results are close to random (see below for an explanation). In these cases, no reference curve is shown. ROC curves for each method are shown as 95 percent confidence intervals (estimated from the sample, since there is no evidence that obtained ROC points follow any known distribution). We also summarize the median AUC results for both methods in Table 2. To assess whether there is any significant performance difference between both methods, we generate anomalies at 10, 25, 50, and 100 percent intensity, relative to the mean amount of traffic for each window, so that classification performance can be measured for increasingly easy-to-detect anomalies (note that anomalous traffic is generated at the training stage by adding pure anomalies of random intensities, instead of fixed). Then, we pick pairs of AUC—one for our method, other for [1]—for every weekday, at 00:00, 06:00, 12:00, and 18:00, for a total of 74 ¼ 28 AUC pairs per anomaly type and intensity. Then, we use the Mann-Whitney U test [50] to search for statistical significance at a level ¼ 0:05. Note that 28 samples are very few to assume Gaussian convergence, and also that AUCs are defined in the interval ½0:5; 1, so a Student’s t test should be discarded in favor of a nonparametric test [50]. Also recall that, contrary to the previous section, we are now interested in rejecting the null
SIMMROSS-WATTENBERG ET AL.: ANOMALY DETECTION IN NETWORK TRAFFIC BASED ON STATISTICAL INFERENCE AND -STABLE...
505
Fig. 6. ROC curves for flood anomalies, of our method (solid lines, marked as “A”) and the method described in [1] (dashed lines, marked as “B”). Dotted lines (marked as “Ref”) show results from a simple logistic regression classifier applied to the -stable parameter space (only shown when results differ from random classification). Curves are shown as 95 percent confidence intervals (estimated from the sample). Larger areas under the curves give best classification performance. For space reasons, only inputs from router 1 and 2 at Monday 12:00 are shown, with anomaly intensities of 10, 25, 50, and 100 percent.
hypothesis, so we look for p-values below the significance level of the test. It should be noted again that the authors of [1] rule out the use of their method to detect flash-crowd anomalies based solely on traffic marginals (they investigate the viability to detect them by studying correlations, although no quantitative analysis is performed). Nevertheless, it is interesting to compare performances of both methods also for this case. Thus, we include it here for the sake of completeness and as an orientation of a minimum acceptable classification rate for our method. Note also that, despite the authors findings, there are some figures for their method that are clearly better than simple blind-classifying.
As Table 2 shows, our method presents a net gain over [1] in all but three cases, in which both methods yield statistically indistinguishable figures. These cases giving fairly poor classification rates correspond to relatively lowintensity flood anomalies in port 1 of router 1. Since this port is the one connecting the whole university to the external world, it is by far the most loaded traffic port in our data sets. -stable marginal fits for this port show central values around 40-70 Mbps and a fairly large sample deviation around it, which makes it hard for any classifier to detect subtle anomalous patterns, such as 10-25 percent flood anomalies.
Fig. 7. ROC curves for flash-crowd anomalies, of our method (solid lines, marked as “A”) and the method described in [1] (dashed lines, marked as “B”). Dotted lines (marked as “Ref”) show results from a simple logistic regression classifier applied to the -stable parameter space (only shown when results differ from random classification). Curves are shown as 95 percent confidence intervals (estimated from the sample). Larger areas under the curves give best classification performance. For space reasons, only inputs from router 1 and 2 at Monday 12:00 are shown, with anomaly intensities of 10, 25, 50, and 100 percent.
506
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,
VOL. 8,
NO. 4, JULY/AUGUST 2011
TABLE 2 Median Areas under ROC Curves for (A) Our Method, (B) that Described in [1], and (Ref) a Simple Logistic Regression Classifier Applied to the -Stable Parameters
For methods A and B: underlined values are significantly larger with the corresponding p-value, as yielded by the Mann-Whitney U test. Both values are underlined if no significant difference is found. For methods A and Reference: ? means no significant difference between GLRT and logistic regression classifiers; y means the reference AUC is significantly larger than the corresponding GLRT AUC; “—” means results are indistinguishable from random classifying. Unmarked reference AUC values are significantly smaller than the corresponding GLRT AUC. Each test is based on 28 AUCs corresponding to every weekday at 00:00, 06:00, 12:00, and 18:00. R = Router; P = Port; D = Direction.
A deeper inspection of Table 2 allows us to draw some conclusions, for which we give an explanation in the following paragraphs: Figures for both methods tend to be better for router 2 than for router 1. 2. Our method seems to yield better classification results for flash-crowd anomalies than for flood type. 3. Some p-values indicate a clear difference between both methods even though median AUCs are very similar. 4. p-values tend to decrease as anomaly intensities increase. 5. Reference AUCs are highly variable, sometimes giving classification results comparable to (or even slightly better than) the GLRT classifier, and in some other cases giving virtually random results. The first conclusion is explained the same way as the previously mentioned cases with nonsignificant p-values. Router 2 carries around one order of magnitude less traffic than router 1. This affects centrality and scatter parameters (both for Gamma and -stable distributions) in such a way that differences to any reference window tend to be more easily detected. The explanation for the second conclusion can be found in the -stable parameter space: flood anomalies are detected essentially as an abnormal centrality value and a slight variation in the scatter parameter (as Fig. 3a may anticipate), 1.
Fig. 8. An example of box and whiskers plots of AUC distribution for (A) our method and (B) that reported in [1]. On the left, a 10 percent flood anomaly at router 2. On the right, a 100 percent flood anomaly at router 1. Both plots are based on 28 AUCs corresponding to every weekday at 00:00, 06:00, 12:00, and 18:00.
while the other two parameters (shape and skewness) remain mostly unaltered. But flash-crowd patterns, being much more noisy and skewed (see Fig. 3b), do alter the shape and skewness of the marginals. Therefore, -stable distributions make use of their full potential when classifying flash-crowd anomalies, which is not the case for the flood type. Regarding the third and fourth conclusions, related to the behavior of p-values, both are explained by looking at box-and-whiskers plots for AUCs, an example of which can be seen in Fig. 8. Generally speaking, AUCs for our method tend to concentrate around their median faster than for the method in [1], as anomaly intensities increase. This can result in AUCs that look similar in median, but differ significantly when considering their complete distribution. About the fifth conclusion, the logistic regression classifier works well when decision regions are disjoint in the -stable parameter space, i.e., when there is a clear boundary between normal and anomalous patterns, such as in the case depicted in Fig. 9a. This happens especially for flood anomalies which, as stated before, essentially tend to alter the centrality parameter. In other cases, such as in Fig. 9b, decision boundaries are not easily found, causing classification results to be indistinguishable from random. As a final conclusion, recall that our results are based exclusively on traffic marginals, and no other measurements or heuristics were needed to achieve them. This
Fig. 9. Examples of decision boundaries obtained with a logistic regression classifier (projection over the plane of the 4D -stable parameter space is shown). On the left, a clear boundary may be drawn between normal and flood traffic. On the right, the decision boundary causes almost random classification outcome between normal and flash-crowd traffic.
SIMMROSS-WATTENBERG ET AL.: ANOMALY DETECTION IN NETWORK TRAFFIC BASED ON STATISTICAL INFERENCE AND -STABLE...
results in added simplicity to the goal of detecting anomalies, both from a theoretical perspective and from a network administrator’s point of view.
7
CONCLUSIONS AND FURTHER WORK
In this paper, an anomaly detection method based on statistical inference and an -stable first-order model has been studied. We follow a four-stage approach to describe each of the pieces from which to build a functional detection system (data acquisition, data analysis, inference, and validation), yielding the final classification results shown in Section 6.2. Our approach uses aggregated traffic as opposed to packet-level sampling so dedicated hardware is not needed. In the data analysis stage, we propose to use -stable distributions as a model for traffic marginals. Since these distributions are able to adapt to highly variable data, and due to the fact that they are the limiting distribution of the generalized central limit theorem, they are a natural candidate to use in modeling aggregated network traffic. On this regard, we give statistical evidence that unrestricted -stable distributions pose an adequate marginal model for real traffic in our scenario—as long as a local-stationarity hypothesis holds—and compare our -stable fits to other marginal models which have been used elsewhere. In the inference stage, we propose a novel strategy for choosing reference traffic windows, so that an expert intervention is not needed. We do not make any assumption that immediate past traffic is necessarily normal, either. To this end, we make use of a whole set of reference windows for all combinations of port, hour, and weekday to better reflect reality, and use them to distinguish normal from anomalous traffic. Synthetic anomalies are used to generate anomalous traffic windows. As for the classification itself, instead of applying a “sufficiently far from normal” threshold to raise an alarm, we propose the use of a GLRT to assess the similarity of a particular traffic trace to reference normal and anomalous traffic windows. Then, a conveniently bounded abnormality index is calculated, in order to let network administrators decide for the importance of a particular anomaly, should they wish to, instead of raising a binary yes/no alarm. A topic not covered in this paper relates to selfadaptation of reference traffic windows (normal and abnormal) to newly seen real traffic. Since network traffic tends to change over time, it may be desirable that training traffic is periodically updated to fit new circumstances as time passes by. Apart from the extra computation time, nothing prevents the proposed method from being fed with new traffic windows and periodically recalculate reference windows. This issue has been addressed in other works, such as [51], [52]. However, allowing the detection system to adapt itself to new traffic has other implications, such as rendering it vulnerable to low-rate attacks [53], in which attackers inject abnormal traffic into the network slowly and increasingly, a fact that would eventually lead the system to incorrectly recognize abnormal traffic as normal. Statistical tests of our model show that -stable distributions outperform other first-order models used in anomaly detection regardless of the window length. In testing the
507
performance of our method, we show classification rates, in the form of areas under ROC curves, for two anomaly types commonly found in anomaly detection literature, namely flood and flash-crowds. Then, we compare our figures to those obtained with the state-of-the-art method reported in [1] (with some modifications to adapt it to our scenario). Table 2 shows a net gain for our method in all but a few cases, in which results for both approaches are statistically indistinguishable. Traffic data used in our tests come from the core router of our university and from one of its schools, which should be representative of heavily and lightly loaded networks. Even though validation data for the inference stage comes from networks in our University, we do not make any assumption on the nature of the data, apart from the fact that it is locally stationary in periods of 30 minutes. This way, our results should be extrapolable to other networks as long as the local stationarity restriction holds. Otherwise, network managers may have to adjust the traffic window length (W ) and the sampling period (t) to fit their particular needs. W and t may need further adjustment so that sampled routers are not overloaded and that sufficiently large traffic windows are fed to the analysis stage. Also, our approach has been validated with flood and flash-crowd anomalies, but other anomalies should be detectable with our method as well, provided that they influence the firstorder distribution of aggregated traffic. Despite the mentioned contributions, our approach still has some drawbacks. -stable distributions outperform other statistical distributions used elsewhere in anomaly detection, but at the cost of a higher computational cost. Although this extra calculation time should not be an issue in current hardware, classical models will always be easier and faster to use. On the other hand, a set of reference traffic windows combined with synthetic anomalies reduces the need for a network manager’s intervention, but a sufficiently large traffic data collection, able to represent normal traffic at any moment, must be available prior to deploying our detection method. For all experiments in Section 6, a recent, average PC machine was used as the detection machine, and a two-year-old, 16-way server was used to calculate -stable parameters of all collected traffic windows in the training stage. The runtime to get both flood and flash-crowd abnormality scores for the current traffic window in the detection machine was roughly five seconds. The offline training stage runtime to preprocess all traffic windows needed to obtain the results shown in Table 2 (about one year of 30-minute windows, two anomaly types, two routers, and two directions for an approximate total of 150,000 windows) was roughly 12 hours in the server. All our algorithms have been implemented in MATLAB [49], so optimized implementations in a compiled language may dramatically lower this figure. Further work in this subject falls in three main areas. First, the proposed model for network traffic is not complete, since we model traffic marginals only. As recent literature shows, traffic correlations should be taken into account if new insights on traffic nature are to be found, so clearly this is open ground for a deeper study in the Data analysis stage. Although the inclusion of a time evolution model adds
508
IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING,
complexity and computational load to traffic analysis, the long-range dependence property may provide additional information which could prove useful for the inference stage. Also, the GLRT has been chosen in the inference stage since, being a parametric classifier, it is able to take advantage of the traffic model robustness, as well as for its asymptotical UMP characteristics. Nevertheless, other classifiers may prove able to yield better results or reduced calculation times. Second, our method and that reported in [1] differ in quite a few areas, so, there is the open question of exactly how much every difference contributes to the final results. Some of these differences are difficult to measure (e.g., the robustness of a set of normal and anomalous reference windows for all port, hour, and weekday combinations versus the use of a single, normal traffic reference window) but, still, studying the contribution of every single variable seems necessary to further enhance performance figures. And third, the methods proposed in this paper have been tested in laboratory conditions, so more testing in production environments shall be carried out, so as to further understand network managers’ needs.
[9]
[10]
[11]
[12]
[13] [14]
[15]
[16]
ACKNOWLEDGMENTS The authors want to thank Jose´ Andre´s Gonza´lez-Fermoselle and Carlos Alonso-Go´mez, network managers of the University of Valladolid campus and the School of Telecommunications Engineering, respectively, for their patience and invaluable support at accessing and sampling traffic data at routers 1 and 2, respectively. The authors would also like to thank Dr. Antonio Trista´n-Vega for helpful discussions about -stable distributions. This work has been partially funded by the Spanish Ministry of Science and Innovation (TIN200803023), the Spanish Ministry of Education and Culture—European Regional Development Fund (TEC2007-67073/TCM), the Autonomous Government of Castilla y Leo´n, Spain (VA106A08, SAN126/VA33/09 and VA0339A10-2), and the Regional Health Ministry of Castilla y Leo´n, Spain (GRS 292/A/08 and GRS 388/A/09).
[17] [18]
[19]
[20]
[21]
[22] [23] [24]
REFERENCES
[25]
[1]
[26]
[2] [3] [4] [5] [6] [7] [8]
A. Scherrer, N. Larrieu, P. Owezarski, P. Borgnat, and P. Abry, “Non-Gaussian and Long Memory Statistical Characterizations for Internet Traffic with Anomalies,” IEEE Trans. Dependable and Secure Computing, vol. 4, no. 1, pp. 56-70, Jan. 2007. M. Thottan and C. Ji, “Anomaly Detection in IP Networks,” IEEE Trans. Signal Processing, vol. 51, no. 8, pp. 2191-2204, Aug. 2003. C. Manikopoulos and S. Papavassiliou, “Network Intrusion and Fault Detection: A Statistical Anomaly Approach,” IEEE Comm. Magazine, vol. 40, no. 10, pp. 76-82, Oct. 2002. Y. Gu, A. McCallum, and D. Towsley, “Detecting Anomalies in Network Traffic Using Maximum Entropy Estimation,” Proc. Internet Measurement Conf., Oct. 2005. A. Lakhina, M. Crovella, and C. Diot, “Diagnosing Network-Wide Traffic Anomalies,” Proc. ACM SIGCOMM ’04, pp. 219-230, Aug. 2005. P. Barford, J. Kline, D. Plonka, and A. Ron, “A Signal Analysis of Network Traffic Anomalies,” Proc. Second ACM SIGCOMM Workshop Internet Measurement, pp. 71-82, Nov. 2002. A. Ray, “Symbolic Dynamic Analysis of Complex Systems for Anomaly Detection,” Signal Processing, vol. 84, no. 7, pp. 11151130, 2004. S.C. Chin, A. Ray, and V. Rajagopalan, “Symbolic Time Series Analysis for Anomaly Detection: A Comparative Evaluation,” Signal Processing, vol. 85, no. 9, pp. 1859-1868, 2005.
[27]
[28] [29]
[30]
[31] [32] [33]
[34] [35]
VOL. 8,
NO. 4, JULY/AUGUST 2011
A. Wagner and B. Plattner, “Entropy Based Worm and Anomaly Detection in Fast IP Networks,” Proc. 14th IEEE Int’l Workshops Enabling Technologies: Infrastructures for Collaborative Enterprises, pp. 172-177, June 2005. M. Ramadas, S. Ostermann, and B. Tjaden, “Detecting Anomalous Network Traffic with Self-Organizing Maps,” Proc. Sixth Int’l Symp. Recent Advances in Intrusion Detection, pp. 36-54, 2003. S.T. Sarasamma, Q.A. Zhu, and J. Huff, “Hierarchical Kohonen Net for Anomaly Detection in Network Security,” IEEE Trans. Systems, Man and Cybernetics, Part B: Cybernetics, vol. 35, no. 2, pp. 302-312, Apr. 2005. V. Alarcon-Aquino and J.A. Barria, “Anomaly Detection in Communication Networks Using Wavelets,” IEE Proc.—Comm., vol. 148, no. 6, pp. 355-362, Dec. 2001. L. Kleinrock, Queueing Systems, Volume 2: Computer Applications. John Wiley and Sons, 1976. W. Willinger, M.S. Taqqu, R. Sherman, and D.V. Wilson, “SelfSimilarity through High-Variability: Statistical Analysis of Ethernet LAN Traffic at the Source Level,” IEEE/ACM Trans. Networking, vol. 5, no. 1, pp. 71-86, Feb. 1997. G. Samorodnitsky and M.S. Taqqu, Stable Non-Gaussian Random Processes: Stochastic Models with Infinite Variance. Chapman & Hall, 1994. F. Simmross-Wattenberg, A. Trista´n-Vega, P. Casaseca-de-la Higuera, J.I. Asensio-Pe´rez, M. Martı´n-Ferna´ndez, Y.A. Dimitriadis, and C. Alberola-Lo´pez, “Modelling Network Traffic as -Stable Stochastic Processes: An Approach Towards Anomaly Detection,” Proc. VII Jornadas de Ingenierı´a Telema´tica (JITEL), pp. 25-32, Sept. 2008. G.R. Arce, Nonlinear Signal Processing: A Statistical Approach. John Wiley and Sons, 2005. J. Jiang and S. Papavassiliou, “Detecting Network Attacks in the Internet via Statistical Network Traffic Normality Prediction,” J. Network and Systems Management, vol. 12, no. 1, pp. 51-72, Mar. 2004. W. Yan, E. Hou, and N. Ansari, “Anomaly Detection and Traffic Shaping under Self-Similar Aggregated Traffic in Optical Switched Networks,” Proc. Int’l Conf. Comm. Technology (ICCT ’03), vol. 1, pp. 378-381, Apr. 2003. J. Brutlag, “Aberrant Behavior Detection in Time Series for Network Monitoring,” Proc. USENIX 14th System Administration Conf. (LISA), pp. 139-146, Dec. 2000. V. Paxson and S. Floyd, “Wide Area Traffic: The Failure of Poisson Modelling,” IEEE/ACM Trans. Networking, vol. 3, no. 3, pp. 226244, June 1995. Internet Traffic Archive, http://ita.ee.lbl.gov/, 2011. Waikato Internet Traffic Storage, http://wand.cs.waikato.ac.nz/ wits/, 2011. Cooperative Assoc. for Internet Data Analysis, http://www. caida.org/, 2011. DiRT Group’s Home Page, Univ. of North Carolina, http://wwwdirt.cs.unc.edu/ts/, 2010. “Metrology for Security and Quality of Service,” http:// www.laas.fr/METROSEC/, 2011. B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen, “Sketch-Based Change Detection: Methods, Evaluation, and Applications,” Proc. Internet Measurement Conf. (IMC), pp. 234-247, Oct. 2003. DDoSVax, http://www.tik.ee.ethz.ch/ddosvax/, 2010. S. Stolfo et al., “The Third International Knowledge Discovery and Data Mining Tools Competition,” http://kdd.ics.uci.edu/ databases/kddcup99/kddcup99.html, 2011. G. Cormode and S. Muthukrishnan, “What’s New: Finding Significant Differences in Network Data Streams,” IEEE/ACM Trans. Networking, vol. 13, no. 6, pp. 1219-1232, Dec. 2005. Cisco Systems, “Cisco IOS NetFlow,” http://www.cisco.com/ web/go/netflow, 2011. A. Papoulis, Probability, Random Variables, and Stochastic Processes, third ed., McGraw-Hill, 1991. W. Leland, M. Taqqu, W. Willinger, and D. Wilson, “On the SelfSimilar Nature of Ethernet Traffic (Extended Version),” IEEE/ ACM Trans. Networking, vol. 2, no. 1, pp. 1-15, Feb. 1994. P. Embrechts and M. Maejima, Selfsimilar Processes. Princeton Univ. Press, 2002. Le´vy Processes: Theory and Applications, O.E. Barndorff-Nielsen, T. Mikosch, and S.I. Resnick, eds., Birkha¨user, 2001.
SIMMROSS-WATTENBERG ET AL.: ANOMALY DETECTION IN NETWORK TRAFFIC BASED ON STATISTICAL INFERENCE AND -STABLE...
[36] J.R. Gallardo, D. Makrakis, and L. Orozco-Barbosa, “Use of -Stable Self-Similar Stochastic Processes for Modelling Traffic in Broadband Networks,” Performance Evaluation, vol. 40, pp. 71-98, 2000. [37] A. Karasaridis and D. Hatzinakos, “Network Heavy Traffic Modeling Using -Stable Self- Similar Processes,” IEEE Trans. Comm., vol. 49, no. 7, pp. 1203-1214, July 2001. [38] T. Mikosch, S. Resnick, H. Rootze´n, and A. Stegeman, “Is Network Traffic Approximated by Stable Le´vy Motion or Fractional Brownian Motion?” The Annals of Applied Probability, vol. 12, no. 1, pp. 23-68, 2002. [39] S.M. Kay, Fundamentals of Statistical Signal Processing, Volume 2: Detection Theory. Prentice Hall, 1998. [40] Iperf, http://iperf.sourceforge.net/, 2011. [41] “Apache JMeter,” The Apache Jakarta Project, Apache Software Foundation, http://jakarta.apache.org/jmeter/, 2011. [42] Z. Liu, N. Niclausse, and C. Jalpa-Villanueva, “Traffic Model and Performance Evaluation of Web Servers,” Performance Evaluation, vol. 46, nos. 2-3, pp. 77-100, 2001. [43] M.A. Stephens, “EDF Statistics for Goodness of Fit and Some Comparisons,” J. Am. Statistical Assoc., vol. 69, no. 347, pp. 730-737, 1974. [44] M.S. Weiss, “Modification of the Kolmogorov-Smirnov Statistic for Use with Correlated Data,” J. Am. Statistical Assoc., vol. 73, no. 364, pp. 872-875, 1978. [45] R.S. Deo, “On Estimation and Testing Goodness of Fit for m-Dependent Stable Sequences,” J. Econometrics, vol. 99, pp. 349372, 2000. [46] L.J. Glesser and D.S. Moore, “The Effect of Dependence on ChiSquared and Empiric Distribution Tests of Fit,” The Annals of Statistics, vol. 11, no. 4, pp. 1100-1108, 1983. [47] A.K. Jain, R.P.W. Duin, and J. Mao, “Statistical Pattern Recognition: A Review,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan. 2000. [48] S.J. Press and S. Wilson, “Choosing between Logistic Regression and Discriminant Analysis,” J. Am. Statistical Assoc., vol. 73, no. 364, pp. 699-705, 1978. [49] “MATLAB—The Language of Technical Computing,” Mathworks, Inc, http://www.mathworks.com/products/matlab/, 2011. [50] B. Rosner, Fundamentals of Biostatistics. Duxbury Thomson Learning, 2000. [51] A. Stavrou, G.F. Cretu-Ciocarlie, M.E. Locasto, and S.J. Stolfo, “Keep Your Friends Close: The Necessity for Updating an Anomaly Sensor with Legitimate Environment Changes,” Proc. ACM/CSS Workshop Security and Artificial Intelligence (AISec), 2009. [52] G.F. Cretu-Ciocarlie, A. Stavrou, M.E. Locasto, and S.J. Stolfo, “Adaptive Anomaly Detection via Self-Calibration and Dynamic Updating,” Proc. 12th Int’l Symp. Recent Advances in Intrusion Detection (RAID), Sept. 2009. [53] G. Macia´-Ferna´ndez, J. Dı´az-Verdejo, and P. Garcı´a-Teodoro, “Evaluation of a Low-Rate DoS Attack against Application Servers,” Computers and Security, vol. 27, pp. 335-354, 2008. Federico Simmross-Wattenberg received the BS and MS degrees in computer science and the PhD degree from the University of Valladolid in 1999, 2001, and 2009, respectively. He is currently a lecturer of Telematics Engineering at the University of Valladolid. In 2001, he joined the Laboratory of Image Processing (LPI) as a researcher, and later the Intelligent & Cooperative Systems Research Group (GSIC), where he has since contributed to various research projects. He has also worked for several years as a network administrator at the University of Valladolid. His current research interests include network traffic analysis, anomaly detection, and statistical models for signal processing.
509
Juan Ignacio Asensio-Pe´rez received the MSc and PhD degrees in telecommunications engineering from the University of Valladolid, Spain, in 1995 and 2000, respectively. He is currently an associate professor in the Department of Signal Theory, Communications and Telematics Engineering, University of Valladolid. His research interests include teletraffic engineering and technology-enhanced learning.
Pablo Casaseca-de-la-Higuera received the Ingeniero de Telecomunicacio´n and PhD degrees from the University of Valladolid, Spain, in 2000 and 2008, respectively. He is currently an assistant professor at the ETSI Telecomunicacio´n of the University of Valladolid, where he performs his research within the Laboratory of Image Processing (LPI). From December 2000 to November 2003, he worked as a design engineer for Alcatel Espacio S.A. where he contributed to several space programs including the European satellite navigation project, Galileo. His activities there were all related to digital signal processing and Radio Frequency (RF) design for the Telemetry, Tracking and Command (TTC) subsystem. After this period, he joined the LPI with a research fellowship which finished in October 2005 when his academic activities began. His research interests are statistical modeling and nonlinear methods for biomedical signal and image processing, and network traffic analysis. Marcos Martı´n-Ferna´ndez received the Ingeniero de Telecomunicacion and PhD degrees from the University of Valladolid, Spain, in 1995 and 2002 respectively. He is an associate professor at the ETSI Telecomunicacion, University of Valladolid. From March 2004 to March 2005, he was a visiting assistant professor of Radiology at the Laboratory of Mathematics in Imaging (Surgical Planning Laboratory, Harvard Medical School, Boston, Massachusetts). His research interests are statistical and mathematical methods for image and signal processing. He is with the Laboratory of Image Processing (LPI) at the University of Valladolid where he is currently performing his research. He was granted with a Fullbright fellowship during his visit at Harvard. He is reviewer of several international journals and member of several international scientific committees. He has contributed with more than 100 scientific publications. Ioannis A. Dimitriadis received the BS degree in telecommunications engineering from the National Technical University of Athens, Greece, in 1981, the MS degree in telecommunications engineering from the University of Virginia, Charlottesville, in 1983, and the PhD degree in telecommunications engineering from the University of Valladolid, Spain, in 1992. He is currently a full professor of telematics engineering at the University of Valladolid. His research interests include technological support to learning and work processes, computer networks, as well as machine learning. He is a senior member of the IEEE, a member of the IEEE Computer Society and the Association for Computing Machinery. Carlos Alberola-Lo´pez received Ingeniero de Telecomunicacio´n and PhD degrees from Politechnical University of Madrid (Spain), in 1992 and 1996, respectively. He is a professor at the ETSI Telecomunicacio´n of the University of Valladolid, Spain. In 1997, he was a visiting scientist at Thayer School of Engineering, Dartmouth College, New Hampshire. His research interests are statistical and fuzzy methods for signal and image processing applications. He is head of the Laboratory of Image Processing (LPI) at the University of Valladolid. He is reviewer of several scientific journals and he is consultant of the Spanish Government for the evaluation of research proposals.