Adaptive Thresholds: Monitoring Streams of Network Counts Diane L AMBERT and Chuanhai L IU This article describes a fast, statistically principled method for monitoring streams of network counts, which have long-term trends, rough cyclical patterns, outliers, and missing data. The key step is to build a reference (predictive) model for the counts that captures their complex, salient features but has just a few parameters that can be kept up-to-date as the counts flow by, without requiring access to past data. This article justifies using a negative binomial reference distribution with parameters that capture trends and patterns and method of moment estimators that can be computed quickly enough to keep up with the data flow. The reference distribution may be of interest in its own right for traffic engineering and load balancing, but a more challenging task is to use it to identify degraded network performance as quickly as possible. Here we detect changes in network performance not by monitoring quantiles of the predictive distribution directly but by applying control chart methodology to normal scores of the p values of the counts. Using p values adjusts for the lack of stationarity from one count to the next. Compared with thresholding isolated counts, control charting reduces the false-alarm rate, increases the chance of detecting ongoing low-level events and reduces the time to detection of long events. This adaptive count thresholding procedure is shown to perform well on both real and simulated data. KEY WORDS: Control charts; Data stream; Event detection; Sequential updates; Tracking.
1. BACKGROUND
the 2 weeks. The more detailed, thinner curve, which corresponds to a short span ( f = .025), shows that there is a strong daily pattern, with a peak near 7 PM that rises more steeply than it falls and a smaller peak later in the evening. About 4 weeks after the period shown here, there was a gap of 14 hours during which no data were collected. These kinds of features are common in network traffic. Papagiannaki, Taft, Zhang, and Diot (2003), for example, showed 90-minute network counts exhibiting strong time-of-day effects, different behaviors on business days and weekends, and a long-term trend. Clearly, there is no one threshold that can be set in advance to identify anomalous counts, regardless of time of day or even date, when there are strong time-of-day and day-of-week effects and trends over time. Nonetheless, the goal is to detect events with each incoming count, without looking at the past raw counts (which may be too massive to retain or access quickly) or being distracted by strong cyclical patterns, trends, high background variability, or stretches of missing data. Accomplishing that requires adaptive thresholds that learn patterns in the data automatically and keep themselves up-to-date as data are collected. There are a few statistically inspired procedures for setting thresholds on network counts in the presence of rough cyclical patterns. One approach is to batch the data into intervals that are sufficiently short so that the counts in an interval can be considered identically distributed. Thottan and Ji (1998), for example, fit an autoregressive process with constant mean and normal errors to the counts in an interval, and at the end of the interval computed a likelihood ratio to compare the autoregressive model fit to the current interval with the autoregressive model fit to the previous interval. If the ratio is large, then the current interval is flagged for a possible network event. The problem is that quick event detection requires short intervals, because detection happens only at the end of an interval, but reliable likelihood ratios require long intervals. Thottan and Ji suggested testing batches of 10 counts for quick detection, but likelihood ratios based on so few observations may not be trustworthy.
Communications networks may have thousands of network elements, such as routers and switches, that report statistics about their health on a regular schedule. Number of packets handled, number of packets dropped, and numbers of processing errors of various types may be reported every minute or second for every network element, for example. Often, these counts become extreme when part of the network is in distress. Error counts may become too high, or the number of packets received may become too low. But what is the threshold for too high or too low? How can overall network health be monitored when thresholds vary across network elements? The first problem is often called monitoring, and the second, which is considered much harder, is a kind of data fusion. Usually, network engineers choose thresholds by trial and error, hoping to balance the rate of missed events against the rate of false alarms. This is a delicate and daunting task, because error counts and traffic volumes have rough time-of-day and day-of-week patterns, long-term trends, long tails, and regional differences. Ignoring these complexities reduces the number of thresholds that need to be set manually, but at the expense of poor performance. Yet treating each time period and region separately for each type of count leads to far too many thresholds to set by hand. As a result, thresholds that are set manually are often inappropriate, producing so many false alarms that only a small fraction can be investigated, which seriously dilutes the value of monitoring. Figure 1, for example, shows two weeks of 5-minute counts for one type of error for one network element. Counts of 0’s are numerous, but so are counts above 40, and variability increases with the median. Two smooths, computed using loess on the square root of the counts to tame variability and then squaring the fitted values to return to the original scale for plotting, are shown. The flatter, bolder curve, which corresponds to a large span ( f = 2/3), shows that errors increased by about 10% over Diane Lambert is currently a Research Scientist, Google Inc. (E-mail:
[email protected]). Chuanhai Liu is currently Professor of Statistics, Purdue University (E-mail:
[email protected]). The authors thank Mike Cleemput for first bringing this problem to their attention and the referees for their careful reading of the manuscript. This article was written while both authors were in the Statistics Research Department of Bell Laboratories, Lucent Technologies.
© 2006 American Statistical Association Journal of the American Statistical Association March 2006, Vol. 101, No. 473, Applications and Case Studies DOI 10.1198/016214505000000943 78
Lambert and Liu: Adaptive Thresholds
Figure 1. Two Weeks of 5-Minute Error Counts With an Increasing Median and Strong Daily Pattern.
Feather (1992) and Feather, Sieworek, and Maxion (1993), in contrast, assumed that daily patterns repeat, so that homogeneity holds not just within short intervals, but also across days. Specifically, counts that fall in the same time period, such as [9:00 AM, 9:15 AM), are assumed to share nearly the same normal distribution even if they occur on different days. An estimated mean and variance are stored for each interval and updated by exponentially weighted moving averaging with each newly reported count that falls in the period. Quantiles of the estimated normal distribution are then used as thresholds. This approach avoids making assumptions about the nature of the cyclical patterns, and it provides a way to detect events at each count rather than only after a batch of counts. On the other hand, its thresholds are discontinuous and updating is slow because each interval of time is treated in isolation. A count at 8:59 AM does not affect the estimated threshold for [9:00 AM, 9:15 AM), for example. From a statistician’s perspective, good thresholds cannot be designed without a good model of the distribution of counts when the network is in control. Moreover, the tails of the model must be accurate, because it is only counts that are rare under the in-control distribution that suggest an event is in progress. Through detailed empirical analysis, this article justifies using negative binomial reference distributions that are parameterized by their means and variances to model the in-control behavior. The moments of the reference distributions are assumed to vary sufficiently smoothly to be interpolated from values on a coarse time grid. The grid values capture the cyclical patterns in the data. For example, the grid might have 24 hourly values if there are no day-of-week effects or 24 × 7 = 168 hourly values if the day of week matters. These grid values are updated by exponentially weighted moving averaging. The combination of interpolation from grid values and exponentially weighted moving averaging smooths continuously over time and tracks both cyclical patterns and long-term trends. We do not advocate using quantiles of the reference distribution to threshold individual counts, however. Even rare counts occur frequently if the data stream is sufficiently fast, so thresholds on counts would need to be set at extreme quantiles to avoid excessive false alarms. But then extreme quantiles would miss low-level but persistent degradation. Thus, instead of formulating event detection as outlier detection, we formulate it as
79
statistical process control and apply control chart techniques to a severity metric that retains information from previous counts. In process control terms, each count is first approximately standardized, and then an exponentially weighted moving average (EWMA) of the standardized counts is thresholded. More precisely, each incoming count is standardized by computing its tail probability under its reference distribution and then taking the normal score of that tail probability. If the reference distribution were continuous and known exactly, then the tail probability p value, pt , would have a uniform[0, 1] distribution, and its normal score, Zt = −1 ( pt ), would be normal(0, 1) under any reference distribution, regardless of the time of day or day of week. The tail p value for a count is only approximately uniform under the reference distribution because of discreteness, but its normal score still lends itself to standard control chart methodology. Thus we define a severity metric St as an EWMA of the standardized counts Zt and threshold St against a constant limit to detect periods of network degradation. This gives a Q-chart in the terminology of statistical quality control. Note that event magnitude and duration are both reflected in the severity metric St . One very extreme count (high magnitude but short duration event) can push St beyond a threshold, or a sequence of many less extreme but still unusual counts (small magnitude but long duration event) can push St beyond the threshold. Moreover, the normal scores standardization provides a way to combine or “fuse” information across time and network elements. In summary, adaptive count thresholding consists of four basic steps that are applied whenever there is an incoming count xt at time t: 1. Interpolate the stored grid values to obtain the estimated parameters for the reference negative binomial distribution Ft in effect at time t. 2. Score xt by computing its p value, pt , under its reference distribution Ft and its normal score Zt = −1 ( pt ). 3. Threshold the updated severity metric, St = (1 − w) × St−1 + wZt , against a constant threshold. 4. Update the stored grid values with the count xt , or with a random draw from Ft if xt is missing, or with a random draw from the tail of Ft if xt is an outlier. Each of these steps is quick to compute, so keeping up with the data flow has not been a problem in our applications. Finally, the raw p values of the counts provide a natural way to monitor the performance of the thresholding system itself. These p values should be roughly uniform as long as the negative binomial reference distributions are valid; if they are not roughly uniform, then the monitoring process may need to be reinitialized, or else a different reference distribution may be required. Such online validation is especially important for automated systems. The rationale and details of adaptive count thresholding are provided in this article. Section 2 presents some example data and empirically justifies a negative binomial model for the counts. Dynamic model estimation and model validation are described in Section 3. The thresholding algorithm is presented in Section 4. Results for real data and simulated datasets are given in Sections 5 and 6. Section 7 presents conclusions and further thoughts.
80
Journal of the American Statistical Association, March 2006
2. BUILDING A REFERENCE MODEL FOR COUNTS Through detailed empirical analysis of 8 weeks of minute counts, this section supports the use of negative binomial reference models. The results for another seven sets of 8 weeks of counts are similar. 2.1 An Estimate of the Mean for Model Building Our first step is to validate that the counts {x1 , . . . , xn } behave like independent random variables conditional on their means {s1 , . . . , sn }. (Note that the counts are only conditionally independent.) Here we assume that the means follow a smooth curve and check whether n Pr(x1 , . . . , xn |s1 , . . . , sn ) = Pr(xt |s1 , . . . , sn ). (1) t=1
Because this model checking stage is off-line and not a part of the online procedure, there are many ways to estimate the curve {st } nonparametrically. Here we choose to estimate it by repeated Hanning (Tukey 1977); that is, by repeated weighted averaging with weights .25, .5, and .25 on the values at t − 1, t, and t + 1. The more iterations of Hanning, the smoother the curve. Independence is checked at each iteration by comparing estimates of the first several lagged autocorrelations of (0) the counts to 0. More precisely, starting from st = xt for (i) t = 1, . . . , n the ith iterated estimate st of the mean curve at t and the first five estimated autocorrelation coefficients rk(i) , k = 1, . . . , 5, are computed as follows: 1. Impute. Impute missing values using linear interpola(i−1) (i−1) tion and impute endpoints by setting s0 = s1 and (i−1) = sn(i−1) . sn+1 2. Smooth. For t = 1, . . . , n, compute 1 (i−1) 1 (i−1) 1 (i−1) + st + st+1 if xt is observed s (i) st = 4 t−1 2 4 missing if xt is missing. 3. Evaluate. Compute the first k-weighted sample autocorrelation coefficients, (i) (i) t wt (xt+k − st+k )(xt − st ) (i) , rk = t wt (t)
where wt = 0 if xt or xt+k is missing and wt = [(st + (t) δ)(st+k + δ)]−1/2 otherwise, and δ is a small positive constant. (We used δ = .0001 and k = 1, . . . , 5.) Using s(i) (t) in the weight wt is appropriate for Poisson tails and conservative in the sense of giving larger absolute autocorrelations for longer tails. Other smoothing methods, such as loess or those discussed by Fan and Gijbels (1996), could be used instead of Hanning; the important point is that the amount of smoothing is chosen to control the dependence in the counts given the mean curve. Ellner and Seifu (2002), for example, examined autocorrelations to choose a neighborhood width for local quadratic regression. Figure 2 plots the first five autocorrelation coefficients against the Hanning iteration number for 8 weeks of 1-minute counts. (Two weeks of the same counts aggregated to 5-minute intervals are shown in Fig. 1.) The first five autocorrelations are
Figure 2. The First Five Estimated Autocorrelation Coefficients for 8 Weeks of Minute Counts Plotted Against Hanning Iteration Number. Lag corresponds to dash length, so lag 1 is dotted and lag 5 has the longest dashes. The autocorrelation coefficients are negligible by iteration 360.
negligible after 360 iterations, so the final estimate sˆt is computed with 360 iterations of Hanning. Note that sˆt is a weighted average of the counts, with symmetric weights that fall exponentially to 0. Here 99% of the weight is concentrated within 36 minutes of t, 75% is concentrated within 15 minutes of t, and 32% is concentrated within 5 minutes of t. Loosely speaking, the fact that the autocorrelations, which are adjusted for the means of the counts, are negligible suggests that any apparent correlation in the raw counts can be attributed to lack of knowledge of the mean. For example, if the mean is unknown but changing smoothly over time, then the count for one period helps to predict the count for the next period, so in this sense adjacent counts are not independent. But if the mean is known, then the count for this period provides no information about the count for the next period. 2.2 Distribution of the Counts Conditional on the Mean We next strengthen assumption (1) from independence conditional on the sequence of means to Pr(x1 , . . . , xn |s1 , . . . , sn ) =
n
Pr(xt |st ),
t=1
and validate the choice of probability model for xt |st . The simplest (and thus most convenient) choice of Pr(xt |st ) is the Poisson. If the counts had the same means, then a plot of their empirical cumulative distribution function (cdf ) against the Poisson cdf with mean equal to the average count would be linear, except for discrepancies caused by discreteness. When counts do not share the same mean, the sample is overdispersed relative to a Poisson, and the plot tends to be S-shaped instead of linear even if the counts are Poisson. To control for the differences in means in our data, we partition the counts into subsets according to their smooth mean estimates and then plot the empirical cdf against the Poisson cdf separately in each subset. More precisely, we partition the 80,640 Hanning minute mean estimates into the 16 intervals shown in the strip labels of Figure 3, compute the sample mean x¯ of the associated counts
Lambert and Liu: Adaptive Thresholds
81
Figure 3. The Empirical cdf of Counts With Hanning Mean Estimate in the Interval Given in the Panel Label Against the Poisson cdf With Mean Equal to the Average Count in the Panel. The number of counts is given in each panel. Both axes are on the arcsin square root scale. Points in all panels lie close to the diagonal, supporting the Poisson assumption.
for each of the 16 intervals, and then plot the empirical cdf of the counts against the Poisson(¯x) cdf to obtain the panels in Figure 3. The horizontal and vertical axes in Figure 3 are on an arcsin square root scale, which not only stabilizes the variance of the empirical probabilities, but also stretches out the probabilities near 0 and 1, thus emphasizing discrepancies from the tails of the Poisson. This emphasis is consistent with our goal of detecting shifts toward the tails online. The linearity in each panel of Figure 3 suggests that the counts have Poisson distributions conditional on their means. A goodness-of-fit test for the Poisson was not used because the sample sizes are sufficiently large to detect unimportant departures from the Poisson. 2.3 Reference Distribution for the Counts Section 2.2 justifies modeling consecutive counts at n uniformly spaced times as x1 , . . . , xn |s1 , . . . , sn ∼
n
Poisson(st ).
t=1
The Poisson cannot be used as a reference distribution for thresholding, however, because the means {st } are unknown. Replacing them with estimates and ignoring the uncertainty in the estimates would give unrealistically short tails and hence too many false alarms. Here we incorporate uncertainty about
the means by treating them as random variables. For simplicity, we would prefer to use a conjugate gamma distribution with parameters that incorporate long-term trends and cyclical patterns. To check whether a gamma is plausible, we assume that the Poisson minute means during an hour are similar, which implies that the 60 Hanning minute mean estimates (Sec. 2.1) within each hour should behave approximately like a random sample of size 60 from a gamma distribution. We then simulate a random sample of size 60 from a gamma distribution whose mean and variance equal the sample mean and sample variance computed from the counts for that hour. The empirical distributions of the Hanning mean estimates should be close to the empirical cdf of the simulated gamma, if the gamma assumption is reasonable and means are roughly constant within an hour. Quantiles of the cube roots of the minute Hanning mean estimates are plotted against quantiles of the cube roots of the simulated gamma means for the 8-week period in Figure 4. (A cube-root transformation stabilizes the variance of a gamma distribution.) As Figure 4 shows, the smooth mean estimates act like gamma random variables, except beyond the .05 and .995 quantiles. The poor fit in the right tail is not surprising, because smoothing flattens peaks, leading to too few large estimated means and to a shorter-than-gamma tail. The poor fit in the left
82
Journal of the American Statistical Association, March 2006
that the negative binomial distribution for xt has mean µt and variance σt2 , defined by µt = E(xt ) =
nt (1 − pt ) pt
σt2 = var(xt ) =
and
nt (1 − pt ) , p2t
and the standard negative binomial parameters can be computed as pt =
Figure 4. A QQ Plot of Cube Root Estimated Means Against Cube Root Simulated Gammas. The dotted lines show the .01, .1, .9, and .99 quantiles of the estimated means.
tail shows that there are too many small fitted means relative to the gamma distribution with parameters equal to the hourly sample mean and variance. This too is not surprising. The simulation assumes that the mean count is constant throughout the hour, but if it is not, then there will be too few small simulated means, as happens here. In any case, given the excellent fit for about the middle 95% of means, we assume that the unobserved Poisson rates {λt : t = 1, . . . , n} satisfy {λt |(αt , βt ), t = 1, . . . , n} ∼
n
gamma(αt , βt ),
t=1
where the parameters {(αt , βt ) : t = 1, . . . , n} evolve smoothly and gamma(αt , βt ) has density f (λ) =
βtαt λαt −1 exp(−βt λ), (αt )
λ > 0.
Together, a Poisson distribution for the count xt and a gamma(αt , βt ) distribution for the mean of the count imply that the marginal distribution of xt is negative binomial, NB(nt , pt ), with density f (k) =
(nt + k) nt p (1 − pt )k (nt )k! t
for k = 0, 1, . . . ,
where pt =
βt βt + 1
and
nt = αt .
A full Bayesian treatment of the problem would treat the gamma(α, β) distribution on the Poisson mean as a prior and put a hyperprior on (αt , βt ) to accommodate uncertainty in these parameters. But if there is little uncertainty in the gamma parameters relative to the variability in the Poisson means, then a naive empirical Bayes approach (Carlin and Louis 2000) that assigns probability 1 to an estimate of (αt , βt ) can work well. Here we use method-of-moments updating to obtain the estimate of (αt , βt ), because it is straightforward for negative binomial parameters and maximum likelihood updating is not. Note
µt σt2
and
nt =
µ2t . σt2 − µt
The remaining step in defining a reference distribution is to replace the parameters µt and σt2 with estimates; this is done in Section 3.2. Although we are not aware of other articles that use negative binomial reference distributions for network counts, independent distributions with varying means and variances have previously been used to model network counts. For example, Cao, Davis, Vander Wiel, and Yu (2000) used normal distributions with varying means and standard deviations proportional to the means to estimate mean router link counts. For very large mean counts, it is possible that a normal distribution would be adequate in the tails and thus suitable for adaptive thresholding. On the other hand, raw, unaggregated packet, session, and connection arrival times in computer networks often exhibit long-range dependence (e.g., Paxson and Floyd 1995), although the dependence and departure from the Poisson dissipate as the traffic load increases (Cao, Cleveland, Lin, and Sun 2002). When arrival times exhibit long-range dependence, counts over very fine time intervals, say on the order of a millisecond or even less, are likely to be long-range dependent as well. In that case, using negative binomial models for the counts, which assume that the counts are independent conditional on their means, may not be adequate. 3. ONLINE PARAMETER ESTIMATION The mean and variance parameters in effect at the time of an incoming count are obtained by interpolating stored values on a coarse grid, so it is the grid values that must represent roughly repeating patterns and trends over time. Smoothing the grid values over time accommodates slow changes in the cyclical patterns and trends. To simplify the terminology in this section, assume that a count xt is reported every minute, hourly parameters are stored, and each day of the week has a similar time-of-day pattern, so a cycle consists of 24 hours. If each day of the week had a different set of hourly parameters, then the cycle would be 1 week, or 168 hours. If the weekdays from Monday through Friday have similar time-of-day patterns and Saturday and Sunday each have their own patterns, then there are three sets of hourly grid parameters, with a total of 72 grid parameters in all. The changes needed for these more complex patterns are straightforward, although the notation can be messy. 3.1 Interpolation An incoming count xt arrives at time t, which corresponds to cycle c, hour h (1 ≤ h ≤ H), and minute m (1 ≤ m ≤ M). The
Lambert and Liu: Adaptive Thresholds 2 , of the reference disestimated mean, µˆ h,m , and variance, σˆ h,m tribution in effect at time t are obtained by quadratic interpolation from stored mean and variance grid values, {(Uh , Vh ) : h = 1, . . . , H}. By design, the interpolation is unbiased in the sense that the arithmetic average of the M interpolated reference means for an hour equals the stored grid value Uh for the hour, and the average of the M interpolated reference variances equals the grid value Vh . More precisely, take three consecutive hours, (−1, 0], (0, 1], (1, 2], and define the quadratic interpolation coefficients (A, B, C) by 0 A B (At2 + Bt + C) dt = − + C = MU−1 , 3 2 −1 1 A B (At2 + Bt + C) dt = + + C = MU0 , 3 2 0
and
2
(At2 + Bt + C) dt =
1
7A 3B + + C = MU1 . 3 2
Solving for (A, B, C) gives M(U−1 − 2U0 + U1 ) , 2 B = M(U0 − U−1 ), A=
and M(2U−1 + 5U0 − U1 ) . C= 6 Letting q = (m − 1)/M and r = m/M, and in the same spirit, defining the mean for a period to be the integral over the period, the interpolated mean at time t corresponding to minute m of hour h is r µˆ h,m = (At2 + Bt + C) dt q
C A 2 B (r + rq + q2 ) + (r + q) + ; = 3M 2M M the right side of this equation is used to compute µˆ h,m from the stored grid means. Therefore, by design, the average of interpolated minute means in the hour equals the stored hour mean. Substituting stored variance grid values V−1 , V0 , and V1 for 2 for the Uh−1 , Uh , and Uh+1 gives an estimated variance σˆ h,m incoming count. The variance interpolation is also unbiased. Note that the interpolation coefficients (A, B, and C) are computed only once for each hour. Moreover, the interpolation smooths both within and across hours, because the coefficients depend on the stored estimates for this hour and the two adjacent hours. Of course, despite the unbiasedness, the interpolation may not be adequate if the mean and variance are not at least approximately quadratic between grid points. 3.2 Updating the Stored Grid Values Updating grid values is straightforward, but outliers must be handled with care. Using an outlier may inflate the mean and variance, giving overly long tails, leading to inflated p values and too many missed events. On the other hand, ignoring all outliers underestimates the stored variance, giving reference
83
distributions with too short tails and too many false alarms. Thus outliers cannot be used as-is or ignored during updating, but rather must be treated “robustly.” Here we assume that the only useful information in an outlier is that a tail observation has occurred; its exact value is unimportant. This suggests replacing an outlier with a random draw from the tail of its reference distribution. For example, we might replace any observation that is beyond the .9999 quantile of its reference distribution Fˆ t with a random draw beyond the .99 quantile of Fˆ t . So upper-tail outliers are replaced with large counts. Likewise, a lower-tail outlier is replaced by a random draw conditioned to be smaller than its .01 quantile. Note that outliers in both tails are replaced by random draws to protect the estimated mean and variance, even if only large counts indicate adverse events. If the count at minute t is missing, then xt is taken to be a random draw from the reference distribution. Ignoring missing counts will bias the stored parameter estimates if the reference distribution is not constant during the period between consecutive grid values. The updated mean of the reference distribution at time t corresponding to minute m and hour h of cycle c (e.g., day c) is an EWMA of the interpolated mean estimate µˆ h,m and Xt , which is the observed count if it is not outlying or missing and a random draw otherwise, that is, µˆ new ˆ h,m + wc Xt , h,m = (1 − wc )µ
where wc = w +
1−w 1+c
(2)
for a fixed weight w between 0 and 1. Here wc is an ad hoc cycle-adjusted weight that decreases to a constant weight w as cycles pass and allows the estimated mean parameter to be learned more quickly during system initialization. 2 of the reference distribSimilarly, the updated variance σˆ h,m ution is 2 new 2 (σˆ h,m ) = (1 − wc )σˆ h,m + wc (Xt − µˆ h,m )(Xt − µˆ new h,m ).
To see why µˆ h,m and µˆ new h,m are both used in the second term on the right side, consider updating the method-of-moments n −1 ¯ 2 variance estimate V = (n) (X − n 1 i Xn ) sequentially, where n −1 ¯ X¯ n = (n) i=1 Xi . Adding and subtracting Xn in the quadratic term for Vn+1 gives (n + 1)Vn+1 = nVn + (Xn+1 − X¯ n )2 + 2(X¯ n − X¯ n+ )(Xn+1 − X¯ n ) + (n + 1)(X¯ n − X¯ n+1 )2 . Because X¯ n − X¯ n+1 = (n + 1)−1 (X¯ n − Xn+1 ) = n−1 (X¯ n+1 − Xn+1 ), a bit of algebra leads to Vn+1 =
n 1 Vn + (Xn+1 − X¯ n )(Xn+1 − X¯ n+1 ), n+1 n+1
2 )new with w = (n+1)−1 . which resembles the formula for (σˆ h,m c Stored grid values are not changed during the cycle, so all interpolated estimates during the hour are based on the same grid values. Temporary copies of the grid mean and variance values for the hour are initialized to 0 at the beginning of the
84
Journal of the American Statistical Association, March 2006
hour, and the updated minute mean and variance for time t are included in the grid values according to Uhnew = Vhnew =
MUh + µˆ new h,m M
and
2 )new MVh + (σˆ h,m
(3)
. M The temporary copies are stored at the end of the cycle and used in the next cycle. The grid values incorporate smoothing over time, because the minute parameters from which they are computed smooth over time. In summary, the reference distribution at time t, against which a count xt is compared, is negative binomial with 2 . Note that the variance of the mean µˆ h,m and variance σˆ h,m reference distribution is not inflated to include the uncertainty in the mean estimate, as it would be if a nondegenerate prior were placed on (αt , βt ), for example. Ignoring this uncertainty is appropriate in our application, because the effective sample size is in the hundreds, given that there are 60 counts per hour and grid values are exponentially smoothed from one cycle to the next. 3.3 Model Validation Any automated system can fail, and ours can as well. The mean and variance estimates might become too inaccurate, for example. Fortunately, system performance can be monitored by tracking the p values of counts. If the reference distribution is correct, then the p values are uniformly distributed except for artifacts caused by discreteness of the counts. A continuous, uniform(0, 1) p value for a count xt with reference distribution Ft and probability density function ft can be defined as (1 − Ft (xt )) + Uft (xt ) for upper-tail p values Ft (xt ) − Uft (xt )
for lower-tail p values,
where U is a random draw from a uniform(0, 1) distribution. The continuity corrected p value corresponding to xt can then be classified into a set of equal-width intervals that cover the unit interval [0, 1]. The p value table can be viewed as a histogram or subjected to a goodness-of-fit test at the end of every cycle, for example, to evaluate whether the departures from uniformity are important enough to distrust the shape of the reference distribution or to warrant reinitialization to adjust for poor parameter updating or redefining cycle lengths. Typically, uniformity would be checked not online but rather during system setup, and then periodically as a check on system quality. Finally, note that validation depends on the raw counts, not on a list of detected events and false alarms. The latter list is difficult to maintain in many real applications, and so is not suitable for routine model validation. 4. THRESHOLDING 4.1 Control Charts Taking p values of counts adjusts for time-of-day effects and long-term trends that complicate monitoring. If small counts suggest network degradation, then the p value pt for
a count xt with reference distribution Ft is defined to be the lower-tail probability, Ft (xt ). If large counts are suspicious, then pt = 1 − F(xt − 1). Taking the normal score, −1 ( pt ), of the p value, where −1 is the normal cdf inverse function, makes it easier to distinguish small p values and allows application of standard control chart technology. Quesenberry (1991) published early work on monitoring an EWMA of normal score p values, which is known as Q-charting in quality control. In summary, we base event detection on thresholding, St = (1 − w)St−1 + wZt ,
where Zt = −1 ( pt ),
for a weight w in (0, 1]. Note that the number of extreme counts needed to trigger an alarm is not set in advance. One extreme count can push St beyond a threshold, or many less extreme but still unusual counts may pull St beyond a threshold. Thus both event magnitude and duration are incorporated into the measure of severity St . The performance of adaptive count thresholding depends on the choice of w in St and the threshold parameter L. If the process is under control and the reference distribution is appropriate, then Zt is approximately normal(0, 1), and St is approximately normal(0, σw2 ) with σw2 =
w . 2−w
An alarm is raised whenever St > Lσw , and the problem is to choose L and w to balance the time between false alarms, or average run length, against the ability to detect events of a specified magnitude. Insight into the choice of w for normal data has been provided by Robinson and Ho (1978), Crowder (1987), Lucas and Saccucci (1990), and Vardeman and Jobe (1998), among others. A simulation study (not reported here) for negative binomial counts shows that a false-alarm rate of 1 per 10,000 counts, or about 1 per week with minute counts, can be achieved for the following (w, L) pairs: (.05, 3.43), (.25, 3.68), (.50, 3.715), and (1.00, 3.719). Section 6 shows that the pair w = .25 and L = 3.68 gives the best detection results for a range of event duration and magnitudes. 4.2 The Procedure Putting the pieces together yields the following algorithm for thresholding, assuming minute counts, hourly grid values, and a daily cycle. The steps are applied at every minute t: 1. Index. Compute the minute m, hour h, and day d for time stamp t. 2. Compute reference parameters. Compute the mean, µˆ h,m , 2 , of the reference distribution F in efand variance, σˆ h,m t fect at time t by applying quadratic interpolation to the stored grid values for hours h − 1, h, and h + 1 using the stored interpolating coefficients (see Sec. 3.1). 3. Validate. If the count xt is not missing, then compute its continuity-corrected p value under Ft and include it in the table of p values, as described in Section 3.3. 4. Threshold. If xt is not missing, then compute its normal score, Zt = −1 ( pt ), under Ft , and update the severity metric to St = (1 − w)St−1 + wZt and threshold St . When xt is missing, take St = St−1 .
Lambert and Liu: Adaptive Thresholds
5. Outliers and missing data. If xt is missing, then take a random draw, Xt , from Ft . If xt is an outlier (beyond a specified quantile of its reference distribution) or if St exceeds the threshold, then replace xt with a random draw, Xt , from the corresponding tail of Ft . Otherwise, set Xt = xt . 6. Update reference distribution. Update the estimated mean and variance of the reference distribution for minute m of hour h with Xt , as described in Section 3.2. 7. Update grid values. Update the grid values Uh and Vh at each minute using (3). At the end of the cycle, replace the stored grid values with the updated grid values and compute and store the coefficients for quadratic interpolation. The updating and thresholding steps are modified slightly during the first hour; µˆ h,m is defined as the sample mean of 2 is defined as their the counts collected before time t, and σˆ h,m sample variance. At the end of the first hour, grid values are initialized to Uh = Tµ /N and Vh = TV /N. No count is considered an outlier, and St is not thresholded during the first hour. Interpolation coefficients are computed at the end of the last hour of the first cycle. 5. AN ILLUSTRATION The results of applying adaptive count thresholding to 8 weeks of minute counts using an hourly grid, a 24-hour cycle, and threshold parameters w = .25 and L = 3.68 for the severity metric are described in this section. Figures 5, 7, and 8 show results for the first 2 weeks, middle 2 weeks, and final 2 weeks. (The counts for the middle two weeks aggregated to 5-minute intervals are shown in Fig. 1.) Counts are shown on the square root scale to tame their variability and make better use of the plotting region, but this also flattens the time patterns that would be seen on the raw count scale on which the data are analyzed. Each panel in these figures is labeled by a date and divided into an upper region and a lower region. The upper region shows the counts and the evolution of the .0001, .5, and .9999 quantiles of the reference distribution. The .0001 quantile is often 0. The curve in the lower region shows the evolution of the severity metric St . The thresholds for setting an alarm are shown as black horizontal lines in the lower panel. When St crosses the threshold, the corresponding minute counts are plotted in a different color, and the background on the strip label for the day is darkened.
85
5.1 Initialization Figure 5 shows the first 10 days of monitoring. This period is not ideal for initialization, because the daily pattern is much less pronounced on the first 2 days than it is later in the period. Moreover, this initial period includes Christmas Eve (December 24) and Christmas Day, which are also atypical. Nonetheless, the daily pattern is learned quickly. For example, the peak around 7 PM is evident in the median and .9999 quantile by the third day. For most of the period, the severity metric floats near 0, as it should when there is no event. There is only one alarm; on December 24 the counts around 2 PM are higher than expected for more than an hour, perhaps because people shifted their evening traffic to early afternoon in anticipation of evening festivities. It is encouraging that no other period was also flagged during the start of monitoring, given that no prior information about the system or baseline, historical dataset is used to initialize monitoring. Also note that many events would have been declared (falsely) if individual counts that exceeded the .9999 quantile were flagged. That is, thresholding counts themselves instead of the smoothed severity metric St would give too many false alarms. The negative binomial reference distributions with dynamically updated parameters fit the counts well even during initialization. Figure 6, for example, shows a histogram of the continuity corrected p values for the first 2 days of counts. There are too many p values near 1, but that tail is unimportant for thresholding. For comparison, Figure 6 also shows the continuity-corrected p values that would be computed using a Poisson reference distribution (for which the estimated variance plays no role) or a lognormal distribution. As would be expected, there are too many small p values with a Poisson reference model, which would lead to a high false-alarm rate. Too many p values are also close to one under the Poisson, so neither tail fits the counts in this example. The lognormal is considered because it has longer tails than a Poisson, and with two parameters it is more flexible than a Poisson. But for these data, the fit of the lognormal is unacceptable, and is much worse than the fit of either the negative binomial or Poisson. 5.2 Ongoing Monitoring Figure 7 shows the results of monitoring the same stream of counts during weeks 4 and 5. Again, the severity metric St floats near 0, as it should when there is no event, even though there
Figure 5. The Counts, Estimated Median, and .9999 Quantile of the Reference Distribution, and the Severity Metric St in the First 10 Days of Monitoring. A short period of elevated counts is highlighted on 12/24.
86
Journal of the American Statistical Association, March 2006 (a)
(b)
(c)
Figure 6. Histograms of Continuity Corrected p Values During the First 2 Days of Counts for Three Different Families of Reference Distributions: (a) Negative Binomial, (b) Poisson, and (c) Lognormal.
are isolated counts beyond the .9999 quantile on January 13, 14, and 17. On January 23, though, St exceeds the threshold for a few hours, and many counts are far beyond the medians of their reference distributions. (However, this discrepancy is subtle in an absolute sense and would not be apparent without a good description of the daily pattern.) Note that the outliers in Figure 7 have little effect on the medians for the next day, but the corresponding .9999 quantiles increase because uncertainty about the behavior of the counts has increased. These effects are desirable; outliers should not be able to move the estimated median of the reference distribution quickly (even though it is computed from mean and variance estimates), but outliers should stretch the tail of the reference distribution. Recall that the upper tail is increased, because an outlier is replaced with a random draw from the tail of the reference distribution. Here any point beyond the .9999 quantile was marked as an outlier and replaced with a random draw from beyond the .99 quantile of the reference. Finally, Figure 8 shows the behavior of adaptive count thresholding during weeks 7 and 8, which includes a stretch of 14 hours of missing data on February 7. The quantiles of the reference distribution are maintained, however, because each missing count is replaced with a draw from its reference distribution. Also note that the median tracks the data well throughout this period. 6. SIMULATED PERFORMANCE The ability to detect an event depends not only on its duration and severity but also on the trends and cyclical patterns when there are no events. The longer or more severe the event and the more stable and less dramatic the trends and cyclical patterns, the easier the event is to detect. Simulation results for a cyclical
mean and a 10% long-term increase in counts are described in this section. In each run of the simulation, eight weeks of minute counts are generated, giving 80,640 counts per run. If no event is in progress at time t, then counts are generated according to a negative binomial distribution with mean t sin(2πat ) µt = 1 + 2e + .5esin(−8πat ) 10T and variance σt2 = max{µt , µ2t /4} where at = t/(24 × 60). The mean increases steeply from midnight until 5 AM, then decays more slowly, and has a secondary peak near 7 PM. This pattern of nonstationarity is typical for counts that are affected by people’s activity. The means for the first and last days assuming that no event occurs on these days are shown in Figure 9. There are six simulation scenarios in all, each designed to evaluate the performance of adaptive thresholding against events of a specific magnitude and duration. For the purpose of evaluation, all events in each scenario have the same fixed duration, although of course we would not expect that to be the case in practice. Because the mean and variance of the background distribution depend on time of day, the relative magnitude of the event is kept constant throughout the 8 simulated weeks by shifting the mean count during an event to the .9, .99, or .999 quantile of the reference distribution. Thus the nature of an event varies with the background. During an event, counts follow a Poisson distribution instead of a negative binomial distribution. If counts continued to follow a negative binomial distribution with standard deviation proportional to the
Figure 7. Adaptive Count Thresholding During Weeks 4 and 5.
Lambert and Liu: Adaptive Thresholds
87
Figure 8. Adaptive Count Thresholding During Weeks 7 and 8.
mean during an event, then extreme counts that make events unrealistically easy to detect would occur with high probability. Finally, events were generated according to a Poisson process with a rate of .5 event per day. Thus events in the 8 simulated weeks are not correlated. Events were detected using four pairs of threshold parameters (w, L), where the threshold parameter L is matched to w as in Section 4.1 to give a false-alarm rate of about 1 per 10,000 counts, which is roughly supported by these simulations. Results are given in Table 1. There, the mean time to detection is the mean time (in minutes or, equivalently, observations) until the severity St crosses the threshold Lσw , and mean false alarms is the average number of threshold crossings when no event is in progress divided by the time (measured in days) during which no event is in progress. As would be expected, thresholding isolated counts (w = 1) gives a high false-alarm rate compared with a modest amount of smoothing, whereas large amounts of smoothing (w = .05) tend to miss short events. A modest amount of smoothing (w = .25) is better than either no smoothing (w = 1) or much smoothing (w = .05), because it is more likely to detect events (especially less extreme, shorter events) without a high false-alarm rate. The results for time to detection are perhaps surprising. First, except for the lowest-level events (mean shifted to the .9 quantile of the reference distribution), 10-minute events that are detected at all are detected as quickly with w = .05 as they are with no smoothing, although one might have expected smoothing to delay event detection. For 30-minute events with mean
pushed to the .99 quantile or beyond, smoothing with w = .25 detects at least as many events as no smoothing and given that an event is detected, it is detected earlier (nearly twice as quickly) with w = .25 than with w = 1. A plausible explanation is that smoothing uses the information in all of the counts, some of which are not far outlying, whereas w = 1 reacts only to the far outlying counts and thus is less efficient for detecting events. Finally, the conclusion that time to detection increases with event duration even when no other characteristic of the event changes is not surprising, because more events are detected when they last longer. 7. DISCUSSION This article presents a statistically principled way to threshold counts online, without access to past data, in the presence of unspecified cyclical patterns and trends and missing data, as Table 1. Performance Over 100 Runs of 8 Simulated Weeks With the Background Mean Shown in Figure 9 Duration Shift Weight Probability Mean time to False-alarm rate (min) quantile w of detection detect (min) (per day) 10
30
Figure 9. The Mean of the Negative Binomial Distribution, Assuming That No Event Is in Progress, for the First Simulated Day ( ) and ). the Last (56) Simulated Day (
.9 .9 .9 .9 .99 .99 .99 .99
.05 .25 .50 1.00 .05 .25 .50 1.00
4.0(.6) 20.0(1.1) 13.9(1.0) 4.6(.6) 41.6(1.3) 93.6(.8) 88.5(1.0) 39.2(1.6)
7.54(.23) 6.39(.16) 5.78(.20) 4.88(.40) 6.77(.08) 4.63(.07) 4.24(.07) 4.71(.15)
.0063(.0015) .0248(.0032) .0404(.0040) .0900(.0058) .0089(.0021) .0215(.0029) .0444(.0040) .0911(.0067)
.999 .999 .999 .999
.05 .25 .50 1.00
63.8(1.3) 99.8(.1) 99.8(.1) 85.9(1.0)
5.88(.06) 3.07(.04) 2.38(.04) 3.24(.08)
.0074(.0015) .0278(.0031) .0433(.0044) .0844(.0050)
.9 .9 .9 .9 .99 .99 .99 .99 .999 .999 .999 .999
.05 .25 .50 1.00 .05 .25 .50 1.00 .05 .25 .50 1.00
57.5(1.5) 64.0(1.3) 37.6(1.4) 11.5(.9) 98.2(.3) 100.0(0) 98.6(.3) 70.5(1.2) 100.0(0) 100.0(0) 100.0(0) 94.0(.6)
18.56(.20) 14.91(.23) 14.51(.34) 14.50(.80) 12.94(.17) 5.77(.10) 6.25(.14) 11.46(.25) 9.39(.13) 3.31(.06) 2.87(.09) 6.12(.19)
.0063(.0015) .0244(.0030) .0430(.0037) .0856(.0057) .0059(.0015) .0270(.0032) .0452(.0042) .0763(.0062) .0067(.0014) .0230(.0025) .0352(.0034) .0622(.0051)
NOTE: Events occur according to a Poisson process at a rate of .5 events per day. Counts during an event are Poisson with mean equal to a specified quantile of the negative binomial reference distribution. Standard errors of reported quantities are shown in parentheses.
88
long as the cycle length can be specified in advance. Monitoring begins at system start; no data are held out for estimating a baseline, for example. The methodology was developed for counts from communication networks, but it could be relevant for other kinds of counts that are influenced by people’s habits and so have time-of-day, day-of-week, or other timing patterns that are not reproducible. A referee has pointed out that the method may also be useful in epidemiology, for example. At its heart, adaptive count thresholding is based on online estimates of means and variances that capture cyclical patterns through the use of a parameter grid over the cycle length and smooth over time through the use of interpolation. These means and variances are converted to quantiles of the negative binomial distribution. Because the negative binomial is a rich family of distributions, this provides a new way to track quantiles of discrete distributions incrementally in the presence of unspecified (nonparametric) trends and timing patterns in the data. Whereas the negative binomial, like any other distribution, is not appropriate for all applications, the fact that it has two parameters and generalizes the Poisson naturally makes it broadly applicable to counts. The changes needed for other two parameter distributions, like the normal or lognormal, are obvious. Event detection in this article is based on applying control chart methods to normal scores of the p values of the counts under their reference distribution. Taking p values standardizes the observations by controlling for differences in their means and variances. Alternatively, we could have centered and scaled the counts by subtracting the mean estimate and dividing by the standard deviation and then applied control chart methods to the standardized counts directly. This might not be appropriate for negative binomial counts with small or moderate means, but would be appropriate for nearly normally distributed data. Results for real and simulated data show that adaptive count thresholding based on a negative binomial distribution performs well in our application. It initializes quickly, tracks the mean and variance effectively, and is able to detect changes in the behavior of counts. We have applied the procedure to network counts herein, but we believe that it is more widely applicable;
Journal of the American Statistical Association, March 2006
for example, it may provide an alternative to other changepoint methods for count data that rely heavily on the Poisson distribution and are not designed with cyclical patterns and trends in mind. [Received July 2004. Revised May 2005.]
REFERENCES Cao, J., Cleveland, W. S., Lin, D., and Sun, D. X. (2002), “Internet Traffic Tends Toward Poisson and Independent as the Load Increases,” in Nonlinear Estimation and Classification, eds. C. Holmes, D. Denison, M. Hansen, B. Yu, and B. Mallick, New York: Springer-Verlag, pp. 83–109. Cao, J., Davis, D., Vander Wiel, S., and Yu, B. (2000), “Time-Varying Network Tomography: Router Link Data,” Journal of the American Statistical Association, 95, 1063–1075. Carlin, B., and Louis, T. (2000), Bayes and Empirical Bayes Methods for Analysis (2nd ed.), London: Chapman & Hall. Crowder, S. V. (1987), “A Simple Method for Studying Run-Length Distributions of Exponentially Weighted Moving Average Charts,” Technometrics, 29, 401–407. Ellner, S. P., and Seifu, Y. (2002), “Using Spatial Statistics to Select Model Complexity,” Journal of Computational and Graphical Statistics, 11, 348–369. Fan, J., and Gijbels, I. (1996), Local Polynomial Regression and Its Applications, London: Chapman & Hall. Feather, F. W. (1992), “Fault Detection in Ethernet Networks via Anomaly Detection,” unpublished doctoral thesis, Carnegie Mellon University, Dept. of Electrical Engineering and Computer Engineering. Feather, F. W., Siewiorek, D., and Maxion, R. (1993), “Fault Detection in Ethernet Network Using Anomaly Signature Matching,” in Proceedings of the SIGCOMM93, San Francisco, CA, pp. 279–288. Lucas, J. M., and Saccucci, M. S. (1990), “Exponentially Weighted Moving Average Control Schemes: Properties and Enhancements” (with discussion), Technometrics, 32, 1–29. Papagiannaki, K., Taft, N., Zhang, Z. L., and Diot, C. (2003), “Long-Term Forecasting of Internet Backbone Traffic: Observations and Initial Models,” in Proceedings of IEEE INFOCOM 2003, San Francisco, CA. Paxson, V., and Floyd, S. (1995), “Wide-Area Traffic: The Failure of Poisson Modeling,” IEEE/ACM Transactions on Networking, 3, 226–244. Quesenberry, C. (1991), “SPC Q-Charts for Start-Up Processes and Short or Long Runs,” Journal of Quality Technology, 23, 213–224. Robinson, P. B., and Ho, T. Y. (1978), “Average Run Lengths of Geometric Moving Average Charts by Numerical Methods,” Technometrics, 20, 85–93. Thottan, M., and Ji, C. (1998), “Proactive Anomaly Detection Using Distributed Agents,” IEEE Network, 12, 21–27. Tukey, J. W. (1977), Exploratory Data Analysis, Reading, MA: AddisonWesley. Vardeman, S. B., and Jobe, J. M. (1998), Statistical Quality Assurance Methods for Engineers, New York: Wiley.