Finding Critical Thresholds for Defining Bursts

1 downloads 0 Views 216KB Size Report
advertisement might indicate a click fraud [8]. Instrusions over the Internet often exhibit a bursty traffic pattern [6]. In astrophysics, a Gamma ray burst might.
Finding Critical Thresholds for Defining Bursts Bibudh Lahiri1 ? , Ioannis Akrotirianakis2 , and Fabian Moerchen2 1

Iowa State University, Ames, IA, USA 50011 [email protected] 2 Siemens Corporate Research, Princeton, NJ, USA 08540 {ioannis.akrotirianakis,fabian.moerchen}@siemens.com

Abstract. A burst, i.e., an unusally high frequency of an event in a time-window, is interesting in monitoring systems as it often indicates abnormality. While the detection of bursts is well addressed, the question of what “critical” thresholds, on the number of events as well as on the window size, make a window “unusally bursty” remains a relevant one. The range of possible values for either threshold can be very large. We formulate finding the combination of critical thresholds as a 2D search problem and design efficient deterministic and randomized divide-and-conquer heuristics. For both, we show that under some weak assumptions, the computational overhead in the worst case is logarithmic in the sizes of the ranges. Our simulations show that on average, the randomized heuristic beats its deteministic counterpart in practice. Keywords: Analytics for temporal data, Massive data analytics

1

Introduction

A burst is a window in time when an event shows an unusally high frequency of occurrence, and often indicates a deviation from the norm. E.g., in text streams from news articles or blogs, an important event like the 9/11 attack caused a burst of the keywords like “twin towers” or “terror”. A burst in clicks to an online advertisement might indicate a click fraud [8]. Instrusions over the Internet often exhibit a bursty traffic pattern [6]. In astrophysics, a Gamma ray burst might indicate an interesting phenomenon [10, 9]. Labelling a window in time as “bursty” calls for at least two thresholds - one on the number of events (k) and the other on the length of the window (t). We call a window (k, t)-bursty if at least k events occur in a time window of length at most t. While the problem of identifying (k, t)-bursty windows, given k and t, is interesting in itself, knowing the right thresholds is part of the problem. For a given t, to know what value of k should be termed “unusually high”, we first need to know typically how many events to expect in a window of length t. Similarly, for a given k, to know what value of t is “unusually low”, we first need to know typically how long it takes to generate k events. ?

This work was done when this author was an intern at Siemens Corporate Research, Princeton.

2

Before we formally defined the problem of finding the critical thresholds, we had to quantify the notion of “usual” and “unusual”. We defined a metric called “coverage”: given a threshold pair (k, t), and a sequence of timestamps of n events, the coverage Ck,t is the fraction of the n events that were included in some (k, t)-bursty window. Note that, a single event can be part of more than one (k, t)-bursty windows. For a given pair (k, t), if we find that Ck,t is quite close to 1, then actually we are not interested in such a pair (k, t), because that implies having at least k events in a window of length at most t is not unusual, and hence should hardly be labelled as a burst. On the other hand, a Ck,t value quite close to 0 implies having k events in a window of length at most t is not usual, and hence demands attention. Note that, this definition ensures Ck,t ∈ [0, 1], and makes Ck,t monotonically non-increasing in k and non-decreasing in t, properties that we prove and take advantage of in our algorithms. We focus on identifying critical pairs (k ∗ , t∗ ) such that Ck∗ ,t∗ is abruptly different from values of Ck,t for pairs (k, t) which are in the neighborhood of (k ∗ , t∗ ) (and k < k ∗ ) - this implies having k ∗ events in a window of length at most t∗ is not the norm, yet there are some rare situations when this has happened. Note that, for a given pair (k, t), Ck,t can be computed by making a single pass over the data, but if the range of possible values for k and t have sizes K and T respectively, then evaluating Ck,t at every point in the two-dimensional space would have a computational overhead of O(KT ). Since for most applications, we hardly have any apriori idea what combination of thresholds are critical, each of K and T can be rather large, e.g., t might range from a few minutes to a few hours, depending on the nature of the application, and k might take any value from 2 to a few thousand. Our contributions can be summarized as follows: – We formally define the problem of finding critical threshold pairs that should label a subsequence of a time series data as unusually bursty. We formulate it as a two-dimesional search problem. – We prove monotonicity properties of the coverage function rigorously, and exploit them to design deteministic and randomized divide-and-conquer heuristics that explore the search space efficiently. Under some weak assumptions, we show the deterministic heuristic computes Ck,t at O(log K log T ) different points, and, under identical assumptions, the randomized heuristic also computes Ck,t at O(log K log T ) different points on expectation in the worst case. For lack of space, we only present the claims here - the proofs can be found in the full version [3]. – We experimentally compared the performance of our deterministic and randomized heuristics with that of a naive algorithm that evaluates Ck,t at at most, but typically much less than, KT points. Even with some optimizations of the naive algorithm, the savings made by our heuristics are in the range of 41% to 97%. Note that although our analysis (Section 6) assumes we stop after getting the first (k ∗ , t∗ ), in our experiment (Section 7), we continued till we got all possible values of (k ∗ , t∗ ).

3

2

Related Work

Zhu and Shasha [10] addressed “elastic” burst detection, where they kept a different threshold for each among a number of different window sizes and identified windows over the time-series when an aggregate function (sum) computed over the window exceeded the corresponding threshold. Their algorithm builds on the time series a Shifted (binary) Wavelet Tree (SWT) data structure, which was generalized in [9] to a faster data structure called Aggregation Pyramid. [9] and [5] revealed correlated bursts among multiple data streams in stock exchange and text data. Kleinberg [1] investigated how keywords in document streams like emails and news articles show a bursty pattern, and developed a framework in the form of an infinite-state automata to model bursts in a stream. Kumar et al [2] extended the ideas of [1] to discover bursts in the hyperlinking among blogs in Blogspace, which occurs when one author publishes an entry on her blog that increases the traffic to her blog and stimulates activities in other blogs that focus on the same topic at the same time. Vlachos et al [4] addressed the problem of burst detection for search engine queries to detect periodic (e.g., weekly or monthly) trends. Yuan et al [7] worked on trend detection from high-speed short text streams While all these earlier literature have focused on the detection of burts, we focus on finding the thresholds that define a burst. A heuristic like ours can be used to learn from historical data what choice of thresholds separates a burst from a non-burst, and can be used later in a real monitoring system for the same application to detect bursts, when some other burst-detection algorithm can also be used.

3

Problem Statement

We have a sequence of events S 0 = (e1 , e2 , ...en ). Let te be the timepoint at which event e occurs, so the correspoding sequence of timestamps is S = (te1 , te2 , ...ten ). Let Nk,t be the number of events that are in some (k, t)-bursty window. As defined in Section 1, the coverage for the pair (k, t), denoted as Ck,t , is Ck,t = Nk,t /n. Let Kmin and Kmax be the minimum and maximum possible values of k, known apriori, and K = Kmax − Kmin . Similarly, let Tmin and Tmax be the minimum and maximum possible values of t, also known apriori, and T = Tmax − Tmin . We focus on the following problem: Problem 1. Given the sequence S, and a user-given parameter θ > 1, find a set α = {(k ∗ , t∗ )} such that α ⊂ [Kmin + 1, Kmax ] × [Tmin , Tmax ], and for any pair ∗ C ∗ (k ∗ , t∗ ) ∈ α, Ck k−1,t ≥ θ. ∗ ,t∗ We first focus on simpler, one-dimensional versions of the problem. Assuming we are dealing with a fixed value of the maximum window length t, this becomes

4

Problem 2. For a fixed t, and a user-given parameter θ > 1, find a subset K ∗ ⊂ C ∗ ≥ θ. [Kmin + 1, Kmax ] such that for any k ∗ ∈ K ∗ , Ck k−1,t ∗ ,t Alternatively, if we deal with a fixed value of the threshold k on the number of events, this becomes Problem 3. For a fixed k, and a user-given parameter θ > 1, find a subset T ∗ ∈ ∗ C ≥ θ. [Tmin , Tmax ] such that for any t∗ ∈ T ∗ , Ck−1,t k,t∗ C

We observed from our experiments that Ck−1,t remains close to 1 most of the k,t time; however, for very few combinations of k and t, it attains values like 2 or 3 or higher - and these are the combinations we are interested in. Since K and T can be pretty large, searching for the few critical combinations calls for efficient search heuristics.

4

Monotonicity of the coverage function

Note that, Ck,t is a monotonically non-increasing function of k, and a monotonically non-decreasing function of t. Intuitively, the reason is that for a fixed t, as k increases, (k, t)-bursty windows become rarer in the data; and for a fixed k, as t increases, it becomes easier to find (k, t)-bursty windows in the same data. The formal proofs can be found in [3].

5 5.1

The Divide-and-Conquer Heuristics The One-Dimensional Problem

We first discuss the solution for Problem 2 - the solution to Problem 3 is similar. Given the sequence S = (te1 , te2 , ...ten ), and a given pair (k, t), Ck,t can be computed on S in a single pass by algorithm 1. A naive approach would be to invoke algorithm 1 with the pairs (k, t) ∀k ∈ [Kmin + 1, Kmax ], and check when Ck−1,t /Ck,t exceeds θ. This would take O(K) calls to algorithm 1. To cut down the number of invocations to algorithm 1, we take a simple divide-and-conquer approach, coupled with backtracking, and exploit the monotonicity of the function Ck,t discussed in Section 4. We present two variations of the approach - one deterministic and the other randomized. The intuition is as follows: Intuition: We split the range K of all possible inputs into two sub-intervals. We devise a simple test to decide which of the two sub-intervals this value k ∗ may lie within. The test is based on the observation that if a sub-interval X = [ks , ke ] contains a k ∗ where the coverage function shows an abrupt jump (i.e., Ck∗ −1,t /Ck∗ ,t ≥ θ), then the ratio of the coverages evaluated at the two endpoints should also exceed θ (i.e., Cks −1,t /Cke ,t ≥ θ) because of the monotonicity of Ck,t in k. Note that, the reverse is not necessarily true, as Cks −1,t /Cke ,t might exceed θ because there was a gradual change (of factor θ or more) from ks − 1 to ke (if the interval [ks , ke ] is long enough). Thus, the test may return a

5

positive result on a sub-interval even if there is no such value k ∗ within that subinterval. However, we repeat this process iteratively, cutting down the length of the interval in each iteration - the factor by which it is cut down varies depending on whether the heuristic is deterministic or randomized. The number of iterations taken to reduce the original interval of width K to a point is O(log K). Note that, in the case when there is no such point k ∗ , the intervals might pass the test for first few iterations (because of the gradual change from ks to ke ), but then, eventually it will be reduced to some interval for which Cks −1,t /Cke ,t will fall below θ, and hence it will no longer pass the test. Deterministic vs randomized divide-and-conquer: We always split an interval of width w into two intervals of length p·w and (1 − p)·w respectively, where p ∈ (0, 1). For the deterministic heuristic, p is always 12 ; whereas for the randomized one, p is chosen uniformly at random in (0, 1). If both the subintervals pass the test, then for the deterministic heuristic, we probe into the subintervals serially; whereas for the randomized one, we first process the smaller one. The reasons for probing the smaller sub-interval first are the following 1. If it contains a point k ∗ , then it can be found in fewer iterations. 2. If it does not contain any point k ∗ , and passed the test of Lemma 1 falsely, then, the algorithm would backtrack after few iterations. Lemma 1. For a fixed t, if a sub-interval X = [ks , ke ] contains a point k ∗ such that Ck∗ −1,t /Ck∗ ,t ≥ θ, then Cks −1,t /Cke ,t ≥ θ. The search for t∗ in Problem 3 proceeds similar to the search for k ∗ as we explained above, the difference being that the test on the sub-intervals is performed using the following lemma. Note the difference: in Lemma 1, the ratio is of coverage at start-point to that at end-point, whereas in Lemma 2, it is the other way round: the difference arises because of the difference in the natures of the monotonicities in k and t. Lemma 2. For a fixed k, if a sub-interval X = [ts , te ] contains a point t∗ such that Ck−1,t∗ /Ck,t∗ ≥ θ, then Ck−1,te /Ck,ts ≥ θ. 5.2

The Two-Dimensional Problem

We now advance to the original and more general problem in two dimensions, i.e., Problem 1. Our algorithm for 2D is an extension of the algorithm for the 1D problem discussed in Section 5.1 in the sense that it progressively divides the 2D range of all possible values of k and t, i.e., [Kmin + 1, Kmax ] × [Tmin , Tmax ], into four sub-ranges/rectangles. For the 2D problem, the pair(s) (k ∗ , t∗ ) for which Ck∗ −1,t∗ /Ck∗ ,t∗ exceeds θ will come from one or a few of these four sub-ranges. We devise a test similar to the one in Section 5.1 to identify which of the four sub-ranges may include the pair (k ∗ , t∗ ); and then probe into that sub-range in the next iteration, cutting down its size again, and so on. If the range of possible values for k and t are of unequal length, i.e., if K 6= T , then the length of the range would reduce to unity for the smaller one, and the rest of the search becomes a 1D search on the other dimension, like the ones in Section 5.1.

6

The test for identifying the correct sub-range in our 2D algorithm is based on the observation in the following lemma. Lemma 3. If a sub-range X = [ks , ke ] × [ts , te ] contains a point (k ∗ , t∗ ) such that Ck∗ −1,t /Ck∗ ,t ≥ θ, then Cks −1,te /Cke ,ts ≥ θ. Like the algorithms in Section 5.1 for the 1D problems, here also we have two variants: one deterministic and the other randomized. If more than one of the four sub-intervals/rectangles pass the test in Lemma 3, then for the deterministic algorithm, we probe into them serially, and for the randomized one, we probe into the rectangles in increasing orders of their areas. Algorithm 1 computes the coverage Ck,t , given a sequence of timestamps (S = (te1 , te2 , . . . , ten )), a lower bound k on the number of events and an upper bound t on the window length. As we have already pointed out, even if a single timestamp tei is included in multiple (k, t)-bursty windows, its contribution to Nk,t , in the definition of Ck,t , is only 1. Hence, we maintain a bitmap (b1 , b2 , . . . , bn ) of length n, one bit for each timestamp in S. We slide a window over S, marking the starting and ending points of the sliding window by s and f respectively all the time. Once all the timepoints in a window are “picked up”, we check (in lines 1 and 3) if the number of events in the current window [s, f ], i.e., f −s+1, exceeds the threshold k. If it does, then all the bits in the sub-sequence (bs , . . . , bf ) of the bitmap are set to 1 (lines 2 and 4) to indicate that the timepoints indexed by these bits are part of some bursty window. Algorithm 2 performs a 1D search over the interval [ks , ke ], for a fixed t (Problem 2), and is called from Algorithm 3 once the range of t-values reduces to a single point. A 1D search over the interval [ts , te ] can be performed similarly, for a fixed k (Problem 3) (we call it “RandomSearcht*”), and is called from Algorithm 3 once the range of k-values reduces to a single point. In algorithm 2 and its counterpart for [ts , te ], whenever we need to compute Ck,t , we first check if it has already been computed, in which case, it should exist in a hashtable with (k|t) being the key; otherwise, we compute it by invoking Algorithm 1, and store it in the hashtable with key (k|t). Note that, in lines 2 and 3 of Algorithm 2, r might exceed θ because of a divison-by-zero error. In case that happens, we will explore the interval only if Ckssmall −1,t ≥ 0, because Ckssmall −1,t = 0 implies Ckesmall ,t = 0 by the monotonicity, and it is not worth exploring [kssmall , kesmall ]. Algorithm 3 performs the search over the 2D interval [ks , ke ] × [ts , te ] to solve Problem 1. A rectangle is defined as a four-tuple (tl , th , kl , kh ), i.e., the set of all points in the 2D range [tl , th ]×[kl , kh ], thus the area being (th −tl +1)·(kh −kl +1). Since the input rectangle is split into only four rectangles, we use insertion sort in line 2 while sorting them by their areas, which takes O(1) time because the input size is constant. The deterministic heuristics are very similar to their randomized counterparts (algorithms 2 and 3), with the following differences: – In line 1 of Algorithm 2 and line 1 of Algorithm 3, kq and tq are midpoints ts +te e (b ks +k 2 c and b 2 c) of the intervals [ks , ke ] and [ts , te ] respectively.

7

– In Algorithm 3, we can do away with the sorting step of line 2, since the four rectangles would be of (almost) equal length.

Algorithm 1: ComputeCoverage (S = (te1 , te2 , . . . , ten ), t, k) output: Ck,t : the fraction of events that are in some (k, t)-bursty window n ← |S|; initialize a bitmap (b1 , b2 , . . . , bn ) to all zeros; /* sliding window is [s, . . . , f ], s ∈ {1, . . . , n}, f ∈ {1, . . . , n} s ← 0, f ← 0; // Note: tei is the ith timepoint in S. while (tef < tes + t) ∧ (f < n) do f ← f + 1;

*/

if tef > tes + t then f ← f − 1;

1 2

while f < n do nw = f − s + 1; if nw ≥ k then set the bits (bs , . . . , bf ) to 1; /* Move the window, storing pointers to the previous window sp = s, fp = f ; while (s ≥ sp ) ∧ (f ≤ fp ) do s ← s + 1; while (tef < tes + t) ∧ (f < n) do f ← f + 1;

*/

if tef > tes + t then f ← f − 1;

3 4

6

/* If the last point is within the last window, it will be counted. Otherwise, it is an isolated point and hence not interesting. */ if f = n then nw ← f − s + 1; if nw ≥ k then set the bits (bs , . . . , bf ) to 1; Pn Ck,t ← j=1 bj /n; return Ck,t ;

Complexity Analysis

Let C(K) be the number of calls made to Algorithm 1 from Algorithm 2. We compute C(K) assuming that in Algorithm 2 and its deterministic equivalent, only one of the two sub-intervals passes the test of Lemma 1 between lines 2- 3,

8

Algorithm 2: RandomSearchk* (S = (te1 , te2 , . . . , ten ), t, ks , ke , θ) if ks = ke then r ← Cks −1,t /Cks ,t ; if (r ≥ θ) ∧ (Cks ,t > 0) then output (ks , t) as a critical threshold pair;

1

2

3

return; else /* U ([a, b]) returns a number uniformly at random in [a, b] */ kq ← U ([ks , ke − 1]); between [ks , kq ] and [kq + 1, ke ], let [ksbig , kebig ] be the bigger window and [kssmall , kesmall ] be the smaller; rsmall ← Ckssmall −1,t /Ckesmall ,t ; if (rsmall ≥ θ) ∧ (Ckssmall −1,t ≥ 0) then RandomSearchk*(S, t, kssmall , kesmall , θ); rbig ← Ckbig −1,t /Ckbig ,t ; s e if (rbig ≥ θ) ∧ (Ckbig −1,t ≥ 0) then s RandomSearchk*(S, t, ksbig , kebig , θ);

Algorithm 3: RandomSearch2D (S = (te1 , te2 , . . . , ten ), ks , ke , ts , te , θ) if (ts = te ) ∧ (ks = ke ) then r ← Cks −1,te /Cke ,ts ; if (r ≥ θ) ∧ (Cke ,ts > 0) then output (ks , ts ) as a critical threshold pair;

1

2

3

return; else if ts = te then RandomSearchk*(S, ts , ks , ke , θ); else if ks = ke then RandomSearcht*(S, k, ts , te , θ); else kq ← U ([ks , ke − 1]); tq ← U ([ts , te − 1]); let R be an array of rectangles with R[1] = (ts , tq , ks , kq ), R[2] = (tq + 1, te , ks , kq ), R[3] = (ts , tq , kq + 1, ke ) and R[4] = (tq + 1, te , kq + 1, ke ); sort R in increasing order of areas of the rectangles; for p = 1 to 4 do let (tl , th , kl , kh ) be the 4-tuple for rectangle R[p]; r ← Ckl −1,th /Ckh ,tl ; if (r ≥ θ) ∧ (Ckl −1,th ≥ 0) then RandomSearch2D(S, kl , kh , tl , th , θ);

9

so we never probe into the other interval, and we stop as soon as we get the first k ∗ that satisfies our criterion. Theorem 1. For the deterministic counterpart of Algorithm 2, C(K) = O(log K). Let C(T ) be the number of calls made to Algorithm 1 for solving Problem 3. Analogous to Theorem 1, we can claim the following: Theorem 2. For the deterministic algorithm for solving Problem 3, C(T ) = O(log T ). For the two-dimensional version, let C(K, T ) be the number of calls made to Algorithm 1 for solving Problem 1. For the following theorem, and also for Theorem 6, we assume that in algorithm 3 and its deterministic counterpart, only one of the four rectangles pass the test of Lemma 3 in line 3, and we stop as soon as we get the first (k ∗ , t∗ ). Theorem 3. For the deterministic counterpart of Algorithm 3, C(K, T ) = O(log K log T ). Theorem 4. For Algorithm 2, the expected complexity in the worst case is E[C(K)] = O(ln K). Analogous to theorem 4, we can claim the following for the complexity C(T ) of the randomized algorithm for problem 3. Theorem 5. For the randomized algorithm for problem 3, the expected complexity in the worst case is E[C(T )] = O(ln T ). Theorem 6. For Algorithm 3, the expected complexity in the worst case is E[C(K, T )] = O(ln K ln T ).

7

Evaluation

Dataset: We implemented both heuristics and compared them with a naive algorithm (which also gave us the ground truth to begin with), by running all 3 on a set of logs collected during the operation of large complex equipment sold by Siemens Healthcare. We chose 32 different types of events that occurred on these equipment, each event identified by a unique code. Each event code occurred on multiple (upto 291 different) machines, so we had to take care of some additional details (which we describe in Section 7 of [3]) while computing Ck,t and finding the critical thresholds. The event codes had upto 300,000 distinct time points. Experiments: We implemented our heuristics in Java on a Windows desktop machine with 2 GB RAM. We set Tmin = 1 minute, Tmax = 100 minutes, Kmin = 2 and Kmax = 100 for all the event codes. We made the following simple optimizations to the naive algorithm: (e)

1. For each event code e, Ck,t for each combination of k and t is computed at most once, stored in a hashtable with the key being e|k|t, a concatenation of (e) (e) (e) e, k and t. The stored value of Ck,t is used in evaluating both Ck−1,t /Ck,t (e)

(e)

and Ck,t /Ck+1,t . We followed the same practice for our heuristics, too.

10 (e)

(e)

2. Once Ck,t reaches 0 for some k, Ck,t is not computed for any larger value of k since it is known that they will be 0 because of the monotonicity property discussed in Section 4. (e)

(e)

The ratios Ck−1,t /Ck,t for all possible combinations of k and t, obtained from the naive algorithm, formed our ground truth. While running our heuristics for (e) (e) each e, we picked the highest value of Ck−1,t /Ck,t , and set θ to that value. We ran the heuristics for an event code only if θ set in this way was at least 1.5. For each event code, we ran the randomized heuristic 10 times, each time with a different seed for the pseudo-random number generator, noted the number of calls to Algorithm 1 for each, and calculated the mean (NR ), the standard deviation (σ) and the coefficient of variation (CV = NσR ). Our observations about the number of calls to Algorithm 1 by the naive algorithm (NN ), the deterministic heuristic (ND ) and the mean for the randomized one (NR ) are as follows: 1. For all but one event code, we found NR < ND . The probable reason is, after partitioning the original interval, when the four intervals are unequal, if the smaller interval does not contain (k ∗ , t∗ ), then it has less chance of falsely passing the test of Lemma 3 in line 3 of Algorithm 3. Also, as we discussed in Subsection 5.1, even if the smaller interval passes the test falsely, we are more likely to backtrack from it earlier because its sub-intervals have even less chance of falsely passing the test, and so on. Even for the single event code where ND beats NR , the latter makes only 0.4% more function calls than the former. Depending on the event code, NR is 4% to 70% less than ND . 2. We define the “improvement” by the randomized (IR ) and the deterministic NR ND (ID ) heuristics as IR = N and ID = N , which are both plotted in N N Figure 2. The improvements are more when NN is close to or more than a million - the improvement I in those cases is then 3-11%. In other cases, it is mostly in the range of 40-50%. Hence, the curves for both IR and ID in Figure 2 show a roughly decreasing pattern as we go from left to right. 3. For 28 out of 32 event codes, the CV ( NσR ) for the randomized heuristic is less than 0.1, which implies a quite stable performance across runs, and hence we would not need multiple runs (and obtain an average) in a real setting, and hence would not ruin the savings obtained by exploiting the monotonicity. In fact, for 22 out of these 28, the CV is less than 0.05. The maximum CV for any event code is 0.18 only. 4. We show the time taken (in minutes) for 10·NR + ND function calls for each eventcode in Figure 3. The time taken increased as the number of function calls increased, which is quite expected. For 16 out of 32 eventcodes, the time taken for 10·NR + ND function calls was less than 15 minutes, and for 27 out of 32 eventcodes, this time was less than 2 hours. As an example, an eventcode which took about 19 minutes for 10·NR + ND function calls had NN = 883, 080, ND = 35, 100 and NR = 24, 655, so the 19 minutes time plotted in Figure 3 is for 10·24655 + 35100 = 281650 function calls, which is less than 32% of NN .

11 110 100

Window length in minutes(t*)

90 80 70 60 50 40 30 20 10 0 0

20

40 60 Thresholds(k*)

80

100

Fig. 1. The critical threshold pairs (k∗ , t∗ ) for all the 32 event codes.

References 1. Kleinberg, J.M.: Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery 7(4), 373–397 (2003) 2. Kumar, R., Novak, J., Raghavan, P., Tomkins, A.: On the bursty evolution of blogspace. In: WWW. pp. 568–576 (2003) 3. Lahiri, B., Akrotirianakis, I., Moerchen, F.: Finding critical thresholds for defining bursts in event logs. http://home.eng.iastate.edu/~bibudh/techreport/ burst_detection.pdf 4. Vlachos, M., Meek, C., Vagena, Z., Gunopulos, D.: Identifying similarities, periodicities and bursts for online search queries. In: SIGMOD Conference. pp. 131–142 (2004) 5. Wang, X., Zhai, C., Hu, X., Sproat, R.: Mining correlated bursty topic patterns from coordinated text streams. In: KDD. pp. 784–793 (2007) 6. Xu, K., Zhang, Z.L., Bhattacharyya, S.: Reducing unwanted traffic in a backbone network. SRUTI (2005) 7. Yuan, Z., Jia, Y., Yang, S.: Online burst detection over high speed short text streams. In: ICCS. pp. 717–725 (2007) 8. Zhang, L., Guan, Y.: Detecting click fraud in pay-per-click streams of online advertising networks. In: ICDCS (2008) 9. Zhang, X., Shasha, D.: Better burst detection. In: ICDE. p. 146 (2006) 10. Zhu, Y., Shasha, D.: Efficient elastic burst detection in data streams. In: KDD. pp. 336–345 (2003)

#function calls by heuristics as fraction

12

2 naive deterministic randomized (mean)

1.5

1

0.5

0 4 10

5

6

10 10 #function calls for naive

7

10

Fig. 2. The improvements by NR and ND over NN . On the X-axis we have NN (the event codes are sorted by NN ), on the Y-axis we have IR and ID . Note that the savings are in general more for larger valus of NN , and the randomized heuristic consistently outperforms the deterministic one. The X-axis is logarithmic.

3

Time in minutes

10

2

10

1

10

0

10

5

10

6

10 Total #function calls

Fig. 3. X-axis shows total number of function calls for 1 run of deterministic and 10 runs of randomized heuristic for each eventcode, i.e., 10·NR + ND , and Y-axis shows total time (in minutes) for all these runs. Both axes are logarithmic.