Fast, Small-Space Algorithms for Approximate Histogram Maintenance Anna C. Gilbert Sudipto Guha y S. Muthukrishnan
Piotr Indyk z Yannis Kotidis Martin J. Strauss
November 12, 2001
Abstract
A
A
A vector of length N is defined implicitly, via a stream of updates of the form “add 5 to 3 .” We give a sketching algorithm, that constructs a small sketch from the stream of updates, and a reconstruction for from the algorithm, that produces a B -bucket piecewise-constant representation (histogram) sketch, such that k ? k (1 + )k ? opt k, where the error k ? k is either `1 (absolute) or `2 (root-mean-square) error. The time to process a single update, time to reconstruct the histogram, and size of the sketch are each bounded by poly(B; log(N ); 1=). Our result is obtained in two steps. First we obtain what we call a robust histogram approximation for , a histogram such that adding a small number of buckets does not help improve the representation quality significantly. From the robust histogram, we cull a histogram of desired accruacy and B buckets in the second step. This technique also provides similar results for Haar wavelet representations, under `2 error. Our results have applications in summarizing data distributions fast and succinctly even in distributed settings.
A H
A H
A H
H
A
A
AT&T
Labs—Research, 180 Park Avenue, Florham Park, NJ 07932 USA, fagilbert, kotidis, muthu,
[email protected] yDepartment of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104,
[email protected] zMIT Laboratory for Computer Science; 545 Technology Square, NE43-373; Cambridge, Massachusetts 02139-3594;
[email protected]
1 Introduction Histograms are succinct and space efficient approximations of distributions of numerical values. One often visualizes histograms as a sequence of vertical bars of equal width and height varying from bar to bar. More generally, histograms are of varying width as well; that is, they are piecewise constant approximations of data distributions. Formally, suppose is a function (or a “distribution” or a “signal”) on N points given by [0 N ). A B -bucket histogram of is defined by a partition of the domain [0 N ) into B intervals (buckets) Bi , as well as by B spline parameters bi . For any x 2 [0 N ), the value of (x) is equal to bi such that x 2 Bi . Since B is typically (much) smaller than N , this is a lossy representation. The quantity k ? kp , where k kp is the lp norm, is the error in approximating by a B -bucket histogram . Typically the norms of interest are l1 (average absolute error) or l2 (root mean square error). The branch of Mathematics called Approximation Theory deals with finding representations for functions. Histograms are amongst the simplest class of representations, and perhaps the most fundamental. It is the easiest to visualize; Statistical analyses frequently involve histograms. Histograms also find many applications in computer systems. For example, most commercial database engines keep a histogram of the various value distributions in a database for optimizing query executions and for approximately processing queries, image processing systems handle color histograms extensively, etc. Finally, an emerging context we describe in some detail is one of distributed databases on large scale networks such as the Internet. Routers generate a data stream of logs of the traffic that goes over the various incident links. For real time traffic control, operators must know traffic patterns at various routers at any moment. However, it is prohibitively bandwidth-expensive to transfer data streams of traffic logs from routers to central monitoring stations on a continuous basis. Hence, histograms may be constructed at routers to summarize the traffic distribution; they will be compact, and, while not being precise, they may suffice for most trend-related analyses. Building histograms at network routers thus saves the distribution cost of transmitting raw data. Our focus here is on maintaining a histogram representation for dynamic data distributions. Our study is mainly motivated by applications of histograms where data changes rapidly.
A
A
H A
H
A H
H
A
Many commercial databases have large number of transactions, e.g., stock transactions etc. during a work day. Transactions change the underlying data distribution (e.g., volume of stock sold per ticker symbol, every minute). The outstanding challenge in using histograms in such transactional databases is to maintain them during these rapid changes. In the context of data streams comprising traffic logs, data distributions change even faster: each IP packet that passes through a router changes the data distribution (eg., the number of bytes from each IP address passing through that router) rapidly. In order to capture these traffic distributions using histograms, we inherently need mechanisms to build histograms as the data distribution evolves rapidly.
In all the above contexts, the basic problem is to find a “good” histogram for a given data distribution. For applications involving the visualization of the histogram, finding a good histogram is an art. But in a formal setting, one seeks optimal histograms, that is, ones that minimize k ? k1 or k ? k2 . An optimal (static) histogram can be found in a straightforward way using dynamic programming, taking time O(N 2 B ) time using O(NB ) space. Formally, the problem of dynamic histogram maintenance is as follows. We are given a function 0 = [0 N ) and parameter B . (1) Suppose the j ’th operation is update(i; x); then the operation is to add x to j ?1 [i], where x is an arbitrary integer parameter (in particular, x may be negative). In the resulting j , j [i] = j ?1 [i] + x and j is identical to j ?1 otherwise. (2) Suppose the j ’th operation is Hist. Then the goal is to output a B -bucket histogram for j ?1 such that k j ?1 ? k1 (or k j ?1 ? k2 )
A H
A
A A A A
A
A
A H A
1
A
H
A H
A
H
is minimized. As is standard, we must design a data structure that supports these operations with little time (here, polylogarithmic in N ). Additionally, we must use working space that is sublinear N (again, polylogarithmic in N ). This is a departure from standard dynamic data structures literature, and it is motivated by applications we cited above. For example, memory is at premium in routers and hence only a few 100 bytes may be allocated for representing data distributions even though n may be 232 or higher; likewise, transactional databases allocate only a few fundred bytes for maintaining histograms on value distributions of even product codes where n may be 232 or larger. Hence, B is likely to be very small; for the same reasons, the work space for maintaining histograms must also be very small with respect to N . Despite its obviously fundamental nature and its importance in many applications including traditional databases and in distributed data streams, no nontrivial algorithms are known for this problem. Known (approximate and randomized) algorithms for constructing histograms either face the linear time bottleneck, that is, take time at least linear in N for the Hist operation, or face the linear space bottleneck, that is, use space at least linear in N , or both.
1.1 Our Results Our main result is the first known algorithm that simultaneously breaks both the linear time and linear space bottlenecks for the problem of dynamic maintenance of histograms. We present
An algorithm supporting the Update and Hist operations in space and time poly(B; 1=; log N ). The reconstruction of the best B bucket histogram for Hist query produces one with error at most (1 + ) times the optimal error in `1 or `2 norm1 . The result is nearly best possible, since any algorithm for this problem uses (B log(N )) space, and, ~ (1=) space. further, any algorithm whose dependence on N is polylogarithmic uses
A
The algorithm supports not only updates of individual values of [i], but in fact enables any linear operation on one or two signals/distributions. This feature allows us to address the issues of distributed data collection, that arise in the context of large scale networks. In particular, we can combine the distribution information from many different sources (e.g., routers) and compute the histogram approximation for the cummulative data. Our techniques also give similar results for Haar wavelet representations and piecewise linear histograms, but they hold under `2 error only.
1.2 Previous Work The problem of histogram construction and maintenance is fundamental to databases and therefore has been studied extensively in the past decade. Different optimization versions of histograms have been formulated: see [18] for an overview. Our problem here is known as the V -optimal histogram in the database literature, and is the one sought after. Heuristic algorithms for building and maintaining histograms have been proposed using sampling [5], wavelets [15], discrete cosine transforms [14], local adjustments [1], etc. None of these approaches gives any provable bounds. The static version of our histogram problem allows no updates. An O (N 2 B ) time dynamic programming algorithm appears in [12]. In addition, [12] also presented a O ((N + B log N ) log jOPT j) time algorithm with (3; 3) approximation; that is, using at most 3B buckets and guaranteed to have error at most 3jOPT j. This was improved to 1 + approximation in [9] in time O(N + (B?1 log N )3 ). In the data stream model, let us first consider the aggregated, sorted model; here, (static) [i]’s are scanned in increasing order of i’s. The result above in [12] works in this data stream model; it uses O (B +
A
1
Our algorithms succeed with high probability over the algorithm’s random choices.
2
log kAk2 ) space and provides same (3; 3) guarantee as above. The algorithm of [10] provided a (1 + )approximation preserving the number of buckets, using O (B 2 log N ) space and taking time O (NB 2 log N ). In [6], the authors provided a simple O (B + log N ) space, exact algorithm for finding a B -term wavelet representation; this immediately translates into a histogram using at most B log N buckets and error at most that of the best B -bucket histogram, in space O (B log(N )). The most general data stream model (which we use in this paper) is known as the cash register model [6]. Few algorithms are known for computing on the cash register model: estimating stream norms [2, 3, 11] and clustering [13]. Computing histograms or other representations is significantly more involved than merely estimating the norm because the identification of the relevant coefficients is very crucially needed in our algorithm. Besides [6], [7] and [8] that we discuss next, no other nontrivial results are known. One of the most closely-related works is [6], which gives an algorithm for our dynamic problem (in the wavelet formulation), using poly(B; 1=; log N ) space. Our present work improves [6] in several important respects: in construction time, error bound, and generality of the technique. The algorithm presented here ~ N ) to O~ (poly(B; 1 )), improves an additive k k error to times the improves a construction time of ( optimal error, and can handle the l1 norm as well. Papers [7] and [8] are experimental evaluation of heuristics for histogram construction motivated by the results of this paper. Both work in the context of distributed data from Internet routers with the former focused on detecting traffic anomalies and characteristics and the latter on visualizing multidimensional traffic patterns. Both represent our ongoing work to explore the application of the ideas in this paper in a practical context.
A
1.3 Technical Overview and Notation A dyadic interval is of the form [i2j ; : : : ; (i + 1)2j ), for integers i and j . A technical primitive we use in our algorithms is a synopsis data structure for an array that supports updates, identification of dyadic intervals with large projections, estimation of the best spline parameters and the norm. Typically a synopsis data structure [4] is defined to be of small space, but we will additionally require them to support all necessary computations in small time as well. By “small”, we mean a value at most poly(B; log N; 1 ). The technical crux in building this synopsis data structure is that we will be required to sum various ranges of random variables and perform “group testing” on sets of coefficients to identify the large projections. This is described in Section 2. Using this synopses data structure, our algorithm proceeds by repeatedly adding to our (partial) histogram the dyadic interval which reduces the error of approximation the most. We run this process till we achieve a stable respresentation of the signal with poly(B; log N; 1 ) buckets. This is what we term as a robust approximation of the signal which abstracts the notion that we have extracted the possible information in any B -bucket approximation of the original signal. This procedure in itself (extended to a pair of dyadic intervals) produces a B -term wavelet representation (see Section 3) which minimizes the representation error. To maintain the continuity in presentation, we first present the wavelet result, and subsequently the robust approximation in Section 4. Finally, in Section 5, we show how to use a robust approximation r to produce a B -bucket approximation . At a high level, we use a dynamic programming argument introduced in [12] for construction of optimal histograms, modified for desired approximations in [10, 9]. But all of these assume knowledge of exact or approximate value of the error of a histogram when projected on a subinterval. This is not possible to have in a sketch setting since the sketch is constructed for the entire interval. The sketch may suggest the subintervals with large projection (which we use in previous section) but cannot evaluate norms projected to subintervals. For this reason we have to use a fairly sophisticated technique of creating a set of histograms, that allows us to add intervals left to right and circumvents the necessity of knowing projections. The argument is similar to hybridization, where we construct our final solution with a series of changes, which add
H
H
3
an interval at a time. If the error introduced by any of these intervals were significantly more than the error in the optimal solution, restricted to the interval in question, we would contradict the robustness we started with. That complete the overview of our algorithm. In what follows, all the proofs are moved to the corresponding section in appendix.
A
is a vector (or “signal”) of length N . All uppercase bold types represent vectors. For an Notation. interval I [0; N ), we write ( ; I ) to denote the projection of the vector on interval I , i.e., equals on I and zero elsewhere. The vector I equals 1 on I and 0 elsewhere. The set of vectors I form a (highly redundant) basis for histograms. We use `1 norm except where noted. All results also hold under `2 norm.
A
A
2 Array Sketches
A H A
In this section, we give a data structure for a dynamic array , called an array sketch, that supports identification of dyadic intervals I such that, given a representation for , for some c, + cI is a significantly better representation than . The structure also gives a near-optimal value for c and an estimate for k k. The data structure is parametrized by s ; , and N . In this section, “small” means of value at most (log(N )=)O(1) + (log(N )=s )O(1) , “compact” means of small size, and “quickly” means using a small amount of time. For an understood signal , let cIopt denote the c that minimizes k ? cI k.
H
H
A
Definition 1 A (s ; ; N )-array sketch of a signal length N and supports the following operations:
A
A
A is a synopsis data structure that represents an array of
Update. Given a number c and an interval I , we can quickly compute an array sketch for
A + cI .
A ? cIopt I k
Identify. We can return a compact list that contains all dyadic intervals I such that k (1 ? )k k but contains no interval I such that k ? cIopt I k > (1 ? =2)k k.
A A Estimate norms. Return kAks such that kAk kAks (1 + s)kAk. Estimate parameters. Given an interval I , return a value c such that kA?cI k (1+s)kA?cIopt I k. A
A
An array sketch for signal under `p norm will take the following form. As in [11], choose a random vector according to a symmetric p-stable distribution. (The 1-stable distibution family is the Cauchy and the 2-stable distribution family is the Gaussian; we want distributions symmetric about zero.) For some set X [0; N ) to be specified below, a sub-basic sketch of is h( ; X ); i. Keeping fixed, choose other sets X (specified below) and repeat the process, to get a basic sketch. Finally, generate several independent basic sketches for independent copies of , to drive down the distortion and probability of error. An array sketch comprises the several basic sketches. We now specify the sets X and show that the fully-specified data structure satisfies the above.
V
A
A
V
V
V
Definition 2 The random variable X is Cauchy-distributed with width w, X C (0; w), if the density fw (x) of X is w x2+1w2 . The width of a Gaussian-distributed random variable is its standard deviation. Lemma 3 For p = 1 or 2, let X and Y be p-stably distributed random variables with widths wX and wY , respectively. Let Z = X + Y . Then, for each number z , one can sample from the distribution X jZ = z , to k digits of precision, in time (number of operations times precision) (k log(wX ) log(wY ) log(z ) log(N ))O(1) . We will use this lemma with wX and wY at most N , so log(wX ) and log(wY ) are small. As in [11], with high probability, log(z ) is small. Finally, a small number k of bits of precision suffices for our computation. 4
Lemma 4 (Naor and Reingold [16]) Suppose there’s a set X [0; N ) with compact description such that, for each interval I , we can quickly compute jX \ I j. Then we can construct a sequence of pseudorandom p-stably distributed random variables ri such that (i) the construction requires small space and (ii) given I , we can quickly compute the range sum, ie., i2X \I ri .
V
P
For proof of the above lemma, see Section A. The central idea of the lemma is to compute dyadic range sums in a tree and to use a pseudorandom generator for the underlying randomness. For example, suppose X = [0; N ), N = 4, and the four random variables are A; B; C , and D. First generate A + B + C + D, then generate A + B jA + B + C + D , noting that A + B and C + D are Cauchy or Gaussian random variables, so the sums satisfy Lemma 3. This determines C + D . Finally, generate AjA + B and C jC + D , thereby determining B and D . The above is useful even for the degenerate case of X = [0; N ), as we see next. Lemma 5 There exists a synopsis data structure that supports the following array sketch operations: Update, Estimate norms, and Estimate parameters. We now turn to identification. First suppose there is a single overwhelmingly large interval to find. Let Hj = fi : bit j of the binary expansion of i is 1g (the j ’th “bit test”), to be used in Lemmas 6 and 8.
A
Lemma 6 Fix j . There is a synopsis data structure for that supports update and supports finding the (single) dyadic interval I of length 2j with k ( ; I )k (2=3)k k, if there is such an I .
A
A
The lemma above gives as an “oracle” to find a (dyadic) interval of large energy. Now we show how to reduce the general case (no overwhelmingly large interval) to the above case. This contains the elements of “group testing.” Lemma 7 Let N be a power of 2 and let Y be a field of characteristic 2, jY j small. Fix j a function fs : [0; N ) ! Y , parametrized by a short random seed s, such that
log(N ). There’s
fs is constant on dyadic ranges of length 2j . fs is a uniform pairwise independent mapping on the set of dyadic ranges of length 2j . Given interval I [0; N ), Hj , and y 2 Y we can quickly compute jfi 2 I \ Hj : fs(i) = ygj.
Lemma 8 We can construct a synopsis structure supporting updates, to find a compact list containing all dyadic intervals I where k ( ; I )k k k.
A
A
Theorem 9 We can construct a (s ; ; N )-array sketch that uses small space and supports its operations in small time.
3 Efficient Wavelet Approximation In this section, we give an algorithm that finds a (1 + )-approximation to the best B -term wavelet representation for the signal , given only a sketch of . We will use the synopsis data structure from the previous section. The results in this section motivate the discussion of main result which will follow, but it may be of independent interest as well. The results in this section will hold for `2 norm only. (Our main result in next section will work for both `2 and `1 ). Let j denote the j ’th wavelet basis vector. The intuition behind this algorithm is that since, by Parsefal’s inequality (for = j dj j we have 2 2 j dj = k k ), we simply want to find the biggest B coefficients. We cannot seek them directly, however, since one of the biggest coefficients may be small, and we cannot try to identify all small coefficients since there are too many. Thus the algorithm proceeds greedily.
A
P
A
A P
A
5
R = 0 and S = ; and sketch = sketch(A). Repeat n o P 1. If A ? R = j d0j j then using sketch find j : (d0j )2 > 4B kA ? Rk22 . 2. If = ; or (jS j = k and S \ = ;) exit loop and output R.
Initially
3. For all j 2 let d~j 4. If jS j < k , then S 5.
estimate(j ). This uses the sketch.
S [ fj 0 g where j is the index with value of jd~j j. the largest R R + d~j j For each j 2 S (may have changed since start of loop) sketch sketch ? d~j sketch( j )
A
We show that we achieve a (1 + )-approximation to the error of approximating the by a B -term wavelet representation. We will relegate the entire proof of this section to the appendix. The proof first shows that there always exists a good B -term wavelet approximation of where the coefficients must be from the index set S . Next we show that we evaluate every coefficient in the index set S up to sufficient precision. Efficient algorithms for these are straightforward generalizations of the algorithms in Section 2. See Appendix B for details.
A
Theorem 10 The above algorithm constructs a (1 + )-approximation, in small time and space.
kA ? Rk (1 + )kA ? Ropt k,
Any B -bucket histogram can be regarded as a wavelet representation with O (B log(N )) terms and any B -term wavelet representation can be regarded as a histogram with O(B ) buckets. Thus we also immediately have a result for histograms with O (log(N )) blowup in the number of buckets: Theorem 11 The best O (B log(N ))-term wavelet representation R, found efficiently by the above algorithm, is a (B log(N ))-bucket histogram with error at most (1+ ) times that of the best B -bucket histogram.
4 Finding a Dyadic Robust Approximation
H
A
In this section, we define and show how to find a robust approximation r to with poly(B; log(N ); 1 ) buckets. We will use the synopsis data structures from Section 2. As mentioned before, the notion of robustness would conceptualize the fact that we have almost extracted as much information as can be provided by a B -bucket approximation of the signal. The formal definition is:
A
H
A
if, given any Definition 12 Given a signal , a histogram r is a (Br ; r )-robust approximation to collection X of jX j Br non-overlapping intervals, any histogram which can be expressed as
H
H = cI I would satisfy (1 ? r )kA ? Hr k kA ? Hk. r
on [0; N ) ? on I 2 X
S
H
I 2X I
S
That is, a robust histogram is not improved much if it is refined by a small number of additional buckets. Note that jX j Br is small, but I 2X I can be large, even equal to [0; N ). Throughout this section we will use the term robust to denote (Br ; r )-robust. We will eventually apply the theorems with Br = B and r = O(=B ). Before we start, the following straightforward property of `1 norm2 The property is true for `22 , and minimizing `22 error is equivalent to minimizing `2 error. We will avoid the confusion and problem of subscripts by sticking with `1 ; the reader can verify that the final result holds also for `22 , since intermediate statements hold either for `2 or `22 , as appropriate. 2
6
H = H0 everywhere except on a non-overlapping set of intervals X , then X? k(A ? H; I )k ? k(A ? H0; I )k : kA ? Hk ? kA ? H0k =
Lemma 13 (Decomposability) If
I 2X
H
A
We show that if is not a robust approximation to , it can be improved. This part is similar to the wavelet construction proof we saw in the previous section; we would improve the robust histogram by identifying a set of large coefficients repeatedly.
H
Lemma 14 Given a histogram which is not robust, there exists an interval I and a parameter c such that a histogram 0 which agrees with everywhere except I , and takes the value c on I approximates better by a fraction 1 ? r =(4Br log N ).
H
H
A
Using the array sketch we can construct a list of dyadic intervals I such that refining H by I (and changing the value only on I ), we can improve the representation by a 1?r =(4Br log N ) fraction. Since the minimum non-zero value is 1=4 and the numbers are polynomially bounded, this process of adding intervals converges in at most O ((Br =r ) log2 N ) rounds. Thus we claim the following lemma,
H
Lemma 15 Given r ; Br , in time t we can find a (Br ; r )-robust histogram r of O ((Br =r ) log2 N ) buckets from an (s ; )-array sketch for . where t; s ; 2 poly (Br ; 1r ; log N ).
A In particular, consider refining Hr by the optimal histogram Hopt with B Br buckets (so that none of Hr remains). It follows that kA ? Hr k kA ? Hopt k=(1 ? r ). 5
A (1 + )-Approximation Histogram Construction
Before we present the approximation algorithm, we recall an optimal dynamic program for histogram construction, which uses O (N ) space and O (N 2 B ) time. The algorithm stores, for each x 2 [0; N ) and k < B , the best histogram with k buckets which approximates the signal on the interval [0; x). Using this, it constructs the best k + 1 bucket approximation for a particular x 2 [0; N ). The best (k + 1)-bucket histogram extends some k -bucket histogram on [0; x0 ). By optimality, this k -bucket histogram has to be also the best possible k -bucket histogram for [0; x0 ). Since the algorithm is allowed O (N ) space it can store the signal and appropriate prefix sums to compute the best value to attribute to a bucket. Intuitively, the dynamic program constructs best approximation of subintervals and tries to extend them. In that sense, for a fixed k , we call fbest k -bucket histogram on [0; x)j0 x < N g a k -extension-basis (briefly, a k -basis) since it supports constructing the best (k + 1)bucket histogram on [0; N ) for any x. Although not explicitly stated, this idea of constructing an extension basis was considered in [10], but for a weaker non-dynamic model. We will proceed along similar lines. But this similarity will be restricted to the high level of constructing an Extension Basis and performing extensions to any [0; x). Since we are allowed poly(log N; 1 ; B ) space our extension basis size has to be small—we’ll construct fbest k -bucket histogram on [0; xi )ji = 0; 1; : : : ; `g, for ` poly(log N; 1 ; B ), and show that we can still (approximately) construct a (k + 1)basis from a k -basis. Furthermore, assigning the best spline parameter to an interval is approximate (through sketches). More complications follow, since we can only estimate the norm of signal on the entire interval [0; N ); there are too many prefixes [0; x) to store sketches on each in order to estimate norms on each. We first show the required properties to allow extension, assuming we are allowed to evaluate errors exactly. We then show how to use this to achieve the desired (1 + )-approximation under the assumption that we get a (1 + s ) approximation to the error. Finally, we show how to construct a small extension basis. 7
5.1 The Extension Basis and Recursive Extension Define a collection of histograms-interval pairs f(Hki ; Ii )g to be a -separated k -basis of size `, with error parameters k and E (k ), if the following holds. 1. 2. 3.
; = I0 [0; 1) = I1 I` = [0; N ) and 0 2 Ii for all i 1, (fIi g defines a prefix of I`). Each Hki is a k -bucket histogram on Ii and is equal to Hr on I` ? Ii . kA ? Hki k kA ? Hki?1k + if Ii extends Ii?1 by more than one element, jIij > jIi?1j + 1.
H
A A H
H
4. E RROR P ROPERTY: Each ki is a good k -bucket approximation for amongst all histograms that have k buckets in Ii , and agrees with r on I` ? Ii . That is k ? ki k (1 + k )k ? k + E (k ).
H
A H
The reason why we declare the above to be an extension basis is immediate from the following construcon the interval [0; x) while using tion procedure for a nearly optimal histogram sol that approximates at most k + 1 buckets in [0; x). In fact it is immediate that if we could generate such an approximation, we could generate a -separated (k + 1)-basis by iterating through the values of x and retaining the necessary histograms. Of course, iterating through [0; N ) is prohibitive; we will show how to construct an extension basis efficiently, but first we need to show a construction process that preserves the E RROR P ROPERTY for sol with a small error E (k + 1). But recall that we can only compare histograms that are defined on the entire domain [0; n). Consider the following construction procedure,
H
A
H
1. Let I
= [0; x).
2. For each i `
(a) Consider extending (b)
Hi agrees with Hkx
i
Hki to Hi by adding a bucket I ? Ii. on Ii , has the value ci on I
A H
? Ii and agrees with Hr on [0; n) ? I .
(c) ci is chosen such that k ? i k is minimized over all choices of ci . 3. Pick j minimizing k ? j k over all choices j ` (k ? j k k ?
A H
A H
A Hik for any i).
Hsol = Hj . Lemma 16 If H is a histogram with at most k + 1 buckets intersecting the interval I , and H agrees with Hr on I` ? I , then kA ? Hsolk (1 + k )kA ? Hk + E (k) + + (1 + k )r kA ? Hr k Starting with k = 0, which is Hr , and inductively construct the (B ? 1)-basis, and therefore approximate the best B -bucket histogram. The final solution is Hsol when k + 1 = B and x = N . Before we describe 4. Set
how to construct the extension basis, there is an issue we need to deal with; we do not know how to evaluate k k. But we can compute kks . To use kks we have to evaluate any histogram linear combination involving , i.e., the function ? , on the entire domain [0; N ). This is because we only have access to sketches on prescribed intervals, and [0; N ) is the only useful interval on which to keep sketches in this context (there are too many prefixes [0; x) to keep sketches on a useful number of them). In fact, this is why the presentation of the construction process (and the -separability in the extension basis) is defined on the whole domain set [0; N ), via hybridization with r , rather than on prefixes of [0; N ). The following is a variant of the previous lemma.
A
A H
H
Lemma 17 In the above construction procedure if we use k ks instead of k k,
kA ? Hsolk (1 + s) [(1 + k )kA ? Hk + E (k) + + (1 + k )r kA ? Hr k] 8
5.2 The Final Approximation Guarantee
H
H
Suppose we start from r . Given xi we can construct 1i , such that any other parameter setting to gives a suboptimal sketch evaluation, so for any histogram with 1 bucket on [0; xi )
H kA ? H1i k (1 + s)kA ? Hk
[0; xi )
This shows 1 = s and E (1) = 0. We can easily enumerate over all xi and construct an extension basis of size N . After constructing 1i , we recursively construct 2i , 3i , etc. This constructs an extension basis with = 0, and ` = N . To construct a single kj +1 , from f( ki ; Ii )g, we would take O (`) sketch evaluations, for minimization. ?1 2 Then, to construct f B i g, which would give us the solution, we would make O(` B ) sketch minimizations. We will show how to construct a small extension basis later, where ` = O (poly(B; log N; 1 )). But first we will prove the quality of the answer we get using the recursive extension and the extension basis that we construct.
H
H
H
H H
H
H
A
Lemma 18 The best B -bucket approximation sol to on [0; N ) produced from a -separated extension basis, for = jOPT j=B , has error k ? sol k (1 + O ())jOPT j.
A H
5.3 The Construction of a Small Extension Basis We will now show how to construct a small extension basis, of size poly(B; log N; 1 ). First let us assume that we know jOPT j upto a factor between 1 and 1+ , by exhaustively trying powers of (1+ ) as candidates for jOPT j. Recall that a = 0-separated extension basis can have N elements, which is prohibitively large, so we non-trivially use the fact that = jOPT j=B . Let 0 = 2. Let 1 (x) be the extension basis element we would generate if I = [0; x) After constructing 1i , we evaluate the possibility of 1i+1 = 1 (xi + 1). Three cases can happen. If k ? 1 (xi + 1)ks k ? 1i ks + 0, set xi+1 = xi + 1. In the second case, if k ? 1(N )ks k ? 1i ks + 0, then set xi+1 = N . Otherwise, perform a bisection search on [xi + 1; N ), and set xi+1 = x and xi+2 = x + 1 where x satisfies
H A H
H
H
kA ? H1(x)ks kA ? H1i ks + 0
A H
A H A H
H
kA ? H1(x + 1)ks kA ? H1i ks + 0: Lemma 19 To achieve ` = O (B=). it is sufficient to construct Hki where kA ? Hki ks 23 jOPT j. The same procedure is repeated for constructing a k -basis with k 2. Therefore, using our robust histogram and
construction in time O (poly (B; log(N ); 1 ), we can claim the following,
Theorem 20 In time and space O (poly (B; log(N ); 1 ) we can construct a (1+ )-approximation of the best B -bucket histogram.
Acknowledgment We are very grateful to Moni Naor and Omer Reingold for providing a general construction for Lemma 4 and for generously allowing us to include our variant of the construction here. Although published results [6] give a construction for Bernoulli random variables (that can substitute for Gaussian random variables, leading to our result under `2 error), Naor-Reingold is much more elegent than other constructions. Furthermore, theirs is the only known construction for Cauchy random variables, which is needed for our result under `1 error. 9
References [1] Ashraf Aboulnaga, Surajit Chaudhuri. Self-tuning Histograms: Building Histograms Without Looking at Data. In Proceedings of the ACM SIGMOD Conference, 1999, 181–192. [2] Noga Alon, Yossi Matias, Mario Szegedy. The Space Complexity of Approximating the Frequency Moments. JCSS 58(1): 137–147 (1999). [3] Joan Feigenbaum, Sampath Kannan, Martin Strauss, Mahesh Viswanathan. An Approximate L1Difference Algorithm for Massive Data Streams. FOCS 1999, 501–511. [4] Phillip B. Gibbons and Yossi Matias. Synopsis Data Structures for Massive Data Sets SODA 1999, 909–910 [5] Phillip B. Gibbons, Yossi Matias, and Viswanath Poosala. Fast Incremental Maintenance of Approximate Histograms. VLDB 1997, 466–475. [6] Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, Martin Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. VLDB 2001, 79–88. [7] Anna C. Gilbert, Yannis Kotidis, S. Muthukrishnan, Martin Strauss. QuickSAND: Quick Summary and Analysis of Network Data DIMACS Technical Report 2001-43. [8] Sudipto Guha, Piotr Indyk, Nick Koudas, and Nitin Thaper. Dynamic Multidimensional Histograms. Submitted to SIGMOD’02. [9] Sudipto Guha and Nick Koudas. Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation. ICDE 2002. [10] Sudipto Guha, Nick Koudas, and Kyuseok Shim. Data-streams and histograms. STOC 2001, 471– 475. [11] Piotr Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation. FOCS 2000, 189–197. [12] H. V. Jagadish, Nick Koudas, S. Muthukrishnan, Viswanath Poosala, Kenneth C. Sevcik, and Torsten Suel. Optimal Histograms with Quality Guarantees. VLDB 1998, 275–286. [13] Flip Korn, S. Muthukrishnan and D. Srivastava. Reverse Nearest Neighbor Aggregates over Data Streams. Manuscript, Nov 2, 2001. [14] J.-H. Lee, D.-H. Kim, and C.-W. Chung. Multi-dimensional selectivity estimation using compressed histogram information. SIGMOD 1999, 205–214. [15] Yossi Matias, Jeffrey Scott Vitter and Min Wang. Dynamic Maintenance of Wavelet-Based Histograms. VLDB 2000, 101–110. [16] Moni Naor and Omer Reingold. Private communication, March, 1999. [17] Noam Nisan Pseudorandom Generators for Space-Bounded Computation. STOC 1990, 204–212. [18] V. Poosala. Histograms for selecitivty estimation. PhD Thesis, U. Wisconsin, Madison. 1997.
10
Appendix A
Proof in Array Sketch Construction
Lemma 3 For p = 1 or 2, let X and Y be p-stably distributed random variables with widths wX and wY , respectively. Let Z = X + Y . Then, for each number z , one can sample from the distribution X jZ = z , to k digits of precision, in time (number of operations times precision) (k log(wX ) log(wY ) log(z ) log(N ))O(1) . Proof: We consider the Cauchy case. The conditional density of X is
x) f (x) = fw f(x)fw ((zz? w +w ) 2 2 = (wwX+wYw ) 2 (wX2+ w2Y ) + z 2 : X Y (wX + x )(wY + (z ? x) ) This is a function of x whose coefficients are polynomially bounded in wX ; wY , and z . Since f is a rational function of x, the indefinite integral F (x) of f (x) can be computed explicitly. By [11], it thus suffices to pick r 2 [0; 1] uniformly at random to X
Y
X
Y
(k log(wX ) log(wY ) log(z ) log(N ))O(1) bits of precision and solve F (x) = r for x by the bisection method. Also by [11], we can compute k bits of x in time (k log(wX ) log(wY ) log(z) log(N ))O(1) . A similar statement holds for Gaussian random variables replacing Cauchy random variables. In this case, the conditional distribution X jZ = z is itself a shifted and scaled Gaussian. Lemma 4(Naor and Reingold [16]) Suppose there’s a set X [0; N ) with compact description such that, for each interval I , we can quickly compute jX \ I j. Then we can construct a sequence of pseudorandom p-stably distributed random variables ri such that
V
the construction requires small space. given I , we can quickly compute
P
i2X \I ri .
Proof: We first give an ideal algorithm that uses too much space; we then show how to reduce the space using Nisan’s generator. Specifically, we will store a tree of p-stable random variable outcomes (with widths of our choosing—not necessarily unit widths) that, ultimately, can be generated by the generator. We give the proof for Cauchy random variables. We construct dyadic range sums of in a tree, in depth-first search order. That is, we first compute jX j, log(N ) = then generate and store an outcome for r0 i2X ri C (0; jX j). Next, we compute jX \ [0; N=2)j
S
A
P
P
r0log(N )?1 = i2[0;N=2)\X ri , from the distribution conditioned P r , using Lemma 3. These two values determine rlog(N )?1 = on the previously-stored value for 1k i2X i P r (without further randomness). Continuining this way, we store outcomes for rj , for 0 i i2[N=2;N )\X k ? 1 k ? 1 k k k log(N ) and 0 j < N=2 , with rj = r2P j + r2j +1 . Given dyadic interval I , we can recover i2X \I ri from S by descending the tree. Let I0 = I and, ’th interval IM is [0; N ). Then for each Im , let IP m+1 be the dyadic interval containing Im ; suppose the MP we can recover i2X \I ri directly from S; given that, we can recover i2X \I ?1 ri by multipliying P P the stored conditional probability for i2X \I ?1 ri by the conditioning probability i2X \I ri , and so and generate and store an outcome for
M
M
M
M
i
P
on, eventually getting i2X \I ri , as desired, via a telescoping chain of conditional probabilities. For an arbitrary I (not necessarily dyadic), partition I into dyadic intervals and proceed as above. Finally, instead of storing the outcomes in as above, we use Nisan’s generator secure against smallspace tests. We then need to verify that our algorithm can be regarded as a small-space test of the randomness. That is, one needs to verify that our overall algorithm, which has oneway access to the data and twoway access to its source of randomness, could instead be implemented in small space given oneway access to the randomness and twoway access to the data. Since the two implementations always compute exactly the same output, and since, with high probability, the latter implementation produces the same output on true randomness as on pseudorandomness, the pseudorandom generator suffices. Roughly speaking, our algorithm needs to compute h ; i. Given only oneway access to the randomness that generates (in depth-first search order) and twoway access to the updates that define , proceed as follows. Using log(N ) space, generate and store
S
VA
S
A
r0log(N ) ; r0log(N )?1 ; : : : ; r00 = r0: Then make a pass over the data (a stream of updates), compute a0 r0 , and, using twoway access to the stream of updates, return the readhead to the beginning of the stream. Next, compute r10 = r1 as r01P ? r00 = 0 (r0 + r1 ) ? r0 , and discard r0 . Return to the data, compute a1 r1 , and accumulate it with a0 r0 toward i ai ri . Next, to generate r2 , first compute r2 + r3 as (r0 + r1 + r2 + r3 ) ? (r0 + r1 ) from stored values, discard (r0 + r1 ) and store (r2 + r3 ), then generate and store r2 conditioned on r2 + r3 . With r2 , compute and accumulate a2 r2 . Continuing in this way, compute hV; Ai, using only oneway access to S and small space. Modification to accomodate the set X is straightforward. Lemma 5 There exists a synopsis data structure that supports the following array sketch operations: 1. Update. 2. Estimate norms. 3. Estimate parameters.
A
VA
V
as multiple independent copies of h ; i, where is Proof: Define the synopsis data structure for a p-stable distribution. Item 1 follows by construction. (For updating, observe that, by linearity of the dot product, we just need the sketch c h ; I i of cI to add to an existing sketch of a signal.) Item 2 is from [11]. We turn now to item 3. Let h = h( ) = B , where B is a projection matrix. By Lemma 4 with X = [0; N ), we can quickly compute BI , and thus also h0 (c) = B ( ? cI ), carried to any small number of bits of precision, which suffices. In the following we show how to find c minimizing
V
A
A
A
median(jh0 (c)1 j; : : :
; jh0 (c)d j)
where h0 (c)j is the j ’th independent copy of a basic sketch. By [11], this implies the lemma. To this end, observe that h0 (c) is a linear function of c. Therefore, we need to solve the following optimization program:
min median(ja1 c + b1 j; : : : jad c + bd j) 0cR This problem can be trivially solved in O (d3 ) time as follows. Firstly, we compute all points c such that jaiy + bij = jaj c + bj j, for i 6= j ; there are at most d2 such points. These points split the range [0; : : : ; R] ii
into at most d2 + 1 intervals i1 : : : it . Observe that if we restrict c to any ij = [lj ; rj ], the problem amounts to finding the median of the values min(jai lj + bi j; jai rj + bi j), for i = 1 : : : d, which can be done in O (d) time. Therefore, the problem can be solved in O (d3 ) time by enumerating all intervals ij .
A
Lemma 6 Fix j . There is a synopsis data structure for that supports update and supports finding the j (single) dyadic interval I of length 2 with k ( ; I )k (2=3)k k, if there is such an I .
A
A
Proof: We show the lemma for j = 0; other cases are similar. Let Hj = fi : bit j of the binary expansion of i is 1g. We call Hj the j ’th “bit test.” For 0 j < log N , use the norm estimation data structure of Lemma 5 on the projected signals ( ; Hj ). Also use the norm estimation data structure on itself. We find the position of interval I (of length 1 in this example) bit by bit. To find the most significant bit, compare k ( ; Hlog(N )?1 )k with k k. If k ( ; Hlog(N )?1 )ks > (1=2)k ks for reasonably good estimates, then k ( ; Hlog(N )?1 )k > (1=3)k k, and we conclude I Hlog(N )?1 = 0; N2 ; otherwise, I N2 ; N . In general, the j ’th test yields the j ’th bit of the position of I .
A
A
A A
A
A
A
A
Lemma 7 Let N be a power of 2 and let Y be a field of characteristic 2, jY j small. Fix j a function fs : [0; N ) ! Y , parametrized by a short random seed s, such that
The function fs is constant on dyadic ranges of length 2j .
The seed s is short.
log(N ). There’s
The function fs , regarded as a function on the set of dyadic ranges of length 2j , is a uniform pairwise independent mapping. Given s, given y
2 Y , given an interval I [0; N ), and given a bit test Hj , one can quickly compute jfi 2 I \ Hj : fs(i) = ygj:
Proof: We illustrate the statement for j = 0. Other cases are similar. Let H be the extended Hamming matrix of length N , gotten by adding a row of 1’s to the collection of all columns of height log(N ), in order. For example, for N = 16, H is
01 BB 0 BB 0 @0
1 0 0 0 0 1
1 0 0 1 0
1 0 0 1 1
1 0 1 0 0
1 0 1 0 1
1 0 1 1 0
1 0 1 1 1
1 1 0 0 0
1 1 0 0 1
1 1 0 1 0
1 1 0 1 1
1 1 1 0 0
1 1 1 0 1
1 1 1 1 0
1 1 1 1 1
1 C C C : C A
Let s be a vector of length log(N ) + 1 over Y , and let fs (i) be the i’th element of the vector-matrix product sH over Y . The claimed properties are straightforward to check. (See, e.g., [3].)
A
Definition 21 Fix a signal, . Pre-identification on dyadic intervals I for which k ( ; I )k k k.
A
A
A consists of finding a compact list that contains all
Lemma 8 There is a synopsis data structure that supports update and pre-identification. Proof: We show the lemma for (dyadic) intervals of length 1; the other log(N ) cases are similar. Let Y be a field of characteristic 2 and size (1= ). Pick seed s 2 Y log(N )+1 at random and fix y 2 Y . Let X = fi : fs(i) = yg. We show that, with high probability, if X contains a position i with k( ; fig)k k k, then k(( ; X ); fig)k = k( ; fig)k (2=3)k( ; X )k.
A
A
A
A
iii
A
To see this, observe that, by pairwise independence and linearity of expectation,
E k(A; X n fig)k i 2 X = E [k(A; X n fig)k] E [k(A; X )k] = (=8)kAk if jY j = (1= ) is large enough. Thus, with probability at least 3=4, E k(A; X n fig)k i 2 X (=2)kAk: Since k (A; fig)k kAk, it follows that (A; X ) has a single position with 2=3 the total energy, if it contains any position of energy at least kAk. S For each seed s, the y 2 Y induce a partition of [0; N ); namely, [0; N ) = y fi : fs (i) = y g. Thus, for each position i with an fraction of the signal energy, there is some y , namely, y = fs (i), such that i 2 fi : fs(i) = yg. The position i is identified for this choice of y. By boosting the probability of success from 3=4 to 1 ? =4 through repetition, it follows that all intervals are isolated and pre-identified this way. Note that X \ Hj satisfies the properties of Lemma 4, where X = fi : fs (i) = y g and Hj is a bit test of Lemma 6. It follows that there’s a data structure that supports update and norm estimation of (A; X \ Hj ), whence support for pre-identification follows.
Theorem 9 We can construct a (s ; ; N )-array sketch that uses small space and supports its operations in small time. Proof: By Lemma 5, it remains only to give a structure that supports update and identification. is defined as follows. Pick a random seed s. Let Y be a finite field The array sketch of a signal of characteristic 2 and approximately 1= 2 elements. For each y 2 Y and each j < log(N ), sketch ( ; fi : fs(i) = yg \ Hj ). For each y, also sketch ( ; fi : fs(i) = yg). The sketch of a projected signal is the dot product with , generated via the tree construction of conditional probabilities by Nisan’s generator [17]. Support for update follows by construction. By Lemma 8, we can find a compact list of intervals that includes all intervals I with k ( ; I )k k k. Observe that k k ? k ? cI k k ( ; I )k, so the compact list of intervals contains all I such that k ? cIopt I k (1 ? )k k; it might also contain other I ’s such that k ? cIopt I k > (1 ? =2)k k. By Lemma 5, for each I on the list, we can find a parameter c such that k ? cIopt I k k ? cI k (1 + s )k ? cIopt I k, thereby estimating k ? cIopt I k to within s k k. This and an estimate k ks for k k are sufficient to select a set of I ’s satisfying the desired properties, provided s =8.
A
A
A
V
A
A A A
A
A
A
A
A
A
A
A
A A
A
A
B Proofs of Section 3 Assume < 1, (1 + s )2 (1 + )(1 ? s )2 . This is true if s < =8. For the ease of readability we repeat the greedy algorithm Initially Loop:
R = 0 and S = ; and sketch = sketch(A).
AP? R), find n, the index set of the large o wavelet 0 2 0 2 j dj j then j : (dj ) > 4B kA ? Rk2 n 0 2 A R 2 oA R j : (dj ) > 8B kA ? Rk2 . 2. If = ; or (jS j = k and S \ = ;) exit loop and output R.
1. Using sketch (which is the sketch of coefficients of ? . If ? =
3. For all j
2 let d~j
estimate(j ). This uses the sketch. iv
S [ fj 0 g where j is the index with value of jd~j j. the largest R R + d~j j For each j 2 S (may have changed since start of loop) sketch sketch ? d~j sketch( j )
4. If jS j < k , then S 5. Let
A = Pj dj
j and R =
P d~ j 2s j
j . For any j 0 , we estimate the coefficient to satisfy
(1 ? s)jd0j 0 j jestimate(j 0 )j (1 + s)jd0j 0 j Lemma 22 At termination of the above algorithm for any i 62 S and j
(1)
2 S we have jdij (1(1+? )) jdj j. s s
R
Proof: If i 62 S in the above algorithm, the coefficient of i remains unchanged as is updated. Consider when we selected j , the coefficients of i ; j were still di ; dj in ? at this step. Since we did not choose
A R
i)j jestimate(j )j (1 + s) jd j. i, estimate(i) estimate(j ) and from Equation 1 jdi j jestimate( (1 ? ) (1 ? ) (1 ? ) j s
We will show it is sufficient to concentrate on the indices in S . Let cients).
s
R0 = Pj2S dj
s
j (using exact coeffi-
A ? R0 k2 kA ? Ropt k22 + 4 kA ? Rk22 .
Lemma 23 If jS j < B then k
Proof: If jS j < B then we could have added i but did not. Thus Therefore d2i 4B k ? k22 . Let Sopt be the set of basis vectors in
A R
Ropt . Thus, X 2 di jSopt ? S j 4B kA ? Rk22 4 kA ? Rk22
i2Sopt ?S
A ? R0k22 =
Therefore, k Lemma 24
i 62 when we decided to output.
X j 62S
d2j
X
j 62Sopt
d2j +
X
j 2Sopt ?S
d2j kA ? Ropt k22 + 4 kA ? Rk22 .
kA ? R0k22 (1 + )kA ? Ropt k22 + 2 kA ? Rk22 .
R
Proof: Once again let Sopt be the set of basis vectors in opt . If jS j < B then the previous lemma applies. In case of jS j = B = jS 0 j, every i 2 Sopt ? S can be matched to an index m(i) 2 S ? Sopt . Using Lemma 22,
1 + s 2 2 d2 d i
Summing over all such pairs,
X
1 ? s
m(i)
d2i (1 + )
(1 + )d2m(i)
X
d2i .
i2X S ?Sopt X 2 0 2 By adding terms not in S or Sopt , kA ? R k2 = d2i (1 + ) di = (1 + )kA ? Ropt k22 . i62S i62Sopt i2Sopt ?S
Thus there exists a “good” approximation if we concentrate on the indices in S . We will show that we compute the coefficients accurately. Theorem 10 The above algorithm constructs a (1 + )-approximation, k ? k (1 + )k ? opt k.
A R
v
A R
A R P
26 S then d^j = dj . Since when we output R we had \ S = ;, for A Rk22 . Therefore,
Proof: Let ? = j d^j j . If j j 2 S , we have d^2j at most 4B k ?
kA ? Rk22 = .
X ^2 X j
dj =
j 62S
d2j +
X ^2 X j 2S
dj
j 62S
d2j + B 4B kA ? Rk22
A ? Rk22 kA ? R0k22 + 4 kA ? Rk22 . Using the previous lemma, kA ? Rk22 (1 + )kA ? Ropt k22 + kA ? Rk22 . Rewriting the equation 2
This evaluates to k
and taking the square-root, since (1 ? 2 )?1
(1 + ), the theorem follows.
We are almost done except we need to prove that the above algorithm terminates in few steps, Lemma 25 The above algorithm terminates in at most O (B log 1
s
log n) steps.
Proof: Suppose in some iteration we update the coefficient j . Let d0j and d00j be the coefficients before and after the update respectively. From Equation 1 it follows that, jd00j j = jd0j ? estimate(j )j s jd0j j. Thus after 2 log 1 log n + 12 times j is chosen to be updated, the coefficient d0j of j in ? will be
A R
s
(d0j )2 (s )
A
2 log 1s log(nM )+1 2 dj n2Ms 2 d2j
where dj is the coefficient of j in and M is the largest integer used in any representation. Since jdj j < M 1 and n > B , assuming n > 16, and s < =8, we can bound (d0j )2 to be at most 8n 2 . Since k ? k22 > 16 the square of the coefficient j will not be larger than 8B k ? k22 and therefore will not be evaluated to be a large coefficient. Since we will choose at most B coefficients, the result follows.
C
A R A R
Proofs in Section 4
H
Lemma 14 Given a histogram which is not robust, there exists an interval I and a parameter c such that a histogram 0 which agrees with everywhere except I , and takes the value c on I approximates better by a fraction 1 ? r =(4Br log N ).
H
H
H
A
Proof: Since is not robust, there exists a set of X of non-overlapping intervals with coefficients cI such that 0 which agrees with everywhere except I 2X I and takes value cI on I , improves the approximation,
H
S
H
(1 ? r )kA ? Hk kA ? H0 k Let X D be the minimum collection of dyadic intervals whose union is equal to the union of intervals in X . Thus X D has at most 2Br log N intervals. For each interval I 0 2 X D , if the interval in X containing I 0 is I , let cI 0 = cI . everywhere Suppose no improving interval exists. Then, for I 2 X D , let histogram I agree with except I , where it has value value cI ; I does not improve the approximation to signal. By Lemma 13
H
H
H
r kA ? Hk kA ? HI k = kA ? Hk ? k(A ? H; I )k + k(A ? cI I ; I )k 1 ? 4B log r N
vi
Using the Lemma again on the collection X of intervals,
kA ? Hk ? kA ? H0k =
X
I 2X
(k(A ? H; I )k ? k(A ? cI I ; I )k)
H
X
r
I 2X 4Br log N
kA ? Hk
Since jX D j 2Br log N , this contradicts the improvement achieved by 0 . Thus there must be a single interval I 2 X D which improves the error by a 1 ? r =(4Br log N ) fraction.
D Proofs in Section 5
H
Lemma 16 If is a histogram with at most k + 1 buckets intersecting the interval I , and r on I` ? I , then
H
H agrees with
kA ? Hsolk (1 + k )kA ? Hk + E (k) + + (1 + k )r kA ? Hr k
H
Proof: Let the last bucket of be [y; x) with parameter c, where xi?1 y < xi . Recall Ii?1 = [0; xi?1 ) and Ii = [0; xi ), and I` = [0; n). Let 0 be a histogram that agrees with on Ii?1 and agrees with r . Since 0 restricted to Ii?1 has at most k buckets, Therefore,
H
H
H
H kA ? Hki?1k (1 + k )kA ? H0k + E (k) = (1 + k ) (kA ? H k ? k(A ? H; I ? Ii?1)k + k(A ? Hr ; I ? Ii?1)k) + E (k)
(2)
H
H
The last part of the above equation followed from Lemma 13 applied to and 0 which agree everywhere except I ? Ii?1 where they agree with and r respectively. We will relate the error k ? sol k to the error k ? ki?1 k, and thus to k ? k. Two possible cases occur, if xi?1 = y or xi?1 < y . In the first case (xi?1 = y ) observe that we evaluated i?1 , and we had the possibility of setting ci?1 = c. Therefore if 0i?1 is a histogram which agrees with ki?1 except on I ? Ii?1 where it takes the value c (= ),
H A H
A H H
H
A H
H H
H
kA ? Hi?1k kA ? H0i?1k Since Hki and H0i agree everywhere except I ? Ii?1 , where they agree with Hr and H respectively by Lemma 13, and using Equation (2),
kA ? H0i?1k = kA ? Hki?1k + k(A ? H; I ? Ii?1)k ? k(A ? Hr ; I ? Ii?1)k (1 + k )kA ? Hk + E (k) + k (k(A ? Hr ; I ? Ii?1)k ? k(A ? H; I ? Ii?1)k)
H can have at most k B buckets in the interval I ? Ii?1 by Lemma 15, we get that k(A ? Hr ; I ? Ii?1 )k ? k(A ? H; I ? Ii?1)k is at most r kA ? Hr k. Since we chose the best possible Hsol, kA ? Hsolk kA ? Hi?1k kA ? H0i?1k Since
which proves the lemma in this (xi?1 = y < xi ) case. Consider the case xi?1 < y < xi , in this case jIi j > jIi?1 j + 1. Consider i and let 0i be equal to i = ki everywhere except I ? Ii where it takes the value c (= ). As in the previous case, k ? i k k ? 0ik, so
H H H H H A H A H kA ? H0ik = kA ? Hki k + k(A ? H ; I ? Ii )k ? k(A ? Hr ; I ? Ii)k kA ? Hki?1 k + + k(A ? H; I ? Ii)k ? k(A ? Hr ; I ? Ii )k by the basis conditions vii
Using Equation 2 this becomes,
kA ? H0ik (1 + k ) (kA ? Hk ? k(A ? H; I ? Ii?1)k + k(A ? Hr ; I ? Ii?1 )k) + E (k) + + k(A ? H ; I ? Ii )k ? k(A ? Hr ; I ? Ii )k The equation rewrites as
kA ? H0ik (1 + k )kA ? Hk + k (k(A ? Hr ; I ? Ii?1)k ? k(A ? H ; I ? Ii?1 )k) + (k(A ? Hr ; I ? Ii?1 )k ? k(A ? H ; Ii ? Ii?1 )k) + E (k) + Since H can have at most k B buckets in the interval Ii ? Ii?1 or in the interval I ? Ii , by Lemma 15
on both these intervals we get
kA ? H0ik (1 + k )kA ? Hk + k r kA ? Hr k + r kA ? Hr k + E (k) + Since kA ? Hsol k kA ? Hi k kA ? H0i k, the lemma follows. Lemma 17 In the above construction procedure if we use k ks instead of k k, kA ? Hsolk (1 + s) [(1 + k )kA ? Hk + E (k) + + (1 + k )r kA ? Hr k] Proof: If we inspect the proof, there are only two places we are using the value k k to make a decision. Any other place where we use k k is for purposes of analysis, we do not compute them. The first place is where we try to estimate the best parameter ci to be set on I ? Ii (or the parameter ci?1 to be set to I ? Ii?1 ). And the other is where we choose the best i . Lets assume that we are in the case xi?1 < y < xi . The other case is exactly the same, replacing i by i?1 . It is possible that k ? i k > k ? 0i k, but what we can guarantee is that
H
H
A H A H kA ? Hsolk kA ? Hsolks kA ? Hiks kA ? H0iks (1 + s)kA ? H0ik:
H
The lemma follows.
Hsol to A on [0; N ) produced from a -separated extension A Hsolk (1 + O())jOPT j.
Lemma 18 The best B -bucket approximation basis, for = jOPT j=B , has error k ?
H
Proof: Let sol be the final B -bucket histogram we construct. Let the value of the error of the best bucket approximation be jOPT j. Put s = =(4B ), = jOPT j=B , r = =B . From Lemma 17, we get
A H H
B-
E (B ) = (1 + s)E (B ? 1) + (1 + s) [ + 2r kA ? Hr k]
A H
A H
H
Note that k ? r k ? k ? opt k r k ? r k, because opt can be thought of as a refinement of r . Thus the solution to the above recurrence supposing E (1) = 0, is
X B
j E (B ) = (1 + s)E (B ? 1) + (1 + s) + 21r j?OPT r
viii
i=1
B -bucket
OPT j = (1 + )B 4jOPT j (1 + s)i + 21r j? B r
Now consider k+1 . Above we saw that 1
= s. From this and Lemma 17, it follows that
k+1 = (1 + s )(1 + k ) = (1 + s )k : Combining all of these we get
kA ? Hsolk (1 + )jOPT j + 4(1 + )jOPT j Lemma 19 To achieve ` = O (B=). it is sufficient to construct
Hki where kA ? Hki ks 23 jOPT j.
Proof: Observe that since we construct an interval of length 1, we have nothing to prove about xi+2 . Now,
kA ? H1i+1k kA ? H1i+1ks kA ? H1i ks + 0 (1 + s)kA ? H1i k + 0 If s kA ? H1i k 0 we are done. But that just means kA ? H1i k has to be less than 2jOPT j. On the other hand if kA ? H1i k is greater than 1:5jOPT j, which we find by kA ? H1i ks 1:5(1 + s )jOPT j, such a
H1i will never be useful for a 1 + approximation. In such a case we just set ` = i + 1 and x` = N , and we are done constructing the extension basis. Notice that for every two elements we add, the error goes up by 0 at least. Thus we cannot have more than 2 (2jOPT j)=0 elements, since, otherwise, the error will be more than 2jOPT j and therefore useless for (1 + )-approximation. Thus the size of the basis will be at most O (B=).
ix