CasAB: Building Precise Bitmap Indices via Cascaded Bloom Filters

0 downloads 0 Views 221KB Size Report
Abstract—Bitmap indices are widely used in massive and read-mostly datasets such as data warehouses and scientific databases. Recently, Bloom filters were ...
2009 Fourth International Conference on Internet Computing for Science and Engineering

CasAB: Building Precise Bitmap Indices via Cascaded Bloom filters Zhuo Wang School of Information Science and Engineering Shenyang Ligong University Shenyang, China [email protected] extremely compact data structure that supports membership queries to a set. Bloom filters unfortunately introduce false positives, i.e., a data item not belonging to the set may probably be claimed to be in the set, yet false negatives are guaranteed not to occur. Nonetheless, the space requirement of Bloom filter falls significantly below the information theoretic lower bounds for error-free data structures. In the case of bitmap index, a Bloom filter is used to encode each bitmap. The bit array of the Bloom filter is thus called an approximate bitmap(AB). This technique is especially effective for queries that ask for only a small subset of data with query criterions based on row identifiers. For example, consider a data warehouse where the data is physically ordered by date. A query that asks for the total sales of every Monday for the last three months would effectively select only twelve rows. In other bitmap indices, all the rows of data would have to be scanned for potential answers to the query. In AB, only the rows that involving the query constraints are processed by the Bloom filter, so the query performance can be extremely fast. Experiments show that AB can be 1˜3 orders of magnitude faster than the state-ofthe-art bitmap index—WAH when less than 15% of the rows in dataset is queried[8], provided both use the comparable index size. Moreover, RLE-based bitmap indexing schemes depend severely on the distribution of data in index size and thus query speed, so tuple re-ordering techniques [10], [11] are frequently used before the actual indices are built. Unlike WAH, Bloom filter-based bitmap index is not sensitive to the distribution of data. As a Bloom filter-based encoding scheme, however, AB is also suffering the flaw of false positives and thus can only get an approximate query result. Although this approximation is tolerable in many scientific and data warehouse environments, our experiments show that when the size of AB is relatively small, the query precision is very poor—false positive tuples1 are dominating the query result, making the query result unacceptable. AB suggests a next step to prune away the false positive tuples by checking the candidate tuples through the original dataset, but this incurs enormous extra I/O costs.

Abstract—Bitmap indices are widely used in massive and read-mostly datasets such as data warehouses and scientific databases. Recently, Bloom filters were used to encode bitmap indices into approximate bitmaps(AB). The salient advantage of this technique is that bitmaps can be directly accessed without decompression, and the query time is proportional in the size of the region being queried. This technique, however, introduces false positives due to the nature of Bloom filters, therefore, only approximate query results can be achieved. To eliminate false positives, we proposed a novel bitmap index encoding scheme, namely cascaded approximate bitmaps(CasAB) based on multilevel Bloom filter cascading, which can achieve precise query results at the cost of slightly more space and time overhead. An efficient CasAB construction algorithm and a query algorithm are given. Space and time complexities of CasAB are analyzed theoretically, and the minimum space size can be pre-computed based on the cardinality of the attribute. Experiments show that the query precision of CasAB is always 100% and space and time overhead is similar to that of AB. Keywords-bitmap index; Bloom filter; compression

I. I NTRODUCTION Bitmap indices have been widely used in scientific and commercial databases for processing complex ad-hoc queries in read-mostly environments. Bitmap indices inherently involve vastly redundant data which can be compressed for smaller storage cost and faster retrieval. Researchers have proposed various encoding schemes for compressing the huge size of bitmap indices. Range encoding and Interval encoding, for instance, are introduced by Chan and Ioannidis to reduce the number of bitmaps to be stored[1], [2]. Run-length encoding(RLE) based compressing schemes are shown to be more effective for sparse datasets. For instance, BBC is a byte-aligned bitmap compression technique and was adopted in Oracle in 1995[3]. Inspired by BBC, K.Wu et al developed the Word-Aligned Hybrid code(WAH) which can exploit modern computer architecture to answer queries efficiently. Though WAH is larger in size than BBC, it is one or two orders of magnitude faster than BBC[4]. For very large cardinalities, binning were introduced to reduce the number of distinct attribute values[5], [6], [7]. Recently, T.Apaydin et al proposed a Bloom filter-based bitmap encoding scheme[8]. Bloom filter[9] was first proposed by B.Bloom in 1970 and are widely used in many applications in databases and networking, which can yield an 978-0-7695-4027-6/10 $26.00 © 2010 IEEE DOI 10.1109/ICICSE.2009.19

1 Here we mean ”false positive tuple” the tuple which is returned in the query result by AB but actually does not satisfy the query condition.

85

The optimal value of k that minimizes the false positive rate can be found by taking derivative of f and setting it to zero, then truncating it to integer, which is

In this paper, we try to eliminate the false positives in AB based on cascaded Bloom filters. This can produce both false positives and false negatives. False positives and false negatives are stored in pairs of Bloom filters level by level. The number of false positives and false negatives can quickly get smaller and smaller, until the number of false negatives is small enough to be quickly searched in a sequential search table. We call these cascaded Bloom filters cascaded approximate bitmaps(CasAB). Contributions of this paper include: • We proposed a novel data structure—CasAB, to encode bitmap indices efficiently based on Bloom filter cascading technique, which can guarantee 100% query precision; • We give an efficient algorithm to construct CasAB, and a fast retrieval algorithm to answer queries by CasAB; • We analyzed the space size of CasAB and the time complexity of the retrieval algorithm theoretically. Method to calculate the minimum space size of CasAB is given based on cardinality; • We did experiments on both synthetic and real-world datasets to evaluate the effectiveness of CasAB. Experiments reveal that CasAB can answer queries precisely at a slightly more space and time overhead than that of AB.

kopt = ⌊(ln 2)m/n⌋ When k is optimal for the minimum false positive rate, p is 1/2. So the minimum false positive rate is fmin = (1 − 1/2)kopt ≈ (0.6185)m/n

(1)

Note that, given n elements and the expected false positive rate of a Bloom filter f , the optimal k for minimum space size of that Bloom filter is also ⌊(ln 2)m/n⌋[12]. A. Kirsch and M. Mitzenmacher demonstrated that only two hash functions are necessary to effectively implement a Bloom filter without any loss in the asymptotic false positive probability[13]. This leads to less computation and potentially less need for randomness in practice. B. Chazelle et al proposed the Bloomier filter[14]. A Bloomier filter is a data structure for compactly encoding a function with static support in order to support approximate evaluation queries. The Bloomier filter is designed by cascading pairs of Bloom filters, and still maintains the economical use of storage. B. Approximate bitmaps Bloom filters were applied to bitmap indices to encode bitmaps in a way that provides fast and integrated querying over compressed bitmaps with direct access[8]. A Bitmap index can be treated as boolean matrix. Most of the bits in the matrix are zeros, and only a fraction of them are ones. The ones indicate where a particular value is distributed in the dataset. Consider a simple bitmap index of an attribute A in Table I. The cardinality of the attribute is 4, therefore there are 4 columns in the bit matrix. Any bit in the matrix can be uniquely identified by the row number and the column number, thus a unique key can be formed for each bit. To encode the bitmap index, we only need to insert all the keys of set bits into a Bloom filter. Such a Bloom filter is referred to as an approximate bitmap(AB).

II. BACKGROUD AND RELATED WORK A. Bloom filters A Bloom filter is a space-efficient data structure used to represent a set S = {s1 , s2 , · · · , sn } of n elements. It is constituted by a bit array of size m and k independent hash functions h1 , h2 , ..., hk whose outputs are uniformly distributed over the discrete range {0, 1, · · · , m − 1}. The bit array is initially set to zero. The n elements are inserted into the Bloom filter one by one in a way as follows. The k hash functions are computed over each element as the entries to the bit array, and k bits in the bit array are set to 1 accordingly. To check if an element x belongs to S, the same k hash functions are computed over x again. If any of the corresponding bits is zero, x ∈ / S for sure. If all the checked bits are set, x ∈ S with high probability. It is possible that an element not in S being reported as a member of S, creating a false positive. But no false negative will be incurred. The precision of a Bloom filter is measured by the false positive rate, which can be calculated as follows. After n elements are inserted, the probability of a particular bit being zero is given as

Table I A

RowID 0 1 2 3 4 5

SIMPLE BITMAP INDEX

A 0 1 1 0 3 2

A=3 0 0 0 0 1 0

A=2 0 0 0 0 0 1

A=1 0 1 1 0 0 0

A=0 1 0 0 1 0 0

In AB, a query is issued on a subset of the boolean matrix, denoted by a sequence of pairs of row and column numbers, rather than the whole boolean matrix. For example, in Table I, if the query condition is A ≥ 2, and the candidate row set is {1,2,4}, then the query region is a subset {(1,2),(1,3),(2,2),(2,3),(4,2),(4,3)}. Keys formed by pairs of

1 kn ) ≈ e−kn/m m Now if we insert a new element into the Bloom filter, the probability of the new element being false positive is p = (1 −

f = (1 − p)k ≈ (1 − e−kn/m )k

86

row and column numbers are then searched by the Bloom filter. The query result is a bit array of the same size as the candidate row set. For each row, the pairs formed by that row are searched in Bloom filter sequentially. If the Bloom filter claims the pair is in the set, the result for that row will be 1, and the rest pairs of that row are skipped. For example, the query mentioned above will return the resultant bit array {0,0,1}, if no false positive is generated. However, due to the characteristics of Bloom filter, false positive tuples might be returned, i.e., the resultant bit array above can accidently be {1,0,1} where row 1 is a false positive tuple if either pair (1,2) or (1,3) produces a false positive by the Bloom filter. If a good set of hash functions is selected, and the bit array of Bloom filter is big enough, the number of false positive tuples should be very small compared to the total query size. The query time is in O(ck), where c is the rectangle size of query region, and k is the number of hash functions which is usually very small, and likely to remain constant in particular applications. Multi-dimensional queries involving relational predicates on multiple attributes, such as A ≥ 3 and 5 ≤ B < 20, can also be well supported by AB. In such cases, the candidate row number is combined into pairs with the column numbers w.r.t. the attribute values involved in the relational predicates, and then all the pairs are searched in the Bloom filter sequentially. whenever the final result of a certain row is determined, the leftover predicates need not be considered further. In AB, Let s denote the number of set bits in boolean matrix, and let α = ⌈m/s⌉, the false positive rate of the AB is: f = (1 − p)k ≈ (1 − e−k/α )k (2)

of query precision—tuple false positive rate(TFPR) as nf p T FPR = nt p + n f p where n f p is the number of false positive tuples in query result, and nt p is the number of true positive tuples in query result. It is easy to see that T FPR becomes larger as cardinality increases. B. CasAB encoding scheme To completely obtain a false positive-free AB, we proposed a novel bitmap encoding scheme—CasAB, to efficiently compress bitmaps while still retain the ability of direct access. Our scheme is inspired by Bloomier filters. Bloomier filters are used to associate a value with each element that have been inserted, implementing an associative array. It achieves this goal via cascaded Bloom filters. But Bloomier filters do cause false positives. What is in common between our CasAB and Bloomier filter is both of them use cascaded Bloom filters, while our CasAB can eliminate false positives due to the ability to verify false positives by checking the original dataset ahead of query time. The basic idea of CasAB is as follows. We use the first Bloom filter, say BF0 , to store the keys of set bits in the boolean matrix, this is exactly the same as AB. For any set bit in boolean matrix, BF0 will always claim it as a true positive. On the other hand, the reset bits(or 0-bits) are potential false positives for BF0 . We use another Bloom filter, say BF1 , to store all the keys that cause false positives in BF0 . Note that we can always know if a 0-bit in boolean matrix will cause a false positive or not by feeding its key into BF0 and check if BF0 claims to contain that key— indicating a 0-bit has been taken as a 1-bit. BF1 contains all the false positives caused by BF0 . But BF1 is also a Bloom filter and it will also cause its own false positives. The difference is, when BF1 causes a false positive, it will report the bit to be zero, causing a false negative in terms of the boolean matrix. To memorize these false negatives, we use the third Bloom filter, say BF2 , to store all the false negatives caused by BF1 . Similarly, BF2 also cause its false positives, and these false positives are also false positives in terms of the boolean matrix. Recursively, we need another pair of Bloom filters, say BF3 and BF4 , to store the false positives caused by BF2 , and so on. Because the false positive rate is usually very small, the number of keys in succeeding Bloom filters is extremely likely to quickly drop to a very small quantity that can be easily stored in a ordinary deterministic sequential table, such as a small array with linear search. The total space required is almost entirely occupied by BF0 . Moreover, the average total search time is also almost a constant, because almost all queries will be resolved by BF0 , almost all remaining queries by the first pair of Bloom filters(BF1 and BF2 ), and so on. Figure 1 shows the process a bit in the boolean matrix being tested through the cascaded

and the optimal k for minimum false positive rate is thus kopt = ⌊(ln 2)α ⌋. III. P ROPOSED SCHEME A. Motivation Although AB can efficiently answer queries with a high compression ratio, its approximate result restricts its application area. Moreover, the query precision of AB is hard to be measured by the false positive rate of Bloom filter. False positive rate only measures the probability of one bit in boolean matrix being a false positive. During our experimental study, we observed that the number of false positives increases steadily as the cardinality of attribute becomes larger. That is because a row will be reported as a false positive tuple as long as any false positive bit on that row occurs. That is, cardinality also matters to query precision—the larger the cardinality, the more likely a row is to be reported as a false positive tuple. When the query result is dominated by false positive tuples, the final query precision will be unacceptable. Instead of measuring the false positive rate on bits, we define a new measurement

87

Bloom filters, where BFi =1 denotes that the key is claimed to be contained in BFi with high probability, while BFi =0 reveals that the key is definitely not contained in BFi . To get the accurate query result, we only need to replace the last Bloom filter with an ordinary deterministic table.

Algorithm 1: CreateCasAB(M) input : M: the boolean matrix. output: a CasAB index for the attribute. 1 2 3

Deterministic

4

tp=true positive, etc.

Non-deterministic

5 6

tn

Level 1

0 BF0=0 b?

tp

tn

1

tp0

1?

9 10

BF1=1

tp+fp

8

……

BF2=0

BF1=0 BF0=1

7

0?

BF2=1

fn+tn

11

1? 12

tp+fp

13 14

Figure 1.

State diagram of CasAB

15

16

Definition 1. A L-level CasAB, denoted as CasAB-L, is an

17

extended structure of AB. It consists of a base Bloom filter, BF0 , and L pairs of Bloom filters, BF2i−1 and BF2i , 1 ≤ i ≤ L, except that the last Bloom filter, BF2L , is replaced with a

18 19 20 21

deterministic sequential table. All Bloom filters are of the same α and k.

22 23

24

C. CasAB construction

25

To efficiently construct a CasAB from given attribute values, we propose a theorem which can reduce the construction time significantly.

26 27

28

Theorem 1. In CasAB-L, let Si denote the set of keys being

29

inserted to BFi , then Si ⊆ Si−2 , i = 2, 3, · · · , 2L.

30 31 32

Proof: Let bit(key) denote the bit value for key in the

Initialize BF0 , BF1 , · · · , BF2L−1 , and a sequential table ST ; foreach 1-bit at position (i,j) in M do // construct BF0 x ← F(i,j); Insert x to BF0 ; for l ← 1 to L do if l = 1 then // Initialize fp candidates FPSet ← 0; / foreach 0-bit at position (i,j) in M do x ← F(i,j); if BF0 contains x then FPSet ← FPSet ∪ x; else// generate new fp candidates foreach x in FPSet do if BF2l−2 not contains x then FPSet ← FPSet − x; foreach x in FPSet do // construct BF21−1 Insert x to BF2l−1 ; if l = 1 then // Initialize fn candidates FNSet ← 0; / foreach 1-bit at position (i,j) in M do x ← F(i,j); if BF1 contains x then FNSet ← FNSet ∪ x; else// generate new fn candidates foreach x in FNSet do if BF2l−1 not contains x then FNSet ← FNSet − x; foreach x in FNSet do // construct BF21 if l=L then Insert x into ST ; else Insert x into BF2l ;

boolean matrix. According to the definition of CasAB, we have, = {key|bit(key) = 1}

any Bloom filters thereafter. This is contradictory to that

S2i−1

= {key|bit(key) = 0 ∧ BF2i−2 (key) = 1}

S2i

= {key|bit(key) = 1 ∧ BF2i−1 (key) = 1}

key ∈ S2i . So key ∈ S2i−2 , thus S2i ⊆ S2i−2 . Similarly, we can prove that 2) S2i+1 ⊆ S2i−1 , i = 1, 2, · · · , L.

S0

According to 1) and 2), we get Si ⊆ Si−2 , i = 2, 3, · · · , 2L. Theorem 1 says that, rather than construct each Bloom filter from original boolean matrix, we only need to construct it from the keys in the Bloom filter two ahead of it. Thus a CasAB-L can be efficiently constructed in O(NC) time by Algorithm 1, where C is the cardinality of attribute, and N the number of rows in dataset. It first initializes 2L Bloom filters with the same α and k and a sequential table ST , whereas the size of bit array of each Bloom filter is not decided(line 1). Then, keys of set bits are formed by using

i = 1, 2, 3, · · · , L. First, we prove 1) S2i ⊆ S2i−2 , i = 1, 2, · · · , L. For i = 1, it is easy to see that S2 ⊆ S0 . For i ≥ 2, ∀key ∈ S2i , bit(key) = 1, and BF2i−1 (key) = 1. If key ∈ / S2i−2 , then bit(key) = 0 or BF2i−3 (key) = 0. But bit(key) = 1, so BF2i−3 (key) = 0. This will lead to bit(key) = 1 in CasAB on BF2i−3 (see Figure 1), thus this key won’t be inserted into

88

c3 ≤ B ≤ c4 ’, each row number in R is combined with every possible value of attribute A and B to form a key. For each dimension, the keys are tested by Retrieve() algorithm. Once a key returns 1, then the successive keys are skipped since the range condition for that attribute is already satisfied. Results from each dimension are AND-ed to determine if that row satisfies the query condition or not. Here we propose a simple query optimization technique called narrow range first to determine the dimension query order in multidimensional queries. Since CasAB can terminate its loop once the query condition is satisfied, we should process dimensions in the ascending order of ranges involved in the query so that less bit retrieving will be needed, e.g., for query ’20 ≤ A ≤ 400 and 10 ≤ B ≤ 15 and 10 ≤ C ≤ 20’, the processing order of dimensions will be: B,C,A.

function F(i, j) and then inserted into BF0 (line 2-4). The remaining Bloom filters are constructed level by level(line 532). In the process of first level, we add all the false-positivecandidate keys into set FPSet based on whether the key is reported as a false positive by BF0 (line 6-11). Similarly, set FNSet, containing all the keys causing false negatives, is initialized by testing if the key is contained in BF1 (line 1823). In each level l, each key x ∈ FPSet is tested by BF2l−2 ; keys not contained in BF2l−2 is removed from FPSet(line 1315), and the remaining keys are inserted into BF2l−1 (line 1617). Similarly, keys that are contained in BF2l−1 are inserted into BF2l , except that, in the last level, keys are inserted into a sequential table ST instead of a Bloom filter(line 28-32). Algorithm 2: Retrieve(r, c) input : r, c: the row ID and column ID in boolean matrix. output: the bit value at position (r,c) in boolean matrix.

3 4 5 6 7 8 9 10 11 12 13 14 15 16

600

AB L=1 L=2

space(*N bits)’

2

x ← F(r,c); if BF0 not contains x then return 0 else for l ← 1 to L do if BF2l−1 not contains x then return 1 else if l = L then if x is in ST then return 1 else return 0

space(*N bits)’

1

L=1

C=100 80 70 60 50 40 30 20 10 0

AB C=50 C=200 C=500 C=1000

500 400 300 200 100 0

4

6

8

10

12

14

16

18

4

20

6

8

10

12

14

16

18

20

alpha

alpha

(a) Space vs α and L Figure 2.

(b) Space vs α and C Analysis of space size

IV. A NALYSIS OF PROPOSED SCHEME In this section, we analyze the space and query time complexities in CasAB theoretically.

else if BF2l not contains x then return 0

A. Space complexity According to Equation (2), all Bloom filters are of the same false positive rate because they have the same α and k parameters. Let si denote the size in bits of BFi , we have the following:

D. CasAB query Given the position of a bit in boolean matrix, the bit value of that position can be accurately retrieved by Algorithm 2. First, row number and column number are combined to form a key x (line 1), and x is fed into BF0 . If BF0 does not contain x, the bit at (r, c) is definitely 0(line 2-3). Otherwise, x is fed into the next Bloom filter. In level l, if BF2l−1 does not contain x, we can infer that the bit at (r, c) is 1(because it is not a false positive, so it is a true positive)(line 6-7). Otherwise, we test BF2l . If BF2l does not contain x, then bit at x must be 0 (because it is not a false negative, so it is a true negative)(line 15-16). Otherwise, we enter the next level. If we reach the last level, we look up x in the sequential table instead of a Bloom filter, and we will always get the deterministic value of the bit at x(line 9-13). Range queries can be answered by CasAB just like the way as AB does. For example, given a candidate row set, R, and a multi-dimensional query condition, ’c1 ≤ A ≤ c2 and

Nα N(C − 1) f α

s0 s1

= =

s2

= N fα ··· = N(C − 1) f l α

s2l−1 s2l

= N fl ···

it is easy to see that s2l−1 + s2l = NC f l α , l = 1, 2, · · · , L Because we set L to a value such that BF2L is very small, the size of BF2L is approximately the same as that of the sequential table. So the total space size of CasAB can be represented by a convergent series:

89

l=1

C f (1 − f L ) ) 1− f

Similarly, the total number of times accessing Bloom filters for 1-bit is L

=

On the assumption of uniform distribution of attribute values, the average number of times for accessing Bloom filters is C−1 1 T = T0 + T1 C C T0 and T1 are all convergent series, we have lim T = f →0

V. E XPERIMENTAL STUDY A. Experimental setup We conducted several experiments on a PC with AMD Athlon × 2, 2.1GHz CPU and 2G RAM. We used two datasets in our experiments. The first one is a synthetic dataset which is uniformly distributed with N = 1000, 000, C = 100 for all 10 attributes. The second one is a real-world dataset(weather dataset) which is highly skew with N = 507685. It has 9 attributes with cardinality = 7037,352,179,152,101,15,10,8,2. A set of good hash functions is critical for both AB and CasAB. Well selected hash functions not only reduce the space size of CasAB, but also dramatically speedup the query speed. To reduce the computation cost of hashing, we used two simple and efficient hash functions in our implementation. The first one, h1 (x), is a circular hash function introduced in [8], and the second one, h2 (x), is a general purpose hash function from [15]. Other hash functions can be produced by gi (x) = h1 (x) + i ∗ h2 (x) according to [13].

In Figure 1, for zeros in boolean matrix, we at least need to check one Bloom filter, BF0 , to successfully report it. For ones in boolean matrix, we at least need to check two Bloom filters, BF0 and BF1 . Let pl denote the probability for a zero bit in boolean matrix be successfully reported its value by using BF2l , l = 0, 1, · · · , L. We have the following: (1 − f ) f (1 − f )

p2

= ···

f 2 (1 − f )

= ···

f l (1 − f )

Space vs alpha (synthetic dataset) 3500

10000

2000 1500 1000

8000

CasAB-1 CasAB-2 AB

6000 4000 2000

500

If a zero bit is successfully predicted by BF2l , then 2l + 1 Bloom filters have to be accessed. So the total number of times accessing Bloom filters for 0-bit is T0

2500

Space vs alpha(weather dataset) 12000

CasAB-1 CasAB-2 AB

3000

size(KB)

pl

C+1 C

that is, T asymptotically approaches 1 as C increases. This means CasAB tends to access Bloom filters only once for each bit in boolean matrix as AB does.

B. Query time complexity

= =

∑ 2l f l−1 (1 − f )

l=1

Figure 2(a) shows the curves of total space size of CasAB as a function of α with L = 1 and 2 for C = 100. Figure 2(b) shows the curves of total space size as a function of α with variant C for L = 1. We use the minimum f calculated by equation (1) for given α in both figures. Figure 2(a) shows that the increment of space size drops quickly as the number of levels increases. For example, the size of CasAB-1 and CasAB-2 are almost the same for α ≥ 13, both are slightly larger than AB. Figure 2(b) reveals that larger cardinalities require a larger α to get a smaller CasAB. Both figures suggest that, given a certain cardinality, there is a optimal α for minimum space size when we always use the minimum false positive rate for given α . This optimal α can be calculated out by numerical analysis method after f is replaced with fmin = (1/2)α ln 2 . E.g., we can get the optimal α = 13 for C = 100, and 23 for C = 7037.

p0 p1

= 2(1 − f ) + 4 f (1 − f ) + · · · + 2L f L−1 (1 − f )

T1

Generally, f can be arbitrary small, so sall increases very slowly as L increases and has an upper boundary. Given the maximum number of elements in the sequential table, nst , the number of levels required can be calculated by N f L ≤ nst , that is ln(nst /N) L=⌈ ⌉ ln f

size(KB)

L

sall = N α (1 +C ∑ f l ) = N α (1 +

0 8

10

12

14

16

18

14 16 18 20 22 24 26 28 30 32

alpha

alpha

(a) Space vs α , C = 100 Figure 3.

= (1 − f ) + 3 f (1 − f ) + · · · + (2L + 1) f L (1 − f )

(b) Space vs α , C = 7037 Space size

L

=

∑ (2l + 1) f l (1 − f )

B. Space size

l=0

Figure 3(a) and 3(b) show the space sizes of AB, CasAB-1 and CasAB-2 as parameter α varies for synthetic dataset and

90

Precision vs alpha(weather dataset)

Precision vs alpha (synthetic dataset) 60

60

AB CasAB

AB CasAB

40

average TFPR(%)

40 30 20

average TFPR(%)

50

50

average TFPR(%)

Precision vs #dim (synthetic dataset) 45

AB CasAB

40 30 20

10

10

0

0

35 30 25 20 15 10 5

8

10

12

14

16

0 0.7

18

0.8

0.9

(a) Precision vs α ,C=100,#dim=1

1.2

1.3

1

2

3

4

6

7

8

(c) Precision vs #dim,C=100

Exec. time vs alpha(weather dataset)

Exec. time vs #dim (synthetic dataset)

52

74

68

50

72

66 64

70

48

5

#dim

(b) Precision vs α ratio,C=352,#dim=1

Exec. time vs alpha (synthetic dataset)

62

46 44 42

68

time(sec.)

time(sec.)

time(sec.)

1.1

alpha ratio

alpha

66 64 62

60 58 56 54 52

40

60 CasAB-1 CasAB-2 AB

38 36 8

10

12

14

16

58 56 18

50

CasAB-1 CasAB-2 AB 0.7

0.8

0.9

alpha

1

1.1

1.2

46 1.3

1

Space vs k (synthetic dataset)

4

2300

Precision vs k (synthetic dataset)

average TFPR(%)

2100 2000 1900 1800 1700

15 10

9

40 38

34 32 4

5

6

k

(a) Space vs k, C = 100, α = 13

42

36

0 8

8

44 20

1600 7

7

CasAB-2 AB

46

5 1500

6

Exec. time vs k (synthetic dataset) 48

CasAB AB

25

2200

5

Query performance

30

CasAB-1 CasAB-2 AB

6

3

(f) Exe. time vs #dim,C=100

time(sec.)

2400

5

2

#dim

(e) Exe. time vs α ratio,C=352,#dim=1 Figure 4.

4

CasAB-2 AB

48

alpha ratio

(d) Exe. time vs α ,C=100,#dim=1

size(KB)

1

7

8

9

k

(b) Precision vs k, C = 100, α = 13 Figure 5.

4

5

6

7

8

9

k

(c) Time vs k, C = 100, α = 13

The number of hash functions

C. Query performance

weather dataset, respectively. In these experiments, we set k as the optimal value for the minimum false positive rate based on Eqution (1). The results are very similar to our theoretical analysis of space size in Figure 2(a) and 2(b). As expected, CasAB-1 and CasAB-2 almost have the same sizes for any α , approaching the size of AB as α increases. It is easy to see that the optimal value of α for minimum space size of CasAB is 13 for synthetic dataset, and 23 for weather dataset, conforming to our theoretical predictions by the method described in subsection IV-A.

In query performance experiments, we generated 400 multi-dimensional range queries. Let r denote the percentage of rows being queried for each query, sel denote the attribute selectivity, i.e., the percentage of distinct values in the cardinality of the attribute, a query is generated as follows. We randomly select a row with row number l from the dataset as the low bound of the range of rows, and the upper bound is given by l +r ∗N. If the upper bound is greater than N, we set it to N. The range width is calculated by C ∗ sel. For the query precision of AB, we use the average T FPR of all queries. Query elapsed time is represented by the total

91

execution time of 400 queries. we set r to 5%, sel to 20% in all experiments below. In weather dataset, the cardinalities varies from dimension to dimension, and the optimal alpha for each dimension is not the same. So we change alpha by multiplying a coefficient. We call this co-efficient α ratio. Figure 4(a) and 4(b) show the query precision on synthetic dataset and weather dataset, respectively. Both CasAB-1 and CasAB-2 can guarantee no false positive tuples will occur for any α , while AB can not get 100% precision for any α . Note that the T FPR of AB is very high when α is relatively small, while CasAB can always get precise answers at the cost of more space size as α decreases. Figure 4(d) and 4(e) show the total execution time for both datasets. We can see that the execution time for AB, CasAB-1, and CasAB-2 are approximately the same, while AB is allways faster than CasAB-1 and CasAB-2. The difference of CasAB-1 and CasAB-2 is not significant. The execution time approximately remains constant for any α , although k increases as α increases. This is because that, although the hashing cost increases, but less false positives will incur, which reduces the time to access Bloom filters. Consequently, the overall time remains unchanged. Figure 4(c) and 4(f) show the precision and execution time of AB and CasAB as the number of query dimensions changes. We can see that both AB and CasAB are not sensitive to the number of dimensions due to the short-cut logical operation optimization.

give the method to predict the optimal parameters for a minimum space size. Experiments verify that CasAB can answer queries accurately, and the space and execution time is slightly larger than AB, conforming to our theoretical analysis. Future work includes further improving the efficiency of construction algorithm of CasAB for large cardinalities and the performance of query algorithm for ad-hoc queries. ACKNOWLEDGMENT We are very grateful to Dr. Wei Wang(from The University of New South Wales), who suggested us study this topic, and many valuable suggestions during the research. Thanks also go to Guadalupe Canahuate(from The Ohio State University) for her helps on hash functions and many valuable suggestions on our manuscript. R EFERENCES [1] C. Y. Chan and Y. E. Ioannidis, “Bitmap index design and evaluation,” in SIGMOD Conference, 1998, pp. 355–366. [2] C. Y. Chan and Y. E. Ioannidis, “An efficient bitmap encoding scheme for selection queries,” in SIGMOD Conference, 1999, pp. 215–226. [3] E. J. O’Neil, P. E. O’Neil, and K. Wu, “Bitmap index design choices and their performance implications,” in IDEAS, 2007, pp. 72–84. [4] K. Wu, E. J. Otoo, and A. Shoshani, “Optimizing bitmap indices with efficient compression,” ACM Trans. Database Syst., vol. 31, no. 1, pp. 1–38, 2006. [5] N. Koudas, “Space efficient bitmap indexing,” in CIKM, 2000, pp. 194–201. [6] R. R. Sinha and M. Winslett, “Multi-resolution bitmap indexes for scientific data,” ACM Trans. Database Syst., vol. 32, no. 3, p. 16, 2007. [7] K. Wu, K. Stockinger, and A. Shoshani, “Breaking the curse of cardinality on bitmap indexes,” in SSDBM, 2008, pp. 348– 365. [8] T. Apaydin, G. Canahuate, H. Ferhatosmanoglu, and A. S. Tosun, “Approximate encoding for direct access and query processing over compressed bitmaps,” in VLDB, 2006, pp. 846–857. [9] B. H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Commun. ACM, vol. 13, no. 7, pp. 422– 426, 1970. [10] A. Pinar, T. Tao, and H. Ferhatosmanoglu, “Compressing bitmap indices by data reorganization,” in ICDE, 2005, pp. 310–321. [11] D. Lemire, O. Kaser, and K. Aouiche, “Sorting improves word-aligned bitmap indexes,” CoRR, vol. abs/0901.3751, 2009. [12] K. Xie, J. gang Wen, D. fang Zhang, and G. gang Xie, “Bloom filter query algorithm,” Journal of software, vol. 20, no. 1, pp. 96–108, 2009. [13] A. Kirsch and M. Mitzenmacher, “Less hashing, same performance: Building a better bloom filter,” in ESA, 2006, pp. 456–467. [14] B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal, “The bloomier filter: an efficient data structure for static support lookup tables,” in SODA, 2004, pp. 30–39. [15] A. Partow, “General purpose hash function algorithms libray,” http://www.partow.net/programming/hashfunctions/index.html, 2002.

D. The number of hash functions In above experiments, we all use the optimal k for minimum false positive rate of Bloom filter with respect to α . But one can always specify a smaller k for faster query speed. Figure 5(a) shows the space sizes of AB and CasAB2 as k varies from 4 to 9 for α = 13. The figure reveals that a smaller k causes a larger space size of CasAB. Figure 5(c) shows the corresponding query speed of AB and CasAB2, which indicates that a samller k can reduce the query time. Since a smaller k can reduce the computation cost of hashing, thus speedup the query performance, there is a tradeoff between query speed and space size of CasAB for k. Figure 5(b) reveals that CasAB is always 100% accurate no matter how small k is, while AB’s precision drops quickly as k decreases. VI. C ONCLUSION The AB index has many distinct properties compared to other bitmap indices. However, the approximate query results of AB limit its application severely. In this paper, we improve AB into a brand-new indexing scheme via cascaded Bloom filters. Our proposed CasAB indexing scheme can completely eliminate false positives with only a small space and time overhead. We give a theoretical analysis of CasAB for space and time complexities, and

92