1986
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 25,
NO. 8,
AUGUST 2014
Bloom Filter Based Associative Deletion Jiangbo Qian, Qiang Zhu, Senior Member, IEEE, and Yongli Wang Abstract—Bloom filters are widely-used powerful tools for processing set membership queries. However, they are not entirely suitable for many new applications, such as deleting one attribute value according to another attribute value for a set of data objects/items with two correlated attributes. In this paper, we introduce a concept for such an operation, called the associative deletion. To realize this operation, we propose a new Bloom filter data structure, named IABF (Improved Associative deletion Bloom Filter), which keeps the association information on the two correlated attributes of items in the given data set. Based on IABF, we present an algorithm to perform associative deletions, which can be applied to both normal data and streaming data. To further accelerate the operation, we also illustrate a hardware coprocessor implementation for a crucial component of the algorithm. Detailed theoretical analysis and experimental results demonstrate that the presented IABF technique can accurately process associative deletions with controlled false positive and negative rates. Index Terms—Bloom filter, associative deletion, false positive, false negative, algorithm, hardware acceleration
Ç 1
INTRODUCTION
T
standard Bloom filter (BF), a hash-based data structure, is a powerful tool for processing set membership queries, such as ‘‘does element x belong to set S?’’ [1]. The standard Bloom filter provides an effective spacesaving tool because it uses a bit array to represent a set of elements with several independent hash functions. Although this method may cause a small probability of having false positives when answering a query, it is acceptable to many applications with a tradeoff among memory overhead, computation cost and errors. Recent developments in areas like network services and data processing have led to a greater need for the ability to process queries over multi-attribute items/objects, rather than merely single attribute items. Although some theories of multi-attribute Bloom filters in various forms have been developed [2], [3], they were designed only for processing membership queries. Some useful operations, such as the associative deletion, semi-join, and expired item elimination, which need to perform deleting operations on two correlated attributes of items in a data set, are seldom studied. Assume that a network filtering service has an access control list, with each item in the list having two attributes (IP address, hostname), to control the access destinations by IP addresses or hostnames (Fig. 1a). To save memory,
. . .
HE
J. Qian is with the School of Information Science and Engineering, Ningbo University, Zhejiang 315211, China. E-mail:
[email protected]. Q. Zhu is with the Department of Computer and Information Science, University of Michigan, Dearborn, MI 48128 USA. E-mail: qzhu@ umich.edu. Y. Wang is with the School of Computer Science and Technology, Nanjing University of Science and Technology, Jiangsu 210094, China. E-mail:
[email protected].
the system uses two BFs, i.e., CBFA and CBFB in the figure, to represent IP addresses and hostnames, respectively. If a set of items (ip, host) is to be deleted from the two BFs, the conventional process is to update CBFA and CBFB by hashing ip and host, respectively. However, in many cases, we have only the deleted IP addresses which are represented by another BF (i.e., CBFAD). Of course, we can get the updated CBFA by performing a subtraction of CBFAD from CBFA. But how can we get the updated CBFB if no hostname is given? That is, how to maintain the consistency in which deleting an IP address also leads to a removal of the associated hostname. Another example of a similar problem is the removal of expired items for a sliding window in streaming data (Fig. 1b). Assume that each item has two attributes, namely, a timestamp and a URL. To save space, the system represents the items received in the last 10 seconds using two BFs, i.e., CBFA0 for timestamp and CBFB0 for URL in the figure. Suppose that the current time is the 100th second. It is the time to remove the URL arrived at the 90th second. It is not difficult for the system to maintain CBFA0 for the 90th second as 90 is given. But how does the system maintain CBFB0 ? We call such an operation as the Bloom filter based associative deletion. If this deletion is supported, space cost and network transmission overhead are reduced, and the performance of relevant jobs is improved tremendously because they are now performed on Bloom filters directly. The high performance requirements by applications like the above access control in a network service and item update for streaming data lead to our exploration of efficient solutions in both software and hardware. The technical contributions of this paper are following: 1.
The concept of the Bloom filter based associative deletion for data with two correlated attributes is introduced. A novel data structure and its relevant algorithm for the associative deletion are proposed to handle both normal and streaming data.
Manuscript received 10 Apr. 2013; revised 12 Aug. 2013; accepted 19 Aug. 2013. Date of publication 5 Sept. 2013; date of current version 16 July 2014. Recommended for acceptance by X. Tang. 2. For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2013.223 1045-9219 Ó 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
QIAN ET AL.: BLOOM FILTER BASED ASSOCIATIVE DELETION
1987
Fig. 1. Examples of Bloom filter based associative deletions. (a) Updating CBFB according to deleted IP addresses. (b) Removing expired items in a sliding window.
3.
4. 5.
The optimization based on theoretical analysis for the proposed data structure and algorithm is given, which shows that the false positive and negative rates can be controlled. A hardware coprocessor design to accelerate the associative deletion is provided. The experiments with real and synthetical data show that our Bloom filter data structure can provide high performance for the associative deletion with low false rates, which is consistent with our theoretical analysis.
The rest of the paper is organized as follows. Section 2 highlights the preliminaries and problem description. Section 3 illustrates a basic data structure for the associative deletion. Section 4 presents an improved data structure and its algorithm. The theoretical analysis to optimize the data structure is provided in Section 5 and Appendices A-F. A hardware coprocessor design to accelerate the associative deletion is given in Section 6. The performance evaluation is reported in Section 7 and Appendices G-H. Section 8 concludes the paper. The related work is summarized in Appendix I. All the appendices are available online in a supplementary file which is available in the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TPDS.2013.223.
2
PRELIMINARIES
AND
PROBLEM DESCRIPTION
In this section, we first review some basic knowledge of Bloom filters that is essential to our proposed technique. We then describe the associative deletion problem.
2.1 Bloom Filter A standard BF is an m-bit array representing an n-element set W . All bits in the array are initially set to 0. A standard BF construction operation uses k independent hash functions fh1 ; h2 ; . . . ; hk g to hash each element x of W and set the bits at locations fh1 ðxÞ; h2 ðxÞ; . . . ; hk ðxÞg of the array to 1. To check if an element y is a member of W , one needs to check whether all hi ðyÞ are 1 ð1 i kÞ. If not, y is not in W . If so, y is regarded as a member of W . False positives are possible; that is, the result of a query may indicate that y belongs to W although y in fact does not. However, there is no false negative; that is, a negative query result definitely suggests that the corresponding y is not in W . The false positive rate can be calculated by the following formula [4]: !k 1 kn k ð1 ekn=m Þ : (1) fp ¼ 1 1 m Therefore, parameters k, m, and n have influence on the false positive rate. Broder et al. [4] summarized that, when m and n are given, the optimal number k of hash functions is given as follows: k ln 2 ðm=nÞ:
(2)
In this case the false positive rate is fp ð1=2Þk 0:6185m=n :
(3)
The standard Bloom filter can elegantly represent a set. However, a deletion cannot be supported by reversing the insertion process since a location set to 0 by one deleted
1988
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 25,
NO. 8,
AUGUST 2014
Fig. 2. Inserting an item into the BABF structure.
element may affect another element. To handle this problem, Fan et al. [5] proposed a Counting Bloom Filter (CBF), in which each array entry in the Bloom filter is not a single bit but a small counter instead. The corresponding counters increase or decrease with an element being inserted or deleted. The analysis from [4] reveals that 4 bits per counter should yield an acceptable overflow probability for most applications. If k ¼ ln 2 ðm=nÞ, which is the optimal parameter, the probability for a CBF to overflow is P ðmaxi fcðiÞg 16Þ 1:37 1015 m, where cðiÞ is the counting number at location i. Generally enk j (4) P maxfcðiÞg j m i jm where the number e is the base of the natural logarithm. Moreover, we can use the following formula [4]: j nk 1 1 nkj P ðcounterðiÞ ¼ jÞ ¼ 1 (5) j m m to calculate the probability of the ith counter being incremented j times. Unless stated otherwise, all the BFs used in this paper are CBFs.
2.2 Problem Description Assume S is a set of items with two (correlated) attributes A and B, and X is a set of items to be deleted from S. Let CBFA, CBFAD, CBFB, CBFBD, and CBFBR be the CBFs representing different sets of (attribute) elements resulting from extracting a specific attribute from an item set. Letter A (after the prefix CBF ) denotes attribute A, letter B denotes attribute B, letter D denotes a deletion set, and letter R denotes the remaining set after the relevant elements are deleted. These CBFs all have the same number m of locations and the same number k of hash functions. Specifically, CBFA (respectively, CBFB) represents the set of elements obtained from extracting attribute A (respectively, B) from S; CBFAD represents the set of elements obtained from extracting attribute A from X, i.e., the deleted elements for attribute A, which is used to associatively delete the relevant elements from CBFB (for attribute B). CBFBD represents the set of elements obtained from extracting attribute B from X, and CBFBR represents the set of remaining elements in CBFB after elements in CBFBD are removed; namely, CBFBD þ CBFBR ¼ CBFB. Fig. 1a shows an example.
Definition 1 (Bloom Filter Based Associative Deletion). An (BF based) associative deletion is an operation to obtain CBFBR and CBFBD from CBFAD, CBFB and the association information between attributes A and B without directly seeing items from S and X. In other words, the associative deletion is a decomposition of CBFB into CBFBR and CBFBD according to CBFAD and the relevant association information. A CBF (e.g., CBFD in Section 3) may have false positives, which can cause false positive deletions from another related CBF (e.g., CBFB) in the same structure. As a result, a CBF (e.g., CBFBR, CBFBD) in such a structure with multiple CBFs may have false negatives besides false positives. The false rates are defined as follows.
Definition 2 (False Negative Rate). fn ¼ ðnumber of false negative elementsÞ=ðnumber of positive elementsÞ. Definition 3 (False Positive Rate). fp ¼ ðnumber of false positive elementsÞ=ðnumber of negative elementsÞ. To facilitate reading, we list the important notations used in the paper in Table 5 in Appendix J.
3
BASIC STRUCTURE
FOR
ASSOCIATIVE DELETION
In this section, we use a basic structure to realize the associative deletion and analyze its accuracy.
3.1 Basic Structure A structure which supports the associative deletion should first represent items with two attributes used for the deletion. An initial idea is to have a separate CBF for each of the attributes. As a result, for two items ða1 ; b1 Þ and ða2 ; b2 Þ to be inserted into the structure, a1 and a2 are inserted into BF 1, and b1 and b2 are inserted into BF 2, where BF 1 and BF 2 are the two CBFs. Since this naive approach would also accidently insert other unwanted items such as ða1 ; b2 Þ and ða2 ; b1 Þ into the structure, an extra filter is desired to capture the inherent association between the values of the two attributes. Based on the above idea, a Basic Bloom Filter structure (BABF) supporting the associative deletion is suggested in Fig. 2, which is similar to PBF [3] but for a different purpose. Specifically, we use two CBFs (i.e., CBFA and CBFB) to store attributes A and B and use an extra CBF (i.e.,
QIAN ET AL.: BLOOM FILTER BASED ASSOCIATIVE DELETION
TABLE 1 BABF Parameters
CBFD) to store the association between A and B. Each of CBFA and CBFB has m locations and uses k independent hash functions fh1 ; h2 ; . . . ; hk g. CBFD has m0 locations and uses p independent hash functions fh01 ; h02 ; . . . ; h0p g. Fig. 2 illustrates the procedure of inserting an item ða; bÞ. We first increase by 1 for the counter at each location of h1 ðaÞ; h2 ðaÞ; . . . ; hk ðaÞ in CBFA, and then increase by 1 for the counter at each location of h1 ðbÞ; h2 ðbÞ . . . ; hk ðbÞ in CBFB. Furthermore, we take h1 ðaÞ þ h1 ðbÞ, h2 ðaÞ þ h2 ðbÞ; . . . ; hk ðaÞ þ hk ðbÞ as k new elements, i.e., c1 ; c2 ; . . . ; ck , and insert them into CBFD to capture the association, where ‘þ’ denotes the concatenating operation. For example, if h1 ðaÞ ¼ B00[, and h1 ðbÞ ¼ B17[, then c1 ¼ h1 ðaÞ þ h1 ðbÞ ¼ B0017[. For simplicity, we call such a concatenated result as an addressline. As a result, to check if an item is inserted, the values/elements of its two attributes must match CBFA and CBFB, respectively, and the relevant addresslines must match CBFD. The idea of handling the associative deletion is to find all addresslines between CBFAD (having d deleted elements for attribute A) and CBFB using CBFD verification. For each pair of addresses, one in CBFAD and the other in CBFB, if their counters are not zero, they are concatenated as a candidate addressline. For each candidate addressline c, we use fh01 ; h02 ; . . . ; h0p g and CBFD to check/verify whether it is a true addressline. If the verification result is positive, we follow the following 2 steps: 1. 2.
Decrease by 1 for the counters at the locations corresponding to c in CBFD. Decompose c into two addresses A1 and A2 for CBFAD and CBFB, respectively. We, then, decrease by 1 for the counter at A1 in CBFAD, decrease by 1 for the counter at A2 in CBFB, and increase by 1 for the counter at A2 in CBFBD.
After verifying all candidate addresslines, CBFBD and CBFBR are solved out.
3.2 Accuracy Analysis Scenarios A key requirement for an associative deletion technique is accuracy. To analyze the accuracy of the BABF technique, the true addresslines are regarded as elements in the CBFD and all candidate addresslines generated between CBFAD and CBFB are regarded as test data. Because there are many candidate addresslines, many false positive addresslines may be resulted in, which may lead to a large error in the associative deletion. Therefore, the number p of hash functions and the number of true addresslines are very important. They directly determine the accuracy when a Bloom filter structure is given. In this section and Section 4, we will illustrate the accuracy analysis for four specific simple non-optimized
1989
scenarios. In Section 5, we will then present an optimization based on theoretical analysis.
Scenario 1 Assume that the parameters of the BABF are given in Table 1. Because each of CBFA and CBFB contains 200 elements and uses 3 hash functions, the number of true addresslines mapped into CBFD is nk ¼ 200 3 ¼ 600. When an associative deletion is performed, some candidate addresslines may be mistakenly regarded as true addresslines because of the false positives. Based on Formula (1), the number of false positive addresslines can be estimated as follows: 8600 !8 1 lc ¼ 1 1 65536 ðð10 3Þ ð200 3Þ 10 3Þ 0:000011: Here, 10 3 is the number of true addresslines, and ðð10 2Þ ð200 3Þ 10 3Þ is the number of test data/ candidate addresslines to be verified by CBFD. We can see that the number of false positive addresslines is 0.000011 and the number of true addresslines is lo ¼ 30. If a false positive addressline is regarded as a true one and the corresponding counters in CBFs are decreased, accuracy may be affected. Specifically, false positives may make wrong elements in the CBFB be deleted, which results in the following impact: attribute elements in some items are missing from the CBFB, and then false negatives for CBFBD and CBFBR are generated as CBFB is in an inaccurate state. Fortunately, the probability of incorrect deletions in this scenario is very small because lo lc . However, Scenario 2 below shows a different result.
Scenario 2 Consider larger n ¼ 2000 and d ¼ 1000, and keep other parameters the same as those in Table 1, we have 86000 !8 1 lc ¼ 1 1 65536 ðð1000 3Þ ð2000 3Þ 1000 3Þ 95121: Here lc 95121, which is far larger than lo ¼ 3000, may cause a large number of false positive deletions. To overcome this problem, an improved structure/ method should be introduced to prune most of the false positive addresslines.
4
IMPROVED STRUCTURE FOR ASSOCIATIVE DELETION
In this Section, we propose an improved structure to increase the accuracy and give two specific scenarios to discuss its accuracy.
4.1 Improved Structure To prune false positive addresslines, we propose a new IABF (Improved Associative deletion Bloom Filter) structure, which uses an additional filter, i.e., CBFI, with hash functions fh001 ; h002 ; . . . ; h00q g to store more association information. For
1990
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 25,
NO. 8,
AUGUST 2014
Fig. 3. IABF structure.
each item ða; bÞ, it stores a new element called linegroup, i.e., h1 ðaÞ þ h1 ðbÞ þ h2 ðaÞ þ h2 ðbÞþ; . . . ; þhk ðaÞ þ hk ðbÞ, w h e r e ‘þ’ denotes the concatenation. In fact, a linegroup is a concatenation of all addresslines from an element in the sequential order of the hash functions. The basic idea is to enhance the utilization of information on the dependency of the attributes to improve accuracy. There are two stages to process an associative deletion using IABF. The first stage is to verify if a candidate addressline (e.g., c1 or c2 in Fig. 3) is a true positive addressline using CBFD, and then put the passed addresslines (may contain false positives) into a set V . The second stage is to verify whether a permutation of k addresslines from V is a true positive linegroup using CBFI. Here k is the number of hash functions of CBFA=CBFB. For example, if k ¼ 2, c10 and c20 are both in V , we should verify c10 þ c20 and c20 þ c10 , where ’+’ denotes the concatenation. We then take the passed permutations as linegroups. For each linegroup g, we follow the following 3 steps: 1. 2.
3.
Decrease by 1 for the counters at the locations corresponding to g in CBFI. Decompose g into addresslines c1 ; c2 ; . . . ; ck . For each of c1 ; c2 ; . . . ; ck , use hash functions fh01 ; h02 ; . . . ; h0p g to decrease by 1 for the counters at all its locations in CBFD. For each of c1 ; c2 ; . . . ; ck , we decompose it to two addresses A1 and A2 for CBFAD and CBFB, respectively. We then decrease by 1 for the counter at A1 in CBFAD, decrease by 1 for the counter at A2 in CBFB, and increase by 1 for the counter at A2 in CBFBD.
After examining all candidate linegroups, CBFBD is solved out.
4.2
ðð3968 þ 2000Þ ð3968 þ 2000 1Þ 1000Þ 8: In other words, 8 false positive linegroups may be generated when 1000 items are associatively deleted in a 2000-item set. The accuracy is much higher than BABF as the number of false positives for BABF is 3968. When the number of items is large, the effect of false positive linegroup should be considered, as illustrated in the following scenario:
Scenario 4 Let us calculate the false positives when n ¼ 2300 and d ¼ 1600 and other parameters in Table 2 remain the same. For stage 1, the number of false positive addresslines is 84600 !8 1 lc ¼ 1 1 65536 ðð1600 2Þ ð2300 2Þ 1600 2Þ 17093: For stage 2, there are 3200 true addresslines and 17093 false positive addresslines. By the permutation verification, the number of false positive linegroups is 162300 !16 1 le ¼ 1 1 65536 ðð17093 þ 3200Þ ð17093 þ 3200 1Þ 1600Þ 555:
Accuracy Analysis Scenarios
Scenario 3 Let us calculate the false positives using the parameters specified in Table 2. For stage 1, the number of false positive addresslines is
lc ¼
For stage 2, there are 2000 true addresslines and 3968 false positive addresslines. By the permutation verification, the false positive number of linegroups is 162000 !16 1 le ¼ 1 1 65536
1 1 1 65536
84000 !8
ðð1000 2Þ ð2000 2Þ 1000 2Þ 3968:
TABLE 2 IABF Parameters
QIAN ET AL.: BLOOM FILTER BASED ASSOCIATIVE DELETION
1991
From the above two scenarios, we can see that the accuracy of IABF is higher than BABF. Nevertheless, the accuracy of IABF decreases with the increase of the numbers of elements in CBFAD and CBFB. In Scenario 4, the number of false positive linegroups le 555 should not be ignored because the number of true linegroups is only d ¼ 1600. In the next section, we will discuss the optimal parameters for IABF and estimate the false negative and positive rates.
5
THEORETICAL ANALYSIS
In this section, we discuss some important parameters for optimizing IABF with respect to the accuracy, cost and counter size.
5.1
Theorem 1. In the IABF structure, the optimal number k of hash functions for CBFA=CBFB is 2. g
Note that, although k ¼ 2 is not obtained by using Formula (2), it is a globally optimized parameter for IABF. It is better than an optimized parameter for CBFA or CBFB individually. The false positive rates of CBFA and CBFB can be controlled by parameter m, which will be discussed later on.
Theorem 2. Assume that the IABF structure has n (input) items, each of CBFA, CBFB, CBFAD, CBFBR, and CBFBD has m locations with 2 hash functions (i.e., k ¼ 2), CBFAD has d elements, CBFD has m0 locations with p hash functions, and CBFI has m00 locations with q hash functions. Then, the number of false positive items (linegroups) can be estimated as 2 00 q 0 p le 1 eqn=m 1 e2pn=m 4dn þ 2d : With appropriate p and q, we have the smallest number of false positive items (linegroups) as follows: 2 m0 m00 lelowest 4dnð0:6185Þ 2n þ 2d ð0:6185Þ n :
Proof. The proof is shown in Appendix B.
Theorem 3. For the IABF structure, the mathematical expectation of the number of deleted true items can be estimated as Ir d ðd=ðd þ le ÞÞ ¼ d2 =ðd þ le Þ, and the mathematical expectation of the number of deleted false positive items can be estimated as Ie le ðle =ðd þ le ÞÞ ¼ l2e =ðd þ le Þ. Proof. The proof is shown in Appendix C.
g
We then have the following theorem:
Accuracy Analysis
Proof. The proof is shown in Appendix A.
items and g is a false positive item. Also assume that each item in the group has an equal probability to be deleted. Hence, deleting any item in fr1 ; r2 g will make g disappear, and deleting g will also make both r1 and r2 disappear. It is easy to get the mathematical expectation of the number of r deletions is 2 ð2=ð2 þ 1ÞÞ, and the mathematical expectation of the number of g deletions is 1 ð1=ð2 þ 1ÞÞ. Thus, we have,
g
In fact, CBFAD has up to k d ¼ 2d hash locations. The candidate addresslines formed from these locations will yield d true items (linegroups) and le false positive items (linegroups). It often occurs that a group of false positive items and true items are bound together via one or more shared hash locations in CBFAD such that deleting a true item from the group will make the false positive items in the group disappear, and vice versa. Furthermore, some false positive items and true items may also be bound (via hash locations for their corresponding addresslines and/or linegroups) in this way in CBFD and/or CBFI. Assume that there is a group fr1 ; r2 ; gg of bound items, where r1 , r2 are true
Theorem 4. For the IABF with parameters given in Theorem 2, 1. 2.
3.
4.
the false negative rate of CBFBD can be estimated as le ; fn ðCBFBDÞ dþl e the false negative rate of CBFBR can be estimated as PðcounterðiÞ¼1Þ and i is fn ðCBFBRÞ Ie =n, where ¼ PðcounterðiÞ1Þ the location number (assuming all locations have the same probability); in particular, since is close to 1, a simpler estimate is fn ðCBFBRÞ Ie =n; the false positive rate of CBFBD can be estimated as !2 1 2ðIr þIe Þ fp ðCBFBDÞ ¼ 1 1 m 2 1 e2ðIr þIe Þ=m ; the false positive rate of CBFBR can be estimated as !2 1 2ðnIr Ie Þ fp ðCBFBRÞ ¼ 1 1 m 2 1 e2ðnIr Ie Þ=m :
Proof. The proof is shown in Appendix D.
g
To demonstrate the accuracy of the formulas given in Theorem 4, a comparison of experimental results (exp) and calculated estimates (cal) is shown in Table 3. The parameters, except d, are given in Scenario 4. From the table, we can see that most of the theoretical and experimental data matches well. From the above analysis, it can be seen that the optimized numbers of hash functions are as follows: k ¼ 2; p and q are determined by Formula (2). Therefore, the numbers of locations in CBFD and CBFI are important to the accuracy of the associative deletion. Theorem 5 discusses how to select an appropriate number of locations.
Theorem 5. Assume that the IABF structure has n items, each of CBFA, CBFB, CBFAD, CBFBR, and CBFBD has m locations with 2 hash functions (i.e., k ¼ 2), CBFAD has d elements, which are the known parameters. CBFD (respectively, CBFI) has an unknown number m0 (respectively, m00 )
1992
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 25,
NO. 8,
AUGUST 2014
TABLE 3 Experimental Evaluations of Theorem 4
of locations. Using the following two criteria to select m0 þ m00 to ensure the accuracy of the associative deletion: 1. 2.
If the requirement is fn ðCBFBDÞ G F1 , then m0 þ n 1 ln 16dn2Fð1F . m00 9 0:48 1Þ If the requirement is fn ðCBFBRÞ G F2 , then m0 þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi nF2 þ ndF2 þn2 F22 =4 n ln 2 ; where F1 and F2 are two m00 9 0:48 16d2 n2 given real numbers in (0, 1).
Proof. The proof is shown in Appendix E.
g
For the parameters in Table 3, when the requirement is fn ðCBFBDÞ G F1 ¼ 30%, n ¼ 2300, d ¼ 2000, then we have ðm0 þ m00 Þ=2 9 63969; when fn ðCBFBRÞ G F2 ¼ 5:9%, n ¼ 2300, d ¼ 1600, we have ðm0 þ m00 Þ=2 9 64017. The theoretical estimates are close to our experimental results, i.e., m0 ¼ 65536 and m00 ¼ 65536. As the numbers of elements in CBFA, CBFB, and CBFAD and the number of hash functions ðk ¼ 2Þ for them are known, their false positive rates can be tuned by m using Formula (1). Note that CBFBD and CBFBR always have less (at most equal) elements than CBFA and CBFB. Thus, CBFBD and CBFBR can use the same m to achieve the same or better false positive rate requirement. If we want to lower m to obtain lower false positive rates just for CBFBD and CBFBR, we can use the following Theorem 6. Note that lowering m may increase the false positive rates of CBFA, CBFB, and CBFAD.
Theorem 6. Assume that the IABF structure has n items, each of CBFA, CBFB, CBFAD, CBFBR, and CBFBD has m locations with 2 hash functions (i.e., k ¼ 2), CBFAD has d elements, and CBFD and CBFI have m0 and m00 locations, respectively. The above parameters are known except m. Use the following two criteria for selecting m to ensure the accuracy of the associative deletion: 1.
If the requirement is fp ðCBFBDÞ G F3 , then m 9 2ðd2 þl2e Þ 2ðIr þIe Þ pffiffiffiffi pffiffiffiffi ;
lnð1 F3 Þ
2.
ðdþle Þ lnð1 F3 Þ
If the requirement is fp ðCBFBRÞ G F4 , then m 9 2ðndþnle d2 l2e Þ 2ðnIr Ie Þ pffiffiffiffi pffiffiffiffi where F3 and F4 are two lnð1 F4 Þ ðdþle Þ lnð1 F4 Þ given real numbers in (0, 1).
Proof. The proof is shown in Appendix F.
g
For the parameters in Table 3, when fp ðCBFBDÞ G F3 ¼ 0:09%, n ¼ 2300, d ¼ 1200, ðm0 þ m00 Þ=2 ¼ 65536, and le ¼ m0 þm00 16d2 n2 ð0:6185Þ n , we have m 9 70907; when fp ðCBFBRÞ G F4 ¼ 0:15%, n ¼ 2300, d ¼ 1200, ðm0 þ m00 Þ= 2 ¼ 65536, and m0 þm00 le ¼ 16d2 n2 ð0:6185Þ n , we have m 9 61779. The theoretical estimates are close to our experimental results, i.e., m ¼ 65536.
Fig. 4. Trade-off on cost and accuracy.
5.2 Trade-off on Cost and Accuracy Although the false negative rates can be controlled by ðm0 þ m00 Þ using Theorem 5, we are still interested in knowing what each of m0 and m00 is with respect to the calculation cost and accuracy. The hash calculation cost is the summation of the numbers of candidate addresslines and linegroups multiplied by their respective numbers of hash functions (recall Formula (2)) in the two stages of the associative deletion using IABF. Thus, the hash cost is given as follows: 0 2 m00 m0 m ln 2 þ 4dnð0:6185Þ 2n þ 2d ln 2: (6) CH ¼ 4dn 2n n The accuracy is directly related to le , where le is the number of false positive items. We have 2 m0 m00 (7) lelowest 4dnð0:6185Þ 2n þ 2d ð0:6185Þ n : Theoretically, we can let the derivative of ðCH þ lelowest Þ equal to 0 to get an appropriate m0 or m00 with fixed ðm0 þ m00 Þ to minimize the cost and error. Usually, CH is larger than lelowest for several orders of magnitude. Thus, we use a coefficient to balance them. However, @ðCH þ le Þ=@m0 ¼ 0 or @ðCH þ le Þ=@m00 ¼ 0 is difficult to solve. To show the changes with different m0 values and fixed ðm0 þ m00 Þ, we draw the calculated results of Formulas (6) and (7) with different parameter values in Fig. 4. We can see that with the increase of m0 , the hash cost is decreasing, and the number of false positives lelowest (i.e., (7), error in Fig. 4) is increasing. When m0 is small in Formula m0 þm00 m0 2 2 n 2n we have lelowest 16d n ð0:6185Þ as 4dnð0:6185Þ 2d. It is interesting to see that, if m0 ¼ 0, that is, CBFD is omitted and all candidate linegroups are directly verified by CBFI, m00 the number of false positive items is ð2d 2nÞ2 ð0:6185Þ n ¼ m0 þm00
16d2 n2 ð0:6185Þ n . In other words, using CBFD does not improve the accuracy much, which can also be seen in Fig. 4Vwith the increase of m0 from a small number to ðm0 þ m00 Þ=2, the accuracy does not change much. However, if CBFD is omitted, the hash cost is large, which can be observed in Formula (6) and Fig. 4. Thus, omitting CBFD is not a good choice. We may choose an appropriate m0 to
QIAN ET AL.: BLOOM FILTER BASED ASSOCIATIVE DELETION
1993
Fig. 5. False positive/negative rates variation. (a) Variation with ðm; nÞ. (b) Variation with ðm=n; nÞ.
meet the requirements of both the hash cost and accuracy. From Fig. 4, we have observed that setting m0 ¼ m00 in IABF may obtain a trade-off between the hash calculation cost and the accuracy.
5.3 Counter Optimization Now let us discuss how many bits are needed for each counter of the CBFs used in the IABF structure. Because we use Formula (2) to optimize CBFD, according to [4], the probability for a CBF overflow is given by Formula (4). If we allow 4 bits per counter, the CBF overflow will occur when some counter reaches the value 16. Thus, Pðmaxi fcounterðiÞg 16Þ 1:37 1015 m0 . This suffices for most applications. The analysis for CBFI is similar. As CBFA, CBFB, CBFAD, CBFBR and CBFBD use only 2 hash functions to represent n items, m should be large enough to decrease the false positive rates of CBFA and CBFB. Assume that the given false positive rate is F0 2 ð0; 1Þ for CBFA and CBFB, that is, ð1 e2n=m Þ2 G F0 from Formula (1). Then, m9
2n pffiffiffiffiffi : lnð1 F0 Þ
(8)
If we allow 3 bits per counter in those CBFs, from Formula (4), the overflow probability is en 8 e8 n8 P maxfcounterðiÞg 8 m ¼ 8 7: (9) i 4m 4 m From Formulas (8) and (9), we have, pffiffiffiffiffi 7 2ne8 lnð1 F0 Þ P maxfcounterðiÞg 8 : i 88 This probability is also sufficiently small. For example, if n ¼ 2300, F0 ¼ 0:01, Pðmaxi fcounterðiÞg 8Þ 1:177 107 . Because CBFAD, CBFBR, and CBFBD have less elements than CBFA and CBFB. Thus, we have,
Theorem 7. Each counter in CBFA, CBFB, CBFAD, CBFBR, and CBFBD can 3 bits, with an overflow probability less pffiffiffiffi use 8 F0 ÞÞ7 Þ , where F0 is a required false positive rate than 2ne ðlnð1 8 8 within (0, 1). The counters for CBFD and CBFI can use 4 bits.
5.4 Scalability of IABF An IABF structure can be constructed by following steps: 1) For the given parameters n, d and desired false positive and negative rates, we use Theorem 5 to decide ðm0 þ m00 Þ and then choose m0 ¼ m00 ¼ ðm0 þ m00 Þ=2; 2) According to Theorem 6, we choose m; 3) Since the accuracy decreases with the increase of d (the more deletions we perform, the more false positives/negatives we would get from BFs), we consider a complementary problem when d 9 n=2 as follows to achieve better accuracy. Specifically, we let CBFAD0 ¼ CBFA CBFAD, use CBFAD0 in place of CBFAD in IABF to obtain CBFBD0 and CBFBR0 , and finally let CBFBD ¼ CBFBR0 and CBFBR ¼ CBFBD0 . To illustrate the scalability of the proposed technique, we calculated the false (positive and negative) rates for different numbers of items with different sizes of CBFs in IABF. We set m0 ¼ m00 ¼ m, and we did not care about the numbers of p and q. We just used Theorem 4 to get the estimated false rates for a given CBFAD that had 1000 elements. Our results are shown in Fig. 5. From Fig. 5a, we can see that, with an increase of m and n in the same proportion, the false rates are also increasing because of the increase of the candidate addresslines. To control the false rates to a low level with the increase of items, the Bloom filters should be expanded. Fig. 5b shows that a slightly increase of m=n may control the false rates to be under 1 percent. In this paper, we focus on discussing the associative deletion for two correlated attributes, which is the most important problem and provides a base for more complicated cases. In some applications, there may exist an associative deletion problem for more than two correlated attributes. For example, a user may want to associatively filter persons with their names, social security numbers, or driver license numbers. One way to solve this multiattribute associative deletion problem is to split it into three two-attribute associative deletions. However, this method may incur more space overhead as we have to construct CBFD and CBFI for every two attributes. An extended version of IABF would save all association information of unified CBFD and CBFI for all three correlated attributes. Let us consider the insertion of an item ða; b; cÞ as an example. In CBFD, we store k elements, i.e., h1 ðaÞ þ h1 ðbÞ þ h1 ðcÞ, h2 ðaÞ þ h2 ðbÞ þ h2 ðcÞ; . . . ; hk ðaÞ þ
1994
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 25,
NO. 8,
AUGUST 2014
Fig. 6. SPAD logic design.
hk ðbÞ þ hk ðcÞ. In CBFI, we store one element, i.e., h1 ðaÞþ h1 ðbÞ þ h1 ðcÞ þ h2 ðaÞ þ h2 ðbÞ þ h2 ðcÞþ; . . . ; þhk ðaÞ þ hk ðbÞ þ hk ðcÞ. Nevertheless, as the numbers of candidate addresslines and linegroups increasing in this extended IABF, the accuracy may decrease. Can we use more Bloom filters to further improve the capturing of multi-attribute association? Intuitively, adding more Bloom filters would improve accuracy. However, the disadvantages of doing so may include an increase in computing overhead and a reduction in space saving. A trade-off among these factors appears an interesting research issue, which will be discussed in a separate paper.
6
HARDWARE ACCELERATION
In this Section, we discuss another important issue about the associative deletion, namely, hardware acceleration. The associative deletion operation using IABF brings advantages of space-saving. However, at the same time, it also brings significant overhead of verifying many addresslines/ linegroups. In some applications, such as the examples shown in Fig. 1, processing speed is as important as space-saving. Fortunately, these verification operations are simple and repetitive. Therefore, they can be accelerated by using a specially designed modular hardware coprocessor, called SPAD (Special coProcessor for Associative Deletion), as shown in Fig. 6. An SPAD does not actually execute deletions; it just finds all true items and false positive items using four standard BFs, which are the degenerated versions of CBFAD, CBFB, CBFD and CBFI, respectively. SPAD is composed of LU (addressLine verifying Unit) for stage 1 of the IABF-based associative deletion and GU (lineGroup verifying Unit) for stage 2. In Fig. 6, the LU is composed of 64 cells. Each of CBFAD and CBFB is split into 8 sub-vectors. Each sub-vector pair of CBFAD and
CBFB is assigned a cell to verify addresslines in a nested loop manner. Each cell is deployed an H3 hash function [6] core which can verify one line in one clock cycle. In one cycle of LU, an H3 hash core hashes a 32-bit addressline into 8 16-bit addresses, then inputs the values at those locations of CBFD to an 8-bit AND-gate, and uses the result of the AND-gate as a writing enable signal ðwrÞ for storing the addressline. Once signal wr is 1, the corresponding addressline is flowed into the pipeline buffer to be processed in GU for stage 2 verification. As the passed addresslines are generated by LU one by one, there always exists an interval time between two generated addresslines. We use a pipeline buffer as a bridge for connecting LU to GU and fully use the interval time to realize the pipeline acceleration. There are 8 memories, i.e., mem0; . . . ; mem7, in GU, each memory is composed of an inserting sub-memory and a probing sub-memory. Once there is an addressline, e.g., t9 in Fig. 6, in the pipeline buffer, GU begins to work with two steps: 1) inserting t9 into one of the inserting sub-memories in the round robin fashion to avoid generating duplicate items; 2) inserting t9 into all probing sub-memories, and triggering the hash verifications in all 8 memories. In each memory of mem0; . . . ; mem7, t9 will be concatenated with each old addressline in the inserting sub-memory for verification. Assume that there are two addresslines ft0 ; t8 g in the inserting sub-memory in mem0, GU should verify the following 4 permutations in 2 clock cycles: t0 þ t9 , t9 þ t0 , t8 þ t9 , and t9 þ t8 , where ‘þ’ denotes the concatenation operation. To complete the verification for two directional permutations, e.g., t0 þ t9 and t9 þ t0 , in one cycle, we duplicate the H3 hash function core. The H3 hash core in GU works in the same way as in LU, with a difference in hashing a 64-bit linegroup into 16 16-bit addresses and executing the AND operation by a 16-bit AND-gate.
QIAN ET AL.: BLOOM FILTER BASED ASSOCIATIVE DELETION
To ensure that SPAD can work in pipelining, we should deploy appropriate numbers of cells in LU and GU. As discussed in Scenario 4, there are 3200 true addresslines and 17093 false positive addresslines. Therefore, each cell will produce ð17093 þ 3200Þ=64 ¼ 317 addresslines in 8192 8192 ¼ 67 M cycles. To process passed addresslines in pipelining and make LU and GU be non-block operators, we design eight cells in GU. When GU is working, the maximal mathematical expectation of the number of addresslines in each inserting sub-memory is ð17093 þ 3200Þ=8 ¼ 20293=8 ¼ 2537. However, the interval time of two consecutive addresslines generated in LU is 67 M=20293 ¼ 3307 cycles, which is longer than 2537. Therefore, in the mathematical expectation, once a line is generated by LU, it can be processed by GU with no accumulation problem. From the above analysis, we know that, with the numbers of cells in LU and GU increasing in proportion, the associative deletion will be speeded up.
1995
Fig. 8. False positive rates varied by CBFAD.
7.1 Associative Deletion In the first set of experiments, three sets of real trace data objects/items with sizes 2300, 2100, 1900, respectively, were used. As discussed before, false positive deletions from CBFB, CBFD, and CBFI may cause false negatives for CBFBR and CBFBD. Fig. 7 shows the false negative rates of CBFBR and CBFBD for CBFADs with different numbers of elements (for associatively deleting elements from attribute B). With the number of elements for deletion increasing, the number of addresslines’ permutations increases too, and the false negatives of CBFBD and CBFBR are growing. The false negative rates of CBFBR are smaller than that of CBFBD. With the number n of data items in the IABF structure decreasing, the false negative rates are falling quickly. In fact, when n ¼ 1900, all the false negative rates are zero in the experiments. Using the same sets of real trace data items, Fig. 8 illustrates the false positive rates of CBFBR and CBFBD
for CBFADs with different numbers of elements. With the number of elements (for attribute A) in CBFAD and the number of items in IABF increasing, there are more and more elements in CBFBD (for attribute B) and less and less elements in CBFBR. In general, the false positive rate is positively correlated with the number of elements in a Bloom filter. Therefore, all the false positive rates grow with more elements in Bloom filters. The false positive rates for our IABF structure are acceptable because the largest rate is less than 0.35 percent. Experiments illustrated in Figs. 9 and 10 were just like those in Figs. 7 and 8. The difference is that the former uses randomly generated synthetical data, while the latter uses real data. From the figures, we see that the performance patterns from the two sets of experiments are similar. For the synthetical data, the false negative rates are all less than 5.6 percent, and the false positive rates are all less than 0.32 percent. When n ¼ 1800, the negative rates are all zero in the experiments. Figs. 11 and 12 show the false negative and false positive rates, respectively, of CBFBD and CBFBR for CBFBs with different numbers of elements that were associatively deleted by a CBFAD with 300 elements (for deletion) using both real and synthetical data. With more elements in CBFB, the number of addresslines’ permutations is increasing, and the number of errors is also increasing. When the number of elements in CBFB is less than 2000, the result is highly accurate with a false negative rate less than 1 percent and a false positive rate less than 0.3 percent.
Fig. 7. False negative rates varied by CBFAD.
Fig. 9. False negative rates varied by CBFAD.
7
EXPERIMENTAL RESULTS
To evaluate the accuracy and performance of our proposed technique, we conducted two types of experiments: one for software implementation and the other for hardware implementation. The setup of the experiments is described in Appendix G.
1996
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
VOL. 25,
NO. 8,
AUGUST 2014
The above experiments show that, with the number n of items in the IABF structure increasing and/or the number d of elements for deletion increasing, the false negative rates of CBFBD and CBFBR increase too. The false positive rates for CBFBD and CBFBR are maintained at a low level. When n is less than 2000 and each Bloom filter has 65536 locations, the proposed associative deletion technique is highly accurate. If a specific false positive/ negative rate is needed, one could use Theorems 5 and 6 to choose appropriate m0 , m00 , and m.
7.2 Sliding Window Associative Deletion In this section, we consider the problem of sliding window associative deletion in Fig. 1b, which represents a special case with only one element in CBFAD. In general, a sliding window has two operations: one is inserting/adding an item, the other is invalidating/deleting an item. The two operations are continuously run in an interleaving way to move forward a sliding window. Fig. 13 shows the accuracy of the associative deletion over different sliding windows. When the window has 2100 items, no matter real or synthetical data is used, the technique maintains a high level accuracy, with the false negative rate being zero and the false positive rate being less than 0.35 percent in the experiments. With the window size increasing, the accuracy is declining, and the accumulated false negative deletions may lead to a significant error. When the window has 2200 items, the technique can process up to 47:5 105 synthetical data items with a high accuracy. After that, the accuracy drops dramatically. When processing real data, the technique maintains a high accuracy even the window has 2300 items. Once the window reaches 2400 items, the technique can only accurately process up to 4:32 104 real data items. The experiments show that the technique can deal with sliding windows with up to 2100 items at a high accuracy level. When the window size is expanded, the Bloom filters may need to be reset after sliding a certain number of items to ensure a high accuracy.
Fig. 11. False negative rates varied by CBFB.
7.3 FPGA Implementation Let us now present experimental results that were obtained by implementing different FPGAs from Altera. The first three data rows in Table 4 show the frequencies and hardware/power consumptions of three types of SPADs
using Stratix FPGA III EP3SL340F1760C2. We can see that, with the number of cells increasing, the hardware consumption and power consumption are also increasing, and the maximal work frequencies are decreasing slowly. From the simulation experiments (parameters were set as Scenario 4), we record the average clock cycle number to process associative deletions and, hence, with the maximal frequency we can get the time to process one operation. The fastest processing time is 87.7 ms with 256 cells in LU and 32 cells in GU. To validate our technique, we also implemented two SPADs on a low-end FPGA development board called Altera DE2-115. The board contains a Cyclone FPGA IV EP4CE115F29C7 and provides some interfaces such as LEDs. Although the frequencies for the relevant rows 4 and 5 in Table 4 are higher than 50 MHz, we set the clock of the prototype SPAD to 50 MHz because of a limited DE2-115 board. To obtain an accurate processing rate, we added a counter in the prototype SPAD to record the number of processing cycles. Because of the limited performance conditions, such as the low-end FPGA and less cells, the performance of the prototype is inferior to the simulation results. In fact, a much higher performance could be obtained if the SPAD was implemented on a high-end FPGA development board. The above results show that one associative deletion processing can be accelerated by dozens of milliseconds, which can meet the requirements of most applications. If the speed is not fast enough for some on-line processing, like the examples in Fig. 1, SPAD can be composed with more LUs and GUs using a high-end FPGA or ASIC, which would further enhance the processing speed.
Fig. 10. False positive rates varied by CBFAD.
Fig. 12. False positive rates varied by CBFB.
QIAN ET AL.: BLOOM FILTER BASED ASSOCIATIVE DELETION
1997
Fig. 13. False positive/negative rates over different windows.
8
CONCLUSION
In this paper, we have introduced the concept of an associative deletion and presented an algorithm and analysis to efficiently process this operation. In particular, a novel Improved Associative deletion Bloom Filter (IABF) structure, which effectively captures the dependency/ association between two correlated attributes, is proposed. To accelerate the operation, we also illustrate a coprocessor design which can provide a high performance computation. Detailed analysis and experimental results demonstrate that the false positive/negative rates of our technique can be controlled under a user-defined threshold value with much space saving. Further studies will be conducted in the future. As mentioned in Section 5.4, an associative deletion for more than two attributes is an interesting research issue. Although this issue can be supported by applying the two-attribute scheme in a pair-wise manner or by an extended version of IABF, these two methods may not
provide the best performance. A careful trade-off among accuracy, computing overhead, and space saving needs to be further studied. In addition, a controlled approximate associative deletion leads to another interesting research direction.
ACKNOWLEDGMENT This work was supported in part by China NSF Grants No. 60803021 and No. 61170035, China Scholarship Fund Grant No. 2011833129, Zhejiang NSF Grant No. LY13F020040 as well as programs sponsored by K.C. Wong Magna Fund in Ningbo University. The authors appreciate Laura Bottomley (
[email protected]) at Duke University for providing trace data. They wish to thank the anonymous reviewers for their valuable time and suggestions to improve the paper.
REFERENCES [1]
TABLE 4 Results of Simulation and Prototype
[2] [3] [4] [5] [6]
B.H. Bloom, ‘‘Space/Time Trade-Offs in Hash Coding with Allowable Errors,’’ Commun. ACM, vol. 13, no. 7, pp. 422-426, July 1970. D. Guo, J. Wu, H. Chen, and X. Luo, ‘‘Theory and Network Application of Dynamic Bloom Filters,’’ in Proc. IEEE INFOCOM, 2006, pp. 1-12. B. Xiao and Y. Hua, ‘‘Using Parallel Bloom Filters for Multiattribute Representation on Network Services,’’ IEEE Trans. Parallel Distrib. Syst., vol. 21, no. 1, pp. 20-32, Jan. 2010. A. Broder and M. Mitzenmacher, ‘‘Network Applications of Bloom Filters: A Survey,’’ Internet Math., vol. 1, no. 4, pp. 485-509, 2005. L. Fan, P. Cao, J. Almeida, and A.Z. Broder, ‘‘Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol,’’ IEEE/ACM Trans. Netw., vol. 8, no. 3, pp. 281-293, June 2000. J.L. Carter and M.N. Wegman, ‘‘Universal Classes of Hash Functions,’’ Comput. Syst. Sci., vol. 18, no. 2, pp. 143-154, Apr. 1979.
1998
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,
Jiangbo Qian received the PhD degree in computer science from Southeast University, China, in 2006. He is currently a Professor in the College of Information Science and Engineering at Ningbo University, China. He was a visiting scholar in the Department of Computer and Information Science at The University of MichiganVDearborn, MI, USA. His research interests include database management, streaming data processing, hardware/software co-design, and logic circuit design.
Qiang Zhu received the PhD degree in computer science from the University of Waterloo, Canada, in 1995. He is currently a Professor in the Department of Computer and Information Science at The University of MichiganVDearborn, USA. He is also an ACM Distinguished Scientist, an IBM CAS Faculty Fellow and an IEEE Senior Member. His current research interests include query optimization, streaming data processing, multidimensional indexing, self-managing databases and Web information systems.
VOL. 25,
NO. 8,
AUGUST 2014
Yongli Wang received the PhD degree in computer science from Southeast University, China, in 2006. He is currently an Associate Professor at Nanjing University of Science and Technology, China. His main research interests include sensor network, data streams management, data mining, Cyber-Physical System, BioData Analysis etc. For his research activities, he also spent an extended period of time at Drexel University, USA.
.
For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.